An open algorithm to pre-process free text self-reported medications data in UK biobank
Xie J., Hong S., Delmestri A., Khalid S., Strauss VY., Collins G., Prieto-Alhambra D.
Introduction:UK Biobank is a large-scale biomedical research database with in-depth genetic and phenotypic information. However, baseline medication data are recorded as free text with semantic ambiguity and errors, potentially limiting their use for research. Aim: To develop a flexible, automated, and transparent algorithm to extract drug information from self-reported medication free text in the UK Biobank. Methods: We have proposed a 3-step approach: 1) filtering strings, 2) building a dictionary of candidate drugs, and 3) mapping drug names. In step 1, special characters and meaningless stop words were filtered using regular expressions and a common words repository. The common word repository can be manually tuned by adding or removing domain-specific words. In step 2, a dictionary of candidate drugs was built by referencing to UK Biobank official documents and the public ChEMBL Database which curates UK British national formulary, international non-proprietary name, and anatomical therapeutic chemical (ATC) codes. Finally, in step 3, the filtered strings were mapped to candidate drug dictionaries identified in step 2. The accuracy of the algorithm was evaluated by comparing its output with manual identification based on a random sample of 100 distinct medication strings. Results: A total of 7,944 distinct medication strings were reported by ~502,125 participants in the UK Biobank. Of these, 6,804 (85.6%) strings included at least one identifiable active ingredients. The accuracy of our algorithm against manual extraction was 9 6 %. Conclusion: Our algorithm is accurate and facilitates the automatic identification of drug names from self-reported medication strings in the UK Biobank. Its performance can be further improved by refining the “stopwords” dictionary. This tool could act as a standard pre-processor for phenotyping baseline drug use information and enable reproducible research into drug use and pharmacoepidemiology/pharmacogenomics in the UK Biobank.