Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

  • Project No: Botnar-2025-07
  • Intake: 2026

PROJECT OVERVIEW

Real world data (RWD) is increasingly available for research. Large investments and collaborations have led to the creation of international networks that enable federated analytics of such data for disease characterisation and for the study of drug and device safety and effectiveness

Our team partners with the Observational Health Data Sciences and Informatics (OHDSI) open science community to leverage data mapped to the Observational Medical Outcomes Partnership (OMOP) Common Data Model.

After years of investment and efforts, the process of creating and validating computable phenotypes remains one of the most expensive and least scalable steps in the whole pipeline of real world evidence generation [1]. Fortunately, recent advances in artificial intelligence (AI) and the emergence of large language models (LLM) have started to prove valuable as a catalyser to make this process more efficient [2].

However, extensive work is needed to fully understand the potential of LLMs in this context, and to mitigate associated risks and challenges. Through this PhD studentship, we aim to investigate how LLMs can be applied to the generation and validation of computable phenotypes for RWD analysis, by researching:

  1. The performance of different LLM models for the generation of clinical descriptions and concept sets based on pre-specified clinical knowledge
  2. The usefulness of LLMs for the review and validation of the resulting phenotypes compared to gold standard processes and previously validated cohorts
  3. The application of LLMs for federated learning across a large network of data partners with access to RWD previously mapped to OMOP 

Key References

1)     F Dernie et al. Standardised and Reproducible Phenotyping Using Distributed Analytics and Tools in the Data Analysis and Real World Interrogation Network (DARWIN EU). Pharmacoepidemiol Drug Safety 2024. https://doi.org/10.1002/pds.70042

2)     M Schuemie et al. Standardized patient profile review using large language models for case adjudication in observational research. NPJ Digit Med 2025 Jan 9;8(1):18. doi: 10.1038/s41746-025-01433-4.

KEYWORDS

Real world evidence, epidemiology, health data sciences

The Health Data Sciences team

The Health Data Sciences team at the Botnar Institute is a multidisciplinary group including over 40 people including research staff, postdoctoral researchers, and 8 PhD students. Our team includes colleagues from multiple and diverse backgrounds and geographies, and from complementary areas of knowledge, necessary for the completion of research studies, from design to reporting. We have extensive expertise in health data sciences, epidemiology, and pharmacoepidemiology.

Training

Alongside departmental training opportunities listed below we will ensure hands-on training in real world data analysis using medical records and genetic data from the Health Data Sciences section at the Botnar Institute (University of Oxford).

The Botnar Institute plays host to the University of Oxford's NDORMS Health Data Sciences and Real World Evidence section, which enables and encourages research and education into the use of large routinely collected health data for the study and improvement of human health. Training will be provided in techniques and methods including epidemiology, pharmacoepidemiology, data sciences, applied artificial intelligence, causal inference, and real world evidence.

A core curriculum of lectures will be taken in the first term to provide a solid multidisciplinary foundation in a broad range of subjects including biology, inflammation, epigenetics, translational immunology, microbiome, and data sciences.  Students will also be required to attend regular seminars within the Department and those relevant in the wider University.

Students will be expected to present data regularly in Departmental seminars, fortnightly Health Data Science meetings, and to attend external conferences to present their research globally, with limited financial support from the Department.

Students will also have the opportunity to work closely with our wide range of collaborators in the Observational Health Data Sciences and Informatics (OHDSI), European Health Data and Evidence Network (EHDEN), and related open data science communities.

Students will have access to various courses run by the Medical Sciences Division Skills Training Team and other Departments. All students are required to attend a 2-day Statistical and Experimental Design course at NDORMS (information will be provided once accepted to the programme).

How to Apply

Please contact the relevant supervisor(s), to register your interest in the project, and, if required, the departmental Education Team (graduate.studies@ndorms.ox.ac.uk), who will be able to advise you of the essential requirements for the programme and provide further information on how to make an official application.

Interested applicants should have, or expect to obtain, a first or upper second-class BSc degree or equivalent in a relevant subject and will also need to provide evidence of English language competence (where applicable). The application guide and form is found online and the DPhil programme will commence in October 2026.

Applications should be made to the following programme using the specified course code:

-        D.Phil in Clinical Epidemiology and Medical Statistics (course code: RD_NNRA1)

For further information, please visit http://www.ox.ac.uk/admissions/graduate/applying-to-oxford.