Large language models for phenotyping in Real World Evidence distributed analytics: bridging expert knowledge and artificial intelligence

Project No: Botnar-2025-07
Intake: 2026

PROJECT OVERVIEW

Real world data (RWD) is increasingly available for research. Large investments and collaborations have led to the creation of international networks that enable federated analytics of such data for disease characterisation and for the study of drug and device safety and effectiveness

Our team partners with the Observational Health Data Sciences and Informatics (OHDSI) open science community to leverage data mapped to the Observational Medical Outcomes Partnership (OMOP) Common Data Model.

After years of investment and efforts, the process of creating and validating computable phenotypes remains one of the most expensive and least scalable steps in the whole pipeline of real world evidence generation [1]. Fortunately, recent advances in artificial intelligence (AI) and the emergence of large language models (LLM) have started to prove valuable as a catalyser to make this process more efficient [2].

However, extensive work is needed to fully understand the potential of LLMs in this context, and to mitigate associated risks and challenges. Through this PhD studentship, we aim to investigate how LLMs can be applied to the generation and validation of computable phenotypes for RWD analysis, by researching:

The performance of different LLM models for the generation of clinical descriptions and concept sets based on pre-specified clinical knowledge
The usefulness of LLMs for the review and validation of the resulting phenotypes compared to gold standard processes and previously validated cohorts
The application of LLMs for federated learning across a large network of data partners with access to RWD previously mapped to OMOP

Key References

1) F Dernie et al. Standardised and Reproducible Phenotyping Using Distributed Analytics and Tools in the Data Analysis and Real World Interrogation Network (DARWIN EU). Pharmacoepidemiol Drug Safety 2024. https://doi.org/10.1002/pds.70042

2) M Schuemie et al. Standardized patient profile review using large language models for case adjudication in observational research. NPJ Digit Med 2025 Jan 9;8(1):18. doi: 10.1038/s41746-025-01433-4.

KEYWORDS

Real world evidence, epidemiology, health data sciences

The Health Data Sciences team

The Health Data Sciences team at the Botnar Institute is a multidisciplinary group including over 40 people including research staff, postdoctoral researchers, and 8 PhD students. Our team includes colleagues from multiple and diverse backgrounds and geographies, and from complementary areas of knowledge, necessary for the completion of research studies, from design to reporting. We have extensive expertise in health data sciences, epidemiology, and pharmacoepidemiology.

Training

Alongside departmental training opportunities listed below we will ensure hands-on training in real world data analysis using medical records and genetic data from the Health Data Sciences section at the Botnar Institute (University of Oxford).

The Botnar Institute plays host to the University of Oxford's NDORMS Health Data Sciences and Real World Evidence section, which enables and encourages research and education into the use of large routinely collected health data for the study and improvement of human health. Training will be provided in techniques and methods including epidemiology, pharmacoepidemiology, data sciences, applied artificial intelligence, causal inference, and real world evidence.

A core curriculum of lectures will be taken in the first term to provide a solid multidisciplinary foundation in a broad range of subjects including biology, inflammation, epigenetics, translational immunology, microbiome, and data sciences. Students will also be required to attend regular seminars within the Department and those relevant in the wider University.

Students will be expected to present data regularly in Departmental seminars, fortnightly Health Data Science meetings, and to attend external conferences to present their research globally, with limited financial support from the Department.

Students will also have the opportunity to work closely with our wide range of collaborators in the Observational Health Data Sciences and Informatics (OHDSI), European Health Data and Evidence Network (EHDEN), and related open data science communities.

Students will have access to various courses run by the Medical Sciences Division Skills Training Team and other Departments. All students are required to attend a 2-day Statistical and Experimental Design course at NDORMS (information will be provided once accepted to the programme).

How to Apply

Please contact the relevant supervisor(s), to register your interest in the project, and, if required, the departmental Education Team (graduate.studies@ndorms.ox.ac.uk), who will be able to advise you of the essential requirements for the programme and provide further information on how to make an official application.

Interested applicants should have, or expect to obtain, a first or upper second-class BSc degree or equivalent in a relevant subject and will also need to provide evidence of English language competence (where applicable). The application guide and form is found online and the DPhil programme will commence in October 2026.

Applications should be made to the following programme using the specified course code:

- D.Phil in Clinical Epidemiology and Medical Statistics (course code: RD_NNRA1)

For further information, please visit http://www.ox.ac.uk/admissions/graduate/applying-to-oxford.

Internal supervisors

Daniel Prieto-Alhambra

Professor of Pharmaco- and Device Epidemiology

Albert Prats-Uribe

Senior Clinical Research Fellow in Public Health

Anna Saura Lazaro

Senior Researcher in Clinical Epidemiology and Real World Evidence

External supervisor

Dr Anna Ostropolets

APPLY NOW

All applications must be made through the official University of Oxford online applications system.

APPLY NOW

INTERVIEW DATES

Divisional Competition interviews will be held on Wednesday 14 and Thursday 15 January 2026.

Admission Only interviews will be held on Wednesday 21 and Thursday 22 January 2026.

APPLICATION NOTICE

The deadline for complete applications is noon on Tuesday 2 December 2025. We strongly encourage you to submit your applications in advance of the deadline date. Any applications which are not complete by the deadline will not be eligible for consideration.

Cookies on this website

Large language models for phenotyping in Real World Evidence distributed analytics: bridging expert knowledge and artificial intelligence

PROJECT OVERVIEW

Key References

KEYWORDS

The Health Data Sciences team

Training

How to Apply

Internal supervisors

Daniel Prieto-Alhambra

Albert Prats-Uribe

Anna Saura Lazaro

External supervisor

APPLY NOW

INTERVIEW DATES

APPLICATION NOTICE