Ethnicity, Health Equity, and AI

Improving how self-identified ethnicity data is used in real-world research

A banner showing 9 people coloured in blues, yellows, pinks and greens, representing different ethnicity, and a central person in white holding a sign which shows a symbol for health equity

This project focuses on addressing health inequalities caused by biases in clinical prediction tools. To do this, our researchers are focusing on improving the way recorded ethnicity is used by researchers.

OVERVIEW

While the COVID-19 pandemic highlighted inequalities in health systems around the world, this issue is long-standing, and disadvantages people based on their backgrounds. This can include different intersectional factors like ethnicity and race, sex, age, socioeconomic status, having rare conditions or disabilities, pregnancy, and more.

Inequalities based on ethnicity are of concern because our understanding of many diseases comes from predominantly White or Caucasian populations whose risk factors, disease prevalence, and incidence can differ from other ethnic groups.

In addition to these complexities, imbalances in healthcare technologies, such as artificial intelligence systems which create clinical prediction models, can worsen these existing biases. If the data used to train models are not representative of patients with these conditions, or if the models include algorithmic biases, the results can be inaccurate for some populations and result in the mis-estimation of a patient’s health risks. Our researchers are addressing health inequalities in this project through improving our understanding of ethnicity data in health records, and applying these data to make tailored health prediction models.

This project was commissioned as part of the UK Government’s COVID-19 Core Studies programme, funded by Health Data Research UK, looking to produce representative, good-quality and reliable data on the ethnicity records of NHS patients in England. The project has 3 main stages:

To organise, assess and describe the ethnicity data available from NHS England patient data, from both primary and secondary healthcare settings - Phase I
To use the high-quality data collated in the first phase to look at health disparities of COVID-19 for different ethnicities, with a particular focus on heart disease.
To create tools for machine learning and data science projects which use ethnicity data, to improve the ability of projects to use data and produce accurate results.

This project uses data from the NHS England Secure Data Environment (SDE), which gives approved researchers access to de-identified patient records of over 60 million patients who were impacted by the COVID pandemic. These data can take various forms, including the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) codes, of which there 489 categories, or the Office of National Statistics (ONS)’s 19 Primary Code ethnicity codes. However, in research, often ethnicity is collapsed into six higher-level groups, which can be too broad and unspecific. Our description of the quality of more detailed ethnicity groups and the application of these detailed data to create more accurate and valuable research will demonstrate that researchers can and should use the available detail in patient ethnicity data to improve the accuracy of clinical models.

This project was supported by Health Data Research UK, the British Heart Foundation Data Science Centre and The Alan Turing Institute, and involved collaborators from University College London. It forms part of the work of the CVD-COVID-UK/COVID-IMPACT Consortium.

The Ethnicity, Equity and AI Project

Project Home

Phase I - Describing Ethnicity Data

Patient and Public Involvement

Phase I Paper - Pineda Moncusi et al. (2024), Scientific Data

We're hiring!

Postdoctoral Research Assistant in Health Data Sciences

The Oxford PHI Lab is seeking a highly motivated data scientist to support our projects on curation and modelling of harmonised health datasets and co-creating publicly available decision-support dashboards and tools to enhance mapping, monitoring, and prediction of global health challenges including mitigating climate-exacerbated global health inequities. This includes a new project funded by the Gates Foundation on real-world data for women’s health.

Closes 16 February 2026.

Apply now

Explore the PHI Lab

Our Patient and Public Involvement Work

Our Outputs and Impact

Our Opportunities and Training

Blog

Join our team

Our MSK AI DPhil in Artificial Intelligence for musculoskeletal global health is open for applications.

Find out more and apply

Latest from PHI - NEW feature article and video

Work at the Planetary Health Informatics group applies artificial intelligence to international real-world health data, in order to further our understanding of disease and fill the gaps in global health. Sara Khalid explains how the work is tackling climate-related health crises such as Malaria.

Read the article

Cookies on this website