Phase I - The largest study of patient ethnicity data in England

In the first phase of this project, our researchers focused on analysing the ethnicity data collected about patients in the NHS, and creating a research-ready resource from these data. Our findings relate to the ways ethnicity is categorised, how these categories can lead to patients having contradicting ethnicity records, and which populations are most likely to be lacking ethnicity data.

Phase I of the Ethnicity, Health Equity and AI study focuses on the ethnicity data that are routinely collected within the NHS, and how this could benefit healthcare tools like health prediction models. The ethnicity data collected in GP centres and hospitals are available to approved researchers through the NHS England’s Secure Data Environment (SDE), where all the data are de-identified to protect patient identities.

This study’s data set is specifically related to patients affected by the COVID-19 pandemic, and represents around 93% of patients in England. Our researchers looked at the available data on ethnicity, analysing how ethnicity is categorised in GP centres and hospitals, where contradictions or gaps in recorded ethnicity exist, and how these might impact research which relies on NHS-recorded ethnicity data.

Flow chart showing how the 6 research codes, 19 NHS codes and 489 codes have different levels of detail and different group sizes

Healthcare in England uses two key categories of ethnicity data: SNOMED-CT concepts and NHS ethnicity codes. SNOMED concepts represent 489 different categories of ethnicity, representing very detailed, or granular, information. On the other hand, NHS ethnicity codes have 19 categories, so represent less detailed information but with larger population groups.

Beyond these two categories, health researchers often further simplify ethnicity to six even broader groups, so that they have bigger populations and supposedly more reliable results, but this reduces the accuracy of results. In other words, these bigger population groups can prevent researchers from finding specific ethnicity groups which have different health profiles than the broader groups they are part of, and inconsistencies in how ethnicity data are recorded or converted from one type of ethnicity categories to another can make research results less accurate.

FINDINGS

The study had three key findings which are important for researchers to consider when using ethnicity data in their research.

COMPLETENESS OF ETHNICITY DATA

Our study found that 1 in 10 patients have missing ethnicity records, resulting from either not being collected or through patients using the ‘prefer not to say’ option. Patients with missing data were more likely to be younger, male and have less recorded co-occurring conditions than those with ethnicity data.

One in ten map

Our Public and Patient Involvement (PPI) in this study helped researchers to understand that the common practice of inferring patient ethnicity when records were missing didn’t respect patients’ wishes when they used a ‘prefer not to say’ option, and could lead to the reinforcement of current biases in healthcare data. As part of this project, we launched the ‘Be Proud to reveal your ethnicity’ campaign to encourage patients to provide their ethnicity, to aid accurate research. Explore this project's PPI work and the 'Be Proud of Your Ethnicity' Campaign here.

DETAILS IN ETHNICITY DATA

Our study demonstrated that patients in England self-identify across 250 different ethnicity sub-groups. These groups all fall within the SNOMED concepts, a standardised way to record patients’ ethnicity which includes 489, but not all of these were used by patients in this data set.

PHI Ethnicity Code Levels in NHS England Patient Data

Download this visualiser from Github to explore further.

Our analysis is the first to demonstrate the detail, or granularity, of ethnicity records available in healthcare data, which can be used by researchers to produce more accurate results and tools for different ethnicities.

INCONSISTENCIES AND INACCURACIES IN ETHNICITY DATA

Our study showed that around 12% of patients had inconsistencies in the ethnicity code used in their patient profiles. This can reflect changes in an individual’s perception of their own ethnicity, but can become complex when seemingly similar ethnicity groups are organised into different higher-level classifications.

A Pie chart showing the proportions of different ethnicity groups in NHS Enland Patient data, which can be expended to look at more detail of how each group is categorised.

Download this visualiser from Github to explore further.

We also found that some healthcare settings are using the older versions of the NHS ethnicity codes, which are based on outdated ONS Census data. Where older categories were used, patients could be organised into different higher-level categories, which could impact results.

Our researchers suggested potential different ways to map the detailed SNOMED codes to NHS Primary Codes, which can improve how representative these 19 Primary Codes are for use in research.

A diagram showing how some SNOMED codes are currently organised into NHS codes (Left) and how they could be organised to reduce conflicts in records.

Download this visualiser from Github to explore further.

ACCESSING THIS DATA

The de-identified data used in this study were made available to accredited researchers. Those wishing to gain access to the data should contact bhfdsc@hdruk.ac.uk in the first instance. If you would like to explore the data visualisers shown in the GIFs on this page, they are available from the BHF Data Science Centre Github for this project, alongside the study code and further figures.

The Ethnicity, Equity and AI Project

Project Home

Phase I - Describing Ethnicity Data

Patient and Public Involvement

Phase I Paper - Pineda Moncusi et al. (2024), Scientific Data

We're hiring!

Postdoctoral Research Assistant in Health Data Sciences

The Oxford PHI Lab is seeking a highly motivated data scientist to support our projects on curation and modelling of harmonised health datasets and co-creating publicly available decision-support dashboards and tools to enhance mapping, monitoring, and prediction of global health challenges including mitigating climate-exacerbated global health inequities. This includes a new project funded by the Gates Foundation on real-world data for women’s health.

Closes 16 February 2026.

Apply now

Explore the PHI Lab

Our Patient and Public Involvement Work

Our Outputs and Impact

Our Opportunities and Training

Blog

Join our team

Our MSK AI DPhil in Artificial Intelligence for musculoskeletal global health is open for applications.

Find out more and apply

Latest from PHI - NEW feature article and video

Work at the Planetary Health Informatics group applies artificial intelligence to international real-world health data, in order to further our understanding of disease and fill the gaps in global health. Sara Khalid explains how the work is tackling climate-related health crises such as Malaria.

Read the article

Cookies on this website

Phase I - The largest study of patient ethnicity data in England

FINDINGS

COMPLETENESS OF ETHNICITY DATA

DETAILS IN ETHNICITY DATA

INCONSISTENCIES AND INACCURACIES IN ETHNICITY DATA

ACCESSING THIS DATA

The Ethnicity, Equity and AI Project

We're hiring!

Postdoctoral Research Assistant in Health Data Sciences

Explore the PHI Lab

Join our team

Latest from PHI - NEW feature article and video