Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Improving how self-identified ethnicity data is used in real-world research

A banner showing 9 people coloured in blues, yellows, pinks and greens, representing different ethnicity, and a central person in white holding a sign which shows a symbol for health equity

This project focuses on addressing health inequalities caused by biases in clinical prediction tools. To do this, our researchers are focusing on improving the way recorded ethnicity is used by researchers.


While the COVID-19 pandemic highlighted inequalities in health systems around the world, this issue is long-standing, and disadvantages people based on their backgrounds. This can include different intersectional factors like ethnicity and race, sex, age, socioeconomic status, having rare conditions or disabilities, pregnancy, and more.

Inequalities based on ethnicity are of concern because our understanding of many diseases comes from predominantly White or Caucasian populations whose risk factors, disease prevalence, and incidence can differ from other ethnic groups.

In addition to these complexities, imbalances in healthcare technologies, such as artificial intelligence systems which create clinical prediction models, can worsen these existing biases. If the data used to train models are not representative of patients with these conditions, or if the models include algorithmic biases, the results can be inaccurate for some populations and result in the mis-estimation of a patient’s health risks. Our researchers are addressing health inequalities in this project through improving our understanding of ethnicity data in health records, and applying these data to make tailored health prediction models.

This project was commissioned as part of the UK Government’s COVID-19 Core Studies programme, funded by Health Data Research UK, looking to produce representative, good-quality and reliable data on the ethnicity records of NHS patients in England. The project has 3 main stages:

  1. To organise, assess and describe the ethnicity data available from NHS England patient data, from both primary and secondary healthcare settings - Phase I
  2. To use the high-quality data collated in the first phase to look at health disparities of COVID-19 for different ethnicities, with a particular focus on heart disease.
  3. To create tools for machine learning and data science projects which use ethnicity data, to improve the ability of projects to use data and produce accurate results.

This project uses data from the NHS England Secure Data Environment (SDE), which gives approved researchers access to de-identified patient records of over 60 million patients who were impacted by the COVID pandemic. These data can take various forms, including the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) codes, of which there 489 categories, or the Office of National Statistics (ONS)’s 19 Primary Code ethnicity codes. However, in research, often ethnicity is collapsed into six higher-level groups, which can be too broad and unspecific. Our description of the quality of more detailed ethnicity groups and the application of these detailed data to create more accurate and valuable research will demonstrate that researchers can and should use the available detail in patient ethnicity data to improve the accuracy of clinical models.

This project was supported by Health Data Research UK, the British Heart Foundation Data Science Centre and The Alan Turing Institute, and involved collaborators from University College London. It forms part of the work of the CVD-COVID-UK/COVID-IMPACT Consortium