Phase I - The largest study of patient ethnicity data in England
In the first phase of this project, our researchers focused on analysing the ethnicity data collected about patients in the NHS, and creating a research-ready resource from these data. Our findings relate to the ways ethnicity is categorised, how these categories can lead to patients having contradicting ethnicity records, and which populations are most likely to be lacking ethnicity data.
Phase I of the Ethnicity, Health Equity and AI study focuses on the ethnicity data that are routinely collected within the NHS, and how this could benefit healthcare tools like health prediction models. The ethnicity data collected in GP centres and hospitals are available to approved researchers through the NHS England’s Secure Data Environment (SDE), where all the data are de-identified to protect patient identities.
This study’s data set is specifically related to patients affected by the COVID-19 pandemic, and represents around 93% of patients in England. Our researchers looked at the available data on ethnicity, analysing how ethnicity is categorised in GP centres and hospitals, where contradictions or gaps in recorded ethnicity exist, and how these might impact research which relies on NHS-recorded ethnicity data.
Healthcare in England uses two key categories of ethnicity data: SNOMED-CT concepts and NHS ethnicity codes. SNOMED concepts represent 489 different categories of ethnicity, representing very detailed, or granular, information. On the other hand, NHS ethnicity codes have 19 categories, so represent less detailed information but with larger population groups.
Beyond these two categories, health researchers often further simplify ethnicity to six even broader groups, so that they have bigger populations and supposedly more reliable results, but this reduces the accuracy of results. In other words, these bigger population groups can prevent researchers from finding specific ethnicity groups which have different health profiles than the broader groups they are part of, and inconsistencies in how ethnicity data are recorded or converted from one type of ethnicity categories to another can make research results less accurate.
FINDINGS
The study had three key findings which are important for researchers to consider when using ethnicity data in their research.
COMPLETENESS OF ETHNICITY DATA
Our study found that 1 in 10 patients have missing ethnicity records, resulting from either not being collected or through patients using the ‘prefer not to say’ option. Patients with missing data were more likely to be younger, male and have less recorded co-occurring conditions than those with ethnicity data.
Our Public and Patient Involvement (PPI) in this study helped researchers to understand that the common practice of inferring patient ethnicity when records were missing didn’t respect patients’ wishes when they used a ‘prefer not to say’ option, and could lead to the reinforcement of current biases in healthcare data. As part of this project, we launched the ‘Be Proud to reveal your ethnicity’ campaign to encourage patients to provide their ethnicity, to aid accurate research. Explore this project's PPI work and the 'Be Proud of Your Ethnicity' Campaign here.
DETAILS IN ETHNICITY DATA
Our study demonstrated that patients in England self-identify across 250 different ethnicity sub-groups. These groups all fall within the SNOMED concepts, a standardised way to record patients’ ethnicity which includes 489, but not all of these were used by patients in this data set.
Download this visualiser from Github to explore further.
Our analysis is the first to demonstrate the detail, or granularity, of ethnicity records available in healthcare data, which can be used by researchers to produce more accurate results and tools for different ethnicities.
INCONSISTENCIES AND INACCURACIES IN ETHNICITY DATA
Our study showed that around 12% of patients had inconsistencies in the ethnicity code used in their patient profiles. This can reflect changes in an individual’s perception of their own ethnicity, but can become complex when seemingly similar ethnicity groups are organised into different higher-level classifications.
Download this visualiser from Github to explore further.
We also found that some healthcare settings are using the older versions of the NHS ethnicity codes, which are based on outdated ONS Census data. Where older categories were used, patients could be organised into different higher-level categories, which could impact results.
Our researchers suggested potential different ways to map the detailed SNOMED codes to NHS Primary Codes, which can improve how representative these 19 Primary Codes are for use in research.
Download this visualiser from Github to explore further.
ACCESSING THIS DATA
The de-identified data used in this study were made available to accredited researchers. Those wishing to gain access to the data should contact bhfdsc@hdruk.ac.uk in the first instance. If you would like to explore the data visualisers shown in the GIFs on this page, they are available from the BHF Data Science Centre Github for this project, alongside the study code and further figures.