Phenotype Library Inclusion Criteria
For an algorithm to be included in the Phenotype Library, it must satisfy the following criteria:
- Define a disease (e.g. hypertension), life style risk factor (e.g. smoking) or biomarker (e.g. blood pressure)
- Derive information from one or more electronic health record data sources. This can include national and local sources. The definition of EHR includes administrative data such as billing/claims data, and clinical audits.
- Have one or more peer-reviewed outputs associated with it e.g. journal publication, scientific conferences, policy white papers etc.
- Provide evidence of how the phenotyping algorithm was validated.
Phenotyping algorithms are stored in the Phenotype Library usign a combination of YAML, CSV and markdown files. There are two main components to each algorothm: a) the phenotype definition file (which is in YAML and markdown) and, b) one or more teminology files (also known as codelists) which are stored as CSV files. The section below provides information on their schema and contents.
All phenotype definition files associated with a phenotype use a common naming pattern:
Phenotype files are stored in the _phenotypes directory.
Similarly, code list files follow a similar pattern:
Codelist files are stored in the codelists directory.
Phenotype definition file
The phenotype definition file is a markdown file with a YAML header. The YAML header is used to record metadata fields capturing information about the algorithm, the data sources, controlled clinical terminologies and other information.
For example, the code snippet below displays the metadata associated with the bronchiestasis phenotyping algorithm submitted by the HDR UK BREATHE Hub (you can view the raw file directly on the repository.)
title: Bronchiestasis name: Bronchiestasis phenotype_id: ZckoXfUWNXn8Jn7fdLQuxj type: Disease or Syndrome group: Respiratory data_sources: - Clinical Practice Research Datalink GOLD - Clinical Practice Research Datalink Aurum - Hospital Episode Statistics APC for CPRD GOLD - Hospital Episode Statistics APC for CPRD Aurum - Death Registration data for CPRD GOLD - Death Registration data for CPRD Aurum - UK Biobank clinical_terminologies: - Read Version 2 - SNOMED-CT - ICD-10 - ICD-11 validation: - prognostic codelists: - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD11.csv - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_SNOMEDCT.csv - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_UKBIOBANK.csv valid_event_data_range: 01/01/2001 - 31/12/2019 sex: - Female - Male author: - Eleanor L Axson - Jennifer K Quint publications: status: BETA date: 2019-06-20 modified_date: 2019-06-20 version: 1
The metadata fields required are the following:
- title (string): Phenotype (long) name
- name (string): Phenotype (short) name
- data_sources (list of strings): Names of data sources that phenotype sources information from. These should be identical, if possible, to the names used to identify individual datasets in the HDR Gateway.
- clinical_terminologies (list of strings): List of controlled clinical terminologies that are used by the phenotype algorithm.
- validation (list of strings): evidence of validation used as evidence of phenotype robustness - valid values:
- prognostic: the ability to replicate known prognostic associations
- aetiologic: the ability to replicate known associations with risk factors
- genetic : the abity to replicate associations with known regions or variants
- cross-source: has the algorithm been evaluated in a similar external data source
- casenote review : has the algorithm been validated through manual review of clinical notes (this usually would result to PPV, NPV values)
- cross-country : has the algorithm been evaluated in a similar external healthcare system
- codelists (list of strings): (unordered) list of CSV terminology files associated with the phenotype
- phenotype_id (list of strings): Unique universal phenotype identifier, generated using the
- group (string): Disease group for phenotype
- valid_event_data_range (list of strings): DD/MM/YYYY date range for events
- sex (list of strings): list of sexes valid for the phenotype
- author (list of strings): list of phenotype authors
- publications (list of strings): list of publications
- status (string): ‘DRAFT’ or ‘FINAL’ status
- date (string): date created
- modified_date (string): date last modified
- version (integer): integer version of phenotype, default ‘1’
Terminology files (codelists)
Codelist files are specified as CSV files with one term per row - for example:
ICD-10 code,ICD-10 term J47,Bronchiectasis
How to submit data
You can download a sample template file from the repository:
If you have a phenotyping algorithm that meets the eligibility requirements, we invite you to submit your data by one of the following ways: