Technical documentation

Phenotype Library Inclusion Criteria

For an algorithm to be included in the Phenotype Library, it must satisfy the following criteria:

  • Define a disease (e.g. hypertension), life style risk factor (e.g. smoking) or biomarker (e.g. blood pressure)
  • Derive information from one or more electronic health record data sources. This can include national and local sources. The definition of EHR includes administrative data such as billing/claims data, and clinical audits.
  • Have one or more peer-reviewed outputs associated with it e.g. journal publication, scientific conferences, policy white papers etc.
  • Provide evidence of how the phenotyping algorithm was validated.

Specification

Phenotyping algorithms are stored in the Phenotype Library usign a combination of YAML, CSV and markdown files. There are two main components to each algorothm: a) the phenotype definition file (which is in YAML and markdown) and, b) one or more teminology files (also known as codelists) which are stored as CSV files. The section below provides information on their schema and contents.

Electronic Health Records Phenotyping algorithm
Electronic Health Records Phenotyping algorithm
Phenotype definition fileMetadataContent
Codelist file
Codelist file
Codelist file
Codelist file
Codelist file
Codelist file
Viewer does not support full SVG 1.1

File naming

All phenotype definition files associated with a phenotype use a common naming pattern:

AUTHORSURNAME_NAME_UUID.md 

for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj.md

Phenotype files are stored in the _phenotypes directory.

Similarly, code list files follow a similar pattern:

NAME_UUID_TERMINOLOGY.csv

for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv

Codelist files are stored in the codelists directory.

Phenotype definition file

The phenotype definition file is a markdown file with a YAML header. The YAML header is used to record metadata fields capturing information about the algorithm, the data sources, controlled clinical terminologies and other information.

For example, the code snippet below displays the metadata associated with the bronchiestasis phenotyping algorithm submitted by the HDR UK BREATHE Hub (you can view the raw file directly on the repository.)

title: Bronchiestasis
name: Bronchiestasis
phenotype_id: ZckoXfUWNXn8Jn7fdLQuxj 
type: Disease or Syndrome
group: Respiratory
data_sources: 
    - Clinical Practice Research Datalink GOLD
    - Clinical Practice Research Datalink Aurum
    - Hospital Episode Statistics APC for CPRD GOLD
    - Hospital Episode Statistics APC for CPRD Aurum
    - Death Registration data for CPRD GOLD
    - Death Registration data for CPRD Aurum
    - UK Biobank
clinical_terminologies: 
    - Read Version 2
    - SNOMED-CT
    - ICD-10
    - ICD-11
validation: 
    - prognostic
codelists:
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD11.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_SNOMEDCT.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_UKBIOBANK.csv
valid_event_data_range: 01/01/2001 - 31/12/2019
sex: 
    - Female
    - Male
author: 
    - Eleanor L Axson
    - Jennifer K Quint
publications: 
status: BETA
date: 2019-06-20
modified_date: 2019-06-20
version: 1

The metadata fields required are the following:

  • title (string): Phenotype (long) name
  • name (string): Phenotype (short) name
  • data_sources (list of strings): Names of data sources that phenotype sources information from. These should be identical, if possible, to the names used to identify individual datasets in the HDR Gateway.
  • clinical_terminologies (list of strings): List of controlled clinical terminologies that are used by the phenotype algorithm.
  • validation (list of strings): evidence of validation used as evidence of phenotype robustness - valid values:
    • prognostic: the ability to replicate known prognostic associations
    • aetiologic: the ability to replicate known associations with risk factors
    • genetic : the abity to replicate associations with known regions or variants
    • cross-source: has the algorithm been evaluated in a similar external data source
    • casenote review : has the algorithm been validated through manual review of clinical notes (this usually would result to PPV, NPV values)
    • cross-country : has the algorithm been evaluated in a similar external healthcare system
  • codelists (list of strings): (unordered) list of CSV terminology files associated with the phenotype
  • phenotype_id (list of strings): Unique universal phenotype identifier, generated using the shortuuid Python module.
  • group (string): Disease group for phenotype
  • valid_event_data_range (list of strings): DD/MM/YYYY date range for events
  • sex (list of strings): list of sexes valid for the phenotype
  • author (list of strings): list of phenotype authors
  • publications (list of strings): list of publications
  • status (string): ‘DRAFT’ or ‘FINAL’ status
  • date (string): date created
  • modified_date (string): date last modified
  • version (integer): integer version of phenotype, default ‘1’

Terminology files (codelists)

Codelist files are specified as CSV files with one term per row - for example:

ICD-10 code,ICD-10 term
J47,Bronchiectasis

How to submit data

You can download a sample template file from the repository:

If you have a phenotyping algorithm that meets the eligibility requirements, we invite you to submit your data by one of the following ways: