## Unifying disease vocabularies

This discussion will explore how to unify the variety of disease vocabularies used by our resources. These resources are listed below:

typeresourcevocabulary
indicationsMEDI [1]ICD-9
indicationsLabeledIn [2, 3]UMLS [4]
transcriptional signaturesSTAR-GEOcustom
symptomsHSDN [5]MeSH [6] (2011 release)
gene associationshet.io [7]Disease Ontology [8, 9]
pathophysiologyhet.io [7]Disease Ontology [8, 9]
tissue localizationhet.io [7]Disease Ontology [8, 9]

Our current plan is to use the Disease Ontology (DO) as our primary vocabulary. Therefore, we will have to map resources to DO terms.

Daniel Himmelstein Researcher

# Disease Ontology Resources

Since we plan to use the DO as our primary disease vocabulary, I thought I would keep track of related papers and projects here. These may be slightly off-topic but valuable to keep track of.

namedescriptioncite
Disease OntologyMain resource[1, 2]
DOLiteDO terms are grouped using associated-gene similarity to produce a simplified vocabulary with little redundancy.[3]
DO_cancer_slimCreated a DO subset named TOPNodes_DOcancerslim composed of 63 non-redundant upper-level cancer terms[4]
DOAFProvides gene annotations (extracted from GeneRIF) to the Disease Ontology.[5]
FunDOProvides gene annotations (extracted from GeneRIF) to the Disease Ontology. Uses diseases from DOLite. Potentially outdated.[6]
DOSEDOSE is an R package to compute semantic similarity between DO terms. The result is pairwise similarities between DO terms based only on the ontology structure.[7]
DOsimSimilar to DOSE but allegedly unmaintained or outdated[8]
Daniel Himmelstein Researcher

# Seeking a Slim DO with distinct terms.

The Disease Ontology is a hierarchy of human diseases. However, our current method has been designed for distinct nodes (especially with respect to diseases and compounds, since we will be predicting indications). By distinct we mean non-redundant — in a non-redundant set of terms, no terms should be an ancestor or descendant of any other term.

Additionally, we would like to pick terms at an appropriate level of specificity. Below, I show lineages of DO terms and bold the term whose specificity I prefer:

• cancer > organ system cancer > respiratory system cancer > lung cancer > Pancoast tumor > lung superior sulcus carcinoma
• neurodegenerative disease > demyelinating disease > multiple sclerosis > relapsing-remitting multiple sclerosis
• glucose metabolism disease > diabetes mellitus > type 2 diabetes mellitus > diabetic peripheral angiopathy

Ideally, we chose a level of specificity such that:

• a term and its descendants form a cohesive and differentiated disease concept
• disease data is collected at a similar level of specificity
• sufficient data exists for the term, after propagating data annotated to more specific terms

These are competing aims and we will most likely have to make difficult subjective decisions. In the past [1], we identified 108 distinct, complex diseases by investigating only diseases with GWAS. However, GWAS may be too restrictive of a filter for the present study. We are more concerned with omitting diseases that would be poorly connected in the network. Some diseases may be well connected but lacking GWAS. We also would prefer to focus on diseases with indications as indications are needed to train our model.

Two previous DO studies may be helpful here:

1. DOLite took a data-driven approach to selecting a consolidated set of DO terms [2]. I have not been able to locate the list of DO identifiers composing DOLite. These terms may also be obsolete as the project is dated. Any information regarding DOLite would be appreciated.
2. DO_cancer_slim created a DO subset named TOPNodes_DOcancerslim composed of 63 non-redundant upper-level cancer terms [3]. We would like a similar term list but encompassing all diseases, not just cancers.
Daniel Himmelstein Researcher

# Selecting the top nodes from DO Cancer Slim

I emailed Lynn Schriml, a lead investigator for the DO, asking about the creation of a consolidated and distinct set of high-level cancer terms. This work was recently published in Database [1]. I wrote:

My main question is how did you decide which terms should be included in TopNodes_DOcancerslim? I want to generate a similar top-level term set but for all diseases. Was this a very difficult task that required medical expertise? Do you have any plans to create a DO-wide slim set? (Ideally, I would prefer to rely on an existing effort than to recreate the wheel).

Response by Lynn Schriml (posted with permission):

We are not planning on creating a DO wide slim at this time.

Creating the DO cancer slim and the TOP nodes slim was a couple of months work by a team, including MDs, cancer and disease experts.

We started with a set of terms from multiple cancer sources that we wanted align. We worked to identify how each term mapped to the current nomenclature of disease (disease names change over time), we then worked to define each term, figure out if the term was represented in DO, and where to place the term in DO, then we added the terms, creating DO definitions. We also reviewed and edited related terms. Once we had the larger set of terms defined, we looked at their parent nodes up to the top node for cancer.

We wanted to identify body system level parents that reflected both the most specific we could be (not mapping the new TOP node parent all the way up to cancer), and for the TOP node to be biologically informative. There was also much discussion in our work group to finalize these choices. When that was done, we then edited the DO file, adding the new subtypes (slims) and adding each term in (one at a time).

Daniel Himmelstein Researcher

# Reconstructing DO Lite

DO Lite is an outdated project which provided a consolidated disease terminology [1]. The only data release we could find is a mapping of disease names to implicated genes. Unfortunately, this dataset omits the actual DO identifiers.

We extracted these disease names found 561 different terms. We used a text matching paradigm to match these disease names to current DO terms. The paradigm consisted of the following steps:

1. creating a mapping of names (including synonyms) to DO identifiers [tsv]
2. mapping DOLite names to current DO names (via exact lowercase match) [tsv]

A majority (66.3% = 372 / 561) of DOLite terms were matched to a current DO identifier. We may consider using these terms as a reference when manually constructing a DO Slim.

Daniel Himmelstein Researcher

# Creating a slim DO

We created a slim DO with 137 terms where:

1. no terms were descendants/ancestors of other terms
2. terms were specific enough to be clinically relevant
3. terms were general enough to be well annotated

To create this slim term set, we combined the diseases from:

1. hetio [1] — 108 complex diseases contained in the GWAS Catalog.
2. TOPNodes_DOcancerslim [2] — a body system focused set of 63 cancer terms.

We found that both sources contained overlapping nodes, and we removed 34 nodes to create a non-overlapping term set. We chose the following rules to resolve overlapping nodes:

1. For cancers in TOPNodes_DOcancerslim, retain only the most specific cancer
2. Remove hetio terms that descend from TOPNodes_DOcancerslim terms
3. For the remaining overlapping hetio terms, choose the term with greater clinical interest or GWAS annotations. For 4 out of the 5 conflicts under this rule, we chose to retain the more general term.

### Other notes:

The repository with our DO analysis is here and contains notebooks for extracting xrefs and evaluating our slim DO [3].

We plan to propagate annotations from more specific terms to our slim terms. To facilitate this process, we created a propagated DO slim xref mapping file.

Pleural cancer (DOID:9917) was a TOPNodes_DOcancerslim term but was not found in the ontology version we downloaded (subversion revision 2810).

Daniel Himmelstein Researcher

# Redundant terms removed from the slim DO

My above post on creating the slim DO didn't specify which diseases were removed to "resolve overlapping nodes". The table below shows which diseases we removed and why (rules above). The exclusions counts by rule are: 7 for rule 1, 22 for rule 2, and 5 for rule 3.

IDNameSourceRemoved by
DOID:201Connective tissue cancerDOcancerslimrule 1
DOID:10155Intestinal cancerDOcancerslimrule 1
DOID:5672Large intestine cancerDOcancerslimrule 1
DOID:3119Gastrointestinal system cancerDOcancerslimrule 1
DOID:8618Oral cavity cancerDOcancerslimrule 1
DOID:170Endocrine gland cancerDOcancerslimrule 1
DOID:3996Urinary system cancerDOcancerslimrule 1
DOID:3459breast carcinomahetiorule 2
DOID:10286prostate carcinomahetiorule 2
DOID:1040chronic lymphocytic leukemiahetiorule 2
DOID:3905lung carcinomahetiorule 2
DOID:1909melanomahetiorule 2
DOID:4001ovarian carcinomahetiorule 2
DOID:1107esophageal carcinomahetiorule 2
DOID:289endometriosishetiorule 2
DOID:4450renal cell carcinomahetiorule 2
DOID:769neuroblastomahetiorule 2
DOID:8567Hodgkin's lymphomahetiorule 2
DOID:3963thyroid carcinomahetiorule 2
DOID:9538multiple myelomahetiorule 2
DOID:9952acute lymphocytic leukemiahetiorule 2
DOID:5517stomach carcinomahetiorule 2
DOID:684hepatocellular carcinomahetiorule 2
DOID:1380endometrial cancerhetiorule 2
DOID:4905pancreatic carcinomahetiorule 2
DOID:4960bone marrow cancerhetiorule 2
DOID:706mature B-cell neoplasmhetiorule 2
DOID:8552chronic myeloid leukemiahetiorule 2
DOID:5844myocardial infarctionhetiorule 3
DOID:6713cerebrovascular diseasehetiorule 3
DOID:11829degenerative myopiahetiorule 3
DOID:13641exfoliation syndromehetiorule 3
DOID:3324mood disorderhetiorule 3

See the remaining 137 diseases here.

Hi Daniel,

In cell 2 of this notebook I notice that you're reading the data/slim-terms.tsv file from the dhimmel/disease-ontology repository. However, I could not determine how the slim-terms.tsv file was generated by looking at the files in the disease-ontology repository.

The only mention of a slim-terms.tsv file in the DO repository is in cell 42 of this notebook, but here the file seems to have been created already.

Do you know how this file was generated? I am trying to replicate the Rephetio workflow with up-to-date data.

Thanks!

Daniel Himmelstein Researcher

@tongli, slim-terms.tsv was created manually according to the post above. In short, I created the table in a spreadsheet by combining the two disease sets (hetio-dag [1] and TOPNodes_DOcancerslim [2]) and removing any problematic terms based on ad hoc rules. The pathophysiology classification was manually done by @pouyakhankhanian and I — I believe primarily for hetio-dag [1] — and was not used in Project Rephetio.

I am trying to replicate the Rephetio workflow with up-to-date data.

The problem with the slim approach we took is that it's difficult to automate and potentially requires some clinical expertise. In other words, it's difficult to update. Therefore, I'm thinking of alternatives for the future. One option would be to include the entire ontology. You could then pick your slim term set at analysis time rather than network construction time. This would make the network more versatile. Since Cypher is capable of variable length relationship queries, you could do on-the-fly propagation. However, this would require some additional implementation and theoretical work.

Along these ends, I started playing around with creating Hetontology — a public Neo4j database of open biomedical ontologies. Depending on how much ground you're looking to break, this may be of interest.

Status: Completed
Labels
data integration
Views
314
Topics
Referenced by
Cite this as
Daniel Himmelstein, Tong Shu Li (2015) Unifying disease vocabularies. Thinklab. doi:10.15363/thinklab.d44