Unifying disease vocabularies

Daniel Himmelstein, Tong Shu Li

doi:10.15363/thinklab.d44

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Unifying disease vocabularies

Daniel Himmelstein Researcher March 31, 2015

This discussion will explore how to unify the variety of disease vocabularies used by our resources. These resources are listed below:

type	resource	vocabulary
indications	MEDI [1]	ICD-9
indications	LabeledIn [2, 3]	UMLS [4]
transcriptional signatures	STAR-GEO	custom
symptoms	HSDN [5]	MeSH [6] (2011 release)
gene associations	het.io [7]	Disease Ontology [8, 9]
pathophysiology	het.io [7]	Disease Ontology [8, 9]
tissue localization	het.io [7]	Disease Ontology [8, 9]

Our current plan is to use the Disease Ontology (DO) as our primary vocabulary. Therefore, we will have to map resources to DO terms.

Daniel Himmelstein Researcher April 2, 2015

Disease Ontology Resources

Since we plan to use the DO as our primary disease vocabulary, I thought I would keep track of related papers and projects here. These may be slightly off-topic but valuable to keep track of.

name	description	cite
Disease Ontology	Main resource	[1, 2]
DOLite	DO terms are grouped using associated-gene similarity to produce a simplified vocabulary with little redundancy.	[3]
DO_cancer_slim	Created a DO subset named `TOPNodes_DOcancerslim` composed of 63 non-redundant upper-level cancer terms	[4]
DOAF	Provides gene annotations (extracted from GeneRIF) to the Disease Ontology.	[5]
FunDO	Provides gene annotations (extracted from GeneRIF) to the Disease Ontology. Uses diseases from DOLite. Potentially outdated.	[6]
DOSE	DOSE is an R package to compute semantic similarity between DO terms. The result is pairwise similarities between DO terms based only on the ontology structure.	[7]
DOsim	Similar to DOSE but allegedly unmaintained or outdated	[8]

Daniel Himmelstein Researcher April 9, 2015

Seeking a Slim DO with distinct terms.

The Disease Ontology is a hierarchy of human diseases. However, our current method has been designed for distinct nodes (especially with respect to diseases and compounds, since we will be predicting indications). By distinct we mean non-redundant — in a non-redundant set of terms, no terms should be an ancestor or descendant of any other term.

Additionally, we would like to pick terms at an appropriate level of specificity. Below, I show lineages of DO terms and bold the term whose specificity I prefer:

cancer > organ system cancer > respiratory system cancer > lung cancer > Pancoast tumor > lung superior sulcus carcinoma
neurodegenerative disease > demyelinating disease > multiple sclerosis > relapsing-remitting multiple sclerosis
glucose metabolism disease > diabetes mellitus > type 2 diabetes mellitus > diabetic peripheral angiopathy

Ideally, we chose a level of specificity such that:

a term and its descendants form a cohesive and differentiated disease concept
disease data is collected at a similar level of specificity
sufficient data exists for the term, after propagating data annotated to more specific terms

These are competing aims and we will most likely have to make difficult subjective decisions. In the past [1], we identified 108 distinct, complex diseases by investigating only diseases with GWAS. However, GWAS may be too restrictive of a filter for the present study. We are more concerned with omitting diseases that would be poorly connected in the network. Some diseases may be well connected but lacking GWAS. We also would prefer to focus on diseases with indications as indications are needed to train our model.

Two previous DO studies may be helpful here:

DOLite took a data-driven approach to selecting a consolidated set of DO terms [2]. I have not been able to locate the list of DO identifiers composing DOLite. These terms may also be obsolete as the project is dated. Any information regarding DOLite would be appreciated.
DO_cancer_slim created a DO subset named TOPNodes_DOcancerslim composed of 63 non-redundant upper-level cancer terms [3]. We would like a similar term list but encompassing all diseases, not just cancers.

Daniel Himmelstein Researcher April 10, 2015

Selecting the top nodes from DO Cancer Slim

I emailed Lynn Schriml, a lead investigator for the DO, asking about the creation of a consolidated and distinct set of high-level cancer terms. This work was recently published in Database [1]. I wrote:

My main question is how did you decide which terms should be included in TopNodes_DOcancerslim? I want to generate a similar top-level term set but for all diseases. Was this a very difficult task that required medical expertise? Do you have any plans to create a DO-wide slim set? (Ideally, I would prefer to rely on an existing effort than to recreate the wheel).

Response by Lynn Schriml (posted with permission):

We are not planning on creating a DO wide slim at this time.

Creating the DO cancer slim and the TOP nodes slim was a couple of months work by a team, including MDs, cancer and disease experts.

We started with a set of terms from multiple cancer sources that we wanted align. We worked to identify how each term mapped to the current nomenclature of disease (disease names change over time), we then worked to define each term, figure out if the term was represented in DO, and where to place the term in DO, then we added the terms, creating DO definitions. We also reviewed and edited related terms. Once we had the larger set of terms defined, we looked at their parent nodes up to the top node for cancer.

We wanted to identify body system level parents that reflected both the most specific we could be (not mapping the new TOP node parent all the way up to cancer), and for the TOP node to be biologically informative. There was also much discussion in our work group to finalize these choices. When that was done, we then edited the DO file, adding the new subtypes (slims) and adding each term in (one at a time).

Daniel Himmelstein Researcher April 10, 2015

Reconstructing DO Lite

DO Lite is an outdated project which provided a consolidated disease terminology [1]. The only data release we could find is a mapping of disease names to implicated genes. Unfortunately, this dataset omits the actual DO identifiers.

We extracted these disease names found 561 different terms. We used a text matching paradigm to match these disease names to current DO terms. The paradigm consisted of the following steps:

creating a mapping of names (including synonyms) to DO identifiers [tsv]
mapping DOLite names to current DO names (via exact lowercase match) [tsv]

A majority (66.3% = 372 / 561) of DOLite terms were matched to a current DO identifier. We may consider using these terms as a reference when manually constructing a DO Slim.

Daniel Himmelstein Researcher April 17, 2015

Creating a slim DO

We created a slim DO with 137 terms where:

no terms were descendants/ancestors of other terms
terms were specific enough to be clinically relevant
terms were general enough to be well annotated

To create this slim term set, we combined the diseases from:

hetio [1] — 108 complex diseases contained in the GWAS Catalog.
TOPNodes_DOcancerslim [2] — a body system focused set of 63 cancer terms.

We found that both sources contained overlapping nodes, and we removed 34 nodes to create a non-overlapping term set. We chose the following rules to resolve overlapping nodes:

For cancers in TOPNodes_DOcancerslim, retain only the most specific cancer
Remove hetio terms that descend from TOPNodes_DOcancerslim terms
For the remaining overlapping hetio terms, choose the term with greater clinical interest or GWAS annotations. For 4 out of the 5 conflicts under this rule, we chose to retain the more general term.

Other notes:

The repository with our DO analysis is here and contains notebooks for extracting xrefs and evaluating our slim DO [3].

We plan to propagate annotations from more specific terms to our slim terms. To facilitate this process, we created a propagated DO slim xref mapping file.

Pleural cancer (DOID:9917) was a TOPNodes_DOcancerslim term but was not found in the ontology version we downloaded (subversion revision 2810).

Daniel Himmelstein Researcher Feb. 20, 2016

Redundant terms removed from the slim DO

My above post on creating the slim DO didn't specify which diseases were removed to "resolve overlapping nodes". The table below shows which diseases we removed and why (rules above). The exclusions counts by rule are: 7 for rule 1, 22 for rule 2, and 5 for rule 3.

ID	Name	Source	Removed by
DOID:201	Connective tissue cancer	DOcancerslim	rule 1
DOID:10155	Intestinal cancer	DOcancerslim	rule 1
DOID:5672	Large intestine cancer	DOcancerslim	rule 1
DOID:3119	Gastrointestinal system cancer	DOcancerslim	rule 1
DOID:8618	Oral cavity cancer	DOcancerslim	rule 1
DOID:170	Endocrine gland cancer	DOcancerslim	rule 1
DOID:3996	Urinary system cancer	DOcancerslim	rule 1
DOID:3459	breast carcinoma	hetio	rule 2
DOID:10286	prostate carcinoma	hetio	rule 2
DOID:1040	chronic lymphocytic leukemia	hetio	rule 2
DOID:3905	lung carcinoma	hetio	rule 2
DOID:1909	melanoma	hetio	rule 2
DOID:4001	ovarian carcinoma	hetio	rule 2
DOID:1107	esophageal carcinoma	hetio	rule 2
DOID:4007	bladder carcinoma	hetio	rule 2
DOID:289	endometriosis	hetio	rule 2
DOID:4450	renal cell carcinoma	hetio	rule 2
DOID:769	neuroblastoma	hetio	rule 2
DOID:8567	Hodgkin's lymphoma	hetio	rule 2
DOID:3963	thyroid carcinoma	hetio	rule 2
DOID:9538	multiple myeloma	hetio	rule 2
DOID:9952	acute lymphocytic leukemia	hetio	rule 2
DOID:5517	stomach carcinoma	hetio	rule 2
DOID:684	hepatocellular carcinoma	hetio	rule 2
DOID:1380	endometrial cancer	hetio	rule 2
DOID:4905	pancreatic carcinoma	hetio	rule 2
DOID:4960	bone marrow cancer	hetio	rule 2
DOID:706	mature B-cell neoplasm	hetio	rule 2
DOID:8552	chronic myeloid leukemia	hetio	rule 2
DOID:5844	myocardial infarction	hetio	rule 3
DOID:6713	cerebrovascular disease	hetio	rule 3
DOID:11829	degenerative myopia	hetio	rule 3
DOID:13641	exfoliation syndrome	hetio	rule 3
DOID:3324	mood disorder	hetio	rule 3

See the remaining 137 diseases here.

Tong Shu Li Sept. 21, 2016

Hi Daniel,

In cell 2 of this notebook I notice that you're reading the data/slim-terms.tsv file from the dhimmel/disease-ontology repository. However, I could not determine how the slim-terms.tsv file was generated by looking at the files in the disease-ontology repository.

The only mention of a slim-terms.tsv file in the DO repository is in cell 42 of this notebook, but here the file seems to have been created already.

Do you know how this file was generated? I am trying to replicate the Rephetio workflow with up-to-date data.

Thanks!

Daniel Himmelstein Researcher Sept. 22, 2016

@tongli, slim-terms.tsv was created manually according to the post above. In short, I created the table in a spreadsheet by combining the two disease sets (hetio-dag [1] and TOPNodes_DOcancerslim [2]) and removing any problematic terms based on ad hoc rules. The pathophysiology classification was manually done by @pouyakhankhanian and I — I believe primarily for hetio-dag [1] — and was not used in Project Rephetio.

I am trying to replicate the Rephetio workflow with up-to-date data.

The problem with the slim approach we took is that it's difficult to automate and potentially requires some clinical expertise. In other words, it's difficult to update. Therefore, I'm thinking of alternatives for the future. One option would be to include the entire ontology. You could then pick your slim term set at analysis time rather than network construction time. This would make the network more versatile. Since Cypher is capable of variable length relationship queries, you could do on-the-fly propagation. However, this would require some additional implementation and theoretical work.

Along these ends, I started playing around with creating Hetontology — a public Neo4j database of open biomedical ontologies. Depending on how much ground you're looking to break, this may be of interest.

Status: Completed

Labels

Views

314

Topics

Mapping Terminologies UMLS Disease Ontology Vocabularies Diseases MeSH DO

Referenced by

Cite this as

Daniel Himmelstein, Tong Shu Li (2015) Unifying disease vocabularies. Thinklab. doi:10.15363/thinklab.d44

License