Suggestions for additional information types?

Daniel Himmelstein, Venkat Malladi, Alessandro Didonna, Alexey Strokach

doi:10.15363/thinklab.d22

Suggestions for additional information types?

Daniel Himmelstein Researcher Jan. 16, 2015

Are there any types of nodes or edges that you think we should include? Or do you think there is a superior resource for an information type than the one proposed?

When constructing edges, we prefer data that is:

systematic, without knowledge bias
extensive in coverage
easy to process
without prohibitive reuse restrictions

For node suggestions, we prefer controlled vocabularies so annotating the nodes with new edge types in the future is possible.

Venkat Malladi March 19, 2015

Started discussion on changing ontology used for Tissue node. Discussion is here

Jesse Spaulding: Here's a link to the new discussion (Venkat, it might be a good idea to edit your post to add this link)

0
1.
Tissue Node
Venkat Malladi, Daniel Himmelstein, Chris Mungall (2015) Thinklab. doi:10.15363/thinklab.d41

Daniel Himmelstein Researcher July 30, 2015

Phenotype–Disease Associations

A recent paper titled "The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease" [1] constructed a catalog of 132,006 phenotypic annotations for common diseases.

The approach relied on text mining of PubMed abstracts. Abstracts were annotated with diseases using MEDLINE topics. Phenotype annotation, however, relied on concept recognition. Thus the method appears to produce similar results to our disease–symptom MEDLINE approach. The difference being that their approach achieves greater phenotype/symptom coverage by substituting manual topic annotation with concept recognition.

The data is online. The column names for the dataset, provided by Tudor Groza in personal communication, are:

MeSH ID
MeSH descriptor
Disease Ontology ID
HPO ID
HPO label
Ranking score of the HPO concept in the context of the MeSH term - this is a modified version of TF-IDF (as per the paper)
Number of Publications containing this association
5 PMIDs (of the total number listed above) referring to this association

I am still slightly unclear on what a phenotype means in the context of human disease, but we will keep this dataset on hand.

Daniel Himmelstein: Noting another disease–phenotype approach [1] titled "Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases."

Daniel Himmelstein Researcher July 31, 2015

A recent study [1] found disease-associated genes are mildly predictive of effective drug targets, which is in line with the findings of a preceding but less rigorous study [2].

The data supplement for "The support of human genetic evidence for approved drug indications" contains two potentially useful resources:

Gene-disease associations from GWAS and OMIM. Compared to our approach for converting from SNP to gene, their method incorporates experimental genomic evidence.

Disease-target combinations extracted from Pharmaprojects, a commercial database. As per the paper:

A target was defined as successful in treating an indication if a drug targeting that gene product was approved for the corresponding indication in the United States or the European Union, as annotated in Pharmaprojects.

Unfortunately, this dataset is a step abstracted from the real deal (separate databases of drug targets and drug indications).

We performed some basic manipulation of their data, which is available here.

janet piñero: I think that for gene-disease associations you should check DisGeNET (www.disgenet.org). I would start by checking the disease list that you already have and query DisGeNET with it. Currently, DisGeNET may be searched using MeSH, OMIMs, and UMLS CUIs. You can download the data in tab files.
Daniel Himmelstein: @janispi, thanks for the suggestion. We've begun adding DisGeNET.

0
1.
Processing DisGeNET for disease-gene relationships
Daniel Himmelstein, janet piñero (2015) Thinklab. doi:10.15363/thinklab.d105

Alessandro Didonna July 31, 2015

What do you think about implementing high-throughput data from RNA interference screenings? RNAi is an alternative, more precise way to control gene expression and it should have less off-target effects compared to pharmacological inhibition.

Daniel Himmelstein Researcher Aug. 8, 2015

Genetic perturbation edges

@alessandrodidonna, thanks for the recommendation. You have motivated to us to add four new gene–gene metaedges:

Gene → knockdown downregulates → Gene
Gene → knockdown upregulates → Gene
Gene → overexpression downregulates → Gene
Gene → overexpression upregulates → Gene

These will be our first directed edges, so it will be exciting to stress test our support for directed edges, a feature that we designed our implementation to support.

We'll be taking the data from LINCS L1000 which contains a large number of genetic perturbation experiments:

Read more about our consensus signatures for gene knockdowns (of 4,363 genes) and overexpressions (of 2,471 genes).

Alexey Strokach April 15, 2016

What about disease-disease co-occurrence data mined from electronic health records (EHR)? I know of two studies in particular, but there are probably more:

Human Disease Network (HuDiNe) [1]

Data compiled from Medicare claims of mostly elderly Americans.
Diseases are mapped to ICD9.
Data can be downloaded from the HuDiNe website.
License: Free for academic use only.

Stanford Translational Research Integrated Database Environment (STRIDE) [2]

Data was mined from EHR at Stanford hospitals.
Diseases are mapped to UMLS CUI.
Data can be downloaded from dryad.
License: Looks like something liberal.

Daniel Himmelstein Researcher April 16, 2016

Thanks @ostrokach. While these resources won't make it into the current project (corresponding to Hetionet v1.0), your suggestions will help us in the future.

HuDiNe

Regarding HuDiNe, I know I looked into the resource in the past. Specifically, I mention it in my Qualifying Exam proposal in 2013 [1]:

Disease comorbidity will be extracted from HuDiNe, a resource of co-occurring diseases constructed from Medicare claims spanning three years and 13 million patients [2]. FDR adjusted p-values from the φ comorbidity statistic — a conservative metric optimized for prevalent diseases — will determine disease-disease edges.

If I remember correctly, I didn't end up including comorbidity relationships from HuDiNe because I had trouble picking an inclusion threshold: prevalent diseases were always comorbid with each other, while rare diseases never had comorbidities. However, as the HuDiNe paper explains they include two measures of comorbidity [2]:

We will use two comorbidity measures to quantify the distance between two diseases: The Relative Risk (RR) and φ-correlation (φ). … For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas φ accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases (see SM Box 1).

Hence, I think it may make sense for us to revisit this dataset and attempt to pick a threshold that combines information from several available columns including:

Relative Risk 99% Conf. Interval
Phi-correlation
t-test value

Intuitively, comorbidity is an ideal relationship for our approach because it's high throughput and could potentially offer orthogonal information to existing relationships.

STRIDE

STRIDE is definitely be of interest [3]. The data is licensed as CC0 [4], which is ideal. However, a good deal of exploratory analysis would be needed to determine a processing pipeline and see what types of cooccurring concepts have meaning.