Rephetio: Repurposing drugs on a hetnet [rephetio]

Suggestions for additional information types?

Are there any types of nodes or edges that you think we should include? Or do you think there is a superior resource for an information type than the one proposed?

When constructing edges, we prefer data that is:

  • systematic, without knowledge bias
  • extensive in coverage
  • easy to process
  • without prohibitive reuse restrictions

For node suggestions, we prefer controlled vocabularies so annotating the nodes with new edge types in the future is possible.

Started discussion on changing ontology used for Tissue node. Discussion is here

Phenotype–Disease Associations

A recent paper titled "The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease" [1] constructed a catalog of 132,006 phenotypic annotations for common diseases.

The approach relied on text mining of PubMed abstracts. Abstracts were annotated with diseases using MEDLINE topics. Phenotype annotation, however, relied on concept recognition. Thus the method appears to produce similar results to our disease–symptom MEDLINE approach. The difference being that their approach achieves greater phenotype/symptom coverage by substituting manual topic annotation with concept recognition.

The data is online. The column names for the dataset, provided by Tudor Groza in personal communication, are:

  • MeSH ID
  • MeSH descriptor
  • Disease Ontology ID
  • HPO ID
  • HPO label
  • Ranking score of the HPO concept in the context of the MeSH term - this is a modified version of TF-IDF (as per the paper)
  • Number of Publications containing this association
  • 5 PMIDs (of the total number listed above) referring to this association

I am still slightly unclear on what a phenotype means in the context of human disease, but we will keep this dataset on hand.

A recent study [1] found disease-associated genes are mildly predictive of effective drug targets, which is in line with the findings of a preceding but less rigorous study [2].

The data supplement for "The support of human genetic evidence for approved drug indications" contains two potentially useful resources:

Gene-disease associations from GWAS and OMIM. Compared to our approach for converting from SNP to gene, their method incorporates experimental genomic evidence.

Disease-target combinations extracted from Pharmaprojects, a commercial database. As per the paper:

A target was defined as successful in treating an indication if a drug targeting that gene product was approved for the corresponding indication in the United States or the European Union, as annotated in Pharmaprojects.

Unfortunately, this dataset is a step abstracted from the real deal (separate databases of drug targets and drug indications).

We performed some basic manipulation of their data, which is available here.

What do you think about implementing high-throughput data from RNA interference screenings? RNAi is an alternative, more precise way to control gene expression and it should have less off-target effects compared to pharmacological inhibition.

Genetic perturbation edges

@alessandrodidonna, thanks for the recommendation. You have motivated to us to add four new gene–gene metaedges:

  • Gene → knockdown downregulates → Gene
  • Gene → knockdown upregulates → Gene
  • Gene → overexpression downregulates → Gene
  • Gene → overexpression upregulates → Gene

These will be our first directed edges, so it will be exciting to stress test our support for directed edges, a feature that we designed our implementation to support.

We'll be taking the data from LINCS L1000 which contains a large number of genetic perturbation experiments:

Read more about our consensus signatures for gene knockdowns (of 4,363 genes) and overexpressions (of 2,471 genes).

What about disease-disease co-occurrence data mined from electronic health records (EHR)? I know of two studies in particular, but there are probably more:

Human Disease Network (HuDiNe) [1]

  • Data compiled from Medicare claims of mostly elderly Americans.
  • Diseases are mapped to ICD9.
  • Data can be downloaded from the HuDiNe website.
  • License: Free for academic use only.

Stanford Translational Research Integrated Database Environment (STRIDE) [2]

  • Data was mined from EHR at Stanford hospitals.
  • Diseases are mapped to UMLS CUI.
  • Data can be downloaded from dryad.
  • License: Looks like something liberal.

Thanks @ostrokach. While these resources won't make it into the current project (corresponding to Hetionet v1.0), your suggestions will help us in the future.


Regarding HuDiNe, I know I looked into the resource in the past. Specifically, I mention it in my Qualifying Exam proposal in 2013 [1]:

Disease comorbidity will be extracted from HuDiNe, a resource of co-occurring diseases constructed from Medicare claims spanning three years and 13 million patients [2]. FDR adjusted p-values from the φ comorbidity statistic — a conservative metric optimized for prevalent diseases — will determine disease-disease edges.

If I remember correctly, I didn't end up including comorbidity relationships from HuDiNe because I had trouble picking an inclusion threshold: prevalent diseases were always comorbid with each other, while rare diseases never had comorbidities. However, as the HuDiNe paper explains they include two measures of comorbidity [2]:

We will use two comorbidity measures to quantify the distance between two diseases: The Relative Risk (RR) and φ-correlation (φ). … For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas φ accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases (see SM Box 1).

Hence, I think it may make sense for us to revisit this dataset and attempt to pick a threshold that combines information from several available columns including:

  • Relative Risk 99% Conf. Interval
  • Phi-correlation
  • t-test value

Intuitively, comorbidity is an ideal relationship for our approach because it's high throughput and could potentially offer orthogonal information to existing relationships.


STRIDE is definitely be of interest [3]. The data is licensed as CC0 [4], which is ideal. However, a good deal of exploratory analysis would be needed to determine a processing pipeline and see what types of cooccurring concepts have meaning.

Status: Open
Referenced by
Cite this as
Daniel Himmelstein, Venkat Malladi, Alessandro Didonna, Alexey Strokach (2015) Suggestions for additional information types?. Thinklab. doi:10.15363/thinklab.d22

Creative Commons License