A recent paper titled "The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease"  constructed a catalog of 132,006 phenotypic annotations for common diseases.
The approach relied on text mining of PubMed abstracts. Abstracts were annotated with diseases using MEDLINE topics. Phenotype annotation, however, relied on concept recognition. Thus the method appears to produce similar results to our disease–symptom MEDLINE approach. The difference being that their approach achieves greater phenotype/symptom coverage by substituting manual topic annotation with concept recognition.
The data is online. The column names for the dataset, provided by Tudor Groza in personal communication, are:
Disease Ontology ID
Ranking score of the HPO concept in the context of the MeSH term - this is a modified version of TF-IDF (as per the paper)
Number of Publications containing this association
5 PMIDs (of the total number listed above) referring to this association
I am still slightly unclear on what a phenotype means in the context of human disease, but we will keep this dataset on hand.
Daniel Himmelstein: Noting another disease–phenotype approach  titled "Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases."
Robert Hoehndorf, Paul N. Schofield, Georgios V. Gkoutos (2015) Sci. Rep.. doi:10.1038/srep10888
A recent study  found disease-associated genes are mildly predictive of effective drug targets, which is in line with the findings of a preceding but less rigorous study .
The data supplement for "The support of human genetic evidence for approved drug indications" contains two potentially useful resources:
Gene-disease associations from GWAS and OMIM. Compared to our approach for converting from SNP to gene, their method incorporates experimental genomic evidence.
Disease-target combinations extracted from Pharmaprojects, a commercial database. As per the paper:
A target was defined as successful in treating an indication if a drug targeting that gene product was approved for the corresponding indication in the United States or the European Union, as annotated in Pharmaprojects.
Unfortunately, this dataset is a step abstracted from the real deal (separate databases of drug targets and drug indications).
We performed some basic manipulation of their data, which is available here.
janet piñero: I think that for gene-disease associations you should check DisGeNET (www.disgenet.org). I would start by checking the disease list that you already have and query DisGeNET with it. Currently, DisGeNET may be searched using MeSH, OMIMs, and UMLS CUIs. You can download the data in tab files.
What do you think about implementing high-throughput data from RNA interference screenings? RNAi is an alternative, more precise way to control gene expression and it should have less off-target effects compared to pharmacological inhibition.
Genetic perturbation edges
@alessandrodidonna, thanks for the recommendation. You have motivated to us to add four new gene–gene metaedges:
Gene → knockdown downregulates → Gene
Gene → knockdown upregulates → Gene
Gene → overexpression downregulates → Gene
Gene → overexpression upregulates → Gene
These will be our first directed edges, so it will be exciting to stress test our support for directed edges, a feature that we designed our implementation to support.
We'll be taking the data from LINCS L1000 which contains a large number of genetic perturbation experiments:
Read more about our consensus signatures for gene knockdowns (of 4,363 genes) and overexpressions (of 2,471 genes).
What about disease-disease co-occurrence data mined from electronic health records (EHR)? I know of two studies in particular, but there are probably more:
Thanks @ostrokach. While these resources won't make it into the current project (corresponding to Hetionet v1.0), your suggestions will help us in the future.
Regarding HuDiNe, I know I looked into the resource in the past. Specifically, I mention it in my Qualifying Exam proposal in 2013 :
Disease comorbidity will be extracted from HuDiNe, a resource of co-occurring diseases constructed from Medicare claims spanning three years and 13 million patients . FDR adjusted p-values from the φ comorbidity statistic — a conservative metric optimized for prevalent diseases — will determine disease-disease edges.
If I remember correctly, I didn't end up including comorbidity relationships from HuDiNe because I had trouble picking an inclusion threshold: prevalent diseases were always comorbid with each other, while rare diseases never had comorbidities. However, as the HuDiNe paper explains they include two measures of comorbidity :
We will use two comorbidity measures to quantify the distance between two diseases: The Relative Risk (RR) and φ-correlation (φ). … For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas φ accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases (see SM Box 1).
Hence, I think it may make sense for us to revisit this dataset and attempt to pick a threshold that combines information from several available columns including:
Relative Risk 99% Conf. Interval
Intuitively, comorbidity is an ideal relationship for our approach because it's high throughput and could potentially offer orthogonal information to existing relationships.
STRIDE is definitely be of interest . The data is licensed as CC0 , which is ideal. However, a good deal of exploratory analysis would be needed to determine a processing pipeline and see what types of cooccurring concepts have meaning.
Tudor Groza, Sebastian Köhler, Dawid Moldenhauer, Nicole Vasilevsky, Gareth Baynam, Tomasz Zemojtel, Lynn Marie Schriml, Warren Alden Kibbe, Paul N. Schofield, Tim Beck, Drashtti Vasant, Anthony J. Brookes, Andreas Zankl, Nicole L. Washington, Christopher J. Mungall, Suzanna E. Lewis, Melissa A. Haendel, Helen Parkinson, Peter N. Robinson (2015) The American Journal of Human Genetics. doi:10.1016/j.ajhg.2015.05.020
Matthew R Nelson, Hannah Tipney, Jeffery L Painter, Judong Shen, Paola Nicoletti, Yufeng Shen, Aris Floratos, Pak Chung Sham, Mulin Jun Li, Junwen Wang, Lon R Cardon, John C Whittaker, Philippe Sanseau (2015) Nature Genetics. doi:10.1038/ng.3314