Processing the DISEASES resource for disease–gene relationships

Daniel Himmelstein, Lars Juhl Jensen

doi:10.15363/thinklab.d106

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Publication:

DISEASES: Text mining and data integration of disease–gene associations

Processing the DISEASES resource for disease–gene relationships

Daniel Himmelstein Researcher Aug. 20, 2015

We are looking into DISEASES [1] as a resource for gene–disease relationships. This database is produced by @larsjuhljensen's group and follows similar protocols as TISSUES [2], which we have already processed.

DISEASES includes three types of evidence:

text mining: using named entity recognition to look for disease–protein cooccurrences in abstracts and sentences. @larsjuhljensen, which literature corpus was used?
knowledge: curated relationships from GHR and UniProtKB [3]
experiments: cancer mutation data from COSMIC [4, 5] and GWAS data from DistiLD [6]

We did a preliminary processing of the integrated dataset, which yielded 81,499 gene–disease relationships for DO Slim diseases (notebook, download). Filtering for scores ≥ 3, resulted in 2,441 relationships.

@larsjuhljensen, are scores in DISEASES comparable between datasets? In other words, are confidence scores standardized to a common gold standard?

We may consider creating an integrated score excluding DistiLD, since we have a distinct GWAS edge.

Lars Juhl Jensen Aug. 21, 2015

Regarding the scores, they are designed to be as comparable as we could make them; however, it was not possible to do so purely through benchmarking, since a high-quality unbiased benchmark set does not exist.

If you already have GWAS from another source, I would exclude DistiLD too. You already import mutation data from e.g. COSMIC, I would exclude the experiments channel entirely. This also makes comparability of scores much less of an issue, since you're left with only automatically text-mined associations, which are scores the same way as tissue associations, and manually curated associations, which are inherently highly reliable.

Daniel Himmelstein: We don't include COSMIC anywhere else, so I would like to include it. @larsjuhljensen, which literature corpus was used for text mining?
Lars Juhl Jensen: Just Medline so far.

Daniel Himmelstein Researcher Aug. 21, 2015

Completed processing

We have completed an initial processing of DISEASES (notebook). The output is a tsv of gene–disease pairs (download) with scores for following channels:

text mining
knowledge
cosmic — the COSMIC subset of the experiments channel
distild — the DistiLD subset of the experiments channel
integrated_no_distild — the integration of the four aforementioned scores
integrated — the integrated score calculated by the DISEASES team, without any exclusions

Genes were converted to Entrez identifiers using the STRING 9.1 mapping (entrez_gene_id.vs.string.v9.05.28122012.txt). We also created a dataset with only DO Slim diseases (download). For this file, we propagated scores from subsumed diseases and reported the max.

Visualizing channel concordance

We visualized the relationships between scores on the full dataset. The off-diagonal plots show a 2D histogram, using hexagonal bins. The diagonal of the grid contains 1D histograms for the x-variable. Bin counts for all panels are log-transformed.

Status: Completed

Views

Topics

Associations Diseases Databases

Referenced by

Visualizing the top epilepsy predictions in Cytoscape
Research report: Rephetio: Repurposing drugs on a hetnet

Cite this as

Daniel Himmelstein, Lars Juhl Jensen (2015) Processing the DISEASES resource for disease–gene relationships. Thinklab. doi:10.15363/thinklab.d106

License