Mining knowledge from MEDLINE articles and their indexed MeSH terms

Daniel Himmelstein, Alex Pankov

doi:10.15363/thinklab.d67

Mining knowledge from MEDLINE articles and their indexed MeSH terms

Daniel Himmelstein Researcher May 10, 2015

Background

The National Library of Medicine (NLM) produces a catalog of 23 million journal articles called PubMed. PubMed contains two subsets that are relevant for literature mining:

PubMed Central (PMC) — 3.4 million articles that include full texts, rather than just abstracts.
MEDLINE — 21 million articles that are manually annotated with their topics. Topics are chosen from the MeSH vocabulary. 5,594 journals are currently indexed.

MeSH, which stands for Medical Subject Headings, is a broad terminology of ~27 thousand terms structured hierarchically to form an ontology. Skilled subject analysts at the NLM typically assign 10–12 MeSH terms per article and denote a subset of these terms as major topics.

Application

Text mining, as suggested to us by @b_good, is an intriguing technique because it is widely-applicable and draws from a knowledge base of epic proportions [1].

We would like to infer relationships between nodes in our network based on MEDLINE cooccurrence. We will search for pairs of MeSH terms that are assigned to the same articles beyond what would be expected if the terms were unrelated. This approach has successfully identified disease symptoms [2] (browse results). The method is versatile and can be applied to any nodes which have been mapped to MeSH.

Daniel Himmelstein Researcher May 13, 2015

Proof of concept implementation

We implemented a topic cooccurrence calculator based on MEDLINE and used this method to identify disease-symptom relationships (notebook, API query script, tsv of results).

First we created a disease set of 119 MeSH terms that mapped to DO slim diseases (tsv of diseases). Next, we created a symptom set of 438 MeSH terms by finding all descendants of D012816 (Signs and Symptoms) (notebook, tsv of symptoms).

For each disease, we identified the articles where that disease was a major topic. For each symptom, we identified the articles where that symptom was a topic. We then identified the articles that contained both a disease major topic and symptom topic. We based further analysis only on these 392,397 articles that contain at least one disease–symptom cooccurrence.

For each symptom–disease pair, we calculated:

cooccurrence — the number of articles where the disease and symptom terms cooccurred.
expected — the number of expected cooccurrences by chance based on each term's marginal frequency.
enrichment — cooccurrence divided by expected.
odds_ratio — the odds of cooccurrence divided by the odds of expected. This calculation appears to be slightly messed up due to non-integer expected counts.
p_fisher — the p-value from Fisher's exact test evaluating whether the observed cooccurrence exceeded that expected by chance.

@apankov, can you comment on the Fisher's exact test and whether there is a superior way to identify terms that significantly cooccur?

@b_good or others: do you know of better metrics for literature mining? One issue is that our approach may miss common symptoms that are not greatly enriched for any particular disease. The HSDN study [1] used a TF-IDF measure, but we require metrics that are comparable across diseases.

Alex Pankov May 13, 2015

I think the Fisher's exact test will be accepted well by reviewers, but Barnard's test could be a good alternative. Otherwise, if you can calculate a p-value based on permutation (or get a bootstrapped estimates for the variance of the number of expected cooccurrences) , that could be an easy, straightforward approach.

Daniel Himmelstein Researcher May 15, 2015

Thanks @apankov. I couldn't find a python implementation of Barnard's test [1, 2], so I think we'll stick with Fisher's exact test [3] for simplicity. The fidelity of p-values is not a major concern here.

However, it has occurred to me that in our above post, we incorrectly created the contingency table for the exact test. We now construct it similarly to Table 1 of this paper [4] so that the contingency table is:

$$$ \begin{bmatrix} a & b\\ c & d \end{bmatrix} $$$

where

a is the number of studies with both the disease and the symptom (cooccurrence)
b is the number of studies with the disease and without the symptom
c is the number of studies without the disease and with the symptom
d is the number of studies without either the disease or symptom

The revised symptom–disease pair tsv file is available here.

Daniel Himmelstein: See this newer symptom–disease pair tsv file which has additional DO slim diseases with MeSH mappings.

Daniel Himmelstein Researcher May 21, 2015

Anatomy–Disease Relationships

The Uberon ontology [1] of anatomical structures includes MeSH cross-references. Thus, we performed our MEDLINE cooccurrence analysis described above to find relationships between diseases and anatomical structures (notebook, tsv download).

The ability of this method to capture disease localization was exceptional. For example, the top five terms by p-value for multiple sclerosis were:

mesh_name	cooccurrence	expected	enrichment	odds_ratio
Central Nervous System	881	38.6	22.8	34.3
Spinal Cord	1492	80.8	18.5	27.5
Myelin Sheath	1006	19.9	50.5	146.8
Brain	4777	778.3	6.1	11.5
Optic Nerve	372	36.5	10.2	11.9

One improvement would be to exclude Uberon terms that don't exist in humans such as venom (UBERON:0007113). Additionally, there are some Uberon–MeSH mapping issues that should get resolved soon allowing us to update the analysis.

Daniel Himmelstein Researcher July 14, 2015

Disease–Disease Relationships

We computed disease similarities based on MEDLINE cooccurrences. Refer to this discussion for more information.

Daniel Himmelstein Researcher Nov. 20, 2016

Noting MRCOC

Head this through a Tweet by @b_good. It appears that the National Library of Medicine precomputes literature co-occurrences for MeSH terms. See the page MEDLINE Co-Occurrences (MRCOC) Files.

This could replace some or all functionality of dhimmel/medline — however, I haven't actually looked into whether it's a user friendly substitute. Just wanted to take note.