The National Library of Medicine (NLM) produces a catalog of 23 million journal articles called PubMed. PubMed contains two subsets that are relevant for literature mining:
PubMed Central (PMC) — 3.4 million articles that include full texts, rather than just abstracts.
MEDLINE — 21 million articles that are manually annotated with their topics. Topics are chosen from the MeSH vocabulary. 5,594 journals are currently indexed.
MeSH, which stands for Medical Subject Headings, is a broad terminology of ~27 thousand terms structured hierarchically to form an ontology. Skilled subject analysts at the NLM typically assign 10–12 MeSH terms per article and denote a subset of these terms as major topics.
Text mining, as suggested to us by @b_good, is an intriguing technique because it is widely-applicable and draws from a knowledge base of epic proportions .
We would like to infer relationships between nodes in our network based on MEDLINE cooccurrence. We will search for pairs of MeSH terms that are assigned to the same articles beyond what would be expected if the terms were unrelated. This approach has successfully identified disease symptoms  (browse results). The method is versatile and can be applied to any nodes which have been mapped to MeSH.
First we created a disease set of 119 MeSH terms that mapped to DO slim diseases (tsv of diseases). Next, we created a symptom set of 438 MeSH terms by finding all descendants of D012816 (Signs and Symptoms) (notebook, tsv of symptoms).
For each disease, we identified the articles where that disease was a major topic. For each symptom, we identified the articles where that symptom was a topic. We then identified the articles that contained both a disease major topic and symptom topic. We based further analysis only on these 392,397 articles that contain at least one disease–symptom cooccurrence.
For each symptom–disease pair, we calculated:
cooccurrence — the number of articles where the disease and symptom terms cooccurred.
expected — the number of expected cooccurrences by chance based on each term's marginal frequency.
enrichment — cooccurrence divided by expected.
odds_ratio — the odds of cooccurrence divided by the odds of expected. This calculation appears to be slightly messed up due to non-integer expected counts.
p_fisher — the p-value from Fisher's exact test evaluating whether the observed cooccurrence exceeded that expected by chance.
@apankov, can you comment on the Fisher's exact test and whether there is a superior way to identify terms that significantly cooccur?
@b_good or others: do you know of better metrics for literature mining? One issue is that our approach may miss common symptoms that are not greatly enriched for any particular disease. The HSDN study  used a TF-IDF measure, but we require metrics that are comparable across diseases.
I think the Fisher's exact test will be accepted well by reviewers, but Barnard's test could be a good alternative. Otherwise, if you can calculate a p-value based on permutation (or get a bootstrapped estimates for the variance of the number of expected cooccurrences) , that could be an easy, straightforward approach.
However, it has occurred to me that in our above post, we incorrectly created the contingency table for the exact test. We now construct it similarly to Table 1 of this paper so that the contingency table is:
a & b\\
c & d
a is the number of studies with both the disease and the symptom (cooccurrence)
b is the number of studies with the disease and without the symptom
c is the number of studies without the disease and with the symptom
d is the number of studies without either the disease or symptom
The Uberon ontology  of anatomical structures includes MeSH cross-references. Thus, we performed our MEDLINE cooccurrence analysis described above to find relationships between diseases and anatomical structures (notebook, tsv download).
The ability of this method to capture disease localization was exceptional. For example, the top five terms by p-value for multiple sclerosis were:
Central Nervous System
One improvement would be to exclude Uberon terms that don't exist in humans such as venom (UBERON:0007113). Additionally, there are some Uberon–MeSH mapping issues that should get resolved soon allowing us to update the analysis.
We computed disease similarities based on MEDLINE cooccurrences. Refer to this discussion for more information.