Human Symptom Disease Network-MeSH ID Matching

Leo Brueggeman, Daniel Himmelstein

doi:10.15363/thinklab.d52

Rephetio: Repurposing drugs on a hetnet [rephetio]

	Human Symptom Disease Network-MeSH ID Matching Leo Brueggeman Researcher April 9, 2015 One of the edge types we plan to incorporate in our network is that between diseases and symptoms. This data will come from the work of Zhou et al. in their "Human symptoms–disease network" paper. The supplementary data released by Zhou et al. identifies diseases and symptoms by their MeSH names, but does not include the associated MeSH IDs. To ease interoperability we have performed the minor task of appending the relevant MeSH IDs to these files. The result can be found here.
	Daniel Himmelstein Researcher April 20, 2015 We mapped the MeSH diseases from the HSDN [1] to our slim DO. See the notebook for more info or download the mapped data. Each disease-symptom relationship includes a `tfidf_score` (term frequency-inverse document frequency). This score, $$w_{i,j}$$, between symptom i and disease j was calculated with: $$$ w_{i,j} = W_{i,j} \times \log{\frac{N}{n_i}} $$$ where $$W_{i,j}$$ is the number of co-occurrences in PubMed, N is the total number of diseases, and $$n_i$$ is the number of diseases where symptom i appears. At some point we will set an inclusion threshold for symptom edges based on their `tfidf_score`. We used a propagated slim DO mapping, so symptoms for MeSH term "relapsing-remitting multiple sclerosis" for example were included as symptoms for DO term "multiple sclerosis".
	Daniel Himmelstein Researcher April 28, 2015 The above formula used to calculate `tfidf_score` adjusts for the frequency of the symptom, but not the frequency of disease. Therefore we speculate that the scores are comparable within but not across diseases. Since we want to adopt a single inclusion threshold for all symptom-disease pairs, we would like to reformulate the metric to adjust for disease frequency. We added a new visualization and table to investigate a disease-frequency bias. It appears that diseases that occur in more PubMed records have a higher number of symptoms exceeding a given `tfidf_score`.
	Daniel Himmelstein Researcher April 30, 2015 Below, I've copied the supplementary methods section from the HSDN [1] describing how the literature mining was accomplished. I think a similar method could help us if we choose to perform our own text mining for network population. We use the Medical Subject Headings (MeSH) [2] terminology to generate symptom-disease relationships from the metadata extracted from PubMed [3] bibliographic records. PubMed is currently the most comprehensive literature database on biomedical sciences. It includes MEDLINE [4] and uses MeSH for each citation to facilitate information retrieval. MeSH is a controlled thesaurus that is used for the annotation of published articles, resulting in a high quality representation of their main topics and contributions. The MeSH terms are assigned manually by trained indexers and have been used in numerous biomedical text mining and literature-based discovery studies [5, 6, 7, 8]. We downloaded the 2011 ASCII version of MeSH that contains 26,142 distinct terms and their unified identifiers. The MeSH vocabulary is structured as a hierarchical tree with 16 top nodes, representing general categories, such as ‘Anatomy’, ‘Diseases’ and ‘Phenomena and Processes.’ The broad category ‘Diseases’ contains the sub-category ‘Symptoms and Signs’ (MeSH tree code C23.888) that incorporates terms related to clinical manifestations observed by physicians or perceived by patients. We used all terms contained in the ‘Disease’ category (Table S1), excluding ‘Animal diseases’, as well as twenty terms, which only represent unspecific disease information, such as ‘Diseases’ itself, ‘Syndrome’, ‘Chronic diseases’ and ‘Infection’. In total, we obtained 4,442 distinct MeSH disease terms and 327 distinct MeSH symptom terms to be used for the PubMed query. To ensure that we only retrieve records with the corresponding indexed disease terms as a major topic, we search MEDLINE with the constraint “[Majr:NoExp]”, which filters for bibliographic records with the study of a specific disease as a main contribution. Using the E-Utility API web service interface of the National Center for Biotechnology Information, we developed a JAVA program to automatically search all MEDLINE bibliographic records published between 1966 and October 2011 (Figure S4). The total number of corresponding PubMed records was 7,109,429, of which 6,553,494 included a disease and 1,405,038 a symptom term. The number of records that contain both a disease, as well as a symptom term was 849,103. They included all 4,442 MeSH disease terms and almost all (322, i.e. 98%) symptom terms.