Rephetio: Repurposing drugs on a hetnet [rephetio]

How should we construct a catalog of drug indications?

We are looking to construct a catalog of indications (efficacious drug-disease pairs) with the following attributes (ordered by importance):

  1. automated and high-throughput construction
  2. high-quality, or varying levels of quality as long as quality level is annotated
  3. comprehensive
  4. disease modifying rather than symptomatic
  5. compounds which map to pubchem
  6. contraindications and adverse effects are excluded and cataloged separately
  7. diseases which map to the disease ontology
  8. source is retrievable

A few options we can consider:

  • LabeledIn — Curators manually identified indications from drug labels for 250 human prescription ingredients (drugs) [1].
  • MEDI — Indications extracted from RxNorm, SIDER 2, MedlinePlus, and Wikipedia were integrated into a single resource. The high-precision subset (indications in RxNorm or two other resources) includes 13,304 unique indications for 2,136 medications [2]. Further work added indication prevalence information [3]. MEDI compares favorably to SemRep for extracting indications from clinical text [4].
  • SemRep — "SemRep is a program that extracts semantic predications (subject-relation-object triples) from biomedical free text" [5]. SemRep has been used to extract TREAT relations from MeSH scope notes, Daily Med, DrugBank, and AHFS Consumer Medication Information [6]. SemRep has also been used to identify TREAT relations from Medline abstracts [7]. A project called SemMedDB provides the SemRep results from mining PubMed [8].
  • SPL-XStructured Product Labels eXtractor — Using MetaMap, this project extracted indications from DailyMed drug labels that were available as XML [9]. Data does not appear to be available.
  • Comparative Toxicogenomics Database [10] — Manual literature curators annotated drug-disease pairs as 'therapeutic'. The resource is extensive (the 'therapeutic' threshold was low) but incomplete.
  • SIDER 2 — In addition to extracting side effects from drug labels, SIDER also extracts indications [11]. Since the approach is automated, some side effects may be extracted as indications and vice versa. This approach would only provide information for drugs with labels from the US FDA or Canada.

Any additional resources or suggestions?

The initial LabeledIn [1] resource used expert curators. The team behind this project tested crowdsourced curation using Amazon Mechanical Turk workers [2]. They found the majority vote of workers on whether a disease within a label was an indication had a high accuracy (96%).

They assessed 3004 indications not already in LabeledIn corresponding to 706 new drug labels. We are looking to increase the coverage of the initial LabeledIn dataset by adding these crowdsourced indications.

This dataset might be worth looking into. Drug-indication links captured from physicians in an EHR system [1] . Data appears to be available - though its in a 200+ page PDF!'m sure that was a journal requirement).

Hey @b_good, thanks for the suggestion [1] and tracking down the data supplement, which I cannot find on the article's JAMIA page. Hereon, I will refer to this resource as ehrlink, unless anyone can find a previously-used or author-preferred nickname.

This resource is noteworthy because it will capture off-label usages better than LabeledIn (which is explicitly on-label) and MEDI (whose inclusion criteria likely favor on-label indications)

I converted the pdf file into a tsv file, which can be downloaded here.

You can access SemRep extracted semantic relations (e.g. treats, causes) based on all PubMed abstracts (updated bi-annually) via the semantic medline database. With a UMLS login, you can get the complete MySQL dump via . Main challenge here is in ensuring quality (as with any NLP output).

  • Daniel Himmelstein: Added the reference to my initial post. Given the quality issues, I do not plan to include this resource in our gold standard set of indications. It could be helpful later as a literature-derived set of potential indications.

ehrlink problem and medication vocabularies

We have extracted the ehrlink [1] indication data (see above). Unfortunately, I am unfamiliar with the identifiers used for problems (diseases) and medications (drugs). I've posted a sampling below in case anyone can figure out.

63645Complete D-transposition Of The Great Vessels
275590Organic REM Sleep Behavior Disorder
62983Arteriosclerotic Cardiovascular Disease (ASCVD)
75090Cerebral Palsy
17938Sodium Polystyrene Sulfonate Oral Powder
21707Clotrimazole Anti-Fungal 1 % External Cream
18805Niacin CR 1000 MG Oral Tablet Extended Release
19598ClonazePAM 0.5 MG Oral Tablet
136143AmLODIPine Besylate 2.5 MG Oral Tablet

My worry is that these identifiers may not correspond to a standardized vocabulary that we can access and easily map to. I will contact the authors for clarification.

Just to let you guys know that, at UNM, Oleg Ursu and I have been constructing such a catalog for nearly eight years.
Unfortunately, nobody funds this type of activity - or at least nobody has funded it so far - thus resources are somewhat limited.
Briefly, we manually curated all the active pharmaceutical ingredients APIs (over 4400; includes biologics), and mapped them to FDA approved drug labels (over 50000 ADLs).
From the ADLs one can extract/map indications, contra-indications, off-label indications... and to each API we mapped RxNorm [CUI], NPC, ATC, INN, plus targets, including numeric bioactivity & type [MoA related; non-MoA assigned; as well as non-human targets]. We also mapped all our diseases to DOIDs - however, there are about 800 or so left that will take us a while to map.
A few pointers:
1) if you want to extract the data yourselves, you're in for a treat. There are diseases in "indications" that do NOT exist anywhere else [e.g., cancer XYZ with mutation A3999B, in other words it's not enough to have the disease, you need the right genotype!];
2) you also have to deal with indications that are "fringe" (pregnancy is not a disease; neither is contraception)
3) indications etc. are not from PubMed - so please pay attention to approved labels
4) disease modifying is far from trivial - you need epi to show you that, X years after the Dx/Rx event, there was no recurrence [are steroids in anti-allergy disease modifying? probably not; are antibiotics in sinusitis disease-modifying? yes and no link_ref,[object Object],if it's chronic!

My colleagues and I have worked on multiple approaches to create this knowledge in the papers below:

  1. Wright A, Chen ES, Maloney FL. An automated technique for identifying associations between medications, laboratory results and problems. J Biomed Inform. 2010 Dec;43(6):891–901.
  2. McCoy AB, Wright A, Laxmisan A, Singh H, Sittig DF. A prototype knowledge base and SMART app to facilitate organization of patient medications by clinical problems. AMIA Annu Symp Proc. 2011;2011:888–94.
  3. McCoy AB, Wright A, Laxmisan A, Ottosen MJ, McCoy JA, Butten D, et al. Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications. J Am Med Inform Assoc. 2012 Oct;19(5):713–8.
  4. McCoy AB, Wright A, Rogith D, Fathiamini S, Ottenbacher AJ, Sittig DF. Development of a clinician reputation metric to identify appropriate problem-medication pairs in a crowdsourced knowledge base. J Biomed Inform. 2014 Apr;48:66–72.

In the JAMIA paper mentioned above, we used what we called a crowdsourcing approach to get this data. We have recently validated that approach at another site, and that publication is coming out in ACI soon. Unfortunately, in the original version, as you suspected, our medications and problems not mapped to any standardized terminology. The identifiers are local to the EHR, and while we have made some attempts to map them to RxNorm and SNOMED-CT, we were never able to get a really accurate set. However, the validation uses data from a different EHR, which I believe can be more easily mapped. Once the paper is out, I'll see if I can share that data.

I find crowdsourcing useful when you use a team of experts. So, for example, a carefully selected team of experts, when working on the same problem, can give surprisingly interesting feedback on an otherwise difficult problem. note that this paper is not about data entry, but about polling experts for their opinion.

I professionally supervised data entry for chemical structures, chemical bioactivities, as well as controlled vocabulary descriptions for assays, indexing medicinal chemistry literature. The average trained person loading data had an error rate of 5-10% - errors varied with period (e.g., the closer to the deadline, usually Christmas, the worse the quality). We used a 3-layer quality control system. And even so, we had a 1-2% error in our database, as revealed by comparison with two other systems.
See this paper for details (mine is the WOMBAT database).

With this in mid, I want to point out that crowdsourcing problem medication pairs by clinicians is an intriguing effort, and if the data is publicly available I would like to learn more. There are risks because a) verification of data entry was probably not done at the entry level (was the clinician familiar with both the drug and the disease?); b) the person determining the problem would require training in pharmacovigilance, understanding of known side-effects, etc. I assume you have done that, and that you compared the sets? I apologize that I do not have time to access your papers right now.

To clarify the crowdsourcing approach, in our study the clinicians are completing the task because it is required during routine care, not solely for the purpose of creating a knowledge base. They are entering the data into the EHR because they are prescribing a medication to a patient and are often required to link it to one or more of the patient's problems for billing purposes. We did not ask them to do any additional work outside of their own routine clinical practice.

thank you - was wondering about that. this does make their work more reliable.

My colleagues and I have worked on multiple approaches to create this knowledge in the papers [1, 2, 3]

@allisonmccoy, thanks for the references. I like your approach because it captures what clinicians are actually using to treat diseases (and can provide indication prevalence — what percent of patients with problem X receive medication X). Too bad that the identifiers are local. We would definitely appreciate the validation data when available, especially if it can be mapped to standard terminologies.

In terms of the mappings from the aforementioned study [2], we still may be able to extract some utility: for example, we could manually map indications for diseases where our indications were lacking. @allisonmccoy, did any of the other papers you highlighted release data that could add value here?

@TIOprea mentioned the difficulty of identifying disease-modifying indications, even in a carefully hand-curated database. @allisonmccoy, does your method favor disease-modifying links? For example, if modafinil were prescribed to treat MS-induced fatigue, would the clinicians link modafinil to multiple sclerosis or fatigue?

Did any of the other papers you highlighted release data that could add value here?

I don't believe so. I also omitted one more paper that validated the approach in Wright, et al.:

  1. Wright A, McCoy A, Henkin S, Flaherty M, Sittig D. Validation of an association rule mining-based method to infer associations between medications and problems. Appl Clin Inform. 2013;4(1):100–9.

The 2nd reference uses RxNorm, SNOMED-CT, and NDF-RT, all of which is freely available, so that knowledge base could easily be regenerated by another party.

Does your method favor disease-modifying links? For example, if modafinil were prescribed to treat MS-induced fatigue, would the clinicians link modafinil to multiple sclerosis or fatigue?

It could be either, but more than likely it would be linked to MS, because that's what would be on the problem list already and easily linked during e-prescribing, but in our evaluation, we would have counted either as correct. We actually had a lot of discussion about this while doing the evaluations, because it did occur frequently.

@TIOprea, thanks for your insights. You touch on important points. In general our method may not require a perfect indication catalog to succeed, so I am hopeful despite the difficulties you mention. Specifically,

There are diseases in "indications" that do NOT exist anywhere else [e.g., cancer XYZ with mutation A3999B, in other words it's not enough to have the disease, you need the right genotype!]

In this case, "cancer XYZ with mutation A3999B" would likely not be in the Disease Ontology and if it were would probably lack cross-references. However, if the disease did map to the DO, we would propagate the indication to "cancer XYZ".

you also have to deal with indications that are "fringe" (pregnancy is not a disease; neither is contraception)

These indications would not make it into the network because they do not relate to an included disease term. Information loss is ); but we'll get over it (;

indications etc. are not from PubMed - so please pay attention to approved labels

Thanks for the perspective. We won't include these as part of our gold standard.

disease modifying is far from trivial - you need epi to show you that

This I think will be the biggest difficulty. One option could be to exclude drugs that mostly treat symptoms. We noticed that drugs with many indications tended to be of this category. For multiple sclerosis, disease modifying is an established concept with currently 12 drugs. Unfortunately, the MS indications we've extracted from MEDI and LabeledIn are predominantly symptomatic. And to make matters worse, for most other diseases the DM status seems much more poorly defined.

The 2nd reference [1] uses RxNorm, SNOMED-CT, and NDF-RT, all of which is freely available, so that knowledge base could easily be regenerated by another party.

@allisonmccoy, I believe when MEDI [2] extracts RxNorm indications, they are taking information from the NDF-RT. My belief is based on that in the introduction they state:

The integration of RxNorm with the National Drug File–Reference Terminology (NDF-RT) from the Veterans Health Administration has added significant indication information between single-ingredient medications and diseases through ‘may_treat’ and ‘may_prevent’ therapeutic relationships. NDF-RT includes both on-label and off-label indications, but its performance on indications has not been previously reported. Preliminary work with earlier versions of RxNorm and NDF-RT demonstrated that a number of medications were lacking indications.

Then in the methods they state:

To obtain indications of a medication from RxNorm, we retrieved all diseases that connect with the medication through either ‘may_be_treated_by’ or ‘may_be_prevented_by’ relationships.

Do you know whether the RxNorm portion of MEDI relied on the same underlying NDF-RT data that you collected for the 2011 AMIA Proceedings Paper [1]?

Do you know whether the RxNorm portion of MEDI relied on the same underlying NDF-RT data that you collected for the 2011 AMIA Proceedings Paper [1]?

@dhimmel We only used the may_treat relationship, but we also took advantage of the is_a hierarchy for problems and ingredient_of relationships between medications and expanded the original set of pairs. So there is some overlap between the two, but likely some pairs that exist in only one or the other.

  • Daniel Himmelstein: Thanks for the clarification. We also plan to perform some indication propagation on the Disease Ontology hierarchy.

PREDICT Indications

An existing computational repurposing approach called PREDICT [1], compiled indications for their analysis. They describe their approach as:

The associations between drugs and UMLS disease concepts were integrated from four different sources using three different methods: (i) direct mapping to drugs, exploiting embedded UMLS links between concepts and drugs; (ii) drug–condition associations downloaded from, where conditions were mapped to UMLS concepts using MetaMap; and (iii) indication‐based mapping. For the latter, we extracted UMLS concepts using the MetaMap tool from textual drug indications downloaded from FDA package inserts (available in the DailyMed website, and DrugBank. In addition, we manually added 44 associations occurring in phase IV (post‐marketing) clinical trials.

... Finally, performing a manual curation of the extracted UMLS concepts from textual description of drug indications, we observed that they are more prone to false positives. We thus required that associations extracted from drug indications appear also in at least one more source.

Compounds are from DrugBank and diseases are from OMIM and the UMLS, which are both cross-referenced by the DO. The study does not report the precision of their indications making it difficult to assess how the quality compares with MEDI-HPS and LabeledIn.

We combined the supplementary datasets from the study to create a table of PREDICT indications (notebook, download). We will further investigate including these indications.

Indication Set

Now that we have decided which diseases and compounds to include in the network, we can map indications onto these nodes.

Our indication set contains four indication resources:

  1. MEDI-HPS — indications from the MEDI's high precision subset
  2. MEDI-LPS — indications from the MEDI's low precision subset
  3. LabeledIn — drug label indications extracted by experts or Mechanical Turks
  4. PREDICT — indications compiled by the PREDICT study

We anticipate constructing our gold standard of indications from MEDI-HPS, LabeledIn, and PREDICT while omitting MEDI-LPS, which has a lower precision. We did not include ehrlink [1] because the vocabularies were not mapped. However, we would happily reward anyone who contributes a mapping of the problems to the DO and the medications to DrugBank.

Indication Links

We would still like a way to differentiate disease-modifying from symptomatic indications and will explore manually classifying a subset of indications and training a model.

Hi Daniel,

I spend some time solving your problem about mapping the drug names from one arbitrary system to a known ontology. As a matter of fact RxNorm proposes an API which has an endpoint that can directly be queried for fuzzy matching - so that's useful. It will be helpful to look into the different endpoints of the API down the road, they provide many useful features (though poorly documented).

I wrote a script (in R) to match all the medication names in your file and get the related properties of the retrieved Rx concepts. It took an hour + to run because of stalling to avoid going over API quotas. The fuzzy-matching API returns several rxcui matches for each medication. A score, ranging from 0 to 100, is attached to every match.

The main output file can have several concepts per medication names if (i) there is ambiguity, i.e. there is more than one best match for a medication or (ii) the best match is imperfect, i.e. the best score is not 100 (then the first three are reported).
The final output is a subset of this file with only the huge majority of unambiguous hits (and we thus have one concept per medication string).

Here are some numbers:
1. Only 2353 medications got matched with a valid concept, out of 2537 initially. Some names don't correspond to any medication and are filtered out.
2. From these 2353 matched medications, 2281 (97%) have an unambiguous first match. These are in the final output.
3. These 2281 unambiguous hits match to a total of 2148 different rxcuis.
4. 1490 (63%) medications have at least one perfect match, with 1471 (63%) being unambiguous.
5. These 1471 unambiguous perfect hits match to a total of 1442 different rxcuis.

QC is straightforward by comparing the original medication names and the retrieved name of each matched rxcui. I quickly checked and even the non-perfect matches (with a score different than 100) seems on point, at the exception of the "therapies" that have very few equivalents in RxNorm and definitely match to the wrong concept.

Potential future directions:
1. Assess the quality of the matches through a systematic check based on the QC file mentioned above.
2. Enrich the final dataset by resolving ambiguity from the term types reported in the rxcui properties.

I went forward on resolving the ambiguity, using the term source in type, and then the number of "atoms" that matches each medication name.

This brings down the number of remaining ambiguity from 72 to 11 medications (0.5%).

I understand you want to extract the ingredients from these concepts, so it doesn't necessarily matter that there are two "top" matches for one medication after trying to resolve ambiguity (both will likely lead to the same components). As a result I created both the file for the successfully resolved matches, and the file for all the best matches after trying to resolve them. The latter has 100% of the medications, including the 11 ambiguous, for which I took arbitrarily one of the top concepts. This is the file you'll want to work off in the future.

I also created the QC file for the ambiguity resolution step.

  • Antoine Lizee: As a side-note, I removed all the row name columns in the csvs, because it was annoying for display in github. You might have to change slightly your code to understand the new format.

  • Daniel Himmelstein: For reference, @alizee used the following term type priority list (high to low):


Revised indications which include ehrlink

We were able to collaboratively map ehrlink to RxNorm and the DO.

Our indication catalog, which only includes DO slim diseases and approved small molecules in DrugBank, now contains:

  • 1,386 high-confidence indications retrieved from MEDI-HPS [1], LabeledIn [2, 3], PREDICT [4], and ehrlink [5] covering 96 diseases and 602 compounds
  • 1,113 low-confidence indications retrieved from MEDI-LPS [1]

The combined high and low-confidence indication set covers 107 diseases and 744 compounds. For more information see the notebook, table of indications with resource info, or table of collapsed indications.

Our validation manuscript has been published, @dhimmel:

I'll see what I can do about sharing the data, but unfortunately I've got travel coming up along with several deadlines, so it may be a little while longer before I'm able to do that.

Expert curation of the indication catalog

We have decided to filter our catalog for disease-modifying indications and are seeking an expert curator to assist with this task. We started a new discussion for this next step.

@allisonmccoy, have you thought more about releasing the data from your recent publication [1]? If you can do this in the next week or two, we would be thrilled to include this data. Otherwise we will have to move ahead with only the ehrlink data from your initial study [2].

PharmacotherapyDB Version 1.0

We completed physician curation for the time being and released the first version of our indications catalog called PharmacotherapyDB.

Thanks @b_good, @TIOprea, @allisonmccoy, @ritukhare, and @alizee — your suggestions and feedback were immensely helpful!

We'll keep this discussion alive for any suggestions of new resources or methods to improve future versions of PharmacotherapyDB.

Therapeutic Target Database

The Therapeutic Target Database (TTD) is a target focused resource with pharmacological relationships. @janispi suggested we check out TTD as a source of drug–disease therapies.

Specifically, TTD has a dataset of indications, which range from approved to investigational, available online (drug-disease_TTD2016.txt). I couldn't find how these indications were constructed from their publications [1, 2, 3, 4, 5], although I may have missed it. I emailed Professor Yu Zong (, who indicated their drug-disease relationships were human curated.

Just wanted to note this information, so we remember to keep TTD in mind.

Update July 16, 2016: Qin Chu provided me the following additional information via email:

The mapping between diseases and drugs were done manually. We searched different sources of literature such as pharmacology textbooks, review articles and research papers. The methods to extract the related drug target and disease information from literature were described in the 2012 version of TTD update paper [3]. We mapped the disease information to ICD code in the 2014 update of TTD [4].

Cheng et al 2014

A 2014 study titled "Systematic evaluation of connectivity map for disease indications" compiled 890 indications between 152 drugs and 145 diseases [1]. They compiled the indications from FAERS and Pharmaprojects. The indications are available as free text in Table S2 of the supplementary word document. I copied Table S2 into a TSV available here.

Recently, three resources have been published that provide catalogs of indications and drug repurposing examples.


repoDB, according to its website [1]:

contains a standard set of drug repositioning successes and failures that can be used to fairly and reproducibly benchmark computational repositioning methods. repoDB data was extracted from DrugCentral and

The data is available on figshare [2] under a CC BY 4.0 license and contains "1,571 drugs and 2,051 UMLS disease concepts, accounting for 6,677 approved and 4,123 failed drug-indication pairs." This dataset will be useful for distinguishing clinical trial indications that result in the following statuses: "Approve, Program Terminated, Not Approved, or Trial Halted". Unfortunately, there is no machine-readable way to determine whether the failures resulted from lack of efficacy.


The RepurposeDB database was recently released by the Dudley Lab [3]. RepurposeDB aims to catalog historical instances of drug repurposing in a machine readable and standardized format.

The database is at it's core a bunch of triples consisting of a:

  1. drug
  2. primary indication
  3. secondary indication

where primary/secondary indication is defined as:

Primary indication refers to the original disease indication for which the drug is targeted, and secondary indication indicates any subsequent indications

The first version (v1) of the resource (dated March 30, 2016) contains 253 drugs (188 small molecules & 65 biologics), 1125 indications, and 3660 data triples. The code is available on bitbucket and some data is available on figshare [4]. Unfortunately, the actual catalog of triples is not available on Figshare, but the authors provided us with the datasets without any restrictions attached (i.e. licensed under CC0).

Drug-Indication Database

Scientist(s) at Merck created the drug-indication database (DID) by integrating 12 resources [5]. While some of this resource is proprietary, much of it has been released via figshare under a CC BY license [6]. The dataset with indications looks a bit complex, but with some munging there are likely many good indications in there.

Status: Completed
Referenced by
Cite this as
Daniel Himmelstein, Benjamin Good, Tudor Oprea, Allison McCoy, Antoine Lizee (2015) How should we construct a catalog of drug indications?. Thinklab. doi:10.15363/thinklab.d21

Creative Commons License