Evaluation framework

Benjamin Good, Daniel Himmelstein

doi:10.15363/thinklab.d47

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Evaluation framework

Benjamin Good April 3, 2015

What data are you planning to use to validate your framework? E.g. what are the positive controls? Presumably you would have some collection of drug-disease pairs that would be divided into development, training, and testing sets?

Benjamin Good: Ah.. guessing this is the reason behind https://think-lab.github.io/discussion/how-should-we-construct-a-catalog-of-drug-indications/21

0
1.
How should we construct a catalog of drug indications?
Daniel Himmelstein, Benjamin Good, Tudor Oprea, Allison McCoy, Antoine Lizee (2015) Thinklab. doi:10.15363/thinklab.d21

Daniel Himmelstein Researcher April 7, 2015

Evaluation with known indications

@b_good, the primary means of evaluation will be assessing performance on a masked subset of indications. Previously [1], we withheld 25% of observations for testing. During training (on the remaining 75% of observations), we can use cross-validation to identify optimal parameter values. We plan to measure performance using area under the ROC curve (AUROC). We will also consider using condensed-ROC curves [2], which emphasize top predictions.

The discussion you noted discusses how to construct a catalog of high-confidence indications, also known as a gold standard. What we haven't discussed thus far is how to create a negative set, which is a necessary input for our classification approaches. The simplest way to generate negatives is to treat all non-positives as negatives: if a compound is not indicated for a disease, the compound-disease pair is considered a negative. Since our positive set is incomplete, some true but unknown indications will be considered negatives. Given that the overwhelming majority of negatives will truly be negatives, I expect the impact of improper negatives to be minimal. However, our past experiences show many people find this response unsatisfactory and would prefer us to exclude potential positives from the negative set. We will probably do that for this project, for example by omitting compound-disease pairs that are in the low-precision subset of MEDI [3].

In our previous project [1], we found that our classification approach was resistant to overfitting. In other words, our training and testing AUROCs were comparable. We should still continue a formal testing paradigm as good practice, but there is a larger issue that I will discuss in my next post.

Daniel Himmelstein Researcher April 8, 2015

Evaluation with novel indications

Experienced chemoinformaticians stress that impressive testing performance on known positives does not always translate to predicting novel positives. The causes are several fold:

the patterns behind established positives are not generative — those patterns do not translate to unknown positives.
predictions are made for instances that are not well-represented in the training set. Understanding the applicability domain of your model is crucial here.
the current set of positives is synchronous with the current state of knowledge. If the method does not incorporate untapped knowledge, novel predictions may not be possible.

These issues are one reason why the community places such emphasis on novel discovery when appraising new methods. Absent experimental verification of our top predicted indications, there are a few approaches we could consider that begin to assess our ability to predict the "unkown". Several potential approaches were explored by a past repurposing study [1]:

Predicting indications currently undergoing clinical trials
Predicting potential indications that were not in our gold standard. For example, those indications in the MEDI low-precision subset or those identified using literature mining.

These approaches are both imperfect, but they are a start. Ideally, we could experimentally evaluate several predictions. This work could potentially be outsourced and thus parallelized.

Daniel Himmelstein Researcher Nov. 25, 2016

Just saw "A review of validation strategies for computational drug repositioning" [1]. This review offers the following recommendation:

creating a true ‘gold standard’ that contains both repositioning successes and failures is one way to improve consistency in the field, and allows for equitable comparisons between methods. We believe that such a ‘gold standard’ database can improve the accuracy of drug repositioning methods and increase the probability of success in clinical trials.

For this project [2], we created PharmacotherapyDB as a gold standard database that distinguishes disease-modifying treatments [3, 4]. While PharmacotherapyDB was not designed to compile "repositioning successes and failures", it can still be used to assess the performance of drug efficacy predictions as suggested by the authors:

We propose a new direction in repositioning validation through the creation of a repositioning database to promote reproducible calculations of sensitivity and specificity.

Essentially, predicting whether a drug treats a disease is a superset of the drug repurposing problem. Therefore, I don't thinks it's problematic that we don't assess our predictions on repurposing successes versus failures. However, I do know that there are some treatment catalogs in the works by other groups that focus solely on drug repurposing, rather than all treatments. Once these are available, it would be nice to see how our predictions fare.

Status: Open

Views

Topics

Evalauation Validation

Referenced by

Cite this as

Benjamin Good, Daniel Himmelstein (2015) Evaluation framework. Thinklab. doi:10.15363/thinklab.d47

License