The Prior Problem

The last step of our project rephetio is to use our heterogeneous Network hetionet † to create features that can predict missing edges in the Network. As the goal is to repurpose existing drugs to existing diseases, the outcome metaedge we are predicting is treatments (read 'treats' as a verb in our nomenclature). We tackle here a major issue that we call 'self-testing', which makes our Machine Learning approach non-conventional.

Self-testing

For computational and semantical reasons, we currently fit our machine-learned models and report evaluated performance based on the training error. Indeed, our Network architecture, implementation and data currently prevents us from creating separate training and testing sets, e.g. by selectively hiding predictor and predicted edges based on time of apparition. As a result (i) predictive models are trained with features that have intrinsic knowledge of the outcome, exposing us to overfitting of these features (ii) classical performance evaluation measures (like AUROC) lose relevance and applicability.

Source-Target Degrees as Features

We included source-target degrees as features of our edge prediction problem. The goal of this addition, as explained elsewhere, is to avoid misleading selection of DWPC features that are proxies for this basic topological information.

In this discussion, we want to focus on treatment degrees. They can be noted $n_{DtC}(Ci)$ and $n_{CtD}(D_j)$, and represent, respectively, the number of diseases treated by a Compound $C_i$ (source Node) and the number of compounds that treat a Disease $D_j$ (target Node). We believe that these degrees, as features, crystallizes the problem of self-testing in our edge-prediction problem.

Direct contamination

These degrees characterize with precision the (bipartite) subnetwork $\mathcal{N}_O$ that have only edges from the 'treatment' metaedge and source and target nodes (from the Compound and Disease metanodes). $\mathcal{N}_O$ contains no information apart from the actual outcomes we want to predict, and yet it is sufficient to compute all the treatment degrees $n_{DtC}$ and $n_{CtD}$. As a result, outcome observations directly 'contaminates' our features on which the models are trained and evaluated. Because these degrees are so unequally distributed (they follow a power law), and since these degree features are directly linked to the probability of existence of a treatment, fitting a model with only these two degree features gives misleeading high performance numbers.

A first pass gave a AUROC above 0.97.

Prior knowledge

The information encapsulated in these two degree features are solely characterizing the treatment edges, as we use them for fitting and evaluation. We are not tackling here the more general problem of knowledge bias, but the specific issue arising as a result of self-testing: actual outcome observations directly contaminates our features, inducing prior knowledge about their value (positive or negative) before any additional information being integrated in our network.

† This is the first public use of the name 'hetionet' for our heterogeneous network that powers — in particular — the project rephetio. We call it a 'soft disclosure' ;-)

Antoine Lizee Researcher

Characterizing the Prior

Removal of degree features would only obfuscate the problem at hand, since this trivial information about outcomes could be picked up in subtler ways by DWPC features. On the contrary, we found that embracing the concept of prior knowledge presented above could solve both problems of fitting and evaluating our models. We will strive here for more accurate characterization of this prior knowledge.

Computing prior probabilities

Solely from the two treatment degree features, we can directly compute a probability of a given Compound-Disease couple to be linked by a treatment edge.

Let us consider a couple $C_i\text{-}D_j$ where the treatment degree from the compound $C_i$ is known, and the Disease $D_j$ is drawn at random. The probability of this couple to be a treatment is equal to the number of disease $C_i$ is connected to divided by the total number of diseases it could connect to. This can be written as:

\begin{align} p_t(C_i\text{-}D) & = \frac{n_{CtD}(C_i)}{N_{C_i\text{-}D}} \\ p_t(C_i\text{-}D) & = \frac{n_{CtD}(C_i)}{N_D} \end{align}

... where $N_{C_i\text{-}D}$ is the total number of possible edges starting from $C_i$, equal to $N_D$ the number of Diseases in the network.

A similar formula can be derived for the Disease degrees:

\begin{align} p_t(D_j\text{-}C) & = \frac{n_{DtC}(D_j)}{N_C} \end{align}

And both of these prior probabilities can be combined to get the prior probability of a given treatment edge knowing both degrees:

$$$$\boxed{p_{ij} = p_t(C_i\text{-}D_j) = \frac{n_{DtC}(C_i) \cdot n_{DtC}(D_j)}{N_D \cdot N_C}}$$$$

This probability fully describes the prior knowledge about the outcome that the treatment degrees hold.

A Null model

A null model can be specified, where the outcome for any source-target couple when predicting presence of a treatment edge is equal to the prior probability $p_{ij}$ above. This model is denoted $\mathcal{M}_{prior}$.

$\mathcal{M}_{prior}$, when tested on all observations, is expected to have a very high AUROC, because it successfully stratifies the population of potential edges into tiers that have increasing and radically different probabilities of being a true link. For instance, it successfully discriminates the huge majority (85%) of couples that have no chances of being a treatment edge because either the Compound and/or the Disease has a treatment degree equal to zero.

AUROC of the null model

Just considering this point in the ROC curve, with $P_0 = P\left(p_{ij} = 0\right)$ the proportion of edges that have a priori probability of zero, we can compute (for the fun) a lower bound for the AUROC that will illustrate why the expected performance is high. Every couple with prior probability of zero is ranked lower than any other couple, and each of these couples is actually a True Negative (TN) by design. Since we know that the model will be better than random for the remaining of the range of specificity, the lower bound for the AUROC of this null model is given by a simple linear interpolation based on one point we know in ROC space.

This point corresponds to classifying only these "absolute negatives" that have $p_{ij} = 0$, as negatives. We get a sensitivity $sens = 1$, since all the real positives are predicted to be positives; and a specificity of

\begin{align} spe &= \frac{P_0}{P_0 + (1 - P_0) \cdot (1 - pre)}\\ &= \frac{P_0}{1 - pre \cdot (1 - P_0)}\\ spe &\simeq P_0 \end{align}

with the prevalence $pre$ being negligible (of the order of $10^{-4}$).

The linear interpolation gives us:

\begin{align} auroc_{min}\left(\mathcal{M}_{prior}\right) &= 1 - \frac{1 - P_0}{2} \end{align}

With $P_0 = 0.85$, we get $\boxed{auroc_{min}\left(\mathcal{M}_{prior}\right) = 92.5 \%}$ as a lower bound for the null model performance.

The measured AUROC of the this null model, on all observations is equal to:

$$auroc\left(\mathcal{M}_{prior}\right) = 97.9 \%$$
Antoine Lizee Researcher

Incorporating the prior in fitting and testing

Using $\mathcal{M}_{prior}$ as baseline

The performance achieved by the null model $\mathcal{M}_{prior}$ should serve as a baseline, thus mitigating the lack of meaning of the AUROC: we'll use $auroc(\mathcal{M}_{prior})$ as a minimum value to improve on (instead of the usual $auroc(\mathcal{M}_{random})= 0.5$).

Using $p_{ij}$ as covariate

Nevertheless, we also want to include as a covariate in our models this prior probability, in order to isolate the prior knowledge about the outcome it describes, and avoid selection of DWPC features that 'tag' it.

In the framework of generalized linear regressions I propose to directly take the linear predictor that corresponds to the computed prior probability. In the case of logistic regression, the feature to use as covariate in the model would be equal to:

\begin{align} \boxed{f_{prior}\left(C_i\text{-}D_j\right) = \text{logit}\left(p_{ij}\right)\strut} \end{align}

Proposed procedure

We propose an incremental approach to the fitting and evaluation of the models, going through different iteration:

\begin{align} \text{1.} && \mathcal{M}_{prior}: \; &\text{the null model described above} \\ \text{2.} && \mathcal{M}_{deg}: \; &\delta_{CtD} \sim f_{prior} + degrees \\ \text{3.} && \mathcal{M}_{perm}: \; &\delta_{CtD} \sim f_{prior} + degrees + DWPC_{permuted} \\ \text{4.} && \mathcal{M}_{real}: \; &\delta_{CtD} \sim f_{prior} + degrees + DWPC_{real} \\ \end{align}

where we use the R formula notation to specify outcomes and features of the model;
where $\delta_{CtD} = \delta_{DtC}$ is the outcome variable that denotes the presence of a treatment edge between a given couple of Compound and Disease;
where '$degrees$' are all the source/target degree features; and
where the $DWPC_{permuted}$ and $DWPC_{real}$ are the all DWPC features computed respectively from the degree-preserving permuted network $\mathcal{N}_{permuted}$ (see ***) and the full network $\mathcal{N}$.

We expect $\mathcal{M}_{deg}$ to bring no additional predictive value than the null model $\mathcal{M}_{prior}$, since the latter withholds already the best information possible at the atomicity of the source/target degrees level. For $\mathcal{M}_{perm}$ and $\mathcal{M}_{real}$, the degrees might be useful in conjunction with the DWPC features.

Testing of new or fitted data will be carried out using an arbitrary value for the prior in the models, e.g. an estimation of the prevalence.

Redefining the training & testing sets

Finally, we propose to fit the models above only on couples that have a non-null prior probability of being a treatment edge, i.e. couples that verify both:

$$\left\{ \begin{array}{c} n_{CtD}(C_i) > 0 \\ n_{DtC}(D_j) > 0 \end{array} \right.$$

Indeed, when the prior probability $p_{ij}$ is equal to zero, the points do not theoretically contribute to the fitting, whereas they practically induce complications (lack of convergence, problem with the numerical encoding of $-\infty$ ... ). Moreover, it gives us the opportunity to report results for the couples from the complementary set (of the couples that have a null prior), which seems quite independent from the set used for training.

On this smaller training set, the performance of the prior alone as measured by the ROC decreases since we removed obvious negatives:

$$auroc\left(\mathcal{M}_{prior}\right) = 85.1 \%$$

... while the area under the precision recall curve is exactly the same: $auprc\left(\mathcal{M}_{prior}\right) = 0.16$.

Daniel Himmelstein Researcher

Why we self-test

@alizee mentions that we plan to use self-testing. Here I'll describe what that means, and why we chose this unconventional approach. Self-testing refers to our plan to not withhold any treatments (positives) and non-treatments (negatives) from our hetnet and subsequent model training. In a conventional testing/validation approach, a subset of compound–disease pairs would be entirely removed from the training process. Here are reasons we don't want to withhold compound–disease pairs for testing.

We only have 755 treatments. Many of the best performing features rely on these edges. For example, we'll look for compounds that bind to similar genes to compounds that treat the target disease. The informativeness of this feature is proportional to how many compounds treat the target disease. Therefore, withholding too many treatments for testing will hurt performance. Conversely, withholding too few treatments will lead to unreliable performance estimates. One solution to this dilemma is cross-validation. However, each fold would require a distinct feature extraction on a distinct hetnet. With our current infrastructure, this would create a major runtime and implementation time bottleneck that would negatively influence our agility.

So the question becomes whether this investment is warranted by the benefits of a true testing set. The main benefit is accurate assessment of performance. However, in our particular situation, I don't think this a major concern and is probably not even possible. We're focused on making good predictions, not comparing the performance of many different methods. Ultimately the predictions will be unaffected by whether we adopt a formal testing scheme.

Next, I expect the extent of model-based overfitting to be low. When training our logistic regression model, we use cross-validation to find an optimal level of regularization (a penalty to prevent overfitting). Specifically, we adopt the "one-standard-error" rule to choose the optimal λ from deviance [1]. This is a stringent choice that takes a more conservative regularization strength than the deviance-minimizing value as an extra precaution against overfitting. In our previous project, where we did mask edges from the hetnet for testing, we comment [2]:

Importantly, we did not observe any significant degradation of performance from training to testing (Fig 3A), indicating that our disciplined regularization approach avoided overfitting and that predictions for associations included in the network were not biased by their presence in the network.

And finally, there are two much bigger biases than overfitting. The first is the bias of existing knowledge (study bias), which we realistically cannot control for. The second is the prior probability of treatment based on degrees, as @alizee has discussed. I don't think cross-validation can easily address the prior probability issue. The PREDICT study — which also incorporated treatment information in features used to predict treatments — attempted to address the issue with the following approach [3]:

To evaluate our classification scheme, we applied it in a 10‐fold cross‐validation setting. To avoid easy prediction cases, we hid all the associations involved with 10% of the drugs in each iteration, rather than hiding 10% of the associations.

However, if you hide all treatments for a set of drugs, a non-uniform prior based on the treatment-degree of the diseases still exists. Therefore, I think the best way forward will be explicitly model the prior probability of treatment based on degree and proceed intelligently.

• Ultimately the predictions will be unaffected by whether we adopt a formal testing scheme.

I don't really agree with that, as the self-testing here also applies when fitting the model. Correctly incorporating the prior should nevertheless mitigate this issue.

• Daniel Himmelstein: I was assuming an approach where you partition for assessing performance, but then refit on the entire dataset for your final predictions.

Status: In Progress
Views
81
Topics
Referenced by
Cite this as
Antoine Lizee, Daniel Himmelstein (2016) Network Edge Prediction: how to deal with self-testing. Thinklab. doi:10.15363/thinklab.d194