Our current approach revolves around modeling treatments that are included in the hetnet as treats edges. We've previously discussed and dealt with some implications of this self-testing methodology. This post will introduce another issue related to modeling edges that are part of the hetnet used to extract features.
Previously, we looked into whether paths with duplicate nodes should be excluded. Without the duplicate node exclusion, we were allowing paths that directly contained the prediction edge (the prediction edge refers to the compound–disease pair that a feature describes). We found that:
paths including the prediction edge cause overfitting by incorporating the outcome (indication status) into the predictor (DWPC)
Now we have identified another contamination vector by which the presence of the prediction edge can seep into DWPCs. In the previous example of contamination, the DWPC of positives is inflated relative to negatives. With edge dropout contamination, the DWPC of positives is deflated relative to negatives. Here's why: if the prediction edge is an actual treatment, a treatment edge is effectively masked from the hetnet. However, if the prediction edge is not a treatment, no treatment edges are masked from the hetnet.
Let's imagine that Compound C1 treats only Disease D1. We'll consider the CtDlAlD metapath (Compound–treats–Disease–localizes–Anatomy–localizes–Disease). First note that Disease–localizes–Anatomy edges are common. Therefore, two random diseases likely colocalize to several anatomies. In other words, there will almost always be paths for the DlAlD portion of the metapath. When the prediction edge is C1–D1, the DWPC will be zero because C1–treats–D1 is masked due to the duplicate node exclusion. However, for other diseases (D2, D3, …), the C1–treats–D1 edge is available for traversal. Thus the path count (and hence the DWPC) will generally be nonzero between C1 and all other diseases besides D1. In this hypothetical example, edge dropout contamination could lead to spurious negative association between the DWPC for the CtDlAlD metapath and the presence of treatment edges.
The specific issue we describe here could more precisely be called "prediction-edge dropout contamination". It's possible that edge dropout raises problems for a broad spectrum of network approaches.