The degree-weighted path count (DWPC) is the metric we use to assess the prevalence of a specific type of path between two nodes . The DWPC assesses the number of paths between a source and target node for a given metapath while weighting the connectivity (node degrees) along the path.
Despite the degree weighting, DWPCs are highly dependent on the source and target node degrees. For a given observation (compound–disease pair in our application) and metapath, the majority of a DWPC often reflects the source and target node degrees rather than the specific relationship between the source and target.
The residual DWPC or R-DWPC is a metric invented by @alizee and I to measure the specific connectivity between two nodes after adjusting for general effects of node degree. The definition is, for a given observation and metapath, the R-DWPC equals the DWPC minus the P-DWPC. The P-DWPC is the average DWPC for an observation across permuted hetnets. Hence, the P-DWPC captures general effects of node degree but does not capture specific edge effects.
In summary, we've developed a method for splitting a DWPC into permuted (degree specific) and residual (edge specific) components. I see the R-DWPC as part of a larger movement to disentangle degree and edge effects in network analysis. As past research has shown [1, 2, 3, 4], network degree can be highly predictive without drawing any insight from the relationship between entities. While a good modeling approach should incorporate degree effects, it could aid interpretation to separately model degree and edge effects.
Transformation and the R-DWPC
We identified a transformation for DWPCs to make their distribution more suitable for modeling. The transformation scales DWPCs by their mean and then applies the IHS function. When computing P-DWPCs and R-DWPCs @alizee and I decided on the following pipeline:
Compute DWPCs for the selected observations on the unpermuted and permuted hetnets.
Identify the mean DWPCs for each metapath on the unpermuted network.
Transform DWPCs for the unpermuted and permuted networsk using the scaling factor identified in 2.
Average the DWPCs on the permuted networks to get the P-DWPC.
Subtract the P-DWPC from the unpermuted-hetnet DWPC to get the R-DWPC.
We chose to base the R-DWPC on a difference rather than a quotient. Additionally, we chose one of many ways to incorporate transformation. Future research that makes use of R-DWPCs may want to reevaluate these decisions under a performance-driven approach.
Model 2 (P-DWPCs) was the worst performer, with only the prior_logit feature being selected by a cross-validated lasso. Model 1 performed the best (AUROC = 91.2%). Model 4, for which we had the highest hopes, had (AUROC = 91.0%). I was surprised that splitting the DWPC into permuted and residual components did not improve performance. The split didn't seem to majorly change predictions, with the predicted probabilities from Models 1 and 4 having a correlation of 87.0%.
Given that the P/R-DWPC separation adds complexity without improving performance, we plan to continue primarily with DWPCs for Project Rephetio. However, I do hope to explore per-observation permutation adjustment again in the future.