The result of Project Rephetio is predicted probabilities of treatment for 209,168 compound–disease pairs. In addition to providing the predictions, we provide Neo4j browser guides with details for each individual prediction.
The guides include the following:
a table of metapaths supporting the prediction
a table of the top 25 paths supporting the prediction
a visualization of the top 10 network paths supporting the prediction
As an example, see the guide for the prediction that bupropion treats nicotine dependence (you've got to click play to show the guide). This discussion will go over how we assign contribution scores to specific metapaths and paths.
Calculating metapath and path contributions
The computation of contribution stats for each prediction occurs in this notebook. The code is rather gnarly, so I'm going to describe the method by example. Specifically, what specific network evidence do we observe for bupropion treating nicotine dependence, something we've discussed before.
The first step is to compile the list of metapaths that provide support for the prediction. Our logistic regression model assigned positive coefficients to 12 DWPC features. Going forward, we only consider these 12 metapaths. A metapath is considered to positively contribute if its logistic regression term is positive. The following table shows the process:
The table shows that there are 5 metapaths providing positive support. The metapath's Transformed DWPC (which has been IHS-transformed and standardized) is multiplied by the logistic regression coefficient to compute the Term. Only metapaths with positive terms retrained. So for example, if a specific compound-disease pair has a negative Transformed DWPC, it is not considered to contribute. In reality a lack of paths of the given type is negatively contributing to the prediction — but this effect cannot be decomposed to specific paths, so we ignore it. Finally, all of the terms can be scaled to sum to one, yielding Contribution — the proportion of the total support provided by a specific metapath.
Once we know the contribution of a metapath, it's easy to calculate the contribution of a specific path, since we know how much each path contributes to the untransformed DWPC. To compute the contribution of a path, we multiply its metapath's contribution by its contribution to that metapath. For the bupropion example, we can calculate the overall contribution of each CbGpPWpGaD path, using the DWPC and Contribution values from the above table:
MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Bupropion'
AND n4.name = 'nicotine dependence'
AND n1 <> n3
] AS degrees, path
extract(n in nodes(path)| n.name) AS nodes,
// Input the untransformed DWPC in the next line
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) / 0.03288 AS dwpc_contribution
dwpc_contribution, // Contribution of the PDP to the DWPC
// Input the metapath contribution in the next line
dwpc_contribution * 0.0947 AS contribution // Contribution of the PDP to the overall prediction
RETURN nodes, dwpc_contribution, contribution
ORDER BY contribution DESC
The contribution values returned by this query equal the values in the path table of the browser guide (copied below).
One limitation of our approach is that it overlooks paths that contribute to a prediction but correspond to a metapath whose transformed DWPC was negative (which occurs for DWPCs that are below the mean).
Calculating source / target edge contributions
In another discussion we introduced partial DWPCs, whereby a DWPC is divided into components by grouping paths based on a chosen attribute. The attribute could be the node the path traverses at a given position. Or the attribute could be the source edge of the path.
The term partial DWPC applies to computing a DWPC for any source–target–metapath–attribute combination. However, we can use a similar approach of grouping paths and summing their scores for source–target (compound–disease) pairs. For example, we can group all paths contributing to a prediction by their source edge to get the contribution of each source edge.
For the bupropion–nicotine dependence example, here are the top five contributing source edges:
Percent of Prediction
Notice that the Bupropion—binds—CHRNA3 relationship is responsible for bulk of the prediction. This indicates that binding CHRNA3 is likely sufficient to give a compound a high nicotine dependence prediction. We can also look at the five contributing target edges:
Percent of Prediction
Here we see that the prediction primarily picks up on two aspects of nicotine dependence. First, that its treated by varenicline and second, that it's associated with CHRNA3.
I'm in the process of adding source/target edge contribution tables to our Neo4j Browser guides with a deployment target of the next 24 hours.