Decomposing the DWPC to assess intermediate node or edge contributions

Daniel Himmelstein

doi:10.15363/thinklab.d228

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Decomposing the DWPC to assess intermediate node or edge contributions

Daniel Himmelstein Researcher Dec. 15, 2016

As a reminder, the degree-weighted path count (DWPC) measures the prevalence of metapath between a specific source and target node [1]. It equals the sum of path degree products (PDPs), which provide a score for a single path based on the degrees along the path.

Traditionally, the DWPC sums the PDPs for all paths connecting the source and target node along a specified metapath. Here I propose a new type of DWPCs that only sums paths that traverse the same intermediate node at a specified position. In other words, traditional DWPCs are defined for a source–target–metapath combination, whereas the proposed DWPCs are defined for a source–target–metapath–position combination. Position refers to an intermediate metanode. However, this approach would also work with an intermediate metaedge as the position. Note that choosing either the source or target metanode as the position is equivalent to the traditional DWPC.

The purpose of this approach is to assess the contribution of intermediate nodes (or edges) in composing the DWPC. Remember that the sum of all "partial" DWPCs equals the traditional DWPC. This approach doesn't replace the need for traditional DWPCs — they serve different needs and answer different questions.

I'm not satisfied with the traditional versus partial nomenclature. @alizee, any advice?

Daniel Himmelstein Researcher Dec. 15, 2016

Enalapril for coronary artery disease example

Prelude: I recently helped @cgreene with a grant proposal titled "Network-based algorithms for drug discovery from genetic associations" (application 1R01HG009516-01A1). For this proposal, we wanted to show an example where considering the tissue-specificity of paths helped identify the mechanisms of drug efficacy. In the course of this analysis, we came up with the partial DWPC method and the following example (the tissue-specific additions are not included below).

Enalapril treats coronary artery disease (CAD) by inhibiting angiotensin-converting enzyme (ACE) [1]. Traditionally, if we were interested in potential pathways contributing to drug efficacy we may search for CbGpPWpGaD paths between enalapril and CAD. Below is the Cypher query to return all paths, ranked by PDP (run the query at https://neo4j.het.io):

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
  (n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
  AND n4.name = 'coronary artery disease'
  AND n1 <> n3
WITH
  path,
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n2)),
  size((n2)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n3)),
  size((n3)-[:ASSOCIATES_DaG]-()),
  size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
  substring(reduce(s = '', node IN nodes(path)| s + '–' + node.name), 1) AS nodes,
  reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS PDP
ORDER BY PDP DESC

Overall, 757 paths were returned. The top 3 paths are:

nodes	PDP
Enalapril–ACE–Metabolism of Angiotensinogen to Angiotensins–ACE2–coronary artery disease	0.000258
Enalapril–ACE–ACE Inhibitor Pathway–NR3C2–coronary artery disease	0.000252
Enalapril–ACE–ACE Inhibitor Pathway–ACE2–coronary artery disease	0.000245

Now let's assume we're more interested in the contributions of specific pathway nodes rather than specific paths. In other words, we don't really care what genes got us to a pathway, we just want an overal score per pathway. In this case, we can select n2 as the position. Now we're computing a DWPC for Enalapril–binds–Gene–participates–Pathway–participates–Gene–associates–coronary artery disease, where bold indicates position. The query becomes:

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
  (n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
  AND n4.name = 'coronary artery disease'
  AND n1 <> n3
WITH
  path,
  n2 AS pathway,
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n2)),
  size((n2)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n3)),
  size((n3)-[:ASSOCIATES_DaG]-()),
  size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
  pathway.identifier AS pathway_id,
  pathway.name AS pathway_name,
  count(*) AS PC,
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC, pathway_name

40 pathways are returned, of which the top 5 are displayed below:

pathway_id	pathway_name	PC	DWPC
WP554_r84372	ACE Inhibitor Pathway	11	0.0015
PC7_8339	Transmembrane transport of small molecules	150	0.0008
PC7_5323	Metabolism of Angiotensinogen to Angiotensins	3	0.0005
PC7_7290	SLC-mediated transmembrane transport	40	0.0004
PC7_5322	Metabolism	309	0.0004

As shown, we now have a ranking of pathways based on their contribution to the overall CbGpPWpGaD metapath. Currently, I don't see a huge role for this approach for feature extraction, but think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy.

Pouya Khankhanian: Agree with "I think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy". Especially if the function to display this result is embedded in a button on the neo4j interface.

I'd love to see the weight given to various nodes in the top predictions for epilepsy, especially the ones in the top 100 which were not classified as AEDs.

Daniel Himmelstein Researcher Dec. 23, 2016

Grouping paths by their source or target edge

The previous comment discussed grouping paths by an intermediate node and then calculating partial DWPCs. This comment introduces an alternative grouping method: grouping either by the source edge (first edge in the path) or target edge (last edge in the path).

Here's the intuition behind this approach. In a hetnet, a node derives its meaning from its relationships. For example, our algorithm is based solely on relationships. Therefore, a good way to investigate a prediction is to consider which edges of either the source compound or target disease mattered. We can this for a specific source–target–metapath combination, by grouping paths by their source or target edge.

For example, the following query takes the enalapril–CAD example and asks which target edges are composing the CbGpPWpGaD paths.

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
  (n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
  AND n4.name = 'coronary artery disease'
  AND n1 <> n3
WITH
  path,
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n2)),
  size((n2)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n3)),
  size((n3)-[:ASSOCIATES_DaG]-()),
  size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n3, n4
RETURN
  n4.name AS target_name,
  type(relationships(path)[3]) AS target_edge_type,
  n3.name AS n3_name,
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC

The top five results are:

target_name	target_edge_type	n3_name	DWPC
coronary artery disease	BINDS_CbG	SLC22A3	0.00072
coronary artery disease	BINDS_CbG	ACE2	0.00058
coronary artery disease	BINDS_CbG	REN	0.00044
coronary artery disease	BINDS_CbG	SLC6A6	0.00038
coronary artery disease	BINDS_CbG	NR3C2	0.00025

These are the top ranking CAD-associated genes that participate in pathways with enalapril targets. As shown by the DWPC column, several of the top target edges are contributing to a similar extent. There is no one CAD-associated gene that is responsible for the bulk of the CbGpPWpGaD DWPC.

In instances where only one path composes the bulk of the total DWPC, you know that a single relationship is driving the score. For example, we can rewrite the above query to analyze the source edge:

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
  (n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
  AND n4.name = 'coronary artery disease'
  AND n1 <> n3
WITH
  path,
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n2)),
  size((n2)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n3)),
  size((n3)-[:ASSOCIATES_DaG]-()),
  size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n0, n1
RETURN
  n0.name AS source_name,
  type(head(relationships(path))) AS source_edge_type,
  n1.name AS n1_name,
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC

source_name	source_edge_type	n1_name	DWPC
Enalapril	BINDS_CbG	ACE	0.00273
Enalapril	BINDS_CbG	SLCO1A2	0.00081
Enalapril	BINDS_CbG	ABCB1	0.00081
Enalapril	BINDS_CbG	SLC22A7	0.00068

These results show that enalapril's binding ACE is driving the CbGpPWpGaD DWPC. In other words, if enalapril did not bind ACE, the CbGpPWpGaD DWPC would be ~40% lower (the total CbGpPWpGaD DWPC between enalapril and CAD is 0.00677).

Views

Topics

Algorithms Hetnet DWPC HNEP Cypher Metapath

Referenced by

Prediction in epilepsy
Decomposing predictions into their network support
Research report: Rephetio: Repurposing drugs on a hetnet

Cite this as

Daniel Himmelstein (2016) Decomposing the DWPC to assess intermediate node or edge contributions. Thinklab. doi:10.15363/thinklab.d228

License