Our algorithm relies on extracting paths of a certain type (metapath) between a source and target node. For our previous project where we predicted gene–disease associations, we placed two restrictions on paths :
paths with duplicate nodes were excluded, and, if present, the association edge between the source gene and target disease was masked.
However, since migrating to neo4j, we have not added extra path restrictions beyond the builtin restriction that paths cannot contain duplicate relationships.
So in total, we have relied on three different conditions for excluding a path:
contains the prediction edge
It turns out that 1 implies 2 and 3 (1 ⇒ 2, 1 ⇒ 3). In other words, when we exclude paths with duplicate nodes, we also exclude paths with duplicate edges. And since we omit the length-one metapath that is simply the metaedge being modeled, all paths containing the prediction edge will contain either the source or target node at least twice.
You may have noticed above that for hetio-dag , we masked the prediction edge if present. Masking an edge temporarily removes it from the network. Masking the prediction edge not only eliminates paths containing that edge, but also can affect path degrees, which go into the DWPC (degree-weighted path count).
This discussion will focus on identifying the most sensible path restrictions and whether masking is warranted.
Cypher implementations of duplicate node exclusion
With help from Christophe Willemsen, @alizee and I identified two methods for excluding paths with duplicate nodes in a Cypher query.
We implemented both methods. Here, we'll describe them in the context of a length-four path matched using MATCH paths = (n0)--(n1)--(n2)--(n3)--(n4).
The nested method adds a WHERE clause specifying:
ALL (x IN nodes(paths) WHERE size(filter(z IN nodes(paths) WHERE z = x)) = 1)
This method uses list comprehension to iterate over each node in a path and ensure that it only occurs once. This clause can be applied to paths of any length and does not require assigning nodes to identifiers.
The expanded method adds a WHERE clause specifying:
NOT (n0=n1 OR n0=n2 OR n0=n3 OR n0=n4 OR n1=n2 OR n1=n3 OR n1=n4 OR n2=n3 OR n2=n4 OR n3=n4)
The method checks that no two nodes are equal by explicitly evaluating all combinations of two nodes. The method requires assigning node identifiers and is path-length dependent. However, it is intuitive and amenable to query plan optimization since filtering can be front-loaded to avoid expanding on illegitimate paths.
Another general solution would be to check whether the number of distinct nodes equals the length of the full path. We currently are unaware of a Cypher implementation for this distinct method, but suspect it could scale to longer paths better than the nested method.
One final variant of the expanded method, called labeled, would avoid comparing nodes with different labels, as these nodes are implicitly different. This method requires the greatest a priori knowledge of path characteristics.
To help optimize our cypher queries, we computed features for 347 positives and 317 negatives for the 1979 metapaths with length ≤ 4 (notebook, features). Positives were randomly selected indications, while negatives were randomly selected non-indications. We computed the PC (path count) and DWPC for each compound–disease–metapath combination using each of three node uniqueness methods described above. In addition, we computed features without excluding duplicate nodes.
Runtimes for duplicate node exclusions
We compared the average query runtime for each node uniqueness method (notebook):
average query runtime
We found that not performing any duplicate node exclusion was fastest. The most efficient method for exclusion was labeled, which explicitly checks that node pairs of the same label are not duplicates. This method only slowed down runtime by 11%, a very acceptable hit.
The nested method was on average slightly slower than expanded. However, in the overwhelming majority of instances, nested was faster than expected, yet poor worst case runtime pushed nested to last place. This observation underscores the vast asymmetry in runtimes: most queries finish quickly, while a small percentage of queries form a long tail that contributes disproportionately to overall runtime.
Remember that these findings are highly context-dependent. Here they are in the context of a subset of queries chosen to be representative of our specific HNEP (hetnet edge prediction) task.
In conclusion, if excluding paths with duplicate nodes is desired, we will use the labeled method.
The effect of duplicate node exclusion on features
Above, we described computing features for 664 compound–disease pairs × 1,979 metapaths. In this comment, we'll investigate the effect of excluding paths with duplicate nodes on this dataset.
Paths without the unique node constraint are still subject to neo4j's unique relationship constraint. In essence, the node uniqueness constraint determines whether paths containing a cycle are permitted.
As previously discussed the node uniqueness constraint also excludes paths containing the prediction edge. In our case, the prediction edge is an indication between the source compound and target disease.
Therefore, feature performance could decline after applying the unique constraint because:
paths with cycles convey meaningful information that the unique node constraint overlooks
paths including the prediction edge cause overfitting by incorporating the outcome (indication status) into the predictor (DWPC)
For the 1,961 features that contained a duplicate metanode, we calculated the decline in AUROC of the DWPC resulting from duplicate node exclusion (notebook, table). We investigated the occurrence of 1 and 2 by segregating metapaths based on whether they include an indication metaedge. If a metapath doesn't include an indication metaedge, then any change in performance must be due to 1. The distributions of AUROC declines identify 2 (overfitting) as a major factor, while showing 1 is at most a very minor factor.
Therefore, we will adopt the unique node constraint because it omits paths that include the prediction edge, which leads to overfitting. Absent their potential for overfitting, paths with duplicate nodes did not contribute greatly to DWPC performance. Finally, paths that are excluded because they contain a cycle will still be incorporated into DWPCs for shorter metapaths that bypass the cyclical segment.