Estimating the complexity of hetnet traversal

Daniel Himmelstein, Antoine Lizee

doi:10.15363/thinklab.d187

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Estimating the complexity of hetnet traversal

Daniel Himmelstein Researcher March 22, 2016

Our approach quantifies the hetnet topology between compound–disease pairs by calculating the prevalence of different path types (metapaths) [1]. We would like to be able to estimate the complexity of a metapath, given the graph. We'll define complexity as the number of Neo4j database hits (dbhits) needed to execute a query. Specifically, we're interested in assessing the runtime of queries for a given metapath, without having to run the queries.

If we can accurately predict runtime using estimated complexity, we can selectively avoid computing features for overly complex metapaths. Since the number of potential metapaths (as well as paths per metapath) scales combinatorially with increasing path length, our method will be always run into computational limits. We're hoping to use complexity estimates to help choose a tractable set of metpaths.

Metapath runtime on a trial feature extraction

In the past, we've used a length cutoff for metapaths. In a trial run of our feature extraction, I computed features for all metapaths with lengths 2–4. However, I terminated the process midway because progress was too slow. What we saw was that a few metapaths took a disproportionate amount of time to query.

See cells 11–12 in this notebook, noting that ~half of metapaths with lengths 2–4 are missing and that paths with duplicate nodes were not excluded as they should have been. Nonetheless, the results are clear: the metapaths with long runtimes were only mildly predictive (auroc), did not decline in predictiveness due to network permutation (delta_auroc).

The worst metapath was CdGeAeGuD, taking on average 37 seconds per query (compound–disease pair). This metapath combines four gene-expression-based edges, which can be terribly high degree on their non-gene terminus.

Hence, we will look into estimating metapath complexity to exclude such runtime outliers.

Daniel Himmelstein Researcher March 22, 2016

Estimating the complexity of a sequential traversal

In sequential traversal, the query begins on a single node and expands to create a tree of paths conforming to the specified metapath. Once the tree has been fully expanded, only paths whose leaves match the desired ending node are retained.

To estimate the sequential complexity of a metapath on a given graph, I first calculated the average degree of each metaedge based on our specific hetnet. Then, I took the log10 of the product of the average degrees along the path. Each metaedge contributes one degree: the average degree of whichever metanode is traversed first by the path. Since you can start a sequential traversal from either end of a path, we'll refer to forward versus backward sequential traversal.

@alizee proved to me in whiteboard discussion, that there is always a constant difference between the complexity of a forward and reverse sequential traversal, which is based only on the number of nodes of the source and target metanodes. The estimated complexity of a sequential traversal is less when starting on the node whose type is more abundant. I'll let @alizee explain the details.

On the trial feature extraction, forward sequential complexity was a good estimator of runtime (notebook cell 13). However, it's important to note that our Neo4j queries were not performed sequentially. Instead, traversal began from both termini and met in the middle where the results were joined. Even so, estimated sequential complexity may be a good metric to use for metapath selection.

Antoine Lizee Researcher March 23, 2016

A note on traversal complexity

Intro

The question here is to estimate the complexity of graph traversal when following a 'metapath': a specification for which kind of nodes we need to traverse, in which order. This complexity will be useful to get a proxy for computation time when create the features we currently use in rephetio for prediction problems, the DWPC. We mainly look at sequential complexity, which is estimating complexity of traversal from one point to another, in only one direction, without joining two partial traversals in the middle. (even though the latter is almost always a better option)

We can reasonably make the hypothesis that sequential complexity is equal to the number of possible paths that must be traversed. You propose above, and I would agree, that we use something along the mean-field approximation and consider that the average number of paths to be traversed when following a Metapath is equal to the multiplication of the average degrees (outward connections) of all node types along the metapath. Doing so, we intrinsically base our analysis on the number of edges in the networks, which gives us some simple, neat properties.

Formal definition

We consider as an example 4 types of Nodes in our Network, $$A, B, C, D$$, linked with symmetrical edges, and the Metapath joining them $$A-B-C-D$$. We consider these nodes in order, since the difference between 'forward' (starting from $$A$$) and 'backward' (starting from $$D$$) traversal matters - as defined by @dhimmel above. Each node type has respectively $$N_A, N_B, N_C, N_D$$ nodes, noted $$A_i, B_j,$$... Each of these single nodes has a forward degree, noted $$a_i$$ for instance, and a backward degree, noted with a prime, eg $$b_i'$$. These degrees represent the number of edges that connect the node at stake ($$A_i$$ and $$B_i$$ in our examples) to the nodes of the next ($$B$$) or previous ($$A$$) types respectively.

[remark: we use the actual types instead of a notation "$$K$$" for simplicity. The results that are written below for nodes of type $$A$$ or $$B$$ stand general]

Proof of the equivalence between forward and backward complexities

The average of node degree is written $$E[a_i]$$ such as:

$$$ \begin{align} E_{i\in1,...,N_A}[a_i]=\frac{1}{N_A}\sum_{i\in1,...,N_A}{a_i} \end{align} $$$

Then, we have

$$$ \begin{align} E_{i\in1,...,N_A}[a_i] \cdot N_A = N_{A-B} = E_{i\in1,...,N_B}[b_i'] \cdot N_B \end{align} $$$

where $$N_{A-B}$$ is the number of edges from the nodes $$A$$ to $$B$$. Follows, in short notation:

$$$ \begin{align} E[a_i] = E[b_i'] \cdot \frac{N_B} {N_A} \quad or \quad E[b_i'] = E[a_i] \cdot \frac{N_A} {N_B} \end{align} $$$

The estimation of 'forward complexity' outlined above, for discovering the paths that correspond to the metapath $$A-B-C-D$$ is written as:

$$$ \begin{equation} FC_{A-B-C-D} = E[a_i] \cdot E[b_i] \cdot E[c_i] \end{equation} $$$

The estimated backward complexity can be written as:

$$$ \begin{align} BC_{A-B-C-D} = FC_{D-C-B-A} = E[b_i'] \cdot E[c_i'] \cdot E[d_i'] \end{align} $$$

Follows:

$$$ \begin{align} BC_{A-B-C-D} &= E[b_i'] \cdot E[c_i'] \cdot E[d_i']\\ &= E[a_i]\frac{N_A}{N_B} \cdot E[b_i]\frac{N_B}{N_C} \cdot E[c_i]\frac{N_C}{N_D}\\ &= E[a_i] \cdot E[b_i]\ \cdot E[c_i] \cdot \frac{N_A}{N_D}\\ BC_{A-B-C-D} &= FC_{A-B-C-D} \cdot \frac{N_A}{N_D} \end{align} $$$

More generally:

$$$ \begin{equation} \boxed{BC = FC \cdot \frac{N_{START}}{N_{END}}} \end{equation} $$$

All sequential complexities are made equal

Thus, forward and backward complexities are trivially related by the formula above.

Nevertheless, we need to keep in mind that the forward and backward complexity estimates above stand only for a single traversal, starting either from one Node $$A_i$$ or $$D_i$$ respectively. Thus, if you plan to do these traversals for all the $$N_A \cdot N_D $$ possible pairs (or a randomly selected subset of those) you need to multiply these complexities by the number of starting nodes, $$N_A$$ or $$N_D$$ in order to get a relevant time estimate. As a result of this computation and under these hypotheses, there is no difference between forward and backward traversal - which is reassuring.

Conclusion

Forward and Backward Complexity estimates are trivially related through the formula above.
Time estimates should take into account the number of starting nodes, resulting in no difference between both directions of traversal.
About join complexity: I believe that the sequential complexity estimates discussed here are good proxies for the process of "joint" traversal, when two semi-traversals are joined in the middle in order to reduce computation time.

Lars Juhl Jensen: I know it's dangerous to make pre-morning coffee comments on a mathematical proof, but didn't you get N_START and N_END swapped in the last formula? In the formula just above, I thought N_A was the number of start nodes and N_D the number of end nodes.

It would also make more sense intuitively, since it implies that N_START⋅FC = N_END⋅BC. In other words, it doesn't matter whether you start from every start node and do forward traversal or start from every end node and do backward traversal. Both involve that you traverse all possible paths between all start nodes and all end nodes and have the same complexity.
Antoine Lizee: @larsjuhljensen: I wish I could say I let this one slip to test readership's attention, but it just seems that your pre-morning coffee brain is more awake than my post-lunch one. Well spotted - thank you.

Daniel Himmelstein Researcher March 31, 2016

The @alizee theorem of sequential complexity

@alizee, fantastic proof! The takeaway for me is that if you're doing a sequential traversal from one source to one target, start from the end where there are a greater number of nodes. In situations where we are using sequential complexity to estimate runtime, we only have be consistent on reporting either forward or backward complexity.

Antoine Lizee: Correct

Daniel Himmelstein Researcher March 31, 2016

A formula for estimating joint traversal complexity

Rather than perform traversals sequentially, both Neo4j and hetio support joint traversal. In joint traversal, paths are expanded upon from both endpoints until meeting at an interior node, where they are joined.

Using @alizee's notation above, we'll consider metapath $$A-B-C-D$$. If we adopt a join index of 2 (joining on C), we compute joint complexity (JC) as:

$$$ JC_{A-B-C_{join}-D} = \log _{10} (E[a_i] \cdot E[b_i] + E[d_i']) $$$

Note that @alizee didn't include the log, but for consistency with our reported complexity values, I'm including it. This proposed formula for complexity assumes that joining the traversals is a free operation. In Neo4j a NodeHashJoin doesn't require any dbhits, but could still add to runtime.

When the join index is 0, the joint traversal becomes equivalent to backward traversal. When the join index is the target node, the joint traversal becomes equivalent to forward traversal.

Daniel Himmelstein Researcher March 31, 2016

Choosing the best join index for Neo4j queries

If you don't give Neo4j 2.3.2 any hints where to join, the planner decides for itself. The planner is capable of sequential or join traversal. However, which query plan to use is determined prior to query execution (see my issue). You can tell Neo4j which node to join on using a join hint (USING JOIN ON in Cypher). If you specify a terminal node to join on in 2.3.2, a sequential query will be executed, but Neo4j will decide for itself whether to do a forward or backward sequential join.

Therefore, we wanted to see whether we could use our estimated join complexities to optimize our Neo4j queries. Therefore we did a parameter sweep where we evaluated all possible join indexes when computing features (DWPCs) for 75 metapaths × 150 compound–disease pairs (notebook). In total, we performed 66,600 queries.

We consider three options for specifying the join index:

midpoint index — the floor of the metapath length divided by two. Note if there were more diseases than compounds, then we expect the ceiling would outperform the floor.
optimal index — the least complex join index based on our formula for estimating join complexity.
no hint — where we let Neo4j choose the join index. The query planner will choose a join index, although we do not know which index was used.

For each metapath, we identified the average runtime of the three options (table). While no option was universally fastest for all metapaths, on average the midpoint index was best (0.229 seconds), followed by the optimal index (0.339), and last no hint (0.656). In other words, specifying the midpoint as the join index cut runtime in third compared to not providing any join hint. Interestingly, our optimal complexity estimate performed ~50% worse than the midpoint overall.

On an individual query level, the midpoint index was the fastest index for 40% of queries, while the optimal index was the fastest index for 29% of the queries. The average rank of the fastest join index according to our complexity estimate was 2.5.

In conclusion, the optimal join index for a specific query is highly variable. For a given metapath, different compound–disease pairs will prefer different join indexes. However, since we plan to select a join index at the metapath level, the midpoint is the best option thus far.

Daniel Himmelstein Researcher March 31, 2016

Complexity poorly estimates runtime

We recently performed 22,933,125 DWPC queries (3,775 compound–disease pairs × 1,215 metapaths × 5 hetnets). For these queries, I specified the optimal join index according to our formula for join complexity. In the future, we will switch to the midpoint join index.

Both sequential complexity and optimal index join complexity poorly predicted runtime (notebook). Many of the outlier metapaths appear to contain a Gene→regulates→Gene edge. Since, the target genes for these edges will largely be restricted to the 978 landmark L1000 genes, I suspect mean degree is a underestimating complexity. One idea I've had is to weight the average degree calculation by degree to address this issue. In other words, should we account for the fact that queries are more likely to traverse high degree nodes when estimating complexity?

Antoine Lizee: I would love to see how sequential complexity predicts midpoint runtime when these results will be ready.

Daniel Himmelstein Researcher April 16, 2016

The relationship between midpoint-join runtime and complexity

We performed a similar DWPC computation to before but switched to specifying midpoint join indexes rather than our "optimal" join indexes. This time we computed 27,308,958 DWPCs for 3,775 compound–disease pairs × 1,206 metapaths × 6 hetnets. However, a bug was introduced which made queries taking over 30 seconds fail silently. Therefore, the following features are incomplete and will be prone to runtime underestimation: CbGeAeGaD, CdG<rGeAlD, CdGeAeGaD, CdGeAeGdD, CdGeAeGuD, CuG<rGeAlD, CuGeAeGaD, CuGeAeGdD, CuGeAeGuD, CuGeAuGaD.

@alizee: the switch to midpoint join led to a stronger association between sequential complexity and runtime (notebook). Interestingly, switch also improved the fit between optimal complexity and runtime.

Antoine Lizee: EDIT: removed because unnecessarily complicated while wrong.
Daniel Himmelstein: EDIT: removed in response to the removal of @alizee's preceding note.
Daniel Himmelstein: See the updated notebook where the silent timeout issue has been fixed. Now all queries count towards average time.

Status: Completed

Views

Topics

Join Hints Runtime Optimization Neo4J Complexity

Referenced by

Cite this as

Daniel Himmelstein, Antoine Lizee (2016) Estimating the complexity of hetnet traversal. Thinklab. doi:10.15363/thinklab.d187

License