Rephetio: Repurposing drugs on a hetnet [rephetio]

Using the neo4j graph database for hetnets

Recently, I went a two-part meetup series on the graph database neo4j. Nicole White led the meetups and her materials are online:

  1. neo4j: Intro to Graphs (slides, meetup)
  2. Data Science with Python and Neo4j (tutorial, repository, meetup)

Currently, we store our hetnets in compressed json text files. To perform any computation or graph analyses, we must load the network into memory, a process that takes from 2–5 minutes for version one of our network. In contrast neo4j provides persistent storage with immediate access.

Additional benefits of neo4j include a mature ecosystem offering broad functionality. The Cyper query language is especially exciting. Cypher uses an ASCII-art based syntax to enable advanced graph lookups and traversals with little boilerplate.

Comparison to hetio

There are a few differences between neo4j and our python package hetio [1] with regards to hetnets:

  • nomenclature — an edge in hetio is called a relationship in neo4j. hetio calls itself a hetnet, while neo4j calls itself a property graph
  • node type — in hetio each node belongs to one metanode representing its type. in neo4j node are annotated with a label to indicate type, and a node can have ≥ 0 labels
  • directionality — in hetio metaedges are either directed or undirected and edges conform to their metaedge's directionality. neo4j doesn't support undirected edges. The suggested workaround is to arbitrarily choose a direction upon creation and ignore the direction when querying
  • type graph — hetio requires a predefined graph of types called a metagraph. neo4j does not enforce or explicitly support a graph of type definitions
  • inverted edges — hetio internally stores two copies of each edge (inverses of each other)

We plan to create export functionality from hetio to neo4j, so we can leverage the strengths of neo4j.

Exporting hetio hetnets to neo4j

We've added neo4j export capability to hetio. Our implementation uses the py2neo toolkit to interact with the neo4j server.

Adding edges is quite slow and the database size is large. However, the neo4j browser combined with cypher is great for exploratory analyses. In a short amount of time, I discovered 4 issues with our network (1, 2, 3, 4) and created a sneak-preview visualization.

We exported the current version of our network, which contains 49,399 nodes and 2,997,246 edges, to neo4j. The export took 10 hours and resulted in a 3.04 GB database.

The data/graph.db/, which stores the database, contained the following files with sizes over 1 MB:

messages.log1.3 GB
data/graph.db/neostore.transaction.db.1262 MB
data/graph.db/neostore.transaction.db.2262 MB
data/graph.db/neostore.transaction.db.3262 MB
data/graph.db/neostore.transaction.db.4262 MB
data/graph.db/neostore.transaction.db.5262 MB
data/graph.db/neostore.transaction.db.6262 MB
data/graph.db/neostore.transaction.db.7113 MB
neostore.propertystore.db127.9 MB
neostore.relationshipstore.db102 MB
neostore.propertystore.db.strings25 MB
neostore.relationshipgroupstore.db3.2 MB
rrd2.0 MB
neostore.propertystore.db.arrays1.5 MB
neostore.nodestore.db1.5 MB

We will look into ways to speed up our write times and reduce storage.

DWPC in Cypher

We've implemented the degree-weighted path count (DWPC [1]) in Cypher. Our implementation produces a different query for each metapath, but specifies the source node (source), target node (target), and damping exponent (w) as parameters.

Below is the query for the CsCuGod>GuD metapath:

MATCH p = (n0:Compound)-[:SIMILARITY]-(n1:Compound)-[:UPREGULATION]-(n2:Gene)-[:OVEREXPRESSION_DOWNREGULATION]->(n3:Gene)-[:UPREGULATION]-(n4:Disease)
WHERE = { source }
AND = { target }
WITH [size((n0)-[:SIMILARITY]-(:Compound)),
size((:Gene)-[:UPREGULATION]-(n4))] AS degrees
RETURN sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -{ w }))

The DWPC for this metapath measures the extent that compounds similar to the query compound upregulate genes whose overpression downregulates genes upregulated by the query disease. The MATCH clause identifies paths corresponding to the metapath. The WITH clause computes degrees along each path and the RETURN clause computes path degree products (PDPs) and sums them to get the DWPC.

Comparison to hetio

We configure our hetio queries to exclude paths with duplicate nodes. However, neo4j excludes duplicate relationships. Additionally, when computing features for an indicated compound–disease pair, we configure our hetio queries to ignore that indication. Our current cypher framework does not support this exclusion.

Our preliminary experience is that DWPC computations in neo4j run approximately twice as quickly as in hetio. However, hetio may have more room for improvement, since we haven't implemented path caching yet.

GraphConnect 2015

Today I attended GraphConnect — a conference focused on neo4j. CEO, Emil Eifrem, kicked the event off with several exciting announcements:

  • Neo4j 2.3 has been released bringing speed and scalability improvements. Specifically, the caching infrastructure has been rewritten to provide "significant (up to 2-3x) improvements in concurrent read scaling."
  • Neo4j 3.0 is in the works and will bring unified and official drivers across languages. The initial release will include a Python but not R driver.
  • Cypher will be open sourced as openCypher. This will hopefully give rise to a standard query language for all graph databases.

Select learnings

Neo4j is designed for deep traversals. Other graph databases preferentially support big data (networks with billions of nodes) over efficient traversal. Since our network is small but our edge prediction method requires deep traversal, neo4j is a good fit for our application.

Neo4j doesn't enforce or specifically support a type graph (also called a schema, metagraph, or graph model). However, a metagraph can easily be created from an already populated graph. While neo4j won't innately reason based on the created metagraph, it can be convenient from a user standpoint.

  • Lars Juhl Jensen: Neo4j 3.0 should be interesting. So far, what has mainly held me back was that I do not want to develop my bioinformatics software in Java and doing everything through Cypher was not sufficiently efficient to warrant migration from a PostgreSQL database.

  • Daniel Himmelstein: Milestone 1 of version 3.0 was released today with a python driver. The update promises fast access from outside of java. However, you will still need to use Cypher (which I find easier and more powerful than SQL). @larsjuhljensen, I would wait till the final 3.0 release for production use but wanted to give you a heads up of what lies ahead.

  • Lars Juhl Jensen: Thanks - my problem was, though, that my network queries could not be expressed efficiently in Cypher. Whereas shortest path could be done very efficiently, something as simple as shortest path in a weighted graph could not. Maybe Cypher has become more powerful since?

  • Daniel Himmelstein: Cypher doesn't have great algorithm coverage. Shortest weighted path isn't too hard to implement, albeit inefficiently. However, if you're encoding anything that resembles a hetnet, cypher will beat SQL for data interactions.

  • Daniel Himmelstein: Lars, you should have mentioned your article on whether graph databases are ready for bioinformatics [1]! This is a citation that belongs in this discussion.

    Are graph databases ready for bioinformatics?
    C. T. Have, L. J. Jensen (2013) Bioinformatics. doi:10.1093/bioinformatics/btt549
  • Daniel Himmelstein: @larsjuhljensen, I just saw that APOC — a Neo4j plugin for "Awesome Procedures On Cypher" — contains several graph algorithms. Installing APOC is easy. It looks like some but not all algorithms have edge weight support.

Query Optimization

Above, we debuted DWPC (degree-weighted path count) computation using Cypher. I noticed that looking up the degrees along each path was a major timesink. This finding was surprising because node degree lookup should be trivial compared to path traversal.

In a stroke of genius, @alizee hypothesized our inclusion of node labels was to blame. For our diagnostic query, removing node labels reduced database hits by 1339 fold and runtime by 8 fold.

Michael Hunger, caretaker general of the neo4j community, explained:

size(pattern) uses node.getDegree() if pattern only contains relationship type & direction

More explanation is available here, but the essential insight is that by using only direction and relationship type to lookup node degree, we no longer need to lookup the label on the other end of each edge.

Database changes

To support this optimization, we need to ensure that no two metaedges that touch a common metanode have the same relationship type. Our current hetnet is noncompliant in this regard: for example, the three Gene Ontology metaedges all have kind 'participation':

  1. Gene–participation–Biological Process
  2. Gene–participation–Molecular Function
  3. Gene–participation–Cellular Component

Therefore, we have implemented unique neo4j relationship types for each metaedge (primary commit and bugfixes 1 and 2) by appending the standardized metaedge abbreviation to its kind. With this change, the relationship types for the Gene Ontology metaedges become:


Example query

With the updated database, the query for calculating the DWPC between goserelin and lung cancer for the CcSEcCdGuD metapath is:

MATCH paths = (n0:Compound)-[:CAUSATION_CcSE]-(n1)-[:CAUSATION_CcSE]-(n2)-[:DOWNREGULATION_CdG]-(n3)-[:UPREGULATION_DuG]-(n4:Disease)
  WHERE n0.identifier = 'DB00014' // Goserelin
  AND n4.identifier = 'DOID:1324' // lung cancer
  ] AS degrees, paths
  COUNT(paths) AS PC,
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC

Concurrent queries using py2neo

As explained by Stefan Armbruster, a single cypher query is limited to a single core. However, multiple queries can be fulfilled in parallel:

With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores. [source]

Currently, we perform a separate query for each compound–disease–metapath combination. Depending on the number of compound–disease pairs and metapaths considered, we will need to compute between 1 million and 1 billion DWPCs.

Our software package for hetnets, hetio [1], is built in python. Despite migrating to neo4j, we are still dependent on hetio for:

  • metagraph operations
  • edge directionality
  • metapath abbreviation
  • constructing cypher queries

Therefore, we're using python to construct and execute queries with the py2neo package.

To enable concurrent queries, we initially used the multiprocessing module (notebook), which enables parallelism by creating subprocesses. However, subprocesses require substantial overhead. Since the majority of computation is performed outside of python by the neo4j sever, we switched to the threading module (notebook). Threading has less overhead than multiprocessing, but is limited to a single process of pure python. However, since the cypher query releases the global interpreter lock, the restriction to a single process is not a time-limiting step.

In the end, we used concurrent.futures to make threading easier (notebook). We encountered a small hiccup where our queue of queries waiting to be executed grew large and consumed substantial memory. We addressed the issue by postponing new query submission until the queue dropped below a given size.

Performing concurrent queries led to ~1000% processor usage by neo4j, equivalent to 10 cores at full load. This benchmark was performed on a 16 core machine running neo4j-community-2.3.1 on Ubuntu 15.10. Increasing the number of concurrent python workers above 16 did not increase the ~1000% usage figure. Let us know of any methods to increase processor saturation.

2016 GraphGist Challenge

Neo4j is hosting a GraphGist challenge. GraphGists provide the following:

With Neo4j GraphGists you can describe and model your domain in a simple text file (AsciiDoc) and render it as a rich, interactive, database-backed page in any browser. It is perfect to document a specific domain, use-case, question or graph problem.

This years competition is Star Wars themed — a theme we adhered to in our submission. To give you a taste, our prologue begins with:

A long time ago in a galaxy far, far away…​. It is a dark time for drug discovery. The Empire spends over a billion dollars in R&D per new drug approval. The process takes decades, 9 out of 10 attempts fail, and the cost has been doubling every 9 years since 1970. But, a small band of Rebel scientists pursue an alternative. Using public data and open source software, the Rebels are predicting new uses for existing drugs.

Our goal in creating a submission was twofold. First, we're excited to interact with other members of the neo4j community who are doing complimentary work. Second, we designed the GraphGist to be a good introduction to our project and hetnet relationship prediction in general.

Literature on using Neo4j for biomedical hetnets

Here we'll compile a list of studies that discuss using Neo4j or graph databases for hetnets related to biology or medicine.

  • "Are graph databases ready for bioinformatics?" by @larsjuhljensen as noted above [1]

  • "Representing and querying disease networks using graph databases" which we discussed here [2]

  • "Use of Graph Database for the Integration of Heterogeneous Biological Data" which found that Neo4j was faster than MySQL for some common queries on a biological hetnet [3].

Status: In Progress
Referenced by
Cite this as
Daniel Himmelstein (2015) Using the neo4j graph database for hetnets. Thinklab. doi:10.15363/thinklab.d112

Creative Commons License