One network to rule them all

Daniel Himmelstein, Lars Juhl Jensen

doi:10.15363/thinklab.d102

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

One network to rule them all

Daniel Himmelstein Researcher Aug. 14, 2015

Network version 1.0

We have completed an initial version of our network (notebook, download) [1].

The network consists of 10 types of nodes (metanodes) and 27 types of edges (metaedges). It contains 49,427 nodes and 2,997,892 edges (1,488,312 of which are unbiased). The network is visualized below, laid out by metanode and colored by metaedge (only a subset of edges are drawn for efficiency):

For additional information, see the summary of nodes, summary of edges, or visualization of degree distributions. Network existence (SHA256 checksum for graph.json.gz) is proven in Bitcoin block 369,898.

Future changes

There are a few changes we hope to make in the near future. First, replacing ADEPTUS with STARGEO for expression signatures of disease. Second, updating our indications with a manually curated subset. As always, suggestions for additional information types are welcome here.

Daniel Himmelstein: 2015-09-06: I updated this post to correct edge counts. Previously, I wrote the network contains "5,998,711 edges (2,977,167 of which are unbiased)". The issue was caused by counting single edges multiple times.
Daniel Himmelstein: Here are two references for publishing hashes of academic content to the bitcoin blockchain: a 2014 blog post and a 2016 paper [1].

0
1.
How blockchain-timestamped protocols could improve the trustworthiness of medical science
Greg Irving, John Holden (2016) F1000Research. doi:10.12688/f1000research.8114.2

Lars Juhl Jensen Aug. 16, 2015

Nice of you to share this big network with everyone; however, I think you need to take care not to get yourself into legal trouble here.

I looked into the JSON network file and found the following:
- Gene membership of all KEGG maps. If you look at the KEGG license, it is questionable if you can do that at all, and very clear that you cannot allow commercial use.
- Side effect data from (I assume) SIDER. The SIDER download files are distributed under the CC-BY-NC-SA license, which means that you are only allowed to redistribute if you give attribution and attach the same license to the derived work.

Given earlier posts, I assume that you also import associations from the TISSUES database. Even though I distribute this resource under the very permissive CC-BY license, you are still required to give attribution. This could, for example, be done by including relevant linkouts under "data" : { "url" : "..." }.

I am not trying to cause trouble here - just the contrary. When making a meta-resource, licenses and copyright law are not something you can afford to ignore. I regularly leave out certain data sources from my resources for legal reasons. For example, OMIM is not included in DISEASES due to its restrictive license.

Lars Juhl Jensen: I forgot to mention explicitly that this obviously means that you cannot distribute your network file on GitHub under the CC0 waiver as you currently do.
Lars Juhl Jensen: Redistribution of LINCS data is also problematic, unless you contacted them and got permission to do so (http://www.lincscloud.org/license/).
Daniel Himmelstein: Okay, I will look into the copyright issues and make any necessary modifications. I like the idea of including source and license fields for each node and edge. My understanding is that CC licenses prior to 4.0 do not restrict the underlying data when used in the United States. However, I will need to do more reading and solicit the advice of copyright experts.
Daniel Himmelstein: I updated the repository license until we clarify these issues.
Lars Juhl Jensen: CC licenses prior to 4.0 were not very well suited for data. I only just noticed that the new SIDER still uses the old license, which I think should be changed. Not to make it more restrictive, but simply because the 4.0 license addresses a problem in the earlier license (i.e. the attribution stacking problem).
Lars Juhl Jensen: Unfortunately, as far as I know, it is not sufficient to consider US law when you distribute data to the whole world. Although the US does not have sui generis database right, other parts of the world do, including the EU. This means that databases created in the EU are protected by such laws. (Caveat: I am not a lawyer, this does not constitute legal advice, yada yada yada.)

Lars Juhl Jensen Aug. 16, 2015

Related to the license issues, it is wise that you solicit advice from legal experts. However, as long as you are dealing with databases created by researchers in academia, the risk of actually getting sued is pretty minimal. The most important question that you should be asking yourself is thus not "what can I technically do without risking to get sued?", but rather "what was the intent of the license?". If you frequently do things that may be technically legal but clearly go against the intent of other researchers, you will quickly make many enemies.

Just friendly words of advise :-)

Daniel Himmelstein Researcher Sept. 5, 2015

Licensing and copyright

Thanks @larsjuhljensen for your helpful advice. We have created a separate discussion for licensing and copyright to continue the conversation.

Daniel Himmelstein Researcher Sept. 5, 2015

Network name

We've realized it's a major impediment to not have a name for the network. We are tentatively calling the network hetio-ind. However, rephetio — the current Thinklab handle for the project — is also an option. Will update when the name is finalized.

Daniel Himmelstein Researcher Feb. 4, 2017

Hetionet v1.0 released

Just wanted to update this thread. The final version of the hetnet for Project Rephetio has been released (for a while now). The network is named Hetionet and Project Rephetio is based on version 1.0, which is available on GitHub [1] in JSON, TSV, and Neo4j database formats. We also host a public Neo4j database instance at https://neo4j.het.io [2].

For details on the construction of Hetionet v1.0, see the Methods section of the project report/manuscript [3, 4]. Quoting from the abstract:

Hetionet v1.0 consists of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data was integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms.

Here is a visualization:

Hetionet v1.0 landscape labeled

Just to be clear, Hetionet v1.0 is not the same as the "initial version of our network" mentioned above. Please use Hetionet v1.0 rather than earlier versions, which had a less developed licensing solution and should therefore be avoided.

Status: Completed

Views

324

Topics

Networks Network Biology Drug Repurposing Dataset Heterogeneous Network

Referenced by

Cite this as

Daniel Himmelstein, Lars Juhl Jensen (2015) One network to rule them all. Thinklab. doi:10.15363/thinklab.d102

License