Rephetio: Repurposing drugs on a hetnet [rephetio]

Tracking project reuse, citation, and publicity

Today is October 3, 2015. 345 days ago, I first met @jspauld and learned of Thinklab. 330 days ago, @sergiobaranzini and I agreed to try out the platform, and 264 days ago we posted an initial draft of our proposal.

Since then our project has recruited 22 reviewers and started 47 discussions containing 266 comments. Currently, our proposal has 641 views and our most highly viewed discussions have 175 and 173 views. These view counts rely on Google Analytics and are therefore just estimates.

We are pleased with the current progress on Thinklab and expect continued growth as the platform matures. In this thread, we will also post instances of reuse, citation, and publicity received outside of Thinklab.

  • Jesse Spaulding: Just a small correction — Thinklab is tracking views independently of Google Analytics. Our figure attempts to track the total number of humans that have viewed the page. Our figure will be less than what Google Analytics has because for them a unique page view is really the number of sessions with at least one page view.

Initial network release covered by the Drug Repurposing Portal

On August 6th 2015, the initial release [1] of our network was covered by the Drug Repurposing Portal. This site describes itself as a

first of its kind one-stop-shop platform for intelligent information on Drug Repurposing

The site also includes a database of over 300 instances of repurposing. While the database is unstructured text (so currently unsuitable for computational analyses), it provides a nice human-readable reference.

Project Altmetrics

Our project has an Altmetric page, which tracks online attention. Currently, some of the project metadata is wrong, and most mentioning content is missing. @jspauld, perhaps you could investigate improving Altmetric integration?

Nature News mentions our use of Thinklab to avoid publishing delays

A Nature News feature published today [1] mentions our Thinklab project:

Some scientists are going a step further, and using platforms such as GitHub, Zenodo and figshare to publish each hypothesis, data collection or figure as they go along. Each file can be given a DOI, so that it is citable and trackable. Himmelstein, who already publishes his papers as preprints, has been using the Thinklab platform to progressively write up and publish the results of a new project since January 2015. “I push 'publish' and it gets a DOI with no delay,” he says. “Am I really gaining that much by publishing [in a conventional journal]? Or is it better to do what is fastest and most efficient to get your research out there?”

The feature also covers my blog post on the history of publishing delays. Using data from PubMed, I found a median time from submission to acceptance of ~100 days and a median time from acceptance to online publication of ~25 days. Since we post most content on Thinklab several months before it will ever be submitted, we're getting our work out 200+ days sooner by using realtime open notebook publishing.

Mention in Storybench piece on BLAZE

Margaux Phares wrote an article published today on Storybench exploring BLAZE. BLAZE is a new type of scholarly search engine that returns search results as bubble maps rather than lists. The article quotes me:

“I’ll often work on problems for years before encountering seminal works that would have been helpful from day one,” Himmelstein said. “The [scientific literature search problem] is worst at the cross-section of fields as each field has its own specialized terminology.” He thinks Blaze may help solve this problem.

This quote was motivated by this project: specifically, the difficultly we faced when finding literature on hetnets. As I wrote to Margaux:

I work on networks with multiple types of relationships. I call these networks "hetnets". It wasn't until the fifth year of my PhD, that I learned of a plethora of other terms referring to the same concept. And had I not been highly proactive at reaching out to diverse players, my ignorance would persist to this day.

2016 Neo4j GraphGist Challenge

We created a Star Wars themed entry to the 2016 Neo4j GraphGist challenge. The competition aimed to showcase exciting uses of Neo4j — a graph database designed for hetnets. The winners were announced today and our GraphGist won the "Open/Government Data and Politics" category.

Genes with more SNPs tend to have higher hetnet degrees

A review I worked on with @caseygreene was published today [1]:

Greene CS, Himmelstein DS (2016) Genetic Association–Guided Analysis of Gene Networks for the Study of Complex Traits. Circulation: Cardiovascular Genetics

The review explains the NetWAS method for prioritizing disease-associated genes [2]. Additionally, it investigates the relationship between SNP abundance and network degree for human genes [3]. I based this analysis on the prerelease of Hetionet — the hetnet constructed for this study. We observed that more connected genes tended to contain more SNPs across several genotyping platforms:

Hetnet degree versus SNPs in Gene

This finding is important because it shows why network methods that convert from SNP to gene space are prone to bias. Without the integration provided by our hetnet, this analysis would have been time consuming. However, once the different network edges were integrated, the analysis became trivial.

Greene & Voight Review

@caseygreene's has preprinted his new review titled "Pathway and network-based strategies to translate genetic discoveries into effective therapies" [1]. The review mentions our project:

They use heterogeneous networks that connect genes, diseases, tissues, and other
factors [2]. They then apply machine learning methods to predict effective drug-disease pairs. The entire process has been visible on an open science platform called Thinklab. The open platform reveals the extensive sets of analyses that lay the groundwork for machine learning applications. Though work is ongoing the initial results are highly promising.

as well as the legal barriers to data reuse that we've encountered:

An often-overlooked aspect of studies that incorporate multiple data types and sources for drug repurposing is licensing and resource accessibility. The manner in which restrictive database licenses have hampered progress by the Himmelstein et al. team [3] has been instructive [4].

The review mentions the difficulties faced by genetic-association-driven drug development:

There have been a limited number of cases where each required elements falls cleanly into place.

I'm exciting to see if our metapath-based approach can help decode the contexts in which genetic association data is most informative for drug efficacy.

Data copyright article in Nature News

Simon Oxenham wrote a story titled Legal confusion threatens to slow data science, which was published in Nature News today [1]. The story covers our experience with data copyright and seeking permission to reuse publicly-funded research.

The article begins with:

Knowledge from millions of biological studies encoded into one network — that is Daniel Himmelstein’s alluring description of Hetionet, a free online resource that melds data from 28 public sources on links between drugs, genes and diseases. But for a product built on public information, obtaining legal permissions has been surprisingly tough.

Thanks to everyone who contributed on our data licensing discussions. This post wouldn't be here if it weren't for you!

Graphistania Podcast on Neo4j & Project Rephetio

Rik Van Bruggen, Regional Vice President at Neo4j, interviewed me for the Graphistania podcast. We discussed hetnets, Neo4j, Cypher, Hetionet, and Project Rephetio. The podcast is available on the Bruggen Blog, SoundCloud, YouTube, iTunes, or below.

Project PrePrint / Report

We've posted the manuscript describing Project Rephetio as a report on Thinklab [1] and preprint on bioRxiv [2]. We're submitting the study to the journal eLife. Thanks again to the tremendous efferts of everyone who helped us reach this point!

I posted the preprint while at OpenCon 2016, which turned out to be a bit of a saga that unfolded through the following series of tweets: 1, 2, 3, 4, 5.

GraphConnect Presentation

I gave a lighting talk at GraphConnect 2016 — a conference devoted to graph databases that is hosted by Neo4j. The recording is below:

Hetionet poster

I created a poster to raise awareness of hetnets, especially Hetionet v1.0, which is available on figshare [1]. I presented the poster at the Moore Data-Driven Discovery Investigator Symposium, the Penn Biomedical Postdoctoral Research Symposium, and the upcoming Rocky Bioinformatics Conference.

CC BY and data: Not always a good fit

In a recent blog post for the University of California's Office of Scholarly Communication Blog, titled CC BY and data: Not always a good fit, @katiefortney writes:

And licenses that dictate exactly how later projects reuse data and provide attribution can hobble those projects. For just one example, see this discussion of licensing issues in Rephetio, a project that’s done a particularly good job documenting the challenges faced in attempting to reuse drug repurposing data. Data reusers should follow best practices for data citation in their field given their particular project. If data sharers can trust them to do just that, rather than trying to shoehorn attribution practices into one-size-fits-all copyright licenses, they can make it much easier for others to reuse their data.

Regarding explanatory notes alongside licenses, she writes:

Explanatory notes about your expectations alongside a CC license may help by showing that you’re aware that there’s complexity, or by inviting data users to initiate a conversation about desired uses. But avoid adding language to your notes that contradicts your chosen license; see another discussion from the Rephetio project mentioned above for an example of custom terms of use that conflicted with a CC license, leading to confusion.

Great to see that our discussions on the legal barriers to data use were helpful examples! Awesome post @katiefortney!

MIA Talk at the Broad Institute

@caseygreene and I took a trip to Cambridge and presented at the Broad Institute's Models, Inference & Algorithms (MIA) series.

I gave a primer on Project Rephetio and Hetionet. The recording is on YouTube (slides):

Thanks Broad for making the YouTube video available under a CC BY license!

Thinkable Bioinformatics Peer Prize

We've entered the Bioinformatics Peer Prize by Thinkable. You can see our entry here, which is based on version 2 of our bioRxiv preprint [1].

To vote, you'll need to signup for an account and submit verification of your researcher status in the form of:

  1. a University email address
  2. a link to a peer-reviewed publication from the last five years

Thinkable can take up to two days to verify your status. Voting closes on May 11, 2017, so don't delay. You can also vote for multiple projects, so once verified you may also find other projects you'd like to support!

Just to clarify: there is no relationship between Thinklab and Thinkable — they just happen to have a Levenshtein distance of only 3!

Status: Open
Referenced by
Cite this as
Daniel Himmelstein (2015) Tracking project reuse, citation, and publicity. Thinklab. doi:10.15363/thinklab.d113

Creative Commons License