Enabling reproducibility and reuse

Jesse Spaulding, Daniel Himmelstein, Casey Greene, Benjamin Good

doi:10.15363/thinklab.d23

Enabling reproducibility and reuse

Jesse Spaulding Jan. 16, 2015

Daniel — we talked a bit about how you'll be constructing this resource in a manner that is automated and reproducible. I think it would be good to put these details in the research plan — it will give people more opportunities to make suggestions. And if you'll be using techniques that you've used before it's probably a good idea to link people to that work.

I want people to be able to share their opinion on how you can create the most value from this project. I'm sure you'll agree that enabling reproducibility and reuse is a big part of that.

What do you think?

Daniel Himmelstein Researcher Jan. 16, 2015

I think it would be good to put these details in the research plan — it will give people more opportunities to make suggestions.

Perhaps, we should discuss here what formats of data and types of services would be most valuable to the community. Then once we reach some consensus, I will update the proposal.

Currently, I was planning on releasing the network in a few file formats:

as a single JSON text file based on a specification we've developed.
as a SIF file which is a text file of all edges. We will also provide an accompanying node table, with node attributes. This format is ideal for Cytoscape users.
as a GML file — a poorly documented and varyingly implemented format for storing graphs. Despite it's problems, this format is widely supported.
as separate files for each metanode and metaedge. This will help users who are only interested in a single part of the network, bypass having to process the rest of the network.
as matrices for edge types that were constructed from continuous pairwise scores.

An online tool for browsing the network would also be nice. I will look into options here.

Casey Greene Jan. 22, 2015

As a follow-up to this, are you planning to make only the network available, or are you planning to release all of the code necessary to construct the network? If you release that code, will you use some testing and continuous integration process to evaluate regressions? Releasing the code would allow others in the future to assess the role that new experiments or new sources of information play in prediction quality. I agree with Jesse that more details on the overall development process would be great.

Daniel Himmelstein Researcher Jan. 26, 2015

I plan to follow the 10 rules of reproducible computational research [1].

@caseygreene, I would like to release all of the code. I plan on doing a lot of the work in notebooks (of the R and python varieties). However, my current plan lacks the level of automation that you are suggesting.

Testing whether a new source of information has improved prediction should be straightforward. I am not sure exactly what you mean by "continuous integration process".

I would like to provide a single script that performs the entire analysis, but this may be difficult because the computation will be performed in different venues and locations.

Jesse Spaulding: It seems appropriate that some of these things be put in the research plan. For example, that you plan to follow the 10 rules of reproducible computational research.

Casey Greene Jan. 26, 2015

If you get the entire analysis boiled down a one-step build process, there are services that can monitor your github repository, look for commits, and then kick off a build. You'd need to define test cases that determine that something looks different than you expect. Maybe you have a set of positive controls (known multi-use drugs?) and you evaluate the extent to which they are accurately predicted. Depending on how long the entire process takes, these services might be enough to monitor and identify any commits that produce large changes in your overall results. You could also use some sort of correlation measure to previous runs to look for any commits that produce abnormally large changes in the output.

Daniel Himmelstein Researcher Feb. 16, 2015

@caseygreene and @jspauld, thanks for the suggestions. I've added an open science section to the proposal where I make commitments to:

the 10 simple rules for reproducible research in computational biology [1]
releasing all code on GitHub
releasing all data
CC-BY or CC-0 licensing

I also mention R Markdown and IPython notebooks, which I've been using extensively thus far — for example, when analyzing SIDER 2 data.

@caseygreene, I think your suggestion of implementing a continuous integration process is fantastic. I am hesitant to commit on this issue for a few reasons:

In the past, the computations have been extreme (requiring cluster usage) and this may be unwieldy to execute after every commit. I hope to optimize the algorithm, which may help in this regard.
I need to investigate these services more.
Our heterogeneous network framework and analyses are still rapidly evolving, so I don't want to invest lot's of time in code or methods that will soon be replaced.
I need to start writing unit tests for my programs. This seems like a more immediate issue that I will prioritize. Once I create proper unit tests, I will enable them as pre-commit hooks.

That being said, your suggestion is in line with the philosophy of ThinkLab and I will keep it in mind.

Casey Greene: Thanks! For things that require significant cluster usage I haven't yet found a cost effective way to do this either outside of a grant from one of the cloud providers. I wonder if you could apply to amazon or google for access to compute instances for this purpose.

Benjamin Good March 18, 2015

Would you consider hosting your knowledge network on WikiData ? https://www.wikidata.org/. WikiData is a new freebase-like open knowledge base being constructed by the Wikimedia Foundation.

A couple reasons to think about this:
1) Our group has NIH funding to build many of the nodes and edges you need there and we have already started.
2) By working in WikiData, your project can benefit from an existing, large user/contributor community and from the WMF computational resources.
3) WikiData is fundamentally about open knowledge exchange. Working in its context will ensure the greatest visibility and re-use for your network.

You probably wouldn't want to store computed probabilities there, but qualitative relationships would work well I think. This would be a great use case for our own efforts..

Daniel Himmelstein: @b_good, thanks for the suggestion. I am excited about any venues where we can publish the data to increase its reuse. Once we complete the data integration stage, I will touch base with you — it's still not entirely clear to me what exactly we would upload.

Benjamin Good March 19, 2015

Daniel,
Regarding what to upload. The figure you presented in your research plan sums this up nicely https://dl.dropboxusercontent.com/s/yp0gjh1v3329xji/metagraph.pngAside from perhaps the MSigDB collection (which could be added), those are exactly the nodes and edges that we hope to see in WikiData.

To use WikiData to its full power, you would actually use it within your own application in the same way that you would use your own internal relational database. Rather than thinking of it only as a repository to export to, you could think of it as the central staging area for the data that you (and the rest of the community) want to compute with.

For more about how we are working with wikidata, you can check out the series of blog posts that describes the funded grant here starting here: http://sulab.org/2013/07/the-future-of-the-gene-wiki/

Daniel Himmelstein: In support of the semantic web, you should check out ThinkLab's awesome markdown syntax, which includes easy citation and hyperlink capabilities.

Daniel Himmelstein Researcher March 20, 2015

Hi @b_good, fascinating work. I was amazed that the Gene Wiki [1] averages "two page views every single second". This amazing traffic, and return on investement from the perspective of funders, illustrates the potential of open, crowdsourced, and collaborative undertakings.

I think your wiki work capitalizes on a larger trend of computing moving to the web and browser. I recently made my first html presentation; my coding is now largely browser based thanks to Jupyter and RStudio and hosted online; API web-queries are now fundamental for information retrieval. It only makes sense that we adopt a information commons like Wikidata to integrate knowledge. The video below really sold me on the concept:

Since the initial stage of this project is highly prototypical, I am hesitant to immediately switch to Wikidata as our backend. However, I would love to work with you and your team to upload as much content as possible. Then for subsequent analyses, we could potentially pull from Wikidata.