Rephetio: Repurposing drugs on a hetnet [rephetio]

MSigDB licensing

We currently rely on MSigDB [1, 2], the Molecular Signatures Database, for perturbation gene sets and pathways. Since the license is highly restrictive, we have emailed the creators with the below message. We will post any updates regarding MSigDB licensing or permissions on this discussion.

Greetings MSigDB Team,

I am a graduate student at UCSF, and I have been using MSigDB for my research. My project aims to predict new uses for existing drugs by integrating many different types of biomedical information. Recently, the issue of database copyright and licensing came up, and we are now trying to ensure that we have sufficient permissions for each of the 28 databases we're integrating.

I was surprised to learn of MSigDB's restrictive license that forbids redistribution, especially given the projects public funding. Currently, several resources I have created may be non-compliant with the license:

My GitHub repository (dhimmel/msigdb) for parsing the MSigDB database contains:

  • unmodified MSigDB downloads
  • a reformatted version of the underlying data

My GitHub repository (dhimmel/pathways) for combining pathway databases contains:

  • the reformatted version of C2:CP from dhimmel/msigdb and a derived dataset containing pathways from other resources

My GitHub repository (dhimmel/integrate) for integrating many resources into a single network contains:

  • the majority of MSigDB 5.0 C2:CP and C2:CGP data stored as network nodes and edges.

My website for a previous project [3] contains:

  • a network download formatted as in dhimmel/integrate, but containing data from most collections in MSigDB version 3.0.

The public availability of the aforementioned resources is important so others can reproduce and build off of our work. Thus, we request permission for our current redistribution and derivative works of MSigDB. Ideally, we could be granted permission to release MSigDB data under a Creative Commons license without a No Derivatives restriction. Applying a CC license would lessen the burden on downstream users.

Thanks for your consideration. Our research is academic in nature, and we suspect it falls under the intended use of MSigDB but perhaps not the license.

Finally, we're performing our project using an open science platform called Thinklab. I've posted a copy of this email on Thinklab and will update the discussion with our progress. Alternatively, feel free to respond via Thinklab rather than email. By detailing each step of our research process publicly, we're hoping to create a valuable resource and explore a more holistic and collaborative medium of publication.


After an additional round of emails on February 4, 2016, Jill Mesirov got back to me. Dr. Mesirov was the principal investigator for the MSigDB project while at the Broad but has since moved to UCSD. She mentioned that they received permission to distribute certain parts of the database but that they did not receive permission to pass on those rights. She also added Helga Thorvaldsdottir — the MSigDB project manager at the Broad — to the conversation.

I responded with the following message:

Dear Dr. Mesirov et al,

Thanks for the reply and involving Helga. Hopefully, we can now locate the appropriate parties to handle our request.

I had guessed that non-transferable distribution rights were part of the issue. I appreciate wanting to build the most comprehensive resource, even when that necessitates stricter licensing.

Do you know which resources forbid downstream distribution? Perhaps we could be given permission to redistribute the unencumbered portions of the database? And for encumbered portions, we could seek the needed additional permissions from the content owners.

We feel that distribution fulfills an important scientific need. We're integrating over 30 resources into a single network that we envision becoming a widely used community dataset. Much like MSigDB did with gene sets, our network will enable novel analyses that are only possible once the data has been unified into a single resource. Additionally, forbidding distribution has troubling consequences for reproducibility. See for example this instance where data copyright interfered with replication.

Given these considerations, we would appreciate help in finding a solution that allows us to distribute MSigDB data, even if only a subset of the database.


In short, I asked if they could look into granting us permission to distribute the unencumbered portions of the database. Ms. Thorvaldsdottir responded that they will be meeting with the IP/Licensing team to discuss my request.

Removing MSigDB from the Rephetio project

I removed MSigDB from our project, since we haven't been able to resolve the licensing issues. It's been 186 days since we initially contacted the MSigDB team and 53 days since we were told that the IP/Licensing team had been notified and a meeting scheduled. Unfortunately, we can't wait any longer.

There are a few distinctions that make MSigDB distinct from other resources where permission is pending but we continue to include. MSigDB is the only resource with a license that explicitly forbids distribution. Additionally, the MSigDB website requires registration, although accessing the database by URL bypasses registration.

Registration makes the license into a legally binding agreement. Essentially, the registration acts as a contract, which can place additional restrictions beyond copyright. As an aside, I therefore find it misleading that the website states:

Registration is free. Its only purpose is to help us track usage for reports to our funding agencies.

Specifically, the license contains several troubling components. First is a reporting requirement for modifications and bug fixes:

modifications and BUG FIXES shall be provided to MIT promptly upon their creation.

While this reporting requirement only applies to the program, which we don't use, a different reporting requirement applies to database:

As consideration for the licenses granted in this Agreement, LICENSEE agrees to provide … a written evaluation of the PROGRAM and the DATABASE, including a description of its functionality or problems and areas for further improvement in the PROGRAM or the DATABASE.

The license is very clear that distribution is prohibited. In fact, uploading the database to a private cloud service appears to violate the license:

2.2 No Sublicensing or Additional Rights. In no event shall LICENSEE sublicense or distribute, in whole or in part, the PROGRAM, modifications, BUG FIXES, or the DATABASE, without prior permission from MIT. LICENSEE agrees not to allow any non-employee of LICENSEE to access, view, or use the PROGRAM or the DATABASE, unless such person is an independent contractor performing services on behalf of LICENSEE. LICENSEE agrees not to put the PROGRAM or the DATABASE on a network, server, or other similar technology that may be accessed by any individual other than the LICENSEE.

As a result, using MSigDB as part of extensible open science project is not possible.

Remedial action

I removed MSigDB from our hetnet (commit). Our hetnet no longer contains perturbation gene sets, which were from the C2:CGP collection. I replaced the MSigDB pathways (C2:CP collection) using Pathway Commons. Both Pathway Commons and MSigDB include data from Reactome and the PID, but by going through Pathway Commons we were able to release those resources as CC BY and CC0.

I deleted my GitHub repository, formerly dhimmel/msigdb, for converting the database into a single user-friendly TSV. Our website for a previous project contains a hetnet including MSigDB 3.0, which I posted before being aware of the licensing issue. Since this hetnet is the foundation of our prior study [1], taking it offline would be problematic for reproducibility and destructive to science. Therefore, I am not taking down this dataset unless specifically requested.

Finally, I removed the following two paragraphs from our project report draft. I'll let them serve as a eulogy:

Pathways were extracted by combining human pathways from WikiPathways [2, 3] and MSigDB version 5.1 [4, 5]. Perturbations were extracted from MSigDB. Each perturbation corresponds to a differential expression experiment of a chemical or genetic perturbation.

Perturbation–regulates–Gene edges are from MSigDB [4, 5]. These edges represent groups of genes that responded in the same direction to a chemical or genetic perturbation. Our previous project found this indiscriminate, automated, and high-throughput method produced gene sets that together were highly informative for predicting disease–gene associations [6].


One issue at play is the restrictive licensing of resources that MSigDB integrates. For example, KEGG requires academic services to obtain a subscription, which stipulates that:

Your Product or Service must not allow Your users to obtain KEGG FTP Data, except in small quantities.

Hence, portions of the MSigDB dataset do have to be licensed to forbid redistribution. Nevertheless, MIT went beyond these upstream restrictions when writing the MSigDB license. First, much of the database could likely be released openly. Second, I find the reporting requirements extreme. It's possible that MIT had a financial motivation when writing the license, as it states:

LICENSEE agrees that neither the PROGRAM nor the DATABASE shall be used as the basis of a commercial software or hardware product

I think the MSigDB team deserves credit for making their comprehensive compilation of gene sets public. However, given the extent of public funding this project has received, I question whether it's ethical for MIT to apply such a problematic license.

Finally I'm not deflecting responsibility: I'm the one who included a resource whose license forbids redistribution. While scientists are often poorly informed on the legality of data reuse, I think it's important to take responsibility for educating yourself. Going forward I will address licensing issues before using a new dataset to avoid similar problems.

Status: Completed
Referenced by
Cite this as
Daniel Himmelstein (2015) MSigDB licensing. Thinklab. doi:10.15363/thinklab.d108

Creative Commons License