Rephetio: Repurposing drugs on a hetnet [rephetio]

Adding pathway resources to your network

For Table 1, you might consider adding another Gene Set resource based on curated pathways. These are analogous to GO-Biological Process terms, but are much more focused and constrained. In fact, they also include small molecules and drugs, so they can serve as more than just gene sets.

Likewise, for Table 2, you could add a number of interaction resources with pathway data. As co-founder of WikiPathways, I have to recommend that one in particular! :) It's 100% free, open source and open access. I can also recommend Reactome and Pathway Commons, the latter of which compiles pathway data from multiple sources into BioPAX data format.

Again, these would not only provide high-quality gene-gene interactions for your network, but also direct drug-gene interactions. And in relation to your Figure 1, they would also provide Disease and Tissue associations.

You can download all human pathways from WikiPathways in multiple formats, or parse just the Entrez Genes in Human pathways from this single dump file. The advantage to the first option are that you are getting the original data, as curated by contributors; the disadvantage is that you have to perform the ID mapping to unify to Entrez and your preferred small molecule system. The advantage of the second option (the dump file) is that the Entrez ID unification has been done for you; the disadvantage is that anything that didn't map to Entrez is simply discarded (including drugs and small molecules!).

@alexanderpico, thanks for letting us know about the best current pathway resources.

MSigDB Canonical Pathways

In the past, we used the versions of Reactome, KEGG, and BioCarta provided by MSigDB [1, 2]. MSigDB version 5.0 was released in April, but it's unclear whether the pathway resources were updated. However, the "C2: Canonical Pathways" (CP) collection integrates 9 pathway resources, so I think we should create a C2: CP metanode with a node for each MSigDB CP gene set.


We can have a separate metanode for WikiPathways [3, 4]. The open and crowdsourced nature of WikiPathways is ideal. The inclusion of compounds, tissues, diseases in addition to genes in these pathways could provide a major performance boost for our method. The benefit will depends on how frequently non-gene entities are included in these pathways. What percent of pathways include diseases, tissues, or drugs? Additionally, are non-gene entities identified as free text, or are they structured by a standardized vocabulary?

Pathway Commons

I like how Pathway Commons [5] brings a common format to many resources. One worry is that Pathway Commons contains edges, such as those from DrugBank, which will be included elsewhere in the network. One solution would be to pick and chose which source databases to integrate from Pathway Commons. After including MSigDB C2: CP and WikiPathways, will Pathway Commons contain much information not already captured? If not, we may just stick with the above resources.

General Questions

  1. How much do these pathway resources overlap? Does WikiPathways include pathways directly taken from other databases?
  2. Do databases differ greatly in quality or type of pathways encoded? If the databases do differ, it may make sense to give each a separate metanode. Otherwise, we will organize all pathways by 1 or 2 metanodes.

Figures 1 and 2 in the Pathways4Life proposal will answer questions about overlap and frequency of updates. Pathway Commons is not a primary source; their focus is on compiling from as many sources as possible. So, given their restriction to BioPAX, they definitely include more than any single resource.

@alexanderpico, thanks figures 1 and 2 do help, however I am more interested in edge-based measures of overlap. Do you have a general sense of whether the same pathways are represented in multiple databases?

My interpretation of Figure 1 is that it provides a lower bound of uniqueness. The fact that there are many genes unique in KEGG, Reactome, and WikiPathways warrants the inclusion of all three resources. However, it doesn't answer whether the common genes are from duplicated pathways or not.

That measure of overlap is fraught with caveats relating to exactly how edges are modeled. When each of the three resource mentioned here converts to a single exchange format, like BioPAX, for example, we each make a unique set of mapping decisions and compromises. Nevertheless, you're absolutely right that node overlap is a lower bound, but I don't have a good estimate for edge overlap. Just browsing the pathway titles is the most convincing way to see that we cover much of the same ground: metabolism, signaling and gene regulation.

Initial human pathway collection

We have downloaded, parsing, and combined MSigDB and WikiPathways (notebook, tsv results). In total we identified, 1,516 human pathways after removing a single duplicated pathway. Most pathways have below 100 genes but some have up to 1,000.

WikiPathways Method

We extracted pathways from the previously-suggested dump file. We removed pathways without any human genes. From a total of 669 wikipathways, 187 were human. @alexanderpico, can you confirm that these are the expected numbers?

Above I asked:

Additionally, are non-gene entities identified as free text, or are they structured by a standardized vocabulary?

It appears that other entities besides genes are unstandardized in WikiPathway models. Therefore, we chose to not connect pathways to diseases, drugs, and tissues.

MSigDB Method

We used the C2: CP collection from MSigDB 5.0 which yielded 1,330 pathways. The sources and counts of these pathways are below:

MsigDB IDNamePathways
PIDPathway Interaction Database196
STSignaling Transduction KE28
SIGSignaling Gateway8

We extracted pathways from the previously-suggested dump file. We removed pathways without any human genes. From a total of 669 wikipathways, 187 were human. @alexanderpico, can you confirm that these are the expected numbers?

Hmm... You are correct that the dump file contains 187 human pathways (just did a browser FIND on the page for 'homo sapiens'), but there are ~293 human pathways in the standard collection. You can access these on the bulk download page in multiple formats, including plain text lists of (non-unified) datanode identifiers. This number is climbing as folks continue to add new content. For example, we have over 300 additional human pathways that are in the works at various stages of completion (or disrepair) that are not included in these bulk downloads.

Sorry for suggesting the dump file. I thought that would make it easier since the identifiers are unified to Entrez, but it's apparently incomplete. I'm not sure why...

The WikiPathways team found the error, corrected it and updated the dump file, which now contains 290 human pathways with gene identifiers unified to Entrez. Same location:

Human pathway collection revision

We updated our pathway resource with the updated WikiPathways data (notebook, tsv results). The new total for human pathways is 1,619 with 289 of those from WikiPathways.

Version 1.0

Our compilation of pathway gene sets is now released (version 1.0) [1]. Gene sets (download) are compiled from WikiPathways and MSigDB. This updated version contains 1,617 pathways.

Replacing MSigDB with Pathway Commons

Due to licensing issues with MSigDB, we've removed MSigDB pathways and switched to Pathway Commons as @alexanderpico initially suggested. Pathway Commons aggregates pathway and binary interaction data from many providers [1].

Pathway Commons data is freely available, but the data is licensed under the terms of each contributing database. For example, Pathway Commons includes KEGG pathways, which have a problematic license. Accordingly, we only include pathways from Pathway Commons resources that are openly licensed.

Specifically, I identified only two appropriate resources from the 8 Pathway Commons resources that contribute pathways (see notebook cell 7). These resources were Reactome [2] and the Pathway Interaction Database (PID) [3]. Reactome is licensed as CC BY, while I believe PID data is in the public domain since it was created by US Government employees. At least the PID publication states, "All data in PID is freely available, without restriction on use. [3]" Since Reactome and PID contributed the majority of MSigDB pathways, I suspect that we didn't lose much information by abandoning MSigDB.

Ultimately, our updated compilation of human gene sets (dhimmel/pathways v2.0 [4]) contains 1,862 human pathways of which 1,341 are from Reactome, 298 are from WikiPathways, and 223 are from the PID.

Status: Completed
  resource   data integration
Referenced by
Cite this as
Alexander Pico, Daniel Himmelstein (2015) Adding pathway resources to your network. Thinklab. doi:10.15363/thinklab.d72

Creative Commons License