Rephetio: Repurposing drugs on a hetnet [rephetio]

STARGEO: expression signatures for disease using crowdsourced GEO annotation

A 2011 study [1] introduced the idea of large-scale drug repurposing based disease expression profiles. However, the field has faced a great impediment: results from differential expression experiments are only available on a per study basis. Our project requires a consensus signature (that aggregates many experiments) for each of 137 diseases.

A forthcoming project called STARGEO aims to provide disease-specific expression signatures on a broad scale. The webapp crowdsources GEO [2, 3] annotation and performs case-control analyses based on user queries. The following video introduces the project:

We now join forces with STARGEO and welcome its creator @idrdex to the team! The first stage will be tagging all GEO datasets containing DO Slim diseases. STARGEO's current DO Slim coverage is available here.

Ontologies for disease-centric GEO annotation

STARGEO allows users to define arbitrary "tags" for sample annotation. We have been adding Disease Ontology (DO) IDs to our tag descriptions. This suffices for simple case-control comparisons, but is insufficient for more complex comparisons.

For example, many cancer studies, will compare tumors to healthy tissue but all samples are from cases. Therefore the contrast is not case versus control, but healthy versus diseased tissue. In general, we will want to incorporate these contrasts into our disease-specific expression signatures.

@fbastian and @chrismungall, do you know of any ontologies that could help with our annotation task? I know that Bgee focuses on healthy tissue, but I though you may be able to direct us in the right direction.

Summary: we are gathering disease-specific expression signatures. What terminologies should we use to create contrast between samples within a study (GEO Series)?

To annotate the condition of a sample, you can use the Experimental Factor Ontology (EFO). But not sure what you mean by "contrast between samples" (do you want to annotate each sample, or have terms directly representing, e.g. "healthy vs. diseased contrast")


For now STARGEO annotations really funnel towards performing classical meta-analysis across studies given an standardized set of "cases" vs "controls". This kind of fits with micro-array experimental design which usually "contrast" some type of case vs control. So there is a concept of a control for a disease which may or may not be a "healthy" control.


We will soon have analyses working on the dev site: We will probably shut down the old site by the end of the month.

The Experimental Factor Ontology (EFO) [1] has a few terms related to what we need. For example,

Ultimately, for a given disease, we want to be able to differentiate the following:

  • a healthy sample from healthy individual
  • a healthy sample from diseased individual
  • a disease sample from diseased individual

Essentially, we want to support two types of case-control analyses based on a single set of annotations:

  1. samples from healthy individuals versus diseased individuals
  2. samples from healthy tissue versus diseased tissue, where all samples may come from diseased individuals

It seems that most existing ontologies are good at describing the characteristics of a single sample — for example, its tissue of origin, the developmental stage of the donor, the phenotypes/diseases of the donor — but they are not good at allowing tagging for the sole purpose of contrast.

@idrdex suggested that we could use "qualifiers for the tags: like PC_individual_case vs PC_individual_control or PC_tissue_case vs PC_tissue_control" I like this idea and think it is a good immediate solution.

  • Dexter Hadley: I think whatever we do is ultimately arbitrary and will become our 'protocol'. We can assume that tags reference the sample itself which probably is some type of tissue and applies to the individual. Like diabetic pancreas tissue is from a diabetic individual. But cancer is a "mosaic" disease which puts tissue vs individual out of sync. I think eventually for every set of annotations made on a GSE, we can allow for qualifiers from EFO to be set.

@idrdex: We've been doing some extensive curation (i.e. back to the literature describing the experiments) on ~1000 samples that matter a lot to a project that we're working on. A couple questions about STARGEO:
1) Is it going to/does it also include samples from organisms other than human?
2) What's the best way to contribute these annotations? Anything programmatic and/or spreadsheet friendly?
3) What's the API like to extract annotations?


  • Dexter Hadley: 1) Yes multi species are on the books. But for now we are focusing on humans to get the site launched.
    2) Lets get you an account which for now is through me. I'm idrdex at both google hangouts and Skype.

    3) Everything for now is accessed through the site. We are working on an API to serve the data for the computational folks. Basically allow retrival of annotations and served matrices with matching gene ids.

Analysis is working now here:

  • Casey Greene: I took a look at that but it doesn't look like it provides hooks to the underlying annotations. Are those going to be available? For the types of analyses that we work on, those are much more valuable than the profiles.

  • Dexter Hadley: The "annotations" interface is not ready yet. But imagine a searchable list where you could put GSEs in for instance and get all our annotations across GSMs. For now I could probably generate a .csv for you. We currently have 400K+ annotations over 100+ tags. Some GSEs have been done up to quadruplicate and ones with multiple annotations we have kappa inter-related sats calculated.

  • Casey Greene: The annotations interface sounds very helpful! I'm looking forward to it! The analysis that we're doing right now isn't for human datasets, so if your curations are generally there we can't check the overlap with our own curations. We do a lot of human work though, so it'll be very helpful to us to have these annotations available.

  • Dexter Hadley: I've got a postgres table that you can query if you like. Let's hangout and discuss :)

Gene handling quality control

Currently, we have STARGEO case-control queries for 66 of our diseases. Of these 66 queries, 37 return differential expression results. The rest either have insufficient samples or fail due to errors.

In the past, I remember coming across a STARGEO output (gene rows × meta-analysis columns) where many rows contained duplicate gene symbols. @idrdex had also mentioned to me that mapping the probe/gene names deposited in GEO can get complicated. Therefore, I wanted to do a few quality controls before proceeding.

I checked into our current STARGEO analysis of differential expression for 37 diseases. I looked for three occurrences which could be due to problems with gene handling (notebook):

  • GeneID–Symbol mappings that don't exist in Entrez Gene (results). There were 2,274 ID-Symbol pairs that didn't exist in my parsing of Entrez Gene. However, most of these were not protein-coding and appeared to stem from updates to the database over time.
  • Rows with duplicate GeneIDs (results). Only two rows were affected by this issue.
  • Rows with duplicate symbols (results). 72 rows were affected by this issue. It did seem however that many gene symbols were not the approved symbols but rather synonyms. Since we use Entrez Gene IDs for mapping, synonyms are not a major concern for us.

So in conclusion, I didn't detect any major issues with the gene handling in STARGEO. These quality controls do not assess the probe–gene mapping, but instead whether the gene information reported for the meta-analyses makes sense.

Initial release of STARGEO analyses

We've released the first complete version of our STARGEO analysis (repository [1]). Thanks @idrdex for helping us get all of the queries running smoothly.

In summary, we defined case-control queries for 66 diseases. Of these diseases, 49 contained sufficient data (multiple studies with at least 3 samples per class). We used STARGEO's random effects meta-analysis and applied an FDR p-value threshold of 0.05 to identify deferentially expressed genes for each disease.

48,688 Disease–downregulates–Gene and 50,287 Disease–upregulates–Gene relationships were identified (dataset). The number of dysregulated genes varied widely by disease. No deferentially expressed genes were identified for endogenous depression, which had a combined sample size of 533 cases and controls.

Now we will integrate these relationships into our hetnet. We may choose to limit each disease to the 500 most significantly up and 500 most significantly down-regulated genes.

Additional resources

Here are some additional resources that claim to do similar things:

  • CRowd Extracted Expression of Differential Signatures (CREEDS) [1]
  • OMics Compendia Commons (OMiCC) [2]

I haven't had time to look into them, but wanted to note them here in case I ever do!

Status: Completed
Referenced by
Cite this as
Daniel Himmelstein, Frederic Bastian, Dexter Hadley, Casey Greene (2015) STARGEO: expression signatures for disease using crowdsourced GEO annotation. Thinklab. doi:10.15363/thinklab.d96

Creative Commons License