Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Publication:
Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species

Processing Bgee for tissue-specific gene presence and over/under-expression


Bgee is an integrative and comparative resource for gene expression [1]. We extract the following edges from Bgee for our network:

  • Gene–Anatomy expression — whether a gene is present in an anatomy
  • Gene–Anatomy upregulation — whether a gene is upregulated (overexpressed) in an anatomy
  • Gene–Anatomy downregulation — whether a gene is downregulated (underexpressed) in an anatomy

We have already substantially discussed Bgee on Thinklab in a more general gene expression thread [2]. The comments consisted of:

We are revisiting our Bgee analysis. Since the original discussion has become cluttered and covers a general topic, we are starting a designated Bgee discussion. Stay tuned.

Quality control filters

Initially, we chose permissive filters for including Bgee edges based largely on the recommendations of @fbastian. Several developments are making us reevaluate our permissiveness:

  1. Gene–expression–Anatomy edges are the most prevalent type in the network (see last figure in this notebook). Now that we're traversing the network to extract paths, we're noticing high-degree nodes are computationally problematic. Several anatomies have 20,000 expressed genes and 5000 differentially expressed genes (see page 2 here). Therefore downstream constraints favor rigorous thresholds for extremely high-degree edge types.
  2. We also extract Gene–expression–Anatomy edges from TISSUES [1]. Currently, TISSUES contributes 321,516 expression edges, while Bgee contributes 5,406,177. I suspect our TISSUES inclusion threshold (score ≥ 3) is much more stringent than our Bgee theshold.
  3. TISSUES provides a score for each tissue–gene relationship. These scores have been calibrated on a gold standard allowing evidence-based weighting of each study. In contrast, Bgee has different categories of evidence based on ambiguity and quality. These measures are codependent and have been difficult to understand. Therefore, it's difficult to conclude whether low quality or high ambiguity relationships have merit.

Ambiguity and call quality

We looked into the relationship between ambiguity and call quality.

Below, we show the contingency table for Expression (columns) and Call quality (rows) in the Homo_sapiens_expr-complete dataset (documentation). Each cell contains the percentage of observations in the corresponding category:

absenthigh ambiguitylow ambiguitypresent
NA00.0010.0010
poor quality0000.23
high quality0.11000.67

Below, we show the contingency table for Differential expression (columns) and Call quality (rows) in the Homo_sapiens_diffexpr-anatomy-simple dataset (documentation). Each cell contains the percentage of observations in the corresponding category:

high ambiguitylow ambiguityover-expressionunder-expression
NA0.0250.1600
low quality000.180.24
high quality000.210.20

Conclusions

Presence: Homo_sapiens_expr-complete uses the value poor quality while Homo_sapiens_diffexpr-anatomy-simple uses low quality. In our previous processing, we used low quality as the value for both datasets. Therefore, we accidentally omitted the 23% of observations that were present with poor quality. Our proposed solution is to take only high quality and present observations. The observations with ambiguous call qualities are rare, and thus I am not worried about excluding them for simplicity.

Differential expression: In Homo_sapiens_diffexpr-anatomy-simple differentially expressed observations are split between low and high quality. Low quality is explained as:

differential expression reported as low quality, or there exists a conflict for the same gene, anatomical entity and developmental stage, from different analyses of a same data type (conflicts between different data types are treated differently). For instance, an analysis showed a gene to be over-expressed in a condition, while another analysis showed the same gene to be under-expressed or not differentially expressed in the same condition. Such conflicts are resolved by a voting system based on the number of conditions compared, weighted by p-value. Note that in one case, this quality level is used to reconcile conflicting calls from different data types: when a data type produced an under-expression call, while a different data type has shown that the same gene was never seen as expressed in the same condition. In that case, the overall summary is under-expression low quality.

We are undecided whether to omit low quality differential expression edges.

For expression calls, besides being more stringent, it is also possible to discard anatomical entities close to the root of the ontology, that are less informative, and that benefits from the propagation from lots of substructures. See also https://github.com/owlcollab/owltools/issues/145 for a related discussion.

For differential expression, well, as said before, I'm not really surprised with this number of 5,000 differentially expressed genes in some structures. We are still willing to update our FDR computation to take into account all analyses, as mentionned here (we are currently updating our differential expression pipeline).

Methods for reducing the number of Bgee expression relationships

@fbastian, we're looking to reduce the number of expression relationships extracted from Bgee. Our motivation is twofold:

  1. Our hetnet currently contains over 1 million Anatomy–expresses–Gene relationships. This high number of relationships is causing computational bottlenecks.
  2. Our network permutations do little to randomize the expression relationships. In other words, too many expression relationships connect super anatomies — anatomies which express most genes — limiting the information content of the edge.

From our previous conversations, it appears that there are three ways to proceed:

  1. Exclude relationships for general anatomies, as suggested above.
  2. Use only RNA-Seq experiments. @fbastian mentions that using only RNA-Seq data avoids the ambiguity states. So what happens when a gene is present in one experiment but not the other for the same data type and conditions? With RNA-seq we should be able to adjust the RPKM inclusion threshold.
  3. Use unpropagated relationships, so for example genes expressed in for brain gray matter would not automatically be transmitted to brain.

These options are not mutually exclusive. We can choose any combination of the three.

I am leaning towards option 3, because I think studies will often be performed on clinically relevant anatomies. In other words, we may accomplish the goal of 1 (removing overly broad anatomies) by including only relationships to their directly annotated anatomies. Additionally, since our hetnet contains many nested levels Uberon terms, we would not be throwing out too many experiments entirely. In the future, we can even write queries that perform the propagation in realtime.

Data complications: In Homo_sapiens_expr-complete.tsv version 13.1, all the values for In situ call quality and In situ call quality equal no data. Additionally, all values for Including in situ observed data are no. @fbastian, I assume this is a bug? Is there a workaround? Advice on the specific filters to apply on which files to achieve options 2 or 3 would be appreciated.

For me, the two working solutions would either be:

  • "3. Use unpropagated relationships": actually, you would still use the propagated data, but you would consider only the anatomical structures described in the experiments. This is what our "simple" expression file contains.
  • or manually select a list of 50-100 anatomical structures you are interested in, and use the propagated data in them. Maybe you can also include all cell types. This is very similar to the previous solution, except that you'd have more freedom about the choice of the organs. Although you should already have lots of organs experimentally described once we include the GTEx data (see bottom of this message), so the previous solution might be good enough.

"2. Use only RNA-Seq experiments" is incorrect: you'll still get the ambiguity states if two libraries provide contradicting information. And, it would be sad not to use other data types – lots of good information come from Affymetrix data.
And adjusting the RPKM threshold will not solve your problem: your problem comes from the propagation of data to upper level terms.

To implement solution "3", you can either use our "simple" expression file, or use the "complete" file, but filter thanks to the column "Observed data": http://bgee.org/?page=doc&action=call_files#single_expr_complete_col9

Otherwise, we can think of alternative solutions:

  • Using only over-/under-expression data. We should have a lot of them thanks to the GTEx data in the near future
  • Using only the best-ranked anatomical structures for each gene, rather than all data, and do the propagation based on that (e.g., in TISSUES the anatomical structures are ranked – we are going to release a gene page next week also providing ranked anatomical structures).
  • Using a completely different approach, based on gene lists rather than individual genes: see our new GO-like expression enrichment test: http://bgee.org/?page=top_anat. I don't know if it can be applied to your network, though, but maybe you can link groups of genes to anatomical structures, rather than individual genes.

About in situ data: we don't have in situ data for human. I don't know of any resource providing systematic gene expression analyses in human using in situ hybridization. This data type is not well suited for human (e.g., I think you're not supposed to use frozen tissues).

Additional information: we have started the pipeline for Bgee 14, that will include the GTEx data. The full pipeline run should take a few months, but you'll get the data ultimately. Also, I'll let you know about the status of our work of re-annotation in the other thread, as we have completed it.

Clarification

Let's see if I understand: It is not possible to tell whether gene presence for a specific condition (anatomy–developmental stage) was from an experiment on that exact condition or was propagated. However, the simple file (or complete file filtered for Observed data) contains only conditions with directly annotated experiments, but any given gene presence call may be from propagation.

Proposed gene presence method

I think the ideal setup for our network would be propagation by developmental stage but not by anatomical structure. Using just the simple or complete datasets, this doesn't currently seem to be possible. However, I've what about the following workaround: using all adult stages on the simple dataset.

Using the simple dataset, I found all gene–anatomy pairs where Expression is present and Call quality is high quality for any adult developmental stage. To identify adult developmental stages, I filtered for HsapDv:0000087 and its descendants (notebook, dataset).

The resulting presence/absence matrix of gene expression is 16,257 genes × 188 anatomies (notebook, dataset) compared to the previous dimensions of 16,278 genes × 666 anatomies. We hope that this pruning of anatomies will help address the network problems we were facing.

Miscellaneous

Thanks @fbastian for suggesting other possible approaches. For future networks, we will revisit these options. For now we're looking for the most straightforward and immediately actionable option.

We don't have in situ data for human. I don't know of any resource providing systematic gene expression analyses in human using in situ hybridization.

I had misinterpreted Including in situ observed data to mean Observed data

the simple file (or complete file filtered for Observed data) contains only conditions with directly annotated experiments, but any given gene presence call may be from propagation.

No, you have the guarantee that there exists an unpropagated call for that very gene. What you can't know for sure from these files, is where the quality level comes from. For instance:

gene A expressed in brain with low quality from experiment A
gene A expressed in brain substructure with high quality from experiment B
=> gene A expressed in brain with high quality in Bgee files

Or

gene A NOT expressed in brain from RNA-Seq experiment A
gene A expressed in brain substructure from Affymetrix experiment B
=> gene A with lowly conflicting status in brain substructure, in Bgee files 

Data in adult for human must represent 90% of the data, so you shouldn't add much by considering all developmental stages.

If you want, I can provide you with a completely unpropagated dataset. But I think it's good to benefit from propagation (e.g., to determine that 2 genes are both expressed in brain, which you might miss if they were expressed in different brain substructures).

I had misinterpreted Including in situ observed data to mean Observed data

I see, we should rename these fields with in situ hybridization then.

Releasing dhimmel/bgee v1.0

We've released version 1.0 of our Bgee processing [1].

Datasets

Genes are identified with Entrez GeneIDs and anatomical structures (anatomies) are identified by Cell Ontology or Uberon terms.

  • present-in-adult.tsv.gz indicates whether a gene is present (1) or absent (0) for a given anatomy in human adults. The dataset is a matrix of 16,257 genes × 188 anatomies.
  • diffex.tsv.gz indicates whether a gene is over-expressed (1), under-expressed (-1), or non-deferentially expressed (0) for a given anatomy. The dataset is a matrix of 18,620 genes × 98 anatomies.

Changelog

Compared to version 0 — the Bgee data in the initial version of our hetnet — the following change was made:

  • We adopted the modification proposed above: gene presence was extracted from the simple Bgee dataset, which guarantees that each Anatomy–expresses–Gene relationship contains direct (unpropagated) experimental data.
 
Status: Completed
Views
140
Topics
Referenced by
Cite this as
Daniel Himmelstein, Frederic Bastian (2015) Processing Bgee for tissue-specific gene presence and over/under-expression. Thinklab. doi:10.15363/thinklab.d124
License

Creative Commons License

Share