Gene–expression–Anatomy edges are the most prevalent type in the network (see last figure in this notebook). Now that we're traversing the network to extract paths, we're noticing high-degree nodes are computationally problematic. Several anatomies have 20,000 expressed genes and 5000 differentially expressed genes (see page 2 here). Therefore downstream constraints favor rigorous thresholds for extremely high-degree edge types.
We also extractGene–expression–Anatomy edges from TISSUES. Currently, TISSUES contributes 321,516 expression edges, while Bgee contributes 5,406,177. I suspect our TISSUES inclusion threshold (score ≥ 3) is much more stringent than our Bgee theshold.
TISSUES provides a score for each tissue–gene relationship. These scores have been calibrated on a gold standard allowing evidence-based weighting of each study. In contrast, Bgee has different categories of evidence based on ambiguity and quality. These measures are codependent and have been difficult to understand. Therefore, it's difficult to conclude whether low quality or high ambiguity relationships have merit.
Ambiguity and call quality
We looked into the relationship between ambiguity and call quality.
Below, we show the contingency table for Expression (columns) and Call quality (rows) in the Homo_sapiens_expr-complete dataset (documentation). Each cell contains the percentage of observations in the corresponding category:
Below, we show the contingency table for Differential expression (columns) and Call quality (rows) in the Homo_sapiens_diffexpr-anatomy-simple dataset (documentation). Each cell contains the percentage of observations in the corresponding category:
Presence:Homo_sapiens_expr-complete uses the value poor quality while Homo_sapiens_diffexpr-anatomy-simple uses low quality. In our previous processing, we used low quality as the value for both datasets. Therefore, we accidentally omitted the 23% of observations that were present with poor quality. Our proposed solution is to take only high quality and present observations. The observations with ambiguous call qualities are rare, and thus I am not worried about excluding them for simplicity.
Differential expression: In Homo_sapiens_diffexpr-anatomy-simple differentially expressed observations are split between low and high quality. Low quality is explained as:
differential expression reported as low quality, or there exists a conflict for the same gene, anatomical entity and developmental stage, from different analyses of a same data type (conflicts between different data types are treated differently). For instance, an analysis showed a gene to be over-expressed in a condition, while another analysis showed the same gene to be under-expressed or not differentially expressed in the same condition. Such conflicts are resolved by a voting system based on the number of conditions compared, weighted by p-value. Note that in one case, this quality level is used to reconcile conflicting calls from different data types: when a data type produced an under-expression call, while a different data type has shown that the same gene was never seen as expressed in the same condition. In that case, the overall summary is under-expression low quality.
We are undecided whether to omit low quality differential expression edges.
For expression calls, besides being more stringent, it is also possible to discard anatomical entities close to the root of the ontology, that are less informative, and that benefits from the propagation from lots of substructures. See also https://github.com/owlcollab/owltools/issues/145 for a related discussion.
For differential expression, well, as said before, I'm not really surprised with this number of 5,000 differentially expressed genes in some structures. We are still willing to update our FDR computation to take into account all analyses, as mentionned here (we are currently updating our differential expression pipeline).
Methods for reducing the number of Bgee expression relationships
@fbastian, we're looking to reduce the number of expression relationships extracted from Bgee. Our motivation is twofold:
Our hetnet currently contains over 1 million Anatomy–expresses–Gene relationships. This high number of relationships is causing computational bottlenecks.
Use only RNA-Seq experiments. @fbastianmentions that using only RNA-Seq data avoids the ambiguity states. So what happens when a gene is present in one experiment but not the other for the same data type and conditions? With RNA-seq we should be able to adjust the RPKM inclusion threshold.
Use unpropagated relationships, so for example genes expressed in for brain gray matter would not automatically be transmitted to brain.
These options are not mutually exclusive. We can choose any combination of the three.
I am leaning towards option 3, because I think studies will often be performed on clinically relevant anatomies. In other words, we may accomplish the goal of 1 (removing overly broad anatomies) by including only relationships to their directly annotated anatomies. Additionally, since our hetnet contains many nested levels Uberon terms, we would not be throwing out too many experiments entirely. In the future, we can even write queries that perform the propagation in realtime.
Data complications: In Homo_sapiens_expr-complete.tsv version 13.1, all the values for In situ call quality and In situ call quality equal no data. Additionally, all values for Including in situ observed data are no. @fbastian, I assume this is a bug? Is there a workaround? Advice on the specific filters to apply on which files to achieve options 2 or 3 would be appreciated.
For me, the two working solutions would either be:
"3. Use unpropagated relationships": actually, you would still use the propagated data, but you would consider only the anatomical structures described in the experiments. This is what our "simple" expression file contains.
or manually select a list of 50-100 anatomical structures you are interested in, and use the propagated data in them. Maybe you can also include all cell types. This is very similar to the previous solution, except that you'd have more freedom about the choice of the organs. Although you should already have lots of organs experimentally described once we include the GTEx data (see bottom of this message), so the previous solution might be good enough.
"2. Use only RNA-Seq experiments" is incorrect: you'll still get the ambiguity states if two libraries provide contradicting information. And, it would be sad not to use other data types – lots of good information come from Affymetrix data. And adjusting the RPKM threshold will not solve your problem: your problem comes from the propagation of data to upper level terms.
Using only over-/under-expression data. We should have a lot of them thanks to the GTEx data in the near future
Using only the best-ranked anatomical structures for each gene, rather than all data, and do the propagation based on that (e.g., in TISSUES the anatomical structures are ranked – we are going to release a gene page next week also providing ranked anatomical structures).
Using a completely different approach, based on gene lists rather than individual genes: see our new GO-like expression enrichment test: http://bgee.org/?page=top_anat. I don't know if it can be applied to your network, though, but maybe you can link groups of genes to anatomical structures, rather than individual genes.
About in situ data: we don't have in situ data for human. I don't know of any resource providing systematic gene expression analyses in human using in situ hybridization. This data type is not well suited for human (e.g., I think you're not supposed to use frozen tissues).
Additional information: we have started the pipeline for Bgee 14, that will include the GTEx data. The full pipeline run should take a few months, but you'll get the data ultimately. Also, I'll let you know about the status of our work of re-annotation in the other thread, as we have completed it.
Let's see if I understand: It is not possible to tell whether gene presence for a specific condition (anatomy–developmental stage) was from an experiment on that exact condition or was propagated. However, the simple file (or complete file filtered for Observed data) contains only conditions with directly annotated experiments, but any given gene presence call may be from propagation.
Proposed gene presence method
I think the ideal setup for our network would be propagation by developmental stage but not by anatomical structure. Using just the simple or complete datasets, this doesn't currently seem to be possible. However, I've what about the following workaround: using all adult stages on the simple dataset.
Using the simple dataset, I found all gene–anatomy pairs where Expression is present and Call quality is high quality for any adult developmental stage. To identify adult developmental stages, I filtered for HsapDv:0000087 and its descendants (notebook, dataset).
The resulting presence/absence matrix of gene expression is 16,257 genes × 188 anatomies (notebook, dataset) compared to the previous dimensions of 16,278 genes × 666 anatomies. We hope that this pruning of anatomies will help address the network problems we were facing.
Thanks @fbastian for suggesting other possible approaches. For future networks, we will revisit these options. For now we're looking for the most straightforward and immediately actionable option.
We don't have in situ data for human. I don't know of any resource providing systematic gene expression analyses in human using in situ hybridization.
I had misinterpreted Including in situ observed data to mean Observed data
the simple file (or complete file filtered for Observed data) contains only conditions with directly annotated experiments, but any given gene presence call may be from propagation.
No, you have the guarantee that there exists an unpropagated call for that very gene. What you can't know for sure from these files, is where the quality level comes from. For instance:
gene A expressed in brain with low quality from experiment A
gene A expressed in brain substructure with high quality from experiment B
=> gene A expressed in brain with high quality in Bgee files
gene A NOT expressed in brain from RNA-Seq experiment A
gene A expressed in brain substructure from Affymetrix experiment B
=> gene A with lowly conflicting status in brain substructure, in Bgee files
Data in adult for human must represent 90% of the data, so you shouldn't add much by considering all developmental stages.
If you want, I can provide you with a completely unpropagated dataset. But I think it's good to benefit from propagation (e.g., to determine that 2 genes are both expressed in brain, which you might miss if they were expressed in different brain substructures).
I had misinterpreted Including in situ observed data to mean Observed data
I see, we should rename these fields with in situ hybridization then.
We adopted the modification proposed above: gene presence was extracted from the simple Bgee dataset, which guarantees that each Anatomy–expresses–Gene relationship contains direct (unpropagated) experimental data.