The Genotype-Tissue Expression project (GTEx) RNA-sequenced :
1641 samples from 175 individuals representing 43 sites: 29 solid organ tissues, 11 brain subregions, whole blood, and two cell lines: Epstein-Barr virus–transformed lymphocytes (LCL) and cultured fibroblasts from skin.
The data is available online. Specifically, we are interested in the GTEx_Analysis_V4_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz file that contains RPKM expression values for each sample. We would like to calculate a single expression value for each gene-tissue pair. Expression values should be comparable across tissues, not just within tissues.
We will post our questions here. Advice appreciated.
Mapping GTEx sites to Uberon and CL
We are usingUberon terms to identify anatomical structures and Cell Ontology (CL)  terms to identify cell types. Thus, we need to map GTEx sites to their corresponding ontology terms.
From the sample attribute documentation (GTEx_Data_V4_Annotations_SampleAttributesDS.txt), we identified 54 sites using the SMTSD attribute. I have mapped about half of the sites to Uberon. The remainder would benefit from a skilled anatomist or GTEx consortium member.
Bounty: Add or correct our mappings using this spreadsheet and put your Thinklab username. Then leave a comment in this discussion, and we will rate its value $$\geq $4 \times n$$, where n is the number of mappings provided.
Some additional sample site information is available in Table S1 (p. 58) of the supplement.
Handing over GTEx processing responsibility to Bgee
We have decided to use Bgee for tissue-specific transcript presence, over-, and under-expression. Bgee doesn't currently include GTEx data but will soon.
Therefore, we are not going to proceed with GTEx data directly for this project. However, we did already process the data into a usable gene × site format (notebook, download). We converted genes to Entrez GeneIDs. The sites are still in GTEx strings rather than Uberon terms. Expression values are log-transformed. Check out the notebook for a visualization of tissue-specific transcript abundance distributions.
Bounty: we will keep the GTEx–Uberon mapping bounty going until June 25, 2015 because these mappings will help the Bgee team and eventually us. @chrismungall, do you want to add a comment here, so you can get rewarded for your 3 mappings?
I think the Bgee team will do a great job. Just a few general comments:
most of the GTEx terms correspond to 'wild-type' structures as can be found in uberon/cl. There are however, two subclasses of skin: exposed and unexposed. We could add these as subclasses in uberon, but this would be unusual. It would be better to either post-compose these, or to have some kind of ancillary 'sample' ontology where this is composed.
For 'hippocampus', the safest option is to map to the broadest term, 'hippocampal formation', but if it can be shown than the GTEx sample excludes bits of the dentate gyrus then the more specific 'ammons horn' can be used.
Finally, it's always best when ontologies are used prospectively rather than retrospectively, maybe future rounds of GTEx will follow the lead of FANTOM5 and ENCODE in doing this.
IMO, the "exposed/unexposed" state is an experimental factor, and should not be annotated using a new anatomical term, it should be an additional "column" in an annotation (using, e.g., EFO).
Daniel, we will discuss next week during our lab meeting the timescale to annotate GTEx data, but as you said, it should be fast. Our problem is that we requested access to the data, and are waiting for an answer.
We could start working on the mapping you have, but we'd rather go through the information for each sample, to check for normality, etc. This is how we usually do. (do they provide GEO or SRA identifiers for the samples BTW?)
Frederic Bastian: Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). This is where we usually download the raw data from. But I think GTEx keep them private, for their "on demand" data sharing policy...
do they provide GEO or SRA identifiers for the samples BTW?
We could start working on the mapping you have, but we'd rather go through the information for each sample, to check for normality, etc.
@fbastian, do whatever is best for you! And let it be known that your painstaking and thorough integration efforts are appreciated.
There are however, two subclasses of skin: exposed and unexposed.
@chrismungall, by post-compose do you mean contacting GTEx and asking for more details on the skin sample sites? That seems the best to me as I assume the sample collectors had specific instructions. The skin sites are specified as suprapubic for sun unexposed and lower leg for sun exposed.
@dhimmel, thanks for the information. Did your lab formally request access to the data to get the actual annotations? (GTEx_Data_V4_Annotations_SampleAttributesDS.txt in your notebook)
Otherwise, Chris is speaking about ontology term post-composition, a way of creating a new ontology concept on-the-fly, that doesn't have any identifier or IRI ("anonymous class expression"), and that is made of the "composition" of several other terms. That would allow you to create on the fly a new concept for "exposed skin".
@fbastian, GTEx_Data_V4_Annotations_SampleAttributesDS.txt is available from the GTEx download page which requires an account. However, this direct link circumvents the login page. How long has it been since you submitted the data access request?
On June 17th I emailed firstname.lastname@example.org asking for a GTEX–Uberon mapping. Today, Tim Sullivan responded and attached this mapping file.
He didn't mention the methodology used, but you may want to crosscheck your work.
I've reproduced Tim's mapping below for quick reference:
Tissue Site Detail
Adipose - Subcutaneous
subcutaneous adipose tissue
Adipose - Visceral (Omentum)
omental fat pad
Artery - Aorta
Artery - Coronary
Artery - Tibial
Artery - Tibial
Brain - Amygdala
Brain - Anterior cingulate cortex (BA24)
anterior cingulate cortex
Brain - Caudate (basal ganglia)
Brain - Cerebellar Hemisphere
Brain - Cerebellum
Brain - Cortex
Brain - Frontal Cortex (BA9)
dorsolateral prefrontal cortex
Brain - Hippocampus
Brain - Hypothalamus
Brain - Nucleus accumbens (basal ganglia)
Brain - Putamen (basal ganglia)
Brain - Spinal cord (cervical c-1)
first cervical spinal cord segment
Brain - Substantia nigra
Breast - Mammary Tissue
Cells - EBV-transformed lymphocytes
Cells - Leukemia cell line (CML)
Cells - Transformed fibroblasts
Cervix - Ectocervix
Cervix - Endocervix
Colon - Sigmoid
Colon - Transverse
Esophagus - Gastroesophageal Junction
Esophagus - Mucosa
esophagus squamous epithelium
Esophagus - Muscularis
esophagus muscularis mucosa
Heart - Atrial Appendage
right atrium auricular region
Heart - Left Ventricle
left ventricle myocardium
Kidney - Cortex
cortex of kidney
right lobe of liver
upper lobe of left lung
Minor Salivary Gland
anterior lingual gland
Muscle - Skeletal
Nerve - Tibial
Nerve - Tibial
body of pancreas
Skin - Not Sun Exposed (Suprapubic)
skin of abdomen
Skin - Sun Exposed (Lower leg)
skin of leg
Small Intestine - Terminal Ileum
Some seem slightly more specific than the label suggests - sometimes the increased specificity is trivial (ie their ovary sample was from a left ovary), sometimes relevant (their representative skeletal muscle sample was from gastrocnemius medialis, the esophagus mucosa sample was taken from the epithelium rather than lamina propria).
To keep you posted: we were recently given access to the GTEx data, so we have started annotating/analyzing the data. We hope to have a new release of Bgee including these data in about 2 months.
Daniel Himmelstein: Exciting, thanks for the update. On another note, I wasn't able to find any licensing information on the Bgee website, which technically means all rights reserved. We're tying to compile licenses for each resource we use. It would be great if you could add a license.
M. Mele, P. G. Ferreira, F. Reverter, D. S. DeLuca, J. Monlong, M. Sammeth, T. R. Young, J. M. Goldmann, D. D. Pervouchine, T. J. Sullivan, R. Johnson, A. V. Segre, S. Djebali, A. Niarchou, T. G. Consortium, F. A. Wright, T. Lappalainen, M. Calvo, G. Getz, E. T. Dermitzakis, K. G. Ardlie, R. Guigo (2015) Science. doi:10.1126/science.aaa0355