|
Genes dysregulated by knockdowns
|
perturbagen | entrez_gene_id | z_score | symbol | status | direction | nlog10_bonferroni_pval |
---|---|---|---|---|---|---|
2 | 133 | -5.495 | ADM | imputed | down | 3.596 |
2 | 501 | -4.317 | ALDH7A1 | measured | down | 1.811 |
2 | 9915 | -5.579 | ARNT2 | measured | down | 4.626 |
dysreg-overexpression.tsv
This dataset contains significantly dysregulated genes due to overexpression perturbations. The raw dataset is available at:
https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv
Below is a preview:
perturbagen | entrez_gene_id | z_score | symbol | status | direction | nlog10_bonferroni_pval |
---|---|---|---|---|---|---|
2 | 991 | -4.687 | CDC20 | measured | down | 2.567 |
2 | 54438 | 4.551 | GFOD1 | measured | up | 2.282 |
2 | 5950 | 4.590 | RBP4 | imputed | up | 1.541 |
I answered:
How many overexpressions significantly upregulated their target gene (expected)? How many overexpressions significantly downregulated their target gene (unexpected)?
R code:
# workshop with dan himmelstein
# which genes have regulation scores. It also contains whether the gene's expression was directly measured or imputed
path <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df <- readr::read_tsv(path)
# significantly dysregulated genes due to knockdown perturbations
path2 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
kd_df <- readr::read_tsv(path2)
# significantly dysregulated genes due to overexpression perturbations
path3 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
oe_df <- readr::read_tsv(path3)
dat <- filter(oe_df, perturbagen == entrez_gene_id)
dat %>%
dplyr::group_by(direction) %>%
dplyr::summarize(
count = n()
)
R output:
Source: local data frame [2 x 2]
direction count
(chr) (int)
1 down 4
2 up 124
/
So, 4 genes were significantly downregulated after being overexpressed (unexpected) and 124 genes were significantly upregulated after being overexpressed (expected).
Thanks Dan!
For question 3,
How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets.
The elegant dplyr solution (thanks Daniel) looks like:
#how many times are genes disregulated in all?
count_df = knockdown_df %>%
dplyr::group_by(entrez_gene_id) %>%
dplyr::summarise(count=n())
#join the table of counts with the full table of genes. the genes that were not present
#are automatically converted to missing data
full=gene_df %>%
dplyr::left_join(count_df)
#divide the missing data by imputed vs. measured
result = full %>% dplyr::filter(is.na(count)) %>%
dplyr::group_by(status) %>%
dplyr::summarise(count=n())
The solution: of all the genes, very few avoid disregulation!
status count
(chr) (int)
1 imputed 55
2 measured 1
How many knockdowns significantly downregulated their target gene (expected)? How many knockdowns significantly upregulated their target gene (unexpected)?
Here is the Code :)
path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_kd = readr::read_tsv(path)
gene_kd %>%
dplyr::filter(perturbagen == entrez_gene_id) %>%
dplyr::group_by(direction) %>%
dplyr::summarize(
count = n()
)
Output:
direction count
(chr) (int)
1 down 806
2 up 9
Conclusion: Of the knockdown genes, 806 significantly downregulated their gene (expected) while 9 upregulated their gene (unexpected)
knock_down_path = "https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/con sensi/signif/dysreg-knockdown.tsv"
kd_genes = readr::read_tsv(knock_down_path)
kd_genes$direction = as.factor(kd_genes$direction)
kd_genes %>%
group_by(symbol, direction) %>%
dplyr::summarise(count=n()) %>%
tidyr::spread(key = direction, value = count, fill = 0) %>%
arrange(desc(up)) %>% top_n(n = 10, wt = desc(up))
symbol | down | up |
---|---|---|
MCOLN1 | 2 | 1128 |
MAL | 0 | 985 |
WIF1 | 0 | 884 |
SERPINA3 | 0 | 873 |
SATB1 | 0 | 862 |
CES1 | 0 | 849 |
XIST | 30 | 764 |
CRIP1 | 0 | 713 |
KLHL21 | 2 | 602 |
COL11A1 | 0 | 562 |
TF | 0 | 527 |
ERAP2 | 0 | 512 |
ABCC5 | 3 | 501 |
AGR2 | 2 | 478 |
CPVL | 1 | 476 |
These are the top 10 most unregulated genes. These up-regulated genes do not appear to be down-regulated with any significant frequency
Which ten genes were most frequently significantly downregulated by gene knockdowns? How many knockdowns significantly downregulated these genes? How many knockdowns significantly upregulated these genes?
Here's my code:
path="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv"
path2="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv"
path3="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv"
gene_df = readr::read_tsv(path)
kd_gene = readr::read_tsv(path2)
oexp_gene= readr::read_tsv(path3)
head(gene_df)
head(kd_gene)
View(kd_gene)
library(dplyr)
library(tidyr)
gene_df %>%
dplyr::group_by(status) %>%
dplyr::summarize(
count=n()
)
gene_df %>%
dplyr::mutate(kind='gene')
#which 10 genes were most frequently dowregulated by KDs
#first, find number of distinct genes downregulated by KDs (7411)
kd_gene$entrez_gene_id %>%
n_distinct()
#next, find number of pertubagens (4312)
kd_gene$perturbagen %>%
n_distinct()
#from the top 10 genes, how many times were they downregulated?
#genes most frequently DOWNREGULATED by the KNOCKDOWNS
#filter to only downregulated KDs
downregulated_kds <- kd_gene %>%
filter(direction=="down")
#sort by count to downregulated KDs
downreg_kd_sorted <- downregulated_kds %>%
dplyr::group_by(symbol) %>%
dplyr::summarise(
count=n()
) %>%
dplyr::arrange(desc(count))
head(downreg_kd_sorted, 10)
View(downreg_kd_sorted)
#from the top 10 genes, how many times were they UPREGULATED?
#genes most frequently UPREGULATED by the KNOCKDOWNS
#filter to only upregulated KDs
upregulated_kds <- kd_gene %>%
filter(direction=="up")
#sort by count to upregulated KDs
upreg_kd_sorted <- upregulated_kds %>%
dplyr::group_by(symbol) %>%
dplyr::summarise(
count=n()
) %>%
dplyr::arrange(desc(count))
head(upreg_kd_sorted, 10)
View(upreg_kd_sorted)
#How many knockdowns downregulated these genes? 195,786
#How many knockdowns upregulated these genes? 132,282
nrow(kd_gene)
kd_gene %>%
dplyr::group_by(direction) %>%
dplyr::summarize(
count=n()
)
upreg_kd_sorted<- upreg_kd_sorted %>%
dplyr::rename(up_count=count)
dim(upreg_kd_sorted)
View(upreg_kd_sorted)
downreg_kd_sorted <-downreg_kd_sorted %>%
dplyr::rename(down_count=count)
dim(downreg_kd_sorted)
MYANSWER<- dplyr::full_join(downreg_kd_sorted, upreg_kd_sorted)
MYANSWER[is.na(MYANSWER)] = 0
MYANSWER
Here's my answer:
symbol | down_count | up_count |
---|---|---|
RPS4Y1 | 1637 | 0 |
CDC20 | 1456 | 1 |
PCNA | 1360 | 0 |
NME1 | 1182 | 0 |
MIF | 1052 | 0 |
CSRP1 | 1031 | 1 |
STUB1 | 996 | 10 |
TIMM9 | 989 | 4 |
TYMS | 881 | 0 |
GDF15 | 866 | 0 |
#read in data
path_genes = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df = readr::read_tsv(path_genes)
path_down = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
down_df = readr::read_tsv(path_down)
path_over = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
over_df = readr::read_tsv(path_over)
q.7 = over_df %>% subset(z_score < 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.7 = q.7[1:10,]
a. Which ten genes were most frequently significantly downregulated by gene overexpression?
entrez_gene_id count
(int) (int)
1 6192 166
2 991 165
3 5111 165
4 5018 122
5 994 105
6 26520 102
7 1738 91
8 9133 86
9 7298 84
10 22827 83
b. How many overexpressions significantly downregulated these genes?
q.7.2 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>%
group_by(perturbagen) %>% summarize(count = n())
612 overexpressed genes
c. How many overexpressions significantly upregulated these genes?
q.7.3 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>%
group_by(perturbagen) %>% summarize(count = n())
4 overexpressed genes
q.8 = over_df %>% subset(z_score > 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.8 = q.8[1:10,]
a. Which ten genes were most frequently significantly upregulated by gene overexpression?
entrez_gene_id count
(int) (int)
1 57192 180
2 5331 152
3 25966 140
4 23378 113
5 4118 104
6 9903 99
7 55008 98
8 1066 96
9 5971 94
10 7503 94
b. How many overexpressions significantly upregulated these genes?
q.8.2 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>%
group_by(perturbagen) %>% summarize(count = n())
792 overexpressed genes
c. How many overexpressions significantly downregulated these genes?
q.8.3 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>%
group_by(perturbagen) %>% summarize(count = n())
14 overexpressed genes
For knockdown perturbations, what is the correlation between number of significantly down and upregulated measured genes.
q.9.down.reg = down_df %>% subset(z_score < 0) %>% group_by(perturbagen) %>% summarize(count.down.reg = n())
q.9.up.reg = down_df %>% subset(z_score > 0) %>% group_by(perturbagen) %>% summarize(count.up.reg = n())
joined_df = dplyr::full_join(q.9.down.reg, q.9.up.reg)
joined_df[is.na(joined_df)] = 0
Pearson correlation, R =0.9371317
cor(joined_df$count.down.reg, joined_df$count.up.reg)
Kendall correlation R = 0.732456
cor(joined_df$count.down.reg, joined_df$count.up.reg, method = 'kendall')
See Question 7 above.
How many genes were never significantly dysregulated by any knockdown perturbation?
library(dplyr)
install.packages("tidyr")
library(readr)
library(ggplot2)
path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
path_ko = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_df = readr::read_tsv(path)
gene_ko_df = readr::read_tsv(path_ko)
ghost_df = gene_df[! (gene_df$entrez_gene_id %in% gene_ko_df$entrez_gene_id), ]
nrow(ghost_df)
#for a list of these genes
ghost_df$symbol
The number of genes that were not sig dysregulated by knockdown perturbation (on main list of genes, but not on knockdown list of genes) = 56!
See question 8 above.
Impressive work!
Each of the nine pupils in attendance answered their question. Most finished within two hours — despite several having little R experience — after an initial 30 minute tutorial. The workshop succeeded at introducing a broad range of topics: R, the hadleyverse, transcriptomics, LINCS L1000, markdown, Thinklab, and open science.
I enjoyed helping the pupils learn while they performed original and noteworthy analyses. And meanwhile, through the power of realtime open science on Thinklab, we're now coauthors on a citeable work [1].
The workshop built off of many developments in scientific education: specifically, solving problems [2] in contemporary research [3] while contributing to the scientific record [4].
Next, I'll review the answers to see what we have learned.
Here's my analysis of the answers from today's workshop. Thanks again to the pupils for their hard work.
Yes, we established this important control. Knockdown overwhelming downregulated (806 instances) rather than upregulated (9 instances) its target gene (Q1 by @jeffreykim). Overexpression overwhelming upregulated (124 instances) rather than downregulated (4 instances) its target gene (Q2 by @kathleenk).
No, we saw that almost all genes were dysregulated by at least one genetic perturbation. Only 0.7% of genes (56 out of 7,467) were never dysregulated by a knockdown (Q4 by @jasleensodhi). Only 1 of these genes was measured, while the remaining 55 were imputed (Q3 by @mishavysotskiy). This imbalance makes sense since imputed genes were subject to a more stringent significance threshold. The low number of never-dysregulated genes is a welcome result from a network perspective, where pervasive connectivity is important.
Next, we identified which genes were most frequently dysregulated due to a knockdown. RPS4Y1 was downregulated by 37.8% (1,637 out of 4,326) of knockdowns (Q5 by @juliacluceru). MCOLN1 was upregulated by 26.1% (1,128 out of 4,326) of knockdowns (Q6 by @beaunorgeot). The top-ten-most-frequently-downregulated-by-knockdown genes were rarely upregulated by knockdown. The same consistency in direction of dysregulation applied to the top-ten-most-frequently-upregulated genes as well.
Next, we identified which genes were most frequently dysregulated due to overexpression. RPS4Y1 was downregulated by 6.9% (166 out of 2,413) of overepressions (Q7 by @emmalynchen). MCOLN1 was upregulated by 7.5% (180 out of 2,413) of overepressions (Q8 by @elizabethlevy1). Interestingly, RPS4Y1 was the most downregulated gene by both knockdown and overexpression. Conversely, MCOLN1 was the most upregulated gene for both perturbation types.
The findings from Q5–8 fit with @larsjuhljensen's hypothesis that a general stress response may cause many genes to respond to any genetic perturbation in a consistent direction. Q5–8 also help address @caseygreene's question on which genes are driving the signals.
Finally, there was a strong correlation between the number of downregulated and upregulated genes per knockdown (Q9 by @marjorieimperial). In other words, a perturbation which downregulates many genes will also likely upregulate many genes.
Lars Juhl Jensen: Reassuring to see that things behave the way I would expect. This should make it fairly easy to derive a scoring scheme that extracts only associations that are specifically associated with a small number of perturbations, as opposed to associated with any perturbation.
Lars Juhl Jensen: The part about broad downregulation occurring in tandem with broad upregulation is almost a given. Even if it is not the case biologically, this will be the case after most normalization methods.