Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF

Daniel Himmelstein, Kathleen Keough, Misha Vysotskiy, Jeffrey Kim, Beau Norgeot, Julia Cluceru, Marjorie Imperial, Emmalyn Chen, Jasleen Sodhi, Elizabeth Levy

doi:10.15363/thinklab.d181

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF

Daniel Himmelstein Researcher March 8, 2016

Today, I'm teaching a workshop for the Systems Pharmacology course at UCSF. The course primarily consists of first year students in the Pharmaceutical Sciences and Pharmacogenomics graduate program.

The topic of my workshop is "Big data". Therefore, I thought a perfect activity would be to analyze the transcriptional perturbation data from LINCS L1000. And stars have aligned: first, we've just released version 2 of our consensus signatures; second, we recently noticed some counterintuitive occurrences in the genetic perturbation data.

Hence, I've designed a set of questions. Each pupil will be assigned a question. The pupils will then use R to attempt to answer the question. At the end of the three hour workshop, we will encourage pupils to post their findings as a comment on this discussion.

I'm hoping to teach my R best practices as well as introduce several packages for modern data science. We will strive for the following workflow in R (not every step is needed for each question):

Read the appropriate file into a dataframe using readr. The readr::read_tsv() function should come in handy. Datasets are available on GitHub (readr should be able to read from the raw dataset URL).
Tidy the dataframe using tidyr. The tidyr::spread() function will help convert the wide (matrix) format to a long format.
Manipulate the dataframe using dplyr. Common operations here will be dplyr::filter and dplyr::mutate.
Join dataframes using dplyr::inner_join() or dplyr::left_join().
Answer the question, either by using dplyr::group_by() followed by dplyr::summarize() or by using ggplot2 to visualize the results.

Questions will follow!

Daniel Himmelstein Researcher March 8, 2016

Questions

Here are the 13 questions for the workshop. They all focus on understanding the transcriptional response to genetic perturbation.

How many knockdowns significantly downregulated their target gene (expected)? How many knockdowns significantly upregulated their target gene (unexpected)?
How many overexpressions significantly upregulated their target gene (expected)? How many overexpressions significantly downregulated their target gene (unexpected)?
How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets.
How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets. (same as 3 by accident)
Which ten genes were most frequently significantly downregulated by gene knockdowns? How many knockdowns significantly downregulated these genes? How many knockdowns significantly upregulated these genes?
Which ten genes were most frequently significantly upregulated by gene knockdowns? How many knockdowns significantly upregulated these genes? How many knockdowns significantly downregulated these genes?
Which ten genes were most frequently significantly downregulated by gene overexpression? How many overexpressions significantly downregulated these genes? How many overexpressions significantly upregulated these genes?
Which ten genes were most frequently significantly upregulated by gene overexpression? How many overexpressions significantly upregulated these genes? How many overexpressions significantly downregulated these genes?
For knockdown perturbations, what is the correlation between number of significantly down and upregulated measured genes.

Dataset documentation will follow.

Daniel Himmelstein Researcher March 8, 2016

Datasets

The above questions can all be answered using the following three datasets.

Gene information `genes.tsv`

This dataset contains which genes have regulation scores. It also contains whether the gene's expression was directly measured or imputed. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv

Below is a preview:

entrez_gene_id	status	symbol	type_of_gene	description
100	imputed	ADA	protein-coding	adenosine deaminase
1000	imputed	CDH2	protein-coding	cadherin 2, type 1, N-cadherin (neuronal)
10000	imputed	AKT3	protein-coding	v-akt murine thymoma viral oncogene homolog 3

Genes dysregulated by knockdowns `dysreg-knockdown.tsv`

This dataset contains significantly dysregulated genes due to knockdown perturbations. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv

Below is a preview:

perturbagen	entrez_gene_id	z_score	symbol	status	direction	nlog10_bonferroni_pval
2	133	-5.495	ADM	imputed	down	3.596
2	501	-4.317	ALDH7A1	measured	down	1.811
2	9915	-5.579	ARNT2	measured	down	4.626

Genes dysregulated by overexpressions `dysreg-overexpression.tsv`

This dataset contains significantly dysregulated genes due to overexpression perturbations. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv

Below is a preview:

perturbagen	entrez_gene_id	z_score	symbol	status	direction	nlog10_bonferroni_pval
2	991	-4.687	CDC20	measured	down	2.567
2	54438	4.551	GFOD1	measured	up	2.282
2	5950	4.590	RBP4	imputed	up	1.541

Kathleen Keough March 8, 2016

Question 2

I answered:

How many overexpressions significantly upregulated their target gene (expected)? How many overexpressions significantly downregulated their target gene (unexpected)?

R code:

# workshop with dan himmelstein

# which genes have regulation scores. It also contains whether the gene's expression was directly measured or imputed

path <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df <- readr::read_tsv(path)

# significantly dysregulated genes due to knockdown perturbations

path2 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
kd_df <- readr::read_tsv(path2)

# significantly dysregulated genes due to overexpression perturbations

path3 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
oe_df <- readr::read_tsv(path3)

dat <- filter(oe_df, perturbagen == entrez_gene_id)

dat %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count = n()
  )

R output:

Source: local data frame [2 x 2]

  direction count
      (chr) (int)
1      down     4
2        up   124
/

So, 4 genes were significantly downregulated after being overexpressed (unexpected) and 124 genes were significantly upregulated after being overexpressed (expected).

Thanks Dan!

Misha Vysotskiy March 8, 2016

For question 3,

How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets.

The elegant dplyr solution (thanks Daniel) looks like:

#how many times are genes disregulated in all?
count_df = knockdown_df %>%
  dplyr::group_by(entrez_gene_id) %>%
  dplyr::summarise(count=n())

#join the table of counts with the full table of genes. the genes that were not present
#are automatically converted to missing data
full=gene_df %>% 
  dplyr::left_join(count_df)

#divide the missing data by imputed vs. measured
result = full %>% dplyr::filter(is.na(count)) %>% 
  dplyr::group_by(status) %>% 
  dplyr::summarise(count=n())

The solution: of all the genes, very few avoid disregulation!

    status count
     (chr) (int)
1  imputed    55
2 measured     1

Jeffrey Kim March 8, 2016

Question 1

How many knockdowns significantly downregulated their target gene (expected)? How many knockdowns significantly upregulated their target gene (unexpected)?

Here is the Code :)

path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_kd = readr::read_tsv(path)

gene_kd %>%
  dplyr::filter(perturbagen == entrez_gene_id) %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count = n()
  )

Output:

  direction count
      (chr) (int)
1      down   806
2        up     9

Conclusion: Of the knockdown genes, 806 significantly downregulated their gene (expected) while 9 upregulated their gene (unexpected)

Beau Norgeot March 9, 2016

Question 6

knock_down_path =    "https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/con    sensi/signif/dysreg-knockdown.tsv"
kd_genes = readr::read_tsv(knock_down_path)
kd_genes$direction = as.factor(kd_genes$direction)

kd_genes %>%
  group_by(symbol, direction) %>%
  dplyr::summarise(count=n()) %>%
  tidyr::spread(key = direction, value = count, fill = 0) %>%
  arrange(desc(up)) %>% top_n(n = 10, wt = desc(up))

Resulting Table

symbol	down	up
MCOLN1	2	1128
MAL	0	985
WIF1	0	884
SERPINA3	0	873
SATB1	0	862
CES1	0	849
XIST	30	764
CRIP1	0	713
KLHL21	2	602
COL11A1	0	562
TF	0	527
ERAP2	0	512
ABCC5	3	501
AGR2	2	478
CPVL	1	476

Notes

These are the top 10 most unregulated genes. These up-regulated genes do not appear to be down-regulated with any significant frequency

Julia Cluceru March 9, 2016

Question 5

Which ten genes were most frequently significantly downregulated by gene knockdowns? How many knockdowns significantly downregulated these genes? How many knockdowns significantly upregulated these genes?

Here's my code:

path="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv"
path2="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv"
path3="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv"
gene_df = readr::read_tsv(path)
kd_gene = readr::read_tsv(path2)
oexp_gene= readr::read_tsv(path3)
head(gene_df)
head(kd_gene)
View(kd_gene)
library(dplyr)
library(tidyr)

gene_df %>%
  dplyr::group_by(status) %>%
  dplyr::summarize(
    count=n()
  )

gene_df %>% 
  dplyr::mutate(kind='gene')

#which 10 genes were most frequently dowregulated by KDs

#first, find number of distinct genes downregulated by KDs (7411)
kd_gene$entrez_gene_id %>% 
  n_distinct()
#next, find number of pertubagens (4312)
kd_gene$perturbagen %>% 
  n_distinct()

#from the top 10 genes, how many times were they downregulated? 
#genes most frequently DOWNREGULATED by the KNOCKDOWNS


#filter to only downregulated KDs
downregulated_kds <- kd_gene %>% 
  filter(direction=="down")
#sort by count to downregulated KDs
downreg_kd_sorted <- downregulated_kds %>%
  dplyr::group_by(symbol) %>%
  dplyr::summarise(
    count=n()
  ) %>%
  dplyr::arrange(desc(count))
head(downreg_kd_sorted, 10)
View(downreg_kd_sorted)

#from the top 10 genes, how many times were they UPREGULATED? 
#genes most frequently UPREGULATED by the KNOCKDOWNS

#filter to only upregulated KDs
upregulated_kds <- kd_gene %>% 
  filter(direction=="up")
#sort by count to upregulated KDs
upreg_kd_sorted <- upregulated_kds %>%
  dplyr::group_by(symbol) %>%
  dplyr::summarise(
    count=n()
  ) %>%
  dplyr::arrange(desc(count))
head(upreg_kd_sorted, 10)
View(upreg_kd_sorted)


#How many knockdowns downregulated these genes? 195,786
#How many knockdowns upregulated these genes? 132,282
nrow(kd_gene)
kd_gene %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count=n()
  )

upreg_kd_sorted<- upreg_kd_sorted %>% 
  dplyr::rename(up_count=count)
dim(upreg_kd_sorted)
View(upreg_kd_sorted)

downreg_kd_sorted <-downreg_kd_sorted %>%
  dplyr::rename(down_count=count)
dim(downreg_kd_sorted)

MYANSWER<- dplyr::full_join(downreg_kd_sorted, upreg_kd_sorted)

MYANSWER[is.na(MYANSWER)] = 0
MYANSWER

Here's my answer:

symbol	down_count	up_count
RPS4Y1	1637	0
CDC20	1456	1
PCNA	1360	0
NME1	1182	0
MIF	1052	0
CSRP1	1031	1
STUB1	996	10
TIMM9	989	4
TYMS	881	0
GDF15	866	0

Marjorie Imperial March 9, 2016

Question 7, 8, and 9

#read in data 
path_genes = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df = readr::read_tsv(path_genes)

path_down = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
down_df = readr::read_tsv(path_down)

path_over = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
over_df = readr::read_tsv(path_over)

Question 7- Emmalyn Chen

q.7 = over_df %>% subset(z_score < 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.7 = q.7[1:10,]

a. Which ten genes were most frequently significantly downregulated by gene overexpression?

entrez_gene_id count
            (int) (int)
1            6192   166
2             991   165
3            5111   165
4            5018   122
5             994   105
6           26520   102
7            1738    91
8            9133    86
9            7298    84
10          22827    83

b. How many overexpressions significantly downregulated these genes?

q.7.2 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

612 overexpressed genes

c. How many overexpressions significantly upregulated these genes?

q.7.3 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

4 overexpressed genes

Question 8 - Liz Levy

q.8 = over_df %>% subset(z_score > 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.8 = q.8[1:10,]

a. Which ten genes were most frequently significantly upregulated by gene overexpression?

entrez_gene_id count
            (int) (int)
1           57192   180
2            5331   152
3           25966   140
4           23378   113
5            4118   104
6            9903    99
7           55008    98
8            1066    96
9            5971    94
10           7503    94

b. How many overexpressions significantly upregulated these genes?

q.8.2 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

792 overexpressed genes

c. How many overexpressions significantly downregulated these genes?

q.8.3 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

14 overexpressed genes

Question 9 - Marjorie Imperial

For knockdown perturbations, what is the correlation between number of significantly down and upregulated measured genes.

q.9.down.reg = down_df %>% subset(z_score < 0) %>% group_by(perturbagen) %>% summarize(count.down.reg = n())
q.9.up.reg = down_df %>% subset(z_score > 0) %>% group_by(perturbagen) %>% summarize(count.up.reg = n())

joined_df = dplyr::full_join(q.9.down.reg, q.9.up.reg)
joined_df[is.na(joined_df)] = 0

Pearson correlation, R =0.9371317

cor(joined_df$count.down.reg, joined_df$count.up.reg)

Kendall correlation R = 0.732456

cor(joined_df$count.down.reg, joined_df$count.up.reg, method = 'kendall')

Emmalyn Chen March 9, 2016

See Question 7 above.

Jasleen Sodhi March 9, 2016

Question 4

How many genes were never significantly dysregulated by any knockdown perturbation?

library(dplyr)
install.packages("tidyr")
library(readr)
library(ggplot2)

path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
path_ko = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_df = readr::read_tsv(path)
gene_ko_df = readr::read_tsv(path_ko)

ghost_df = gene_df[! (gene_df$entrez_gene_id %in% gene_ko_df$entrez_gene_id), ]
nrow(ghost_df)

#for a list of these genes
ghost_df$symbol

The number of genes that were not sig dysregulated by knockdown perturbation (on main list of genes, but not on knockdown list of genes) = 56!

Elizabeth Levy March 9, 2016

See question 8 above.

Daniel Himmelstein Researcher March 9, 2016

Closing remarks

Impressive work!

Each of the nine pupils in attendance answered their question. Most finished within two hours — despite several having little R experience — after an initial 30 minute tutorial. The workshop succeeded at introducing a broad range of topics: R, the hadleyverse, transcriptomics, LINCS L1000, markdown, Thinklab, and open science.

I enjoyed helping the pupils learn while they performed original and noteworthy analyses. And meanwhile, through the power of realtime open science on Thinklab, we're now coauthors on a citeable work [1].

The workshop built off of many developments in scientific education: specifically, solving problems [2] in contemporary research [3] while contributing to the scientific record [4].

Next, I'll review the answers to see what we have learned.

Daniel Himmelstein Researcher March 9, 2016

Workshop conclusions

Here's my analysis of the answers from today's workshop. Thanks again to the pupils for their hard work.

Do target genes of genetic perturbation respond in the expected direction?

Yes, we established this important control. Knockdown overwhelming downregulated (806 instances) rather than upregulated (9 instances) its target gene (Q1 by @jeffreykim). Overexpression overwhelming upregulated (124 instances) rather than downregulated (4 instances) its target gene (Q2 by @kathleenk).

Are the many genes that never respond to genetic perturbation?

No, we saw that almost all genes were dysregulated by at least one genetic perturbation. Only 0.7% of genes (56 out of 7,467) were never dysregulated by a knockdown (Q4 by @jasleensodhi). Only 1 of these genes was measured, while the remaining 55 were imputed (Q3 by @mishavysotskiy). This imbalance makes sense since imputed genes were subject to a more stringent significance threshold. The low number of never-dysregulated genes is a welcome result from a network perspective, where pervasive connectivity is important.

Which genes are most frequently dysregulated?

Next, we identified which genes were most frequently dysregulated due to a knockdown. RPS4Y1 was downregulated by 37.8% (1,637 out of 4,326) of knockdowns (Q5 by @juliacluceru). MCOLN1 was upregulated by 26.1% (1,128 out of 4,326) of knockdowns (Q6 by @beaunorgeot). The top-ten-most-frequently-downregulated-by-knockdown genes were rarely upregulated by knockdown. The same consistency in direction of dysregulation applied to the top-ten-most-frequently-upregulated genes as well.

Next, we identified which genes were most frequently dysregulated due to overexpression. RPS4Y1 was downregulated by 6.9% (166 out of 2,413) of overepressions (Q7 by @emmalynchen). MCOLN1 was upregulated by 7.5% (180 out of 2,413) of overepressions (Q8 by @elizabethlevy1). Interestingly, RPS4Y1 was the most downregulated gene by both knockdown and overexpression. Conversely, MCOLN1 was the most upregulated gene for both perturbation types.

The findings from Q5–8 fit with @larsjuhljensen's hypothesis that a general stress response may cause many genes to respond to any genetic perturbation in a consistent direction. Q5–8 also help address @caseygreene's question on which genes are driving the signals.

Does broad downregulation occur in tandem with broad upregulation?

Finally, there was a strong correlation between the number of downregulated and upregulated genes per knockdown (Q9 by @marjorieimperial). In other words, a perturbation which downregulates many genes will also likely upregulate many genes.

Lars Juhl Jensen: Reassuring to see that things behave the way I would expect. This should make it fairly easy to derive a scoring scheme that extracts only associations that are specifically associated with a small number of perturbations, as opposed to associated with any perturbation.
Lars Juhl Jensen: The part about broad downregulation occurring in tandem with broad upregulation is almost a given. Even if it is not the case biologically, this will be the case after most normalization methods.

Status: Completed

Views

184

Topics

Systems Pharmacology L1000 UCSF Open Education LINCS

Referenced by

Cite this as

Daniel Himmelstein, Kathleen Keough, Misha Vysotskiy, Jeffrey Kim, Beau Norgeot, Julia Cluceru, Marjorie Imperial, Emmalyn Chen, Jasleen Sodhi, Elizabeth Levy (2016) Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF. Thinklab. doi:10.15363/thinklab.d181

License

Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF

Questions

Datasets

Gene information genes.tsv

Genes dysregulated by knockdowns dysreg-knockdown.tsv

Genes dysregulated by overexpressions dysreg-overexpression.tsv

Question 2

Question 1

Question 6

Resulting Table

Notes

Question 5

Question 7, 8, and 9

Question 7- Emmalyn Chen

Question 8 - Liz Levy

Question 9 - Marjorie Imperial

Question 4

Closing remarks

Workshop conclusions

Do target genes of genetic perturbation respond in the expected direction?

Are the many genes that never respond to genetic perturbation?

Which genes are most frequently dysregulated?

Does broad downregulation occur in tandem with broad upregulation?

Gene information `genes.tsv`

Genes dysregulated by knockdowns `dysreg-knockdown.tsv`

Genes dysregulated by overexpressions `dysreg-overexpression.tsv`