Calculating genomic windows for GWAS lead SNPs

Daniel Himmelstein, Marina Sirota, Greg Way

doi:10.15363/thinklab.d71

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Calculating genomic windows for GWAS lead SNPs

Daniel Himmelstein Researcher June 8, 2015

Background

GWAS uncover disease-associated loci, but due to sparse genotyping arrays and linkage disequilibrium (LD), identifying the specific SNP driving the association is difficult. Therefore, GWAS usually report the most significant hit as the single lead SNP for a loci, leaving the identification of a causal SNP for later research. Often multiple GWAS of the same disease will identify different lead SNPs in the same region, presumable all tagging the same causal variant. Therefore, around any lead SNP is a region of indetermination—a genomic window in which the SNP driving the association is likely to reside.

Application

When extracting disease-gene associations from the GWAS Catalog [1], we collapse multiple associations for the same disease into loci (regions) [2]. Starting with lead SNPs for each association, we find the corresponding windows and overlap them into genomically disjoint sets.

Previously, we retrieved windows for GWAS lead-SNPs from the DAPPLE [3] wingspan files. DAPPLE windows "were calculated for each lead-SNP by finding the furthest upstream and downstream SNPs where $$r^2 > 0.5$$ and extending outwards to the next recombination hotspot [2]."

However, DAPPLE relied on HapMap [4] for LD data, which is now outdated. Many SNPs in the GWAS catalog are not in HapMap. Since HapMap is missing many SNPs, extending to the next recombination hotspot was necessary.

Questions

Given a lead SNP, how should we identify the furthest upstream and downstream SNPs with $$r^2$$ exceeding a given threshold? Which data and tools should we use?
In the context of GWAS loci, is $$r^2 > 0.5$$ too low of a threshold for windows?
Is the recombination hotspot extension necessary?

We would like to identify windows for ~5000 SNPs which are identified in dbSNP build 142 rsids.

Marina Sirota June 8, 2015

I would suggest using 1000 genomes for the LD calculation here with a more stringent r^2 cutoff (maybe 0.8?). Some LD information is available through their browser

http://browser.1000genomes.org/Homo_sapiens/Location/Genome?db=core;r=2:31451742-31452000

Here is a thread discussing similar ideas:
https://www.biostars.org/p/2909/

The other resource of interest is the ExAC dataset: http://exac.broadinstitute.org/ I don't think the LD data is available, but it's worthwhile reaching out to them!

Daniel Himmelstein Researcher June 11, 2015

@marinasirota, thanks for the advice.

The SNAP Proxy Search [1] allows us to find all SNPs within 500kb and with LD above a provided threshold for the query SNP, using 1000 Genomes (KG) pilot data.

One issue with KG is that the whole-genome sequencing was done at low depth (4x coverage) and that only 179 samples were sequenced: 60 CEU, 59 YRI, 30 CHB, and 30 JPT [2]. Therefore many low frequency or technically difficult variants were likely missed. Since GWAS have mostly focused on common variants, the puniness of 1000 Genomes pilot data may be acceptable.

We went ahead and evaluated the SNAP LD information from KG for our GWAS lead SNPs. For each lead SNP, we found all SNPs in $$r^2 \geq 0.8$$ in the European subset of 60 individuals. The findings are as follows:

Of 5,255 GWAS lead SNPs, 517 were not found by SNAP
SNPs with lower minor allele frequencies were more likely to have large windows (kilobase spans). We speculate this results from greater noise in $$r^2$$ values when the number of minor alleles is low, enabling far away SNPs to appear in high LD by chance.
614 lead SNPs have a zero-length span — no SNPs were found with LD exceeding the threshold. Most likely this is due to the incompleteness of KG.
Window spans measured in kilobases are highly, positively correlated with spans measured in centimorgans. Therefore, we cannot chose a single centimorgan threshold to approximate windows calculated using the $$r^2$$ method.

In conclusion, the KG data retrieved from SNAP is feasible but not ideal. We will look into larger datasets and have reached out to the ExAC team.

Daniel Himmelstein: 2015-06-12: I reassessed the implications of the centimorgan versus kilobase window span correlation. My current conclusion is that a single centimorgan threshold cannot be chosen that produces similar windows to the r-squared method.

Daniel Himmelstein Researcher June 12, 2015

Permissive $$r^2$$ threshold when relying on low-powered LD data

@marinasirota, I intuitively agree that for the modern GWAS assaying and imputing millions of SNPs, lead SNPs are likely be in $$r^2 \geq 0.8$$ with the SNPs driving the association. However, when using the 1000 Genomes Pilot data for LD, I think we should use a more permissive threshold of $$r^2 \geq 0.5$$. The 0.8 threshold produces 614 windows with zero-length spans compared to 149 for the 0.5 threshold. Zero-length spans are equivalent to declaring that the lead SNP is the only SNP capable of creating the association. I would prefer to minimize these instances when we have such incomplete LD information.

Marina Sirota June 12, 2015

@dhimmel - that makes sense.

Daniel Himmelstein Researcher June 23, 2015

1000 Genomes Phase 3 data

We have become aware that more recent and comprehensive 1000 Genomes data exists, it's just not included in SNAP. The phase 3 dataset contains ~2500 individuals with whole-genome sequencing.

The phase 3 data was recently added to the Ensembl database. Ensembl has a perl API, which should be able to find all SNPs in LD with a lead SNP.

We found example code and have reached out for advice because our implementation is currently failing.

Daniel Himmelstein Researcher Nov. 2, 2015

LDlink

A recently-published webapp called LDlink calculates SNPs in LD for a given lead SNP using 1000 Genomes Phase 3 data [1]. The LDproxy feature allows specifying a lead SNP and reference population. The resulting table of proxy SNPs is downloadable as a tsv.

Unfortunately, the service doesn't release a public API. Therefore, querying at scale could be difficult.

Greg Way April 8, 2016

Extracting LD from 1000 genomes data is not straightforward. Here is a rough outline of the solution I came up with. I used Plink1.9 https://www.cog-genomics.org/plink2/ and VCFTools. It will require some tweaking for specific applications.

Step 0: Download 1000 genomes data and remove duplicate SNP IDs

Step 1: use vcftools to generate a population specific tped file

vcftools --gzvcf <vcf_file> --plink-tped --keep <samples.txt> --out <tped_fh>

where <samples.txt> is the location of a text file with a single column of sample IDs

Step 2: transpose the tped file (more efficient than creating a ped file originally)

plink --tfile <tped_fh> --recode --threads <num_threads> --no-sex --no-pheno --out <ped_fh>

Step 3: use Plink1.9 to pull out an LD matrix

plink --file <vcf_file> --allow-no-sex --r2 --threads <num_threads> --ld-window-r2 <window> --chr <chromosome> --ld-snps <snp_string> --out <out_name>

where <snp_string> is a comma separated list of RS IDs

0	1.	The NHGRI GWAS Catalog, a curated resource of SNP-trait associations D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, H. Parkinson (2013) Nucleic Acids Research. doi:10.1093/nar/gkt1229
0	2.	Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes Daniel S. Himmelstein, Sergio E. Baranzini (2015) PLOS Computational Biology. doi:10.1371/journal.pcbi.1004259
0	3.	Proteins Encoded in Genomic Regions Associated with Immune-Mediated Disease Physically Interact and Suggest Underlying Biology Elizabeth J. Rossin, Kasper Lage, Soumya Raychaudhuri, Ramnik J. Xavier, Diana Tatar, Yair Benita, Chris Cotsapas, Mark J. Daly (2011) PLoS Genetics. doi:10.1371/journal.pgen.1001273
0	4.	A haplotype map of the human genome The International HapMap Consortium (2005) Nature. doi:10.1038/nature04226
0	5.	SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap A. D. Johnson, R. E. Handsaker, S. L. Pulit, M. M. Nizzari, C. J. O'Donnell, P. I. W. de Bakker (2008) Bioinformatics. doi:10.1093/bioinformatics/btn564
0	6.	A map of human genome variation from population-scale sequencing Richard M. Durbin, David L. Altshuler, Richard M. Durbin, Gonçalo R. Abecasis, David R. Bentley, Aravinda Chakravarti, Andrew G. Clark, Francis S. Collins, Francisco M. De La Vega, Peter Donnelly, Michael Egholm, Paul Flicek, Stacey B. Gabriel, Richard A. Gibbs, Bartha M. Knoppers, Eric S. Lander, Hans Lehrach, Elaine R. Mardis, Gil A. McVean, Debbie A. Nickerson, Leena Peltonen, Alan J. Schafer, Stephen T. Sherry, Jun Wang, Richard K. Wilson, Richard A. Gibbs, David Deiros, Mike Metzker, Donna Muzny, Jeff Reid, David Wheeler, Jun Wang, Jingxiang Li, Min Jian, Guoqing Li, Ruiqiang Li, Huiqing Liang, Geng Tian, Bo Wang, Jian Wang, Wei Wang, Huanming Yang, Xiuqing Zhang, Huisong Zheng, Eric S. Lander, David L. Altshuler, Lauren Ambrogio, Toby Bloom, Kristian Cibulskis, Tim J. Fennell, Stacey B. Gabriel, David B. Jaffe, Erica Shefler, Carrie L. Sougnez, David R. Bentley, Niall Gormley, Sean Humphray, Zoya Kingsbury, Paula Koko-Gonzales, Jennifer Stone, Kevin J. McKernan, Gina L. Costa, Jeffry K. Ichikawa, Clarence C. Lee, Ralf Sudbrak, Hans Lehrach, Tatiana A. Borodina, Andreas Dahl, Alexey N. Davydov, Peter Marquardt, Florian Mertes, Wilfiried Nietfeld, Philip Rosenstiel, Stefan Schreiber, Aleksey V. Soldatov, Bernd Timmermann, Marius Tolzmann, Michael Egholm, Jason Affourtit, Dana Ashworth, Said Attiya, Melissa Bachorski, Eli Buglione, Adam Burke, Amanda Caprio, Christopher Celone, Shauna Clark, David Conners, Brian Desany, Lisa Gu, Lorri Guccione, Kalvin Kao, Andrew Kebbel, Jennifer Knowlton, Matthew Labrecque, Louise McDade, Craig Mealmaker, Melissa Minderman, Anne Nawrocki, Faheem Niazi, Kristen Pareja, Ravi Ramenani, David Riches, Wanmin Song, Cynthia Turcotte, Shally Wang, Elaine R. Mardis, Richard K. Wilson, David Dooling, Lucinda Fulton, Robert Fulton, George Weinstock, Richard M. Durbin, John Burton, David M. Carter, Carol Churcher, Alison Coffey, Anthony Cox, Aarno Palotie, Michael Quail, Tom Skelly, James Stalker, Harold P. Swerdlow, Daniel Turner, Anniek De Witte, Shane Giles, Richard A. Gibbs, David Wheeler, Matthew Bainbridge, Danny Challis, Aniko Sabo, Fuli Yu, Jin Yu, Jun Wang, Xiaodong Fang, Xiaosen Guo, Ruiqiang Li, Yingrui Li, Ruibang Luo, Shuaishuai Tai, Honglong Wu, Hancheng Zheng, Xiaole Zheng, Yan Zhou, Guoqing Li, Jian Wang, Huanming Yang, Gabor T. Marth, Erik P. Garrison, Weichun Huang, Amit Indap, Deniz Kural, Wan-Ping Lee, Wen Fung Leong, Aaron R. Quinlan, Chip Stewart, Michael P. Stromberg, Alistair N. Ward, Jiantao Wu, Charles Lee, Ryan E. Mills, Xinghua Shi, Mark J. Daly, Mark A. DePristo, David L. Altshuler, Aaron D. Ball, Eric Banks, Toby Bloom, Brian L. Browning, Kristian Cibulskis, Tim J. Fennell, Kiran V. Garimella, Sharon R. Grossman, Robert E. Handsaker, Matt Hanna, Chris Hartl, David B. Jaffe, Andrew M. Kernytsky, Joshua M. Korn, Heng Li, Jared R. Maguire, Steven A. McCarroll, Aaron McKenna, James C. Nemesh, Anthony A. Philippakis, Ryan E. Poplin, Alkes Price, Manuel A. Rivas, Pardis C. Sabeti, Stephen F. Schaffner, Erica Shefler, Ilya A. Shlyakhter, David N. Cooper, Edward V. Ball, Matthew Mort, Andrew D. Phillips, Peter D. Stenson, Jonathan Sebat, Vladimir Makarov, Kenny Ye, Seungtai C. Yoon, Carlos D. Bustamante, Andrew G. Clark, Adam Boyko, Jeremiah Degenhardt, Simon Gravel, Ryan N. Gutenkunst, Mark Kaganovich, Alon Keinan, Phil Lacroute, Xin Ma, Andy Reynolds, Laura Clarke, Paul Flicek, Fiona Cunningham, Javier Herrero, Stephen Keenen, Eugene Kulesha, Rasko Leinonen, William M. McLaren, Rajesh Radhakrishnan, Richard E. Smith, Vadim Zalunin, Xiangqun Zheng-Bradley, Jan O. Korbel, Adrian M. Stütz, Sean Humphray, Markus Bauer, R. Keira Cheetham, Tony Cox, Michael Eberle, Terena James, Scott Kahn, Lisa Murray, Aravinda Chakravarti, Kai Ye, Francisco M. De La Vega, Yutao Fu, Fiona C. L. Hyland, Jonathan M. Manning, Stephen F. McLaughlin, Heather E. Peckham, Onur Sakarya, Yongming A. Sun, Eric F. Tsung, Mark A. Batzer, Miriam K. Konkel, Jerilyn A. Walker, Ralf Sudbrak, Marcus W. Albrecht, Vyacheslav S. Amstislavskiy, Ralf Herwig, Dimitri V. Parkhomchuk, Stephen T. Sherry, Richa Agarwala, Hoda M. Khouri, Aleksandr O. Morgulis, Justin E. Paschall, Lon D. Phan, Kirill E. Rotmistrovsky, Robert D. Sanders, Martin F. Shumway, Chunlin Xiao, Gil A. McVean, Adam Auton, Zamin Iqbal, Gerton Lunter, Jonathan L. Marchini, Loukas Moutsianas, Simon Myers, Afidalina Tumian, Brian Desany, James Knight, Roger Winer, David W. Craig, Steve M. Beckstrom-Sternberg, Alexis Christoforides, Ahmet A. Kurdoglu, John V. Pearson, Shripad A. Sinari, Waibhav D. Tembe, David Haussler, Angie S. Hinrichs, Sol J. Katzman, Andrew Kern, Robert M. Kuhn, Molly Przeworski, Ryan D. Hernandez, Bryan Howie, Joanna L. Kelley, S. Cord Melton, Gonçalo R. Abecasis, Yun Li, Paul Anderson, Tom Blackwell, Wei Chen, William O. Cookson, Jun Ding, Hyun Min Kang, Mark Lathrop, Liming Liang, Miriam F. Moffatt, Paul Scheet, Carlo Sidore, Matthew Snyder, Xiaowei Zhan, Sebastian Zöllner, Philip Awadalla, Ferran Casals, Youssef Idaghdour, John Keebler, Eric A. Stone, Martine Zilversmit, Lynn Jorde, Jinchuan Xing, Evan E. Eichler, Gozde Aksay, Can Alkan, Iman Hajirasouliha, Fereydoun Hormozdiari, Jeffrey M. Kidd, S. Cenk Sahinalp, Peter H. Sudmant, Elaine R. Mardis, Ken Chen, Asif Chinwalla, Li Ding, Daniel C. Koboldt, Mike D. McLellan, David Dooling, George Weinstock, John W. Wallis, Michael C. Wendl, Qunyuan Zhang, Richard M. Durbin, Cornelis A. Albers, Qasim Ayub, Senduran Balasubramaniam, Jeffrey C. Barrett, David M. Carter, Yuan Chen, Donald F. Conrad, Petr Danecek, Emmanouil T. Dermitzakis, Min Hu, Ni Huang, Matt E. Hurles, Hanjun Jin, Luke Jostins, Thomas M. Keane, Si Quang Le, Sarah Lindsay, Quan Long, Daniel G. MacArthur, Stephen B. Montgomery, Leopold Parts, James Stalker, Chris Tyler-Smith, Klaudia Walter, Yujun Zhang, Mark B. Gerstein, Michael Snyder, Alexej Abyzov, Suganthi Balasubramanian, Robert Bjornson, Jiang Du, Fabian Grubert, Lukas Habegger, Rajini Haraksingh, Justin Jee, Ekta Khurana, Hugo Y. K. Lam, Jing Leng, Xinmeng Jasmine Mu, Alexander E. Urban, Zhengdong Zhang, Yingrui Li, Ruibang Luo, Gabor T. Marth, Erik P. Garrison, Deniz Kural, Aaron R. Quinlan, Chip Stewart, Michael P. Stromberg, Alistair N. Ward, Jiantao Wu, Charles Lee, Ryan E. Mills, Xinghua Shi, Steven A. McCarroll, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Chris Hartl, Joshua M. Korn, Heng Li, James C. Nemesh, Jonathan Sebat, Vladimir Makarov, Kenny Ye, Seungtai C. Yoon, Jeremiah Degenhardt, Mark Kaganovich, Laura Clarke, Richard E. Smith, Xiangqun Zheng-Bradley, Jan O. Korbel, Sean Humphray, R. Keira Cheetham, Michael Eberle, Scott Kahn, Lisa Murray, Kai Ye, Francisco M. De La Vega, Yutao Fu, Heather E. Peckham, Yongming A. Sun, Mark A. Batzer, Miriam K. Konkel, Jerilyn A. Walker, Chunlin Xiao, Zamin Iqbal, Brian Desany, Tom Blackwell, Matthew Snyder, Jinchuan Xing, Evan E. Eichler, Gozde Aksay, Can Alkan, Iman Hajirasouliha, Fereydoun Hormozdiari, Jeffrey M. Kidd, Ken Chen, Asif Chinwalla, Li Ding, Mike D. McLellan, John W. Wallis, Matt E. Hurles, Donald F. Conrad, Klaudia Walter, Yujun Zhang, Mark B. Gerstein, Michael Snyder, Alexej Abyzov, Jiang Du, Fabian Grubert, Rajini Haraksingh, Justin Jee, Ekta Khurana, Hugo Y. K. Lam, Jing Leng, Xinmeng Jasmine Mu, Alexander E. Urban, Zhengdong Zhang, Richard A. Gibbs, Matthew Bainbridge, Danny Challis, Cristian Coafra, Huyen Dinh, Christie Kovar, Sandy Lee, Donna Muzny, Lynne Nazareth, Jeff Reid, Aniko Sabo, Fuli Yu, Jin Yu, Gabor T. Marth, Erik P. Garrison, Amit Indap, Wen Fung Leong, Aaron R. Quinlan, Chip Stewart, Alistair N. Ward, Jiantao Wu, Kristian Cibulskis, Tim J. Fennell, Stacey B. Gabriel, Kiran V. Garimella, Chris Hartl, Erica Shefler, Carrie L. Sougnez, Jane Wilkinson, Andrew G. Clark, Simon Gravel, Fabian Grubert, Laura Clarke, Paul Flicek, Richard E. Smith, Xiangqun Zheng-Bradley, Stephen T. Sherry, Hoda M. Khouri, Justin E. Paschall, Martin F. Shumway, Chunlin Xiao, Gil A. McVean, Sol J. Katzman, Gonçalo R. Abecasis, Tom Blackwell, Elaine R. Mardis, David Dooling, Lucinda Fulton, Robert Fulton, Daniel C. Koboldt, Richard M. Durbin, Senduran Balasubramaniam, Allison Coffey, Thomas M. Keane, Daniel G. MacArthur, Aarno Palotie, Carol Scott, James Stalker, Chris Tyler-Smith, Mark B. Gerstein, Suganthi Balasubramanian, Aravinda Chakravarti, Bartha M. Knoppers, Gonçalo R. Abecasis, Carlos D. Bustamante, Neda Gharani, Richard A. Gibbs, Lynn Jorde, Jane S. Kaye, Alastair Kent, Taosha Li, Amy L. McGuire, Gil A. McVean, Pilar N. Ossorio, Charles N. Rotimi, Yeyang Su, Lorraine H. Toji, Chris Tyler-Smith, Lisa D. Brooks, Adam L. Felsenfeld, Jean E. McEwen, Assya Abdallah, Christopher R. Juenger, Nicholas C. Clemm, Francis S. Collins, Audrey Duncanson, Eric D. Green, Mark S. Guyer, Jane L. Peterson, Alan J. Schafer, Gonçalo R. Abecasis, David L. Altshuler, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Richard A. Gibbs, Matt E. Hurles, Gil A. McVean (2010) Nature. doi:10.1038/nature09534
0	7.	LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants: Fig. 1. Mitchell J. Machiela, Stephen J. Chanock (2015) Bioinformatics. doi:10.1093/bioinformatics/btv402

Status: Deferred

Views

704

Topics

Genomics Genetics GWAS Catalog 1000 Genomes Recombination Hotspots HapMap Linkage Disequilibrium LD Loci

Referenced by

Research report: Rephetio: Repurposing drugs on a hetnet
Extracting disease-gene associations from the GWAS Catalog

Cite this as

Daniel Himmelstein, Marina Sirota, Greg Way (2015) Calculating genomic windows for GWAS lead SNPs. Thinklab. doi:10.15363/thinklab.d71

License