## Background

GWAS uncover disease-associated loci, but due to sparse genotyping arrays and linkage disequilibrium (LD), identifying the specific SNP driving the association is difficult. Therefore, GWAS usually report the most significant hit as the single lead SNP for a loci, leaving the identification of a causal SNP for later research. Often multiple GWAS of the same disease will identify different lead SNPs in the same region, presumable all tagging the same causal variant. Therefore, around any lead SNP is a region of indetermination—a genomic window in which the SNP driving the association is likely to reside.

## Application

When extracting disease-gene associations from the GWAS Catalog [1], we collapse multiple associations for the same disease into loci (regions) [2]. Starting with lead SNPs for each association, we find the corresponding windows and overlap them into genomically disjoint sets.

Previously, we retrieved windows for GWAS lead-SNPs from the DAPPLE [3] wingspan files. DAPPLE windows "were calculated for each lead-SNP by finding the furthest upstream and downstream SNPs where $r^2 > 0.5$ and extending outwards to the next recombination hotspot [2]."

However, DAPPLE relied on HapMap [4] for LD data, which is now outdated. Many SNPs in the GWAS catalog are not in HapMap. Since HapMap is missing many SNPs, extending to the next recombination hotspot was necessary.

# Questions

1. Given a lead SNP, how should we identify the furthest upstream and downstream SNPs with $r^2$ exceeding a given threshold? Which data and tools should we use?
2. In the context of GWAS loci, is $r^2 > 0.5$ too low of a threshold for windows?
3. Is the recombination hotspot extension necessary?

We would like to identify windows for ~5000 SNPs which are identified in dbSNP build 142 rsids.

I would suggest using 1000 genomes for the LD calculation here with a more stringent r^2 cutoff (maybe 0.8?). Some LD information is available through their browser

http://browser.1000genomes.org/Homo_sapiens/Location/Genome?db=core;r=2:31451742-31452000

Here is a thread discussing similar ideas:
https://www.biostars.org/p/2909/

The other resource of interest is the ExAC dataset: http://exac.broadinstitute.org/ I don't think the LD data is available, but it's worthwhile reaching out to them!

Daniel Himmelstein Researcher

The SNAP Proxy Search [1] allows us to find all SNPs within 500kb and with LD above a provided threshold for the query SNP, using 1000 Genomes (KG) pilot data.

One issue with KG is that the whole-genome sequencing was done at low depth (4x coverage) and that only 179 samples were sequenced: 60 CEU, 59 YRI, 30 CHB, and 30 JPT [2]. Therefore many low frequency or technically difficult variants were likely missed. Since GWAS have mostly focused on common variants, the puniness of 1000 Genomes pilot data may be acceptable.

We went ahead and evaluated the SNAP LD information from KG for our GWAS lead SNPs. For each lead SNP, we found all SNPs in $r^2 \geq 0.8$ in the European subset of 60 individuals. The findings are as follows:

• SNPs with lower minor allele frequencies were more likely to have large windows (kilobase spans). We speculate this results from greater noise in $r^2$ values when the number of minor alleles is low, enabling far away SNPs to appear in high LD by chance.
• 614 lead SNPs have a zero-length span — no SNPs were found with LD exceeding the threshold. Most likely this is due to the incompleteness of KG.
• Window spans measured in kilobases are highly, positively correlated with spans measured in centimorgans. Therefore, we cannot chose a single centimorgan threshold to approximate windows calculated using the $r^2$ method.

In conclusion, the KG data retrieved from SNAP is feasible but not ideal. We will look into larger datasets and have reached out to the ExAC team.

• Daniel Himmelstein: 2015-06-12: I reassessed the implications of the centimorgan versus kilobase window span correlation. My current conclusion is that a single centimorgan threshold cannot be chosen that produces similar windows to the r-squared method.

Daniel Himmelstein Researcher

# Permissive $r^2$ threshold when relying on low-powered LD data

@marinasirota, I intuitively agree that for the modern GWAS assaying and imputing millions of SNPs, lead SNPs are likely be in $r^2 \geq 0.8$ with the SNPs driving the association. However, when using the 1000 Genomes Pilot data for LD, I think we should use a more permissive threshold of $r^2 \geq 0.5$. The 0.8 threshold produces 614 windows with zero-length spans compared to 149 for the 0.5 threshold. Zero-length spans are equivalent to declaring that the lead SNP is the only SNP capable of creating the association. I would prefer to minimize these instances when we have such incomplete LD information.

@dhimmel - that makes sense.

Daniel Himmelstein Researcher

# 1000 Genomes Phase 3 data

We have become aware that more recent and comprehensive 1000 Genomes data exists, it's just not included in SNAP. The phase 3 dataset contains ~2500 individuals with whole-genome sequencing.

The phase 3 data was recently added to the Ensembl database. Ensembl has a perl API, which should be able to find all SNPs in LD with a lead SNP.

We found example code and have reached out for advice because our implementation is currently failing.

Daniel Himmelstein Researcher

A recently-published webapp called LDlink calculates SNPs in LD for a given lead SNP using 1000 Genomes Phase 3 data [1]. The LDproxy feature allows specifying a lead SNP and reference population. The resulting table of proxy SNPs is downloadable as a tsv.

Unfortunately, the service doesn't release a public API. Therefore, querying at scale could be difficult.

Extracting LD from 1000 genomes data is not straightforward. Here is a rough outline of the solution I came up with. I used Plink1.9 https://www.cog-genomics.org/plink2/ and VCFTools. It will require some tweaking for specific applications.

# Step 1: use vcftools to generate a population specific tped file

vcftools --gzvcf <vcf_file> --plink-tped --keep <samples.txt> --out <tped_fh>

where <samples.txt> is the location of a text file with a single column of sample IDs

# Step 2: transpose the tped file (more efficient than creating a ped file originally)

plink --tfile <tped_fh> --recode --threads <num_threads> --no-sex --no-pheno --out <ped_fh>

# Step 3: use Plink1.9 to pull out an LD matrix

plink --file <vcf_file> --allow-no-sex --r2 --threads <num_threads> --ld-window-r2 <window> --chr <chromosome> --ld-snps <snp_string> --out <out_name>

where <snp_string> is a comma separated list of RS IDs

Status: Deferred
Views
704
Topics
Referenced by
Cite this as
Daniel Himmelstein, Marina Sirota, Greg Way (2015) Calculating genomic windows for GWAS lead SNPs. Thinklab. doi:10.15363/thinklab.d71