GWAS VI: Polygenic risk scores and LD score regression
BIOS 770
Josh Weinstock, PhD
Assistant Professor, Department of Human Genetics, Emory University
Learning objectives
Define polygenic risk scores (PRS)
Discuss common PRS methods
Discuss challenges in PRS portability and interpretation
Discuss linkage disequilibrium score regression (LDSR)
Genotypes → Phenotypes: Recap of GWAS
Continuous Trait
\[ \color{darkblue}{Y}_i \sim \mathcal{N}(\color{darkgreen}{X}_i \beta, \sigma^2) \]
Binary Trait
\[ \color{darkblue}{Y}_i \sim \text{Bernoulli}(\sigma(\color{darkgreen}{X}_i \beta)) \]
\(\color{darkgreen}{X}_i\) : Genotype vector (e.g., SNP counts 0/1/2)
\(\beta\) : Effect sizes from GWAS
\(\sigma\) : Sigmoid function for binary outcomes
Interpretation
\(\beta\) represents marginal effects of genotypes on phenotypes
PRS = \(\color{darkgreen}{X}_i \hat{\beta}\) (weighted sum of risk alleles)
Why create a PRS?
Torkamani et al., Nature Reviews Genetics , 2018
Statistical methods for PRS construction
Clumping and Thresholding
Clumping : Remove SNPs in linkage disequilibrium (LD) .
Thresholding : Select SNPs with GWAS p < threshold.
Penalized Regression
LASSO/Ridge: penalize by L1 or L2 norm to either shrink coefficients or induce sparsity.
Bayesian Methods (PRS-CS, LDpred, BayesR)
Bayesian regression with some helpful prior.
Creating a PRS
Wray et al., Nature Reviews Genetics , 2013
Modeling linkage disequlibrium
For genotype matrix \(\color{darkgreen}{X} \in \mathbb{R}^{n \times p}\) (samples \(\times\) SNPs):
Covariance : \[
R = \text{Cov}(\color{darkgreen}{X}) = \mathbb{E}[\color{darkgreen}{X}^T \color{darkgreen}{X}] - \mathbb{E}[\color{darkgreen}{X}]^T \mathbb{E}[\color{darkgreen}{X}]
\]
Under common assumptions :
Centered genotypes: \(\mathbb{E}[X] = 0\)
Unit variance: \(\text{Var}(X_{i}) = 1 \forall i\)
Simplifies to Correlation Matrix : \[
R = \text{Cor}(\color{darkgreen}{X}) = \mathbb{E}[\color{darkgreen}{X}^T \color{darkgreen}{X}]
\]
Properties of the estimator
Good : \[
\mathbb{E}[\hat{R}] = R \quad \text{(Unbiased)}
\]
Bad :
When \(P \ge N\) , \(\hat{R}\) becomes rank-deficient: \[
\text{rank}(\hat{R}) \leq \min(N,P) = N
\]
\(\hat{R}\) may also fail to be positive definite.
GWAS Dimensionality Challenge
UK Biobank Example (10Mb region): \[
\begin{aligned}
P &= 1{,}000 \ \text{SNPs} \\
\text{Params in } \hat{R} &= \frac{P(P-1)}{2} = 499{,}500 \\
N &\approx 500{,}000 \ \text{(typical subsample)}
\end{aligned}
\]
Implications :
LD matrix estimation becomes ill-conditioned
Matrix inversion \(\hat{R}^{-1}\) unstable
Impacts PRS methods and fine-mapping
Recovering joint estimates from summary statistics
GWAS estimates the marginal effects.
These are likely to be “worse” (for prediction, loci discovery) than joint effects if the SNPs are correlated (highly common)
Simulation: joint vs marginal coefficients
N = 10000
P = 100
M = 20
h2 = 0.20
U <- qr.Q (qr (matrix (rnorm (P ^ 2 ), P)))
D = seq (0.01 , 10 , length = P)
Sigma = crossprod (U, U * D)
L = chol (Sigma)
X = matrix (rnorm (N * P), N, P) %*% L
X = scale (X) # ensure variances are 1
beta = rep (0 , P)
beta_causal = sample (1 : P, size = M)
beta[beta_causal] = rnorm (M, mean = 0 , sd = sqrt (h2 / M))
Y = X %*% beta + rnorm (N, 0 , sqrt (1.0 - h2))
Joint esimates are better than marginal estimates
beta_joint = coef (lm (Y ~ X + 0 ))
beta_gwas = (t (X) %*% Y) / N
summary (lm (beta ~ beta_joint))[["r.squared" ]]
summary (lm (beta ~ beta_gwas))[["r.squared" ]]
Using sufficient statistics, convert marginal to joint
\[
\begin{align}
\hat{\beta}^{\text{marg}} &= \frac{\color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\
(\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1} \hat{\beta}^{\text{marg}} &= (\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1}\frac{\color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\
(N\hat{R})^{-1} \hat{\beta}^{\text{marg}} &= \frac{(\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1} * \color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\
\hat{R}^{-1} \hat{\beta}^{\text{marg}} &= \hat{\beta}^{joint}
\end{align}
\]
Asymptotically: \[
\hat{\beta}^{\text{marg}} \sim \mathcal{N}\left(\hat{R}\beta, \frac{\sigma^2}{N}\hat{R} \right)
\]
What we get from GWAS (\(\hat{\beta}^{\text{marg}}\) ) are the true coefficients (\(\beta\) ) “rotated” by LD (\(R\) ) + noise
GWAS summary statistics
Often, we don’t have access to individual level data
But we can pair sufficient statistics from GWAS with LD estimates from a population reference panel
This information is sufficient to approximate an individual level likelihood without requiring individual genotypes from the GWAS
In summary so far
GWAS estimates marginal effects
We always want to convert these to joint estimates for prediction purposes
We can actually do this without needing to re-analyze the original individual level GWAS data
Methods for computing PRS
Clumping and thresholding
Perform GWAS
“Clump” correlated variants together / LD-prune
Simply take all betas where the pvalue is less than some threshold
Note, no conversion from marginal to joint here
LDPred: Bayesian linear regression
Prior on the SNP effects: spike and slab
\[
\beta_i \stackrel{\text{iid}}{\sim} \begin{cases}
\mathcal{N}\left(0, \dfrac{h_g^2}{M_p}\right) & \text{with probability } p \\
0 & \text{with probability } (1-p)
\end{cases}
\]
Posterior expectation without LD
\[
\mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg} \right) = \left(\dfrac{h_g^2}{h_g^2 + \dfrac{M_p}{N}}\right) \bar{p}_i \hat{\beta}^{marg}_i
\]
Estimates in the presence of LD
\[
\mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + I \dfrac{M}{Nh_g^2}\right)^{-1} \hat{\beta}^{marg}_i
\]
Simulation
beta_joint = coef (lm (Y ~ X + 0 ))
beta_gwas = (t (X) %*% Y) / N
beta_ld_pred_inf = solve (cor (X) + diag (P) * (M / N * h2), beta_gwas)
summary (lm (beta ~ beta_joint))[["r.squared" ]]
summary (lm (beta ~ beta_gwas))[["r.squared" ]]
summary (lm (beta ~ beta_ld_pred_inf))[["r.squared" ]]
LDPred induces shrinkage towards origin
Connections to ridge regression
\[
\hat{\beta}^{ridge} = (\color{darkgreen}{X}^t \color{darkgreen}{X} + \lambda I)^{-1}\color{darkgreen}{X}^t \color{darkblue}{Y}
\]
\[
\mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + I \dfrac{M}{Nh_g^2}\right)^{-1} \hat{\beta}^{marg}_i
\]
\[
\lambda \approx \dfrac{M}{Nh_g^2}
\]
Lasso derived PRS
Minimize: \[
(\color{darkblue}{Y} - \color{darkgreen}{X}\beta)^t(\color{darkblue}{Y} - \color{darkgreen}{X}\beta) + \lambda||\beta||_1
\]
This can be approximated with summary statistics: \[
\color{darkblue}{y}^t\color{darkblue}{y} + \beta^{t}\hat{R}\beta - 2\beta^{t}N\hat{\beta}^{marg} + \lambda||\beta||_1
\]
Mak et al., Genetic Epidemiology , 2017
Simulation with glmnet as a proxy
library (glmnet)
lasso = glmnet (X, Y, alpha = 1.0 )
lasso_estimates = as.vector (coef (lasso, s = 0.01 ))[2 : (P + 1 )]
summary (lm (beta ~ beta_joint))[["r.squared" ]]
summary (lm (beta ~ beta_gwas))[["r.squared" ]]
summary (lm (beta ~ beta_ld_pred_inf))[["r.squared" ]]
summary (lm (beta ~ lasso_estimates))[["r.squared" ]]
PRSCS: Bayesian linear regression with continuous shrinkage prior
\[
\beta_i \sim \mathcal{N}\left(0, \frac{\sigma^2}{N} \lambda_i \tau \right) \\
\lambda_i \sim Gamma(a, \delta_i) \\
\delta_i \sim Gamma(b, 1) \\
\sqrt{\tau} \sim \mathcal{C}^{+}(0, 1)
\]
Ge et al., 2019, Nature Communications
PRSCS intuition
\[
\mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + \mathbf{T}^{-1} \right)^{-1} \hat{\beta}^{marg}_i
\]
No LD case (simplification)
\[
\mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg} \right) = \left(1 - \omega_i \right) \hat{\beta}^{marg}_i \\
\omega_i = \frac{1}{(1 + \lambda_i \tau)}
\]
\(\omega_i\) prior
SBayesR: Bayesian linear regression with a mixture distribution prior
\[
\beta_j \mid \pi, \sigma_{\beta}^2 =
\begin{cases}
0 & \text{with probability } \pi_1, \\
\sim N(0, \gamma_2 \sigma_{\beta}^2) & \text{with probability } \pi_2, \\
\vdots & \vdots \\
\sim N(0, \gamma_C \sigma_{\beta}^2) & \text{with probability } 1 - \sum_{c=1}^{C-1} \pi_c,
\end{cases}
\]
Lloyd-Jones^, Zeng^ et al., Nature Communications , 2019
Contribution of annotations to mixture probabilities
Non-parametric Bayesian PRS with dirichlet process regression
\[
\sigma^2_\beta \sim \mathbf{G} \\
\mathbf{G} \sim DP(\mathbf{H}, \lambda)
\]
\(\mathbf{H}\) is a “base distribution”
\(\lambda\) is a concentration parameter
Zeng et al., Nature Communications , 2017
“Stick-breaking” interpretation
\[
\beta_{i} \sim \sum_{k=1}^{\infty} \pi_{k} N\left(0, \sigma_{k}^{2}\right) \\
\pi_{k} = \nu_{k} \prod_{l=1}^{k-1} (1 - \nu_{l}), \quad \nu_{k} \sim \mathrm{Beta}\left(1, \lambda\right)
\]
Review
PRS methods generally select and/or shink coefficients
It helps to use LD to convert from marginal to joint estimates
Several different Bayesian priors have shown to be useful in different ways
Bayesian methods offer a framework to include external functional annotations
PRS limitations: do PRS transfer across ancestries?
Figure 1., Martin et al., Nature Genetics , 2019
Portability is inversely proportional to PCA distance
Figure 1, Ding et al., Nature , 2023
Effect apparent even within discrete ancestry clusters
Figure 3, Ding et al., Nature , 2023
Why does this happen?
Causal variants vary in allele frequency across ancestries
Causal variants vary across ancestries
LD differences across ancestries
Allele frequency and LD variation
Martin et al., Figure 3, Nature Genetics , 2019
Multi-ancestry PRS extensions
Ruan et al., Figure 1, Nature Genetics , 2022
Linkage disequilibrium regression
How can we characterize the mechanisms that manifest in a particular distribution of GWAS test statistics? Why are test statistics frequently inflated?
The trait many have many causal loci (degree of polygenicity)
Inflation or deflation may occur as a consequence of population stratification, invalidation of homoscedascity, poor asymptotic approximations to tail probabilities used to compute pvalues
LD Scores and \(\chi^2\) -Statistics
Effect size estimate: \(\hat{\beta}_j^{marg} = \color{darkgreen}{X}_j^\top y/N\)
\(\chi_j^2 = N\left(\hat{\beta}^{marg}_j\right)^2\)
Define the LD Score of variant \(j\) as: \(\ell_j := \sum_{k=1}^M r_{jk}^2\)
Proposition: Under this model, the expected \(\chi^2\) -statistic is : \[
\mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M}\ell_j + 1
\]
Bulk-Sullivan et al., Nature Genetics , 2015
Proof step 1: law of total variance
Decompose \(\mathrm{Var}[\hat{\beta}_j]\) : \[
\mathrm{Var}[\hat{\beta}_j] = \underbrace{\mathbb{E}[\mathrm{Var}[\hat{\beta}_j \mid \color{darkgreen}{X}]]}_{\text{Main term}} + \underbrace{\mathrm{Var}[\mathbb{E}[\hat{\beta}_j \mid \color{darkgreen}{X}]]}_{\text{Vanishes}}
\] The second term vanishes because \(\mathbb{E}[\hat{\beta}_j \mid \color{darkgreen}{X}] = 0\) .
Step 2: conditional variance
Given \(X\) , the variance becomes: \[
\mathrm{Var}[\hat{\beta}_j \mid \color{darkgreen}{X}] = \frac{1}{N^2} \color{darkgreen}{X}_j^\top \mathrm{Var}[\color{darkblue}{y} \mid \color{darkgreen}{X}] \color{darkgreen}{X}_j
\] Substitute \(y = X= \color{darkgreen}{X}\beta + \epsilon\) , and note \(Var(\beta) = I\frac{h_g^2}{M}\) : \[
= \frac{1}{N^2} \left(\frac{h_g^2}{M} \color{darkgreen}{X}_j^\top \color{darkgreen}{X} \color{darkgreen}{X}^\top \color{darkgreen}{X}_j + N(1-h_g^2)\right)
\]
Step 3: Sample LD scores
Rewrite the key quadratic form: \[
\frac{1}{N^2} \color{darkgreen}{X}_j^\top \color{darkgreen}{X} \color{darkgreen}{X}^\top \color{darkgreen}{X}_j = \sum_{k=1}^M \tilde{r}_{jk}^2
\] where \(\tilde{r}_{jk} := \frac{1}{N}\sum_{i=1}^N \color{darkgreen}{X}_{ij} \color{darkgreen}{X}_{ik}\) (sample LD).
Step 4: Expectation
Using the \(\delta\) -method approximation: \[
\mathbb{E}[\tilde{r}_{jk}^2] \approx r_{jk}^2 + \frac{1 - r_{jk}^2}{N}
\] Summing over \(k\) gives: \[
\mathbb{E}\left[\sum_{k=1}^M \tilde{r}_{jk}^2\right] \approx \ell_j + \frac{M - \ell_j}{N}
\]
Step 5: Final steps
Combine all terms: \[
\mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M} \left(\ell_j + \frac{M - \ell_j}{N}\right) + (1 - h_g^2)
\] Simplify for large \(N\) : \[
\mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M}\ell_j + 1
\]
Extension to SNP annotations
\[
\mathbb{E}[\chi_{j}^{2}] = N c \sum_{C} \tau_{C} \ell(j, C) + N a + 1
\]
Finucane et al., Nature Genetics , 2015
Contribution of annotations to SNP heritability across traits