GWAS VI: Polygenic risk scores and LD score regression

BIOS 770

Josh Weinstock, PhD

Assistant Professor, Department of Human Genetics, Emory University

Learning objectives

Define polygenic risk scores (PRS)
Discuss common PRS methods
Discuss challenges in PRS portability and interpretation
Discuss linkage disequilibrium score regression (LDSR)

Genotypes → Phenotypes: Recap of GWAS

Continuous Trait
\[ \color{darkblue}{Y}_i \sim \mathcal{N}(\color{darkgreen}{X}_i \beta, \sigma^2) \]

Binary Trait
\[ \color{darkblue}{Y}_i \sim \text{Bernoulli}(\sigma(\color{darkgreen}{X}_i \beta)) \]

\(\color{darkgreen}{X}_i\): Genotype vector (e.g., SNP counts 0/1/2)
\(\beta\): Effect sizes from GWAS
\(\sigma\): Sigmoid function for binary outcomes

Interpretation

\(\beta\) represents marginal effects of genotypes on phenotypes
PRS = \(\color{darkgreen}{X}_i \hat{\beta}\) (weighted sum of risk alleles)

Why create a PRS?

Torkamani et al., Nature Reviews Genetics, 2018

Statistical methods for PRS construction

Clumping and Thresholding
- Clumping: Remove SNPs in linkage disequilibrium (LD) .
- Thresholding: Select SNPs with GWAS p < threshold.
Penalized Regression
- LASSO/Ridge: penalize by L1 or L2 norm to either shrink coefficients or induce sparsity.
Bayesian Methods (PRS-CS, LDpred, BayesR)
- Bayesian regression with some helpful prior.

Creating a PRS

Wray et al., Nature Reviews Genetics, 2013

Modeling linkage disequlibrium

For genotype matrix \(\color{darkgreen}{X} \in \mathbb{R}^{n \times p}\) (samples \(\times\) SNPs):

Covariance: \[ R = \text{Cov}(\color{darkgreen}{X}) = \mathbb{E}[\color{darkgreen}{X}^T \color{darkgreen}{X}] - \mathbb{E}[\color{darkgreen}{X}]^T \mathbb{E}[\color{darkgreen}{X}] \]

Under common assumptions:

Centered genotypes: \(\mathbb{E}[X] = 0\)
Unit variance: \(\text{Var}(X_{i}) = 1 \forall i\)

Simplifies to Correlation Matrix: \[ R = \text{Cor}(\color{darkgreen}{X}) = \mathbb{E}[\color{darkgreen}{X}^T \color{darkgreen}{X}] \]

Properties of the estimator

Good: \[ \mathbb{E}[\hat{R}] = R \quad \text{(Unbiased)} \]

Bad:

When \(P \ge N\), \(\hat{R}\) becomes rank-deficient: \[ \text{rank}(\hat{R}) \leq \min(N,P) = N \]

\(\hat{R}\) may also fail to be positive definite.

GWAS Dimensionality Challenge

UK Biobank Example (10Mb region): \[ \begin{aligned} P &= 1{,}000 \ \text{SNPs} \\ \text{Params in } \hat{R} &= \frac{P(P-1)}{2} = 499{,}500 \\ N &\approx 500{,}000 \ \text{(typical subsample)} \end{aligned} \]

Implications:

LD matrix estimation becomes ill-conditioned
Matrix inversion \(\hat{R}^{-1}\) unstable
Impacts PRS methods and fine-mapping

Recovering joint estimates from summary statistics

GWAS estimates the marginal effects.
These are likely to be “worse” (for prediction, loci discovery) than joint effects if the SNPs are correlated (highly common)

Simulation: joint vs marginal coefficients

N = 10000
P = 100
M = 20
h2 = 0.20
U <- qr.Q(qr(matrix(rnorm(P ^ 2), P)))
D = seq(0.01, 10, length = P)
Sigma = crossprod(U, U * D)
L = chol(Sigma)
X = matrix(rnorm(N * P), N, P) %*% L
X = scale(X) # ensure variances are 1
beta = rep(0, P)
beta_causal = sample(1:P, size = M)
beta[beta_causal] = rnorm(M, mean = 0, sd = sqrt(h2 / M))
Y = X %*% beta + rnorm(N, 0, sqrt(1.0 - h2))

Joint esimates are better than marginal estimates

beta_joint = coef(lm(Y ~ X + 0))
beta_gwas  = (t(X) %*% Y) / N

summary(lm(beta ~ beta_joint))[["r.squared"]]

[1] 0.8535646

summary(lm(beta ~ beta_gwas))[["r.squared"]]

[1] 0.7479689

Using sufficient statistics, convert marginal to joint

\[ \begin{align} \hat{\beta}^{\text{marg}} &= \frac{\color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\ (\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1} \hat{\beta}^{\text{marg}} &= (\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1}\frac{\color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\ (N\hat{R})^{-1} \hat{\beta}^{\text{marg}} &= \frac{(\color{darkgreen}{X}^t \color{darkgreen}{X})^{-1} * \color{darkgreen}{X}^t \color{darkblue}{Y}}{N} \\ \hat{R}^{-1} \hat{\beta}^{\text{marg}} &= \hat{\beta}^{joint} \end{align} \]

Asymptotically: \[ \hat{\beta}^{\text{marg}} \sim \mathcal{N}\left(\hat{R}\beta, \frac{\sigma^2}{N}\hat{R} \right) \]

What we get from GWAS (\(\hat{\beta}^{\text{marg}}\)) are the true coefficients (\(\beta\)) “rotated” by LD (\(R\)) + noise

GWAS summary statistics

Often, we don’t have access to individual level data
But we can pair sufficient statistics from GWAS with LD estimates from a population reference panel
This information is sufficient to approximate an individual level likelihood without requiring individual genotypes from the GWAS

In summary so far

GWAS estimates marginal effects
We always want to convert these to joint estimates for prediction purposes
We can actually do this without needing to re-analyze the original individual level GWAS data

Methods for computing PRS

Clumping and thresholding

Perform GWAS
“Clump” correlated variants together / LD-prune
Simply take all betas where the pvalue is less than some threshold
Note, no conversion from marginal to joint here

LDPred: Bayesian linear regression

Prior on the SNP effects: spike and slab

\[ \beta_i \stackrel{\text{iid}}{\sim} \begin{cases} \mathcal{N}\left(0, \dfrac{h_g^2}{M_p}\right) & \text{with probability } p \\ 0 & \text{with probability } (1-p) \end{cases} \]

Posterior expectation without LD

\[ \mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg} \right) = \left(\dfrac{h_g^2}{h_g^2 + \dfrac{M_p}{N}}\right) \bar{p}_i \hat{\beta}^{marg}_i \]

Estimates in the presence of LD

\[ \mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + I \dfrac{M}{Nh_g^2}\right)^{-1} \hat{\beta}^{marg}_i \]

Simulation

beta_joint = coef(lm(Y ~ X + 0))
beta_gwas  = (t(X) %*% Y) / N
beta_ld_pred_inf = solve(cor(X) + diag(P) * (M / N * h2), beta_gwas)

summary(lm(beta ~ beta_joint))[["r.squared"]]

[1] 0.8535646

summary(lm(beta ~ beta_gwas))[["r.squared"]]

[1] 0.7479689

summary(lm(beta ~ beta_ld_pred_inf))[["r.squared"]]

[1] 0.8629468

LDPred induces shrinkage towards origin

Connections to ridge regression

\[ \hat{\beta}^{ridge} = (\color{darkgreen}{X}^t \color{darkgreen}{X} + \lambda I)^{-1}\color{darkgreen}{X}^t \color{darkblue}{Y} \]

\[ \mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + I \dfrac{M}{Nh_g^2}\right)^{-1} \hat{\beta}^{marg}_i \]

\[ \lambda \approx \dfrac{M}{Nh_g^2} \]

Lasso derived PRS

Minimize: \[ (\color{darkblue}{Y} - \color{darkgreen}{X}\beta)^t(\color{darkblue}{Y} - \color{darkgreen}{X}\beta) + \lambda||\beta||_1 \]

This can be approximated with summary statistics: \[ \color{darkblue}{y}^t\color{darkblue}{y} + \beta^{t}\hat{R}\beta - 2\beta^{t}N\hat{\beta}^{marg} + \lambda||\beta||_1 \]

Mak et al., Genetic Epidemiology, 2017

Simulation with glmnet as a proxy

library(glmnet)

lasso = glmnet(X, Y, alpha = 1.0)
lasso_estimates = as.vector(coef(lasso, s = 0.01))[2:(P + 1)]

summary(lm(beta ~ beta_joint))[["r.squared"]]

[1] 0.8535646

summary(lm(beta ~ beta_gwas))[["r.squared"]]

[1] 0.7479689

summary(lm(beta ~ beta_ld_pred_inf))[["r.squared"]]

[1] 0.8629468

summary(lm(beta ~ lasso_estimates))[["r.squared"]]

[1] 0.9905764

PRSCS: Bayesian linear regression with continuous shrinkage prior

\[ \beta_i \sim \mathcal{N}\left(0, \frac{\sigma^2}{N} \lambda_i \tau \right) \\ \lambda_i \sim Gamma(a, \delta_i) \\ \delta_i \sim Gamma(b, 1) \\ \sqrt{\tau} \sim \mathcal{C}^{+}(0, 1) \]

Ge et al., 2019, Nature Communications

PRSCS intuition

\[ \mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg}, \mathbf{\hat{R}} \right) = \left(\mathbf{\hat{R}} + \mathbf{T}^{-1} \right)^{-1} \hat{\beta}^{marg}_i \]

No LD case (simplification)

\[ \mathbb{E}\left(\beta_i \mid \hat{\beta}_i^{marg} \right) = \left(1 - \omega_i \right) \hat{\beta}^{marg}_i \\ \omega_i = \frac{1}{(1 + \lambda_i \tau)} \]

\(\omega_i\) prior

SBayesR: Bayesian linear regression with a mixture distribution prior

\[ \beta_j \mid \pi, \sigma_{\beta}^2 = \begin{cases} 0 & \text{with probability } \pi_1, \\ \sim N(0, \gamma_2 \sigma_{\beta}^2) & \text{with probability } \pi_2, \\ \vdots & \vdots \\ \sim N(0, \gamma_C \sigma_{\beta}^2) & \text{with probability } 1 - \sum_{c=1}^{C-1} \pi_c, \end{cases} \]

Lloyd-Jones^, Zeng^ et al., Nature Communications, 2019

SBayesRC: SBayesR, but prior is informed by functional annotations

Figure 1, Zheng et al., 2024, Nature Genetics

Contribution of annotations to mixture probabilities

Non-parametric Bayesian PRS with dirichlet process regression

\[ \sigma^2_\beta \sim \mathbf{G} \\ \mathbf{G} \sim DP(\mathbf{H}, \lambda) \]

\(\mathbf{H}\) is a “base distribution”
\(\lambda\) is a concentration parameter

Zeng et al., Nature Communications, 2017

“Stick-breaking” interpretation

\[ \beta_{i} \sim \sum_{k=1}^{\infty} \pi_{k} N\left(0, \sigma_{k}^{2}\right) \\ \pi_{k} = \nu_{k} \prod_{l=1}^{k-1} (1 - \nu_{l}), \quad \nu_{k} \sim \mathrm{Beta}\left(1, \lambda\right) \]

Review

PRS methods generally select and/or shink coefficients
It helps to use LD to convert from marginal to joint estimates
Several different Bayesian priors have shown to be useful in different ways
Bayesian methods offer a framework to include external functional annotations

PRS limitations: do PRS transfer across ancestries?

Figure 1., Martin et al., Nature Genetics, 2019

PRS performance decays in test-set different in ancestry from GWAS cohort

Figure 3., Martin et al., Nature Genetics, 2019

Portability is inversely proportional to PCA distance

Figure 1, Ding et al., Nature, 2023

Effect apparent even within discrete ancestry clusters

Figure 3, Ding et al., Nature, 2023

Why does this happen?

Causal variants vary in allele frequency across ancestries
Causal variants vary across ancestries
LD differences across ancestries

Allele frequency and LD variation

Martin et al., Figure 3, Nature Genetics, 2019

Multi-ancestry PRS extensions

Ruan et al., Figure 1, Nature Genetics, 2022

Linkage disequilibrium regression

How can we characterize the mechanisms that manifest in a particular distribution of GWAS test statistics? Why are test statistics frequently inflated?

The trait many have many causal loci (degree of polygenicity)
Inflation or deflation may occur as a consequence of population stratification, invalidation of homoscedascity, poor asymptotic approximations to tail probabilities used to compute pvalues

LD Scores and \(\chi^2\)-Statistics

Effect size estimate: \(\hat{\beta}_j^{marg} = \color{darkgreen}{X}_j^\top y/N\)
\(\chi_j^2 = N\left(\hat{\beta}^{marg}_j\right)^2\)
Define the LD Score of variant \(j\) as: \(\ell_j := \sum_{k=1}^M r_{jk}^2\)

Proposition: Under this model, the expected \(\chi^2\)-statistic is: \[ \mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M}\ell_j + 1 \]

Bulk-Sullivan et al., Nature Genetics, 2015

Proof step 1: law of total variance

Decompose \(\mathrm{Var}[\hat{\beta}_j]\): \[ \mathrm{Var}[\hat{\beta}_j] = \underbrace{\mathbb{E}[\mathrm{Var}[\hat{\beta}_j \mid \color{darkgreen}{X}]]}_{\text{Main term}} + \underbrace{\mathrm{Var}[\mathbb{E}[\hat{\beta}_j \mid \color{darkgreen}{X}]]}_{\text{Vanishes}} \] The second term vanishes because \(\mathbb{E}[\hat{\beta}_j \mid \color{darkgreen}{X}] = 0\).

Step 2: conditional variance

Given \(X\), the variance becomes: \[ \mathrm{Var}[\hat{\beta}_j \mid \color{darkgreen}{X}] = \frac{1}{N^2} \color{darkgreen}{X}_j^\top \mathrm{Var}[\color{darkblue}{y} \mid \color{darkgreen}{X}] \color{darkgreen}{X}_j \] Substitute \(y = X= \color{darkgreen}{X}\beta + \epsilon\), and note \(Var(\beta) = I\frac{h_g^2}{M}\): \[ = \frac{1}{N^2} \left(\frac{h_g^2}{M} \color{darkgreen}{X}_j^\top \color{darkgreen}{X} \color{darkgreen}{X}^\top \color{darkgreen}{X}_j + N(1-h_g^2)\right) \]

Step 3: Sample LD scores

Rewrite the key quadratic form: \[ \frac{1}{N^2} \color{darkgreen}{X}_j^\top \color{darkgreen}{X} \color{darkgreen}{X}^\top \color{darkgreen}{X}_j = \sum_{k=1}^M \tilde{r}_{jk}^2 \] where \(\tilde{r}_{jk} := \frac{1}{N}\sum_{i=1}^N \color{darkgreen}{X}_{ij} \color{darkgreen}{X}_{ik}\) (sample LD).

Step 4: Expectation

Using the \(\delta\)-method approximation: \[ \mathbb{E}[\tilde{r}_{jk}^2] \approx r_{jk}^2 + \frac{1 - r_{jk}^2}{N} \] Summing over \(k\) gives: \[ \mathbb{E}\left[\sum_{k=1}^M \tilde{r}_{jk}^2\right] \approx \ell_j + \frac{M - \ell_j}{N} \]

Step 5: Final steps

Combine all terms: \[ \mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M} \left(\ell_j + \frac{M - \ell_j}{N}\right) + (1 - h_g^2) \] Simplify for large \(N\): \[ \mathbb{E}[\chi_j^2] \approx \frac{N h_g^2}{M}\ell_j + 1 \]

Extension to SNP annotations

\[ \mathbb{E}[\chi_{j}^{2}] = N c \sum_{C} \tau_{C} \ell(j, C) + N a + 1 \]

Finucane et al., Nature Genetics, 2015

Contribution of annotations to SNP heritability across traits

Figure 3, Finucane et al.

Contribution of cell types to trait heritability

Figure 6, Finucance et al.

Cross-trait LD score regression

Can we use summary statistics to infer the genetic correlation between traits? \[ \mathbb{E}[z_{1j} z_{2j}] = \frac{\sqrt{N_1 N_2} \, \rho_g}{M} \, t_j + \frac{\rho N_s}{\sqrt{N_1 N_2}} \]

Bulik-Sullivan & Finucance et al., Nature Genetics, 2015

Cross-trait LD score

Figure 2, Bulik-Sullivan et al.

Effects of assorative mating on cross-trait LD score regression

Figure 1., Border et al., Science, 2022

Forward simulations are consistent with empirical observations

Figure 2., Border et al., Science, 2022

Summary

PRS are estimators of genetic risk/liability/propensity for a particular phenotype
Numerous PRS methods have emerged, all generally bespoke versions of Bayesian or penalized regression
LD score regression is a powerful tool for partitioning heritability within and across phenotypes