GWAS V: Methods for biobank scale analyses

BIOS 770

Josh Weinstock, PhD

Assistant Professor, Department of Human Genetics, Emory University

Learning objectives

  1. Compute challenges to working with data at massive scale
  2. Statistical challenges
  3. Discussion of “state of the art” methods

GWAS in the era of biobanks

Biobank: A collection of biosamples that are available for phenotyping and genotyping (array + imputation or sequencing). Phenotyping is typically defined by processing of billing codes available from Electronic health record (EHR) data.

Genome-wide association studies at scale

Global-Biobank Meta-analysis Initiative

Challenges to biobank scale analyses

  1. Compute scalability
  2. Cryptic relatedness, population stratification
  3. Case-control imbalance

Big \(O\) notation

  • \(g(n) = O(f(n)) \text{ means } C \times f(n) \text{ is an upper bound on } g(n).\)
  • \(g(n) = \Omega(f(n)) \text{ means } C \times f(n) \text{ is a lower bound on } g(n).\)
  • \(g(n) = \Theta(f(n)) \text{ means } C_1 \times f(n) \text{ is an upper bound on } g(n) \\ \text{ and } C_2 \times f(n) \text{ is a lower bound on } g(n).\)
  • \(C, C_1, \text{ and } C_2\) are all constants.

Formal definition

\[ f(x) = O(g(x)) \quad \text{as } x \to \infty \]

and it is read \(f(x)\) is big \(O\) of \(g(x)\) or \(f(x)\) is of the order of \(g(x)\) if \(|f(x)|\) is at most a positive constant multiple of \(|g(x)|\) for all sufficiently large values of \(x\). That is, \(f(x) = O(g(x))\) if \(\exists M \in \mathbb{Z}^{+}\) and a real number \(x_0 \in \mathbb{R}\) such that

\[ |f(x)| \leq M |g(x)| \quad \forall x \geq x_0. \]

Example

\(F(n) = 3n^2 + 7n\)

Which of these is true?

  1. \(F(n) = O(n^7)\)
  2. \(F(n) = O(n^2)\)
  3. \(F(n) = O(n)\)

Reminder on byte units

Bits and Bytes

Scalability

Naive implementations simply won’t scale!

N = 10000
P = 1000
# simulate P common genotypes
G = matrix(rbinom(N * P, size = 2, prob = .5), nrow = N, ncol = P)

Q: How much RAM does this take in Mb?

Ram usage in R

  • One int takes roughly 4 bytes
  • \(10,000 * 1,000 = 1e7\) ints
  • \(1e7 * 4\) = \(4e7\) bytes
  • \(\frac{4e7}{2^20} = 38.1 Mb\)

Does this scale?

  • Imagine instead of 100,000 individuals and 8 million SNPs
  • \(38.1 * 10 * 8000 \approx 3,040,000 Mb = 2,968 Gb \approx 2.9 Tb\)

This is just to store common variants in RAM.

Storing genotype data

A given genotype generally takes just takes a value of the number of alt-alleles: \(0\), \(1\), \(2\). Including a missing value, this is just \(4\) combinations.

Q: How many bits does it take to store our genotype matrix

*A: \(1e5 * 8e6 * (0.25) = 190,724 Mb = 186 Gb\)

Can we do better with compression? Do we need to store every reference genotype?

Data structures for genotype data

On common variants, typically data are represented with a data structure using the above encoding:

  • Plink (.bed or .pgen)
  • BCF (depending on phase or ploidy)
  • BGEN

Generally, these libraries provide abstractions on top of this for linear algebra routines on these data structures.

Storing genotypes efficiently in Julia

struct PackedGenotypeMatrix
    data::Vector{UInt8}  # Column-major packed genotypes (2 bits per entry)
    n_samples::Int
    n_variants::Int
end

function PackedGenotypeMatrix(dense_matrix::Matrix{Int32})
    n_samples, n_variants = size(dense_matrix)
    bytes_per_col = ceil(Int, n_samples / 4)
    packed_data = zeros(UInt8, bytes_per_col * n_variants)
    
    for col in 1:n_variants
        for row in 1:n_samples
            val = dense_matrix[row, col]
            byte_idx = (col - 1) * bytes_per_col + (row - 1) ÷ 4 + 1
            shift = 2 * ((row - 1) % 4)
            packed_data[byte_idx] |= val << shift
        end
    end
    
    PackedGenotypeMatrix(packed_data, n_samples, n_variants)
end

Efficient data structure saves 16x storage in RAM

using Distributions
N = 10_000;
P = 1000;
G = Int32.(rand(Binomial(2, 0.5), N, P));
Gbits = PackedGenotypeMatrix(G);

println("Naive: ", Base.summarysize(G) / (2 ^ 20)) # 38.1 Mb
Naive: 38.147010803222656
println("Efficient: ", Base.summarysize(Gbits) / (2 ^ 20)) # 2.4 Mb
Efficient: 2.384246826171875

Instead of 4 bytes per individual, store 4 individuals per single byte: 16x savings

Other ways to save on storage costs?

  1. Gzip / zlib / snappy compression (saves on disk space)
  2. Other data structures?

Sparse allele vectors

SAV SAV

Efficient mixed model implementations

  • Biobanks have a lot of individuals, with some relatedness in there
  • Generalized linear mixed models have emerged as the standard here
  • Although statistically powerful, some care is needed to avoided large compute burden

BOLT-LMM

\[ \mathbf{\color{darkgreen}{y}} = \mathbf{X}\mathbf{\beta} + \mathbf{e}_{\text{proj}}, \]

  • \(\color{darkgreen}{y} \in S \subset \mathbb{R}^N\): projected phenotypes
  • \(X = N \times M\) matrix, \(x_m \in S \subset \mathbb{R}^N\) for \(m = 1, \ldots, M\): projected genotypes
  • \(\beta \in \mathbb{R}^M\): id random effects with \(E[\beta_m] = 0\), \(\text{Var}(\beta_m) = \sigma_\beta^2\)
  • \(e_{\text{proj}} \in S \subset \mathbb{R}^N\): id random normal noise with \(e_{\text{proj}} \sim N(0, \sigma_e^2 \mathbf{P}_{\text{fixed}})\)

BOLT-LMM test statistic

\[ \chi^2_{\text{LMM-LOCO}} = \frac{\bigl(\mathbf{x}_{\text{test}}^\top \mathbf{\color{darkorange}{V}}_{\text{LOCO}}^{-1} \color{darkgreen}{y} \bigr)^2} {\mathbf{x}_{\text{test}}^\top \mathbf{\color{darkorange}{V}}_{\text{LOCO}}^{-1}\,\mathbf{x}_{\text{test}}} \]

Covariance of individuals

  • \(\color{darkorange}{V} = \sigma_g^2 \frac{XX^\top}{M} + \sigma_e^2 I_N\)
  • We will need to efficiently solve \(\color{darkorange}{V}^{-1}x\) and \(\color{darkorange}{V}^{-1}\color{darkgreen}{y}\) to estimate parameters

Strategies:

  1. Explicitly invert \(\color{darkorange}{V}\)
  2. Factorize \(\color{darkorange}{V}\) using Cholesky or SVD, and then solve system with factors
  3. Use an iterative method

Congugate gradient iteration

We want to compute \(\color{darkorange}{V}^{-1}x\), i.e. find \(z\) such that \(z = \color{darkorange}{V}^{-1}x\)

This is equivalent to solving finding \(z\) such that \(\color{darkorange}{V}z = x\) .

Conjugate Gradient Algorithm

  1. Initialize:
    • \(\mathbf{r}_0 := \mathbf{x} - \mathbf{\color{darkorange}{V}} \mathbf{z}_0\)
    • \(\mathbf{p}_0 := \mathbf{r}_0\)
    • \(k := 0\)

  1. Repeat until convergence:
    • \(\alpha_k := \frac{\mathbf{r}_k^\top \mathbf{r}_k}{\mathbf{p}_k^\top \mathbf{\color{darkorange}{V} \mathbf{p}_k}}\)
    • \(\mathbf{z}_{k+1} := \mathbf{z}_k + \alpha_k \mathbf{p}_k\)
    • \(\mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{\color{darkorange}{V}} \mathbf{p}_k\)
    • If \(\|\mathbf{r}_{k+1}\|\) is small, exit loop.
    • \(\beta_k := \frac{\mathbf{r}_{k+1}^\top \mathbf{r}_{k+1}}{\mathbf{r}_k^\top \mathbf{r}_k}\)
    • \(\mathbf{p}_{k+1} := \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\)
    • \(k := k + 1\)
  2. Return \(\mathbf{z}_{k+1}\) as the result.

Simulate nearly block-diagonal kinship matrix

Simulated Kinship matrix

Simulate genotypes and solve system

K = simulate_GRM(N);
σ2_g = 0.3;
σ2_e = 0.7;
V = K * σ2_g + I * σ2_e;

x = rand(MvNormal(zeros(N), V));

prob = LinearProblem(V, x);
sol = solve(prob, KrylovJL_CG()); # 0.07 seconds
good = sol.u;

naive = inv(V) * x; # 17.8 seconds
isapprox(good, naive) # true!

# cholesky takes 2.17 seconds

Using conjugate gradient iteration instead of explicit matrix inversion saves ~250x compute

Other efficiency considerations

  • We don’t actually have to form \(\color{darkorange}{V}\) explicitly, we can leave it in a factored form
  • Consider \(\color{darkorange}{V}x\) vs \((\sigma^2_gXX^t + \sigma^2_eI)x\)

Why does this reduce compute?

Computational complexity

  1. \(\color{darkorange}{V}=XX^t\) requires \(O(MN^2)\) operations
  2. \(\color{darkorange}{V} * x\) requires \(O(MN^2 + N^2) = O(MN^2)\) operations
  3. \(X * (X^tx)\) requires \(O(MN + MN) = O(MN)\) operations, because it’s a matrix vector product

Example

N = 10_000;
M = 1_000;
X = rand(Normal(0, 1), N, M); # N x M matrix
x = rand(Normal(0, 1), N); # N x 1 vector

z1 = (X * X') * x; # takes 2.76 seconds
z2 = X * (X'x);    # takes 0.003 seconds
isapprox(z1, z2) # true

Doing matrix-vector multiplication before matrix-matrix multiplication is ~920x faster when N is large

BOLT-LMM compared to other methods

Table 1, Loh et al., Nature Genetics, 2015

FaST-LMM implementation

Model setup

The LMM log likelihood for phenotype data \(\mathbf{\color{darkgreen}{y}}\) (dimension \(n \times 1\)) given fixed effects \(\mathbf{X}\):

\[ \log \mathcal{L} = -\frac{1}{2} \left[ n \log(2\pi) + \log\bigl|\sigma_e^2 \mathbf{I} + \sigma_g^2 \mathbf{\color{darkblue}{K}}\bigr| + \\ \Bigl(\mathbf{\color{darkgreen}{y}} - \mathbf{X}\mathbf{\beta}\Bigr)^\top \Bigl(\sigma_e^2 \mathbf{I} + \sigma_g^2 \mathbf{\color{darkblue}{K}}\Bigr)^{-1} \Bigl(\mathbf{\color{darkgreen}{y}} - \mathbf{X}\mathbf{\beta}\Bigr) \right] \]

  • \(\mathbf{\color{darkblue}{K}}\): Genetic similarity matrix (\(n \times n\))
  • \(\sigma_e^2\): Residual variance
  • \(\sigma_g^2\): Genetic variance, \(\mathbf{\beta}\): Fixed-effect weights (\(d \times 1\))

Log likelihood after factorizing \(\mathbf{\color{darkblue}{K}}\)

Let \(\delta = \sigma_e^2 / \sigma_g^2\) and decompose \(\mathbf{\color{darkblue}{K}} = \mathbf{U} \mathbf{S} \mathbf{U}^\top\)
\[ \log \mathcal{L} = -\frac{1}{2} \left[ n \log(2\pi\sigma_g^2) + \log\bigl|\mathbf{S} + \delta \mathbf{I}\bigr| + \\ \frac{1}{\sigma_g^2} \Bigl(\mathbf{U}^\top\mathbf{\color{darkgreen}{y}} - \mathbf{U}^\top\mathbf{X} \mathbf{\beta}\Bigr)^\top \Bigl(\mathbf{S} + \delta \mathbf{I}\Bigr)^{-1} \Bigl(\mathbf{U}^\top\mathbf{\color{darkgreen}{y}} - \mathbf{U}^\top\mathbf{X}\mathbf{\beta}\Bigr) \right] \]

Simplifications:

  1. \(|\sigma_g^2 (\mathbf{S} + \delta \mathbf{I})| = \sigma_g^{2n} |\mathbf{S} + \delta \mathbf{I}|\)
  2. \((\mathbf{\color{darkblue}{K}} + \delta \mathbf{I})^{-1} = \mathbf{U} (\mathbf{S} + \delta \mathbf{I})^{-1} \mathbf{U}^\top\)

Simplified log likelihood

Rotate data using \(\mathbf{U}^\top\) and let \(\tilde{\mathbf{\color{darkgreen}{y}}} = \mathbf{U}^\top\mathbf{\color{darkgreen}{y}}\), \(\tilde{\mathbf{X}} = \mathbf{U}^\top\mathbf{X}\):

\[ \log \mathcal{L} = -\frac{1}{2} \left[ n \log(2\pi\sigma_g^2) + \sum_{i=1}^n \log(s_i + \delta) + \\ \frac{1}{\sigma_g^2} \sum_{i=1}^n \frac{\Bigl(\tilde{\color{darkgreen}{y}}_i - \tilde{\mathbf{X}}_{i:} \mathbf{\beta})^2}{s_i + \delta} \right] \]

where \(s_i\) are eigenvalues of \(\mathbf{\color{darkblue}{K}}\).

This reduces to a sum of univariate normal terms:

\[ \prod_{i=1}^n \mathcal{N}\left(\tilde{\color{darkgreen}{y}}_i \,\Big|\, \tilde{\mathbf{X}}_{i:} \boldsymbol{\beta}, \; \sigma_g^2 (s_i + \delta)\right) \]

Optimization Process

  1. Solve for \(\mathbf{\beta}(\delta)\)
    Differentiate LL w.r.t. \(\mathbf{\beta}\) and set to zero: \[ \mathbf{\beta}(\delta) = \left(\tilde{\mathbf{X}}^\top (\mathbf{S} + \delta \mathbf{I})^{-1} \tilde{\mathbf{X}}\right)^{-1} \tilde{\mathbf{X}}^\top (\mathbf{S} + \delta \mathbf{I})^{-1} \tilde{\mathbf{\color{darkgreen}{y}}} \]

  1. Solve for \(\sigma_g^2(\delta)\)
    Substitute \(\mathbf{\beta}(\delta)\) into LL and differentiate w.r.t. \(\sigma_g^2\): \[ \sigma_g^2(\delta) = \frac{1}{n} \sum_{i=1}^n \frac{\Bigl(\tilde{\color{darkgreen}{y}}_i - \tilde{\mathbf{X}}_{i} \mathbf{\beta}(\delta)\Bigr)^2}{s_i + \delta} \]

  1. Optimize \(\delta\)
    Plug \(\mathbf{\beta}(\delta)\) and \(\sigma_g^2(\delta)\) into (2) and use Brent’s method for 1D optimization over \(\delta\).

FaST-LMM TLDR
The “Fa” in FaST-LMM refers to the factorization \(\mathbf{\color{darkblue}{K}} = \mathbf{U}\mathbf{S}\mathbf{U}^\top\), which diagonalizes the covariance matrix and enables computationally efficient \(O(n)\) updates.

However, expensive eigen decomposition \(O(n^3)\) must occur up front.

Fast-GWA with GCTA

Jiang et al., Nature Genetics, 2019

Fast-GWA with GCTA

Jiang et al., Nature Genetics, 2019

Reduce computational complexity with sparse matrices

using SparseArrays
K = simulate_GRM(N, 10, 0.02) # third argument is baseline-relatedness
Ktrunc = copy(K)
Ktrunc[Ktrunc .< 0.05] .= 0.0 # truncate values
spK = SparseMatrixCSC(Ktrunc)
@time z1 = K * x; # 0.015 seconds
@time z2 = spK * x; # 0.0002 seconds

cor(z1, z2) # 0.999

Using a dense GRM matrix is 75x slower than a sparse matrix multiplication!

What is a sparse matrix?

struct SparseMatrix
    m::Int32                   # Number of rows
    n::Int32                   # Number of columns
    colptr::Vector{Int32}      # Column j is in colptr[j]:(colptr[j+1]-1)
    rowval::Vector{Int32}      # Row indices of stored values
    nzval::Vector{Float64}     # Stored values, typically nonzeros
end;

What are the tradeoffs of using a sparse GRM in place of a dense GRM?

Review

Compute burden in LMMs can be reduced by:

  1. Converting the problem to one of linear regression (FaST-LMM)
  2. Use iterative methods to avoid quadratic complexity in sample size (Bolt-LMM)
  3. Use more efficient data structures to store the GRM (Fast-GWA GCTA)

Q: Any other ideas of ways to reduce compute burden?

Case-control phenotypes

So far - we’ve discussed efficient methods for running LMMs at biobank-scale. How well will these work for case-control phenotypes?

Case-control phenotypes

Figure 1, Zhou et al., Nature Genetics, 2019

Using an LMM on case-control phenotypes leads to substantial inflation with even modest case-control imbalance!

Saddle point approximation for case-control phenotypes

1. Logistic Mixed Model

The null model is specified as: \[ \text{logit}\Bigl(\mathbb{E}[\color{darkgreen}{Y_i}]\Bigr) = \mathbf{X}_i^\top \alpha + \mathbf{g}_i^\top \mathbf{u} \]

  • \(\color{darkgreen}{Y_i}\): Binary phenotype (case/control status) for individual \(i\),
  • \(\mathbf{X}_i\): Covariates (fixed effects)
  • \(\boldsymbol{\alpha}\): Fixed-effect coefficients
  • \(\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \sigma_g^2 \mathbf{\color{darkblue}{K}})\): Random effects capturing genetic relatedness.

2. Variance Component Estimation

The variance ratio \(\hat{r}\) is estimated as: \[ \hat{r} = \frac{\sigma_g^2}{\sigma_e^2} \] where \(\sigma_g^2\) is the genetic variance and \(\sigma_e^2\) is the residual variance (fixed at 1 for logistic models).

3. Predicted Means and Weights

  • Predicted means under the null model: \[ \hat{\mu}_i = \text{logit}^{-1}\left( \mathbf{X}_i^T \hat{\boldsymbol{\alpha}} + \mathbf{g}_i^T \hat{\mathbf{u}} \right) \]
  • Weight matrix \(\hat{\mathbf{W}}\) (diagonal entries): \[ \hat{W}_{ii} = \hat{\mu}_i (1 - \hat{\mu}_i) \]

4. Variance-Adjusted Test Statistic

The adjusted test statistic for Step 2 is: \[ T_{\text{adj}} = \frac{\tilde{\mathbf{G}}^T (\mathbf{\color{darkgreen}{Y}} - \hat{\boldsymbol{\mu}})}{\sqrt{\hat{r} \cdot \tilde{\mathbf{G}}^T \hat{\mathbf{W}} \tilde{\mathbf{G}}}} \]

Computing pvalues

  • Under the null, the distribution of the test statistic is assumed to converge in distribution to a gaussian
  • However, if the case-control imbalance is particularly large, usual asymptotics don’t quite kick in, making this a poor approximation

Simulation

function gwas_sim(N = 50_000, P = 10_000)
    X = rand(Normal(0, 1), N, 10)
    α = rand(Normal(0, 1), 10)
    μ0 = -7.0
    η = μ0 .+ X * α
    μ = logistic.(η)
    Y = rand.(Bernoulli.(μ)) # 3% cases
    W = Diagonal.* (1 .- μ))

    stats = zeros(P)
    pvals = zeros(P)

    @inbounds Threads.@threads :static for i in 1:P
        p = rand(Beta(3, 12))
        g = rand(Binomial(2, p), N)
        stats[i] = g' * (Y - μ) / sqrt(g' * W * g)
        pvals[i] = 2.0 * (1.0 - cdf(Normal(0, 1), abs(stats[i])))
    end

    return stats, pvals
end;

Random.seed!(3);
stats, pvals = gwas_sim();

Pvalues are not distributed uniformly under the null

Substantial deviation in qq plot

How can we generate properly calibrated test statistics?

To apply fastSPA to \(T_{adj}\), first derive its cumulant generating function (CGF). Given \(\widehat{\boldsymbol{b}}\), \(T_{adj}\) is modeled as a weighted sum of independent Bernoulli random variables. The approximated CGF is:

\[ K\left(t;\ \widehat{\mu}, c\right) = \sum_{i=1}^{N} \log\left(1 - \widehat{\mu}_{i} + \widehat{\mu}_{i} e^{c t \widehat{G}_{i}}\right) - c t \sum_{i=1}^{N} \widehat{G}_{i}^{\prime} \widehat{\mu}_{i} \]

where \(c = \text{Var}^{*}(\mathbf{T})^{-1/2}\).

FastSPA, Dey et al., AJHG, 2017

Derivatives and Probability Calculation

Let \(K^{\prime}(t)\) and \(K^{\prime\prime}(t)\) denote the first and second derivatives of \(K\) with respect to \(t\). To compute the probability \(\Pr(T_{adj} < q)\) for an observed test statistic \(q\), use:

\[ \Pr\left(T_{adj} < q\right) \approx F(q) = \Phi\left\{w + \frac{1}{w} \log\left(\frac{v}{w}\right)\right\} \]

where: \[ w = \text{sign}\left(\widehat{\zeta}\right) \left[2\left\{\widehat{\zeta} q - K\left(\widehat{\zeta}\right)\right\}\right]^{1/2}, \quad v = \widehat{\zeta} \left\{K^{\prime\prime}\left(\widehat{\zeta}\right)\right\}^{1/2} \]

and \(\widehat{\zeta} = \widehat{\zeta}(q)\) solves the equation \(K^{\prime}(\widehat{\zeta}) = q\).

\[ \Pr(T_{adj} < q) \approx \Phi\left(\frac{q - \mu}{\sigma}\right) \]

Review

  1. For case-control phenotypes, it’s faster to compute a score test than log-likelihood ratio test
  2. For phenotypes with case-control imbalance, the usual asymptotic approximations of the null distribution don’t work well enough!
  3. SAIGE uses the saddle point approximation to estimate better calibrated pvalues
  4. SAIGE combines this innovation with some of the ideas from Bolt-LMM

Further optimizations

Mbatchou et al., Nature Genetics, 2021

REGENIE benchmarks

Mbatchou et al., Table 1

Optimization for multiple phenotypes

  • No need to process each phenotype completely independently
  • For example, rather than projecting out covariates for each phenotype one at a time, this can be done for a matrix storing all phenotypes:

\(\mathbb{Y} = \left[ Y_1, Y_2, \cdots, Y_P \right]\)

Summary

  1. Efficient data structures are essential
    1. Store genotypes with 2 bits
    2. Use sparse matrices for the GRM
  2. Optimized algorithms are essential
    1. Solving linear systems with iterative methods rather than explicit inverse or eigen decomposition or cholesky
  3. Sometimes we need more than standard asymptotic results to compute pvalues