GWAS V: Methods for biobank scale analyses

Model setup

The LMM log likelihood for phenotype data (dimension ) given fixed effects :

: Genetic similarity matrix ()
: Residual variance
: Genetic variance, : Fixed-effect weights ()

Log likelihood after factorizing

Let and decompose

Simplifications:

Simplified log likelihood

Rotate data using and let , :

where are eigenvalues of .

This reduces to a sum of univariate normal terms:

Optimization Process

Solve for
Differentiate LL w.r.t. and set to zero:

Solve for
Substitute into LL and differentiate w.r.t. :

Optimize
Plug and into (2) and use Brent’s method for 1D optimization over .

FaST-LMM TLDR
The “Fa” in FaST-LMM refers to the factorization , which diagonalizes the covariance matrix and enables computationally efficient updates.

However, expensive eigen decomposition must occur up front.

Fast-GWA with GCTA

Jiang et al., Nature Genetics, 2019

Fast-GWA with GCTA

Jiang et al., Nature Genetics, 2019

Reduce computational complexity with sparse matrices

using SparseArrays
K = simulate_GRM(N, 10, 0.02) # third argument is baseline-relatedness
Ktrunc = copy(K)
Ktrunc[Ktrunc .< 0.05] .= 0.0 # truncate values
spK = SparseMatrixCSC(Ktrunc)
@time z1 = K * x; # 0.015 seconds
@time z2 = spK * x; # 0.0002 seconds

cor(z1, z2) # 0.999

Using a dense GRM matrix is 75x slower than a sparse matrix multiplication!

What is a sparse matrix?

struct SparseMatrix
    m::Int32                   # Number of rows
    n::Int32                   # Number of columns
    colptr::Vector{Int32}      # Column j is in colptr[j]:(colptr[j+1]-1)
    rowval::Vector{Int32}      # Row indices of stored values
    nzval::Vector{Float64}     # Stored values, typically nonzeros
end;

What are the tradeoffs of using a sparse GRM in place of a dense GRM?

Review

Compute burden in LMMs can be reduced by:

Converting the problem to one of linear regression (FaST-LMM)
Use iterative methods to avoid quadratic complexity in sample size (Bolt-LMM)
Use more efficient data structures to store the GRM (Fast-GWA GCTA)

Q: Any other ideas of ways to reduce compute burden?

Case-control phenotypes

So far - we’ve discussed efficient methods for running LMMs at biobank-scale. How well will these work for case-control phenotypes?

Case-control phenotypes

Figure 1, Zhou et al., Nature Genetics, 2019

Using an LMM on case-control phenotypes leads to substantial inflation with even modest case-control imbalance!

Saddle point approximation for case-control phenotypes

1. Logistic Mixed Model

The null model is specified as:

: Binary phenotype (case/control status) for individual ,
: Covariates (fixed effects)
: Fixed-effect coefficients
: Random effects capturing genetic relatedness.

2. Variance Component Estimation

The variance ratio is estimated as: where is the genetic variance and is the residual variance (fixed at 1 for logistic models).

3. Predicted Means and Weights

Predicted means under the null model:
Weight matrix (diagonal entries):

4. Variance-Adjusted Test Statistic

The adjusted test statistic for Step 2 is:

Computing pvalues

Under the null, the distribution of the test statistic is assumed to converge in distribution to a gaussian
However, if the case-control imbalance is particularly large, usual asymptotics don’t quite kick in, making this a poor approximation

Simulation

function gwas_sim(N = 50_000, P = 10_000)
    X = rand(Normal(0, 1), N, 10)
    α = rand(Normal(0, 1), 10)
    μ0 = -7.0
    η = μ0 .+ X * α
    μ = logistic.(η)
    Y = rand.(Bernoulli.(μ)) # 3% cases
    W = Diagonal(μ .* (1 .- μ))

    stats = zeros(P)
    pvals = zeros(P)

    @inbounds Threads.@threads :static for i in 1:P
        p = rand(Beta(3, 12))
        g = rand(Binomial(2, p), N)
        stats[i] = g' * (Y - μ) / sqrt(g' * W * g)
        pvals[i] = 2.0 * (1.0 - cdf(Normal(0, 1), abs(stats[i])))
    end

    return stats, pvals
end;

Random.seed!(3);
stats, pvals = gwas_sim();

Pvalues are not distributed uniformly under the null

Substantial deviation in qq plot

How can we generate properly calibrated test statistics?

To apply fastSPA to , first derive its cumulant generating function (CGF). Given , is modeled as a weighted sum of independent Bernoulli random variables. The approximated CGF is:

where .

FastSPA, Dey et al., AJHG, 2017

Derivatives and Probability Calculation

Let and denote the first and second derivatives of with respect to . To compute the probability for an observed test statistic , use:

where:

and solves the equation .

Review

For case-control phenotypes, it’s faster to compute a score test than log-likelihood ratio test
For phenotypes with case-control imbalance, the usual asymptotic approximations of the null distribution don’t work well enough!
SAIGE uses the saddle point approximation to estimate better calibrated pvalues
SAIGE combines this innovation with some of the ideas from Bolt-LMM

Further optimizations

Mbatchou et al., Nature Genetics, 2021

REGENIE benchmarks

Mbatchou et al., Table 1

Optimization for multiple phenotypes

No need to process each phenotype completely independently
For example, rather than projecting out covariates for each phenotype one at a time, this can be done for a matrix storing all phenotypes:

Summary

Efficient data structures are essential
1. Store genotypes with 2 bits
2. Use sparse matrices for the GRM
Optimized algorithms are essential
1. Solving linear systems with iterative methods rather than explicit inverse or eigen decomposition or cholesky
Sometimes we need more than standard asymptotic results to compute pvalues