BIOS 770
Assistant Professor, Department of Human Genetics, Emory University
Biobank: A collection of biosamples that are available for phenotyping and genotyping (array + imputation or sequencing). Phenotyping is typically defined by processing of billing codes available from Electronic health record (EHR) data.
Global-Biobank Meta-analysis Initiative
f(x)=O(g(x))as x→∞
and it is read f(x) is big O of g(x) or f(x) is of the order of g(x) if |f(x)| is at most a positive constant multiple of |g(x)| for all sufficiently large values of x. That is, f(x)=O(g(x)) if ∃M∈Z+ and a real number x0∈R such that
|f(x)|≤M|g(x)|∀x≥x0.
F(n)=3n2+7n
Which of these is true?
Bits and Bytes
Naive implementations simply won’t scale!
Q: How much RAM does this take in Mb?
This is just to store common variants in RAM.
A given genotype generally takes just takes a value of the number of alt-alleles: 0, 1, 2. Including a missing value, this is just 4 combinations.
Q: How many bits does it take to store our genotype matrix
*A: 1e5∗8e6∗(0.25)=190,724Mb=186Gb
Can we do better with compression? Do we need to store every reference genotype?
On common variants, typically data are represented with a data structure using the above encoding:
Generally, these libraries provide abstractions on top of this for linear algebra routines on these data structures.
struct PackedGenotypeMatrix
data::Vector{UInt8} # Column-major packed genotypes (2 bits per entry)
n_samples::Int
n_variants::Int
end
function PackedGenotypeMatrix(dense_matrix::Matrix{Int32})
n_samples, n_variants = size(dense_matrix)
bytes_per_col = ceil(Int, n_samples / 4)
packed_data = zeros(UInt8, bytes_per_col * n_variants)
for col in 1:n_variants
for row in 1:n_samples
val = dense_matrix[row, col]
byte_idx = (col - 1) * bytes_per_col + (row - 1) ÷ 4 + 1
shift = 2 * ((row - 1) % 4)
packed_data[byte_idx] |= val << shift
end
end
PackedGenotypeMatrix(packed_data, n_samples, n_variants)
end
using Distributions
N = 10_000;
P = 1000;
G = Int32.(rand(Binomial(2, 0.5), N, P));
Gbits = PackedGenotypeMatrix(G);
println("Naive: ", Base.summarysize(G) / (2 ^ 20)) # 38.1 Mb
Naive: 38.147010803222656
Efficient: 2.384246826171875
Instead of 4 bytes per individual, store 4 individuals per single byte: 16x savings
y=Xβ+eproj,
χ2LMM-LOCO=(x⊤testV−1LOCOy)2x⊤testV−1LOCOxtest
Strategies:
We want to compute V−1x, i.e. find z such that z=V−1x
This is equivalent to solving finding z such that Vz=x .
Simulated Kinship matrix
Using conjugate gradient iteration instead of explicit matrix inversion saves ~250x compute
Why does this reduce compute?
Doing matrix-vector multiplication before matrix-matrix multiplication is ~920x faster when N is large
Table 1, Loh et al., Nature Genetics, 2015
The LMM log likelihood for phenotype data y (dimension n×1) given fixed effects X:
logL=−12[nlog(2π)+log|σ2eI+σ2gK|+(y−Xβ)⊤(σ2eI+σ2gK)−1(y−Xβ)]
Let δ=σ2e/σ2g and decompose K=USU⊤
logL=−12[nlog(2πσ2g)+log|S+δI|+1σ2g(U⊤y−U⊤Xβ)⊤(S+δI)−1(U⊤y−U⊤Xβ)]
Simplifications:
Rotate data using U⊤ and let ˜y=U⊤y, ˜X=U⊤X:
logL=−12[nlog(2πσ2g)+n∑i=1log(si+δ)+1σ2gn∑i=1(˜yi−˜Xi:β)2si+δ]
where si are eigenvalues of K.
This reduces to a sum of univariate normal terms:
n∏i=1N(˜yi|˜Xi:β,σ2g(si+δ))
FaST-LMM TLDR
The “Fa” in FaST-LMM refers to the factorization K=USU⊤, which diagonalizes the covariance matrix and enables computationally efficient O(n) updates.
However, expensive eigen decomposition O(n3) must occur up front.
Jiang et al., Nature Genetics, 2019
Jiang et al., Nature Genetics, 2019
Using a dense GRM matrix is 75x slower than a sparse matrix multiplication!
What are the tradeoffs of using a sparse GRM in place of a dense GRM?
Compute burden in LMMs can be reduced by:
Q: Any other ideas of ways to reduce compute burden?
So far - we’ve discussed efficient methods for running LMMs at biobank-scale. How well will these work for case-control phenotypes?
Figure 1, Zhou et al., Nature Genetics, 2019
Using an LMM on case-control phenotypes leads to substantial inflation with even modest case-control imbalance!
The null model is specified as: logit(E[Yi])=X⊤iα+g⊤iu
The variance ratio ˆr is estimated as: ˆr=σ2gσ2e where σ2g is the genetic variance and σ2e is the residual variance (fixed at 1 for logistic models).
The adjusted test statistic for Step 2 is: Tadj=˜GT(Y−ˆμ)√ˆr⋅˜GTˆW˜G
function gwas_sim(N = 50_000, P = 10_000)
X = rand(Normal(0, 1), N, 10)
α = rand(Normal(0, 1), 10)
μ0 = -7.0
η = μ0 .+ X * α
μ = logistic.(η)
Y = rand.(Bernoulli.(μ)) # 3% cases
W = Diagonal(μ .* (1 .- μ))
stats = zeros(P)
pvals = zeros(P)
@inbounds Threads.@threads :static for i in 1:P
p = rand(Beta(3, 12))
g = rand(Binomial(2, p), N)
stats[i] = g' * (Y - μ) / sqrt(g' * W * g)
pvals[i] = 2.0 * (1.0 - cdf(Normal(0, 1), abs(stats[i])))
end
return stats, pvals
end;
Random.seed!(3);
stats, pvals = gwas_sim();
To apply fastSPA to Tadj, first derive its cumulant generating function (CGF). Given ˆb, Tadj is modeled as a weighted sum of independent Bernoulli random variables. The approximated CGF is:
K(t; ˆμ,c)=N∑i=1log(1−ˆμi+ˆμiectˆGi)−ctN∑i=1ˆG′iˆμi
where c=Var∗(T)−1/2.
FastSPA, Dey et al., AJHG, 2017
Let K′(t) and K′′(t) denote the first and second derivatives of K with respect to t. To compute the probability Pr(Tadj<q) for an observed test statistic q, use:
Pr(Tadj<q)≈F(q)=Φ{w+1wlog(vw)}
where: w=sign(ˆζ)[2{ˆζq−K(ˆζ)}]1/2,v=ˆζ{K′′(ˆζ)}1/2
and ˆζ=ˆζ(q) solves the equation K′(ˆζ)=q.
Pr(Tadj<q)≈Φ(q−μσ)
Mbatchou et al., Nature Genetics, 2021
Mbatchou et al., Table 1
Y=[Y1,Y2,⋯,YP]