BIOS 770
Assistant Professor, Department of Human Genetics, Emory University
Biobank: A collection of biosamples that are available for phenotyping and genotyping (array + imputation or sequencing). Phenotyping is typically defined by processing of billing codes available from Electronic health record (EHR) data.
Global-Biobank Meta-analysis Initiative
\[ f(x) = O(g(x)) \quad \text{as } x \to \infty \]
and it is read \(f(x)\) is big \(O\) of \(g(x)\) or \(f(x)\) is of the order of \(g(x)\) if \(|f(x)|\) is at most a positive constant multiple of \(|g(x)|\) for all sufficiently large values of \(x\). That is, \(f(x) = O(g(x))\) if \(\exists M \in \mathbb{Z}^{+}\) and a real number \(x_0 \in \mathbb{R}\) such that
\[ |f(x)| \leq M |g(x)| \quad \forall x \geq x_0. \]
\(F(n) = 3n^2 + 7n\)
Which of these is true?
Bits and Bytes
Naive implementations simply won’t scale!
Q: How much RAM does this take in Mb?
This is just to store common variants in RAM.
A given genotype generally takes just takes a value of the number of alt-alleles: \(0\), \(1\), \(2\). Including a missing value, this is just \(4\) combinations.
Q: How many bits does it take to store our genotype matrix
*A: \(1e5 * 8e6 * (0.25) = 190,724 Mb = 186 Gb\)
Can we do better with compression? Do we need to store every reference genotype?
On common variants, typically data are represented with a data structure using the above encoding:
Generally, these libraries provide abstractions on top of this for linear algebra routines on these data structures.
struct PackedGenotypeMatrix
data::Vector{UInt8} # Column-major packed genotypes (2 bits per entry)
n_samples::Int
n_variants::Int
end
function PackedGenotypeMatrix(dense_matrix::Matrix{Int32})
n_samples, n_variants = size(dense_matrix)
bytes_per_col = ceil(Int, n_samples / 4)
packed_data = zeros(UInt8, bytes_per_col * n_variants)
for col in 1:n_variants
for row in 1:n_samples
val = dense_matrix[row, col]
byte_idx = (col - 1) * bytes_per_col + (row - 1) ÷ 4 + 1
shift = 2 * ((row - 1) % 4)
packed_data[byte_idx] |= val << shift
end
end
PackedGenotypeMatrix(packed_data, n_samples, n_variants)
end
using Distributions
N = 10_000;
P = 1000;
G = Int32.(rand(Binomial(2, 0.5), N, P));
Gbits = PackedGenotypeMatrix(G);
println("Naive: ", Base.summarysize(G) / (2 ^ 20)) # 38.1 Mb
Naive: 38.147010803222656
Efficient: 2.384246826171875
Instead of 4 bytes per individual, store 4 individuals per single byte: 16x savings
\[ \mathbf{\color{darkgreen}{y}} = \mathbf{X}\mathbf{\beta} + \mathbf{e}_{\text{proj}}, \]
\[ \chi^2_{\text{LMM-LOCO}} = \frac{\bigl(\mathbf{x}_{\text{test}}^\top \mathbf{\color{darkorange}{V}}_{\text{LOCO}}^{-1} \color{darkgreen}{y} \bigr)^2} {\mathbf{x}_{\text{test}}^\top \mathbf{\color{darkorange}{V}}_{\text{LOCO}}^{-1}\,\mathbf{x}_{\text{test}}} \]
Strategies:
We want to compute \(\color{darkorange}{V}^{-1}x\), i.e. find \(z\) such that \(z = \color{darkorange}{V}^{-1}x\)
This is equivalent to solving finding \(z\) such that \(\color{darkorange}{V}z = x\) .
Simulated Kinship matrix
Using conjugate gradient iteration instead of explicit matrix inversion saves ~250x compute
Why does this reduce compute?
Doing matrix-vector multiplication before matrix-matrix multiplication is ~920x faster when N is large
Table 1, Loh et al., Nature Genetics, 2015
The LMM log likelihood for phenotype data \(\mathbf{\color{darkgreen}{y}}\) (dimension \(n \times 1\)) given fixed effects \(\mathbf{X}\):
\[ \log \mathcal{L} = -\frac{1}{2} \left[ n \log(2\pi) + \log\bigl|\sigma_e^2 \mathbf{I} + \sigma_g^2 \mathbf{\color{darkblue}{K}}\bigr| + \\ \Bigl(\mathbf{\color{darkgreen}{y}} - \mathbf{X}\mathbf{\beta}\Bigr)^\top \Bigl(\sigma_e^2 \mathbf{I} + \sigma_g^2 \mathbf{\color{darkblue}{K}}\Bigr)^{-1} \Bigl(\mathbf{\color{darkgreen}{y}} - \mathbf{X}\mathbf{\beta}\Bigr) \right] \]
Let \(\delta = \sigma_e^2 / \sigma_g^2\) and decompose \(\mathbf{\color{darkblue}{K}} = \mathbf{U} \mathbf{S} \mathbf{U}^\top\)
\[
\log \mathcal{L} = -\frac{1}{2} \left[
n \log(2\pi\sigma_g^2) + \log\bigl|\mathbf{S} + \delta \mathbf{I}\bigr| + \\
\frac{1}{\sigma_g^2} \Bigl(\mathbf{U}^\top\mathbf{\color{darkgreen}{y}} - \mathbf{U}^\top\mathbf{X} \mathbf{\beta}\Bigr)^\top \Bigl(\mathbf{S} + \delta \mathbf{I}\Bigr)^{-1} \Bigl(\mathbf{U}^\top\mathbf{\color{darkgreen}{y}} - \mathbf{U}^\top\mathbf{X}\mathbf{\beta}\Bigr)
\right]
\]
Simplifications:
Rotate data using \(\mathbf{U}^\top\) and let \(\tilde{\mathbf{\color{darkgreen}{y}}} = \mathbf{U}^\top\mathbf{\color{darkgreen}{y}}\), \(\tilde{\mathbf{X}} = \mathbf{U}^\top\mathbf{X}\):
\[ \log \mathcal{L} = -\frac{1}{2} \left[ n \log(2\pi\sigma_g^2) + \sum_{i=1}^n \log(s_i + \delta) + \\ \frac{1}{\sigma_g^2} \sum_{i=1}^n \frac{\Bigl(\tilde{\color{darkgreen}{y}}_i - \tilde{\mathbf{X}}_{i:} \mathbf{\beta})^2}{s_i + \delta} \right] \]
where \(s_i\) are eigenvalues of \(\mathbf{\color{darkblue}{K}}\).
This reduces to a sum of univariate normal terms:
\[ \prod_{i=1}^n \mathcal{N}\left(\tilde{\color{darkgreen}{y}}_i \,\Big|\, \tilde{\mathbf{X}}_{i:} \boldsymbol{\beta}, \; \sigma_g^2 (s_i + \delta)\right) \]
FaST-LMM TLDR
The “Fa” in FaST-LMM refers to the factorization \(\mathbf{\color{darkblue}{K}} = \mathbf{U}\mathbf{S}\mathbf{U}^\top\), which diagonalizes the covariance matrix and enables computationally efficient \(O(n)\) updates.
However, expensive eigen decomposition \(O(n^3)\) must occur up front.
Jiang et al., Nature Genetics, 2019
Jiang et al., Nature Genetics, 2019
Using a dense GRM matrix is 75x slower than a sparse matrix multiplication!
What are the tradeoffs of using a sparse GRM in place of a dense GRM?
Compute burden in LMMs can be reduced by:
Q: Any other ideas of ways to reduce compute burden?
So far - we’ve discussed efficient methods for running LMMs at biobank-scale. How well will these work for case-control phenotypes?
Figure 1, Zhou et al., Nature Genetics, 2019
Using an LMM on case-control phenotypes leads to substantial inflation with even modest case-control imbalance!
The null model is specified as: \[ \text{logit}\Bigl(\mathbb{E}[\color{darkgreen}{Y_i}]\Bigr) = \mathbf{X}_i^\top \alpha + \mathbf{g}_i^\top \mathbf{u} \]
The variance ratio \(\hat{r}\) is estimated as: \[ \hat{r} = \frac{\sigma_g^2}{\sigma_e^2} \] where \(\sigma_g^2\) is the genetic variance and \(\sigma_e^2\) is the residual variance (fixed at 1 for logistic models).
The adjusted test statistic for Step 2 is: \[ T_{\text{adj}} = \frac{\tilde{\mathbf{G}}^T (\mathbf{\color{darkgreen}{Y}} - \hat{\boldsymbol{\mu}})}{\sqrt{\hat{r} \cdot \tilde{\mathbf{G}}^T \hat{\mathbf{W}} \tilde{\mathbf{G}}}} \]
function gwas_sim(N = 50_000, P = 10_000)
X = rand(Normal(0, 1), N, 10)
α = rand(Normal(0, 1), 10)
μ0 = -7.0
η = μ0 .+ X * α
μ = logistic.(η)
Y = rand.(Bernoulli.(μ)) # 3% cases
W = Diagonal(μ .* (1 .- μ))
stats = zeros(P)
pvals = zeros(P)
@inbounds Threads.@threads :static for i in 1:P
p = rand(Beta(3, 12))
g = rand(Binomial(2, p), N)
stats[i] = g' * (Y - μ) / sqrt(g' * W * g)
pvals[i] = 2.0 * (1.0 - cdf(Normal(0, 1), abs(stats[i])))
end
return stats, pvals
end;
Random.seed!(3);
stats, pvals = gwas_sim();
To apply fastSPA to \(T_{adj}\), first derive its cumulant generating function (CGF). Given \(\widehat{\boldsymbol{b}}\), \(T_{adj}\) is modeled as a weighted sum of independent Bernoulli random variables. The approximated CGF is:
\[ K\left(t;\ \widehat{\mu}, c\right) = \sum_{i=1}^{N} \log\left(1 - \widehat{\mu}_{i} + \widehat{\mu}_{i} e^{c t \widehat{G}_{i}}\right) - c t \sum_{i=1}^{N} \widehat{G}_{i}^{\prime} \widehat{\mu}_{i} \]
where \(c = \text{Var}^{*}(\mathbf{T})^{-1/2}\).
FastSPA, Dey et al., AJHG, 2017
Let \(K^{\prime}(t)\) and \(K^{\prime\prime}(t)\) denote the first and second derivatives of \(K\) with respect to \(t\). To compute the probability \(\Pr(T_{adj} < q)\) for an observed test statistic \(q\), use:
\[ \Pr\left(T_{adj} < q\right) \approx F(q) = \Phi\left\{w + \frac{1}{w} \log\left(\frac{v}{w}\right)\right\} \]
where: \[ w = \text{sign}\left(\widehat{\zeta}\right) \left[2\left\{\widehat{\zeta} q - K\left(\widehat{\zeta}\right)\right\}\right]^{1/2}, \quad v = \widehat{\zeta} \left\{K^{\prime\prime}\left(\widehat{\zeta}\right)\right\}^{1/2} \]
and \(\widehat{\zeta} = \widehat{\zeta}(q)\) solves the equation \(K^{\prime}(\widehat{\zeta}) = q\).
\[ \Pr(T_{adj} < q) \approx \Phi\left(\frac{q - \mu}{\sigma}\right) \]
Mbatchou et al., Nature Genetics, 2021
Mbatchou et al., Table 1
\(\mathbb{Y} = \left[ Y_1, Y_2, \cdots, Y_P \right]\)