2.13.1. Overview of Mixed Linear Models

Mixed linear models incorporate both “fixed effects” and “random effects” (that is, “mixed effects”). The independent variables in a linear regression may be thought of as fixed effects. To solve for the random effects in a mixed model, something should be known about the variances and covariances of these random effects.

The Mixed Model Equation

Suppose there are n measurements of a phenotype which is influenced by f fixed effects and t instances of one random effect. The mixed linear model may be written as

y = X\beta + Zu + \epsilon,

where y is an n \times 1 vector of observed phenotypes, X is an n \times f matrix of fixed effects, and \beta is a f \times 1 vector representing the coefficients of the fixed effects. u is the unknown random effect, and Z is an n \times t matrix relating the instances of the random effect to the phenotypes. We assume

Var(u) = \sigma^2_g K

is the variance-covariance matrix for the random effect u, and

Var(\epsilon) = \sigma^2_e I,

so that

Var(y) = \sigma^2_g ZKZ' + \sigma^2_e I.

Examples of “fixed effects” may include the mean, one or more genotypic markers, and other additional covariates that may be analyzed.

Examples of a “random effect” are:

  1. Polygenic effects from each of t subgroupings, where the n measurements have been grouped into t subgroupings such as inbred strains. Z is then an incidence matrix relating subgroupings/strains to measurements, and K should be a matrix showing the pairwise genetic relationship among the t strains.

  2. Polygenic effects from each of the n samples, where there is just one measurement per sample. Z is then just the identity matrix I, and K should be a pairwise genetic relationship or kinship matrix among the n samples.

Note

At this time, none of the SVS Mixed Linear Model Analysis tools supports organizing measurements into subgroupings such as inbred strains.

The parameters \sigma^2_g and \sigma^2_e are called the “variance components”, and are assumed to be unknown. To solve the mixed model equation, the variance components must first be estimated. Once this is done, a generalized least squares (GLS) procedure may be used to estimate \beta.

More than One Random Effect

It is also possible for a mixed linear model to have more than one random effect. For instance, a mixed model with two random effects u and v would be defined by

y = X\beta + u + v + \epsilon,

where

Var(u) = \sigma^2_{g1} K_1,

Var(v) = \sigma^2_{g2} K_2,

and

Var(\epsilon) = \sigma^2_e I,

K_1 and K_2 being the variance/covariance matrices for u and v respectively, so that

Var(y) = \sigma^2_{g1} K_1 + \sigma^2_{g2} K_2 + \sigma^2_e I.

In this example, there are three variance components: \sigma^2_{g1}, \sigma^2_{g2}, and \sigma^2_e.

Finding the Variance Components

The SVS Mixed Linear Model Analysis tools use either one of two approaches to estimate the variance components:

  1. EMMA. To estimate the variance components \sigma^2_g and \sigma^2_e for a mixed model with only one random effect, you may select to use the EMMA (Efficient Mixed Model Association) approach ([Kang2008]). This approach reduces the estimating problem to a maximization search in just one dimension, and will thus be reliable and fast for use cases with 10,000 or fewer samples.

    (For sample sizes n over 10,000 or 15,000, this method starts becoming limited by the time it takes to solve the eigenvalue/eigenvector problem for these n \times n matrices. While some of the Mixed Linear Model Analysis tools (specifically GBLUP and K-Fold cross validation using GBLUP) have a large-data capability which saves space in memory, you must still choose, when using this capability, the tradeoff between faster approximation of the eigenvalues and eigenvectors vs. more exact computation of these values.)

  2. AI. Average Information (AI) ([Yang2011], [LeeSH2012]) is the other approach you may select. Multiple variance components may be estimated using this approach, and it is not limited for large sample sizes by the need to solve an eigenvalue/eigenvector problem.

    However, neither the MLMM feature nor large-data capability is currently implemented in SVS for the AI approach.

Either the full likelihood (Maximum Likelihood ML) or the restricted likelihood (Restricted Maximum Likelihood REML) may be maximized. The restricted likelihood is defined as the full likelihood with the fixed effects integrated out. As stated in [Kang2008], “The restricted likelihood avoids a downward bias of maximum-likelihood estimates of variance components by taking into account the loss in degrees of freedom associated with fixed effects.”

Note

The SVS Mixed Linear Model Analysis tools always maximize the restricted likelihood (REML) rather than the full likelihood (ML) except when the Bayes Information Criterion (for the MLMM feature) is computed.

Finding the Variance Components Using EMMA

Suppose \sigma = \sigma_g, \delta =
\frac{\sigma^2_e}{\sigma^2_g}, and H = \frac{V}{\sigma^2} =
ZKZ' + \delta I, which is a function of \delta. Under the null hypothesis, the full log-likelihood function can be formulated as

l_F(y; \beta, \sigma, \delta) = \frac{1}{2}\big[-n\log(2\pi\sigma^2) - \log|H| - \frac{1}{\sigma^2}(y - X\beta)'H^{-1}(y - X\beta)\big]

and the restricted log-likelihood function can be formulated as

l_R(y;\sigma,\delta) &= l_F(y; \hat{\beta}, \sigma, \delta) \\
                     &= \frac{1}{2}\big[f\log(2\pi\sigma^2) + \log|X'X| - \log|X'H^{-1}X|\big].

The full-likelihood function is maximized when \beta is \hat{\beta} = (X'H^{-1}X)^{-1}X'H^{-1}y, and the optimal variance component is \sigma^2_F = R/n for full likelihood and \sigma^2_R = R/(n-f) for restricted likelihood, where R=(y - X\hat{\beta})'H^{-1}(y - X\hat{\beta}) is a function of \delta as well.

Using spectral decomposition, it is possible to find \xi_i and \lambda_s such that

H = ZKZ' + \delta I = U_F diag(\xi_1 + \delta, ..., \xi_n + \delta)U_F'

and

SHS &= S(ZKZ' + \delta I)S \\
    &= [U_R W_R]diag(\lambda_1 + \delta, ..., \lambda_{n-f} + \delta, 0, ..., 0)[U_R W_R]' \\
    &= U_R diag(\lambda_1 + \delta, ..., \lambda_{n-f} + \delta)U_R',

where S = I - X(X'X)^{-1}X', U_F is n
\times n, and U_R is an n \times (n-f) eigenvector matrix corresponding to the nonzero eigenvalues. W_R is an n \times f matrix corresponding to the zero eigenvalues. U_F and U_R are independent of \delta.

Let [\eta_1, \eta_2, ..., \eta_{n-f}]' = U_R'y. Then, finding the maximum-likelihood (ML) estimate is equivalent to optimizing

f_F(\delta) &= l_F(y; \hat{\beta}, \hat{\sigma}, \delta) \\
            &= \frac{1}{2}\bigg[n \log\frac{n}{2\pi} - n - n \log(\sum_{s=1}^{n-f}\frac{\eta^2_s}{\lambda_s + \delta}) - \sum_{i=1}^n \log(\xi_i + \delta)\bigg]

with respect to \delta, and finding the restricted maximum-likelihood (REML) estimate is equivalent to optimizing

f_R(\delta) &= l_R(y; \hat{\sigma}, \delta) \\
            &= \frac{1}{2}\bigg[(n-f) \log\frac{n-f}{2\pi} - (n-f) - (n-f) \log(\sum_{s=1}^{n-f}\frac{\eta^2_s}{\lambda_s + \delta}) - \sum_{s=1}^{n-f} \log(\lambda_s + \delta)\bigg]

with respect to \delta. (See the Appendix of [Kang2008] for the mathematical details, except that PS = P, not “PS = S” as that Appendix states.) These functions are continuous for \delta > 0 if and only if all the eigenvalues \lambda_s are nonnegative (and, for f_F(\delta), the eigenvalues \xi_i are nonnegative). Otherwise, if the kinship matrix is not positive semidefinite, the likelihood will be ill-defined for a certain range of \delta.

The derivatives of these functions, which may be used to find the local maxima for the functions themselves, are

f_F'(\delta) = \frac{n}{2}\frac{\sum_s\eta^2_s/(\lambda_s + \delta)^2}{\sum_s\eta^2_s/(\lambda_s + \delta)} - \frac{1}{2} \sum_i \frac{1}{\xi_i + \delta}

and

f_R'(\delta) = \frac{(n-f)}{2} \frac{\sum_s\eta^2_s/(\lambda_s + \delta)^2}{\sum_s\eta^2_s/(\lambda_s + \delta)} - \frac{1}{2} \sum_s \frac{1}{\lambda_s + \delta}.

The ML or REML may be searched for by subdividing the values of \delta into 100 intervals, evenly in log space from \delta = 10^{-5} to \delta = 10^5, and applying a method such as the Newton-Raphson algorithm or the secant method on f_F'(\delta) or f_R'(\delta) to all the intervals where the sign of the derivative function changes, then taking the optimal \delta among all the stationary points and endpoints.

Note

The secant method is used by the SVS Mixed Linear Model Analysis tools.

Notice that evaluating f_F'(\delta) or f_R'(\delta) does not require a large number of matrix multiplications or inverses at each iteration as other methods typically do – instead, the EMMA technique computes spectral decomposition only once. Thus, using the grid search indicated above, the likelihood may be optimized globally with high confidence using much less computation.

Finding the Variance Components Using the Average Information (AI) Technique

Problem Restatement and Summary

Suppose we have r random effects in our mixed model. We may then write our mixed model as

y = X\beta + \sum_{i=1}^r u_i + \epsilon ,

where u_i is a vector of random genetic effects. In this model, the phenotypic variance \sigma^2_P is partitioned into the variance explained by each of the genetic factors and the residual variance. We have

V = Var(y) = \sum_{i=1}^r G_i \sigma^2_i + I \sigma^2_{\epsilon},

where \sigma^2_i is the variance of the i-th genetic factor with its corresponding variance/covarince matrix G_i.

(G_i can be a kinship matrix or a GBLUP Genomic Relationship Matrix (GRM). See The Genomic Relationship Matrix for more information about the GRM matrix.)

The variance components \sigma^2_i (for all i) and \sigma^2_{\epsilon} are unknown and need to be determined. To determine these variance components, SVS uses the Average Information (AI) technique to implement the restricted maximum likelihood (REML) method. This process may be summarized as follows:

  • SVS first initializes each component to (the somewhat arbitrary value of) \sigma^2_P divided by the number of total components (r + 1).

  • Next, SVS uses one iteration of the expectation maximization (EM) algorithm (as adapted for the AI REML method).

  • Finally, repeated iterations of the Average Information (AI) algorithm itself are used until convergence is achieved. The criterion for convergence is that the log-likelihood changes by less than the threshold value of 10^{-6}.

Definitions and Algorithm Details

Define \theta as a vector of variance components

\theta = (\sigma^2_1, \sigma^2_2, ..., \sigma^2_r, \sigma^2_{\epsilon})' ,

the matrix P as

P = V^{-1} - V^{-1} X (X'V^{-1}X)^{-1}X'V^{-1} ,

and L as the log-likelihood function of the model, ignoring the constant:

L = -\frac{(log|V| + log|X'V^{-1}X| + y'Py)} 2 .

We now define a matrix called AI which is the average of the observed and expected information matrices:

AI = \frac{1}{2} \begin{bmatrix}
y'PG_1 P G_1 Py & ... & y'PG_1 PG_rPy & y'PG_1 PPy \\
       .        &  .  &       .       &     .      \\
       .        &  .  &       .       &     .      \\
       .        &  .  &       .       &     .      \\
y'PG_r P G_1 Py & ... & y'PG_r PG_rPy & y'PG_r PPy \\
y'P    P G_1 Py & ... & y'P    PG_rPy & y'P    PPy
\end{bmatrix} .

The vector of first derivatives of the log likelihood function with respect to each variance component is

\frac{\partial L}{\partial \theta} = -\frac{1}{2} \begin{bmatrix}
tr(PG_1) - y'PG_1Py \\
         .          \\
         .          \\
         .          \\
tr(PG_r) - y'PG_rPy \\
tr(P)    - y'P   Py
\end{bmatrix} .

  • The i-th variance component \sigma^2_i is initialized to

    \sigma^{2(0)}_i = \frac{\sigma^2_P}{(r + 1)} .

  • This is updated by one iteration of the EM algorithm as

    \sigma^{2(1)}_i = \large[ \sigma^{4(0)}_i y'PG_iPy + tr(\sigma^{2(0)}_i I - \sigma^{4(0)}_i P G_i) \large] / n .

    The EM algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values.

  • The remaining iterations use the Average Information algorithm itself, which consists of updating the vector \theta of variance components as

    \theta^{(t+1)} = \theta^{(t)} + (AI^{(t)})^{-1} \partial L / \partial \theta \big|_{\theta^{(t)}}

    to get from iteration (t) to iteration (t+1).

  • Convergence is determined by whether L^{(t+1)} - L^{(t)} <
10^{-6}, where L^{(t)} is the log likelihood of the t-th iteration.

During the iteration process, any component whose estimate is negative (that is, the estimate “has escaped from the parameter space”) is set to 10^{-6} \times \sigma^2_P.

See [Yang2011] for further information.

Estimating the Variance of Heritability for One Random Effect

The formula for the estimate of the variance of heritability is derived as follows using a Taylor series expansion:

Var\left(\frac{X}{Y}\right) = \left(\frac{E(X)}{E(Y)}\right)^2
                              \left[\frac{Var(X)}{(E(X))^2} - \frac{2 Cov(X,Y)}{E(X)E(Y)}
                              + \frac{Var(Y)}{(E(Y))^2}\right]

Now, using the fact that the pseudo-heritability (or narrow-sense heritability) is:

h^2 = \frac{\hat{\sigma}_g^2}{\hat{\sigma}_g^2 + \hat{\sigma}_e^2}
    = \frac{\hat{\sigma}_g^2}{\hat{\sigma}_T^2}

We can obtain the formula for the estimate of the variance of heritability:

Var(h^2) &= Var\left(\frac{\hat{\sigma}_g^2}{\hat{\sigma}_g^2 + \hat{\sigma}_e^2}\right) \\
         &= Var\left(\frac{\hat{\sigma}_g^2}{\hat{\sigma}_T^2}\right) \\
         &= \left(\frac{\hat{\sigma}_g^2}{\hat{\sigma}_T^2}\right)^2
            \left[\frac{Var(\hat{\sigma}_g^2)}{(\hat{\sigma}_g^2)^2}
            - 2\left(\frac{Cov(\hat{\sigma}_g^2, \hat{\sigma}_T^2)}{\hat{\sigma}_g^2\hat{\sigma}_T^2}\right)
            + \frac{Var(\hat{\sigma}_T^2)}{(\hat{\sigma}_T^2)^2}\right]

Note that

Var(\hat{\sigma}_T^2) &= Var(\hat{\sigma}_g^2 + \hat{\sigma}_e^2) \\
                      &= Var(\hat{\sigma}_g^2) + 2Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2)
                         + Var(\hat{\sigma}_e^2) \\
Cov(\hat{\sigma}_g^2,\hat{\sigma}_T^2) &= Cov(\hat{\sigma}_g^2,\hat{\sigma}_g^2 + \hat{\sigma}_e^2) \\
                                       &= Cov(\hat{\sigma}_g^2,\hat{\sigma}_g^2)
                                          + Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2) \\
                                       &= Var(\hat{\sigma}_g^2) + Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2)

So:

Var(h^2) = (h^2)^2\left[\frac{Var(\hat{\sigma}_g^2)}{(\hat{\sigma}_g^2)^2}
          - 2 \frac{Var(\hat{\sigma}_g^2) + Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2)}
            {\hat{\sigma}_g^2(\hat{\sigma}_g^2 + \hat{\sigma}_e^2)}
          + \frac{Var(\hat{\sigma}_g^2) + 2Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2) + Var(\hat{\sigma}_e^2)}
            {(\hat{\sigma}_g^2 + \hat{\sigma}_e^2)^2}\right]

The formulas and methods for calculating the variance components can be found in Finding the Variance Components. Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2) is the (estimated) sampling variance/covariance of the (estimates of the) variance components, and may be computed as

Cov(\hat{\sigma}_g^2,\hat{\sigma}_e^2) = AI^{-1} ,

where AI is the Average Information Matrix for

\theta = (\sigma_g^2, \sigma_e^2)' .

(See Definitions and Algorithm Details.)

Solving the Mixed Model Equation Using EMMA

The generalized least squares (GLS) solution to

y = X\beta + Zu + \epsilon

may now be obtained. Note that the variance V of Zu + \epsilon is

V = \sigma^2 (ZKZ' + \delta I).

If we can find a matrix B such that

BB' = H = \frac{V}{\sigma^2} = ZKZ' + \delta I,

we can substitute y^* = B^{-1}y, X^* = B^{-1}X, and \epsilon^* = B^{-1}(Zu + \epsilon) to get

y^* = X^*\beta + \epsilon^*.

(The Cholesky decomposition of H is one way to obtain such a matrix B.) This equation can be solved for \beta through ordinary least squares (OLS), because we have

Var(\epsilon^*) = Var(B^{-1}(Zu + \epsilon)) = B^{-1}V(B^{-1})' =
\sigma^2 B^{-1}H(B^{-1})' = \sigma^2 B^{-1}BB'(B^{-1})' = \sigma^2
I.

The value of the residual sum of squares (RSS) from solving the transformed equation y^* = X^*\beta + \epsilon^* is the Mahalanobis RSS for the original equation y = X\beta + Zu +
\epsilon.

Taking advantage of the eigendecomposition of H performed in the EMMA algorithm, the computation of a valid B^{-1} can be simplified to

B^{-1} = diag(1/\sqrt{\xi_1 + \delta}, ..., 1/\sqrt{\xi_n + \delta}) U_F' .

Using the Mixed Model for Association Studies

The Exact Model

Association studies are typically carried out by testing the hypothesis H_0:\beta_k = 0 for each of m loci, one at a time, on the basis of the model

y_i = \sum_f \beta_f X_{if} + \beta_k M_{ik} + \eta_{i\bar{k}},

where M_{ik} is the minor allele count of marker k for individual i, \beta_k is a (fixed) effect size of marker k, and \sum_f \beta_f are other fixed effects such as the intercept and any fixed covariates. The error term \eta_{i\bar{k}} is

\eta_{i\bar{k}} = \sum_{s \ne k} \beta_s M_{is} + \epsilon_i .

If we assume the n individuals are unrelated and there is no dependence across the genotypes, the \eta_{i\bar{k}} values will be independently and identically distributed (i.i.d.), and thus simple linear regressions will make appropriate inferences for the k values of \beta.

However, the variance of the first term of \eta_{i\bar{k}} actually comes closer to being proportional to a matrix of the relatedness or kinship between samples. Thus, if we write

u_{i\bar{k}} = \sum_{s \ne k} \beta_s M_{is},

we see that the equation for y_i reduces to the mixed-model equation

y_i = \sum_f \beta_f X_{if} + \beta_k M_{ik} + u_{i\bar{k}} + \epsilon_i .

Note that strictly speaking, to use this equation, we should base not only the kinship, but also the variance components, upon all markers except for marker k.

The EMMAX Approximations and Technique

Even using the EMMA technique, finding the kinship matrix for, variance components for and solving for \beta_k for all k would be a daunting task. However, we may make two approximations:

  1. Let

    \eta_i = \sum_{s=1}^m \beta_s M_{is} + \epsilon_i

    approximate \eta_{i\bar{j}}, and let

    u_i = \sum_s \beta_s M_{is},

    approximate u_{i\bar{k}}. Then, we have

    y_i &= \sum_f \beta_f X_{if} + \beta_k M_{ik} + \eta_i \\
    &= \sum_f \beta_f X_{if} + \beta_k M_{ik} + u_i + \epsilon_i .

    To solve this, we need to compute the kinship matrix just once, using all markers. That kinship matrix may then be used to solve this equation for every marker k.

  2. Find the variance components once – specifically, for the system of equations

    y_i = \sum_f \beta_f X_{if} + u_i + \epsilon_i,

    using the kinship matrix K which is computed just once for all markers k. Then, use these variance components \sigma^2_g (for the variance of the u_i which is \sigma^2_g K) and \sigma^2_e (for the variance of the \epsilon_i which is \sigma^2_e
I) to apply the GLS method for solving

    y_i = \sum_f \beta_f X_{if} + \beta_k M_{ik} + u_i + \epsilon_i

    for \beta_k for every marker k.

This technique, which is called EMMAX (EMMA eXpedited) and was published in [Kang2010], allows mixed models to be used for genome-wide association testing within a very reasonable amount of computing time.

Scaling the Kinship Matrix

The actual EMMAX technique, however, before using the kinship matrix K, scales it (and thus effectively scales u) by an amount that will make the expectation of the estimated population variance of the (scaled) u_i to be \sigma^2_g, just as the expectation of the estimated population variance for the \epsilon_i is \sigma^2_e.

This is done by defining a scaling factor w as

w = \frac{n - 1}{Tr(CKC)},

and multiplying K by it to get

K_S = w K.

Here, C = I - 1_n1_n'/n and 1_n is a length n vector of ones. C is called a “Gower’s centering matrix”–it has the property that when you apply it to a vector v to get Cv, it will subtract the mean 1'v/n of the components of v from each component of v:

Cv = [I - 1_n1_n'/n]v = v - 1_n[1_n'v/n] = v - (1_n'v/n)1_n

The reasoning for using this scaling factor is as follows:

Suppose we have a vector v of elements v_i. Estimate the population variance q of these elements over all the n samples i. (This estimate is sometimes called the “sample variance”.) The (unbiased) estimate would be

q &= \frac{\sum_i^n (v_i - \bar{v})^2}{n - 1} \\
  &= \frac{\sum_i^n v_i^2 - n \bar{v}^2}{n - 1},

where \bar{v} is the average of the components v_i of v.

However, another way to write this is

q = \frac{(v - \bar{v}1_n)'(v - \bar{v}1_n)}{n - 1},

since

v'v &= \sum_i^n v_i^2,\\
(\bar{v}1_n)'v &= v'(\bar{v}1_n) = \bar{v}\sum_i^n v_i = \frac{\sum_i^n v_i}n \sum_i^n v_i = n \bar{v}^2,\\
(\bar{v}1_n)'(\bar{v}1_n) &= n \bar{v}^2,

and

q &= \frac{v'v - (\bar{v}1_n)'v - v'(\bar{v}1_n) + (\bar{v}1_n)'(\bar{v}1_n)}{n - 1} \\
  &= \frac{\sum_i^n v_i^2 - 2n \bar{v}^2 + n \bar{v}^2}{n - 1} \\
  &= \frac{\sum_i^n v_i^2 - n \bar{v}^2}{n - 1}.

Two other ways to write this are

q = \frac{Tr((v - \bar{v}1_n)'(v - \bar{v}1_n))}{n - 1},

since (v - \bar{v}1_n)'(v - \bar{v}1_n) is a scalar, and

q = \frac{Tr((v - \bar{v}1_n)(v - \bar{v}1_n)')}{n - 1},

since Tr(AB) = Tr(BA) for any two matrices A and B.

We note that

Cv = v - (1_n'v/n)1_n = v - \bar{v}1_n ;

therefore, the estimated population variance q may be written as

q &= \frac{Tr((v - \bar{v}1_n)(v - \bar{v}1_n)')}{n - 1} \\
  &= \frac{Tr(Cv(Cv)')}{n - 1} \\
  &= \frac{Tr(Cvv'C)}{n - 1}.

Looking at v as a random variable of n dimensions and q as a scalar random variable, let us write the expectation of the estimated population variance q as:

E(q) &= E(\frac{Tr(Cvv'C)}{n - 1}) \\
     &= \frac{Tr(CE(vv')C)}{n - 1}.

But K = E(vv') defines a relationship matrix among the elements of possible v. (Note that the possible instances of v themselves might not be “centered” – that is, the components of the v may or may not have zero averages.) We now write the expected estimated population variance E(q) as

E(q) = \frac{Tr(CKC)}{n - 1}.

We wish to “scale E(q) to one” – that is, set E(q) to one by scaling K appropriately. We do that by defining K_S = w K, where

w = \frac{n - 1}{Tr(CKC)},

and noting that

E(q_N) = \frac{Tr(CK_SC)}{n - 1} = w \frac{Tr(CKC)}{n - 1} = 1.

Note

The features that scale the kinship matrix before using it are:

The other SVS mixed-model features, including the GBLUP-related features, do not scale the kinship matrix before using it.

Note

When the EMMA algorithm for GBLUP is used, GBLUP will output what the computed pseudo-heritability would have been had the matrix been scaled.

Further Optimization When Covariates Are Present

The following technique is mentioned in passing in [Segura2012] and is used both in [Vilhjalmsson2012] and in the mixed-model tools of SVS.

If we have a mixed linear model with fixed-effect covariates X_f, one particular “more interesting” fixed-effect covariate X_k, and a random-effect covariate u_S for which the scaled relationship matrix is K_S,

y = X_f \beta_{kf} + X_k \beta_k + u_S + \epsilon,

and we have this model for many k and we don’t need to find the covariate coefficients \beta_{kf} for any of these models, and we have a matrix B such that BB' = K_S +
\delta I (see Solving the Mixed Model Equation Using EMMA), we can perform the following optimization:

  1. Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem

    B^{-1}y = B^{-1}X_f \beta_{h0} + \epsilon_{h0}

    to find \hat{\beta}_{h0} as an estimate for \beta_{h0}.

    Designate the (Mahalanobis) RSS obtained from solving this equation as

    mrss_{h0} = (B^{-1}y - B^{-1}X_f \hat{\beta}_{h0})'(B^{-1}y - B^{-1}X_f \hat{\beta}_{h0}).

  2. Perform the QR algorithm on B^{-1}X_f to get

    QR = B^{-1}X_f

    where Q and R are the “thin”, “reduced”, or “economic” versions of Q and R.

  3. Define

    M = (I - QQ'),

    giving us

    MB^{-1} = (I - QQ')B^{-1} .

  4. Transform the original equation by pre-multiplying it by MB^{-1} to get

    MB^{-1}y = MB^{-1}X_f\beta_{kf} + MB^{-1}X_k\beta_k + MB^{-1}u_S + MB^{-1}\epsilon

    But

    MB^{-1}X_f = B^{-1}X_f - QQ'B^{-1}X_f = QR - QQ'QR = (Q - QQ'Q)R = (Q - Q)R = 0,

    because the columns of Q are “orthogonal” and of “unit length” and so Q'Q = I and QQ'Q = Q. Thus, we have

    MB^{-1}y = MB^{-1}X_k\beta_k + MB^{-1}u_S + MB^{-1}\epsilon.

    MB^{-1}y may be re-written as

    MB^{-1}y &= (I - QQ')B^{-1}y = B^{-1}y - QQ'B^{-1}y = B^{-1}y - QR\hat{\beta}_{h0} \\
   &= B^{-1}y - B^{-1}X_f\hat{\beta}_{h0},

    the last step being because

    \hat{\beta}_{h0} = ((B^{-1}X_f)'B^{-1}X_f)^{-1}(B^{-1}X_f)'B^{-1}y,

    and, since QR = B^{-1}X_f,

    \hat{\beta}_{h0} = ((QR)'QR)^{-1}(QR)'B^{-1}y,

    and

    \begin{aligned}
QR \hat{\beta}_{h0}
&= QR ((QR)'QR)^{-1}(QR)'B^{-1}y \\
&= QR (R'Q'QR)^{-1}(QR)'B^{-1}y \\
&= QR (R'R)^{-1}(QR)'B^{-1}y \\
&= QR (R'R)^{-1}R'Q'B^{-1}y \\
&= QR R^{-1}(R')^{-1}R'Q'B^{-1}y \\
&= QQ'B^{-1}y. \\
\end{aligned}

    Thus,

    MB^{-1}y = B^{-1}y - B^{-1}X_f\hat{\beta}_{h0} = MB^{-1}X_k\beta_k + MB^{-1}u_S + MB^{-1}\epsilon.

    Define \epsilon_{MB} as

    \epsilon_{MB} = MB^{-1}u_S + MB^{-1}\epsilon ,

    so that

    MB^{-1}y = B^{-1}y - B^{-1}X_f\hat{\beta}_{h0} = MB^{-1}X_k\beta_k + \epsilon_{MB}.

    If we pre-multiply the original problem simply by B^{-1}, we get

    B^{-1}y = B^{-1}X_f \beta_{kf} + B^{-1}X_k \beta_k + B^{-1}(u_S + \epsilon),

    which may be solved as an ordinary-least-squares (OLS) problem (Solving the Mixed Model Equation Using EMMA), making Var(B^{-1}(u_S +
\epsilon)) proportional to I.

    Meanwhile,

    Var(MB^{-1}y) &= Var(B^{-1}y - B^{-1}X_f\hat{\beta}_{h0}) = Var(B^{-1}y) = Var(B^{-1}(u_S + \epsilon)) \\
&= Var(MB^{-1}X_k\beta_k + \epsilon_{MB}) = Var(\epsilon_{MB}),

    and thus Var(MB^{-1}y) = Var(\epsilon_{MB}) is proportional to I. This means that

    MB^{-1}y = B^{-1}y - B^{-1}X_f\hat{\beta}_{h0} = MB^{-1}X_k\beta_k + \epsilon_{MB}

    may be treated as an ordinary-least-squares (OLS) problem. This may now be solved for all k.

Note

For optimization, SVS pre-computes the matrix product MB^{-1} and uses this product as one matrix to help perform all of the regressions involving X_k.

Note

The matrix M is the “annihilator matrix” for the null hypothesis problem

B^{-1}y = B^{-1}X_f \beta_{h0} + \epsilon_{h0} .

Designate the Mahalanobis Residual Sum of Squares (Mahalanobis RSS) for marker k as

mrss_k = (B^{-1}y - B^{-1}X_f \hat{\beta}_{kf} - B^{-1}X_k\hat{\beta}_k)'(B^{-1}y - B^{-1}X_f \hat{\beta}_{kf} - B^{-1}X_k\hat{\beta}_k),

which is optimized to

mrss_k = (MB^{-1}y - MB^{-1}X_k\hat{\beta}_k)'(MB^{-1}y - MB^{-1}X_k\hat{\beta}_k).

This is the Residual Sum of Squares (RSS) value for the regression as transformed by pre-multiplying it by B^{-1}.

Note

If we have a (completely fixed-effect) linear model with covariates X_f and one particular “more interesting” covariate X_k,

y = X_f \beta_{kf} + X_k \beta_k + \epsilon,

and we have this model for many k and we don’t need to find the covariate coefficients \beta_{kf} for any of these models, we can perform the same kind of optimization.

  1. Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem

    y = X_f \beta_{h0} + \epsilon_{h0}

    to find \hat{\beta}_{h0} as an estimate for \beta_{h0}.

  2. Perform the QR algorithm on X_f to get

    QR = X_f

    where Q and R are the “thin”, “reduced”, or “economic” versions of Q and R.

  3. Define

    M = (I - QQ')

  4. Transform the original equation by pre-multiplying it by M to get

    My = MX_f\beta_{kf} + MX_k\beta_k + M\epsilon

    But

    MX_f = X_f - QQ'X_f = QR - QQ'QR = (Q - QQ'Q)R = (Q - Q)R = 0,

    because the columns of Q are “orthogonal” and of “unit length” and so Q'Q = I and QQ'Q = Q. Thus, we have

    My = MX_k\beta_k + M\epsilon.

    My may be re-written as

    My &= (I - QQ')y = y - QQ'y = y - QR\hat{\beta}_{h0} \\
   &= y - X_f\hat{\beta}_{h0},

    because \hat{\beta}_{h0} = R^{-1}Q'y and R\hat{\beta}_{h0} = RR^{-1}Q'y = Q'y.

    Thus,

    My = y - X_f\hat{\beta}_{h0} = MX_k\beta_k + M\epsilon.

    This is equivalent to the ordinary-least-squares (OLS) problem

    My = y - X_f\hat{\beta}_{h0} = MX_k\beta_k + \epsilon_M,

    where the variance of \epsilon_M is proportional to I. This is because

    Var(My) = Var(y) = Var(\epsilon)

    which we assume to be proportional to I.

    Note that the matrix M is the “annihilator matrix” for the null hypothesis problem

    y = X_f \beta_{h0} + \epsilon_{h0} .

The Mixed Model Equation for Gene-by-Environment Interactions

Note

This algorithm is distinct from the GBLUP-related algorithm discussed later in Incorporating Gene by Environment Interactions into GBLUP.

Suppose there are n measurements of a phenotype which is influenced by m fixed effects and n instances of one random effect. Also suppose there are one or more (say, e) gene-by-environment effects, where the interaction is potentially with one “more interesting” predictor variable.

We model this with a full model and a reduced model. The full model is:

y = X_c \beta_{kc} + X_i \beta_{ki} + X_k \beta_{kp} + X_{ip} \beta_{ip} + u_{full} + \epsilon_{full} ,

and the reduced model is:

y = X_c \beta_{krc} + X_i \beta_{kri} + X_k \beta_{rkp} + u_{reduced} + \epsilon_{reduced} ,

where y is an n-vector of observed phenotypes, X_c is an n \times m matrix of m fixed covariates, X_i is an n \times e matrix of e fixed terms that are being tested for gene-environment interactions, X_k is an n-vector containing the current “more interesting” covariate or predictor variable that may be doing the interacting, and X_{ip} is an n \times e matrix containing the e interaction terms created by multiplying the columns of X_i element-by-element with X_k (the “Predictor”). All of the above \beta terms correspond to the X terms with which they are shown and to whether they apply to the full model or to the reduced model. The u and \epsilon terms are random effect and error terms, respectively.

We approximate this using a technique like that used in the EMMAX method without interactions (The EMMAX Approximations and Technique). Specifically, we find the variance components once, using the parts of the above equations that are independent of X_k altogether. This works as follows–we write

y = X_c \beta_{cvc} + X_i \beta_{ivc} + u_{vc} + \epsilon_{vc}

where “vc” stands for (finding the) “variance components”.

We assume that

Var(u_{vc}) = \sigma^2_g K

is the variance-covariance matrix for the random effect u_{vc}, and that

Var(\epsilon_{vc}) = \sigma^2_e I,

so that

Var(y) = \sigma^2_g K + \sigma^2_e I.

The EMMA technique (Finding the Variance Components Using EMMA and Solving the Mixed Model Equation Using EMMA) is then used to determine the variance components \sigma^2_g and \sigma^2_e, as well as a matrix B (and its inverse) such that

BB' = H = \frac{Var(y)}{\sigma^2_g} = K + \frac{\sigma^2_e}{\sigma^2_g} I.

It is now possible to go through every marker (“more interesting covariate”) k and compute (as an EMMAX-type approximation) the following full and reduced models:

B^{-1}y = B^{-1}X_c \beta_{kc} + B^{-1}X_i \beta_{ki} + B^{-1}X_k \beta_{kp} + B^{-1}X_{ip} \beta_{ip} + B^{-1}(u_{full} + \epsilon_{full})

for the full model, where the term B^{-1}(u_{full} +
\epsilon_{full}) is assumed to be an error term proportional to the identity matrix, and

B^{-1}y = B^{-1}X_c \beta_{krc} + B^{-1}X_i \beta_{kri} + B^{-1}X_k \beta_{rkp} + B^{-1}(u_{reduced} + \epsilon_{reduced})

for the reduced model, where also the term B^{-1}(u_{reduced} +
\epsilon_{reduced}) is assumed to be an error term proportional to the identity matrix.

Optimization when Gene-Environment Interaction Terms Are Included

While SVS explicitly computes the full model mentioned above and outputs all of its \beta values, it performs only an optimization of the reduced model computation, which is sufficient to determine the RSS of the reduced-model equation above (which is the approximate Mahalanobis RSS for the original reduced-model equation), and thus to determine the full-vs-reduced p-value for the complete model.

This reduced-model optimization is to solve

MB^{-1}y = MB^{-1}X_k\beta_{rkp} + \epsilon_{MB},

where M = (I - QQ'), and Q is derived from performing the QR algorithm, as

QR = B^{-1}[X_c | X_i] .

Note

This optimization technique is detailed above in Further Optimization When Covariates Are Present. (Note that in that section, it is the full model that is being optimized, but here, we are optimizing the reduced model.) Please substitute [X_c | X_i] for X_f and \beta_{rkp} for \beta_k in that section.

Note

For the similar (completely fixed-effect) linear-model problem with full model

y = X_c \beta_{kc} + X_i \beta_{ki} + X_k \beta_{kp} + X_{ip} \beta_{ip} + \epsilon_{full}

and reduced model

y = X_c \beta_{krc} + X_i \beta_{kri} + X_k \beta_{rkp} + \epsilon_{reduced},

we may optimize computing the reduced model using the technique shown in the (completely fixed-effect) linear-model note to Further Optimization When Covariates Are Present, where that optimization is for a full model, and [X_c | X_i] should be substituted for the X_f in that note.

The reduced-model Mahalanobis RSS that we will use for marker k thus becomes, in its optimized form,

mrss_{kr} = (MB^{-1}y - MB^{-1}X_k\hat{\beta}_{kr})'(MB^{-1}y - MB^{-1}X_k\hat{\beta}_{kr}).

Meanwhile, for the full-model Mahalanobis Residual Sum of Squares (Mahalanobis RSS) for marker k, we use:

mrss_k = (B^{-1}y - B^{-1}X_c \beta_{kc} - B^{-1}X_i \beta_{ki} - B^{-1}X_k \beta_{kp} - B^{-1}X_{ip} \beta_{ip})' \\
(B^{-1}y - B^{-1}X_c \beta_{kc} - B^{-1}X_i \beta_{ki} - B^{-1}X_k \beta_{kp} - B^{-1}X_{ip} \beta_{ip}).

The Multi-Locus Mixed Model (MLMM)

For complex traits controlled by several large-effect loci, a single-locus test may not be appropriate, especially in the presence of population structure.

Therefore, [Segura2012] has proposed a simple stepwise mixed-model regression with forward inclusion and backward elimination of genotypic markers as fixed effect covariates. This method, called the Multi-Locus Mixed Model (MLMM), proceeds as follows:

  1. Begin with an initial model that includes, as its fixed effects, only the intercept and any additional covariates you may have specified.

  2. Using this model, perform an EMMAX scan through all markers (that you have not specified as additional covariates).

  3. From the markers scanned above, select the most significant marker and add it to the model as a new fixed-effect covariate (“cofactor”), creating a new model.

  4. Repeat (2) and (3) (forward inclusion) until a pre-specified maximum number of forward steps is reached or until forward inclusion must be stopped for some other reason (see note below).

  5. For each selected marker in the current model, temporarily remove it from the fixed effects and perform an EMMAX scan over only that marker.

  6. Eliminate, from the current model, the marker that came out as least significant using the above test. A new smaller model is created.

  7. Repeat (5) and (6) (backward elimination) until only one selected marker is left.

The variance components are re-estimated between each forward and backward step, while the same kinship matrix is used throughout the calculations.

Note

The following are reasons why forward inclusion might be stopped early:

  • When the pseudo-heritability \big(\hat{\sigma^2_g} /
Var(y)\big) estimate has become close to zero.

  • When all of the variance is explained by the currently-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.

  • When the latest significant marker that has been selected is collinear with the previously-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.

  • When the number of active samples is no longer enough to continue analysis.

  • When the number of active (non-selected) markers is no longer enough to analyze the resulting models.

Model Criteria

The result of this stepwise regression is a series of models. Several model criteria have been explored by the authors of [Segura2012] to determine how appropriate any of the models is:

  • Bayes Information Criteria (BIC). This is calculated as BIC
= -2l_F + p\log(n), where l_F is the full-model log-likelihood, p is the number of model parameters (one for the intercept, one for \delta, one for each marker covariate used in the particular MLMM model, and finally one for each additional covariate used in all of the models), and n is the sample size/number of individuals.

    Given any two estimated models, the model with the lower value of BIC is the preferred choice.

    However, the authors of [Segura2012] believe this model is “too tolerant in the context of GWAS”.

  • Extended Bayes Information Criteria (Extended BIC). This is the BIC penalized by the model space dimension. Its formula is

    EBIC = BIC + 2 \log\biggl(\binom{n}{p-q}\biggr)
     = BIC + 2 \biggl(\sum_{i = n-(p-q)+1}^n \log(i)  +  \sum_{i = 1}^{p-q} \log(i) \biggr),

    where q is the initial number of model parameters (one for the intercept, one for \delta, and one for each additional covariate used in all of the models), and \binom{n}{p-q} is the total number of models which can be formed using p-q marker covariates under the assumption that these will only be selected from the best n markers.

  • Modified Bayes Information Criteria (Modified BIC). This adds a different penalty based not only by the model space dimension, but also by how many overall markers there are to test. Its formula is

    MBIC = BIC + 2 p \log\biggl(\frac{m}{2.2} - 1\biggr),

    where m is the total number of markers being tested in the current step.

  • Bonferroni Criterion. Only defined for models derived from forward selection, this selects the model with the most covariate marker loci for which the best p-value obtained from the preceding EMMAX scan was below the Bonferroni threshold. The (Bonferroni) threshold used is 1/(20 m), where m is the total number of markers being tested in the current step.

  • Multiple Bonferroni Criterion. This selects the model with the most covariate marker loci all of which have individual p-values below the Bonferroni threshold. Here, “individual p-value” is as explained in the note of Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method. The (Bonferroni) threshold used is 1/(20 m), where m is the total number of markers being tested in the current step.

  • Multiple Posterior Probability of Association. This selects the model with the most covariate marker loci all of which have posterior probabilities of association above a PPA threshold. The threshold used is 0.5 . Posterior probabilities of association are based on Bayesian priors pr of 1/m for every marker (and for every step), where m is the total number of markers being tested in the current step, and are computed as follows:

    • Find the Bayes factor for marker k as

      bf = \exp\left(\frac{n\log(mrss_{h0}/mrss_k) - \log(n)}{2}\right) ,

      where mrss_{h0} and mrss_k are the values of the Mahalanobis RSS for the base model and for testing with marker k, respectively.

    • Determine the posterior odds and posterior probability as

      po = bf \frac{pr}{1 - pr}

      and

      pp = \frac{po}{1 + po} .

Gender Correction for EMMAX and MLMM

Gender correction for EMMAX and MLMM consists simply of considering X Chromosome genotypes for males as either zero (major allele present) or one (minor allele present). No other gender correction is performed.

Note

If a genotypic spreadsheet is used, it is still expected that male X chromosomes are coded in this spreadsheet as “two major alles” (major allele present) vs. “two minor alleles” (minor allele present).

Genomic Best Linear Unbiased Predictors (GBLUP)

Problem Statement

Suppose we have the mixed model equation

y = X_f \beta_f + u + \epsilon

over n samples, with fixed effects specified by \beta_f that include the intercept and any additional covariates you may have specified, and we assume that Var(\epsilon) = \sigma^2_e I. Also suppose that the random effects u are additive genetic merits or genomic breeding values associated with these samples, and that these may be formulated from m autosomal markers as

u = M\alpha,

where M is an n \times m matrix, the exact construction of which will be discussed later in Overall Normalization and Normalization by Individual Marker, and \alpha is a vector for which \alpha_k is the allele substitution effect (ASE) for marker k. For now, suffice it to say that matrix M is derived from the individual genotypes, with M_{ik} dependent on the genotype for the i-th sample at the k-th locus. (For inclusion of non-autosomal markers, see Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.)

Further assume that \operatorname{E}(\alpha) = 0, which makes \operatorname{E}(u) = 0, and that \operatorname{Var}(\alpha) = \operatorname{I}\sigma^2_M, where \sigma^2_M is an (unknown) constant which is the component of variance associated with the ASE. (For inclusion of non-autosomal markers in the calculation of ASE values, see Correcting the ASE for the Gender Chromosome.)

This gives us

\operatorname{Var}(u) = \operatorname{Var}(M\alpha) = M\operatorname{Var}(\alpha)M' = MM'\sigma^2_M.

This gives us

y = X_f \beta_f + M\alpha + \epsilon

and

Var(y) = MM'\sigma^2_M + \sigma^2_e I .

Our object is to estimate both the genomic breeding value u for every sample and the ASE for every marker.

We now discuss obtaining the matrix M and from that, a normalization factor or normalization factors and a kinship matrix.

Overall Normalization

There are two ways to normalize the entries of matrix M. The first method is Overall Normalization.

For Overall Normalization, we use the mixed model equation exactly as described above, namely

y = X_f \beta_f + M\alpha + \epsilon ,

with M_{ik} set to 2p_k, (p_k - q_k), or -2q_k, depending upon whether the genotype for the i-th sample at the k-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. Here, p_k and q_k are the major and minor allele frequencies for marker k, respectively.

Additionally, we write the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium as

\phi = 2 \sum_{k=1}^m p_k q_k .

We now define the variance matrix for Overall Normalization as

G_O = \frac{MM'}{\phi} ,

which gives us

\operatorname{Var}(u) = MM'\sigma^2_M = \phi \sigma^2_M \frac{MM'}{\phi} \
= \phi \sigma^2_M G_O = \sigma^2_{Go} G_O,

where we let \sigma^2_{Go} = \phi \sigma^2_M.

Normalization by Individual Marker

The second normalization method is Normalization by Individual Marker. (This method is used by the software package GCTA [Yang2011].)

For Normalization by Individual Marker, we instead use

y = X_f \beta_f + W\alpha + \epsilon

and

Var(y) = WW'\sigma^2_W + \sigma^2_e I ,

as our mixed-model equations, where

W_{ik} = \frac{x_{ik} - 2 q_k}{\sqrt{2 p_k q_k}},

and x_{ik} is two, one, or zero, depending upon whether the genotype for the i-th sample at the k-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. Here also, p_k and q_k are the major and minor allele frequencies for marker k, respectively.

This time, we define our variance matrix G_W as

G_W = \frac{WW'}{m} ,

which gives us

\operatorname{Var}(u) = WW'\sigma^2_W = m \sigma^2_W \frac{WW'}{m} \
= m \sigma^2_W G_W = \sigma^2_{Gw} G_W,

where we let \sigma^2_{Gw} = m \sigma^2_W.

We note that if we define

N = diag(1/\sqrt{2 p_1 q_1}, 1/\sqrt{2 p_2 q_2}, ..., 1/\sqrt{2 p_m q_m}) ,

we see that

W = M N ,

where M is the matrix M defined above in Overall Normalization.

The Genomic Relationship Matrix

For either normalization method, we refer to the matrix G (G_O or G_W) as the GBLUP Genomic Relationship Matrix (GRM). This matrix may be used as a kinship matrix for solving the GBLUP mixed-model equation, and \sigma^2_{Go} or \sigma^2_{Gw} may be thought of as the variance component \sigma^2_g for u.

Note

  1. Because this method uses a kinship matrix based on genotypes rather than on actual ancestry, the results are referred to as “Genomic Best Linear Unbiased Predictors” rather than just “Best Linear Unbiased Predictors”.

  2. Unlike some other SVS mixed-model analysis tools, SVS GBLUP does not scale its kinship matrix using a Gower’s centering matrix. Instead, the normalizing method used, whether overall or individual-marker, is considered sufficient to scale the GBLUP Genomic Relationship Matrix.

  3. Because of how it is constructed, the GBLUP Genomic Relationship Matrix is itself a “centered matrix” and is (thus) singular. However, it is still a positive semidefinite matrix and will work well as a kinship matrix.

  4. The GBLUP Genomic Relationship Matrix may either be constructed as a separate step by itself, separate from performing any analyses, or it may be computed “dynamically” or “on-the-fly” as the first part of a GBLUP analysis. Note that

    • If the Genomic Relationship Matrix is pre-computed separately from performing any analyses, genotypes from all samples will be used.

    • If, on the other hand, the GRM is computed “dynamically” as a part of a GBLUP analysis, genotypes from any samples for which the phenotype or any covariates or gene-by-environment interaction variables involved in that analysis have a missing value will not used for computing the GRM.

    • The exception is that when predicting for missing values (Genomic Prediction) and the GRM is computed “dynamically”, genotype data from samples in the “prediction set” (those with a missing phenotype) that have missing covariate values is still used to compute the GRM. (But not data from samples with missing values for gene-by-environment interaction variables.)

Finding the Genomic Best Linear Unbiased Predictors Using EMMA

For either matrix G (G_O or G_W), and using the EMMA technique (Finding the Variance Components Using EMMA), we can find \hat{\beta}_f, \delta, and H = G + \delta I. The second of Henderson’s mixed-model equations, as modified to accommodate singular G, is

G X_f \hat{\beta}_f + (G + \delta I) \hat{u} = G y .

This may be rewritten as

(G + \delta I) \hat{u} = H \hat{u} = G y - G X_f \hat{\beta}_f .

This gives us

\hat{u} = H^{-1} G (y - X_f \hat{\beta}_f) .

Noting the following equalities,

G^2 + \delta G = G (G + \delta I) = (G + \delta I) G = G H = H G

H^{-1} G H = G

H^{-1} G = G H^{-1} ,

we may write

\hat{u} = GH^{-1}(y - X_f \hat{\beta}_f)

as a solution for the genomic BLUP.

In SVS, this is computationally streamlined by finding

\hat{\gamma} = H^{-1}(y - X_f\hat{\beta}_f),

then computing \hat{u} = G\hat{\gamma}.

Finding the Allele Substitution Effect (ASE) Using EMMA and Overall Normalization

If we now define

\hat{\alpha}_O = M'H_O^{-1}(y - X_f\hat{\beta_f}) / \phi,

where H_O is the matrix H for Overall Normalization, we find that

M\hat{\alpha}_O = \frac{MM'}{\phi}H_O^{-1}(y - X_f\hat{\beta_f}) = G_OH_O^{-1}(y - X_f\hat{\beta_f}) = \hat{u},

which makes \hat{\alpha}_O a solution for the ASE.

In SVS, this is computationally streamlined by finding

\hat{\alpha}_O = M'\hat{\gamma}_O/\phi ,

where \hat{\gamma}_O is the version of \hat{\gamma} for Overall Normalization.

Finding the ASE Using EMMA and Individual Marker Normalization

If we now define

\hat{\alpha}_W = W'H_W^{-1}(y - X_f\hat{\beta_f}) / m,

where H_W is the matrix H for Normalization by Individual Marker, we find that

W\hat{\alpha}_W = \frac{WW'}{m}H_W^{-1}(y - X_f\hat{\beta_f}) = G_WH_W^{-1}(y - X_f\hat{\beta_f}) = \hat{u} .

This makes \hat{\alpha}_W a best linear unbiased predictor (BLUP) for \alpha (as defined in Normalization by Individual Marker).

To obtain the ASE, we need to scale the elements of \hat{\alpha}_W appropriately. Define the k-th element of \hat{\alpha}_W^* by dividing element k of \hat{\alpha}_W by \sqrt{2 p_k q_k}. This calculation is equivalent to writing

\hat{\alpha}_W^* = N \hat{\alpha}_W = N N'M'H_W^{-1}(y - X_f\hat{\beta_f}) / m .

We see, however, that

M \hat{\alpha}_W^* = M N N'M'H_W^{-1}(y - X_f\hat{\beta_f}) / m
= \frac{WW'}{m} H_W^{-1}(y - X_f\hat{\beta_f})
= G_W H_W^{-1}(y - X_f\hat{\beta_f}) = \hat{u} .

Thus, we see that \hat{\alpha}_W^* is a solution for the ASE.

In SVS, we computationally streamline this by finding

\hat{\alpha}_W^* = N W' \hat{\gamma}_W / m = N N' M' \hat{\gamma}_W / m = N^2 M' \hat{\gamma}_W / m,

where \hat{\gamma}_W is the version of \hat{\gamma} for Individual Marker Normalization.

Finding the Genomic Best Linear Unbiased Predictors and ASE Using AI

For any variance component \sigma^2_{vc}, we find

\hat{\gamma}_{vc} = \sigma^2_{vc} P y ,

where the variance components and the matrix P are defined in Finding the Variance Components Using the Average Information (AI) Technique. From this, we can compute

\hat{u}_{vc} = G_{vc} \hat{\gamma}_{vc}

for any variance component.

For any set of markers M_{vc} that contribute exclusively to a genomic relationship matrix (GRM) G_{vc} for just one variance component \sigma^2_{vc}, where G_{vc} is completely generated from M_{vc} as M_{vc}M'_{vc}/\phi_{vc} or as W_{vc}W'_{vc}/m_{vc}, and \phi_{vc} or m_{vc} is also generated exclusively from set M_{vc} of markers, we may compute an allele substitution effect (ASE). This ASE value is either

\hat{\alpha}_{vcO} = M'_{vc}\hat{\gamma}_{vcO}/\phi_{vc}

if Overall Normalization is being used, or

\hat{\alpha}^*_{vcW} = N_{vc} W'_{vc} \hat{\gamma}_{vcW} / m_{vc} = N_{vc}^2 M'_{vc} \hat{\gamma}_{vcW} / m_{vc}

if Normalize by Individual Marker is being used.

Note

The Gene-by-Environment Interaction feature Incorporating Gene by Environment Interactions into GBLUP (available only when using the Average Information algorithm) generates one or more GRM matrices that do not fit the proper criteria for generating ASE values. This is because these matrices are not generated directly from M_{vc}M'_{vc}/\phi_{vc} or W_{vc}W'_{vc}/m_{vc} for any set of markers.

Normalizing the ASE Itself Using EMMA

An additional output when using EMMA, for either GRM normalization choice, consists of a normalization of the allele substitution effects themselves. To do this, each ASE is divided by the SNP Standard Deviation, which is the square root of the component of variance \sigma^2_M associated with the ASE. \sigma^2_M is reconstructed by dividing the additive genetic variance by the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium:

\sigma_M^2 &= \frac{\sigma_G^2}{\phi}

The normalized ASE is then:

\hat{\alpha}_{norm} = \frac{1}{\sqrt{\sigma_M^2}}\hat{\alpha}

Incorporating Gene by Environment Interactions into GBLUP

Note

This algorithm is distinct from the EMMAX-related algorithm discussed in The Mixed Model Equation for Gene-by-Environment Interactions.

Suppose there are one or more gene-by-environment interactions we wish to test for, where for each genotype-environment interaction effect i, there is an environmental variable e_i which specifies two or more categories, e_i being categorical (each environment being a category) or binary (defining two environments). We may specify the mixed model as

y = [X|X_e]\beta + u + \sum_i u_{ei} + \epsilon,

with

Var(y) = \sigma^2_{g} G + \sum_i \sigma^2_{gei} G_{ei} + \sigma^2_e I,

where X_e includes all environmental variables as separate fixed effects and u_{ei} is the vector of genotype-environment interaction effects corresponding to e_i. We use G_{ei}
= G for all pairs of individuals in the same environment as each other, as defined by e_i, and G_{ei} = 0 for all pairs of individuals in separate environments.

The variance components \sigma^2_{g}, \sigma^2_{gei}, and \sigma^2_e must be found using the Average Information (AI) REML algorithm.

Genetic Correlation Using GBLUP

Suppose we have two phenotypes, y_1 and y_2, which are dependent on some genotypic data. We may write a standard bivariate linear mixed model for these as

y_1 = X_1\beta_1 + u_1 + \epsilon_1

and

y_2 = X_2\beta_2 + u_2 + \epsilon_2,

where u_1 and u_2 are polygenic effects influencing the traits y_1 and y_2, respectively.

We may alternatively think of this model as

\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix} = \begin{bmatrix}
X_1\beta_1 \\ X_2\beta_2
\end{bmatrix} + \begin{bmatrix}
u_1 \\ u_2
\end{bmatrix} + \begin{bmatrix}
\epsilon_1 \\ \epsilon_2
\end{bmatrix}.

If G is the Genomic Relationship Matrix (GRM) for the genotypic data, we may write the variance/covariance matrix (V) for this model as

V = var \left(\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\right) =
  \begin{bmatrix}
    G \sigma^2_{u1} + I \sigma^2_{e1} & G \sigma^2_{u1u2} \\
    G \sigma^2_{u1u2} & G \sigma^2_{u2} + I \sigma^2_{e2}
  \end{bmatrix},

where \sigma^2_{u1u2} is the covariance between u_1 and u_2. V may also be written as

V = \sigma^2_{u1} \begin{bmatrix}
    G & 0 \\
    0 & 0
  \end{bmatrix} + \sigma^2_{u2} \begin{bmatrix}
    0 & 0 \\
    0 & G
  \end{bmatrix} + \sigma^2_{u1u2} \begin{bmatrix}
    0 & G \\
    G & 0
  \end{bmatrix} + \sigma^2_{e1} \begin{bmatrix}
    I & 0 \\
    0 & 0
  \end{bmatrix} + \sigma^2_{e2} \begin{bmatrix}
    0 & 0 \\
    0 & I
  \end{bmatrix},

from which we see we have five variance components.

SVS uses an adaptation of the AI REML algorithm to estimate these, with two variations:

  1. Simply estimate these five variance components,

  2. Or instead use

    V = \sigma^2_{u1} \begin{bmatrix}
    G & 0 \\
    0 & 0
  \end{bmatrix} + \sigma^2_{u2} \begin{bmatrix}
    0 & 0 \\
    0 & G
  \end{bmatrix} + \sigma^2_{u1u2} \begin{bmatrix}
    0 & G \\
    G & 0
  \end{bmatrix} + \sigma^2_{e1} \begin{bmatrix}
    I & 0 \\
    0 & 0
  \end{bmatrix} + \sigma^2_{e2} \begin{bmatrix}
    0 & 0 \\
    0 & I
  \end{bmatrix} + \sigma^2_{e12} \begin{bmatrix}
    0 & I \\
    I & 0
  \end{bmatrix},

    where \sigma^2_{e12} is theoretically zero, but actually may exist as a residual covariance.

Designate the following names for the above variance/covariance matrices:

G1 = \begin{bmatrix}
        G & 0 \\
        0 & 0
      \end{bmatrix},

G2 = \begin{bmatrix}
        0 & 0 \\
        0 & G
      \end{bmatrix},

G12 = \begin{bmatrix}
        0 & G \\
        G & 0
      \end{bmatrix},

E1 = \begin{bmatrix}
        I & 0 \\
        0 & 0
      \end{bmatrix},

E2 =  \begin{bmatrix}
        0 & 0 \\
        0 & I
      \end{bmatrix},

and, where used,

E12 = \begin{bmatrix}
        0 & I \\
        I & 0
      \end{bmatrix}.

This gives us, for the two variations of this algorithm,

V = \sigma^2_{u1} G1 + \sigma^2_{u2} G2 + \sigma^2_{u1u2} G12 + \
\sigma^2_{e1} E1 + \sigma^2_{e2} E2

and

V = \sigma^2_{u1} G1 + \sigma^2_{u2} G2 + \sigma^2_{u1u2} G12 + \
\sigma^2_{e1} E1 + \sigma^2_{e2} E2 + \sigma^2_{e12} E12 .

Alternative names for the variance components that are used by the outputs of SVS, along with the interpretations for these variance components, are

  • V(G1) (\sigma^2_{u1}) Variance of y_1 due to random effects

  • V(G2) (\sigma^2_{u2}) Variance of y_2 due to random effects

  • C(G12) (\sigma^2_{u1u2}) Covariance between y_1 and y_2 due to random effects

  • V(E1) (\sigma^2_{e1}) Error variance for y_1

  • V(E2) (\sigma^2_{e2}) Error variance for y_2 and, if necessary,

  • C(E12) (\sigma^2_{e12}) (Residual) covariance of y_1 and y_2 due to error terms.

Please see [LeeSH2012] for further details about this algorithm.

Correcting GBLUP for Gender

When using the GBLUP features of SVS, there are several steps that may be taken to correct for gender:

  • Correct the gemomic relationship matrix (GRM) for gender by using male-specific and female-specific values for the genotypes of the gender chromosome, along with modifying the normalization (for either normalization method).

  • Correct the gender-chromosome allele-substitution effect (ASE) values to match the gender corrections done for the GRM. (To do this requires defining separate ASE values for males vs. females.)

  • When AI REML is being used, and thus multiple GRM matrices may be analyzed, one GRM matrix (presumably gender-corrected) may be used for the gender chromosome and a different GRM matrix may be used for the other chromosomes.

Correcting the GRM for Gender

For markers within the gender chromosome (usually X), there are two factors:

  • The number of copies of the alternate allele at any locus for males can only be zero or one.

  • There are three assumptions that can be made about how much effect a gender chromosome allele has in males vs. how much effect a gender chromosome allele has in females. These assumptions go by the name of Dosage Compensation, and are:

    • Full dosage compensation–Each allele in females has only half the effect that an allele has in males.

    • Equal X-linked genetic variance for males and females–This occurs when each allele in females has 1/\sqrt{2} times as much effect as each allele in males.

    • No dosage compensation–Each allele has a similar effect on the trait.

For a further discussion of Dosage Compensation, please see A Further Note About Dosage Compensation below, as well as [Yang2011].

Correcting the GRM for Gender Using Overall Normalization

We first compute a combined value for the major and minor allele frequencies p_k and q_k for gender-chromosome marker k and for all samples based on the total frequencies of the alleles over all occurrences of the gender chromosome, as

p_k = \frac{c_m p_{km} + 2 c_f p_{kf}}{c_m + 2 c_f}

and

q_k = \frac{c_m q_{km} + 2 c_f q_{kf}}{c_m + 2 c_f} ,

where c_m and c_f are the number of male samples and the number of female samples, respectively, and p_{km}, q_{km}, p_{kf} and q_{kf} are the major and minor allele frequencies for male samples only and female samples only, respectively.

To begin with, we use the following entries for matrix M for the gender chromosome:

  • For females, we use half of what we would have used for autosomes, to compensate for the fact that males have only one gender chromosome. Specifically, we use p_k, (p_k -
q_k)/2, or -q_k for M_{ik}, depending upon whether the genotype for the i-th sample at the k-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively.

  • For males, we use d p_k or -d q_k for M_{ik}, depending upon whether the genotype for the i-th sample at the k-th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. The parameter d is the dosage compensation factor. Its values are

    • d = \sqrt{2} for Full dosage compensation

    • d = 1 for Equal X-linked genetic variance for males and females

    • d = 1 / \sqrt{2} for No dosage compensation.

Entries of M for markers outside of the gender chromosome are computed the same as otherwise.

The remaining differences are that:

  1. Contributions to the GRM from gender-chromosome genotypes of female subjects are multiplied by \sqrt{2}. This applies to contributions from both M and M', so that the contribution to the covariance between two female subjects (or to the variance of one female subject) is multiplied by two.

  2. For contributions to the overall normalization factor \phi from gender-chromosome markers, we use

    w_m (p_k q_k)  +  w_f (p_k q_k) = p_k q_k,

    where w_m and w_f are the fraction of the samples that are male and female, respectively.

    Note

    p_{km} q_{km}, which is approximated by p_k q_k, is the variance of the male-sample genotype data at marker k, as coded into matrix M, and p_{kf} q_{kf}, which is approximated by p_k q_k, is the would-be variance, at Hardy-Weinberg equilibrium, of the female-sample genotype data at marker k, as coded into matrix M and afterward multiplied by \sqrt{2}.

The remaining steps for computing a GRM and computing \hat{u} are the same as they would be otherwise.

To show computing the gender-corrected GRM as a matrix formula, we have to do some factoring out of our M matrix. Without loss of generality, we may consider the matrix M to be partitioned into

\begin{bmatrix}
  M_m \\
  M_f
\end{bmatrix},

where M_m and M_f are the male and female entries for the matrix M. We also define matrices D_{mm}, D_{mf}, and D_{ff} to be diagonal with entries of 1 for autosomal-chromosome markers and of the following values for each gender-chromosome marker k:

D_{mmkk} = 1

D_{mfkk} = \sqrt{2}

D_{ffkk} = 2

Then, we can write

G_O = \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_m D_{mm} M_m' & M_m D_{mf} M_f'\\
  M_f D_{mf} M_m' & M_f D_{ff} M_f'
\end{bmatrix}

to show that contributions to the variance from gender-chromosome genotypes of female subjects (from each of M and M') are multiplied by \sqrt{2}.

The next two formulas, for either REML algorithm, remain the same. We still compute, for EMMA,

\hat{\gamma}_O = H_O^{-1}(y - X_f\hat{\beta}_f)

and

\hat{u} = G_O\hat{\gamma}_O

as before, and for AI and for any variance component \sigma^2_{vc}, we compute

\hat{\gamma}_{vcO} = \sigma^2_{vc} P y

and

\hat{u}_{vc} = G_{vcO} \hat{\gamma}_{vcO}

as before, where matrix P, as defined in Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected G_{vcO}.

Correcting the GRM for Gender Using Normalization by Individual Marker

In this case, we start with using the following entries for matrix M for the gender chromosome (where later on, matrix W will be built using matrix M):

  • For females, we use 2p_k, (p_k - q_k), or -2q_k for M_{ik}, depending upon whether the genotype for the i-th sample at the k-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. This is the same as would be for a non-gender chromosome (that is, x_{ik} - 2 q_k, where x_{ik} is two, one, or zero).

  • For males, we use the same values as the male values for Overall Normalization, namely, d p_k or -d q_k for M_{ik}, depending upon whether the genotype for the i-th sample at the k-th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. Just as for Overall Normalization, the values for the dosage compensation factor d are

    • d = \sqrt{2} for Full dosage compensation

    • d = 1 for Equal X-linked genetic variance for males and females

    • d = 1 / \sqrt{2} for No dosage compensation.

Entries of M for markers outside of the gender chromosome are computed the same as otherwise.

We next compute the normalizations, which are different between genders. We must use two different matrices (N_m and N_f) to represent the normalizations in matrix form.

Now, the normalizations themselves (and matrices N_m and N_f) may be computed as follows:

  • For females, the normalization for gender-chromosome marker k (and the k-th diagonal element for matrix N_f) is

    1 / \sqrt{2 p_k q_k} .

  • For males, the normalization for gender-chromosome marker k (and the k-th diagonal element for matrix N_m) is

    1 / \sqrt{p_k q_k} .

For non-gender chromosomes, the normalizations for both male and female samples are the same as they would have been without gender correction.

To show computing the gender-corrected GRM using Normalization by Individual Marker as a matrix formula, we note that without loss of generality, we may consider the matrix M to be partitioned as

M = \begin{bmatrix}
  M_{mX} & M_{mA} \\
  M_{fX} & M_{fA}
\end{bmatrix},

where M_{mX} and M_{fX} are the male and female entries for the gender chromosome and M_{mA} and M_{fA} are the male and female entries for the remaining (autosomal) chromosomes.

Also, we note that without loss of generality, we may partition matrices N_m and N_f as

N_m = \begin{bmatrix}
  N_{mX} & 0 \\
  0 & N_A
\end{bmatrix}

and

N_f = \begin{bmatrix}
  N_{fX} & 0 \\
  0 & N_A
\end{bmatrix} .

We now may represent a gender-corrected matrix W as

W = \begin{bmatrix}
\begin{bmatrix} M_{mX} & M_{mA} \end{bmatrix} N_m \\
\begin{bmatrix} M_{fX} & M_{fA} \end{bmatrix} N_f \end{bmatrix}

or

W = \begin{bmatrix}
  M_{mX} N_{mX} & M_{mA} N_A \\
  M_{fX} N_{fX} & M_{fA} N_A
  \end{bmatrix} .

Using this W, we compute

G_W = \frac{WW'}{m} .

Now, for EMMA, we compute

\hat{\gamma}_W = H_W^{-1}(y - X_f\hat{\beta}_f)

and

\hat{u} = G_W \hat{\gamma}_W

as before, and for AI and for any variance component \sigma^2_{vc}, we compute

\hat{\gamma}_{vcW} = \sigma^2_{vc} P y

and

\hat{u}_{vc} = G_{vcW} \hat{\gamma}_{vcW}

as before, where matrix P, as defined in Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected G_{vcW}.

Correcting the ASE for the Gender Chromosome

When there is a designated gender chromosome, the allele substitution effects (ASE) are computed separately for females and males, although the results will only be different between the genders for the gender chromosome markers.

To perform the separate ASE computations for either normalization method, we may consider, without loss of generality, as we did in Correcting the GRM for Gender Using Normalization by Individual Marker, the matrix M to be partitioned into

\begin{bmatrix}
  M_m \\
  M_f
\end{bmatrix} =
\begin{bmatrix}
  M_{mX} & M_{mA} \\
  M_{fX} & M_{fA}
\end{bmatrix},

where M_m and M_f are the male and female entries for the matrix M, and M_{mX} and M_{fX} are the male and female entries for the gender chromosome and M_{mA} and M_{fA} are the male and female entries for the remaining chromosomes.

Also, for either normalization method, we may consider, without loss of generality, \hat{\gamma} to be partitioned into

\begin{bmatrix}
  \hat{\gamma}_m \\
  \hat{\gamma}_f
\end{bmatrix},

where \hat{\gamma}_m and \hat{\gamma}_f are the values of \hat{\gamma} corresponding to male and female samples, respectively.

To find what the gender-corrected ASE values should be, we write out the formula for the random effects \hat{u} for each normalization method more explicitly, in terms of the values of \hat{\gamma}.

Overall Normalization

For Overall Normalization, recalling that we have

G_O = \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_m D_{mm} M_m' & M_m D_{mf} M_f'\\
  M_f D_{mf} M_m' & M_f D_{ff} M_f'
\end{bmatrix} ,

we may write the random effects by sample as

\hat{u} = G_O\hat{\gamma}_O = \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_m D_{mm} M_m' & M_m D_{mf} M_f'\\
  M_f D_{mf} M_m' & M_f D_{ff} M_f'
\end{bmatrix} \begin{bmatrix}
  \hat{\gamma}_{mO} \\
  \hat{\gamma}_{fO}
\end{bmatrix} .

Based on the definitions in Correcting the GRM for Gender Using Overall Normalization, and bearing in mind how we have partitioned M, we write out the D entries in matrix notation as

D_{mm} &= \begin{bmatrix}
  I & 0 \\
  0 & I
\end{bmatrix}

D_{mf} &= \begin{bmatrix}
  \sqrt{2} I & 0 \\
  0 & I
\end{bmatrix}

D_{ff} &=  \begin{bmatrix}
  2 I & 0 \\
  0 & I
\end{bmatrix} .

Noting that

M_m = \begin{bmatrix}
  M_{mX} & M_{mA}
\end{bmatrix}

and

M_f = \begin{bmatrix}
  M_{fX} & M_{fA}
\end{bmatrix} ,

this gives us

M_m D_{mm} M_m' &= \begin{bmatrix}
  M_{mX} & M_{mA}
\end{bmatrix}  \begin{bmatrix}
  I & 0 \\
  0 & I
\end{bmatrix}  \begin{bmatrix}
  M'_{mX} \\
  M'_{mA}
\end{bmatrix} = M_{mX}M'_{mX} + M_{mA}M'_{mA}

M_m D_{mf} M_f' &= \begin{bmatrix}
  M_{mX} & M_{mA}
\end{bmatrix}  \begin{bmatrix}
  \sqrt{2} I & 0 \\
  0 & I
\end{bmatrix}  \begin{bmatrix}
  M'_{fX} \\
  M'_{fA}
\end{bmatrix} = \sqrt{2}M_{mX}M'_{fX} + M_{mA}M'_{fA}

M_f D_{mf} M_m' &= \begin{bmatrix}
  M_{fX} & M_{fA}
\end{bmatrix}  \begin{bmatrix}
  \sqrt{2} I & 0 \\
  0 & I
\end{bmatrix}  \begin{bmatrix}
  M'_{mX} \\
  M'_{mA}
\end{bmatrix} = \sqrt{2}M_{fX}M'_{mX} + M_{fA}M'_{mA}

M_f D_{ff} M_f' &= \begin{bmatrix}
  M_{fX} & M_{fA}
\end{bmatrix}  \begin{bmatrix}
  2 I & 0 \\
  0 & I
\end{bmatrix}  \begin{bmatrix}
  M'_{fX} \\
  M'_{fA}
\end{bmatrix} = 2 M_{fX}M'_{fX} + M_{fA}M'_{fA} .

Thus,

\hat{u} = G_O\hat{\gamma}_O &= \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_{mX}M'_{mX} + M_{mA}M'_{mA} & \sqrt{2}M_{mX}M'_{fX} + M_{mA}M'_{fA} \\
  \sqrt{2}M_{fX}M'_{mX} + M_{fA}M'_{mA} & 2 M_{fX}M'_{fX} + M_{fA}M'_{fA}
\end{bmatrix} \begin{bmatrix}
  \hat{\gamma}_{mO} \\
  \hat{\gamma}_{fO}
\end{bmatrix}

 &= \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_{mX}M'_{mX}\hat{\gamma}_{mO} + M_{mA}M'_{mA}\hat{\gamma}_{mO} + \sqrt{2}M_{mX}M'_{fX}\hat{\gamma}_{fO} + M_{mA}M'_{fA}\hat{\gamma}_{fO} \\
  \sqrt{2}M_{fX}M'_{mX}\hat{\gamma}_{mO} + M_{fA}M'_{mA}\hat{\gamma}_{mO} + 2 M_{fX}M'_{fX}\hat{\gamma}_{fO} + M_{fA}M'_{fA}\hat{\gamma}_{fO}
\end{bmatrix}

 &= \left(\frac{1}{\phi}\right) \begin{bmatrix}
  M_{mX}(M'_{mX}\hat{\gamma}_{mO} + \sqrt{2}M'_{fX}\hat{\gamma}_{fO}) + M_{mA}(M'_{mA}\hat{\gamma}_{mO} + M'_{fA}\hat{\gamma}_{fO}) \\
  M_{fX}(\sqrt{2}M'_{mX}\hat{\gamma}_{mO} + 2 M'_{fX}\hat{\gamma}_{fO}) + M_{fA}(M'_{mA}\hat{\gamma}_{mO} + M'_{fA}\hat{\gamma}_{fO})
\end{bmatrix} .

Now define

\hat{\alpha}_{XmO} &= (M'_{mX}\hat{\gamma}_{mO} + \sqrt{2}M'_{fX}\hat{\gamma}_{fO}) / \phi \\
\hat{\alpha}_{XfO} &= (\sqrt{2}M'_{mX}\hat{\gamma}_{mO} + 2 M'_{fX}\hat{\gamma}_{fO}) / \phi \\
\hat{\alpha}_{AO}  &= (M'_{mA}\hat{\gamma}_{mO} + M'_{fA}\hat{\gamma}_{fO}) / \phi = \begin{bmatrix} M'_{mA} & M'_{fA} \end{bmatrix} \begin{bmatrix} \hat{\gamma}_{mO} / \phi \\ \hat{\gamma}_{fO} / \phi \end{bmatrix} ,

and finally define the allele substitution effects (ASE) for Overall Normalization as

\hat{\alpha}_{mO} &= \begin{bmatrix} \hat{\alpha}_{XmO} \\ \hat{\alpha}_{AO} \end{bmatrix} \\
\hat{\alpha}_{fO} &= \begin{bmatrix} \hat{\alpha}_{XfO} \\ \hat{\alpha}_{AO} \end{bmatrix} .

This gives us

\hat{u} = G_O\hat{\gamma}_O = \begin{bmatrix}
  M_{mX} \hat{\alpha}_{XmO} + M_{mA} \hat{\alpha}_{AO} \\
  M_{fX} \hat{\alpha}_{XfO} + M_{fA} \hat{\alpha}_{AO}
\end{bmatrix} .

But

M_{mX} \hat{\alpha}_{XmO} + M_{mA} \hat{\alpha}_{AO} = \begin{bmatrix}
  M_{mX} & M_{mA}
\end{bmatrix} \begin{bmatrix} \hat{\alpha}_{XmO} \\ \hat{\alpha}_{AO} \end{bmatrix}
= M_m \hat{\alpha}_{mO}

and

M_{fX} \hat{\alpha}_{XfO} + M_{fA} \hat{\alpha}_{AO} = \begin{bmatrix}
  M_{fX} & M_{fA}
\end{bmatrix} \begin{bmatrix} \hat{\alpha}_{XfO} \\ \hat{\alpha}_{AO} \end{bmatrix}
 = M_f \hat{\alpha}_{fO} .

We now can confirm that for Overall Normalization,

\hat{u} = G_O\hat{\gamma}_O = \begin{bmatrix}
  M_m \hat{\alpha}_{mO} \\
  M_f \hat{\alpha}_{fO}
\end{bmatrix} .

Note

The above definitions of \hat{\alpha}_{XmO} and \hat{\alpha}_{XfO} may be rewritten as

\hat{\alpha}_{XmO} &= (M'_{mX}\hat{\gamma}_{mO} + \sqrt{2} M'_{fX}\hat{\gamma}_{fO}) / \phi \\
\hat{\alpha}_{XfO} &= \sqrt{2}(M'_{mX}\hat{\gamma}_{mO} + \sqrt{2} M'_{fX}\hat{\gamma}_{fO}) / \phi .

From this, we see that

\hat{\alpha}_{XfO} = \sqrt{2} \hat{\alpha}_{XmO} .

In other words, the ASE values for Overall Normalization for gender-chromosome markers are in proportion to each other, but are not the same as each other. Two sets of ASE values must be used.

Normalization by Individual Marker

For Normalization by Individual Marker, we first represent the gender-corrected matrix W as

W = \begin{bmatrix}
  M_{mX} N_{mX} & M_{mA} N_A \\
  M_{fX} N_{fX} & M_{fA} N_A
  \end{bmatrix}

as was done in Correcting the GRM for Gender Using Normalization by Individual Marker, using the same notations. We next write out some representations of WW':

WW' &= \begin{bmatrix}
  M_{mX} N_{mX} & M_{mA} N_A \\
  M_{fX} N_{fX} & M_{fA} N_A
  \end{bmatrix} \begin{bmatrix}
  N'_{mX} M'_{mX} & N'_{fX} M'_{fX} \\
  N'_A M'_{mA} & N'_A M'_{fA}
  \end{bmatrix}

  &= \begin{bmatrix}
  M_{mX} N_{mX} N'_{mX} M'_{mX} + M_{mA} N_A N'_A M'_{mA} & M_{mX} N_{mX} N'_{fX} M'_{fX} + M_{mA} N_A N'_A M'_{fA} \\
  M_{fX} N_{fX} N'_{mX} M'_{mX} + M_{fA} N_A N'_A M'_{mA} & M_{fX} N_{fX} N'_{fX} M'_{fX} + M_{fA} N_A N'_A M'_{fA}
  \end{bmatrix} .

Noting that all of the N-type matrices are symmetric, we get the following:

WW' = \begin{bmatrix}
  M_{mX} N^2_{mX} M'_{mX} + M_{mA} N^2_A M'_{mA} & M_{mX} N_{mX} N_{fX} M'_{fX} + M_{mA} N^2_A M'_{fA} \\
  M_{fX} N_{fX} N_{mX} M'_{mX} + M_{fA} N^2_A M'_{mA} & M_{fX} N^2_{fX} M'_{fX} + M_{fA} N^2_A M'_{fA}
  \end{bmatrix} .

Representing the GRM as

G_W = \frac{WW'}{m} ,

we now represent the sample-based random effects as

\hat{u} &= G_W \hat{\gamma}_W = \frac{WW'}{m} \hat{\gamma}_W

&= \frac{1}{m} \begin{bmatrix}
  M_{mX} N^2_{mX} M'_{mX} + M_{mA} N^2_A M'_{mA} & M_{mX} N_{mX} N_{fX} M'_{fX} + M_{mA} N^2_A M'_{fA} \\
  M_{fX} N_{fX} N_{mX} M'_{mX} + M_{fA} N^2_A M'_{mA} & M_{fX} N^2_{fX} M'_{fX} + M_{fA} N^2_A M'_{fA}
  \end{bmatrix} \begin{bmatrix}
  \hat{\gamma}_{mW} \\
  \hat{\gamma}_{fW}
\end{bmatrix}

&= \frac{1}{m} \begin{bmatrix}
  M_{mX} N^2_{mX} M'_{mX} \hat{\gamma}_{mW} + M_{mA} N^2_A M'_{mA} \hat{\gamma}_{mW} + M_{mX} N_{mX} N_{fX} M'_{fX} \hat{\gamma}_{fW} + M_{mA} N^2_A M'_{fA} \hat{\gamma}_{fW} \\
  M_{fX} N_{fX} N_{mX} M'_{mX} \hat{\gamma}_{mW} + M_{fA} N^2_A M'_{mA} \hat{\gamma}_{mW} + M_{fX} N^2_{fX} M'_{fX} \hat{\gamma}_{fW} + M_{fA} N^2_A M'_{fA} \hat{\gamma}_{fW}
\end{bmatrix}

&= \frac{1}{m} \begin{bmatrix}
  M_{mX} (N^2_{mX} M'_{mX} \hat{\gamma}_{mW} + N_{mX} N_{fX} M'_{fX} \hat{\gamma}_{fW}) + M_{mA} (N^2_A M'_{mA} \hat{\gamma}_{mW} + N^2_A M'_{fA} \hat{\gamma}_{fW}) \\
  M_{fX} (N_{fX} N_{mX} M'_{mX} \hat{\gamma}_{mW} + N^2_{fX} M'_{fX} \hat{\gamma}_{fW}) + M_{fA} (N^2_A M'_{mA} \hat{\gamma}_{mW} + N^2_A M'_{fA} \hat{\gamma}_{fW})
\end{bmatrix} .

Now define

\hat{\alpha}^*_{XmW} &= (N^2_{mX} M'_{mX} \hat{\gamma}_{mW} + N_{mX} N_{fX} M'_{fX} \hat{\gamma}_{fW}) / m \\
\hat{\alpha}^*_{XfW} &= (N_{fX} N_{mX} M'_{mX} \hat{\gamma}_{mW} + N^2_{fX} M'_{fX} \hat{\gamma}_{fW}) / m \\
\hat{\alpha}^*_{AW}  &= (N^2_A M'_{mA} \hat{\gamma}_{mW} + N^2_A M'_{fA} \hat{\gamma}_{fW}) / m = N^2_A \begin{bmatrix} M'_{mA} & M'_{fA} \end{bmatrix} \begin{bmatrix} \hat{\gamma}_{mW} / m \\ \hat{\gamma}_{fW} / m \end{bmatrix} ,

and finally define the allele substitutions effects (ASE) for Normalization by Individual Marker as

\hat{\alpha}^*_{mW} &= \begin{bmatrix} \hat{\alpha}^*_{XmW} \\ \hat{\alpha}^*_{AW} \end{bmatrix} \\
\hat{\alpha}^*_{fW} &= \begin{bmatrix} \hat{\alpha}^*_{XfW} \\ \hat{\alpha}^*_{AW} \end{bmatrix} .

This gives us

\hat{u} = G_W\hat{\gamma}_W = \begin{bmatrix}
  M_{mX} \hat{\alpha}^*_{XmW} + M_{mA} \hat{\alpha}^*_{AW} \\
  M_{fX} \hat{\alpha}^*_{XfW} + M_{fA} \hat{\alpha}^*_{AW}
\end{bmatrix} .

But

M_{mX} \hat{\alpha}^*_{XmW} + M_{mA} \hat{\alpha}^*_{AW} = \begin{bmatrix}
  M_{mX} & M_{mA}
\end{bmatrix} \begin{bmatrix} \hat{\alpha}^*_{XmW} \\ \hat{\alpha}^*_{AW} \end{bmatrix}
= M_m \hat{\alpha}^*_{mW}

and

M_{fX} \hat{\alpha}^*_{XfW} + M_{fA} \hat{\alpha}^*_{AW} = \begin{bmatrix}
  M_{fX} & M_{fA}
\end{bmatrix} \begin{bmatrix} \hat{\alpha}^*_{XfW} \\ \hat{\alpha}^*_{AW} \end{bmatrix}
 = M_f \hat{\alpha}^*_{fW} .

So, we can now confirm that for Normalization by Individual Marker,

\hat{u} = G_W\hat{\gamma}_W = \begin{bmatrix}
  M_m \hat{\alpha}^*_{mW} \\
  M_f \hat{\alpha}^*_{fW}
\end{bmatrix} .

Note

The above definitions of \hat{\alpha}^*_{XmW} and \hat{\alpha}^*_{XfW} may be rewritten as

\hat{\alpha}^*_{XmW} &= N_{mX} (N_{mX} M'_{mX} \hat{\gamma}_{mW} + N_{fX} M'_{fX} \hat{\gamma}_{fW}) / m \\
\hat{\alpha}^*_{XfW} &= N_{fX} (N_{mX} M'_{mX} \hat{\gamma}_{mW} + N_{fX} M'_{fX} \hat{\gamma}_{fW}) / m .

This is N_{mX} and N_{fx} multiplying the same vector

(N_{mX} M'_{mX} \hat{\gamma}_{mW} + N_{fX} M'_{fX} \hat{\gamma}_{fW}) / m .

Since the normalizations for any marker k in the gender chromosome, and the diagonal elements of N_{fX} and N_{mX} corresponding to marker k, are 1/
\sqrt{2 p_k q_k} and 1 / \sqrt{p_k q_k} for females and males, respectively, we see that

N_{mx} = \sqrt{2} N_{fx} .

Therefore,

\hat{\alpha}^*_{XmW} = \sqrt{2} \hat{\alpha}^*_{XfW} .

In other words, it is also true that for Normalization by Individual Marker, the ASE values for gender-chromosome markers are in proportion to each other, but are not the same as each other. Two sets of ASE values must be used here, also.

A Further Note About Dosage Compensation

While [Yang2011] discusses this subject in detail, let’s answer one question–why are these factors used for dosage compensation? Let’s look at the individual covariance terms:

Covariance Terms

As implied above, the contribution from marker k to the covariance between two female subjects i and j for Normalization by Individual Marker is the GRM variance component times the expression

\frac{(x_{ik} - 2q_k)(x_{jk} - 2q_k)}{2q_k(1 - q_k)} ,

where x_{ik} and x_{jk} are the number of minor alleles at marker k for (female) subjects i and j, and q_k is the (overall) minor allele frequency for marker k.

Meanwhile, the contribution from marker k to the covariance between two male subjects i and j for Normalization by Individual Marker is, instead, the GRM variance component times

\frac{d(x_{ik} - q_k)d(x_{jk} - q_k)}{q_k(1 - q_k)} ,

where d is the dosage compensation factor, x_{ik} and x_{jk} are the number of gender-chromosome minor alleles (0 or 1) at marker k for (male) subjects i and j, and (as above) q_k is the (overall) minor allele frequency for marker k.

For Overall Normalization, the male terms used in M are the same as those used for Normalization by Individual Marker. Also, the contributions to the Overall Normalization factor \phi from the gender chromosome for males are analogous to the normalization terms for Normalization by Individual Marker, namely

q_k(1 - q_k) .

Therefore, if you were to use Overall Normalization to compute a GRM for only one marker in the gender chromosome, and only for male subjects, your contribution from that marker k to the covariance between two (male) subjects would be the GRM variance component times

\frac{d(x_{ik} - q_k)d(x_{jk} - q_k)}{q_k(1 - q_k)} ,

which is equal, for this data, to the male/male covariance term for Normalization by Individual Marker.

To match this, for Overall Normalization, the female terms used in M are twice as small as the corresponding autosomal terms. However, the contributions to the variance for the female gender chromosome are multiplied by \sqrt{2} for each female term (and thus by two if both terms are female). Thus, for an all-female gender-chromosome GRM computed on only one marker using Overall Normalization, the contribution from that marker k to the covariance between two (female) subjects i and j would be the GRM variance component times

2 \left(\frac{(x_{ik}/2 - q_k)(x_{jk}/2 - q_k)}{q_k(1 - q_k)}\right) .

This is equal to

\frac{(x_{ik} - 2q_k)(x_{jk} - 2q_k)}{2 q_k(1 - q_k)} ,

which is equal, for this data, to the female/female covariance term for Normalization by Individual Marker.

Comparison of Dosage Compensation Factors

Let’s (focus on Normalization by Individual Marker and) write the male/male covariance term for the three levels of dosage compensation, also giving these expressions a common denominator.

For Full Dosage Compensation, we have

\frac{\sqrt{2}(x_{ik} - q_k)\sqrt{2}(x_{jk} - q_k)}{q_k(1 - q_k)} = \frac{(2x_{ik} - 2q_k)(2x_{jk} - 2q_k)}{2q_k(1 - q_k)} .

For Equal X-Linked Variance, we have

\frac{(x_{ik} - q_k)(x_{jk} - q_k)}{q_k(1 - q_k)} = \frac{(\sqrt{2}x_{ik} - \sqrt{2}q_k)(\sqrt{2}x_{jk} - \sqrt{2}q_k)}{2q_k(1 - q_k)} .

And for No Dosage Compensation, we have

\frac{((x_{ik} - q_k)/\sqrt{2})((x_{jk} - q_k)/\sqrt{2})}{q_k(1 - q_k)} = \frac{(x_{ik} - q_k)(x_{jk} - q_k)}{2q_k(1 - q_k)} .

Comparing these with the covariance between two female subjects, we see that

  • For Full Dosage Compensation (“each allele in females has only half the effect that an allele has in males”), we can compare the female x_{ik} term with the male 2x_{ik} term.

  • For Equal X-Linked Variance, we find ourselves comparing the female x_{ik} term with the male \sqrt{2}x_{ik} term, using the common denominator above. You might think of this as a type of “Partial Dosage Compensation”.

  • Finally, for No Dosage Compensation (“each allele has a similar effect on the trait”), we can compare the female x_{ik} term with the male x_{ik} term.

This leads to the second question, “How is the Equal X-Linked Variance assumption ‘Equal’?” The answer is that in each of the expressions

\frac{(x_{ik} - 2q_k)(x_{jk} - 2q_k)}{2q_k(1 - q_k)}

and

\frac{(x_{ik} - q_k)(x_{jk} - q_k)}{q_k(1 - q_k)} ,

the dosage is normalized by the would-be variance of the dosage for the whole marker if all the subjects were of the gender in question. This is not true for the male/male covariance term for either of the other two Dosage Compensation assumptions.

Separating Out Gender-Chromosome-Linked Effects

If you select AI as your REML algorithm, SVS allows you to separate out the gender-chromosome-linked effects from the non-gender-chromosome-linked effects by fitting the model

y = X\beta + u_X + u + \epsilon,

where u_X is a vector of genetic effects attributable to the gender chromosome, with

Var(u_X) = \sigma^2_X G_X,

Var(u) = \sigma^2_u G_u,

and

Var(\epsilon) = \sigma^2_e I,

where G_X and G_u are the genomic relationship matrices (GRM’s) for the gender chromosome and for all other chromosomes, respectively, so that

Var(y) = \sigma^2_X G_X + \sigma^2_u G_u + \sigma^2_e I.

Note

You may wish to test for the most appropriate Dosage Compensation for the G_X gender-chromosome GRM for this model by running the model three times, each time with a different Dosage Compensation for G_X, then comparing the likelihoods of the model-fitting under each of the three assumptions.

Genomic Prediction

Sometimes, it is desired to predict the random effects (genomic merit/genomic breeding values) for samples for which there is genotypic data but no phenotype data, based on other samples for which phenotype data, genotypic data and, if applicable, covariate data does exist.

Additionally, it is sometimes desired to predict the phenotypes themselves for those samples with missing phenotypes, so long as the samples have not only genotypic data but also, if applicable, covariate data.

Call the n_t samples for which there are phenotype values the “training set”, and the n_v others the “validation set”. Assume all samples have genotypic data, imputed or otherwise, for all markers. Also assume all samples in the training set have valid covariate data, if there are covariates being used.

To predict the random effects and the missing phenotypes, we do the following:

  • Without loss of generality, imagine the samples of the training set all come first, before any samples of the validation set. Define Z = [I | 0], where the width and height of I is n_t, the width of the zero matrix is n_v, and the height of the zero matrix is n_t. Also partition u, X_f, and y according to training vs. validation, as

    u &= \begin{bmatrix}
     u_t \\
     u_v
    \end{bmatrix}

X_f &= \begin{bmatrix}
       X_{ft} \\
       X_{fv}
      \end{bmatrix}

y &= \begin{bmatrix}
     y_t \\
     y_v
    \end{bmatrix} .

  • Either compute the genomic relationship matrix or import a pre-computed genomic relationship matrix based on all samples (both training and validation sets).

    Note

    If correcting for gender has been selected, the modifications for computing M and G remain the same as noted above in Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.

  • Use Restricted Maximum Likelihood (REML) on the mixed model for the training set, which may be expressed as

    y_t = X_{ft} \beta_f + Z u + \epsilon_t ,

    where Var(u) = \sigma^2_G G and Var(Zu) = \sigma^2_G
ZGZ', to determine the variance components and the values of \hat{\beta}_f.

  • If EMMA is being used as the REML algorithm (Finding the Variance Components Using EMMA), it will also determine \delta and the inverse of H_t = ZGZ' + \delta I for the training set.

    Noting that the form of the second of Henderson’s mixed-model equations, as modified to accommodate singular G, is, for Z \ne I,

    G Z' X_{ft} \hat{\beta}_f + (GZ'Z + \delta I) \hat{u} = GZ' y_t ,

    we obtain

    (GZ'Z + \delta I) \hat{u} = GZ' y_t - GZ' X_{ft} \hat{\beta}_f ,

    or

    \hat{u} = (GZ'Z + \delta I)^{-1} GZ'(y_t - X_{ft}\hat{\beta}_f) .

    Noting the following equalities,

    GZ'ZGZ' + \delta GZ' = (GZ'Z + \delta I)GZ' = GZ'(ZGZ' + \delta I) = GZ'H_t

    GZ' = (GZ'Z + \delta I)^{-1} GZ'H_t

    GZ'H_t^{-1} = (GZ'Z + \delta I)^{-1} GZ',

    We may write

    \hat{u} = GZ'H_t^{-1} (y_t - X_{ft}\hat{\beta}_f)

    as a solution for the genomic BLUP.

    If we now define, for Overall Normalization,

    \hat{\alpha}_O = M'Z'H_{tO}^{-1}(y_t - X_{ft}\hat{\beta}_f) / \phi,

    where H_O is the matrix H for Overall Normalization, we find that

    M\hat{\alpha}_O = \frac{MM'}{\phi}Z'H_{tO}^{-1}(y_t - X_{ft}\hat{\beta}_f) = G_OZ'H_{tO}^{-1}(y_t - X_{ft}\hat{\beta}_f) = \hat{u},

    which makes \hat{\alpha}_O a solution for the ASE. The computational streamlining becomes computing \hat{\gamma}_{tO} as

    \hat{\gamma}_{tO} = H_{tO}^{-1}(y_t - X_{ft}\hat{\beta}_f),

    computing \hat{\gamma}_O as

    \hat{\gamma}_O = Z'\hat{\gamma}_{tO},

    and finally computing \hat{u} = G_O\hat{\gamma}_O and the ASE \hat{\alpha}_O = M'\hat{\gamma}_O/\phi as before.

    Here, \phi is calculated for all samples, namely both training samples and validation samples.

    For Normalization by Individual Marker, if we now define

    \hat{\alpha}_W = W' Z' H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) / m,

    and

    \hat{\alpha}_W^* = N \hat{\alpha}_W,

    where N is defined as in Normalization by Individual Marker, and is calculated for all samples, namely both training samples and validation samples, we find that

    W\hat{\alpha}_W = \frac{WW'}{m}Z'H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) = G_WZ'H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) = \hat{u} ,

    and that

    M \hat{\alpha}_W^* &= M N N'M'Z'H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) / m    \\
                   &= \frac{WW'}{m} Z' H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) \\
                   &= G_W Z' H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f) = \hat{u} ,

    which makes \hat{\alpha}_W^* a solution for the ASE. The computational streamlining becomes computing \hat{\gamma}_{tW} as

    \hat{\gamma}_{tW} = H_{tW}^{-1}(y_t - X_{ft}\hat{\beta}_f),

    computing \hat{\gamma}_W as

    \hat{\gamma}_W = Z'\hat{\gamma}_{tW},

    and finally computing \hat{u} = G_W\hat{\gamma}_W and the ASE \hat{\alpha}_W^* = N^2 M' \hat{\gamma}_W/m as before.

  • If Average Information is being used as the REML algorithm (Finding the Variance Components Using the Average Information (AI) Technique), we may express the total random effect u as the sum of the r random effects of the individual components

    u = \sum_{vc=1}^r u_{vc} .

    Given this, we express the mixed model for the training set as

    y_t = X_{ft}\beta_f + Z \sum_{vc=1}^r u_{vc} + \epsilon_t ,

    with

    Var(y_t) = Z \left( \sum_{vc=1}^r G_{vc} \sigma^2_{vc} \right) Z' + I \sigma^2_{\epsilon} .

    for either normalization method. Along with the variance components (\sigma^2_{vc} and \sigma^2_{\epsilon}), the AI algorithm will find the matrix P_t and the values of \hat{\beta}_f for the training set. From these are found

    \hat{\gamma}_{vct} = \sigma^2_{vc} P_t y_t

    and

    \hat{\gamma}_{vc} = Z'\hat{\gamma}_{vct}

    for each variance component \sigma^2_{vc}, as well as

    \hat{u}_{vc} = G_{vc} \hat{\gamma}_{vc}

    for each variance component, with the total random effect estimate being

    \hat{u} = \sum_{vc=1}^r \hat{u}_{vc} .

    For any variance component \sigma^2_{vc} for which the matrix G_{vc} is completely generated from a set of markers M_{vc} as M_{vc}M'_{vc}/\phi_{vc} or as W_{vc}W'_{vc}/m_{vc}, where \phi_{vc} or m_{vc} is also generated exclusively from set M_{vc} of markers, we may compute an allele substitution effect (ASE) value for any marker within such a set. The vector of these values is

    \hat{\alpha}_{vcO} = M'_{vc}\hat{\gamma}_{vcO}/\phi_{vc}

    for Overall Normalization, or

    \hat{\alpha}^*_{vcW} = N_{vc} W'_{vc} \hat{\gamma}_{vcW} / m_{vc} = N_{vc}^2 M'_{vc} \hat{\gamma}_{vcW} / m_{vc}

    for Normalization by Individual Marker.

  • Finally, noting that the full mixed-model problem is

    y = X_f \beta_f + u + \epsilon

    or

    \begin{bmatrix}
     y_t \\
     y_v
    \end{bmatrix} = \begin{bmatrix}
       X_{ft} \\
       X_{fv}
      \end{bmatrix} \beta_f + \begin{bmatrix}
     u_t \\
     u_v
    \end{bmatrix} + \epsilon,

    we predict the validation phenotypes

    \hat{y}_v = X_{fv} \hat{\beta}_f + \hat{u}_v

    from (the intercept and) any validation covariates X_{fv} and the predicted values \hat{u}_v. If there are missing validation covariates, the corresponding validation phenotypes are not predicted.

Note

As a “bonus”, we also use the same prediction algorithm to “re-predict” the training phenotypes as

\hat{y}_t = X_{ft} \hat{\beta}_f + \hat{u}_t .

Note

If correcting for gender has been selected, the modifications for partitioning M and either \hat{\gamma}_O or \hat{\gamma}_W (both of which involve all samples, both training samples and validation samples) and computing the ASE (\hat{\alpha}_O or \hat{\alpha}_W) remain the same as noted above in Correcting the GRM for Gender Using Overall Normalization, Correcting the GRM for Gender Using Normalization by Individual Marker, and Correcting the ASE for the Gender Chromosome.

Bayes C and C-pi

Problem Statement

Suppose that we have the following mixed model equation to describe the relationship between the phenotypes of our samples, their genotypes, and fixed and random effects.

y = X_f \beta_f + u + \epsilon

over n samples, with fixed effects in X_f specified in \beta_f that include the intercept and any additional covariates you may have specified. Also, the random effects, u are additive genetic merits or genomic breeding values associated with each sample.

We can define u in terms of the genotypes of each sample over m autosomal markers and the allele substitution effects.

u = M \alpha

where M is an n x m matrix containing the genotypes of each sample. M_{ik} is 0, 1, or 2, depending upon whether the genotype for the i -th sample at the k -th locus is homozygous for the major allele, heterozygous, or homozygous for the minor allele, respectively. \alpha is a vector for which \alpha_k is the allele substitution effect (ASE) for marker k.

Estimating the Model Parameters

The two Bayesian methods for fitting this mixed model implemented in SVS are Bayes C and Bayes C\pi [Habier2011]. The only difference between the two methods is the assumptions about \pi.

\pi is the prior probability that a SNP has no effect on the phenotype, the value of \pi is treated as unknown and is estimated in the Bayes C\pi method and is treated as known with a value of 0.9 [Neves2012] in Bayes C.

The Bayesian approach uses prior probabilities, beliefs about the parameters before data is analyzed, and the conditional probabilities, a probability density for a parameter based on the given data, to construct full conditional posterior probabilities. These posterior probabilities are then sampled from to make estimates about the model parameters.

SVS uses a single-site Gibbs sampler.

Prior and Posterior Distributions

There are six parameters that are sampled each iteration of the Gibbs sampler.

The parameters are:

  • \pi: the prior probability that a SNP has no effect.

  • \alpha_k: the allele substitution effect (ASE) at loci k.

  • \sigma_M^2: the component of variance associated with the ASE.

  • \sigma_e^2: the component of variance associated with the error term.

  • \beta_f: the component of the f -th fixed effect.

  • \delta: whether a marker is included in the current iteration.

The prior distributions for the parameters are as follows [Fernando2009a], [Sorensen2002]:

Parameter

Prior

\pi

\sim \mathcal{U}(0, 1)

\alpha_k

\sim \mathcal{N}(0, \sigma_M^2)

\sigma_M^2

\sim v_M S_M^2 \chi_{v_M}^{-2}

\sigma_e^2

\sim v_e S_e^2 \chi_{v_e}^{-2}

\beta_f

\propto constant

\delta

\sim \mathcal{B}(numIter, 1 - \pi)

Where numIter is the number of iterations for the Gibbs sampler, v_M = 4, v_e = 2, [Fernando2009b] and S_M^2 is defined as [Habier2011]:

S_M^2 = \frac {\overset {\sim} {\sigma_M^2} (v_M - 2)} {v_M}

where \overset {\sim} {\sigma_M^2} is the initial value of \sigma_M^2 and is defined as [Habier2011]:

\overset {\sim} {\sigma_M^2} = \frac {\overset {\sim} {\sigma_s^2}} {m (1 - \pi) (\phi / m)}

where \phi is:

\phi = 2 \sum_{k = 1}^{m} p_k q_k

or the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium. And \sigma_s^2 is the additive-genetic variance explained by SNPs [Habier2011] and we can set it as 0.05 [Fernando2009b].

The prior of \beta_f is constant and not proper, however the posterior is. The initial value for the first \beta_f is the mean of the phenotype values and the rest are set to 0.

The posterior distributions for these parameters are [Fernando2009a], [Sorensen2002]:

Parameter

Posterior

pi

\sim \text{Beta}(m - m_{iter} + 1, m_{iter} + 1)

\alpha_k

\sim \mathcal{N}(\frac {M'_k(y - X\beta - M_k\prime \alpha_k\prime)} {M'_k M_k + \frac {\sigma_e^2} {\sigma_M^2}}, \frac {\sigma_e^2} {M'_k M_k + \frac {\sigma_e^2} {\sigma_M^2}})

\sigma_M^2

\sim \overset{\sim}{v_M} \overset{\sim}{S_M^2} \chi_{\overset{\sim}{v_M}}^{-2}

\sigma_e^2

\sim ((y - X\beta - M\alpha)'(y - X\beta -M\alpha) + v_e S_e^2) \chi_{v_e + n}^{-2}

\beta_f

\sim \mathcal{N}((X'_f X_f)^{-1} X'_f (y - X_f\prime \beta_f\prime - M\alpha), (\frac {X'_f X_f} {\sigma_e^2})^{-1})

\delta

\frac {f(r_k | \delta_k, \theta_{k\_}) Pr(\delta_k | \pi)} {f(r_k | \delta_k = 0, \theta_{k\_})\pi + f(r_k | \delta_k = 1, \theta_{k\_})(1 - \pi)}

where m_{iter} is the number of markers included in this iteration (see Deciding when to include a marker) and S_e^2 is the scale factor for the posterior distribution \sigma_e^2 and it is set at 1. And \overset {\sim} {v_M} is [Habier2011]:

\overset {\sim} {v_M} = v_M + m_{iter}

and \overset {\sim} {S_M^2} is defined as [Habier2011]:

\overset {\sim} {S_M^2} = \frac {\alpha' \alpha + v_M S_M^2} {\overset {\sim} {v_M}}

where S_M^2 is defined as above and is updated each iteration with the new value of \pi.

Note

A prime in the subscript mean “all but this.” For example, when sampling \beta_f we remove all the fixed effects and their coefficients except for the current one.

The posterior of \delta can be better explained by [Fernando2009a]:

Pr(\delta_k | y, \beta, \alpha_{\_j}, \delta_{\_j}, \sigma_M^2, \sigma_e^2, \pi) = Pr(\delta_k | r_k, \theta_{k\_})

Pr(\delta_k | r_k, \theta_{k\_}) = \frac {\delta_k, r_k | \theta_{k\_}} {f(r_k | \theta_{k\_})}
&= \frac {f(r_k | \delta_k, \theta_{k\_}) Pr(\delta_k | \pi)} {f(r_k | \delta_k = 0, \theta_{k\_})\pi + f(r_k | \delta_k = 1, \theta_{k\_})(1 - \pi)}

Deciding when to include a marker

Because of the difficulty in sampling directly from the \delta distribution we use the log-likelihoods, find the probability that \delta is 1, and sample from a uniform distribution, \mathcal{U}(0,1), to determine if a marker will be included in the current iteration, if we do not include a marker, then \alpha_k will be set to 0.

The log-likelihoods are defined as [Fernando2009b]:

log \mathcal{L}(\delta_k | 0) = - \frac {1} {2} (log(M'_k M_k \sigma_e^2) + \frac {(M'_k (y - X\beta - M_k\prime \alpha_k\prime))^2} {M'_k M_k \sigma_e^2}) + log(\pi)

log \mathcal{L}(\delta_k | 1)  = - \frac {1} {2} (log((M'_k M_k)^2 \sigma_M^2 + M'_k M_k \sigma_e^2) + \frac {(M'_k (y - X\beta - M_k\prime \alpha_k\prime))^2} {(M'_k M_k)^2 \sigma_M^2 + M'_k M_k \sigma_e^2}) + log(1 - \pi)

We can then define the probability that \delta is 1 as [Fernando2009b]:

P(\delta_k = 1) = \frac {1} {1 + e^{log \mathcal{L} (\delta_k | 0) - log \mathcal{L} (\delta_k | 1)}}

Finding the ASE and Genomic Estimated Breeding Values

To find the our estimates for \alpha and \beta we take the average over all iterations:

\hat {\alpha_k} = \frac {\sum_{t = 1}^{numIter} \alpha_{kt}} {numIter}

\hat {\beta_f} = \frac {\sum_{t = 1}^{numIter} \beta_{ft}} {numIter}

for all k and f.

The ASE values will then be the new \hat \alpha values.

To find the genomic estimated breeding values (GEBV) we use the ASE values:

\hat u = M \hat \alpha

\hat \sigma_M^2, \hat \sigma_e^2, and \hat \pi are found in the same way as \hat \alpha and \hat \beta, taking the average value over all iterations.

Gender Correction

To correct for gender, we take the following steps:

  • For markers within the X chromosome, we use the following entries for matrix M:

    • For females, we encode them the same as non-X-chromosome markers.

    • For males, we use 0 or 1 for M_{ik}, depending upon whether the genotype for the i -th sample at the k -th locus contains the major (X-chromosome) allele or the minor (X-chromosome) allele, respectively.

    The other entries of M are left the same.

  • To compute \phi, we continue to use 2p_k q_k as the expected-variance term for most markers. For X-chromosome markers, however, we use

    w_m (p_k q_k)  +  w_f \frac{p_k q_k}{2},

    where w_m and w_f are the fraction of the samples that are male and female, respectively.

  • We still compute

    G = \frac{MM'}{\phi}

    as before.

  • During the \alpha_k sampling phase of the Gibbs sampler, the y and M matrices are split into male and female sections and separate samples are taken.

  • The non-X-Chromosome ASE values are added (from the males and females) to get the final ASE values.

Normalizing the ASE

To normalize the ASE values we do:

\hat \alpha_{norm} = \frac {1} {\sqrt{\sigma_M^2}} \hat \alpha

Standardizing Phenotype Values

Phenotype values will be standardized to prevent \sigma_e^2 from becoming too large and disrupting the results.

Values will be set to their z-score:

z = \frac {x - \mu} {\sigma}

The result spreadsheet by marker will have the original phenotypes and the standardized phenotypes, if phenotype prediction is chosen, then there will be a column of predicted phenotypes and transformed back predicted phenotypes.

Genomic Prediction

To predict the phenotype of samples for which there is genotypic data but no phenotypic data (will be known as our validation set) we can run the Gibbs sampler with just the data from the samples with known phenotypic and genotypic data (the training set).

After running the Gibbs sampler with just the training set data we can estimate the GEBVs for all samples with:

\hat u = M \hat \alpha

where \hat \alpha is estimated with just the training samples.

And we can predict the phenotypes for all samples (training and validation) with:

\hat y = X \hat \beta + M \hat \alpha

However, if there are missing covariate values the phenotypes cannot be predicted for the sample and it will be dropped from the result spreadsheet.