# Overview of Mixed Linear Models¶

Mixed linear models incorporate both “fixed effects” and “random effects” (that is, “mixed effects”). The independent variables in a linear regression may be thought of as fixed effects. To solve for the random effects in a mixed model, something should be known about the variances and covariances of these random effects.

## The Mixed Model Equation¶

Suppose there are measurements of a phenotype which is influenced by fixed effects and instances of one random effect. The mixed linear model may be written as

where is an vector of observed phenotypes, is an matrix of fixed effects, and is a vector representing the coefficients of the fixed effects. is the unknown random effect, and is an matrix relating the instances of the random effect to the phenotypes. We assume

is the variance-covariance matrix for the random effect , and

so that

Examples of “fixed effects” may include the mean, one or more genotypic markers, and other additional covariates that may be analyzed.

Examples of a “random effect” are:

1. Polygenic effects from each of subgroupings, where the measurements have been grouped into subgroupings such as inbred strains. is then an incidence matrix relating subgroupings/strains to measurements, and should be a matrix showing the pairwise genetic relationship among the strains.
2. Polygenic effects from each of the samples, where there is just one measurement per sample. is then just the identity matrix , and should be a pairwise genetic relationship or kinship matrix among the samples.

Note

At this time, none of the SVS Mixed Linear Model Analysis tools supports organizing measurements into subgroupings such as inbred strains.

The parameters and are called the “variance components”, and are assumed to be unknown. To solve the mixed model equation, the variance components must first be estimated. Once this is done, a generalized least squares (GLS) procedure may be used to estimate .

### More than One Random Effect¶

It is also possible for a mixed linear model to have more than one random effect. For instance, a mixed model with two random effects and would be defined by

where

and

and being the variance/covariance matrices for and respectively, so that

In this example, there are three variance components: , , and .

## Finding the Variance Components¶

The SVS Mixed Linear Model Analysis tools use either one of two approaches to estimate the variance components:

1. EMMA. To estimate the variance components and for a mixed model with only one random effect, you may select to use the EMMA (Efficient Mixed Model Association) approach ([Kang2008]). This approach reduces the estimating problem to a maximization search in just one dimension, and will thus be reliable and fast for use cases with 10,000 or fewer samples.

(For sample sizes over 10,000 or 15,000, this method starts becoming limited by the time it takes to solve the eigenvalue/eigenvector problem for these matrices. While some of the Mixed Linear Model Analysis tools (specifically GBLUP and K-Fold cross validation using GBLUP) have a large-data capability which saves space in memory, you must still choose, when using this capability, the tradeoff between faster approximation of the eigenvalues and eigenvectors vs. more exact computation of these values.)

2. AI. Average Information (AI) ([Yang2011], [LeeSH2012]) is the other approach you may select. Multiple variance components may be estimated using this approach, and it is not limited for large sample sizes by the need to solve an eigenvalue/eigenvector problem.

However, neither the MLMM feature nor large-data capability is currently implemented in SVS for the AI approach.

Either the full likelihood (Maximum Likelihood ML) or the restricted likelihood (Restricted Maximum Likelihood REML) may be maximized. The restricted likelihood is defined as the full likelihood with the fixed effects integrated out. As stated in [Kang2008], “The restricted likelihood avoids a downward bias of maximum-likelihood estimates of variance components by taking into account the loss in degrees of freedom associated with fixed effects.”

Note

The SVS Mixed Linear Model Analysis tools always maximize the restricted likelihood (REML) rather than the full likelihood (ML) except when the Bayes Information Criterion (for the MLMM feature) is computed.

## Finding the Variance Components Using EMMA¶

Suppose , , and , which is a function of . Under the null hypothesis, the full log-likelihood function can be formulated as

and the restricted log-likelihood function can be formulated as

The full-likelihood function is maximized when is , and the optimal variance component is for full likelihood and for restricted likelihood, where is a function of as well.

Using spectral decomposition, it is possible to find and such that

and

where , is , and is an eigenvector matrix corresponding to the nonzero eigenvalues. is an matrix corresponding to the zero eigenvalues. and are independent of .

Let . Then, finding the maximum-likelihood (ML) estimate is equivalent to optimizing

with respect to , and finding the restricted maximum-likelihood (REML) estimate is equivalent to optimizing

with respect to . (See the Appendix of [Kang2008] for the mathematical details, except that , not “” as that Appendix states.) These functions are continuous for if and only if all the eigenvalues are nonnegative (and, for , the eigenvalues are nonnegative). Otherwise, if the kinship matrix is not positive semidefinite, the likelihood will be ill-defined for a certain range of .

The derivatives of these functions, which may be used to find the local maxima for the functions themselves, are

and

The ML or REML may be searched for by subdividing the values of into 100 intervals, evenly in log space from to , and applying a method such as the Newton-Raphson algorithm or the secant method on or to all the intervals where the sign of the derivative function changes, then taking the optimal among all the stationary points and endpoints.

Note

The secant method is used by the SVS Mixed Linear Model Analysis tools.

Notice that evaluating or does not require a large number of matrix multiplications or inverses at each iteration as other methods typically do – instead, the EMMA technique computes spectral decomposition only once. Thus, using the grid search indicated above, the likelihood may be optimized globally with high confidence using much less computation.

## Finding the Variance Components Using the Average Information (AI) Technique¶

### Problem Restatement and Summary¶

Suppose we have random effects in our mixed model. We may then write our mixed model as

where is a vector of random genetic effects. In this model, the phenotypic variance is partitioned into the variance explained by each of the genetic factors and the residual variance. We have

where is the variance of the -th genetic factor with its corresponding variance/covarince matrix .

( can be a kinship matrix or a GBLUP Genomic Relationship Matrix (GRM). See The Genomic Relationship Matrix for more information about the GRM matrix.)

The variance components (for all ) and are unknown and need to be determined. To determine these variance components, SVS uses the Average Information (AI) technique to implement the restricted maximum likelihood (REML) method. This process may be summarized as follows:

• SVS first initializes each component to (the somewhat arbitrary value of) divided by the number of total components .
• Next, SVS uses one iteration of the expectation maximization (EM) algorithm (as adapted for the REML method).
• Finally, repeated iterations of the Average Information (AI) algorithm itself are used until convergence is achieved. The criterion for convergence is that the log-likelihood changes by less than the threshold value of .

### Definitions and Algorithm Details¶

Define as a vector of variance components

the matrix as

and as the log-likelihood function of the model, ignoring the constant:

We now define a matrix called which is the average of the observed and expected information matrices:

The vector of first derivatives of the log likelihood function with respect to each variance component is

• The -th variance component is initialized to

• This is updated by one iteration of the EM algorithm as

The EM algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values.

• The remaining iterations use the Average Information algorithm itself, which consists of updating the vector of variance components as

to get from iteration to iteration .

• Convergence is determined by whether , where is the log likelihood of the -th iteration.

During the iteration process, any component whose estimate is negative (that is, the estimate “has escaped from the parameter space”) is set to .

See [Yang2011] for further information.

## Estimating the Variance of Heritability for One Random Effect¶

The formula for the estimate of the variance of heritability is derived as follows using a Taylor series expansion:

Now, using the fact that the pseudo-heritability (or narrow-sense heritability) is:

We can obtain the formula for the estimate of the variance of heritability:

Note that

So:

The formulas and methods for calculating the variance components can be found in Finding the Variance Components. is the (estimated) sampling variance/covariance of the (estimates of the) variance components, and may be computed as

where is the Average Information Matrix for

## Solving the Mixed Model Equation Using EMMA¶

The generalized least squares (GLS) solution to

may now be obtained. Note that the variance of is

If we can find a matrix such that

we can substitute , , and to get

(The Cholesky decomposition of is one way to obtain such a matrix .) This equation can be solved for through ordinary least squares (OLS), because we have

The value of the residual sum of squares (RSS) from solving the transformed equation is the Mahalanobis RSS for the original equation .

Taking advantage of the eigendecomposition of performed in the EMMA algorithm, the computation of a valid can be simplified to

## Using the Mixed Model for Association Studies¶

### The Exact Model¶

Association studies are typically carried out by testing the hypothesis for each of loci, one at a time, on the basis of the model

where is the minor allele count of marker for individual , is a (fixed) effect size of marker , and are other fixed effects such as the mean of the and any fixed covariates. The error term is

If we assume the individuals are unrelated and there is no dependence across the genotypes, the values will be independently and identically distributed (i.i.d.), and thus simple linear regressions will make appropriate inferences for the values of .

However, the variance of the first term of actually comes closer to being proportional to a matrix of the relatedness or kinship between samples. Thus, if we write

we see that the equation for reduces to the mixed-model equation

Note that strictly speaking, to use this equation, we should base not only the kinship, but also the variance components, upon all markers except for marker .

### The EMMAX Approximations and Technique¶

Even using the EMMA technique, finding the kinship matrix for, variance components for and solving for for all would be a daunting task. However, we may make two approximations:

1. Let

approximate , and let

approximate . Then, we have

To solve this, we need to compute the kinship matrix just once, using all markers. That kinship matrix may then be used to solve this equation for every marker .

2. Find the variance components once – specifically, for the system of equations

using the kinship matrix which is computed just once for all markers . Then, use these variance components (for the variance of the which is ) and (for the variance of the which is ) to apply the GLS method for solving

for for every marker .

This technique, which is called EMMAX (EMMA eXpedited) and was published in [Kang2010], allows mixed models to be used for genome-wide association testing within a very reasonable amount of computing time.

### Scaling the Kinship Matrix¶

The actual EMMAX technique, however, before using the kinship matrix , scales it (and thus effectively scales ) by an amount that will make the expectation of the estimated population variance of the (scaled) to be , just as the expectation of the estimated population variance for the is .

This is done by defining a scaling factor as

and multiplying by it to get

Here, and is a length vector of ones. is called a “Gower’s centering matrix”–it has the property that when you apply it to a vector to get , it will subtract the mean of the components of from each component of :

The reasoning for using this scaling factor is as follows:

Suppose we have a vector of elements . Estimate the population variance of these elements over all the samples . (This estimate is sometimes called the “sample variance”.) The (unbiased) estimate would be

where is the average of the components of .

However, another way to write this is

since

and

Two other ways to write this are

since is a scalar, and

since for any two matrices and .

We note that

therefore, the estimated population variance may be written as

Looking at as a random variable of dimensions and as a scalar random variable, let us write the expectation of the estimated population variance as:

But defines a relationship matrix among the elements of possible . (Note that the possible instances of themselves might not be “centered” – that is, the components of the may or may not have zero averages.) We now write the expected estimated population variance as

We wish to “scale to one” – that is, set to one by scaling appropriately. We do that by defining , where

and noting that

Note

The features that scale the kinship matrix before using it are:

The other SVS mixed-model features, including the GBLUP-related features, do not scale the kinship matrix before using it.

Note

When the EMMA algorithm for GBLUP is used, GBLUP will output what the computed pseudo-heritability would have been had the matrix been scaled.

### Further Optimization When Covariates Are Present¶

The following technique is mentioned in passing in [Segura2012] and is used both in [Vilhjalmsson2012] and in the mixed-model tools of SVS.

If we have a mixed linear model with fixed-effect covariates , one particular “more interesting” fixed-effect covariate and a random-effect covariate for which the scaled relationship matrix is

and we have this model for many and we don’t need to find the covariate coefficients for any of these models, and we have a matrix such that (see Solving the Mixed Model Equation Using EMMA), we can perform the following optimization:

1. Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem

to find as an estimate for .

Designate the (Mahalanobis) RSS obtained from solving this equation as

2. Perform the QR algorithm on to get

where and are the “thin”, “reduced”, or “economic” versions of and .

3. Define

giving us

4. Transform the original equation by pre-multiplying it by to get

But

because the columns of are “orthogonal” and of “unit length” and so and . Thus, we have

may be re-written as

the last step being because

and, since ,

and

Thus,

This is equivalent to the ordinary-least-squares (OLS) problem

where the variance of is proportional to . This is because if we pre-multiply the original problem simply by , we get

which may be solved as an OLS (Solving the Mixed Model Equation Using EMMA), and because

which is proportional to .

The ordinary-least-squares (OLS) problem

may now be solved for all .

Note

For optimization, SVS pre-computes the matrix product and uses this product as one matrix to help perform all of the regressions involving .

Note

The matrix is the “annihilator matrix” for the null hypothesis problem

Designate the Mahalanobis Root Sum of Squares (Mahalanobis RSS) for marker as

which is optimized to

This is the Root Sum of Squares (RSS) value for the regression as transformed by pre-multiplying it by .

Note

If we have a (completely fixed-effect) linear model with covariates and one particular “more interesting” covariate ,

and we have this model for many and we don’t need to find the covariate coefficients for any of these models, we can perform the same kind of optimization.

1. Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem

to find as an estimate for .

2. Perform the QR algorithm on to get

where and are the “thin”, “reduced”, or “economic” versions of and .

3. Define

4. Transform the original equation by pre-multiplying it by to get

But

because the columns of are “orthogonal” and of “unit length” and so and . Thus, we have

may be re-written as

because and .

Thus,

This is equivalent to the ordinary-least-squares (OLS) problem

where the variance of is proportional to . This is because

which we assume to be proportional to .

Note that the matrix is the “annihilator matrix” for the null hypothesis problem

### Optimization when Gene-Environment Interaction Terms Are Included¶

If we have the full mixed linear model

and we have

as the corresponding reduced mixed linear model, where are fixed covariates, are fixed terms that will later be used to create gene-environment interaction terms, is the current “more interesting” covariate or predictor variable, and are interaction terms created by multiplying the element-by-element with , and it is desired to determine all of the full-model beta’s, we must compute the entire linear full-model regression directly, as

where the term is assumed to be an error term proportional to the identity matrix. Meanwhile, the reduced model here may be optimized in the same fashion as what Further Optimization When Covariates Are Present refers to as the “full model”. Substitute for the in that explanation.

The reduced-model Mahalanobis RSS for marker thus becomes, in its optimized form,

where , and is derived from performing the QR algorithm, as

Here, we use the following for the Mahalanobis Root Sum of Squares (Mahalanobis RSS) for marker :

Note

For the similar linear-model problem with full model

and reduced model

where it is desired to determine all of the full-model beta’s, we must compute the entire full-model regression itself. However, we may still optimize computing the reduced model using the technique shown above in the linear-model note to Further Optimization When Covariates Are Present, where is substituted for the in that note.

## The Multi-Locus Mixed Model (MLMM)¶

For complex traits controlled by several large-effect loci, a single-locus test may not be appropriate, especially in the presence of population structure.

Therefore, [Segura2012] has proposed a simple stepwise mixed-model regression with forward inclusion and backward elimination of genotypic markers as fixed effect covariates. This method, called the Multi-Locus Mixed Model (MLMM), proceeds as follows:

1. Begin with an initial model that includes, as its fixed effects, only the intercept and any additional covariates you may have specified.
2. Using this model, perform an EMMAX scan through all markers (that you have not specified as additional covariates).
3. From the markers scanned above, select the most significant marker and add it to the model as a new fixed-effect covariate (“cofactor”), creating a new model.
4. Repeat (2) and (3) (forward inclusion) until a pre-specified maximum number of forward steps is reached or until forward inclusion must be stopped for some other reason (see note below).
5. For each selected marker in the current model, temporarily remove it from the fixed effects and perform an EMMAX scan over only that marker.
6. Eliminate, from the current model, the marker that came out as least significant using the above test. A new smaller model is created.
7. Repeat (5) and (6) (backward elimination) until only one selected marker is left.

The variance components are re-estimated between each forward and backward step, while the same kinship matrix is used throughout the calculations.

Note

The following are reasons why forward inclusion might be stopped early:

• When the pseudo-heritability estimate has become close to zero.
• When all of the variance is explained by the currently-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.
• When the latest significant marker that has been selected is collinear with the previously-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.
• When the number of active samples is no longer enough to continue analysis.
• When the number of active (non-selected) markers is no longer enough to analyze the resulting models.

### Model Criteria¶

The result of this stepwise regression is a series of models. Several model criteria have been explored by the authors of [Segura2012] to determine how appropriate any of the models is:

• Bayes Information Criteria (BIC). This is calculated as , where is the full-model log-likelihood, is the number of model parameters (one for the intercept, one for , one for each marker covariate used in the particular MLMM model, and finally one for each additional covariate used in all of the models), and is the sample size/number of individuals.

Given any two estimated models, the model with the lower value of BIC is the preferred choice.

However, the authors of [Segura2012] believe this model is “too tolerant in the context of GWAS”.

• Extended Bayes Information Criteria (Extended BIC). This is the BIC penalized by the model space dimension. Its formula is

where is the initial number of model parameters (one for the intercept, one for , and one for each additional covariate used in all of the models), and is the total number of models which can be formed using marker covariates under the assumption that these will only be selected from the best markers.

• Modified Bayes Information Criteria (Modified BIC). This adds a different penalty based not only by the model space dimension, but also by how many overall markers there are to test. Its formula is

where is the total number of markers being tested in the current step.

• Bonferroni Criterion. Only defined for models derived from forward selection, this selects the model with the most covariate marker loci for which the best p-value obtained from the preceding EMMAX scan was below the Bonferroni threshold.

• Multiple Bonferroni Criterion. This selects the model with the most covariate marker loci all of which have individual p-values below the Bonferroni threshold. Here, “individual p-value” is as explained in the note of Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method. The threshold used is , where is the total number of markers being tested in the current step.

• Multiple Posterior Probability of Association. This selects the model with the most covariate marker loci all of which have posterior probabilities of association above a PPA threshold. The threshold used is 0.5 . Posterior probabilities of association are based on Bayesian priors of for every marker (and for every step), where is the total number of markers being tested in the current step, and are computed as follows:

• Find the Bayes factor for marker as

where and are the values of the Mahalanobis RSS for the base model and for testing with marker , respectively.

• Determine the posterior odds and posterior probability as

and

## Gender Correction for EMMAX and MLMM¶

Gender correction for EMMAX and MLMM consists simply of considering X Chromosome genotypes for males as either zero (major allele present) or one (minor allele present). No other gender correction is performed.

Note

If a genotypic spreadsheet is used, it is still expected that male X chromosomes are coded in this spreadsheet as “two major alles” (major allele present) vs. “two minor alleles” (minor allele present).

## Genomic Best Linear Unbiased Predictors (GBLUP)¶

### Problem Statement¶

Suppose we have the mixed model equation

over samples, with fixed effects specified by that include the intercept and any additional covariates you may have specified, and we assume that . Also suppose that the random effects are additive genetic merits or genomic breeding values associated with these samples, and that these may be formulated from autosomal markers as

where is an matrix, the exact construction of which will be discussed later in Overall Normalization and Normalization by Individual Marker, and is a vector for which is the allele substitution effect (ASE) for marker . For now, suffice it to say that matrix is derived from the individual genotypes, with dependent on the genotype for the -th sample at the -th locus. (For inclusion of non-autosomal markers, see Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.)

Further assume that (which makes ), and that , where is an (unknown) constant which is the component of variance associated with the ASE. This gives us

This gives us

and

Our object is to estimate both the genomic breeding value for every sample and the ASE for every marker.

We now discuss obtaining the matrix and from that, a scaling factor and a kinship matrix.

### Overall Normalization¶

There are two ways to normalize the entries of matrix . The first method is Overall Normalization.

For Overall Normalization, we use the mixed model equation exactly as described above, namely

with set to , , or , depending upon whether the genotype for the -th sample at the -th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. Here, and are the major and minor allele frequencies for marker , respectively.

Additionally, we write the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium as

We now define the variance matrix for Overall Normalization as

which gives us

where we let .

### Normalization by Individual Marker¶

The second normalization method is Normalization by Individual Marker. (This method is used by the software package GCTA [Yang2011].)

For Normalization by Individual Marker, we instead use

and

as our mixed-model equations, where

and is two, one, or zero, depending upon whether the genotype for the -th sample at the -th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. Here also, and are the major and minor allele frequencies for marker , respectively.

This time, we define our variance matrix as

which gives us

where we let .

We note that if we define

we see that

where is the matrix defined above in Overall Normalization.

### The Genomic Relationship Matrix¶

For either normalization method, we refer to the matrix ( or ) as the GBLUP Genomic Relationship Matrix (GRM). This matrix may be used as a kinship matrix for solving the GBLUP mixed-model equation, and or may be thought of as the variance component for .

Note

1. Because this method uses a kinship matrix based on genotypes rather than on actual ancestry, the results are referred to as “Genomic Best Linear Unbiased Predictors” rather than just “Best Linear Unbiased Predictors”.
2. Unlike some other SVS mixed-model analysis tools, SVS GBLUP does not scale its kinship matrix using a Gower’s centering matrix. Instead, the normalizing method used, whether overall or individual-marker, is considered sufficient to scale the GBLUP Genomic Relationship Matrix.
3. Because of how it is constructed, the GBLUP Genomic Relationship Matrix is itself a “centered matrix” and is (thus) singular. However, it is still a positive semidefinite matrix and will work well as a kinship matrix.

### Finding the Genomic Best Linear Unbiased Predictors Using EMMA¶

For either matrix ( or ), and using the EMMA technique (Finding the Variance Components Using EMMA), we can find , , and The second of Henderson’s mixed-model equations, as modified to accommodate singular , is

This may be rewritten as

This gives us

Noting the following equalities,

we may write

as a solution for the genomic BLUP.

In SVS, this is computationally streamlined by finding

then computing .

### Finding the Allele Substitution Effect (ASE) Using EMMA and Overall Normalization¶

If we now define

where is the matrix for the Overall Normalization, we find that

which makes a solution for the ASE.

In SVS, this is computationally streamlined by finding

where is the version of for Overall Normalization.

### Finding the ASE Using EMMA and Individual Marker Normalization¶

If we now define

where is the matrix for Normalization by Individual Marker, we find that

This makes a best linear unbiased predictor (BLUP) for .

To obtain the ASE, we need to scale the elements of appropriately. Define the -th element of by dividing element of by . This calculation is equivalent to writing

We see, however, that

Thus, we see that is a solution for the ASE.

In SVS, we computationally streamline this by finding

where is the version of for Individual Marker Normalization.

### Finding the Genomic Best Linear Unbiased Predictors and ASE Using AI¶

For any variance component , we find

where matrix is defined in Finding the Variance Components Using the Average Information (AI) Technique, and is based on using the appropriate normalization. From this, we compute

and an ASE as either

if Overall Normalization is being used, or

if Normalize by Individual Marker is being used.

### Normalizing the ASE Itself Using EMMA¶

An additional output when using EMMA, for either GRM normalization choice, consists of a normalization of the allele substitution effects themselves. To do this, each ASE is divided by the SNP Standard Deviation, which is the square root of the component of variance associated with the ASE. is reconstructed by dividing the additive genetic variance by the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium:

The normalized ASE is then:

### Incorporating Gene by Environment Interactions into GBLUP¶

We may specify the mixed model as

with

where is a vector of genotype-environment interaction effects for all of the individuals, with for the pairs of individuals in the same environment, and for the pairs of individuals in different environments.

The variance components , , and may all be found using the Average Information (AI) REML algorithm.

### Genetic Correlation Using GBLUP¶

Suppose we have two phenotypes, and , which are dependent on some genotypic data. We may write a standard bivariate linear mixed model for these as

and

where and are polygenic effects influencing the traits and , respectively.

We may alternatively think of this model as

If is the Genomic Relationship Matrix (GRM) for the genotypic data, we may write the variance/covariance matrix () for this model as

where is the covariance between and . may also be written as

from which we see we have five variance components.

SVS uses an adaptation of the AI REML algorithm to estimate these, with two variations:

1. Simply estimate these five variance components,

where is theoretically zero, but actually may exist as a residual covariance.

Designate the following names for the above variance/covariance matrices:

and, where used,

This gives us, for the two variations of this algorithm,

and

Alternative names for the variance components that are used by the outputs of SVS, along with the interpretations for these variance components, are

• V(G1) () Variance of due to random effects
• V(G2) () Variance of due to random effects
• C(G12) () Covariance between and due to random effects
• V(E1) () Error variance for
• V(E2) () Error variance for
and, if necessary,
• C(E12) () (Residual) covariance of and due to error terms.

## Correcting GBLUP for Gender¶

When using the GBLUP features of SVS, there are several steps that may be taken to correct for gender:

• Correct the gemomic relationship matrix (GRM) for gender, by using male-specific and female-specific values for the genotypes of the gender chromosome, and by modifying the normalization (for either normalization method).
• The ASE values may be computed separately for males vs. females, using male-only subsets vs female-only subsets of the (or ) matrix and of the vector. (This approach assumes the GRM matrix is already gender-corrected.)
• When AI REML is being used, and thus multiple GRM matrices may be analyzed, one GRM matrix (presumably gender-corrected) may be used for the gender chromosome and a different GRM matrix may be used for the other chromosomes.

### Correcting the GRM for Gender¶

For markers within the gender chromosome (usually X), there are two factors:

• The number of copies of the alternate allele at any locus for males can only be zero or one.
• There are varying assumptions that may be made and have been made about how much effect a gender chromosome allele has in males, as well as how much effect a gender chromosome allele has in females.

Dealing with the second point first, there are three assumptions that are usually used regarding the ratio of the allele effects between males and females. These assumptions go by the name of Dosage Compensation, and are:

• Full dosage compensation Each allele in females has only half the effect that an allele has in males.
• Equal X-linked genetic variance for males and females This occurs when each allele in females has times as much effect as each allele in males.
• No dosage compensation Each allele has a similar effect on the trait.

For a further discussion of Dosage Compensation, please see [Yang2011].

Additionally, if the same GRM is being built for both the gender chromosome and for non-gender chromosomes, there has to be a ratio assumed for gender-chromosome effects vs. non-gender-chromosome effects.

### Correcting the GRM for Gender Using Overall Normalization¶

To begin with, we use the following entries for matrix for the gender chromosome:

• For females, we use , , or for , depending upon whether the genotype for the -th sample at the -th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. The frequencies and here are computed only for female subjects.
• For males, we use or for , depending upon whether the genotype for the -th sample at the -th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. The frequencies and here are computed only for male subjects. The parameter is the dosage compensation factor. Its values are
• for Full dosage compensation
• for Equal X-linked genetic variance for males and females
• for No dosage compensation.

Entries of for markers outside of the gender chromosome are computed the same as otherwise.

The remaining difference is in computing the overall normalization factor . We continue to use as the expected-variance term for most markers. However, for gender chromosome markers, we use

where and are the fraction of the samples that are male and female, respectively.

Note

This Overall Normalization algorithm assumes the same ratio of variances between female gender-chromosome alleles and non-gender-chromosome alleles (namely 2 to 1) as does [Taylor2013].

The remaining computations for a GRM and a GBLUP are the same as they would be otherwise. We still compute

as before. Also, for EMMA, we compute

and

as before, and for AI and for any variance component , we compute

and

as before, where matrix , as defined in Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected .

### Correcting the GRM for Gender Using Normalization by Individual Marker¶

In this case, we start with using the following entries for matrix for the gender chromosome (where later on, matrix will be built using matrix ):

• For females, we use , , or for , depending upon whether the genotype for the -th sample at the -th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. This is the same as would be for a non-gender chromosome, except that the frequencies and here are computed only for female subjects.
• For males, we use the same values as the male values for Overall Normalization, namely, or for , depending upon whether the genotype for the -th sample at the -th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. The frequencies and here are computed only for male subjects. Just as for Overall Normalization, the values for the dosage compensation factor are
• for Full dosage compensation
• for Equal X-linked genetic variance for males and females
• for No dosage compensation.

Entries of for markers outside of the gender chromosome are computed the same as otherwise.

We next compute the normalizations, which are different between genders. We must use two different matrices ( and ) to represent the normalizations in matrix form.

We first compute and for all samples based on the total frequencies of the alleles over all occurrences of the gender chromosome, as

and

where and are the number of male samples and the number of female samples, respectively.

Now, the normalizations themselves (and matrices and ) may be computed as follows:

• For females, the normalization for marker (and the -th diagonal element for matrix ) is

• For males, the normalization for marker (and the -th diagonal element for matrix ) is

For non-gender chromosomes, the normalizations for both male and female samples are the same as they would have been without gender correction.

Note

This Normalization by Individual Marker algorithm assumes the same ratio of variances between female gender-chromosome alleles and non-gender-chromosome alleles (namely 1 to 1) as does [Yang2011].

To further explain gender correction using Normalization by Individual Marker, we note that without loss of generality, we may consider the matrix to be partitioned into

where and are the male and female entries for the gender chromosome and and are the male and female entries for the remaining chromosomes.

Also, we note that without loss of generality, we may partition matrices and into

and

We now may represent a gender-corrected matrix as

or

Using this , we compute

Now, for EMMA, we compute

and

as before, and for AI and for any variance component , we compute

and

as before, where matrix , as defined in Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected .

### Correcting the ASE for the Gender Chromosome¶

When there is a designated gender chromosome, the allele substitution effects (ASE) are computed separately for females and males, although the results will only be different between the genders for the gender chromosome markers.

Without loss of generality, we may consider, as we did in Correcting the GRM for Gender Using Normalization by Individual Marker, the matrix to be partitioned into

where and are the male and female entries for the gender chromosome and and are the male and female entries for the remaining chromosomes.

Also, for either normalization method, we may consider, without loss of generality, to be partitioned into

where and are the values of corresponding to male and female samples, respectively.

For Overall Normalization, we then compute

with the final two results being

For Normalization by Individual Marker, we then compute

with the final two results being

If you select AI as your REML algorithm, SVS allows you to separate out the gender-chromosome-linked effects from the non-gender-chromosome-linked effects by fitting the model

where is a vector of genetic effects attributable to the gender chromosome, with

and

where and are the genomic relationship matrices (GRM’s) for the gender chromosome and for all other chromosomes, respectively, so that

This model may be fit any of three ways:

• with having been made using Full Dosage Compensation,
• with having been made using No Dosage Compensation.

Note

If you wish to test for the most appropriate Dosage Compensation, this can be achieved by running this model three times, each time with a different Dosage Compensation, then comparing the likelihoods of the model-fitting under each of the Dosage Compensation assumptions.

## Genomic Prediction¶

Sometimes, it is desired to predict the random effects (genomic merit/genomic breeding values) for samples for which there is genotypic data, but no phenotype data, based on other samples for which phenotype data (as well as genotypic and covariate data) does exist. (If there is covariate data for these missing phenotype values, these values can be predicted based on the random effect predictions.)

Call the samples for which there are phenotype values the “training set”, and the others the “validation set”. Assume all samples have genotypic data, imputed or otherwise, for all markers. Also assume all samples in the training set have valid covariate data, if there are covariates being used.

To predict the random effects and the missing phenotypes, we do the following:

• Without loss of generality, imagine the samples of the training set all come first, before any samples of the validation set. Define , where the width and height of is , the width of the zero matrix is , and the height of the zero matrix is . Also partition , , and according to training vs. validation, as

and

• Either compute the genomic relationship matrix or import a pre-computed genomic relationship matrix based on all samples (both training and validation sets).

Note

If correcting for gender has been selected, the modifications for computing and remain the same as noted above in Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.

• Use the EMMA technique (Finding the Variance Components Using EMMA) on the mixed model for the training set

(where and ) to determine the proper values for , , and the inverse of , where .

• Noting that the form of the second of Henderson’s mixed-model equations, as modified to accommodate singular , is, for ,

we obtain

or

Noting the following equalities,

We may write

as a solution for the genomic BLUP. If we now define

we find that

which makes a solution for the ASE. The computational streamlining becomes computing as

computing as

and finally computing and the ASE as before.

Note

If correcting for gender has been selected, the modifications for partitioning and (both of which involve all samples) and computing the ASE () remain the same as noted above in Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.

• Finally, noting that the full mixed-model problem is

or

we predict the validation phenotypes

from (the intercept and) any validation covariates and the predicted values . If there are missing validation covariates, the corresponding validation phenotypes are not predicted.

## Bayes C and C-pi¶

### Problem Statement¶

Suppose that we have the following mixed model equation to describe the relationship between the phenotypes of our samples, their genotypes, and fixed and random effects.

over samples, with fixed effects in specified in that include the intercept and any additional covariates you may have specified. Also, the random effects, are additive genetic merits or genomic breeding values associated with each sample.

We can define in terms of the genotypes of each sample over autosomal markers and the allele substitution effects.

where is an x matrix containing the genotypes of each sample. is 0, 1, or 2, depending upon whether the genotype for the -th sample at the -th locus is homozygous for the major allele, heterozygous, or homozygous for the minor allele, respectively. is a vector for which is the allele substitution effect (ASE) for marker .

### Estimating the Model Parameters¶

The two Bayesian methods for fitting this mixed model implemented in SVS are Bayes C and Bayes C [Habier2011]. The only difference between the two methods is the assumptions about .

is the prior probability that a SNP has no effect on the phenotype, the value of is treated as unknown and is estimated in the Bayes C method and is treated as known with a value of 0.9 [Neves2012] in Bayes C.

The Bayesian approach uses prior probabilities, beliefs about the parameters before data is analyzed, and the conditional probabilities, a probability density for a parameter based on the given data, to construct full conditional posterior probabilities. These posterior probabilities are then sampled from to make estimates about the model parameters.

SVS uses a single-site Gibbs sampler.

### Prior and Posterior Distributions¶

There are six parameters that are sampled each iteration of the Gibbs sampler.

The parameters are:

• : the prior probability that a SNP has no effect.
• : the allele substitution effect (ASE) at loci .
• : the component of variance associated with the ASE.
• : the component of variance associated with the error term.
• : the component of the -th fixed effect.
• : whether a marker is included in the current iteration.

The prior distributions for the parameters are as follows [Fernando2009a], [Sorensen2002]:

Parameter Prior

Where is the number of iterations for the Gibbs sampler, , , [Fernando2009b] and is defined as [Habier2011]:

where is the initial value of and is defined as [Habier2011]:

where is:

or the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium. And is the additive-genetic variance explained by SNPs [Habier2011] and we can set it as 0.05 [Fernando2009b].

The prior of is constant and not proper, however the posterior is. The initial value for the first is the mean of the phenotype values and the rest are set to 0.

The posterior distributions for these parameters are [Fernando2009a], [Sorensen2002]:

Parameter Posterior

where is the number of markers included in this iteration (see Deciding when to include a marker) and is the scale factor for the posterior distribution and it is set at 1. And is [Habier2011]:

and is defined as [Habier2011]:

where is defined as above and is updated each iteration with the new value of .

Note

A prime in the subscript mean “all but this.” For example, when sampling we remove all the fixed effects and their coefficients except for the current one.

The posterior of can be better explained by [Fernando2009a]:

### Deciding when to include a marker¶

Because of the difficulty in sampling directly from the distribution we use the log-likelihoods, find the probability that is 1, and sample from a uniform distribution, , to determine if a marker will be included in the current iteration, if we do not include a marker, then will be set to 0.

The log-likelihoods are defined as [Fernando2009b]:

We can then define the probability that is 1 as [Fernando2009b]:

### Finding the ASE and Genomic Estimated Breeding Values¶

To find the our estimates for and we take the average over all iterations:

for all and .

The ASE values will then be the new values.

To find the genomic estimated breeding values (GEBV) we use the ASE values:

, , and are found in the same way as and , taking the average value over all iterations.

### Gender Correction¶

To correct for gender, we take the following steps:

• For markers within the X chromosome, we use the following entries for matrix :

• For females, we encode them the same as non-X-chromosome markers.
• For males, we use 0 or 1 for , depending upon whether the genotype for the -th sample at the -th locus contains the major (X-chromosome) allele or the minor (X-chromosome) allele, respectively.

The other entries of are left the same.

• To compute , we continue to use as the expected-variance term for most markers. For X-chromosome markers, however, we use

where and are the fraction of the samples that are male and female, respectively.

• We still compute

as before.

• During the sampling phase of the Gibbs sampler, the and matrices are split into male and female sections and separate samples are taken.

• The non-X-Chromosome ASE values are added (from the males and females) to get the final ASE values.

### Normalizing the ASE¶

To normalize the ASE values we do:

### Standardizing Phenotype Values¶

Phenotype values will be standardized to prevent from becoming too large and disrupting the results.

Values will be set to their z-score:

The result spreadsheet by marker will have the original phenotypes and the standardized phenotypes, if phenotype prediction is chosen, then there will be a column of predicted phenotypes and transformed back predicted phenotypes.

### Genomic Prediction¶

To predict the phenotype of samples for which there is genotypic data but no phenotypic data (will be known as our validation set) we can run the Gibbs sampler with just the data from the samples with known phenotypic and genotypic data (the training set).

After running the Gibbs sampler with just the training set data we can estimate the GEBVs for all samples with:

where is estimated with just the training samples.

And we can predict the phenotypes for all samples (training and validation) with:

However, if there are missing covariate values the phenotypes cannot be predicted for the sample and it will be dropped from the result spreadsheet.