2.13.1. Overview of Mixed Linear Models¶
Mixed linear models incorporate both “fixed effects” and “random effects” (that is, “mixed effects”). The independent variables in a linear regression may be thought of as fixed effects. To solve for the random effects in a mixed model, something should be known about the variances and covariances of these random effects.
The Mixed Model Equation¶
Suppose there are measurements of a phenotype which is
influenced by
fixed effects and
instances of
one random effect. The mixed linear model may be written as
where is an
vector of observed
phenotypes,
is an
matrix of fixed
effects, and
is a
vector representing
the coefficients of the fixed effects.
is the unknown random
effect, and
is an
matrix relating the
instances of the random effect to the phenotypes. We assume
is the variance-covariance matrix for the random effect , and
so that
Examples of “fixed effects” may include the mean, one or more genotypic markers, and other additional covariates that may be analyzed.
Examples of a “random effect” are:
Polygenic effects from each of
subgroupings, where the
measurements have been grouped into
subgroupings such as inbred strains.
is then an incidence matrix relating subgroupings/strains to measurements, and
should be a matrix showing the pairwise genetic relationship among the
strains.
Polygenic effects from each of the
samples, where there is just one measurement per sample.
is then just the identity matrix
, and
should be a pairwise genetic relationship or kinship matrix among the
samples.
Note
At this time, none of the SVS Mixed Linear Model Analysis tools supports organizing measurements into subgroupings such as inbred strains.
The parameters and
are
called the “variance components”, and are assumed to be unknown. To
solve the mixed model equation, the variance components must first be
estimated. Once this is done, a generalized least squares (GLS)
procedure may be used to estimate
.
More than One Random Effect¶
It is also possible for a mixed linear model to have more than one
random effect. For instance, a mixed model with two random effects
and
would be defined by
where
and
and
being the variance/covariance matrices for
and
respectively, so that
In this example, there are three variance components: ,
, and
.
Finding the Variance Components¶
The SVS Mixed Linear Model Analysis tools use either one of two approaches to estimate the variance components:
EMMA. To estimate the variance components
and
for a mixed model with only one random effect, you may select to use the EMMA (Efficient Mixed Model Association) approach ([Kang2008]). This approach reduces the estimating problem to a maximization search in just one dimension, and will thus be reliable and fast for use cases with 10,000 or fewer samples.
(For sample sizes
over 10,000 or 15,000, this method starts becoming limited by the time it takes to solve the eigenvalue/eigenvector problem for these
matrices. While some of the Mixed Linear Model Analysis tools (specifically GBLUP and K-Fold cross validation using GBLUP) have a large-data capability which saves space in memory, you must still choose, when using this capability, the tradeoff between faster approximation of the eigenvalues and eigenvectors vs. more exact computation of these values.)
AI. Average Information (AI) ([Yang2011], [LeeSH2012]) is the other approach you may select. Multiple variance components may be estimated using this approach, and it is not limited for large sample sizes by the need to solve an eigenvalue/eigenvector problem.
However, neither the MLMM feature nor large-data capability is currently implemented in SVS for the AI approach.
Either the full likelihood (Maximum Likelihood ML) or the restricted likelihood (Restricted Maximum Likelihood REML) may be maximized. The restricted likelihood is defined as the full likelihood with the fixed effects integrated out. As stated in [Kang2008], “The restricted likelihood avoids a downward bias of maximum-likelihood estimates of variance components by taking into account the loss in degrees of freedom associated with fixed effects.”
Note
The SVS Mixed Linear Model Analysis tools always maximize the restricted likelihood (REML) rather than the full likelihood (ML) except when the Bayes Information Criterion (for the MLMM feature) is computed.
Finding the Variance Components Using EMMA¶
Suppose ,
, and
, which is a function of
. Under the
null hypothesis, the full log-likelihood function can be formulated as
and the restricted log-likelihood function can be formulated as
The full-likelihood function is maximized when is
, and the optimal
variance component is
for full likelihood
and
for restricted likelihood, where
is a function
of
as well.
Using spectral decomposition, it is possible to find and
such that
and
where ,
is
, and
is an
eigenvector matrix corresponding to the nonzero
eigenvalues.
is an
matrix
corresponding to the zero eigenvalues.
and
are independent of
.
Let . Then,
finding the maximum-likelihood (ML) estimate is equivalent to
optimizing
with respect to , and finding the restricted
maximum-likelihood (REML) estimate is equivalent to optimizing
with respect to . (See the Appendix of [Kang2008] for
the mathematical details, except that
, not
“
” as that Appendix states.) These functions are
continuous for
if and only if all the eigenvalues
are nonnegative (and, for
,
the eigenvalues
are nonnegative). Otherwise, if the
kinship matrix is not positive semidefinite, the likelihood will be
ill-defined for a certain range of
.
The derivatives of these functions, which may be used to find the local maxima for the functions themselves, are
and
The ML or REML may be searched for by subdividing the values of
into 100 intervals, evenly in log space from
to
, and applying a
method such as the Newton-Raphson algorithm or the secant method on
or
to all the intervals
where the sign of the derivative function changes, then taking the
optimal
among all the stationary points and
endpoints.
Note
The secant method is used by the SVS Mixed Linear Model Analysis tools.
Notice that evaluating or
does not require a large number of matrix multiplications or inverses
at each iteration as other methods typically do – instead, the EMMA
technique computes spectral decomposition only once. Thus, using the
grid search indicated above, the likelihood may be optimized globally
with high confidence using much less computation.
Finding the Variance Components Using the Average Information (AI) Technique¶
Problem Restatement and Summary¶
Suppose we have random effects in our mixed model. We may
then write our mixed model as
where is a vector of random genetic effects. In this
model, the phenotypic variance
is partitioned into
the variance explained by each of the genetic factors and the residual
variance. We have
where is the variance of the
-th genetic
factor with its corresponding variance/covarince matrix
.
( can be a kinship matrix or a GBLUP Genomic Relationship
Matrix (GRM). See The Genomic Relationship Matrix for more information
about the GRM matrix.)
The variance components (for all
) and
are unknown and need to be determined. To
determine these variance components, SVS uses the Average
Information (AI) technique to implement the restricted maximum
likelihood (REML) method. This process may be summarized as follows:
SVS first initializes each component to (the somewhat arbitrary value of)
divided by the number of total components
.
Next, SVS uses one iteration of the expectation maximization (EM) algorithm (as adapted for the AI REML method).
Finally, repeated iterations of the Average Information (AI) algorithm itself are used until convergence is achieved. The criterion for convergence is that the log-likelihood changes by less than the threshold value of
.
Definitions and Algorithm Details¶
Define as a vector of variance components
the matrix as
and as the log-likelihood function of the model, ignoring the constant:
We now define a matrix called which is the average of the
observed and expected information matrices:
The vector of first derivatives of the log likelihood function with respect to each variance component is
The
-th variance component
is initialized to
This is updated by one iteration of the EM algorithm as
The EM algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values.
The remaining iterations use the Average Information algorithm itself, which consists of updating the vector
of variance components as
to get from iteration
to iteration
.
Convergence is determined by whether
, where
is the log likelihood of the
-th iteration.
During the iteration process, any component whose estimate is negative
(that is, the estimate “has escaped from the parameter space”) is set
to .
See [Yang2011] for further information.
Estimating the Variance of Heritability for One Random Effect¶
The formula for the estimate of the variance of heritability is derived as follows using a Taylor series expansion:
Now, using the fact that the pseudo-heritability (or narrow-sense heritability) is:
We can obtain the formula for the estimate of the variance of heritability:
Note that
So:
The formulas and methods for calculating the variance components can be found in
Finding the Variance Components. is
the (estimated) sampling variance/covariance of the (estimates of the)
variance components, and may be computed as
where is the Average Information Matrix for
Solving the Mixed Model Equation Using EMMA¶
The generalized least squares (GLS) solution to
may now be obtained. Note that the variance of
is
If we can find a matrix such that
we can substitute ,
,
and
to get
(The Cholesky decomposition of is one way to obtain such a
matrix
.) This equation can be solved for
through ordinary least squares (OLS), because we have
The value of the residual sum of squares (RSS) from solving the
transformed equation is the
Mahalanobis RSS for the original equation
.
Taking advantage of the eigendecomposition of performed in
the EMMA algorithm, the computation of a valid
can be
simplified to
Using the Mixed Model for Association Studies¶
The Exact Model¶
Association studies are typically carried out by testing the
hypothesis for each of
loci, one
at a time, on the basis of the model
where is the minor allele count of marker
for
individual
,
is a (fixed) effect size of
marker
, and
are other fixed effects
such as the intercept and any fixed covariates. The error term
is
If we assume the individuals are unrelated and there is no
dependence across the genotypes, the
values
will be independently and identically distributed (i.i.d.), and thus
simple linear regressions will make appropriate inferences for the
values of
.
However, the variance of the first term of
actually comes closer to being proportional to a matrix of the
relatedness or kinship between samples. Thus, if we write
we see that the equation for reduces to the mixed-model
equation
Note that strictly speaking, to use this equation, we should base not
only the kinship, but also the variance components, upon all markers
except for marker .
The EMMAX Approximations and Technique¶
Even using the EMMA technique, finding the kinship matrix for,
variance components for and solving for for all
would be a daunting task. However, we may make two
approximations:
Let
approximate
, and let
approximate
. Then, we have
To solve this, we need to compute the kinship matrix just once, using all markers. That kinship matrix may then be used to solve this equation for every marker
.
Find the variance components once – specifically, for the system of equations
using the kinship matrix
which is computed just once for all markers
. Then, use these variance components
(for the variance of the
which is
) and
(for the variance of the
which is
) to apply the GLS method for solving
for
for every marker
.
This technique, which is called EMMAX (EMMA eXpedited) and was published in [Kang2010], allows mixed models to be used for genome-wide association testing within a very reasonable amount of computing time.
Scaling the Kinship Matrix¶
The actual EMMAX technique, however, before using the kinship matrix
, scales it (and thus effectively scales
) by an
amount that will make the expectation of the estimated population
variance of the (scaled)
to be
,
just as the expectation of the estimated population variance for the
is
.
This is done by defining a scaling factor as
and multiplying by it to get
Here, and
is a length
vector of ones.
is called a “Gower’s centering
matrix”–it has the property that when you apply it to a vector
to get
, it will subtract the mean
of the components of
from each component
of
:
The reasoning for using this scaling factor is as follows:
Suppose we have a vector of elements
.
Estimate the population variance
of these elements over
all the
samples
. (This estimate is sometimes
called the “sample variance”.) The (unbiased) estimate would be
where is the average of the components
of
.
However, another way to write this is
since
and
Two other ways to write this are
since is a scalar, and
since for any two matrices
and
.
We note that
therefore, the estimated population variance may be
written as
Looking at as a random variable of
dimensions
and
as a scalar random variable, let us write the
expectation of the estimated population variance
as:
But defines a relationship matrix among the
elements of possible
. (Note that the possible instances of
themselves might not be “centered” – that is, the
components of the
may or may not have zero averages.) We
now write the expected estimated population variance
as
We wish to “scale to one” – that is, set
to one by scaling
appropriately. We do
that by defining
, where
and noting that
Note
The features that scale the kinship matrix before using it are:
Single-locus mixed model GWAS (EMMAX) (Performing Analysis)
Multi-locus mixed model GWAS (MLMM) (Performing Analysis)
Single-locus mixed model GWAS (EMMAX) with Interactions (Mixed Linear Model Analysis with Interactions)
The other SVS mixed-model features, including the GBLUP-related features, do not scale the kinship matrix before using it.
Note
When the EMMA algorithm for GBLUP is used, GBLUP will output what the computed pseudo-heritability would have been had the matrix been scaled.
Further Optimization When Covariates Are Present¶
The following technique is mentioned in passing in [Segura2012] and is used both in [Vilhjalmsson2012] and in the mixed-model tools of SVS.
If we have a mixed linear model with fixed-effect covariates
, one particular “more interesting” fixed-effect
covariate
and a random-effect covariate
for which the scaled relationship matrix is
and we have this model for many and we don’t need to find
the covariate coefficients
for any of these
models, and we have a matrix
such that
(see Solving the Mixed Model Equation Using EMMA), we can perform the
following optimization:
Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem
to find
as an estimate for
.
Designate the (Mahalanobis) RSS obtained from solving this equation as
Perform the QR algorithm on
to get
where
and
are the “thin”, “reduced”, or “economic” versions of
and
.
Define
giving us
Transform the original equation by pre-multiplying it by
to get
But
because the columns of
are “orthogonal” and of “unit length” and so
and
. Thus, we have
may be re-written as
the last step being because
and, since
,
and
Thus,
Define
as
so that
If we pre-multiply the original problem simply by
, we get
which may be solved as an ordinary-least-squares (OLS) problem (Solving the Mixed Model Equation Using EMMA), making
proportional to
.
Meanwhile,
and thus
is proportional to
. This means that
may be treated as an ordinary-least-squares (OLS) problem. This may now be solved for all
.
Note
For optimization, SVS pre-computes the matrix product
and uses this product as one matrix to help perform
all of the regressions involving
.
Note
The matrix is the “annihilator matrix” for the null
hypothesis problem
Designate the Mahalanobis Residual Sum of Squares (Mahalanobis RSS)
for marker as
which is optimized to
This is the Residual Sum of Squares (RSS) value for the regression as
transformed by pre-multiplying it by .
Note
If we have a (completely fixed-effect) linear model with covariates
and one particular “more interesting” covariate
,
and we have this model for many and we don’t need to
find the covariate coefficients
for any of
these models, we can perform the same kind of optimization.
Solve the ordinary-least-squares (OLS) “null hypothesis” or reduced-model problem
to find
as an estimate for
.
Perform the QR algorithm on
to get
where
and
are the “thin”, “reduced”, or “economic” versions of
and
.
Define
Transform the original equation by pre-multiplying it by
to get
But
because the columns of
are “orthogonal” and of “unit length” and so
and
. Thus, we have
may be re-written as
because
and
.
Thus,
This is equivalent to the ordinary-least-squares (OLS) problem
where the variance of
is proportional to
. This is because
which we assume to be proportional to
.
Note that the matrix
is the “annihilator matrix” for the null hypothesis problem
The Mixed Model Equation for Gene-by-Environment Interactions¶
Note
This algorithm is distinct from the GBLUP-related algorithm discussed later in Incorporating Gene by Environment Interactions into GBLUP.
Suppose there are measurements of a phenotype which is
influenced by
fixed effects and
instances of one
random effect. Also suppose there are one or more (say,
)
gene-by-environment effects, where the interaction is potentially with
one “more interesting” predictor variable.
We model this with a full model and a reduced model. The full model is:
and the reduced model is:
where is an
-vector of observed phenotypes,
is an
matrix of
fixed
covariates,
is an
matrix of
fixed terms that are being tested for gene-environment interactions,
is an
-vector containing the current “more
interesting” covariate or predictor variable that may be doing the
interacting, and
is an
matrix
containing the
interaction terms created by multiplying the
columns of
element-by-element with
(the
“Predictor”). All of the above
terms correspond to the
terms with which they are shown and to whether they apply to
the full model or to the reduced model. The
and
terms are random effect and error terms,
respectively.
We approximate this using a technique like that used in the EMMAX
method without interactions (The EMMAX Approximations and Technique). Specifically,
we find the variance components once, using the parts of the above
equations that are independent of altogether. This works
as follows–we write
where “vc” stands for (finding the) “variance components”.
We assume that
is the variance-covariance matrix for the random effect , and that
so that
The EMMA technique (Finding the Variance Components Using EMMA and
Solving the Mixed Model Equation Using EMMA) is then used to determine the
variance components and
, as well
as a matrix
(and its inverse) such that
It is now possible to go through every marker (“more interesting
covariate”) and compute (as an EMMAX-type approximation) the
following full and reduced models:
for the full model, where the term is assumed to be an error term proportional to the
identity matrix, and
for the reduced model, where also the term is assumed to be an error term proportional to
the identity matrix.
Optimization when Gene-Environment Interaction Terms Are Included¶
While SVS explicitly computes the full model mentioned above and
outputs all of its values, it performs only an
optimization of the reduced model computation, which is sufficient to
determine the RSS of the reduced-model equation above (which is the
approximate Mahalanobis RSS for the original reduced-model equation),
and thus to determine the full-vs-reduced p-value for the complete
model.
This reduced-model optimization is to solve
where , and
is derived from performing the QR algorithm, as
Note
This optimization technique is detailed above in
Further Optimization When Covariates Are Present. (Note that in that section, it is the full
model that is being optimized, but here, we are optimizing the
reduced model.) Please substitute for
and
for
in that
section.
Note
For the similar (completely fixed-effect) linear-model problem with full model
and reduced model
we may optimize computing the reduced model using the technique
shown in the (completely fixed-effect) linear-model note to
Further Optimization When Covariates Are Present, where that optimization is for a full model,
and should be substituted for the
in that note.
The reduced-model Mahalanobis RSS that we will use for marker thus becomes, in its optimized form,
Meanwhile, for the full-model Mahalanobis Residual Sum of Squares
(Mahalanobis RSS) for marker , we use:
The Multi-Locus Mixed Model (MLMM)¶
For complex traits controlled by several large-effect loci, a single-locus test may not be appropriate, especially in the presence of population structure.
Therefore, [Segura2012] has proposed a simple stepwise mixed-model regression with forward inclusion and backward elimination of genotypic markers as fixed effect covariates. This method, called the Multi-Locus Mixed Model (MLMM), proceeds as follows:
Begin with an initial model that includes, as its fixed effects, only the intercept and any additional covariates you may have specified.
Using this model, perform an EMMAX scan through all markers (that you have not specified as additional covariates).
From the markers scanned above, select the most significant marker and add it to the model as a new fixed-effect covariate (“cofactor”), creating a new model.
Repeat (2) and (3) (forward inclusion) until a pre-specified maximum number of forward steps is reached or until forward inclusion must be stopped for some other reason (see note below).
For each selected marker in the current model, temporarily remove it from the fixed effects and perform an EMMAX scan over only that marker.
Eliminate, from the current model, the marker that came out as least significant using the above test. A new smaller model is created.
Repeat (5) and (6) (backward elimination) until only one selected marker is left.
The variance components are re-estimated between each forward and backward step, while the same kinship matrix is used throughout the calculations.
Note
The following are reasons why forward inclusion might be stopped early:
When the pseudo-heritability
estimate has become close to zero.
When all of the variance is explained by the currently-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.
When the latest significant marker that has been selected is collinear with the previously-selected fixed-effect covariate markers (cofactors) and any covariates you may have specified.
When the number of active samples is no longer enough to continue analysis.
When the number of active (non-selected) markers is no longer enough to analyze the resulting models.
Model Criteria¶
The result of this stepwise regression is a series of models. Several model criteria have been explored by the authors of [Segura2012] to determine how appropriate any of the models is:
Bayes Information Criteria (BIC). This is calculated as
, where
is the full-model log-likelihood,
is the number of model parameters (one for the intercept, one for
, one for each marker covariate used in the particular MLMM model, and finally one for each additional covariate used in all of the models), and
is the sample size/number of individuals.
Given any two estimated models, the model with the lower value of BIC is the preferred choice.
However, the authors of [Segura2012] believe this model is “too tolerant in the context of GWAS”.
Extended Bayes Information Criteria (Extended BIC). This is the BIC penalized by the model space dimension. Its formula is
where
is the initial number of model parameters (one for the intercept, one for
, and one for each additional covariate used in all of the models), and
is the total number of models which can be formed using
marker covariates under the assumption that these will only be selected from the best
markers.
Modified Bayes Information Criteria (Modified BIC). This adds a different penalty based not only by the model space dimension, but also by how many overall markers there are to test. Its formula is
where
is the total number of markers being tested in the current step.
Bonferroni Criterion. Only defined for models derived from forward selection, this selects the model with the most covariate marker loci for which the best p-value obtained from the preceding EMMAX scan was below the Bonferroni threshold. The (Bonferroni) threshold used is
, where
is the total number of markers being tested in the current step.
Multiple Bonferroni Criterion. This selects the model with the most covariate marker loci all of which have individual p-values below the Bonferroni threshold. Here, “individual p-value” is as explained in the note of Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method. The (Bonferroni) threshold used is
, where
is the total number of markers being tested in the current step.
Multiple Posterior Probability of Association. This selects the model with the most covariate marker loci all of which have posterior probabilities of association above a PPA threshold. The threshold used is 0.5 . Posterior probabilities of association are based on Bayesian priors
of
for every marker (and for every step), where
is the total number of markers being tested in the current step, and are computed as follows:
Find the Bayes factor for marker
as
where
and
are the values of the Mahalanobis RSS for the base model and for testing with marker
, respectively.
Determine the posterior odds and posterior probability as
and
Gender Correction for EMMAX and MLMM¶
Gender correction for EMMAX and MLMM consists simply of considering X Chromosome genotypes for males as either zero (major allele present) or one (minor allele present). No other gender correction is performed.
Note
If a genotypic spreadsheet is used, it is still expected that male X chromosomes are coded in this spreadsheet as “two major alles” (major allele present) vs. “two minor alleles” (minor allele present).
Genomic Best Linear Unbiased Predictors (GBLUP)¶
Problem Statement¶
Suppose we have the mixed model equation
over samples, with fixed effects specified by
that include the intercept and any additional
covariates you may have specified, and we assume that
. Also suppose that the random
effects
are additive genetic merits or genomic breeding
values associated with these samples, and that these may be formulated
from
autosomal markers as
where is an
matrix, the exact
construction of which will be discussed later in
Overall Normalization and
Normalization by Individual Marker, and
is a vector
for which
is the allele substitution effect (ASE) for
marker
. For now, suffice it to say that matrix
is
derived from the individual genotypes, with
dependent
on the genotype for the
-th sample at the
-th
locus. (For inclusion of non-autosomal markers, see
Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.)
Further assume that , which makes
, and that
, where
is an (unknown) constant which is the component of
variance associated with the ASE. (For inclusion of non-autosomal
markers in the calculation of ASE values, see Correcting the ASE for the Gender Chromosome.)
This gives us
This gives us
and
Our object is to estimate both the genomic breeding value
for every sample and the ASE for every marker.
We now discuss obtaining the matrix and from that, a
normalization factor or normalization factors and a kinship matrix.
Overall Normalization¶
There are two ways to normalize the entries of matrix . The
first method is Overall Normalization.
For Overall Normalization, we use the mixed model equation exactly as described above, namely
with set to
,
, or
, depending upon whether the genotype for the
-th sample at the
-th locus is homozygous for the
minor allele, heterozygous, or homozygous for the major allele,
respectively. Here,
and
are the major and
minor allele frequencies for marker
, respectively.
Additionally, we write the sum of would-be variances over all the markers if each had been at Hardy-Weinberg equilibrium as
We now define the variance matrix for Overall Normalization as
which gives us
where we let .
Normalization by Individual Marker¶
The second normalization method is Normalization by Individual Marker. (This method is used by the software package GCTA [Yang2011].)
For Normalization by Individual Marker, we instead use
and
as our mixed-model equations, where
and is two, one, or zero, depending upon whether the
genotype for the
-th sample at the
-th locus is
homozygous for the minor allele, heterozygous, or homozygous for the
major allele, respectively. Here also,
and
are the
major and minor allele frequencies for marker
, respectively.
This time, we define our variance matrix as
which gives us
where we let .
We note that if we define
we see that
where is the matrix
defined above in
Overall Normalization.
The Genomic Relationship Matrix¶
For either normalization method, we refer to the matrix
(
or
) as the GBLUP Genomic Relationship Matrix
(GRM). This matrix may be used as a kinship matrix for solving the
GBLUP mixed-model equation, and
or
may be thought of as the variance component
for
.
Note
Because this method uses a kinship matrix based on genotypes rather than on actual ancestry, the results are referred to as “Genomic Best Linear Unbiased Predictors” rather than just “Best Linear Unbiased Predictors”.
Unlike some other SVS mixed-model analysis tools, SVS GBLUP does not scale its kinship matrix using a Gower’s centering matrix. Instead, the normalizing method used, whether overall or individual-marker, is considered sufficient to scale the GBLUP Genomic Relationship Matrix.
Because of how it is constructed, the GBLUP Genomic Relationship Matrix is itself a “centered matrix” and is (thus) singular. However, it is still a positive semidefinite matrix and will work well as a kinship matrix.
The GBLUP Genomic Relationship Matrix may either be constructed as a separate step by itself, separate from performing any analyses, or it may be computed “dynamically” or “on-the-fly” as the first part of a GBLUP analysis. Note that
If the Genomic Relationship Matrix is pre-computed separately from performing any analyses, genotypes from all samples will be used.
If, on the other hand, the GRM is computed “dynamically” as a part of a GBLUP analysis, genotypes from any samples for which the phenotype or any covariates or gene-by-environment interaction variables involved in that analysis have a missing value will not used for computing the GRM.
The exception is that when predicting for missing values (Genomic Prediction) and the GRM is computed “dynamically”, genotype data from samples in the “prediction set” (those with a missing phenotype) that have missing covariate values is still used to compute the GRM. (But not data from samples with missing values for gene-by-environment interaction variables.)
Finding the Genomic Best Linear Unbiased Predictors Using EMMA¶
For either matrix (
or
), and using
the EMMA technique (Finding the Variance Components Using EMMA), we can find
,
, and
The second of Henderson’s mixed-model equations, as modified to
accommodate singular
, is
This may be rewritten as
This gives us
Noting the following equalities,
we may write
as a solution for the genomic BLUP.
In SVS, this is computationally streamlined by finding
then computing .
Finding the Allele Substitution Effect (ASE) Using EMMA and Overall Normalization¶
If we now define
where is the matrix
for Overall
Normalization, we find that
which makes a solution for the ASE.
In SVS, this is computationally streamlined by finding
where is the version of
for Overall Normalization.
Finding the ASE Using EMMA and Individual Marker Normalization¶
If we now define
where is the matrix
for Normalization by
Individual Marker, we find that
This makes a best linear unbiased predictor
(BLUP) for
(as defined in
Normalization by Individual Marker).
To obtain the ASE, we need to scale the elements of
appropriately. Define the
-th element
of
by dividing element
of
by
. This calculation
is equivalent to writing
We see, however, that
Thus, we see that is a solution for the ASE.
In SVS, we computationally streamline this by finding
where is the version of
for Individual Marker Normalization.
Finding the Genomic Best Linear Unbiased Predictors and ASE Using AI¶
For any variance component , we find
where the variance components and the matrix are defined in
Finding the Variance Components Using the Average Information (AI) Technique. From this, we can compute
for any variance component.
For any set of markers that contribute exclusively to a
genomic relationship matrix (GRM)
for just one variance
component
, where
is completely
generated from
as
or as
, and
or
is also generated exclusively from set
of markers, we
may compute an allele substitution effect (ASE). This ASE value is
either
if Overall Normalization is being used, or
if Normalize by Individual Marker is being used.
Note
The Gene-by-Environment Interaction feature
Incorporating Gene by Environment Interactions into GBLUP (available only when using the Average
Information algorithm) generates one or more GRM matrices that do
not fit the proper criteria for generating ASE values. This is
because these matrices are not generated directly from
or
for
any set of markers.
Normalizing the ASE Itself Using EMMA¶
An additional output when using EMMA, for either GRM normalization
choice, consists of a normalization of the allele substitution effects
themselves. To do this, each ASE is divided by the SNP Standard
Deviation, which is the square root of the component of variance
associated with the ASE.
is
reconstructed by dividing the additive genetic variance by the sum of
would-be variances over all the markers if each had been at
Hardy-Weinberg equilibrium:
The normalized ASE is then:
Incorporating Gene by Environment Interactions into GBLUP¶
Note
This algorithm is distinct from the EMMAX-related algorithm discussed in The Mixed Model Equation for Gene-by-Environment Interactions.
Suppose there are one or more gene-by-environment interactions we wish
to test for, where for each genotype-environment interaction effect
, there is an environmental variable
which
specifies two or more categories,
being categorical (each
environment being a category) or binary (defining two environments). We
may specify the mixed model as
with
where includes all environmental variables as separate
fixed effects and
is the vector of genotype-environment
interaction effects corresponding to
. We use
for all pairs of individuals in the same environment as each
other, as defined by
, and
for all pairs
of individuals in separate environments.
The variance components ,
,
and
must be found using the Average Information
(AI) REML algorithm.
Genetic Correlation Using GBLUP¶
Suppose we have two phenotypes, and
, which are
dependent on some genotypic data. We may write a standard bivariate
linear mixed model for these as
and
where and
are polygenic effects influencing
the traits
and
, respectively.
We may alternatively think of this model as
If is the Genomic Relationship Matrix (GRM) for the
genotypic data, we may write the variance/covariance matrix
(
) for this model as
where is the covariance between
and
.
may also be written as
from which we see we have five variance components.
SVS uses an adaptation of the AI REML algorithm to estimate these, with two variations:
Simply estimate these five variance components,
Or instead use
where
is theoretically zero, but actually may exist as a residual covariance.
Designate the following names for the above variance/covariance matrices:
and, where used,
This gives us, for the two variations of this algorithm,
and
Alternative names for the variance components that are used by the outputs of SVS, along with the interpretations for these variance components, are
V(G1) (
) Variance of
due to random effects
V(G2) (
) Variance of
due to random effects
C(G12) (
) Covariance between
and
due to random effects
V(E1) (
) Error variance for
V(E2) (
) Error variance for
and, if necessary,
C(E12) (
) (Residual) covariance of
and
due to error terms.
Please see [LeeSH2012] for further details about this algorithm.
Correcting GBLUP for Gender¶
When using the GBLUP features of SVS, there are several steps that may be taken to correct for gender:
Correct the gemomic relationship matrix (GRM) for gender by using male-specific and female-specific values for the genotypes of the gender chromosome, along with modifying the normalization (for either normalization method).
Correct the gender-chromosome allele-substitution effect (ASE) values to match the gender corrections done for the GRM. (To do this requires defining separate ASE values for males vs. females.)
When AI REML is being used, and thus multiple GRM matrices may be analyzed, one GRM matrix (presumably gender-corrected) may be used for the gender chromosome and a different GRM matrix may be used for the other chromosomes.
Correcting the GRM for Gender¶
For markers within the gender chromosome (usually X), there are two factors:
The number of copies of the alternate allele at any locus for males can only be zero or one.
There are three assumptions that can be made about how much effect a gender chromosome allele has in males vs. how much effect a gender chromosome allele has in females. These assumptions go by the name of Dosage Compensation, and are:
Full dosage compensation–Each allele in females has only half the effect that an allele has in males.
Equal X-linked genetic variance for males and females–This occurs when each allele in females has
times as much effect as each allele in males.
No dosage compensation–Each allele has a similar effect on the trait.
For a further discussion of Dosage Compensation, please see A Further Note About Dosage Compensation below, as well as [Yang2011].
Correcting the GRM for Gender Using Overall Normalization¶
We first compute a combined value for the major and minor allele
frequencies and
for gender-chromosome marker
and for all samples based on the total frequencies of the
alleles over all occurrences of the gender chromosome, as
and
where and
are the number of male samples and
the number of female samples, respectively, and
,
,
and
are the major and
minor allele frequencies for male samples only and female samples
only, respectively.
To begin with, we use the following entries for matrix for
the gender chromosome:
For females, we use half of what we would have used for autosomes, to compensate for the fact that males have only one gender chromosome. Specifically, we use
,
, or
for
, depending upon whether the genotype for the
-th sample at the
-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively.
For males, we use
or
for
, depending upon whether the genotype for the
-th sample at the
-th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. The parameter
is the dosage compensation factor. Its values are
for Full dosage compensation
for Equal X-linked genetic variance for males and females
for No dosage compensation.
Entries of for markers outside of the gender chromosome are
computed the same as otherwise.
The remaining differences are that:
Contributions to the GRM from gender-chromosome genotypes of female subjects are multiplied by
. This applies to contributions from both
and
, so that the contribution to the covariance between two female subjects (or to the variance of one female subject) is multiplied by two.
For contributions to the overall normalization factor
from gender-chromosome markers, we use
where
and
are the fraction of the samples that are male and female, respectively.
Note
, which is approximated by
, is the variance of the male-sample genotype data at marker
, as coded into matrix
, and
, which is approximated by
, is the would-be variance, at Hardy-Weinberg equilibrium, of the female-sample genotype data at marker
, as coded into matrix
and afterward multiplied by
.
The remaining steps for computing a GRM and computing are
the same as they would be otherwise.
To show computing the gender-corrected GRM as a matrix formula, we
have to do some factoring out of our matrix. Without loss of
generality, we may consider the matrix
to be partitioned
into
where and
are the male and female entries for
the matrix
. We also define matrices
,
, and
to be diagonal with entries of
for autosomal-chromosome markers and of the following values
for each gender-chromosome marker
:
Then, we can write
to show that contributions to the variance from gender-chromosome
genotypes of female subjects (from each of and
)
are multiplied by
.
The next two formulas, for either REML algorithm, remain the same. We still compute, for EMMA,
and
as before, and for AI and for any variance component
, we compute
and
as before, where matrix , as defined in
Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected
.
Correcting the GRM for Gender Using Normalization by Individual Marker¶
In this case, we start with using the following entries for matrix
for the gender chromosome (where later on, matrix
will be built using matrix
):
For females, we use
,
, or
for
, depending upon whether the genotype for the
-th sample at the
-th locus is homozygous for the minor allele, heterozygous, or homozygous for the major allele, respectively. This is the same as would be for a non-gender chromosome (that is,
, where
is two, one, or zero).
For males, we use the same values as the male values for Overall Normalization, namely,
or
for
, depending upon whether the genotype for the
-th sample at the
-th locus contains the minor (gender-chromosome) allele or the major (gender-chromosome) allele, respectively. Just as for Overall Normalization, the values for the dosage compensation factor
are
for Full dosage compensation
for Equal X-linked genetic variance for males and females
for No dosage compensation.
Entries of for markers outside of the gender chromosome are
computed the same as otherwise.
We next compute the normalizations, which are different between
genders. We must use two different matrices ( and
) to represent the normalizations in matrix form.
Now, the normalizations themselves (and matrices and
) may be computed as follows:
For females, the normalization for gender-chromosome marker
(and the
-th diagonal element for matrix
) is
For males, the normalization for gender-chromosome marker
(and the
-th diagonal element for matrix
) is
For non-gender chromosomes, the normalizations for both male and female samples are the same as they would have been without gender correction.
To show computing the gender-corrected GRM using Normalization by
Individual Marker as a matrix formula, we note that without loss of
generality, we may consider the matrix to be partitioned as
where and
are the male and female
entries for the gender chromosome and
and
are the
male and female entries for the remaining (autosomal) chromosomes.
Also, we note that without loss of generality, we may partition
matrices and
as
and
We now may represent a gender-corrected matrix as
or
Using this , we compute
Now, for EMMA, we compute
and
as before, and for AI and for any variance component
, we compute
and
as before, where matrix , as defined in
Finding the Variance Components Using the Average Information (AI) Technique, is based on the gender-corrected
.
Correcting the ASE for the Gender Chromosome¶
When there is a designated gender chromosome, the allele substitution effects (ASE) are computed separately for females and males, although the results will only be different between the genders for the gender chromosome markers.
To perform the separate ASE computations for either normalization
method, we may consider, without loss of generality, as we did in
Correcting the GRM for Gender Using Normalization by Individual Marker, the matrix to be partitioned
into
where and
are the male and female entries for
the matrix
, and
and
are the
male and female entries for the gender chromosome and
and
are the male and female entries for the remaining
chromosomes.
Also, for either normalization method, we may consider, without loss
of generality, to be partitioned into
where and
are the values of
corresponding to male and female samples,
respectively.
To find what the gender-corrected ASE values should be, we write out
the formula for the random effects for each
normalization method more explicitly, in terms of the values of
.
Overall Normalization¶
For Overall Normalization, recalling that we have
we may write the random effects by sample as
Based on the definitions in Correcting the GRM for Gender Using Overall Normalization, and
bearing in mind how we have partitioned , we write out the
entries in matrix notation as
Noting that
and
this gives us
Thus,
Now define
and finally define the allele substitution effects (ASE) for Overall Normalization as
This gives us
But
and
We now can confirm that for Overall Normalization,
Note
The above definitions of and
may be rewritten as
From this, we see that
In other words, the ASE values for Overall Normalization for gender-chromosome markers are in proportion to each other, but are not the same as each other. Two sets of ASE values must be used.
Normalization by Individual Marker¶
For Normalization by Individual Marker, we first represent the
gender-corrected matrix as
as was done in Correcting the GRM for Gender Using Normalization by Individual Marker, using the same notations. We next write out some representations of :
Noting that all of the -type matrices are symmetric, we get the following:
Representing the GRM as
we now represent the sample-based random effects as
Now define
and finally define the allele substitutions effects (ASE) for Normalization by Individual Marker as
This gives us
But
and
So, we can now confirm that for Normalization by Individual Marker,
Note
The above definitions of and
may be rewritten as
This is and
multiplying the same vector
Since the normalizations for any marker in the gender
chromosome, and the diagonal elements of
and
corresponding to marker
, are
and
for females and
males, respectively, we see that
Therefore,
In other words, it is also true that for Normalization by Individual Marker, the ASE values for gender-chromosome markers are in proportion to each other, but are not the same as each other. Two sets of ASE values must be used here, also.
A Further Note About Dosage Compensation¶
While [Yang2011] discusses this subject in detail, let’s answer one question–why are these factors used for dosage compensation? Let’s look at the individual covariance terms:
Covariance Terms¶
As implied above, the contribution from marker to the
covariance between two female subjects
and
for
Normalization by Individual Marker is the GRM variance component
times the expression
where and
are the number of minor
alleles at marker
for (female) subjects
and
, and
is the (overall) minor allele frequency for
marker
.
Meanwhile, the contribution from marker to the covariance
between two male subjects
and
for Normalization
by Individual Marker is, instead, the GRM variance component times
where is the dosage compensation factor,
and
are the number of gender-chromosome minor alleles (0
or 1) at marker
for (male) subjects
and
,
and (as above)
is the (overall) minor allele frequency for
marker
.
For Overall Normalization, the male terms used in are
the same as those used for Normalization by Individual
Marker. Also, the contributions to the Overall Normalization
factor
from the gender chromosome for males are analogous
to the normalization terms for Normalization by Individual
Marker, namely
Therefore, if you were to use Overall Normalization to compute a
GRM for only one marker in the gender chromosome, and only for male
subjects, your contribution from that marker to the
covariance between two (male) subjects would be the GRM variance
component times
which is equal, for this data, to the male/male covariance term for Normalization by Individual Marker.
To match this, for Overall Normalization, the female terms used in
are twice as small as the corresponding autosomal
terms. However, the contributions to the variance for the female
gender chromosome are multiplied by
for each female
term (and thus by two if both terms are female). Thus, for an
all-female gender-chromosome GRM computed on only one marker using
Overall Normalization, the contribution from that marker
to the covariance between two (female) subjects
and
would be the GRM variance component times
This is equal to
which is equal, for this data, to the female/female covariance term for Normalization by Individual Marker.
Comparison of Dosage Compensation Factors¶
Let’s (focus on Normalization by Individual Marker and) write the male/male covariance term for the three levels of dosage compensation, also giving these expressions a common denominator.
For Full Dosage Compensation, we have
For Equal X-Linked Variance, we have
And for No Dosage Compensation, we have
Comparing these with the covariance between two female subjects, we see that
For Full Dosage Compensation (“each allele in females has only half the effect that an allele has in males”), we can compare the female
term with the male
term.
For Equal X-Linked Variance, we find ourselves comparing the female
term with the male
term, using the common denominator above. You might think of this as a type of “Partial Dosage Compensation”.
Finally, for No Dosage Compensation (“each allele has a similar effect on the trait”), we can compare the female
term with the male
term.
This leads to the second question, “How is the Equal X-Linked Variance assumption ‘Equal’?” The answer is that in each of the expressions
and
the dosage is normalized by the would-be variance of the dosage for the whole marker if all the subjects were of the gender in question. This is not true for the male/male covariance term for either of the other two Dosage Compensation assumptions.
Separating Out Gender-Chromosome-Linked Effects¶
If you select AI as your REML algorithm, SVS allows you to separate out the gender-chromosome-linked effects from the non-gender-chromosome-linked effects by fitting the model
where is a vector of genetic effects attributable to the
gender chromosome, with
and
where and
are the genomic relationship
matrices (GRM’s) for the gender chromosome and for all other
chromosomes, respectively, so that
Note
You may wish to test for the most appropriate Dosage
Compensation for the gender-chromosome GRM for this
model by running the model three times, each time with a different
Dosage Compensation for
, then comparing the likelihoods
of the model-fitting under each of the three assumptions.
Genomic Prediction¶
Sometimes, it is desired to predict the random effects (genomic merit/genomic breeding values) for samples for which there is genotypic data but no phenotype data, based on other samples for which phenotype data, genotypic data and, if applicable, covariate data does exist.
Additionally, it is sometimes desired to predict the phenotypes themselves for those samples with missing phenotypes, so long as the samples have not only genotypic data but also, if applicable, covariate data.
Call the samples for which there are phenotype values the
“training set”, and the
others the “validation
set”. Assume all samples have genotypic data, imputed or otherwise,
for all markers. Also assume all samples in the training set have
valid covariate data, if there are covariates being used.
To predict the random effects and the missing phenotypes, we do the following:
Without loss of generality, imagine the samples of the training set all come first, before any samples of the validation set. Define
, where the width and height of
is
, the width of the zero matrix is
, and the height of the zero matrix is
. Also partition
,
, and
according to training vs. validation, as
Either compute the genomic relationship matrix or import a pre-computed genomic relationship matrix based on all samples (both training and validation sets).
Note
If correcting for gender has been selected, the modifications for computing
and
remain the same as noted above in Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.
Use Restricted Maximum Likelihood (REML) on the mixed model for the training set, which may be expressed as
where
and
, to determine the variance components and the values of
.
If EMMA is being used as the REML algorithm (Finding the Variance Components Using EMMA), it will also determine
and the inverse of
for the training set.
Noting that the form of the second of Henderson’s mixed-model equations, as modified to accommodate singular
, is, for
,
we obtain
or
Noting the following equalities,
We may write
as a solution for the genomic BLUP.
If we now define, for Overall Normalization,
where
is the matrix
for Overall Normalization, we find that
which makes
a solution for the ASE. The computational streamlining becomes computing
as
computing
as
and finally computing
and the ASE
as before.
Here,
is calculated for all samples, namely both training samples and validation samples.
For Normalization by Individual Marker, if we now define
and
where
is defined as in Normalization by Individual Marker, and is calculated for all samples, namely both training samples and validation samples, we find that
and that
which makes
a solution for the ASE. The computational streamlining becomes computing
as
computing
as
and finally computing
and the ASE
as before.
If Average Information is being used as the REML algorithm (Finding the Variance Components Using the Average Information (AI) Technique), we may express the total random effect
as the sum of the
random effects of the individual components
Given this, we express the mixed model for the training set as
with
for either normalization method. Along with the variance components (
and
), the AI algorithm will find the matrix
and the values of
for the training set. From these are found
and
for each variance component
, as well as
for each variance component, with the total random effect estimate being
For any variance component
for which the matrix
is completely generated from a set of markers
as
or as
, where
or
is also generated exclusively from set
of markers, we may compute an allele substitution effect (ASE) value for any marker within such a set. The vector of these values is
for Overall Normalization, or
for Normalization by Individual Marker.
Finally, noting that the full mixed-model problem is
or
we predict the validation phenotypes
from (the intercept and) any validation covariates
and the predicted values
. If there are missing validation covariates, the corresponding validation phenotypes are not predicted.
Note
As a “bonus”, we also use the same prediction algorithm to “re-predict” the training phenotypes as
Note
If correcting for gender has been selected, the
modifications for partitioning and either
or
(both of which
involve all samples, both training samples and validation samples)
and computing the ASE (
or
) remain the same as noted above in
Correcting the GRM for Gender Using Overall Normalization, Correcting the GRM for Gender Using Normalization by Individual Marker,
and Correcting the ASE for the Gender Chromosome.
Bayes C and C-pi¶
Problem Statement¶
Suppose that we have the following mixed model equation to describe the relationship between the phenotypes of our samples, their genotypes, and fixed and random effects.
over samples, with fixed effects in
specified in
that include the intercept and any additional covariates you may
have specified. Also, the random effects,
are additive genetic merits
or genomic breeding values associated with each sample.
We can define in terms of the genotypes of each sample over
autosomal markers and the allele substitution effects.
where is an
x
matrix containing the genotypes of each
sample.
is 0, 1, or 2, depending upon whether the genotype for the
-th sample at the
-th locus is homozygous for the major allele,
heterozygous, or homozygous for the minor allele, respectively.
is a
vector for which
is the allele substitution effect (ASE) for marker
.
Estimating the Model Parameters¶
The two Bayesian methods for fitting this mixed model implemented in SVS are
Bayes C and Bayes C [Habier2011]. The only difference between the two methods
is the assumptions about
.
is the prior probability that a SNP has no effect on the phenotype,
the value of
is treated as unknown and is estimated in the Bayes C
method and is treated as known with a value of 0.9 [Neves2012] in Bayes C.
The Bayesian approach uses prior probabilities, beliefs about the parameters before data is analyzed, and the conditional probabilities, a probability density for a parameter based on the given data, to construct full conditional posterior probabilities. These posterior probabilities are then sampled from to make estimates about the model parameters.
SVS uses a single-site Gibbs sampler.
Prior and Posterior Distributions¶
There are six parameters that are sampled each iteration of the Gibbs sampler.
The parameters are:
: the prior probability that a SNP has no effect.
: the allele substitution effect (ASE) at loci
.
: the component of variance associated with the ASE.
: the component of variance associated with the error term.
: the component of the
-th fixed effect.
: whether a marker is included in the current iteration.
The prior distributions for the parameters are as follows [Fernando2009a], [Sorensen2002]:
Parameter |
Prior |
---|---|
Where is the number of iterations for the Gibbs sampler,
,
, [Fernando2009b] and
is
defined as [Habier2011]:
where is the initial value of
and is defined as [Habier2011]:
where is:
or the sum of would-be variances over all the markers if each had been at
Hardy-Weinberg equilibrium. And is the additive-genetic variance
explained by SNPs [Habier2011] and we can set it as 0.05 [Fernando2009b].
The prior of is constant and not proper, however the posterior is.
The initial value for the first
is the mean of the phenotype values
and the rest are set to 0.
The posterior distributions for these parameters are [Fernando2009a], [Sorensen2002]:
Parameter |
Posterior |
---|---|
where is the number of markers included in this iteration
(see Deciding when to include a marker) and
is the scale factor for
the posterior distribution
and it is set at 1. And
is [Habier2011]:
and is defined as [Habier2011]:
where is defined as above and is updated each iteration with the new
value of
.
Note
A prime in the subscript mean “all but this.” For example, when sampling
we remove all the fixed effects and their coefficients
except for the current one.
The posterior of can be better explained by [Fernando2009a]:
Deciding when to include a marker¶
Because of the difficulty in sampling directly from the distribution
we use the log-likelihoods, find the probability that
is 1, and
sample from a uniform distribution,
, to determine if a
marker will be included in the current iteration, if we do not include a marker,
then
will be set to 0.
The log-likelihoods are defined as [Fernando2009b]:
We can then define the probability that is 1 as [Fernando2009b]:
Finding the ASE and Genomic Estimated Breeding Values¶
To find the our estimates for and
we take the average
over all iterations:
for all and
.
The ASE values will then be the new values.
To find the genomic estimated breeding values (GEBV) we use the ASE values:
,
, and
are found
in the same way as
and
, taking the average
value over all iterations.
Gender Correction¶
To correct for gender, we take the following steps:
For markers within the X chromosome, we use the following entries for matrix
:
For females, we encode them the same as non-X-chromosome markers.
For males, we use 0 or 1 for
, depending upon whether the genotype for the
-th sample at the
-th locus contains the major (X-chromosome) allele or the minor (X-chromosome) allele, respectively.
The other entries of
are left the same.
To compute
, we continue to use
as the expected-variance term for most markers. For X-chromosome markers, however, we use
where
and
are the fraction of the samples that are male and female, respectively.
We still compute
as before.
During the
sampling phase of the Gibbs sampler, the
and
matrices are split into male and female sections and separate samples are taken.
The non-X-Chromosome ASE values are added (from the males and females) to get the final ASE values.
Normalizing the ASE¶
To normalize the ASE values we do:
Standardizing Phenotype Values¶
Phenotype values will be standardized to prevent from becoming
too large and disrupting the results.
Values will be set to their z-score:
The result spreadsheet by marker will have the original phenotypes and the standardized phenotypes, if phenotype prediction is chosen, then there will be a column of predicted phenotypes and transformed back predicted phenotypes.
Genomic Prediction¶
To predict the phenotype of samples for which there is genotypic data but no phenotypic data (will be known as our validation set) we can run the Gibbs sampler with just the data from the samples with known phenotypic and genotypic data (the training set).
After running the Gibbs sampler with just the training set data we can estimate the GEBVs for all samples with:
where is estimated with just the training samples.
And we can predict the phenotypes for all samples (training and validation) with:
However, if there are missing covariate values the phenotypes cannot be predicted for the sample and it will be dropped from the result spreadsheet.