# 2.13.3. Mixed Linear Model Analysis¶

GWAS mixed linear model analysis uses a kinship matrix to correct for cryptic relatedness as a random effect and can include any additional fixed effects in the model.

The methods currently available in the **Mixed Linear Model Analysis** tool
include:

Linear regression (fixed effects only) [Vilhjalmsson2012]

Mixed Model GWAS using a single locus (EMMAX) [Kang2010], [Vilhjalmsson2012]

Multi-locus mixed model GWAS (MLMM) [Segura2012], [Vilhjalmsson2012]

While the **Linear regression (fixed effects only)** method does not
include a random effect component, it is able to take advantage of the
Fast F test and regression code written by Bjarni Vilhjalmsson for the
MLMM implementation in Python.

Note

This feature only performs **linear** regression or **linear**
mixed-model analysis, even on a binary phenotype. The outputs
will be the same as they would be for recoding the binary phenotype
as an integer and then running the analysis. According to [Kang2010]
this is “in the spirit of Armitage’s test”. Thus results from a logistic
regression (**Numeric > Numeric Regression Analysis** or **Genotype >
Haplotype Trend Regression** (using windowed haplotype blocks with
a size of one marker)) could be considerably different from EMMAX or
MLMM in **Genotype > Mixed Linear Model Analysis**. For a more
direct comparison, recode the binary phenotype into an integer
phenotype and then run linear regression analysis through Genotypic
Regression Analysis, Numeric Regression Analysis or Haplotype Trend
Regression.

## Data Types Accepted¶

This tool can perform the regression analysis directly on mapped genotypic data – there is no need to recode the data into a numeric model. However, this analysis can be run on mapped numerically recoded genotypes as well.

Note

For genotypic input, only one minor allele per marker is allowed.

For recoded genotypes, the corresponding restrictions hold:

For the Additive Model, there must be either two or three unique values.

For other models, there must be exactly two unique values.

Additionally, this feature may be run on mapped real-mode numeric data, such as PCA-corrected genotypic data. However, when using real-mode numeric data, observe the following:

Select

**Numerically as average value**for imputing missing data.Do not select

**Correct for Hemizygous Males**.In the

**Additional Outputs**tab, do not request**Allele frequencies**,**Genotype counts**or**Allele Counts**.

## Performing Analysis¶

To perform mixed linear model analysis on genotypic data, recoded
genotypic data, or real data, open a spreadsheet and select a column
for the dependent variable. The dependent variable must be either
quantitative (real-valued or integer-valued) or a binary case/control
status column. To open the Mixed Linear Model Analysis window, select the
**Genotype** > **Mixed Linear Model Analysis** menu item. This feature is
currently supported for spreadsheets with only one column set as dependent.
Categorical dependent columns are currently not supported.

The main tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Main Tab)) allows for various methods and parameters to be set or changed. A list and brief description of these options is as follows:

**Linear regression (fixed effects only):**This method performs a linear regression analysis with fixed effects only. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). The advantage of this linear regression analysis is that it uses the same F test (Fast F Test) as the mixed linear models (EMMAX and MLMM) and has the choice of imputation methods. See Correct for Additional Covariates for instructions on including fixed effects in the model.

The following methods use a regression analysis framework to add either a random effect (normally based on kinship) or both a random effect and fixed effects to regression analysis models for each genetic marker. The regression models include:

**Single-locus mixed model GWAS (EMMAX):**[Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which can include fixed effects (see Correct for Additional Covariates) but always includes a kinship matrix as a random effect. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See Precomputed Kinship Matrix Option for more information.**Multi-locus mixed model GWAS (MLMM):**[Segura2012] This method is an implementation of the MLMM method by [Vilhjalmsson2012] which uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model. Additional fixed effects can be selected (see Correct for Additional Covariates) and either a pre-computed kinship matrix can be used (see Precomputed Kinship Matrix Option) or an IBS kinship matrix can be computed from the genotypic data and used for the analysis.

An additional parameter specifically for the EMMAX and MLMM methods is

**Use Pre-Computed Kinship Matrix (Cov. Matrix of Random Effects)**: See Precomputed Kinship Matrix Option.

Additional parameters for all regression models include:

**Genetic Model and Imputation:****Genetic model to use:**The genetic models available include:*Additive*: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.*Dominant*: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.*Recessive*: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.

Note

If you are running this analysis from a numerically recoded spreadsheet, this prompt will read

**Genetic model used for recoding the original spreadsheet**, and you should enter which of the above recoding operations you have already performed (see Recode Genotypes) or would have performed to create the spreadsheet from which you are running this analysis.**Impute missing data as:**The options for imputing the missing genotypes include setting the missing genotype to*Homozygous major allele*: Always sets the missing genotypes to 0.*Numerically as average value*: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.Note

If

**Correct for Hemizygous Males**(see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.

**Correct for Hemizygous Males**: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.**Choose Sex Column:**Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).**Chromosome that is hemizygous for males:**Usually the X Chromosome, which is the default.

**Correct for Additional Covariates**: See Correct for Additional Covariates.

## Additional Outputs¶

The second tab of the Mixed Linear Model Analysis window (see Figure
Mixed Linear Model Analysis Window (Second Tab)) allows for additional outputs to be
added to the output spreadsheet of **Linear regression (fixed effects
only)** or of **Single-locus mixed model GWAS (EMMAX)** and/or to the
p-value output spreadsheet of **Multi-locus mixed model GWAS
(MLMM)**. Four of these selectable outputs are enhancements to all p-value outputs:

**Bonferroni multiple testing correction**The number of markers tested (successfully or not) is counted as the number of tests for this correction.**False discovery rate (FDR)****Output data for P-P/Q-Q plots****Output -log 10(P)**Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.

The other four are genotype statistics, which are placed at the end of the spreadsheet.

**Call rate (fraction not missing)****Allele frequencies****Genotype counts****Allele Counts**

Note

If you are using a numerically recoded spreadsheet and are not using the additive model, only

**Call rate**is selectable as a genotype statistic.Before the genotype statistics are output, the

*Actual Sample Size*, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output.If you are using a genotypic spreadsheet, the

*Minor Allele (Test Allele)*and the*Major Allele*for each marker will always be output.

## Precomputed Kinship Matrix Option¶

A pre-computed kinship matrix may be used for any of the following mixed-model analysis features:

Single-locus mixed model GWAS (EMMAX) (Performing Analysis)

Multi-locus mixed model GWAS (MLMM) (Performing Analysis)

Single-locus mixed model GWAS (EMMAX) with Interactions (Mixed Linear Model Analysis with Interactions)

Compute Genomic BLUP (GBLUP) (Performing GBLUP Analysis)

Genetic Correlation using GBLUP (Genetic Correlation of Two Traits using GBLUP)

Bayes C and C-pi Analysis (Bayes C and C-pi Genomic Prediction Analysis)

K-Fold Cross Validation (K-Fold Cross Validation)

Mixed-Model KBAC (Mixed-Model KBAC)

A pre-computed kinship matrix may be obtained from any one of the following features or places:

The IBS (identity-by-state) matrix computed automatically as a part of an EMMAX or MLMM (Performing Analysis) run on the same genotypic data as you are analyzing.

An IBS matrix computed using the Identity by Descent Estimation feature.

Any other relationship matrix which is computed using the Identity by Descent Estimation feature, so long as it meets the requirements enumerated below.

A Genomic Relationship Matrix computed automatically as a part of a GBLUP (Performing GBLUP Analysis) run on the same genotypic data as you are analyzing.

A Genomic Relationship Matrix computed by Separately Computing the Genomic Relationship Matrix.

A Numerator Relationship Matrix (“A Matrix”) based on a pedigree as computed by Computing the Numerator Relationship Matrix.

Any other relationship matrix which is imported into SVS, so long as it meets the requirements enumerated below.

Note

For computing Genomic BLUP (GBLUP), a Genomic Relationship Matrix (pre-computed or computed automatically) should be used as the kinship matrix. Results are not guaranteed if a different type of kinship matrix is used for computing GBLUP.

Additionally,

The pre-computed Wishart Matrix (see Spreadsheet Outputs and Spreadsheet Outputs) has the same format as a pre-computed kinship matrix.

A pre-computed kinship matrix is required to be an numeric matrix where is (at least) the number of samples in the genotypic spreadsheet. Both row labels and column headers should be the sample names.

Also, the spreadsheet’s data must be symmetric with respect to the diagonal (within , where is the above-the-diagonal value being checked), and its trace (the sum of its diagonal elements) must be positive. These last two precautions are designed to make it more likely that the selected spreadsheet’s matrix is positive semidefinite.

Note

The rows and columns in the pre-computed kinship matrix, while they need to be sorted in the same order as each other, do not need to be sorted in the same order as the rows of the spreadsheet to be analyzed. However:

The data will be matched up using the row labels (sample names) of both spreadsheets.

The kinship matrix spreadsheet must have a row label (sample name) for every row label (sample name) in the spreadsheet to be analyzed.

If the kinship matrix spreadsheet has extra rows not present in the spreadsheet to be analyzed, those rows (and corresponding columns) simply will not be used.

To begin, check this option (unless you are running Mixed-Model KBAC),
then click on **Select Sheet** and select the kinship spreadsheet from
the window that is presented.

## Correct for Additional Covariates¶

Additional fixed effect covariates can be selected for inclusion in
the model from columns of this spreadsheet. Fixed effect covariates
can be binary, integer, real-valued, categorical or (if actual
genotypic data rather than recoded genotypic data is being used for
the analysis) genotypic. In all cases, if a marker is used as an
additional fixed effect, it will not be included in the analysis in
any other way. To begin, check this option, then click on **Add
Columns** to get a choice of spreadsheet columns to use.

## Output from Various Mixed Linear Models Analysis Methods¶

### Output from the Linear Regression (fixed effects only) Method¶

The linear regression option creates a **P-Values from Linear
Regression** spreadsheet.

The columns in this spreadsheet are:

*P-Value*P-value of the linear regression.*-log10(P-Value)*The negative log (based 10) of the above p-value.Two parameters for each of the following variables:

The intercept (

*Intercept*)Each additional covariate you have specified

The individual marker being tested (

*Predictor*)

These parameters are:

*__________ Beta*: This is the beta value of the parameter that has been tested.For the intercept and any covariates, this corresponds to one component of in the third note for Further Optimization When Covariates Are Present.

For the individual marker being tested (

*Predictor*), this corresponds to in the third note for Further Optimization When Covariates Are Present.Note

The

*Predictor Beta*indicates the direction of the effect.*__________ Beta SE*: This is the standard error corresponding to this beta value.This is computed as

Here, is a component ranging from the

*Intercept*through any covariates you have specified and through the*Predictor*, and is the matrix made up of these components, asis the RSS value () for the regression, is the number of samples and is one plus the number of any fixed covariates you may have selected.

Any additional p-value outputs you have selected (see Additional Outputs).

The actual sample size and any genotype statistic outputs (see Additional Outputs).

### Output from the Single Locus Mixed Linear Model (EMMAX) Method¶

If an IBS kinship matrix was computed on the fly (and your input data
is genotypic), this will be output as **IBS Distance ((IBS2 *
0.5*IBS1)/ # non-missing markers)**. (Otherwise, if it was computed on
the fly for numerically recoded genotypes, it will be called **IBS
Relationship Matrix (Computed from Recoded Data)**.) This spreadsheet
can be used again for further Mixed Model analysis.

Also created is the **P-Values from Single-Locus Mixed Model** spreadsheet.

The columns in this spreadsheet are:

*P-Value*: P-value of the mixed linear model.*-log10(P-Value)*The negative log (based 10) of the above p-value.Two parameters for each of the following variables:

The intercept (

*Intercept*)Each additional covariate you have specified

The individual marker being tested (

*Predictor*)

These parameters are:

*__________ Beta*: This is the beta value of the parameter that has been tested.For the intercept and any covariates, this corresponds to one component of in Further Optimization When Covariates Are Present.

For the individual marker being tested (

*Predictor*), this corresponds to in Further Optimization When Covariates Are Present.Note

The

*Predictor Beta*indicates the direction of the (fixed) effect.*__________ Beta SE*: This is the standard error corresponding to this beta value.This is computed as

Here, is a component ranging from the

*Intercept*through any covariates you have specified and through the*Predictor*, and is the matrix made up of these components, asis the RSS value () for the transformed regression, is the number of samples and is one plus the number of any fixed covariates you may have selected.

Any additional p-value outputs you have selected (see Additional Outputs).

*Proportion of Variance Explained*: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained will beThe actual sample size and any genotype statistic outputs (see Additional Outputs).

The following will be output to the node change log of this spreadsheet:

The scaling factor used for the kinship matrix, as computed according to Scaling the Kinship Matrix.

The pseudo-heritability , which is

The variance and standard error of the pseudo-heritability and , see Estimating the Variance of Heritability for One Random Effect for the formula.

The genetic component of variance

*Vg*().The error component of variance

*Ve*().

If you have selected additional covariates, two additional outputs will appear in the node change log:

*Proportion of genetic variance*and*Prop. explained by fixed covariates*.

These correspond exactly to the proportion of *Genetic variance*
( for ) and the proportion of *Variance
explained* () outputs of the **Variance Partition
Plot** (See The Variance Partition Plot) for the initial model
() of an MLMM run.

### Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method¶

If an IBS kinship matrix was computed on the fly (and your input data
is genotypic), this will be output as **IBS Distance ((IBS2 *
0.5*IBS1)/ # non-missing markers)**. (Otherwise, if it was computed on
the fly for numerically recoded genotypes, it will be called **IBS
Relationship Matrix (Computed from Recoded Data)**.) This spreadsheet
can be used again for further Mixed Model analysis.

Also created are:

**P-Values from Multi-Locus Mixed Model**: The columns in this spreadsheet are grouped by step number. For each step containing a model not used by a previous step, and starting with the initial model, the following columns are added:*P-Value (Initial Model)*or*P-Value (After Step ##)*: P-value of the mixed linear model.*-log10(P-Value)*The negative log (based 10) of the above p-value.*Predictor Beta*: This is the beta value of the individual marker that has been tested (the*Predictor*), and corresponds to in Further Optimization When Covariates Are Present.Note

For the Initial Model, the

*Predictor Beta*indicates the direction of the (fixed) effect of the marker being tested. For succeeding steps, the*Predictor Beta*only indicates the direction of the additional (fixed) effect of the marker being tested over and above the (fixed) effects of the markers (now being used as covariates) found in the previous steps. No one “effect direction” is defined for multiple markers considered as one unit.*Predictor Beta SE*: This is the standard error corresponding to this beta value, and could be computed as noted above in Output from the Single Locus Mixed Linear Model (EMMAX) Method.However, since we only need to output the standard error for the

*Predictor*, and we don’t need to output the standard error for the*Intercept*or any covariates, the following optimized expression, which is based on the optimized regression for , namelyis used to compute the standard error for the

*Predictor*:is the RSS value () for the (transformed and) optimized regression, is the number of samples and is one plus the number of any fixed covariates you may have selected.

Any additional p-value outputs you have selected (see Additional Outputs).

*Prop. Var. Explained*: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained will be*Manhattan Category (Initial Model)*or*Manhattan Category (After Step ##)*: This column can be used to color the -log10 p-values in a genome browser. This column is preferred to the*Chromosome*category because it labels the marker covariates as a separate category (“Cofactor” rather than “Chr ##”).

Note

“For each step containing a model not used by a previous step” normally means the initial model and all Forward Selection steps. This is because Backward Elimination normally eliminates in the reverse order of Forward Selection, thus causing these models to duplicate the models of corresponding Forward Selection steps. But not always. If Backward Elimination eliminates one or two (or more) covariates in a different order than reversing Forward Selection would have done, the results of these different Backward Elimination models will additionally be displayed in this spreadsheet.

Note

If, for the step being displayed, the marker has become a covariate marker (which you will know because the

*Manhattan Category*will show that the marker is a “Cofactor”), the p-value, regression beta, beta standard error, and proportion of variance explained (PVE) for this marker (say, marker ) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker as the only marker which is scanned as a non-covariate.The covariate markers for any step (whether shown in this spreadsheet or not) are also summarized in the final columns of the

**MLMM Step Information and Covariate P-Values**spreadsheet (see below).At the end are added the actual sample size and any genotype statistic outputs (see Additional Outputs).

**MLMM Step Information and Covariate P-Values**: The rows of this spreadsheet correspond to the MLMM models obtained by both the forward selection steps and backward elimination steps. The first row corresponds to the initial model where no covariate markers have yet been selected. This spreadsheet will have the following columns:*Preceding step type*: None (for the initial model), Forward selection or Backward elimination.*Optimal according to:*If the model is optimal according to one or more model criteria (other than classic BIC), the criterion or criteria used for classifying the model as optimal will be listed. (See Model Criteria.) Possible criteria (and their abbreviations) are:*Bonferroni**Multiple Bonferroni**Ext. BIC*Extended Bayes Information Criteria*Mod. BIC*Modified Bayes Information Criteria*Multiple PPA*Multiple Posterior Probability of Association

*Pseudo-heritability*: Proportion of inheritance explained by the random effects, as evaluated for this model. If this model is after Step , this proportion iswhere , , and are all evaluated for this model.

*Variance of heritability*, the variance of the pseudo-heritability , as evaluated for this model.*SE of heritability*, the standard error of the pseudo-heritability , as evaluated for this model. See Estimating the Variance of Heritability for One Random Effect for the formula for these two values.The genetic component of variance

*Vg*() for this model.The error component of variance

*Ve*() for this model.*RSS*: The actual Residual Sum of Squares with just the covariates as fixed effects. This may be written aswhere the results from using the EMMA technique to solve the base model equation

and the include marker covariates that are a part of this model, any additional covariates you have specified, and the vector of all one’s for obtaining the intercept. (See The EMMAX Approximations and Technique.)

*REML Mahalanobis RSS*: Transformed RSS with just the covariates as fixed effects. Using the notation of Further Optimization When Covariates Are Present, this is .*Min Mahalanobis RSS*: Transformed RSS for the covariates plus the marker with the best p-value. Using the notation of Further Optimization When Covariates Are Present, this is where represents the marker with the best p-value.*Log likelihood*: The full log likelihood (as computed using ML) for this model.*BIC*: Classic Bayesian Information Criterion. (See Model Criteria.)*Extended BIC*: BIC penalized by the model space dimension. (See Model Criteria.)*Modified BIC*: BIC penalized by the model space dimension using an alternative algorithm. (See Model Criteria.)*Max. cofactor p-value*: The maximum individual p-value for all of the covariate markers included in this model (see note below).*Min. p-value*: The minimum p-value for all of the markers not included as covariates in this model.*Beta at min. p-value*The beta value of the marker having the minimum p-value.*Beta std. error at min. p-value*The beta standard error for the marker having the minimum p-value.*Max. PPA*: The maximum, over all markers, of the posterior probability of association (PPA) (see Model Criteria).*Max. PPA p-value*: The p-value of that marker having the maximum PPA.The remaining columns contain the following values for the covariate markers included in this model:

*P:*The individual p-value*B:*The regression beta*SE:*The standard error for the regression beta*PVE:*The proportion of variance explained

These are displayed in four columns, each marker-mapped and lateled with the covariate marker name. The data in these columns is individually labeled according to the above abbreviations.

Note

The p-value, regression beta, beta standard error, and proportion of variance explained (PVE) values for any covariate marker (say, marker ) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker as the only marker which is scanned as a non-covariate.

These covariate marker values are also shown in the

**P-Values from Multi-Locus Mixed Model**spreadsheet (for each step containing a model not used by a previous step). In that spreadsheet, for any given step being displayed, the covariate markers are each designated in the step’s*Manhattan Category*column as “Cofactor”.

**Variance Partition Plot**: See The Variance Partition Plot.

The following will be output to the node change log of each of the above outputs:

The options selected and numbers of samples and markers scanned

The scaling factor used for the kinship matrix, as computed according to Scaling the Kinship Matrix

The number of MLMM forward steps actually used

### The Variance Partition Plot¶

This output graph shows, for the model resulting from each step, how the variance of may be partitioned into the following component proportions:

*Error*(In Red)*Genetic variance*(In Green)*Variance explained*(In Blue)*Explained by fixed covariates*(In Gray – will show only if you have selected additional covariates)

These proportions are obtained as follows:

A reference RSS () is determined to be used for all models. This reference RSS will be the (actual) Residual Sum of Squares with only the intercept as a fixed effect. This may be written as

where the results from using the EMMA technique to solve the null model equation

(See The EMMAX Approximations and Technique.)

If you have selected additional covariates, this will be computed separately. If you have not selected additional covariates, we have

where is the

*RSS*value for the initial model.If you have selected additional covariates, the proportion explained by fixed covariates will be

In any case, for the model resulting from step , the proportion of

*Variance explained*iswhere is the

*RSS*value computed for the model resulting from step . If you have not selected additional covariates, this simplifies toThe proportion of

*Genetic variance*iswhere the pseudo-heritability is

, , and having been evaluated for step .

The

*Error*accounts for the rest of the variance: