2.13.3. Mixed Linear Model Analysis¶
GWAS mixed linear model analysis uses a kinship matrix to correct for cryptic relatedness as a random effect and can include any additional fixed effects in the model.
The methods currently available in the Mixed Linear Model Analysis tool include:
Linear regression (fixed effects only) [Vilhjalmsson2012]
Mixed Model GWAS using a single locus (EMMAX) [Kang2010], [Vilhjalmsson2012]
Multi-locus mixed model GWAS (MLMM) [Segura2012], [Vilhjalmsson2012]
While the Linear regression (fixed effects only) method does not include a random effect component, it is able to take advantage of the Fast F test and regression code written by Bjarni Vilhjalmsson for the MLMM implementation in Python.
Note
This feature only performs linear regression or linear mixed-model analysis, even on a binary phenotype. The outputs will be the same as they would be for recoding the binary phenotype as an integer and then running the analysis. According to [Kang2010] this is “in the spirit of Armitage’s test”. Thus results from a logistic regression (Numeric > Numeric Regression Analysis or Genotype > Haplotype Trend Regression (using windowed haplotype blocks with a size of one marker)) could be considerably different from EMMAX or MLMM in Genotype > Mixed Linear Model Analysis. For a more direct comparison, recode the binary phenotype into an integer phenotype and then run linear regression analysis through Genotypic Regression Analysis, Numeric Regression Analysis or Haplotype Trend Regression.
Data Types Accepted¶
This tool can perform the regression analysis directly on mapped genotypic data – there is no need to recode the data into a numeric model. However, this analysis can be run on mapped numerically recoded genotypes as well.
Note
For genotypic input, only one minor allele per marker is allowed.
For recoded genotypes, the corresponding restrictions hold:
For the Additive Model, there must be either two or three unique values.
For other models, there must be exactly two unique values.
Additionally, this feature may be run on mapped real-mode numeric data, such as PCA-corrected genotypic data. However, when using real-mode numeric data, observe the following:
Select Numerically as average value for imputing missing data.
Do not select Correct for Hemizygous Males.
In the Additional Outputs tab, do not request Allele frequencies, Genotype counts or Allele Counts.
Performing Analysis¶
To perform mixed linear model analysis on genotypic data, recoded genotypic data, or real data, open a spreadsheet and select a column for the dependent variable. The dependent variable must be either quantitative (real-valued or integer-valued) or a binary case/control status column. To open the Mixed Linear Model Analysis window, select the Genotype > Mixed Linear Model Analysis menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

Mixed Linear Model Analysis Window (Main Tab)¶
The main tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Main Tab)) allows for various methods and parameters to be set or changed. A list and brief description of these options is as follows:
Linear regression (fixed effects only): This method performs a linear regression analysis with fixed effects only. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). The advantage of this linear regression analysis is that it uses the same F test (Fast F Test) as the mixed linear models (EMMAX and MLMM) and has the choice of imputation methods. See Correct for Additional Covariates for instructions on including fixed effects in the model.
The following methods use a regression analysis framework to add either a random effect (normally based on kinship) or both a random effect and fixed effects to regression analysis models for each genetic marker. The regression models include:
Single-locus mixed model GWAS (EMMAX): [Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which can include fixed effects (see Correct for Additional Covariates) but always includes a kinship matrix as a random effect. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See Precomputed Kinship Matrix Option for more information.
Multi-locus mixed model GWAS (MLMM): [Segura2012] This method is an implementation of the MLMM method by [Vilhjalmsson2012] which uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model. Additional fixed effects can be selected (see Correct for Additional Covariates) and either a pre-computed kinship matrix can be used (see Precomputed Kinship Matrix Option) or an IBS kinship matrix can be computed from the genotypic data and used for the analysis.
An additional parameter specifically for the EMMAX and MLMM methods is
Use Pre-Computed Kinship Matrix (Cov. Matrix of Random Effects): See Precomputed Kinship Matrix Option.
Additional parameters for all regression models include:
Genetic Model and Imputation:
Genetic model to use: The genetic models available include:
Additive: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.
Dominant: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.
Recessive: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.
Note
If you are running this analysis from a numerically recoded spreadsheet, this prompt will read Genetic model used for recoding the original spreadsheet, and you should enter which of the above recoding operations you have already performed (see Recode Genotypes) or would have performed to create the spreadsheet from which you are running this analysis.
Impute missing data as: The options for imputing the missing genotypes include setting the missing genotype to
Homozygous major allele: Always sets the missing genotypes to 0.
Numerically as average value: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.
Note
If Correct for Hemizygous Males (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.
Correct for Hemizygous Males: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.
Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
Correct for Additional Covariates: See Correct for Additional Covariates.
Additional Outputs¶

Mixed Linear Model Analysis Window (Second Tab)¶
The second tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Second Tab)) allows for additional outputs to be added to the output spreadsheet of Linear regression (fixed effects only) or of Single-locus mixed model GWAS (EMMAX) and/or to the p-value output spreadsheet of Multi-locus mixed model GWAS (MLMM). Four of these selectable outputs are enhancements to all p-value outputs:
Bonferroni multiple testing correction The number of markers tested (successfully or not) is counted as the number of tests for this correction.
False discovery rate (FDR)
Output data for P-P/Q-Q plots
Output -log 10(P) Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.
The other four are genotype statistics, which are placed at the end of the spreadsheet.
Call rate (fraction not missing)
Allele frequencies
Genotype counts
Allele Counts
Note
If you are using a numerically recoded spreadsheet and are not using the additive model, only Call rate is selectable as a genotype statistic.
Before the genotype statistics are output, the Actual Sample Size, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output.
If you are using a genotypic spreadsheet, the Minor Allele (Test Allele) and the Major Allele for each marker will always be output.
Precomputed Kinship Matrix Option¶
A pre-computed kinship matrix may be used for any of the following mixed-model analysis features:
Single-locus mixed model GWAS (EMMAX) (Performing Analysis)
Multi-locus mixed model GWAS (MLMM) (Performing Analysis)
Single-locus mixed model GWAS (EMMAX) with Interactions (Mixed Linear Model Analysis with Interactions)
Compute Genomic BLUP (GBLUP) (Performing GBLUP Analysis)
Genetic Correlation using GBLUP (Genetic Correlation of Two Traits using GBLUP)
Bayes C and C-pi Analysis (Bayes C and C-pi Genomic Prediction Analysis)
K-Fold Cross Validation (K-Fold Cross Validation)
Mixed-Model KBAC (Mixed-Model KBAC)
A pre-computed kinship matrix may be obtained from any one of the following features or places:
The IBS (identity-by-state) matrix computed automatically as a part of an EMMAX or MLMM (Performing Analysis) run on the same genotypic data as you are analyzing.
An IBS matrix computed using the Identity by Descent Estimation feature.
Any other relationship matrix which is computed using the Identity by Descent Estimation feature, so long as it meets the requirements enumerated below.
A Genomic Relationship Matrix computed automatically as a part of a GBLUP (Performing GBLUP Analysis) run on the same genotypic data as you are analyzing.
A Genomic Relationship Matrix computed by Separately Computing the Genomic Relationship Matrix.
A Numerator Relationship Matrix (“A Matrix”) based on a pedigree as computed by Computing the Numerator Relationship Matrix.
Any other relationship matrix which is imported into SVS, so long as it meets the requirements enumerated below.
Note
For computing Genomic BLUP (GBLUP), a Genomic Relationship Matrix (pre-computed or computed automatically) should be used as the kinship matrix. Results are not guaranteed if a different type of kinship matrix is used for computing GBLUP.
Additionally,
The pre-computed Wishart Matrix (see Spreadsheet Outputs and Spreadsheet Outputs) has the same format as a pre-computed kinship matrix.
A pre-computed kinship matrix is required to be an numeric matrix where
is (at least) the number of
samples in the genotypic spreadsheet. Both row labels and column
headers should be the sample names.
Also, the spreadsheet’s data must be symmetric with respect to the
diagonal (within , where
is
the above-the-diagonal value being checked), and its trace (the sum of
its diagonal elements) must be positive. These last two precautions
are designed to make it more likely that the selected spreadsheet’s
matrix is positive semidefinite.
Note
The rows and columns in the pre-computed kinship matrix, while they need to be sorted in the same order as each other, do not need to be sorted in the same order as the rows of the spreadsheet to be analyzed. However:
The data will be matched up using the row labels (sample names) of both spreadsheets.
The kinship matrix spreadsheet must have a row label (sample name) for every row label (sample name) in the spreadsheet to be analyzed.
If the kinship matrix spreadsheet has extra rows not present in the spreadsheet to be analyzed, those rows (and corresponding columns) simply will not be used.
To begin, check this option (unless you are running Mixed-Model KBAC), then click on Select Sheet and select the kinship spreadsheet from the window that is presented.
Correct for Additional Covariates¶
Additional fixed effect covariates can be selected for inclusion in the model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.
Output from Various Mixed Linear Models Analysis Methods¶
Output from the Linear Regression (fixed effects only) Method¶
The linear regression option creates a P-Values from Linear Regression spreadsheet.
The columns in this spreadsheet are:
P-Value P-value of the linear regression.
-log10(P-Value) The negative log (based 10) of the above p-value.
Two parameters for each of the following variables:
The intercept (Intercept)
Each additional covariate you have specified
The individual marker being tested (Predictor)
These parameters are:
__________ Beta: This is the beta value of the parameter that has been tested.
For the intercept and any covariates, this corresponds to one component of
in the third note for Further Optimization When Covariates Are Present.
For the individual marker being tested (Predictor), this corresponds to
in the third note for Further Optimization When Covariates Are Present.
Note
The Predictor Beta indicates the direction of the effect.
__________ Beta SE: This is the standard error corresponding to this beta value.
This is computed as
Here,
is a component ranging from the Intercept through any covariates you have specified and through the Predictor
, and
is the matrix made up of these components, as
is the RSS value (
) for the regression,
is the number of samples and
is one plus the number of any fixed covariates you may have selected.
Any additional p-value outputs you have selected (see Additional Outputs).
The actual sample size and any genotype statistic outputs (see Additional Outputs).
Output from the Single Locus Mixed Linear Model (EMMAX) Method¶
If an IBS kinship matrix was computed on the fly (and your input data is genotypic), this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). (Otherwise, if it was computed on the fly for numerically recoded genotypes, it will be called IBS Relationship Matrix (Computed from Recoded Data).) This spreadsheet can be used again for further Mixed Model analysis.
Also created is the P-Values from Single-Locus Mixed Model spreadsheet.
The columns in this spreadsheet are:
P-Value: P-value of the mixed linear model.
-log10(P-Value) The negative log (based 10) of the above p-value.
Two parameters for each of the following variables:
The intercept (Intercept)
Each additional covariate you have specified
The individual marker being tested (Predictor)
These parameters are:
__________ Beta: This is the beta value of the parameter that has been tested.
For the intercept and any covariates, this corresponds to one component of
in Further Optimization When Covariates Are Present.
For the individual marker being tested (Predictor), this corresponds to
in Further Optimization When Covariates Are Present.
Note
The Predictor Beta indicates the direction of the (fixed) effect.
__________ Beta SE: This is the standard error corresponding to this beta value.
This is computed as
Here,
is a component ranging from the Intercept through any covariates you have specified and through the Predictor
, and
is the matrix made up of these components, as
is the RSS value (
) for the transformed regression,
is the number of samples and
is one plus the number of any fixed covariates you may have selected.
Any additional p-value outputs you have selected (see Additional Outputs).
Proportion of Variance Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained
will be
The actual sample size and any genotype statistic outputs (see Additional Outputs).
The following will be output to the node change log of this spreadsheet:
The scaling factor used for the kinship matrix, as computed according to Scaling the Kinship Matrix.
The pseudo-heritability
, which is
The variance and standard error of the pseudo-heritability
and
, see Estimating the Variance of Heritability for One Random Effect for the formula.
The genetic component of variance Vg (
).
The error component of variance Ve (
).
If you have selected additional covariates, two additional outputs will appear in the node change log:
Proportion of genetic variance and
Prop. explained by fixed covariates.
These correspond exactly to the proportion of Genetic variance
( for
) and the proportion of Variance
explained (
) outputs of the Variance Partition
Plot (See The Variance Partition Plot) for the initial model
(
) of an MLMM run.
Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method¶
If an IBS kinship matrix was computed on the fly (and your input data is genotypic), this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). (Otherwise, if it was computed on the fly for numerically recoded genotypes, it will be called IBS Relationship Matrix (Computed from Recoded Data).) This spreadsheet can be used again for further Mixed Model analysis.
Also created are:
P-Values from Multi-Locus Mixed Model: The columns in this spreadsheet are grouped by step number. For each step containing a model not used by a previous step, and starting with the initial model, the following columns are added:
P-Value (Initial Model) or P-Value (After Step ##): P-value of the mixed linear model.
-log10(P-Value) The negative log (based 10) of the above p-value.
Predictor Beta: This is the beta value of the individual marker that has been tested (the Predictor), and corresponds to
in Further Optimization When Covariates Are Present.
Note
For the Initial Model, the Predictor Beta indicates the direction of the (fixed) effect of the marker being tested. For succeeding steps, the Predictor Beta only indicates the direction of the additional (fixed) effect of the marker being tested over and above the (fixed) effects of the markers (now being used as covariates) found in the previous steps. No one “effect direction” is defined for multiple markers considered as one unit.
Predictor Beta SE: This is the standard error corresponding to this beta value, and could be computed as noted above in Output from the Single Locus Mixed Linear Model (EMMAX) Method.
However, since we only need to output the standard error for the Predictor, and we don’t need to output the standard error for the Intercept or any covariates, the following optimized expression, which is based on the optimized regression for
, namely
is used to compute the standard error for the Predictor
:
is the RSS value (
) for the (transformed and) optimized regression,
is the number of samples and
is one plus the number of any fixed covariates you may have selected.
Any additional p-value outputs you have selected (see Additional Outputs).
Prop. Var. Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained
will be
Manhattan Category (Initial Model) or Manhattan Category (After Step ##): This column can be used to color the -log10 p-values in a genome browser. This column is preferred to the Chromosome category because it labels the marker covariates as a separate category (“Cofactor” rather than “Chr ##”).
Note
“For each step containing a model not used by a previous step” normally means the initial model and all Forward Selection steps. This is because Backward Elimination normally eliminates in the reverse order of Forward Selection, thus causing these models to duplicate the models of corresponding Forward Selection steps. But not always. If Backward Elimination eliminates one or two (or more) covariates in a different order than reversing Forward Selection would have done, the results of these different Backward Elimination models will additionally be displayed in this spreadsheet.
Note
If, for the step being displayed, the marker has become a covariate marker (which you will know because the Manhattan Category will show that the marker is a “Cofactor”), the p-value, regression beta, beta standard error, and proportion of variance explained (PVE) for this marker (say, marker
) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker
as the only marker which is scanned as a non-covariate.
The covariate markers for any step (whether shown in this spreadsheet or not) are also summarized in the final columns of the MLMM Step Information and Covariate P-Values spreadsheet (see below).
At the end are added the actual sample size and any genotype statistic outputs (see Additional Outputs).
MLMM Step Information and Covariate P-Values: The rows of this spreadsheet correspond to the MLMM models obtained by both the forward selection steps and backward elimination steps. The first row corresponds to the initial model where no covariate markers have yet been selected. This spreadsheet will have the following columns:
Preceding step type: None (for the initial model), Forward selection or Backward elimination.
Optimal according to: If the model is optimal according to one or more model criteria (other than classic BIC), the criterion or criteria used for classifying the model as optimal will be listed. (See Model Criteria.) Possible criteria (and their abbreviations) are:
Bonferroni
Multiple Bonferroni
Ext. BIC Extended Bayes Information Criteria
Mod. BIC Modified Bayes Information Criteria
Multiple PPA Multiple Posterior Probability of Association
Pseudo-heritability: Proportion of inheritance explained by the random effects, as evaluated for this model. If this model is after Step
, this proportion is
where
,
, and
are all evaluated for this model.
Variance of heritability, the variance of the pseudo-heritability
, as evaluated for this model.
SE of heritability, the standard error of the pseudo-heritability
, as evaluated for this model. See Estimating the Variance of Heritability for One Random Effect for the formula for these two values.
The genetic component of variance Vg (
) for this model.
The error component of variance Ve (
) for this model.
RSS: The actual Residual Sum of Squares with just the covariates as fixed effects. This may be written as
where the
results from using the EMMA technique to solve the base model equation
and the
include marker covariates that are a part of this model, any additional covariates you have specified, and the vector of all one’s for obtaining the intercept. (See The EMMAX Approximations and Technique.)
REML Mahalanobis RSS: Transformed RSS with just the covariates as fixed effects. Using the notation of Further Optimization When Covariates Are Present, this is
.
Min Mahalanobis RSS: Transformed RSS for the covariates plus the marker with the best p-value. Using the notation of Further Optimization When Covariates Are Present, this is
where
represents the marker with the best p-value.
Log likelihood: The full log likelihood (as computed using ML) for this model.
BIC: Classic Bayesian Information Criterion. (See Model Criteria.)
Extended BIC: BIC penalized by the model space dimension. (See Model Criteria.)
Modified BIC: BIC penalized by the model space dimension using an alternative algorithm. (See Model Criteria.)
Max. cofactor p-value: The maximum individual p-value for all of the covariate markers included in this model (see note below).
Min. p-value: The minimum p-value for all of the markers not included as covariates in this model.
Beta at min. p-value The beta value of the marker having the minimum p-value.
Beta std. error at min. p-value The beta standard error for the marker having the minimum p-value.
Max. PPA: The maximum, over all markers, of the posterior probability of association (PPA) (see Model Criteria).
Max. PPA p-value: The p-value of that marker having the maximum PPA.
The remaining columns contain the following values for the covariate markers included in this model:
P: The individual p-value
B: The regression beta
SE: The standard error for the regression beta
PVE: The proportion of variance explained
These are displayed in four columns, each marker-mapped and lateled with the covariate marker name. The data in these columns is individually labeled according to the above abbreviations.
Note
The p-value, regression beta, beta standard error, and proportion of variance explained (PVE) values for any covariate marker (say, marker
) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker
as the only marker which is scanned as a non-covariate.
These covariate marker values are also shown in the P-Values from Multi-Locus Mixed Model spreadsheet (for each step containing a model not used by a previous step). In that spreadsheet, for any given step being displayed, the covariate markers are each designated in the step’s Manhattan Category column as “Cofactor”.
Variance Partition Plot: See The Variance Partition Plot.
The following will be output to the node change log of each of the above outputs:
The options selected and numbers of samples and markers scanned
The scaling factor used for the kinship matrix, as computed according to Scaling the Kinship Matrix
The number of MLMM forward steps actually used
The Variance Partition Plot¶
This output graph shows, for the model resulting from each step, how
the variance of may be partitioned into the following
component proportions:
Error (In Red)
Genetic variance (In Green)
Variance explained (In Blue)
Explained by fixed covariates (In Gray – will show only if you have selected additional covariates)

Variance Partition Plot with additional fixed covariates¶
These proportions are obtained as follows:
A reference RSS (
) is determined to be used for all models. This reference RSS will be the (actual) Residual Sum of Squares with only the intercept as a fixed effect. This may be written as
where the
results from using the EMMA technique to solve the null model equation
(See The EMMAX Approximations and Technique.)
If you have selected additional covariates, this will be computed separately. If you have not selected additional covariates, we have
where
is the RSS value for the initial model.
If you have selected additional covariates, the proportion explained by fixed covariates will be
In any case, for the model resulting from step
, the proportion of Variance explained is
where
is the RSS value computed for the model resulting from step
. If you have not selected additional covariates, this simplifies to
The proportion of Genetic variance is
where the pseudo-heritability
is
,
, and
having been evaluated for step
.
The Error accounts for the rest of the variance: