# Mixed Linear Model Analysis¶

The methods currently available in the **Mixed Linear Model Analysis** tool
include:

- Linear regression (fixed effects only) [Vilhjalmsson2012]
- Mixed Model GWAS using a single locus (EMMAX) [Kang2010], [Vilhjalmsson2012]
- Multi-locus mixed model GWAS (MLMM) [Segura2012], [Vilhjalmsson2012]

While linear regression does not include a random effect component, this method is able to take advantage of the Fast F test and regression code written by Bjarni Vilhjalmsson for the MLMM implementation in Python.

GWAS mixed linear model analysis uses a kinship matrix to correct for cryptic relatedness as a random effect and can include any additional fixed effects in the model.

This analysis can perform the regression analysis directly on genotypic
data – there is no need to recode the data into a numeric model as
with *Numeric Regression Analysis*. However, this analysis can be run on a
numerically recoded spreadsheet as well.

Note

EMMAX only performs **linear** regression, even on a binary
phenotype. This is similar to recoding the binary phenotype as an
integer. According to [Kang2010] this is “in the spirit of
Armitage’s test”. Thus results from a logistic regression
(**Numeric > Numeric Regression Analysis** or **Genotype >
Haplotype Trend Regression** (using windowed haplotype blocks with
a size of one marker)) will be considerably different from EMMAX or
MLMM in **Genotype > Mixed Linear Model Analysis**. For a more
direct comparison, a binary phenotype can be recoded into an
integer phenotype to run linear regression analysis through Numeric
Regression Analysis or Haplotype Trend Regression.

## Performing Analysis¶

To perform mixed linear model analysis on genotypic data or recoded genotypic data,
open a spreadsheet and select a column for the dependent variable.
The dependent variable must be either quantitative (real-valued or
integer-valued) or a binary case/control
status column. To open the Mixed Linear Model Analysis window, select the
**Genotype** > **Mixed Linear Model Analysis** menu item. This feature is
currently supported for spreadsheets with only one column set as dependent.
Categorical dependent columns are currently not supported.

The main tab of the Mixed Linear Model Analysis window (see Figure
*Mixed Linear Model Analysis Window (Main Tab)*) allows for various methods and
parameters to be set or changed. A list and brief description of these
options is as follows:

**Linear regression (fixed effects only):**This method performs a linear regression analysis with fixed effects only. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). The advantage of this linear regression analysis is that it uses the same F test (Fast F Test) as the mixed linear models (EMMAX and MLMM) and has the choice of imputation methods. See*Correct for Additional Covariates*for instructions on including fixed effects in the model.

The following methods use a regression analysis framework to add either a random effect (normally based on kinship) or both a random effect and fixed effects to regression analysis models for each genetic marker. The regression models include:

**Single-locus mixed model GWAS (EMMAX):**[Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which can include fixed effects (see*Correct for Additional Covariates*) but always includes a kinship matrix as a random effect. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See*Precomputed Kinship Matrix Option*for more information.**Multi-locus mixed model GWAS (MLMM):**[Segura2012] This method is an implementation of the MLMM method by [Vilhjalmsson2012] which uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model. Additional fixed effects can be selected (see*Correct for Additional Covariates*) and either a pre-computed kinship matrix can be used (see*Precomputed Kinship Matrix Option*) or an IBS kinship matrix can be computed from the genotypic data and used for the analysis.

An additional parameter specifically for the EMMAX and MLMM methods is

**Use Pre-Computed Kinship Matrix (Cov. Matrix of Random Effects)**: See*Precomputed Kinship Matrix Option*.

Additional parameters for all regression models include:

**Genetic Model and Imputation:****Genetic model to use:**The genetic models available include:*Additive*: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.*Dominant*: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.*Recessive*: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.

Note

If you are running this analysis from a numerically recoded spreadsheet, this prompt will read

**Genetic model used for recoding the original spreadsheet**, and you should enter which of the above recoding operations you have already performed (see*Recode Genotypes*) or would have performed to create the spreadsheet from which you are running this analysis.**Impute missing data as:**The options for imputing the missing genotypes include setting the missing genotype to*Homozygous major allele*: Always sets the missing genotypes to 0.*Numerically as average value*: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.Note

If

**Correct for Hemizygous Males**(see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.

**Correct for Hemizygous Males**: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.**Choose Sex Column:**Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).**Chromosome that is hemizygous for males:**Usually the X Chromosome, which is the default.

**Correct for Additional Covariates**: See*Correct for Additional Covariates*.

## Additional Outputs¶

The second tab of the Mixed Linear Model Analysis window (see Figure
*Mixed Linear Model Analysis Window (Second Tab)*) allows for additional outputs to be
added to the output spreadsheet of **Linear regression (fixed effects
only)** or of **Single-locus mixed model GWAS (EMMAX)** and/or to the
p-value output spreadsheet of **Multi-locus mixed model GWAS
(MLMM)**. Four of these selectable outputs are enhancements to all p-value outputs:

**Bonferroni multiple testing correction****False discovery rate (FDR)****Output data for P-P/Q-Q plots****Output -log 10(P)**Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.

The other four are genotype statistics, which are placed at the end of the spreadsheet.

**Call rate (fraction not missing)****Allele frequencies****Genotype counts****Allele Counts**

Note

- If you are using a numerically recoded spreadsheet and are not
using the additive model, only
**Call rate**is selectable as a genotype statistic. - Before the genotype statistics are output, the
*Actual Sample Size*, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output. - If you are using a genotypic spreadsheet, the
*Minor Allele (Test Allele)*and the*Major Allele*for each marker will always be output.

## Precomputed Kinship Matrix Option¶

A pre-computed kinship matrix may be used for any of the following mixed-model analysis features:

- Single-locus mixed model GWAS (EMMAX) (
*Performing Analysis*) - Multi-locus mixed model GWAS (MLMM) (
*Performing Analysis*) - Single-locus mixed model GWAS (EMMAX) with Interactions (
*Performing Analysis*) - Compute Genomic BLUP (GBLUP) (
*Performing GBLUP Analysis*) - Mixed-Model KBAC (
*Mixed-Model KBAC*)

A pre-computed kinship matrix may be obtained from any one of the following features or places:

- The IBS (identity-by-state) matrix computed automatically as a part
of an EMMAX or MLMM (
*Performing Analysis*) run on the same genotypic data as you are analyzing. - An IBS matrix computed using the
*Identity by Descent Estimation*feature. - Any other relationship matrix which is computed using the
*Identity by Descent Estimation*feature, so long as it meets the requirements enumerated below. - A Genomic Relationship Matrix computed automatically as a part of a
GBLUP (
*Performing GBLUP Analysis*) run on the same genotypic data as you are analyzing. - A Genomic Relationship Matrix computed by
*Separately Computing the Genomic Relationship Matrix*. - A Numerator Relationship Matrix (“A Matrix”) based on a pedigree as
computed by
*Computing the Numerator Relationship Matrix*. - Any other relationship matrix which is imported into SVS, so long as it meets the requirements enumerated below.

Note

For computing Genomic BLUP (GBLUP), a Genomic Relationship Matrix (pre-computed or computed automatically) should be used as the kinship matrix. Results are not guaranteed if a different type of kinship matrix is used for computing GBLUP.

A pre-computed kinship matrix is required to be an numeric matrix where is (at least) the number of samples in the genotypic spreadsheet. Both row labels and column headers should be the sample names.

Also, the spreadsheet’s data must be symmetric with respect to the diagonal (within , where is the above-the-diagonal value being checked), and its trace (the sum of its diagonal elements) must be positive. These last two precautions are designed to make it more likely that the selected spreadsheet’s matrix is positive semidefinite.

Note

The rows and columns in the pre-computed kinship matrix, while they need to be sorted in the same order as each other, do not need to be sorted in the same order as the rows of the spreadsheet to be analyzed. However:

- The data will be matched up using the row labels (sample names) of both spreadsheets.
- The kinship matrix spreadsheet must have a row label (sample name) for every row label (sample name) in the spreadsheet to be analyzed.
- If the kinship matrix spreadsheet has extra rows not present in the spreadsheet to be analyzed, those rows (and corresponding columns) simply will not be used.

To begin, check this option (unless you are running Mixed-Model KBAC),
then click on **Select Sheet** and select the kinship spreadsheet from
the window that is presented.

## Correct for Additional Covariates¶

Additional fixed effect covariates can be selected for inclusion in
the model from columns of this spreadsheet. Fixed effect covariates
can be binary, integer, real-valued, categorical or (if actual
genotypic data rather than recoded genotypic data is being used for
the analysis) genotypic. In all cases, if a marker is used as an
additional fixed effect, it will not be included in the analysis in
any other way. To begin, check this option, then click on **Add
Columns** to get a choice of spreadsheet columns to use.

## Output from Various Mixed Linear Models Analysis Methods¶

### Output from the Linear Regression (fixed effects only) Method¶

The linear regression option creates a **P-Values from Linear
Regression** spreadsheet.

The columns in this spreadsheet are:

*P-Value*P-value of the linear regression.*-log10(P-Value)*The negative log (based 10) of the above p-value.*Regression Beta*: This is the beta value of the individual marker that has been tested, and corresponds to in the third note for*Further Optimization When Covariates Are Present*.Note

*Regression Beta*indicates the direction of the effect.*Beta Standard Error*: This is the standard error corresponding to this beta value, and is computed aswhere is the RSS value () for the regression itself (either the original form of the regression or of the optimized form, ), is the number of samples, is one plus the number of any fixed covariates you may have selected, and is the RSS value of regressing against the intercept and any fixed covariates, as in

Computing the value may be optimized to

Any additional p-value outputs you have selected (see

*Additional Outputs*).The actual sample size and any genotype statistic outputs (see

*Additional Outputs*).

### Output from the Single Locus Mixed Linear Model (EMMAX) Method¶

If a kinship matrix was computed on the fly, this will be output
as **IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers)**. This
spreadsheet can be used again for further Mixed Model analysis.

Also created is the **P-Values from Single-Locus Mixed Model** spreadsheet.

The columns in this spreadsheet are:

*P-Value*: P-value of the mixed linear model.*-log10(P-Value)*The negative log (based 10) of the above p-value.*Regression Beta*: This is the beta value of the individual marker that has been tested, and corresponds to in*Using the Mixed Model for Association Studies*.Note

*Regression Beta*indicates the direction of the (fixed) effect.*Beta Standard Error*: This is the standard error corresponding to this beta value, and is computed aswhere is the RSS value () for the transformed regression itself (either the original form of the transformed regression or of the optimized form, ), is the number of samples, is one plus the number of any fixed covariates you may have selected, and is the RSS value of regressing the transformed (which is ) against the transformation of the intercept and any fixed covariates, as in

Computing the value may be optimized to

Any additional p-value outputs you have selected (see

*Additional Outputs*).*Proportion of Variance Explained*: This column contains the proportion of variance explained by the effects of this marker. Using the notation of*Further Optimization When Covariates Are Present*, this proportion of variance explained will beThe actual sample size and any genotype statistic outputs (see

*Additional Outputs*).

The following will be output to the node change log of this spreadsheet:

The pseudo-heritability , which is

The variance and standard error of the pseudo-heritability and , see

*Estimating the Variance of Heritability*for the formula.The genetic component of variance

*Vg*().The error component of variance

*Ve*().

If you have selected additional covariates, two additional outputs will appear in the node change log:

*Proportion of genetic variance*and*Prop. explained by fixed covariates*.

These correspond exactly to the proportion of *Genetic variance*
( for ) and the proportion of *Variance
explained* () outputs of the **Variance Partition
Plot** (See *The Variance Partition Plot*) for the initial model
() of an MLMM run.

### Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method¶

If an IBS kinship matrix was computed on the fly (and your input data
is genotypic), this will be output as **IBS Distance ((IBS2 *
0.5*IBS1)/ # non-missing markers)**. (Otherwise, if it was computed on
the fly for numerically recoded genotypes, it will be called **IBS
Relationship Matrix (Computed from Recoded Data)**.) This spreadsheet
can be used again for further Mixed Model analysis.

Also created are:

**P-Values from Multi-Locus Mixed Model**: The columns in this spreadsheet are grouped by step number. For each step the following columns are added:*P-Value*: P-value of the mixed linear model.*-log10(P-Value)*The negative log (based 10) of the above p-value.*Regression Beta*: This is the beta value of the individual marker that has been tested, and corresponds to in*Using the Mixed Model for Association Studies*.Note

For the Initial Model, the

*Regression Beta*indicates the direction of the (fixed) effect of the marker being tested. For succeeding steps, the*Regression Beta*only indicates the direction of the additional (fixed) effect of the marker being tested over and above the (fixed) effects of the markers (now being used as covariates) found in the previous steps. No one “effect direction” is defined for multiple markers considered as one unit.*Beta Standard Error*: This is the standard error corresponding to this beta value, and is computed as noted above in*Output from the Single Locus Mixed Linear Model (EMMAX) Method*.Any additional p-value outputs you have selected (see

*Additional Outputs*).*Prop. Var. Explained*: This column contains the proportion of variance explained by the effects of this marker. Using the notation of*Further Optimization When Covariates Are Present*, this proportion of variance explained will be*Manhattan Category*: This column can be used to color the -log10 p-values in a genome browser. This column is preferred to the*Chromosome*category because it colors the covariates as a separate category.

At the end are added the actual sample size and any genotype statistic outputs (see

*Additional Outputs*).**MLMM Step Information and Covariate P-Values**: The rows of this spreadsheet correspond to the MLMM models obtained by both the forward selection steps and backward elimination steps. The first row corresponds to the initial model where no covariate markers have yet been selected. This spreadsheet will have the following columns:*Preceding step type*: None (for the initial model), Forward selection or Backward elimination.*Optimal according to:*If the model is optimal according to one or more model criteria (other than classic BIC), the criterion or criteria used for classifying the model as optimal will be listed. (See*Model Criteria*.) Possible criteria (and their abbreviations) are:*Bonferroni**Multiple Bonferroni**Ext. BIC*Extended Bayes Information Criteria*Mod. BIC*Modified Bayes Information Criteria*Multiple PPA*Multiple Posterior Probability of Association

*Pseudo-heritability*: Proportion of inheritance explained by the random effects, as evaluated for this model. If this model is after Step , this proportion iswhere , , and are all evaluated for this model.

The genetic component of variance

*Vg*() for this model.The error component of variance

*Ve*() for this model.*RSS*: The actual Root Sum of Squares with just the covariates as fixed effects. This may be written aswhere the results from using the EMMA technique to solve the base model equation

and the include marker covariates that are a part of this model, any additional covariates you have specified, and the vector of all one’s for obtaining the intercept. (See

*The EMMAX Approximations and Technique*.)*REML Mahalanobis RSS*: Transformed RSS with just the covariates as fixed effects. Using the notation of*Further Optimization When Covariates Are Present*, this is .*Min Mahalanobis RSS*: Transformed RSS for the covariates plus the marker with the best p-value. Using the notation of*Further Optimization When Covariates Are Present*, this is where represents the marker with the best p-value.*Log likelihood*: The full log likelihood for this model.*BIC*: Classic Bayesian Information Criterion. (See*Model Criteria*.)*Extended BIC*: BIC penalized by the model space dimension. (See*Model Criteria*.)*Modified BIC*: BIC penalized by the model space dimension using an alternative algorithm. (See*Model Criteria*.)*Max. cofactor p-value*: The maximum individual p-value for all of the covariate markers included in this model (see note below).*Min. p-value*: The minimum p-value for all of the markers not included as covariates in this model.*Beta at min. p-value*The beta value of the marker having the minimum p-value.*Beta std. error at min. p-value*The beta standard error for the marker having the minimum p-value.*Max. PPA*: The maximum, over all markers, of the posterior probability of association (PPA) (see*Model Criteria*).*Max. PPA p-value*: The p-value of that marker having the maximum PPA.The remaining columns contain the following values for the covariate markers included in this model (see note below). For each covariate marker, these values are grouped together within parentheses inside one column:

*P:*The individual p-value*B:*The regression beta*SE:*The standard error for the regression beta

Note

The individual p-value, regression beta, and beta standard error of each covariate marker are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker as the only non-covariate marker which is scanned.

**Variance Partition Plot**: See*The Variance Partition Plot*.

### The Variance Partition Plot¶

This output graph shows, for the model resulting from each step, how the variance of may be partitioned into the following component proportions:

*Error*(In Red)*Genetic variance*(In Green)*Variance explained*(In Blue)*Explained by fixed covariates*(In Gray – will show only if you have selected additional covariates)

These proportions are obtained as follows:

A reference RSS () is determined to be used for all models. This reference RSS will be the (actual) Root Sum of Squares with only the intercept as a fixed effect. This may be written as

where the results from using the EMMA technique to solve the null model equation

(See

*The EMMAX Approximations and Technique*.)If you have selected additional covariates, this will be computed separately. If you have not selected additional covariates, we have

where is the

*RSS*value for the initial model.If you have selected additional covariates, the proportion explained by fixed covariates will be

In any case, for the model resulting from step , the proportion of

*Variance explained*iswhere is the

*RSS*value computed for the model resulting from step . If you have not selected additional covariates, this simplifies toThe proportion of

*Genetic variance*iswhere the pseudo-heritability is

, , and having been evaluated for step .

The

*Error*accounts for the rest of the variance: