Mixed Linear Model Analysis

The methods currently available in the Mixed Linear Model Analysis tool include:

While linear regression does not include a random effect component, this method is able to take advantage of the Fast F test and regression code written by Bjarni Vilhjalmsson for the MLMM implementation in Python.

GWAS mixed linear model analysis uses a kinship matrix to correct for cryptic relatedness as a random effect and can include any additional fixed effects in the model.

This analysis can perform the regression analysis directly on genotypic data – there is no need to recode the data into a numeric model as with Numeric Regression Analysis. However, this analysis can be run on a numerically recoded spreadsheet as well.

Note

EMMAX only performs linear regression, even on a binary phenotype. This is similar to recoding the binary phenotype as an integer. According to [Kang2010] this is “in the spirit of Armitage’s test”. Thus results from a logistic regression (Numeric > Numeric Regression Analysis or Genotype > Haplotype Trend Regression (using windowed haplotype blocks with a size of one marker)) will be considerably different from EMMAX or MLMM in Genotype > Mixed Linear Model Analysis. For a more direct comparison, a binary phenotype can be recoded into an integer phenotype to run linear regression analysis through Numeric Regression Analysis or Haplotype Trend Regression.

Performing Analysis

To perform mixed linear model analysis on genotypic data or recoded genotypic data, open a spreadsheet and select a column for the dependent variable. The dependent variable must be either quantitative (real-valued or integer-valued) or a binary case/control status column. To open the Mixed Linear Model Analysis window, select the Genotype > Mixed Linear Model Analysis menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

mixedModelWin

Mixed Linear Model Analysis Window (Main Tab)

The main tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Main Tab)) allows for various methods and parameters to be set or changed. A list and brief description of these options is as follows:

  • Linear regression (fixed effects only): This method performs a linear regression analysis with fixed effects only. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). The advantage of this linear regression analysis is that it uses the same F test (Fast F Test) as the mixed linear models (EMMAX and MLMM) and has the choice of imputation methods. See Correct for Additional Covariates for instructions on including fixed effects in the model.

The following methods use a regression analysis framework to add either a random effect (normally based on kinship) or both a random effect and fixed effects to regression analysis models for each genetic marker. The regression models include:

  • Single-locus mixed model GWAS (EMMAX): [Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which can include fixed effects (see Correct for Additional Covariates) but always includes a kinship matrix as a random effect. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See Precomputed Kinship Matrix Option for more information.
  • Multi-locus mixed model GWAS (MLMM): [Segura2012] This method is an implementation of the MLMM method by [Vilhjalmsson2012] which uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model. Additional fixed effects can be selected (see Correct for Additional Covariates) and either a pre-computed kinship matrix can be used (see Precomputed Kinship Matrix Option) or an IBS kinship matrix can be computed from the genotypic data and used for the analysis.

An additional parameter specifically for the EMMAX and MLMM methods is

Additional parameters for all regression models include:

  • Genetic Model and Imputation:

    • Genetic model to use: The genetic models available include:

      • Additive: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.
      • Dominant: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.
      • Recessive: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.

      Note

      If you are running this analysis from a numerically recoded spreadsheet, this prompt will read Genetic model used for recoding the original spreadsheet, and you should enter which of the above recoding operations you have already performed (see Recode Genotypes) or would have performed to create the spreadsheet from which you are running this analysis.

    • Impute missing data as: The options for imputing the missing genotypes include setting the missing genotype to

      • Homozygous major allele: Always sets the missing genotypes to 0.

      • Numerically as average value: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.

        Note

        If Correct for Hemizygous Males (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.

    • Correct for Hemizygous Males: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.

      • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
      • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Correct for Additional Covariates: See Correct for Additional Covariates.

Additional Outputs

mixedModelWin2

Mixed Linear Model Analysis Window (Second Tab)

The second tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Second Tab)) allows for additional outputs to be added to the output spreadsheet of Linear regression (fixed effects only) or of Single-locus mixed model GWAS (EMMAX) and/or to the p-value output spreadsheet of Multi-locus mixed model GWAS (MLMM). Four of these selectable outputs are enhancements to all p-value outputs:

  • Bonferroni multiple testing correction
  • False discovery rate (FDR)
  • Output data for P-P/Q-Q plots
  • Output -log 10(P) Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.

The other four are genotype statistics, which are placed at the end of the spreadsheet.

  • Call rate (fraction not missing)
  • Allele frequencies
  • Genotype counts
  • Allele Counts

Note

  1. If you are using a numerically recoded spreadsheet and are not using the additive model, only Call rate is selectable as a genotype statistic.
  2. Before the genotype statistics are output, the Actual Sample Size, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output.
  3. If you are using a genotypic spreadsheet, the Minor Allele (Test Allele) and the Major Allele for each marker will always be output.

Precomputed Kinship Matrix Option

A pre-computed kinship matrix may be used for any of the following mixed-model analysis features:

A pre-computed kinship matrix may be obtained from any one of the following features or places:

Note

For computing Genomic BLUP (GBLUP), a Genomic Relationship Matrix (pre-computed or computed automatically) should be used as the kinship matrix. Results are not guaranteed if a different type of kinship matrix is used for computing GBLUP.

A pre-computed kinship matrix is required to be an n \times
n numeric matrix where n is (at least) the number of samples in the genotypic spreadsheet. Both row labels and column headers should be the sample names.

Also, the spreadsheet’s data must be symmetric with respect to the diagonal (within 10^{-8} + 10^{-5} |b|, where b is the above-the-diagonal value being checked), and its trace (the sum of its diagonal elements) must be positive. These last two precautions are designed to make it more likely that the selected spreadsheet’s matrix is positive semidefinite.

Note

The rows and columns in the pre-computed kinship matrix, while they need to be sorted in the same order as each other, do not need to be sorted in the same order as the rows of the spreadsheet to be analyzed. However:

  • The data will be matched up using the row labels (sample names) of both spreadsheets.
  • The kinship matrix spreadsheet must have a row label (sample name) for every row label (sample name) in the spreadsheet to be analyzed.
  • If the kinship matrix spreadsheet has extra rows not present in the spreadsheet to be analyzed, those rows (and corresponding columns) simply will not be used.

To begin, check this option (unless you are running Mixed-Model KBAC), then click on Select Sheet and select the kinship spreadsheet from the window that is presented.

Correct for Additional Covariates

Additional fixed effect covariates can be selected for inclusion in the model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

Output from Various Mixed Linear Models Analysis Methods

Output from the Linear Regression (fixed effects only) Method

The linear regression option creates a P-Values from Linear Regression spreadsheet.

The columns in this spreadsheet are:

  • P-Value P-value of the linear regression.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • Regression Beta: This is the beta value of the individual marker that has been tested, and corresponds to \beta_k in the third note for Further Optimization When Covariates Are Present.

    Note

    Regression Beta indicates the direction of the effect.

  • Beta Standard Error: This is the standard error corresponding to this beta value, and is computed as

    se_k = \sqrt{\frac{RSS_{Reg}}{(n - p) RSS_k}},

    where RSS_{Reg} is the RSS value (\epsilon'\epsilon) for the regression itself (either the original form of the regression or of the optimized form, My = y -
X_f\hat{\beta_{h0}} = MX_k\beta_k + \epsilon_M), n is the number of samples, p is one plus the number of any fixed covariates you may have selected, and RSS_k is the RSS value of regressing X_k against the intercept and any fixed covariates, as in

    X_k = X_f\beta_{Xk} + \epsilon_{Xk} .

    Computing the value RSS_k may be optimized to

    RSS_k = X_k' M X_k = X_k' M M X_k = (M X_k)' (M X_k) .

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

Output from the Single Locus Mixed Linear Model (EMMAX) Method

If a kinship matrix was computed on the fly, this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). This spreadsheet can be used again for further Mixed Model analysis.

Also created is the P-Values from Single-Locus Mixed Model spreadsheet.

The columns in this spreadsheet are:

  • P-Value: P-value of the mixed linear model.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • Regression Beta: This is the beta value of the individual marker that has been tested, and corresponds to \beta_k in Using the Mixed Model for Association Studies.

    Note

    Regression Beta indicates the direction of the (fixed) effect.

  • Beta Standard Error: This is the standard error corresponding to this beta value, and is computed as

    se_k = \sqrt{\frac{RSS_{Reg}}{(n - p) RSS_k}},

    where RSS_{Reg} is the RSS value (\epsilon'\epsilon) for the transformed regression itself (either the original form of the transformed regression or of the optimized form, MB^{-1}y
= B^{-1}y - B^{-1}X_f\hat{\beta_{h0}} = MB^{-1}X_k\beta_k +
\epsilon_{MB}), n is the number of samples, p is one plus the number of any fixed covariates you may have selected, and RSS_k is the RSS value of regressing the transformed X_k (which is B^{-1}X_k) against the transformation of the intercept and any fixed covariates, as in

    B^{-1}X_k = B^{-1}X_f\beta_{Xk} + \epsilon_{Xk} .

    Computing the value RSS_k may be optimized to

    RSS_k = (B^{-1}X_k)' M (B^{-1} X_k) = (B^{-1}X_k)' M M (B^{-1} X_k) = (MB^{-1} X_k)' (MB^{-1} X_k) .

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • Proportion of Variance Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained pve will be

    pve = \frac{mrss_{h0} - mrss_k}{mrss_{h0}} .

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

The following will be output to the node change log of this spreadsheet:

  • The pseudo-heritability ph, which is

    ph &= \hat{\sigma^2_g} / Var(y) \\
   &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
   &= 1 / (1 + \hat{\delta}).

  • The variance and standard error of the pseudo-heritability Var(h^2) and SE(h^2) = \sqrt{Var(h^2)}, see Estimating the Variance of Heritability for the formula.

  • The genetic component of variance Vg (\hat{\sigma^2_g}).

  • The error component of variance Ve (\hat{\sigma^2_e}).

If you have selected additional covariates, two additional outputs will appear in the node change log:

  • Proportion of genetic variance and
  • Prop. explained by fixed covariates.

These correspond exactly to the proportion of Genetic variance (pg_j for j=0) and the proportion of Variance explained (p_{fixed}) outputs of the Variance Partition Plot (See The Variance Partition Plot) for the initial model (j=0) of an MLMM run.

Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method

If an IBS kinship matrix was computed on the fly (and your input data is genotypic), this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). (Otherwise, if it was computed on the fly for numerically recoded genotypes, it will be called IBS Relationship Matrix (Computed from Recoded Data).) This spreadsheet can be used again for further Mixed Model analysis.

Also created are:

  • P-Values from Multi-Locus Mixed Model: The columns in this spreadsheet are grouped by step number. For each step the following columns are added:

    • P-Value: P-value of the mixed linear model.

    • -log10(P-Value) The negative log (based 10) of the above p-value.

    • Regression Beta: This is the beta value of the individual marker that has been tested, and corresponds to \beta_k in Using the Mixed Model for Association Studies.

      Note

      For the Initial Model, the Regression Beta indicates the direction of the (fixed) effect of the marker being tested. For succeeding steps, the Regression Beta only indicates the direction of the additional (fixed) effect of the marker being tested over and above the (fixed) effects of the markers (now being used as covariates) found in the previous steps. No one “effect direction” is defined for multiple markers considered as one unit.

    • Beta Standard Error: This is the standard error corresponding to this beta value, and is computed as noted above in Output from the Single Locus Mixed Linear Model (EMMAX) Method.

    • Any additional p-value outputs you have selected (see Additional Outputs).

    • Prop. Var. Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained pve will be

      pve = \frac{mrss_{h0} - mrss_k}{mrss_{h0}} .

    • Manhattan Category: This column can be used to color the -log10 p-values in a genome browser. This column is preferred to the Chromosome category because it colors the covariates as a separate category.

    At the end are added the actual sample size and any genotype statistic outputs (see Additional Outputs).

  • MLMM Step Information and Covariate P-Values: The rows of this spreadsheet correspond to the MLMM models obtained by both the forward selection steps and backward elimination steps. The first row corresponds to the initial model where no covariate markers have yet been selected. This spreadsheet will have the following columns:

    • Preceding step type: None (for the initial model), Forward selection or Backward elimination.

    • Optimal according to: If the model is optimal according to one or more model criteria (other than classic BIC), the criterion or criteria used for classifying the model as optimal will be listed. (See Model Criteria.) Possible criteria (and their abbreviations) are:

      • Bonferroni
      • Multiple Bonferroni
      • Ext. BIC Extended Bayes Information Criteria
      • Mod. BIC Modified Bayes Information Criteria
      • Multiple PPA Multiple Posterior Probability of Association
    • Pseudo-heritability: Proportion of inheritance explained by the random effects, as evaluated for this model. If this model is after Step j, this proportion is

      ph_j &= \hat{\sigma^2_g} / Var(y) \\
     &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
     &= 1 / (1 + \hat{\delta}),

      where \hat{\sigma^2_g}, \hat{\sigma^2_e}, and \hat{\delta} are all evaluated for this model.

    • The genetic component of variance Vg (\hat{\sigma^2_g}) for this model.

    • The error component of variance Ve (\hat{\sigma^2_e}) for this model.

    • RSS: The actual Root Sum of Squares with just the covariates as fixed effects. This may be written as

      rss = \sum_i (y_i - \sum_f \hat{\beta}_f X_{if})^2,

      where the \hat{\beta}_f results from using the EMMA technique to solve the base model equation

      y_i = \sum_f \beta_f X_{if} + u_i + \epsilon_i,

      and the X_{if} include marker covariates that are a part of this model, any additional covariates you have specified, and the vector of all one’s for obtaining the intercept. (See The EMMAX Approximations and Technique.)

    • REML Mahalanobis RSS: Transformed RSS with just the covariates as fixed effects. Using the notation of Further Optimization When Covariates Are Present, this is mrss_{h0}.

    • Min Mahalanobis RSS: Transformed RSS for the covariates plus the marker with the best p-value. Using the notation of Further Optimization When Covariates Are Present, this is mrss_k where k represents the marker with the best p-value.

    • Log likelihood: The full log likelihood for this model.

    • BIC: Classic Bayesian Information Criterion. (See Model Criteria.)

    • Extended BIC: BIC penalized by the model space dimension. (See Model Criteria.)

    • Modified BIC: BIC penalized by the model space dimension using an alternative algorithm. (See Model Criteria.)

    • Max. cofactor p-value: The maximum individual p-value for all of the covariate markers included in this model (see note below).

    • Min. p-value: The minimum p-value for all of the markers not included as covariates in this model.

    • Beta at min. p-value The beta value of the marker having the minimum p-value.

    • Beta std. error at min. p-value The beta standard error for the marker having the minimum p-value.

    • Max. PPA: The maximum, over all markers, of the posterior probability of association (PPA) (see Model Criteria).

    • Max. PPA p-value: The p-value of that marker having the maximum PPA.

    • The remaining columns contain the following values for the covariate markers included in this model (see note below). For each covariate marker, these values are grouped together within parentheses inside one column:

      • P: The individual p-value
      • B: The regression beta
      • SE: The standard error for the regression beta

    Note

    The individual p-value, regression beta, and beta standard error of each covariate marker k are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker k as the only non-covariate marker which is scanned.

  • Variance Partition Plot: See The Variance Partition Plot.

The Variance Partition Plot

This output graph shows, for the model resulting from each step, how the variance of y may be partitioned into the following component proportions:

  • Error (In Red)
  • Genetic variance (In Green)
  • Variance explained (In Blue)
  • Explained by fixed covariates (In Gray – will show only if you have selected additional covariates)
Variance Partition Plot

Variance Partition Plot with additional fixed covariates

These proportions are obtained as follows:

  1. A reference RSS (rss_{ref}) is determined to be used for all models. This reference RSS will be the (actual) Root Sum of Squares with only the intercept as a fixed effect. This may be written as

    rss_{ref} = \sum_i (y_i - \hat{\beta}_{null})^2,

    where the \hat{\beta}_{null} results from using the EMMA technique to solve the null model equation

    y_i =\beta_{null} + u_i + \epsilon_i.

    (See The EMMAX Approximations and Technique.)

    If you have selected additional covariates, this will be computed separately. If you have not selected additional covariates, we have

    rss_{ref} = rss_0,

    where rss_0 is the RSS value for the initial model.

  2. If you have selected additional covariates, the proportion explained by fixed covariates will be

    p_{fixed} = 1 - \frac{rss_0}{rss_{ref}} .

  3. In any case, for the model resulting from step j, the proportion of Variance explained is

    px_j = \frac{rss_0 - rss_j}{rss_{ref}},

    where rss_j is the RSS value computed for the model resulting from step j. If you have not selected additional covariates, this simplifies to

    px_j = 1 - \frac{rss_j}{rss_0}.

  4. The proportion of Genetic variance is

    pg_j = ph_j \frac{rss_j}{rss_{ref}},

    where the pseudo-heritability ph_j is

    ph_j &= \hat{\sigma^2_g} / Var(y) \\
     &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
     &= 1 / (1 + \hat{\delta}),

    \hat{\sigma^2_g}, \hat{\sigma^2_e}, and \hat{\delta} having been evaluated for step j.

  5. The Error accounts for the rest of the variance:

    pe_j = 1 - pg_j - px_j - p_{fixed}.