2.13.3. Mixed Linear Model Analysis

GWAS mixed linear model analysis uses a kinship matrix to correct for cryptic relatedness as a random effect and can include any additional fixed effects in the model.

The methods currently available in the Mixed Linear Model Analysis tool include:

While the Linear regression (fixed effects only) method does not include a random effect component, it is able to take advantage of the Fast F test and regression code written by Bjarni Vilhjalmsson for the MLMM implementation in Python.

Note

This feature only performs linear regression or linear mixed-model analysis, even on a binary phenotype. The outputs will be the same as they would be for recoding the binary phenotype as an integer and then running the analysis. According to [Kang2010] this is “in the spirit of Armitage’s test”. Thus results from a logistic regression (Numeric > Numeric Regression Analysis or Genotype > Haplotype Trend Regression (using windowed haplotype blocks with a size of one marker)) could be considerably different from EMMAX or MLMM in Genotype > Mixed Linear Model Analysis. For a more direct comparison, recode the binary phenotype into an integer phenotype and then run linear regression analysis through Genotypic Regression Analysis, Numeric Regression Analysis or Haplotype Trend Regression.

Data Types Accepted

This tool can perform the regression analysis directly on mapped genotypic data – there is no need to recode the data into a numeric model. However, this analysis can be run on mapped numerically recoded genotypes as well.

Note

For genotypic input, only one minor allele per marker is allowed.

For recoded genotypes, the corresponding restrictions hold:

  • For the Additive Model, there must be either two or three unique values.

  • For other models, there must be exactly two unique values.

Additionally, this feature may be run on mapped real-mode numeric data, such as PCA-corrected genotypic data. However, when using real-mode numeric data, observe the following:

  • Select Numerically as average value for imputing missing data.

  • Do not select Correct for Hemizygous Males.

  • In the Additional Outputs tab, do not request Allele frequencies, Genotype counts or Allele Counts.

Performing Analysis

To perform mixed linear model analysis on genotypic data, recoded genotypic data, or real data, open a spreadsheet and select a column for the dependent variable. The dependent variable must be either quantitative (real-valued or integer-valued) or a binary case/control status column. To open the Mixed Linear Model Analysis window, select the Genotype > Mixed Linear Model Analysis menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

mixedModelWin

Mixed Linear Model Analysis Window (Main Tab)

The main tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Main Tab)) allows for various methods and parameters to be set or changed. A list and brief description of these options is as follows:

  • Linear regression (fixed effects only): This method performs a linear regression analysis with fixed effects only. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). The advantage of this linear regression analysis is that it uses the same F test (Fast F Test) as the mixed linear models (EMMAX and MLMM) and has the choice of imputation methods. See Correct for Additional Covariates for instructions on including fixed effects in the model.

The following methods use a regression analysis framework to add either a random effect (normally based on kinship) or both a random effect and fixed effects to regression analysis models for each genetic marker. The regression models include:

  • Single-locus mixed model GWAS (EMMAX): [Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which can include fixed effects (see Correct for Additional Covariates) but always includes a kinship matrix as a random effect. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See Precomputed Kinship Matrix Option for more information.

  • Multi-locus mixed model GWAS (MLMM): [Segura2012] This method is an implementation of the MLMM method by [Vilhjalmsson2012] which uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model. Additional fixed effects can be selected (see Correct for Additional Covariates) and either a pre-computed kinship matrix can be used (see Precomputed Kinship Matrix Option) or an IBS kinship matrix can be computed from the genotypic data and used for the analysis.

An additional parameter specifically for the EMMAX and MLMM methods is

Additional parameters for all regression models include:

  • Genetic Model and Imputation:

    • Genetic model to use: The genetic models available include:

      • Additive: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.

      • Dominant: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.

      • Recessive: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.

      Note

      If you are running this analysis from a numerically recoded spreadsheet, this prompt will read Genetic model used for recoding the original spreadsheet, and you should enter which of the above recoding operations you have already performed (see Recode Genotypes) or would have performed to create the spreadsheet from which you are running this analysis.

    • Impute missing data as: The options for imputing the missing genotypes include setting the missing genotype to

      • Homozygous major allele: Always sets the missing genotypes to 0.

      • Numerically as average value: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.

        Note

        If Correct for Hemizygous Males (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.

    • Correct for Hemizygous Males: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.

      • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).

      • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.

    • Correct for Additional Covariates: See Correct for Additional Covariates.

Additional Outputs

mixedModelWin2

Mixed Linear Model Analysis Window (Second Tab)

The second tab of the Mixed Linear Model Analysis window (see Figure Mixed Linear Model Analysis Window (Second Tab)) allows for additional outputs to be added to the output spreadsheet of Linear regression (fixed effects only) or of Single-locus mixed model GWAS (EMMAX) and/or to the p-value output spreadsheet of Multi-locus mixed model GWAS (MLMM). Four of these selectable outputs are enhancements to all p-value outputs:

  • Bonferroni multiple testing correction The number of markers tested (successfully or not) is counted as the number of tests for this correction.

  • False discovery rate (FDR)

  • Output data for P-P/Q-Q plots

  • Output -log 10(P) Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.

The other four are genotype statistics, which are placed at the end of the spreadsheet.

  • Call rate (fraction not missing)

  • Allele frequencies

  • Genotype counts

  • Allele Counts

Note

  1. If you are using a numerically recoded spreadsheet and are not using the additive model, only Call rate is selectable as a genotype statistic.

  2. Before the genotype statistics are output, the Actual Sample Size, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output.

  3. If you are using a genotypic spreadsheet, the Minor Allele (Test Allele) and the Major Allele for each marker will always be output.

Precomputed Kinship Matrix Option

A pre-computed kinship matrix may be used for any of the following mixed-model analysis features:

A pre-computed kinship matrix may be obtained from any one of the following features or places:

Note

For computing Genomic BLUP (GBLUP), a Genomic Relationship Matrix (pre-computed or computed automatically) should be used as the kinship matrix. Results are not guaranteed if a different type of kinship matrix is used for computing GBLUP.

Additionally,

A pre-computed kinship matrix is required to be an n \times
n numeric matrix where n is (at least) the number of samples in the genotypic spreadsheet. Both row labels and column headers should be the sample names.

Also, the spreadsheet’s data must be symmetric with respect to the diagonal (within 10^{-8} + 10^{-5} |b|, where b is the above-the-diagonal value being checked), and its trace (the sum of its diagonal elements) must be positive. These last two precautions are designed to make it more likely that the selected spreadsheet’s matrix is positive semidefinite.

Note

The rows and columns in the pre-computed kinship matrix, while they need to be sorted in the same order as each other, do not need to be sorted in the same order as the rows of the spreadsheet to be analyzed. However:

  • The data will be matched up using the row labels (sample names) of both spreadsheets.

  • The kinship matrix spreadsheet must have a row label (sample name) for every row label (sample name) in the spreadsheet to be analyzed.

  • If the kinship matrix spreadsheet has extra rows not present in the spreadsheet to be analyzed, those rows (and corresponding columns) simply will not be used.

To begin, check this option (unless you are running Mixed-Model KBAC), then click on Select Sheet and select the kinship spreadsheet from the window that is presented.

Correct for Additional Covariates

Additional fixed effect covariates can be selected for inclusion in the model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

Output from Various Mixed Linear Models Analysis Methods

Output from the Linear Regression (fixed effects only) Method

The linear regression option creates a P-Values from Linear Regression spreadsheet.

The columns in this spreadsheet are:

  • P-Value P-value of the linear regression.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • Two parameters for each of the following variables:

    • The intercept (Intercept)

    • Each additional covariate you have specified

    • The individual marker being tested (Predictor)

    These parameters are:

    • __________ Beta: This is the beta value of the parameter that has been tested.

      For the intercept and any covariates, this corresponds to one component of \beta_{kf} in the third note for Further Optimization When Covariates Are Present.

      For the individual marker being tested (Predictor), this corresponds to \beta_k in the third note for Further Optimization When Covariates Are Present.

      Note

      The Predictor Beta indicates the direction of the effect.

    • __________ Beta SE: This is the standard error corresponding to this beta value.

      This is computed as

      se_{kc} = \sqrt{\frac{RSS_{Reg}}{(n - p)} (X_{fk}'X_{fk})^{-1}_{c,c}} .

      Here, c is a component ranging from the Intercept through any covariates you have specified and through the Predictor X_k, and X_{fk} is the matrix made up of these components, as

      X_{fk} = \large[ X_f \large| X_k \large] .

      RSS_{Reg} is the RSS value (\epsilon'\epsilon) for the regression, n is the number of samples and p is one plus the number of any fixed covariates you may have selected.

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

Output from the Single Locus Mixed Linear Model (EMMAX) Method

If an IBS kinship matrix was computed on the fly (and your input data is genotypic), this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). (Otherwise, if it was computed on the fly for numerically recoded genotypes, it will be called IBS Relationship Matrix (Computed from Recoded Data).) This spreadsheet can be used again for further Mixed Model analysis.

Also created is the P-Values from Single-Locus Mixed Model spreadsheet.

The columns in this spreadsheet are:

  • P-Value: P-value of the mixed linear model.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • Two parameters for each of the following variables:

    • The intercept (Intercept)

    • Each additional covariate you have specified

    • The individual marker being tested (Predictor)

    These parameters are:

    • __________ Beta: This is the beta value of the parameter that has been tested.

      For the intercept and any covariates, this corresponds to one component of \beta_{kf} in Further Optimization When Covariates Are Present.

      For the individual marker being tested (Predictor), this corresponds to \beta_k in Further Optimization When Covariates Are Present.

      Note

      The Predictor Beta indicates the direction of the (fixed) effect.

    • __________ Beta SE: This is the standard error corresponding to this beta value.

      This is computed as

      se_{kc} = \sqrt{\frac{RSS_{Reg}}{(n - p)} ((B^{-1}X_{fk})'(B^{-1}X_{fk}))^{-1}_{c,c}} .

      Here, c is a component ranging from the Intercept through any covariates you have specified and through the Predictor X_k, and X_{fk} is the matrix made up of these components, as

      X_{fk} = \large[ X_f \large| X_k \large] .

      RSS_{Reg} is the RSS value (\epsilon'\epsilon) for the transformed regression, n is the number of samples and p is one plus the number of any fixed covariates you may have selected.

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • Proportion of Variance Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained pve will be

    pve = \frac{mrss_{h0} - mrss_k}{mrss_{h0}} .

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

The following will be output to the node change log of this spreadsheet:

If you have selected additional covariates, two additional outputs will appear in the node change log:

  • Proportion of genetic variance and

  • Prop. explained by fixed covariates.

These correspond exactly to the proportion of Genetic variance (pg_j for j=0) and the proportion of Variance explained (p_{fixed}) outputs of the Variance Partition Plot (See The Variance Partition Plot) for the initial model (j=0) of an MLMM run.

Outputs from the Multi-Locus Mixed Linear Model (MLMM) Method

If an IBS kinship matrix was computed on the fly (and your input data is genotypic), this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). (Otherwise, if it was computed on the fly for numerically recoded genotypes, it will be called IBS Relationship Matrix (Computed from Recoded Data).) This spreadsheet can be used again for further Mixed Model analysis.

Also created are:

  • P-Values from Multi-Locus Mixed Model: The columns in this spreadsheet are grouped by step number. For each step containing a model not used by a previous step, and starting with the initial model, the following columns are added:

    • P-Value (Initial Model) or P-Value (After Step ##): P-value of the mixed linear model.

    • -log10(P-Value) The negative log (based 10) of the above p-value.

    • Predictor Beta: This is the beta value of the individual marker that has been tested (the Predictor), and corresponds to \beta_k in Further Optimization When Covariates Are Present.

      Note

      For the Initial Model, the Predictor Beta indicates the direction of the (fixed) effect of the marker being tested. For succeeding steps, the Predictor Beta only indicates the direction of the additional (fixed) effect of the marker being tested over and above the (fixed) effects of the markers (now being used as covariates) found in the previous steps. No one “effect direction” is defined for multiple markers considered as one unit.

    • Predictor Beta SE: This is the standard error corresponding to this beta value, and could be computed as noted above in Output from the Single Locus Mixed Linear Model (EMMAX) Method.

      However, since we only need to output the standard error for the Predictor, and we don’t need to output the standard error for the Intercept or any covariates, the following optimized expression, which is based on the optimized regression for X_k, namely

      MB^{-1}y = MB^{-1}X_k\beta_k + \epsilon_{MB} ,

      is used to compute the standard error for the Predictor X_k:

      se_k &= \sqrt{\frac{RSS_{Reg}}{(n - p)} ((MB^{-1}X_k)'(MB^{-1}X_k))^{-1}_{1,1}} \\
     &= \sqrt{\frac{RSS_{Reg}}{(n - p) ((MB^{-1}X_k)'(MB^{-1}X_k))}} .

      RSS_{Reg} is the RSS value (\epsilon'_{MB}\epsilon_{MB}) for the (transformed and) optimized regression, n is the number of samples and p is one plus the number of any fixed covariates you may have selected.

    • Any additional p-value outputs you have selected (see Additional Outputs).

    • Prop. Var. Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained pve will be

      pve = \frac{mrss_{h0} - mrss_k}{mrss_{h0}} .

    • Manhattan Category (Initial Model) or Manhattan Category (After Step ##): This column can be used to color the -log10 p-values in a genome browser. This column is preferred to the Chromosome category because it labels the marker covariates as a separate category (“Cofactor” rather than “Chr ##”).

    Note

    “For each step containing a model not used by a previous step” normally means the initial model and all Forward Selection steps. This is because Backward Elimination normally eliminates in the reverse order of Forward Selection, thus causing these models to duplicate the models of corresponding Forward Selection steps. But not always. If Backward Elimination eliminates one or two (or more) covariates in a different order than reversing Forward Selection would have done, the results of these different Backward Elimination models will additionally be displayed in this spreadsheet.

    Note

    If, for the step being displayed, the marker has become a covariate marker (which you will know because the Manhattan Category will show that the marker is a “Cofactor”), the p-value, regression beta, beta standard error, and proportion of variance explained (PVE) for this marker (say, marker k) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker k as the only marker which is scanned as a non-covariate.

    The covariate markers for any step (whether shown in this spreadsheet or not) are also summarized in the final columns of the MLMM Step Information and Covariate P-Values spreadsheet (see below).

    At the end are added the actual sample size and any genotype statistic outputs (see Additional Outputs).

  • MLMM Step Information and Covariate P-Values: The rows of this spreadsheet correspond to the MLMM models obtained by both the forward selection steps and backward elimination steps. The first row corresponds to the initial model where no covariate markers have yet been selected. This spreadsheet will have the following columns:

    • Preceding step type: None (for the initial model), Forward selection or Backward elimination.

    • Optimal according to: If the model is optimal according to one or more model criteria (other than classic BIC), the criterion or criteria used for classifying the model as optimal will be listed. (See Model Criteria.) Possible criteria (and their abbreviations) are:

      • Bonferroni

      • Multiple Bonferroni

      • Ext. BIC Extended Bayes Information Criteria

      • Mod. BIC Modified Bayes Information Criteria

      • Multiple PPA Multiple Posterior Probability of Association

    • Pseudo-heritability: Proportion of inheritance explained by the random effects, as evaluated for this model. If this model is after Step j, this proportion is

      ph_j &= \hat{\sigma^2_g} / Var(y) \\
     &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
     &= 1 / (1 + \hat{\delta}),

      where \hat{\sigma^2_g}, \hat{\sigma^2_e}, and \hat{\delta} are all evaluated for this model.

    • Variance of heritability, the variance of the pseudo-heritability Var(h^2), as evaluated for this model.

    • SE of heritability, the standard error of the pseudo-heritability SE(h^2) = \sqrt{Var(h^2)}, as evaluated for this model. See Estimating the Variance of Heritability for One Random Effect for the formula for these two values.

    • The genetic component of variance Vg (\hat{\sigma^2_g}) for this model.

    • The error component of variance Ve (\hat{\sigma^2_e}) for this model.

    • RSS: The actual Residual Sum of Squares with just the covariates as fixed effects. This may be written as

      rss = \sum_i (y_i - \sum_f \hat{\beta}_f X_{if})^2,

      where the \hat{\beta}_f results from using the EMMA technique to solve the base model equation

      y_i = \sum_f \beta_f X_{if} + u_i + \epsilon_i,

      and the X_{if} include marker covariates that are a part of this model, any additional covariates you have specified, and the vector of all one’s for obtaining the intercept. (See The EMMAX Approximations and Technique.)

    • REML Mahalanobis RSS: Transformed RSS with just the covariates as fixed effects. Using the notation of Further Optimization When Covariates Are Present, this is mrss_{h0}.

    • Min Mahalanobis RSS: Transformed RSS for the covariates plus the marker with the best p-value. Using the notation of Further Optimization When Covariates Are Present, this is mrss_k where k represents the marker with the best p-value.

    • Log likelihood: The full log likelihood (as computed using ML) for this model.

    • BIC: Classic Bayesian Information Criterion. (See Model Criteria.)

    • Extended BIC: BIC penalized by the model space dimension. (See Model Criteria.)

    • Modified BIC: BIC penalized by the model space dimension using an alternative algorithm. (See Model Criteria.)

    • Max. cofactor p-value: The maximum individual p-value for all of the covariate markers included in this model (see note below).

    • Min. p-value: The minimum p-value for all of the markers not included as covariates in this model.

    • Beta at min. p-value The beta value of the marker having the minimum p-value.

    • Beta std. error at min. p-value The beta standard error for the marker having the minimum p-value.

    • Max. PPA: The maximum, over all markers, of the posterior probability of association (PPA) (see Model Criteria).

    • Max. PPA p-value: The p-value of that marker having the maximum PPA.

    • The remaining columns contain the following values for the covariate markers included in this model:

      • P: The individual p-value

      • B: The regression beta

      • SE: The standard error for the regression beta

      • PVE: The proportion of variance explained

      These are displayed in four columns, each marker-mapped and lateled with the covariate marker name. The data in these columns is individually labeled according to the above abbreviations.

      Note

      The p-value, regression beta, beta standard error, and proportion of variance explained (PVE) values for any covariate marker (say, marker k) are obtained by using the EMMAX algorithm with all the other covariate markers, plus any additional covariates you have selected, as covariates, using marker k as the only marker which is scanned as a non-covariate.

      These covariate marker values are also shown in the P-Values from Multi-Locus Mixed Model spreadsheet (for each step containing a model not used by a previous step). In that spreadsheet, for any given step being displayed, the covariate markers are each designated in the step’s Manhattan Category column as “Cofactor”.

  • Variance Partition Plot: See The Variance Partition Plot.

The following will be output to the node change log of each of the above outputs:

  • The options selected and numbers of samples and markers scanned

  • The scaling factor used for the kinship matrix, as computed according to Scaling the Kinship Matrix

  • The number of MLMM forward steps actually used

The Variance Partition Plot

This output graph shows, for the model resulting from each step, how the variance of y may be partitioned into the following component proportions:

  • Error (In Red)

  • Genetic variance (In Green)

  • Variance explained (In Blue)

  • Explained by fixed covariates (In Gray – will show only if you have selected additional covariates)

Variance Partition Plot

Variance Partition Plot with additional fixed covariates

These proportions are obtained as follows:

  1. A reference RSS (rss_{ref}) is determined to be used for all models. This reference RSS will be the (actual) Residual Sum of Squares with only the intercept as a fixed effect. This may be written as

    rss_{ref} = \sum_i (y_i - \hat{\beta}_{null})^2,

    where the \hat{\beta}_{null} results from using the EMMA technique to solve the null model equation

    y_i =\beta_{null} + u_i + \epsilon_i.

    (See The EMMAX Approximations and Technique.)

    If you have selected additional covariates, this will be computed separately. If you have not selected additional covariates, we have

    rss_{ref} = rss_0,

    where rss_0 is the RSS value for the initial model.

  2. If you have selected additional covariates, the proportion explained by fixed covariates will be

    p_{fixed} = 1 - \frac{rss_0}{rss_{ref}} .

  3. In any case, for the model resulting from step j, the proportion of Variance explained is

    px_j = \frac{rss_0 - rss_j}{rss_{ref}},

    where rss_j is the RSS value computed for the model resulting from step j. If you have not selected additional covariates, this simplifies to

    px_j = 1 - \frac{rss_j}{rss_0}.

  4. The proportion of Genetic variance is

    pg_j = ph_j \frac{rss_j}{rss_{ref}},

    where the pseudo-heritability ph_j is

    ph_j &= \hat{\sigma^2_g} / Var(y) \\
     &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
     &= 1 / (1 + \hat{\delta}),

    \hat{\sigma^2_g}, \hat{\sigma^2_e}, and \hat{\delta} having been evaluated for step j.

  5. The Error accounts for the rest of the variance:

    pe_j = 1 - pg_j - px_j - p_{fixed}.