Mixed Linear Model Analysis with Interactions

Gene-Environment interactions may be analyzed using the Mixed Linear Model Analysis with Interactions tool. The methods which are available in this tool include:

  • Linear regression with interactions
  • Mixed Model GWAS using a single predictor (EMMAX) with interactions

These two methods are very similar to their non-interaction counterparts (see Mixed Linear Model Analysis). The difference is that a full-vs-reduced model is used which consists of the following for the reduced model:

  • The constant (“intercept”) term
  • Any non-interacting covariates you may specify
  • The interaction-term covariates you specify, not yet multiplied by the predictor
  • The (single-SNP/single-locus) predictor

and for the full model, the above plus the following:

  • The interaction terms. Each term is the element-by-element product of the corresponding interaction-term covariate with the predictor.

(These are in addition to the error term(s) of each model.)

For an explanation of the underlying mathematics, see Overview of Mixed Linear Models, which includes the section Optimization when Gene-Environment Interaction Terms Are Included.

Performing Analysis

To perform mixed linear model analysis with interactions on genotypic data or recoded genotypic data, open a spreadsheet and select a column for the dependent variable. The dependent variable must be either quantitative (real-valued or integer-valued) or a binary case/control status column. To open the Mixed Linear Model Analysis with Interactions window, select the Genotype > Mixed Linear Model Analysis with Interactions menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

mixedModelWin

Mixed Linear Model Analysis with Interactions Window (Main Tab)

The main tab of the Mixed Linear Model Analysis with Interactions window (Mixed Linear Model Analysis with Interactions Window (Main Tab)) allows for various methods and parameters to be set or changed. A list and brief description of these options is as follows:

  • Linear regression (fixed effects only) with Interactions: This method performs a (fixed-effect-only) linear regression analysis with interactions. If the marker mapped data is genotypic, then the data is recoded first into the specified numeric model (additive, dominant or recessive). This linear regression analysis uses the same F test (Fast F Test) as the mixed linear models (EMMAX) and has a choice of imputation methods. See Selecting the Interaction Terms and Correct for Additional Covariates for instructions on including the interaction terms and other covariates in the model.
  • Single-locus mixed model GWAS (EMMAX) with Interactions: [Kang2010] This method is an EMMAX implementation by [Vilhjalmsson2012] which analyzes both fixed effects containing interactions and random effects. See Selecting the Interaction Terms and Correct for Additional Covariates for instructions on including the interaction terms and other covariates in the model as fixed effects. A kinship matrix is always required to help describe the random effects. Either an identity-by-state (IBS) kinship matrix will be computed from the genotypic data or a pre-computed kinship matrix can be selected to speed up analysis. See Precomputed Kinship Matrix Option for more information about kinship matrices.

An additional parameter specific to the EMMAX with Interactions method is

Additional parameters for all regression models include:

  • Genetic Model and Imputation:

    • Genetic model to use: The genetic models available include:

      • Additive: Recodes Major Homozygous genotype (dd) to 0, Heterozygous (Dd) to 1, and Minor Homozygous (DD) to 2.
      • Dominant: Recodes Major Homozygous genotype (dd) to 0 and Heterozygous (Dd) and Minor Homozygous (DD) to 1.
      • Recessive: Recodes Major Homozygous (dd) and Heterozygous (Dd) to 0 and Minor Homozygous (DD) to 1.

      Note

      If you are running this analysis from a numerically recoded spreadsheet, this prompt will read Genetic model used for recoding the original spreadsheet, and you should enter which of the above recoding operations you have already performed (see Recode Genotypes) or would have performed to create the spreadsheet from which you are running this analysis.

    • Impute missing data as: The options for imputing the missing genotypes include setting the missing genotype to

      • Homozygous major allele: Always sets the missing genotypes to 0.

      • Numerically as average value: Uses the average non-missing recoded genotype values for each marker as the value to use for recoding missing genotypes.

        Note

        If Correct for Hemizygous Males (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and applied separately.

    • Correct for Hemizygous Males: Recodes the X-Chromosome genotypes for males as 0 or 1. Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question.

      • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
      • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Please Enter the Interaction Terms: See Selecting the Interaction Terms.

    • Correct for Additional Covariates: See Correct for Additional Covariates.

Selecting the Interaction Terms

Interaction-term covariates can be selected for inclusion in the model from columns of this spreadsheet. As noted above, interaction-term covariates are included in the model in two different ways:

  • Each interaction-term covariate that you select will act, by itself, as one reduced-model covariate.
  • Each interaction-term covariate is also multiplied, element by element, with the predictor (the current marker/locus being analyzed) to create one of the actual interaction terms to be included in the full model.

Interaction-term covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an interaction-term covariate, it will not be included in the analysis in any other way. Click on Add Columns to get a choice of spreadsheet columns to use.

Correct for Additional Covariates

Additional fixed-effect reduced-model covariates (which are not a part of any interaction term) can also be selected for inclusion in the model from columns of this spreadsheet. These covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional covariate, it will not be included in the analysis in any other way. To begin, check the Correct for Additional Covariates option, then click on Add Columns to get a choice of spreadsheet columns to use.

Additional Outputs

The second tab of the Mixed Linear Model Analysis with Interactions window is the same as the second tab of the Mixed Linear Model Analysis window (see Mixed Linear Model Analysis Window (Second Tab)). This tab allows for additional outputs to be added to either output spreadsheet. Four of these selectable outputs are enhancements to the p-value output:

  • Bonferroni multiple testing correction
  • False discovery rate (FDR)
  • Output data for P-P/Q-Q plots
  • Output -log 10(P) Negative log base 10 of the p-values. This is the optimal column to use for plotting the values in a genome browser.

The other four are genotype statistics, which are placed at the end of the spreadsheet.

  • Call rate (fraction not missing)
  • Allele frequencies
  • Genotype counts
  • Allele Counts

Note

  1. If you are using a numerically recoded spreadsheet and are not using the additive model, only Call rate is selectable as a genotype statistic.
  2. Before the genotype statistics are output, the Actual Sample Size, that is, the number of samples actually used for each marker, as distinct from samples containing missing data that is imputed, will always be output.
  3. If you are using a genotypic spreadsheet, the Minor Allele (Test Allele) and the Major Allele for each marker will always be output.

Output from Linear Regression with Interactions

The linear regression option creates a P-Values from Linear Regression spreadsheet.

The columns in this spreadsheet are:

  • P-Value This is the full-model-vs-reduced-model p-value.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • For each of the following terms:

    • the intercept (term 0)
    • each additional covariate you have specified (if any)
    • each interaction covariate
    • the predictor itself, and
    • each interaction term,

    the following values are output from the full-model regression:

    • The beta coefficient for this term (called Beta #, Beta pred, or Interaction # Beta), and
    • The standard error associated with this coefficient. (The standard error value is called Beta # SE, Beta pred SE, or Interaction # Beta SE.)

    Note

    The standard error corresponding to \beta_m at locus k is computed as

    se_k = \sqrt{\frac{RSS_{Reg}}{(n - p) RSS_{km}}},

    where RSS_{Reg} is the RSS value (\epsilon'\epsilon) for the full-model regression itself, n is the number of samples, p is the total number of terms in the full model (as explained above), and RSS_{km} is the m+1st diagonal element of

    (X_{Fk}' X_{Fk})^+,

    where X_{Fk} denotes a matrix whose columns are the full-model (fixed-effect) terms explained above, and A^+ denotes the generalized inverse of any matrix A.

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

Output from Single Locus Mixed Linear Model (EMMAX) with Interactions

If a kinship matrix was computed on the fly, this will be output as IBS Distance ((IBS2 * 0.5*IBS1)/ # non-missing markers). This spreadsheet can be used again for further Mixed Model analysis.

Also created is the P-Values from Single-Locus Mixed Model spreadsheet.

The columns in this spreadsheet are:

  • P-Value This is the full-model-vs-reduced-model p-value.

  • -log10(P-Value) The negative log (based 10) of the above p-value.

  • For each of the following terms:

    • the intercept (term 0)
    • each additional covariate you have specified (if any)
    • each interaction covariate
    • the predictor itself, and
    • each interaction term,

    the following values are output from the full-model regression:

    • The beta coefficient for this term (called Beta #, Beta pred, or Interaction # Beta), and
    • The standard error associated with this coefficient. (The standard error value is called Beta # SE, Beta pred SE, or Interaction # Beta SE.)

    Note

    The standard error corresponding to \beta_m at locus k is computed from the full-model regression itself, which is, using the notation in Optimization when Gene-Environment Interaction Terms Are Included,

    B^{-1}y = B^{-1}X_c \beta_{kc} + B^{-1}X_i \beta_{ki} + B^{-1}X_k \beta_{kp} + B^{-1}X_{ip} \beta_{ip} + \epsilon.

    This standard error is

    se_k = \sqrt{\frac{RSS_{Reg}}{(n - p) RSS_{km}}},

    where RSS_{Reg} is the RSS value from the regression (\epsilon'\epsilon), n is the number of samples, p is the total number of terms in the full model (as explained above), and RSS_{km} is the m+1st diagonal element of

    (X_{Fk}' X_{Fk})^+,

    where A^+ denotes the generalized inverse of any matrix A, and X_{Fk} denotes the matrix whose columns are the full-model fixed-effect terms explained above, pre-multiplied by B^{-1}–that is,

    X_{Fk} = B^{-1}[X_c X_i X_k X_{ip}].

  • Any additional p-value outputs you have selected (see Additional Outputs).

  • Proportion of Variance Explained: This column contains the proportion of variance explained by the effects of this marker. Using the notation of Further Optimization When Covariates Are Present, this proportion of variance explained pve will be

    pve = \frac{mrss_{h0} - mrss_k}{mrss_{h0}} .

  • The actual sample size and any genotype statistic outputs (see Additional Outputs).

The following will be output to the node change log of this spreadsheet:

  • The pseudo-heritability ph, which is

    ph &= \hat{\sigma^2_g} / Var(y) \\
   &= \hat{\sigma^2_g} / (\hat{\sigma^2_g} + \hat{\sigma^2_e}) \\
   &= 1 / (1 + \hat{\delta}).

  • The variance and standard error of the pseudo-heritability Var(h^2) and SE(h^2) = \sqrt{Var(h^2)}, see Estimating the Variance of Heritability for the formula.

  • The genetic component of variance Vg (\hat{\sigma^2_g}).

  • The error component of variance Ve (\hat{\sigma^2_e}).

  • Two final outputs also appear in the node change log, namely

    • Proportion of genetic variance, and
    • Prop. explained by fixed covariates.

    These are calculated as

    pg = ph \frac{rss_0}{rss_{ref}}

    and

    p_{fixed} = 1 - \frac{rss_0}{rss_{ref}} ,

    where rss_0 is the (actual) Root Sum of Squares (RSS) value for the model consisting of the intercept and all covariates (including any additional covariates and including the interaction covariate(s), but no predictor and no interaction terms) as fixed effects, and rss_{ref} is a reference RSS, which is the (actual) Root Sum of Squares computed from a model containing only the intercept as a fixed effect. The reference RSS may be written as

    rss_{ref} = \sum_i (y_i - \hat{\beta}_{null})^2,

    where the \hat{\beta}_{null} results from using the EMMA technique to solve the null model equation

    y_i =\beta_{null} + u_i + \epsilon_i.

    (See The EMMAX Approximations and Technique.)