Genotypic Regression Analysis

Linear and logistic regression, stepwise linear and logistic regression, and permutation tests with genotypic variables in a moving window along with numeric, categorical, or genotypic covariates, against one dependent variable can be performed from the Genotypic Regression window.

Individual regressions may either be performed with all variables and covariates together in one regression (“full model only”) or as a pair of regressions, one with all variables and covariates together (the “full model”) and a second with only some of the covariates (the “reduced model”), to obtain a “full-vs-reduced-model p-value”. (See Full Versus Reduced Model Regression Equation.)

The covariates used for regression may optionally consist of interactions between other (numeric, categorical, or genotypic) covariates that are derived directly from the spreadsheet.

This feature directly parallels the Numeric Regression Analysis feature, except that:

  • The moving window or single column is directed over genotypic columns rather than numeric columns.
  • The genotypic columns are converted into numeric values according to your specification before they are used in the regression algorithm.
  • In some cases, you may specify genotype statistics to be additionally output.

For an overview of the theories behind regression analysis in Golden Helix SVS, see Linear Regression and Logistic Regression.

Full Versus Reduced Model Regression Equation

As noted in Numeric Regression Analysis, sometimes it is desired to “correct for” binary, continuous, or categorical variables, otherwise known as “covariates”. These covariates, or first-order interactions between covariates, may be influencing the dependent variable response. Correcting for the covariates allows the user to see specifically what effects there are on the remaining variables. In Genotypic Regression, covariates are not only allowed to be numeric or categorical, but they are also allowed to be genotypic themselves.

As in Numeric Regression Analysis, first a regression equation which includes only the dependent and the reduced model covariates (plus a constant term) is calculated (the “reduced model”), after which a regression which includes all of the variables including all full model covariates (along with all reduced model covariates and the constant term) is calculated (the “full model”). The significance of the full versus the reduced model is then calculated with an F-test (for linear regression) or a likelihood ratio statistic (for logistic regression).

See Full Versus Reduced Model Regression Equation (for linear regression) and Full Versus Reduced Model Regression Equation (for logistic regression) for more information.

Performing Analysis

To perform genotypic regression analysis, open a spreadsheet and select a column for the dependent variable. The dependent variable must be either quantitative (real- valued or integer-valued) or a binary case/control status column. To open the Genotypic Regression window, select the Genotype > Genotypic Regression Analysis menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

genoRegWinTab1

Genotypic Regression Analysis – Regression Parameters

genoRegWinTab1CCI

Genotypic Regression Analysis – Regression Parameters (When Covariate-Column Interactions Are Specified)

There are three tabs in the Genotype Regression Analysis window. These are:

  • Regression Parameters: The first tab of the Genotypic Regression Analysis window (see Figure Genotypic Regression Analysis – Regression Parameters) allows for the general regression parameters to be set, including covariate and windowing selection parameters and whether to use stepwise regression. These general parameters are the same as those for the first tab of the Numeric Regression Analysis window, and are explained in Performing Analysis.
  • Output Parameters: The second tab of the Genotypic Regression Analysis window, which has the same choices as the second tab of the Numeric Regression Analysis window (see Figure Genotypic Regression Analysis – Regression Output Parameters), allows for the setting of additional regression parameters, including multiple testing corrections and what additional regression outputs to create. These parameters are also explained in Performing Analysis.
  • Genotypic Parameters The third tab of the Genotypic Regression Analysis window (Genotypic Regression Analysis – Genotypic Parameters) is unique to Genotypic Regression Analysis. In this tab, you may specify how to convert genotypic columns from the spreadsheet to numeric data and what genotype statistics to output. These parameters are explained in Genotypic Parameters below.
TabgenoRegWinTab2

Genotypic Regression Analysis – Regression Output Parameters

TabgenoRegWinTab3

Genotypic Regression Analysis – Genotypic Parameters

Genotypic Parameters

These parameters, which are set in the third tab of the Genotypic Regression Analysis window (Genotypic Regression Analysis – Genotypic Parameters), include:

  • Recode Genotype Column Data To Numeric Values by Allele Classification Two parameters, allele classification and a genetic model, may be chosen here. Together, they dictate how to convert non-missing genotypic data to numeric values.

    The allele classification parameter consists of choosing from

    • Classify Alleles by major allele (d) vs. minor allele (D) as found in the data vs.
    • Classify Alleles by reference allele (r) vs. alternate allele (A) from the marker map.

    Note

    The second allele classification (reference vs. alternate) is enabled only if the spreadsheet is marker-mapped and the marker map contains a field called “Reference”.

    The genetic model parameter consists of either

    • Genetic Model To Use
      • Additive model: DD=2, Dd=1, dd=0
      • Dominant model: DD=1, Dd=1, dd=0
      • Recessive model: DD=1, Dd=0, dd=0

    or

    • Genetic Model To Use
      • Additive model: AA=2, Ar=1, rr=0
      • Dominant model: AA=1, Ar=1, rr=0
      • Recessive model: AA=1, Ar=0, rr=0,

    depending upon the allele classification chosen.

    Note

    For any columns with more than two (distinct) alleles, the “major” allele, when classifying according to major allele (d) vs. minor allele (D), will be considered to be the allele with the highest frequency, and all other alleles will be treated the same as if they were all one and the same “minor” allele.

    Note

    When classifying according to reference allele (r) vs. alternate allele (A), all alternate alleles, for any columns with more than one alternate allele, will be treated the same as if they were all one and the same alternate allele.

    Note

    When classifying according to reference allele (r) vs. alternate allele (A), regression will take place only over mapped genotypic columns.

    Note

    When classifying according to reference allele (r) vs. alternate allele (A), using unmapped genotypic covariates (or those with no reference allele in the marker map) will not work. These will be declared as being “constant”.

  • Missing (Non-Covariate) Genotype Values This parameter dictates what to do with any missing values which are found in the windowed or single-column genotypic data.

    Note

    This parameter is only available if you have selected Regress on each of the ### genotypic columns, Use a moving window of regressors, or Regress on covariate-column interactions (on ### genotypic columns).

    • Impute missing genotype values as homozygous major allele or
    • Impute missing genotype values as homozygous reference allele Use the value zero. This corresponds to the value obtained from a homozygous major allele genotype or homozygous reference allele genotype.
    • Impute missing genotype values numerically as average value The average value resulting from the non-missing genotypes (using the selected genetic model) is calculated, then used as the value for any missing genotypes.
    • Drop samples containing missing genotype values Any spreadsheet rows containing missing genotype values are effectively ignored when regressing the current column or window position.

    Note

    All samples/spreadsheet rows containing a missing value for any individually-selected covariate, genotypic or otherwise, are always dropped.

  • Marker Statistics Outputs Select options for the outputs which are desired. If one or more of the last three of these options are selected, the names of the major and minor (or reference and alternate) alleles will also be output.

    Note

    These options are only available if you have selected Regress on each of the ### genotypic columns or Regress on covariate-column interactions (on ### genotypic columns).

    • Call rate (fraction not missing) The fraction of genotypic data which is present in the current column.
    • Allele frequencies The respective fractions of minor alleles and major alleles, or alternate alleles and reference alleles, represented in the non-missing data.
    • Genotype counts The counts of how many of the three possible genotypes are present, along with the count of missing values.
    • Allele counts The counts of how many of the two possible alleles are present, along with the count of missing-value alleles.

    Note

    These outputs apply to the genotypic data after the spreadsheet rows corresponding to any missing individually-selected covariate data have been excluded, but before any other processing, including missing-value imputation (if selected), takes place on the data.

Running the Regression

Click Run to start the regression analysis procedure.

Note

Sometimes a regression may fail due to insufficient rank in the coefficient matrix. This can be a result of not enough observations or due to the inclusion of “collinear” regressors. A collinear regressor is one which is a linear combination of one or more other regressors.

Regression Outputs

There are three outputs which are possible from a Genotypic Regression (although at most two may be output from any single regression). These are:

  • A residual spreadsheet. This is output if you have selected Perform regression with selected covariates only and also selected to output a residual spreadsheet.
  • A regression results spreadsheet. This is always output if you have selected Regress on each of the ### genotypic columns, Use a moving window of regressors, or Regress on covariate-column interactions (on ### genotypic columns).
  • A regression statistics results viewer. This is always output if you have selected Perform regression with selected covariates only. Otherwise, it is output only if detailed output (Viewing Detailed Results) is specified and the criteria for detailed output are met.

Residual Spreadsheet

The residual spreadsheet outputs are the same as for Numeric Regression. Please see Residual Spreadsheet.

Regression Results Spreadsheet

All the outputs listed for the Numeric Regression results spreadsheet (Regression Results Spreadsheet) are output to this spreadsheet.

In addition, depending on the genotype statistics outputs you have selected, the following genotype statistics may be output if you have selected Classify Alleles by major allele (d) vs. minor allele (D) as found in the data:

  • Call Rate The fraction of genotypic data which is present for this marker.

  • Minor Allele The minor allele for this marker.

    Note

    If the column has more than two (distinct) alleles, a list of the names of all the alleles other than the allele with the highest frequency will be output.

  • Major Allele The major allele for this marker.

    Note

    If the column has more than two (distinct) alleles, the name of the allele with the highest frequency will be output.

  • Minor Allele Frequency The fraction of minor alleles to total (non-missing) alleles.

  • Major Allele Frequency The fraction of major alleles to total (non-missing) alleles.

  • Genotype DD Count The count of homozygous minor allele genotypes.

  • Genotype Dd Count The count of heterozygous genotypes.

  • Genotype dd Count The count of homozygous major allele genotypes.

  • Missing Genotype Count The count of missing genotypes.

  • Minor Allele D Count The count of minor alleles.

  • Major Allele d Count The count of major alleles.

  • Missing Allele Count The count of missing alleles.

If you have selected Classify Alleles by reference allele (r) vs. alternate allele (A) from the marker map, the following genotype statistics may be output, depending on what you have selected:

  • Call Rate The fraction of genotypic data which is present for this marker.

  • Alternate Allele(s) The alternate allele for this marker.

    Note

    If the column has more than one alternate allele, a list of the names of all alternate alleles will be output.

  • Reference Allele The reference allele for this marker.

  • Alternate Allele Frequency The fraction of alternate alleles to total (non-missing) alleles.

  • Reference Allele Frequency The fraction of reference alleles to total (non-missing) alleles.

  • Genotype AA Count The count of homozygous alternate allele genotypes.

  • Genotype Ar Count The count of heterozygous genotypes.

  • Genotype rr Count The count of homozygous reference allele genotypes.

  • Missing Genotype Count The count of missing genotypes.

  • Alternate Allele A Count The count of alternate alleles.

  • Reference Allele r Count The count of reference alleles.

  • Missing Allele Count The count of missing alleles.

Regression Statistics Results Viewer

A Regression Statistics Results Viewer (see Figure Linear Regression Statistics Results Viewer) will be displayed for a single regression or, on the other hand, if Output detailed results if... in the Output Parameters tab of the Regression Analysis window was selected, for all regressions that meet the criteria specified on that tab.

The outputs are the same as they are for Numeric Regression. Please see Regression Statistics Results Viewer.

Caveats for Logistic Regression

The same caveats apply for Genotype Logistic Regression as for Numeric Logistic Regression. Please see Caveats for Logistic Regression.