Haplotype Trend Regression

Haplotype Trend Regression (HTR) takes one or more block(s) of genotypic markers and for each block of markers, estimates haplotypes for these markers, then regresses their by-sample haplotype probabilities against a dependent variable. The haplotypes used for the regression may be all haplotypes (or all but one–see the Frequency threshold section of How Haplotype Frequencies are Computed) above the frequency threshold. The regression may be linear or logistic (depending upon the dependent variable), may be stepwise if desired, and may involve fixed numeric or categorical covariates and/or interaction terms. The fixed covariates and interaction terms may either be regressed together with the by-sample haplotype probabilities (“full model only”) or may be grouped separately into a “reduced model”. Permutation testing is also available for Haplotype Trend Regression.

For an overview of the theories behind regression analysis in Golden Helix SVS, see Linear Regression and Logistic Regression.

Performing Analysis

To perform Haplotype Trend Regression, open a spreadsheet containing genotypic markers and at least one numeric or case/control column. Select a column for the dependent variable–this column must be either quantitative (real- valued or integer-valued) or a binary case/control status column. To open the Haplotype Trend Regression window, select the Genotypic > Haplotype Trend Regression menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

htrWinTab1

Haplotype Trend Regression – Analysis Parameters

The Haplotype Trend Regression window (see Figure Haplotype Trend Regression – Analysis Parameters) allows for various options to be set or changed. A list and brief description of these options is as follows:

  • HTR Parameters: The first tab of the Haplotype Trend Regression window allows for most of the categories of parameters to be set. The categories are:
  • Haplotype Block Definition Defines the block(s) of markers to be used for analysis.
  • Haplotype Estimation Options These options specify how haplotypes should be estimated. Additionally, the frequency threshold determines which haplotypes should be directly used in the analysis.
  • Stepwise Regression These options allow the selection of, and specifying parameters for, stepwise regression.
  • Fixed Covariate Options If you have specified fixed covariates, whether to use them as part of a full-model-only regression or as the reduced model for a full-vs-reduced model regression. This box will become enabled as soon as you select any fixed covariates.
  • Fixed Covariates Specify fixed covariates here. This spreadsheet column chooser box is activated if there are binary or numeric columns in the spreadsheet in addition to the column selected as the dependent variable, and/or there are categorical columns in the spreadsheet.
  • Advanced Parameters: This tab (see Figure Haplotype Trend Regression – Advanced Parameters) allows residual spreadsheet output to be specified (when that is allowed based on how you define your haplotype blocks) and for multiple testing corrections to be set. Additional output, such as that used for the creation of P-P or Q-Q plots, and detailed output options are also available in this tab.
htrWinTab2

Haplotype Trend Regression – Advanced Parameters

Ways of Defining Haplotype Blocks

Golden Helix SVS allows has a few convenient ways of defining marker blocks to be used for Haplotype Trend Regression.

  • Use precomputed blocks
  • Use all markers as a single block
  • Use a moving window of markers

These are described in more detail below.

Use Precomputed Blocks

By selecting this option, you have complete control over the definition of marker blocks to be used in analysis. This option reads from an external spreadsheet and a given block definition column.

The spreadsheet with the block columns should have the same marker names along its row labels as are in the current spreadsheet as column headers. A block definition column should be a column of type Integer. Each row for the column specifies the block number the current row’s marker is a member of. It may have missing values to indicate that the marker in the current row is not in any block.

When you have selected a block spreadsheet, you must then choose which of the valid block definition columns from that spreadsheet you would like to use for analysis.

In a common workflow, you may wish to run the Haplotype Block Detection algorithm to produce a block spreadsheet with blocks defined algorithmically. Then open this dialog to select the resulting block spreadsheet to define the blocks to be used for Haplotype Trend Regression.

Use All Markers as Single Block

By selecting this option, the entire set of active markers for the current spreadsheet will be treated as a single block. This may be useful when investigating entire sets of markers produced as subset spreadsheets from a LD plot. See Using LD Plots for more information.

Use a Moving Window of Markers

By selecting this option, a set of blocks will be automatically generated based on parameters for a moving window. There are two options for the moving window–either a moving window of a fixed number of columns, or, if a marker map is applied, a dynamic moving window size based on the base pair distance between markers.

  • Fixed window size: Specifies that a fixed number of markers should be used for the moving window.
  • Dynamic window size in base pairs: Specifies both the genetic distance in kilo-base pairs and maximum size of the moving window. It will define which markers are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the specified genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.

Show Marker Names in Spreadsheet Output

The checkbox Show marker names in spreadsheet output is available for selection when Use precomputed blocks or Use a moving window of markers is selected. This checkbox defaults to being checked.

  • Check this box to display, in the output spreadsheet, which markers make up the current haplotype block.
  • Uncheck this box if your spreadsheet is going to be very large and you wish to save the memory that this column will take up. (For instance, the output for 1 million haplotype blocks may take from 20 megabytes up to 100 megabytes or more.)

How Haplotype Frequencies are Computed

Because the phase of the genotypic information in genotypic markers is not known, haplotype frequencies must be estimated using statistical methods. Although the estimation algorithms may find many potential haplotypes, there are usually only a handful with significant frequencies in a given block of markers.

The Frequency threshold parameter is used

  1. to only consider haplotypes with a estimated frequency above the threshold so as to reduce the number of variables being considered in association tests, and
  2. to help assess whether or not to not only refrain from considering haplotypes below the frequency threshold, but also, in order to help prevent multicollinearity, refrain from considering the haplotype with the lowest frequency above the Frequency threshold. Specifically, the total sum of frequencies of the haplotypes not considered for inclusion in the regression must be greater than the Frequency threshold.

Both estimation methods (see paragraph below) allow for samples with missing genotypes to have their haplotype frequencies inferred. Select Impute missings to enable this algorithmic feature.

Currently, there are two methods for estimating haplotype frequencies (see the link below for details about the algorithms and their individual strengths). If you select the EM method, you must also provide the additional Maximum EM iterations and EM convergence tolerance parameters used by the algorithm.

See Haplotype Frequency Estimation Methods for more information on the details of each estimation algorithm.

Stepwise Regression

  • Stepwise Regression: Check this group box to specify that the linear or logistic regression should be done as the specified stepwise regression procedure, either backwards elimination or forward selection. A P-value cut-off must be specified when running stepwise regression. Backward elimination starts with all of the full model covariates and removes the least significant covariate until removing any covariates would be more significant than the stepwise p-value cut-off specified.”Significant” here means testing the current model as a “full model” and the current model without a regressor as a “reduced model” to find a full-vs-reduced p-value. Forward selection selects the most significant covariate and keeps adding the next most significant covariate until adding a further covariate is no longer significant. “Significant” here means testing the current model plus a covariate as a “full model” and the current model itself as a “reduced model” to find a full-vs-reduced p-value.
  • If Stepwise Regression is not checked, then a single linear or logistic regression will take place for each haplotype block.

Fixed Covariates and the Regression Model

Sometimes it is desired to observe the effects of one or more binary, continuous, or categorical variables, otherwise known as “covariates”. These covariates, or first-order interactions between covariates, may be influencing the dependent variable response. Covariates (see Specifying Fixed Covariates for how to specify these) may be included in Haplotype Trend Regression in two ways:

  1. They may be added to the regression as additional variables. Once you have specified at least one covariate, go to the Fixed Covariate Options and check Include with a full-model-only regression to do this. Stepwise regression may prove useful for this type of analysis.
  2. They may be “corrected for” by using them in a reduced model. Once you have specified at least one covariate, go to the Fixed Covariate Options and check Use as the reduced model for a full-vs.-reduced regression to do this. Correcting for the covariates allows the user to see specifically what effects there are from the remaining variables (haplotype frequencies). To do this, first a linear regression equation, which includes only the dependent and the reduced model covariates, is calculated (the “reduced model”). Next, a linear regression which includes all of the variables (including the haplotype frequencies) is calculated (the “full model”). The significance of the full versus the reduced model is calculated with an F-test. See Full Versus Reduced Model Regression Equation for more information.

Note

The Fixed Covariate Options box will be disabled until you have specified at least one covariate.

Specifying Fixed Covariates

If there are binary or numeric columns in the spreadsheet in addition to the column selected as the dependent variable, and/or there are categorical columns in the spreadsheet, these will be selectable in the fixed covariates section. To include a covariate in the analysis, click on the Add Covariate button. This will open a dialog allowing you to select the covariate(s) to use in the regression equation. Then, select the covariate(s) to include and click Add. If you would like to add all of the covariates in the list, click Add All. The selected covariates will be shown in the “Fixed Covariates” list. To remove a covariate, select the covariate(s) to remove, and click Remove Selected. This will remove the item from the “Fixed Covariates” list and from the regression equation. To remove all covariates click Clear List.

To include first-order interactions, click the Add Interaction button. This will open a dialog which displays two lists, each containing all of the non-genotypic covariate column names within the spreadsheet. Select the term(s) from each of the two lists which you would like to include and click Add. All selected items from the list on the left will be paired with all the selected items from the list on the right, and an item for each pair will be added to the “Fixed Covariates” list. If any of the selected items in either window represent categorical columns, then sub-items representing the dummy variables used in regression for each category will be paired with the items or sub-items from the other window. (Values from each pair are multiplied to create a “new” covariate, which is then used in the regression equation.)

When you have added all of the interactions, click Close to return to the Haplotype Trend Regression window. All listed interactions will be included in the analysis, so unwanted interactions must be removed in order to exclude them. To remove an interaction, select the item(s) to remove and click Remove Interaction.

Once you have selected the covariates and interactions you wish to include in the regression, go back to the Fixed Covariate Options box, which will now be enabled, to select how the covariates will be used in the regression (see Fixed Covariates and the Regression Model).

Missing Values

All samples containing missing values in the dependent variable or in any fixed covariates will be dropped from the analysis.

Additionally, if Impute missing values for haplotypes is NOT checked in the Haplotype Estimation Options section, all samples containing missing values in any of the genotypes of the haplotype block currently being analyzed will also be dropped from the analysis.

Residual Spreadsheet

A residual spreadsheet is available if you have checked Use all markers as single block to get one regression from all markers.

The residual spreadsheet (see Figure Haplotype Trend Regression Residual Spreadsheet) will contain the actual, predicted, and residual values of the dependent variable for each sample. The residual value of a sample is defined as the difference between the sample’s actual value and its predicted value from the regression. In addition, the haplotype frequencies for the individual samples will be output for those haplotypes that were used in the regression.

To obtain a residual spreadsheet, which will be in addition to the results view from the regression, check Output Residual spreadsheet in the Advanced Parameters tab.

Note

This option is available only when you select Use all markers as single block.

Additional Outputs

To enable the most utility from your regression results, some convenient derivative statistics can be computed on your p-values.

  • Output -log10(P): Computes the value -log_{10}(\textit{p-value}) for each p-value and for each multiple- testing-corrected p-value.
  • Output data for P-P/Q-Q plots: Computes expected value for each p-value and for each multiple-testing-corrected p-value. By plotting the expected vs. actual P values, you can create P-P or Q-Q plots. This option forces the -log10(P) output as well.

Note

Output data for P-P/Q-Q plots is only available if you have selected Use precomputed blocks or Use a moving window of markers.

Multiple Testing Correction

It may be possible to obtain a good test statistic by chance alone. Multiple testing corrections are designed to help ensure, if possible, this is not the case. You may optionally select one or more of the following multiple testing corrections.

  • Bonferroni adjustment (on N haplotype blocks) Multiplies p-values by N, but does not allow the result to be more than 1.

  • False Discovery Rate (FDR) A less severe adjustment. See False Discovery Rate for an explanation of this algorithm.

  • Single value permutations and/or Full scan permutations Permutation testing of the linear and logistic regression models permutes the dependent variable, then runs the regressions over again, checking the significance of these regressions. This is distinct from checking the “fit” of the permuted dependent to the original regression results from a given set of regressors. The object is to see whether by chance, a different set of dependents could have had a better relationship or “fit” with the covariates and regressors. This is tested through performing a new regression for each permutation.

    See Permutation Testing Methodology for a more detailed explanation and examples of permutation testing.

Note

If you have selected Use all markers as single block, the only available correction will be Single value permutations.

Viewing Detailed Results

If you have selected Use all markers as single block, detailed results (Regression Statistics Results Viewer) will always be shown for the regression.

Otherwise, If you have selected Use precomputed blocks or Use a Moving Window of Markers, the only outputs shown will be in rows of the results spreadsheet (Haplotype Trend Regression Results Spreadsheet), unless you specify a criterion for which regressions you would like to see detailed results. To see detailed results for some regressions, check Output detailed results if... on the Advanced Parameters tab and set the desired criterion. This criterion comes in three parts:

  • Value to use. These may be:
    • P-Value Full vs. Reduced Model (available and default for full vs. reduced testing)
    • -log 10 P-Value FullvsRed Model (available for full vs. reduced testing)
    • P-Value Full Model (default for full-model-only testing)
    • -log 10 P-Value Full Model
    • R Squared Full Model (available for linear haplotype trend regression)
  • Type of comparison. This may be “<” (default), “<=” (\le), “>”, or “>=” (\ge).
  • Threshold. (Defaults to 0.05.)

Detailed output will be generated for those regressions for which the criterion is true. These will all be placed into a single detailed output viewer.

Running Haplotype Trend Regression

Click Run to start the analysis procedure.

Note

Sometimes the a regression may fail due to insufficient rank in the coefficient matrix. This can be a result of not enough observations or due to the inclusion of “collinear” regressors. A collinear regressor is one which is a linear combination of one or more regressors.

Haplotype Trend Regression Outputs

Residual Spreadsheet

If a residual spreadsheet is produced (see Figure Haplotype Trend Regression Residual Spreadsheet), it will contain the actual, predicted, and residual values of the dependent variable for each sample. The residual value of a sample is defined as the difference between the sample’s actual value and its predicted value from the regression. Additionally, any fixed covariates that were specified for the regression are output, followed by the haplotype frequencies by individual sample for those haplotypes that were used in the regression.

htrResidSS

Haplotype Trend Regression Residual Spreadsheet

Note

Strictly speaking, residuals do not make as much sense for logistic regression as they do for linear regression because the distribution of a logistic regression residual separates into two parts. However, this spreadsheet may be used as a crude gauge of how well the regression model predicts the observed values of the dependent variable.

Regression Results Spreadsheet

If you checked Use precomputed blocks or Use a Moving Window of Markers, a spreadsheet of regression results will be output (see Figure Haplotype Trend Regression Results Spreadsheet). The rows of this spreadsheet correspond to haplotype blocks used. The row label will correspond to the first marker in the block.

htrResultSS

Haplotype Trend Regression Results Spreadsheet

Note

Detailed results for any interesting haplotype blocks can either be found in the Regression Statistics Results Viewer, if the p-value or R^2 value meets the specified criterion (see Viewing Detailed Results), or by inactivating all genotypic markers other than those in the block of interest, then running Haplotype Trend Regression with Use all markers as single block and any fixed covariates used originally.

Regression Statistics Results Viewer

A Regression Statistics Results Viewer (see Figure Linear Haplotype Trend Regression Statistics Results Viewer) will be displayed for any regression if Use all markers as single block was selected or if Output detailed results if... on the Advanced Parameters tab was selected and the regression meets the selected criterion (Viewing Detailed Results).

htrStatsView

Linear Haplotype Trend Regression Statistics Results Viewer

Linear Haplotype Trend Regression Statistics

The detailed output viewable in the Haplotype Trend Regression Statistics Viewer is detailed below.

Full Model Only Regression

If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

  • Name of the response variable.
  • Unsigned multiple correlation coefficient R, where R=\sqrt{R^2}.
  • Coefficient of determination R^2.
  • Adjusted R^2. This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
  • Sample size.
  • Residual standard error SE_{resid}.
  • Unbiased standard deviation of the response.
  • Value of the F-statistic.
  • P-value of the F-statistic for the regression model.
  • Single-value permuted p-value, if single-value permutation testing was selected.
  • Full-scan permuted p-value, if full-scan permutation testing was selected.
  • Number of permutations, if permutation testing was selected.
  • Regression degrees of freedom.
  • Residual degrees of freedom.
  • Total degrees of freedom.
  • Y-intercept.

Full Versus Reduced Model Regression

If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

  • Name of the response variable.
  • Coefficient of determination R^2 for the full model.
  • Coefficient of determination R^2 for the reduced model.
  • Adjusted R^2 for the full model. This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
  • Sample size.
  • Residual standard error SE_{resid}.
  • Unbiased standard deviation of the response.
  • Value of the F-statistic for the full model.
  • Value of the F-statistic for the full versus reduced model.
  • P-value of the F-statistic for the full regression model.
  • P-value of the F-statistic for the full versus reduced regression model.
  • Single-value permuted p-value, if single-value permutation testing was selected.
  • Full-scan permuted p-value, if full-scan permutation testing was selected.
  • Number of permutations, if permutation testing was selected.
  • Regression degrees of freedom of the full model.
  • Regression degrees of freedom of the reduced model.
  • Residual degrees of freedom of the full model.
  • Total degrees of freedom of the full model.
  • Y-intercept of the full model.
  • Y-intercept of the reduced model.

Logistic Haplotype Trend Regression Model Statistics

Full Model Only Regression

If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

  • Name of the response variable.
  • Regression likelihood L_1.
  • Null model likelihood L_0.
  • Sample size.
  • Value of the Chi-Squared (\chi^2) statistic.
  • P-value of the Chi-Squared statistic for the regression model.
  • Single-value permuted p-value, if single-value permutation testing was selected.
  • Full-scan permuted p-value, if full-scan permutation testing was selected.
  • Number of permutations, if permutation testing was selected.
  • Regression degrees of freedom.
  • Residual degrees of freedom.
  • Total degrees of freedom.
  • \beta_0.
  • \beta_0 standard error.

Full Versus Reduced Model Regression

If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

  • Name of the response variable.
  • Full model likelihood L_1.
  • Reduced model likelihood L_0.
  • Chi-squared (\chi^2) statistic of the full model.
  • Chi-squared statistic of the full versus reduced model.
  • P-value of the Chi-Squared statistic for the full regression model.
  • P-value of the Chi-Squared statistic for the full versus reduced regression model.
  • Single-value permuted p-value, if single-value permutation testing was selected.
  • Full-scan permuted p-value, if full-scan permutation testing was selected.
  • Number of permutations, if permutation testing was selected.
  • Regression degrees of freedom of the full model.
  • Regression degrees of freedom of the reduced model.
  • Residual degrees of freedom of the full model.
  • Total degrees of freedom of the full model.
  • \beta_0 for the full model.
  • Standard error for \beta_0 for the full model.
  • \beta_0 for the reduced model.

Markers Used

The markers used for generating the haplotypes (that is, the markers of the haplotype block being regressed) are shown.

Linear Model Regressor Statistics

For both full-model-only and full versus reduced linear haplotype trend regressions, the Y-intercept for the full model is displayed. Also, for full versus reduced linear haplotype trend regression models, the Y-intercept for both the full and reduced models is displayed.

The following statistics are displayed for each regressor:

  • Name (unless this is a fixed covariate or interaction term, this will be a haplotype identifier consisting of the alleles used from each marker)
  • Coefficient
  • Standard error
  • T-statistic for adding this regressor
  • P-value for adding this regressor
  • Univariate fit p-value

Logistic Model Regressor Statistics

For both full-model-only and full versus reduced logistic haplotype trend regressions, \beta_0 for the full model is displayed. Also, for full versus reduced logistic haplotype trend regression models, \beta_0 for both the full and reduced models is displayed.

The following statistics are displayed for each regressor:

  • Name (unless this is a fixed covariate or interaction term, this will be a haplotype identifier consisting of the alleles used from each marker)
  • Coefficient
  • Standard error
  • P-value for adding this regressor
  • Odds ratio
  • Univariate fit p-value

The regression odds ratio for the coefficient \hat{\beta} is e^{\hat{\beta}}. The interpretation of this odds ratio is the ratio of the odds of the dependent being one (“true”) if the given regressor were increased by one unit to the odds of the dependent being one (“true”) when the given regressor has its current value.

Left Out Regressors

This list will include all regressors excluded from the final model of a stepwise regression model.

Haplotypes Used in the Regression

This will list the haplotypes considered for this regression because of their being above the Frequency threshold. Also listed is the overall frequency of each haplotype, along with whether it was actually used for the regression. “Used for” here means considered as a part of the full model, but not necessarily included in the final model of a stepwise regression model.

Caveats for Logistic Regression

Under some circumstances, the iteration procedure for the logistic regression algorithm will be unstable, and the regression may fail, even when the coefficient matrix has sufficient rank and significant regressors are included. Such a circumstance can arise when the regression algorithm tries to emulate a step function or otherwise tries to accommodate independent values for which the dependent variable is either exclusively 1 or exclusively 0.

If a stepwise regression model approach is used, similar circumstances resulting in instability may cause “paradoxical” phenomena such as:

  • The final regression used to get the model statistics fails, even though it is “the same as” the last model tried in the stepwise regression algorithm. Actually, it is possible that a different order will be used for the regressors in the final model compared to the last model tried for stepwise regression. If the problem is highly unstable, the different order may be enough to cause failure.
  • For some regressors, the p-value Pr(Chi) associated with dropping the regressor from the regression equaling 1 (Pr(Chi)=1). This happens when the regression fails after removing the regressor. This is only possible for a regressor other than the last one added to the model.

The best workaround is to filter out the data causing such instabilities. If one covariate of a regression has a coefficient above 15 or 20 or below -15 or -20 and the regressors from a stepwise regression won’t regress directly, or if a certain covariate does not regress by itself, the data should be filtered. Consider making a row subset spreadsheet based on ranges of values of the covariates and performing the desired regression model on each. Alternatively, consider stepwise regression if not already applied to the model. If stepwise regression is failing, changing the method from forward selection to backwards elimination or vice versa could result in a solution.