# Haplotype Trend Regression¶

Haplotype Trend Regression (HTR) takes one or more block(s) of
genotypic markers and for each block of markers, estimates haplotypes
for these markers, then regresses their by-sample haplotype
probabilities against a dependent variable. The haplotypes used
for the regression may be all haplotypes (or all but one–see the **Frequency
threshold** section of *How Haplotype Frequencies are Computed*) above
the frequency threshold. The regression may be
linear or logistic (depending upon the dependent variable), may be
stepwise if desired, and may involve fixed numeric or categorical
covariates and/or interaction terms. The fixed covariates and
interaction terms may either be regressed together with the by-sample
haplotype probabilities (“full model only”) or may be grouped
separately into a “reduced model”. Permutation testing is also
available for Haplotype Trend Regression.

For an overview of the theories behind regression analysis in Golden Helix SVS, see
*Linear Regression* and *Logistic Regression*.

## Performing Analysis¶

To perform Haplotype Trend Regression, open a spreadsheet containing
genotypic markers and at least one numeric or case/control
column. Select a column for the dependent variable–this column must
be either quantitative (real- valued or integer-valued) or a binary
case/control status column. To open the Haplotype Trend Regression
window, select the **Genotypic** > **Haplotype Trend Regression** menu
item. This feature is currently supported for spreadsheets with only
one column set as dependent. Categorical dependent columns are
currently not supported.

The Haplotype Trend Regression window (see Figure *Haplotype Trend Regression – Analysis Parameters*)
allows for various options to be set or changed. A list and brief
description of these options is as follows:

**HTR Parameters:**The first tab of the Haplotype Trend Regression window allows for most of the categories of parameters to be set. The categories are:*Haplotype Block Definition*Defines the block(s) of markers to be used for analysis.*Haplotype Estimation Options*These options specify how haplotypes should be estimated. Additionally, the frequency threshold determines which haplotypes should be directly used in the analysis.*Stepwise Regression*These options allow the selection of, and specifying parameters for, stepwise regression.*Fixed Covariate Options*If you have specified fixed covariates, whether to use them as part of a full-model-only regression or as the reduced model for a full-vs-reduced model regression. This box will become enabled as soon as you select any fixed covariates.*Fixed Covariates*Specify fixed covariates here. This spreadsheet column chooser box is activated if there are binary or numeric columns in the spreadsheet in addition to the column selected as the dependent variable, and/or there are categorical columns in the spreadsheet.**Advanced Parameters:**This tab (see Figure*Haplotype Trend Regression – Advanced Parameters*) allows residual spreadsheet output to be specified (when that is allowed based on how you define your haplotype blocks) and for multiple testing corrections to be set. Additional output, such as that used for the creation of P-P or Q-Q plots, and detailed output options are also available in this tab.

### Ways of Defining Haplotype Blocks¶

Golden Helix SVS allows has a few convenient ways of defining marker blocks to be used for Haplotype Trend Regression.

- Use precomputed blocks
- Use all markers as a single block
- Use a moving window of markers

These are described in more detail below.

#### Use Precomputed Blocks¶

By selecting this option, you have complete control over the definition of marker blocks to be used in analysis. This option reads from an external spreadsheet and a given block definition column.

The spreadsheet with the block columns should have the same marker names
along its row labels as are in the current spreadsheet as column
headers. A block definition column should be a column of type
**Integer**. Each row for the column specifies the block number the
current row’s marker is a member of. It may have missing values to
indicate that the marker in the current row is not in any block.

When you have selected a block spreadsheet, you must then choose which of the valid block definition columns from that spreadsheet you would like to use for analysis.

In a common workflow, you may wish to run the *Haplotype Block Detection* algorithm
to produce a block spreadsheet with blocks defined algorithmically. Then
open this dialog to select the resulting block spreadsheet to define the
blocks to be used for Haplotype Trend Regression.

#### Use All Markers as Single Block¶

By selecting this option, the entire set of active markers for the
current spreadsheet will be treated as a single block. This may be
useful when investigating entire sets of markers produced as subset
spreadsheets from a LD plot. See *Using LD Plots* for more information.

#### Use a Moving Window of Markers¶

By selecting this option, a set of blocks will be automatically generated based on parameters for a moving window. There are two options for the moving window–either a moving window of a fixed number of columns, or, if a marker map is applied, a dynamic moving window size based on the base pair distance between markers.

*Fixed window size:*Specifies that a fixed number of markers should be used for the moving window.*Dynamic window size in base pairs:*Specifies both the genetic distance in kilo-base pairs and maximum size of the moving window. It will define which markers are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the specified genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.

#### Show Marker Names in Spreadsheet Output¶

The checkbox **Show marker names in spreadsheet output** is available
for selection when **Use precomputed blocks** or **Use a moving window
of markers** is selected. This checkbox defaults to being checked.

- Check this box to display, in the output spreadsheet, which markers make up the current haplotype block.
- Uncheck this box if your spreadsheet is going to be very large and you wish to save the memory that this column will take up. (For instance, the output for 1 million haplotype blocks may take from 20 megabytes up to 100 megabytes or more.)

### How Haplotype Frequencies are Computed¶

Because the phase of the genotypic information in genotypic markers is not known, haplotype frequencies must be estimated using statistical methods. Although the estimation algorithms may find many potential haplotypes, there are usually only a handful with significant frequencies in a given block of markers.

The **Frequency threshold** parameter is used

- to only consider haplotypes with a estimated frequency above the threshold so as to reduce the number of variables being considered in association tests, and
- to help assess whether or not to not only refrain from considering
haplotypes below the frequency threshold, but also, in order to
help prevent multicollinearity, refrain from considering the
haplotype with the lowest frequency above the
**Frequency threshold**. Specifically, the total sum of frequencies of the haplotypes not considered for inclusion in the regression must be greater than the**Frequency threshold**.

Both estimation methods (see paragraph below) allow for samples with
missing genotypes to have their haplotype frequencies inferred. Select
**Impute missings** to enable this algorithmic feature.

Currently, there are two methods for estimating haplotype frequencies
(see the link below for details about the algorithms and their
individual strengths). If you select the **EM** method, you must also
provide the additional **Maximum EM iterations** and **EM convergence
tolerance** parameters used by the algorithm.

See *Haplotype Frequency Estimation Methods* for more information on the details of each
estimation algorithm.

### Stepwise Regression¶

**Stepwise Regression:**Check this group box to specify that the linear or logistic regression should be done as the specified stepwise regression procedure, either**backwards elimination**or**forward selection**. A**P-value cut-off**must be specified when running stepwise regression. Backward elimination starts with all of the full model covariates and removes the least significant covariate until removing any covariates would be more significant than the stepwise p-value cut-off specified.”Significant” here means testing the current model as a “full model” and the current model without a regressor as a “reduced model” to find a full-vs-reduced p-value. Forward selection selects the most significant covariate and keeps adding the next most significant covariate until adding a further covariate is no longer significant. “Significant” here means testing the current model plus a covariate as a “full model” and the current model itself as a “reduced model” to find a full-vs-reduced p-value.- If
**Stepwise Regression**is not checked, then a single linear or logistic regression will take place for each haplotype block.

### Fixed Covariates and the Regression Model¶

Sometimes it is desired to observe the effects of one or more binary,
continuous, or categorical variables, otherwise known as
“covariates”. These covariates, or first-order interactions between
covariates, may be influencing the dependent variable
response. Covariates (see *Specifying Fixed Covariates* for how to specify
these) may be included in Haplotype Trend Regression in two ways:

- They may be added to the regression as additional variables. Once
you have specified at least one covariate, go to the
**Fixed Covariate Options**and check*Include with a full-model-only regression*to do this. Stepwise regression may prove useful for this type of analysis. - They may be “corrected for” by using them in a reduced model. Once
you have specified at least one covariate, go to the
**Fixed Covariate Options**and check*Use as the reduced model for a full-vs.-reduced regression*to do this. Correcting for the covariates allows the user to see specifically what effects there are from the remaining variables (haplotype frequencies). To do this, first a linear regression equation, which includes only the dependent and the reduced model covariates, is calculated (the “reduced model”). Next, a linear regression which includes all of the variables (including the haplotype frequencies) is calculated (the “full model”). The significance of the full versus the reduced model is calculated with an F-test. See*Full Versus Reduced Model Regression Equation*for more information.

Note

The **Fixed Covariate Options** box will be disabled until
you have specified at least one covariate.

### Specifying Fixed Covariates¶

If there are binary or numeric columns in the spreadsheet in addition
to the column selected as the dependent variable, and/or there are
categorical columns in the spreadsheet, these will be selectable in
the **fixed covariates** section. To include a covariate in the
analysis, click on the Add Covariate button. This will open a dialog
allowing you to select the covariate(s) to use in the regression
equation. Then, select the covariate(s) to include and click
**Add**. If you would like to add all of the covariates in the list,
click **Add All**. The selected covariates will be shown in the “Fixed
Covariates” list. To remove a covariate, select the covariate(s) to
remove, and click **Remove Selected**. This will remove the item from
the “Fixed Covariates” list and from the regression equation. To
remove all covariates click **Clear List**.

To include first-order interactions, click the **Add Interaction**
button. This will open a dialog which displays two lists, each
containing all of the non-genotypic covariate column names within the
spreadsheet. Select the term(s) from each of the two lists which you
would like to include and click **Add**. All selected items from the
list on the left will be paired with all the selected items from the
list on the right, and an item for each pair will be added to the
“Fixed Covariates” list. If any of the selected items in either window
represent categorical columns, then sub-items representing the dummy
variables used in regression for each category will be paired with the
items or sub-items from the other window. (Values from each pair are
multiplied to create a “new” covariate, which is then used in the
regression equation.)

When you have added all of the interactions, click **Close** to return
to the Haplotype Trend Regression window. All listed interactions will
be included in the analysis, so unwanted interactions must be removed
in order to exclude them. To remove an interaction, select the
item(s) to remove and click **Remove Interaction**.

Once you have selected the covariates and interactions you wish to
include in the regression, go back to the **Fixed Covariate Options**
box, which will now be enabled, to select how the covariates will be
used in the regression (see *Fixed Covariates and the Regression Model*).

### Missing Values¶

All samples containing missing values in the dependent variable or in any fixed covariates will be dropped from the analysis.

Additionally, if **Impute missing values for haplotypes** is NOT
checked in the **Haplotype Estimation Options** section, all samples
containing missing values in any of the genotypes of the haplotype
block currently being analyzed will also be dropped from the analysis.

### Residual Spreadsheet¶

A residual spreadsheet is available if you have checked **Use all
markers as single block** to get one regression from all markers.

The residual spreadsheet (see Figure *Haplotype Trend Regression Residual Spreadsheet*) will contain
the actual, predicted, and residual values of the dependent variable
for each sample. The residual value of a sample is defined as the
difference between the sample’s actual value and its predicted value
from the regression. In addition, the haplotype frequencies for the
individual samples will be output for those haplotypes that were used
in the regression.

To obtain a residual spreadsheet, which will be in addition to the
results view from the regression, check **Output Residual
spreadsheet** in the **Advanced Parameters** tab.

Note

This option is available only when you select **Use all
markers as single block**.

### Additional Outputs¶

To enable the most utility from your regression results, some convenient derivative statistics can be computed on your p-values.

*Output -log10(P):*Computes the value for each p-value and for each multiple- testing-corrected p-value.*Output data for P-P/Q-Q plots:*Computes expected value for each p-value and for each multiple-testing-corrected p-value. By plotting the expected vs. actual P values, you can create P-P or Q-Q plots. This option forces the**-log10(P)**output as well.

Note

**Output data for P-P/Q-Q plots** is only available if you
have selected **Use precomputed blocks** or **Use a moving window
of markers**.

### Multiple Testing Correction¶

It may be possible to obtain a good test statistic by chance alone. Multiple testing corrections are designed to help ensure, if possible, this is not the case. You may optionally select one or more of the following multiple testing corrections.

**Bonferroni adjustment (on N haplotype blocks)**Multiplies p-values by , but does not allow the result to be more than 1.**False Discovery Rate (FDR)**A less severe adjustment. See*False Discovery Rate*for an explanation of this algorithm.**Single value permutations**and/or**Full scan permutations**Permutation testing of the linear and logistic regression models permutes the dependent variable, then runs the regressions over again, checking the significance of these regressions. This is distinct from checking the “fit” of the permuted dependent to the original regression results from a given set of regressors. The object is to see whether by chance, a different set of dependents could have had a better relationship or “fit” with the covariates and regressors. This is tested through performing a new regression for each permutation.See

*Permutation Testing Methodology*for a more detailed explanation and examples of permutation testing.

Note

If you have selected **Use all markers as single block**,
the only available correction will be **Single value
permutations**.

### Viewing Detailed Results¶

If you have selected **Use all markers as single block**, detailed
results (*Regression Statistics Results Viewer*) will always be shown for the regression.

Otherwise, If you have selected **Use precomputed blocks** or **Use a
Moving Window of Markers**, the only outputs shown will be in rows of
the results spreadsheet (*Haplotype Trend Regression Results Spreadsheet*), unless you specify a
criterion for which regressions you would like to see detailed
results. To see detailed results for some regressions, check **Output
detailed results if...** on the Advanced Parameters tab and set the
desired criterion. This criterion comes in three parts:

- Value to use. These may be:
*P-Value Full vs. Reduced Model*(available and default for full vs. reduced testing)*-log 10 P-Value FullvsRed Model*(available for full vs. reduced testing)*P-Value Full Model*(default for full-model-only testing)*-log 10 P-Value Full Model**R Squared Full Model*(available for linear haplotype trend regression)

- Type of comparison. This may be “<” (default), “<=” (), “>”, or “>=” ().
- Threshold. (Defaults to 0.05.)

Detailed output will be generated for those regressions for which the criterion is true. These will all be placed into a single detailed output viewer.

## Running Haplotype Trend Regression¶

Click **Run** to start the analysis procedure.

Note

Sometimes the a regression may fail due to insufficient rank in the coefficient matrix. This can be a result of not enough observations or due to the inclusion of “collinear” regressors. A collinear regressor is one which is a linear combination of one or more regressors.

## Haplotype Trend Regression Outputs¶

### Residual Spreadsheet¶

If a residual spreadsheet is produced (see Figure *Haplotype Trend Regression Residual Spreadsheet*),
it will contain the actual, predicted, and residual values of the
dependent variable for each sample. The residual value of a sample is
defined as the difference between the sample’s actual value and its
predicted value from the regression. Additionally, any fixed
covariates that were specified for the regression are output, followed
by the haplotype frequencies by individual sample for those haplotypes
that were used in the regression.

Note

Strictly speaking, residuals do not make as much sense for logistic regression as they do for linear regression because the distribution of a logistic regression residual separates into two parts. However, this spreadsheet may be used as a crude gauge of how well the regression model predicts the observed values of the dependent variable.

### Regression Results Spreadsheet¶

If you checked **Use precomputed blocks** or **Use a Moving Window of
Markers**, a spreadsheet of regression results will be output (see
Figure *Haplotype Trend Regression Results Spreadsheet*). The rows of this spreadsheet correspond to
haplotype blocks used. The row label will correspond to the first
marker in the block.

Note

Detailed results for any interesting haplotype blocks can either be
found in the Regression Statistics Results Viewer, if the p-value
or value meets the specified criterion (see
*Viewing Detailed Results*), or by inactivating all genotypic markers
other than those in the block of interest, then running Haplotype
Trend Regression with **Use all markers as single block** and any
fixed covariates used originally.

### Regression Statistics Results Viewer¶

A Regression Statistics Results Viewer (see Figure
*Linear Haplotype Trend Regression Statistics Results Viewer*) will be displayed for any regression if **Use all
markers as single block** was selected or if **Output detailed results
if...** on the Advanced Parameters tab was selected and the regression
meets the selected criterion (*Viewing Detailed Results*).

### Linear Haplotype Trend Regression Statistics¶

The detailed output viewable in the Haplotype Trend Regression Statistics Viewer is detailed below.

**Full Model Only Regression**

If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

- Name of the response variable.
- Unsigned multiple correlation coefficient , where .
- Coefficient of determination .
- Adjusted . This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
- Sample size.
- Residual standard error .
- Unbiased standard deviation of the response.
- Value of the F-statistic.
- P-value of the F-statistic for the regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom.
- Residual degrees of freedom.
- Total degrees of freedom.
- Y-intercept.

**Full Versus Reduced Model Regression**

If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

- Name of the response variable.
- Coefficient of determination for the full model.
- Coefficient of determination for the reduced model.
- Adjusted for the full model. This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
- Sample size.
- Residual standard error .
- Unbiased standard deviation of the response.
- Value of the F-statistic for the full model.
- Value of the F-statistic for the full versus reduced model.
- P-value of the F-statistic for the full regression model.
- P-value of the F-statistic for the full versus reduced regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom of the full model.
- Regression degrees of freedom of the reduced model.
- Residual degrees of freedom of the full model.
- Total degrees of freedom of the full model.
- Y-intercept of the full model.
- Y-intercept of the reduced model.

### Logistic Haplotype Trend Regression Model Statistics¶

**Full Model Only Regression**

If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

- Name of the response variable.
- Regression likelihood .
- Null model likelihood .
- Sample size.
- Value of the Chi-Squared () statistic.
- P-value of the Chi-Squared statistic for the regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom.
- Residual degrees of freedom.
- Total degrees of freedom.
- .
- standard error.

**Full Versus Reduced Model Regression**

If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:

- Name of the response variable.
- Full model likelihood .
- Reduced model likelihood .
- Chi-squared () statistic of the full model.
- Chi-squared statistic of the full versus reduced model.
- P-value of the Chi-Squared statistic for the full regression model.
- P-value of the Chi-Squared statistic for the full versus reduced regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom of the full model.
- Regression degrees of freedom of the reduced model.
- Residual degrees of freedom of the full model.
- Total degrees of freedom of the full model.
- for the full model.
- Standard error for for the full model.
- for the reduced model.

### Markers Used¶

The markers used for generating the haplotypes (that is, the markers of the haplotype block being regressed) are shown.

### Linear Model Regressor Statistics¶

For both full-model-only and full versus reduced linear haplotype trend regressions, the Y-intercept for the full model is displayed. Also, for full versus reduced linear haplotype trend regression models, the Y-intercept for both the full and reduced models is displayed.

The following statistics are displayed for each regressor:

- Name (unless this is a fixed covariate or interaction term, this will be a haplotype identifier consisting of the alleles used from each marker)
- Coefficient
- Standard error
- T-statistic for adding this regressor
- P-value for adding this regressor
- Univariate fit p-value

### Logistic Model Regressor Statistics¶

For both full-model-only and full versus reduced logistic haplotype trend regressions, for the full model is displayed. Also, for full versus reduced logistic haplotype trend regression models, for both the full and reduced models is displayed.

The following statistics are displayed for each regressor:

- Name (unless this is a fixed covariate or interaction term, this will be a haplotype identifier consisting of the alleles used from each marker)
- Coefficient
- Standard error
- P-value for adding this regressor
- Odds ratio
- Univariate fit p-value

The regression odds ratio for the coefficient is . The interpretation of this odds ratio is the ratio of the odds of the dependent being one (“true”) if the given regressor were increased by one unit to the odds of the dependent being one (“true”) when the given regressor has its current value.

### Left Out Regressors¶

This list will include all regressors excluded from the final model of a stepwise regression model.

### Haplotypes Used in the Regression¶

This will list the haplotypes considered for this regression because
of their being above the **Frequency threshold**. Also listed is the
overall frequency of each haplotype, along with whether it was
actually used for the regression. “Used for” here means considered as
a part of the full model, but not necessarily included in the final
model of a stepwise regression model.

## Caveats for Logistic Regression¶

Under some circumstances, the iteration procedure for the logistic regression algorithm will be unstable, and the regression may fail, even when the coefficient matrix has sufficient rank and significant regressors are included. Such a circumstance can arise when the regression algorithm tries to emulate a step function or otherwise tries to accommodate independent values for which the dependent variable is either exclusively 1 or exclusively 0.

If a stepwise regression model approach is used, similar circumstances resulting in instability may cause “paradoxical” phenomena such as:

- The final regression used to get the model statistics fails, even though it is “the same as” the last model tried in the stepwise regression algorithm. Actually, it is possible that a different order will be used for the regressors in the final model compared to the last model tried for stepwise regression. If the problem is highly unstable, the different order may be enough to cause failure.
- For some regressors, the p-value Pr(Chi) associated with dropping the regressor from the regression equaling 1 (). This happens when the regression fails after removing the regressor. This is only possible for a regressor other than the last one added to the model.

The best workaround is to filter out the data causing such instabilities. If one covariate of a regression has a coefficient above 15 or 20 or below -15 or -20 and the regressors from a stepwise regression won’t regress directly, or if a certain covariate does not regress by itself, the data should be filtered. Consider making a row subset spreadsheet based on ranges of values of the covariates and performing the desired regression model on each. Alternatively, consider stepwise regression if not already applied to the model. If stepwise regression is failing, changing the method from forward selection to backwards elimination or vice versa could result in a solution.