Haplotype Association Tests and Block Detection

Haplotype Association Tests

Haplotype Association Tests Overview


Haplotype Association Tests Window

The Haplotype Association Tests window offers a straightforward way of testing for association between haplotype frequencies of marker blocks against a case/control status. Tests can be made against individual haplotypes inferred by a marker block, or all significant haplotypes on a per-block basis.

Ways of Defining Haplotype Blocks

Golden Helix SVS allows has a few convenient ways of defining marker blocks to be used in the association test.

  • Use precomputed blocks
  • Use all markers as a single block
  • Use a moving window of markers

These are described in more detail below.

Use Precomputed Blocks

By selecting this option, you have complete control over the definition of marker blocks to be used in analysis. This option reads from an external spreadsheet and a given block definition column.

The spreadsheet with the block columns should have the same marker names along its row labels as are in the current spreadsheet as column headers. A block definition column should be a column of type Integer. Each row for the column specifies the block number the current row’s marker is a member of. It may have missing values to indicate that the marker in the current row is not in any block.

When you have selected a block spreadsheet, you must then choose which of the valid block definition columns from that spreadsheet you would like to use for analysis.

In a common workflow, you may wish to run the Haplotype Block Detection algorithm to produce a block spreadsheet with blocks defined algorithmically. Then open this dialog to select the resulting block spreadsheet to define the blocks to be used for Haplotype Analysis.

Use All Markers as Single Block

By selecting this option, the entire set of active markers for the current spreadsheet will be treated as a single block. This may be useful when investigating entire sets of markers produced as subset spreadsheets from a LD plot. See Using LD Plots for more information.

Use a Moving Window of Markers

By selecting this option, a set of blocks will be automatically generated based on parameters for a moving window. There are two options for the moving window–either a moving window of a fixed number of columns, or, if a marker map is applied, a dynamic moving window size based on the base pair distance between markers.

  • Fixed window size: Specifies that a fixed number of markers should be used for the moving window.
  • Dynamic window size in base pairs: Specifies both the genetic distance in kilo-base pairs and maximum size of the moving window. It will define which markers are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the specified genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.

Show Marker Names in Output

The checkbox Show marker names in output defaults to being checked.

  • Check this box to display, in the output spreadsheet, which markers make up the current haplotype block.
  • Uncheck this box if your spreadsheet is going to be very large and you wish to save the memory that this column will take up. (For instance, the output for 1 million haplotype blocks may take from 20 megabytes up to 100 megabytes or more.)

Association Tests Used with Haplotype Frequencies

The following statistical tests are available in comparing the significance of the association between the selected case/control dependent variable and the haplotype frequencies:

  • Chi-squared test
  • Odds ratio with 95% CI
  • Logistic regression

These statistics are applied in different ways based on whether the tests are done on a per-block or per-haplotype basis.

Tests Computed Per Haplotype vs. Per Block

There are two modes of computing haplotype association tests:

  • Calculate per haplotype: Specifies that for each marker block, the haplotype frequencies will be computed (see How Haplotype Frequencies are Computed) and for each haplotype above the frequency threshold, the selected tests will be computed to measure the association between each haplotype and the case/control dependent trait.
  • Calculate per block: Specifies that all (or all but one–see the Frequency threshold section of How Haplotype Frequencies are Computed) the haplotypes above the frequency threshold are tested together in the association test. This measures the association of the the haplotype block as a whole with the case/control trait.

Chi-Squared Test of Haplotype Association

When selected, the chi-square test sets up a 2 \times N contingency table comparing the haplotype frequencies for the cases vs. the controls. The values in the contingency table are based on the haplotype counts between cases and controls.

On a per-haplotype basis, the haplotype counts are computed by a summation of the values in a full frequency table (see Haplotype Tables). For each haplotype, the haplotype’s frequencies for cases and controls are individually summed up, multiplied by 2 and then placed in the first column of the contingency table. The frequencies of all the other haplotypes are then summed up into a single cases and control count, multiplied by 2 and placed in the second column of the contingency table. Given y is the current haplotype and n is the summation of all haplotypes other than y where there are N haplotypes total, the table is constructed as follows.

  Current Haplotype Other Haplotypes
Case h_{case} n_{case}
Control h_{control} n_{control}

Where h_{trait} = \displaystyle\sum^{trait}h_i * 2 for each case and control trait and each sample i with that trait for the current haplotype.

And n_{trait} = \displaystyle\sum_{x=1}^{x=N}\sum^{trait}I_{x \neq h}(x_i) * 2 for each case and control trait and each sample i with that trait and where I_{x \neq h} is an indicator function that returns 0 when x = h.

On a per-block basis, the contingency table is constructed with N columns, one for each haplotype. Each column will be computed for the given haplotype h according to the above h_{trait} formula.

See (Pearson) Chi-Squared Test for details on how a chi-square statistic and p-value are computed from the contingency table.

Odds Ratio Test of Haplotype Association

Odds ratio tests are only available on a per-haplotype basis.

When selecting the Odds Ratio Test, you will get odds ratios and the lower and upper 95% confidence bounds of the current haplotype versus the other haplotypes. The values used in the odds ratio computation are the same counts described in the above 2 \times 2 contingency table.


An odds ratio is generally considered significant if both the lower and the upper 95% confidence bounds are greater than one (or both less than one for an odds ratio less than one).

See the Formulas and Theories chapter for an explanation of this statistic (Section Odds Ratio with Confidence Limits).

Logistic Regression Test of Haplotype Association

If the Logistic Regression test is selected, a regression is performed with the case/control as the response. The binary response, y is fit to the given predictor variables x_i, using logistic regression. The results include the regression p-value and the reportable intercepts.

In a per-haplotype test, there is the single predictor variable constructed out of the haplotype frequency you would find in the haplotype frequency table (see Haplotype Tables) for the given haplotype. In addition to the p-value for the regression a \beta_0 and \beta_1 term and their respective standard errors are reported in their own columns.

When computed per-block, a single logistic regression is done with N predictor variables, where N is the number of haplotypes being used for the block-mode regression. The intercept for the solved regression is reported along with the p-value.

See the Frequency threshold section of How Haplotype Frequencies are Computed regarding which haplotypes will be used for block-mode regression.

See Logistic Regression for more details on how the logistic regression is performed and how the resulting p-values are formed.

How Haplotype Frequencies are Computed

Because the phase of the genotypic information in genotypic markers is not known, haplotype frequencies must be estimated using statistical methods. Although the estimation algorithms may find many potential haplotypes, there are usually only a handful with significant frequencies in a given block of markers.

The Frequency threshold parameter is used

  1. to only consider haplotypes with a estimated frequency above the threshold so as to reduce the number of variables being considered in association tests, and
  2. in the case of block-mode testing, to help assess whether or not to not only refrain from considering haplotypes below the frequency threshold, but also, in order to help prevent multicollinearity, refrain from considering the haplotype with the lowest frequency above the Frequency threshold. Specifically, the total sum of frequencies of the haplotypes not considered for block-mode testing must be greater than the Frequency threshold.

Both estimation methods (see paragraph below) allow for samples with missing genotypes to have their haplotype frequencies inferred. Select Impute missings to enable this algorithmic feature.

Currently, there are two methods for estimating haplotype frequencies (see the link below for details about the algorithms and their individual strengths). If you select the EM method, you must also provide the additional Maximum EM iterations and EM convergence tolerance parameters used by the algorithm.

See Haplotype Frequency Estimation Methods for more information on the details of each estimation algorithm.

Multiple Testing Correction

To account for the multiple testing problem, you can have additional output columns that relate to multiple testing correction computed for each selected association test.

Bonferroni and FDR multiple testing corrections as well as single value and full scan permutation tests can be applied to the chi-squared and regression p-value results.

See the Formulas and Theories chapter for an explanation of these correction methods in the False Discovery Rate and Permutation Testing Methodology sections.

Additional Outputs

To enable the most utility from your association test results, some convenient derivative statistics can be computed on your p-values.

  • Output -log10(P): Computes the value -log_{10}(\textit{p-value}) for each p-value and multiple testing corrected p-value.
  • Output data for P-P/Q-Q plots: Computes expected values for each p-value and their multiple testing corrected p-value. By plotting the expected vs. actual P values, you can create P-P or Q-Q plots. This option forces the -log10(P) output as well.

Also, if you select Haplotype frequencies when doing a per-haplotype test, columns for the overall and case/control frequencies for the haplotype responsible for the resulting row in the results spreadsheet will be listed beside the description of the haplotype.

Haplotype Association Tests Results

The results spreadsheet for the haplotype association tests will start with the column containing block numbers. If you selected pre-computed blocks, these are the block numbers for the blocks that were used in the association tests. If you generated blocks dynamically, these numbers reflect those generated blocks. Because of this column, you can select this spreadsheet from a LD plot (see Haplotype Block Sets and LD Graphs) as a marker block spreadsheet to visualize the blocks on a LD graph.

When performing per-haplotype tests, each row represents the tests for a given haplotype. The second column (third column if you have checked Show marker names in output) will contain a string representation of the haplotype that was tested.

Haplotype Block Detection


Haplotype Block Detection Window


Haplotype Block Detection Window - Tab2

When studying haplotypes in the human genome, it is often desirable to try to define the location of haplotype blocks–”sizable regions over which there is little evidence for historical recombination and within which only a few common haplotypes are observed” [Gabriel2002].

With the Haplotype Block Detection dialog, you have the ability to detect haplotype blocks in your SNP data using the method described in the defining paper by Gabriel et al. [Gabriel2002].

Block Detection Options

In the Block Definition Algorithm group of options, you can set parameters to control the “Gabriel et al.” algorithm. The defaults specified are the ones found in the paper which are suggested from empirical results and biological intuition.

You can also specify in the General Options a MAF threshold under which markers are ignored in the context of the algorithm. The maximum number of markers in a block and the maximum length of a block in kilo-base pairs define upper bounds of a block’s size using two different metrics for size.

Haplotype Estimation Options

As in Haplotype Association Tests, you can select the haplotype estimation method and the options that define its behavior for estimating haplotype frequencies.

For block detection, the frequencies are used to computed D' statistics and one-sided confidence bounds for the D'.

See How Haplotype Frequencies are Computed for more details.