# Haplotype Association Tests and Block Detection¶

## Haplotype Association Tests¶

### Haplotype Association Tests Overview¶

The Haplotype Association Tests window offers a straightforward way of testing for association between haplotype frequencies of marker blocks against a case/control status. Tests can be made against individual haplotypes inferred by a marker block, or all significant haplotypes on a per-block basis.

### Ways of Defining Haplotype Blocks¶

Golden Helix SVS allows has a few convenient ways of defining marker blocks to be used in the association test.

- Use precomputed blocks
- Use all markers as a single block
- Use a moving window of markers

These are described in more detail below.

#### Use Precomputed Blocks¶

By selecting this option, you have complete control over the definition of marker blocks to be used in analysis. This option reads from an external spreadsheet and a given block definition column.

The spreadsheet with the block columns should have the same marker names
along its row labels as are in the current spreadsheet as column
headers. A block definition column should be a column of type
**Integer**. Each row for the column specifies the block number the
current row’s marker is a member of. It may have missing values to
indicate that the marker in the current row is not in any block.

When you have selected a block spreadsheet, you must then choose which of the valid block definition columns from that spreadsheet you would like to use for analysis.

In a common workflow, you may wish to run the *Haplotype Block Detection* algorithm
to produce a block spreadsheet with blocks defined algorithmically. Then
open this dialog to select the resulting block spreadsheet to define the
blocks to be used for Haplotype Analysis.

#### Use All Markers as Single Block¶

By selecting this option, the entire set of active markers for the
current spreadsheet will be treated as a single block. This may be
useful when investigating entire sets of markers produced as subset
spreadsheets from a LD plot. See *Using LD Plots* for more information.

#### Use a Moving Window of Markers¶

By selecting this option, a set of blocks will be automatically generated based on parameters for a moving window. There are two options for the moving window–either a moving window of a fixed number of columns, or, if a marker map is applied, a dynamic moving window size based on the base pair distance between markers.

*Fixed window size:*Specifies that a fixed number of markers should be used for the moving window.*Dynamic window size in base pairs:*Specifies both the genetic distance in kilo-base pairs and maximum size of the moving window. It will define which markers are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the specified genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.

#### Show Marker Names in Output¶

The checkbox **Show marker names in output** defaults to being
checked.

- Check this box to display, in the output spreadsheet, which markers make up the current haplotype block.
- Uncheck this box if your spreadsheet is going to be very large and you wish to save the memory that this column will take up. (For instance, the output for 1 million haplotype blocks may take from 20 megabytes up to 100 megabytes or more.)

### Association Tests Used with Haplotype Frequencies¶

The following statistical tests are available in comparing the significance of the association between the selected case/control dependent variable and the haplotype frequencies:

- Chi-squared test
- Odds ratio with 95% CI
- Logistic regression

These statistics are applied in different ways based on whether the tests are done on a per-block or per-haplotype basis.

#### Tests Computed Per Haplotype vs. Per Block¶

There are two modes of computing haplotype association tests:

*Calculate per haplotype:*Specifies that for each marker block, the haplotype frequencies will be computed (see*How Haplotype Frequencies are Computed*) and for each haplotype above the frequency threshold, the selected tests will be computed to measure the association between each haplotype and the case/control dependent trait.*Calculate per block:*Specifies that all (or all but one–see the**Frequency threshold**section of*How Haplotype Frequencies are Computed*) the haplotypes above the frequency threshold are tested together in the association test. This measures the association of the the haplotype block as a whole with the case/control trait.

#### Chi-Squared Test of Haplotype Association¶

When selected, the chi-square test sets up a
contingency table comparing the haplotype frequencies for the cases vs.
the controls. The values in the contingency table are based on the
haplotype *counts* between cases and controls.

On a per-haplotype basis, the haplotype counts are computed by a
summation of the values in a full frequency table (see *Haplotype Tables*).
For each haplotype, the haplotype’s frequencies for cases and controls
are individually summed up, multiplied by 2 and then placed in the first
column of the contingency table. The frequencies of all the other
haplotypes are then summed up into a single cases and control count,
multiplied by 2 and placed in the second column of the contingency
table. Given is the current haplotype and is the
summation of all haplotypes other than where there are
haplotypes total, the table is constructed as follows.

Current HaplotypeOther HaplotypesCaseControl

Where for each and trait and each sample with that trait for the current haplotype.

And for each and trait and each sample with that trait and where is an indicator function that returns 0 when .

On a per-block basis, the contingency table is constructed with columns, one for each haplotype. Each column will be computed for the given haplotype according to the above formula.

See *(Pearson) Chi-Squared Test* for details on how a chi-square statistic and p-value
are computed from the contingency table.

#### Odds Ratio Test of Haplotype Association¶

Odds ratio tests are only available on a per-haplotype basis.

When selecting the Odds Ratio Test, you will get odds ratios and the lower and upper 95% confidence bounds of the current haplotype versus the other haplotypes. The values used in the odds ratio computation are the same counts described in the above contingency table.

Note

An odds ratio is generally considered significant if both the lower and the upper 95% confidence bounds are greater than one (or both less than one for an odds ratio less than one).

See the Formulas and Theories chapter for an explanation of this
statistic (Section *Odds Ratio with Confidence Limits*).

#### Logistic Regression Test of Haplotype Association¶

If the Logistic Regression test is selected, a regression is performed with the case/control as the response. The binary response, is fit to the given predictor variables , using logistic regression. The results include the regression p-value and the reportable intercepts.

In a per-haplotype test, there is the single predictor variable
constructed out of the haplotype frequency you would find in the
haplotype frequency table (see *Haplotype Tables*) for the given haplotype.
In addition to the p-value for the regression a and
term and their respective standard errors are reported
in their own columns.

When computed per-block, a single logistic regression is done with predictor variables, where is the number of haplotypes being used for the block-mode regression. The intercept for the solved regression is reported along with the p-value.

See the **Frequency threshold** section of *How Haplotype Frequencies are Computed*
regarding which haplotypes will be used for block-mode regression.

See *Logistic Regression* for more details on how the logistic regression is
performed and how the resulting p-values are formed.

### How Haplotype Frequencies are Computed¶

Because the phase of the genotypic information in genotypic markers is not known, haplotype frequencies must be estimated using statistical methods. Although the estimation algorithms may find many potential haplotypes, there are usually only a handful with significant frequencies in a given block of markers.

The **Frequency threshold** parameter is used

- to only consider haplotypes with a estimated frequency above the threshold so as to reduce the number of variables being considered in association tests, and
- in the case of block-mode testing, to help assess whether or not to
not only refrain from considering haplotypes below the frequency
threshold, but also, in order to help prevent multicollinearity,
refrain from considering the haplotype with the lowest frequency
above the
**Frequency threshold**. Specifically, the total sum of frequencies of the haplotypes not considered for block-mode testing must be greater than the**Frequency threshold**.

Both estimation methods (see paragraph below) allow for samples with
missing genotypes to have their haplotype frequencies inferred. Select
**Impute missings** to enable this algorithmic feature.

Currently, there are two methods for estimating haplotype frequencies
(see the link below for details about the algorithms and their
individual strengths). If you select the **EM** method, you must also
provide the additional **Maximum EM iterations** and **EM convergence
tolerance** parameters used by the algorithm.

See *Haplotype Frequency Estimation Methods* for more information on the details of each
estimation algorithm.

### Multiple Testing Correction¶

To account for the multiple testing problem, you can have additional output columns that relate to multiple testing correction computed for each selected association test.

Bonferroni and FDR multiple testing corrections as well as single value and full scan permutation tests can be applied to the chi-squared and regression p-value results.

See the Formulas and Theories chapter for an explanation of these
correction methods in the *False Discovery Rate* and *Permutation Testing Methodology* sections.

### Additional Outputs¶

To enable the most utility from your association test results, some convenient derivative statistics can be computed on your p-values.

*Output -log10(P):*Computes the value for each p-value and multiple testing corrected p-value.*Output data for P-P/Q-Q plots:*Computes expected values for each p-value and their multiple testing corrected p-value. By plotting the expected vs. actual P values, you can create P-P or Q-Q plots. This option forces the**-log10(P)**output as well.

Also, if you select **Haplotype frequencies** when doing a per-haplotype
test, columns for the overall and case/control frequencies for the
haplotype responsible for the resulting row in the results spreadsheet
will be listed beside the description of the haplotype.

### Haplotype Association Tests Results¶

The results spreadsheet for the haplotype association tests will start
with the column containing block numbers. If you selected pre-computed
blocks, these are the block numbers for the blocks that were used in the
association tests. If you generated blocks dynamically, these numbers
reflect those generated blocks. Because of this column, you can select
this spreadsheet from a LD plot (see *Haplotype Block Sets and LD Graphs*) as a marker block
spreadsheet to visualize the blocks on a LD graph.

When performing per-haplotype tests, each row represents the tests for
a given haplotype. The second column (third column if you have checked
**Show marker names in output**) will contain a string representation
of the haplotype that was tested.

## Haplotype Block Detection¶

When studying haplotypes in the human genome, it is often desirable to try to define the location of haplotype blocks–”sizable regions over which there is little evidence for historical recombination and within which only a few common haplotypes are observed” [Gabriel2002].

With the **Haplotype Block Detection** dialog, you have the ability to
detect haplotype blocks in your SNP data using the method described in the
defining paper by Gabriel et al. [Gabriel2002].

### Block Detection Options¶

In the **Block Definition Algorithm** group of options, you can set
parameters to control the “Gabriel et al.” algorithm. The defaults
specified are the ones found in the paper which are suggested from
empirical results and biological intuition.

You can also specify in the **General Options** a MAF threshold under
which markers are ignored in the context of the algorithm. The maximum
number of markers in a block and the maximum length of a block in
kilo-base pairs define upper bounds of a block’s size using two
different metrics for size.

### Haplotype Estimation Options¶

As in *Haplotype Association Tests*, you can select the haplotype estimation
method and the options that define its behavior for estimating
haplotype frequencies.

For block detection, the frequencies are used to computed statistics and one-sided confidence bounds for the .

See *How Haplotype Frequencies are Computed* for more details.