Haplotype Frequency Estimation Methods

The alleles of multiple markers transmitted from one parent are called a haplotype. Haplotype analysis of safety and efficacy data can incorporate the information from multiple markers from the same gene or genes, which are physically close on a specific chromosome. Genotypic data from unrelated individuals do not contain information on which alleles were transmitted from each parent, but haplotype frequencies can be estimated using several existing in silico methodologies such as the Expectation Maximization (EM) algorithm and the Composite Haplotype Method (CHM).

About Haplotype Inference

A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data.

The expectation-maximization (EM) algorithm, formalized by [Dempster1977], is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see [Excoffier1995] for details of obtaining haplotype frequencies by EM).

Determining the probability of each haplotype for each sample from the overall haplotype probabilities is based on the computation method being used, and does not relate to any estimate of LD between any pair of markers involved.

CHM diplotype probabilities are based on the average of the probabilities of the two haplotypes, while EM diplotype probabilities are based on the product of the probabilities of the two haplotypes. The process of finding the EM diplotype probabilities is equivalent to the “expectation” step of each EM iteration.

Case/Control Association with Haplotype Frequencies

Consider the following simple example. Suppose we observe a sample of 6 individuals typed at two bi-allelic markers. The first three individuals are double homozygotes, AA/BB, contributing six gametes of the type AB to the sample counts. The next two are aa/bb, contributing four gametes of the type ab. The last individual, however, is a double heterozygote, Aa/Bb, and there are two possible pairs of gametes, AB with ab, or Ab with aB. Taking into account the first part of the sample, the EM algorithm assigns an extremely low probability to the second, (Ab with aB) arrangement, because these types of gametes have not been observed among unambiguous individuals at all. To make such a calculation, it is necessary to assume “Hardy Weinberg Equilibrium” (HWE) over the two markers, which means that the two-marker genotypes do not deviate from the products of the haplotype frequencies. On the other hand, the CHM method described below does not assume HWE and maintains that either of the two arrangements (AB with ab, or Ab with aB) is equally likely.

Certainly, there will be discrepancy between sample frequencies generated by the two methods. Our goal, however, is not simply to report, but to contrast haplotype frequencies between levels of phenotype. Assume for the moment two markers with two alleles and that the phenotype is binary, i.e. the subjects can be classified as either “cases” or “controls”. For simplicity, assume that the haplotype AB determines the probability of an individual being a “case” and individual alleles are unimportant. Then the expected allele frequencies (p_A ,p_B) are the same among cases and controls. We can therefore express the expected frequency of AB haplotypes in cases and controls as

\begin{array}{l c} Cases: & {p_{AB} = p_A p_B + D_1} \\ Controls: & {p_{AB} = p_A p_B + D_2} \\ \end{array}

where D_1, D_2 are linkage disequilibrium (LD) coefficients. These coefficients are needed to account for the part of gametic frequency not explained by frequencies of alleles, p_A,p_B. When we are comparing p_{AB} between cases and controls, we are essentially looking at the size of the difference between D_{1} and D_{2}. The EM algorithm attempts to obtain estimates of p_{AB} in cases and controls separately, assuming HWE in these groups. However, even if HWE holds in the population, it is expected not to hold in either cases or controls looked at separately [Nielsen1998], which may lower the power of the test. Moreover, cases and controls exhibit the deviation from HWE in different directions [Zaykin2000], which is exploited by CHM. The CHM estimates a quantity different from p_{AB}, specifically

\begin{array}{l c} Cases: & {\phi_{AB} = p_A p_B + D_1 + p_{A/B}} \\ Controls: & {\phi_{AB} = p_A p_B + D_2 + p_{A/B}} \\ \end{array}

The last term, p_{A/B}, is the frequency of A, B alleles that reside on two different gametes [Weir1996], in contrast to p_{AB}, that measures their joint frequency on the same gamete. This “intra-gametic” frequency can also be written as a product of A, B allele frequencies plus the deviation unexplained by the product. Generally, this deviation is not zero if the HWE at the haplotype level does not hold. It is also expected that this deviation is generally larger in cases than in controls (denote the difference by \delta), when cases are less prevalent category in a population. Therefore, when comparing composite frequencies \phi_{AB} between cases and controls, we are looking at the size of the difference D_1 - D_2 + \delta. The term \delta is zero only under the multiplicative model of haplotype action, in which case two methods are comparing the same quantities.

The quantities D_1, D_2, p_{AB}, and \delta are all unknown. The process of imputing LD involves first imputing two-marker haplotype frequencies by the chosen method, then imputing LD based on those imputed haplotype frequencies.

EM and CHM each have a distinct way of imputing haplotype frequencies, neither of which involve the values of D_1, D_2, p_{AB}, and \delta between any particular pair of markers. EM assumes diplotypes are products of haplotype frequencies (which is a consequence of HWE over the markers), and CHM assumes that any of the haplotype arrangements is equally likely (such as AB with ab or Ab with aB in the example described above) and will take the average of their probabilities. The respective algorithms are based on those respective assumptions.

The CHM LD method (Formulas for Computing Linkage Disequilibrium (LD)) internally takes a shortcut past the haplotype imputation code and just does the equivalent imputation process directly. For biallelic markers, it additionally compensates for lack of HWE in either or both of the two markers.

In practice, we have found that more often than not the EM and CHM result in essentially the same p-values for tests of association with the phenotype, although the CHM can be much more powerful under models that include large non-additive components. Compared to EM, it is a non-iterative and much faster algorithm, posing no danger of convergence to a local maximum.

Expectation Maximization (EM)

For haplotype inference with Single-Nucleotide Polymorphisms (SNPs), the EM algorithm is popular because it is easy to interpret and is stable. Compared to the Gibbs sampler, another method of estimating haplotype phases, the EM approach is a deterministic procedure. It also requires less computing time, and is easier to check for convergence. The output of the EM algorithm is the maximum-likelihood estimate (MLE), which possesses well-established statistical properties.


Because of the memory-space properties of the EM algorithm, the EM algorithm will perform poorly if at all with more than a dozen loci.

The expectation-maximization (EM) algorithm for haplotype imputation is an implementation of the method described in [Excoffier1995].

Composite Haplotype Method (CHM)

The CHM is based on the idea of the genotypic LD coefficient, \Delta_{AB}, [Weir1996]. Estimation of \Delta_{AB} involves calculation of di-genic frequencies. In the two-locus bi-allelic case, they are estimated as

\frac{1}n\eta_{AB} = \frac{2n_{AABB} + n_{AABb} + n_{AaBB} + n_{AaBb}/2}n,

where n_{AaBb}, for example, is the number of individuals with genotype Aa/Bb, and n is the sample size. The composite disequilibrium is defined as a sum of inter- and intra-gametic components,


Under random mating, P_{A/B}=p_Ap_B, and so assuming random mating, \Delta_{AB} is an unbiased estimate of the LD parameter, D_{AB}. Also, if we do not wish to separate the inter- and intra-gametic components, we may define


which is an observable quantity.

[Zaykin2001] extended the definition of di-genic frequencies to multiple loci and alleles. For the i-th individual multilocus genotype g_i, let H(g_i) be the number of single-locus heterozygotes in g_i. Define weights as


Sample composite haplotype counts are calculated from summing over individual contributions,

\eta_{abc,...}=\sum_{i=1}^nw(g_i)I(a,b,c,...\subset g_i),

where n is the sample size, and I(\cdot) is the indicator function, defined as

I(a,b,c,... \subset g_i) = \{
\begin{array}{l l}
1 & \mbox{if i-th individual genotype }g_i
\mbox{ has alleles }a,b,c,...\\
0 & \mbox{otherwise}\\

Thus, if the i-th individual has at least one copy of all required alleles, it is counted with weight w(g_{i}). The composite haplotype frequencies are given by

\rho_{abc...} = \frac{1}{2n}\eta_{abc...}.

Note that \rho_{abc...} includes both inter- and intra-gametic component frequencies.

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s definition, \eta_{AB}=2n_{AABB}+n_{AABb}+n_{AaBB}+n_{AaBb}/2. In a single-locus case, they are the usual definition of the allele count:

n_i = 2n_{ii}+\sum_{i\neq j}n_{ij}.