Haplotype Frequency Estimation Methods

The alleles of multiple markers transmitted from one parent are called a haplotype. Haplotype analysis of safety and efficacy data can incorporate the information from multiple markers from the same gene or genes, which are physically close on a specific chromosome. Genotypic data from unrelated individuals do not contain information on which alleles were transmitted from each parent, but haplotype frequencies can be estimated using several existing in silico methodologies such as the Expectation Maximization (EM) algorithm and the Composite Haplotype Method (CHM).


GenomeBrowse only computes LD using the CHM method.

About Haplotype Inference

A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data.

The expectation-maximization (EM) algorithm, formalized by [Dempster1977], is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see [Excoffier1995] for details of obtaining haplotype frequencies by EM).

Determining the probability of each haplotype for each sample from the overall haplotype probabilities is based on the computation method being used, and does not relate to any estimate of LD between any pair of markers involved.

CHM diplotype probabilities are based on the average of the probabilities of the two haplotypes, while EM diplotype probabilities are based on the product of the probabilities of the two haplotypes. The process of finding the EM diplotype probabilities is equivalent to the “expectation” step of each EM iteration.

Composite Haplotype Method (CHM)

The CHM is based on the idea of the genotypic LD coefficient, \Delta_{AB}, [Weir1996]. Estimation of \Delta_{AB} involves calculation of di-genic frequencies. In the two-locus bi-allelic case, they are estimated as

\frac{1}n\eta_{AB} = \frac{2n_{AABB} + n_{AABb} + n_{AaBB} + n_{AaBb}/2}n,

where n_{AaBb}, for example, is the number of individuals with genotype Aa/Bb, and n is the sample size. The composite disequilibrium is defined as a sum of inter- and intra-gametic components,


Under random mating, P_{A/B}=p_Ap_B, and so assuming random mating, \Delta_{AB} is an unbiased estimate of the LD parameter, D_{AB}. Also, if we do not wish to separate the inter- and intra-gametic components, we may define


which is an observable quantity.

[Zaykin2001] extended the definition of di-genic frequencies to multiple loci and alleles. For the i-th individual multilocus genotype g_i, let H(g_i) be the number of single-locus heterozygotes in g_i. Define weights as


Sample composite haplotype counts are calculated from summing over individual contributions,

\eta_{abc,...}=\sum_{i=1}^nw(g_i)I(a,b,c,...\subset g_i),

where n is the sample size, and I(\cdot) is the indicator function, defined as

I(a,b,c,... \subset g_i) = \{
\begin{array}{l l}
1 & \mbox{if i-th individual genotype }g_i
\mbox{ has alleles }a,b,c,...\\
0 & \mbox{otherwise}\\

Thus, if the i-th individual has at least one copy of all required alleles, it is counted with weight w(g_{i}). The composite haplotype frequencies are given by

\rho_{abc...} = \frac{1}{2n}\eta_{abc...}.

Note that \rho_{abc...} includes both inter- and intra-gametic component frequencies.

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s definition, \eta_{AB}=2n_{AABB}+n_{AABb}+n_{AaBB}+n_{AaBb}/2. In a single-locus case, they are the usual definition of the allele count:

n_i = 2n_{ii}+\sum_{i\neq j}n_{ij}.