# Methods¶

## Haplotype Frequency Estimation Methods¶

The alleles of multiple markers transmitted from one parent are called a haplotype. Haplotype analysis of safety and efficacy data can incorporate the information from multiple markers from the same gene or genes, which are physically close on a specific chromosome. Genotypic data from unrelated individuals do not contain information on which alleles were transmitted from each parent, but haplotype frequencies can be estimated using several existing in silico methodologies such as the Expectation Maximization (EM) algorithm and the Composite Haplotype Method (CHM).

Note

GenomeBrowse only computes LD using the CHM method.

A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data.

The expectation-maximization (EM) algorithm, formalized by [Dempster1977], is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see [Excoffier1995] for details of obtaining haplotype frequencies by EM).

Determining the probability of each haplotype for each sample from the overall haplotype probabilities is based on the computation method being used, and does not relate to any estimate of LD between any pair of markers involved.

CHM diplotype probabilities are based on the average of the probabilities of the two haplotypes, while EM diplotype probabilities are based on the product of the probabilities of the two haplotypes. The process of finding the EM diplotype probabilities is equivalent to the “expectation” step of each EM iteration.

### Composite Haplotype Method (CHM)¶

The CHM is based on the idea of the genotypic LD coefficient, , [Weir1996]. Estimation of involves calculation of di-genic frequencies. In the two-locus bi-allelic case, they are estimated as

where , for example, is the number of individuals with genotype Aa/Bb, and n is the sample size. The composite disequilibrium is defined as a sum of inter- and intra-gametic components,

Under random mating, , and so assuming random mating, is an unbiased estimate of the LD parameter, . Also, if we do not wish to separate the inter- and intra-gametic components, we may define

which is an observable quantity.

[Zaykin2001] extended the definition of di-genic frequencies to multiple loci and alleles. For the i-th individual multilocus genotype let be the number of single-locus heterozygotes in Define weights as

Sample composite haplotype counts are calculated from summing over individual contributions,

where n is the sample size, and is the indicator function, defined as

Thus, if the i-th individual has at least one copy of all required alleles, it is counted with weight . The composite haplotype frequencies are given by

Note that includes both inter- and intra-gametic component frequencies.

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s definition, . In a single-locus case, they are the usual definition of the allele count: