Formulas for Computing Linkage Disequilibrium (LD)

Two approaches are available for computing linkage disequilibrium (LD), depending upon the method used for imputing the two-marker haplotype frequencies upon which the LD computations depend, expectation-maximization (EM) vs. the composite haplotype method (CHM).

Computing LD using Expectation-Maximization (EM)

First, the EM method (see Expectation Maximization (EM) and Haplotype Frequency Estimation Methods) is used to impute the two-marker haplotype probabilities p_{ij} for the i-th and j-th allele in the first and second markers, respectively.

Using these, the signed D_{ij} statistics may be calculated as

D_{ij} = p_{ij} - p_i q_j,

where p_i is the frequency of allele i in the first marker, and q_j is the frequency of allele j in the second marker.

If there are k alleles in the first marker and m alleles in the second, a chi-squared distribution with (k-1)(m-1) degrees of freedom may be written as

\chi^2 = n\sum_{i=1}^k \sum_{j=1}^m \frac{D_{ij}}{p_i q_j}.

R^2 may then be computed by taking the p-value as

p = \operatorname{chisqr}(\chi^2, (k-1)(m-1))

and obtaining R^2 from the inverse distribution for one degree of freedom as

R^2 = \frac{F^{-1}(p)}{n}.

For the two-locus two-allele case, this procedure simplifies to the following direct formula

R^2 = \sum_{i=1}^2 \sum_{j=1}^2 \frac{D_{ij}}{p_i q_j},

which, for this two-locus two-allele case, may be shown to be equivalent to

R^2 = \frac{D_{AB}^2}{p_A(1 - p_A)q_B(1 - q_B)},

where A may be chosen as either one of the alleles in the first marker and B may be chosen as either one of the alleles in the second marker.

Computing LD using the Composite Haplotype Method (CHM)

Multi-Allelic

If there are k alleles in the first marker and m alleles in the second, where either k > 2 or m > 2 or both, and using the same notation for p_i and q_j as above, a chi-squared distribution with (k-1)(m-1) degrees of freedom may be written as

\chi^2 = n\sum_{i=1}^k \sum_{j=1}^m \frac{\Delta_{ij}}{2 p_i q_j},

where

\Delta_{ij} = \frac{\eta_{ij}}{n} - 2 p_i q_j,

and \eta_{ij} is defined as in Composite Haplotype Method (CHM). Here, we are effectively using

\rho_{ij} = \frac{1}{2n}\eta_{ij},

which includes both inter- and intra-gametic component frequencies, as our haplotype frequencies.

R^2 may then be computed by taking the p-value as

p = \operatorname{chisqr}(\chi^2, (k-1)(m-1))

and obtaining R^2 from the inverse distribution for one degree of freedom as

R^2 = \frac{F^{-1}(p)}{n}.

Bi-Allelic

For the two-locus two-allele case, and using the notation of Composite Haplotype Method (CHM), we compute R^2 using the following direct formula

R^2 = \frac{\Delta_{AB}^2}{(p_A(1-p_A) + D_{AA})(q_B(1 - q_B) + D_{BB})},

where D_{AA} and D_{BB} are the Hardy-Weinberg coefficients for allele A of the first marker and allele B of the second marker, respectively. This formula may be thought of as putting a “Hardy-Weinberg correction” onto the formula

R^2 = \frac{\Delta_{AB}^2}{p_A(1 - p_A)q_B(1 - q_B)},

which is only completely accurate under the special circumstance of random mating (perfect Hardy-Weinberg equilibrium over the two-marker haplotypes), for which p_{A/B} approximates p_A
p_B and \Delta_{AB} is an unbiased estimate of D_{AB}.

It may be shown that for the circumstance of perfect linkage disequilibrium, the result of using the “Hardy-Weinberg correction” formula is equivalent to

R^2 = \frac{D_{AB}^2}{p_A(1 - p_A)q_B(1 - q_B)}.

The D-Prime Statistic

If the minor allele frequencies of the respective markers are small, the magnitude of the D_{ij} statistic cannot get very large, even if the marker is in almost complete linkage disequilibrium, compared to the magnitude it could have had if the allele frequencies of the markers were almost equal.

The D-prime statistic was designed to compensate for this. D'_{ij} is defined as D_{ij} normalized by the maximum possible value that D_{ij} could possibly have given the allele frequencies in each of the markers.

Specifically,

D'_{ij} = \frac{D_{ij}}{\min(p_i q_j, (1 - p_i)(1 - q_j))}

if D_{ij} < 0, and

D'_{ij} = \frac{D_{ij}}{\min((1 - p_i) q_j, p_i (1 - q_j))}

otherwise.

The overall D-prime statistic is defined as

D' = \sum_{i=1}^k \sum_{j=1}^m p_i q_j |D'_{ij}|.

Computing D-Prime

For EM, the above formula is used directly on the values of p_i, q_j, and D_{ij}, where the D_{ij} are imputed using the technique of Computing LD using Expectation-Maximization (EM).

For multi-allelic CHM, we use

D'_{ij} = \frac{\Delta_{ij}}{\min(p_i q_j, (1 - p_i)(1 - q_j))}

if \Delta_{ij} < 0, and

D'_{ij} = \frac{\Delta_{ij}}{\min((1 - p_i) q_j, p_i (1 - q_j))}

otherwise, with the overall D-prime statistic being defined as

D' = \sum_{i=1}^k \sum_{j=1}^m p_i q_j |D'_{ij}|.

For bi-allelic CHM, we use the same formulas as for multi-allelic CHM, except that for the final D', we take the original overall D' obtained as above and use a Hardy-Weinberg correction on it:

D' = D'_{uncorrected} \sqrt{\frac{p_A(1 - p_A)p_B(1 - p_B)}{(p_A(1 - p_A) + D_{AA})(p_B(1 - p_B) + D_{BB})}},

where A, B, D_{AA} and D_{BB} are defined as in Bi-Allelic.