General Statistics

General Marker Statistics

The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Genotype Statistics by Marker) or as a tab in the Genotypic Association Test dialog (Genotype Association Tests).

Hardy-Weinberg Equilibrium Computation

The HWE p-value measures the strength of evidence against the null hypothesis that the marker does not follow Hardy-Weinberg Equilibrium. Large p-value are consistent with the marker following HWE.

Suppose we have a marker with alleles 1,\ldots, k having frequencies p_1,\ldots, p_k. We may write the genotype count for alleles i and j as n_{ij}. Due to phase ambiguity, if i \neq j, we count occurrences of allele i on the first chromosome and allele j on the second chromosome, along with occurrences of allele j on the first chromosome and allele i on the second chromosome in both the notations n_{ij} and n_{ji}.

Thus, we may write the count for allele i as n_i = 2n_{ii} + \sum^k_{j=1,j \neq i}{n_{ij}}. We may also express the genotype frequency for allele i occurring homozygously as p_{ii}=\frac{n_{ii}}{n}, and the genotype frequency for heterozygous alleles i and j as p_{ij} = \frac{n_{ij}}{n}, where n is the population count. The frequency of allele i may be expressed as:

p_i = \frac{n_i}{2n} = p_{ii} + \frac{1}{2}\sum^k_{j=1,j\neq i}{p_{ij}}.

We wish to check the agreement of p_{ii} with p^2_i and the agreement of p_{ij}, where i \neq j, with 2p_i p_j. We multiply by two to deal with the phase ambiguity (see above).

Thus, we will define the Hardy-Weinberg equilibrium coefficient D_{ii} or D_{ij} for alleles i and j such that

p_{ii} = p^2_i + D_{ii}

p_{ij} = 2p_i p_j - 2D_{ij} (\text{for } i \neq j).

(It may be shown that for a bi-allelic marker, D_{11}=D_{12}=D_{22}.)

We then have a chi-squared distribution with k(k-1)/2 degrees of freedom,

X^2 & = n\sum^k_{i=1}{\frac{(p_{ii}-p^2_i)^2}{p^2_i}} + n\sum^{k-1}_{i=1}{\sum^k_{j=i+1}{\frac{(p_{ij}-2p_ip_j)^2}{2p_ip_j}}}\\
& = \sum^k_{i=1}{\frac{(n_{ii}-np^2_i)^2}{np^2_i}} + \sum^{k-1}_{i=1}{\sum^k_{j=i+1}{\frac{(n_{ij}-2np_ip_j)^2}{2np_ip_j}}}.

From this, we obtain the distribution’s p-value p = \chi^2(X^2, k(k-1)/2), and the correlation, R, from the inverse distribution for one degree of freedom (where F is the chi-squared distribution), which is

R = \sqrt{\frac{F^{-1}(p)}{n}}.

Fisher’s Exact Test HWE P-Values

In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.

See [Emigh1980].

Signed HWE Correlation R

Note

This statistic applies only to bi-allelic markers.

We define the signed HWE correlation R as

R_{signed} = \frac{n*n_{DD} - n^2_D}{n*n_D - n^2_D},

where

n_D = n_{DD} \neq \frac{n_{Dd}}{2},

n is the total genotype count and n_{DD} and n_{Dd} are the counts for genotypes DD and Dd, respectively.

This is derived from the formula for (signed) correlation between two sets of observations, x_i and y_i,

r = \frac{n\sum{x_i y_i} - \sum{x_i}\sum{y_i}}{\sqrt{n\sum{x^2_i} - \left(\sum{x_i}\right)^2}\sqrt{n\sum{y^2_i} - \left(\sum{y_i}\right)^2}},

where we take the x_i to be 0 if the first allele is d and 1 if the first allele is D, and the y_i to be 0 if the second allele is d and 1 if the second allele is D.

Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.

If there is a high homozygous count, x_i and y_i will often be 1 or often be 0 at the same time, and therefore there will be a positive correlation between the x_i and the y_i. Similarly, if there is a high heterozygous count, x_i and y_i will often be 1 at opposite times, causing an anti-correlation to exist.

Call Rate

The call rate is the fraction of genotypes present and not missing for the given marker.

\text{call rate}=\frac{\text{number of complete and non-missing genotypes}}{\text{total number of genotypes}}

Minor Allele Frequency (MAF)

The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

\text{MAF} = \frac{\text{Minor allele count}}{\text{Total allele count}}

Statistics Available for Genotype Association Tests

Correlation/Trend Test

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have n pairs of observations (x_i,\: y_i), the (signed) correlation R between them is

R = \frac{cov(x,\: y)}{\sqrt{var(x)var(y)}}= \frac{\sum{x_i y_i} - \frac{1}{n}\sum{x_i}\sum{y_i}}{\sqrt{\left(\sum{x^2_i}-\frac{1}{n}\left(\sum{x_i}\right)^2\right)\left(\sum{y^2_i}-\frac{1}{n}\left(\sum{y_i}\right)^2\right)}}.

Meanwhile,

\chi^2 = (n-1)R^2

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

  1. In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use, instead of the above formula, \chi^2 = nR^2, we would have the mathematical equivalent of the Armitage Trend Test.
  2. This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead \chi^2 = (n-1-k)R^2, where k is the number of principal components that have been removed from the data. The premise is that PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.
  3. R is a signed value and indicates the effect direction.

Armitage Trend Test

The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In Golden Helix SVS, the ordering is by number of minor alleles in the genotype–zero, one, or two.

Let n_{10},\; n_{11}, \text{ and } n_{12} be the counts for cases with 0, 1 and 2 alleles, respectively, and n_{00},\; n_{01}, \text{ and } n_{02} be the counts for controls with 0, 1 and 2 alleles, respectively. Also, let s_0 = 0, s_1 = 1, and s_2 = 2.

If we let N be the total count,

p_{case} =\frac{n_{10}+n_{11}+n_{12}}{N},
p{1i} = \frac{n_{1i}}{\left(n_{0i} + n_{1i}\right)},
\bar{s} = \frac{\sum{\left(n_{0i} + n_{1i}\right)s_i}}{N},
d = \sum{\left(n_{0i}+n_{1i}\right)\left(s_i - \bar{s}\right)^2}, and
b =\frac{\sum{\left(n_{0i}+n_{1i}\right)\left(p_{1i}-p_{case}\right)\left(s_i- \bar{s}\right)}}{d},

then the prediction equation under ordinary least-squares fit is

\hat{p}_{1i} = p_{1i} + b(s_i - \bar{s}).

The statistic for the Armitage Trend Test is

z^2 = \frac{b^2 d}{p_{case}\left(1 - p_{case}\right)},

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this test.

Note

The trend statistic itself may be formulated as

T_A = b \sqrt{\frac{d}{p_{case}(1 - p_{case})}} .

This trend statistic T_A indicates the direction of the effect.

Exact Form of Armitage Test

The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.

To perform the exact Armitage test, we define the trend score for the contingency table m as

T_m = \sum{n^{(m)}_{1i}s_{1i}},

where

s_{1i}=\frac{\left(s_i - \bar{s}\right)}{\sum{\left(n_{0j} + n_{1j}\right)\left(s_j - \bar{s}\right)^2}}.

The exact permutation p-value is evaluated as

p_{exact} = \sum_{|T_m|\geq|T_{observed}|}{p_m},

where

p_m = \frac{\prod{\binom{n_{0i} + n_{1i}}{n_{1i}}}}{\binom{N}{n_{10} + n_{11} + n_{12}}}.

Note

The trend statistic T_A for the observed data may be formulated in this context as

T_A = T_0 \sqrt{\frac{\sum{\left(n_{0j} + n_{1j}\right)\left(s_j - \bar{s}\right)^2}}{p_{case}(1 - p_{case})}} ,

where T_0 is the value of T_m for the table created from the observed data, and p_{case} =
\frac{n_{10}+n_{11}+n_{12}}{N}, where N is the total number of observations. As noted in Armitage Trend Test, T_A indicates the direction of the effect.

(Pearson) Chi-Squared Test

This is the most-often used way to obtain a p-value for (the extremeness of) an (unordered) m \times n contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements x_{ij} has N observations, we make an “expected” contingency table based on the marginal totals with elements

e_{ij} = \frac{r_i c_j}{N},

where r_i are the row totals and c_j are the column totals.

We then obtain a p-value from the fact that

\chi^2 = \sum{\frac{\left(x_{ij} - e_{ij}\right)^2}{e_{ij}}}

approximates a chi-squared distribution with (m-1)(n-1) degrees of freedom.

For the 2 \times 2, 2 \times 3, and 2
\times 4 tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

Note

For the cases that use the 2 \times 2 table, the correlation R is defined and may be computed as

R = \frac{x_{11}x_{00} - x_{10}x_{01}}{\sqrt{r_0 r_1 c_0 c_1}} .

R indicates the direction of the effect.

(Pearson) Chi-Squared Test with Yates’ Correction

This test is almost the same as the Pearson Chi-Squared test. The difference is that a correction is made to compensate for the fact that the contingency table uses discrete integer values.

Both this test and the Pearson Chi-Squared test itself obtain a p-value for (the extremeness of) an (unordered) m \times n contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements x_{ij} has N observations, our “expected” contingency table based on the marginal totals has elements

e_{ij} = \frac{r_i c_j}{N},

where r_i are the row totals and c_j are the column totals.

The estimated \chi^2 value from the Pearson Chi-Squared test with Yates correction is

\chi^2 = \sum{\frac{(|x_{ij} - e_{ij}| - 0.5)^2}{e_{ij}}}.

This approximates a chi-squared distribution with (m-1)(n-1) degrees of freedom which is used to obtain a p-value for this test.

For the 2 \times 2, 2 \times 3, and 2
\times 4 tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

Note

For the cases that use the 2 \times 2 table, the correlation R is defined and may be computed as

R = \frac{x_{11}x_{00} - x_{10}x_{01}}{\sqrt{r_0 r_1 c_0 c_1}} .

R indicates the direction of the effect.

Fisher’s Exact Test

The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.

The probability of a 2 \times r contingency table with elements x_{rc} and row totals r_c and column totals c_r and N elements is given by

p = \frac{\left(r_1!r_2!\right)\left(c_1!c_2!\ldots c_r!\right)}{x_{11}!x_{12}!\ldots x_{2r}!N!}

To reduce the amount of computation, techniques developed by Mehta and Patel [MehtaAndPatel1983] are used for computing Fisher’s Exact Test.

Note

For the cases that use the 2 \times 2 table, the correlation R is defined and may be computed as noted in (Pearson) Chi-Squared Test. R indicates the direction of the effect.

Odds Ratio with Confidence Limits

For the purposes of this method’s description, we define a 2 \times 2 contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.

  Yes No Total
Case y_{case} n_{case} y_{case} + n_{case}
Control y_{control} n_{control} y_{control} + n_{control}
Total y_{case} + y_{control} n_{case} + n_{control} N

The odds ratio is defined as the ratio of the odds for “Case” among the Yes values to the odds for “Case” among the No values, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently

OR = \frac{y_{case} n_{control}}{n_{case} y_{control}}.

To obtain confidence limits, we use the standard error of \log(OR), which is

s = \sqrt{\frac{1}{y_{case}} + \frac{1}{y_{control}} + \frac{1}{n_{case}} + \frac{1}{n_{control}}}.

The 95% confidence interval then ranges from e^{\log(OR)-1.96s} to e^{\log(OR)+1.96s}.

Analysis of Deviance

This is a maximum-likelihood based technique for analyzing a case-control contingency table with k columns. Let s be the proportion of cases in the entire sample, n_j be the number of observations in column j of the contingency table, and p_j be the proportion of cases in column j. Then, to perform an analysis of deviance test, we define

F_0 = -2n(s \log(s) + (1-s)\log(1-s))

and

F_k = -\sum^k_{j=1}{[-2n_j(p_j\log(p_j) + (1-p_j)\log(1-p_j))]}.

The test statistic is then F_0 - F_k, which approximates a chi-squared distribution with k-1 degrees of freedom. A p-value is then obtained based on this chi-squared approximation.

Note

For the cases that use a contingency table with just two columns, the correlation R is defined and may be computed as noted in (Pearson) Chi-Squared Test. R indicates the direction of the effect.

F-Test

The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.

This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.

If there are n observations x_i subdivided into k groups, we define

F_0 = \sum_{observations} {\left(x_i - \bar{x}\right)^2} = \sum_{observations}{x^2_i - n\bar{x}^2},

and

F_k = \sum_{groups}{\left(\sum_{group}{x^2_i} - \frac{\sum_{group}{x^2_i}}{n_{group}}\right)}.

If v_1 = (k - 1) and v_2 = (n-k), then

\frac{F_0 - F_k}{v_1}

is proportional to the variance between the groups, and

\frac{F_k}{v_2}

is proportional to the variance within the groups. The F statistic becomes

F = \frac{\left(F_0 - F_k\right)/v_1}{F_k/v_2} = \frac{\left(F_0 - F_k\right)v_2}{F_kv_1}.

The p-value is obtained by subtracting the probability of observing the F statistic from an F_{v_{1},v_{2}} distribution (where v_1 are the numerator degrees of freedom and v_2 are the denominator degrees of freedom) from one.

p-value = P\left(X \geq \left|F_{statistic}\right|\right) \text{ where } X \sim F_{v_{1},v_{2}}

Note

For the cases where there are just two categories, the change in the dependent average in going from category/group 1 to category/group 2,

\Delta = \frac{\sum_{group 2}^{n_2} x_i}{n_2} - \frac{\sum_{group 1}^{n_1} x_i}{n_1} ,

may be calculated. This change \Delta indicates the direction of the effect.

Linear Regression

See Linear Regression.

Logistic Regression

See Logistic Regression.

Statistics for Numeric Association Tests

Correlation/Trend Test

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have n pairs of observations (x_i,\: y_i), the (signed) correlation R between them is

R = \frac{cov(x,\: y)}{\sqrt{var(x)var(y)}} = \frac{\sum{x_i y_i} - \frac{1}{n}\sum{x_i}\sum{y_i}}{\sqrt{\left(\sum{x^2_i}-\frac{1}{n}\left(\sum{x_i}\right)^2\right)\left(\sum{y^2_i}-\frac{1}{n}\left(\sum{y_i}\right)^2\right)}}.

Meanwhile,

\chi^2 = (n-1)R^2

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead \chi^2 = (n-1-k)R^2, where k is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.

T-Test

The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)

In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.

The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.

If there are n_{t} observations x_{ti} corresponding to a true dependent variable value and n_f observations x_{fi} corresponding to a false dependent variable value, we define

S_t =\sum{x_{ti}}
S_f =\sum{x_{fi}}
S_q =\sum_{observations}{x^2_i}

Then,

S_d = \frac{S_q - \frac{S^2_t}{n_t} - \frac{S^2_f}{n_f}}{n_t + n_f - 2}.

If S_d is less than a threshold (10^{-6}), then the p-value returned is 1.0. Otherwise,

T = \frac{\frac{S_t}{n_t}-\frac{S_f}{n_f}}{\sqrt{\frac{S_d}{n_t}+\frac{S_d}{n_f}}}

The p-value may be calculated on the basis of this T value as a “two-sided p-value” using Student’s t distribution with n_t+n_f-2 degrees of freedom.

False Discovery Rate

When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in Golden Helix SVS specifically to correct for multiple testing over many different predictors.

Suppose that m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really false positive results, that is V is the number of type I errors. The False Discovery Rate is defined as

FDR = E\left(\frac{V}{R}\vert R>0\right)Pr(R>0),

that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.

Suppose we are rejecting (the null hypothesis) on the basis of the p-values p_1,\ldots,p_m from these m tests, specifically, when a p-value is less than a parameter \gamma. If we can treat the p-values as being independent, then we can estimate Pr(p \leq \gamma) as

\widehat{Pr}(P \leq \gamma) = \frac{\max(R(\gamma),1)}{m},

where R(\gamma) is the number of p_i less than or equal to \gamma, and use this to estimate the False Discovery Rate FDR as

\widehat{FDR}(\gamma) = \frac{\gamma}{\widehat{Pr}(P \leq \gamma)}.

When this is computed for \gamma equal to any particular p-value, these expressions simplify to

\widehat{Pr}(P \leq \gamma) = \frac{R(\gamma)}{m},

and

\widehat{FDR}(\gamma) = \frac{m\gamma}{R(\gamma)} = \frac{m\gamma}{j},

where j is the number of p-values less than or equal to \gamma.

See [Storey2002]. (We use \pi_0 = 1 here.)