# General Statistics¶

## General Marker Statistics¶

The following subsections further explain the methods used in obtaining
General Marker Statistics, which may be invoked using a separate window
(*Genotype Statistics by Marker*) or as a tab in the Genotypic Association
Test dialog (*Genotype Association Tests*).

### Hardy-Weinberg Equilibrium Computation¶

The HWE p-value measures the strength of evidence against the null hypothesis that the marker does not follow Hardy-Weinberg Equilibrium. Large p-value are consistent with the marker following HWE.

Suppose we have a marker with alleles having frequencies . We may write the genotype count for alleles and as . Due to phase ambiguity, if , we count occurrences of allele on the first chromosome and allele on the second chromosome, along with occurrences of allele on the first chromosome and allele on the second chromosome in both the notations and .

Thus, we may write the count for allele as . We may also express the genotype frequency for allele occurring homozygously as , and the genotype frequency for heterozygous alleles and as , where is the population count. The frequency of allele may be expressed as:

We wish to check the agreement of with and the agreement of , where , with . We multiply by two to deal with the phase ambiguity (see above).

Thus, we will define the Hardy-Weinberg equilibrium coefficient or for alleles and such that

.

(It may be shown that for a bi-allelic marker, .)

We then have a chi-squared distribution with degrees of freedom,

From this, we obtain the distribution’s p-value , and the correlation, , from the inverse distribution for one degree of freedom (where is the chi-squared distribution), which is

### Fisher’s Exact Test HWE P-Values¶

In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.

See [Emigh1980].

### Signed HWE Correlation R¶

Note

This statistic applies only to bi-allelic markers.

We define the signed HWE correlation as

where

is the total genotype count and and are the counts for genotypes DD and Dd, respectively.

This is derived from the formula for (signed) correlation between two sets of observations, and ,

where we take the to be 0 if the first allele is d and 1 if the first allele is D, and the to be 0 if the second allele is d and 1 if the second allele is D.

Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.

If there is a high homozygous count, and will often be 1 or often be 0 at the same time, and therefore there will be a positive correlation between the and the . Similarly, if there is a high heterozygous count, and will often be 1 at opposite times, causing an anti-correlation to exist.

### Minor Allele Frequency (MAF)¶

The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

## Statistics Available for Genotype Association Tests¶

### Correlation/Trend Test¶

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have pairs of observations , the (signed) correlation between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

- In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use, instead of the above formula, , we would have the mathematical equivalent of the Armitage Trend Test.
- This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead where is the number of principal components that have been removed from the data. The premise is that PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.
- is a signed value and indicates the effect direction.

### Armitage Trend Test¶

The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In Golden Helix SVS, the ordering is by number of minor alleles in the genotype–zero, one, or two.

Let be the counts for cases with 0, 1 and 2 alleles, respectively, and be the counts for controls with 0, 1 and 2 alleles, respectively. Also, let , , and .

If we let be the total count,

, | |

, | |

, | |

, and | |

, |

then the prediction equation under ordinary least-squares fit is

The statistic for the Armitage Trend Test is

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this test.

Note

The trend statistic itself may be formulated as

This trend statistic indicates the direction of the effect.

### Exact Form of Armitage Test¶

The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.

To perform the exact Armitage test, we define the trend score for the contingency table as

where

The exact permutation p-value is evaluated as

where

Note

The trend statistic for the observed data may be formulated in this context as

where is the value of for the table created
from the observed data, and , where is the total
number of observations. As noted in *Armitage Trend Test*,
indicates the direction of the effect.

### (Pearson) Chi-Squared Test¶

This is the most-often used way to obtain a p-value for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements has observations, we make an “expected” contingency table based on the marginal totals with elements

where are the row totals and are the column totals.

We then obtain a p-value from the fact that

approximates a chi-squared distribution with degrees of freedom.

For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

Note

For the cases that use the table, the correlation is defined and may be computed as

indicates the direction of the effect.

### (Pearson) Chi-Squared Test with Yates’ Correction¶

This test is almost the same as the Pearson Chi-Squared test. The difference is that a correction is made to compensate for the fact that the contingency table uses discrete integer values.

Both this test and the Pearson Chi-Squared test itself obtain a p-value for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements has observations, our “expected” contingency table based on the marginal totals has elements

where are the row totals and are the column totals.

The estimated value from the Pearson Chi-Squared test with Yates correction is

This approximates a chi-squared distribution with degrees of freedom which is used to obtain a p-value for this test.

For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

Note

For the cases that use the table, the correlation is defined and may be computed as

indicates the direction of the effect.

### Fisher’s Exact Test¶

The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.

The probability of a contingency table with elements and row totals and column totals and elements is given by

To reduce the amount of computation, techniques developed by Mehta and Patel [MehtaAndPatel1983] are used for computing Fisher’s Exact Test.

Note

For the cases that use the table, the
correlation is defined and may be computed as noted in
*(Pearson) Chi-Squared Test*. indicates the direction of the effect.

### Odds Ratio with Confidence Limits¶

For the purposes of this method’s description, we define a contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.

YesNoTotalCaseControlTotal

The odds ratio is defined as the ratio of the odds for “Case” among the Yes values to the odds for “Case” among the No values, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently

To obtain confidence limits, we use the standard error of , which is

The 95% confidence interval then ranges from to .

### Analysis of Deviance¶

This is a maximum-likelihood based technique for analyzing a case-control contingency table with columns. Let be the proportion of cases in the entire sample, be the number of observations in column of the contingency table, and be the proportion of cases in column . Then, to perform an analysis of deviance test, we define

and

The test statistic is then , which approximates a chi-squared distribution with degrees of freedom. A p-value is then obtained based on this chi-squared approximation.

Note

For the cases that use a contingency table with just two
columns, the correlation is defined and may be computed
as noted in *(Pearson) Chi-Squared Test*. indicates the direction of
the effect.

### F-Test¶

The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.

This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.

If there are observations subdivided into groups, we define

and

If and , then

is proportional to the variance between the groups, and

is proportional to the variance within the groups. The F statistic becomes

The p-value is obtained by subtracting the probability of observing the F statistic from an distribution (where are the numerator degrees of freedom and are the denominator degrees of freedom) from one.

Note

For the cases where there are just two categories, the change in the dependent average in going from category/group 1 to category/group 2,

may be calculated. This change indicates the direction of the effect.

### Linear Regression¶

See *Linear Regression*.

### Logistic Regression¶

See *Logistic Regression*.

## Statistics for Numeric Association Tests¶

### Correlation/Trend Test¶

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have pairs of observations , the (signed) correlation between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead where is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.

### T-Test¶

The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)

In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.

The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.

If there are observations corresponding to a true dependent variable value and observations corresponding to a false dependent variable value, we define

Then,

If is less than a threshold (), then the p-value returned is 1.0. Otherwise,

The p-value may be calculated on the basis of this T value as a “two-sided p-value” using Student’s t distribution with degrees of freedom.

## False Discovery Rate¶

When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in Golden Helix SVS specifically to correct for multiple testing over many different predictors.

Suppose that hypotheses are tested, and of them are rejected (positive results). Of the rejected hypotheses, suppose that of them are really false positive results, that is is the number of type I errors. The False Discovery Rate is defined as

that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.

Suppose we are rejecting (the null hypothesis) on the basis of the p-values from these tests, specifically, when a p-value is less than a parameter . If we can treat the p-values as being independent, then we can estimate as

where is the number of less than or equal to , and use this to estimate the False Discovery Rate as

When this is computed for equal to any particular p-value, these expressions simplify to

and

where is the number of p-values less than or equal to .

See [Storey2002]. (We use here.)