3.1. General Statistics¶
3.1.1. General Marker Statistics¶
The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Genotype Statistics by Marker) or as a tab in the Genotypic Association Test dialog (Genotype Association Tests).
HardyWeinberg Equilibrium Computation¶
The HWE pvalue measures the strength of evidence against the null hypothesis that the marker does not follow HardyWeinberg Equilibrium. Large pvalue are consistent with the marker following HWE.
Suppose we have a marker with alleles having frequencies . We may write the genotype count for alleles and as . Due to phase ambiguity, if , we count occurrences of allele on the first chromosome and allele on the second chromosome, along with occurrences of allele on the first chromosome and allele on the second chromosome in both the notations and .
Thus, we may write the count for allele as . We may also express the genotype frequency for allele occurring homozygously as , and the genotype frequency for heterozygous alleles and as , where is the population count. The frequency of allele may be expressed as:
We wish to check the agreement of with and the agreement of , where , with . We multiply by two to deal with the phase ambiguity (see above).
Thus, we will define the HardyWeinberg equilibrium coefficient or for alleles and such that
.
(It may be shown that for a biallelic marker, .)
We then have a chisquared distribution with degrees of freedom,
From this, we obtain the distribution’s pvalue , and the correlation, , from the inverse distribution for one degree of freedom (where is the chisquared distribution), which is
Fisher’s Exact Test HWE PValues¶
In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.
See [Emigh1980].
Signed HWE Correlation R¶
Note
This statistic applies only to biallelic markers.
We define the signed HWE correlation as
where
is the total genotype count and and are the counts for genotypes DD and Dd, respectively.
This is derived from the formula for (signed) correlation between two sets of observations, and ,
where we take the to be 0 if the first allele is d and 1 if the first allele is D, and the to be 0 if the second allele is d and 1 if the second allele is D.
Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be onehalf of the (phaseambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.
If there is a high homozygous count, and will often be 1 or often be 0 at the same time, and therefore there will be a positive correlation between the and the . Similarly, if there is a high heterozygous count, and will often be 1 at opposite times, causing an anticorrelation to exist.
Call Rate¶
The call rate is the fraction of genotypes present and not missing for the given marker.
Minor Allele Frequency (MAF)¶
The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.
3.1.2. Statistics Available for Genotype Association Tests¶
Correlation/Trend Test¶
The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.
If we have pairs of observations , the (signed) correlation between them is
Meanwhile,
follows an approximate chisquared distribution with one degree of freedom, from which a pvalue may be obtained.
Note
In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use, instead of the above formula, , we would have the mathematical equivalent of the Armitage Trend Test.
This correlation/trend test is also available to be used after PCA correction. However, the formula for the chisquared statistic is instead where is the number of principal components that have been removed from the data. The premise is that PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.
is a signed value and indicates the effect direction.
Armitage Trend Test¶
The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In Golden Helix SVS, the ordering is by number of minor alleles in the genotype–zero, one, or two.
Let be the counts for cases with 0, 1 and 2 alleles, respectively, and be the counts for controls with 0, 1 and 2 alleles, respectively. Also, let , , and .
If we let be the total count,
, 

, 

, 

, and 

, 
then the prediction equation under ordinary leastsquares fit is
The statistic for the Armitage Trend Test is
which is asymptotically chisquared with one degree of freedom. This is used to obtain the chisquared based pvalue for this test.
Note
The trend statistic itself may be formulated as
This trend statistic indicates the direction of the effect.
Exact Form of Armitage Test¶
The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.
To perform the exact Armitage test, we define the trend score for the contingency table as
where
The exact permutation pvalue is evaluated as
where
Note
The trend statistic for the observed data may be formulated in this context as
where is the value of for the table created from the observed data, and , where is the total number of observations. As noted in Armitage Trend Test, indicates the direction of the effect.
(Pearson) ChiSquared Test¶
This is the mostoften used way to obtain a pvalue for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.
If the contingency table with elements has observations, we make an “expected” contingency table based on the marginal totals with elements
where are the row totals and are the column totals.
We then obtain a pvalue from the fact that
approximates a chisquared distribution with degrees of freedom.
For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.
Note
For the cases that use the table, the correlation is defined and may be computed as
indicates the direction of the effect.
(Pearson) ChiSquared Test with Yates’ Correction¶
This test is almost the same as the Pearson ChiSquared test. The difference is that a correction is made to compensate for the fact that the contingency table uses discrete integer values.
Both this test and the Pearson ChiSquared test itself obtain a pvalue for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.
If the contingency table with elements has observations, our “expected” contingency table based on the marginal totals has elements
where are the row totals and are the column totals.
The estimated value from the Pearson ChiSquared test with Yates correction is
This approximates a chisquared distribution with degrees of freedom which is used to obtain a pvalue for this test.
For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.
Note
For the cases that use the table, the correlation is defined and may be computed as
indicates the direction of the effect.
Fisher’s Exact Test¶
The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.
The probability of a contingency table with elements and row totals and column totals and elements is given by
To reduce the amount of computation, techniques developed by Mehta and Patel [MehtaAndPatel1983] are used for computing Fisher’s Exact Test.
Note
For the cases that use the table, the correlation is defined and may be computed as noted in (Pearson) ChiSquared Test. indicates the direction of the effect.
Odds Ratio with Confidence Limits¶
For the purposes of this method’s description, we define a contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.
Yes
No
Total
Case
Control
Total
The odds ratio is defined as the ratio of the odds for “Case” among the Yes values to the odds for “Case” among the No values, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently
To obtain confidence limits, we use the standard error of , which is
The 95% confidence interval then ranges from to .
Analysis of Deviance¶
This is a maximumlikelihood based technique for analyzing a casecontrol contingency table with columns. Let be the proportion of cases in the entire sample, be the number of observations in column of the contingency table, and be the proportion of cases in column . Then, to perform an analysis of deviance test, we define
and
The test statistic is then , which approximates a chisquared distribution with degrees of freedom. A pvalue is then obtained based on this chisquared approximation.
Note
For the cases that use a contingency table with just two columns, the correlation is defined and may be computed as noted in (Pearson) ChiSquared Test. indicates the direction of the effect.
FTest¶
The FTest applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.
This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.
If there are observations subdivided into groups, we define
and
If and , then
is proportional to the variance between the groups, and
is proportional to the variance within the groups. The F statistic becomes
The pvalue is obtained by subtracting the probability of observing the F statistic from an distribution (where are the numerator degrees of freedom and are the denominator degrees of freedom) from one.
Note
For the cases where there are just two categories, the change in the dependent average in going from category/group 1 to category/group 2,
may be calculated. This change indicates the direction of the effect.
Linear Regression¶
See Linear Regression.
Logistic Regression¶
See Logistic Regression.
3.1.3. Statistics for Numeric Association Tests¶
Correlation/Trend Test¶
The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.
If we have pairs of observations , the (signed) correlation between them is
Meanwhile,
follows an approximate chisquared distribution with one degree of freedom, from which a pvalue may be obtained.
Note
This correlation/trend test is also available to be used after PCA correction. However, the formula for the chisquared statistic is instead where is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.
TTest¶
The TTest is a special form of the FTest in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)
In the CNV Association Test, the TTest is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.
The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.
If there are observations corresponding to a true dependent variable value and observations corresponding to a false dependent variable value, we define
Then,
If is less than a threshold (), then the pvalue returned is 1.0. Otherwise,
The pvalue may be calculated on the basis of this T value as a “twosided pvalue” using Student’s t distribution with degrees of freedom.
3.1.4. False Discovery Rate¶
When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the pvalues or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in Golden Helix SVS specifically to correct for multiple testing over many different predictors.
Suppose that hypotheses are tested, and of them are rejected (positive results). Of the rejected hypotheses, suppose that of them are really false positive results, that is is the number of type I errors. The False Discovery Rate is defined as
that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.
Suppose we are rejecting (the null hypothesis) on the basis of the pvalues from these tests, specifically, when a pvalue is less than a parameter . If we can treat the pvalues as being independent, then we can estimate as
where is the number of less than or equal to , and use this to estimate the False Discovery Rate as
When this is computed for equal to any particular pvalue, these expressions simplify to
and
where is the number of pvalues less than or equal to .
See [Storey2002]. (We use here.)