The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Genotype Statistics by Marker) or as a tab in the Genotypic Association Test dialog (Genotype Association Tests).
The HWE p-value measures the strength of evidence against the null hypothesis that the marker does not follow Hardy-Weinberg Equilibrium. Large p-value are consistent with the marker following HWE.
Suppose we have a marker with alleles
having
frequencies
. We may write the genotype count
for alleles
and
as
. Due to
phase ambiguity, if
, we count occurrences of allele
on the first chromosome and allele
on the second
chromosome, along with occurrences of allele
on the first
chromosome and allele
on the second chromosome in both the
notations
and
.
Thus, we may write the count for allele
as
. We may also
express the genotype frequency for allele
occurring
homozygously as
, and the genotype
frequency for heterozygous alleles
and
as
, where
is the population
count. The frequency of allele
may be expressed as:

We wish to check the agreement of
with
and the agreement of
, where
, with
. We multiply by two to deal with the phase ambiguity (see above).
Thus, we will define the Hardy-Weinberg equilibrium coefficient
or
for alleles
and
such that

.
(It may be shown that for a bi-allelic marker,
.)
We then have a chi-squared distribution with
degrees of freedom,

From this, we obtain the distribution’s p-value
, and the correlation,
,
from the inverse distribution for one degree of freedom (where
is the chi-squared distribution), which is

In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.
See [Emigh1980].
Note
This statistic applied only to bi-allelic markers.
We define the signed HWE correlation
as

where

is the total genotype count and
and
are the counts for genotypes DD and Dd, respectively.
This is derived from the formula for (signed) correlation between two
sets of observations,
and
,

where we take the
to be 0 if the first allele is d and
1 if the first allele is D, and the
to be 0 if the second
allele is d and 1 if the second allele is D.
Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.
If there is a high homozygous count,
and
will often be 1 or often be 0 at the same time, and therefore there will
be a positive correlation between the
and the
. Similarly, if there is a high heterozygous count,
and
will often be 1 at opposite times,
causing an anti-correlation to exist.
The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.
If we have
pairs of observations
, the
(signed) correlation
between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.
Note
, we would have the mathematical equivalent of the Armitage Trend Test.
where
is the number of principal components that have been removed from the data. The premise is that PCA correction has removed
degrees of freedom from the data, and only the remaining degrees need to be tested.The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In Golden Helix SVS, the ordering is by number of minor alleles in the genotype–zero, one, or two.
Let
be the counts for
cases with 0, 1 and 2 alleles, respectively, and
be the counts for
controls with 0, 1 and 2 alleles, respectively. Also, let
,
, and
.
If we let
be the total count,
![]() |
![]() |
![]() |
![]() |
![]() |
, and |
![]() |
, |
then the prediction equation under ordinary least-squares fit is

The statistic for the Armitage Trend Test is

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this test.
The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.
To perform the exact Armitage test, we define the trend score for the
contingency table
as

where

The exact permutation p-value is evaluated as

where

This is the most-often used way to obtain a p-value for (the
extremeness of) an (unordered)
contingency table,
to know whether to reject the null hypothesis that the proportions in
the rows and columns of the table differ from the proportions of the
margin column totals and the margin row totals, respectively, as much
as they do by chance alone.
If the contingency table with elements
has
observations, we make an “expected” contingency table based on the
marginal totals with elements

where
are the row totals and
are the
column totals.
We then obtain a p-value from the fact that

approximates a chi-squared distribution with
degrees of freedom.
For the
,
, and
tables for which this technique is used in SVS, the
degrees of freedom are 1, 2 and 3, respectively.
This test is almost the same as the Pearson Chi-Squared test. The difference is that a correction is made to compensate for the fact that the contingency table uses discrete integer values.
Both this test and the Pearson Chi-Squared test itself obtain a
p-value for (the extremeness of) an (unordered)
contingency table, to know whether to reject the null hypothesis that
the proportions in the rows and columns of the table differ from the
proportions of the margin column totals and the margin row totals,
respectively, as much as they do by chance alone.
If the contingency table with elements
has
observations, our “expected” contingency table based on the
marginal totals has elements

where
are the row totals and
are the
column totals.
The estimated
value from the Pearson Chi-Squared test
with Yates correction is

This approximates a chi-squared distribution with
degrees of freedom which is used to obtain a p-value for this test.
For the
,
, and
tables for which this technique is used in SVS, the
degrees of freedom are 1, 2 and 3, respectively.
The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.
The probability of a
contingency table with
elements
and row totals
and column totals
and
elements is given by

To reduce the amount of computation, techniques developed by Mehta and Patel [MehtaAndPatel1983] are used for computing Fisher’s Exact Test.
For the purposes of this method’s description, we define a
contingency table as being organized as
“(Case/Control) vs. (Yes/No)” demonstrated in the table below.
Yes No Total Case Control Total
The odds ratio is defined as the ratio of the odds for “Case” among the Yes values to the odds for “Case” among the No values, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently

To obtain confidence limits, we use the standard error of
, which is

The 95% confidence interval then ranges from
to
.
This is a maximum-likelihood based technique for analyzing a
case-control contingency table with
columns. Let
be the proportion of cases in the entire sample,
be the
number of observations in column
of the contingency table,
and
be the proportion of cases in column
.
Then, to perform an analysis of deviance test, we define

and
![F_k = -\sum^k_{j=1}{[-2n_j(p_j\log(p_j) + (1-p_j)\log(1-p_j))]}.](_images/math/b5dfae9c08f57ed060b9a5fe4801e32a409c1126.png)
The test statistic is then
, which approximates a
chi-squared distribution with
degrees of freedom. A
p-value is then obtained based on this chi-squared approximation.
The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.
This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.
If there are
observations
subdivided into
groups, we define

and

If
and
, then

is proportional to the variance between the groups, and

is proportional to the variance within the groups. The F statistic becomes

The p-value is obtained by subtracting the probability of observing
the F statistic from an
distribution (where
are the numerator degrees of freedom and
are
the denominator degrees of freedom) from one.

See Linear Regression.
See Logistic Regression.
The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.
If we have
pairs of observations
, the
(signed) correlation
between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.
Note
This correlation/trend test is also available to be used after PCA
correction. However, the formula for the chi-squared statistic is
instead
where
is the
number of principal components that have been removed from the data.
The premise is that the PCA correction has removed
degrees of freedom from the data, and only the remaining degrees need
to be tested.
The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)
In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.
The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.
If there are
observations
corresponding
to a true dependent variable value and
observations
corresponding to a false dependent variable value, we
define
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Then,

If
is less than a threshold (
), then the
p-value returned is 1.0. Otherwise,

The p-value may be calculated on the basis of this T value as a
“two-sided p-value” using Student’s t distribution with
degrees of freedom.
When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in Golden Helix SVS specifically to correct for multiple testing over many different predictors.
Suppose that
hypotheses are tested, and
of them
are rejected (positive results). Of the rejected hypotheses, suppose
that
of them are really false positive results, that is
is the number of type I errors. The False Discovery Rate is
defined as

that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.
Suppose we are rejecting (the null hypothesis) on the basis of the
p-values
from these
tests,
specifically, when a p-value is less than a parameter
.
If we can treat the p-values as being independent, then we can estimate
as

where
is the number of
less than or
equal to
, and use this to estimate the False Discovery
Rate
as

When this is computed for
equal to any particular
p-value, these expressions simplify to

and

where
is the number of p-values less than or equal to
.
See [Storey2002]. (We use
here.)
Permutation testing consists of running many tests which are just like the original test except that the dependent variable is permuted differently for each test, and counting the number of times the test statistic resulting from a permuted dependent variable is more significant than the statistic from the original test.
The permuted p-value is the fraction of permuted tests which are more significant than or as significant as the original test (this is the definition of a p-value). The original test is counted as one of the “permuted tests” in this calculation. This is meant to approximate the probability that the test could come out as “positive” by chance alone.
Note
The permutations that are used are not based on a time of day seed, but on a constant seed. This means that permutation results will be identical when the study is re-run.
With single value permutations, the dependent variable is permuted and the given statistical test is performed using the given model on the given marker or predictor. The permutation methodology is the same for genotypic association tests and numerical association tests. The independent variables in the first case are genetic markers, and in the second numerical predictors. In this paragraph and those following, “marker” can be replaced with “predictor” depending on the context of the permutation testing. Only “marker” will be used for simplicity.
This process is repeated the number of times you select. The permuted p-value is defined as the fraction of times a test with a permuted dependent variable on the given marker came out as significant as or more significant than the same test on the same marker with the non-permuted dependent variable.
Note
This single-value permutation technique focuses exclusively on the marker in question, and by itself does not offer any type of correction for multiple testing.
Suppose your approximated p-value (based on a chi-square distribution) is 0.2, you are performing 10 permutations, and the significance of those permutations (based on the same chi-square distribution) may be expressed as the following approximated p-values (counting the original test as the first “permutation”):
0.20
0.15
0.09
0.51
0.31
0.01
0.26
0.11
0.05
0.18
In the above column, there are 7 outcomes where the approximated p-value is equal to or less than 0.2. The permuted p-value is therefore

The full-scan permutation technique differs from the single-value technique in that it addresses the multiple-testing problem. It does this by comparing the original test result from an individual marker not merely with the permuted result from that marker, but also with the permuted results from all the rest of the markers. If any of these comparisons comes out to be more significant than the unpermuted result of the marker in question, the permutation is considered to have resulted in the “permuted test being more significant than the unpermuted test”.
To be specific, the following procedure is used for full-scan permutation testing:
Note
This procedure will tend to multiply the result by a value somewhat equal to the number of markers being tested, and thus gives answers comparable to those from Bonferroni-corrected p-values.
You run the same association test on four spreadsheets. These spreadsheets are all alike, except that one has the real data that you are interested in, and the other three have the same data except that the dependent variable has been permuted randomly in those spreadsheets.
You look at the (approximated) p-values from all of these tests, and note that the best approximated p-value from the first spreadsheet on any marker (which came from marker 4) is 0.0003. You also note that the best approximated p-values (which resulted from various markers) from the second, third, and fourth spreadsheets are 0.005, 0.02, and 0.013.
Using this technique, you could say that the full-scan permuted p-value of marker 4 would be one (the number of spreadsheets whose significance measured as a p-value was 0.0003 or better) divided by four (the total number of spreadsheets), or 0.25.
Normally, to get a better idea of the real p-value, you would perform many more permutations than this! For instance, one thousand permutations or more would be necessary to more accurately gage a p-value near 0.001.
Suppose you have 5 markers, and their approximated test p-values (based on a chi-square distribution) are:
0.20 0.10 0.02 0.34 0.52
Their Bonferroni-corrected p-values would be:
1.00 0.50 0.10 1.00 1.00
Also suppose we do full-scan permutation testing using 10 permutations on these markers, and the approximated p-value results of those permutations on those markers (based on the same chi-square distribution and counting the original test as a “permutation”) are:
0.20 0.10 0.02 0.34 0.52
0.15 0.13 0.71 0.05 0.26
0.09 0.17 0.41 0.26 0.22
0.51 0.36 0.47 0.92 0.01
0.20 0.74 0.26 0.87 0.23
0.01 0.46 0.21 0.08 0.05
0.26 0.61 0.92 0.37 0.43
0.11 0.54 0.29 0.08 0.13
0.05 0.06 0.04 0.39 0.42
0.18 0.69 0.21 0.19 0.14
In the above data, we find the following best permuted-test outcomes over all markers for each of the permutations are:
| 0.02 | (from the third marker) |
| 0.05 | (from the fourth marker) |
| 0.09 | (first marker) |
| 0.01 | (fifth marker) |
| 0.20 | (first marker) |
| 0.01 | (first marker) |
| 0.26 | (first marker) |
| 0.08 | (fourth marker) |
| 0.04 | (third marker) |
| 0.14 | (fifth marker) |
For the first marker, whose approximated p-value from actual data was 0.20, we find that the best approximated p-values from the permutations other than permutation 7 (0.26) were all as good as or as better than 0.20. Thus the full scan permuted p-value for the first marker is 9/10 or 0.90.
For the second marker, at 0.10, all permutations except permutation 5 (0.20), permutation 7 (0.26), and permutation 10 (0.14) were better than its actual data value. Thus, its full-scan permuted p-value is 7/10 or 0.70.
The third marker has an approximated p-value of 0.02 on actual data, which is much better than that for the other markers, and is only equaled or bettered by itself (“permutation” 1 at 0.02), permutation 4 (0.01) and permutation 6 (0.01). Thus, its full-scan permuted p-value is 3/10 or 0.30.
Meanwhile, every permutation had a better approximated p-value than either the 0.34 or 0.52 values from the last two markers. Thus, these markers both have a full-scan permuted p-value of 10/10 or 1.00.
Comparing the full-scan permuted p-values for every marker with its Bonferroni-corrected p-value for real data we get these similar values:
0.90 0.70 0.30 1.00 1.00
1.00 0.50 0.10 1.00 1.00
Let’s say your full-scan permuted p-value was 0.463 from when you performed 1000 permutations. This means that of the 1000 permutations performed, the p-value from the internally saved list of the best statistical results from the permutations was better 463 times out of 1000. This would be a poor showing. However, you must take into consideration that each time the permutation was performed, it did not do anything like averaging the approximated p-values for all the markers, but it took the one marker with the best showing (whose approximated p-value would be the lowest).
The response,
, is fit to every genetic predictor variable or
encoded genotype,
, in the spreadsheet, using linear
regression, and the results include the regression p-value, intercept
and slope which are output in a new spreadsheet along with other
genotypic association test results and any multiple correction results.
The response is represented with the formula
, where the model is represented by
the expression
and the error term,
, expressing the difference, or residual, between the
model of the response and the response itself. For missing values of the
predictor, the mean of the response is used.
The regression hypothesis test is the test of:

Assumptions:
for all 
where
denote the residuals
and
are all independent and follow a normal distribution
and all
have equal variance
The sums of squares and mean sum of squared errors are calculated as follows:
| Number of Observations: | ![]() |
| Rank of the Coefficient Matrix: | ![]() |
| Mean of the response: | ![]() |
| Mean of the predictors or genotypes: | ![]() |
| Solution to the normal equations: | ![]() |
where ![]() |
|
| Total Sum of Squares: | ![]() |
| Regression Sum of Squares: | ![]() |
where is a matrix of ones |
|
| Error Sum of Squares: | ![]() |
| Residual Sum of Squares: | ![]() |
| Coefficient of determination: | ![]() |
| Adjusted coefficient of determination: | ![]() |
| Test Statistic: | ![]() |
The test statistic follows the F distribution, where
where
.
The Regression Analysis window performs multiple linear regression on the regressors unless only one regressor is specified. A multiple linear regression model takes one or more regressors and fits a regression model to one dependent variable. This model is a generalization of the simple linear regression model used for linear regression in the analysis test dialogs.
The regression hypothesis test is the test of:

Assumptions:
for all 
where
denote the residuals
and
are all independent and follow a normal distribution
and all
have equal variance
The sums of squares and mean sum of squared errors are calculated as follows:
| Number of Observations: | ![]() |
| Rank of the Coefficient Matrix: | ![]() |
| Mean of the response: | ![]() |
| Mean of the predictors or genotypes: | ![]() |
| Solution to the normal equations: | ![]() |
where ![]() |
|
| Total Sum of Squares: | ![]() |
| Regression Sum of Squares: | ![]() |
where is a matrix of ones |
|
| Error Sum of Squares: | ![]() |
| Residual Sum of Squares: | ![]() |
| Coefficient of determination: | ![]() |
| Adjusted coefficient of determination: | ![]() |
| Test Statistic: | ![]() |
The test statistic follows the F distribution, where
where
.
In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.
The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.
The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:
| Number of Observations: | ![]() |
| Rank of the Reduced Model Coefficient Matrix: | ![]() |
| Mean of the response: | ![]() |
| Mean of the predictors or genotypes: | ![]() |
| Solution to the normal equations: | ![]() |
where ![]() |
|
| Total Sum of Squares: | ![]() |
| Regression Sum of Squares: | ![]() |
where is a matrix of ones |
|
| Error Sum of Squares: | ![]() |
The sums of squares and mean sum of squared errors for the full model are calculated similarly:
| Number of Observations: | ![]() |
| Rank of the Full Model Coefficient Matrix: | ![]() |
| Mean of the response: | ![]() |
| Mean of the predictors or genotypes: | ![]() |
| Solution to the normal equations: | ![]() |
where ![]() |
|
| Total Sum of Squares: | ![]() |
| Regression Sum of Squares: | ![]() |
where is a matrix of ones |
|
| Residual Sum of Squares: | ![]() |
| Error Sum of Squares: | ![]() |
The test statistic is:

The p-value is calculated by:
where
.
The coefficient of the
regressor is calculated with the
equation:

where
is the sample size,
is the mean of
the
regressor and
is the mean of the
response.
The Y-intercept of the regression equation is calculated with the equation:

where
is the number of regressors,
is the
coefficient and
is the mean of the
regressor.
The standard error for the
regressor is computed by
taking a full model regression equation with all regressors less the
regressor. For the purposes of calculating the standard
error, the
regressor is set as the dependent variable.
Let
be the regressor sum
of squares,
be the coefficient of determination for the
regressor vs all other regressors model,
be the mean square errors for the regression model, and
be
the error sum of squares. Let the total number of regressors in the
model be
. Then the standard error of the regressor
is calculated as follows:

The value of the t-statistic for the
regressor is
obtained from the equation:

where
is the estimated coefficient for the
regressor.
The p-value of the t-statistic for the
regressor is the
probability of a value as extreme or more extreme than the observed
t-statistic from a Student’s T distribution with
degrees
of freedom.

The p-value for the univariate fit is obtained from a Student’s T
distribution where the t-statistic is calculated assuming that the
regressor is the only regressor in the model against
the dependent variable.
If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.
A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.
For example, consider the following covariates for five samples.
| Sample | Lab | Dose | Age |
|---|---|---|---|
| sample01 | A | Low | 35 |
| sample02 | A | Med | 31 |
| sample03 | A | High | 37 |
| sample04 | B | Low | 32 |
| sample05 | B | Med | 36 |
| sample06 | B | High | 33 |
Using dummy variables for the categorical covariates the above table would be:
| Sample | Lab=A | Lab=B | Dose=Low | Dose=Med | Dose=High | Age |
|---|---|---|---|---|---|---|
| sample01 | 1 | 0 | 1 | 0 | 0 | 35 |
| sample02 | 1 | 0 | 0 | 1 | 0 | 31 |
| sample03 | 1 | 0 | 0 | 0 | 1 | 37 |
| sample04 | 0 | 1 | 1 | 0 | 0 | 32 |
| sample05 | 0 | 1 | 0 | 1 | 0 | 36 |
| sample06 | 0 | 1 | 0 | 0 | 1 | 33 |
Interactions Lab*Dose and Lab*Age would be specified as:
| Sample | A*Low | A*Med | A*High | B*Low | B*Med | B*High | A*Age | B*Age |
|---|---|---|---|---|---|---|---|---|
| sample01 | 1 | 0 | 0 | 0 | 0 | 0 | 35 | 0 |
| sample02 | 0 | 1 | 0 | 0 | 0 | 0 | 31 | 0 |
| sample03 | 0 | 0 | 1 | 0 | 0 | 0 | 37 | 0 |
| sample04 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 32 |
| sample05 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 36 |
| sample06 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 33 |
If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.
Forward Selection
Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.
Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
Backward Elimination
Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.
Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
In this case, all the observations with zero as the predictor variable are place in one group, and all of the observations with a one as the predictor variable are placed in a second group. A two-sample t-test is used to determine the probability that the two groups have the same mean.
-Test)¶Suppose you have
items that are split into two groups of
sizes
and
, and the respective sums of
their continuous responses are
and
.
Further, let
be the sum of the squared responses,
, and let
We can then calculate the
statistic:
where the p-value is given by the tails of a two-sided Student’s t
distribution with
degrees of freedom:
where
.
Test)¶Suppose you have
observations, each with a
-dimensional response in an
matrix,
. All the observations with zero as the predictor
variable are placed into one group,
, of size
, and all of the observations with a one as the
predictor variable are placed into a second group,
, of
size
. Note that
. Then a Hotelling
statistic is computed to compare the multivariate
continuous responses in the two groups. Let the
-dimensional vector
contain the means of
responses
,

Define the
-dimensional vector
, where

Define the
entry of the
matrix
as follows:

Define the
entry of the
vector
as follows:

The matrix equation
is solved and the
statistic is calculated as follows:

The p-value is computed using the F distribution with
and
degrees of freedom:

where
.
The (univariate binary) response,
is fit to the given
predictor variable,
, using logistic regression, and the
results include the regression p-value, the parameters
and
which are output in a new spreadsheet along with
other association test results, any multiple test correction results, as
well as any expected p-values based on the rank of the observed p-value
and number of predictors. The response is represented with the formula
, with the model itself being
the expression
and the error term,
, expressing the difference, or residual, between the
model of the response and the response itself.
Assuming there are
observations, a logit model is used to
fit the binary response,
, using
and a vector
of 1’s as a covariate matrix (
). (The vector of 1’s
facilitates obtaining an intercept.) The Newton’s method approach of
maximizing the log likelihood function is used for estimating the
logit model [Green1997]. The null hypothesis being tested is that the
slope and intercept coefficients in the logit model are zero. A
likelihood statistic is calculated, where
is the
unrestricted likelihood and
is the restricted
likelihood, and
is asymptotically distributed
as Chi-Squared with
degrees of freedom.
We simplify the notation by using a lower-case
to mean “log
likelihood”–that is,

and

where base
logarithms are used.
Using this notation, the unrestricted log likelihood is
![l_0 = n[s\log(s) + (1-s)\log(1-s)],](_images/math/918dea7acf9361ca36ca78e651b0c56aeab134d4.png)
where
is the proportion of the
dependent observations
that are equal to one, and the restricted log
likelihood is
![l_1 = \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right)\right],](_images/math/db4af195980611221aa4662b2f477f0a989d7a4c.png)
and
where
. Here,
is the vector
.
The multiple logistic regression uses a logit model to fit the binary
response
, using the covariate matrix
, consisting of the regression coefficients for
continuous predictors and indicator coefficients for categorical
predictors, along with a column of 1’s for the intercept. The Newton’s
method approach of maximizing the log likelihood function is used for
estimating the logit model [Green1997]. The null hypothesis being
tested is that the slope and intercept coefficients in the logit model
are zero. A likelihood statistic is calculated, where
is
the unrestricted likelihood and
is the restricted
likelihood, and
is asymptotically distributed
as Chi-Squared with
degrees of freedom.
We simplify the notation by using a lower-case
to mean “log
likelihood”–that is,

and

where base
logarithms are used.
Using this notation, the unrestricted log likelihood is
![l_0 = n[s\log(s) + (1-s)\log(1-s)],](_images/math/918dea7acf9361ca36ca78e651b0c56aeab134d4.png)
where
is the proportion of the
dependent observations
that are equal to one, and the restricted log
likelihood is
![l_1 = \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right)\right],](_images/math/db4af195980611221aa4662b2f477f0a989d7a4c.png)
and
where
. Here,
is the vector
of slope coefficients.
For the full versus reduced logistic regression model, logistic regression equations are obtained for both the full model and for the reduced model. The reduced logistic regression model includes only the dependent and any covariates selected for the reduced model. The full logistic regression model includes all of the variables including any full model covariates.
A likelihood ratio statistic is calculated to find the significance of
including the full model regressors vs not including these regressors.
The restricted likelihood of the reduced model is represented by
and
is the restricted likelihood of the full
model. Both
and
are computed as below:
![l_0 & =\log{L_0}= \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^T_Rz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^T_Rz_i}}\right)\right],
l_1 & =\log{L_1}= \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^T_Fz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^T_Fz_i}}\right)\right],](_images/math/36af1a17086f05b512e1db284d586e67a04426cc.png)
and
where
where
are the degrees of
freedom of the full model and
are the degrees of freedom of
the reduced model. Here,
is the reduced model
vector of slope coefficients, and
is the full
model vector of slope coefficients.
The coefficient of each regressor, along with the y-intercept, is calculated as a part of the Newton’s method approach of maximizing the log likelihood function for the full model.
The standard error of the
regressor is found by
inverting the information matrix of the regression, which is formed
using the intercept as the last coefficient. The square root of the
diagonal element of the inverted matrix is the standard
error of the
regressor, and the square root of the last
diagonal element of the inverted matrix is the standard error of the
intercept. (See [HosmerAndLemeshow2000].)
The p-value Pr(Chi) associated with dropping the
regressor from the regression is found by running a separate logistic
regression using all the regressors as the full model and a model with
all the regressors except the
regressor as the reduced
model. (The “Chi” refers to the likelihood ratio test that is performed
between these two models to find the p-value.)
The regression odds ratio for a coefficient
is simply
. The interpretation of this is by how much (by what
ratio) the odds of the dependent being one change if the given regressor
changes by one unit. An example would be the ratio of the odds of being
a case rather than a control for a smoker to the odds of being a case
rather than a control for a non-smoker.
The p-value for the univariate fit of the
regressor is
obtained from a separate logistic regression which is calculated as if
the
regressor were the only regressor in the model
against the dependent variable.
If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.
A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.
For example, consider the following covariates for five samples.
| Sample | Lab | Dose | Age |
|---|---|---|---|
| sample01 | A | Low | 35 |
| sample02 | A | Med | 31 |
| sample03 | A | High | 37 |
| sample04 | B | Low | 32 |
| sample05 | B | Med | 36 |
| sample06 | B | High | 33 |
Using dummy variables for the categorical covariates the above table would be:
| Sample | Lab=A | Lab=B | Dose=Low | Dose=Med | Dose=High | Age |
|---|---|---|---|---|---|---|
| sample01 | 1 | 0 | 1 | 0 | 0 | 35 |
| sample02 | 1 | 0 | 0 | 1 | 0 | 31 |
| sample03 | 1 | 0 | 0 | 0 | 1 | 37 |
| sample04 | 0 | 1 | 1 | 0 | 0 | 32 |
| sample05 | 0 | 1 | 0 | 1 | 0 | 36 |
| sample06 | 0 | 1 | 0 | 0 | 1 | 33 |
Interactions Lab*Dose and Lab*Age would be specified as:
| Sample | A*Low | A*Med | A*High | B*Low | B*Med | B*High | A*Age | B*Age |
|---|---|---|---|---|---|---|---|---|
| sample01 | 1 | 0 | 0 | 0 | 0 | 0 | 35 | 0 |
| sample02 | 0 | 1 | 0 | 0 | 0 | 0 | 31 | 0 |
| sample03 | 0 | 0 | 1 | 0 | 0 | 0 | 37 | 0 |
| sample04 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 32 |
| sample05 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 36 |
| sample06 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 33 |
If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.
Forward Selection
Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.
Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
Backward Elimination
Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.
Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
The alleles of multiple markers transmitted from one parent are called a haplotype. Haplotype analysis of safety and efficacy data can incorporate the information from multiple markers from the same gene or genes, which are physically close on a specific chromosome. Genotypic data from unrelated individuals do not contain information on which alleles were transmitted from each parent, but haplotype frequencies can be estimated using several existing in silico methodologies such as the Expectation Maximization (EM) algorithm and the Composite Haplotype Method (CHM).
A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data.
The expectation-maximization (EM) algorithm, formalized by [Dempster1977], is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see [Excoffier1995] for details of obtaining haplotype frequencies by EM).
Determining the probability of each haplotype for each sample from the overall haplotype probabilities is based on the computation method being used, and does not relate to any estimate of LD between any pair of markers involved.
CHM diplotype probabilities are based on the average of the probabilities of the two haplotypes, while EM diplotype probabilities are based on the product of the probabilities of the two haplotypes. The process of finding the EM diplotype probabilities is equivalent to the “expectation” step of each EM iteration.
Consider the following simple example. Suppose we observe a sample of 6 individuals typed at two bi-allelic markers. The first three individuals are double homozygotes, AA/BB, contributing six gametes of the type AB to the sample counts. The next two are aa/bb, contributing four gametes of the type ab. The last individual, however, is a double heterozygote, Aa/Bb, and there are two possible pairs of gametes, AB with ab, or Ab with aB. Taking into account the first part of the sample, the EM algorithm assigns an extremely low probability to the second, (Ab with aB) arrangement, because these types of gametes have not been observed among unambiguous individuals at all. To make such a calculation, it is necessary to assume “Hardy Weinberg Equilibrium” (HWE) over the two markers, which means that the two-marker genotypes do not deviate from the products of the haplotype frequencies. On the other hand, the CHM method described below does not assume HWE and maintains that either of the two arrangements (AB with ab, or Ab with aB) is equally likely.
Certainly, there will be discrepancy between sample frequencies
generated by the two methods. Our goal, however, is not simply to
report, but to contrast haplotype frequencies between levels of
phenotype. Assume for the moment two markers with two alleles and that
the phenotype is binary, i.e. the subjects can be classified as either
“cases” or “controls”. For simplicity, assume that the haplotype AB
determines the probability of an individual being a “case” and
individual alleles are unimportant. Then the expected allele frequencies
(
) are the same among cases and controls. We can
therefore express the expected frequency of AB haplotypes in cases and
controls as

where
,
are linkage disequilibrium (LD)
coefficients. These coefficients are needed to account for the part of
gametic frequency not explained by frequencies of alleles,
. When we are comparing
between cases
and controls, we are essentially looking at the size of the difference
between
and
. The EM algorithm
attempts to obtain estimates of
in cases and controls
separately, assuming HWE in these groups. However, even if HWE holds in
the population, it is expected not to hold in either cases or controls
looked at separately [Nielsen1998], which may lower the power of the test. Moreover,
cases and controls exhibit the deviation from HWE in different
directions [Zaykin2000], which is exploited by CHM. The CHM estimates a quantity
different from
, specifically

The last term,
, is the frequency of A, B alleles that
reside on two different gametes [Weir1996], in contrast to
, that measures their joint frequency on the same
gamete. This “intra-gametic” frequency can also be written as a product
of A, B allele frequencies plus the deviation unexplained by the
product. Generally, this deviation is not zero if the HWE at the
haplotype level does not hold. It is also expected that this deviation
is generally larger in cases than in controls (denote the difference by
), when cases are less prevalent category in a
population. Therefore, when comparing composite frequencies
between cases and controls, we are looking at the
size of the difference
. The term
is zero only under the multiplicative model of
haplotype action, in which case two methods are comparing the same
quantities.
The quantities
,
,
, and
are all unknown. The process of imputing LD involves
first imputing two-marker haplotype frequencies by the chosen method,
then imputing LD based on those imputed haplotype frequencies.
EM and CHM each have a distinct way of imputing haplotype frequencies,
neither of which involve the values of
,
,
, and
between any particular pair of
markers. EM assumes diplotypes are products of haplotype frequencies
(which is a consequence of HWE over the markers), and CHM assumes that
any of the haplotype arrangements is equally likely (such as AB with ab
or Ab with aB in the example described above) and will take the average
of their probabilities. The respective algorithms are based on those
respective assumptions.
The CHM LD method internally takes a shortcut past the haplotype imputation code and just does the equivalent imputation process directly. For biallelic markers, it additionally compensates for lack of HWE in either or both of the two markers.
In practice, we have found that more often than not the EM and CHM result in essentially the same p-values for tests of association with the phenotype, although the CHM can be much more powerful under models that include large non-additive components. Compared to EM, it is a non-iterative and much faster algorithm, posing no danger of convergence to a local maximum.
For haplotype inference with Single-Nucleotide Polymorphisms (SNPs), the EM algorithm is popular because it is easy to interpret and stability. Compared to the Gibbs sampler, another method of estimating haplotype phases, the EM approach is a deterministic procedure. It also requires less computing time, and is easier to check for convergence. The output of the EM algorithm is the maximum-likelihood estimate (MLE), which possesses well-established statistical properties.
Note
Because of the memory-space properties of the EM algorithm, the EM algorithm will perform poorly if at all with more than a dozen loci.
The expectation-maximization (EM) algorithm for haplotype imputation is an implementation of the method described in [Excoffier1995].
The CHM is based on the idea of the genotypic LD coefficient,
, [Weir1996]. Estimation of
involves
calculation of di-genic frequencies. In the two-locus bi-allelic case,
they are estimated as

where
, for example, is the number of individuals with
genotype Aa/Bb, and n is the sample size. The composite disequilibrium
is defined as a sum of inter- and intra-gametic components,

Under random mating,
and so
is
an unbiased estimate of the LD parameter,
For our
purposes it is not necessary to separate the inter- and intra-gametic
components and we work in terms of

[Zaykin2001] extended the definition of di-genic frequencies to multiple loci and
alleles. For the i-th individual multilocus genotype
let
be the number of single-locus heterozygotes in
Define weights as

Sample composite haplotype counts are calculated from summing over individual contributions,

where n is the sample size, and
is the indicator function, defined
as

Thus, if the i-th individual has at least one copy of all required
alleles, it is counted with weight
. The composite
haplotype frequencies are given by

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s
definition,
In a single-locus case, they are the usual definition of the allele
count:

This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson2006].
Note
Golden Helix SVS currently uses PCA with association tests performed against genotypes (using their numeric equivalent values) (Genotypic Principal Component Analysis) and with association tests performed against numerical data (Numeric Principal Component Analysis).
Suppose
is the
(marker) by
(subject
or sample) matrix of either numeric values or the numeric equivalent
values of the genotypes, with the exception that for each marker, what
had been the average value over that marker has first been subtracted
out from the marker’s values (data centering by marker).
Note
In our notation, a spreadsheet in SVS on which PCA may
be performed may be thought of as
, since such a
spreadsheet suggests an
(subject or sample) by
(marker) matrix.
We would like to decompose
into
as a
singular value decomposition, where
is an
by
matrix for which
,
is an
by
matrix for which
, and
is an
by
singular-value matrix.
For this decomposition,
may be thought of as consisting of
principal components or “ancestries” relating the samples to the
singular values, while the
column of
(where
and
) may be thought of as
containing the “coordinate value” of each marker along the
principal component.
We do this by finding the eigenvalues and eigenvectors of
. (This matrix is like a Wishart matrix. See
Further Motivation: Relation to the Variance-Covariance Matrix.) Rewriting
in terms of the
singular value decomposition, we have

where
is an
by
matrix
containing the eigenvalues of
on its diagonal and
zeroes elsewhere. Right-hand multiplying the above by
gives us

Since
is a matrix of the eigenvectors (principal
components) of
, the above implies, for
through
, that

where
are the eigenvalues of
, and
the vectors
are the columns of
and are also
the eigenvectors of
.
As stated above, the principal components
may be thought
of as “ancestry” vectors. The most “important” ones correspond to the
highest values of
. Our object will be to subtract
these most important
out from the rest of the data, so
that we may more effectively test for associations with markers that
specifically relate to our phenotype, rather than with markers that
relate to population ancestry.
Note
Matrix
is not unique, except that the first
columns have to be eigenvectors of
corresponding to the eigenvalues
. As stated
above, the
element of the
column of
may be thought of as the relative magnitude of the
principal component in making up the
row of
.
However, SVS does not actually need to compute any part of
matrix
at all. The procedure outlined in
Applying the PCA Correction is dependent only upon the input data
and upon the
(from matrix
), and is
otherwise self-contained.
Note
In actual fact, SVS normalizes
by the number
of markers,
. The corresponding decomposition of
becomes
, and
is the matrix which contains the eigenvalues
that we can obtain and output from SVS.
When accumulating the elements of
for genotypic data, the
marker data may be optionally weighted in favor of those markers with
less variance by normalizing them according to their variance, either
theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing
for genotypic data.
Otherwise, the technique is to:
matrix.
.Suppose
is a vector containing the dependent variable
(response) values or numeric-equivalent values. Exercising data
centering by marker, we subtract the average element value from each
element to get
; then,

where
. Similarly,

where
, etc., up to

where
.
Note that the
values are obtained through dot products,
where component wise,

The corrected
is obtained (as a part of data centering by
marker) by adding back the average element value of the original
to each element of
.
The idea is for each principal component, we find what magnitude of it
is present in
(or
)–then we subtract that
magnitude of that component from
to get
.
Exactly the same technique is applied for every vector
of
data from an individual marker. After subtracting the average element
value to get
, we compute

where
. Similarly,

where
, etc., up to

where 
Component-wise, we have

We finally get the corrected
back (as a part of data
centering by marker) by adding back the average element value of the
original
to each element of
.
The possibilities for marker normalization are:
The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.
This theoretical standard deviation is as follows according to the genetic model being used:
Additive (the only model supported by EIGENSTRAT itself):

where

is a Bayesian estimate of the minor allele probability in terms
of the numeric marker values
.
Dominant:

where

is a Bayesian estimate of the probability of being a DD or Dd genotype.
Recessive:

where

is a Bayesian estimate of the probability of being a DD genotype.
The actual standard deviation of this marker’s data.
Don’t normalize.
Note
Numeric data is not normalized in SVS.
For the purposes of PCA correction in SVS, the marker data may
be treated as if it comes from a multivariate normal distribution, where
the data from marker
,
,
is picked from a random variable
(where
is equal to the number of samples). If
indeed
has a multivariate normal distribution, then there is a random vector
whose components are independent standard
normal random variables, a vector
, and
a
by
matrix
such that
.
The vector
is the expected value of
, and the
matrix
is the variance-covariance matrix of the
components
.
Thus, the vectors
in matrix
may be treated as coming from the expression

or

where
are elements in matrix
.
Let us examine
. One element of
may be
written as

The term
expands out to
![x_{ri}x_{rj} = [(\sum_qz_{rq}a_{qi}) + \mu_i][(\sum_qz_{rq}a_{qj}) + \mu_j]](_images/math/c403856f916870f8521897dfa7a84e5e4be6b8cf.png)

Expanding the first term above, we have

We may thus write
![(X^TX)_{ij}& = \sum_rx_{ri}x_{rj} = \sum_r[\sum_qz_{rq}a_{qi}z_{rq}a_{qj} + \sum_q\sum_{s\ne q}z_{rq}a_{qi}z_{rs}a_{sj} + \mu_i(\sum_qz_{rq}a_{qj}) + \mu_j(\sum_qz_{rq}a_{qi}) + \mu_i\mu_j]\\
& = \sum_r\sum_qz_{rq}a_{qi}z_{rq}a_{qj} + \sum_r\sum_q\sum_{s\ne q}z_{rq}a_{qi}z_{rs}a_{sj} + \mu_i\sum_r\sum_qz_{rq}a_{qj} + \mu_j\sum_r\sum_qz_{rq}a_{qi} + \sum_r\mu_i\mu_j\\
& = \sum_qa_{qi}a_{qj}\sum_rz_{rq}^2 + \sum_q\sum_{s\ne q}a_{qi}a_{sj}\sum_rz_{rq}z_{rs} + \mu_i\sum_qa_{qj}\sum_rz_{rq} + \mu_j\sum_qa_{qi}\sum_rz_{rq} + m\mu_i\mu_j,](_images/math/be02bbb1bb8e184b994c820bf74deb521d0365cd.png)
where m is the number of markers.
However, since the components of the random variable from which
is assumed to be drawn are independent standard normal
random variables, we may make the following estimates:

Thus, we may estimate the elements
and
of
as

or

Note that

thus, we have the following estimate:
where
is the covariance matrix of the multivariate
distribution of
and
is an outer product
of the sample averages.
Note
is a Wishart matrix based
on
. The distribution of this random
matrix is a generalization of the chi-square distribution.
In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered
by marker–that is, the average of the data within a given marker is
first subtracted out from each of that marker’s data values both before
they are used for the
matrix and before they are
PCA-corrected, with the averages being added in after correction.
An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.
Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.
Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.
On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.
Regarding data centering by marker, it should be noted that after the
marker averages are subtracted out, we have
, where
is the element of
corresponding to the
-th marker and the
-th sample. Since

for any pair of samples
and
, we have

Multiplying a constant vector
into the above
will therefore obtain a vector with zero for every
element, demonstrating that when data centering by marker is used, one
of the eigenvalues is zero, corresponding to an “average vector” of
constant elements.
If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.
A second alternative approach to PCA, available in SVS for
numeric or copy number data, is to center by sample, either in addition
to centering by marker or instead of centering by marker. This entails
effectively subtracting out the sample average before contributing to
the matrix
(and adding it back in while correcting). (In
SVS, if centering by both sample and marker, the overall
average is subtracted out only once.)
Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.
An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.
Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.
Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.
It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.
Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.
However, now we must look at whether two components which are orthogonal
in the main sample set are still orthogonal in the sample subset.
Suppose
and
are two
components of the full set, where
and
are in the
original sample set but not the sample subset, and
and
are in the sample subset. (Without loss of generality, we
are writing the vector as if all of the elements in the subset follow
after all of the elements not in the subset.) Because these are
eigenvectors of
, they will be orthogonal. That is, their
dot product will be zero.
This dot product may be written
. We see
that if the individual dot products, both outside of and inside of the
subset, are zero, then the full dot product will be zero. Thus,
components that are orthogonal in the subset and its complement will be
orthogonal for the overall set.
Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.
However, because the subset components are not precisely orthogonal
except in special cases, a modified Gram-Schmidt process is applied by
SVS to make the components that belong to the subset
orthonormal (both orthogonal and normalized). Effectively, this is
equivalent to “obtaining a PCA correction” for every component
based upon the previous components
,
, ..., up to
.
The Runs of Homozygosity algorithm is designed to find runs of consecutive homozygous SNPs and find where those runs are common among at least the specified number of samples.
The first part of the analysis is determining the homozygous runs or Runs of Homozygosity (ROH), which can be represented as the index of the SNP where the run started, and the index of the last SNP in the run. A homozygous genotype is one where both bases are the same, such as A_A or B_B.
Unlike the PLINK ROH algorithm which uses a moving window that potentially introduces artificial runs and fails to capture runs larger than the window, the Golden Helix SVS algorithm works continuously across an entire chromosome looking at every possible run that match the parameters specified for the algorithm in the Runs of Homozygosity} window. The algorithm looks at every homozygous SNP as a potential start of a new ROH run. Each SNP is then determined to be homozygous, heterozygous or missing. Each potential run is then updated to be extended if it’s homozygous, or modified to have their heterozygous or missing count incremented. Runs that exceed the allowed number of heterozygotes or missing SNPs are then checked to see if they should be thrown out based on their length and SNP density according to the specified algorithm parameters. At the end of a chromosome, the longest runs are greedily chosen out of all the potential runs (with no overlapping of runs allowed) until no more runs are available. The result is the set of runs containing the longest runs of homozygous SNPs for a given chromosome and sample. This is repeated for each chromosome and sample of the dataset.
A second algorithm is used to create a clustering of runs. Given the
final list of ROHs calculated by the previous algorithm, and a threshold
for the minimum number of samples that contain a run,
a cluster is defined as a contiguous set of SNPs where every SNP has at
least
samples in a run for each SNP. The algorithm
sweeps across all the SNPs in the dataset looking for clusters that meet
the run length requirements provided to the algorithm. A list of
clusters is outputted to a spreadsheet. Optionally, another spreadsheet
is computed that calculates for each sample the ratio of SNPs in each
cluster that are members of a run of homozygosity.
As SVS imports Affymetrix CEL files it creates quantile normalized log2 ratios in order to more accurately find regions of copy number variation and for more accurate association testing on the CNV variants.
The process to generate normalized log2 ratios is very analogous to the methodology employed by Affymetrix [Affymetrix2007], scaled to handle thousands of cases and controls. The process is as follows:
Depending on the Mapping array type, there are anywhere from 1 to 40 probes used to interrogate a given genotype. The Affymetrix 500k mapping array provides perfect match and mismatch probes, whereas the Affy 5.0 and 6.0 chips only include perfect match probes. We only use the means of the perfect match probes.
The A and B probe intensities are quantile normalized per sample using the approach of [Bolstad2001] and [Bolstad2003]. 500k NSP and STY samples are quantile normalized separately, as are the Affy 5.0 and 6.0 polymorphic and non-polymorphic probes. The process is as follows:
Calculate a reference distribution and calculate log2 ratios:
Select a set of samples for the reference distribution, for instance the controls.
For each polymorphic probe,
, calculate the median of
the quantile normalized A intensities and the median of the
quantile normalized B intensities,
for the reference samples. Then for a given pair of probe
intensities
and
, the normalized copy
number signal is the log2 ratio of the sum of the
and
probe intensities to the median A and B probe
intensities:

For non-polymorphic probes, there is no B intensity, but the analogous normalization is performed:

Join together distributions:
The copy number segmenting algorithm is designed to find local variations over markers. The segmenting process is optimized by working at three levels:
CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points. The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions possibly useful for association studies. For a given sample, the covariate is the mean of the intensities within each segment for that sample. If there are consistent positions for copy number variation across multiple samples, the copy number segments will be found. In reality, there may not always be consistent copy number segments across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample and outputs a spreadsheet showing all cut-points found among all samples. There are two options for how this spreadsheet can be formatted. Either there will be one column for each unique cut-point segment, or there will be a column for all segmented markers in the marker map. The first spreadsheet is good for association analysis because the multiple testing adjustment is smaller. However, the second spreadsheet is better for plotting the results. The results are the same in both cases, but in the second spreadsheet the results of a column are repeated for every marker in each unique segment.
See Using CNAM (Copy Number Analysis Method) Optimal Segmenting for more information.
CNAM allows segmenting with a moving window, where the segmentation algorithm is applied to each sub-region created by the window. Using the optional moving window will result in an overall faster analysis; however, the results may not be as robust as analyzing the entire chromosome simultaneously. We implement an approach to windowing that minimizes edge effects as the moving window moves along the chromosome:
Start out with a sub-region of
markers.
Perform the segmenting algorithm on the current sub-region.
If two or more cut-points were found, meaning that there were more than two segments found:
Use
markers. Otherwise:
If no cut-points could be found at all to subdivide the one
sub-region, the next sub-region is started in the middle of the
current sub-region and extended for
markers. No
cut-points are yielded in this case.
Note
This may mean a region of copy number variation which started before this sub-region will be considered to be extended through this sub-region and possibly for much longer. We do not arbitrarily subdivide regions based on moving-window boundaries–instead, this whole procedure is meant to find valid cut-points throughout the main region.
If the sub-region was successfully segmented into exactly two
segments, yielding exactly one cut-point, and the cut-point is in
the first half of the sub-region, this cut-point is accepted, and
the next sub-region is started at that cut-point and extended for
markers.
If the sub-region was successfully segmented into exactly two
segments, yielding exactly one cut-point, but the cut-point is in
the second half of the sub-region, the apparent cut-point may be
due to an edge effect. Thus,the cut-point is not used and the
moving-window sub-region is changed so as to get better
segmenting. If the size of the sub-region is now still
, the sub-region is simply extended so that the total
size of the sub-region is
. If the size of the
sub-region is already
, the next sub-region, which
goes back to being length
, is started at the former
start plus
.
Finally, when the end of the main region is encountered, whatever cut-points are obtained from the sub-region are accepted.
You may set the window size as one of the optional segmenting parameters, see Using CNAM (Copy Number Analysis Method) Optimal Segmenting.
Selecting to not use a moving window in the segmenting analysis results in the sub-region spanning an entire chromosome. The segmenting algorithm is then applied to the whole chromosome. This is more computationally intensive, and the most trusted procedure. Note, that it may be necessary to increase the Max segments per window to more than the default of 20, as a given chromosome may well have more than 20 regions of copy number variation.
The segmentation algorithm exhaustively looks through all possible
cut-points for a given region of data, to minimize the sum of squared
deviations of the data from the mean of its respective segment. Given
points or intensities, and
cut-points defining
segments, the number of possible cut-points is
. That is, the search space explodes exponentially. This
exponential search space perhaps accounts for the explosion of the
algorithms using hidden Markov model approaches or approximate
segmenting approaches to find the segment regions.
Ironically, since 1999, within the tree splitting code of Golden Helix SVS, we have had an optimal algorithm that finds segment boundaries, effectively overcoming the exponential search problem. This algorithm, due to Hawkins, see [Hawkins1972], [Hawkins1973], and [Hawkins2002], is based on dynamic programming. It exhaustively searches through all possible cut-point positions to find an optimal segmenting of the data. We employ this approach with a few modifications to the copy number problem.
The segmentation algorithms in Golden Helix SVS use “segmentation for a normally distributed response for a continuous-ordinal predictor”. This mode of segmentation is partly analogous to a univariate counterpart used in creating tree models. However, there are important differences, the main one being that no final p-value calculation for the entire sub-region is attempted because the main emphasis here is on local variation rather than an entire “split”. Also, the tree models do not currently implement a permutation testing procedure for determining split cardinality.
Note
The following points apply for the copy number segmentation algorithms:
In the multivariate algorithm, the samples (from the respective subjects) are considered to be the respective multivariate “dimensions” of this analysis. In the univariate algorithm, the dimension of the analysis is one, as each sample is analyzed independently. The log2 ratio for every sample and every marker is considered to be the “response” variable.
The model is that the observations segment into
groups, and
that each group, within each dimension, has a different mean with noise.
The
cut-points optimally splitting the data in a maximum
likelihood sense are found by minimizing the sum over all dimensions of
sums of squared deviations of the group means from the observations.
The positions of these cut-points are found in advance for every
possible total number of segments, from two to the maximum number of
segments allowed. You may set the maximum number of segments (per
sub-region)
as one of the optional segmenting
parameters (see Using CNAM (Copy Number Analysis Method) Optimal Segmenting).
To determine if we can use a given number of segments
, a
permutation testing based comparison is done between each pair of
adjacent segments. The permutation testing procedure is described in the
next section. If the segments are significantly different for every pair
of adjacent segments,
segments are used. Otherwise, this
same determination is tried for
segments, and so on. This
process is started at
and continued through
.
It may turn out that this process continues through to
.
This means it is not possible to divide the observations into more than
one segment nor to get any cut-points (over this sub-region).
The univariate outlier removal option addresses a couple of problems. When a minimum number of markers per segment is specified, such as 10, a single marker outlier could drag along another 9 markers to create a segment that was driven only by the one point. This would inflate the number of 10 marker segments, and would incorrectly specify boundaries of the copy number segment if the region was 9 markers or less in length. The second problem is that if the minimum number of markers per segment is set to 1, single marker segments would be found, but those single marker segments would be found to not be significant with permutation testing, and because the outliers reduced the sum of squared deviations from the mean more than for some of the longer segments, those longer (but significant) segments would be lost as the next lower cardinality splits were chosen. The basic problem is that this maximum likelihood estimator of minimizing the sum of squared deviations from the mean assumes normality, which does not hold in the case of numerous outliers. The outlier removal process applies only when the minimum number of markers per segment is set to 1. It starts with finding the maximum number of segments per window and then deletes the cut-points that bracket single marker segments. Then permutation testing is done on the remaining segments that are longer than length 1 (including the outlier data as part of those larger segments), dropping down to lower cardinality if any pair of adjacent segments are not statistically different.
An optimal
-way split is one that defines
segments such that the sum of squared deviations from the mean of each
segment is minimized. In other words, there is no better placement of
cut-points that would result in a smaller sum of squared deviations of
the data from the respective mean of the segment in which it falls. The
challenge, given this optimal segmenting, is determining whether the
resulting segments represent significant differences between pairwise
adjacent segments.
Consider two adjacent segments. Both the left and right segments have means. The cut-point between them minimizes the sum of squared deviations of the data from the two respective means. If we were to just perform a T-test comparing the adjacent segments, we would need to adjust the resulting p-value with some kind of multiple testing correction for all of the possible cut-points that were searched through to find the best one. A Bonferroni adjustment would be too conservative due to correlation structure in the data. Instead we turn to permutation testing.
The key to performing permutation testing correctly is to randomize and perform the same procedure as was used to find the original statistic. This ends up being fairly straightforward.
Given
is the max pairwise permuted p-value parameter,
for each adjacent pair of segments we perform the following procedure:

times:
.If the permuted pairwise
is less than or equal to
for all adjacent pairwise segments, all of the
segments are statistically significant. Otherwise, we repeat the
procedure with the optimal
way segmenting, until either we
find all pairwise segments are significant, or all the data in a region
represents a single segment (or possibly part of a larger segment if the
moving window procedure is used).
Recent improvements in the permutation algorithm (versions 7.1 and higher) enable the use of all specified threads and an updated randomization procedure. The previous naive potentially biased randomization procedure was replaced with the unbiased Fisher and Yates method (also known as the Knuth shuffle) [Durstenfeld1964]. This modification and a set of 128 random number generators with unique keys run sequentially allows us to simultaneously have up to 128 threads performing permutation testing, and the same segmentation results occur regardless of the number of threads specified. If a non-power-of-two number of threads is specified, it rounds up or down to the nearest power of two, with an upper bound of 128. Due to differences in the randomization procedure, different output can be generated over previous versions for segments that fall on the borders of the p-value threshold. A single segmenting job will now utilize close to 100% CPU 100% of the time if all threads are selected, removing the need to have more than one job running simultaneously on the same machine.
We recommend always using univariate outlier removal and segmenting with the minimum number of markers per segment set to 1 and maximum number of segments per window set to 20 or more. The windowing and p-value for the permutation process can change depending on whether one is focused on whole genome scans allowing higher false discovery rates or cytogenetic applications focused on larger and fewer findings. For the former, use a window size of 5000 or more markers with a p-value of 0.01 (although a p-value of 0.001 will reduce FDR). In the latter cytogenetic case, do not use windowing, segment an entire chromosome at a time, and set the p-value threshold to 0.001.
Reduction in p-value leads to an increase in permutation testing time.
The worst case maximum number of permutations for pairwise permutation
is
, but optimizations allow us to reduce
this number as the p-values become more significant. The segmenting time
is proportional to the square of the number of markers multiplied by the
maximum number of segments per window. Performing whole chromosome
segmentation without a moving window is particularly computationally
expensive on larger arrays, but very tractable on aCGH arrays with
100-300K markers. Running the algorithm with different settings is also
useful as a way to compare results and generate either more
possibilities by taking the union of segments over all methods, or to
narrow down a study to only include the consensus or partial consensus
segments over multiple settings.