# Formulas and Theories: The Science Behind SNP and Variation Suite¶

## General Statistics¶

### General Marker Statistics¶

The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Genotype Statistics by Marker) or as a tab in the Genotypic Association Test dialog (Genotype Association Tests).

#### Hardy-Weinberg Equilibrium Computation¶

The HWE p-value measures the strength of evidence against the null hypothesis that the marker does not follow Hardy-Weinberg Equilibrium. Large p-value are consistent with the marker following HWE.

Suppose we have a marker with alleles having frequencies . We may write the genotype count for alleles and as . Due to phase ambiguity, if , we count occurrences of allele on the first chromosome and allele on the second chromosome, along with occurrences of allele on the first chromosome and allele on the second chromosome in both the notations and .

Thus, we may write the count for allele as . We may also express the genotype frequency for allele occurring homozygously as , and the genotype frequency for heterozygous alleles and as , where is the population count. The frequency of allele may be expressed as:

We wish to check the agreement of with and the agreement of , where , with . We multiply by two to deal with the phase ambiguity (see above).

Thus, we will define the Hardy-Weinberg equilibrium coefficient or for alleles and such that

.

(It may be shown that for a bi-allelic marker, .)

We then have a chi-squared distribution with degrees of freedom,

From this, we obtain the distribution’s p-value , and the correlation, , from the inverse distribution for one degree of freedom (where is the chi-squared distribution), which is

#### Fisher’s Exact Test HWE P-Values¶

In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.

See [Emigh1980].

#### Signed HWE Correlation R¶

Note

This statistic applies only to bi-allelic markers.

We define the signed HWE correlation as

where

is the total genotype count and and are the counts for genotypes DD and Dd, respectively.

This is derived from the formula for (signed) correlation between two sets of observations, and ,

where we take the to be 0 if the first allele is d and 1 if the first allele is D, and the to be 0 if the second allele is d and 1 if the second allele is D.

Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.

If there is a high homozygous count, and will often be 1 or often be 0 at the same time, and therefore there will be a positive correlation between the and the . Similarly, if there is a high heterozygous count, and will often be 1 at opposite times, causing an anti-correlation to exist.

#### Call Rate¶

The call rate is the fraction of genotypes present and not missing for the given marker.

#### Minor Allele Frequency (MAF)¶

The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

### Statistics Available for Genotype Association Tests¶

#### Correlation/Trend Test¶

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have pairs of observations , the (signed) correlation between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

1. In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use, instead of the above formula, , we would have the mathematical equivalent of the Armitage Trend Test.
2. This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead where is the number of principal components that have been removed from the data. The premise is that PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.

#### Armitage Trend Test¶

The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In Golden Helix SVS, the ordering is by number of minor alleles in the genotype–zero, one, or two.

Let be the counts for cases with 0, 1 and 2 alleles, respectively, and be the counts for controls with 0, 1 and 2 alleles, respectively. Also, let , , and .

If we let be the total count,

 , and ,

then the prediction equation under ordinary least-squares fit is

The statistic for the Armitage Trend Test is

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this test.

#### Exact Form of Armitage Test¶

The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.

To perform the exact Armitage test, we define the trend score for the contingency table as

where

The exact permutation p-value is evaluated as

where

#### (Pearson) Chi-Squared Test¶

This is the most-often used way to obtain a p-value for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements has observations, we make an “expected” contingency table based on the marginal totals with elements

where are the row totals and are the column totals.

We then obtain a p-value from the fact that

approximates a chi-squared distribution with degrees of freedom.

For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

#### (Pearson) Chi-Squared Test with Yates’ Correction¶

This test is almost the same as the Pearson Chi-Squared test. The difference is that a correction is made to compensate for the fact that the contingency table uses discrete integer values.

Both this test and the Pearson Chi-Squared test itself obtain a p-value for (the extremeness of) an (unordered) contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively, as much as they do by chance alone.

If the contingency table with elements has observations, our “expected” contingency table based on the marginal totals has elements

where are the row totals and are the column totals.

The estimated value from the Pearson Chi-Squared test with Yates correction is

This approximates a chi-squared distribution with degrees of freedom which is used to obtain a p-value for this test.

For the , , and tables for which this technique is used in SVS, the degrees of freedom are 1, 2 and 3, respectively.

#### Fisher’s Exact Test¶

The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.

The probability of a contingency table with elements and row totals and column totals and elements is given by

To reduce the amount of computation, techniques developed by Mehta and Patel [MehtaAndPatel1983] are used for computing Fisher’s Exact Test.

#### Odds Ratio with Confidence Limits¶

For the purposes of this method’s description, we define a contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.

Yes No Total
Case
Control
Total

The odds ratio is defined as the ratio of the odds for “Case” among the Yes values to the odds for “Case” among the No values, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently

To obtain confidence limits, we use the standard error of , which is

The 95% confidence interval then ranges from to .

#### Analysis of Deviance¶

This is a maximum-likelihood based technique for analyzing a case-control contingency table with columns. Let be the proportion of cases in the entire sample, be the number of observations in column of the contingency table, and be the proportion of cases in column . Then, to perform an analysis of deviance test, we define

and

The test statistic is then , which approximates a chi-squared distribution with degrees of freedom. A p-value is then obtained based on this chi-squared approximation.

#### F-Test¶

The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.

This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.

If there are observations subdivided into groups, we define

and

If and , then

is proportional to the variance between the groups, and

is proportional to the variance within the groups. The F statistic becomes

The p-value is obtained by subtracting the probability of observing the F statistic from an distribution (where are the numerator degrees of freedom and are the denominator degrees of freedom) from one.

### Statistics for Numeric Association Tests¶

#### Correlation/Trend Test¶

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have pairs of observations , the (signed) correlation between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

Note

This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead where is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed degrees of freedom from the data, and only the remaining degrees need to be tested.

#### T-Test¶

The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)

In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.

The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.

If there are observations corresponding to a true dependent variable value and observations corresponding to a false dependent variable value, we define

Then,

If is less than a threshold (), then the p-value returned is 1.0. Otherwise,

The p-value may be calculated on the basis of this T value as a “two-sided p-value” using Student’s t distribution with degrees of freedom.

### False Discovery Rate¶

When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in Golden Helix SVS specifically to correct for multiple testing over many different predictors.

Suppose that hypotheses are tested, and of them are rejected (positive results). Of the rejected hypotheses, suppose that of them are really false positive results, that is is the number of type I errors. The False Discovery Rate is defined as

that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.

Suppose we are rejecting (the null hypothesis) on the basis of the p-values from these tests, specifically, when a p-value is less than a parameter . If we can treat the p-values as being independent, then we can estimate as

where is the number of less than or equal to , and use this to estimate the False Discovery Rate as

When this is computed for equal to any particular p-value, these expressions simplify to

and

where is the number of p-values less than or equal to .

See [Storey2002]. (We use here.)

## Permutation Testing Methodology¶

Permutation testing consists of running many tests which are just like the original test except that the dependent variable is permuted differently for each test, and counting the number of times the test statistic resulting from a permuted dependent variable is more significant than the statistic from the original test.

The permuted p-value is the fraction of permuted tests which are more significant than or as significant as the original test (this is the definition of a p-value). The original test is counted as one of the “permuted tests” in this calculation. This is meant to approximate the probability that the test could come out as “positive” by chance alone.

Note

The permutations that are used are not based on a time of day seed, but on a constant seed. This means that permutation results will be identical when the study is re-run.

### Single Value Permutations¶

With single value permutations, the dependent variable is permuted and the given statistical test is performed using the given model on the given marker or predictor. The permutation methodology is the same for genotypic association tests and numerical association tests. The independent variables in the first case are genetic markers, and in the second numerical predictors. In this paragraph and those following, “marker” can be replaced with “predictor” depending on the context of the permutation testing. Only “marker” will be used for simplicity.

This process is repeated the number of times you select. The permuted p-value is defined as the fraction of times a test with a permuted dependent variable on the given marker came out as significant as or more significant than the same test on the same marker with the non-permuted dependent variable.

Note

This single-value permutation technique focuses exclusively on the marker in question, and by itself does not offer any type of correction for multiple testing.

#### Single Value Permutation Example¶

Suppose your approximated p-value (based on a chi-square distribution) is 0.2, you are performing 10 permutations, and the significance of those permutations (based on the same chi-square distribution) may be expressed as the following approximated p-values (counting the original test as the first “permutation”):

0.20
0.15
0.09
0.51
0.31
0.01
0.26
0.11
0.05
0.18


In the above column, there are 7 outcomes where the approximated p-value is equal to or less than 0.2. The permuted p-value is therefore

### Full Scan Permutations¶

The full-scan permutation technique differs from the single-value technique in that it addresses the multiple-testing problem. It does this by comparing the original test result from an individual marker not merely with the permuted result from that marker, but also with the permuted results from all the rest of the markers. If any of these comparisons comes out to be more significant than the unpermuted result of the marker in question, the permutation is considered to have resulted in the “permuted test being more significant than the unpermuted test”.

To be specific, the following procedure is used for full-scan permutation testing:

1. For the number of times you have specified, the dependent variable is permuted and the test is performed over all markers. From each permutation, the most significant results from any of the markers are kept track of.
2. The full-scan permuted p-value of each marker may now be found. It is the fraction of permutations for which the most significant result over all markers for that permutation was as significant as or more significant than the test on the marker in question using the non-permuted dependent variable.

Note

This procedure will tend to multiply the result by a value somewhat equal to the number of markers being tested, and thus gives answers comparable to those from Bonferroni-corrected p-values.

### Full Scan Example One¶

You run the same association test on four spreadsheets. These spreadsheets are all alike, except that one has the real data that you are interested in, and the other three have the same data except that the dependent variable has been permuted randomly in those spreadsheets.

You look at the (approximated) p-values from all of these tests, and note that the best approximated p-value from the first spreadsheet on any marker (which came from marker 4) is 0.0003. You also note that the best approximated p-values (which resulted from various markers) from the second, third, and fourth spreadsheets are 0.005, 0.02, and 0.013.

Using this technique, you could say that the full-scan permuted p-value of marker 4 would be one (the number of spreadsheets whose significance measured as a p-value was 0.0003 or better) divided by four (the total number of spreadsheets), or 0.25.

Normally, to get a better idea of the real p-value, you would perform many more permutations than this! For instance, one thousand permutations or more would be necessary to more accurately gage a p-value near 0.001.

### Full Scan Example Two¶

Suppose you have 5 markers, and their approximated test p-values (based on a chi-square distribution) are:

0.20 0.10 0.02 0.34 0.52

Their Bonferroni-corrected p-values would be:

1.00 0.50 0.10 1.00 1.00

Also suppose we do full-scan permutation testing using 10 permutations on these markers, and the approximated p-value results of those permutations on those markers (based on the same chi-square distribution and counting the original test as a “permutation”) are:

0.20  0.10  0.02  0.34  0.52
0.15  0.13  0.71  0.05  0.26
0.09  0.17  0.41  0.26  0.22
0.51  0.36  0.47  0.92  0.01
0.20  0.74  0.26  0.87  0.23
0.01  0.46  0.21  0.08  0.05
0.26  0.61  0.92  0.37  0.43
0.11  0.54  0.29  0.08  0.13
0.05  0.06  0.04  0.39  0.42
0.18  0.69  0.21  0.19  0.14

In the above data, we find the following best permuted-test outcomes over all markers for each of the permutations are:

 0.02 (from the third marker) 0.05 (from the fourth marker) 0.09 (first marker) 0.01 (fifth marker) 0.2 (first marker) 0.01 (first marker) 0.26 (first marker) 0.08 (fourth marker) 0.04 (third marker) 0.14 (fifth marker)

For the first marker, whose approximated p-value from actual data was 0.20, we find that the best approximated p-values from the permutations other than permutation 7 (0.26) were all as good as or as better than 0.20. Thus the full scan permuted p-value for the first marker is 9/10 or 0.90.

For the second marker, at 0.10, all permutations except permutation 5 (0.20), permutation 7 (0.26), and permutation 10 (0.14) were better than its actual data value. Thus, its full-scan permuted p-value is 7/10 or 0.70.

The third marker has an approximated p-value of 0.02 on actual data, which is much better than that for the other markers, and is only equaled or bettered by itself (“permutation” 1 at 0.02), permutation 4 (0.01) and permutation 6 (0.01). Thus, its full-scan permuted p-value is 3/10 or 0.30.

Meanwhile, every permutation had a better approximated p-value than either the 0.34 or 0.52 values from the last two markers. Thus, these markers both have a full-scan permuted p-value of 10/10 or 1.00.

Comparing the full-scan permuted p-values for every marker with its Bonferroni-corrected p-value for real data we get these similar values:

0.90  0.70  0.30  1.00  1.00
1.00  0.50  0.10  1.00  1.00

### Full Scan Example Three¶

Let’s say your full-scan permuted p-value was 0.463 from when you performed 1000 permutations. This means that of the 1000 permutations performed, the p-value from the internally saved list of the best statistical results from the permutations was better 463 times out of 1000. This would be a poor showing. However, you must take into consideration that each time the permutation was performed, it did not do anything like averaging the approximated p-values for all the markers, but it took the one marker with the best showing (whose approximated p-value would be the lowest).

## Linear Regression¶

### Genotype or Numeric Association Test – Linear Regression¶

The response, , is fit to every genetic predictor variable or encoded genotype, , in the spreadsheet, using linear regression, and the results include the regression p-value, intercept and slope which are output in a new spreadsheet along with other genotypic association test results and any multiple correction results. The response is represented with the formula , where the model is represented by the expression and the error term, , expressing the difference, or residual, between the model of the response and the response itself. For missing values of the predictor, the mean of the response is used.

The regression hypothesis test is the test of:

Assumptions:

for all

where denote the residuals

and are all independent and follow a normal distribution

and all have equal variance

The sums of squares and mean sum of squared errors are calculated as follows:

 Number of Observations: Rank of the Coefficient Matrix: Mean of the response: Mean of the predictors or genotypes: Solution to the normal equations: where Total Sum of Squares: Regression Sum of Squares: where is a matrix of ones Error Sum of Squares: Residual Sum of Squares: Coefficient of determination: Adjusted coefficient of determination: Test Statistic:

The test statistic follows the F distribution, where where .

### Multiple Linear Regression Model¶

The Regression Analysis window performs multiple linear regression on the regressors unless only one regressor is specified. A multiple linear regression model takes one or more regressors and fits a regression model to one dependent variable. This model is a generalization of the simple linear regression model used for linear regression in the analysis test dialogs.

#### Full Model Only Regression Equation¶

The regression hypothesis test is the test of:

Assumptions:

for all

where denote the residuals

and are all independent and follow a normal distribution

and all have equal variance

The sums of squares and mean sum of squared errors are calculated as follows:

 Number of Observations: Rank of the Coefficient Matrix: Mean of the response: Mean of the predictors or genotypes: Solution to the normal equations: where Total Sum of Squares: Regression Sum of Squares: where is a matrix of ones Error Sum of Squares: Residual Sum of Squares: Coefficient of determination: Adjusted coefficient of determination: Test Statistic:

The test statistic follows the F distribution, where where .

#### Full Versus Reduced Model Regression Equation¶

In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.

The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.

The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:

 Number of Observations: Rank of the Reduced Model Coefficient Matrix: Mean of the response: Mean of the predictors or genotypes: Solution to the normal equations: where Total Sum of Squares: Regression Sum of Squares: where is a matrix of ones Error Sum of Squares:

The sums of squares and mean sum of squared errors for the full model are calculated similarly:

 Number of Observations: Rank of the Full Model Coefficient Matrix: Mean of the response: Mean of the predictors or genotypes: Solution to the normal equations: where Total Sum of Squares: Regression Sum of Squares: where is a matrix of ones Residual Sum of Squares: Error Sum of Squares:

The test statistic is:

The p-value is calculated by: where .

#### Regressor Statistics¶

The coefficient of the regressor is calculated with the equation:

where is the sample size, is the mean of the regressor and is the mean of the response.

The Y-intercept of the regression equation is calculated with the equation:

where is the number of regressors, is the coefficient and is the mean of the regressor.

The standard error for the regressor is computed by taking a full model regression equation with all regressors less the regressor. For the purposes of calculating the standard error, the regressor is set as the dependent variable. Let be the regressor sum of squares, be the coefficient of determination for the regressor vs all other regressors model, be the mean square errors for the regression model, and be the error sum of squares. Let the total number of regressors in the model be . Then the standard error of the regressor is calculated as follows:

The value of the t-statistic for the regressor is obtained from the equation:

where is the estimated coefficient for the regressor.

The p-value of the t-statistic for the regressor is the probability of a value as extreme or more extreme than the observed t-statistic from a Student’s T distribution with degrees of freedom.

The p-value for the univariate fit is obtained from a Student’s T distribution where the t-statistic is calculated assuming that the regressor is the only regressor in the model against the dependent variable.

#### Categorical Covariates and Interaction Terms¶

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.

Sample Lab Dose Age
sample01 A Low 35
sample02 A Med 31
sample03 A High 37
sample04 B Low 32
sample05 B Med 36
sample06 B High 33

Using dummy variables for the categorical covariates the above table would be:

Sample Lab=A Lab=B Dose=Low Dose=Med Dose=High Age
sample01 1 0 1 0 0 35
sample02 1 0 0 1 0 31
sample03 1 0 0 0 1 37
sample04 0 1 1 0 0 32
sample05 0 1 0 1 0 36
sample06 0 1 0 0 1 33

Interactions Lab*Dose and Lab*Age would be specified as:

Sample A*Low A*Med A*High B*Low B*Med B*High A*Age B*Age
sample01 1 0 0 0 0 0 35 0
sample02 0 1 0 0 0 0 31 0
sample03 0 0 1 0 0 0 37 0
sample04 0 0 0 1 0 0 0 32
sample05 0 0 0 0 1 0 0 36
sample06 0 0 0 0 0 1 0 33

#### Stepwise Regression¶

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

#### Binomial Predictor¶

In this case, all the observations with zero as the predictor variable are place in one group, and all of the observations with a one as the predictor variable are placed in a second group. A two-sample t-test is used to determine the probability that the two groups have the same mean.

##### Univariate Case (Student’s -Test)¶

Suppose you have items that are split into two groups of sizes and , and the respective sums of their continuous responses are and . Further, let be the sum of the squared responses, , and let We can then calculate the statistic: where the p-value is given by the tails of a two-sided Student’s t distribution with degrees of freedom: where .

##### Multivariate Case (Hotelling’s Test)¶

Suppose you have observations, each with a -dimensional response in an matrix, . All the observations with zero as the predictor variable are placed into one group, , of size , and all of the observations with a one as the predictor variable are placed into a second group, , of size . Note that . Then a Hotelling statistic is computed to compare the multivariate continuous responses in the two groups. Let the -dimensional vector contain the means of responses ,

Define the -dimensional vector , where

Define the entry of the matrix as follows:

Define the entry of the vector as follows:

The matrix equation is solved and the statistic is calculated as follows:

The p-value is computed using the F distribution with and degrees of freedom:

where .

## Logistic Regression¶

### Genotype or Numeric Association Test – Logistic Regression¶

The (univariate binary) response, is fit to the given predictor variable, , using logistic regression, and the results include the regression p-value, the parameters and which are output in a new spreadsheet along with other association test results, any multiple test correction results, as well as any expected p-values based on the rank of the observed p-value and number of predictors. The response is represented with the formula , with the model itself being the expression and the error term, , expressing the difference, or residual, between the model of the response and the response itself.

Assuming there are observations, a logit model is used to fit the binary response, , using and a vector of 1’s as a covariate matrix (). (The vector of 1’s facilitates obtaining an intercept.) The Newton’s method approach of maximizing the log likelihood function is used for estimating the logit model [Green1997]. The null hypothesis being tested is that the slope and intercept coefficients in the logit model are zero. A likelihood statistic is calculated, where is the unrestricted likelihood and is the restricted likelihood, and is asymptotically distributed as Chi-Squared with degrees of freedom.

We simplify the notation by using a lower-case to mean “log likelihood”–that is,

and

where base logarithms are used.

Using this notation, the unrestricted log likelihood is

where is the proportion of the dependent observations that are equal to one, and the restricted log likelihood is

and where . Here, is the vector .

### Multiple Logistic Regression¶

#### Full Model Only Regression Equation¶

The multiple logistic regression uses a logit model to fit the binary response , using the covariate matrix , consisting of the regression coefficients for continuous predictors and indicator coefficients for categorical predictors, along with a column of 1’s for the intercept. The Newton’s method approach of maximizing the log likelihood function is used for estimating the logit model [Green1997]. The null hypothesis being tested is that the slope and intercept coefficients in the logit model are zero. A likelihood statistic is calculated, where is the unrestricted likelihood and is the restricted likelihood, and is asymptotically distributed as Chi-Squared with degrees of freedom.

We simplify the notation by using a lower-case to mean “log likelihood”–that is,

and

where base logarithms are used.

Using this notation, the unrestricted log likelihood is

where is the proportion of the dependent observations that are equal to one, and the restricted log likelihood is

and where . Here, is the vector of slope coefficients.

#### Full Versus Reduced Model Regression Equation¶

For the full versus reduced logistic regression model, logistic regression equations are obtained for both the full model and for the reduced model. The reduced logistic regression model includes only the dependent and any covariates selected for the reduced model. The full logistic regression model includes all of the variables including any full model covariates.

A likelihood ratio statistic is calculated to find the significance of including the full model regressors vs not including these regressors. The restricted likelihood of the reduced model is represented by and is the restricted likelihood of the full model. Both and are computed as below:

and where where are the degrees of freedom of the full model and are the degrees of freedom of the reduced model. Here, is the reduced model vector of slope coefficients, and is the full model vector of slope coefficients.

#### Regressor Statistics¶

The coefficient of each regressor, along with the y-intercept, is calculated as a part of the Newton’s method approach of maximizing the log likelihood function for the full model.

The standard error of the regressor is found by inverting the information matrix of the regression, which is formed using the intercept as the last coefficient. The square root of the diagonal element of the inverted matrix is the standard error of the regressor, and the square root of the last diagonal element of the inverted matrix is the standard error of the intercept. (See [HosmerAndLemeshow2000].)

The p-value Pr(Chi) associated with dropping the regressor from the regression is found by running a separate logistic regression using all the regressors as the full model and a model with all the regressors except the regressor as the reduced model. (The “Chi” refers to the likelihood ratio test that is performed between these two models to find the p-value.)

The regression odds ratio for a coefficient is simply . The interpretation of this is by how much (by what ratio) the odds of the dependent being one change if the given regressor changes by one unit. An example would be the ratio of the odds of being a case rather than a control for a smoker to the odds of being a case rather than a control for a non-smoker.

The p-value for the univariate fit of the regressor is obtained from a separate logistic regression which is calculated as if the regressor were the only regressor in the model against the dependent variable.

#### Categorical Covariates and Interaction Terms¶

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.

Sample Lab Dose Age
sample01 A Low 35
sample02 A Med 31
sample03 A High 37
sample04 B Low 32
sample05 B Med 36
sample06 B High 33

Using dummy variables for the categorical covariates the above table would be:

Sample Lab=A Lab=B Dose=Low Dose=Med Dose=High Age
sample01 1 0 1 0 0 35
sample02 1 0 0 1 0 31
sample03 1 0 0 0 1 37
sample04 0 1 1 0 0 32
sample05 0 1 0 1 0 36
sample06 0 1 0 0 1 33

Interactions Lab*Dose and Lab*Age would be specified as:

Sample A*Low A*Med A*High B*Low B*Med B*High A*Age B*Age
sample01 1 0 0 0 0 0 35 0
sample02 0 1 0 0 0 0 31 0
sample03 0 0 1 0 0 0 37 0
sample04 0 0 0 1 0 0 0 32
sample05 0 0 0 0 1 0 0 36
sample06 0 0 0 0 0 1 0 33

#### Stepwise Regression¶

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

## Haplotype Frequency Estimation Methods¶

The alleles of multiple markers transmitted from one parent are called a haplotype. Haplotype analysis of safety and efficacy data can incorporate the information from multiple markers from the same gene or genes, which are physically close on a specific chromosome. Genotypic data from unrelated individuals do not contain information on which alleles were transmitted from each parent, but haplotype frequencies can be estimated using several existing in silico methodologies such as the Expectation Maximization (EM) algorithm and the Composite Haplotype Method (CHM).

A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data.

The expectation-maximization (EM) algorithm, formalized by [Dempster1977], is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see [Excoffier1995] for details of obtaining haplotype frequencies by EM).

Determining the probability of each haplotype for each sample from the overall haplotype probabilities is based on the computation method being used, and does not relate to any estimate of LD between any pair of markers involved.

CHM diplotype probabilities are based on the average of the probabilities of the two haplotypes, while EM diplotype probabilities are based on the product of the probabilities of the two haplotypes. The process of finding the EM diplotype probabilities is equivalent to the “expectation” step of each EM iteration.

### Case/Control Association with Haplotype Frequencies¶

Consider the following simple example. Suppose we observe a sample of 6 individuals typed at two bi-allelic markers. The first three individuals are double homozygotes, AA/BB, contributing six gametes of the type AB to the sample counts. The next two are aa/bb, contributing four gametes of the type ab. The last individual, however, is a double heterozygote, Aa/Bb, and there are two possible pairs of gametes, AB with ab, or Ab with aB. Taking into account the first part of the sample, the EM algorithm assigns an extremely low probability to the second, (Ab with aB) arrangement, because these types of gametes have not been observed among unambiguous individuals at all. To make such a calculation, it is necessary to assume “Hardy Weinberg Equilibrium” (HWE) over the two markers, which means that the two-marker genotypes do not deviate from the products of the haplotype frequencies. On the other hand, the CHM method described below does not assume HWE and maintains that either of the two arrangements (AB with ab, or Ab with aB) is equally likely.

Certainly, there will be discrepancy between sample frequencies generated by the two methods. Our goal, however, is not simply to report, but to contrast haplotype frequencies between levels of phenotype. Assume for the moment two markers with two alleles and that the phenotype is binary, i.e. the subjects can be classified as either “cases” or “controls”. For simplicity, assume that the haplotype AB determines the probability of an individual being a “case” and individual alleles are unimportant. Then the expected allele frequencies () are the same among cases and controls. We can therefore express the expected frequency of AB haplotypes in cases and controls as

where , are linkage disequilibrium (LD) coefficients. These coefficients are needed to account for the part of gametic frequency not explained by frequencies of alleles, . When we are comparing between cases and controls, we are essentially looking at the size of the difference between and . The EM algorithm attempts to obtain estimates of in cases and controls separately, assuming HWE in these groups. However, even if HWE holds in the population, it is expected not to hold in either cases or controls looked at separately [Nielsen1998], which may lower the power of the test. Moreover, cases and controls exhibit the deviation from HWE in different directions [Zaykin2000], which is exploited by CHM. The CHM estimates a quantity different from , specifically

The last term, , is the frequency of A, B alleles that reside on two different gametes [Weir1996], in contrast to , that measures their joint frequency on the same gamete. This “intra-gametic” frequency can also be written as a product of A, B allele frequencies plus the deviation unexplained by the product. Generally, this deviation is not zero if the HWE at the haplotype level does not hold. It is also expected that this deviation is generally larger in cases than in controls (denote the difference by ), when cases are less prevalent category in a population. Therefore, when comparing composite frequencies between cases and controls, we are looking at the size of the difference . The term is zero only under the multiplicative model of haplotype action, in which case two methods are comparing the same quantities.

The quantities , , , and are all unknown. The process of imputing LD involves first imputing two-marker haplotype frequencies by the chosen method, then imputing LD based on those imputed haplotype frequencies.

EM and CHM each have a distinct way of imputing haplotype frequencies, neither of which involve the values of , , , and between any particular pair of markers. EM assumes diplotypes are products of haplotype frequencies (which is a consequence of HWE over the markers), and CHM assumes that any of the haplotype arrangements is equally likely (such as AB with ab or Ab with aB in the example described above) and will take the average of their probabilities. The respective algorithms are based on those respective assumptions.

The CHM LD method (Formulas for Computing Linkage Disequilibrium (LD)) internally takes a shortcut past the haplotype imputation code and just does the equivalent imputation process directly. For biallelic markers, it additionally compensates for lack of HWE in either or both of the two markers.

In practice, we have found that more often than not the EM and CHM result in essentially the same p-values for tests of association with the phenotype, although the CHM can be much more powerful under models that include large non-additive components. Compared to EM, it is a non-iterative and much faster algorithm, posing no danger of convergence to a local maximum.

### Expectation Maximization (EM)¶

For haplotype inference with Single-Nucleotide Polymorphisms (SNPs), the EM algorithm is popular because it is easy to interpret and is stable. Compared to the Gibbs sampler, another method of estimating haplotype phases, the EM approach is a deterministic procedure. It also requires less computing time, and is easier to check for convergence. The output of the EM algorithm is the maximum-likelihood estimate (MLE), which possesses well-established statistical properties.

Note

Because of the memory-space properties of the EM algorithm, the EM algorithm will perform poorly if at all with more than a dozen loci.

The expectation-maximization (EM) algorithm for haplotype imputation is an implementation of the method described in [Excoffier1995].

### Composite Haplotype Method (CHM)¶

The CHM is based on the idea of the genotypic LD coefficient, , [Weir1996]. Estimation of involves calculation of di-genic frequencies. In the two-locus bi-allelic case, they are estimated as

where , for example, is the number of individuals with genotype Aa/Bb, and n is the sample size. The composite disequilibrium is defined as a sum of inter- and intra-gametic components,

Under random mating, , and so assuming random mating, is an unbiased estimate of the LD parameter, . Also, if we do not wish to separate the inter- and intra-gametic components, we may define

which is an observable quantity.

[Zaykin2001] extended the definition of di-genic frequencies to multiple loci and alleles. For the i-th individual multilocus genotype let be the number of single-locus heterozygotes in Define weights as

Sample composite haplotype counts are calculated from summing over individual contributions,

where n is the sample size, and is the indicator function, defined as

Thus, if the i-th individual has at least one copy of all required alleles, it is counted with weight . The composite haplotype frequencies are given by

Note that includes both inter- and intra-gametic component frequencies.

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s definition, . In a single-locus case, they are the usual definition of the allele count:

## Formulas for Computing Linkage Disequilibrium (LD)¶

Two approaches are available for computing linkage disequilibrium (LD), depending upon the method used for imputing the two-marker haplotype frequencies upon which the LD computations depend, expectation-maximization (EM) vs. the composite haplotype method (CHM).

### Computing LD using Expectation-Maximization (EM)¶

First, the EM method (see Expectation Maximization (EM) and Haplotype Frequency Estimation Methods) is used to impute the two-marker haplotype probabilities for the -th and -th allele in the first and second markers, respectively.

Using these, the signed statistics may be calculated as

where is the frequency of allele in the first marker, and is the frequency of allele in the second marker.

If there are alleles in the first marker and alleles in the second, a chi-squared distribution with degrees of freedom may be written as

may then be computed by taking the p-value as

and obtaining from the inverse distribution for one degree of freedom as

For the two-locus two-allele case, this procedure simplifies to the following direct formula

which, for this two-locus two-allele case, may be shown to be equivalent to

where may be chosen as either one of the alleles in the first marker and may be chosen as either one of the alleles in the second marker.

### Computing LD using the Composite Haplotype Method (CHM)¶

#### Multi-Allelic¶

If there are alleles in the first marker and alleles in the second, where either or or both, and using the same notation for and as above, a chi-squared distribution with degrees of freedom may be written as

where

and is defined as in ftCHM. Here, we are effectively using

which includes both inter- and intra-gametic component frequencies, as our haplotype frequencies.

may then be computed by taking the p-value as

and obtaining from the inverse distribution for one degree of freedom as

#### Bi-Allelic¶

For the two-locus two-allele case, and using the notation of ftCHM, we compute using the following direct formula

where and are the Hardy-Weinberg coefficients for allele of the first marker and allele of the second marker, respectively. This formula may be thought of as putting a “Hardy-Weinberg correction” onto the formula

which is only completely accurate under the special circumstance of random mating (perfect Hardy-Weinberg equilibrium over the two-marker haplotypes), for which approximates and is an unbiased estimate of .

It may be shown that for the circumstance of perfect linkage disequilibrium, the result of using the “Hardy-Weinberg correction” formula is equivalent to

### The D-Prime Statistic¶

If the minor allele frequencies of the respective markers are small, the magnitude of the statistic cannot get very large, even if the marker is in almost complete linkage disequilibrium, compared to the magnitude it could have had if the allele frequencies of the markers were almost equal.

The D-prime statistic was designed to compensate for this. is defined as normalized by the maximum possible value that could possibly have given the allele frequencies in each of the markers.

Specifically,

if , and

otherwise.

The overall D-prime statistic is defined as

#### Computing D-Prime¶

For EM, the above formula is used directly on the values of , , and , where the are imputed using the technique of Computing LD using Expectation-Maximization (EM).

For multi-allelic CHM, we use

if , and

otherwise, with the overall D-prime statistic being defined as

For bi-allelic CHM, we use the same formulas as for multi-allelic CHM, except that for the final , we take the original overall obtained as above and use a Hardy-Weinberg correction on it:

where , , and are defined as in Bi-Allelic.

## Formulas for Principal Component Analysis¶

This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson2006].

Note

Golden Helix SVS currently uses PCA with association tests performed against genotypes (using their numeric equivalent values) (Genotypic Principal Component Analysis) and with association tests performed against numerical data (Numeric Principal Component Analysis).

Suppose is the (marker) by (subject or sample) matrix of either numeric values or the numeric equivalent values of the genotypes, with the exception that for each marker, what had been the average value over that marker has first been subtracted out from the marker’s values (data centering by marker).

Note

In our notation, a spreadsheet in SVS on which PCA may be performed may be thought of as , since such a spreadsheet suggests an (subject or sample) by (marker) matrix.

### Motivation¶

We would like to decompose into as a singular value decomposition, where is an by matrix for which , is an by matrix for which , and is an by singular-value matrix. For this decomposition, may be thought of as consisting of principal components or “ancestries” relating the samples to the singular values, while the column of (where and ) may be thought of as containing the “coordinate value” of each marker along the principal component.

We do this by finding the eigenvalues and eigenvectors of . (This matrix is like a Wishart matrix. See Further Motivation: Relation to the Variance-Covariance Matrix.) Rewriting in terms of the singular value decomposition, we have

where is an by matrix containing the eigenvalues of on its diagonal and zeroes elsewhere. Right-hand multiplying the above by gives us

Since is a matrix of the eigenvectors (principal components) of , the above implies, for through , that

where are the eigenvalues of , and the vectors are the columns of and are also the eigenvectors of .

As stated above, the principal components may be thought of as “ancestry” vectors. The most “important” ones correspond to the highest values of . Our object will be to subtract these most important out from the rest of the data, so that we may more effectively test for associations with markers that specifically relate to our phenotype, rather than with markers that relate to population ancestry.

Note

Matrix is not unique, except that the first columns have to be eigenvectors of corresponding to the eigenvalues . As stated above, the element of the column of may be thought of as the relative magnitude of the principal component in making up the row of .

However, SVS does not actually need to compute any part of matrix at all. The procedure outlined in Applying the PCA Correction is dependent only upon the input data and upon the (from matrix ), and is otherwise self-contained.

Note

In actual fact, SVS normalizes by the number of markers, . The corresponding decomposition of becomes , and is the matrix which contains the eigenvalues that we can obtain and output from SVS.

### Technique¶

When accumulating the elements of for genotypic data, the marker data may be optionally weighted in favor of those markers with less variance by normalizing them according to their variance, either theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing for genotypic data.

Otherwise, the technique is to:

• Scan the markers and obtain the Wishart-like matrix.
• Find the desired number of eigenvectors and eigenvalues from .
• If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.

### Applying the PCA Correction¶

Suppose is a vector containing the dependent variable (response) values or numeric-equivalent values. Exercising data centering by marker, we subtract the average element value from each element to get ; then,

where . Similarly,

where , etc., up to

where .

Note that the values are obtained through dot products, where component wise,

The corrected is obtained (as a part of data centering by marker) by adding back the average element value of the original to each element of .

The idea is for each principal component, we find what magnitude of it is present in (or )–then we subtract that magnitude of that component from to get .

Exactly the same technique is applied for every vector of data from an individual marker. After subtracting the average element value to get , we compute

where . Similarly,

where , etc., up to

where

Component-wise, we have

We finally get the corrected back (as a part of data centering by marker) by adding back the average element value of the original to each element of .

### Formulas for PCA Normalization of Genotypic Data¶

The possibilities for marker normalization are:

• The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.

This theoretical standard deviation is as follows according to the genetic model being used:

• Additive (the only model supported by EIGENSTRAT itself):

where

is a Bayesian estimate of the minor allele probability in terms of the numeric marker values .

• Dominant:

where

is a Bayesian estimate of the probability of being a DD or Dd genotype.

• Recessive:

where

is a Bayesian estimate of the probability of being a DD genotype.

• The actual standard deviation of this marker’s data.

• Don’t normalize.

Note

Numeric data is not normalized in SVS.

### Further Motivation: Relation to the Variance-Covariance Matrix¶

For the purposes of PCA correction in SVS, the marker data may be treated as if it comes from a multivariate normal distribution, where the data from marker , , is picked from a random variable (where is equal to the number of samples). If indeed has a multivariate normal distribution, then there is a random vector whose components are independent standard normal random variables, a vector , and a by matrix such that .

The vector is the expected value of , and the matrix is the variance-covariance matrix of the components .

Thus, the vectors in matrix may be treated as coming from the expression

or

where are elements in matrix .

Let us examine . One element of may be written as

The term expands out to

Expanding the first term above, we have

We may thus write

where m is the number of markers.

However, since the components of the random variable from which is assumed to be drawn are independent standard normal random variables, we may make the following estimates:

Thus, we may estimate the elements and of as

or

Note that

thus, we have the following estimate:

where is the covariance matrix of the multivariate distribution of and is an outer product of the sample averages.

Note

is a Wishart matrix based on . The distribution of this random matrix is a generalization of the chi-square distribution.

### Centering by Marker vs. Not Centering by Marker¶

In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered by marker–that is, the average of the data within a given marker is first subtracted out from each of that marker’s data values both before they are used for the matrix and before they are PCA-corrected, with the averages being added in after correction.

An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.

Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.

Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.

On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.

Regarding data centering by marker, it should be noted that after the marker averages are subtracted out, we have , where is the element of corresponding to the -th marker and the -th sample. Since

for any pair of samples and , we have

Multiplying a constant vector into the above will therefore obtain a vector with zero for every element, demonstrating that when data centering by marker is used, one of the eigenvalues is zero, corresponding to an “average vector” of constant elements.

If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.

### Centering by Sample vs. Not Centering by Sample¶

A second alternative approach to PCA, available in SVS for numeric or copy number data, is to center by sample, either in addition to centering by marker or instead of centering by marker. This entails effectively subtracting out the sample average before contributing to the matrix (and adding it back in while correcting). (In SVS, if centering by both sample and marker, the overall average is subtracted out only once.)

Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.

An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.

### Applying PCA to a Superset of Markers¶

Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.

Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.

### Applying PCA to a Subset of Samples¶

It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.

Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.

However, now we must look at whether two components which are orthogonal in the main sample set are still orthogonal in the sample subset. Suppose and are two components of the full set, where and are in the original sample set but not the sample subset, and and are in the sample subset. (Without loss of generality, we are writing the vector as if all of the elements in the subset follow after all of the elements not in the subset.) Because these are eigenvectors of , they will be orthogonal. That is, their dot product will be zero.

This dot product may be written . We see that if the individual dot products, both outside of and inside of the subset, are zero, then the full dot product will be zero. Thus, components that are orthogonal in the subset and its complement will be orthogonal for the overall set.

Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.

However, because the subset components are not precisely orthogonal except in special cases, a modified Gram-Schmidt process is applied by SVS to make the components that belong to the subset orthonormal (both orthogonal and normalized). Effectively, this is equivalent to “obtaining a PCA correction” for every component based upon the previous components , , ..., up to .

## Runs Of Homozygosity (ROH) Algorithm¶

The Runs of Homozygosity algorithm is designed to find runs of consecutive homozygous SNPs and find where those runs are common among at least the specified number of samples.

The first part of the analysis is determining the homozygous runs or Runs of Homozygosity (ROH), which can be represented as the index of the SNP where the run started and the index of the last SNP in the run. A homozygous genotype is one where both bases are the same, such as A_A or B_B.

Unlike the PLINK ROH algorithm, which uses a moving window that potentially introduces artificial runs and fails to capture runs larger than the window, the Golden Helix SVS algorithm works continuously across an entire chromosome, examining every possible run that matches the parameters specified for the algorithm in the ROH GUI Dialog window. The algorithm looks at every homozygous SNP as a potential start of a new ROH run. Each SNP is then determined to be homozygous, heterozygous or missing. Each potential run is then updated to be extended if it is homozygous, or modified to have its heterozygous or missing count incremented. If consecutive heterozygotes are restricted, a run will allow up to that amount of heterozygotes in a row, however if it isn’t currently at the maximum allowed heterozygote count, then it may continue its run. The same applies for consecutive missing genotypes. There is also an option to allow only a certain amount of heterozygotes (or missing genotypes) within a specified window of variants, this is more general than consecutive as there could be homozygous calls in between the heterozygotes (or missing genotypes). Runs that exceed the allowed number of heterozygotes or missing SNPs are then checked to see if they should be thrown out based on their length and SNP density according to the specified algorithm parameters. In addition, for NGS data, missing genotypes can be specified to be treated as homozygous reference calls. At the end of a chromosome, the longest runs are greedily chosen out of all the potential runs (with no overlapping of runs allowed) until no more runs are available. The result is the set of runs containing the longest runs of homozygous SNPs for a given chromosome and sample. This is repeated for each chromosome and sample of the dataset.

A second algorithm is used to create a clustering of runs. Given the final list of ROHs calculated by the previous algorithm and a threshold, , for the minimum number of samples that contain a run, a cluster is defined as a contiguous set of SNPs where every SNP has at least samples in a run for each SNP. The algorithm sweeps across all the SNPs in the dataset looking for clusters that meet the run length requirements provided to the algorithm. A list of clusters is generated in a spreadsheet. Optionally, another spreadsheet is created which calculates, for each sample, the ratio of SNPs in each cluster that are members of a run of homozygosity.

In addition to the clusters described above, Optimal and Haplotypic Similarity clusters can be found. Optimal clusters are found by grouping samples with highly overlapped ROH regions and Haplotypic Similarity clusters are found by grouping samples with high allelic similarity over overlapping ROH regions. The methodology behind finding these clusters is described below.

### Optimal and Haplotypic Similarity Clusters¶

A method that uses normalized spectral clustering to group samples with highly overlapped ROH regions has been developed [Zhang2013]. The advantage of this method is that the new clusters have more stringent boundaries then the original clusters.

This clustering method is performed on each of the clusters found in the original method and repeated on those clusters until local optimum clusters are found. This method is also called repeated binary spectral clustering as for each cluster we group the samples into two clusters and then perform the clustering method again on those clusters. This continues until one of the stopping conditions, described below, is met. The clusters reported in the output spreadsheet are the clusters where the stopping condition was met.

#### Steps¶

The basic steps for finding these clusters is outlined below:

1. Construct a similarity matrix that describes the amount of overlap between each pair of runs.
2. Construct a degree matrix from the similarity matrix.
3. Compute the normalized symmetric Laplacian matrix.
4. Find the eigenvectors of the Laplacian matrix and create a matrix whose columns are the first K eigenvectors (K = 2, the number of groups we will form).
5. Normalize the matrix by normalizing the rows to norm 1.
6. Use k-means to cluster the points of this matrix (the rows) into K groups.
7. Repeat on the resulting clusters until the number of iterations is above 100, the number of runs of each group is less or equal to than , and the lower quartile of the similarity matrix is greater than 0.5.

The is done for each cluster found using the original method.

#### Similarity Matrix¶

There are two different ways that the similarity matrix is found. One for the optimal clustering method and one for the haplotypic similarity method.

For optimal clustering, the similarity matrix is defined as:

where

where is the amount of overlap between two runs, in SNPs, and are the lengths of runs and , in SNPs, and is the number of runs in this cluster.

For haplotypic similarity clustering, the similarity matrix is defined as:

where is the number of runs in a cluster and and are the runs and

where is the start index (in SNPs) of the overlap region between runs and . And is the end index of that region. and are the calls of the two runs at index . And is the same as defined before.

In plain English, the similarity is the percent of SNPs over the overlapping regions of runs and with identical alleles. However, if their overlap region, , is less than 0.5 of either the th or th run, the similarity is set at 0.

#### Degree Matrix¶

The degree matrix D, a diagonal matrix, for the similarity matrix is defined as:

where is the number of runs in the cluster.

#### Normalized Symmetric Laplacian¶

The normalized symmetric laplacian matrix can be found from the degree and similarity matrices, matrices D and S, respectively.

There are different ways to get the laplacian matrix.

First method:

Second method [Ng2002]:

However, to reduced computation time, the second method can be reduced to:

The reduced version of the second method is the one used.

#### Eigenvectors of Laplacian Matrix¶

It has been shown that the eigenvectors of a Laplacian matrix can be used to find semi-optimal cuts in a graph [Ng2002] , and through the modifications made above, they can be used to find cuts amongst the runs in our cluster [Zhang2013].

We then put the K largest eigenvectors into the matrix U column wise.

#### Normalize¶

The last step before the k-means algorithm is performed is to normalize the rows of the matrix U.

The new matrix T is described as:

#### K-Means¶

Now that we have obtained the eigenvectors, and modified them, from the laplacian matrix, we can use the rows as points for the k-means algorithm. This essentially uses the second eigenvector to determine the clustering.

There are a few different k-means algorithms, the main ones are Lloyd, Hartigan-Wong, Forgy, and MacQueen. The Hartigan-Wong implementation generally performs better and is the method used.

Before running k-means on matrix T, initial cluster centers are found to help minimize the run time and to help ensure that the algorithm converges on a solution. To find these initial cluster centers, we first find the mean point. The mean point is the mean of the first column and the mean of the second column of our matrix T. We then find the distance between each point and this mean. Then the points are sorted in ascending order by their distance to the mean. And lastly, for each cluster , (where L = 1, 2, ... K), the th point is chosen to be the initial center for cluster [Hartigan1979].

The k-means algorithm is then run on the matrix T and these initial centers.

After k-means, the runs are assigned to their respective clusters and the start and end indexes for these new clusters are found.

The start and end indexes of the clusters are found by ordering the start and end indexes of the runs in the cluster into ascending order. Then while iterating over the list of sorted start indexes, the first instance where at least 75% percent of the runs overlap this index is chosen as the new start index. The same is done to find the end index. However, if there isn’t an index that reaches 75% coverage, then the closest index is chosen [Zhang2013].

Now that the new clusters are established, the clustering process is started again on these and continues this until one of the end conditions is met, described above in Steps.

## Quantile Normalization of Affymetrix CEL Files¶

As SVS imports Affymetrix CEL files it creates quantile normalized log2 ratios in order to more accurately find regions of copy number variation and for more accurate association testing on the CNV variants.

The process to generate normalized log2 ratios is very analogous to the methodology employed by Affymetrix [Affymetrix2007], scaled to handle thousands of cases and controls. The process is as follows:

1. Depending on the Mapping array type, there are anywhere from 1 to 40 probes used to interrogate a given genotype. The Affymetrix 500k mapping array provides perfect match and mismatch probes, whereas the Affy 5.0 and 6.0 chips only include perfect match probes. We only use the means of the perfect match probes.

• For Affy 500k, the NSP and STY A and B probe intensities are extracted separately. For a given marker (SNP or CNV) there may be anywhere from 1 to 40 probes. These are averaged to get a probe intensity per marker.
• For Affy 5.0 and 6.0, the polymorphic SNP probes and the non-polymorphic copy number probes are extracted separately. The non-polymorphic probes only have an A intensity.
2. The A and B probe intensities are quantile normalized per sample using the approach of [Bolstad2001] and [Bolstad2003]. 500k NSP and STY samples are quantile normalized separately, as are the Affy 5.0 and 6.0 polymorphic and non-polymorphic probes. The process is as follows:

• For each of the autosomal A and B probes: sort the intensities for each sample in ascending order.
• Replace the smallest value in each sample with the mean of the smallest values, the second smallest value in each sample with the mean of the second smallest, and so on for the entire set of probes.
• Reorder the modified A and B intensities into their original order for all samples.
• Calculate modified intensities for the non-autosomal (sex) probes by finding the closest autosomal intensity value and substituting its corresponding quantile-normalized intensity.
3. Calculate a reference distribution and calculate log2 ratios:

• Select a set of samples for the reference distribution, for instance the controls.

• For each polymorphic probe, , calculate the median of the quantile normalized A intensities and the median of the quantile normalized B intensities, for the reference samples. Then for a given pair of probe intensities and , the normalized copy number signal is the log2 ratio of the sum of the and probe intensities to the median A and B probe intensities:

• For non-polymorphic probes, there is no B intensity, but the analogous normalization is performed:

4. Join together distributions:

• Recall, for the 500k, the NSP and STY log2 ratios are calculated separately. Also, for arrays containing non-polymorphic copy number probes, these must be joined with the polymorphic probes. We join different arrays together by the “virtual array generation” procedure outlined in section 7 of [Affymetrix2007].
• The Affymetrix procedure defines a log2 ratio range for defining which markers are copy-neutral. Rather than doing this, we sample from the middle 1/3 of the distribution of autosomal log2 ratios, but follow the same procedure of centering the copy number log2 ratios about zero by subtracting the mean of the copy-neutral markers, and scaling the distributions by their respective signal to noise ratios.

## CNAM Optimal Segmentation Algorithm¶

### Overview¶

The copy number segmenting algorithm is designed to find local variations over markers. The segmenting process is optimized by working at three levels:

1. If desired, the region of markers is subdivided into a moving window of sub-regions.
2. A unique segmenting algorithm is applied to find multiple segments wherever possible, and by implication, segment boundaries which are termed “cut-points”.
3. A permutation algorithm is applied to validate the found cut-points.

CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points. The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions possibly useful for association studies. For a given sample, the covariate is the mean of the intensities within each segment for that sample. If there are consistent positions for copy number variation across multiple samples, the copy number segments will be found. In reality, there may not always be consistent copy number segments across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample and outputs a spreadsheet showing all cut-points found among all samples. There are two options for how this spreadsheet can be formatted. Either there will be one column for each unique cut-point segment, or there will be a column for all segmented markers in the marker map. The first spreadsheet is good for association analysis because the multiple testing adjustment is smaller. However, the second spreadsheet is better for plotting the results. The results are the same in both cases, but in the second spreadsheet the results of a column are repeated for every marker in each unique segment.

See Using CNAM (Copy Number Analysis Method) Optimal Segmenting for more information.

### Obtaining Segments with a Moving Window¶

CNAM allows segmenting with a moving window, where the segmentation algorithm is applied to each sub-region created by the window. Using the optional moving window will result in an overall faster analysis; however, the results may not be as robust as analyzing the entire chromosome simultaneously. We implement an approach to windowing that minimizes edge effects as the moving window moves along the chromosome:

1. Start out with a sub-region of markers.

2. Perform the segmenting algorithm on the current sub-region.

3. If two or more cut-points were found, meaning that there were more than two segments found:

• If Univariate Outlier Removal is selected, delete cut-points that bracket single marker segments.
• If Univariate Outlier Removal is not selected, use all the cut-points except the last one and begin the next sub-region at the second-to-last cut-point.

Use markers. Otherwise:

• If no cut-points could be found at all to subdivide the one sub-region, the next sub-region is started in the middle of the current sub-region and extended for markers. No cut-points are yielded in this case.

Note

This may mean a region of copy number variation which started before this sub-region will be considered to be extended through this sub-region and possibly for much longer. We do not arbitrarily subdivide regions based on moving-window boundaries–instead, this whole procedure is meant to find valid cut-points throughout the main region.

• If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, and the cut-point is in the first half of the sub-region, this cut-point is accepted, and the next sub-region is started at that cut-point and extended for markers.

• If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, but the cut-point is in the second half of the sub-region, the apparent cut-point may be due to an edge effect. Thus,the cut-point is not used and the moving-window sub-region is changed so as to get better segmenting. If the size of the sub-region is now still , the sub-region is simply extended so that the total size of the sub-region is . If the size of the sub-region is already , the next sub-region, which goes back to being length , is started at the former start plus .

4. Finally, when the end of the main region is encountered, whatever cut-points are obtained from the sub-region are accepted.

You may set the window size as one of the optional segmenting parameters, see Using CNAM (Copy Number Analysis Method) Optimal Segmenting.

### Obtaining Segments without a Moving Window¶

Selecting to not use a moving window in the segmenting analysis results in the sub-region spanning an entire chromosome. The segmenting algorithm is then applied to the whole chromosome. This is more computationally intensive, and the most trusted procedure. Note, that it may be necessary to increase the Max segments per window to more than the default of 20, as a given chromosome may well have more than 20 regions of copy number variation.

### Copy Number Segmentation within a Sub-Region¶

The segmentation algorithm exhaustively looks through all possible cut-points for a given region of data, to minimize the sum of squared deviations of the data from the mean of its respective segment. Given points or intensities, and cut-points defining segments, the number of possible cut-points is . That is, the search space explodes exponentially. This exponential search space perhaps accounts for the explosion of the algorithms using hidden Markov model approaches or approximate segmenting approaches to find the segment regions.

Ironically, since 1999, within the tree splitting code of Golden Helix SVS, we have had an optimal algorithm that finds segment boundaries, effectively overcoming the exponential search problem. This algorithm, due to Hawkins, see [Hawkins1972], [Hawkins1973], and [Hawkins2002], is based on dynamic programming. It exhaustively searches through all possible cut-point positions to find an optimal segmenting of the data. We employ this approach with a few modifications to the copy number problem.

The segmentation algorithms in Golden Helix SVS use “segmentation for a normally distributed response for a continuous-ordinal predictor”. This mode of segmentation is partly analogous to a univariate counterpart used in creating tree models. However, there are important differences, the main one being that no final p-value calculation for the entire sub-region is attempted because the main emphasis here is on local variation rather than an entire “split”. Also, the tree models do not currently implement a permutation testing procedure for determining split cardinality.

Note

The following points apply for the copy number segmentation algorithms:

• Within the sub-region under analysis, we consider the number of markers to correspond to the number of observations.
• These markers are considered to be implicitly numbered consecutively. The implicit marker number is used as the (continuous-ordinal) “predictor” variable.
• The object of this segmentation is to find which markers border areas of significant copy number variation–that is to find for which regions of markers there is at least one sample for whom the log2 ratio varies significantly between the regions, presumably indicating copy number variation between those regions.
• As mentioned above, we label the borders between these regions as “cut-points”.

In the multivariate algorithm, the samples (from the respective subjects) are considered to be the respective multivariate “dimensions” of this analysis. In the univariate algorithm, the dimension of the analysis is one, as each sample is analyzed independently. The log2 ratio for every sample and every marker is considered to be the “response” variable.

The model is that the observations segment into groups, and that each group, within each dimension, has a different mean with noise. The cut-points optimally splitting the data in a maximum likelihood sense are found by minimizing the sum over all dimensions of sums of squared deviations of the group means from the observations.

The positions of these cut-points are found in advance for every possible total number of segments, from two to the maximum number of segments allowed. You may set the maximum number of segments (per sub-region) as one of the optional segmenting parameters (see Using CNAM (Copy Number Analysis Method) Optimal Segmenting).

To determine if we can use a given number of segments , a permutation testing based comparison is done between each pair of adjacent segments. The permutation testing procedure is described in the next section. If the segments are significantly different for every pair of adjacent segments, segments are used. Otherwise, this same determination is tried for segments, and so on. This process is started at and continued through .

It may turn out that this process continues through to . This means it is not possible to divide the observations into more than one segment nor to get any cut-points (over this sub-region).

### Univariate Outlier Removal¶

The univariate outlier removal option addresses a couple of problems. When a minimum number of markers per segment is specified, such as 10, a single marker outlier could drag along another 9 markers to create a segment that was driven only by the one point. This would inflate the number of 10 marker segments, and would incorrectly specify boundaries of the copy number segment if the region was 9 markers or less in length. The second problem is that if the minimum number of markers per segment is set to 1, single marker segments would be found, but those single marker segments would be found to not be significant with permutation testing, and because the outliers reduced the sum of squared deviations from the mean more than for some of the longer segments, those longer (but significant) segments would be lost as the next lower cardinality splits were chosen. The basic problem is that this maximum likelihood estimator of minimizing the sum of squared deviations from the mean assumes normality, which does not hold in the case of numerous outliers. The outlier removal process applies only when the minimum number of markers per segment is set to 1. It starts with finding the maximum number of segments per window and then deletes the cut-points that bracket single marker segments. Then permutation testing is done on the remaining segments that are longer than length 1 (including the outlier data as part of those larger segments), dropping down to lower cardinality if any pair of adjacent segments are not statistically different.

### Permutation Testing for Verifying Copy Number Segments¶

An optimal -way split is one that defines segments such that the sum of squared deviations from the mean of each segment is minimized. In other words, there is no better placement of cut-points that would result in a smaller sum of squared deviations of the data from the respective mean of the segment in which it falls. The challenge, given this optimal segmenting, is determining whether the resulting segments represent significant differences between pairwise adjacent segments.

Consider two adjacent segments. Both the left and right segments have means. The cut-point between them minimizes the sum of squared deviations of the data from the two respective means. If we were to just perform a T-test comparing the adjacent segments, we would need to adjust the resulting p-value with some kind of multiple testing correction for all of the possible cut-points that were searched through to find the best one. A Bonferroni adjustment would be too conservative due to correlation structure in the data. Instead we turn to permutation testing.

The key to performing permutation testing correctly is to randomize and perform the same procedure as was used to find the original statistic. This ends up being fairly straightforward.

Given is the max pairwise permuted p-value parameter, for each adjacent pair of segments we perform the following procedure:

• Calculate the original sum of squared deviations for the means of the two adjacent segments.
• Set
• Do the following times:
• Randomly shuffle the marker label columns (i.e. multivariate columns are kept together).
• Find the optimal two-way split that minimizes the sum of squared deviations from the means within the random data.
• If the randomly computed sum of squared deviations is less than or equal to the original sum of squared deviations, set
• The pairwise .

If the permuted pairwise is less than or equal to for all adjacent pairwise segments, all of the segments are statistically significant. Otherwise, we repeat the procedure with the optimal way segmenting, until either we find all pairwise segments are significant, or all the data in a region represents a single segment (or possibly part of a larger segment if the moving window procedure is used).

Recent improvements in the permutation algorithm (versions 7.1 and higher) enable the use of all specified threads and an updated randomization procedure. The previous naive potentially biased randomization procedure was replaced with the unbiased Fisher and Yates method (also known as the Knuth shuffle) [Durstenfeld1964]. This modification and a set of 128 random number generators with unique keys run sequentially allows us to simultaneously have up to 128 threads performing permutation testing, and the same segmentation results occur regardless of the number of threads specified. If a non-power-of-two number of threads is specified, it rounds up or down to the nearest power of two, with an upper bound of 128. Due to differences in the randomization procedure, different output can be generated over previous versions for segments that fall on the borders of the p-value threshold. A single segmenting job will now utilize close to 100% CPU 100% of the time if all threads are selected, removing the need to have more than one job running simultaneously on the same machine.