# Principal Component Analysis¶

## Principal Component Analysis Overview¶

### Correcting for Stratification¶

Sometimes finding an association can be confounded by population stratification. This is because a condition may be more prevalent in one group of people than in a different group, resulting in a spurious association between the condition or trait being tested for and any genetic characteristics which vary between the two different groups of people.

While it is good practice for studies to be based on as homogeneous a group of test subjects as possible, it has been noted in [Price2006] that even the mild variation in genetic characteristics among those who classify themselves as belonging to one ethnic group or another can be problematic enough to confound a study done over thousands of genetic markers.

Note

Hidden population stratification may be thought of as a non-zero between unknown groupings of samples.

Please see *Overview of F-Statistics* for a further discussion of
F-statistics.

Two methods are available in SVS for correction for population stratification:

- Correction of Input Data by Principal Component Analysis, and
- Correction of Output Chi-Square Values and P-Values by Genomic
Control (see
*Correcting for Stratification by Genomic Control*)

### Correcting for Batch Effects and Other Measurement Errors¶

It is noted in [Price2006] that there is evidence that variations in test equipment can easily introduce bias into studies.

Additionally, with Copy Number Variation (CNV) studies, variation between batches (batch effects) may easily affect the association test results. Principal Component Analysis may be used to correct for measurement variations in the input data, as well as for population stratification.

### Correction of Input Data by Principal Component Analysis¶

This technique, pioneered at the Broad Institute, distributes a program called “EIGENSTRAT” which implements the PCA correction technique, as described in [Price2006]. A more thorough discussion of stratification, principal component analysis, and the eigenvalues involved may be found in [Patterson2006].

Note

The technique implemented in SVS uses similar methodology as in the “EIGENSTRAT” program, with several enhancements.

#### Working Premise¶

If there is population stratification or there are batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or very few markers.

If one can find a way to remove any pattern(s) resulting from stratification or batch effects and test only the data without these patterns, the influence from the stratification or batch effects should be eliminated or minimized.

#### The PCA Technique¶

The object is to first determine the different patterns of variation over the subjects from within the markers, then to remove these patterns.

The following procedure is followed for each marker:

- The average value for the marker across all samples is determined.
- The average value is subtracted from all values for the marker to recenter the data about zero. The result is the total pattern of variation for the marker.
- This pattern might be “normalized” or divided by a value determined
from the marker’s data. The possibilities for normalization in
SVS are:
- The theoretical standard deviation of the marker’s data at Hardy-Weinberg Equilibrium (HWE). This is what the standard deviation of the data for the marker would be over the entire population if the marker in the population were in HWE and had the same major and minor (or reference/alternate) allele frequencies as observed for the marker.
- The actual standard deviation of the data for the marker.
- No normalization.

These (possibly normalized) individual marker patterns over the subjects are then “summed up” into an matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as .

Note

The effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.

The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.

It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first principal component or the first few principal components correspond directly to the stratification pattern(s), based on the prevalence of this pattern or these patterns of stratification over so many markers.

Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top principal components (where you select ) and remove these component/patterns from both the marker data and the dependent variable using a vector-analysis related technique.

For a more precise and “mathematical” explanation of this process,
please see *Formulas for Principal Component Analysis*.

#### How Many Components to Use?¶

This is, to some degree, an open question as it is a matter of personal preference. Note that if you choose as many components as there are markers, you will wind up subtracting out ALL effects, thus getting nothing from your tests. The maximum number of markers that can be selected is the number of samples less one. If a larger number is selected, then the number of principal components found will still be the number of samples less one.

The best answer consists of first simply obtaining the components
themselves and their corresponding eigenvectors for
principal components, where is the total number of samples
in the dataset. This can be done by either running uncorrected tests or
from the separate Principal Component Analysis windows in the Quality
Assurance menu from a spreadsheet. See *Genotypic Principal Component Analysis* or *Numeric Principal Component Analysis*.

Then evaluate the pattern of the eigenvalues in the spreadsheet or plot
the eigenvalue variable column. (Right-click on the column header and
select Plot Variable.) If the first few are very large compared with the
remaining eigenvalues, then use that many components in a second
analysis. Another way to determine the number of components to use is to
examine cluster plots of the first few eigenvalues plotted against each
other, i.e. one versus two, two versus three, and so on. (See
*Multi-Color Scatter Plots for PCA or Gender Analysis*.) When there are no longer discernible clusters of
groups based on phenotype categories, then that is the number of
components to use for PCA correction.

#### A Second Answer (for Genotypic Data Only)¶

The inflation factor can also be used to adjust for stratification. An
inflation factor close to one would indicate stratification is not
present in the data. The number of principal components can be
determined by increasing or decreasing the number of components until
the inflation factor is as close to one as possible. Using the inflation
factor to determine the number of principal components to use is a
variant of Genomic Control. See *Correcting for Stratification by Genomic Control* for more
information.

#### Removing Outlier Subjects¶

If you are interested in what the principal components are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing patients or subjects from analysis who are found to have extreme values in one or more of the previously-determined principal components.

Golden Helix SVS automates this process. Select the number of principal components that should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, which will be logged in a separate spreadsheet.

#### PCA-Corrected Association Testing¶

The only tests available, once you have corrected the input data for
stratification using Principal Component Analysis, are the
**Correlation/Trend Test** (genotypic *Correlation/Trend Test* and numeric
*Correlation/Trend Test*) and **Logistic/Linear Regression** (genotypic
*Logistic/Linear Regression* and numeric *Linear/Logistic Regression*).

(The exception to this is that for numeric association tests
on a binary dependent variable, where the dependent variable is not
corrected through PCA but only the predictors are corrected, the
**T-Test** is additionally available.)

This is because the **Correlation/Trend Test** and **Logistic/Linear
Regression** are based on numeric values for both the predictor and
the response. Thus, these tests may still be run, even if genotype
designations or case/control designations have been transformed from
simple categorical numeric designations such as zero, one, or two to
real numbers that may take on any necessary values.

## Genotypic Principal Component Analysis¶

### Data Requirements¶

Genotypic principal components analysis requires a dataset containing
genotypic data. First import genotypic data into an SVS
project (see *Importing Your Data Into A Project*). Once a spreadsheet has been created in a
project with genotypic data, access the Genotypic Principal Component
Analysis options dialog by selecting **Genotype** >
**Genotypic Principal Component Analysis** from the spreadsheet menu.

Before principal components can be computed for genotypic data, the allele classification must be specified. The options are to classify alleles based on allele frequency (major/minor alleles) or by reference/alternate alleles as indicted by a marker map field that indicates the reference allele. Based on the classification of alleles a numeric equivalent to each genotype is established, depending on whether the additive, dominant, or recessive model is selected for the genetic model. (The PCA technique is not available for the basic allelic test or the genotypic test, since establishing a numeric equivalent is not as straightforward or even possible for those models.)

Note

It is common practice to inactivate markers that are known to have data quality issues before using principal components analysis.

### Using the Genotypic Principal Components Analysis Window¶

Note

This window (see *Genotype Principal Component Analysis*) essentially accomplishes
the functions of the PCA Parameters tab from the Genotype Association
Tests dialog obtained through the Analysis menu when **Correct for
Stratification with PCA** is selected on the Association Test
Parameters tab. The only difference is that it is not necessary to
simultaneously perform an association test to use the separate
Genotype Principal Component Analysis window.

### Processing¶

The principal components can be computed, or if they have already been
computed for the dataset, the spreadsheet of principal components can be
selected after selecting the “Use precomputed principal components”
option. See *Applying PCA to a Superset of Markers* and *Applying PCA to a Subset of Samples* for specific
limitations of this feature.

Select the PCA parameters - specifically, classification of alleles,
the genetic model, maximum
number of components to find and correct for, normalization method,
spreadsheets to output, and whether to eliminate component outlier
subjects. When PCA outlier removal is performed by recomputing
components, there are selections for the number of times to recompute
the components, the criteria for determining an outlier, and the number
of components to remove outliers from. See *Correction of Input Data by Principal Component Analysis* for more
information on the options on this dialog.

Select the **Run** button to perform the analysis.

When the analysis is complete, a message indicating the number of markers analyzed and the number of markers that were skipped will be appended to the Node Change Log for each output spreadsheet, and all spreadsheets selected for output will be opened.

### Spreadsheet Outputs¶

The possible output spreadsheets are as follows:

- The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)
- The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.
- The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).
- If re-computation of components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers and the iteration and component in which they were found.

Note

- If you wish to plot any outputs (such as the one column in the eigenvalue
spreadsheet) select
**Plot Numeric Values**from the**Plot**menu. The column of data will be plotted against the row labels or eigenvalue number. - If you wish to plot a principal component eigenvector against another eigenvector,
select
**XY Scatter Plots**from the**Plot**menu. See*Multi-Color Scatter Plots for PCA or Gender Analysis*or*N by N Scatter Plots*for more information.

## Numeric Principal Component Analysis¶

### Data Requirements¶

Numeric principal components analysis requires a dataset containing
numeric data. First import numeric data into an SVS project
(see *Importing Your Data Into A Project*). Once a spreadsheet has been created in a project with
numeric binary, integer, and/or real-valued data, access the Numeric
Principal Component Analysis options dialog by selecting **Numeric** >
**Numeric Principal Component Analysis** from the
spreadsheet menu.

Note

It is common practice to inactivate markers known to have data quality issues before using principal components analysis.

### Using the Numeric Principal Components Analysis Window¶

Note

This window (see *Numeric Principal Component Analysis*) essentially
accomplishes the functions of the PCA Parameters tab from the Numeric
Association Tests dialog obtained through the Analysis menu when
**Correct for Batch Effects/Stratification with PCA** is selected on
the Association Test Parameters tab. The only difference is that it
is not necessary to simultaneously perform an association test to use
the separate Numeric Principal Component Analysis window.

### Processing¶

The principal components can be computed, or if they have already been
computed for the dataset, the spreadsheet of principal components can be
selected after selecting the “Use precomputed principal components”
option. See *Applying PCA to a Superset of Markers* and *Applying PCA to a Subset of Samples* for specific
limitations of this feature.

Select the PCA parameters - specifically, the maximum number of components to find and correct for, the spreadsheets to output, whether to center by marker or by sample or by both marker and sample, and whether to eliminate component outlier subjects. When PCA outlier removal is performed by recomputing components, there are selections for the number of times to recompute the components, the criteria for determining an outlier, and the number of components to remove outliers from.

For new projects, data centering by marker is normally recommended. Data
centering by sample may be useful for components that are applied to
chromosomes or regions other than those from which they are originally
computed. For more information on these options, see *Centering by Marker vs. Not Centering by Marker*
and *Centering by Sample vs. Not Centering by Sample*.

See *Correction of Input Data by Principal Component Analysis* for more information on other options in this
dialog.

Select the **Run** button to perform the analysis.

When the analysis is complete, a message indicating the number of components found and the number of predictors analyzed will be appended to the Node Change Log for each output spreadsheet. All spreadsheets selected for output will be opened.

### Spreadsheet Outputs¶

The possible output spreadsheets are as follows:

- The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)
- The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.
- The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).
- If recomputing components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers as well as the iteration and component in which they were found.

Note

- If you wish to plot any outputs (such as the one column in the
eigenvalue spreadsheet) select
**Plot Numeric Values**from the**Plot**menu. The column of data will be plotted against the row labels or eigenvalue number. - If you wish to plot a principal component eigenvector against
another eigenvector, select
**XY Scatter Plots**from the**Plot**menu. See*Multi-Color Scatter Plots for PCA or Gender Analysis*or*N by N Scatter Plots*for more information.

## Correcting for Stratification by Genomic Control¶

This somewhat older method, pioneered by Devlin and Roeder [DevlinAndRoeder1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than they should be. The result is a higher median than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the distribution will usually be uniformly spread out by a certain “inflation factor” .

This technique of Genomic Control measures this “inflation factor” by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and is taken to be one.

Then, Genomic Control applies its correction by dividing the actual association test chi-square statistic results by this , thus possibly making these results appropriately more pessimistic.

Two approaches exist for this:

- Measure the “inflation factor” over a set of markers designed to indicate population stratification. Then, use this on the actual association test (presumably done for just a few candidate markers). (For studies over a small number of markers.)
- Measure the “inflation factor” over the actual association tests being done. Then afterward, use this on all chi-square results so obtained. (For whole-genome scans or a large number of markers.)

Both of these approaches are available in SVS, through the
**Genotype** > **Genotype Association Tests** dialog in the Analysis
menu.

From the Genotype Association Tests dialog, select **Show Inflation
Factor (Lambda), Chi-Squares, and Corrected Values** to find inflation
factors () and the results of applying the Genomic
Control technique on chi-squares, p-values, Bonferroni-adjusted
p-values, and False Discovery Rates.

Note

The inflation factor will be displayed in the Node Change Log for the Association Results spreadsheet.

From the Genotype Association Tests dialog, select **Correct Using This
Inflation Factor (Lambda) Instead:** and enter a value
to use as an “inflation factor” that was determined from a previous
association test run or from other data previously analyzed.

Note

The inflation factor relates to the chi-square statistic. After a chi-square statistic has been corrected through Genomic Control, the normal procedure for finding the approximate p-value is still followed. If there had been “inflation”, the Genomic Control-corrected p-value will be pushed closer to one.