Principal Component Analysis

Principal Component Analysis Overview

Correcting for Stratification

Sometimes finding an association can be confounded by population stratification. This is because a condition may be more prevalent in one group of people than in a different group, resulting in a spurious association between the condition or trait being tested for and any genetic characteristics which vary between the two different groups of people.

While it is good practice for studies to be based on as homogeneous a group of test subjects as possible, it has been noted in [Price2006] that even the mild variation in genetic characteristics among those who classify themselves as belonging to one ethnic group or another can be problematic enough to confound a study done over thousands of genetic markers.

Note

Hidden population stratification may be thought of as a non-zero F_{st} between unknown groupings of samples.

Please see Overview of F-Statistics for a further discussion of F-statistics.

Two methods are available in SVS for correction for population stratification:

Correcting for Batch Effects and Other Measurement Errors

It is noted in [Price2006] that there is evidence that variations in test equipment can easily introduce bias into studies.

Additionally, with Copy Number Variation (CNV) studies, variation between batches (batch effects) may easily affect the association test results. Principal Component Analysis may be used to correct for measurement variations in the input data, as well as for population stratification.

Correction of Input Data by Principal Component Analysis

This technique, pioneered at the Broad Institute, distributes a program called “EIGENSTRAT” which implements the PCA correction technique, as described in [Price2006]. A more thorough discussion of stratification, principal component analysis, and the eigenvalues involved may be found in [Patterson2006].

Note

The technique implemented in SVS uses similar methodology as in the “EIGENSTRAT” program, with several enhancements.

Working Premise

If there is population stratification or there are batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or very few markers.

If one can find a way to remove any pattern(s) resulting from stratification or batch effects and test only the data without these patterns, the influence from the stratification or batch effects should be eliminated or minimized.

The PCA Technique

The object is to first determine the different patterns of variation over the subjects from within the markers, then to remove these patterns.

The following procedure is followed for each marker:

  1. The average value for the marker across all samples is determined.
  2. The average value is subtracted from all values for the marker to recenter the data about zero. The result is the total pattern of variation for the marker.
  3. This pattern might be “normalized” or divided by a value determined from the marker’s data. The possibilities for normalization in SVS are:
    • The theoretical standard deviation of the marker’s data at Hardy-Weinberg Equilibrium (HWE). This is what the standard deviation of the data for the marker would be over the entire population if the marker in the population were in HWE and had the same major and minor (or reference/alternate) allele frequencies as observed for the marker.
    • The actual standard deviation of the data for the marker.
    • No normalization.

These (possibly normalized) individual marker patterns over the n subjects are then “summed up” into an n \times n matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as X^TX.

Note

The effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.

The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.

It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first principal component or the first few principal components correspond directly to the stratification pattern(s), based on the prevalence of this pattern or these patterns of stratification over so many markers.

Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top k principal components (where you select k) and remove these component/patterns from both the marker data and the dependent variable using a vector-analysis related technique.

For a more precise and “mathematical” explanation of this process, please see Formulas for Principal Component Analysis.

How Many Components to Use?

This is, to some degree, an open question as it is a matter of personal preference. Note that if you choose as many components as there are markers, you will wind up subtracting out ALL effects, thus getting nothing from your tests. The maximum number of markers that can be selected is the number of samples less one. If a larger number is selected, then the number of principal components found will still be the number of samples less one.

The best answer consists of first simply obtaining the components themselves and their corresponding eigenvectors for N-1 principal components, where N is the total number of samples in the dataset. This can be done by either running uncorrected tests or from the separate Principal Component Analysis windows in the Quality Assurance menu from a spreadsheet. See Genotypic Principal Component Analysis or Numeric Principal Component Analysis.

Then evaluate the pattern of the eigenvalues in the spreadsheet or plot the eigenvalue variable column. (Right-click on the column header and select Plot Variable.) If the first few are very large compared with the remaining eigenvalues, then use that many components in a second analysis. Another way to determine the number of components to use is to examine cluster plots of the first few eigenvalues plotted against each other, i.e. one versus two, two versus three, and so on. (See Multi-Color Scatter Plots for PCA or Gender Analysis.) When there are no longer discernible clusters of groups based on phenotype categories, then that is the number of components to use for PCA correction.

A Second Answer (for Genotypic Data Only)

The inflation factor can also be used to adjust for stratification. An inflation factor close to one would indicate stratification is not present in the data. The number of principal components can be determined by increasing or decreasing the number of components until the inflation factor is as close to one as possible. Using the inflation factor to determine the number of principal components to use is a variant of Genomic Control. See Correcting for Stratification by Genomic Control for more information.

Removing Outlier Subjects

If you are interested in what the principal components are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing patients or subjects from analysis who are found to have extreme values in one or more of the previously-determined principal components.

Golden Helix SVS automates this process. Select the number of principal components that should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, which will be logged in a separate spreadsheet.

PCA-Corrected Association Testing

The only tests available, once you have corrected the input data for stratification using Principal Component Analysis, are the Correlation/Trend Test (genotypic Correlation/Trend Test and numeric Correlation/Trend Test) and Logistic/Linear Regression (genotypic Logistic/Linear Regression and numeric Linear/Logistic Regression).

(The exception to this is that for numeric association tests on a binary dependent variable, where the dependent variable is not corrected through PCA but only the predictors are corrected, the T-Test is additionally available.)

This is because the Correlation/Trend Test and Logistic/Linear Regression are based on numeric values for both the predictor and the response. Thus, these tests may still be run, even if genotype designations or case/control designations have been transformed from simple categorical numeric designations such as zero, one, or two to real numbers that may take on any necessary values.

Genotypic Principal Component Analysis

Data Requirements

Genotypic principal components analysis requires a dataset containing genotypic data. First import genotypic data into an SVS project (see Importing Your Data Into A Project). Once a spreadsheet has been created in a project with genotypic data, access the Genotypic Principal Component Analysis options dialog by selecting Genotype > Genotypic Principal Component Analysis from the spreadsheet menu.

Before principal components can be computed for genotypic data, the allele classification must be specified. The options are to classify alleles based on allele frequency (major/minor alleles) or by reference/alternate alleles as indicted by a marker map field that indicates the reference allele. Based on the classification of alleles a numeric equivalent to each genotype is established, depending on whether the additive, dominant, or recessive model is selected for the genetic model. (The PCA technique is not available for the basic allelic test or the genotypic test, since establishing a numeric equivalent is not as straightforward or even possible for those models.)

Note

It is common practice to inactivate markers that are known to have data quality issues before using principal components analysis.

Using the Genotypic Principal Components Analysis Window

Note

This window (see Genotype Principal Component Analysis) essentially accomplishes the functions of the PCA Parameters tab from the Genotype Association Tests dialog obtained through the Analysis menu when Correct for Stratification with PCA is selected on the Association Test Parameters tab. The only difference is that it is not necessary to simultaneously perform an association test to use the separate Genotype Principal Component Analysis window.

genoPCAwindow

Genotype Principal Component Analysis

Processing

The principal components can be computed, or if they have already been computed for the dataset, the spreadsheet of principal components can be selected after selecting the “Use precomputed principal components” option. See Applying PCA to a Superset of Markers and Applying PCA to a Subset of Samples for specific limitations of this feature.

Select the PCA parameters - specifically, classification of alleles, the genetic model, maximum number of components to find and correct for, normalization method, spreadsheets to output, and whether to eliminate component outlier subjects. When PCA outlier removal is performed by recomputing components, there are selections for the number of times to recompute the components, the criteria for determining an outlier, and the number of components to remove outliers from. See Correction of Input Data by Principal Component Analysis for more information on the options on this dialog.

Select the Run button to perform the analysis.

When the analysis is complete, a message indicating the number of markers analyzed and the number of markers that were skipped will be appended to the Node Change Log for each output spreadsheet, and all spreadsheets selected for output will be opened.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

  • The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)
  • The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.
  • The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).
  • If re-computation of components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers and the iteration and component in which they were found.

Note

  1. If you wish to plot any outputs (such as the one column in the eigenvalue spreadsheet) select Plot Numeric Values from the Plot menu. The column of data will be plotted against the row labels or eigenvalue number.
  2. If you wish to plot a principal component eigenvector against another eigenvector, select XY Scatter Plots from the Plot menu. See Multi-Color Scatter Plots for PCA or Gender Analysis or N by N Scatter Plots for more information.

Numeric Principal Component Analysis

Data Requirements

Numeric principal components analysis requires a dataset containing numeric data. First import numeric data into an SVS project (see Importing Your Data Into A Project). Once a spreadsheet has been created in a project with numeric binary, integer, and/or real-valued data, access the Numeric Principal Component Analysis options dialog by selecting Numeric > Numeric Principal Component Analysis from the spreadsheet menu.

Note

It is common practice to inactivate markers known to have data quality issues before using principal components analysis.

Using the Numeric Principal Components Analysis Window

Note

This window (see Numeric Principal Component Analysis) essentially accomplishes the functions of the PCA Parameters tab from the Numeric Association Tests dialog obtained through the Analysis menu when Correct for Batch Effects/Stratification with PCA is selected on the Association Test Parameters tab. The only difference is that it is not necessary to simultaneously perform an association test to use the separate Numeric Principal Component Analysis window.

numericPCAwindow

Numeric Principal Component Analysis

Processing

The principal components can be computed, or if they have already been computed for the dataset, the spreadsheet of principal components can be selected after selecting the “Use precomputed principal components” option. See Applying PCA to a Superset of Markers and Applying PCA to a Subset of Samples for specific limitations of this feature.

Select the PCA parameters - specifically, the maximum number of components to find and correct for, the spreadsheets to output, whether to center by marker or by sample or by both marker and sample, and whether to eliminate component outlier subjects. When PCA outlier removal is performed by recomputing components, there are selections for the number of times to recompute the components, the criteria for determining an outlier, and the number of components to remove outliers from.

For new projects, data centering by marker is normally recommended. Data centering by sample may be useful for components that are applied to chromosomes or regions other than those from which they are originally computed. For more information on these options, see Centering by Marker vs. Not Centering by Marker and Centering by Sample vs. Not Centering by Sample.

See Correction of Input Data by Principal Component Analysis for more information on other options in this dialog.

Select the Run button to perform the analysis.

When the analysis is complete, a message indicating the number of components found and the number of predictors analyzed will be appended to the Node Change Log for each output spreadsheet. All spreadsheets selected for output will be opened.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

  • The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)
  • The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.
  • The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).
  • If recomputing components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers as well as the iteration and component in which they were found.

Note

  1. If you wish to plot any outputs (such as the one column in the eigenvalue spreadsheet) select Plot Numeric Values from the Plot menu. The column of data will be plotted against the row labels or eigenvalue number.
  2. If you wish to plot a principal component eigenvector against another eigenvector, select XY Scatter Plots from the Plot menu. See Multi-Color Scatter Plots for PCA or Gender Analysis or N by N Scatter Plots for more information.

Correcting for Stratification by Genomic Control

This somewhat older method, pioneered by Devlin and Roeder [DevlinAndRoeder1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than they should be. The result is a higher median than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the distribution will usually be uniformly spread out by a certain “inflation factor” \lambda.

This technique of Genomic Control measures this “inflation factor” \lambda by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and \lambda is taken to be one.

Then, Genomic Control applies its correction by dividing the actual association test chi-square statistic results by this \lambda, thus possibly making these results appropriately more pessimistic.

Two approaches exist for this:

  • Measure the “inflation factor” \lambda over a set of markers designed to indicate population stratification. Then, use this \lambda on the actual association test (presumably done for just a few candidate markers). (For studies over a small number of markers.)
  • Measure the “inflation factor” \lambda over the actual association tests being done. Then afterward, use this \lambda on all chi-square results so obtained. (For whole-genome scans or a large number of markers.)

Both of these approaches are available in SVS, through the Genotype > Genotype Association Tests dialog in the Analysis menu.

From the Genotype Association Tests dialog, select Show Inflation Factor (Lambda), Chi-Squares, and Corrected Values to find inflation factors (\lambda) and the results of applying the Genomic Control technique on chi-squares, p-values, Bonferroni-adjusted p-values, and False Discovery Rates.

Note

The inflation factor will be displayed in the Node Change Log for the Association Results spreadsheet.

From the Genotype Association Tests dialog, select Correct Using This Inflation Factor (Lambda) Instead: and enter a \lambda value to use as an “inflation factor” that was determined from a previous association test run or from other data previously analyzed.

Note

The inflation factor relates to the chi-square statistic. After a chi-square statistic has been corrected through Genomic Control, the normal procedure for finding the approximate p-value is still followed. If there had been “inflation”, the Genomic Control-corrected p-value will be pushed closer to one.