2.9. Principal Component Analysis

2.9.1. Principal Component Analysis Overview

Correcting for Stratification

Sometimes finding an association can be confounded by population stratification. This is because a condition may be more prevalent in one group of people than in a different group, resulting in a spurious association between the condition or trait being tested for and any genetic characteristics which vary between the two different groups of people.

While it is good practice for studies to be based on as homogeneous a group of test subjects as possible, it has been noted in [Price2006] that even the mild variation in genetic characteristics among those who classify themselves as belonging to one ethnic group or another can be problematic enough to confound a study done over thousands of genetic markers.

Note

Hidden population stratification may be thought of as a non-zero F_{st} between unknown groupings of samples.

Please see Overview of F-Statistics for a further discussion of F-statistics.

Two methods are available in SVS for correction for population stratification:

Correcting for Batch Effects and Other Measurement Errors

It is noted in [Price2006] that there is evidence that variations in test equipment can easily introduce bias into studies.

Additionally, with Copy Number Variation (CNV) studies, variation between batches (batch effects) may easily affect the association test results. Principal Component Analysis may be used to correct for measurement variations in the input data, as well as for population stratification.

Correction of Input Data by Principal Component Analysis

This technique, pioneered at the Broad Institute, distributes a program called “EIGENSTRAT” which implements the PCA correction technique, as described in [Price2006]. A more thorough discussion of stratification, principal component analysis, and the eigenvalues involved may be found in [Patterson2006].

Note

The technique implemented in SVS uses similar methodology as in the “EIGENSTRAT” program, with several enhancements.

Working Premise

If there is population stratification or there are batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or very few markers.

If one can find a way to remove any pattern(s) resulting from stratification or batch effects and test only the data without these patterns, the influence from the stratification or batch effects should be eliminated or minimized.

The PCA Technique

The object is to first determine the different patterns of variation over the subjects from within the markers, then to remove these patterns.

The following procedure is followed for each marker:

  1. The average value for the marker across all samples is determined.

  2. The average value is subtracted from all values for the marker to recenter the data about zero. The result is the total pattern of variation for the marker.

  3. This pattern might be “normalized” or divided by a value determined from the marker’s data. The possibilities for normalization in SVS are:

    • The theoretical standard deviation of the marker’s data at Hardy-Weinberg Equilibrium (HWE). This is what the standard deviation of the data for the marker would be over the entire population if the marker in the population were in HWE and had the same major and minor (or reference/alternate) allele frequencies as observed for the marker.

    • The actual standard deviation of the data for the marker.

    • No normalization.

These (possibly normalized) individual marker patterns over the n subjects are then “summed up” into an n \times n matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as X^TX.

Note

The effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.

The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.

It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first principal component or the first few principal components correspond directly to the stratification pattern(s), based on the prevalence of this pattern or these patterns of stratification over so many markers.

Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top k principal components (where you select k) and remove these component/patterns from both the marker data and the dependent variable using a vector-analysis related technique.

For a more precise and “mathematical” explanation of this process, please see Formulas for Principal Component Analysis.

How Many Components to Use?

This is, to some degree, an open question as it is a matter of personal preference. Note that if you choose as many components as there are markers, you will wind up subtracting out ALL effects, thus getting nothing from your tests. The maximum number of markers that can be selected is the number of samples less one. If a larger number is selected, then the number of principal components found will still be the number of samples less one.

The best answer consists of first simply obtaining the components themselves and their corresponding eigenvectors for N-1 principal components, where N is the total number of samples in the dataset. This can be done by either running uncorrected tests or from the separate Principal Component Analysis windows in the Quality Assurance menu from a spreadsheet. See Genotypic Principal Component Analysis or Numeric Principal Component Analysis.

Then evaluate the pattern of the eigenvalues in the spreadsheet or plot the eigenvalue variable column. (Right-click on the column header and select Plot Variable.) If the first few are very large compared with the remaining eigenvalues, then use that many components in a second analysis. Another way to determine the number of components to use is to examine cluster plots of the first few eigenvalues plotted against each other, i.e. one versus two, two versus three, and so on. (See Multi-Color Scatter Plots for PCA or Gender Analysis.) When there are no longer discernible clusters of groups based on phenotype categories, then that is the number of components to use for PCA correction.

A Second Answer (for Genotypic Data Only)

The inflation factor can also be used to adjust for stratification. An inflation factor close to one would indicate stratification is not present in the data. The number of principal components can be determined by increasing or decreasing the number of components until the inflation factor is as close to one as possible. Using the inflation factor to determine the number of principal components to use is a variant of Genomic Control. See Correcting for Stratification by Genomic Control for more information.

Removing Outlier Subjects

If you are interested in what the principal components are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing patients or subjects from analysis who are found to have extreme values in one or more of the previously-determined principal components.

Golden Helix SVS automates this process. Select the number of principal components that should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, which will be logged in a separate spreadsheet.

PCA-Corrected Association Testing

The only tests available, once you have corrected the input data for stratification using Principal Component Analysis, are the Correlation/Trend Test (genotypic Correlation/Trend Test and numeric Correlation/Trend Test) and Logistic/Linear Regression (genotypic Logistic/Linear Regression and numeric Linear/Logistic Regression).

(The exception to this is that for numeric association tests on a binary dependent variable, where the dependent variable is not corrected through PCA but only the predictors are corrected, the T-Test is additionally available.)

This is because the Correlation/Trend Test and Logistic/Linear Regression are based on numeric values for both the predictor and the response. Thus, these tests may still be run, even if genotype designations or case/control designations have been transformed from simple categorical numeric designations such as zero, one, or two to real numbers that may take on any necessary values.

2.9.2. Genotypic Principal Component Analysis

Data Requirements

Genotypic principal components analysis requires a dataset containing genotypic data. First import genotypic data into an SVS project (see Importing Your Data Into A Project). Once a spreadsheet has been created in a project with genotypic data, access the Genotypic Principal Component Analysis options dialog by selecting Genotype > Genotypic Principal Component Analysis from the spreadsheet menu.

Before principal components can be computed for genotypic data, the allele classification must be specified. The options are to classify alleles based on allele frequency (major/minor alleles) or by reference/alternate alleles as indicted by a marker map field that indicates the reference allele. Based on the classification of alleles a numeric equivalent to each genotype is established, depending on whether the additive, dominant, or recessive model is selected for the genetic model. (The PCA technique is not available for the basic allelic test or the genotypic test, since establishing a numeric equivalent is not as straightforward or even possible for those models.)

Note

It is common practice to inactivate markers that are known to have data quality issues before using principal components analysis.

Using the Genotypic Principal Components Analysis Window

Note

This window (see Genotype Principal Component Analysis) essentially accomplishes the functions of the PCA Parameters tab from the Genotype Association Tests dialog obtained through the Analysis menu when Correct for Stratification with PCA is selected on the Association Test Parameters tab. The only difference is that it is not necessary to simultaneously perform an association test to use the separate Genotype Principal Component Analysis window.

genoPCAwindow

Genotype Principal Component Analysis

Processing

The principal components can be computed, or if they have already been computed for the dataset, the spreadsheet of principal components can be selected after selecting the “Use precomputed principal components” option. See Applying PCA to a Superset of Markers and Applying PCA to a Subset of Samples for specific limitations of this feature.

Select the PCA parameters - specifically, classification of alleles, the genetic model, maximum number of components to find and correct for, normalization method, spreadsheets to output, and whether to eliminate component outlier subjects. When PCA outlier removal is performed by recomputing components, there are selections for the number of times to recompute the components, the criteria for determining an outlier, and the number of components to remove outliers from. See Correction of Input Data by Principal Component Analysis for more information on the options on this dialog.

Select the Run button to perform the analysis.

When the analysis is complete, a message indicating the number of markers analyzed and the number of markers that were skipped will be appended to the Node Change Log for each output spreadsheet, and all spreadsheets selected for output will be opened.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

  • The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)

  • The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.

  • The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).

  • If re-computation of components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers and the iteration and component in which they were found.

Note

  1. If you wish to plot any outputs (such as the one column in the eigenvalue spreadsheet) select Plot Numeric Values from the Plot menu. The column of data will be plotted against the row labels or eigenvalue number.

  2. If you wish to plot a principal component eigenvector against another eigenvector, select XY Scatter Plots from the Plot menu. See Multi-Color Scatter Plots for PCA or Gender Analysis or N by N Scatter Plots for more information.

2.9.3. Numeric Principal Component Analysis

Data Requirements

Numeric principal components analysis requires a dataset containing numeric data. First import numeric data into an SVS project (see Importing Your Data Into A Project). Once a spreadsheet has been created in a project with numeric binary, integer, and/or real-valued data, access the Numeric Principal Component Analysis options dialog by selecting Numeric > Numeric Principal Component Analysis from the spreadsheet menu.

Note

It is common practice to inactivate markers known to have data quality issues before using principal components analysis.

Using the Numeric Principal Components Analysis Window

Note

This window (see Numeric Principal Component Analysis) essentially accomplishes the functions of the PCA Parameters tab from the Numeric Association Tests dialog obtained through the Analysis menu when Correct for Batch Effects/Stratification with PCA is selected on the Association Test Parameters tab. The only difference is that it is not necessary to simultaneously perform an association test to use the separate Numeric Principal Component Analysis window.

numericPCAwindow

Numeric Principal Component Analysis

Processing

The principal components can be computed, or if they have already been computed for the dataset, the spreadsheet of principal components can be selected after selecting the “Use precomputed principal components” option. See Applying PCA to a Superset of Markers and Applying PCA to a Subset of Samples for specific limitations of this feature.

Select the PCA parameters - specifically, the maximum number of components to find and correct for, the spreadsheets to output, whether to center by marker or by sample or by both marker and sample, and whether to eliminate component outlier subjects. When PCA outlier removal is performed by recomputing components, there are selections for the number of times to recompute the components, the criteria for determining an outlier, and the number of components to remove outliers from.

For new projects, data centering by marker is normally recommended. Data centering by sample may be useful for components that are applied to chromosomes or regions other than those from which they are originally computed. For more information on these options, see Centering by Marker vs. Not Centering by Marker and Centering by Sample vs. Not Centering by Sample.

See Correction of Input Data by Principal Component Analysis for more information on other options in this dialog.

Select the Run button to perform the analysis.

When the analysis is complete, a message indicating the number of components found and the number of predictors analyzed will be appended to the Node Change Log for each output spreadsheet. All spreadsheets selected for output will be opened.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

  • The corrected input data. (Recall that genotypic data is first converted to numeric data by the selected genetic model.)

  • The principal components spreadsheet with rows according to the sample and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components requested will be shown.

  • The eigenvalue spreadsheet will simply show the eigenvalues from greatest to smallest (of the number of components requested).

  • If recomputing components after removal of outliers was selected, and outliers were found, then a spreadsheet will be created to list these outliers as well as the iteration and component in which they were found.

Note

  1. If you wish to plot any outputs (such as the one column in the eigenvalue spreadsheet) select Plot Numeric Values from the Plot menu. The column of data will be plotted against the row labels or eigenvalue number.

  2. If you wish to plot a principal component eigenvector against another eigenvector, select XY Scatter Plots from the Plot menu. See Multi-Color Scatter Plots for PCA or Gender Analysis or N by N Scatter Plots for more information.

2.9.4. Finding Principal Components When the Sample Size is Large

When the sample size gets to be over 4,000, finding the principal components using Genotypic Principal Component Analysis or Numeric Principal Component Analysis will show a noticable delay, with 14,000 samples taking on the order of 12 minutes for principal component processing and 28,000 samples taking more than an hour. Finally, there is an ultimate limit for sample size for these features to function at all, which is just over 40,000. This can take several hours to finish the total processing.

To address this issue, a feature has been added that will find a few principal components when the sample size is large or very large. This new feature has been used on datasets containing up to 400,000 samples.

This feature may be accessed from either a spreadsheet with mapped genotypic data or from a spreadsheet with mapped numeric data. In either case, use Genotype > Compute Large Data PCA from the spreadsheet menu. The appropriate user interface will come up according to your data type.

Basic Concept

As indicated in Technique, obtaining principal components from genotypic or numeric data requires two phases:

For large-data PCA, (1) the Wishart matrix is not only generated, but output, (2) the Wishart matrix is stored on hard drive, with no attempt made to store the entire matrix in RAM memory, and (3) only the eigenvalues and eigenvectors that are requested are actually computed. For the details of how these requested eigenvalues and eigenvectors are computed, please see Large Kinship Matrices or Large Numbers of Samples.

Data Requirements for Genotypic Data

For large-data principal components analysis on genotypic data, the data requirements are the same as those stated in Data Requirements, except that (1) only mapped genotypic data will be used, and (2) allele classification will always be based on allele frequency (major vs. minor allele).

Using the Large Data PCA Window for Genotypic Data

The following window (Principal Component Analysis of Genotypes–Large Data Approach) displays when you request Genotype > Compute Large Data PCA with genotypic data.

largeGenoPCAwindow

Principal Component Analysis of Genotypes–Large Data Approach

Step 1: Compute the Wishart Matrix

If you need to compute the Wishart Matrix, leave STEP 1: COMPUTE THE WISHART MATRIX checked. In this left-hand box, select the following:

  • PCA Marker Normalization, or whether and how data is normalized, or re-scaled, before it is added into the Wishart Matrix.

    • Theoretical standard deviation under HWE What the standard deviation of the marker data, as numerically recoded, would be, based on the marker’s allele frequencies, assuming the marker is in Hardy-Weinberg equilibrium.

    • Actual standard deviation The actual standard deviation of the marker data as numerically recoded.

    • Apply no normalization Just use the numerically-recoded data as is.

  • Genetic Model for PCA What numeric recoding to use for the genotypic data. In the following, ‘d’ is the major allele, while ‘D’ is the minor allele.

    • Additive model: (dd)->(Dd)->(DD)

    • Dominant model: (DD,Dd) vs. (dd)

    • Recessive model: (DD) vs. (Dd, dd)

  • Impute missing genotypic data as:

    • Homozygous major allele Basically, this substitutes zero for missing values in the numeric recoding.

    • Numerically as average value The non-missing numerically recoded data is averaged over the current marker, and that average is substituted for every missing value in the current marker.

Note

In Genotypic Principal Component Analysis and Numeric Principal Component Analysis, missing values are simply dropped. This may produce some differences in results when there is missing data.

Note

This step will run faster when you have more RAM dedicated to this process. Specifically, use Transpose and analysis memory usage under Tools > Global Product Options to change this setting. (This option must be accessed when you are not in any project.)

Meanwhile, if you already have a pre-computed Wishart Matrix, uncheck STEP 1: COMPUTE THE WISHART MATRIX and, under STEP 2: FIND THE PRINCIPAL COMPONENTS, check Use Pre-Computed Wishart Matrix and select your pre-computed Wishart Matrix.

Step 2: Find the Principal Components

Note

If you only need to pre-compute a Wishart Matrix for later use, uncheck STEP 2: FIND THE PRINCIPAL COMPONENTS.

This step does the actual eigenvalue and eigenvector computation. An approach based on an algorithm specified in [Halko2010] is used. For the algorithm details, please see Large Kinship Matrices or Large Numbers of Samples.

  • Use Pre-Computed Wishart Matrix As stated above, if the Wishart Matrix has already been computed and you wish to use it, uncheck STEP 1: COMPUTE THE WISHART MATRIX and check this box, selecting the Wishart Matrix you wish to use.

  • Number of Principal Components / Find __ Principal Components Select how many principal components (eigenvectors and eigenvalues) to compute. This algorithm is optimized for computing just a few of the eigenvector/eigenvalue pairs that have the highest eigenvalues. The remaining eigenvector/eigenvalue pairs do not explicitly enter into (or bog down) the computation.

  • Halko Algorithm Random Number Initialization The [Halko2010] algorithm incorporates pseudo-random numbers that need to be generated. This pseudo-random number generator has to be initialized with a “seed” value from which it can generate the pseudo-random numbers.

    • Fixed random seed (reproducible results) This is the default choice. With this choice, one fixed random seed (specifically 12345678) is always used. If the program must be re-run, using this choice a second time (with all other parameters being the same) will reproduce the exact same results as the original run.

    • Dynamic random seed (investigate precision) Use this choice (on a second or later run) to obtain a gauge of algorithm accuracy.

      “Dynamic” refers to the fact that a different random seed is used every time a run is made using this choice. Due to the differing random seeds, the results will differ between runs, both between a “Fixed” run and any “Dynamic” run, and between any two “Dynamic” runs. These differences are a gauge of the actual algorithm accuracy, which will be proportional to these differences. As a rule of thumb, the actual error will almost always be within 10 times the amount of the differences between the “Fixed” results and a set of “Dynamic” results or between two sets of “Dynamic” results.

Note

The components that are more “Principal”, that is, those that have the higher eigenvalues, will always have a higher level of accuracy with this algorithm. For instance, the default Number of Principal Components of 15 is meant to achieve a good level of accuracy for the highest 10 components, while only achieving a reasonable level of accuracy for the other 5 components.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

Data Requirements for Numeric Data

For large-data principal components analysis on numeric data, the data requirements are the same as those stated in Data Requirements, except that only mapped numeric data will be used for principal components analysis.

Also note that this feature will always center the data by marker (and not center the data by sample).

Using the Large Data PCA Window for Numeric Data

The following window (Principal Component Analysis of Numeric Data–Large Data Approach) displays when you request Genotype > Compute Large Data PCA with numeric data.

largeNumericPCAwindow

Principal Component Analysis of Numeric Data–Large Data Approach

Step 1: Compute the Wishart Matrix

Step 1 for numeric data is almost, but not quite, the same as Step 1 for genotypic data.

If you need to compute the Wishart Matrix, leave STEP 1: COMPUTE THE WISHART MATRIX checked. In this left-hand box, select the following:

  • PCA Marker Normalization, or whether and how data is normalized, or re-scaled, before it is added into the Wishart Matrix.

    • Theoretical standard deviation under HWE (*) What the standard deviation of the data would be, based on assuming the data is genotypic marker data which has been numerically recoded using the additive model, deriving the major and minor allele frequencies from that assumption, then finding what the standard deviation would be for those frequencies if the marker were in Hardy-Weinberg equilibrium.

    Note

    Choosing this normalization will require the input data to be integer (or binary) and only have values of 0, 1, or 2.

    • Actual standard deviation The actual standard deviation of the numeric data.

    • Apply no normalization Just use the numeric data as is.

    Note

    Any numeric data will be considered valid if either of the above two normalization options (Actual standard deviation or Apply no normalization) is chosen.

    Note

    Apply no normalization will process the data the same as would Numeric Principal Component Analysis using Center data by marker (and not using Center data by sample).

  • Impute missing data as:

    • Zero (homozygous major allele *) This substitutes zero for missing values in the numeric data.

    • Numerically as average value The non-missing data is averaged over the current position/marker, and that average is substituted for every missing value in the current position/marker.

Note

In Genotypic Principal Component Analysis and Numeric Principal Component Analysis, missing values are simply dropped. This may produce some differences in results when there is missing data.

Note

This step will run faster when you have more RAM dedicated to this process. Specifically, use Transpose and analysis memory usage under Tools > Global Product Options to change this setting. (This option must be accessed when you are not in any project.)

Meanwhile, if you already have a pre-computed Wishart Matrix, uncheck STEP 1: COMPUTE THE WISHART MATRIX and, under STEP 2: FIND THE PRINCIPAL COMPONENTS, check Use Pre-Computed Wishart Matrix and select your pre-computed Wishart Matrix.

Step 2: Find the Principal Components

Note

If you only need to pre-compute a Wishart Matrix for later use, uncheck STEP 2: FIND THE PRINCIPAL COMPONENTS.

Step 2 for numeric data, which does the actual eigenvalue and eigenvector computation, is exactly the same as Step 2 for genotypic data. Please see Step 2: Find the Principal Components.

Spreadsheet Outputs

The possible output spreadsheets are as follows:

2.9.5. Correcting for Stratification by Genomic Control

This somewhat older method, pioneered by Devlin and Roeder [DevlinAndRoeder1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than they should be. The result is a higher median than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the distribution will usually be uniformly spread out by a certain “inflation factor” \lambda.

This technique of Genomic Control measures this “inflation factor” \lambda by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and \lambda is taken to be one.

Then, Genomic Control applies its correction by dividing the actual association test chi-square statistic results by this \lambda, thus possibly making these results appropriately more pessimistic.

Two approaches exist for this:

  • Measure the “inflation factor” \lambda over a set of markers designed to indicate population stratification. Then, use this \lambda on the actual association test (presumably done for just a few candidate markers). (For studies over a small number of markers.)

  • Measure the “inflation factor” \lambda over the actual association tests being done. Then afterward, use this \lambda on all chi-square results so obtained. (For whole-genome scans or a large number of markers.)

Both of these approaches are available in SVS, through the Genotype > Genotype Association Tests dialog in the Analysis menu.

From the Genotype Association Tests dialog, select Show Inflation Factor (Lambda), Chi-Squares, and Corrected Values to find inflation factors (\lambda) and the results of applying the Genomic Control technique on chi-squares, p-values, Bonferroni-adjusted p-values, and False Discovery Rates.

Note

The inflation factor will be displayed in the Node Change Log for the Association Results spreadsheet.

From the Genotype Association Tests dialog, select Correct Using This Inflation Factor (Lambda) Instead: and enter a \lambda value to use as an “inflation factor” that was determined from a previous association test run or from other data previously analyzed.

Note

The inflation factor relates to the chi-square statistic. After a chi-square statistic has been corrected through Genomic Control, the normal procedure for finding the approximate p-value is still followed. If there had been “inflation”, the Genomic Control-corrected p-value will be pushed closer to one.