3.7. Formulas for Principal Component Analysis¶
This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson2006].
Golden Helix SVS currently uses PCA with association tests performed against genotypes (using their numeric equivalent values) (Genotypic Principal Component Analysis) and with association tests performed against numerical data (Numeric Principal Component Analysis).
Suppose is the (marker) by (subject or sample) matrix of either numeric values or the numeric equivalent values of the genotypes, with the exception that for each marker, what had been the average value over that marker has first been subtracted out from the marker’s values (data centering by marker).
In our notation, a spreadsheet in SVS on which PCA may be performed may be thought of as , since such a spreadsheet suggests an (subject or sample) by (marker) matrix.
We would like to decompose into as a singular value decomposition, where is an by matrix for which , is an by matrix for which , and is an by singular-value matrix. For this decomposition, may be thought of as consisting of principal components or “ancestries” relating the samples to the singular values, while the column of (where and ) may be thought of as containing the “coordinate value” of each marker along the principal component.
We do this by finding the eigenvalues and eigenvectors of . (This matrix is like a Wishart matrix. See Further Motivation: Relation to the Variance-Covariance Matrix.) Rewriting in terms of the singular value decomposition, we have
where is an by matrix containing the eigenvalues of on its diagonal and zeroes elsewhere. Right-hand multiplying the above by gives us
Since is a matrix of the eigenvectors (principal components) of , the above implies, for through , that
where are the eigenvalues of , and the vectors are the columns of and are also the eigenvectors of .
As stated above, the principal components may be thought of as “ancestry” vectors. The most “important” ones correspond to the highest values of . Our object will be to subtract these most important out from the rest of the data, so that we may more effectively test for associations with markers that specifically relate to our phenotype, rather than with markers that relate to population ancestry.
Matrix is not unique, except that the first columns have to be eigenvectors of corresponding to the eigenvalues . As stated above, the element of the column of may be thought of as the relative magnitude of the principal component in making up the row of .
However, SVS does not actually need to compute any part of matrix at all. The procedure outlined in Applying the PCA Correction is dependent only upon the input data and upon the (from matrix ), and is otherwise self-contained.
In actual fact, SVS normalizes by the number of markers, . The corresponding decomposition of becomes , and is the matrix which contains the eigenvalues that we can obtain and output from SVS.
3.7.2. Alternate Motivation¶
Remember that is the (marker) by (subject or sample) matrix of the numeric equivalent values of the genotypes, with the average values over the markers subtracted out (data centering by marker). (By “numeric equivalent”, we mean (assuming the additive genetic model) zero for a homozygous major allele genotype, one for a heterozygous genotype, and two for a homozygous minor allele genotype.)
In order to have a better idea about what the data is like, and to make it easier to describe our data graphically if we wish to, we take the columns of matrix , which we will call , and find the linear combinations of these,
for , which have the largest variances, and thus will best “explain” the data. We do this through finding the eigenvalues and eigenvectors of the matrix , which approximates the matrix of the (sample) variances/covariances of the columns of , then picking the eigenvectors corresponding to the largest eigenvalues of .
What we call the “eigenvectors” of are the vectors such that and
where is the “eigenvalue” associated with eigenvector .
(The eigenvectors and eigenvalues for a matrix may be found by the two-step process of first, solving the characteristic equation
for , then second, for each eigenvalue , solving
for the eigenvector . Standard matrix software may be used to perform these operations in the most optimized manner.)
We note that the variance of is approximated by
and so picking the eigenvectors with the highest eigenvalues through gives us the most variance “explained” by through . These eigenvectors with the highest eigenvalues we call the “principal components” of matrix .
In order to visually inspect some of the relationships in the data, we could plot the top three principal components (eigenvectors through corresponding to the highest three eigenvalues through ) against each other in a scatterplot, and observe any clustering of data points that might appear in the scatterplot.
From a “conventional PCA” point of view, what we are doing is using the genetic samples/subjects as various “measurements”, while using the genetic markers as “samples” concerning these “measurements”. This makes sense because we are trying to find a population-structure-type relationship among the subjects (“measurements”) by going through many genetic markers (“samples”) to see what the relationship among the subjects (“measurements”) is.
This also explains why PCA discussions usually use matrix as an (genetic marker) by (subject/sample) matrix, rather than a (subject/sample) by (genetic marker) matrix.
Now that we can distill the essence of the population structure into principal components, where can be whatever number we need, we can subtract this structure essence from our numerically-converted genotypic data, leaving behind more subtle trends that may have associations with one or more phenotypic traits.
When accumulating the elements of for genotypic data, the marker data may be optionally weighted in favor of those markers with less variance by normalizing them according to their variance, either theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing for genotypic data.
Otherwise, the technique is to:
Scan the markers and obtain the Wishart-like matrix.
Find the desired number of eigenvectors and eigenvalues from .
If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.
3.7.4. Applying the PCA Correction¶
Suppose is a vector containing the dependent variable (response) values or numeric-equivalent values. Exercising data centering by marker, we subtract the average element value from each element to get ; then,
where . Similarly,
where , etc., up to
Note that the values are obtained through dot products, where component wise,
The corrected is obtained (as a part of data centering by marker) by adding back the average element value of the original to each element of .
The idea is for each principal component, we find what magnitude of it is present in (or )–then we subtract that magnitude of that component from to get .
Exactly the same technique is applied for every vector of data from an individual marker. After subtracting the average element value to get , we compute
where . Similarly,
where , etc., up to
Component-wise, we have
We finally get the corrected back (as a part of data centering by marker) by adding back the average element value of the original to each element of .
3.7.5. Formulas for PCA Normalization of Genotypic Data¶
The possibilities for marker normalization are:
The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.
This theoretical standard deviation is as follows according to the genetic model being used:
Additive (the only model supported by EIGENSTRAT itself):
is a Bayesian estimate of the minor allele probability in terms of the numeric marker values .
is a Bayesian estimate of the probability of being a DD or Dd genotype.
is a Bayesian estimate of the probability of being a DD genotype.
The actual standard deviation of this marker’s data.
Numeric data is not normalized in SVS.
3.7.6. Further Motivation: Relation to the Variance-Covariance Matrix¶
For the purposes of PCA correction in SVS, the marker data may be treated as if it comes from a multivariate normal distribution, where the data from marker , , is picked from a random variable (where is equal to the number of samples). If indeed has a multivariate normal distribution, then there is a random vector whose components are independent standard normal random variables, a vector , and a by matrix such that .
The vector is the expected value of , and the matrix is the variance-covariance matrix of the components .
Thus, the vectors in matrix may be treated as coming from the expression
where are elements in matrix .
Let us examine . One element of may be written as
The term expands out to
Expanding the first term above, we have
We may thus write
where m is the number of markers.
However, since the components of the random variable from which is assumed to be drawn are independent standard normal random variables, we may make the following estimates:
Thus, we may estimate the elements and of as
thus, we have the following estimate:
where is the covariance matrix of the multivariate distribution of and is an outer product of the sample averages.
is a Wishart matrix based on . The distribution of this random matrix is a generalization of the chi-square distribution.
3.7.7. Centering by Marker vs. Not Centering by Marker¶
In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered by marker–that is, the average of the data within a given marker is first subtracted out from each of that marker’s data values both before they are used for the matrix and before they are PCA-corrected, with the averages being added in after correction.
An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.
Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.
Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.
On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.
Regarding data centering by marker, it should be noted that after the marker averages are subtracted out, we have , where is the element of corresponding to the -th marker and the -th sample. Since
for any pair of samples and , we have
Multiplying a constant vector into the above will therefore obtain a vector with zero for every element, demonstrating that when data centering by marker is used, one of the eigenvalues is zero, corresponding to an “average vector” of constant elements.
If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.
3.7.8. Centering by Sample vs. Not Centering by Sample¶
A second alternative approach to PCA, available in SVS for numeric or copy number data, is to center by sample, either in addition to centering by marker or instead of centering by marker. This entails effectively subtracting out the sample average before contributing to the matrix (and adding it back in while correcting). (In SVS, if centering by both sample and marker, the overall average is subtracted out only once.)
Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.
An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.
3.7.9. Applying PCA to a Superset of Markers¶
Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.
Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.
3.7.10. Applying PCA to a Subset of Samples¶
It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.
Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.
However, now we must look at whether two components which are orthogonal in the main sample set are still orthogonal in the sample subset. Suppose and are two components of the full set, where and are in the original sample set but not the sample subset, and and are in the sample subset. (Without loss of generality, we are writing the vector as if all of the elements in the subset follow after all of the elements not in the subset.) Because these are eigenvectors of , they will be orthogonal. That is, their dot product will be zero.
This dot product may be written . We see that if the individual dot products, both outside of and inside of the subset, are zero, then the full dot product will be zero. Thus, components that are orthogonal in the subset and its complement will be orthogonal for the overall set.
Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.
However, because the subset components are not precisely orthogonal except in special cases, a modified Gram-Schmidt process is applied by SVS to make the components that belong to the subset orthonormal (both orthogonal and normalized). Effectively, this is equivalent to “obtaining a PCA correction” for every component based upon the previous components , , …, up to .