3.7. Formulas for Principal Component Analysis¶
This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson2006].
Note
Golden Helix SVS currently uses PCA with association tests performed against genotypes (using their numeric equivalent values) (Genotypic Principal Component Analysis) and with association tests performed against numerical data (Numeric Principal Component Analysis).
Suppose is the
(marker) by
(subject
or sample) matrix of either numeric values or the numeric equivalent
values of the genotypes, with the exception that for each marker, what
had been the average value over that marker has first been subtracted
out from the marker’s values (data centering by marker).
Note
In our notation, a spreadsheet in SVS on which PCA may
be performed may be thought of as , since such a
spreadsheet suggests an
(subject or sample) by
(marker) matrix.
3.7.1. Motivation¶
We would like to decompose into
as a
singular value decomposition, where
is an
by
matrix for which
,
is an
by
matrix for which
, and
is an
by
singular-value matrix.
For this decomposition,
may be thought of as consisting of
principal components or “ancestries” relating the samples to the
singular values, while the
column of
(where
and
) may be thought of as
containing the “coordinate value” of each marker along the
principal component.
We do this by finding the eigenvalues and eigenvectors of
. (This matrix is like a Wishart matrix. See
Further Motivation: Relation to the Variance-Covariance Matrix.) Rewriting
in terms of the
singular value decomposition, we have
where is an
by
matrix
containing the eigenvalues of
on its diagonal and
zeroes elsewhere. Right-hand multiplying the above by
gives us
Since is a matrix of the eigenvectors (principal
components) of
, the above implies, for
through
, that
where are the eigenvalues of
, and
the vectors
are the columns of
and are also
the eigenvectors of
.
As stated above, the principal components may be thought
of as “ancestry” vectors. The most “important” ones correspond to the
highest values of
. Our object will be to subtract
these most important
out from the rest of the data, so
that we may more effectively test for associations with markers that
specifically relate to our phenotype, rather than with markers that
relate to population ancestry.
Note
Matrix is not unique, except that the first
columns have to be eigenvectors of
corresponding to the eigenvalues
. As stated
above, the
element of the
column of
may be thought of as the relative magnitude of the
principal component in making up the
row of
.
However, SVS does not actually need to compute any part of
matrix at all. The procedure outlined in
Applying the PCA Correction is dependent only upon the input data
and upon the
(from matrix
), and is
otherwise self-contained.
Note
In actual fact, SVS normalizes by the number
of markers,
. The corresponding decomposition of
becomes
, and
is the matrix which contains the eigenvalues
that we can obtain and output from SVS.
3.7.2. Alternate Motivation¶
Remember that is the
(marker) by
(subject or sample) matrix of the numeric equivalent values of the
genotypes, with the average values over the markers subtracted out
(data centering by marker). (By “numeric equivalent”, we mean
(assuming the additive genetic model) zero for a homozygous major
allele genotype, one for a heterozygous genotype, and two for a
homozygous minor allele genotype.)
In order to have a better idea about what the data is like, and to
make it easier to describe our data graphically if we wish to, we take
the columns of matrix , which we will call
, and
find the
linear combinations of these,
for , which have the largest variances, and thus will
best “explain” the data. We do this through finding the eigenvalues
and eigenvectors of the matrix
, which
approximates the matrix
of the (sample)
variances/covariances of the columns of
, then picking the
eigenvectors corresponding to the
largest eigenvalues of
.
What we call the “eigenvectors” of are the
vectors
such that
and
where is the “eigenvalue” associated with
eigenvector
.
(The eigenvectors and eigenvalues for a matrix may be found
by the two-step process of first, solving the characteristic equation
for , then second, for each eigenvalue
,
solving
for the eigenvector . Standard matrix software may be used to
perform these operations in the most optimized manner.)
We note that the variance of is approximated by
and so picking the eigenvectors with the highest eigenvalues
through
gives us the most variance
“explained” by
through
. These
eigenvectors
with the highest eigenvalues we call the “principal components” of matrix
.
In order to visually inspect some of the relationships in the data, we
could plot the top three principal components (eigenvectors
through
corresponding to the highest three
eigenvalues
through
) against each
other in a scatterplot, and observe any clustering of data points that
might appear in the scatterplot.
Note
From a “conventional PCA” point of view, what we are doing is using the genetic samples/subjects as various “measurements”, while using the genetic markers as “samples” concerning these “measurements”. This makes sense because we are trying to find a population-structure-type relationship among the subjects (“measurements”) by going through many genetic markers (“samples”) to see what the relationship among the subjects (“measurements”) is.
This also explains why PCA discussions usually use matrix
as an
(genetic marker) by
(subject/sample)
matrix, rather than a
(subject/sample) by
(genetic marker) matrix.
Now that we can distill the essence of the population structure into
principal components, where
can be whatever number
we need, we can subtract this structure essence from our
numerically-converted genotypic data, leaving behind more subtle
trends that may have associations with one or more phenotypic traits.
3.7.3. Technique¶
When accumulating the elements of for genotypic data, the
marker data may be optionally weighted in favor of those markers with
less variance by normalizing them according to their variance, either
theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing
for genotypic data.
Otherwise, the technique is to:
Scan the markers and obtain the Wishart-like
matrix.
Find the desired number of eigenvectors and eigenvalues from
.
If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.
3.7.4. Applying the PCA Correction¶
Suppose is a vector containing the dependent variable
(response) values or numeric-equivalent values. Exercising data
centering by marker, we subtract the average element value from each
element to get
; then,
where . Similarly,
where , etc., up to
where .
Note that the values are obtained through dot products,
where component wise,
The corrected is obtained (as a part of data centering by
marker) by adding back the average element value of the original
to each element of
.
The idea is for each principal component, we find what magnitude of it
is present in (or
)–then we subtract that
magnitude of that component from
to get
.
Exactly the same technique is applied for every vector of
data from an individual marker. After subtracting the average element
value to get
, we compute
where . Similarly,
where , etc., up to
where
Component-wise, we have
We finally get the corrected back (as a part of data
centering by marker) by adding back the average element value of the
original
to each element of
.
3.7.5. Formulas for PCA Normalization of Genotypic Data¶
The possibilities for marker normalization are:
The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.
This theoretical standard deviation is as follows according to the genetic model being used:
Additive (the only model supported by EIGENSTRAT itself):
where
is a Bayesian estimate of the minor allele probability in terms of the numeric marker values
.
Dominant:
where
is a Bayesian estimate of the probability of being a DD or Dd genotype.
Recessive:
where
is a Bayesian estimate of the probability of being a DD genotype.
The actual standard deviation of this marker’s data.
Don’t normalize.
Note
Numeric data is not normalized in SVS.
3.7.6. Further Motivation: Relation to the Variance-Covariance Matrix¶
For the purposes of PCA correction in SVS, the marker data may
be treated as if it comes from a multivariate normal distribution, where
the data from marker ,
,
is picked from a random variable
(where
is equal to the number of samples). If
indeed
has a multivariate normal distribution, then there is a random vector
whose components are independent standard
normal random variables, a vector
, and
a
by
matrix
such that
.
The vector is the expected value of
, and the
matrix
is the variance-covariance matrix of the
components
.
Thus, the vectors in matrix
may be treated as coming from the expression
or
where are elements in matrix
.
Let us examine . One element of
may be
written as
The term expands out to
Expanding the first term above, we have
We may thus write
where m is the number of markers.
However, since the components of the random variable from which
is assumed to be drawn are independent standard normal
random variables, we may make the following estimates:
Thus, we may estimate the elements and
of
as
or
Note that
thus, we have the following estimate:
where is the covariance matrix of the multivariate
distribution of
and
is an outer product
of the sample averages.
Note
is a Wishart matrix based
on
. The distribution of this random
matrix is a generalization of the chi-square distribution.
3.7.7. Centering by Marker vs. Not Centering by Marker¶
In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered
by marker–that is, the average of the data within a given marker is
first subtracted out from each of that marker’s data values both before
they are used for the matrix and before they are
PCA-corrected, with the averages being added in after correction.
An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.
Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.
Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.
On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.
Regarding data centering by marker, it should be noted that after the
marker averages are subtracted out, we have ,
where
is the element of
corresponding to the
-th marker and the
-th sample. Since
for any pair of samples and
, we have
Multiplying a constant vector into the above
will therefore obtain a vector with zero for every
element, demonstrating that when data centering by marker is used, one
of the eigenvalues is zero, corresponding to an “average vector” of
constant elements.
If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.
3.7.8. Centering by Sample vs. Not Centering by Sample¶
A second alternative approach to PCA, available in SVS for
numeric or copy number data, is to center by sample, either in addition
to centering by marker or instead of centering by marker. This entails
effectively subtracting out the sample average before contributing to
the matrix (and adding it back in while correcting). (In
SVS, if centering by both sample and marker, the overall
average is subtracted out only once.)
Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.
An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.
3.7.9. Applying PCA to a Superset of Markers¶
Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.
Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.
3.7.10. Applying PCA to a Subset of Samples¶
It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.
Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.
However, now we must look at whether two components which are orthogonal
in the main sample set are still orthogonal in the sample subset.
Suppose and
are two
components of the full set, where
and
are in the
original sample set but not the sample subset, and
and
are in the sample subset. (Without loss of generality, we
are writing the vector as if all of the elements in the subset follow
after all of the elements not in the subset.) Because these are
eigenvectors of
, they will be orthogonal. That is, their
dot product will be zero.
This dot product may be written . We see
that if the individual dot products, both outside of and inside of the
subset, are zero, then the full dot product will be zero. Thus,
components that are orthogonal in the subset and its complement will be
orthogonal for the overall set.
Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.
However, because the subset components are not precisely orthogonal
except in special cases, a modified Gram-Schmidt process is applied by
SVS to make the components that belong to the subset
orthonormal (both orthogonal and normalized). Effectively, this is
equivalent to “obtaining a PCA correction” for every component
based upon the previous components
,
, …, up to
.