3.7. Formulas for Principal Component Analysis

This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson2006].


Golden Helix SVS currently uses PCA with association tests performed against genotypes (using their numeric equivalent values) (Genotypic Principal Component Analysis) and with association tests performed against numerical data (Numeric Principal Component Analysis).

Suppose X is the m (marker) by n (subject or sample) matrix of either numeric values or the numeric equivalent values of the genotypes, with the exception that for each marker, what had been the average value over that marker has first been subtracted out from the marker’s values (data centering by marker).


In our notation, a spreadsheet in SVS on which PCA may be performed may be thought of as X^T, since such a spreadsheet suggests an n (subject or sample) by m (marker) matrix.

3.7.1. Motivation

We would like to decompose X into USV^T as a singular value decomposition, where U is an m by m matrix for which U^TU = I, V is an n by n matrix for which V^TV = I, and S is an m by n singular-value matrix. For this decomposition, V may be thought of as consisting of principal components or “ancestries” relating the samples to the singular values, while the k^{th} column of U (where k \ge 1 and k \le n) may be thought of as containing the “coordinate value” of each marker along the k^{th} principal component.

We do this by finding the eigenvalues and eigenvectors of X^TX. (This matrix is like a Wishart matrix. See Further Motivation: Relation to the Variance-Covariance Matrix.) Rewriting X^TX in terms of the singular value decomposition, we have


where S^TS is an n by n matrix containing the eigenvalues of X^TX on its diagonal and zeroes elsewhere. Right-hand multiplying the above by V gives us


Since V is a matrix of the eigenvectors (principal components) of X^TX, the above implies, for k = 1 through n, that

X^TXv_k = \lambda_k v_k,

where \lambda_k are the eigenvalues of X^TX, and the vectors v_k are the columns of V and are also the eigenvectors of X^TX.

As stated above, the principal components v_k may be thought of as “ancestry” vectors. The most “important” ones correspond to the highest values of \lambda_k. Our object will be to subtract these most important v_k out from the rest of the data, so that we may more effectively test for associations with markers that specifically relate to our phenotype, rather than with markers that relate to population ancestry.


Matrix U is not unique, except that the first n columns have to be eigenvectors of XX^T corresponding to the eigenvalues \lambda_k. As stated above, the i-th element of the k-th column of U may be thought of as the relative magnitude of the k-th principal component in making up the i-th row of X.

However, SVS does not actually need to compute any part of matrix U at all. The procedure outlined in Applying the PCA Correction is dependent only upon the input data and upon the v_k (from matrix V), and is otherwise self-contained.


In actual fact, SVS normalizes X^TX by the number of markers, m. The corresponding decomposition of X becomes \frac{X^TX}m = V(S^TS/m)V^T, and \frac{S^TS}m is the matrix which contains the eigenvalues that we can obtain and output from SVS.

3.7.2. Alternate Motivation

Remember that X is the m (marker) by n (subject or sample) matrix of the numeric equivalent values of the genotypes, with the average values over the markers subtracted out (data centering by marker). (By “numeric equivalent”, we mean (assuming the additive genetic model) zero for a homozygous major allele genotype, one for a heterozygous genotype, and two for a homozygous minor allele genotype.)

In order to have a better idea about what the data is like, and to make it easier to describe our data graphically if we wish to, we take the columns of matrix X, which we will call x_j, and find the p linear combinations of these,

\sum_{j=1}^n v_{jk} x_j,

for k=1...p, which have the largest variances, and thus will best “explain” the data. We do this through finding the eigenvalues and eigenvectors of the matrix \frac{X^TX}{m}, which approximates the matrix C of the (sample) variances/covariances of the columns of X, then picking the eigenvectors corresponding to the p largest eigenvalues of \frac{X^TX}{m}.

What we call the “eigenvectors” of \frac{X^TX}{m} are the vectors v_k such that v_k^T v_k = 1 and

\frac{X^TX}{m} v_k = \lambda_k v_k ,

where \lambda_k is the “eigenvalue” associated with eigenvector v_k.

(The eigenvectors and eigenvalues for a matrix A may be found by the two-step process of first, solving the characteristic equation

det(A - \lambda I) = 0

for \lambda, then second, for each eigenvalue \lambda, solving

(A - \lambda I) x = 0

for the eigenvector x. Standard matrix software may be used to perform these operations in the most optimized manner.)

We note that the variance of Xv_k is approximated by

\frac{(Xv_k)^T (Xv_k)}{m} &= v_k^T \frac{X^TX}{m} v_k \\
&= v_k^T \lambda_k v_k = \lambda_k v_k^T v_k = \lambda_k,

and so picking the k eigenvectors with the highest eigenvalues \lambda_1 through \lambda_k gives us the most variance “explained” by Xv_1 through Xv_k. These k eigenvectors with the highest eigenvalues we call the “principal components” of matrix \frac{X^TX}{m}.

In order to visually inspect some of the relationships in the data, we could plot the top three principal components (eigenvectors v_1 through v_3 corresponding to the highest three eigenvalues \lambda_1 through \lambda_3) against each other in a scatterplot, and observe any clustering of data points that might appear in the scatterplot.


From a “conventional PCA” point of view, what we are doing is using the genetic samples/subjects as various “measurements”, while using the genetic markers as “samples” concerning these “measurements”. This makes sense because we are trying to find a population-structure-type relationship among the subjects (“measurements”) by going through many genetic markers (“samples”) to see what the relationship among the subjects (“measurements”) is.

This also explains why PCA discussions usually use matrix X as an m (genetic marker) by n (subject/sample) matrix, rather than a n (subject/sample) by m (genetic marker) matrix.

Now that we can distill the essence of the population structure into k principal components, where k can be whatever number we need, we can subtract this structure essence from our numerically-converted genotypic data, leaving behind more subtle trends that may have associations with one or more phenotypic traits.

3.7.3. Technique

When accumulating the elements of X^TX for genotypic data, the marker data may be optionally weighted in favor of those markers with less variance by normalizing them according to their variance, either theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing for genotypic data.

Otherwise, the technique is to:

  • Scan the markers and obtain the Wishart-like X^TX matrix.

  • Find the desired number of eigenvectors and eigenvalues from X^TX.

  • If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.

3.7.4. Applying the PCA Correction

Suppose r is a vector containing the dependent variable (response) values or numeric-equivalent values. Exercising data centering by marker, we subtract the average element value from each element to get r_0; then,

r_1 = r_0 - \rho_1v_1,

where \rho_1=v_1\cdot r_0. Similarly,


where \rho_2=v_2\cdot r_1, etc., up to

r_k = r_{k-1}-\rho_kv_k,

where \rho_k=v_k\cdot r_{k-1}.

Note that the \rho values are obtained through dot products, where component wise,

\rho_k = \frac{\sum_ja_{jk}r_{j(k-1)}}{\sum_ja^2_{jk}}.

The corrected r is obtained (as a part of data centering by marker) by adding back the average element value of the original r to each element of r_k.

The idea is for each principal component, we find what magnitude of it is present in r (or r_{k-1})–then we subtract that magnitude of that component from r_{k-1} to get r_k.

Exactly the same technique is applied for every vector g of data from an individual marker. After subtracting the average element value to get g_0, we compute

g_1 = g_0-\gamma_1v_1,

where \gamma_1 = v_1\cdot g_0. Similarly,

g_2 = g_1 - \gamma_2v_2,

where \gamma_2=v_2\cdot g_1, etc., up to

g_k = g_{k-1} - \gamma_kv_k,

where \gamma_k = v_k\cdot g_{k-1}

Component-wise, we have

\gamma_k = \frac{\sum_j a_{jk}g_{j(k-1)}}{\sum_ja^2_{jk}}.

We finally get the corrected g back (as a part of data centering by marker) by adding back the average element value of the original g to each element of g_k.

3.7.5. Formulas for PCA Normalization of Genotypic Data

The possibilities for marker normalization are:

  • The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.

    This theoretical standard deviation is as follows according to the genetic model being used:

    • Additive (the only model supported by EIGENSTRAT itself):

      2\sqrt{(p(1 - p))}


      p = \frac{(1 + \sum g_j)}{(2 + 2n)}

      is a Bayesian estimate of the minor allele probability in terms of the numeric marker values g_j.

    • Dominant:

      \sqrt{(p(1 - p))}


      p = \frac{(.5 + \sum g_j)}{(1 + n)}

      is a Bayesian estimate of the probability of being a DD or Dd genotype.

    • Recessive:

      \sqrt{(p(1 - p))}


      p = \frac{(.5 + \sum g_j)}{(1 + n)}

      is a Bayesian estimate of the probability of being a DD genotype.

  • The actual standard deviation of this marker’s data.

  • Don’t normalize.


Numeric data is not normalized in SVS.

3.7.6. Further Motivation: Relation to the Variance-Covariance Matrix

For the purposes of PCA correction in SVS, the marker data may be treated as if it comes from a multivariate normal distribution, where the data from marker r, x_r = (x_{r1},...,x_{rn}), is picked from a random variable x = (x^1,...,x^n) (where n is equal to the number of samples). If x indeed has a multivariate normal distribution, then there is a random vector z = (z^1,...,z^k) whose components are independent standard normal random variables, a vector \mu = (\mu^1,...,\mu^n), and a k by n matrix A such that x = zA + \mu.

The vector \mu is the expected value of x, and the matrix A^TA is the variance-covariance matrix of the components x^j.

Thus, the vectors (x_{r1},...,x_{rn}) in matrix X may be treated as coming from the expression

(x_{r1},...,x_{rn}) = (z_{r1},...,z_{rk}) A + (\mu_1,...,\mu_n),


x_{rp} = \sum_qz_{rq}a_{qp} + \mu_p,

where a_{qp} are elements in matrix A.

Let us examine X^TX. One element of X^TX may be written as

(X^TX)_{ij} = \sum_rx_{ri}x_{rj}.

The term x_{ri}x_{rj} expands out to

x_{ri}x_{rj} = [(\sum_qz_{rq}a_{qi}) + \mu_i][(\sum_qz_{rq}a_{qj}) + \mu_j]

= (\sum_qz_{rq}a_{qi})(\sum_qz_{rq}a_{qj}) + \mu_i(\sum_qz_{rq}a_{qj}) + \mu_j(\sum_qz_{rq}a_{qi}) + \mu_i\mu_j.

Expanding the first term above, we have

(\sum_qz_{rq}a_{qi})(\sum_qz_{rq}a_{qj}) = \sum_qz_{rq}a_{qi}z_{rq}a_{qj} + \sum_q\sum_{s\ne q}z_{rq}a_{qi}z_{rs}a_{sj}.

We may thus write

(X^TX)_{ij}& = \sum_rx_{ri}x_{rj} = \sum_r[\sum_qz_{rq}a_{qi}z_{rq}a_{qj} + \sum_q\sum_{s\ne q}z_{rq}a_{qi}z_{rs}a_{sj} + \mu_i(\sum_qz_{rq}a_{qj}) + \mu_j(\sum_qz_{rq}a_{qi}) + \mu_i\mu_j]\\
& = \sum_r\sum_qz_{rq}a_{qi}z_{rq}a_{qj} + \sum_r\sum_q\sum_{s\ne q}z_{rq}a_{qi}z_{rs}a_{sj} + \mu_i\sum_r\sum_qz_{rq}a_{qj} + \mu_j\sum_r\sum_qz_{rq}a_{qi} + \sum_r\mu_i\mu_j\\
& = \sum_qa_{qi}a_{qj}\sum_rz_{rq}^2 + \sum_q\sum_{s\ne q}a_{qi}a_{sj}\sum_rz_{rq}z_{rs} + \mu_i\sum_qa_{qj}\sum_rz_{rq} + \mu_j\sum_qa_{qi}\sum_rz_{rq} + m\mu_i\mu_j,

where m is the number of markers.

However, since the components of the random variable from which z is assumed to be drawn are independent standard normal random variables, we may make the following estimates:

& \sum_rz_{rq}^2 = m\\
& \sum_rz_{rq}z_{rs} = 0, (q \ne s)\\
& \sum_rz_{rq} = 0\\

Thus, we may estimate the elements i and j of X^TX as

(X^TX)_{ij} = m\sum_qa_{qi}a_{qj} + m\mu_i\mu_j,


\frac{(X^TX)_{ij}}m = \sum_qa_{qi}a_{qj} + \mu_i\mu_j.

Note that

(A^TA)_{ij} = \sum_qa_{qi}a_{qj};

thus, we have the following estimate:

\frac{X^TX}m = A^TA + \mu^T\mu,

where A^TA is the covariance matrix of the multivariate distribution of x and \mu^T\mu is an outer product of the sample averages.


\frac{X^TX}m - \mu^T\mu is a Wishart matrix based on (x_{rj} - \mu_j). The distribution of this random matrix is a generalization of the chi-square distribution.

3.7.7. Centering by Marker vs. Not Centering by Marker

In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered by marker–that is, the average of the data within a given marker is first subtracted out from each of that marker’s data values both before they are used for the X^TX matrix and before they are PCA-corrected, with the averages being added in after correction.

An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.

Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.

Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.

On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.

Regarding data centering by marker, it should be noted that after the marker averages are subtracted out, we have \sum_ix_{ri} = 0, where x_{ri} is the element of X corresponding to the r-th marker and the i -th sample. Since

(X^TX)_{ij} = \sum_rx_{ri}x_{rj},

for any pair of samples i and j, we have

\sum_i(X^TX)_{ij} = \sum_i\sum_rx_{ri}x_{rj} = \sum_r\sum_ix_{ri}x_{rj} = \sum_r(\sum_ix_{ri})x_{rj} = 0.

Multiplying a constant vector (1,...,1)^T into the above X^TX will therefore obtain a vector with zero for every element, demonstrating that when data centering by marker is used, one of the eigenvalues is zero, corresponding to an “average vector” of constant elements.

If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.

3.7.8. Centering by Sample vs. Not Centering by Sample

A second alternative approach to PCA, available in SVS for numeric or copy number data, is to center by sample, either in addition to centering by marker or instead of centering by marker. This entails effectively subtracting out the sample average before contributing to the matrix X^TX (and adding it back in while correcting). (In SVS, if centering by both sample and marker, the overall average is subtracted out only once.)

Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.

An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.

3.7.9. Applying PCA to a Superset of Markers

Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.

Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.

3.7.10. Applying PCA to a Subset of Samples

It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.

Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.

However, now we must look at whether two components which are orthogonal in the main sample set are still orthogonal in the sample subset. Suppose (a^T b^T)^T and (c^T d^T)^T are two components of the full set, where a and c are in the original sample set but not the sample subset, and b and d are in the sample subset. (Without loss of generality, we are writing the vector as if all of the elements in the subset follow after all of the elements not in the subset.) Because these are eigenvectors of X^TX, they will be orthogonal. That is, their dot product will be zero.

This dot product may be written a\cdot c + b\cdot d. We see that if the individual dot products, both outside of and inside of the subset, are zero, then the full dot product will be zero. Thus, components that are orthogonal in the subset and its complement will be orthogonal for the overall set.

Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.

However, because the subset components are not precisely orthogonal except in special cases, a modified Gram-Schmidt process is applied by SVS to make the components that belong to the subset orthonormal (both orthogonal and normalized). Effectively, this is equivalent to “obtaining a PCA correction” for every component v_k based upon the previous components v_1, v_2, …, up to v_{k-1}.