Large Kinship Matrices or Large Numbers of Samples

If there are more than 8,000 samples (for 64-bit machines, 5,500 for 32-bit machines), memory requirements for computing mixed models begin to be substantial, or even overbearing. Additionally, with this many samples, computation of eigenvalues and eigenvectors for methods such as GBLUP (Performing GBLUP Analysis) begins to take much longer. Thus, for GBLUP and for K-Fold cross validation using GBLUP, SVS offers a large-data mode for some of the computations and intermediate data storage. This mode consists of:

  • Storing intermediate matrix data in scratch spreadsheets. The data for one matrix may be stored either in one spreadsheet or, on the other hand, in many separate spreadsheets each containing only a small enough portion of the data to allow it to be held conveniently in (random-access) memory.
  • Computing basic matrix operations piece by piece, reading sections of the data from the spreadsheet (or spreadsheets) as necessary, making computations using these sections of data and, if the computational outputs consist of a new large matrix, writing the computation results section by section to a new large spreadsheet or small separate spreadsheets.
  • In order to find a matrix B (Solving the Mixed Model Equation) based on the matrix H =
K + \delta I, \delta itself is first determined, then a Cholesky decomposition of matrix H is performed using large-data matrix techniques. This avoids the need to determine the eigenvalues and eigenvectors of the kinship matrix K itself.
  • To find the eigenvalues of S(K+I)S and their corresponding eigenvectors, SVS uses an adaptation of the methodology of [Halko2010] (see Decomposition of Large Matrices Using Random Numbers below), along with the large-data techniques mentioned above.

Decomposition of Large Matrices Using Random Numbers

The authors of [Halko2010] describe a technique using random numbers for finding an orthonormal matrix Q whose range approximates the range of another matrix A of a much larger size. This matrix Q may then be used to (for instance) approximate the largest eigenvalues and their corresponding eigenvectors for a symmetric matrix A. This technique only needs one pass to cause the overall error of expressing matrix A using its approximate largest eigenvalues and eigenvectors to be of the same magnitude as the largest of the next-smaller eigenvalues which were not approximated. [Halko2010] also describes a method of increasing accuracy involving post-multiplying A by Q when the eigenvalues do not descend in value very quickly, and/or large data needs to be dealt with.

Golden Helix SVS adapts this technique using two layers of iteration to approximate all eigenvalues and eigenvectors of such a large-size matrix A. The top layer of iteration finds batches of the largest eigenvalues of first A, then of A - UDU', where UDU' is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of A, then of A - UDU' -
U_2D_2U_2', where U_2D_2U_2' is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of A - UDU' (which are the next-largest set of eigenvalues of A), and so on. The second layer uses the [Halko2010] technique to obtain estimates of the largest eigenvalues and eigenvectors of A (or of A - UDU', etc.).

Approximating the Largest Eigenvalues in a Matrix

Suppose matrix A is a (symmetric) positive semidefinite m \times m matrix for which we need to find approximations to the k largest eigenvalues and their eigenvectors. The basic technique described in [Halko2010] is as follows:

  • Create a (“tall and slender”) m \times k matrix \Omega which contains elements from the Standard Normal Distribution (normally distributed with mean 0 and variance 1). Presumably, dimension k is small enough so that this matrix fits comfortably into (random-access) memory.

  • Create the m \times k matrix Y as the matrix product A\Omega.

  • Find the set of orthonormal vectors which take up the same space as Y, using, for instance, the QR approach or the Gram-Schmidt procedure. Call the matrix containing this set Q.

  • Now, compute B = Q'AQ and find the eigenvalues and eigenvectors of B. Since B is only k \times
k and fits very easily into (random-access) memory, this is a much more straightforward and quickly-solved problem than that of finding the eigenvalues and eigenvectors of A itself.

    Call the matrix of resulting eigenvectors U and the matrix containing the eigenvalues on its diagonal (and zeros elsewhere) \Lambda, so that we have

    B = U\Lambda U' .

  • Now, we approximate A as

    A \approx QU \Lambda U'Q' = QBQ' .

  • To increase accuracy, [Halko2010] suggests modifying this algorithm as follows (the “power method”):

    • Follow the first three steps above to create a Q matrix, call this Q_1.
    • Create matrix Y_2 = A Q_1, then find the set of orthonormal vectors that take up the same space as Y_2, and call this Q_2.
    • Repeat the above step q times, to get Q_q.
    • Now, find the eigenvalues and eigenvectors of B = Q_q' A
Q_q and follow the last two steps above to approximate A.
  • From equation (6.5) of [Halko2010], we see that the accuracy of either of these two methods (the original method or the power method) is

    \left\| A - Q_q U \Lambda U' Q_q' \right\| \lessapprox (k m)^{1/(2q)} \lambda_{k+1} ,

    where \lambda_{k+1} is the (k+1) - st - largest eigenvalue of A, and for the original method, we take q = 1.

  • As an alternative method of increasing accuracy, which may be used at the same time as the “power method” described above, the authors of [Halko2010] suggest using “oversampling”, meaning that to obtain the k largest eigenvalues and their eigenvectors, use a (“tall and slender”) matrix \Omega of dimensions m \times k_{ov}, where k_{ov} > k. (This will make matrix Q, and matrices Q_1, etc., if used, to also be of dimensions m \times k_{ov}.) Normally, k_{ov} does not need to be greater than 2k.

Finding All the Eigenvalues of Matrix S(K+I)S

The above technique works reasonably well for a matrix whose eigenvalues descend in value quickly, or reach almost zero quickly. However, a typical kinship matrix will have many or most eigenvalues in the general order of a magnitude of one. Similarly, the non-zero eigenvalues of the matrix we are interested in, namely S(K+I)S, will often range from a little over 1 to over 5 or more. (The number of zero eigenvalues is exactly equal to one more than the number of covariates you have specified.) This renders the above [Halko2010] technique only marginally useful by itself.

A further difficulty is that the EMMA algorithm needs the estimates of the smaller non-zero eigenvalues and their eigenvectors of S(K+I)S to be essentially as precise as the estimates of the larger eigenvalues and their eigenvectors. This is because for each non-zero eigenvalue, part of the algorithm divides by an expression derived from that eigenvalue, which expression may come close to zero. (That expression is the eigenvalue minus one plus the value of \delta. \delta may become as small as 10^{-5}.) (See Finding the Variance Components.)

Therefore, in Golden Helix SVS, we make the following enhancement:

  • Start with the original kinship matrix K.

  • Create the (postive definite) matrix \alpha_1 I + S(K +
\alpha_2 I)S, where S = I - X(X'X)^{-1}X', and where, in SVS, we use the value .25 for both \alpha_1 and \alpha_2. Call this matrix A.

    Note

    We avoid actually storing matrix S by itself anywhere, but we simply use large-data techniques to perform the entire operation of finding \alpha_1 I + S(K + \alpha_2
I)S with a minimum of memory usage.

    Note

    Mathematically speaking, those eigenvalues of \alpha_1 I + S(K + \alpha_2 I)S equal to \alpha_1 correspond to, and have the same eigenvectors as, the zero eigenvalues of S(K+I)S, while each of the remaining eigenvalues of \alpha_1 I + S(K + \alpha_2 I)S may be converted to the corresponding remaining eigenvalues of S(K+I)S, for which the corresponding eigenvectors are the same, by adding (1 - \alpha_1 - \alpha_2).

  • Use the power-method-with-oversampling variation of the above-mentioned [Halko2010] procedure on matrix A. Use matrices \Omega and Q of width k_{ov}, where k_{ov} is set so that \Omega and Q fit comfortably into (random-access) memory. However, use the following procedure to obtain the best accuracy with minimum effort from this method:

    • Follow the first three steps of Approximating the Largest Eigenvalues in a Matrix (create \Omega, create Y = A\Omega, and orthonormalize) to create a Q matrix, which we call Q_0.

    • Create matrix Y_1 = A Q_0. However, do not yet orthonormalize Y_1.

    • Now, form B_0 = Q_0' A Q_0 = Q_0' Y_1, and find just the eigenvalues of B_0.

    • Then, orthonormalize Y_1 to get Q_1, form Y_2 = A Q_1, form B_1 = Q_1' A Q_1 = Q_1' Y_2, and obtain the eigenvalues of B_1.

    • If enough of the highest eigenvalue estimates have changed slowly enough between those of B_0 and those of B_1, accept these estimates as being the highest eigenvalues of both matrix B_1 and matrix A. “Enough” and “slowly enough” are explained in the next bullet below. Otherwise, repeat the step above followed by this step, except compute Y_3 from Q_2 using an extra scan through the matrix A, as Y_{3a} = A Q_2 and Y_3 = A Y_{3a}, and so on for Y_4, Y_5, etc.

    • To explain the words “enough” and “slowly enough” above: After each iteration r of the step where B_r is formed, compare the magnitudes of the differences between eigenvalue estimates between step r and step r-1, starting with the largest eigenvalue and working through the smaller eigenvalues. As soon as the magnitude of the difference between two estimates of a given eigenvalue is larger than a small \epsilon (you may specify 10^{-3}, 10^{-5}, or math:10^{-7}), stop comparing. Call the number of eigenvalues of B_r for which this comparison passed k_r (where k_0 = 0). If r \ge 1 and k_r (r-1) < k_{(r-1)} r, do no further iterations, and “accept” the largest k_{final} = k_r eigenvalues from iteration r.

      Note

      “Accepting” k_{final} eigenvalues from r iterations is equivalent to finding k_{final} eigenvalues using a “power method” with q=r extra passes and an “oversampling” parameter of k_{ov}.

      Note

      The idea behind the stopping criterion is that if r > 1 and \frac{k_r}{r} \ge
\frac{k_{(r-1)}}{(r-1)}, we see the average number of eigenvalues per iteration has increased, and thus it must have been worthwhile to have the current iteration, and it presumably will be worthwhile to continue iterating.

      Additionally, no matter what the value of r, stop iterating if k_{ov} - f eigenvalues have been estimated satisfactorily, where f is one more than the number of covariates you have specified, and “accept” all of the largest k_{final} = k_{ov} - f eigenvalues.

      Note

      It may turn out that the above algorithm, if followed strictly, would not stop until after a very high number of iterations r, or not even stop at all. Alternatively, it may stop after only a very tiny number of eigenvalues are “accepted”.

      For these reasons, when using this large-data EMMA algorithm, SVS allows you to specify the limit (q) for the number of iterations r. In addition, SVS imposes a maximum oversampling ratio ((k_{ov}/k)_{max}) of two.

      The meaning of these restrictions is that in the event that SVS reaches q iterations, this part of the algorithm is stopped and k_{final}=k_{ov}/2 eigenvalues are “accepted”, no matter what.

      Also, when the algorithm stops earlier (r < r_{ov}), at least k_{final}=\frac{rk_{ov}}{q(k_{ov}/k)_{max}} =
\frac{rk_{ov}}{2q} eigenvalues are “accepted”. Hopefully, this will be fewer than the number of eigenvalues k_r that the algorithm has already (tentatively) “accepted”. If not, this will have the effect of pushing the algorithm forward at the minimum rate of \frac{k_{ov}}{2q} per iteration.

      With these restrictions (maximum power-method iterations and maximum oversampling ratio), a maximum execution time can be guaranteed.

    • Now that we have “accepted” k_{final} eigenvalues from iteration r of the algorithm above, compute the eigenvectors V_r of B_r. Designate the matrix with the diagonal containing the highest k_{final} of the eigenvalues of B_r (the ones that we “accepted”) to be \Lambda_{Br1}, and the eigenvectors corresponding to these “accepted” eigenvalues V_{r1}. Use these to approximate the highest k_{final} eigenvalues of A (diagonal of matrix \Lambda_1) and their corresponding eigenvectors U_1 as

      \Lambda_1 = \Lambda_{Br1}

      and

      U_1 = Q_r V_{r1} .

      Since matrix A may be written as

      A = U_1 \Lambda_1 U_1'  +  U_2 \Lambda_2 U_2' ,

      where \Lambda_2 and U_2 represent the remaining eigenvalues and eigenvectors after \Lambda_1 and U_1, consider matrix A as decomposed into

      A = Q_r V_{r1} \Lambda_{Br1} V_{r1}' Q_r'  +  U_2 \Lambda_2 U_2' .

  • If m - f eigenvalues and eigenvectors have been obtained (for matrix A, and thus for matrix S(K+I)S), stop the algorithm. This many eigenvalues and eigenvectors are all that are required by the EMMA technique.

    Otherwise, use large-data techniques to form a new matrix A by subtracting off the first part of our decomposition, as

    A_{new} = A_{old} - Q_r V_{r1} \Lambda_{Br1} V_{r1}' Q_r' .

    Then, repeat the above variation of the power method along with this step.

    Note

    Because the original matrix A was formed from the actual (positive semidefinite) matrix we are interested in plus \alpha_1 I, where \alpha_1 is a smallish positive number, performing the above subtraction will leave behind all of the remaining eigenvalues and eigenvectors of matrix A, while eigenvalues of A_{new} corresponding to the eigenvectors that we have just found will essentially be zero, thus keeping them out of the way of later calculations.

  • If m minus the number of eigenvalues thus far obtained is less than k_{ov}, appropriately reduce the width of the \Omega and Q matrices.

    Note

    In addition, under this circumstance, when the algorithm stops, whether or not q iterations are reached, all remaining eigenvalues (out of (m-f) eigenvalues) are “accepted”, because all the information related to the remaining (non-zero) eigenvalues should have been effectively contained in the \Omega and Q matrices.

Large N Performance Tradeoffs

As mentioned earlier, when dealing with many samples, computation of eigenvalues and eigenvectors for GBLUP (Performing GBLUP Analysis) takes much longer. Thus, when using large-data mode for GBLUP (and for K-Fold cross validation using GBLUP), there are approximations that have to be made to gain speed and computability in favor of accuracy. Therefore, when you use either of these features with more than 8,000 samples (5,500 for 32-bit systems), and before you are given the standard options relating to the feature, you are given a choice of which large-data approach to use from the following set of options:

Mode Approach Max Iterations per Batch Early Exit Precision
Slow Large N 30 10^{-7}
Medium Large N 12 10^{-5}
Quick Large N 5 10^{-5}
Quickest Large N 2 10^{-3}
Exact Small N N/A N/A

“Max Iterations per Batch” refers to the maximum number q of iterations to use (see the fifth note in Finding All the Eigenvalues of Matrix S(K+I)S), and “Early Exit Precision” refers to the tolerance to use (see the bullet before the third note in Finding All the Eigenvalues of Matrix S(K+I)S).

Note

If you separately compute a Genomic Relationship Matrix (Separately Computing the Genomic Relationship Matrix) from many samples, SVS will automatically go into large-data mode, but still compute the exact Genomic Relationship Matrix in a reasonable amount of time.

Summary of Performance Tradeoffs

  • If you have more than 8,000 samples (5,500 samples for 32-bit systems), you will be asked to choose your data analysis mode before starting analysis of either GBLUP or of K-Fold cross validation using GBLUP.
  • Before entering your project and starting your calculations, set the “Transpose and analysis memory usage” to the highest memory level that you have available (Global Product Options), ideally to somewhat more than the ideal memory usages documented in Large N Memory Considerations. This will insure the best accuracy when you use Large Data mode.
  • For only moderately more than 8,000 samples, use the “Medium” or “Slow” Large N mode. Alternatively, if you have sufficient RAM, and you are willing to wait for eigenvalues to be computed (may be several minutes) without being able to interrupt these computations, you could use the “Exact” (Small N) mode (Computation Time).
  • For a medium number of samples, such as 15,000, using the “Quick” or “Quickest” method will bring the computation time down. Also, consider using the “Exact” method if you have sufficient RAM and can wait for the eigenvalue computations (see above bullet), or using the approach (mentioned in the next bullet) of basing the kinship matrix on a subset of markers containing many fewer markers than the number of samples you have.
  • For large numbers of samples, you should definitely consider using a kinship matrix based on just one chromosome, or on a marker subset created from LD pruning. (See Alternative Approaches.)
  • If you get messages in the spreadsheet logs starting with “NOTE:” that count how many eigenvalues were approximate or were rounded up, and there were a significant number of eigenvalues that had to be approximated in this way, consider using a “more slow” method (more maximum iterations per batch) to improve accuracy, or if possible, using more memory for the calculations (see the bullet above about memory usage).

Computation Time

With a 64-bit system you will always want to run GBLUP (and K-Fold cross validation using GBLUP) using the Exact method if enough RAM is available, and you are willing to wait for eigenvalues to be computed without being able to easily stop or cancel out from the process. We have been able to process kinship matrices from 54,000 samples using the Exact method on a computer with 128 GB of RAM. At a certain point (yet undetermined), having more RAM will not make a difference in whether or not you can use the Exact (small N) method, as systems will not allow it to be used for the algorithm.

Using a 48k simulated marker dataset, tests were performed with varying sample sizes using the above five modes. On a computer with 8 cores, 8 GB of RAM and a CPU speed of 3.1 GHz, the following table illustrates the time required for the sample sizes listed.

# Samples Slow Medium Quick Quickest Exact (Small N Algorithms)
2k ~1 min 1 min ~1 min ~1 min ~1 min
4k ~10 min 7 min 5 min 3 min ~2 min
8k 76 min 80 min 32 min 19 min ~5 min
20k 1133 min / ~19 hrs 823 min / ~14 hrs 469 min / ~8 hrs 234 min / ~4 hrs ~57 min
40k Not computed Not computed 3556 min / ~59 hrs / ~2.5 days 2042 min / ~34 hrs / ~1.5 days ~5796 min / ~4 days

Based on the data accumulated, the quickest method, as best-fit by a cubic distribution, should finish 100,000 samples in roughly 26 days. A faster computer with more cores could speed this computation time up.

Precision of Results

Precision was measured by taking the variance of the difference between methods divided by the total variability. There was no difference between the slow method and the exact method for all sample sizes computed (up to 20k). Variability started to be introduced with the Medium method and four thousand samples, but the variability was almost negligible. The quickest method had the most variability for each sample size computed except for the 2000 sample set. Of the results output, the Random Effects Components for each sample and the Normalized Absolute ASE values are those most affected by the compromises made to obtain speed increases.

The method used for these precision computations is the ratio of the measurement system variability to total variability (\rho_M),

\rho_M = \frac{\sigma^2_{Gauge}}{\sigma^2_{Total}}

where \sigma^2_{Gauge} = \large(\frac{\bar{R}}{d_2} \large)^2 and \bar{R} = mean(|measurement_1-measurement_2|_{Nx1}) and d_2 = 1.128 because only two measurements are being compared, and where \sigma^2_{Total}=s^2 where s is the sample standard deviation of the averages of the two measurements (\bar{x}).

Determining Influential Markers

Of the top 100 markers from the Exact method compared to the top 100 markers for the Quickest method in the 20k sample set, all but three overlap. Thus, it is reasonable to assume that a prediction model based on the top markers from the quickest method would result in a very similar conclusion to that built on the exact method. More work should be done in confirming this hypothesis.

Large N Memory Considerations

Large-data-mode GBLUP will automatically detect how much “Transpose and analysis memory usage” you have told SVS is available, and will consider that amount to be available for Large-data-mode GBLUP analysis (Global Product Options).

To insure the most accuracy, you should allow SVS to use as much memory as possible, up to its upper limit, which is

1104 (N^{3/2})

bytes of memory for its computations, where N is the number of samples.

For the following sample sizes, this formula translates to:

# Samples Memory
8k ~790 MB
20k ~3.1 GB
40k ~8.8 GB
100k ~34.9 GB

This limitation is to keep in check the time spent computing one of the steps of the process (namely the Gram-Schmidt orthonormalization step). However, this amount of memory is also an ideal size for these computations.

If this amount of memory is not available, the memory that is available will be used. However, this will impose percentage limitations on the amount of memory that will be used in the following cases, compared with the ideal memory:

# Samples 256 MB 512 MB 1 GB 8 GB 16 GB
8k 32% 65% (100%) (100%) (100%)
20k 8% 16% 32% (100%) (100%)
40k 3% 6% 11% 91% (100%)
100k .7% 1.5% 3% 23% 46%

As stated above, having more memory available for Large-N-mode GBLUP will, in general, improve the accuracy that will be obtained.

Meanwhile, the amount of memory used will not appreciably affect the maximum possible time of execution, one way or the other, because this amount simply trades off how many eigenvalues are computed per top-layer iteration vs. how many top-layer iterations may be required at a maximum.

Comparing Machines

The Large N implementation relies heavily on the ability of the NumPy and SciPy libraries for Python to use multiple threads. This means that the Large N method must be run on a Windows machine to perform the computation in a reasonable amount of time.

This table shows run times for GBLUP using the “Quickest” mode on two different systems:

# Samples Windows 8 Core 8 GB RAM Linux Server
2000 0.95 min 5.4 min
4000 3.5 min 44 min
8000 19 min 217 min

Alternative Approaches

One suggested alternative approach has been to split up the analysis in a Chromosome by Chromosome manner. However, the bulk of the computation is performed on the sample by sample dimension regardless of the number of markers in the dataset. This means that for a given kinship matrix for 100,000 samples, the amount of time it will take to perform the analysis will roughly be the same for five thousand markers, 48 thousand markers or one million markers.

However, the computation needed to process a kinship matrix which is a GBLUP Relationship Matrix formed from many fewer markers than samples need not be as much, since, in this case, many or most of the non-zero eigenvalues of S(K+I)S are the minimum possible value of one. The Quick method can process these minimum-possible eigenvalues up to seven times faster than would otherwise normally be possible with the Quick method, while larger performance gains are possible with the Medium and Slow methods for these minimum-possible eigenvalues. Therefore, for instance, if the kinship matrix is based on 40,000 samples and 10,000 markers, processing the entire kinship matrix with the Quick method should take about one third the time it would for the Quick method to process a kinship matrix based on 40,000 samples and 40,000 markers.

To take advantage of this fact, a good alternative approach is to simply base the kinship matrix on one chromosome (to get a reasonable amount of information about the inheritance pattern) and perform the analysis on all of the markers using this kinship matrix.

Another alternative approach, which may be combined with the approach to base the kinship matrix on only one chromosome, is to base the kinship matrix on a subset of markers obtained by LD Pruning (LD Pruning).

Note

It is important to set the largest possible memory size that you have available in your system for “Transpose and analysis memory usage” (Global Product Options), up to and past the upper limits mentioned in Large N Memory Considerations, in order to have the best possible accuracy for these calculations, and to insure the best likelihood that this optimization actually takes place.