# Large Kinship Matrices or Large Numbers of Samples¶

If there are more than 8,000 samples (for 64-bit machines, 5,500 for 32-bit machines), memory requirements for computing mixed models begin to be substantial, or even overbearing. Additionally, with this many samples, computation of eigenvalues and eigenvectors for methods such as GBLUP (Performing GBLUP Analysis) begins to take much longer. Thus, for GBLUP and for K-Fold cross validation using GBLUP, SVS offers a large-data mode for some of the computations and intermediate data storage. This mode consists of:

• Storing intermediate matrix data in scratch spreadsheets. The data for one matrix may be stored either in one spreadsheet or, on the other hand, in many separate spreadsheets each containing only a small enough portion of the data to allow it to be held conveniently in (random-access) memory.
• Computing basic matrix operations piece by piece, reading sections of the data from the spreadsheet (or spreadsheets) as necessary, making computations using these sections of data and, if the computational outputs consist of a new large matrix, writing the computation results section by section to a new large spreadsheet or small separate spreadsheets.
• In order to find a matrix (Solving the Mixed Model Equation Using EMMA) based on the matrix , itself is first determined, then a Cholesky decomposition of matrix is performed using large-data matrix techniques. This avoids the need to determine the eigenvalues and eigenvectors of the kinship matrix itself.
• To find the eigenvalues of and their corresponding eigenvectors, SVS uses an adaptation of the methodology of [Halko2010] (see Decomposition of Large Matrices Using Random Numbers below), along with the large-data techniques mentioned above.

Note

Large-data mode is not currently available for the Average Information (AI) REML algorithm.

## Decomposition of Large Matrices Using Random Numbers¶

The authors of [Halko2010] describe a technique using random numbers for finding an orthonormal matrix whose range approximates the range of another matrix of a much larger size. This matrix may then be used to (for instance) approximate the largest eigenvalues and their corresponding eigenvectors for a symmetric matrix . This technique only needs one pass to cause the overall error of expressing matrix using its approximate largest eigenvalues and eigenvectors to be of the same magnitude as the largest of the next-smaller eigenvalues which were not approximated. [Halko2010] also describes a method of increasing accuracy involving post-multiplying by when the eigenvalues do not descend in value very quickly, and/or large data needs to be dealt with.

Golden Helix SVS adapts this technique using two layers of iteration to approximate all eigenvalues and eigenvectors of such a large-size matrix . The top layer of iteration finds batches of the largest eigenvalues of first , then of , where is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of , then of , where is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of (which are the next-largest set of eigenvalues of ), and so on. The second layer uses the [Halko2010] technique to obtain estimates of the largest eigenvalues and eigenvectors of (or of , etc.).

### Approximating the Largest Eigenvalues in a Matrix¶

Suppose matrix is a (symmetric) positive semidefinite matrix for which we need to find approximations to the largest eigenvalues and their eigenvectors. The basic technique described in [Halko2010] is as follows:

• Create a (“tall and slender”) matrix which contains elements from the Standard Normal Distribution (normally distributed with mean 0 and variance 1). Presumably, dimension is small enough so that this matrix fits comfortably into (random-access) memory.

• Create the matrix as the matrix product .

• Find the set of orthonormal vectors which take up the same space as , using, for instance, the QR approach or the Gram-Schmidt procedure. Call the matrix containing this set .

• Now, compute and find the eigenvalues and eigenvectors of . Since is only and fits very easily into (random-access) memory, this is a much more straightforward and quickly-solved problem than that of finding the eigenvalues and eigenvectors of itself.

Call the matrix of resulting eigenvectors and the matrix containing the eigenvalues on its diagonal (and zeros elsewhere) , so that we have

• Now, we approximate as

• To increase accuracy, [Halko2010] suggests modifying this algorithm as follows (the “power method”):

• Follow the first three steps above to create a matrix, call this .
• Create matrix , then find the set of orthonormal vectors that take up the same space as , and call this .
• Repeat the above step times, to get .
• Now, find the eigenvalues and eigenvectors of and follow the last two steps above to approximate .
• From equation (6.5) of [Halko2010], we see that the accuracy of either of these two methods (the original method or the power method) is

where is the - st - largest eigenvalue of , and for the original method, we take .

• As an alternative method of increasing accuracy, which may be used at the same time as the “power method” described above, the authors of [Halko2010] suggest using “oversampling”, meaning that to obtain the largest eigenvalues and their eigenvectors, use a (“tall and slender”) matrix of dimensions , where . (This will make matrix , and matrices , etc., if used, to also be of dimensions .) Normally, does not need to be greater than .

### Finding All the Eigenvalues of Matrix S(K+I)S¶

The above technique works reasonably well for a matrix whose eigenvalues descend in value quickly, or reach almost zero quickly. However, a typical kinship matrix will have many or most eigenvalues in the general order of a magnitude of one. Similarly, the non-zero eigenvalues of the matrix we are interested in, namely , will often range from a little over 1 to over 5 or more. (The number of zero eigenvalues is exactly equal to one more than the number of covariates you have specified.) This renders the above [Halko2010] technique only marginally useful by itself.

A further difficulty is that the EMMA algorithm needs the estimates of the smaller non-zero eigenvalues and their eigenvectors of to be essentially as precise as the estimates of the larger eigenvalues and their eigenvectors. This is because for each non-zero eigenvalue, part of the algorithm divides by an expression derived from that eigenvalue, which expression may come close to zero. (That expression is the eigenvalue minus one plus the value of . may become as small as ) (See Finding the Variance Components.)

Therefore, in Golden Helix SVS, we make the following enhancement:

• Create the (postive definite) matrix , where , and where, in SVS, we use the value .25 for both and . Call this matrix .

Note

We avoid actually storing matrix by itself anywhere, but we simply use large-data techniques to perform the entire operation of finding with a minimum of memory usage.

Note

Mathematically speaking, those eigenvalues of equal to correspond to, and have the same eigenvectors as, the zero eigenvalues of , while each of the remaining eigenvalues of may be converted to the corresponding remaining eigenvalues of , for which the corresponding eigenvectors are the same, by adding .

• Use the power-method-with-oversampling variation of the above-mentioned [Halko2010] procedure on matrix . Use matrices and of width , where is set so that and fit comfortably into (random-access) memory. However, use the following procedure to obtain the best accuracy with minimum effort from this method:

• Follow the first three steps of Approximating the Largest Eigenvalues in a Matrix (create , create , and orthonormalize) to create a matrix, which we call .

• Create matrix . However, do not yet orthonormalize .

• Now, form , and find just the eigenvalues of .

• Then, orthonormalize to get , form , form , and obtain the eigenvalues of .

• If enough of the highest eigenvalue estimates have changed slowly enough between those of and those of , accept these estimates as being the highest eigenvalues of both matrix and matrix . “Enough” and “slowly enough” are explained in the next bullet below. Otherwise, repeat the step above followed by this step, except compute from using an extra scan through the matrix , as and , and so on for , , etc.

• To explain the words “enough” and “slowly enough” above: After each iteration of the step where is formed, compare the magnitudes of the differences between eigenvalue estimates between step and step , starting with the largest eigenvalue and working through the smaller eigenvalues. As soon as the magnitude of the difference between two estimates of a given eigenvalue is larger than a small (you may specify , , or math:10^{-7}), stop comparing. Call the number of eigenvalues of for which this comparison passed (where ). If and , do no further iterations, and “accept” the largest eigenvalues from iteration .

Note

“Accepting” eigenvalues from iterations is equivalent to finding eigenvalues using a “power method” with extra passes and an “oversampling” parameter of .

Note

The idea behind the stopping criterion is that if and , we see the average number of eigenvalues per iteration has increased, and thus it must have been worthwhile to have the current iteration, and it presumably will be worthwhile to continue iterating.

Additionally, no matter what the value of , stop iterating if eigenvalues have been estimated satisfactorily, where is one more than the number of covariates you have specified, and “accept” all of the largest eigenvalues.

Note

It may turn out that the above algorithm, if followed strictly, would not stop until after a very high number of iterations , or not even stop at all. Alternatively, it may stop after only a very tiny number of eigenvalues are “accepted”.

For these reasons, when using this large-data EMMA algorithm, SVS allows you to specify the limit () for the number of iterations . In addition, SVS imposes a maximum oversampling ratio () of two.

The meaning of these restrictions is that in the event that SVS reaches iterations, this part of the algorithm is stopped and eigenvalues are “accepted”, no matter what.

Also, when the algorithm stops earlier (), at least eigenvalues are “accepted”. Hopefully, this will be fewer than the number of eigenvalues that the algorithm has already (tentatively) “accepted”. If not, this will have the effect of pushing the algorithm forward at the minimum rate of per iteration.

With these restrictions (maximum power-method iterations and maximum oversampling ratio), a maximum execution time can be guaranteed.

• Now that we have “accepted” eigenvalues from iteration of the algorithm above, compute the eigenvectors of . Designate the matrix with the diagonal containing the highest of the eigenvalues of (the ones that we “accepted”) to be , and the eigenvectors corresponding to these “accepted” eigenvalues . Use these to approximate the highest eigenvalues of (diagonal of matrix ) and their corresponding eigenvectors as

and

Since matrix may be written as

where and represent the remaining eigenvalues and eigenvectors after and , consider matrix as decomposed into

• If eigenvalues and eigenvectors have been obtained (for matrix , and thus for matrix ), stop the algorithm. This many eigenvalues and eigenvectors are all that are required by the EMMA technique.

Otherwise, use large-data techniques to form a new matrix by subtracting off the first part of our decomposition, as

Then, repeat the above variation of the power method along with this step.

Note

Because the original matrix was formed from the actual (positive semidefinite) matrix we are interested in plus , where is a smallish positive number, performing the above subtraction will leave behind all of the remaining eigenvalues and eigenvectors of matrix while eigenvalues of corresponding to the eigenvectors that we have just found will essentially be zero, thus keeping them out of the way of later calculations.

• If minus the number of eigenvalues thus far obtained is less than , appropriately reduce the width of the and matrices.

Note

In addition, under this circumstance, when the algorithm stops, whether or not iterations are reached, all remaining eigenvalues (out of eigenvalues) are “accepted”, because all the information related to the remaining (non-zero) eigenvalues should have been effectively contained in the and matrices.

As mentioned earlier, when dealing with many samples, computation of eigenvalues and eigenvectors for GBLUP (Performing GBLUP Analysis) takes much longer. Thus, when using large-data mode for GBLUP (and for K-Fold cross validation using GBLUP), there are approximations that have to be made to gain speed and computability in favor of accuracy. Therefore, when you use either of these features with more than 8,000 samples (5,500 for 32-bit systems), and before you are given the standard options relating to the feature, you are given a choice of which large-data approach to use from the following set of options:

Mode Approach Max Iterations per Batch Early Exit Precision
Slow Large N 30
Medium Large N 12
Quick Large N 5
Quickest Large N 2
Exact Small N N/A N/A

“Max Iterations per Batch” refers to the maximum number of iterations to use (see the fifth note in Finding All the Eigenvalues of Matrix S(K+I)S), and “Early Exit Precision” refers to the tolerance to use (see the bullet before the third note in Finding All the Eigenvalues of Matrix S(K+I)S).

Note

If you separately compute a Genomic Relationship Matrix (Separately Computing the Genomic Relationship Matrix) from many samples, SVS will automatically go into large-data mode, but still compute the exact Genomic Relationship Matrix in a reasonable amount of time.

• If you have more than 8,000 samples (5,500 samples for 32-bit systems), you will be asked to choose your data analysis mode before starting analysis of either GBLUP or of K-Fold cross validation using GBLUP.
• Before entering your project and starting your calculations, set the “Transpose and analysis memory usage” to the highest memory level that you have available (Global Product Options), ideally to somewhat more than the ideal memory usages documented in Large N Memory Considerations. This will insure the best accuracy when you use Large Data mode.
• For only moderately more than 8,000 samples, use the “Medium” or “Slow” Large N mode. Alternatively, if you have sufficient RAM, and you are willing to wait for eigenvalues to be computed (may be several minutes) without being able to interrupt these computations, you could use the “Exact” (Small N) mode (Computation Time).
• For a medium number of samples, such as 15,000, using the “Quick” or “Quickest” method will bring the computation time down. Also, consider using the “Exact” method if you have sufficient RAM and can wait for the eigenvalue computations (see above bullet), or using the approach (mentioned in the next bullet) of basing the kinship matrix on a subset of markers containing many fewer markers than the number of samples you have.
• For large numbers of samples, you should definitely consider using a kinship matrix based on just one chromosome, or on a marker subset created from LD pruning. (See Alternative Approaches.)
• If you get messages in the spreadsheet logs starting with “NOTE:” that count how many eigenvalues were approximate or were rounded up, and there were a significant number of eigenvalues that had to be approximated in this way, consider using a “more slow” method (more maximum iterations per batch) to improve accuracy, or if possible, using more memory for the calculations (see the bullet above about memory usage).

### Computation Time¶

With a 64-bit system you will always want to run GBLUP (and K-Fold cross validation using GBLUP) using the Exact method if enough RAM is available, and you are willing to wait for eigenvalues to be computed without being able to easily stop or cancel out from the process. We have been able to process kinship matrices from 54,000 samples using the Exact method on a computer with 128 GB of RAM. At a certain point (yet undetermined), having more RAM will not make a difference in whether or not you can use the Exact (small N) method, as systems will not allow it to be used for the algorithm.

Using a 48k simulated marker dataset, tests were performed with varying sample sizes using the above five modes. On a computer with 8 cores, 8 GB of RAM and a CPU speed of 3.1 GHz, the following table illustrates the time required for the sample sizes listed.

# Samples Slow Medium Quick Quickest Exact (Small N Algorithms)
2k ~1 min 1 min ~1 min ~1 min ~1 min
4k ~10 min 7 min 5 min 3 min ~2 min
8k 76 min 80 min 32 min 19 min ~5 min
20k 1133 min / ~19 hrs 823 min / ~14 hrs 469 min / ~8 hrs 234 min / ~4 hrs ~57 min
40k Not computed Not computed 3556 min / ~59 hrs / ~2.5 days 2042 min / ~34 hrs / ~1.5 days ~5796 min / ~4 days

Based on the data accumulated, the quickest method, as best-fit by a cubic distribution, should finish 100,000 samples in roughly 26 days. A faster computer with more cores could speed this computation time up.

### Precision of Results¶

Precision was measured by taking the variance of the difference between methods divided by the total variability. There was no difference between the slow method and the exact method for all sample sizes computed (up to 20k). Variability started to be introduced with the Medium method and four thousand samples, but the variability was almost negligible. The quickest method had the most variability for each sample size computed except for the 2000 sample set. Of the results output, the Random Effects Components for each sample and the Normalized Absolute ASE values are those most affected by the compromises made to obtain speed increases.

The method used for these precision computations is the ratio of the measurement system variability to total variability (),

where and and because only two measurements are being compared, and where where is the sample standard deviation of the averages of the two measurements ().

### Determining Influential Markers¶

Of the top 100 markers from the Exact method compared to the top 100 markers for the Quickest method in the 20k sample set, all but three overlap. Thus, it is reasonable to assume that a prediction model based on the top markers from the quickest method would result in a very similar conclusion to that built on the exact method. More work should be done in confirming this hypothesis.

### Large N Memory Considerations¶

Large-data-mode GBLUP will automatically detect how much “Transpose and analysis memory usage” you have told SVS is available, and will consider that amount to be available for Large-data-mode GBLUP analysis (Global Product Options).

To insure the most accuracy, you should allow SVS to use as much memory as possible, up to its upper limit, which is

bytes of memory for its computations, where is the number of samples.

For the following sample sizes, this formula translates to:

# Samples Memory
8k ~790 MB
20k ~3.1 GB
40k ~8.8 GB
100k ~34.9 GB

This limitation is to keep in check the time spent computing one of the steps of the process (namely the Gram-Schmidt orthonormalization step). However, this amount of memory is also an ideal size for these computations.

If this amount of memory is not available, the memory that is available will be used. However, this will impose percentage limitations on the amount of memory that will be used in the following cases, compared with the ideal memory:

# Samples 256 MB 512 MB 1 GB 8 GB 16 GB
8k 32% 65% (100%) (100%) (100%)
20k 8% 16% 32% (100%) (100%)
40k 3% 6% 11% 91% (100%)
100k .7% 1.5% 3% 23% 46%

As stated above, having more memory available for Large-N-mode GBLUP will, in general, improve the accuracy that will be obtained.

Meanwhile, the amount of memory used will not appreciably affect the maximum possible time of execution, one way or the other, because this amount simply trades off how many eigenvalues are computed per top-layer iteration vs. how many top-layer iterations may be required at a maximum.

### Comparing Machines¶

The Large N implementation relies heavily on the ability of the NumPy and SciPy libraries for Python to use multiple threads. This means that the Large N method must be run on a Windows machine to perform the computation in a reasonable amount of time.

This table shows run times for GBLUP using the “Quickest” mode on two different systems:

# Samples Windows 8 Core 8 GB RAM Linux Server
2000 0.95 min 5.4 min
4000 3.5 min 44 min
8000 19 min 217 min

### Alternative Approaches¶

One suggested alternative approach has been to split up the analysis in a Chromosome by Chromosome manner. However, the bulk of the computation is performed on the sample by sample dimension regardless of the number of markers in the dataset. This means that for a given kinship matrix for 100,000 samples, the amount of time it will take to perform the analysis will roughly be the same for five thousand markers, 48 thousand markers or one million markers.

However, the computation needed to process a kinship matrix which is a GBLUP Relationship Matrix formed from many fewer markers than samples need not be as much, since, in this case, many or most of the non-zero eigenvalues of are the minimum possible value of one. The Quick method can process these minimum-possible eigenvalues up to seven times faster than would otherwise normally be possible with the Quick method, while larger performance gains are possible with the Medium and Slow methods for these minimum-possible eigenvalues. Therefore, for instance, if the kinship matrix is based on 40,000 samples and 10,000 markers, processing the entire kinship matrix with the Quick method should take about one third the time it would for the Quick method to process a kinship matrix based on 40,000 samples and 40,000 markers.

To take advantage of this fact, a good alternative approach is to simply base the kinship matrix on one chromosome (to get a reasonable amount of information about the inheritance pattern) and perform the analysis on all of the markers using this kinship matrix.

Another alternative approach, which may be combined with the approach to base the kinship matrix on only one chromosome, is to base the kinship matrix on a subset of markers obtained by LD Pruning (LD Pruning).

Note

It is important to set the largest possible memory size that you have available in your system for “Transpose and analysis memory usage” (Global Product Options), up to and past the upper limits mentioned in Large N Memory Considerations, in order to have the best possible accuracy for these calculations, and to insure the best likelihood that this optimization actually takes place.