2.13.2. Large Kinship Matrices or Large Numbers of Samples¶
The Problem¶
If there are more than 8,000 samples, memory requirements for computing mixed models begin to be substantial, or even overbearing. Additionally, with this many samples, computation of eigenvalues and eigenvectors begins to take much longer for methods such as GBLUP using the EMMA algorithm (see Performing GBLUP Analysis) or for Principal Components Analysis (PCA) (see Genotypic Principal Component Analysis and Numeric Principal Component Analysis).
The Solution¶
Thus, for GBLUP using the EMMA algorithm, for KFold cross validation using GBLUP, and for Principal Components Analysis (PCA) (Step 2: Find the Principal Components), SVS offers a largedata mode for some of the computations and intermediate data storage. This mode consists of:
Storing intermediate matrix data in scratch spreadsheets. The data for one matrix may be stored either in one spreadsheet or, on the other hand, in many separate spreadsheets each containing only a small enough portion of the data to allow it to be held conveniently in (randomaccess) memory.
Computing basic matrix operations piece by piece, reading sections of the data from the spreadsheet (or spreadsheets) as necessary, making computations using these sections of data and, if the computational outputs consist of a new large matrix, writing the computation results section by section to a new large spreadsheet or small separate spreadsheets.
For largedata GBLUP using EMMA and KFold crossvalidation using GBLUP: In order to find a matrix (Solving the Mixed Model Equation Using EMMA) based on the matrix , itself is first determined, then a Cholesky decomposition of matrix is performed using largedata matrix techniques. This avoids the need to determine the eigenvalues and eigenvectors of the kinship matrix itself.
To find the eigenvalues and the corresponding eigenvectors of the large matrix that must be dealt with, which is:
For GBLUP using EMMA and KFold crossvalidation using GBLUP: Matrix
For largedata Principal Components Analysys (PCA): the Wishart Matrix
SVS uses an adaptation of the methodology of [Halko2010] (see Decomposition of Large Matrices Using Random Numbers below), along with the largedata techniques mentioned above.
How to Use¶
For GBLUP using the EMMA algorithm and for KFold cross validation using GBLUP: If there are more than 8,000 samples, an extra dialog will pop up, asking for what “mode” you wish to use (see the table shown in For GBLUP using EMMA: Large N Performance Tradeoffs), and whether you wish to use largedata techniques at all.
For Principal Components Analysis (PCA): There is a separate dialog and utility available for finding a few principal components of a dataset with a large number of samples. (This dialog and utility is available even if you have fewer than 8,000 samples.) See Finding Principal Components When the Sample Size is Large.
Note
Largedata mode is not currently available for GBLUP using the Average Information (AI) REML algorithm.
Decomposition of Large Matrices Using Random Numbers¶
The authors of [Halko2010] describe a technique using random numbers for finding an orthonormal matrix whose range approximates the range of another matrix of a much larger size. This matrix may then be used to (for instance) approximate the largest eigenvalues and their corresponding eigenvectors for a symmetric matrix . This technique only needs one pass to cause the overall error of expressing matrix using its approximate largest eigenvalues and eigenvectors to be of the same magnitude as the largest of the nextsmaller eigenvalues which were not approximated. [Halko2010] also describes a method of increasing accuracy involving postmultiplying by when the eigenvalues do not descend in value very quickly, and/or large data needs to be dealt with.
Golden Helix SVS adapts this technique using two layers of iteration to approximate all requested eigenvalues and eigenvectors of such a largesize matrix . The top layer of iteration finds batches of the largest eigenvalues of first , then of , where is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of , then of , where is the eigenvalue/eigenvector decomposition formed from the largest eigenvalues of (which are the nextlargest set of eigenvalues of ), and so on. The second layer uses the [Halko2010] technique to obtain estimates of the largest eigenvalues and eigenvectors of (or of , etc.).
For GBLUP using EMMA: All eigenvalues and all eigenvectors are estimated.
For Principal Components Analysis (PCA): Only the requested number of eigenvalues and eigenvectors (components) are estimated. If only a few principal components are needed, the algorithm will never need to deal with any large matrix other than itself.
Approximating the Largest Eigenvalues in a Matrix¶
Suppose matrix is a (symmetric) positive semidefinite matrix for which we need to find approximations to the largest eigenvalues and their eigenvectors. The basic technique described in [Halko2010] is as follows:
Create a (“tall and slender”) matrix which contains elements from the Standard Normal Distribution (normally distributed with mean 0 and variance 1). Presumably, dimension is small enough so that this matrix fits comfortably into (randomaccess) memory.
Create the matrix as the matrix product .
Find the set of orthonormal vectors which take up the same space as , using, for instance, the QR approach or the GramSchmidt procedure. Call the matrix containing this set .
Now, compute and find the eigenvalues and eigenvectors of . Since is only and fits very easily into (randomaccess) memory, this is a much more straightforward and quicklysolved problem than that of finding the eigenvalues and eigenvectors of itself.
Call the matrix of resulting eigenvectors and the matrix containing the eigenvalues on its diagonal (and zeros elsewhere) , so that we have
Now, we approximate as
To increase accuracy, [Halko2010] suggests modifying this algorithm as follows (the “power method”):
Follow the first three steps above to create a matrix, call this .
Create matrix , then find the set of orthonormal vectors that take up the same space as , and call this .
Repeat the above step times, to get .
Now, find the eigenvalues and eigenvectors of and follow the last two steps above to approximate .
From equation (6.5) of [Halko2010], we see that the accuracy of either of these two methods (the original method or the power method) is
where is the  st  largest eigenvalue of , and for the original method, we take .
As an alternative method of increasing accuracy, which may be used at the same time as the “power method” described above, the authors of [Halko2010] suggest using “oversampling”, meaning that to obtain the largest eigenvalues and their eigenvectors, use a (“tall and slender”) matrix of dimensions , where . (This will make matrix , and matrices , etc., if used, to also be of dimensions .) Normally, does not need to be greater than .
Since the matrix is somewhat like a “memory buffer”, call the “width” of the “tall and slender” matrix the “buffer width”.
For GBLUP using EMMA: How Large Will the Buffer Width Be?¶
For GBLUP using EMMA, Golden Helix SVS attempts to use
as the ideal buffer width. This is an optimum width based on preventing the calculations of the GramSchmidt step from taking excessive and unnecessary time.
If there is insufficient RAM memory to fit this wouldbe buffer width, the buffer width is reduced sufficiently to allow fitting of the calculations into RAM memory.
For PCA: How Large Will the Buffer Width Be?¶
For PCA, Golden Helix SVS uses the maximum of
A minimal width for PCA computation, namely
vs.
An optimal width based on the number of samples , namely
In the unusual event that there is insufficient RAM memory for this wouldbe buffer width, the buffer width is reduced sufficiently to allow fitting of the calculations into RAM memory.
Note that if only a small number of principal components are requested, the buffer width used will be far smaller than what would have been used for the GBLUP EMMA algorithm, thus saving calculation time.
For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S¶
The technique in Approximating the Largest Eigenvalues in a Matrix works reasonably well for a matrix whose eigenvalues descend in value quickly, or reach almost zero quickly. However, a typical kinship matrix will have many or most eigenvalues in the general order of a magnitude of one. Similarly, the nonzero eigenvalues of the matrix we are interested in, namely , will often range from a little over 1 to over 5 or more. (The number of zero eigenvalues is exactly equal to one more than the number of covariates you have specified.) This renders the above [Halko2010] technique only marginally useful by itself.
A further difficulty is that the EMMA algorithm needs the estimates of the smaller nonzero eigenvalues and their eigenvectors of to be essentially as precise as the estimates of the larger eigenvalues and their eigenvectors. This is because for each nonzero eigenvalue, part of the algorithm divides by an expression derived from that eigenvalue, which expression may come close to zero. (That expression is the eigenvalue minus one plus the value of . may become as small as ) (See Finding the Variance Components.)
Therefore, in Golden Helix SVS, we start with the following steps to enhance the [Halko2010] technique for GBLUP using EMMA:
Start with the original kinship matrix .
Create the (postive definite) matrix , where , and where, in SVS, we use the value .25 for both and . Call this matrix .
Note
We avoid actually storing matrix by itself anywhere, but we simply use largedata techniques to perform the entire operation of finding with a minimum of memory usage.
Note
Mathematically speaking, those eigenvalues of equal to correspond to, and have the same eigenvectors as, the zero eigenvalues of , while each of the remaining eigenvalues of may be converted to the corresponding remaining eigenvalues of , for which the corresponding eigenvectors are the same, by adding .
For PCA: Finding the Requested Eigenvalues of the Wishart Matrix¶
The technique in Approximating the Largest Eigenvalues in a Matrix works reasonably well for a matrix whose eigenvalues descend in value quickly, or reach almost zero quickly. However, a typical Wishart matrix may have eigenvalues that descend in value more slowly, or perhaps quickly at first followed by descending slowly. This renders the above [Halko2010] technique only marginally useful by itself.
Additionally, some eigenvalues of the Wishart matrix may be zero in certain use cases.
Therefore in Golden Helix SVS, to achieve a reasonable level of accuracy without spending excessive compute time, and to guard against possible zero eigenvalues wrecking the algorithm, we start with the following steps to enhance the [Halko2010] technique:
Start with the original Wishart matrix (say, ).
Create the (postive definite) matrix , where in SVS, we use the value 0.0002 for . Call this matrix .
Note
We avoid actually storing matrix by itself anywhere, but we simply use largedata techniques to perform the entire operation of finding with a minimum of memory usage.
Note
Mathematically speaking, those eigenvalues of equal to correspond to, and have the same eigenvectors as, any zero eigenvalues of , while each of the remaining eigenvalues of may be converted to the corresponding remaining eigenvalues of , for which the corresponding eigenvectors are the same, by subtracting out .
Further Processing Steps to Find One Batch of Eigenvalues of Matrix A¶
Use the powermethodwithoversampling variation of the abovementioned [Halko2010] procedure on matrix . Use matrices and of width , where is set according to For GBLUP using EMMA: How Large Will the Buffer Width Be? or For PCA: How Large Will the Buffer Width Be?. This setting will be reasonably optimal but still small enough to allow and to fit comfortably into (randomaccess) memory.
However, use the following procedure to obtain the best accuracy with minimum effort from this method:
Follow the first three steps of Approximating the Largest Eigenvalues in a Matrix (create , create , and orthonormalize) to create a matrix, which we call .
Create matrix . However, do not yet orthonormalize .
Now, form , and find just the eigenvalues of .
Then, orthonormalize to get , form , form , and obtain the eigenvalues of .
If enough of the highest eigenvalue estimates have changed slowly enough between those of and those of , accept these estimates as being the highest eigenvalues of both matrix and matrix . “Enough” and “slowly enough” are explained in the next bullet below. Otherwise, repeat the step above followed by this step, except compute from using an extra scan through the matrix , as and , and so on for , , etc.
To explain the words “enough” and “slowly enough” above: After each iteration of the step where is formed, compare the magnitudes of the differences between eigenvalue estimates between step and step , starting with the largest eigenvalue and working through the smaller eigenvalues. As soon as the magnitude of the difference between two estimates of a given eigenvalue is larger than a small , stop comparing eigenvalue estimates.
For GBLUP using EMMA: You may specify , , or as the . (See the table shown in For GBLUP using EMMA: Large N Performance Tradeoffs.)
For PCA: is always used as the . (Also, the Max Iterations per Batch () is always 60. You might call this mode “HyperSlow”.)
Call the number of eigenvalues of for which this comparison passed (where ).
When the “batch stopping criteria” are met, stop iterating. The meaning of “batch stopping criteria” is explained in the next two sections.
For succeeding “batches” of eigenvalues, this procedure is repeated for , , etc.
For GBLUP using EMMA: Batch Stopping Criteria¶
If and , do no further iterations, and “accept” the largest eigenvalues from iteration .
Note
“Accepting” eigenvalues from iterations is equivalent to finding eigenvalues using a “power method” with extra passes and an “oversampling” parameter of .
Note
The idea behind the stopping criterion is that if and , we see the average number of eigenvalues per iteration has increased, and thus it must have been worthwhile to have the current iteration, and it presumably will be worthwhile to continue iterating.
Additionally, no matter what the value of , stop iterating if eigenvalues have been estimated satisfactorily, where is one more than the number of covariates you have specified, and “accept” all of the largest eigenvalues.
Note
It may turn out that the above algorithm, if followed strictly, would not stop until after a very high number of iterations , or not even stop at all. Alternatively, it may stop after only a very tiny number of eigenvalues are “accepted”.
For these reasons, when using this largedata EMMA algorithm, SVS allows you to specify the limit () for the number of iterations . In addition, SVS imposes a maximum oversampling ratio () of two.
The meaning of these restrictions is that in the event that SVS reaches iterations, this part of the algorithm is stopped and eigenvalues are “accepted”, no matter what.
Also, when the algorithm stops earlier (), at least eigenvalues are “accepted”. Hopefully, this will be fewer than the number of eigenvalues that the algorithm has already (tentatively) “accepted”. If not, this will have the effect of pushing the algorithm forward at the minimum rate of per iteration.
With these restrictions (maximum powermethod iterations and maximum oversampling ratio), a maximum execution time can be guaranteed.
For PCA: Batch stopping Criteria¶
The normal use case for PCA is:
If is greater than or equal to the number of eigenvalues requested, and we are working on the first batch (which processes directly from matrix ), stop all iteration using the [Halko2010] method and “accept” the greatest eigenvalues which have been computed as the final answer.
As stated above, this is considered to be the normal use case for PCA, because (1) only a few eigenvalues are requested, so that the number of eigenvalues requested is less than the buffer width , (2) the algorithm converges for all eigenvalues requested before the maximum of 60 iterations is reached, and as a result, (3) only the very first batch derived directly from matrix needs to be processed.
Another use case may sometimes happen:
Suppose (1) convergence turns out to be more difficult than would normally happen and thus the algorithm failed, after 60 iterations, to find all requested eigenvalues within the normal precision within the first batch, but meanwhile, (2) the buffer width is more than twice the number of requested eigenvalues .
In this case, the largest eigenvalues are simply “accepted” as the final answer.
If, in the rare case that due to either memory considerations or because a large number of components were requested, the buffer width is less than or even less than , multiple batches may be necessated, which will result in these use cases:
If, on any batch, the number of eigenvalues found plus the number of all previously found eigenvalues is greater than or equal to the number of eigenvalues requested, stop the algorithm and “accept” the greatest eigenvalues from this batch.
On any batch, if all possible batch eigenvalues have been found (), “accept” these eigenvalues and start on the next batch.
On any batch, when 60 iterations have been reached but plus the number of all previously found eigenvalues is still less than the number of eigenvalues requested, stop working on the current batch, “bolster up” the working value of to at least , and “accept” the latest estimates of the greatest eigenvalues.
Call the number of eigenvalues found or accepted when the current batch has stopped .
Processing to Finish a Batch¶
Now that we have “accepted” eigenvalues from iteration of the algorithm above, compute the eigenvectors of . Designate the matrix with the diagonal containing the highest of the eigenvalues of (the ones that we “accepted”) to be , and the eigenvectors corresponding to these “accepted” eigenvalues . Use these to approximate the highest eigenvalues of (diagonal of matrix ) and their corresponding eigenvectors as
and
Since matrix may be written as
where and represent the remaining eigenvalues and eigenvectors after and , consider matrix as decomposed into
For GBLUP using EMMA: When to Stop the Whole Algorithm¶
If eigenvalues and eigenvectors have been obtained (for matrix , and thus for matrix ), stop the algorithm. This many eigenvalues and eigenvectors are all that are required by the EMMA technique.
For PCA: When to Stop the Whole Algorithm¶
As stated above:
If we are working on the first batch (which processes directly from matrix ), and is greater than or equal to the number of eigenvalues requested, stop all iteration using the [Halko2010] method and “accept” the greatest eigenvalues which have been computed.
If, for some reason, the algorithm needs to work on succeeding batches after the first batch, it is the number of eigenvalues found plus the number of all previously found eigenvalues which needs to be greater than or equal to the number of eigenvalues requested in order to stop the algorithm and “accept” the greatest eigenvalues from this batch to finish finding all eigenvalues.
If a New Batch Needs to Be Initiated:¶
Use largedata techniques to form a new matrix by subtracting off the first part of our decomposition, as
Then, repeat the above variation of the power method along with this step.
Note
Because the original matrix was formed from the actual (positive semidefinite) matrix we are interested in plus , where is a smallish positive number, performing the above subtraction will leave behind all of the remaining eigenvalues and eigenvectors of matrix while eigenvalues of corresponding to the eigenvectors that we have just found will essentially be zero, thus keeping them out of the way of later calculations.
If minus the number of eigenvalues thus far obtained is less than , appropriately reduce the width of the and matrices.
Note
In addition, under this circumstance, when the algorithm stops, whether or not iterations are reached, all remaining eigenvalues (out of eigenvalues or, for PCA, out of the remaining requested eigenvalues) are “accepted”, because all the information related to the remaining (nonzero) eigenvalues should have been effectively contained in the and matrices.
For GBLUP using EMMA: Large N Performance Tradeoffs¶
As mentioned earlier, when dealing with many samples, computation of eigenvalues and eigenvectors for GBLUP (Performing GBLUP Analysis) takes much longer. Thus, when using largedata mode for GBLUP (and for KFold cross validation using GBLUP), there are approximations that have to be made to gain speed and computability in favor of accuracy. Therefore, when you use either of these features with more than 8,000 samples, and before you are given the standard options relating to the feature, you are given a choice of which largedata approach to use from the following set of options:
Mode 
Approach 
Max Iterations per Batch 
Early Exit Precision 

Slow 
Large N 
30 

Medium 
Large N 
12 

Quick 
Large N 
5 

Quickest 
Large N 
2 

Exact 
Small N 
N/A 
N/A 
“Max Iterations per Batch” refers to the maximum number of iterations to use (see the fifth note in For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S), and “Early Exit Precision” refers to the tolerance to use (see the bullet before the third note in For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S).
Note
If you separately compute a Genomic Relationship Matrix (Separately Computing the Genomic Relationship Matrix) from many samples, SVS will automatically go into largedata mode, but still compute the exact Genomic Relationship Matrix in a reasonable amount of time.
Summary of Performance Tradeoffs¶
If you have more than 8,000 samples (5,500 samples for 32bit systems), you will be asked to choose your data analysis mode before starting analysis of either GBLUP or of KFold cross validation using GBLUP.
Before entering your project and starting your calculations, set the “Transpose and analysis memory usage” to the highest memory level that you have available (Global Product Options), ideally to somewhat more than the ideal memory usages documented in Large N Memory Considerations. This will insure the best accuracy when you use Large Data mode.
For only moderately more than 8,000 samples, use the “Medium” or “Slow” Large N mode. Alternatively, if you have sufficient RAM, and you are willing to wait for eigenvalues to be computed (may be several minutes) without being able to interrupt these computations, you could use the “Exact” (Small N) mode (Computation Time).
For a medium number of samples, such as 15,000, using the “Quick” or “Quickest” method will bring the computation time down. Also, consider using the “Exact” method if you have sufficient RAM and can wait for the eigenvalue computations (see above bullet), or using the approach (mentioned in the next bullet) of basing the kinship matrix on a subset of markers containing many fewer markers than the number of samples you have.
For large numbers of samples, you should definitely consider using a kinship matrix based on just one chromosome, or on a marker subset created from LD pruning. (See Alternative Approaches.)
If you get messages in the spreadsheet logs starting with “NOTE:” that count how many eigenvalues were approximate or were rounded up, and there were a significant number of eigenvalues that had to be approximated in this way, consider using a “more slow” method (more maximum iterations per batch) to improve accuracy, or if possible, using more memory for the calculations (see the bullet above about memory usage).
Computation Time¶
With a 64bit system you will always want to run GBLUP (and KFold cross validation using GBLUP) using the Exact method if enough RAM is available, and you are willing to wait for eigenvalues to be computed without being able to easily stop or cancel out from the process. We have been able to process kinship matrices from 54,000 samples using the Exact method on a computer with 128 GB of RAM. At a certain point (yet undetermined), having more RAM will not make a difference in whether or not you can use the Exact (small N) method, as systems will not allow it to be used for the algorithm.
Using a 48k simulated marker dataset, tests were performed with varying sample sizes using the above five modes. On a computer with 8 cores, 8 GB of RAM and a CPU speed of 3.1 GHz, the following table illustrates the time required for the sample sizes listed.
# Samples 
Slow 
Medium 
Quick 
Quickest 
Exact (Small N Algorithms) 

2k 
~1 min 
1 min 
~1 min 
~1 min 
~1 min 
4k 
~10 min 
7 min 
5 min 
3 min 
~2 min 
8k 
76 min 
80 min 
32 min 
19 min 
~5 min 
20k 
1133 min / ~19 hrs 
823 min / ~14 hrs 
469 min / ~8 hrs 
234 min / ~4 hrs 
~57 min 
40k 
Not computed 
Not computed 
3556 min / ~59 hrs / ~2.5 days 
2042 min / ~34 hrs / ~1.5 days 
~5796 min / ~4 days 
Based on the data accumulated, the quickest method, as bestfit by a cubic distribution, should finish 100,000 samples in roughly 26 days. A faster computer with more cores could speed this computation time up.
Precision of Results¶
Precision was measured by taking the variance of the difference between methods divided by the total variability. There was no difference between the slow method and the exact method for all sample sizes computed (up to 20k). Variability started to be introduced with the Medium method and four thousand samples, but the variability was almost negligible. The quickest method had the most variability for each sample size computed except for the 2000 sample set. Of the results output, the Random Effects Components for each sample and the Normalized Absolute ASE values are those most affected by the compromises made to obtain speed increases.
The method used for these precision computations is the ratio of the measurement system variability to total variability (),
where and and because only two measurements are being compared, and where where is the sample standard deviation of the averages of the two measurements ().
Determining Influential Markers¶
Of the top 100 markers from the Exact method compared to the top 100 markers for the Quickest method in the 20k sample set, all but three overlap. Thus, it is reasonable to assume that a prediction model based on the top markers from the quickest method would result in a very similar conclusion to that built on the exact method. More work should be done in confirming this hypothesis.
Large N Memory Considerations¶
Largedatamode GBLUP will automatically detect how much “Transpose and analysis memory usage” you have told SVS is available, and will consider that amount to be available for Largedatamode GBLUP analysis (Global Product Options).
To insure the most accuracy, you should allow SVS to use as much memory as possible, up to its upper limit, which is
bytes of memory for its computations, where is the number of samples.
For the following sample sizes, this formula translates to:
# Samples 
Memory 

8k 
~790 MB 
20k 
~3.1 GB 
40k 
~8.8 GB 
100k 
~34.9 GB 
This limitation is to keep in check the time spent computing one of the steps of the process (namely the GramSchmidt orthonormalization step). However, this amount of memory is also an ideal size for these computations.
If this amount of memory is not available, the memory that is available will be used. However, this will impose percentage limitations on the amount of memory that will be used in the following cases, compared with the ideal memory:
# Samples 
256 MB 
512 MB 
1 GB 
8 GB 
16 GB 

8k 
32% 
65% 
(100%) 
(100%) 
(100%) 
20k 
8% 
16% 
32% 
(100%) 
(100%) 
40k 
3% 
6% 
11% 
91% 
(100%) 
100k 
.7% 
1.5% 
3% 
23% 
46% 
As stated above, having more memory available for LargeNmode GBLUP will, in general, improve the accuracy that will be obtained.
Meanwhile, the amount of memory used will not appreciably affect the maximum possible time of execution, one way or the other, because this amount simply trades off how many eigenvalues are computed per toplayer iteration vs. how many toplayer iterations may be required at a maximum.
Comparing Machines¶
The Large N implementation relies heavily on the ability of the NumPy and SciPy libraries for Python to use multiple threads. This means that the Large N method must be run on a Windows machine to perform the computation in a reasonable amount of time.
This table shows run times for GBLUP using the “Quickest” mode on two different systems:
# Samples 
Windows 8 Core 8 GB RAM 
Linux Server 

2000 
0.95 min 
5.4 min 
4000 
3.5 min 
44 min 
8000 
19 min 
217 min 
Alternative Approaches¶
One suggested alternative approach has been to split up the analysis in a Chromosome by Chromosome manner. However, the bulk of the computation is performed on the sample by sample dimension regardless of the number of markers in the dataset. This means that for a given kinship matrix for 100,000 samples, the amount of time it will take to perform the analysis will roughly be the same for five thousand markers, 48 thousand markers or one million markers.
However, the computation needed to process a kinship matrix which is a GBLUP Relationship Matrix formed from many fewer markers than samples need not be as much, since, in this case, many or most of the nonzero eigenvalues of are the minimum possible value of one. The Quick method can process these minimumpossible eigenvalues up to seven times faster than would otherwise normally be possible with the Quick method, while larger performance gains are possible with the Medium and Slow methods for these minimumpossible eigenvalues. Therefore, for instance, if the kinship matrix is based on 40,000 samples and 10,000 markers, processing the entire kinship matrix with the Quick method should take about one third the time it would for the Quick method to process a kinship matrix based on 40,000 samples and 40,000 markers.
To take advantage of this fact, a good alternative approach is to simply base the kinship matrix on one chromosome (to get a reasonable amount of information about the inheritance pattern) and perform the analysis on all of the markers using this kinship matrix.
Another alternative approach, which may be combined with the approach to base the kinship matrix on only one chromosome, is to base the kinship matrix on a subset of markers obtained by LD Pruning (LD Pruning).
Note
It is important to set the largest possible memory size that you have available in your system for “Transpose and analysis memory usage” (Global Product Options), up to and past the upper limits mentioned in Large N Memory Considerations, in order to have the best possible accuracy for these calculations, and to insure the best likelihood that this optimization actually takes place.