2.13.2. Large Kinship Matrices or Large Numbers of Samples¶
The Problem¶
If there are more than 8,000 samples, memory requirements for computing mixed models begin to be substantial, or even overbearing. Additionally, with this many samples, computation of eigenvalues and eigenvectors begins to take much longer for methods such as GBLUP using the EMMA algorithm (see Performing GBLUP Analysis) or for Principal Components Analysis (PCA) (see Genotypic Principal Component Analysis and Numeric Principal Component Analysis).
The Solution¶
Thus, for GBLUP using the EMMA algorithm, for K-Fold cross validation using GBLUP, and for Principal Components Analysis (PCA) (Step 2: Find the Principal Components), SVS offers a large-data mode for some of the computations and intermediate data storage. This mode consists of:
Storing intermediate matrix data in scratch spreadsheets. The data for one matrix may be stored either in one spreadsheet or, on the other hand, in many separate spreadsheets each containing only a small enough portion of the data to allow it to be held conveniently in (random-access) memory.
Computing basic matrix operations piece by piece, reading sections of the data from the spreadsheet (or spreadsheets) as necessary, making computations using these sections of data and, if the computational outputs consist of a new large matrix, writing the computation results section by section to a new large spreadsheet or small separate spreadsheets.
For large-data GBLUP using EMMA and K-Fold cross-validation using GBLUP: In order to find a matrix
(Solving the Mixed Model Equation Using EMMA) based on the matrix
,
itself is first determined, then a Cholesky decomposition of matrix
is performed using large-data matrix techniques. This avoids the need to determine the eigenvalues and eigenvectors of the kinship matrix
itself.
To find the eigenvalues and the corresponding eigenvectors of the large matrix that must be dealt with, which is:
For GBLUP using EMMA and K-Fold cross-validation using GBLUP: Matrix
For large-data Principal Components Analysys (PCA): the Wishart Matrix
SVS uses an adaptation of the methodology of [Halko2010] (see Decomposition of Large Matrices Using Random Numbers below), along with the large-data techniques mentioned above.
How to Use¶
For GBLUP using the EMMA algorithm and for K-Fold cross validation using GBLUP: If there are more than 8,000 samples, an extra dialog will pop up, asking for what “mode” you wish to use (see the table shown in For GBLUP using EMMA: Large N Performance Tradeoffs), and whether you wish to use large-data techniques at all.
For Principal Components Analysis (PCA): There is a separate dialog and utility available for finding a few principal components of a dataset with a large number of samples. (This dialog and utility is available even if you have fewer than 8,000 samples.) See Finding Principal Components When the Sample Size is Large.
Note
Large-data mode is not currently available for GBLUP using the Average Information (AI) REML algorithm.
Decomposition of Large Matrices Using Random Numbers¶
The authors of [Halko2010] describe a technique using random numbers
for finding an orthonormal matrix whose range approximates
the range of another matrix
of a much larger size. This
matrix
may then be used to (for instance) approximate the
largest eigenvalues and their corresponding eigenvectors for a
symmetric matrix
. This technique only needs one pass to
cause the overall error of expressing matrix
using its
approximate largest eigenvalues and eigenvectors to be of the same
magnitude as the largest of the next-smaller eigenvalues which were
not approximated. [Halko2010] also describes a method of increasing
accuracy involving post-multiplying
by
when the
eigenvalues do not descend in value very quickly, and/or large data
needs to be dealt with.
Golden Helix SVS adapts this technique using two layers of iteration to
approximate all requested eigenvalues and eigenvectors of such a large-size
matrix . The top layer of iteration finds batches of the
largest eigenvalues of first
, then of
,
where
is the eigenvalue/eigenvector decomposition formed
from the largest eigenvalues of
, then of
, where
is the eigenvalue/eigenvector
decomposition formed from the largest eigenvalues of
(which are the next-largest set of eigenvalues of
), and so
on. The second layer uses the [Halko2010] technique to obtain
estimates of the largest eigenvalues and eigenvectors of
(or
of
, etc.).
For GBLUP using EMMA: All eigenvalues and all eigenvectors are estimated.
For Principal Components Analysis (PCA): Only the requested number of eigenvalues and eigenvectors (components) are estimated. If only a few principal components are needed, the algorithm will never need to deal with any large matrix other than
itself.
Approximating the Largest Eigenvalues in a Matrix¶
Suppose matrix is a (symmetric) positive semidefinite
matrix for which we need to find approximations to
the
largest eigenvalues and their eigenvectors. The basic
technique described in [Halko2010] is as follows:
Create a (“tall and slender”)
matrix
which contains elements from the Standard Normal Distribution (normally distributed with mean 0 and variance 1). Presumably, dimension
is small enough so that this matrix fits comfortably into (random-access) memory.
Create the
matrix
as the matrix product
.
Find the set of orthonormal vectors which take up the same space as
, using, for instance, the QR approach or the Gram-Schmidt procedure. Call the matrix containing this set
.
Now, compute
and find the eigenvalues and eigenvectors of
. Since
is only
and fits very easily into (random-access) memory, this is a much more straightforward and quickly-solved problem than that of finding the eigenvalues and eigenvectors of
itself.
Call the matrix of resulting eigenvectors
and the matrix containing the eigenvalues on its diagonal (and zeros elsewhere)
, so that we have
Now, we approximate
as
To increase accuracy, [Halko2010] suggests modifying this algorithm as follows (the “power method”):
Follow the first three steps above to create a
matrix, call this
.
Create matrix
, then find the set of orthonormal vectors that take up the same space as
, and call this
.
Repeat the above step
times, to get
.
Now, find the eigenvalues and eigenvectors of
and follow the last two steps above to approximate
.
From equation (6.5) of [Halko2010], we see that the accuracy of either of these two methods (the original method or the power method) is
where
is the
- st - largest eigenvalue of
, and for the original method, we take
.
As an alternative method of increasing accuracy, which may be used at the same time as the “power method” described above, the authors of [Halko2010] suggest using “oversampling”, meaning that to obtain the
largest eigenvalues and their eigenvectors, use a (“tall and slender”) matrix
of dimensions
, where
. (This will make matrix
, and matrices
, etc., if used, to also be of dimensions
.) Normally,
does not need to be greater than
.
Since the matrix
is somewhat like a “memory buffer”, call the “width”
of the “tall and slender”
matrix
the “buffer width”.
For GBLUP using EMMA: How Large Will the Buffer Width Be?¶
For GBLUP using EMMA, Golden Helix SVS attempts to use
as the ideal buffer width. This is an optimum width based on preventing the calculations of the Gram-Schmidt step from taking excessive and unnecessary time.
If there is insufficient RAM memory to fit this would-be buffer width,
the buffer width is reduced sufficiently to allow
fitting of the calculations into RAM memory.
For PCA: How Large Will the Buffer Width Be?¶
For PCA, Golden Helix SVS uses the maximum of
A minimal width for PCA computation, namely
vs.
An optimal width based on the number of samples
, namely
In the unusual event that there is insufficient RAM memory for this
would-be buffer width, the buffer width is reduced
sufficiently to allow fitting of the calculations into RAM memory.
Note that if only a small number of principal components are
requested, the buffer width used will be far smaller than what would
have been used for the GBLUP EMMA algorithm, thus saving calculation
time.
For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S¶
The technique in Approximating the Largest Eigenvalues in a Matrix works reasonably well
for a matrix whose eigenvalues descend in value quickly, or reach
almost zero quickly. However, a typical kinship matrix will have many
or most eigenvalues in the general order of a magnitude of
one. Similarly, the non-zero eigenvalues of the matrix we are
interested in, namely , will often range from a little
over 1 to over 5 or more. (The number of zero eigenvalues is exactly
equal to one more than the number of covariates you have specified.)
This renders the above [Halko2010] technique only marginally useful
by itself.
A further difficulty is that the EMMA algorithm needs the estimates of
the smaller non-zero eigenvalues and their eigenvectors of
to be essentially as precise as the estimates of the
larger eigenvalues and their eigenvectors. This is because for each
non-zero eigenvalue, part of the algorithm divides by an expression
derived from that eigenvalue, which expression may come close to
zero. (That expression is the eigenvalue minus one plus the value of
.
may become as small as
) (See Finding the Variance Components.)
Therefore, in Golden Helix SVS, we start with the following steps to enhance the [Halko2010] technique for GBLUP using EMMA:
Start with the original kinship matrix
.
Create the (postive definite) matrix
, where
, and where, in SVS, we use the value .25 for both
and
. Call this matrix
.
Note
We avoid actually storing matrix
by itself anywhere, but we simply use large-data techniques to perform the entire operation of finding
with a minimum of memory usage.
Note
Mathematically speaking, those eigenvalues of
equal to
correspond to, and have the same eigenvectors as, the zero eigenvalues of
, while each of the remaining eigenvalues of
may be converted to the corresponding remaining eigenvalues of
, for which the corresponding eigenvectors are the same, by adding
.
For PCA: Finding the Requested Eigenvalues of the Wishart Matrix¶
The technique in Approximating the Largest Eigenvalues in a Matrix works reasonably well for a matrix whose eigenvalues descend in value quickly, or reach almost zero quickly. However, a typical Wishart matrix may have eigenvalues that descend in value more slowly, or perhaps quickly at first followed by descending slowly. This renders the above [Halko2010] technique only marginally useful by itself.
Additionally, some eigenvalues of the Wishart matrix may be zero in certain use cases.
Therefore in Golden Helix SVS, to achieve a reasonable level of accuracy without spending excessive compute time, and to guard against possible zero eigenvalues wrecking the algorithm, we start with the following steps to enhance the [Halko2010] technique:
Start with the original Wishart matrix (say,
).
Create the (postive definite) matrix
, where in SVS, we use the value 0.0002 for
. Call this matrix
.
Note
We avoid actually storing matrix
by itself anywhere, but we simply use large-data techniques to perform the entire operation of finding
with a minimum of memory usage.
Note
Mathematically speaking, those eigenvalues of
equal to
correspond to, and have the same eigenvectors as, any zero eigenvalues of
, while each of the remaining eigenvalues of
may be converted to the corresponding remaining eigenvalues of
, for which the corresponding eigenvectors are the same, by subtracting out
.
Further Processing Steps to Find One Batch of Eigenvalues of Matrix A¶
Use the power-method-with-oversampling variation of the above-mentioned [Halko2010] procedure on matrix
. Use matrices
and
of width
, where
is set according to For GBLUP using EMMA: How Large Will the Buffer Width Be? or For PCA: How Large Will the Buffer Width Be?. This setting will be reasonably optimal but still small enough to allow
and
to fit comfortably into (random-access) memory.
However, use the following procedure to obtain the best accuracy with minimum effort from this method:
Follow the first three steps of Approximating the Largest Eigenvalues in a Matrix (create
, create
, and orthonormalize) to create a
matrix, which we call
.
Create matrix
. However, do not yet orthonormalize
.
Now, form
, and find just the eigenvalues of
.
Then, orthonormalize
to get
, form
, form
, and obtain the eigenvalues of
.
If enough of the highest eigenvalue estimates have changed slowly enough between those of
and those of
, accept these estimates as being the highest eigenvalues of both matrix
and matrix
. “Enough” and “slowly enough” are explained in the next bullet below. Otherwise, repeat the step above followed by this step, except compute
from
using an extra scan through the matrix
, as
and
, and so on for
,
, etc.
To explain the words “enough” and “slowly enough” above: After each iteration
of the step where
is formed, compare the magnitudes of the differences between eigenvalue estimates between step
and step
, starting with the largest eigenvalue and working through the smaller eigenvalues. As soon as the magnitude of the difference between two estimates of a given eigenvalue is larger than a small
, stop comparing eigenvalue estimates.
For GBLUP using EMMA: You may specify
,
, or
as the
. (See the table shown in For GBLUP using EMMA: Large N Performance Tradeoffs.)
For PCA:
is always used as the
. (Also, the Max Iterations per Batch (
) is always 60. You might call this mode “HyperSlow”.)
Call the number of eigenvalues of
for which this comparison passed
(where
).
When the “batch stopping criteria” are met, stop iterating. The meaning of “batch stopping criteria” is explained in the next two sections.
For succeeding “batches” of eigenvalues, this procedure is repeated for
,
, etc.
For GBLUP using EMMA: Batch Stopping Criteria¶
If and
, do no further
iterations, and “accept” the largest
eigenvalues from iteration
.
Note
“Accepting” eigenvalues from
iterations is equivalent to finding
eigenvalues
using a “power method” with
extra passes and an
“oversampling” parameter of
.
Note
The idea behind the stopping criterion is that if and
, we see
the average number of eigenvalues per iteration has increased, and
thus it must have been worthwhile to have the current iteration,
and it presumably will be worthwhile to continue iterating.
Additionally, no matter what the value of , stop iterating if
eigenvalues have been estimated satisfactorily,
where
is one more than the number of covariates you have
specified, and “accept” all of the largest
eigenvalues.
Note
It may turn out that the above algorithm, if followed
strictly, would not stop until after a very high number of
iterations , or not even stop at all. Alternatively, it
may stop after only a very tiny number of eigenvalues are
“accepted”.
For these reasons, when using this large-data EMMA algorithm,
SVS allows you to specify the limit () for the
number of iterations
. In addition, SVS imposes a
maximum oversampling ratio (
) of two.
The meaning of these restrictions is that in the event that
SVS reaches iterations, this part of the algorithm
is stopped and
eigenvalues are
“accepted”, no matter what.
Also, when the algorithm stops earlier (), at
least
eigenvalues are “accepted”. Hopefully, this
will be fewer than the number of eigenvalues
that the
algorithm has already (tentatively) “accepted”. If not, this will
have the effect of pushing the algorithm forward at the minimum
rate of
per iteration.
With these restrictions (maximum power-method iterations and maximum oversampling ratio), a maximum execution time can be guaranteed.
For PCA: Batch stopping Criteria¶
The normal use case for PCA is:
If
is greater than or equal to the number
of eigenvalues requested, and we are working on the first batch (which processes directly from matrix
), stop all iteration using the [Halko2010] method and “accept” the greatest
eigenvalues which have been computed as the final answer.
As stated above, this is considered to be the normal use case for PCA, because (1) only a few eigenvalues are requested, so that the number of eigenvalues requested is less than the buffer width
, (2) the algorithm converges for all eigenvalues requested before the maximum of 60 iterations is reached, and as a result, (3) only the very first batch derived directly from matrix
needs to be processed.
Another use case may sometimes happen:
Suppose (1) convergence turns out to be more difficult than would normally happen and thus the algorithm failed, after 60 iterations, to find all requested eigenvalues within the normal precision within the first batch, but meanwhile, (2) the buffer width
is more than twice the number of requested eigenvalues
.
In this case, the largest
eigenvalues are simply “accepted” as the final answer.
If, in the rare case that due to either memory considerations or
because a large number of components were requested, the
buffer width
is less than
or even less than
, multiple batches may be necessated, which will result in
these use cases:
If, on any batch, the number of eigenvalues found
plus the number of all previously found eigenvalues
is greater than or equal to the number
of eigenvalues requested, stop the algorithm and “accept” the greatest
eigenvalues from this batch.
On any batch, if all possible batch eigenvalues have been found (
), “accept” these
eigenvalues and start on the next batch.
On any batch, when 60 iterations have been reached but
plus the number of all previously found eigenvalues is still less than the number
of eigenvalues requested, stop working on the current batch, “bolster up” the working value of
to at least
, and “accept” the latest estimates of the greatest
eigenvalues.
Call the number of eigenvalues found or accepted when the current
batch has stopped .
Processing to Finish a Batch¶
Now that we have “accepted” eigenvalues from
iteration
of the algorithm above, compute the
eigenvectors
of
. Designate the matrix
with the diagonal containing the highest
of the
eigenvalues of
(the ones that we “accepted”) to be
, and the eigenvectors corresponding to
these “accepted” eigenvalues
. Use these to
approximate the highest
eigenvalues of
(diagonal of matrix
) and their
corresponding eigenvectors
as
and
Since matrix may be written as
where and
represent the remaining
eigenvalues and eigenvectors after
and
, consider matrix
as decomposed into
For GBLUP using EMMA: When to Stop the Whole Algorithm¶
If eigenvalues and eigenvectors have been obtained
(for matrix
, and thus for matrix
), stop
the algorithm. This many eigenvalues and eigenvectors are all that
are required by the EMMA technique.
For PCA: When to Stop the Whole Algorithm¶
As stated above:
If we are working on the first batch (which processes directly from matrix
), and
is greater than or equal to the number
of eigenvalues requested, stop all iteration using the [Halko2010] method and “accept” the greatest
eigenvalues which have been computed.
If, for some reason, the algorithm needs to work on succeeding batches after the first batch, it is the number of eigenvalues found
plus the number of all previously found eigenvalues
which needs to be greater than or equal to the number
of eigenvalues requested in order to stop the algorithm and “accept” the greatest
eigenvalues from this batch to finish finding all
eigenvalues.
If a New Batch Needs to Be Initiated:¶
Use large-data techniques to form a new matrix
by subtracting off the first part of our decomposition, as
Then, repeat the above variation of the power method along with this step.
Note
Because the original matrix
was formed from the actual (positive semidefinite) matrix we are interested in plus
, where
is a smallish positive number, performing the above subtraction will leave behind all of the remaining eigenvalues and eigenvectors of matrix
while eigenvalues of
corresponding to the eigenvectors that we have just found will essentially be zero, thus keeping them out of the way of later calculations.
If
minus the number of eigenvalues thus far obtained is less than
, appropriately reduce the width of the
and
matrices.
Note
In addition, under this circumstance, when the algorithm stops, whether or not
iterations are reached, all remaining eigenvalues (out of
eigenvalues or, for PCA, out of the remaining requested eigenvalues) are “accepted”, because all the information related to the remaining (non-zero) eigenvalues should have been effectively contained in the
and
matrices.
For GBLUP using EMMA: Large N Performance Tradeoffs¶
As mentioned earlier, when dealing with many samples, computation of eigenvalues and eigenvectors for GBLUP (Performing GBLUP Analysis) takes much longer. Thus, when using large-data mode for GBLUP (and for K-Fold cross validation using GBLUP), there are approximations that have to be made to gain speed and computability in favor of accuracy. Therefore, when you use either of these features with more than 8,000 samples, and before you are given the standard options relating to the feature, you are given a choice of which large-data approach to use from the following set of options:
Mode |
Approach |
Max Iterations per Batch |
Early Exit Precision |
---|---|---|---|
Slow |
Large N |
30 |
|
Medium |
Large N |
12 |
|
Quick |
Large N |
5 |
|
Quickest |
Large N |
2 |
|
Exact |
Small N |
N/A |
N/A |
“Max Iterations per Batch” refers to the maximum number of
iterations to use (see the fifth note in For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S),
and “Early Exit Precision” refers to the tolerance to use (see the
bullet before the third note in For GBLUP using EMMA: Finding All the Eigenvalues of Matrix S(K+I)S).
Note
If you separately compute a Genomic Relationship Matrix (Separately Computing the Genomic Relationship Matrix) from many samples, SVS will automatically go into large-data mode, but still compute the exact Genomic Relationship Matrix in a reasonable amount of time.
Summary of Performance Tradeoffs¶
If you have more than 8,000 samples (5,500 samples for 32-bit systems), you will be asked to choose your data analysis mode before starting analysis of either GBLUP or of K-Fold cross validation using GBLUP.
Before entering your project and starting your calculations, set the “Transpose and analysis memory usage” to the highest memory level that you have available (Global Product Options), ideally to somewhat more than the ideal memory usages documented in Large N Memory Considerations. This will insure the best accuracy when you use Large Data mode.
For only moderately more than 8,000 samples, use the “Medium” or “Slow” Large N mode. Alternatively, if you have sufficient RAM, and you are willing to wait for eigenvalues to be computed (may be several minutes) without being able to interrupt these computations, you could use the “Exact” (Small N) mode (Computation Time).
For a medium number of samples, such as 15,000, using the “Quick” or “Quickest” method will bring the computation time down. Also, consider using the “Exact” method if you have sufficient RAM and can wait for the eigenvalue computations (see above bullet), or using the approach (mentioned in the next bullet) of basing the kinship matrix on a subset of markers containing many fewer markers than the number of samples you have.
For large numbers of samples, you should definitely consider using a kinship matrix based on just one chromosome, or on a marker subset created from LD pruning. (See Alternative Approaches.)
If you get messages in the spreadsheet logs starting with “NOTE:” that count how many eigenvalues were approximate or were rounded up, and there were a significant number of eigenvalues that had to be approximated in this way, consider using a “more slow” method (more maximum iterations per batch) to improve accuracy, or if possible, using more memory for the calculations (see the bullet above about memory usage).
Computation Time¶
With a 64-bit system you will always want to run GBLUP (and K-Fold cross validation using GBLUP) using the Exact method if enough RAM is available, and you are willing to wait for eigenvalues to be computed without being able to easily stop or cancel out from the process. We have been able to process kinship matrices from 54,000 samples using the Exact method on a computer with 128 GB of RAM. At a certain point (yet undetermined), having more RAM will not make a difference in whether or not you can use the Exact (small N) method, as systems will not allow it to be used for the algorithm.
Using a 48k simulated marker dataset, tests were performed with varying sample sizes using the above five modes. On a computer with 8 cores, 8 GB of RAM and a CPU speed of 3.1 GHz, the following table illustrates the time required for the sample sizes listed.
# Samples |
Slow |
Medium |
Quick |
Quickest |
Exact (Small N Algorithms) |
---|---|---|---|---|---|
2k |
~1 min |
1 min |
~1 min |
~1 min |
~1 min |
4k |
~10 min |
7 min |
5 min |
3 min |
~2 min |
8k |
76 min |
80 min |
32 min |
19 min |
~5 min |
20k |
1133 min / ~19 hrs |
823 min / ~14 hrs |
469 min / ~8 hrs |
234 min / ~4 hrs |
~57 min |
40k |
Not computed |
Not computed |
3556 min / ~59 hrs / ~2.5 days |
2042 min / ~34 hrs / ~1.5 days |
~5796 min / ~4 days |
Based on the data accumulated, the quickest method, as best-fit by a cubic distribution, should finish 100,000 samples in roughly 26 days. A faster computer with more cores could speed this computation time up.
Precision of Results¶
Precision was measured by taking the variance of the difference between methods divided by the total variability. There was no difference between the slow method and the exact method for all sample sizes computed (up to 20k). Variability started to be introduced with the Medium method and four thousand samples, but the variability was almost negligible. The quickest method had the most variability for each sample size computed except for the 2000 sample set. Of the results output, the Random Effects Components for each sample and the Normalized Absolute ASE values are those most affected by the compromises made to obtain speed increases.
The method used for these precision computations is the ratio of the
measurement system variability to total variability (),
where
and
and
because only two measurements are being compared,
and where
where
is the sample
standard deviation of the averages of the two measurements
(
).
Determining Influential Markers¶
Of the top 100 markers from the Exact method compared to the top 100 markers for the Quickest method in the 20k sample set, all but three overlap. Thus, it is reasonable to assume that a prediction model based on the top markers from the quickest method would result in a very similar conclusion to that built on the exact method. More work should be done in confirming this hypothesis.
Large N Memory Considerations¶
Large-data-mode GBLUP will automatically detect how much “Transpose and analysis memory usage” you have told SVS is available, and will consider that amount to be available for Large-data-mode GBLUP analysis (Global Product Options).
To insure the most accuracy, you should allow SVS to use as much memory as possible, up to its upper limit, which is
bytes of memory for its computations, where is the number of
samples.
For the following sample sizes, this formula translates to:
# Samples |
Memory |
---|---|
8k |
~790 MB |
20k |
~3.1 GB |
40k |
~8.8 GB |
100k |
~34.9 GB |
This limitation is to keep in check the time spent computing one of the steps of the process (namely the Gram-Schmidt orthonormalization step). However, this amount of memory is also an ideal size for these computations.
If this amount of memory is not available, the memory that is available will be used. However, this will impose percentage limitations on the amount of memory that will be used in the following cases, compared with the ideal memory:
# Samples |
256 MB |
512 MB |
1 GB |
8 GB |
16 GB |
---|---|---|---|---|---|
8k |
32% |
65% |
(100%) |
(100%) |
(100%) |
20k |
8% |
16% |
32% |
(100%) |
(100%) |
40k |
3% |
6% |
11% |
91% |
(100%) |
100k |
.7% |
1.5% |
3% |
23% |
46% |
As stated above, having more memory available for Large-N-mode GBLUP will, in general, improve the accuracy that will be obtained.
Meanwhile, the amount of memory used will not appreciably affect the maximum possible time of execution, one way or the other, because this amount simply trades off how many eigenvalues are computed per top-layer iteration vs. how many top-layer iterations may be required at a maximum.
Comparing Machines¶
The Large N implementation relies heavily on the ability of the NumPy and SciPy libraries for Python to use multiple threads. This means that the Large N method must be run on a Windows machine to perform the computation in a reasonable amount of time.
This table shows run times for GBLUP using the “Quickest” mode on two different systems:
# Samples |
Windows 8 Core 8 GB RAM |
Linux Server |
---|---|---|
2000 |
0.95 min |
5.4 min |
4000 |
3.5 min |
44 min |
8000 |
19 min |
217 min |
Alternative Approaches¶
One suggested alternative approach has been to split up the analysis in a Chromosome by Chromosome manner. However, the bulk of the computation is performed on the sample by sample dimension regardless of the number of markers in the dataset. This means that for a given kinship matrix for 100,000 samples, the amount of time it will take to perform the analysis will roughly be the same for five thousand markers, 48 thousand markers or one million markers.
However, the computation needed to process a kinship matrix which is a
GBLUP Relationship Matrix formed from many fewer markers than samples
need not be as much, since, in this case, many or most of the non-zero
eigenvalues of are the minimum possible value of
one. The Quick method can process these minimum-possible eigenvalues
up to seven times faster than would otherwise normally be possible
with the Quick method, while larger performance gains are possible
with the Medium and Slow methods for these minimum-possible
eigenvalues. Therefore, for instance, if the kinship matrix is based
on 40,000 samples and 10,000 markers, processing the entire kinship
matrix with the Quick method should take about one third the time it
would for the Quick method to process a kinship matrix based on 40,000
samples and 40,000 markers.
To take advantage of this fact, a good alternative approach is to simply base the kinship matrix on one chromosome (to get a reasonable amount of information about the inheritance pattern) and perform the analysis on all of the markers using this kinship matrix.
Another alternative approach, which may be combined with the approach to base the kinship matrix on only one chromosome, is to base the kinship matrix on a subset of markers obtained by LD Pruning (LD Pruning).
Note
It is important to set the largest possible memory size that you have available in your system for “Transpose and analysis memory usage” (Global Product Options), up to and past the upper limits mentioned in Large N Memory Considerations, in order to have the best possible accuracy for these calculations, and to insure the best likelihood that this optimization actually takes place.