K-Fold Cross Validation

Overview

The K-Fold cross validation feature is used to assess how well a model can predict a phenotype. Training data (subjects for which we have both phenotype and genotype data) is partitioned into k subsamples. For each of these k subsamples (or “folds”), a model is created where one subsample is selected to be the validation set and the remaining subjects are selected to be the training set. Using the method or methods that you select, each model is then fit on the training set (phenotype and genotype data) and used to predict the phenotype values of the validation set (using the validation set’s genotype data). Since the actual phenotype values for the validation set are available, it is easy to analyze how well the model (and the method or methods that you select) was/were able to predict the validation set phenotypes.

In addition, this procedure may be repeated over several iterations to get a better assessment of how well these these models and your selected method(s) work.

The methods available to use for K-Fold Cross Validation are:

An alternative mode is also available whereby you may, over a number of iterations, just select a subset of N/k subjects at random (where N is the total number of subjects) and predict the phenotypes for these subjects once, based on the remaining data. The advantage of this mode is that it allows every prediction to be for a brand-new set of subjects which have been selected from the entire original set, where it is OK for some subjects to be selected from sets which have just previously been used for predictions.

Note

The initial spreadsheet, if in genotypic format, will be numerically recoded to ensure the major/minor alleles are the same for each fold.

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

Large N Considerations

If your dataset consists of more than 8,000 samples (5,500 for 32-bit systems), you will first be presented with the following dialog:

gblupLargeDataKFold

Large Data Dialog Window

If you are using the Genomic Best Linear Unbiased Predictors (GBLUP) prediction method (Options), please see Summary of Performance Tradeoffs to know how best to respond to this prompt.

If you are using Bayes C-pi or Bayes C, please select the Exact method. You may use the tool documented in Separately Computing the Genomic Relationship Matrix, if you wish, to pre-compute a Genomic Relationship Matrix and use that for K-Fold cross validation.

After answering this prompt, you will be taken to the standard K-Fold cross validation options explained below.

Options

K-Fold Cross Validation Dialog Window

K-Fold Cross Validation Dialog Window

  • Method(s): The following genomic prediction methods are available for cross validation:

  • Bayesian Options: The following parameters can be set and will affect the Bayes C\pi and Bayes C methods:

    • Number of Iterations: Enter the number of iterations the MCMC loop should run for.
    • Burn-in: How many samples should be thrown out from the beginning of the chain. Set this to 0 for no burn-in.
    • Thinning: Only store one sample every x iterations, where x is the number entered in the box. Set this to 0 for no thinning.
    • Initial Pi: This will be either the initial value of \pi for Bayes C\pi or the fixed value of \pi for Bayes C.
    • Computation Method: As Is will treat the genotypes values as either a 0, 1, or 2 (number of minor alleles). Centered will code genotypes as 0, 1, or 2, subtracted by the marker mean.
  • Correct For Gender: Assumes the column coded as if the male were heterozygous for the X-Chromosome allele in question. For GBLUP and Bayesian implementations, please see Correcting the GRM for Gender Using Overall Normalization and Gender Correction, respectively.

    Note

    This option will only be available if there is a marker map and it contains at least one column in a chromosome that is listed in the assembly file as an allosome. The drop down list will only have chromosomes that are both allosomes and in this spreadsheet.

  • Use Pre-Computed Genomic Relationship Matrix: To use, check this option and click on Select Sheet and select the genomic relationship matrix spreadsheet from the window that is presented. To be valid, this spreadsheet must follow the rules outlined in Precomputed Kinship Matrix Option.

    Note

    The HWE variance sum \phi is re-calculated from the genotypic data being used for this analysis.

  • Correct for Additional Covariates: Allows additional fixed effects to be added to this model from this spreadsheet. Fixed effect coefficients can be binary, integer, real-values, categorical, or genotypic. In all cases, if a marker is chosen as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

  • Impute Missing Genotypic Data As: Missing genotypic data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotypic data will be recoded to 0.

    • Numerically as average value: All missing genotypic data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Stratify Folds by: To use stratified random sampling instead of simple random sampling for splitting up the data into k folds, check this option and choose either a categorical or binary column for the subpopulation column.

  • K-Fold Options: Choose how to select the samples and how many predictions to make.

    • Predict Only Once per Iteration Checking this will select the alternative mode whereby, for each overall iteration, N/k subjects are selected to be the validation set, the remainder are selected to be the training set, and one prediction is made.

    • Number of Folds: This is the k in K-Fold, where k is the number of subsamples the dataset is randomly partitioned into and (unless Predict Only Once per Iteration is checked) the number of separate models (with distinct validation sets) that will be created and used for prediction and validation.

    • Number of Iterations: If Predict Only Once per Iteration is not checked, the number of times predicting for k folds will be repeated.

      If Predict Only Once per Iteration is checked, the number of times a validation set of N/k subjects is randomly selected from all N subjects and used for prediction based on the remaining subjects.

  • Delete intermediate spreadsheets with results for each fold?: If checked, this option will delete the per fold result options and only the final results will be visible.

    Note

    If Predict Only Once per Iteration and this option are both checked, no prediction spreadsheets will be shown. Only the statistical results for every iteration and a summary of these results will show for each method you have chosen.

Output for the Normal Mode (K Predictions per Iteration)

When Predict Only Once per Iteration is unchecked, these outputs are created.

For each fold and each iteration the following spreadsheets will be created for each method you have selected:

Note

If Delete intermediate spreadsheets with results for each fold? is checked, the following outputs will be created only on a temporary basis, and will be deleted after the average and summary outputs have been created.

  • (Method Name) Genomic Relationship Matrix: the kinship matrix calculated from the original training set, if you did not specify a Pre-Computed Genomic Relationship Matrix. Please see The Genomic Relationship Matrix. (NOTE: This matrix is only calculated once, and is used for this and all subsequent estimates.)

  • (Method Name) estimates by marker - Fold #: Contains the estimates of the allele substitution effect (ASE) by marker with their normalized and absolute values of normalized values. If gender correction is applied, separate columns for the ASE will be outputted for both males and females. If there was a marker map on the original spreadsheet, it will be applied to this one.

  • (Method Name) fixed effect coefficients - Fold #: Contains the coefficient corresponding to each fixed effect. If there were no fixed effects chosen, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • (Method Name) estimates by sample - Fold #: Contains the actual and predicted phenotype, with the missing samples in the actual column being the validation set. For Bayes C/C-\pi, this will be phenotypes after they are transformed. Please see Standardizing Phenotype Values.

    Note

    The following will only be output for the Bayesian methods.

  • (Method Name) Run Log - Fold #: Lists how many markers were included in every hundredth iteration.

  • (Method Name) Trace Spreadsheet - Fold #: Lists the values sampled for during each iterations for \pi, \sigma_M^2, \sigma_e^2, and the number of markers included, for males and females.

  • Plot of Numeric Values from (Method Name) Trace Spreadsheet - Fold #: This is a trace plot representation of the information in the Trace Spreadsheet. Autocorrelation plots can be made from the columns in this spreadsheet, please see Autocorrelation Plots.

Within each iteration, after all the folds have run, the following outputs will be created for each method you have selected:

  • (Method Name) - ASE: These are the average ASE values over all folds.
  • (Method Name) - Fixed Effect Coefficients: These are the average coefficients over all folds.
  • (Method Name) Final Results: These are the combined predicted phenotype values and the actual values. The predicted values are taken from the predicted phenotypes for the validation set from each fold. Also, the actual phenotypes and fold number each sample was predicted in will be displayed.
  • (Method Name) Summary Statistics: This contains the overall and per fold statistics. Please see Statistics. For the Bayes C and Bayes C \pi methods, the mean and standard deviation of the phenotype are also shown as part of the overall statistics.

Note

If you have selected to perform more than one iteration, all of the above outputs will additionally include the iteration number in their headers.

Once all the iterations have finished, and if you have selected to perform more than one iteration, the following outputs will be created for each method you have selected:

  • (Method Name) Iteration Summary Statistics (spreadsheet): This spreadsheet contains one row for each iteration containing the overall statistics for that iteration. The last two rows of this spreadsheet contain, in each column, the mean and the standard deviation of the statistics displayed in the previous rows (elements) of that column.
  • (Method Name) Iteration Summary Statistics (results viewer): This viewer (which repeats the information from the last two rows of the (Method Name) Iteration Summary Statistics spreadsheet described above) summarizes the overall iteration statistics from using this method with the parameters you have specified, by showing the means and standard deviations of these statistics.

Output for the Alternative Mode (One Prediction per Iteration)

When Predict Only Once per Iteration is checked, these outputs are created.

For each iteration, the following spreadsheets are created:

Note

If Delete intermediate spreadsheets with results for each fold? is checked, the following outputs will be created only on a temporary basis, and will be deleted after the summary outputs have been created.

  • (Method Name) Genomic Relationship Matrix: the kinship matrix calculated from the original training set, if you did not specify a Pre-Computed Genomic Relationship Matrix. Please see The Genomic Relationship Matrix. (NOTE: This matrix is only calculated once, and is used for this and all subsequent estimates.)

  • (Method Name) estimates by marker - Iteration #: Contains the estimates of the allele substitution effect (ASE) by marker with their normalized and absolute values of normalized values. If gender correction is applied, separate columns for the ASE will be outputted for both males and females. If there was a marker map on the original spreadsheet, it will be applied to this one.

  • (Method Name) fixed effect coefficients - Iteration #: Contains the coefficient corresponding to each fixed effect. If there were no fixed effects chosen, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • (Method Name) estimates by sample - Iteration #: Contains the actual and predicted phenotype, with the missing samples in the actual column being the validation set. For Bayes C/C-\pi, this will be phenotypes after they are transformed. Please see Standardizing Phenotype Values.

    Note

    The following will only be output for the Bayesian methods.

  • (Method Name) Run Log - Fold 1 - Iteration #: Lists how many markers were included in every hundredth iteration.

  • (Method Name) Trace Spreadsheet - Iteration #: Lists the values sampled for during each iterations for \pi, \sigma_M^2, \sigma_e^2, and the number of markers included, for males and females.

  • Plot of Numeric Values from (Method Name) Trace Spreadsheet - Iteration #: This is a trace plot representation of the information in the Trace Spreadsheet. Autocorrelation plots can be made from the columns in this spreadsheet, please see Autocorrelation Plots.

Once all the iterations have finished, the following outputs will be created for each method you have selected:

  • (Method Name) Iteration Summary Statistics (spreadsheet): This spreadsheet contains one row for each iteration containing the prediction statistics for that iteration. The last two rows of this spreadsheet contain, in each column, the mean and the standard deviation of the statistics displayed in the previous rows (elements) of that column.
  • (Method Name) Iteration Summary Statistics (results viewer): This viewer (which repeats the information from the last two rows of the (Method Name) Iteration Summary Statistics spreadsheet described above) summarizes the iteration statistics from using this method with the parameters you have specified, by showing the means and standard deviations of these statistics.

Sampling

There are two different sampling techniques available for partitioning the data.

Simple Random Sampling:

Each sample has the same probability of being selected.

Stratified Random Sampling:

Each subpopulation (grouped by either a categorical or binary variable) is sampled separately and then these samples are combined to ensure a sample with proportional representation of each subgroup.

Statistics

For each fold and the whole dataset, and for each iteration, the following statistics are calculated. n is the sample size and y are the phenotype values.

Overall Statistics (Normal Mode)

The overall statistics listed before the per fold statistics use the actual phenotype values for y and the predicted phenotype value, \hat y, for each sample from the fold where it was part of the validation set. So, if sample 1 was part of the validation set in fold 3, it’s predicted value from fold 3 will be the value included in \hat y.

These values can be seen in the (Method Name) Final Results spreadsheet. The Fold Predicted in column indicates in which fold a sample was part of the validation set.

Quantitative Statistics

Pearson’s Product-Moment Correlation Coefficient

r_{y, \hat y} = \frac {\sum_{i = 1}^n (y_i - \bar y) (\hat y_i - {\hat {\bar y}})} {(n - 1) s_y s_{\hat y}}

where s_y and s_{\hat y} are the standard deviations.

Residual Sum of Squares

RSS = \sum_{i = 1}^n (y_i - \hat y)^2

Total Sum of Squares

TSS = \sum_{i = 1}^n (y_i - \bar y)^2

R-Squared

R^2 = 1 - \frac {RSS} {TSS}

Root Mean Square Error

RMSE = \sqrt {\frac {RSS} {n}}

Mean Absolute Error

MAE = \frac {1} {n} \sum_{i = 1}^n |y_i - \hat y_i |

Binary Statistics

Area Under the Curve (Using the Wilcoxon Mann Whitney method)

AUC = \frac {U_1} {n_1 n_2}

where n_1 is the sample size of observations with actual phenotypes of 0 and n_2 is the sample size of observations with actual phenotypes of 1.

And

U_1 = R_1 - \frac {n_1 (n_1 + 1)} {2}

where R_1 is the sum of the ranks for the predicted phenotypes with actual phenotypes of 0.

Matthews Correlation Coefficient

MCC = \frac {TP \cdot TN - FP \cdot FN} {\sqrt {(TP + FN) \cdot (FP + TN) \cdot (TP + FP) \cdot (FN + TN)}}

where TP is the the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives in the predicted phenotypes. Since all three methods used here (GBLUP, Bayes C and C\pi) treat binary values as quantitative and predict in the range from 0 to 1, we treat predicted values of 0.5 and higher as 1 and below 0.5 as 0.

Accuracy

ACC = \frac {TP + TN} {TP + FN + FP + TN}

Sensitivity (or true positive rate)

TPR = \frac {TP} {TP + FN}

Specificity (or true negative rate)

SPC = \frac {TN} {FP + TN}

Root Mean Square Error

RMSE = \sqrt {\frac {\sum_{i = 1}^{n} (y_i - \hat y_i)^2} {n}}

Iteration Level Statistics

As noted above (Output for the Normal Mode (K Predictions per Iteration) and Output for the Alternative Mode (One Prediction per Iteration)), if you have selected to perform more than one iteration, or you have specified Predict Only Once per Iteration, the means and standard deviations of the statistics calculated at the end of each iteration are reported in a separate result viewer (one result viewer for each method) after all other results have been estimated and shown.