K-Fold Cross Validation

Overview

The K-Fold cross validation feature is used to assess how well a model can predict a phenotype. Training data (subjects we have both phenotype and genotype data for) is partitioned into k subsamples and for k iterations, each of these subsamples (or folds) is selected to be the validation set and the model is fit with the other k - 1 folds and predicts the validation set. We can then analyze how well the model was able to predict the phenotypes for these samples in the validation set as we know their true phenotype values.

The initial spreadsheet, if in genotypic format, will be numerically recoded to ensure the major/minor alleles are the same for each fold.

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

Large N Considerations

If your dataset consists of more than 8,000 samples (5,500 for 32-bit systems), you will first be presented with the following dialog:

gblupLargeDataKFold

Large Data Dialog Window

If you are using the Genomic Best Linear Unbiased Predictors (GBLUP) prediction method (Options), please see Summary of Performance Tradeoffs to know how best to respond to this prompt.

If you are using Bayes C-pi or Bayes C, please select the Exact method. You may use the tool documented in Separately Computing the Genomic Relationship Matrix, if you wish, to pre-compute a Genomic Relationship Matrix and use that for K-Fold cross validation.

After answering this prompt, you will be taken to the standard K-Fold cross validation options explained below.

Options

K-Fold Cross Validation Dialog Window

K-Fold Cross Validation Dialog Window

  • Method(s): The following genomic prediction methods are available for cross validation:

  • Bayesian Options: The following parameters can be set and will affect the Bayes C\pi and Bayes C methods:

    • Number of Iterations: Enter the number of iterations the MCMC loop should run for.
    • Burn-in: How many samples should be thrown out from the beginning of the chain. Set this to 0 for no burn-in.
    • Thinning: Only store one sample every x iterations, where x is the number entered in the box. Set this to 0 for no thinning.
    • Initial Pi: This will be either the initial value of \pi for Bayes C\pi or the fixed value of \pi for Bayes C.
    • Computation Method: As Is will treat the genotypes values as either a 0, 1, or 2 (number of minor alleles). Centered will code genotypes as 0, 1, or 2, subtracted by the marker mean.
  • Correct For Gender: Assumes the column coded as if the male were heterozygous for the X-Chromosome allele in question. For GBLUP and Bayesian implementations, please see Correcting for Gender and Gender Correction, respectively.

    Note

    This option will only be available if there is a marker map and it contains at least one column in a chromosome that is listed in the assembly file as an allosome. The drop down list will only have chromosomes that are both allosomes and in this spreadsheet.

  • Use Pre-Computed Genomic Relationship Matrix: To use, check this option and click on Select Sheet and select the genomic relationship matrix spreadsheet from the window that is presented. To be valid, this spreadsheet must follow the rules outlined in Precomputed Kinship Matrix Option. Precomputed Kinship Matrix Option.

    Note

    The HWE variance sum \phi is re-calculated form the genotypic data being used for this analysis.

  • Correct for Additional Covariates: Allows additional fixed effects to be added to this model from this spreadsheet. Fixed effect coefficients can be binary, integer, real-values, categorical, or genotypic. In all cases, if a marker is chosen as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

  • Impute Missing Genotypic Data As: Missing genotypic data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotypic data will be recoded to 0.

    • Numerically as average value: All missing genotypic data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Stratify Folds by: To use stratified random sampling instead of simple random sampling for splitting up the data into k folds, check this option and choose either a categorical or binary column for the subpopulation column.

  • Number of Folds: This is the k in K-Fold, where k is the number of subsamples the dataset is randomly partitioned into and the number of times the cross-validation process is repeated. For each iteration, one subsample is left out of the analysis and is used as the validation set.

  • Number of Iterations: Set the number of times k folds will be run.

  • Delete intermediate spreadsheets with results for each fold?: If checked, this option will delete the per fold result options and only the final results will be visible.

Output

For each fold the following spreadsheets will be created:

  • (Method Name) estimates by marker - Fold #: Contains the estimates of the allele substitution effect (ASE) by marker with their normalized and absolute values of normalized values. If gender correction is applied, separate columns for the ASE will be outputted for both males and females. If there was a marker map on the original spreadsheet, it will be applied to this one.

  • (Method Name) fixed effect coefficients - Fold #: Contains the coefficient corresponding to each fixed effect. If there were no fixed effects chosen, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • (Method Name) estimates by sample - Fold #: Contains the actual and predicted phenotype, with the missing samples in the actual column being the validation set. For Bayes C/C-\pi, this will be phenotypes after they are transformed. Please see Standardizing Phenotype Values.

  • (Method Name) Genomic Relationship Matrix: the kinship matrix calculated from the original training set. Please see The GBLUP Genomic Relationship Matrix.

    Note

    The following will only be outputted for the Bayesian methods.

  • (Method Name) Run Log - Fold #: Lists how many markers were included in every hundredth iteration.

  • (Method Name) Trace Spreadsheet - Fold #: Lists the values sampled for during each iterations for \pi, \sigma_M^2, \sigma_e^2, and the number of markers included, for males and females.

  • Plot of Numeric Values from (Method Name) Trace Spreadsheet - Fold #: This is a trace plot representation of the information in the Trace Spreadsheet. Autocorrelation plots can be made from the columns in this spreadsheet, please see Autocorrelation Plots.

After all the folds have run:

  • (Method Name) - ASE: These are the average ASE values over all folds.
  • (Method Name) - Fixed Effect Coefficients: These are the average coefficients over all folds.
  • (Method Name) Final Results: These are the combined predicted phenotype values and the actual values. The predicted values are taken from the predicted phenotypes for the validation set from each fold. Also, the actual phenotypes and fold number each sample was predicted in will be displayed.
  • (Method Name) Summary Statistics: This contains the overall and per fold statistics. Please see Statistics

Note

If more than one iteration is performed, the per-fold output spreadsheets will also include an iteration number.

Sampling

There are two different sampling techniques available for partitioning the data.

Simple Random Sampling:

Each sample has the same probability of being selected.

Stratified Random Sampling:

Each subpopulation (grouped by either a categorical or binary variable) is sampled separately and then these samples are combined to ensure a sample with proportional representation of each subgroup.

Statistics

For each fold and the whole dataset, the following statistics are calculated. n is the sample size and y are the phenotype values.

The means and standard deviations are calculated when there are more than one iterations performed.

Overall Statistics

The overall statistics listed before the per fold statistics use the actual phenotype values for y and the predicted phenotype value, \hat y, for each sample from the fold where it was part of the validation set. So, if sample 1 was part of the validation set in fold 3, it’s predicted value from fold 3 will be the value included in \hat y.

These values can be seen in the (Method Name) Final Results spreadsheet. The Fold Predicted in column indicates in which fold a sample was part of the validation set.

Quantitative Statistics

Pearson’s Product-Moment Correlation Coefficient

r_{y, \hat y} = \frac {\sum_{i = 1}^n (y_i - \bar y) (\hat y_i - {\hat {\bar y}})} {(n - 1) s_y s_{\hat y}}

where s_y and s_{\hat y} are the standard deviations.

Residual Sum of Squares

RSS = \sum_{i = 1}^n (y_i - \hat y)^2

Total Sum of Squares

TSS = \sum_{i = 1}^n (y_i - \bar y)^2

R-Squared

R^2 = 1 - \frac {RSS} {TSS}

Root Mean Square Error

RMSE = \sqrt {\frac {RSS} {n}}

Mean Absolute Error

MAE = \frac {1} {n} \sum_{i = 1}^n |y_i - \hat y_i |

Binary Statistics

Area Under the Curve (Using the Wilcoxon Mann Whitney method)

AUC = \frac {U_1} {n_1 n_2}

where n_1 is the sample size of observations with actual phenotypes of 0 and n_2 is the sample size of observations with actual phenotypes of 1.

And

U_1 = R_1 - \frac {n_1 (n_1 + 1)} {2}

where R_1 is the sum of the ranks for the predicted phenotypes with actual phenotypes of 0.

Mathews Correlation Coefficient

MCC = \frac {TP \cdot TN - FP \cdot FN} {\sqrt {(TP + FN) \cdot (FP + TN) \cdot (TP + FP) \cdot (FN + TN)}}

where TP is the the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives in the predicted phenotypes. Since GBLUP and Bayes C and C\pi treat binary values as quantitative and predict in the range from 0 to 1, we treat predicted values of 0.5 and higher as 1 and below 0.5 as 0.

Accuracy

ACC = \frac {TP + TN} {TP + FN + FP + TN}

Sensitivity (or true positive rate)

TPR = \frac {TP} {TP + FN}

Specificity (or true negative rate)

SPC = \frac {TN} {FP + TN}

Root Mean Square Error

RMSE = \sqrt {\frac {\sum_{i = 1}^{n} (y_i - \hat y_i)^2} {n}}

Iteration Level Statistics

When there are more than two iterations, the statistics calculated at the end of each iteration will be stored and when K-Fold is done, the mean and standard deviation will be found. These values will be reported on a separate result viewer.