Genomic Best Linear Unbiased Predictors Analysis

Performing GBLUP Analysis

The GBLUP method computes or imports a genomic relationship matrix and from that computes the “Genomic Best Linear Unbiased Predictor” (GBLUP) of additive genetic merits by sample and of allele substitution effects (ASE) by marker. [VanRaden2008], [Taylor2013]

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

Large N Considerations

If your dataset consists of more than 8,000 samples (5,500 for 32-bit systems), you will first be presented with the following dialog:

gblupLargeData

Large Data Dialog Window

Please see Summary of Performance Tradeoffs to know how best to respond to this prompt. Afterward, you will be taken to the standard GBLUP options explained below.

Options

gblupDialog

Compute Genomic BLUP (GBLUP) Dialog Window

  • REML Computation Algorithm: Choose the variant of the Restricted Maximum Likelihood method for estimating the genetic variance explained by the genotypes:

    • Use EMMA: Faster and required for predicting missing phenotypes.
    • Use AI REML: The Average Information variant from GCTA that is capable of integrating multiple GRMs to use an additional GRM for the gender chromosome. Also required for correcting for Gene by Environment interactions. See [Yang2011].
  • Impute missing genotypic data as: Missing genotypic data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotypic data will be recoded to 0.

    • Numerically as average value: All missing genotypic data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Correct for Gender: Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question. Uses the [Taylor2013] gender-correction algorithm (see Correcting for Gender). Two values of the ASE are output, one for each gender.

    • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
    • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Dosage compensation: Modify the dosage compensation.
    • GRM for Gender Chromosomes: The AI REML allows for multiple GRMs to be used in a single analysis. You can comute a GRM for your gender chromosomes separately and spefiy the precomputed GRM here.
  • Use Pre-Computed Genomic Relationship Matrix: To use, check this option, then click on Select Sheet and select the genomic relationship matrix spreadsheet from the window that is presented. To be valid, this spreadsheet must follow the rules outlined in Precomputed Kinship Matrix Option.

    Note

    When using a pre-computed genomic relationship matrix, the matrix M and the HWE variance sum \phi are re-calculated from the genotypic data being used for this analysis.

  • Normalization Algorithm (Used or Assumed) for the GRM: If a pre-computed GRM is not selected, this influences the GRM computation. See [Yang2011] for details of the individual marker normalization (not recommended for most use cases).

  • Correct for Additional Covariates: Allows additional fixed effects to be added to this model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

  • Correct for Gene by Environment Interactions: Allows for the correct for gene by environment interactions based on an environment categorical variable. Environment variables can be binary or categorical. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use. See [Yang2011] for description of the per-variable output and summary output in the node log produced with this option.

  • Missing Phenotypes: To predict random effects (genomic merit/genomic breeding values) and the phenotypes for samples with missing phenotypes, select Predict random effects for samples with missing phenotypes. Selecting this will also include samples with missing phenotypes as a part of the basis for the ASE calculations. Otherwise, select Drop samples with missing phenotypes.

GBLUP Output

The following two spreadsheets will always be created:

  • GBLUP estimates by sample: This spreadsheet contains the phenotype (selected dependent variable) of each sample and the Random effects component for each sample.

  • GBLUP estimates by marker: These are the GBLUP estimates of the allele substitution effect (ASE) by marker, along with the absolute magnitude of the ASE and the normalized absolute magnitude of the ASE. If gender correction is applied, separate columns for the ASE, the absolute magnitude of the ASE, and the normalized Abs ASE will be output for both males and females. The marker map will be applied to this spreadsheet.

    Note

    It is recommended that the Normalized Abs ASE value be used for plotting and visualization.

The following values will be output to the node change log of each spreadsheet:

  • The pseudo-heritability ph, which is

    ph &= \hat{\sigma^2_G} / Var(y) \\
   &= \hat{\sigma^2_G} / (\hat{\sigma^2_G} + \hat{\sigma^2_e}) \\
   &= 1 / (1 + \hat{\delta}).

  • The would-be pseudo-heritability phw if the normalized genomic relationship matrix had been used. This is

    phw = 1 / (1 + \hat{\delta} / w),

    where w is the normalizing factor that would have been necessary to normalize the genomic relationship matrix according to the methodology of Normalizing the Kinship Matrix.

  • The variance and standard error of the pseudo-heritability Var(h^2) and SE(h^2) = \sqrt{Var(h^2)}, see Estimating the Variance of Heritability for the formula.

  • The p-value, which is

    P(X > -2(l_0 - l_1)),

    where X is chi-square distributed with one degree of freedom, l_1 is the restricted maximum-likelihood (REML) estimate f_R(\delta) (Finding the Variance Components), and l_0 is the log-likelihood

    l_0 = \frac{-n}{2}\bigg(1 + \log(2\pi) + \log\big(\frac{rss}{n}\big)\bigg)

    based on the corresponding linear model with no random effects, where rss is the root sum of squares for that linear model.

  • The genetic component of variance Vg (\hat{\sigma^2_G}).

  • The error component of variance Ve (\hat{\sigma^2_e}).

If you have selected additional covariates, two additional outputs will appear in these node change logs:

  • Proportion of genetic variance and
  • Prop. explained by fixed covariates.

These correspond exactly to the proportion of Genetic variance (pg_j for j=0) and the proportion of Variance explained (p_{fixed}) outputs of the Variance Partition Plot (See The Variance Partition Plot) for the initial model (j=0) of an MLMM run. (The pseudo-heritability ph_0 for the unnormalized genomic relationship matrix is used here.)

GBLUP Genomic Relationship Matrix

Unless the Use Pre-Computed Genomic Relationship Matrix option is selected, a GBLUP Genomic Relationship Matrix spreadsheet will be created. This spreadsheet can be used not only as a pre-computed relationship matrix for other runs of this GBLUP tool, but also as a pre-computed kinship matrix for the EMMAX and MLMM Mixed Model GWAS methods (Mixed Linear Model Analysis).

This matrix can also be computed when there is no dependent variable available. See Separately Computing the Genomic Relationship Matrix for more information.