Genomic Best Linear Unbiased Predictors Analysis

Performing GBLUP Analysis

The GBLUP method computes or imports a genomic relationship matrix and from that computes the “Genomic Best Linear Unbiased Predictor” (GBLUP) of additive genetic merits by sample and of allele substitution effects (ASE) by marker. [VanRaden2008], [Taylor2013]

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

Large N Considerations

If your dataset consists of more than 8,000 samples, you will first be presented with the following dialog:

gblupLargeData

Large Data Dialog Window

Please see Summary of Performance Tradeoffs to know how best to respond to this prompt. Afterward, you will be taken to the standard GBLUP options explained below.

Note

As noted in this dialog, please select Exact if you plan to use the AI REML algorithm.

Options

gblupDialog

Compute Genomic BLUP (GBLUP) Dialog Window

  • REML Computation Algorithm: Choose the variant of the Restricted Maximum Likelihood method for estimating the genetic variance explained by the genotypes:

    • Use EMMA: Usually faster.
    • Use AI REML: The Average Information algorithm allows integrating multiple GRMs into the model, either for using an additional GRM for the gender chromosome, or for correcting for Gene by Environment interactions. See [Yang2011].
  • Impute missing genotypic data as: Missing genotypic data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotypic data will be recoded to 0.

    • Numerically as average value: All missing genotypic data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Correct for Gender: Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question. Uses the [Taylor2013] gender-correction algorithm. (See Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.) Two values of the ASE are output, one for each gender.

    • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
    • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Dosage compensation: Select the dosage compensation to be used. Equal X-Linked Variance is the default.
    • Use Pre-Computed Genomic Relationship Matrix for Gender Chromosome (AI REML only): The AI REML allows for multiple GRMs to be used in a single analysis. You can compute a GRM for your gender chromosome separately and specify that precomputed GRM here. To be valid, this spreadsheet must follow the rules outlined in Precomputed Kinship Matrix Option.
  • Use Pre-Computed Genomic Relationship Matrix: To use, check this option, then click on Select Sheet and select the genomic relationship matrix spreadsheet from the window that is presented. To be valid, this spreadsheet must follow the rules outlined in Precomputed Kinship Matrix Option.

    Note

    When using a pre-computed genomic relationship matrix, the matrix M and the scaling factor \phi are re-calculated from the genotypic data being used for this analysis.

  • Normalization Algorithm (Used or Assumed) for the GRM: If a pre-computed GRM is not selected, this choice influences the GRM computation. In any case, this choice influences ASE computations. See [Yang2011] for details of the individual marker normalization. Overall normalization is the default. (See The Genomic Relationship Matrix.)

  • Correct for Additional Covariates: Allows additional fixed effects to be added to this model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

  • Correct for Gene by Environment Interactions: Allows for the correct for gene by environment interactions based on an environment variable, which may be binary or categorical. (See Incorporating Gene by Environment Interactions into GBLUP.) To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

    Note

    You must use the AI REML algorithm to correct for gene by environment interactions.

    Note

    Ouputs for gene by environment interactions are as described in Spreadsheet Output for Multiple Random Effects and in Node Change Log Output for the AI Algorithm.

  • Missing Phenotypes: To predict random effects (genomic merit/genomic breeding values) and the phenotypes for samples with missing phenotypes, select Predict random effects for samples with missing phenotypes (see Genomic Prediction). Selecting this will cause samples with missing phenotypes to be included in the ASE calculations.

    Otherwise, select Drop samples with missing phenotypes.

    Note

    An alternative prediction procedure is to use Genotype -> Predict Phenotypes From Existing Results. See Predict Phenotypes From Existing Results.

GBLUP Output

Spreadsheet Output for One Random Effect

Unless you have selected Use Pre-Computed Genomic Relationship Matrix, the following spreadsheet is output:

  • GBLUP Genomic Relationship Matrix The relationship between pairs of samples, as determined by actual genomic similarity (or dis-similarity) between samples.

The following three spreadsheets will always be created:

  • GBLUP estimates by sample: This spreadsheet contains a column for the phenotype (selected dependent variable) of each sample and a column for the Random effects component for each sample. If you have selected Predict random effects for samples with missing phenotypes, a third column containing the Predicted phenotype for each sample will be inserted after the first column, which will then be called the Actual Phenotype.

  • GBLUP fixed effect coefficients Contains the coefficient corresponding to each fixed effect. If there were no fixed effects selected, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • GBLUP estimates by marker: These are the GBLUP estimates of the allele substitution effect (ASE) by marker, along with the absolute magnitude of the ASE and, if EMMA was used, the normalized absolute magnitude of the ASE. If gender correction is applied, separate columns for the ASE, the absolute magnitude of the ASE, and (with EMMA) the normalized Abs ASE will be output for both males and females.

    The marker map will be applied to this spreadsheet.

If you have selected Use AI REML, the following spreadsheet will also be created:

  • Sampling Var/Covar Matrix of the Variance Comp Estimates for the Full Model: These are the variances and covariances of the estimates of the variance components (V(e) (\sigma^2_e) and V(g) (\sigma^2_g)).

Spreadsheet Output for Multiple Random Effects

Unless you have selected Use Pre-Computed Genomic Relationship Matrix, the following spreadsheet is output:

  • GBLUP Genomic Relationship Matrix The relationship between pairs of samples, as determined by actual genomic similarity (or dis-similarity) between samples.

The following three spreadsheets will always be created:

  • GBLUP estimates by sample: This spreadsheet contains a column for the phenotype (selected dependent variable) of each sample, a column for the Total random effect component for each sample, which is the sum of random effects from all individual random effect components for that sample, and, for each variance component, a column containing the Random effects component related to that variance component for each sample. If you have selected Predict random effects for samples with missing phenotypes, another column containing the Predicted phenotype for each sample will be inserted after the first column, which will then be called the Actual Phenotype.

  • GBLUP fixed effect coefficients Contains the coefficient corresponding to each fixed effect. If there were no fixed effects selected, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • GBLUP estimates by marker: For each variance component, this spreadsheet contains a column for the GBLUP estimates of the allele substitution effect (ASE) by marker resulting from this random effect and a column for the absolute magnitude of the ASE from this random effect. If gender correction is applied, separate columns for the ASE and the absolute magnitude of the ASE will be output for both males and females for each random effect.

    Note

    If you selected Correct for Gene by Environment Interactions, the columns corresponding to the interaction or interactions will be estimates of Allele Activity, that is, approximately how sensitive the interaction random effect is to a difference in the genotype. While these columns are computed using the same algorithm as are other columns containing ASE values, these are not, strictly speaking, allele substitution effects, because the corresponding genomic relationship matrix (GRM) or matrices (GRMs) for this or these variance component(s) was or were not generated directly from M_{vc}M'_{vc}/\phi or W_{vc}W'_{vc}/m for any set of markers.

    Note

    If you selected both Correct for Gender and Use Pre-Computed Genomic Relationship Matrix for Gender Chromosome (AI REML only), the male ASE values for the non-gender chromosomes and the male ASE values for the gender chromosome will be placed into the same column as each other, and similarly, the female ASE values for the non-gender chromosomes and the female values for the gender chromosome will be placed into the same column as each other (but in a different column from the male ASE values).

    The marker map will be applied to this spreadsheet.

If you have selected Use AI REML, the following spreadsheet will also be created:

  • Sampling Var/Covar Matrix of the Variance Comp Estimates for the Full Model: These are the variances and covariances of the estimates of the variance components.

    A row and a column is created for each of the variance components (V(e) (\sigma^2_e), V(g) (\sigma^2_g), etc.). The diagonal contains the sampling variance of each variance component, while each off-diagonal element contains the covariance between the row’s variance component and the column’s variance component.

Node Change Log Output for the EMMA Algorithm

The following will be output to the node change log of each spreadsheet:

  • The options used
  • Summary statistics, including numbers of samples and markers scanned and analyzed

If the EMMA algorithm is used, the following will additionally be output to the node change log of all spreadsheets other than the GRM spreadsheet (if that spreadsheet was generated):

  • The pseudo-heritability ph, which is

    ph &= \hat{\sigma^2_G} / Var(y) \\
   &= \hat{\sigma^2_G} / (\hat{\sigma^2_G} + \hat{\sigma^2_e}) \\
   &= 1 / (1 + \hat{\delta}).

  • The would-be pseudo-heritability phw calculated as if the scaled genomic relationship matrix had been used. This is

    phw = 1 / (1 + w \hat{\delta}),

    where w is the scaling factor that would have been necessary to scale the genomic relationship matrix according to the methodology of Scaling the Kinship Matrix.

  • The would-be scaling factor w.

  • The variance and standard error of the pseudo-heritability Var(h^2) and SE(h^2) = \sqrt{Var(h^2)}, see Estimating the Variance of Heritability for One Random Effect for the formula.

  • The p-value, which is

    P(X > -2(l_0 - l_1)),

    where X is chi-square distributed with one degree of freedom, l_1 is the restricted maximum-likelihood (REML) estimate f_R(\delta) (Finding the Variance Components Using EMMA), and l_0 is the restricted log-likelihood for the corresponding linear model with no random effects (\sigma^2_G
= 0). l_0 may be written as

    l_0 = \frac{n-f}{2}\bigg[\log\frac{n-f}{2\pi} - 1 - \log\bigg(\sum_{s=1}^{n-f}\eta^2_s\bigg)\bigg] .

    Note

    The above expression for l_0 may be obtained by taking the expression for f_R(\delta) found in Finding the Variance Components Using EMMA and allowing \delta
(= \sigma_e/\sigma_G) to tend towards infinity, so that f_R(\delta) tends towards

    f_R(\delta) &= \frac{n-f}{2}\bigg[\log\frac{n-f}{2\pi} - 1 - \log\big(\sum_{s=1}^{n-f}\frac{\eta^2_s}{\delta}\big) - \frac{1}{n-f}\sum_{s=1}^{n-f} \log(\delta)\bigg] \\
            &= \frac{n-f}{2}\bigg[\log\frac{n-f}{2\pi} - 1 - \log\big(\sum_{s=1}^{n-f}\eta^2_s\big) + log(\delta) - log(\delta) \bigg] \\
            &= \frac{n-f}{2}\bigg[\log\frac{n-f}{2\pi} - 1 - \log\big(\sum_{s=1}^{n-f}\eta^2_s\big)\bigg] ,

    which is independent of \delta.

  • The genetic component of variance Vg (\hat{\sigma^2_G}).

  • The error component of variance Ve (\hat{\sigma^2_e}).

If you have selected additional covariates, two additional outputs will appear in these node change logs:

  • Proportion of genetic variance and
  • Prop. explained by fixed covariates.

These correspond exactly to the proportion of Genetic variance (pg_j for j=0) and the proportion of Variance explained (p_{fixed}) outputs of the Variance Partition Plot (See The Variance Partition Plot) for the initial model (j=0) of an MLMM run. (The pseudo-heritability ph_0 for the unnormalized genomic relationship matrix is used here.)

  • Any messages from the algorithm concerning steps that had to be taken to make the algorithm converge.

Node Change Log Output for the AI Algorithm

The following will be output to the node change log of each spreadsheet:

  • The options used
  • Summary statistics, including numbers of samples and markers scanned and analyzed

If the Average Information (AI) algorithm is used, the following will additionally be output to the node change log of all spreadsheets other than the GRM spreadsheet (if that spreadsheet was generated):

  • The variances for the full model.

    First, the number of iterations required for convergence and the log(likelihood) for this model are output.

    Then, a table is output containing columns for the description of (Source), the value of (Variance), and the standard error (SE) of

    • Each variance component (including V(e) (\sigma^2_e))
    • Vp (V_p), the sum of all variance components (including V(e) (\sigma^2_e))
    • Each random-effect variance component divided by V_p
    • Sum of V(G)/Vp, the sum of all random-effect variance components divided by V_p, will be shown if there is more than one random effect being analyzed.
  • If there is more than one random effect being analyzed, the variances for each of:

    • the full model without the first random effect
    • the full model without the second random effect
    • ...
    • the full model without the last random effect.

    The same output format is used as for the variances for the full model.

  • The variances for “the completely reduced model”–that is, the model containing only the error term \sigma^2_e I.

    The same output format is used as for the variances for the full model.

  • If there is more than one random effect being analyzed, a likelihood ratio test for each of:

    • the full model vs. the full model without the first random effect
    • the full model vs. the full model without the second random effect
    • ...
    • the full model vs. the full model without the last random effect.

    The following is output for each random effect:

    • logL The full-model likelihood
    • logL0 The likelihood of the full model without this random effect
    • LRT The likelihood ratio test (2 (logL - logL0))
    • df The degrees of freedom for this test
    • pval The p-value for this test
  • An overall random effects likelihood test. The following is output for this test:

    • logL The full-model likelihood
    • logL0 The likelihood of the null model
    • LRT The likelihood ratio test (2 (logL - logL0))
    • df The degrees of freedom for this test
    • pval The p-value for this test
  • Any messages from the algorithm concerning steps that had to be taken to make the algorithm converge.

GBLUP Genomic Relationship Matrix

Unless the Use Pre-Computed Genomic Relationship Matrix option is selected, a GBLUP Genomic Relationship Matrix spreadsheet will be created. This spreadsheet can be used not only as a pre-computed relationship matrix for other runs of this GBLUP tool, but also as a pre-computed kinship matrix for the EMMAX and MLMM Mixed Model GWAS methods (Mixed Linear Model Analysis).

This matrix can also be computed when there is no dependent variable available. See Separately Computing the Genomic Relationship Matrix for more information.