Genomic Best Linear Unbiased Predictors Analysis Using Bins

Performing Binned GBLUP Analysis

To better capture polygenic effects, the GBLUP method may be performed on markers that have been binned into categories, with a separate Genomic Relationship Matrix (GRM) being created (or imported) and used for every category (“bin”).

This method, as well as the feature LD Score Computation and Binning, is partly inspired by the paper [Wainschtein2019], which describes recovering the missing heritability for height and for body mass index (BMI) to the level implied by pedigree studies.

Just as does the standard GBLUP method (Genomic Best Linear Unbiased Predictors Analysis), binned GBLUP, after computing or importing a set of genomic relationship matrices, computes the “Genomic Best Linear Unbiased Predictor” (GBLUP) of additive genetic merits by sample and of allele substitution effects (ASE) by marker. [VanRaden2008], [Taylor2013]. One main difference between the standard GBLUP and binned GBLUP is that a separate GBLUP of additive genetic merits by sample is computed corresponding to every GRM, with the total effect also being shown for every sample.

Note

Since any marker contributes to only one GRM, the ASE for that marker is based only on that one GRM and its variance component, and on no other GRMs.

Note

As with the standard GBLUP, this method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

Note

The Average Information REML algorithm (Finding the Variance Components Using the Average Information (AI) Technique) is always used to find the variance components for binned GBLUP.

Note

No sub-models are computed which consist of some variance components/bins but not others. Only the full model and the “completely reduced” model (that has no random effects at all) are computed.

Workflow

Normally, a binned GBLUP analysis consists of two steps:

  1. Categorize your markers into bins (“bin the markers”). This may be done through:

    1. Using Genotype > Quality Assurance and Utilities > LD Score Computation and Binning. (See :ref:``.)
    2. Using File > Convert Genetic Marker Map into Spreadsheet. This will have the effect of binning the markers by chromosome name.
    3. Using DNA-Seq > Variant Binning by Frequency Source.
    4. Any other method that produces a spreadsheet with
      • Row labels corresponding to your markers, and a
      • binary, integer, or categorical column somewhere in the spreadsheet that designates a bin number or bin label corresponding to each marker.
  2. Run this feature (Genotype > Compute GBLUP Using Bins), which will use the binning spreadsheet as one of its inputs.

    Optionally, you could first use the feature Genotype > Quality Assurance and Utilities > GBLUP Genomic Relationship matrix and precompute your set of GRMs using its Create Multiple GRMs by Bin Using a Binning Spreadsheet feature. You would then use this set of GRMs as input to this feature (Genotype > Compute GBLUP Using Bins).

Options

binnedGblupDialog

Compute Genomic BLUP (GBLUP) Using Bins Dialog Window

  • Bins/Categories of Markers Choose the spreadsheet to be used for binning markers and the spreadsheet column containing the bin categories.

  • Impute missing genotypic data as: Missing genotypic data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotypic data will be recoded to 0.

    • Numerically as average value: All missing genotypic data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Correct for Gender: Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question. Uses the [Taylor2013] gender-correction algorithm. (See Correcting the GRM for Gender Using Overall Normalization and Correcting the GRM for Gender Using Normalization by Individual Marker.) Two values of the ASE are output, one for each gender.

    • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
    • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Dosage compensation: Select the dosage compensation to be used. Equal X-Linked Variance is the default.
  • Use Pre-Computed Genomic Relationship Matrices: To use, check this option, then click on Select Sheet and, from the window that is presented, select the spreadsheet that lists the set of genomic relationship matrix spreadsheets. The list should have the categories/bins as row labels and the spreadsheet numbers in the first column. To be valid, the individual pre-computed GRM spreadsheets must each follow the rules outlined in Precomputed Kinship Matrix Option.

  • Correct for Additional Covariates: Allows additional fixed effects to be added to this model from columns of this spreadsheet. Fixed effect covariates can be binary, integer, real-valued, categorical or (if actual genotypic data rather than recoded genotypic data is being used for the analysis) genotypic. In all cases, if a marker is used as an additional fixed effect, it will not be included in the analysis in any other way. To begin, check this option, then click on Add Columns to get a choice of spreadsheet columns to use.

  • Normalization Algorithm (Used or Assumed) for the GRM: If pre-computed GRMs are not selected, this choice influences the GRM computations. In any case, this choice influences ASE computations. Overall normalization is the default. (See The Genomic Relationship Matrix.)

  • Missing Phenotypes: To predict random effects (genomic merit/genomic breeding values) and the phenotypes for samples with missing phenotypes, select Predict random effects for samples with missing phenotypes (see Genomic Prediction). Selecting this will cause samples with missing phenotypes to be included in the ASE calculations.

    Otherwise, select Drop samples with missing phenotypes.

    Note

    An alternative prediction procedure is to use Genotype -> Predict Phenotypes From Existing Results. See Predict Phenotypes From Existing Results.

Binned GBLUP Output

Spreadsheet Outputs

Unless you have selected Use Pre-Computed Genomic Relationship Matrix, the following spreadsheets are output:

  • One GBLUP Genomic Relationship Matrix for every category/bin. This is the relationship between pairs of samples, as determined by actual genomic similarity (or dis-similarity) between samples, over the markers contained in the bin.
  • A Genomic Relationship Matrix List spreadsheet, which, for each row, has the category/bin number or label as a row label and the spreadsheet number of the bin’s GRM in that row’s first column.

The following four spreadsheets will always be created:

  • GBLUP estimates by sample: This spreadsheet contains a column for the phenotype (selected dependent variable) of each sample, a column for the Total random effect component for each sample, which is the sum of random effects from all individual random effect components for that sample, and, for each variance component/bin, a column containing the Random effects component related to that variance component/bin for each sample. If you have selected Predict random effects for samples with missing phenotypes, another column containing the Predicted phenotype for each sample will be inserted after the first column, which will then be called the Actual Phenotype.

  • GBLUP fixed effect coefficients Contains the coefficient corresponding to each fixed effect. If there were no fixed effects selected, the only coefficient will be the intercept. For categorical covariates, the reference category will be listed with missing for the coefficient and a 1 in the the “Reference Covariate?” column. The “Reference Covariate?” column will contain a 0 for all non categorical and non reference covariates.

  • GBLUP estimates by marker: This spreadsheet first contains a column showing the bin/category for each marker. Next is a column for the GBLUP estimate of the allele substitution effect (ASE) for each marker (as computed from the random effects corresponding to that marker’s bin/category/GRM/variance component). The final column shows the absolute magnitudes of the ASE values. If gender correction is applied, separate columns for the ASE and the absolute magnitude of the ASE will be output for both males and females.

    The marker map is applied to this spreadsheet.

  • Sampling Var/Covar Matrix of the Variance Comp Estimates for the Full Model: These are the variances and covariances of the estimates of the variance components. A row and a column is created for each of the variance components (V(e) (\sigma^2_e), V(G_bin1) (\sigma^2_{vcbin1}), etc.). The diagonal contains the sampling variance of each variance component, while each off-diagonal element contains the covariance between the row’s variance component and the column’s variance component.

Node Change Log Output

The following will be output to the node change log of each spreadsheet:

  • The options used
  • Summary statistics, including numbers of samples and markers scanned and analyzed

The following will additionally be output to the node change log of all spreadsheets other than (if they were generated) the GRM spreadsheets and the GRM list spreadsheet:

  • The number of markers processed for each bin.

  • The variances for the full model:

    • First, the number of iterations required for convergence and the log(likelihood) for the full model are output.
    • Then, for the full model, a table is output containing columns for the description of (Source), the value of (Variance), and the standard error (SE) of
      • The error variance component V(e) (\sigma^2_e)
      • The variance component V(G_bin) for each bin
      • Vp (V_p), the sum of all variance components (including V(e) (\sigma^2_e))
      • Each random-effect variance component divided by V_p
      • Sum of V(G)/Vp, the sum of all random-effect variance components divided by V_p.
      • The variance per marker for each bin.
  • The variances for “the completely reduced model”–that is, the model containing only the error term \sigma^2_e I.

    The same output format is used as for the variances for the full model.

  • An overall random effects likelihood test. The following is output for this test:

    • logL The full-model likelihood
    • logL0 The likelihood of the null model
    • LRT The likelihood ratio test (2 (logL - logL0))
    • df The degrees of freedom for this test
    • pval The p-value for this test
  • Any messages from the algorithm concerning steps that had to be taken to make the algorithm converge.

Precomputing the Set of Genomic Relationship Matrices

See Separately Computing the Genomic Relationship Matrix for information about pre-computing the set of Genomic Relationship Matrices for a spreadsheet of markers that have been binned.