Genotype Imputation

SVS implements an adaptation of the BEAGLE 4.1 program to perform genotype phasing and imputation. For BEAGLE information and documentation, please see the University of Washington’s website: https://faculty.washington.edu/browning/beagle/beagle.html

Typical Workflow

Use Premade Human Reference Panel

Premade human reference panels can be downloaded from the Golden Helix server by selecting Download > Imputation Data from within the Project Navigator. The vcf files will be downloaded with their counterpart .tbi file. These will be downloaded to your ImputationRefPanels folder in your AppData location.

Create your own Reference Panel

Filtering should be performed before creating a reference panel, such as filtering on the major allele frequency. To create a reference panel, go to Genotype > Create Imputation Reference Panel from your quality filtered genotype spreadsheet. Select from the provided options or keep the defaults and select Run. The genotype assembly will be included in the reference file, if Add to Reference Panels Folder is selected.

If Add To Project as Spreadsheet is selected, a spreadsheet will be created in the project. Missing genotypes from the original spreadsheet will be filled in and genotyping errors will be corrected.

Run Imputation

From a genotype spreadsheet go to Genotype > Genotype Imputation with BEAGLE. Select a reference panel from the file selection menu, if no files are listed, either navigate to a folder with reference panel using the Browse... button, create a reference panel (please see Create your own Reference Panel) or download one from Download > Imputation Data.

Select options from the rest of the options and advanced tabs, or keep the defaults and select Run to start imputation. (See Genotype Imputation Dialog for more details)

If your data is not whole-genome, the windowing option, in the advanced tab, allows selecting a reasonable window of markers around the data.

Allele Encoding

When running imputation, markers are matched between the target and reference panels using the chromosome, position, and alleles in the data. To ensure accurate results, use the same allele representations between the imputation target and reference data sets.

If the same platform provider is being used between reference and imputation data sets, “A/B” encoding can be used. However, it’s better to use the “Reference/ Alternates” option.

If an RSID is available in the marker map, A/B data can be recoded using the Recode SNP to Variants tool. (Please see Recode SNPs as Variants) If your spreadsheet does not currently have an RSID available in the marker map it can be added with an available add-on script. (Please see How can I add Gene Name or RS ID to my spreadsheet’s marker map?)

Note

All of the alleles in a target marker must match alleles in the reference panel marker, and the target marker reference allele must match the reference panel marker reference allele. However, it is not necessary for all the alleles in the reference panel marker to be represented in the target marker.

This, for instance, allows a target marker that is homozygous (in the reference allele) for the target samples to match a reference panel marker that has the same reference allele but also one or more alternate alleles in its data.

Create Reference Panel Dialog

Create Imputation Reference Panel - Options Tab

Create Imputation Reference Panel - Options Tab

Create Imputation Reference Panel - Options Tab with Add to Project as Spreadsheet

Create Imputation Reference Panel - Options Tab with Add to Project as Spreadsheet

  • Output: Select the output type, Add to Reference Panels Folder will create a tsf file and Add to Project as Spreadsheet will create a spreadsheet with missing genotypes filled in and genotyping errors corrected.
  • Folder: The name of the folder the reference panel file will be located. The default location is in your appdata folder.
  • Base Name: The first part of the reference panel’s name. The file will have this plus the Project Genome.
  • Project Genome: The current project genome, this will be added to the base name to create the reference panel file name.
  • Allele Encoding: Please see Allele Encoding.
  • Included Map Fields: Select fields from the marker map that will be included in the reference file, this data will be added to the marker map of the imputed spreadsheet.
Create Imputation Reference Panel - Advanced Tab

Create Imputation Reference Panel - Advanced Tab

  • General Parameters:

    • # Threads: Number of threads for sample-wise computations. (Approximately three quarters of the internal computations are sample-wise and thus may be multi-threaded.)
    • Window Size: Specifies the number of markers to include in each sliding window.
    • Overlap: Specifies the number of markers of overlap between sliding windows.
  • Phasing and Imputation Algorithm:

    • Beagle 4.1:

      • 4.1 Phasing Iterations: Accuracy increases with the number of iterations, but so also does compute time. Phasing iterations are preceded by 10 burn-in iterations using the Beagle 4.0 phasing algorithm.
      • Max Cluster Size in CM: The maximum cM distance between individual markers that are combined into an aggregate marker when imputing ungenotyped markers.
      • Effective Population Size: Effective population size when imputing ungenotyped markers. The default value is suitable for large outbreak human populations. Smaller values in the hundreds or thousands may be appropriate for inbred human and animal population.
      • Allele Miscall Rate: The default value should give good results for most sequence and SNP array data.
    • Beagle 4.0:

      • Use Pedigree: (This option will appear only if your input spreadsheet is a pedigree spreadsheet.) Use pedigree information to achieve higher phasing accuracy.

        Note

        If this option is selected, an output spreadsheet (or an additional output spreadsheet) will be created that reports the number of Mendelian errors observed in the input data.

      • Burn-in Iterations: Number of initial burn-in iterations.

      • Phasing Iterations: Number of iterations for estimating genotype phase. Increasing this parameter will typically increase genotype phase accuracy.

Genotype Imputation Dialog

Genotype Imputation with Beagle - Options Tab

Genotype Imputation with Beagle - Options Tab

  • Reference Panel:

    • Folder: The name of the folder the reference panel file will be located. The default location is in your appdata folder.
    • Project Genome Filter: The current project genome, this will be added to the base name to create the reference panel file name.
    • Reference File: List of reference panels in the Folder. select one to use for imputation.
    • Only impute to ref markers within X bp of target markers: Maximum distance between reference and target markers when imputing.
  • Output:

    • Base Name: The first part of the reference panel’s name. The file will have this plus the Project Genome.

    • Spreadsheet as child of: Where the imputed spreadsheet will be created.

    • Output Per Genotype Probabilities Spreadsheet: Contains the posterior genotype probabilities.

    • Output Imputation Statistics Spreadsheet: Will output a spreadsheet with three columns, whether a marker was imputed or not, allelic R-squared values, and dosage R-squared.

    • Set genotype to missing if genotype probability is less than X: Only keep genotypes if their probability is above a certain threshold.

    • Keep Target Markers That Do Not Match Any Reference Marker: Include all target markers in the output, whether or not they match the chromosome, position, and alleles of any reference markers.

      Note

      Target markers for chromosomes not present in any reference markers will not be included in the output, even if this box is checked. Additionally, if you have specified Only impute to ref markers within X bp of target markers, target markers farther away than X base pairs from any reference marker will not be included, either.

Genotype Imputation with Beagle - Advanced Tab

Genotype Imputation with Beagle - Advanced Tab

  • General Parameters:

    • # Threads: Number of threads for sample-wise computations. (Approximately three quarters of the internal computations are sample-wise and thus may be multi-threaded.)
    • Window Size: Specifies the number of markers to include in each sliding window.
    • Overlap: Specifies the number of markers that overlap between sliding windows.
  • Phasing and Imputation Algorithm:

    • Beagle 4.1:

      • 4.1 Phasing Iterations: Accuracy (of phasing in the intermediate data which is imputed afterward) increases with the number of iterations, but so also does compute time. Phasing iterations are preceded by 10 burn-in iterations using the Beagle 4.0 phasing algorithm.
      • Max Cluster Size in CM: The maximum cM distance between individual markers that are combined into an aggregate marker when imputing ungenotyped markers.
      • Effective Population Size: Effective population size when imputing ungenotyped markers. The default value is suitable for large outbreak human populations. Smaller values in the hundreds or thousands may be appropriate for inbred human and animal population.
      • Allele Miscall Rate: The default value should give good results for most sequence and SNP array data.
    • Beagle 4.0:

      • Use Pedigree: (This option will appear only if your input spreadsheet is a pedigree spreadsheet.) Use pedigree information to achieve higher phasing accuracy.

        Note

        If this option is selected, an additional output spreadsheet will be created that reports the number of Mendelian errors observed in the input data.

      • Burn-in Iterations: Number of initial burn-in iterations.

      • Phasing Iterations: Number of iterations for estimating genotype phase. Increasing this parameter will typically increase genotype phase accuracy (in the intermediate data which is imputed afterward).