CNV Caller on Whole Genomes

Golden Helix SVS integreates the VS-CNV 2.0 algorithm for calling CNVs on fixed width bins and is tailored toward whole genome analysis.

The output spreadsheets are designed for Golden Helix SVS statistical analysis.

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). Each coverage bin is classified as either homozygous deletion, diploid, or duplication.

Note

By using large bins, large CNV events can be detected accurately on extremely low read-depth WGS data. Even a sample with 0.02X coverage (~1 million reads) will be able to call events down to the million base-pair level, and be able to detect chromosomal aneuploidy events with high confidence.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on shallow whole genome sequencing data and is capable of calling large cytogentic events such as whole chromosome duplications. The minimum and maximum reference sample count can be configured, and if you have a large number of control samples in your reference folder, we suggest increasing the maximum value.

With the addition of the “CNV Caller” add-on to your Golden Helix SVS license, you can add this algorithm to your analysis workflows.

Note

To add the CNV Caller to your license of Golden Helix SVS contact info@goldenhelix.com.

Requirements

In order to use the CNV Caller on Whole Genomes, first start from a spreadsheets that has the sample names for your analysis as row labels and select Numeric > CNV on NGS Whole Genomes.

CNV on Whole Genomes Window

CNV of Whole Genomes Window

The algorithm requires BAM files for each sample you would like to perform the analysis on.

Coverage statistics over the binned regions on the BAM file will be computed as the first step of the algorithm.

By default the samples in the current analysis as well as all previously analyzed samples are used as reference samples. The repository of reference samples can be found by going to Tools > Open Folder > Reference Samples Folder on the main application window.

Input and Output Options

The first tab configues the inputs and the requested outputs for running the algorithm. It has the following options:

  • Bin Size: Defines the size in base pairs of the equally spaced regions over which coverage will be computed.
  • Optional Regions to Mask and Ignore: The masked region file is used to specify regions to be excluded from the coverage computation. A BED file or interval source may be used to define the regions; the file must be indexed.
  • BAM Path Mappings: The sample names come from the rows of the starting spreadsheet. These can be mapped to BAM paths manually by using the Associate BAMs dialog or if you have BAM path mappings as a column in your spreadsheet you can specify that column.

The Outputs group box contains choices for which outputs you would like to have when the algorithm completes. They are summarized below:

  • Sample summary table: Provides a table with samples as rows and various summary statistics from the CNV calls as columns. See the section below for the detailed description of this table.
  • Whole Genome CNV State: Should be the most useful output for doing downstream analysis between copy number state and other sample phenotypes and traits of interest. The output will have a column per-target defined by the input coverage region file, with the column values either being categorical or numeric based on the user selection.
    • Numeric CNV state: Encoding can be switched to numeric with the following mappings:
      • 0: Homozygous Deletion
      • 1: Heterozygous Deletion
      • 2: Diploid (Copy Neutral)
      • 3: Duplication
    • Filter out calls with QC flags: Allows certain CNV calls to be set to missing (?) when they have certain QC flags set on them by the algorithm. This may make sense to only use high-quality CNV calls in your CNV state table in downstream analysis. The flags are defined as below:
      • Low Controls Depth: Mean read depth over controls exceeded threshold.
      • High Controls Variation: Variation coefficient exceeded threshold.
      • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
      • Low Z-Score: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Whole Genome Z-score table: Provides a spreadsheet with the supporting z-scores for every binned region and every sample.
  • Whole Genome ratio table: Provides a spreadsheet with the supporting ratios for every binned region and every sample..
  • CNV event table: Provides output of the fully constructed CNV events that can only be represented on a per-sample basis. Up to the provided number of samples will have per-sample spreadsheets created with the details of the CNV calls for that given sample. See the section below for details of these spreadsheets.

CNV Algorithm Parameters

The Parameters tab in the dialog allows the user to specify the following options:

  • Minimum Number of Reference Samples: Desired minimum number of reference samples to be selected.
  • Maximum Number of Reference Samples: The maximum number of reference samples to be selected.
  • Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
  • Add samples to reference set: This option adds the current project’s sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.
  • Normalize sex chromosomes separately based on inferred gender: If this option is selected, and there are non-autosomal targets (X and Y), a gender will be inferred for each sample. Based on that gender, a set of gender-matched references will be selected for normalizing these chromosomes. Un-check this if you don’t have enough samples to do gender-matched normalization or expect all samples to be predominantly one sex.
  • Reference Sample Folder: The folder containing the reference samples used to normalize the coverage data.
  • Z-Score Threshold: Threshold for mean z-score for calling events.
  • Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
  • Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.
  • Use Optimal Segmentation Algorithm (slower): By default a Circular Binary Segmentation (CBS) is used, but the CNAM optimal segmentation algorithm can be selected. In our testing there were only marginal or no benefits to using CNAM for most datasets.

Columns in the Sample Summary Table

Summary information about each sample are available in this table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio
  • # CNV Events: Number of CNV events
  • # Flagged CNV Events: Number of flagged CNV events
  • # Unflagged CNV Events: Number of unflagged CNV events
  • # Hom Deletions: Number of homozygous deletion events
  • # Het Deletions: Number of heterozygous deletion events
  • # Duplications: Number of duplication events
  • Z-score IQR: Interquartile range of the Z-scores over all targets
  • Ratio IQR: Interquartile range of the ratios over all targets
  • Variants Considered: Variants considered for VAF content
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions
  • Reference Samples: Samples selected as matched controls
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions
  • Karyotype: Chromosomal CNV information for this sample.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison. Classification and event segmentation are performed using the Circular Binary Segmentation or CNAM optimal segmentation algorithm.