CNV Caller on Binned Regions

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). This algorithm uses coverage data computed over fixed width bins and is tailored toward whole genome analysis. Each coverage bin is classified as either homozygous deletion, heterozygous deletion, diploid, or duplication.

Note

By using large bins, large CNV events can be detected accurately on extremely low read-depth WGS data. Even a sample with 0.02X coverage (~1 million reads) will be able to call events down to the million base-pair level, and be able to detect chromosomal aneuploidy events with high confidence.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on shallow whole genome sequencing data and is capable of calling large cytogenetic events such as whole chromosome duplications.

With the addition of the “CNV Caller” add-on to your VarSeq license, you can add this algorithm to your interactive or automated VSPipeline executed workflows.

Note

To add the CNV caller to your license of VarSeq contact info@goldenhelix.com.

Requirements

Binned Region Coverage must be computed prior to running this algorithm. For best results, we recommend at least 30 reference samples. The currect references can be viewed and edited with the Manage Reference Samples dialog, for more information see VarSeq CNV Reference Manager.

Options

The user may specify the following options:

  • Minimum Number of Reference Samples: Desired minimum number of reference samples to be selected.
  • Maximum Number of Reference Samples: The maximum number of reference samples to be selected.
  • Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
  • Add samples to reference set: This option adds the current project’s sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.
  • Reference Sample Folder: The folder containing the reference samples used to normalize the coverage data.
  • Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
  • Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.

Output of the CNVs Table

The CNV Caller algorithm will generate a CNVs table view. This table will include records for all called CNV events.

  • Region: Genomic coordinates (Chr: Start-Stop)
  • # Targets: Number of targets in the event
  • # Samples: Number of samples in the event
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • CNV State: State of the CNV event. Either Deletion, Het Deletion, Duplicate or CN LoH.
  • Flags: QC warnings for the event.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Avg Target Mean Depth: Average mean depth of the targets in this event as reported by Coverage Statistics
  • Avg Z Score: Average Z-score of the event.
  • Avg Ratio: Average ratio of the event.
  • Variants Considered: Number of targets in the event
  • Supporting LoH Variants: Total number of variants within an LoH event supporting the called CNV state.
  • p-value: Probability that z-scores at least as extreme as those in the event would occur by chance in a diploid region.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Samples Table

Summary fields are appended to the Samples Table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio.
  • # CNV Events: Number of CNV events.
  • # Flagged CNV Events: Number of flagged CNV events.
  • # Unflagged CNV Events: Number of unflagged CNV events.
  • # Hom Deletions: Number of homozygous deletion events.
  • # Het Deletions: Number of heterozygous deletion events.
  • # Duplications: Number of duplication events.
  • Z-score IQR: Interquartile range of the Z-scores over all targets.
  • Ratio IQR: Interquartile range of the ratios over all targets.
  • Variants Considered: Variants considered for VAF content.
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions.
  • Reference Samples: Samples selected as matched controls.
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage.
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions.
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Coverage Regions Table

Target level CNV fields are appended to the Coverage Regions Table. These fields provide information computed across all coverage regions.

  • CNV State: State of the CNV call or this target. Either homozygous deletion, heterozygous deletion, diploid, or duplication
  • Flags: QC flags for the target region.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Z Score: Z-score of the target. Computed as (normalized target depth - mean depth across controls) / standard deviation
  • Ratio: Ratio of normalized target depth over mean depth across controls
  • Variants Considered: Variant considered for VAF content.

Variants by CNVs Table

This composite table view includes all of the CNVs that cover one or more variants from the filtered Variant table. The CNVs appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each CNV can be viewed by changing the row selection in the CNV table.

Output in the Variant Table

Variants will be matched to any CNVs they fall within. The values for each of the matching CNVs will be listed in their respective fields which are appended to the Variant table.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison. Classification and event segmentation are performed using the CNAM optimal segmentation algorithm.