CNV Caller on Target Regions

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). Each coverage region is classified as either homozygous deletion, heterozygous deletion, diploid, or duplication.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on gene panel data, as well as whole exome data, and is capable of calling events ranging from single exon deletions, to whole chromosome duplications. The minimum and maximum reference sample count can be configured, and, if you have a large number of control samples in your reference folder, we suggest increasing the maximum value.

With the addition of the “CNV Caller” add-on to your VarSeq license, you can add this algorithm to your interactive or automated VSPipeline executed workflows.

Note

To add the CNV caller to your license of VarSeq contact info@goldenhelix.com.

Requirements

Coverage statistics must be computed prior to running this algorithm. For best results, we recommend at least 100x coverage and 30 reference samples. The current references can be viewed and edited with the Manage Reference Samples dialog, for more information see VarSeq CNV Reference Manager.

Options

The first tab in the dialog allows the user to specify the following options:

  • Sensitivity/Precision: Adjusts the trade-off between the true positive rate and the true negative rate.
  • Minimum Number of Reference Samples: Desired minimum number of reference samples to be selected.
  • Maximum Number of Reference Samples: The maximum number of reference samples to be selected.
  • Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
  • Add samples to reference set: This option adds the current project’s sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.
  • Normalize sex chromosomes separately based on inferred gender: If this option is selected, and there are non-autosomal targets (X and Y), a gender will be inferred for each sample. Based on that gender, a set of gender-matched references will be selected for normalizing these chromosomes. Un-check this if you don’t have enough samples to do gender-matched normalization or expect all samples to be predominantly one sex.
  • Reference Sample Folder: The folder containing the reference samples used to normalize the coverage data.
  • Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
  • Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.
  • Blacklist Regions: The blacklist region file is used to specify regions to be excluded from the normalization process.

The CNV calling algorithm relies on probability distributions associated with both the Z-score and Ratio metrics. The Z-score for a target measures the number of standard deviations a sample’s coverage is from the mean reference sample coverage, while the Ratio is the target coverage divided by the mean reference sample coverage.

Each metric is associated with three probability distributions; one for each type of CNV: Hom. Deletion, Het. Deletion, and Duplication.

Normal distributions are used for the deletion distributions, while log-normal distributions are used for the duplication distributions.

These parameters can be specified in the advanced tab, which contains the following options: * Prior CNV Probability: Specify the prior and transition probabilities for CNV state.

  • Z-score Parameters: Specify the parameters for the Z-score distributions.
  • Ratio Parameters: Specify the parameters for the Ratio distributions.
  • Utilize Variant Allele Frequency: Use Variant Allele Frequency when calling CNVs.

Output of the CNVs Table

The CNV Caller algorithm will generate a CNVs table view. This table will include records for all called CNV events.

  • Region: Genomic coordinates (Chr: Start-Stop)
  • # Targets: Number of targets in the event.
  • # Samples: Number of samples in the event.
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • CNV State: State of the CNV event. Either Deletion, Het Deletion, Duplicate or CN LoH.
  • Flags: QC warnings for the event.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Low Z Score: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Avg Target Mean Depth: Average mean depth of the targets in this event as reported by Coverage Statistics.
  • Avg Z Score: Average Z-score of the event.
  • Avg Ratio: Average ratio of the event.
  • Variants Considered: Number of targets in the event.
  • Supporting LoH Variants: Total number of variants within an LoH event supporting the called CNV state.
  • p-value: Probability that z-scores at least as extreme as those in the event would occur by chance in a diploid region.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Samples Table

Summary fields are appended to the Samples Table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • High Median Z-score: The median of all the z-scores was above 0.4. This indicates a general skew of this samples away from the reference samples, likely to cause excessive duplication calls.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
  • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio and Y coverage when more than 50 targets are present in the Y chromosome
  • # CNV Events: Number of CNV events
  • # Flagged CNV Events: Number of flagged CNV events
  • # Unflagged CNV Events: Number of unflagged CNV events
  • # Hom Deletions: Number of homozygous deletion events
  • # Het Deletions: Number of heterozygous deletion events
  • # Duplications: Number of duplication events
  • Z-score IQR: Interquartile range of the Z-scores over all targets
  • Ratio IQR: Interquartile range of the ratios over all targets
  • Variants Considered: Variants considered for VAF content
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions
  • Reference Samples: Samples selected as matched controls
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage
  • Y Ratio: Ratio of Y chromosome coverage to autosomal chromosome coverage
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions
  • Karyotype: Chromosomal CNV information for this sample.

Output in the Coverage Regions Table

Target level CNV fields are appended to the Coverage Regions Table. These fields provide information computed across all coverage regions.

  • CNV State: State of the CNV call or this target. Either homozygous deletion, heterozygous deletion, diploid, or duplication
  • Flags: QC flags for the target region.
    • Low Controls Depth: Mean read depth over controls exceeded threshold
    • High Controls Variation: Variation coefficient exceeded threshold
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs
  • Z Score: Z-score of the target. Computed as (normalized target depth - mean depth across controls) / standard deviation
  • Ratio: Ratio of normalized target depth over mean depth across controls
  • Variants Considered: Variant considered for VAF content.
  • Normalized Mean Depth: (hidden by default) Target depth / Mean depth over controls
  • Avg. Normalized Control Depth: (hidden by default) Average normalized depth for the target in all the controls
  • Control Standard Dev.: (hidden by default) Standard deviation of normalized depth in all the controls

Variants by CNVs Table

This composite table view includes all of the CNVs that cover one or more variants from the filtered Variant table. The CNVs appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each CNV can be viewed by changing the row selection in the CNV table.

Output in the Variant Table

Variants will be matched to any CNVs they fall within. The values for each of the matching CNVs will be listed in their respective fields which are appended to the Variant table.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison.

Classification and event segmentation are performed using a probabilistic model that incorporates three evidence metrics: Z-score, ratio, and variant allele frequency (VAF). The Z-score measures the number of standard deviations from the reference sample mean, the ratio is the normalized mean for the sample of interest divided by the average normalized mean for the reference samples, and VAF is the allelic fraction at the variant locus. Using these metrics, the algorithm calls CNV state for each target region. Target regions are then merged to obtain contiguous CNV events.

Since these metrics can be noisy over very large regions, a segmentation algorithm is used to call large multi-gene and whole chromosome events. If a region contains many small CNV events, CNAM optimal segmentation is used to segment the region and small events the share a segmented region are merged.