CNV Caller on NGS Target Regions

CNV Caller on Target Regions

Golden Helix SVS integrates the VS-CNV 2.0 algorithm for calling CNVs on targeted gene panels and exomes.

The outputs spreadsheets are designed for Golden Helix SVS statistical analysis.

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). Each coverage region is classified as either homozygous deletion, heterozygous deletion, diploid, or duplication.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on gene panel data, as well as whole exome data, and is capable of calling events ranging from single exon deletions, to whole chromosome duplications. The minimum and maximum reference sample count can be configured, and, if you have a large number of control samples in your reference folder, we suggest increasing the maximum value.

With the addition of the “CNV Caller” add-on to your Golden Helix SVS license, you can add this algorithm to your analysis workflows.

Note

To add the CNV caller to your license of Golden Helix SVS contact info@goldenhelix.com.

Requirements

In order to use the CNV Caller on Target Regions, first start from a spreadsheet that has the sample names for you analysis as row labels and select Numeric > CNV on NGS Target Regions.

cnvWindow

CNV on Target Regions Window

The algorithm requires BAM files for each sample you would like to peroform the analysis on as well as a BED file or converted target regions file that defines the target capture regions used by the samples.

Optionally, the algorithm can leverage the Variant Allele Frequency information that can be imported with your genotypes in the VCF Import wizard.

Coverage statistics over the target regions on the BAM file will be computed as the first step of the algorithm. For best results, we recommend at least 100x coverage and 30 reference samples.

By default the samples in the current analysis as well as all previously analyzed samples for the given target regions are used as reference samples. The repository of reference samples can be found by going to Tools > Open Folder > Reference Samples Folder on the main application window..

Input and Output Options

The first tab configures the inputs and the requested outputs for running the algorithm. It has the following options:

  • Track for Target Regions: This track defines the target regions used by the NGS sample kit and will define the regions used to compute coverage and run the CNV algoirthm.
  • BAM Path Mappings The sample names come from the rows of the starting spreadhseet. These can be mapped to BAM paths manually by using the Associate BAMs dialog or if you have BAM path mappings as a column in your spreadsheet you can speify that column.
  • VAF Spreadsheet allows you to specify a spreadsheet containing the Variant Allele Frequencies of variants imported during the Import VCF process. When importing VCF files, be sure to check the “VAF” field as one you would like to import as a sample by variant spreadsheet. The VAF values are used as hints to the Copy Number caller by indicating allelic imabalances.

The Outputs group box contains choices for which outputs you would like to have when the algorithm completes. They are summarized below:

  • Sample summary table provies a table with samples as rows and various summary statistics from the CNV calls as columns. See the section below for the detailed description of this table.
  • Target Region CNV State should be the most useful output for doing downstream analysis between copy number state and other sample phenotypes and traits of interest. The output will have a column per-target defined by the input coverage region file, with the column values either being categorical or numeric based on the user selection.
    • Numeric CNV state encoding can be switched to numeric with the following mappings:
      • 0: Homozygous Deletion
      • 1: Heterozygous Deletion
      • 2: Diploid (Copy Neutral)
      • 3: Duplication
    • Filter out calls with QC flags allows certain CNV calls to be set to missing (?) when they have certain QC flags set on them by the algorithm. This may make sense to only use high-quality CNV calls in your CNV state table in downstream analysis. The flags are defined as below:
      • Low Controls Depth: Mean read depth over controls exceeded threshold.
      • High Controls Variation: Variation coefficient exceeded threshold.
      • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
      • Low Z-Score: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Target region Z-score table provides a spreadsheet with the supporting Z-scores for the calls in the CNV state output table.
  • Target region ratio table provides a spreadsheet with the supporting ratios for the calls in the CNV state output table.
  • CNV event table provides output of the fully constructed CNV events that can only be represented on a per-sample basis. Up to the provided number of samples will have per-sample spreadsheets created with the details of the CNV calls for that given sample. See the section below for details of these spreadsheets.

CNV Algorithm Paramaters

The Parameters tab in the dialog allows the user to specify the following options:

  • Sensitivity: Adjusts the trade-off between the true positive rate and the

true negative rate.

  • Minimum Number of Reference Samples: Desired minimum number of reference

samples to be selected.

  • Maximum Number of Reference Samples: The maximum number of reference

samples to be selected.

  • Exclude reference samples with percent difference greater than: This

option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.

  • Add samples to reference set: This option adds the current project’s

sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.

  • Normalize sex chromosomes separately based on inferred gender: If this

option is selected, and there are non-autosomal targets (X and Y), a gender will be inferred for each sample. Based on that gender, a set of gender-matched references will be selected for normalizing these chromosomes. Un-check this if you don’t have enough samples to do gender-matched normalization or expect all samples to be predominantly one sex.

  • Reference Sample Folder: The folder containing the reference samples used

to normalize the coverage data.

  • Controls average target mean depth below: Flags targets with average

reference sample depth below the specified value.

  • Controls variation coefficient above: Flags targets for which the

variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.

  • Optional Regions Ignored During Normalization: The blacklist region file

is used to specify regions to be excluded from the normalization process.

Advanced Algorithm Paramaters

The CNV calling algorithm relies on probability distributions associated with both the Z-score and Ratio metrics. The Z-score for a target measures the number of standard deviations a sample’s coverage is from the mean reference sample coverage, while the Ratio is the target coverage divided by the mean reference sample coverage.

Each metric is associated with three probability distributions; one for each type of CNV: Hom. Deletion, Het. Deletion, and Duplication.

Normal distributions are used for the deletion distributions, while log-normal distributions are used for the duplication distributions.

These parameters can be specified in the advanced tab, which contains the following options:

  • Z-score Parameters: Specify the parameters for the Z-score distributions.
  • Ratio Parameters: Specify the parameters for the Ratio distributions.
  • Utilize Variant Allele Frequency: Use Variant Allele Frequency when calling CNVs.

Columns in the Sample Summary Table

Summary information about each sample are available in this table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio
  • # CNV Events: Number of CNV events
  • # Flagged CNV Events: Number of flagged CNV events
  • # Unflagged CNV Events: Number of unflagged CNV events
  • # Hom Deletions: Number of homozygous deletion events
  • # Het Deletions: Number of heterozygous deletion events
  • # Duplications: Number of duplication events
  • Z-score IQR: Interquartile range of the Z-scores over all targets
  • Ratio IQR: Interquartile range of the ratios over all targets
  • Variants Considered: Variants considered for VAF content
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions
  • Reference Samples: Samples selected as matched controls
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions
  • Karyotype: Chromosomal CNV information for this sample.

Columns of the Per-Sample CNV Event Calls Spreadsheets

If the CNV event tables were selected as an output to a fixed number of samples, one table will be created with the full CNV event details for each sample. These per-sample outputs can be useful to understand the details of the events that behind the CNV state table, but are less useful for downstream analysis.

  • Regions: Genomic coordinates (Chr: Start-Stop).
  • Type: The type of CNV called. Either a Gain or a Loss.
  • # Targets: Number of targets in the event.
  • # Samples: Number of samples in the event.
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • CNV State: State of the CNV event. Either Deletion, Het Deletion, Duplicate or CN LoH.
  • Flags: QC warnings for the event.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Low Z Score: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Avg Target Mean Depth: Average mean depth of the targets in this event as reported by Coverage Statistics.
  • Avg Z Score: Average Z-score of the event.
  • Avg Ratio: Average ratio of the event.
  • Variant Considered: Number of targets in the event.
  • Supporting LoH Variants: Total number of variants within an LoH event supporting the called CNV state.
  • P-value: Probability that z-scores at least as extreme as those in the event would occur by chance in a diploid region.
  • Karyotype: Cytogenetic nomenclature for this event.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison.

Classification and event segmentation are performed using a probabilistic model that incorporates three evidence metrics: Z-score, ratio, and variant allele frequency (VAF). The Z-score measures the number of standard deviations from the reference sample mean, the ratio is the normalized mean for the sample of interest divided by the average normalized mean for the reference samples, and VAF is the allelic fraction at the variant locus. Using these metrics, the algorithm calls CNV state for each target region. Target regions are then merged to obtain contiguous CNV events.

Since these metrics can be noisy over very large regions, a segmentation algorithm is used to call large multi-gene and whole chromosome events. If a region contains many small CNV events, CNAM optimal segmentation is used to segment the region and small events the share a segmented region are merged.