VarSeq Algorithms

VarSeq includes various algorithms to create new fields based on existing sources. All algorithms can be run by going to Add > Computed Data... from the File menu or the primary toolbar. See navigatingVarSeq.

Add Menu

Run various algorithms through this option

Algorithm Menu

List of available algorithms

These algorithms are classified into groups based on their domain and the grouping that is applied to the samples. The domains are Variants, Genes, and Samples. Three different sample groupings are defined for the Gene and Variant domains. The ‘Per Sample’ group will examine each sample individually and will create a new value for each gene or variant in the domain, for each sample. The ‘Per Trio’ group algorithms perform analysis like ‘Per Sample’ with the added context of a child, mother, and father. The Project/Cohort grouping examines all of the samples in aggregate, and will create a single value for each gene or variant in the domain.

Genotype Zygosity

This algorithm examines the genotypes for each sample and identifies them as one of the following:

  • Reference: Both alleles match the reference allele
  • Heterozygous: Two different alleles
  • Homozygous Variant: Both alleles are the same but are different than the reference allele.
  • Hemizygous: The single allele present is different than the reference allele.

Requirements

Requires a genotype (GT) sample level field.

Output

Creates a zygosity field for each sample.

Frequency Aware Zygosity

This algorithm computes the zygosity of a sample by using the individual’s genotype combined with frequency information available in a variant frequency catalog such as ExAC or 1000 genomes. The zygosity is categorized as follows:

  • Homozygous Major: Two major alleles called (Wild Type)
  • Homozygous Minor: Two minor alleles called
  • Heterozygous: Two different alleles called
  • Hemizygous Major: A single major allele called on sex chromosome (Wild Type)
  • Hemizygous Minor: A single minor allele called on sex chromosome

The advantage to using frequency information when computing zygosity is that the reference sequence contains an alternate allele at certain positions. The alternate allele frequency at this position is often greater than 90%. We denote the most common allele as the major allele.

Note

At a site with multiple alternate alleles, the major allele could have a frequency less than 50% (eg. Ref = 35%, Alt1 = 20%, Alt2 = 45%)

Requirements

Requires a genotype (GT) sample level field.

This algorithm requires first annotating variants using a variant source with an Alternate Allele Frequency field.

Output

Creates a categorical field denoting the frequency aware zygosity for each sample.

GT Style Genotype

This algorithm uses the numeric genotype field to create a representation with the allele bases.

Requirements

  • 0/1 Genotypes Field: The imported variants must have a numeric sample genotype field (0/1).

Options

  • Golden Helix ‘G_T’: This option will create a genotype field with the represented alleles separated with an underscore. The alleles are sorted lexicographically before they are joined.
  • VCF encoding ‘G/T’: This option will create a genotype field with the represented alleles separated by a backslash (or a pipe ‘|’ for phased genotypes). The alleles will appear in the order that their indexes appear in the original numeric Genotype field.

Output

A Genotype column is created containing a text representation of the genotype. This column will be appended to the variant sample fields.

Compute Fields

This computation algorithm, or rather data transformation, takes existing annotation fields and creates a new field by evaluation an expression on an existing source.

After clicking OK you will be prompted to select an annotation field source, this can include previously computed fields. Once the source is selected, you can click OK again.

This will open up the Expression Editor. Any number of expressions can be specified using all fields listed. A preview of the evaluated expression will appear in the table at the bottom of the expression editor dialog. For more information about the Expression Editor, please see Expression Editor.

Requirements

None

Output

A column will be created containing the computed output based on the specified expression.

Note

When computing expressions on Sample level fields the output will also be a sample level field. See Compute Fields Data Transformation Examples for specific examples.

Mendel Error

This algorithm computes the Mendel Error status for the child’s genotype. The status is categorized as follows:

  • Untransmitted: The allele was not inherited by the child.
  • Transmitted: The allele was inherited by the child
  • de Novo Allele: The child’s genotype shows a single de Novo mutation that was not present in either parent.
  • MIE: Mendel Inheritance Error. The child’s genotype cannot be explained by mendelian inheritance patterns and is more complex than a single de Novo mutation.

Missing alleles are considered to be reference.

Requirements

Requires a genotype (GT) sample level field.

Requires that the sample have at least one parent

Output

Creates a categorical column on each child sample denoting its Mendel Error status

Variant Type

This algorithm categorizes the variant as one of the following:

  • SNP: Single nucleotide substitution
  • MNP: Multiple nucleotide substitution of same length
  • Insertion: Insertion of nucleotides against the reference
  • Deletion: Gap compared to the reference sequence
  • DeletionInsertion: Multiple nucleotides replaced with differing lengths
  • Complex: More than one variant type
  • Reference: Only a single allele

Requirements

None

Output

A Variant Type column is created containing the variant categorization.

Count Alleles

This algorithm counts the number of alternate alleles in the genotype field across all of the samples.

Requirements

Requires a genotype (GT) sample level field.

Options

  • Sample Grouping: Optionally takes a categorical sample level field and counts the alleles for each category. You can add these fields during the import process or use the default field such as Affection Status.
  • Remove No-Calls Genotypes: By default, the # Alleles field includes no-call genotypes such as ./., meaning in general it will always be twice the number of samples. If you select this option, no-calls will reduce this value and also change the computed Allele Frequencies to match. This may make sense in multi-sample calling pipelines, but beware you may encounter situations where you have high allele frequencies simply because variants appear in only a few samples and in all other samples was considered a No-Call.
  • Output Sample Names: When selected, a new Sample Names field is created that lists the names of the samples containing a variant genotype when the number of samples with this condition passes the specified threshold.

Output

  • Allele Counts: Counts of each alternate allele for each site across all samples. In most cases, there is only a single alternate and so the count is the number of observations of this allele across all chromosomes of the samples.

    For example, a homozygous variant for a sample gets a count of 2, while a heterozygous genotype gets a count of 1.

  • Allele Frequencies: The Allele Counts divided by the total number of observed alleles (# Alleles). Missing genotypes are assumed to be bi-allelic, which adds 2 to the total.

  • # Alleles: Total number of observed alleles in called genotypes.

  • # Hets: Count of the number of heterozygous genotypes across all samples.

  • # HomoVar: Count of the number of homozygous (or hemizygous) non-reference called genotypes across all samples.

  • Sample Names: (Optional) The names of the samples containing a variant genotype (not reference or missing).

Annotate Transcripts

This algorithm annotates variants against overlapping transcripts. The algorithm produces a number of fields for each variant. These fields are documented in the field documentation produced by the computation.

Requirements

An gene source is needed to run the algorithm.

Options

  • Only annotate verified mRNA transcripts: If checked, only verified transcripts will be included.
  • Amino Acid Notation: Amino acids can be represented as either three letter or one letter abbreviations.
  • Splice Site Boundaries: The distances used to classify splice site boundaries can be adjusted as needed.
    • Splice Donor Distance: Default is 2 bp
    • Splice Acceptor Distance: Default is 2 bp
    • Splice Region Exonic Distance: Default is 3 bp
    • Splice Region Intronic Distance: Default is 8 bp
  • Preferred Transcript(s): A list of transcripts that should be preferred as the clinical relevant transcript.

Output

Three column groups will be created. A summary group, which collapses overlapping annotations to produce a clinically relevant annotation. Second, a full annotation which details the interactions for each transcript-variant pair. Finally, a set of columns containing the information in the underlying source.

If multiple transcripts overlap the variant then the results will be joined together in a list for each field. If a variant is intergenic non applicable fields will be filled in with missing values.

Annotate Regions

This algorithm identifies variants that overlap features in the selected source. While adding an annotation source can also identify variants that are contained within features, additional logic is performed to match variants. This algorithm only performs the simple case of whether or not the position is equal to or contained within a region in the annotation source.

Requirements

An annotation source to use for annotating the regions or variants.

Output

Columns for the selected source. If multiple features match the variant then the results will be joined together in a list for each field. If a variant does not have an overlapping feature the fields will be filled in with missing values.

Aggregate Compute Fields

This computation algorithm, or rather data transformation, takes existing fields and creates a new field by evaluation an expression on an existing source. Sample fields will be flattened and treated as lists at the site level.

After clicking OK you will be prompted to select field sources, multiple sources may be selected including previously computed fields. Once the source is selected, you can click OK again.

This will open up the Expression Editor. Any number of expressions can be specified using all fields listed. A preview of the evaluated expression will appear in the table at the bottom of the expression editor dialog. For more information about the Expression Editor, please see Expression Editor.

Requirements

None

Output

A column will be created containing the computed output based on the specified expression.

Note

When computing expressions on sample level fields the output will be a variant level field. See Aggregate Compute Fields Data Transformation Example for a specific example.

Sample PhoRank Gene Ranking

This algorithm ranks genes based on their relevance to user-specified phenotypes as defined by the GO and HPO biomedical ontologies. PhoRank is modeled on the Phevor algorithm.

Phevor assigns scores to ontology terms based on their proximity to the user-specified phenotypes. Nodes that are connected to a search term, either directly or through a shared gene relationship, are called seed nodes and are assigned an initial score of one. The algorithm propagates this score information through the ontologies, so that genes with high scores are more closely related to the specified phenotypes, while genes with low scores have little or no relation to the phenotypes.

We have modified the Phevor algorithm by assigning initial scores to seed nodes based on their similarity to the initial search terms. We have also modified Phevor’s propagation mechanism so that the score propagated from one node to another is weighted by the similarity of the two nodes. These modifications increase the scores of more specific nodes that are highly related to the search terms, while decreasing the scores of more general nodes with many neighbors.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After running PhoRank you will be prompted to select a Gene Names field to be used for gene ranking.

After clicking OK you will be prompted to enter a comma delimited list of HPO phenotype terms for each sample. Optionally the list of available phenotypes can be extended to include OMIM provided syndromes and phenotypes. The OMIM content add-on is required for this feature. (see OMIM for further details).

Output

  • Gene Rank: Percentile rank of the specific gene for each sample.
  • Gene Score: The score of the gene computed by the ontology propagation algorithm for each sample.
  • Path: A shortest path from the gene to one of the specified phenotypes (there may be many paths to the phenotypes), for each sample.

Count Alleles By Gene

Note

This algorithm runs on the variants in the selected Filter Chain and not all variants originally imported.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to group variant counts based on genes.

Output in the Gene Table

The fields in the gene table include sample specific fields for three different categories. The counts in each field are based on the grouping of unique values from the selected Gene Names field from the specified annotation source.

Fields include:

  • # Variants: The total number of variants for the current sample in each gene.
  • Allele Count: The total number of non-reference alleles for the current sample in each gene.
  • # Het: The total number of heterozygous variants for the current sample in each gene.
  • # Hemi: The total number of hemizygous variants for the current sample in each gene.
  • # HomoVar: The total number of homozygous variants for the current sample in each gene.

Output in the Filter Chain

A filter card is automatically created after the algorithm finishes running. This card is placed at the bottom of the filter chain, but can be moved by clicking and dragging the card to the desired location.

If moving the algorithm output card changes the input it will be necessary to rerun the algorithm. This will be indicated by an information icon on the filter card.

By default, the filter card created will correspond to the # Variants field, but it can be changed to other fields created by the algorithm. Additionally, right-clicking on the column header for the other fields produced will present an option to add an additional card for that field.

Match Gene List (Per Sample)

This algorithm determines matches for each sample between the gene annotations of each variant and a user imported list of gene or identifier symbols unique to each sample.

Note

This tool can also be used to match any string field with a list of string entries. For example matching a list of RS IDs to the Identifier field for your variants.

Requirements

Each sample must be paired with a gene list during the initial data import, see Import Sample Information from Text File.

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to be used for matching.

After clicking OK a second time you will be prompted to select the sample Gene List field which contains the list of gene names that will be matched against.

Output

A sample specific boolean column indicating whether the gene annotation for each variant matches at least one of the values given.

Compound Het Detection

A compound heterozygous polymorphism refers to a child that has inherited two different heterozygous polymorphisms within the same gene, one from each parent. This could result in both copies of the gene being potentially affected.

This type of polymorphism should also alter the amino acid sequence, or be classified as a non-synonymous variant.

Note

This algorithm runs on the variants in the selected Filter Chain and not all variants originally imported.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to group variants into genes for the Compound Heterozygous analysis.

Optionally, you may select an Allele Frequency Field to allow one variant to reach 5% frequency while the other is strictly less than 1%. This option requires first annotating against a variant frequency database. If you do not want to include an allele frequency requirement, click Skip, otherwise, select the field and click OK.

Finally you have the choice of changing the default Advanced Parameters, which include:

  • Allow de Novo het mutations to be considered: When a gene contains at least one inherited het as well as a de Novo mutation, it will be considered a compound het gene.
  • Allow duos (one missing parent): If a proband is het and only has one parent specified, this option assumes the missing parents genotype is either reference or het, to be opposite the non-missing parent. While this includes sites where both parents may be heterozygous, it provides useful candidate compound het genes.

Output in the Variant Table

There are several groups of output fields created by this algorithm.

Group by Genes

Grouping of unique values from the selected Gene Names field from the specified annotation source.

Fields include:

  • Gene Names: The set of unique gene names seen in all overlapping transcripts.

Compound Het Variants for Proband

Per-variant compound heterozygous algorithm analysis.

Fields include:

  • Compound Het?: Whether or not this variant one of the two or more heterozygous genotypes in a gene following the Compound Heterozygous inheritance model.
  • Inherited From: Whether the variant was inherited from the Father or Mother, or de Novo (if the advanced option to consider de Novo was selected).

Compound Het Genes for Proband

Per-gene compound heterozygous algorithm analysis.

Fields include:

  • Has Compound Het?: Whether or not this gene has two or more heterozygous genotypes following the Compound Heterozygous inheritance model.
  • Inherited from Father: Number of heterozygous genotypes unambiguously inherited from the father.
  • Inherited from Mother: Number of heterozygous genotypes unambiguously inherited from the mother.
  • Inherited Total: Number of heterozygous genotypes unambiguously inherited from either the mother or father.
  • Hets in Both Parents: Number of heterozygous genotypes in the affected child and both parents. These are not counted toward a Compound Heterozygous model for a gene, but may be useful to rule out genes with a high level of background variation.
  • de Novo: Number of de Novo genotypes where both parents are reference. This field is only present if the option to consider de Novo mutations was selected.
  • Second Smallest Freq: If a frequency threshold was set to allow one of the two necessary compound heterozygous genotypes to be up to a more lenient threshold, this value is set to the larger of the two rarest variants that construct a compound heterozygous gene.

Output in the Gene Table

The Compound Heterozygous algorithm also creates a split table view with a gene table on the left and a corresponding variant table on the right. The fields in the gene table include:

Group by Genes

See Group by Genes.

Compound Het Genes for Proband

Clicking on a row in the Gene table will bring up a list of variants in the right variant table. The fields displayed in this table will be the same fields that are visible in the main Variant table.

See Compound Het Genes for Proband for information on the fields.

Output in the Filter Chain

A filter card is automatically created after the algorithm finishes running. This card is placed at the bottom of the filter chain, but can be moved by clicking and dragging the card to the desired location.

If moving the Compound Het card changes the input it will be necessary to rerun the algorithm. This will be indicated by an information icon on the filter card.

Variant PhoRank Gene Ranking

This algorithm ranks genes based on their relevance to user-specified phenotypes as defined by the GO and HPO biomedical ontologies. PhoRank is modeled on the Phevor algorithm.

Phevor assigns scores to ontology terms based on their proximity to the user-specified phenotypes. Nodes that are connected to a search term, either directly or through a shared gene relationship, are called seed nodes and are assigned an initial score of one. The algorithm propagates this score information through the ontologies, so that genes with high scores are more closely related to the specified phenotypes, while genes with low scores have little or no relation to the phenotypes.

We have modified the Phevor algorithm by assigning initial scores to seed nodes based on their similarity to the initial search terms. We have also modified Phevor’s propagation mechanism so that the score propagated from one node to another is weighted by the similarity of the two nodes. These modifications increase the scores of more specific nodes that are highly related to the search terms, while decreasing the scores of more general nodes with many neighbors.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After running PhoRank you will be prompted to select a Gene Names field to be used for gene ranking.

After clicking OK you will be prompted to enter a comma delimited list of HPO phenotype terms, and name for the phenotype. Optionally the list of available phenotypes can be extended to include OMIM provided syndromes and phenotypes. The OMIM content add-on is required for this feature. (see OMIM for further details).

Output

  • Gene Rank: Percentile rank of the specific gene.
  • Gene Score: The score of the gene computed by the ontology propagation algorithm.
  • Path: A shortest path from the gene to one of the specified phenotypes (there may be many paths to the phenotypes).

Aggregate Filtered Variants

Note

This algorithm runs on the variants in the selected Filter Chain and not all variants originally imported.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to group variant counts based on genes, and a grouping field. If no grouping is selected, counts will be created for all of the combined samples. Otherwise the values in the selected Category field which is used to define the sample sets.

Output in the Gene Table

Three different fields will be created for each sample set. A sample set is created for each category in the selected Category field, as well as for a ungrouped set which contains all of the samples. Each category will have the three following fields for each gene.

Fields include:

  • Unique Variant Count: The total number of unique variant sites that are present for this sample category.
  • Unique Sample Count: The total number of samples in this bucket which had variant sites in this gene.
  • Observed Variant Count: The total number of different variant sites observed in this gene across all samples.

Output in the Filter Chain

A filter card is automatically created after the algorithm finishes running. This card is placed at the bottom of the filter chain, but can be moved by clicking and dragging the card to the desired location.

If moving the algorithm output card changes the input it will be necessary to rerun the algorithm. This will be indicated by an information icon on the filter card.

By default the filter card created will correspond to the Unique Variant Count field, but can be changed to other fields created by the algorithm. Additionally, right-clicking on the column header for the other fields produced will present an option to add an additional card for that field.

Match Genes Linked to Phenotypes

This algorithm determines matches between the gene annotations of each variant and list of genes associated with a group of user-specified phenotypes.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to be used for matching.

After clicking OK a second time you will be prompted to enter a delimited list of phenotypes to use for selecting a gene list.

Options

This dialog allows the user to specify the following options:

  • New Field Name: Name to be used for the newly created field.
  • Enhance with OMIM phenotypes: Specifies that OMIM terms may be used.
  • Gene Association: Determines how many nodes may be traversed in the ontology to reach a gene.
    • HPO gene association The matched genes must be directly related to the term in the HPO ontology
    • HPO +1 hop in GO Genes will be matched if they share a neighbor with one of the entered phenotype terms in the ontology. This allows for genes indirectly related through GO terms to be included in the gene list.
  • Linked Genes: The computed set of gene names to match.

Output

A boolean column indicating whether the gene annotation for each variant matches at least one of the genes.

Match Gene List

This algorithm determines matches between the gene annotations of each variant and a user selected list of gene or identifier symbols.

Note

This tool can also be used to match any string field with a list of string entries. For example matching a list of RS IDs to the Identifier field for your variants.

Requirements

This algorithm requires first annotating and classifying variants using a gene annotation source.

After clicking OK you will be prompted to select a Gene Names field to be used for matching.

After clicking OK a second time you will be prompted to enter a delimited list of gene symbols or identifiers to use for matching against the field previously selected.

Output

A boolean column indicating whether the gene annotation for each variant matches at least one of the values given.

Targeted Region Coverage

Sample level coverage statistics allow for the computation of basic coverage information in defined regions from a corresponding BAM file. The total coverage as well as strand based coverage is computed from the quality filtered pileup depth for each region. Aggregate statistics are computed for each sample across all of the defined regions to provide a high level overview of the sample’s coverage.

Requirements

BAM File

Each sample must be paired with a BAM file during the initial data import see Associate Sample BAM Files. Each BAM file should be unique to the sample and have a corresponding index file (.bai) adjacent to it in its file location.

Region File

The Region file is used to define the areas where coverage will be computed. Each region will generate its own record in the final output. A BED file or interval source may be used to define the regions; the file must be indexed.

Options

  • Additional Depth Threshold: The percentage of bases in each region is computed by default for depths of 1x, 20x, 100x, and 500x. An additional depth may be specified to augment these fields. The percentage of bases at this depth will be computed for each region.

Sample Coverage Output

The fields from the file used to define the regions will be included in addition to the fields that are computed by the coverage statistics algorithm.

  • Span: The width of the region. Computed from the difference between the stop and start positions.
  • Mean Depth: The mean coverage depth for all of the bases in the region.
  • Mean Forward Depth: The mean coverage depth for all of the bases in the region on the Forward Strand.
  • Mean Reverse Depth: The mean coverage depth for all of the bases in the region on the Reverse Strand.
  • Mean Filtered Depth: The mean coverage depth for reads that were filterd out of this region. The reads that are filtered have a poor mapping quality, indicating they may map to multiple regions.
  • Min Depth: The minimum total depth (forward depth + reverse depth) across the region pile up.
  • Min Forward Depth: The minimum depth on the forward strand pileup across the region pile up.
  • Min Reverse Depth: The minimum depth on the reverse strand pileup across the region pile up.
  • Max Depth: The maximum total depth (forward depth + reverse depth) across the region pile up.
  • Max Forward Depth: The maximum depth on the forward strand pileup across the region pile up.
  • Max Reverse Depth: The maximum depth on the reverse strand pileup across the region pile up.
  • % 1x: The percentage of bases with a coverage depth of at least 1 in the region.
  • % 20x: The percentage of bases with a coverage depth of at least 20 in the region.
  • % 100x: The percentage of bases with a coverage depth of at least 100 in the region.
  • % 500x: The percentage of bases with a coverage depth of at least 500 in the region.

Output of the Coverage Regions Table

The coverage statistics algorithm will generate a ‘Coverage Regions’ table view. This table will include records for all of the regions in the region file.

Searching the Regions Table

The regions table can be searched by right clicking on a column title and selecting search this column. This allows for the examination of coverage regions that fall above or below user defined thresholds for the field.

Variants by Region Table

This composite table view includes all of the regions that cover one or more variants from the filtered Variant table. The regions appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each region can be viewed by changing the row selection in the region table.

Output in the Variant Table

Variants will be matched to any regions they fall within. The values for each of the matching regions will be listed in their respective fields which are appended to the Variant table.

Output in the Samples Table

Summary statistic fields are appended to the Samples Table. These fields provide summary information computed across all of the regions.

  • Sample Mean Depth: The average coverage of the sample over all of the regions. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample Mean Forward Depth: The average coverage of the sample over all of the regions on the Forward Strand. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample Mean Reverse Depth: The average coverage of the sample over all of the regions on the Reverse Strand. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample %1x: The percentage of bases in all of the regions with at least 1x coverage.
  • Sample %20x: The percentage of bases in all of the regions with at least 20x coverage.
  • Sample %100x: The percentage of bases in all of the regions with at least 100x coverage.
  • Sample %500x: The percentage of bases in all of the regions with at least 500x coverage.

Note

If an Additional Depth Threshold was specified a corresponding sample level field will also be computed.

Binned Region Coverage

Sample level coverage statistics allow for the computation of basic coverage information over fixed width bins from a corresponding BAM file. The total coverage as well as strand based coverage is computed from the quality filtered pileup depth for each region. Aggregate statistics are computed for each sample across all of the defined regions to provide a high level overview of the sample’s coverage.

Note

This algorithm is designed for the consistent coverage profile of WGS data. It is generally used as the input to the CNV Caller on Binned Regions.

Requirements

BAM File

Each sample must be paired with a BAM file during the initial data import see Associate Sample BAM Files. Each BAM file should be unique to the sample and have a corresponding index file (.bai) adjacent to it in its file location.

Options

  • Bin Size: Defines the size in base pairs of the equally spaced regions over which coverage will be computed. Each bin will generate its own record in the final output.
  • Additional Depth Threshold: The percentage of bases in each region is computed by default for depths of 1x, 20x, 100x, and 500x. An additional depth may be specified to augment these fields. The percentage of bases at this depth will be computed for each region.
  • Masked Regions: The masked region file is used to specify regions to be excluded from coverage computation. A BED file or interval source may be used to define the regions; the file must be indexed.

Sample Coverage Output

The fields from the file used to define the regions will be included in addition to the fields that are computed by the coverage statistics algorithm.

  • Span: The width of the region. Computed from the difference between the stop and start positions.
  • Mean Depth: The mean coverage depth for all of the bases in the region.
  • Mean Forward Depth: The mean coverage depth for all of the bases in the region on the Forward Strand.
  • Mean Reverse Depth: The mean coverage depth for all of the bases in the region on the Reverse Strand.
  • Mean Filtered Depth: The mean coverage depth for reads that were filtered out of this region. The reads that are filtered have a poor mapping quality, indicating they may map to multiple regions.
  • Min Depth: The minimum total depth (forward depth + reverse depth) across the region pileup.
  • Min Forward Depth: The minimum depth on the forward strand pileup across the region pileup.
  • Min Reverse Depth: The minimum depth on the reverse strand pileup across the region pileup.
  • Max Depth: The maximum total depth (forward depth + reverse depth) across the region pileup.
  • Max Forward Depth: The maximum depth on the forward strand pileup across the region pileup.
  • Max Reverse Depth: The maximum depth on the reverse strand pileup across the region pileup.
  • % 1x: The percentage of bases with a coverage depth of at least 1 in the region.
  • % 20x: The percentage of bases with a coverage depth of at least 20 in the region.
  • % 100x: The percentage of bases with a coverage depth of at least 100 in the region.
  • % 500x: The percentage of bases with a coverage depth of at least 500 in the region.

Output of the Coverage Regions Table

The coverage statistics algorithm will generate a ‘Coverage Regions’ table view. This table will include records for all of the regions in the region file.

Searching the Regions Table

The regions table can be searched by right clicking on a column title and selecting search this column. This allows for the examination of coverage regions that fall above or below user defined thresholds for the field.

Variants by Region Table

This composite table view includes all of the regions that cover one or more variants from the filtered Variant table. The regions appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each region can be viewed by changing the row selection in the region table.

Output in the Variant Table

Variants will be matched to any regions they fall within. The values for each of the matching regions will be listed in their respective fields which are appended to the Variant table.

Output in the Samples Table

Summary statistic fields are appended to the Samples Table. These fields provide summary information computed across all of the regions.

  • Sample Mean Depth: The average coverage of the sample over all of the regions. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample Mean Forward Depth: The average coverage of the sample over all of the regions on the Forward Strand. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample Mean Reverse Depth: The average coverage of the sample over all of the regions on the Reverse Strand. The average is weighted by the size of the regions to give the average depth over all of the bases that fall within each regions.
  • Sample %1x: The percentage of bases in all of the regions with at least 1x coverage.
  • Sample %20x: The percentage of bases in all of the regions with at least 20x coverage.
  • Sample %100x: The percentage of bases in all of the regions with at least 100x coverage.
  • Sample %500x: The percentage of bases in all of the regions with at least 500x coverage.

Note

If an Additional Depth Threshold was specified a corresponding sample level field will also be computed.

Sample Statistics

Sample level statistics compute statistics for each sample over the called sites. This provides a high level view of type of variants found for each sample and can be used to make quality control decisions.

Requirements

The project must have one or more samples imported.

Output the Samples Table

The selected statistics will be appended to each samples Table. These fields provide a summary of each samples variants.

TiTv Ratio

Counts of the two classes of single nucleotide variations, transitions, and transversions.

  • Transition Count The number of transitions for each sample.
  • Transversion Count The number of transversions for each sample.
  • TiTv Ratio The ratio of transitions to transversions

Coding TiTv Ratio

When run after the Annotate Transcripts algorithm, filtering by exon regions can be selected. Counts of the two classes of single nucleotide variations, transitions, and transversions are computed for exonic regions.

  • Transition Count The number of transitions for each sample in exon features.
  • Transversion Count The number of transversions for each sample in exon features.
  • TiTv Ratio The ratio of transitions to transversions in exon features.

Variant Count

  • Var Count The total number of non-missing, non-reference alleles for each samples.

Coding Variant Count

When run after the Annotate Transcripts algorithm, filtering by exon regions can be selected. Counts are reported only for features in exon regions.

  • Var Count The total number of non-missing, non-reference alleles for each samples in exon features.

Singleton Count

The number of times that a sample has a alternate allele which is not found in any other sample at that site. To be counted a sample may be homozygous or heterozygous for a singleton alternate allele.

  • Singleton Count The number of singletons for each sample

SNV Count

The number of single nucleotide variants (SNVs) for each sample.

  • SNV Het Count The number of heterozygous SNVs for each sample.
  • SNV Hom Count The number of homozygous SNVs for each sample.
  • SNV Het/Hom Ratio The ratio of heterozygous SNVs to homozygous SNVs.

Note

Hemizygous SNVs, Multi-nucleotide polymorphisms, (MNPs) and complex variants, are not included in this calculation.

Indel Count

The number of insertions and deletions (Indels) for each sample

  • Indel Het Count The number of heterozygous Indels for each sample.

  • Indel Hom Count The number of homozygous Indels for each sample.

  • Indel Het/Hom Ratio The ratio of heterozygous Indels to homozygous

    Indels.

Note

Hemizygous Indels are not included in this calculation.

Heterozygous Rate

The number, and ratio, of heterozygous genotypes for each sample.

  • Het Count The number of heterozygous genotypes for each sample.

  • Het Ratio The Het Count divided by the number of non-reference,

    non-missing, genotypes for the sample.

Homozygous Rate

The number, and ratio, of homozygous genotypes for each sample.

  • Hom Count The number of homozygous genotypes for each sample.

  • Hom Ratio The Hom Count divided by the number of non-reference,

    non-missing, genotypes for the sample.

Hemizygous Rate

The number, and ratio, of hemizygous genotypes for each sample.

  • Hemi Count The number of hemizygous genotypes for each sample.

  • Hemi Ratio The Hemi Count divided by the number of non-reference,

    non-missing, genotypes for the sample.

Reference Rate

The number, and ratio, of reference genotypes for each sample.

  • Ref Count The number of reference genotypes for each sample.

  • Ref Ratio The Het Count divided by the number of non-missing genotypes

    for the sample.

Note

It is important to remember that the number of reference calls made for a sample can change for gvcf files depending on the other samples it is imported with. Homozygous reference calls are inserted for a given sample when a genomic site is encountered, which has a variant for another sample, and a covered region for the given sample.

Call Rate

The ratio of genotypes which are non-missing for each sample to the total number of genomic sites in the project. Hemizygous called genotypes are treated as non-missing.

  • Called Genotypes Number of non-missing genotypes for each sample.

  • Call Rate Ratio of Called Genotypes to total genomic sites in the

    project.

Gender Inference

By specifying a gender chromosome and heterozygous threshold. The gender of a sample can be inferred from the heterozygous rate.

  • Gender Chromosome Het Ratio The ratio of the number of heterozygous

    variants in the specified gender chromosome to the total number of variants in the gender chromosome.

  • Inferred Gender The gender (Female or Male) of each sample.

Variant Type Count

The variant classification for each site that the sample has a non-reference, non-missing genotype. The classification is completed at the variant site level.

  • SNP Count The number of sites classified as single nucleotide polymorphisms where each sample has a non-missing, non-reference genotype.
  • MNP Count The number of sites classified as multi nucleotide polymorphisms where each sample has a non-missing, non-reference genotype.
  • Ins Count The number of sites classified as insertions where each sample has a non-missing, non-reference genotype.
  • Del Count The number of sites classified as deletions where each sample has a non-missing, non-reference genotype.
  • DelIns Count The number of sites which have insertion and deletion alleles where each sample has a non-missing, non-reference genotype.
  • Complex Count The number of sites classified as complex where each sample has a non-missing, non-reference genotype.

CNV Caller on Binned Regions

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). This algorithm uses coverage data computed over fixed width bins and is tailored toward whole genome analysis. Each coverage bin is classified as either homozygous deletion, heterozygous deletion, diploid, or duplication.

Note

By using large bins, large CNV events can be detected accurately on extremely low read-depth WGS data. Even a sample with 0.02X coverage (~1 million reads) will be able to call events down to the million base-pair level, and be able to detect chromosomal aneuploidy events with high confidence.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on shallow whole genome sequencing data and is capable of calling large cytogenetic events such as whole chromosome duplications.

With the addition of the “CNV Caller” add-on to your VarSeq license, you can add this algorithm to your interactive or automated VSPipeline executed workflows.

Note

To add the CNV caller to your license of VarSeq contact info@goldenhelix.com.

Requirements

Binned Region Coverage must be computed prior to running this algorithm. For best results, we recommend at least 30 reference samples.

Options

The user may specify the following options:

  • Minimum Number of Reference Samples: Desired minimum number of reference samples to be selected.
  • Maximum Number of Reference Samples: The maximum number of reference samples to be selected.
  • Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
  • Add samples to reference set: This option adds the current project’s sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.
  • Reference Sample Folder: The folder containing the reference samples used to normalize the coverage data.
  • Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
  • Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.

Output of the CNVs Table

The CNV Caller algorithm will generate a CNVs table view. This table will include records for all called CNV events.

  • Region: Genomic coordinates (Chr: Start-Stop)
  • # Targets: Number of targets in the event
  • # Samples: Number of samples in the event
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • CNV State: State of the CNV event. Either Deletion, Het Deletion, Duplicate or CN LoH.
  • Flags: QC warnings for the event.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Avg Target Mean Depth: Average mean depth of the targets in this event as reported by Coverage Statistics
  • Avg Z Score: Average Z-score of the event.
  • Avg Ratio: Average ratio of the event.
  • Variants Considered: Number of targets in the event
  • Supporting LoH Variants: Total number of variants within an LoH event supporting the called CNV state.
  • p-value: Probability that z-scores at least as extreme as those in the event would occur by chance in a diploid region.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Samples Table

Summary fields are appended to the Samples Table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio.
  • # CNV Events: Number of CNV events.
  • # Flagged CNV Events: Number of flagged CNV events.
  • # Unflagged CNV Events: Number of unflagged CNV events.
  • # Hom Deletions: Number of homozygous deletion events.
  • # Het Deletions: Number of heterozygous deletion events.
  • # Duplications: Number of duplication events.
  • Z-score IQR: Interquartile range of the Z-scores over all targets.
  • Ratio IQR: Interquartile range of the ratios over all targets.
  • Variants Considered: Variants considered for VAF content.
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions.
  • Reference Samples: Samples selected as matched controls.
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage.
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions.
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Coverage Regions Table

Target level CNV fields are appended to the Coverage Regions Table. These fields provide information computed across all coverage regions.

  • CNV State: State of the CNV call or this target. Either homozygous deletion, heterozygous deletion, diploid, or duplication
  • Flags: QC flags for the target region.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Z Score: Z-score of the target. Computed as (normalized target depth - mean depth across controls) / standard deviation
  • Ratio: Ratio of normalized target depth over mean depth across controls
  • Variants Considered: Variant considered for VAF content.

Variants by CNVs Table

This composite table view includes all of the CNVs that cover one or more variants from the filtered Variant table. The CNVs appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each CNV can be viewed by changing the row selection in the CNV table.

Output in the Variant Table

Variants will be matched to any CNVs they fall within. The values for each of the matching CNVs will be listed in their respective fields which are appended to the Variant table.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison. Classification and event segmentation are performed using the CNAM optimal segmentation algorithm.

CNV Caller on Target Regions

This algorithm uses sample level coverage statistics to detect copy number variations (CNV). Each coverage region is classified as either homozygous deletion, heterozygous deletion, diploid, or duplication.

Reference samples are used to normalize the coverage data and statistics are reported to provide an overview of the evidence for each classification. This algorithm has been tested on gene panel data, as well as whole exome data, and is capable of calling events ranging from single exon deletions, to whole chromosome duplications. The minimum and maximum reference sample count can be configured, and, if you have a large number of control samples in your reference folder, we suggest increasing the maximum value.

With the addition of the “CNV Caller” add-on to your VarSeq license, you can add this algorithm to your interactive or automated VSPipeline executed workflows.

Note

To add the CNV caller to your license of VarSeq contact info@goldenhelix.com.

Requirements

Coverage statistics must be computed prior to running this algorithm. For best results, we recommend at least 100x coverage and 30 reference samples.

Options

The first tab in the dialog allows the user to specify the following options:

  • Expected CNV Rate: The prior probability that a target is within a CNV event in the absence of evidence.
    • Not Common maps to a probability of 1E-10
    • Somewhat Common maps to a probability of 1E-6
    • Common maps to a probability of 1E-4
  • Minimum Number of Reference Samples: Desired minimum number of reference samples to be selected.
  • Maximum Number of Reference Samples: The maximum number of reference samples to be selected.
  • Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
  • Add samples to reference set: This option adds the current project’s sample to the reference set. Go to Tools > Open Folder > Reference Samples Folder to see all the samples that have been added to your reference set over time.
  • Independently normalize non-autosomal targets: If this option is selected, non-autosomal targets will not be normalized using the autosomal targets, but will instead be normalized separately. This option should be used if few non-autosomal targets are present, or if the entire X or Y chromosomes are likely to be deleted or duplicated.
  • Reference Sample Folder: The folder containing the reference samples used to normalize the coverage data.
  • Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
  • Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.
  • Blacklist Regions: The blacklist region file is used to specify regions to be excluded from the normalization process.

The CNV calling algorithm relies on probability distributions associated with both the Z-score and Ratio metrics. The Z-score for a target measures the number of standard deviations a sample’s coverage is from the mean reference sample coverage, while the Ratio is the target coverage divided by the mean reference sample coverage.

Each metric is associated with three probability distributions; one for each type of CNV: Hom. Deletion, Het. Deletion, and Duplication.

Normal distributions are used for the deletion distributions, while log-normal distributions are used for the duplication distributions.

These parameters can be specified in the advanced tab, which contains the following options:

  • Custom Z-score Parameters: Specify the parameters for the Z-score distributions.
  • Custom Ratio Parameters: Specify the parameters for the Ratio distributions.
  • Utilize Variant Allele Frequency: Use Variant Allele Frequency when calling CNVs.

Output of the CNVs Table

The CNV Caller algorithm will generate a CNVs table view. This table will include records for all called CNV events.

  • Region: Genomic coordinates (Chr: Start-Stop)
  • # Targets: Number of targets in the event.
  • # Samples: Number of samples in the event.
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • CNV State: State of the CNV event. Either Deletion, Het Deletion, Duplicate or CN LoH.
  • Flags: QC warnings for the event.
    • Low Controls Depth: Mean read depth over controls exceeded threshold.
    • High Controls Variation: Variation coefficient exceeded threshold.
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Low Z Score: Event is not significantly different from surrounding normal regions based on regional IQR.
  • Avg Target Mean Depth: Average mean depth of the targets in this event as reported by Coverage Statistics.
  • Avg Z Score: Average Z-score of the event.
  • Avg Ratio: Average ratio of the event.
  • Variants Considered: Number of targets in the event.
  • Supporting LoH Variants: Total number of variants within an LoH event supporting the called CNV state.
  • p-value: Probability that z-scores at least as extreme as those in the event would occur by chance in a diploid region.
  • Karyotype: Cytogenetic nomenclature for this event.

Output in the Samples Table

Summary fields are appended to the Samples Table. These fields provide summary information computed across all of the CNVs.

  • Sample Flags: QC warnings for the samples
    • High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
    • Low Sample Mean Depth: Sample mean depth below 30.
    • Mismatch to reference samples: Match score indicates low similarity to control samples.
    • Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
  • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
  • Inferred Gender: Gender inferred from X chromosome coverage ratio
  • # CNV Events: Number of CNV events
  • # Flagged CNV Events: Number of flagged CNV events
  • # Unflagged CNV Events: Number of unflagged CNV events
  • # Hom Deletions: Number of homozygous deletion events
  • # Het Deletions: Number of heterozygous deletion events
  • # Duplications: Number of duplication events
  • Z-score IQR: Interquartile range of the Z-scores over all targets
  • Ratio IQR: Interquartile range of the ratios over all targets
  • Variants Considered: Variants considered for VAF content
  • Percent Difference: Average percent difference between sample and matched controls for autosomal regions
  • Reference Samples: Samples selected as matched controls
  • X Ratio: Ratio of X chromosome coverage to autosomal chromosome coverage
  • Non-Autosomal Percent Difference: Average percent difference between sample and matched controls for non-autosomal regions
  • Non-Autosomal Reference Samples: Samples selected as matched controls for non-autosomal regions
  • karyotype: Chromosomal CNV information for this sample.

Output in the Coverage Regions Table

Target level CNV fields are appended to the Coverage Regions Table. These fields provide information computed across all coverage regions.

  • CNV State: State of the CNV call or this target. Either homozygous deletion, heterozygous deletion, diploid, or duplication
  • Flags: QC flags for the target region.
    • Low Controls Depth: Mean read depth over controls exceeded threshold
    • High Controls Variation: Variation coefficient exceeded threshold
    • Within Regional IQR: Event is not significantly different from surrounding normal regions based on regional IQR.
    • Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs
  • Z Score: Z-score of the target. Computed as (normalized target depth - mean depth across controls) / standard deviation
  • Ratio: Ratio of normalized target depth over mean depth across controls
  • Variants Considered: Variant considered for VAF content.

Variants by CNVs Table

This composite table view includes all of the CNVs that cover one or more variants from the filtered Variant table. The CNVs appear in the left hand table, and the corresponding variants in the right hand table. The variants that fall within each CNV can be viewed by changing the row selection in the CNV table.

Output in the Variant Table

Variants will be matched to any CNVs they fall within. The values for each of the matching CNVs will be listed in their respective fields which are appended to the Variant table.

Algorithm Details

The CNV Calling algorithm was developed based on a combination of methods in the existing CNV literature and novel methods developed by our engineers. The algorithm classifies each coverage region and uses controls samples for coverage comparison.

Classification and event segmentation are performed using a probabilistic model that incorporates three evidence metrics: Z-score, ratio, and variant allele frequency (VAF). The Z-score measures the number of standard deviations from the reference sample mean, the ratio is the normalized mean for the sample of interest divided by the average normalized mean for the reference samples, and VAF is the allelic fraction at the variant locus. Using these metrics, the algorithm calls CNV state for each target region. Target regions are then merged to obtain contiguous CNV events.

Since these metrics can be noisy over very large regions, a segmentation algorithm is used to call large multi-gene and whole chromosome events. If a region contains many small CNV events, CNAM optimal segmentation is used to segment the region and small events the share a segmented region are merged.

LoH Caller

This algorithm calls Loss of Heterozygosity events based on variant allele frequency. This algorithm is modeled on the H3M2 algorithm.

H3M2 uses a heterogeneous Hidden Markov Model (HMM) that incorporates inter-marker distances to identify LoH events from allele frequency data. The model has three hidden states representing homozygous, non-homozygous, and trisomy states of the genome, while the observations are the variant allele frequency values at each position. The emission probability distributions are defined by two truncated Gaussian mixture models as follows:

P(AF_i| hom_i) = c_1N(0, F \cdot \sigma) + c_3N(1, F \cdot \sigma_3 )

P(AF_i| het_i) = c_1N(0, F \cdot \sigma) + c_2N(\frac{1}{2}, \sigma_2) + c_3N(1, F \cdot \sigma_3 )

P(AF_i| tri_i) = c_1N(0, F \cdot \sigma) + c_2N(\frac{1}{3}, \sigma_2) + c_3N(\frac{2}{3}, \sigma_3) + c_4N(1, F \cdot \sigma_4 )

where c_i is the weight of the i-th component of the mixture model, F is a parameter used to modulate the spread of the distributions, and AF_i denotes the allele frequency. The values hom_i, het_i, and tri_i denote homozygous, heterozygous, and trisomy states respectively.

The transition probability for moving from a non-homozygous state to either homozygous or trisomy is given by:

P(het_i | hom_{i-1} \vee tri_{i-1}) = p_1 (1 - e^{- \frac{d_i}{d_{norm}}})

where p_1 denotes the likelihood of moving from the non-homozygous state, d_i is the genomic distance between position i and i-1, and d_{norm} is used to modulate the effect of the genomic distance on the transition probabilities. The probability for moving from a trisomy or homozygous state is defined similarly, using a parameter p_2 to specify the likelihood of the state change.

Requirements

Requires Variant Allele Frequency (VAF) and Genotype Quality (GQ) sample level fields.

Output

The LoH Caller algorithm will generate a LoHs table view. This table will include records for all called events.

  • Region: Genomic coordinates (Chr: Start-Stop)
  • # Samples: Number of samples in the event
  • Span: The width of the event. Computed from the difference between the stop and start positions.
  • LoH Event: True if the LoH is present in the current sample.
  • LoH State: The state of the event; either LoH or Trisomy.
  • Variants Considered: Number of variants in the LoH
  • Percent in Expected State: Percentage of variants with VAF consistent with the called LoH state.