DNA Sequence Analysis

DNA-Seq Menu

The following DNA-Seq tools are available from the DNA-Seq Menu (see DNA-Seq Menu):

DNA-Seq Menu

DNA-Seq Menu

  • Annotate and Filter Variants
  • PhoRank Gene Ranking
  • Set Genotypes to No-Call based on Additional Spreadsheets
  • Runs of Homozygosity for NGS
  • Calculate Alt Read Ratio
  • Activate Variants by Sample Genotypes
  • Filter Variants in Reference Sample Spreadsheet
  • Subset Informative Genotypes by Category
  • Variant Binning by Frequency Source
  • Classify by Inheritance Pattern
  • Find de Novo Candidate Variants
  • Score Variants by Recessive Model
  • Score Compound Heterozygous Regions
  • Score Variants by Dominant Model
  • Collapsing Methods
    • Count Variants per Gene
    • CMC with Hotelling T Squared Test
    • CMC with Regression
    • KBAC with Permutation Testing
    • KBAC with Regression
    • Mixed-Model KBAC
    • SKAT-O

DNA Sequence Analysis Overview

DNA Sequence Data

DNA sequence data resembles genotype data, but the genotypes are obtained from a variant file. Instead of having a marker for each probe from the array, genotypes are reported for probes where there is a unique variant site.

Merging multiple variant call files (VCF) results in joining datasets that may or may not have a variant at the same location in both files. As a result it is necessary to fill in the missing genotypes due to merging probe lists with a homozygous reference genotype obtained by using the reference base information for each probe.

Thus DNA sequence data must consist of genotypes for variant sites (consisting of Single Nucleotide Variants (SNVs), Insertion/Deletion variants “indels”, or Substitutions (SUBs)) in addition to a reference nucleotide base for each site. The variants might be rare or not; further analysis of this data will take into account the minor allele frequency (MAF) as defined by a reference dataset to determine if a variant is rare or common.

DNA Sequence Analysis

In general, DNA sequence analysis is tasked with studying the effects of rare and low-frequency variants, as well as potentially common variants on the phenotype of interest.

A good DNA-Seq analysis study starts with good study design and controlling the effects of batch effects when in the sequencing and variant calling step of processing the data.

After obtaining the DNA sequence data, the next step is to categorize the relative frequency of the sequenced variants. If only a small number of samples are available, an external catalog such as dbSNP 129 can be used to classify variants.

Note

dbSNP 129 is often considered the last “clean” dbSNP build without many rare and unconfirmed variants from 1000 Genomes and other large scale projects added to the database.

The next step is to search for regions where the heterogeneous burden of rare, low frequency and common variants is strongly correlated with the trait under study. Traditional association techniques used in GWAS studies do not have the power to detect associations with rare variants individually or provide tools for measuring their compound effect. Thus, analytic approaches for testing disease association with DNA-Seq data have been developed.

Note

These methods will also work with GWAS probes/markers as long as the reference nucleotide base or reference allele is known.

DNA-Seq Analysis

The Golden Helix SVS DNA-Seq Analysis workflows support DNA sequence analysis from VCF files. Other platforms can be supported as long as a text file is available with the genotype calls and reference nucleotide bases. Custom import scripts can be provided for these text files as needed.

Note

In general, the observed allele field from most marker maps is not formatted in the required “Ref/Alt allele” format, and this marker map field should not be used for sequence analysis.

An abbreviated workflow for DNA-Seq Analysis is enumerated below. More specific workflows can be found under the section for each collapsing analysis method.

  1. Import and/or prepare DNA-Seq variant genotype data.
  2. Import phenotypic data containing case/control status.
  3. Follow the steps for preparing the data for the chosen collapsing or analysis method.
  4. Join the phenotype with the DNA-Seq variant genotype data.
  5. Perform the collapsing and/or analysis method on a case/control phenotype from the joined spreadsheet.
  6. Explore the results.

Preparing Genotypes for DNA-Seq Analysis

In order to perform one of the collapsing methods or analysis workflows, a spreadsheet of genotype data with an applied marker map is needed. The marker map must contain either two separate fields one with the reference nucleotide bases and one must contain the alternate nucleotide base(s). Or a single marker map field that can contain the “observed” pair of referenced and alternate alleles such as ‘A/G’ or ‘C/T’. It is assumed that the first allele in the observed allele field is the reference allele, and the second allele is the alternate allele. Using the Import > Import VCFs and Variant Files tool will automatically import the necessary fields for analysis. See Import VCFs and Variant Files for more information.

Also necessary for sequence analysis are annotation data sources downloaded and saved in the local annotation source folder. This allows any script that requires these sources to run much faster. Recommended sources for the human GRCh37_g1k assembly to have available for sequence analysis include but are not limited to:

  • RefSeqGenes105v2-NCBI_2013-08-20_GRCh_37_g1k_Homo_sapiens.tsf
  • dbSNP147-NCBI_2016-06-02_GRCh_37_g1k_Homo_sapiens.tsf
  • dbNSFPFunctionalPredictionsandScores3.0-GHI_2015-09-04_GRCh_37_Homo_sapiens.tsf
  • 1kGPhase3-VariantFrequencies5b-GHI_2015-08-18_GRCh_37_Homo_sapiens.tsf
  • ReferenceSequence-1000Genomes_GRCh_37_g1k_Homo_sapiens.tsf

Importing VCF Files

To import data in the VCF (Variant Call Files) format, go to Import > Import VCFs and Variant Files. Multiple VCF files can and should be selected simultaneously in order to have the data properly merged together. There are numerous marker map fields that can be selected for inclusion in the marker map file.

For more information see: Import VCFs and Variant Files.

Runs of Homozygosity for NGS

See Runs of Homozygosity Analysis for full details on this tool.

Annotate and Filter Variants

Annotating and filtering based on Data Sources can be useful for any genetic analysis workflow, including GWAS, CNV analysis, and sequencing analysis.

From a marker mapped spreadsheet go to DNA-Seq > Annotate and Filter Variants. Click the Add Track(s) button to select all data sources to be used for annotating and/or filtering variants.

Selecting Annotation Source Dialog

Select Source Window

Please see the detailed descriptions below for each data source type selected.

Please see below for filtering options.

Annotate Gene Region

When selecting a gene source and setting the Gene Annotation Mode to be Annotate Gene Region at the bottom of the source selection dialog several output options are available.

Note

If your spreadsheets marker map does not contain either separate Reference and Alternate allele fields or a combined Ref/Alt field then the Annotate Gene Region mode will be the only mode available when selecting a gene source.

Annotate Gene Region Options

Annotate Gene Region Options

Options to Specify

Select whether to only annotate against verified mRNA transcripts or not. For output options, a gene region report that can optionally include intergenic variants.

Results

If selected, a gene region report will be created. If a filter was selected, an applied filters spreadsheet and a filtered variant spreadsheet will also be created.

Gene Region Report - A summary of gene region information for each variant.

  • Gene Names - The set of unique gene names seen in all overlapping transcripts.
  • Gene Region (Combined) - The highest priority region found among the variant transcript interactions. The region in which the variant is located. When a variant overlaps multiple regions, the region with the highest precedence is given. The order of precedence is defined as (from highest to lowest): exon, utr5, utr3, intron, intergenic.
  • Transcript Name (Clinically Relevant) - The transcript determined to be clinically relevant among those found for the variant’s interaction with the gene. If the variant affects multiple genes, one transcript from each gene is selected to be clinically relevant. The clinically relevant transcript is the transcript with the longest coding sequence among the transcripts with an LRG annotation. If a gene has no transcripts that are annotated with an LRG ID, the transcript with the longest coding sequence is used. For more information on how transcripts are chosen see our blog post on the topic.
  • Gene Region (Clinically Relevant) - The region in which the variant is located in the clinically relevant transcript.
  • Exon Number (Clinically Relevant) - The number of the exon in which this variant is found in the clinically relevant transcript. Exons are numbered in transcription order, starting with “1”.

Annotate Variant Effect on Transcripts

When a gene source is selected this tool allows for several annotation and output options. If just variant reports is selected one spreadsheet is created. A second spreadsheet with a variant interaction report and (optionally) auxiliary transform fields can be created.

Annotate Transcript Options

Annotate Transcript Options

Options To Specify

Select whether to only annotate against verified mRNA transcripts or not. Also, the splice site distances can be specified. For output options, a variant report and/or a variant interaction reports (with optional auxiliary transform fields and intergenic variants included) can be selected.

Results

If selected, a variant report and a variant interaction report (with optional auxiliary fields) will be created. If a filter was selected, an applied filters spreadsheet and a filtered variant spreadsheet will also be created.

Variant Report - A summary of the computed interactions between each variant and the overlapping transcripts. In the case of multiple interactions, the interaction with the highest priority will be listed.

  • Gene Names - The set of unique gene names seen in all overlapping transcripts.
  • Sequence Ontology (Combined) - The highest priority ontology found among the variant transcript interactions. The predicted interaction between the variant and transcript. The terms used are the standard feature descriptions given by the The Sequence Ontology Project. When a variant can be characterized in multiple ways, the highest precedence description is given. The order of precedence is defined as (from highest to lowest): transcript_ablation, exon_loss_variant, stop_lost, stop_gained, initiator_codon_variant, frameshift_variant, splice_acceptor_variant, splice_donor_variant, disruptive_inframe_deletion, disruptive_inframe_insertion, inframe_deletion, inframe_insertion, 5_prime_UTR_premature_start_codon_gain_variant, missense_variant, synonymous_variant, stop_retained_variant, splice_region_variant, 3_prime_UTR_variant, 5_prime_UTR_variant, intron_variant, non_coding_exon_variant, intergenic_variant, unknown.
  • Gene Region (Combined) - The highest priority region found among the variant transcript interactions. The region in which the variant is located. When a variant overlaps multiple regions, the region with the highest precedence is given. The order of precedence is defined as (from highest to lowest): exon, utr5, utr3, intron, intergenic.
  • Effect (Combined) - The highest priority of the effect annotations found among the variant transcript interactions. The likely effect that the variant will have on the transcript’s product. The ontologies that correspond to each effect category can be found at the bottom of this page in the documentation for the effect category.
  • Transcript Name (Clinically Relevant) - The transcript determined to be clinically relevant among those found for the variant’s interaction with the gene. If the variant affects multiple genes, one transcript from each gene is selected to be clinically relevant. The clinically relevant transcript is the transcript with the longest coding sequence among the transcripts with an LRG annotation. If a gene has no transcripts that are annotated with an LRG ID, the transcript with the longest coding sequence is used. For more information on how transcripts are chosen see our blog post on the topic.
  • Exon Number (Clinically Relevant) - The number of the exon in which this variant is found in the clinically relevant transcript. Exons are numbered in transcription order, starting with “1”.
  • HGVS c. (Clinically Relevant) - The associated HGVS coding DNA notation with the clinically relevant transcript(s).
  • HGVS p. (Clinically Relevant) - The associated HGVS protein change notation with the clinically relevant transcript(s).
  • Sequence Ontology (Clinically Relevant) - The sequence ontology associated with the clinically relevant transcript(s).
  • Effect (Clinically Relevant) - The effect of the variant associated with the clinically relevant transcript(s).
  • Delta Length - The change in the length of the genomic sequence caused by the variant.
  • Length of Ref - The length of the reference allele.
  • Length of Alt - The length of the alternate allele.

Variant Interaction Report - These columns display the computed interactions between each variant and the overlapping transcripts at that location. Additionally, certain useful statistics and HGVS nomenclature have been calculated for each variant-transcript pair.

  • Ref/Alt - Reference and Alternate alleles in the format Ref/Alt(s)
  • Transcript Name - Transcript identifier.
  • Sequence Ontology - The predicted interaction between the variant and transcript. The terms used are the standard feature descriptions given by the The Sequence Ontology Project. When a variant can be characterized in multiple ways, the highest precedence description is given. The order of precedence is defined as (from highest to lowest): transcript_ablation, exon_loss_variant, stop_lost, stop_gained, initiator_codon_variant, frameshift_variant, splice_acceptor_variant, splice_donor_variant, disruptive_inframe_deletion, disruptive_inframe_insertion, inframe_deletion, inframe_insertion, 5_prime_UTR_premature_start_codon_gain_variant, missense_variant, synonymous_variant, stop_retained_variant, splice_region_variant, 3_prime_UTR_variant, 5_prime_UTR_variant, intron_variant, non_coding_exon_variant, intergenic_variant, unknown.
  • Gene Region - The region in which the variant is located. When a variant overlaps multiple regions, the region with the highest precedence is given. The order of precedence is defined as (from highest to lowest): exon, utr5, utr3, intron, intergenic.
  • Effect - The likely effect that the variant will have on the transcript’s product. The ontologies that correspond to each effect category can be found at the bottom of this page in the documentation for the effect category.
  • Gene Name - The gene which overlaps the variant.
  • Exon Number - The number of the exon in which this variant is found. Exons are numbered in transcription order, starting with “1”. A number is provided for each transcript that the variant overlaps.
  • 5’ Exon Number - Only provided for intronic variants. The exon number on the 5’ side of the intronic variant. This will differ from Exon Number field only if the variant is closer to the exon on the 3’ side of the variant instead of the 5’ side.
  • # of Exons - The number of the exons in a transcript.
  • # AA Codons Changed - The number of codons changed by the variant. For frameshift and stopgain variants all codons following the variant are considered changed.
  • Ref AA - The reference codon(s) for the position of the variant. A maximum of 9 codons reported, with the remaining elided.
  • Alt AA - The changed codon(s) caused by the alternate allele. A maximum of 9 codons reported, with the remaining elided.
  • Dist to Exon Boundary - The distance from the variant to the nearest exonic boundary. Only exonic variants are considered.
  • Dist to Coding Start - The distance from the start of the variant to the coding start (cds start) of the transcript. Only variants in coding regions are considered.
  • % Dist of Tx - The distance of the variant from the start of the coding sequence normalized as a percentage. Only coding variants are considered.
  • HGVS g. - The HGVS notation for the variant using the genomic reference sequence.
  • HGVS c. - The HGVS notation for the variant using the coding reference sequence of the specified transcript.
  • HGVS p. - The HGVS notation for the variant using the sequence of the transcript’s protein product. If the calculated protein sequence does not end in a stop codon no notation is calculated.
  • AA Position - The codon position for the variant.
  • AA Length - The number of codons in the transcript.
  • CDS Length - The length of the transcript’s coding sequence including the stop codon.
  • Transcript Status - Due to errors in transcript mapping, some transcript sequences are unlikely to be biologically feasible. For variants in these transcripts, extra care should be taken to examine their context.

Annotate Variant

When a variant track is selected, options for allelic match mode and annotation mode can be selected.

Annotate Variant Options

Annotate Variant Options

Options To Specify

Allelic match mode can be set to:

  • Exact: Match records that contain exactly the observed alleles.
  • Closest: Match records that contain observed alleles, but prefer the record that most closely matches the observed. So if A/C is observed, and there is a match to a A/C/G and a A/C record, prefer the A/C.
  • Contains Allele: Match records that contain exactly the observed alleles.
  • Keep All Records: Keep all records, regardless of the allelic match.

Expect One Matching Variant can be checked or unchecked:

  • One to One Annotation (Checked): Expect that only a single record will match.
  • One to Many Annotation (Unchecked): Expect that multiple records will match.

Results

An optional annotation results spreadsheet can be created. If a filter was selected, an applied filters and filtering variant spreadsheets will be created.

Annotate Region

If an interval track is selected, options for annotation reports can be specified.

Annotate Region Options

Annotate Region Options

Options To Specify

For output options, an annotation report can be selected.

Results

A single spreadsheet will be created and if filter options have been selected, spreadsheets with the filter options and filtered out variants.

Annotate dbNSFP

dbNSFP Annotation tracks will always be run with the default options for the track. Only output and filter options can be selected.

Options to Specify

For output options, optional Voting Report and Annotation Results can be selected.

Results

Voting Report - Voting of independent functional prediction tolerated in N Tolerated and N Damaging. Prediction algorithms include SIFT, PolyPhen2 HVAR, MutationTaster, MutationAssessor, FATHMM, and FATHMM MKL Coding. If you want a conservative filter, simply keep things that have 0, 1 or maybe 2 Tolerated predictions. A more conservative filter would keep based on 3, 4 or 5 Damaging predictions. Many variants do not have 5 algorithms with non-missing values.

Because a single variant may overlap multiple genes or transcripts with different coding frames, the prediction scores are the worst of the scores for each Ref Amino Acid/Alt Amino Acid observed across all overlapping transcripts.

  • N of 6 Predicted Tolerated - The number of independent functional prediction algorithms that had a non-missing value that are classified as Tolerated (Or Non-Functional or Polymorphism)
  • N of 6 Predicted Damaging - The number of independent functional prediction algorithms that had a non-missing value that are classified as Damaging (Or Functional, or Disease Causing)

Annotate with Secure Sources

Annotation tracks under the Secure Annotations folder will always be run with the default options for the track. Only output and filter options can be selected.

See OMIM, CADD, and MedGenome OncoMD for information on expected output.

Filter Options

The Optional Filters section has a space for setting up filters that will filter out variants whose value for a particular field in the annotation results don’t meet all the selected filter criteria.

The following are the three different filtering modes, the selected mode is dependent on the data type on the annotation field.

Is Annotated Filter Options

Is Annotated Filter Options

This will filter out variants that aren’t annotated in the annotation track. This will use either the boolean field found in the annotation track (ex. In dbNSFP? in the dbNSFP), the track’s right ID (a missing ID indicates that there is no annotation for a particular variant, or if all of the fields are missing.

String Comparison Filter Options

String Comparison Filter Options

Multiple string values separated with either a comma, semicolon, or linebreak can be entered. If is one of is selected, at least one of the values must be present to keep a variant. If is not one of is present then the variant will be filtered out if at least one of these values is present.

Categorical Filter Options

Categorical Filter Options

Possible values for this field are listed in the selection window. One or more values can be selected. If at least one value is present, the variant will be kept.

Numeric Filter Options

Numeric Filter Options

A lower and upper bound real number can be entered to filter on values that fall between (and including) these numbers. If just the lower bound is filled in, variants are kept if they are this value or greater. If just the upper bound is present, then variants are kept if they are less than or equal. To do an exact value match, enter the same value in both fields. If the Include Missings checkbox is selected, variants that have this field missing will automatically be kept, if unchecked, they will be removed.

Results

A spreadsheet with the title Applied Filters will be created. This will contain boolean fields for each filter with 1 if the variant passes the filter and 0 if it does not. The first column, Is Filtered? indicates whether a variant has passed all the filters.

A second spreadsheet that is a filtered down version of the original data set will be created, this will contain only variants that have passed all specified filters.

PhoRank Gene Ranking

This algorithm ranks genes based on their relevance to user-specified phenotypes as defined by the GO and HPO biomedical ontologies. PhoRank is modeled on the Phevor algorithm.

Phevor assigns scores to ontology terms based on their proximity to the user-specified phenotypes and the algorithm propagates this score information through the ontologies. Genes with high scores are more closely related to the specified phenotypes, while genes with low scores have little or no relation to the phenotypes.

For the PhoRank algorithm in SVS we have modified the Phevor algorithm by assigning initial scores to seed nodes based on their similarity to the initial search terms. We have also modified Phevor’s propagation mechanism so that the score propagated from one node to another is weighted by the similarity of the two nodes. These modifications increase the scores of more specific nodes that are highly related to the search terms, while decreasing the scores of more general nodes with many neighbors.

Options to Specify

From a marker mapped spreadsheet go to DNA-Seq > PhoRank Gene Ranking.

PhoRank Phenotype Dialog

PhoRank Phenotype Dialog

At the top of the dialog, enter phenotype ontologies in a comma separated list, suggestions will be displayed in a drop down list, phenotypes from the latest OMIM track can also be included by selecting the Enhance with OMIM phenotypes check box. The default spreadsheet name can be edited here too. This option requires the add-on OMIM annotation source be included with your SVS license.

Note

To add the OMIM annotation source to your license for SVS contact your account manager or support@goldenhelix.com.

On the bottom of the dialog select a gene track that will be used to create a list of genes that overlapp variants from the input spreadsheet. This will be auto-filled with a default gene track based on your projects default assembly, this can be changed by selecting Select Track. Then select whether to only rank genes based on variants overlapping verified mRNA transcripts or not.

Output

Two spreadsheets will be created, the name of the first spreadsheet will be the text in the Spreadsheet Name field from the options dialog with “- PhoRank Variant Output” appended. This spreadsheet will have a row for each variant (some will be repeated if there is more than one annotation record) with the phorank results:

  • Gene Name: The gene that overlapps the variant.
  • Ranks: Percentile rank of the specific gene.
  • Scores: The score of the gene computed by the ontology propagation algorithm.
  • Paths: A shortest path from the gene to one of the specified phenotypes (there may be many paths to the phenotypes).

The second spreadsheet will be organized by gene with the phorank results and an additional field with the number of markers that have an overlapping gene transcript. The spreadsheet name will be the text in the Spreadsheet Name field from the options dialog with “- PhoRank Gene Output” appended.

Set Genotypes to No-Call based on Additional Spreadsheets

Genotypic data will be set to no-call (?_?) given a filtering requirement that is applied to the corresponding numeric or categorical values in several additional spreadsheets. Spreadsheets must have at least one overlapping column header and row label.

Two different filtering mechanisms are available depending on the columns in the additional spreadsheet. If the most common column type in the additional spreadsheet is numeric (Real, Integer or Binary), then a threshold filtering mechanism appears in that spreadsheet’s corresponding tab in the second prompt. If the most common column type is categorical or genotypic, a list of values may be given that correspond to genotypes being set to no-call.

Requirements

  • A spreadsheet containing several genotypic columns. May also contain non-genotypic columns.
  • At least one additional spreadsheet containing several numeric (real, integer or binary) or categorical columns.
  • There must be at least one overlapping column header and row label across each spreadsheets pair (the genotypic spreadsheet and the one(s) to filter against).
  • Zygosity based filtering options will be available if a column oriented marker map is present. A reference field in the marker map is required.

Numeric Filtering

If the most common column type in the additional spreadsheet is numeric, the user must enter the following values in the second prompt.

  • Uniform or Zygosity Based Filtering: Threshold values can be set for either all genotypes (uniform) or separate thresholds can be set for Ref_Ref, Ref_Alt, and Alt_Alt.
  • By Threshold or by Range: Either select to filter by one threshold value, or filter by a range of values. The same options are available in zygosity based filtering.
  • Missing numeric values: If the corresponding numeric value is missing, the genotype can either be set to missing or left as the current genotype.
Threshold Filter in Set Genotypes to No-Call

Threshold Filter in Set Genotypes to No-Call

String Filtering

If the most common column type in the additional spreadsheet is categorical or genotypic, the user must enter the following values in the second prompt.

  • Uniform or Zygosity Based Filtering: lists can be specified for either all genotypes (uniform) or separate lists can be specified for Ref_Ref, Ref_Alt, and Alt_Alt.
  • Comma-separated List: List of values in the categorical spreadsheet that correspond to values in the genotype spreadsheet that you would like to set to no-call. Several values can be specified in the comma-separated list.
  • Missing categorical values: If the corresponding categorical value is missing, the genotype can either be set to missing or left as the current genotype.
Comma-separated List Filter in Set Genotypes to No-Call

Comma-separated List Filter in Set Genotypes to No-Call

Zygosity Based Filtering

If the marker map attached to the genotype spreadsheet contains a Reference Allele field, zygosity based filtering is available. This filtering tool allows the user to specify different filtering criteria for Ref_Ref, Ref_Alt and Alt_Alt calls.

Different threshold values, or lists if in string filtering, can be selected for Ref_Ref, Ref_Alt, and Alt_Alt. For threshold single threshold or range filtering can be selected just as in the uniform filtering options.

Zygosity based Filter in Set Genotypes to No-Call

Zygosity based Filter in Set Genotypes to No-Call

Calculate Alt Read Ratio

Often Variant Call Files (VCF) files will include an Allelic Depth (AD) field. This tool supports computation of the Alt Read Ratio where the AD field contains a comma separated list of allelic depths. It is assumed that the first value is always the reference allele depth followed by depths for each alternate allele.

This specification was based on the output from GATK, see GATK AD Specification for more information on the supported format.

The alternate allele ratio can be used to filter variant genotype calls down to only calls with a large fraction of reads supporting the alternate allele.

Note

The documentation from GATK about the AD field indicate that an allele ratio calculated from the allelic depths should not be used for filtering because it can include reads that were filtered before calling genotypes. Please keep this in mind when using the Alt Read Ratio spreadsheet for filtering.

Requirements

  • A mapped spreadsheet containing categorical columns, and columns either consisting of missing values or comma separated lists of allelic depths.
  • Originally designed for the Allelic Depth (AD) field from a VCF file.

Method

If there is data in a cell corresponding to allelic read depths, the first value is always the depth of the reference allele. After the first value are read depths for one or more alternate alleles.

The following cases are handled with their corresponding formulas:

  • List of two values (e.g. “50,10”): The reference allele depth is 50, the alternate allele depth is 10.

    \text{Alt Read Ratio} = \frac{\text{Alt Allele Depth}}{\text{Alt AD} + \text{Ref AD}}

  • List of three values where there is a non-zero ref allele depth (e.g. “10,50,0” or “10,0,50” or “10,4,50”): The reference allele depth is 10, there are one or more alternate alleles that have non-zero read depths. Only the alternate allele with the maximum depth is used for the ratio computation.

    \text{Alt Read Ratio} = \frac{\max\{A_1,A_2\}}{\max\{A_1,A_2\} + \text{Ref AD}}

  • List of three values where the ref allele depth is zero (e.g. “0,10,50”): The reference allele depth is 0 and there are one or more alternate alleles that have non-zero read depths. The alternate allele with the maximum depth is used for the numerator. The total of both alternate allelic depths is used for the denominator.

    \text{Alt Read Ratio} = \frac{\max\{A_1,A_2\}}{A_1 + A_2}

  • List of more than three values (three alternates possible): Alt read ratio value set to missing and the count of the number of cells with too many alternate alleles is incremented.

Output

The resulting spreadsheet contains either missing values or the alternate read ratio as computed above. The final spreadsheet is of the same dimensions as the original spreadsheet.

In the node change log for the resulting spreadsheet, the number of cells where only alternate allele counts were used is reported as well as the number of cells where there were more than two alternate alleles and thus no ratio was computed.

Activate Variants by Sample Genotypes

This tool examines variant data and inactivates genotypic columns that follow the specified genotypic patterns for the selected samples. The spreadsheet must contain mapped genotypic columns and the marker map must contain a reference allele field.

Optionally a dependent (binary or categorical) column can be set that allows the user to select variants that all samples in a given category must pass to remain active.

An example use case could be filtering related individuals, such as a trio or small family, so that only variants that follow the specified inheritance pattern remain active.

Options

  • Reference Field: Marker map field containing reference allele.

  • Sample Genotype Pattern: Check genotype pattern for at least one sample or group of samples to activate variants.

    • If no dependent column is selected
    activateSampleGenotypes

    Activate Variants by Sample Genotypes Window

    • If dependent column is selected
    activateSampleGenotypesDependent

    Activate Variants by Sample Genotypes with Dependent Column

Results

Any variants that do not fit the genotypic pattern selected will be inactivated in the original genotype spreadsheet.

Filter Variants in Reference Sample Spreadsheet

This function inactivates uninformative markers as determined by a second Reference Sample spreadsheet. Markers are considered uninformative if the allele set or genotype set (based on the user-specified option) are a subset of the alleles/genotypes found in the reference spreadsheet’s corresponding marker at the same chromosome and position.

The requirements for this function include a column oriented marker mapped spreadsheet that is to be filtered on and a column oriented marker mapped reference spreadsheet with a reference allele field included in the marker map. The user can choose to filter by presence of Alleles in the reference spreadsheet or by presence of genotypes in the reference spreadsheet.

The reference allele will always be included in the set of alleles found at that chromosome and position. If multiple columns are found at the same chromosome and position, the allele sets are combined into one unique set that includes all alleles found in all columns as well as all reference alleles.

The function will pass through each column in the spreadsheet and determine whether or not the marker is informative based on the alleles present (and will not consider the reference alleles in this spreadsheet). If multiple columns exist at the same chromosome and position, each will be considered separately.

Subset Informative Genotypes by Category

This tool scans genotypic columns to find informative genotypes defined by having at least one non-missing, non-reference allele. Informative genotype column sets are found for each unique category in a user-defined categorical column.

This tool requires a spreadsheet with several mapped genotypic columns. The marker map must contain a reference allele field and may contain a categorical column. The row labels may also be used as the categorical column.

N output spreadsheets are created representing N unique categories. The genotypic columns included in each spreadsheet contain at least one non-missing, non-reference allele over all rows within the category. If no informative columns are found, the output spreadsheet is not created.

Variant Binning by Frequency Source

This feature creates frequency bins based on user-defined thresholds and an external reference population provided through a probe annotation track.

For each genotypic marker the alternate allele frequency or minor allele frequencies (MAFs) of a reference population as specified in a probe annotation track are used to assign each marker to their respective frequency bins. This helps to identify rare variants in a SNP dataset when there are not enough samples to calculate in-sample MAF.

This function can be accessed by going to DNA-Seq > Variant Binning by Frequency Source.

Options to specify:

  • Select the probe track with an MAF field, such as:
    • NHLBI_ESP6500SI-V2_Exomes-Variant_Frequencies-2013_03_22_GHI_GRCh_37_Homo_sapiens.idf
  • Select the bins by specifying the MAF thresholds (in ascending order). The minimum threshold value is 0.0 and the maximum value is 0.5. At least one bin must be specified.

Results:

A marker mapped filtering results spreadsheet is created as a child of the original spreadsheet. Markers not in the probe track are assigned an MAF bin value of 0, since the variant is rare enough not to be in the reference population. This spreadsheet contains the following columns (see Variant Frequency Bins Spreadsheet):

  • MAF Bin An integer number that indicates the bin number assignment based on the specified bin thresholds.
  • MAF Minor Allele Frequency for the reference population.
  • Alleles Present Actual alleles observed in the genotypes from the spreadsheet
  • Ref/Alt Field from Track Observed alleles in the form Reference/Alternate allele.
  • Additional Columns Any additional information included in selected source.

Note

The names of the first two columns and the fourth column will depend on the name of the frequency field selected and the field used to determine the reference and alternate alleles.

variantBinFreqResults

Variant Frequency Bins Spreadsheet

Classify by Inheritance Pattern

This tool requires a Pedigree + Genotype spreadsheet and will attempt to classify the inheritance pattern for all variants for complete trios. The following classifications are possible.

  • Maternal de Novo: Child has homozygous alternate genotype, mother has heterozygous or homozygous alternate genotype and father has homozygous reference genotype. One alternate allele was inherited from the mother and the other is a de Novo mutation.
  • Paternal de Novo: Child has homozygous alternate genotype, father has heterozygous or homozygous alternate genotype and mother has homozygous reference genotype. One alternate allele was inherited from the father and the other is a de Novo mutation.
  • Het de Novo: Child has heterozygous genotype, mother and father have homozygous reference genotypes. Child has one de Novo mutation not found in parents.
  • Homozygous de Novo: Child has homozygous alternate genotype, mother and father have homozygous reference genotypes. Child has two de Novo mutations not found in parents.
  • Het Either: Child, mother and father have heterozygous genotype. Child could have been inherited alternate allele from either parent.
  • Het Maternal: Child has heterozygous genotype that was inherited from the mother. Either mother is heterozygous or homozygous alternate and father is homozygous reference, or mother is homozygous alternate and father is heterozygous.
  • Het Paternal: Child has heterozygous genotype that was inherited from the father. Either father is heterozygous or homozygous alternate and mother is homozygous reference, or father is homozygous alternate and mother is heterozygous.
  • Homozygous Both: Child has homozygous alternate genotype and mother and father either have heterozygous or homozygous alternate genotypes.
  • Het de Novo: Child is heterozygous and mother and father both have homozygous alternate genotypes.

If the inheritance pattern does not fit one of the above descriptions, the classification is missing.

Options to specify:

  • Reference Allele Field: The marker map field that contains the reference allele
  • Treat missing genotypes as reference: Optionally treat missing data as homozygous reference calls
Classify by Inheritance Pattern

Classify by Inheritance Pattern Window

Output

The resulting spreadsheet will contain one row for each variant and one column for each child from a complete trio.

Inheritance Pattern Results

Classify by Inheritance Pattern Results

Find de Novo Candidate Variants

This tool uses pedigree information to identify candidate functional polymorphisms, defined as the offspring in a trio having a genotype classified as a Mendelian error. By default, only heterozygous errors are considered candidates. Optionally, homozygous non-reference errors can be considered and require a reference allele field to be present in the marker map. Another option allows the user to restrict computation to affected offspring.

Options to specify:

  • Reference Allele Field: Marker map field containing reference allele. This is optional and only required if homozygous alternatives are included.
  • Treat missing genotypes as reference: Optionally treat missing genotypes as homozygous reference calls.
  • Include informative homozygous errors as candidates: Optionally allow a de Novo candidate to have a homozygous alternate call in the case where both parents are still homozygous reference.
  • Only consider affected children: Limit the search to affected children as defined in the pedigree spreadsheet.
Find de Novo Candidate Variants

Find de Novo Candidate Variants Window

Output

A new child node is created for each trio found in the original spreadsheet, assuming the trio has candidate variants. A message is added to the log if a trio does not have any candidates. In each new spreadsheet, every genotypic column represents a Mendelian Error (heterozygous and optionally non-reference homozygous) found in the trio’s genotypic data. An additional report spreadsheet is created containing all variants that were found to be candidates in any trio. The variants are listed in the row labels and there is a binary column for each trio (with the child’s label) signifying if that variant was considered a candidate for the trio.

Score Variants by Recessive Model

This feature requires a mapped pedigree spreadsheet and the marker map must include a reference allele field. Variants will be scored based on how well each variant follows the expected recessive model inheritance pattern. One score will be generated for each nuclear family as well as the sum of all scores for all families.

Options to specify:

  • Reference Allele Field: Marker map field containing reference allele.
  • Spreadsheet Action:
    • Score Variants: Only the scores for all variants are computed
    • Score and Filter Variants: The scores for all variants are computed and those variants that do not have a perfect recessive model score equal to one for at least one family are inactivated in the original spreadsheet.
  • Treat missing genotypes as reference: If checked missing genotypes are excluded from the numerator and denominator and the score reported is the Recessive Model Score, see: Recessive Model Score Formula. Otherwise, if this option is not checked, missing values are included in the denominator and both the Unweighted and Weighted Scores are included in the score report, see: Recessive Model Score Formula and Weighted Recessive Model Score Formula.
  • Output subset variant spreadsheets per (affected) trio: If checked one trio spreadsheet is created for every affected child in a family. This option is not recommended when there are numerous families and multiple affected children per family.
  • Include all per-family output in score report: If checked the underlying counts that make up the recessive model scores are included in the output.

Output

  • Recessive Model Variant Score Report: This spreadsheet contains at a minimum the following columns:
    • Drop?: Whether or not the variant would be dropped in the original spreadsheet if Score and Filter Variants was selected.
    • Total # Rec Families: Total number of families with a perfect recessive model score (score = 1.0).
    • Sum(Score): Sum of the standard recessive model score for all of the families.
    • Sum(Wt Score): If weighted scores are computed, the sum of all the weighted scores for all of the families is reported in this column.
    • Recessive Model Score (Family ID, Father ID, Mother ID): One column for each nuclear family. See Recessive Model Score Formula.
    • Weighted Recessive Model Score (Family ID, Father ID, Mother ID): (Optional) One column for each nuclear family, contains the weighted score when missing values are not treated as reference. See Weighted Recessive Model Score Formula.
    • Additional Output if all per-family output is included in score report: (Optional)
      • # Hets in Parents (Family ID, Father ID, Mother ID): Total number of heterozygous genotypes in the parents.
      • #HomoAlts in Affected Sibs (Family ID, Father ID, Mother ID): Total number of homozygous alternate genotype calls for affected children.
      • #Ref or Het in Unaffected Sibs (Family ID, Father ID, Mother ID): Total number of homozygous reference or heterozygous genotype calls for unaffected children.
      • Total (Family ID, Father ID, Mother ID): Sum of the previous three columns (Total = # Hets in Parents + # HomoAlts in Affected Sibs + #Ref or Het in Unaffected Sibs)
      • # non-missing Genotypes (Family ID, Father ID, Mother ID): Total number of non-missing genotypes for the nuclear family.

Formulas

Recessive Model Score Formula

In this formula the total number of genotypes that follow the recessive model pattern is divided by the total number of considered genotypes. If missing values are treated as reference then missing values are included in the denominator. Otherwise, only called genotypes are included in denominator.

score = \frac{numHetParents + numAltAffected + numRefUnaffected}{numGenotypes}

Weighted Recessive Model Score Formula

The weighted recessive model score is only output when missing values are not treated as reference, otherwise there is no difference in the results. This formula adjusts for the number of missing parents and the denominator includes all genotypes, called or not.

score = \frac{numHetParents + numAltAffected + numRefUnaffected-0.5*numMissingParents}{numCalledGenotypes + numMissingGenotypes}

Score Compound Heterozygous Regions

This feature requires a mapped pedigree spreadsheet (map must include a reference allele field) and will calculate the number compound heterozygous inheritance events within each gene region. A compound heterozygous inheritance event requires that:

  • Children have a heterozygous genotype
  • One parent has a copy of the alternate allele
  • The parental source of the alternate allele is known and not ambiguous
  • A child has two heterozygous genotypes with in the same gene where the alternate allele is inherited from each parent at a minimum of two different loci.
  • (Optional) A child’s heterozygous genotype can be counted when one of the parent has two copies of the alternate allele.

Options to specify:

  • Select a gene track: Select a gene track to use as a reference.
  • Reference Allele Field: Marker map field containing reference allele. This is optional and only required if homozygous alternatives are included.
  • Treat missing genotypes as reference: Optionally treat missing genotypes as homozygous reference calls.
  • Score Compound Heterozygous Variants (Create additional per variant output): Checking this will create a per variant spreadsheet that reports which parent gives the heterozygous genotype the alternate allele.
  • One output spreadsheet per trio: Will output one spreadsheet of results per trio instead of all results in one spreadsheet.
  • Consider hets even when one of the parents is homozygous alternate: Will consider heterozygous genotypes in the child even when one of the parents is homozygous for the variant.

Output

An output spreadsheet is created that includes a row for each gene that had at least one inherited heterozygous genotype, an ambiguous heterozygous genotype or a Mendelian error. The columns created include the total number of compound heterozygous events for all trios and individual results for each trio. The individual results include the number of heterozygous genotypes inherited from the father, the number of heterozygous genotypes inherited from the mother, the total number of inherited heterozygous genotypes, the number of ambiguous heterozygous genotypes and the number of Mendelian errors found with in the gene region.

Additional output options include creating a per variant spreadsheet that reports which parent gives the heterozygous genotype the alternate allele. Also, one output spreadsheet can be created per trio instead of all results in one spreadsheet.

Score Variants by Dominant Model

This feature requires either a pedigree spreadsheet or a binary dependent column along with several mapped genotypic columns. The marker map must include a reference allele field. The variants will be scored based on how well each variant follows the expected dominant model pattern based on case/control status. In the case of a pedigree spreadsheet, one score will be computed per family. The dialog presented will depend on whether or not the spreadsheet has pedigree information.

Options for a Pedigree Spreadsheet

  • Reference Allele Field: Marker map field containing reference allele.
  • Spreadsheet Action:
    • Score Variants: Only the scores for all variants are computed
    • Score and Filter Variants: The scores for all variants are computed and those variants that do not have a perfect dominant model score equal to one for at least one family are inactivated in the original spreadsheet.
  • Treat missing genotypes as reference: If checked, missing genotypes are included in the numerator and denominator and the score reported is the Dominant Model Score. Otherwise, if this option is not checked, missing values are not included in the numerator and denominator and both the Unweighted and Weighted Scores are included in the score report.
  • Output subset variant spreadsheets per family: If checked, one subset spreadsheet from the original pedigree and genotype spreadsheet is created for every family. The variants included in the subset spreadsheet will be only those variants that have a dominant model score of one.
  • Include all per-family output in score report: If checked, the underlying counts that make up the dominant model scores are included in the output, and the counts will be grouped by family ID. If not checked, only the overall scores will be output per-family.
scoreVarDomModel_with_Ped

Score Variants by Dominant Model Dialog for a Pedigree Spreadsheet

Output for Family-Based Dominant Model Score Report

A dominant model score report is created that contains a score for each variant (per family), ranging from 0 to 1 where a value of 1 represents a perfect dominant model fit. If filtering is selected, a subset spreadsheet is created containing all perfectly fitted variants.

  • Dominant Model Variant Score Report: This spreadsheet contains at a minimum the following columns:
    • Total # Dom Families: Total number of families with a perfect dominant model score (score = 1.0).
    • Sum(Score): Sum of the standard dominant model score for all of the families.
    • #Carriers in Affected (Family ID): The number of affected samples with a heterozygous genotype (Alt_Ref) at each variant.
    • Affected Samples (Family ID): A list of affected samples with a heterozygous genotype (Alt_Ref) at each variant.
    • Dominant Model Score (Family ID): The dominant model score (total / denominator). See the formula below for more details.
    • Additional output depending on parameters:
      • Total # Complete Dom Families: Total number of families with a perfect weighted dominant model score (wt score = 1.0).
      • Sum(Wt Score): Sum of the weighted dominant model score for all of the families
      • #Not Carriers in Unaffected (Family ID): The number of unaffected samples with a homozygous reference genotype (Ref_Ref) at each variant.
      • #Unaffected missing Genotypes (Family ID): The number of unaffected samples with a missing genotype.
      • Total (Family ID): The number of affected carriers plus the number of unaffected not carrier samples.
      • Denominator (Family ID): The total number of samples used for the dominant model score. This number depends on if missing values are treated as reference or not. See above for the description of this option.
      • Weighted Numerator (Family ID): The number of samples that fit the dominant model without counting missing values, plus an adjustment. See the weighted model score formula below for details.
      • Weighted Denominator (Family ID): The total number of samples including those with missing genotypes.
      • Weighted Dominant Model Score (Family ID): The weighted dominant model score (weighted numerator / weighted denominator). See the formula below for more details.

Options for a Spreadsheet without a Pedigree

  • Reference Allele Field: Marker map field containing reference allele.
  • Spreadsheet Action:
    • Score Variants: Only the scores for all variants are computed
    • Score and Filter Variants: The scores for all variants are computed and those variants that do not have a perfect dominant model score equal to one are inactivated in the original spreadsheet.
  • Treat missing genotypes as reference: If checked, missing genotypes are included in the numerator and denominator and the score reported is the Dominant Model Score. Otherwise, if this option is not checked, missing values are not included in the numerator and denominator and both the Unweighted and Weighted Scores are included in the score report.
  • Include all output in score report: If checked the underlying counts that make up the dominant model scores are included in the output.
scoreVarDomModel_without_Ped

Score Variants by Dominant Model Dialog for a Case/Control Spreadsheet

Output for Case/Control Dominant Model Score Report

A dominant model score report is created that contains a score for each variant, ranging from 0 to 1 where a value of 1 represents a perfect dominant model fit. If filtering is selected, a subset spreadsheet is created containing all perfectly fitted variants.

  • Dominant Model Variant Score Report: This spreadsheet contains at a minimum the following columns:
    • #Carriers in Affected: The number of affected samples with a heterozygous genotype (Alt_Ref) at each variant.
    • Affected Samples: A list of affected samples with a heterozygous genotype (Alt_Ref) at each variant.
    • Dominant Model Score: The dominant model score (total / denominator). See the formula below for more details.
    • Additional output depending on parameters:
      • #Not Carriers in Unaffected: The number of unaffected samples with a homozygous reference genotype (Ref_Ref) at each variant.
      • #Unaffected missing Genotypes: The number of unaffected samples with a missing genotype.
      • Total: The number of affected carriers plus the number of unaffected not carrier samples.
      • Denominator: The total number of samples used for the dominant model score. This number depends on if missing values are treated as reference or not. See above for the description of this option.
      • Weighted Numerator: The number of samples that fit the dominant model without counting missing values, plus an adjustment. See the weighted model score formula below for details.
      • Weighted Denominator: The total number of samples including those with missing genotypes.
      • Weighted Dominant Model Score: The weighted dominant model score (weighted numerator / weighted denominator). See the formula below for more details.

Formulas

The dominant model score is calculated as follows:

x &= \text{number of heterozygous cases} \\
y &= \text{number of homozygous reference controls} \\
S &= \frac{x + y}{\text{number of samples}}

If the option to Treat missing genotypes as reference is not selected then a weighted score is also included in the output.

x &= \text{number of heterozygous cases} \\
y &= \text{number of homozygous reference controls} \\
z &= \text{number of missing genotypes} \\
S_{wt} &= \frac{x + y - 0.5*z}{\text{number of samples}}

Analysis through Collapsing Methods

Traditional association techniques used in GWAS studies do not have the power to detect the significance of rare variants individually or provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it is necessary to “collapse” several variants into a single covariate based on regions such as genes.

Count Variants per Gene

In order to determine the influence of a variant in a gene or other region, it is useful to know how many variants exist for a gene per sample. This feature allows three optional spreadsheets to be created based on how the variants are identified. The first reports the presence of at least one variant in a gene per sample. The second other counts the total number of variants per sample per gene, here a variant is considered to be either a Ref_Alt or Alt_Alt genotype. The third option is to count the number of alternate alleles in a gene for each sample.

A variant can be counted if the genotype is identified as either a homozygous variant (‘Alt_Alt’) or heterozygous variant (‘Ref_Alt’), only a homozygous variant, or only a heterozygous variant.

For input, this function requires a marker mapped genotype spreadsheet that has a marker map field which contains either the reference bases or a “Ref/Alt Alleles” field. In the case of a “Ref/Alt Alleles” field, only the first allele is used, which is assumed to be the reference allele. The reference base is used to determine if a genotype contains a variant allele or not.

This function can be accessed by going to DNA-Seq > Collapsing Methods > Count Variants per Gene.

In the Count Variants per Gene dialog (see Count Variants per Gene Dialog Window), an annotation gene track needs to be selected. At this time only local annotation tracks can be selected. An example filtered gene track name is: RefSeqGenes-UCSC_GRCh_37_Homo_sapiens.idf:1

Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

The marker map field names are scanned and listed in the Select Map Field chooser dialog. Only string fields are listed, as the reference base should be listed as a string. Select either the reference base or the Ref/Alt alleles field.

countVariantsOptionsWindow

Count Variants per Gene Dialog Window

Optionally, variants that are upstream or downstream of a gene may be treated as being inside that gene. This can be accomplished by enabling the option

Include upstream/downstream variants. The distance (in bp) to a gene boundary in order to be counted can then be specified. The default is 1000 bp.

Select the types of variants to count: both homozygous and heterozygous variants, only homozygous variants, or only heterozygous variants.

Finally, the type of output can be selected: either a binary presence/absence indicator of a variant (per gene), an integer count of the number of variants per gene, or an integer count of the number of alternate alleles per gene. If there are no variants for a gene for either output option, the gene will not be output in the resulting spreadsheet.

Binary Presence/Absence of Variant (Per Gene)

The row labels for this spreadsheet are the original row labels, which should be the sample IDs or sample names. The column headers are in the format of chromosome:start - stop for each gene. The gene name itself is contained in the marker map ‘Name’ field. If there are multiple occurrences of a gene with the same start and same stop position due to differences in coding region start and stop positions, only one column is created since the counting of variants is only dependent on the start and stop position of a gene region.

This output option results in binary columns. In this spreadsheet, a ‘1’ indicates the presence of at least one variant for the sample for the particular gene, and a ‘0’ indicates there were no variants for the sample for the gene. Where types of variants allowed were defined in the prompt dialog and is indicated in the spreadsheet name.

If there are no variants for a particular gene in any sample the gene will not be included in the output spreadsheet.

Count of Number of Variants (Per Gene)

The row labels for this spreadsheet are the original row labels, which should be the sample IDs or sample names. The column headers are in the format of chromosome:start - stop for each gene. The gene name itself is contained in the marker map ‘Name’ field. If there are multiple occurrences of a gene with the same start and same stop position due to differences in coding region start and stop positions, only one column is created since the counting of variants is only dependent on the start and stop position of a gene region.

This output option results in columns of integers. In this spreadsheet, the total number of variants for a sample for a gene is reported. The maximum number of variants is the total number of probes in a gene. The minimum number of variants is ‘0’, which indicates that there were no variants for the sample for the gene. Where types of variants allowed were defined in the prompt dialog and is indicated in the spreadsheet name.

If there are no variants for a particular gene in any sample, the gene will not be included in the output spreadsheet.

Count of Number of Alternate Alleles (Per Gene)

The row labels for this spreadsheet are the original row labels, which should be the sample IDs or sample names. The column headers are in the format of chromosome:start - stop for each gene. The gene name itself is contained in the marker map ‘Name’ field. If there are multiple occurrences of a gene with the same start and same stop position due to differences in coding region start and stop positions, only one column is created since the counting of alternate alleles is only dependent on the start and stop position of a gene region.

This output option results in columns of integers. In this spreadsheet, the total number of alternate alleles for a sample for a gene is reported. The maximum number of alternate alleles is the total number of alleles (2 times the number of variant sites) in a gene. The minimum number of alternate alleles is ‘0’, which indicates that there were no variants for the sample for the gene. Where types of variants allowed were defined in the prompt dialog and is indicated in the spreadsheet name.

If there are no alternate alleles for a particular gene in any sample, the gene will not be included in the output spreadsheet.

Note

For all output spreadsheets if there are non-overlapping transcripts with the same gene name then multiple columns will be created, one per non-overlapping region. To create unique column headers for these regions an “_” followed by a ‘1’, ‘2’, etc. will be appended to the gene name. The original gene name for each region will be included as a separate field in the marker map.

Collapsing Methods

In Golden Helix SVS, you can use several approaches to study rare variant burden, including the Cohort Allelic Sum Test (CAST), the Combined Multivariate and Collapsing (CMC) method, and the Kernel-Based Adaptive Cluster (KBAC) method. Please see Collapsing Methods for DNA-Sequence Analysis for details about these approaches and how to use them. Using one of these approaches will give greater power to detect the significance of the rarer variants.

DNA Sequence Analysis Tutorials

For more details and up-to-date recommendations on DNA-Seq Analysis, or Variant analysis and classification please see the Variant tutorials on the Golden Helix website: Golden Helix SVS Tutorials.