4.4. Importing Variants

You can import one or more VCF, gVCF, 23andMe text files, or variant TSF files into a new project. VarSeq also supports annotating from numerous other file types.

If your project was created from a template, the Import Variants dialog will automatically appear after creating a new project. Otherwise, you can click on the Import Variants… button to open the Import Variants Wizard.

Import Variants 1

To Import Variants into a New Project.

4.4.1. Select Files to Import

The Import Wizard will step you through all of the import options to bring variant level data (with or without sample level fields) into VarSeq. The left of the screen indicates the current category of the import process and a short description of the step on the right.

The first step in the import process is to select the files with your variant calls by clicking on Add Files, browsing to the location and clicking on Open. The files you have selected will be displayed in the Select Files list. Then, select which variant type tables to import and the Sample Name Matching Mode. You may also select Advanced Options which are described in below in Section 4.4.4 Modify How Fields in the Variant Files are Imported.

Import Wizard Step 1

Select the Files to Import in the First Step of the Import Wizard

Note

Indicate whether or not files should be appended if multiple files have the same sample names using the Append together files with matching sample names option.

Here are two examples:

  • If there is one file per chromosome for each sample or containing all samples, leave this option Checked

  • If there are multiple files per sample and the variants are to be compared between files (tumor/normal, various alignment algorithms, etc.), Uncheck this option.

Note

When importing 23andMe data from delimited text files a local copy of a dbSNP annotation source is required to determine reference and alternate alleles for each RS ID listed. You can download a copy through the Data Source Library. Different versions of dbSNP may produce different results for some variants.

If the files have not been compressed or indexed after clicking Next > the files will be compressed and indexed. Otherwise, the next step is to select the relationship between or type of samples.

Note

VarSeq supports importing records from vcf following the 4.3 specifications. The specification details can be found here Variant Call Format Specification . This includes the following key points:

  • VCFs must be encoded using UTF-8

  • The genome assembly of the file is determined by matching the length of the contigs in the header with the known genome assemblies. If that lookup fails, the value of the reference, and finally source fields are used to infer the genome build.

  • Field arity for the INFO and FORMAT fields should use the following specifiers:

    • A If there is one value per alternate allele (in the same order as the alternates)

    • R If there is one value per allele (in the same order as the alternates)

    • ‘.’ It is unknown how many values will be in the list

    • The values G and specific arity values such as 1, or 2 will still be imported though the number of elements in the field is not guaranteed to remain consistent with the original count or the number of genotypes in the record.

  • Unspecified alleles or REF-only blocks should be specified with the <*> notation. These alleles are used to denote the coverage of regions between variant calls in gvcfs.

  • If the ALT field contains a ‘*’, indicating that there was an overlapping deletion according to the VCF spec, VarSeq will import this record and remove the reference to this allele in the corresponding genotype GT field as well as allele-matching fields. This will in effect make variants haploid for their called alternates.

When importing CNV records the following fields are used to determine the type of CNV event encoded in the file. Variants may be encoded using the symbolic alternates: DEL, INS, DUP, INV, CNV, INS, CN:# or they can be encoded as small variants. The threshold at which variants are treated as CNVs can be adjusted in the variant import.

  • SVTYPE The type of structural variant. This must be one of the following values, DEL, INS, DUP, INV, CNV, BND.

  • END The end field should be the: POS + [length of the REF allele] - 1. This field is required when symbolic alleles are used.

  • CN This sample level (FORMAT) field refers to the copy number genotype for imprecise events and may be used to infer the copy number state of a sample within a given record.

Breakend events are imported using the END field in addition to the SVTYPE, and ALT field. The ALT field has strict conventions using the [,] symbols to specify the strand and extension direction of the breakend. The direction of the brackets is used to specify the direction of extension (strand) and the location in alternate specifies the location of the break event relative to the variant. For more details on how the alternate values are formatted and for specific examples please refer to the VCF specification linked above.

In addition to the ALT field the MATEID is also used:

  • MATEID Breakends must use this field to specify the other record in the file that completes this event. The MATEID for a record should match the ID field of the paired event on the other side of the breakend event. The ID must be unique for each record in the file.

4.4.2. Select Relationship

If you are importing variants into an empty template and your files contain sample level information (i.e. patient name, mother, collection date, ordering physician) you will be asked if the samples are associated. Select the appropriate relationship or type of import and click Next.

The options include:

  • Individual Samples: Where the samples in the file(s) that you are importing are not related to each other. You will be able to select the affected individuals on the next step.

  • Family Samples: Where the samples in the file(s) that you are importing are related to one another. You will be able to select the affected individuals and specify parent relationships on the following step.

    Note

    Select this option if you have one or more families. This option does not require that all of the samples are in the same family.

  • Cancer Samples: Where the samples in the file(s) that you are importing are from cancer gene panels. You will be able to select the affected individuals on the next step.

  • Tumor/Normal Samples: Where the samples in the file(s) that you are importing contain tumor/normal pairs. You will be able to select the tumor samples and the matching normal samples on the next next..

Import Wizard Step 2

Specify the relationship between the samples

4.4.3. Edit Sample Information

Once the relationship has been specified you will be prompted to edit the sample information. You can change the sample name, affection status (affected, carrier or control), and if the samples are related, set the parents for the children.

If the sample information is contained in a plain text file (or pedigree file) you can also import this information by clicking on From Text File above the sample table. See Import Sample Information from Text File for more information.

If the Family Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Family

Edit the sample information to specify relationships and affection status for family data

If either of the Individual Samples or Cancer Samples options were selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Individual

Edit the sample information to specify affection status for individual samples

If the Tumor/Normal Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Tumor/Normal

Edit the sample information to specify tumor/normal status and matched normal sample name.

Import Sample Information from Text File

Instead of manually specifying the sample information, it can be imported from a text file. The data in the text file will be matched to the sample by name.

General text file importing parameters are available to handle most text file formats including text pedigree files, or files without a header.

Example Sample Information files can be found by selecting Open Folder, then VarSeq Install Folder from the Tools menu at the top of the screen. In the file browser, open the folder called Data and then the ExampleSampleManifests folder.

Opening the VarSeq Install Folder

These example TSV files have specific fields that are recognized by VarSeq allowing the information to be included in a Patient Report at the end of your workflow. The example files in the folder are:

  • germline_for_report: This file is for use with genomic variants in VarSeq and the VSClinical ACMG workflows.

  • cancer_for_report: This file supports both the VarSeq somatic workflows such as Tumor

  • germline_trio:

  • genes_and_phenotypes:

These files are a starting point for loading your own sample data and can be modified to add addition data which may displayed in a patient report following the instructions in VSReports Templates.

A best attempt is made to detect the column that contains the sample names for matching. If the correct field from the text file is not selected, right click on the column header of the correct column and choose Set as Sample Names.

Basic Sample Info

Importing sample information from a file that just contains the affection status

If the text file contains a field with secondary sample names, the sample names can be renamed using this field. To set the field as Renamed Sample Names click on the column header and select Renamed Sample Names. This fills in the Renamed Sample column in the import wizard.

If the samples are imported as Family Samples, then the text file can specify the Father and Mother IDs in addition to the affection status. The gender can also be imported for later use in algorithms.

Note

If a field is set for Renamed Sample Names, the Father and Mother IDs must match the renamed sample names and not the original sample name.

Pedigree Sample Info

Importing pedigree information from a text file

If the samples are imported as Tumor/Normal Samples, then the text file can specify the matched normal sample IDs in addition to the tumor/normal status. The text file needs to specify the affection status for the samples, or tumor/normal status at a minimum. Other phenotype information can be imported as well. To set a column as Affected?, Tumor?, or as a field to import, right click on the column header and select the desired option.

Note

If a field is set for Renamed Sample Names, the matched normal IDs must match the renamed sample names and not the original sample names.

Tumor/Normal Sample Info

Importing tumor/normal information from a text file

Associating Sample Alignment Files

In addition to being specified in a text file, the file path to the sample’s coverage alignment file can also be set through the Alignment File association dialog, which can be opened by selecting Associate Alignment File above the sample field table. The alignment files supported include BAM and CRAM.

Alignment Association Dialog

Pairing samples with corresponding alignment files

The directory of the first input file is recursively searched for alignment files when the dialog is first opened. If the sample alignment files are in a different location, the search directory can be changed by selecting Browse. Match a sample with its corresponding alignment file by typing the name of the file in its File Name text box, or by selecting it from the samples File Name drop down menu.

4.4.4. Filter the Records that are Imported

By default no records are filtered on import. This page of the dialog allows for the specification filters to reduce the number of records that are imported. Filtering records at this step will speed downstream computations, and improve project performance.

Records that do not pass the filters added are not imported.

Add Import Filters

Add filters to reduce the number of imported variants

If a sample field is selected the sample values will be removed if the sample doesn’t pass the filter. If all of the samples don’t pass the filter the record will be removed.

4.4.5. Modify How Fields in the Variant Files are Imported

Important

This is an Advanced option. To get to this page you will need to check the Advanced Options check box in the lower left hand corner of the wizard then advance past the sample editing page or go back from the final page.

All of the fields from the files selected for import are displayed on the Edit Field Merge and Type Behavior page of the import wizard.

By default all INFO fields from the variant files are imported as variant site fields and all FORMAT fields are imported as sample fields. When merging multiple files together there can be differing values in the INFO fields. The options presented are designed to handle this possibility.

The FILTER and DP (Read Depth)INFO fields are automatically elevated to sample fields. Any of the other fields can also be elevated to a sample field by selecting Sample in the drop-down menu in the Merge Behavior column.

Edit Field Merge Options

Edit field merge and type behavior advanced option dialog

By default, all other INFO fields will be merged by creating a Unique list of values for the field across all samples and files, this will keep the field a variant site field. Other merge options include:

  • NumericMax: For integer, integer array, float or float array field types. Takes the maximum of all values for the field in all files.

  • NumericMin: For integer, integer array, float or float array field types. Takes the minimum of all values for the field in all files.

  • NumericMean: For integer, integer array, float or float array field types. Takes the mean of all values for the field in all files.

  • KeepMatching: All field types. Only keep the value if all files that have a value for the specified field match.

  • TakeFirst: All field types. Take the first value seen.

  • TakeAll: All field types. Takes all values seen.

4.4.6. Import Summary

The final page of the import wizard is a summary of the import process and some additional options.

Import Summary

The Import Summary for an Individual Sample

Once the variants have been imported, you may annotate the variants or run computation algorithms on the variants. In the Import Summary step you can:

  • Variant/CNV Size Threadholds: To set the size threadhold for CNVs versus variants.

CNVs versus variants
  • Use Default Chromosome Names: Import the standard chromosome names defined in the genome assembly if alternate chromosome names such as the NC ID are specified.

Use default names
  • Left Align: Check this option if insertions and deletions not in the left-most representation need to be re-aligned with a Smith Waterman algorithm (PMID:7265238) , to provide it with its canonical representation.

Left Align Example

Left align visual example

Left Align

Note

The Left Align algorithm requires a valid and local reference sequence that matches the assembly of the data being imported.

  • Allelic Primitives: Check this option to split multi-nucleotide polymorphisms into the SNP representations that provide the best support for annotation.

Allelic Primitives Example

Allelic primitives visual example

In the example above, the original variant (above) represents a variant with the ref/alt “TCAT/GCAG”. This can be simplified by splitting the variant into two different “T/G” SNPs (below). The simplified representation is a more general form of original variation and is more likely to be found in annotation sources

Allelic Primitives
  • Split Variants Based On Unique Genotypes: When multiple individuals have mutations in the same locus (same chromosome and reference alleles), some variant files will place these all in one “record”. This option splits that record to ones that are matched to each individual genotype alleles allowing annotation and filtering to be precise for each different individual genotype.

Split Variants

Note

This option is available when importing Individual Samples

This provides the following:

  • Each site is broken into all of the possible genotypes (with one or two alternate alleles), that can be constructed from the original alleles.

  • Samples genotypes will be filled from the alleles assigned to each sub-feature when possible.

  • Only the RefAlt field is updated for each feature. The sample fields and the alternates are copied from the original feature.

  • Annotations are most-specific, with samples with only one alternate being properly annotated to annotation records with that alternate allele.

  • Allele counts will be calculated for each of the split features. Providing counts for each of the different allele combinations.

  • Liftover: Use this function when the genome assembly used to call your variants (ie GRCh37) is different from the genome assembly selected for your project (ie GRCh38). Selecting your project’s genome assembly from the drop down will convert, or “liftover”, variant coordinates to those matching the project.

Liftover
  • Subset to Regions: Allows you to limit the data imported from you files to one or more chromosome and/or a specify range of genomic coordinates.

Subset to genomic regions
  • Subset to Track: Allows you to limit the data imported from you files to those specified by an Annotation track or or BED file. Clicking on Select Track opens the :ref::data_source_library where you can select an Annotation source.

    Depending on the Annotation source, you may need to specify additional paramenter. The regions defined in the source can be expanded to include near-by variants by using the BP option. To strictly include variants contained in the region set this option to 0 +/- BP. If a gene source is selected, there is the option to filter variants contained within Exons Only or within the Full Transcript.

Subset to Track

Note

It is highly suggested that this option be used for Whole Genome variant files, and that the annotation source be one that would be used during the filtering process anyway.

  • Filter Flags: Allows you to limit the data imported from you files by selecting fields to filter using the available options.

Filter Flags

Once the options are set click Finished to import the data.