Importing Data

You can import one or more VCF, gVCF, 23andMe text files, or variant TSF files into a new project. VarSeq also supports annotating from numerous other file types.

If your project was created from a template, the Import Variants dialog will automatically appear after creating a new project. Otherwise, you can click on the Import Variants... button and select one or more files to import.

Select Files to Import

The import wizard will step you through all of the import options to bring variant level data (with or without sample level fields) into VarSeq.

The first step is to select the files to import.

Import Wizard Step 1

Select the files to import on the first step of the import wizard

Note

Indicate whether or not files should be appended if multiple files have the same sample names using the Append together files with matching sample names option.

Here are two examples:

  • If there is one file per chromosome for each sample or containing all samples, leave this option Checked
  • If there are multiple files per sample and the variants are to be compared between files (tumor/normal, various alignment algorithms, etc.), Uncheck this option.

Note

When importing 23andMe data from delimited text files a local copy of a dbSNP annotation source is required to determine reference and alternate alleles for each RS ID listed. You can download a copy through the Data Source Library. Different versions of dbSNP may produce different results for some variants.

If the files have not been compressed or indexed after clicking Next > the files will be compressed and indexed. Otherwise, the next step is to select the relationship between or type of samples.

Select Relationship

If you are importing variants into an empty template and your files contain sample level information (i.e. they contain more than just site level information) you will be asked if the samples are related or unrelated. Select the appropriate relationship or type of import. The options include:

  • Individual Samples: The samples in the file(s) that you are importing are not related to each other. You will be able to select the affected individuals on the next page.

  • Family Samples: The samples in the file(s) that you are importing are related to one another. You will be able to select the affected individuals and specify parent relationships on the following page.

    Note

    Select this option if you have one or more families. This option does not require that all of the samples are in the same family.

  • Cancer Samples: The samples in the file(s) that you are importing are from cancer gene panels. You will be able to select the affected individuals on the next page.

  • Tumor/Normal Samples: The samples in the file(s) that you are importing contain tumor/normal pairs. You will be able to select the tumor samples and the matching normal samples on the next page.

Import Wizard Step 2

Specify the relationship between the samples

Edit Sample Information

Once the relationship has been specified you will be prompted to edit the sample information. You can change the sample name, affection status (affected, carrier or control), and if the samples are related, set the parents for the children.

If the sample information is also contained in a plain text file (or pedigree file) you can also import this information by clicking on From Text File above the sample table. See Import Sample Information from Text File for more information.

If the Family Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Family

Edit the sample information to specify relationships and affection status for family data

If either of the Individual Samples or Cancer Samples options were selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Individual

Edit the sample information to specify affection status for individual samples

If the Tumor/Normal Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Tumor/Normal

Edit the sample information to specify tumor/normal status and matched normal sample name.

The maximum number of samples that can have their sample information edited is 100 samples. Although, it is not practical to specify that much sample information in this format. Also available is the ability to import the sample information from a text file.

Import Sample Information from Text File

Instead of manually specifying the sample information, it can be imported from a text file. The data in the text file will be matched by sample name.

General text file importing parameters are available to handle most text file formats including text pedigree files, or files without a header.

The text file needs to specify the affection status for the samples, or tumor/normal status at a minimum. Other phenotype information can be imported as well. To set a column as Affected?, Tumor?, or as a field to import, right click on the column header and select the desired option.

A best attempt is made to detect the column that contains the sample names for matching. If the correct field from the text file is not selected, right click on the column header of the correct column and choose Set as Sample Names.

Basic Sample Info

Importing sample information from a file that just contains the affection status

If the text file contains a field with secondary sample names, the sample names can be renamed using this field. To set the field as Renamed Sample Names click on the column header and select Renamed Sample Names. This fills in the Renamed Sample column in the import wizard.

If the samples are imported as Family Samples, then the text file can specify the Father and Mother IDs in addition to the affection status. The gender can also be imported for later use in algorithms.

Note

If a field is set for Renamed Sample Names the Father and Mother IDs must match the renamed sample names and not the original sample name.

Pedigree Sample Info

Importing pedigree information from a text file

If the samples are imported as Tumor/Normal Samples, then the text file can specify the matched normal sample IDs in addition to the tumor/normal status.

Note

If a field is set for Renamed Sample Names the matched normal IDs must match the renamed sample names and not the original sample names.

Tumor/Normal Sample Info

Importing tumor/normal information from a text file

Associating Sample BAM Files

In addition to being specified in a text file. The file path to the sample’s coverage BAM file can also be set through the BAM File association dialog, which can be opened by selecting Associate BAM File above the sample field table.

BAM Association Dialog

Pairing samples with corresponding BAM files

The directory of the first input file is recursively searched for BAM files when the dialog is first opened. If the sample BAM files are in a different location, the search directory can be changed by selecting Browse. Match a sample with its corresponding BAM file, by typing the name of the file in its File Name text box, or by selecting it from the samples File Name drop down menu.

Modify How Fields in the Variant Files are Imported

Important

This is an Advanced option. To get to this page you will need to check the Advanced Options check box in the lower left hand corner of the wizard then advance past the sample editing page or go back from the final page.

All of the fields from the files selected for import are displayed on the Edit Field Merge and Type Behavior page of the import wizard.

By default all INFO fields from the variant files are imported as variant site fields and all FORMAT fields are imported as sample fields. When merging multiple files together there can be differing values in the INFO fields. The options presented are designed to handle this possibility.

Certain INFO fields are automatically elevated to sample fields, these fields are FILTER and DP (Read Depth). Any of the other fields can also be elevated to a sample field by selecting Sample in the drop-down menu in the Merge Behavior column.

Edit Field Merge Options

Edit field merge and type behavior advanced option dialog

By default, all other INFO fields will be merged by creating a Unique list of values for the field across all samples and files, this will keep the field a variant site field. Other merge options include:

  • NumericMax: For integer, integer array, float or float array field types. Takes the maximum of all values for the field in all files.
  • NumericMin: For integer, integer array, float or float array field types. Takes the minimum of all values for the field in all files.
  • NumericMean: For integer, integer array, float or float array field types. Takes the mean of all values for the field in all files.
  • KeepMatching: All field types. Only keep the value if all files that have a value for the specified field match.
  • TakeFirst: All field types. Take the first value seen.

Import Summary

The final page of the import wizard is a summary of the import process. To finalize the import click Finished.

Typical Summary

The import summary for the import wizard for a trio from a single VCF file

Once the variants have been imported you may annotate the variants using annotation sources, or run a computation algorithm on the variants using sources already added.

  • Specify Genomic Regions to Import: To only import variants from a particular region, or one or more chromosomes, enter the region(s) into this option.

    Region Suggestions

    The information for the subset imported chromosome option

  • Import Regions Defined by Annotation File: To only import variants within regions defined by an annotation source or BED file, select this option and select the file defining the regions. Allowed source types are gene or interval sources.

    The regions defined in the source can be expanded to include near-by variants by using the BP option. To strictly include variants contained in the region set this option to 0 +/- BP.

    If a gene source is selected, there is the option to filter variants contained within Exons Only or within the Full Transcript.

    Note

    It is highly suggested that this option be used for Whole Genome variant files, and that the annotation source be one that would be used during the filtering process anyway.

  • Select filters ...: To only import variants that have a particular FILTER value, select those filters by checking the box in front of the available options.

To change the variant import algorithms, click on the Advanced Options check-box in the lower left hand corner. This will provide additional options and should look like the image below for a trio imported from a single VCF file.

Advanced Summary

The import summary for the import wizard for a trio from a single VCF file with advanced options visible

Advanced options allow variants in the variant files to be adjusted using the following algorithms:

Left Align

Insertions and deletions not in the left-most representation will be re-aligned with a Smith Waterman algorithm to provide it with its canonical representation.

Left Align Example

Left align visual example

In the example above, the “CAT” deletion has been moved down 6 bases. Moving variants to the left most position like this, will allow for uniform comparison between variants which can be represented at more than one position.

Note

The Left Align algorithm requires a valid and local reference sequence that matches the assembly of the data being imported.

Allelic Primitives

Multi-nucleotide polymorphisms will be split into the SNP representation that provides the best support for annotation.

Allelic Primitives Example

Allelic primitives visual example

In the example above, the original variant (above) represents a variant with the ref/alt “TCAT/GCAG”. This can be simplified by splitting the variant into two different “T/G” SNPs (below). The simplified representation is a more general form of original variation and is more likely to be found in annotation sources

Split Variants Based on Unique Genotypes

When multiple individuals have mutations in the same “site” (same chromosome and reference alleles), some variant files will place these all in one “record”. This option splits that record to ones that are matched to each individual genotype alleles allowing annotation and filtering to be precise for each different individual genotype.

Note

This option is available when importing Individual Samples

This provides the following:

  • Each site is broken into all of the possible genotypes (with one or two alternate alleles), that can be constructed from the original alleles.
  • Samples genotypes will be filled from the alleles assigned to each sub-feature when possible.
  • Only the RefAlt field is updated for each feature. The sample fields and the alternates are copied from the original feature.
  • Annotations are most-specific, with samples with only one alternate being properly annotated to annotation records with that alternate allele.
  • Allele counts will be calculated for each of the split features. Providing counts for each of the different allele combinations.

Flatten Variant Genotypes

When a variant at a given “site” (same chromosome and reference alleles), has more than one alternate, some variant files will place these all in one “record”. This option splits that record to ones that are matched to each alternate allele allowing annotation and filtering to be precise for each alternate allele.

Note

This option is available when importing Cancer Samples

In this mode:

  • No concern for keeping genotypes intact is made. Every record is a single reference and alternate. The ALT field is updated and the appropriate values are taken from all “A” fields.
  • Genotypes that cannot be formed in the new split (i.e. they were 1/2 before) are set to half missing in each of the records (1/. and ./2). The fields are then copied into each of the split records.

Match Variants to Affected Individuals Genotypes

When multiple affected individuals have mutations in the same “site” (same chromosome and reference alleles), some variant files will place these all in one “record”. This option splits that record to ones that are matched to each affected individuals genotypes alleles allowing annotation and filtering to be precise for each affected individual.

Note

This option is available when importing Family Samples

In this mode:

  • Each site is broken into all of the possible genotypes (with one or two alternate alleles), that can be constructed from the samples marked as “Affected” during import.
  • Alternate Alleles that are not represented in the samples are combined and placed in a separate record.
  • Samples genotypes will be filled from the alleles assigned to each sub-feature when possible.
  • The RefAlt field and the Alternates are updated. As well as the “A” (alt matching) fields, to match the new alternates.
  • Annotations are most-specific for the “Affected” samples, with samples with only one alternate being properly annotated to annotation records with that alternate allele.
  • Allele counts will be calculated for each of the split features. Providing counts for each of the different allele combinations.

Once the options are set click Finished to import the data.