Importing Your Data Into A Project

Importing Data

The Import menu presents various menu items for importing different types of data. The methods for importing data are described in the following sections.

Text File

Text files such as comma, space or tab delimited files that are saved with extensions .csv, .txt, or .dat can be imported into SVS using the Import Text File dialog. This dialog has two tabs associated with it, Input File, which specifies general dataset import options, and Advanced Options. See Text File Import Window.

importText

Text File Import Window

  • Input File: After opening the Import Text window, select a file by clicking on the Browse button, which allows you to navigate your file system.

    You must specify how a file is delimited in order to properly import the file. If the wrong delimiter is specified, a warning message will indicate the file may be using a different delimiter.

    The dataset name may be given at this time. This name will be applied to both the dataset node as well as the spreadsheet viewer node. The spreadsheet and parent node can be renamed after import.

    If the text file has a row label column, that column can be specified, or generic row labels can be created. A row label is generally a sample name or information to identify a row and not generally used for analysis. If a text file is imported with the wrong column specified for row labels this can be changed by using the Spreadsheet Editor (see Editing the Row Label Header and Row Labels) without needing to re-import the file. The default is to use the first column of data as row labels.

  • Advanced Options: The Advanced Options Tab allows you to specify a custom encoding list for missing data if your text file uses different characters besides an empty field, period, comma, ?, or --- (three dashes). In the custom encoding box, enter a whitespace delimited list of missing value encodings. This list will overwrite the built-in missing encoding list except for the empty field.

importTextAdvanced

Text File Import Window - Advanced Options

If genotype data exists in the text file you can specify whether or not the program should read the data as genotypic. If you un-check the “Read Genotypic Data” box, then all genotype data will be read as categorical. The allele delimiter character can be specified by choosing from the drop-down menu or by choosing “Other ->” and indicating the character in the text box to the right of the menu. If there are non-genotype fields that have an underscore in the field, these columns will be read as genotypic. These columns can be changed to categorical using the spreadsheet editor after the file is imported into the project. The default behavior is to read all fields containing an underscore as genotype data. Columns with all missing values can be encoded as Genotypic by checking “Encode columns with all missing data as genotypic.”

Header lines can be skipped by checking the “Skip” box and selecting the number of rows to skip. The default is to not skip any header lines.

The Base numeric type default is Boolean. This means that if a column of all 0’s was detected, it would be encoded as Boolean. You can also choose Integer, Single or Double precision float for the default type. There is also an option to encode real columns with single precision floats (as opposed to double precision). The values would then be stored in 4 bytes rather than 8 bytes.

Third Party File

Most file formats of statistical and data management programs can be read using the Import Third Party Formats dialog. To import from a file, select the file and format you wish to use by clicking on the Browse button. The files are filtered by file format at the bottom of the Select File dialog. Any of the file types listed in the drop down menu can be imported for use in Golden Helix SVS. After a file has been selected, there are options for specifying column names and the allele delimiter. See Third Party File Import Window.

import3rdPty

Third Party File Import Window

Once a file has been selected, clicking Next > will lead to steps appropriate to the file type. If the file format allows for more than one worksheet, such as in Microsoft Excel ™, one sheet will have to be selected for import at a time. Other file formats or data sources might have additional steps. A row label column can be selected from the list of available columns or generic row label columns can be generated.

PED/TPED/BED File

Golden Helix SVS can import plain text PED/MAP files, plain text TPED/TFAM files, and optimized binary PED or BED files (which should have corresponding BIM/FAM files). Because these files have full marker map information for each SNP, the marker map is also imported into Golden Helix SVS and automatically applied to the dataset.

importPED

PED/TPED/BED File Import Window

In the Import PED/TPED/BED File window you can select the Text PED/MAP format, the Text TPED/TFAM format, or the Binary BED/FAM/BIM format. If you select the PED/MAP format, you can browse for a PED file by clicking on Browse. A corresponding MAP file will be automatically filled in, but you may choose to browse for a different MAP file. Similarly, if you select the TPED/TFAM format, you can browse for a TPED file and the corresponding TFAM file will be automatically filled in. If you select the BED/FAM/BIM format, you can browse for a BED file and the corresponding FAM/BIM files will be automatically filled in.

For all import formats, you can specify a dataset name, but you can only specify encoding options for the PED/MAP and TPED/TFAM formats. The default is to encode missing phenotypes in the Sex or Affection Status column as “-9” and missing genotypes as “0”. If your PED, TPED, and/or TFAM file has a different encoding you can specify it in the “Missing genotype” or “Missing phenotype” fields. Affection status encoding can also be specified, allowing either 1 or 0 to designate unaffected individuals in the data.

The final option allows you to specify whether you want to map numerical chromosomes (this converts non-autosomes to expected allosomes) or to import the chromosomes “as is”. If mapping numerical chromosomes, you can indicate whether you are importing human genome data or non-human genome data. If you are importing non-human genome data, the number of autosomal chromosomes in the data must be specified.

Note

The first four columns of the PED, TFAM, and FAM file formats are identifiers and encode missing values with the “0” string.

Golden Helix DSF File

The Dataset Storage Format (DSF) is designed to allow for the sharing and collaboration of datasets between Golden Helix SVS users. The DSF format is also open to third-parties to develop the ability to create DSF files from their own products or data sources and thus more easily integrate with Golden Helix SVS.

DSF files can be imported into SVS in two ways. The first way is through the Import > Golden Helix DSF dialog which prompts the user to select a single DSF file to import. The second way is to select one or more DSF files in a file system browser and drag and drop the DSF files into an open Golden Helix SVS project. The files must be dropped in the Project Navigator Window.

Note

As of SVS version 7.6.0 multiple DSF files can be selected for import by selecting multiple files in a file system browser and dragging and dropping them in an open project.

This action still imports one file at a time, but additional files are imported immediately after the previous file has been completely imported into the project.

Legacy Golden Helix GHD File

Legacy GHD files can be imported into Golden Helix SVS by selecting GHD files from the dialog that allows you to navigate to the location of the GHD file and import. The dataset name is retained from when the GHD file was created, but an applied marker map is not preserved.

Public Data

Public data, such as HapMap data, can be downloaded and imported directly into Golden Helix SVS. The available datasets are listed in a dialog. The dataset name is retained from the selection menu, but a marker map is not applied.

Affymetrix Files

Both SNP and CNV data from Affymetrix chips can be imported into an SVS project for analysis. SNP data can be imported in either the CHP or CNT formats. CNV data can be imported in the CEL, CNT and CNCHP formats.

Analysis results generated by the Affymetrix Chromosome Analysis Suite (ChAS) in the CYCHP format and CEL files generated by processing the Cytogenetics Whole-Genome Arrays can also be imported into an SVS project for analysis.

CNV Data

For the Affymetrix 500k, SNP 5.0, and SNP 6.0 arrays, Golden Helix SVS supports reading CEL intensity files and calculating normalized log2 ratios for copy number segmentation and association analysis. For the Affymetrix 10k, 100k, and 500k arrays, you may use the Affymetrix CNAT Batch Analysis tool to create CNT files; for the 100k, 500k, and SNP 6.0 arrays, use the Genotyping Console to create CNCHP files. These files contain normalized log2 ratios and can be imported into a dataset for analysis in SVS. See Extracting Affymetrix Copy Number Data for use in SVS for instructions on creating CNT or CNCHP files using Affymetrix tools. Affymetrix CEL, CNT, or CNCHP files can be imported directly into Golden Helix SVS versions 7.0 and higher without any additional steps.

Affymetrix CHP File

Golden Helix SVS is able to directly import Affymetrix 10k, 100k, 500k, 5.0 and 6.0 GeneChip®mapping array, CHP files.

Affymetrix Files Installation

For mapping arrays prior to the SNP 5.0 and SNP 6.0 arrays, you must have the corresponding library file installed for each type of mapping array you want to import. SNP 5.0 and 6.0 mapping arrays do not require library files. If you have GCOS installed, it is likely the library files are already installed in either C:/GeneChip/Library or C:/GeneChip/Affy_Data/Library (on Windows).

If you do not have GCOS installed, or need other Library files, they can be downloaded from Affymetrix through the NetAffx service. See Genetic Marker Maps and Affymetrix Library Files for importing library files through NetAffx.

Golden Helix SVS will, by default look for mapping array library files in the C:/Program Files/Golden Helix SVS/AffyLibraryFiles directory or the last directory used for Library files. There is an option for specifying the directory whenever a library file is needed for importing Affymetrix files.

The library files available from the NetAffx service are for the final versions of these mapping arrays. If you were using an experimental early access array, you will need to get the appropriate library files from Affymetrix. All that is needed for Golden Helix SVS is the CDF file for the array.

Affymetrix Mapping Array Import

To import Affymetrix CHP files you can either select all of the files to import using Add Files from the Import CHP Files... dialog, or add an entire directory by choosing Add Directory. If you wish to remove CHP files from the list in the File box, select the files to remove and click Remove. Multiple selection is allowed by <Shift>-left-click to select a block of files or by <Ctrl>-left-click to select individual files. See Affymetrix CHP File Import Window.

importCHP1

Affymetrix CHP File Import Window

Note

The “100k” and “500k” arrays, are composed of two “50k” and two “250k” chips, respectively, with their corresponding library files. These need to be imported separately, and joined together from the spreadsheet file menu. See Joining or Merging Spreadsheets for instructions on how to join the datasets from the two chips.

The other options in this import window include specifying a dataset name, changing the library path, and filtering calls based on confidence score (p-value). When importing CHP files from SNP 5.0 or 6.0 arrays, the library file location can be ignored, as it is not needed for the import process.

If you wish to use a different threshold for the confidence score, check the box and fill in the desired confidence score (a number between 0 and 1). Changing the confidence score is only valid for certain, more recent file types such as 100k or 500k CHP files. During the import process Golden Helix SVS will screen whether changing the confidence score is valid for your particular files.

Note

  1. The Affymetrix CHP files do not contain phenotypic information about individuals. This data must be imported separately. When doing so, make sure that the label column for those individuals matches the CHP file identifier in the spreadsheet. From either spreadsheet, you can join on column labels to get a combined spreadsheet. See Joining or Merging Spreadsheets for more information on joining spreadsheets.
  2. You may wish to import marker map information for the mapping array dataset. Annotation data can be retrieved from Affymetrix NetAffx service, with appropriate login privileges. See Genetic Marker Maps and Affymetrix Library Files for instructions on how to obtain annotation data from NetAffx.

Affymetrix CEL Files

The Affymetrix CEL import tool reads CEL intensity files, normalizes the intensity values against the chosen or default reference samples, and imports the normalized log2 ratios into Golden Helix SVS. The methodology for calculating and normalizing log2 ratios from the CEL files is described in the Quantile Normalization of Affymetrix CEL Files section.

importCEL1

Affymetrix CEL File Import Window

From the Import Affymetrix CEL file dialog (see Affymetrix CEL File Import Window), first select the CEL files you want to include in the dataset. For Mapping 500k data, you must select files from both the NSP and STY arrays for each sample. To select CEL files, click the Add Files button and use the file browser to select multiple CEL files. The CEL files you selected will appear in the CEL import dialog window. You may add all of the CEL files in a directory by using the Add Directory button. To remove CEL files from the window, select the unwanted samples and click Remove. You may continue adding CEL files by clicking the Add Files or the Add Directory buttons again. Multiple selection is allowed by <Shift>-left-click to select a block of files or by <Ctrl>-left-click to select several individual files.

In the next window, specific import options can be specified.

importCEL2

Affymetrix CEL File Import Window

For the import of Mapping 500k CEL files, a matching spreadsheet containing the file names must be available in Golden Helix SVS. This spreadsheet will tell the CEL import tool how to join the NSP and STY samples together to create one sample per patient. The matching spreadsheet should have a row label column and at least two data columns. The row labels should be the sample names. The first and second columns should be the NSP and STY file names. Other columns in the dataset are optional but may contain the reference status for the sample. For Mapping 500k CEL file import, check the 500k NSP/STY Matching box and select the matching spreadsheet by clicking on Select Sheet.

The default reference set includes all samples. Another option is to select a subset from a spreadsheet containing the Reference Status for the samples. The row labels should match the sample names. For the SNP 5.0 and SNP 6.0 Array, the row labels should be the file names of the CEL files with the CEL extension removed. The reference status column should contain 0’s and 1’s where 0 denotes reference and 1 denotes non-reference status. All of the samples will be normalized against the reference samples. When a spreadsheet is selected, the 0=Ref 1=Non-Ref Column drop down box will contain the names of columns of binary data in the selected spreadsheet. Select the name of the column to be used as the reference status.

You also have the option to omit samples with the reference designation from the final output spreadsheet. To do this, check the appropriately named box. If this option is selected, reference samples will be used in normalization of data and calculation of LogR values, but will not be included in the output spreadsheet.

Another reference set option is to use HapMap precomputed populations. All 270 samples or an ethnic subset can be used.

A Marker Map needs to be selected for use in the analysis. Probes that are not contained in the marker map will not be imported. In other words, if the marker map does not contain copy number probes and the CEL files do, those probes will not be in the resulting spreadsheet. The CEL files are scanned prior to this dialog and the appropriate marker map will be detected and auto-downloaded. If the marker map has already been downloaded, navigate and select it by clicking on Select Marker Map.

The Library Path where the CDF library files are located is also automatically detected. The directory can be changed by clicking on View Library Folder. The library files should contain both SNP and CN Probes.

You may optionally select a temporary directory where intermediate DSF files will be stored. If your project is located on a shared network drive, for performance reasons you should specify a Temp Directory on a local disk.

Output options include both A and B alleles before quantile normalization, after quantile normalization and before the log ratios are computed, and the LogR ratios with samples column wise and row wise.

The name of the Dataset can be specified at this time; the default is to name the dataset “Affy CEL Dataset”.

Note

  1. Affymetrix recommends using at least 25 samples as references in un-paired copy number analysis.
  2. The gender of the reference samples should be considered for copy number analysis of the X and/or Y chromosomes.
  3. The CEL conversion process will take several hours to complete.
  4. NSP files can be imported without the corresponding STY files (or vice-versa), to do so select only the NSP files, do not use a matching spreadsheet, and make sure that the row labels in the reference sheet match the CEL file names exactly.

Affymetrix CNT Files

The Affymetrix CNT import tool converts multiple CNT files into one aggregate spreadsheet that contains the log2 ratio values in a format ready to be used for analysis. CNT files can be created for the Mapping 10k, 100k, and 500k arrays or for any copy number data that can be converted into a text file. See Creating CNT Files using the Affymetrix CNAT Batch Analysis Tool and Affymetrix CNT File Format for information on creating Affymetrix CNT files.

importCNT1

CNT File Import Window

From the Import CNT Files... dialog (see CNT File Import Window), you can click Add Files to select CNT files to convert. This will open a file chooser where you can select one or more CNT files. The CNT files you selected will appear in the CNT file convert window. You can add all of the files in a directory by clicking Add Directory. To remove CNT files from the window, select the unwanted files and click Remove. You may continue adding CNT files by clicking the Add Files button again. Files cannot be added more than once, but files with the same name stored in different locations may be added to the same import.

You can also change the name of the dataset at this time.

Note

Row labels in the output spreadsheet will be determined by the file names, so files with the same name stored in different locations will have the same row labels.

Affymetrix CNCHP Files

The Affymetrix CNCHP import tool converts multiple CNCHP files into one aggregate spreadsheet containing the log2 ratio values in a format ready to be used for analysis. CNCHP files can be created for the Mapping 100k, 500k, and SNP 6.0 arrays. See Creating CNCHP Files Using Affymetrix Genotyping Console 2.0 for information on creating Affymetrix CNCHP files.

importCNCHP1

Affymetrix CNCHP File Import Window

From the Import CNCHP Files... dialog (see Affymetrix CNCHP File Import Window), you can click Add Files to select CNCHP files to convert. This will open a file chooser where you can select one or more CNCHP files. The CNCHP files you selected will appear in the CNCHP file convert window. You can add all of the files in a directory by clicking Add Directory. To remove CNCHP files from the window, select the unwanted files and click Remove. You may continue adding CNCHP files by clicking the Add Files button again. Files cannot be added more than once, but files with the same name stored in different locations may be added to the same import.

You can also change the name of the dataset at this time.

Note

Row labels in the output spreadsheet will be determined by the file names, so files with the same name stored in different locations will have the same row labels.

Affymetrix CYCHP Files

The Affymetrix CYCHP import tool converts multiple CYCHP files into aggregate spreadsheets containing one or more of the possible datasets contained within CYCHP files. The possible datasets are:

  • Log2Ratio: Creates a spreadsheet containing values that are the ratio of signal to median of Reference signal for every sample/marker pair in the array.
  • CN Segments: Creates a spreadsheet listing the segments for the CN dataset. The ‘Value’ column contains the Copy Number State. Also reported is the confidence, an indicator score for non-normal copy numbers.
  • LOH Segments: Creates a spreadsheet listing the segments for the LOH dataset. The value of the ‘Value’ column is 1 when a Loss of Heterozygosity is found, and 0 when not found. Also reported is the confidence, the ratio of the probability of the SCAR measurements under the LOH model to the sum of the probability under each of the LOH and non-LOH models.
  • Normal Diploid Segments: Creates a spreadsheet listing the segments for the Normal Diploid dataset. The value of the ‘Value’ column is 1 when CN State is 2 and LOH is 0. Otherwise, the value of this column is 0. Also reported is the confidence, the ratio of the probability of the SCAR measurements under the LOH model to the sum of the probability under each of the LOH and non-LOH models.
  • CN Neutral LOH Segments: Creates a spreadsheet listing the segments for the CN Neutral LOH dataset. The value of the ‘Value’ column is 1 when CN State is 2 and LOH is 1. Otherwise, the value of this column is 0. Also reported is the confidence, the ratio of the probability of the SCAR measurements under the LOH model to the sum of the probability under each of the LOH and non-LOH models.
  • Mosaicism Segments: Creates a spreadsheet listing the segments for the Mosaicism dataset. The ‘Value’ column contains the Copy Number State. Also reported is the confidence and mosaicism. Confidence is the proportion of markers that are above or below the thresholds required to make a CN change call for a running median segment size of 251. The value of the ‘Mosaicism’ column is 1 if more than one CN call was found in the segment.
importCYCHP1

Affymetrix CYCHP File Import Window

From the Import CYCHP Files... dialog (see Affymetrix CYCHP File Import Window), you can click Add Files to select CYCHP files to convert. This will open a file chooser where you can select one or more CYCHP files. The CYCHP files you selected will appear in the CYCHP file convert window. You can add all of the files in a directory by clicking Add Directory. To remove CYCHP files from the window, select the unwanted files and click Remove. You may continue adding CYCHP files by clicking the Add Files button again. Files cannot be added more than once, but files with the same name stored in different locations may be added to the same import.

You can also change the name of the dataset at this time.

Select the output datasets to create as well as indicate whether or not CN segment covariate spreadsheets should be created. The covariates spreadsheets are defined below.

  • One column per marker: A new column is created for every marker present in the all segments for all samples. Column headers are marker names.
  • First column of each segment: A new column is created every time there is a new segment value for any one sample over all the samples. This creates common segments for all samples, although for a particular sample there may be more columns than there are segments. In the case where a new column is introduced but the segment value has not changed, then the segment value is repeated fro all columns in a segment. Column headers are the marker names of the first marker in each segment.

Affymetrix DMET Report

Using this option, Affymetrix DMET data is imported. Hemizygous markers are converted to homozygous markers, and tri-allelic markers are converted into two columns, each containing the major allele and one of the minor alleles. A marker map is requested after choosing the DMET file. If there are tri-allelic markers, additional markers are added to the marker map so that both columns will be mapped.

Illumina

Four different Illumina file types can be imported via the Illumina submenu of the import menu. These include Illumina DSF File, Illumina Final Report From One or More Files, Illumina Single Final Report, Matrix Text File, and iControlDB Data.

Illumina DSF File

For the Illumina platform, you must use BeadStudio or GenomeStudio with the Golden Helix SVS DSF Plug-In to export the log2 ratio values from your Bead/GenomeStudio project. For instructions on how to install and use the plug-in see Exporting Data from GenomeStudio. With the DSF Plug-In, you can choose to export the entire project or specific chromosomes.

The Import Illumina DSF File dialog allows you to directly import log2 ratio data into a spreadsheet. This process will import all of the chromosomes stored in the DSF file.

Illumina Final Report From One or More Files

This option imports multiple fields of data from Illumina Final Report text files. It can be used to just import genotype data, or multiple real- valued columns such as B-Allele Frequency, or Log R Ratios. The user can choose from all fields found in the text file and select which ones should be imported.

A separate dataset is created for each field, except for allele fields which are combined into a genotype column and GC score which is used to filter genotypes.

Single Illumina Final Report

This option imports multiple fields from a single Illumina Final Report text file. It can be used to just import genotype data, or multiple real-valued columns such as B-Allele Frequency or Log R Ratios. The user can choose from all fields found in the text file and select which ones should be imported.

A marker map contained in the Genetic Marker Maps folder can be selected to apply to all datasets after import.

A separate dataset is created for each field, except for allele fields which are combined into a genotype column and GC score which is used to filter genotypes.

Illumina Matrix Text File

The option imports a user-specified file in the matrix Illumina text format that can be exported using the Final Report wizard in Illumina’s BeadStudio Software. The file can be comma or tab delimited, and GC calls can either be included or excluded.

The exported file will contain file information, a header line and then the genotype data. Optionally the genotype data may also contain chromosome and position information.

During the import process, on the first dialog you will need to choose a file to be imported, then specify whether your file is tab or comma delimited and if the file contains GC Score data as well as a name for the imported dataset. If you choose to use a GC Score threshold input your threshold value (range: 0 - 1). SNPs that have a GC Score below this threshold will be imported as missings (?_?).

On the second dialog select the data column that represents the marker name as well as the column that represents the first sample. If your file contains chromosome and position information you can select the option to create and apply a marker map to your dataset.

Illumina iControlDB Data

This option imports iControlDB formatted data into SVS. Illumina’s iControlDB is an online database containing publicly available data, to be used as controls, for example, in an association study. The data will contain genotype and phenotype data from individuals generated from Illumina genotyping products.

Agilent Files

The Agilent file import tool reads Agilent text files (TXT tab delimited files) that were created using the Agilent Feature Extraction software and allows you to import various fields into Golden Helix SVS for analysis. All fields are imported into SVS as they are stored in the TXT file except for the LogR field. Agilent uses a base 10 for all logarithms. To be consistent for analysis, the import process converts this field into a base 2 logarithm. Marker map information from the text files are also imported, and the resulting dataset has the marker map applied.

importAgilentWindow

Agilent File Import Window

From the Import Agilent File dialog (see Agilent File Import Window), first select the TXT files you want to include in the dataset. To select the files, click on the Add Files button and use the file browser to choose the files for import. The files you selected will appear in the import dialog window. You may add all of the TXT files in a directory by using the Add Directory button. To remove files from the window, select the unwanted samples and click Remove. You may continue adding files by clicking the Add Files or the Add Directory buttons again. Multiple selection is allowed by <Shift>-left-click to select a block of files or by <Ctrl>-left-click to select several individual files. All TXT files must be of the same type, length and containing the same marker map information in order to be imported together. If there are files that do not match then an error message will be generated and the non-matching files will need to be removed from the list of files to import.

You can also change the name of the dataset at this time.

NimbleGen Data Summary Files

The NimbleGen data summary file import tool reads NimbleGen *_segMNT.txt text files (TXT tab delimited files) that were created using Roche NimbleGen Software and allows you to import various log ratio fields into Golden Helix SVS for analysis. The selected field is imported into SVS, and a marker map is created and applied to the dataset.

importNimbleGen1

NimbleGen File Import Window

From the Import NimbleGen Data dialog (see NimbleGen File Import Window), first select the TXT files you want to include in the dataset. To select the files, click on the Add Files button and use the file browser to choose files for import. The files you selected will appear in the import dialog window. You may add all of the TXT files in a directory by using the Add Directory button. To remove files from the window, select the unwanted samples and click Remove. You may continue adding files by clicking on the Add Files or the Add Directory buttons again. Multiple selection is allowed by <Shift>-left-click to select a block of files or by <Ctrl>-left-click to select several individual files. All TXT files must be of the same type, length and containing the same marker map information in order to be imported together. If there are files that do not match then an error message will be generated, and the non-matching files will need to be removed from the list of files to import.

The possible log ratio fields that can be imported one or more at a time (if they are available) in all of the text files are:

  • LogR Ratio
  • LogR Ratio Spatial
  • LogR Ratio Corrected

If multiple log ratio data fields are selected, a single dataset will be created for field.

You can also change the name of the dataset at this time.

Importing Family Pedigree Data

Preparing Family Data

Golden Helix SVS supports the import of family-based data in FBAT/PBAT Pedigree format, FBAT/PBAT Phenotype format, text pedigree format, and family-based text phenotype format.

Note

Golden Helix SVS also supports joining family-based spreadsheets with other spreadsheets. This may be useful if the genetic data for your family-based study comes from sources such as the Affymetrix GeneChip ™.

The format of a text pedigree should be as follows:

  • The optional label column, whose position in the file you may specify in the import dialog, is mainly useful for indexing into other genetic data. This data could be, for instance, Affymetrix GeneChip ™ data.
  • The first column after the row labels should be the family number or family ID.
  • The second column after the row labels should be the patient number or individual ID.
  • The third column after the row labels should be the father number or father ID.
  • The fourth column after the row labels should be the mother number or mother ID.
  • The fifth column after the row labels should be the gender. Gender should be encoded as 0 = missing, 1 = male, 2 = female. If the encoding 0 = male, 1 = female is used, set ? or the appropriate character as the missing value encoding on the Advanced Options tab.
  • The sixth column after the row labels should be the affection status. Affection should be encoded as 0 = unknown, 1 = unaffected, 2 = affected. If the encoding 0 = unaffected, 1 = affected is used, set ? or the appropriate character as the missing value encoding on the Advanced Options tab.
  • The remaining columns should be genetic markers or any other phenotypic data.

The format of a text phenotype should be as follows:

  • The first column after the row labels should be the family number or family ID.
  • The second column after the row labels should be the patient number or individual ID.
  • The remaining columns should be phenotypic data or genetic markers.

Import FBAT Pedigree

In the Import FBAT Pedigree File dialog (see FBAT Pedigree Import Window), select the pedigree file (file with the .ped extension) by clicking on Browse and navigating through the file manager to the desired file. You may edit the dataset name at this time. You must also specify the Sex and Affection Status Field Encodings your file uses. Additionally, you may change the default missing encoding options by clicking on the Advanced Options tab and indicating the value that your file uses for missing phenotype and or missing genotype.

importFBATped1

FBAT Pedigree Import Window

Import FBAT Phenotype

In the Import FBAT Phenotype File dialog (see FBAT Phenotype Import Window), select the phenotype file (file with the .phe extension) by clicking on Browse and navigating through the file manager to the desired file. You may edit the dataset name at this time. You may also indicate a custom encoding for missing data by clicking on the Advanced Options tab and indicating the character that your phenotype file uses for missing data.

importFBATphe1

FBAT Phenotype Import Window

Import Text Pedigree

In the Import Text Pedigree dialog (see Text Pedigree Import Window), select the text pedigree file by clicking on Browse and navigating through the file manager to the desired text file. You will need to indicate how the text file is delimited in the drop down menu. Possible options are comma, white-space, tab delimited or “Other ->”. If your text file uses a different delimiter than comma, space or tab, select other and indicated the character used in the text box to the right of the menu. You have the option of generating row labels from the Patient ID and Family ID columns or by using the first column in the text file. Additionally, you must specify how Sex and Affection Status are encoded in your file by selecting a specification for each.

On the Advanced Options tab you can indicate a custom encoding for missing data by entering in the string used for missing values in the text box. You can also change the allele delimiter if your text file uses a character different from the default underscore (_). There are several possible options and also an “Other ->” category where you can specify the character used to the right of the menu. If there are header lines in your text file you can skip them by checking in the appropriate box and indicating the number of rows in the header of the file.

importTextPed1

Text Pedigree Import Window

Import Text Phenotype

In the Import Text Phenotype dialog (see Text Phenotype Import Window), select the text phenotype file by clicking on Browse and navigating through the file manager to the desired text file. You will need to indicate how the text file is delimited in the drop down menu. Possible options are comma, white-space, tab delimited or “Other ->”. If your text file uses a different delimiter than comma, space or tab, select other and indicated the character used in the text box to the right of the menu. You have the option of generating row labels from the Patient ID and Family ID columns or by using the first column in the text file. On the Advanced Options tab you can indicate a custom encoding for missing data by entering in the string used for missing values in the text box. You can also change the allele delimiter if your text file uses a character different from the default underscore (_). There are several possible options and also a “Other ->” category where you can specify the character used to the right of the menu. If there are header lines in your text file you can skip them by checking in the appropriate box and indicating the number of rows in the header of the file.

importTextPhe1

Text Phenotype Import Window

HapMap

This function imports user-specified text files with extensions .txt or .csv from the HapMap project. Multiple HapMap files cannot be merged together with this script.

Import Complete Genomics Data

This function imports SNP, Insertion, Deletion, and Substitution variants from the var-[ASM-ID].tsv files provided by Complete Genomics (http://media.completegenomics.com/documents/DataFileFormats112.pdf). These files can be imported directly from the provided bzip2 sources. There is no need to decompress them before-hand. The user can choose to import one var file or several var files simultaneously. When multiple files are imported the data will be combined into a single spreadsheet. Each file is assumed to contain data for exactly one sample.

importCompleteGenomics

Complete Genomics Import Window

Options for the import include the base dataset name. If specified this will define the name of the dataset created by the import. If the default value of “*” is used, the dataset will take the name of the first file in the input list.

Complete Genomics VAR files have special encodings for variants, these include:

  • Partial reference match haplotype (no-call-rc)
  • Partial reference mismatch haplotype (no-call-ri)
  • Pseudo autosomal region called in X as diploid haplotype (PAR-called-in-X)

There are options for each of these encodings to either classify them as reference or no-call. Please consult your documentation from Complete Genomics when deciding how to treat these calls.

In addition, there is an option to encode half-called genotypes as either ref-call or missing.

This import tool assumes records in all input files to be grouped by locus and otherwise ordered by chromosome and start position. The chromosome order is assumed to be (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, M). If the input files are found to violate these assumptions the import tool will display a warning. However, checking the option to sort input records within loci should prevent most of these warnings.

Finally, it is possible to only import certain chromosomes or only include intervals from a selected annotation source.

Impute2 GWAS Files

This function reads dosage files created by Impute2 and allows for the option of importing those raw dosages as well as the option to convert the dosage probabilities to genotype calls.

importImpute2

Import Impute2 GWAS Files Window

The genotype files and the sample file are text files (either whitespace, tab, or comma delimited). There should be one genotype file per chromosome with the file extensions as *.impute2 or *.gen and one sample file with extension *.sample.

You can find specifics on how the genotype and sample files are formatted at the following website. http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html

The genotype file names are used to determine chromosomal information for the marker map applied to the data on import. If the prefix for the file contains anything other than chr1 for example MyData_chr1 then enter MyData_chr under the Genotype file prefix section.

The option to Import dosages as a single B-Allele dosage will create one dosage column for each marker using the separate AA, AB, and BB probabilities provided using the following formula.

B Allele Dosage = AB Probability + 2*(BB Probability)

The option to Import Info File will import the data from the info file associated with each genotype file and put it in the marker map.

Please see the website below for the fields contained in the info file. https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#output_options

Import VCFs and Variant Files

This option imports data from VCF, gVCF or 23andMe text files into multiple spreadsheets. Special handling is provided for genotype data and certain format fields. The user can choose to import one VCF file or several VCF files simultaneously.

You can find specifics on how the VCF files should be formatted at the following website: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Select Files to Import

The import wizard will step you through all of the import options to bring variant level data (with or without sample level fields) into SVS.

The first step is to select the files to import.

Import Wizard Step 1

Select the files to import on the first step of the import wizard

Note

Indicate whether or not files should be appended if multiple files have the same sample names using the Append together files with matching sample names option.

Here are two examples:

  • If there is one file per chromosome for each sample or containing all samples, leave this option Checked
  • If there are multiple files per sample and the variants are to be compared between files (tumor/normal, various alignment algorithms, etc.), Uncheck this option.

Note

When importing 23andMe data from delimited text files a local copy of a dbSNP annotation source is required to determine reference and alternate alleles for each RS ID listed. You can download a copy through the Data Source Library. Different versions of dbSNP may produce different results for some variants.

If the files have not been compressed or indexed after clicking Next > the files will be compressed and indexed. Otherwise, the next step is to select the relationship between or type of samples.

Select Relationship

If you are importing variants into an empty template and your files contain sample level information (i.e. they contain more than just site level information) you will be asked if the samples are related or unrelated. Select the appropriate relationship or type of import. The options include:

  • Individual Samples: The samples in the file(s) that you are importing are not related to each other. You will be able to select the affected individuals on the next page.

  • Family Samples: The samples in the file(s) that you are importing are related to one another. You will be able to select the affected individuals and specify parent relationships on the following page.

    Note

    Select this option if you have one or more families. This option does not require that all of the samples are in the same family.

  • Cancer Samples: The samples in the file(s) that you are importing are from cancer gene panels. You will be able to select the affected individuals on the next page.

  • Tumor/Normal Samples: The samples in the file(s) that you are importing contain tumor/normal pairs. You will be able to select the tumor samples and the matching normal samples on the next page.

Import Wizard Step 2

Specify the relationship between the samples

Edit Sample Information

Once the relationship has been specified you will be prompted to edit the sample information. This step is OPTIONAL as you can import the sample information directly from numerous sources in SVS.

If you want to skip this step, click Next >. However, if you want you can change the sample name, affection status (affected, unaffected or missing), and if the samples are related, set the parents for the children here and import all of the data simultaneously.

If the sample information is also contained in a plain text file (or pedigree file) you can also import this information by clicking on From Text File above the sample table. See Import Sample Information from Text File for more information.

If the Family Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Family

Edit the sample information to specify relationships and affection status for family data

If either of the Individual Samples or Cancer Samples options were selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Individual

Edit the sample information to specify affection status for individual samples

If the Tumor/Normal Samples option was selected on the previous page, the sample editing page will look like the image below.

Import Wizard Step 3 Tumor/Normal

Edit the sample information to specify tumor/normal status and matched normal sample name.

The maximum number of samples that can have their sample information edited is 100 samples. Although, it is not practical to specify that much sample information in this format. Also available is the ability to import the sample information from a text file.

Import Sample Information from Text File

Instead of manually specifying the sample information, it can be imported from a text file. The data in the text file will be matched by sample name.

General text file importing parameters are available to handle most text file formats including text pedigree files, or files without a header.

The text file needs to specify the affection status for the samples, or tumor/normal status at a minimum. Other phenotype information can be imported as well. To set a column as Affected?, Tumor?, or as a field to import, right click on the column header and select the desired option.

A best attempt is made to detect the column that contains the sample names for matching. If the correct field from the text file is not selected, right click on the column header of the correct column and choose Set as Sample Names.

Basic Sample Info

Importing sample information from a file that just contains the affection status

If the text file contains a field with secondary sample names, the sample names can be renamed using this field. To set the field as Renamed Sample Names click on the column header and select Renamed Sample Names. This fills in the Renamed Sample column in the import wizard.

If the samples are imported as Family Samples, then the text file can specify the Father and Mother IDs in addition to the affection status. The gender can also be imported for later use in algorithms.

Note

If a field is set for Renamed Sample Names the Father and Mother IDs must match the renamed sample names and not the original sample name.

Pedigree Sample Info

Importing pedigree information from a text file

If the samples are imported as Tumor/Normal Samples, then the text file can specify the matched normal sample IDs in addition to the tumor/normal status.

Note

If a field is set for Renamed Sample Names the matched normal IDs must match the renamed sample names and not the original sample names.

Tumor/Normal Sample Info

Importing tumor/normal information from a text file

Modify How Fields in the Variant Files are Imported

All of the fields from the files selected for import are displayed on the Edit Field Merge and Type Behavior page of the import wizard.

The G_T (Genotypes) field is selected for import automatically. Additional fields can be selected by checking the Select Field box. Sample fields (FORMAT fields) will be imported as spreadsheet. INFO fields will be added to the marker map. When merging multiple files together there can be differing values in the INFO fields. The options presented are designed to handle this possibility.

Certain INFO fields are automatically elevated to sample fields, these fields are FILTER and DP (Read Depth). Any of the other fields can also be elevated to a sample field by selecting Sample in the drop-down menu in the Merge Behavior column.

Edit Field Merge Options

Edit field merge and type behavior advanced option dialog

By default, all other INFO fields will be merged by creating a Unique list of values for the field across all samples and files, this will keep the field a variant site field. Other merge options include:

  • NumericMax: For integer, integer array, float or float array field types. Takes the maximum of all values for the field in all files.
  • NumericMin: For integer, integer array, float or float array field types. Takes the minimum of all values for the field in all files.
  • NumericMean: For integer, integer array, float or float array field types. Takes the mean of all values for the field in all files.
  • KeepMatching: All field types. Only keep the value if all files that have a value for the specified field match.
  • TakeFirst: All field types. Take the first value seen.

Import Summary

The final page of the import wizard is a summary of the import process. To finalize the import click Finished.

Typical Summary

The import summary for the import wizard for a trio from a single VCF file

Other options on this page include:

  • Sheet Base Name: The base dataset/spreadsheet name to be used for all spreadsheets created. If merging multiple files it is recommended that this name be set to a more informative name.

  • Specify Genomic Regions to Import: To only import variants from a particular region, or one or more chromosomes, enter the region(s) into this option.

    Region Suggestions

    The information for the subset imported chromosome option

  • Split Output Into Files Per Chromosome: This allows you to create one spreadsheet per chromosome for each VCF field imported.

  • Select filters ...: To only import variants that have a particular FILTER value, select those filters by checking the box in front of the available options.

To change the variant import algorithms, click on the Advanced Options check-box in the lower left hand corner. This will provide additional options and should look like the image below for a trio imported from a single VCF file.

Advanced Summary

The import summary for the import wizard for a trio from a single VCF file with advanced options visible

Advanced options allow variants in the variant files to be adjusted using the following algorithms:

Left Align

Insertions and deletions not in the left-most representation will be re-aligned with a Smith Waterman algorithm to provide it with its canonical representation.

Left Align Example

Left align visual example

In the example above, the “CAT” deletion has been moved down 6 bases. Moving variants to the left most position like this, will allow for uniform comparison between variants which can be represented at more than one position.

Note

The Left Align algorithm requires a valid and local reference sequence that matches the assembly of the data being imported.

Allelic Primitives

Multi-nucleotide polymorphisms will be split into the SNP representation that provides the best support for annotation.

Allelic Primitives Example

Allelic primitives visual example

In the example above, the original variant (above) represents a variant with the ref/alt “TCAT/GCAG”. This can be simplified by splitting the variant into two different “T/G” SNPs (below). The simplified representation is a more general form of original variation and is more likely to be found in annotation sources

Split Variants Based on Unique Genotypes

When multiple individuals have mutations in the same “site” (same chromosome and reference alleles), some variant files will place these all in one “record”. This option splits that record to ones that are matched to each individual genotype alleles allowing annotation and filtering to be precise for each different individual genotype.

Note

This option is available when importing Individual Samples

This provides the following:

  • Each site is broken into all of the possible genotypes (with one or two alternate alleles), that can be constructed from the original alleles.
  • Samples genotypes will be filled from the alleles assigned to each sub-feature when possible.
  • Only the RefAlt field is updated for each feature. The sample fields and the alternates are copied from the original feature.
  • Annotations are most-specific, with samples with only one alternate being properly annotated to annotation records with that alternate allele.
  • Allele counts will be calculated for each of the split features. Providing counts for each of the different allele combinations.

Flatten Variant Genotypes

When a variant at a given “site” (same chromosome and reference alleles), has more than one alternate, some variant files will place these all in one “record”. This option splits that record to ones that are matched to each alternate allele allowing annotation and filtering to be precise for each alternate allele.

Note

This option is available when importing Cancer Samples

In this mode:

  • No concern for keeping genotypes intact is made. Every record is a single reference and alternate. The ALT field is updated and the appropriate values are taken from all “A” fields.
  • Genotypes that cannot be formed in the new split (i.e. they were 1/2 before) are set to half missing in each of the records (1/. and ./2). The fields are then copied into each of the split records.

Match Variants to Affected Individuals Genotypes

When multiple affected individuals have mutations in the same “site” (same chromosome and reference alleles), some variant files will place these all in one “record”. This option splits that record to ones that are matched to each affected individuals genotypes alleles allowing annotation and filtering to be precise for each affected individual.

Note

This option is available when importing Family Samples

In this mode:

  • Each site is broken into all of the possible genotypes (with one or two alternate alleles), that can be constructed from the samples marked as “Affected” during import.
  • Alternate Alleles that are not represented in the samples are combined and placed in a separate record.
  • Samples genotypes will be filled from the alleles assigned to each sub-feature when possible.
  • The RefAlt field and the Alternates are updated. As well as the “A” (alt matching) fields, to match the new alternates.
  • Annotations are most-specific for the “Affected” samples, with samples with only one alternate being properly annotated to annotation records with that alternate allele.
  • Allele counts will be calculated for each of the split features. Providing counts for each of the different allele combinations.

Once the options are set click Finished to import the data.

Mach Output

This option imports the mldose, mlgeno and mlinfo files output from MACH. The mlinfo file has to be imported but the mldose and mlgeno files are optional.

importMACH

Import MACH Output Window

Note

If your MACH output data was generated using the HapMap website it will need to be converted from UTF-16 encoding to UTF-8 encoding for this script to work. There are free utilities online for this purpose. One such website is http://www.fileformat.info/convert/text/utf2utf.htm.

Import LDSCORE Output

This options imports ldscore files created by the LDSC module. Please see https://github.com/bulik/ldsc/ for more information on this package.

Import LDSCORE Files Dialog

Import LDSCORE Files Dialog

RNA-Seq Tabularized Quantification

This function imports the Tabularized quantification data from pipeline.goldenhelix.com

The file should be in a gene (or isoform) as rows and samples as columns format. The first four (optionally five) columns should contain the gene (or isoform) name, chromosome, start, and stop. The fifth column may optionally contain transcript names or the gene (or isoform) counts may begin in this column. All remaining columns contain gene (or isoform) counts.

Gene Example:

ID	Chrom	Start	Stop	transcripts	BT-20	BT549	HCC1187
AF075036	chr19	9649409	9650406	AF075036:uc002mls.3	0.00	1.04	0.00
AK298056	chr6	31783320	31785722	AK298056:uc011dok.1	0.00	111.75	131.96
CCDC22	chrX	49091927	49106987	NM_014008:uc004dnd.2; AK296911:uc011mna.2	306.00	326.00	223.23
DNAJC15	chr13	43597362	43683306	NM_013238:uc001uyy.3	1.00	791.45	130.00
HAND1	chr5	153854532	153857824	NM_004821:uc003lvn.3	0.00	0.00	0.00
LEF1	chr4	108968701	109090112	NM_016269:uc003hyt.2; NM_001166119:uc011cfk.2; NM_001130714:uc003hyv.2; NM_001130713:uc003hyu.2; AK303143:uc011cfj.1; AF294627:uc010imb.2; AF086339:uc003hyw.1	138.00	2.00	3.00
LINC00094	chr9	136890561	136896719	NR_015427:uc004ceu.3	870.15	1067.75	267.50
LOC643650	chr10	47096454	47151400	NR_033957:uc001jef.3	154.24	41.85	28.81
MIR548A1	chr6	18572015	18572111	NR_030312:uc021yme.1	0.00	0.00	0.00
U1 (4)	chr5	40269660	40269821	:uc021xxq.1	1.00	0.00	1.00
UBXN2B	chr8	59323823	59364060	NM_001077619:uc003xtl.3	498.75	1967.07	1421.84

An example of a counts per gene table is above. Notice the file is tab delimited with transcript names separated by a semi-colon and saved as either a *.tsv or a *.txt file. .

Isoform Example:

ID	Chrom	Start	Stop	BT-20	BT549	HCC1187
AY754876:uc010hxn.3	chr3	183701541	183735727	0.00	11.12	20.60
NM_001025107:uc001ffk.3	chr1	154554534	154600456	7891.50	9753.86	18786.33
AK125718:uc002rej.4	chr2	23971534	24055472	16.61	17.08	0.00
BC051736:uc001uya.1	chr13	41792722	41793250	0.00	1.00	1.00
NM_175709:uc003axb.3	chr22	39526779	39548538	195.07	209.05	143.72
NM_017812:uc003vre.3	chr7	132469623	132766828	2064.10	2524.05	3205.69
NM_001142675:uc010qww.2	chr11	867859	915058	0.00	0.00	0.00
AJ890453:uc003srd.3	chr7	7457263	7557468	0.15	0.00	1.00
NM_020420:uc004fvn.3	chrY	25275502	25345254	0.00	0.00	0.00
NM_012182:uc002agj.1	chr15	60296421	60298142	0.00	5.00	0.00

An example of a counts per isoform file is above. The file itself should be tab delimited and saved as either a *.tsv or *.txt file