Spreadsheets

Spreadsheet Overview

Spreadsheets provide the basis for all data manipulation, plotting and analysis in Golden Helix SVS.

All operations performed on spreadsheets are logged in the Node Change Log of the Project Navigator for reference and for replication of procedures.

Special Features of a Pedigree Spreadsheet

A pedigree spreadsheet has six required columns, Family ID, Patient ID, Father ID, Mother ID, Sex, and Affection Status. Any spreadsheet that has these six column headers and the columns are the correct type will be identified as a pedigree spreadsheet. This is indicated not only with a special node icon in the Project Navigator, but also by setting the background color for the column name headers to blue, see Pedigree Spreadsheet.

pedSpreadsheet

Pedigree Spreadsheet

Whenever a pedigree spreadsheet is joined to another spreadsheet the six pedigree columns will be kept as the left most columns regardless of the join order. See Joining or Merging Spreadsheets for more information on joining two spreadsheets.

Relationships and Dependencies Between Spreadsheets

If the navigator node for a spreadsheet has child nodes - either subset spreadsheets, joined or appended spreadsheets, analysis results or plots - then any changes made to the original spreadsheet will result in a new navigator node being created, and a new spreadsheet tab at the bottom of the spreadsheet view. This keeps the parent node in the same state it was when the first child node was created. This is why child nodes are named as such, since they are descendants of the parent nodes.

If you try to delete a parent node, you can only do so by deleting all associated child nodes. In some cases, such as joining two spreadsheets, a dependency is created where one or more spreadsheets rely on the data contained in another spreadsheet. Dependencies are denoted in the Project Navigator by highlighting all associated nodes when the dependent node is selected, and by listing all linked nodes in the “(Linked To)” column. A spreadsheet cannot be deleted if one or more nodes is dependent on it.

If a spreadsheet navigator node does not have any child nodes, then any changes to the spreadsheet will not result in a new node being created.

Row-Information Columns

In any spreadsheet, there are two static columns which always remain visible even when scrolling. They are named the row number and row label columns.

Row Label Headers

The Row Label Header describes what the row labels represent. The default Row Label Header is “Label” if generic row labels are created on import of data. Left-clicking on the Row Label Header sorts the labels first in ascending order and a second click sorts in descending order. Ascending order is indicated by a triangle pointing toward the top of the spreadsheet, and descending is indicated by a triangle pointing toward the bottom of the spreadsheet.

Row Numbers

Row Numbers are static; they never change regardless of any sort operations performed on the spreadsheet. Right-clicking on a Row Number brings up the Row Menu, which contains, at a minimum, the Edit This Spreadsheet option. If the spreadsheet has a genetic marker map applied to row labels, and if the marker map has RS ID or Identifier information, rows containing RS ID’s will also have hyperlinks to websites referencing the RS ID in the Row Menu.

Row Labels

Row Labels are identifiers for the rows. Row labels can be used to sort spreadsheet data ascending or descending. Left-clicking on the row label will set the active/inactive state of the entire row. Right-clicking brings up the Row Menu, which contains, at a minimum, the “Edit This Spreadsheet” option. If the spreadsheet has a genetic marker map applied to row labels, and if the marker map has RS ID information, rows containing RS ID’s will also have hyperlinks to websites referencing the RS ID in the Row Menu.

Row States

There are two possible row states: inactive or active. Inactive means a row will be excluded from any analysis or plot. Active means a row will be included in any analysis or plot. Multiple rows can simultaneously be made active or inactive by toggling the first column to the desired state, then <Shift>-left-clicking on a distant column. All columns in between will be set to the state of the first column clicked.

Column Headers

In any spreadsheet, there are two static rows which always remain visible even when scrolling. They are named the Column Number Headers and Column Label Headers.

Column Number Headers

A Column Number Header indicates the numerical index of a column. On a given Column Number Header the column type is displayed in a large blue letter. See Column Data Types below for a description of the possible column types and icons. Either left- or right-clicking on a Column Number Header opens the Column Header Menu. The Column Header Menu gives options for manipulating the spreadsheet data based on the values in the selected column, or for plotting the selected column. See Column and Row Spreadsheet Operations for more information.

Column Name Headers

A Column Name Header is the name of a column as specified on import of the spreadsheet data or after editing the spreadsheet. Column Name Headers should be informative of the data contained in the column.

Left-clicking on a Column Name Header sets the column state. See Column States for more information.

Right-clicking on a Column Name Header opens the Column Header Menu. See Column and Row Spreadsheet Operations for more information.

Column Data Types

Some column operations are specific to the type of column, indicated by a large blue letter on a column number header. The types are as follows:

  • B’: Indicates a binary column (values 0, 1, ?).
  • C’: Indicates a categorical column (values such as “Low”, “Medium”, or “High”).
  • G’: Indicates a genotype column (bi- or multi-allelic markers with alleles separated by an underscore such as “A_B” or “2_2”).
  • I’: Indicates an integer-valued column (values such as -1, 0, 1, 2, 10, etc.).
  • R’: Indicates a real-valued column (values containing decimal places encoded as single or double precision floating point values).

Column types can be changed or edited using the Spreadsheet Editor, see Editing a Spreadsheet for more information.

Column States

There are three possible column states:

  • Active Means a column is set as an independent variable and included in analysis (except for certain situations, see Genotype Association Tests). The text and data in this column are black.
  • Inactive Means that the column is excluded from any analysis. The text and the data in this column are gray.
  • Dependent Means that the column is set as a dependent variable. Multiple columns are allowed to be set as dependent for a multivariate analysis. The text and data in this column are magenta.

The column state of a column can be set by left-clicking on a column’s column label header. Repeated clicking will cycle through the column states (active, dependent, inactive).

Column state can also be set by right-clicking on either column header (or by left-clicking on the column number header) and selecting Make Active, Make Inactive or Make Dependent. Multiple columns can be made active, dependent, or inactive by toggling the first column to the desired state, then <Shift>-left-clicking on a distant column. All columns in between will be set to the state of the first column clicked.

Genetic Marker Map Information

Applying a genetic marker map to a spreadsheet saves the marker map and associated annotations as special marker map fields in the spreadsheet and reorders all genetic markers first by chromosome and then by position number. This spreadsheet is referred to as a “marker mapped spreadsheet”. This marker map information will be used for certain analyses and can be used for selecting subsets of data. See Genetic Marker Maps Overview for more information. See Marker Mapped Spreadsheet Viewer, A identifies the Map button, and B identifies the marker map information fields.

mmSSView

Marker Mapped Spreadsheet Viewer

Map Button

If a spreadsheet has a genetic marker map applied, a green button with the word “Map” will appear to the left of the Row Label Header above the Row Numbers. Left-clicking on this button will show all available marker map fields either in rows or columns depending on whether the marker names are row labels or column name headers. If no genetic marker map is applied this button will be grayed out.

Marker Map Fields

A marker map field is a row or column of marker map data, such as Chromosome or Position, which may or may not be unique to the marker names. If there is no marker map information for a particular column or row then the marker map fields will be blank for that column or row. If a row or column contains marker map information (at the very minimum Chromosome and Position are required), right-clicking on that field will list hyperlinks to websites that look up the particular marker by either chromosome and position or RS ID (if available) in the supported online databases.

Right-clicking on the Map button allows the user to choose which marker map fields are shown or hidden. The user is not allowed to hide all fields. Clicking on a field name will toggle between showing and hiding the field. To hide all fields, left-click on the Map button.

Marker map information can be searched using the Edit > Find search dialog. Only one marker map field can be searched at a time. See Finding Strings or Values in a Spreadsheet for more information.

Saving/Exporting Spreadsheets

For more information on saving or exporting data from spreadsheets see Exporting Spreadsheet Data.

Spreadsheet Menus Overview

A brief description of all spreadsheet menu items will follow along with links for more information on the topic.

File Menu

  • Save As...:

    A spreadsheet can be saved and exported as a text file, any supported third party format, as family-indexed PED/TPED/BED files, as a Golden Helix DSF file, as a Golden Helix Legacy GHD file, as CNT files, or a VCF file. For more information see Exporting Spreadsheet Data.

  • Apply Genetic Marker Map:

    A genetic marker map can be applied to a spreadsheet when marker names are either row labels or column name headers. See Applying a Genetic Marker Map to a Spreadsheet for more information.

  • Export Genetic Marker Map:

    A genetic marker map applied to a spreadsheet can be exported and saved as a DSM file. This allows the marker map to be applied to other spreadsheets in the current project or in other projects. See Exporting an Applied Genetic Marker Map to a DSM file for more information.

  • Drop Genetic Marker Map:

    It may be desirable to drop or remove an applied marker map. This option does just that. See Dropping a Genetic Marker Map from a Spreadsheet for more information.

  • Append Spreadsheets:

    Appending spreadsheets is used for combining two datasets that in general do not share any row labels, but do share all or most column name headers. For example, adding additional samples to a spreadsheet. See Appending Spreadsheets for more information.

  • Join or Merge Spreadsheets:

    Joining or merging spreadsheets is used for combining two datasets that generally share row labels. For example, joining genotype and phenotype data for the same samples. If two spreadsheets do not share row labels, an option is available to join on sorted row labels. See Joining or Merging Spreadsheets for more information.

  • Create Top-Level Spreadsheet:

    Creates a new dataset with no dependencies as a child of the project node from the current spreadsheet. See Create Top-Level Spreadsheet for more information.

  • Create Marker Map from Spreadsheet:

    Create a marker map from the current spreadsheet. Requires columns containing Marker Name, Chromosome, and Position information. Optionally, a Stop field can be specified. See Create Marker Map from Spreadsheet for more information.

  • Add Columns to Marker Map:

    Add columns from the current spreadsheet to its marker map. Requires a row marker mapped spreadsheet. See Add Columns to Marker Map for more information.

  • Add Alternates to Marker Map:

    Add Alternates field to the marker map of the current spreadsheet that contains a list of alternate alleles for each variant in a genotype spreadsheet. See Add Alternates to Marker Map for more information.

  • Convert Genetic Marker Map to Spreadsheet: Converts the currently visible marker map data into a spreadsheet. The spreadsheet will have the marker names as the row labels and the marker map data in columns.

  • Close:

    Closes the spreadsheet.

Edit Menu

  • Edit This Spreadsheet:

    Opens the Spreadsheet Editor. See Editing a Spreadsheet for more information.

  • Transpose Spreadsheet:

    Transposes all columns in a spreadsheet of a selected type. See Transposing Spreadsheets for more information.

  • Convert to Pedigree Spreadsheet:

    Converts the spreadsheet to a pedigree spreadsheet by either selecting the pedigree fields from among the available spreadsheet columns or by generating generic values. The only required column is a binary affection status column. See Convert to Pedigree Spreadsheet for more information.

  • Recode:

    This sub-menu contains several options for recoding spreadsheet data, such as genotypes and alleles, and renaming row labels and column headers. See Recode for more information.

  • Go To...:

    Opens a dialog to shift the visible data fields to the specified row and column number in the top left corner.

  • Find:

    Searches the spreadsheet for a string or value, or partial string or value, as indicated. See Finding Strings or Values in a Spreadsheet for more information.

  • Find Next:

    Searches for the next occurrence of a string or value as specified in the Find dialog. See Finding Strings or Values in a Spreadsheet for more information.

  • Build Pedigree from Row Labels:

    Takes the existing spreadsheet and row labels and prompts for pedigree information for a single family. This tool can be used on an existing pedigree or to create a pedigree from scratch. See Build Pedigree from Row Labels for more information.

  • Add Allele Delimiter:

    Takes categorical columns of alleles without a delimiter and adds an underscore between the first two characters to convert the alleles to genotypes.

  • Biallelic Expansion:

    Expands genotypic columns in a spreadsheet so that the resulting columns are biallelic.

  • Set Values to Missing based on Second Spreadsheet

    Sets values to missing based on values in a second spreadsheet. See Set Values to Missing based on Second Spreadsheet for more information.

  • Sort by Marker Map Field:

    Sorts a row-mapped spreadsheet by a field in the marker map. See Sort by Marker Map Field for more information.

  • Create Labels from Marker Map Field:

    Creates a new spreadsheet with a chosen marker map field as the column headers or row labels. See Create Labels from Marker Map Field for more information.

Select Menu

  • Row:

    This sub-menu contains several options for row operations. See Row Select Operations for more information.

  • Column:

    This sub-menu contains several options for column operations. See Column Select Operations for more information.

  • Activate All:

    Activates all columns and rows in a spreadsheet.

  • Activate By Chromosomes:

    This option allows for selection of rows or columns based on the chromosome field of a genetic marker map. Requires a marker mapped spreadsheet. See Activate By Chromosome for more information.

  • Subset Active Data:

    Creates a subset spreadsheet based on active columns and rows. See Creating Subset Spreadsheets for more information.

  • Activate or Inactivate Based on Second Spreadsheet:

    This function activates or inactivates all columns or rows in a spreadsheet based on the active row labels or column headers of a second spreadsheet.

  • Apply Current Selection to Second Spreadsheet:

    This function takes a spreadsheet that has had its rows or columns filtered (set to inactive), then applies the active selection to another spreadsheet. The rows/columns may have been activated or inactivated based on Activate By Threshold, Activate by Category or by any other filtering tool in SVS.

    The orientation in the second spreadsheet is selected based on the largest set intersection of labels/headers.

    For example, if the rows option was selected in the dialog, two set intersections are found: the number of overlapping row labels (original spreadsheet) and row labels (selected spreadsheet) and the number of overlapping row labels (original) and column headers (selected). The orientation with the largest number of overlaps is used.

    If the columns option was selected, the number of overlapping column headers (original) and row labels (selected) and the number of overlapping column headers (original) and column headers (selected) are compared with the largest overlap being used.

    Based on orientation, only overlapping labels/headers remain active in the selected spreadsheet.

    Note

    If the original spreadsheet has some rows or columns inactivated, the default orientation selection will be the direction with the largest number inactivated. The default spreadsheet is the parent node if one exists, otherwise there is no default spreadsheet.

  • Activate Rows based on Multiple Column Criteria:

    This tool allows the user to apply threshold or categorical filtering to several columns in one step.

    In the first prompt, the user selects the columns to be filtered against.

    In the second prompt, the user defines the filtering criteria for each previously selected column and whether to require all of the criteria to be satisfied or at least one of the criteria. Categorical, Genotypic and Binary columns have a ‘Activate by Category’ -type filter, while Real and Integer columns have a threshold filter. A row must meet all filtering criteria to remain active.

  • Compare and Activate by Column Agreement:

    This function compares the values of two columns and inactivates the rows in which mismatched values lie. The user can optionally create a subset of the rows with matching data values and/or the rows with differing data values.

    For more information about the details of this function, see Compare and Activate by Column Agreement.

  • Activate by Gene List:

    This function filters mapped markers that are outside gene or transcript regions defined by the selected annotation track, given a separate spreadsheet that contains a list of gene or transcript names.

    • If the selected spreadsheet is row-mapped, either a map field or a column field can be used for the gene list.
    • If the selected spreadsheet is column-mapped, a marker map field can be used for the gene list.
    • If the selected spreadsheet is not marker mapped, a column field must be chosen for the gene list.

    If the label for the selected field has transcript in it, the Transcript Name field of the gene source will be used for filtering otherwise the Gene Name field will be used.

  • Activate by Column or Row Label Pattern

    This tool scans either row labels or column headers and either activates or inactivates selected rows or columns based on a user-specified string and selected matching criteria.

    Options for string matching include full matching, partial matching, or matching by a regular expression. Additionally there is the option for a case sensitive search.

  • Select Variants by Filtering on a Marker Map Field

    This tool selects either marker mapped rows or columns based on filtering on a marker map field. The field can either be numeric or a string field.

    A list of string values can be specified for a string map field or a threshold for numeric fields. The selected variants can either be activated or inactivated in the original spreadsheet.

DNA-Seq Menu

  • Set Genotypes to No-Call based on Additional Spreadsheets:

    Set genotypes in the current spreadsheet to missing based on a user-specified threshold or list filtering criteria applied to the values in several additional spreadsheets. See Set Genotypes to No-Call based on Additional Spreadsheets for more information.

  • Runs of Homozygosity for NGS:

    Calculates runs of homozygosity in genotypic data. This feature requires a genetic marker map be applied to the spreadsheet with a reference field. This version has options more appropriate for NGS data. See Genotype Menu for a version with options for GWAS data. See Runs of Homozygosity for more information.

  • Annotate and Filter Variants:

    Annotates and filters DNA-Seq data based on one or more annotation sources. See Annotate and Filter Variants for more information.

  • Activate Variants by Sample Genotypes

    This function activates genotypic columns that follow the specified genotypic patterns for the selected samples. See Activate Variants by Sample Genotypes for more information.

  • Filter Variants in Reference Sample Spreadsheet:

    This function inactivates uninformative markers as determined by a second Reference Sample spreadsheet. Markers are considered uninformative if the allele set or genotype set (based on the user-specified option) is a subset of the alleles/genotypes found in the reference spreadsheet’s corresponding marker at the same chromosome and position. See Filter Variants in Reference Sample Spreadsheet for more information.

  • Subset Informative Genotypes by Category:

    This function scans genotypic columns to find informative genotypes defined by having at least one non-missing, non-reference allele. See Subset Informative Genotypes by Category for more information.

  • Variant Binning by Frequency Source:

    Create frequency bins based on user-defined thresholds and an external reference population provided through a probe annotation track. This helps to identify rare variants in a SNP dataset when there are not enough samples to calculate in-sample MAF. See Variant Binning by Frequency Source for more information.

  • Variant Classification:

    Classifies variants based on their potential effects on a gene’s output protein sequence. See variant_classification for more information.

  • Find de Novo Candidate Variants:

    This feature uses pedigree information to identify candidate functional polymorphism, defined as the offspring in a trio having a genotype classified as a Mendelian error. See Find de Novo Candidate Variants for more information.

  • Score Variants by Recessive Model:

    This feature scores variants based on how well each variant follows the expected recessive model inheritance pattern. See Score Variants by Recessive Model for more information.

  • Score Compound Heterozygous Regions:

    Calculates the number of compound heterozygous inheritance events within each gene region. See Score Compound Heterozygous Regions for more information.

  • Score Variants by Dominant Model:

    This feature scores variants based on how well each variant follows the expected dominant model pattern based on case/control status. See Score Variants by Dominant Model for more information.

  • Collapsing Methods:

    This sub-menu contains several additional testing options. See Collapsing Methods for DNA-Sequence Analysis and Count Variants per Gene for more information.

Genotype Menu

  • Genotype Statistics by Marker:

    Determine how closely respective genotypes in a dataset approximate a state of Hardy-Weinberg Equilibrium (HWE) by calculating HWE p-values, Fisher’s Exact Test for HWE p-values, and HWE correlation R values. This option also allows for the calculation of allele frequencies for both the major and minor alleles and can provide allele and genotype counts for each marker in the dataset. See Genotype Statistics by Marker for more information.

  • Genotype Filtering by Marker:

    Determine how closely respective genotypes in a dataset approximate a state of Hardy-Weinberg Equilibrium (HWE) by calculating HWE p-values, Fisher’s Exact Test for HWE p-values, and HWE correlation R values. This option also allows for the exclusion of data out of HWE, with a minor allele frequency or call rate below a user-specified threshold or if the data does not meet other quality control thresholds. See Genotype Filtering by Marker for more information.

  • Genotype Statistics by Sample:

    Determine if genotypes for samples violate population-based transmission Hardy-Weinberg principles. See Genotype Statistics by Sample for more information.

  • Quality Assurance:

    This sub-menu contains several options to improve the quality of genotype data, include genotype statistics.

  • LD Reports:

    This sub-menu contains several LD Analysis options. See LD Reports for more information.

  • Genotype Principal Components Analysis:

    Runs principal component analysis on genotypic data using one of three genetic models. See Principal Component Analysis and Genotypic Principal Component Analysis for more information.

  • PBAT Family-Based QA:

    Calculates Family-Based genotype statistics by marker or by proband. A pedigree spreadsheet is required to use this feature. See PBAT Family-Based QA Statistics for more information.

  • PBAT Genotype Analysis:

    Runs the PBAT algorithm from a pedigree spreadsheet containing genotypic data. See PBAT Genotype Analysis for more information.

  • Genotype Association Tests:

    Performs genotype association tests on genotype columns using a quantitative or binary column as the dependent variable. See Genotype Association Tests for more information.

  • Haplotype Association Tests:

    Calculates haplotype frequencies of blocks of genotypes and the association between those frequencies and a binary outcome. See Haplotype Association Tests for more information.

  • Haplotype Block Detection:

    Detect blocks of haplotypes that represent biological groupings of genotypes. See Haplotype Block Detection for more information.

  • Runs of Homozygosity for GWAS:

    Calculates runs of homozygosity in genotypic data. This feature requires a genetic marker map be applied to the spreadsheet. This version has options more appropriate for GWAS data. See DNA-Seq Menu for a version with options for NGS data. See Runs of Homozygosity for more information.

  • Compute Genomic BLUP (GBLUP):

    Computes the genomic best linear unbiased predictors for genotypic data correcting kinship with a genomic relationship matrix. Output includes the random effect component of the predictor and the allele substitution effects of each marker. See Genomic Best Linear Unbiased Predictors (GBLUP) for more information.

  • Mixed Linear Model Analysis:

    Various methods including linear regression with only fixed effects, mixed model GWAS (using EMMAX) or multi-locus mixed model GWAS (MLMM) can be performed on genotypic data with a binary or quantitative (real-valued) dependent variable using Analysis > Mixed Linear Model Analysis. See: Mixed Linear Model Analysis for more information.

Numeric Menu

  • Numeric Principal Component Analysis:

    Runs principal component analysis on numeric data or pre-computed principal components. See Principal Component Analysis and Numeric Principal Component Analysis for more information.

  • Numeric Association Tests:

    Performs association tests on phenotype or other numeric columns. See Numeric Association Tests for more information.

  • Numeric Regression Analysis:

    Performs linear or logistic regression on numeric, categorical or binary covariates. See Numeric Regression Analysis for more information.

  • CNV QA:

    This sub-menu contains several options for identifying poor markers or samples in copy number data. These markers or samples could then be inactivated before performing any analyses.

  • CNAM Optimal Segmenting:

    Performs the optimal segmenting algorithm on numeric copy number data. This feature requires a marker map applied to the spreadsheet’s columns. See Using CNAM (Copy Number Analysis Method) Optimal Segmenting for more information.

  • CNAM Output Analysis:

    This sub-menu contains several options for performing analytic measures on copy number data and the spreadsheets created by CNAM, such as the Segment List spreadsheet. See Manipulating and Analyzing CNAM Output for more information.

  • PBAT CNV Analysis:

    Runs the PBAT algorithm from a marker-mapped pedigree spreadsheet containing numeric CNV intensity data. See PBAT CNV Analysis for more information.

  • Statistics (per Column):

    This function calculates and/or reports the following approximate values for each real-, integer-valued and binary (optional) active column: Lower Outlier Threshold = Q1 - M*IQR, Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Upper Outlier Threshold = Q3 - M*IQR, Interquartile Range (IQR), Variance and Standard Deviation. M is a user defined multiplier to define outlier thresholds based on IQR (Inter Quartile Range).

    For more information about the details of this function, see Statistics (per Column).

  • Statistics (per Row):

    This function calculates and/or reports the following approximate values for each row using data from only real-, integer-valued and binary (optional) active columns: Lower Outlier Threshold = Q1 - M*IQR, Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Upper Outlier Threshold = Q3 - M*IQR, Interquartile Range (IQR), Variance and Standard Deviation. M is a user defined multiplier to define outlier thresholds based on IQR (Inter Quartile Range).

    For more information about the details of this function, see Statistics (per Column).

  • Multidimensional Outlier Detection:

    This function determines outliers based on user-specified columns. A distance score is computed by summing the squared distances from the median in each column, then taking the square root of the sum. The sample is considered an outlier if its distance score is greater than a threshold, based on a user-specified multiplier and the quartiles of each column.

    For more information about the details of this function, see Multidimensional Outlier Detection.

  • Fisher’s Exact Test for Binary Predictors:

    A Fisher’s Exact Test can be performed on binary predictors and a binary case/control dependent variable using Analysis > Fishers Exact Test for Binary Predictors. See Fisher’s Exact Test for Binary Predictors for more information.

RNA-Seq Menu

  • Activate Genes by Minimum Read Threshold:

    Filter columns based on all samples or subsets of samples passing a minimum read threshold.

  • Normalization and Log Transformation:

    Normalizes and/or log transforms RNA-Seq quantification data. See Normalization and Log Transformation for more information.

  • DESeq Analysis:

    Differential expression analysis method for RNA sequence count data. Fold changes and p-values can be calculated for two conditions delineated in a categorical or binary column. See DESeq Analysis for more information.

  • Dendrograms and Heatmaps:

    From any of the ... Counts for G/T with the Best ... P-Values spreadsheets output by DESeq Analysis creates a heatmap and clustering dendrograms of these counts. See Dendrograms and Heatmap.

GenomeBrowse Menu

  • Numeric Value Plot:

    Plots one or more columns of data against genetic marker map information. See Numeric Value Plot for more information.

  • Linkage Disequilibrium

    Plots linkage disequilibrium between genotypic columns. LD is plotted based on genomic distance. See genomicLD for more information.

  • Heat Map

    Plots an intensity plot for all rows and columns. See genomicHeatMap for more information.

Plot Menu

  • Histograms:

    Plots one or more histograms of column data in one or more graphs. See Histograms for more information.

  • XY Scatter Plots:

    Plots one column of data as an independent variable against one or more columns of data as dependent variables. See XY Scatter Plots for more information.

  • Numeric Value Plot (Uniform):

    Plots one or more columns of data against row labels. See Uniform Numeric Value Plot for more information.

  • Linkage Disequilibrium (Uniform)

    Plots linkage disequilibrium between genotypic columns. LD is plotted on a uniform scale. See Uniform LD Plots for more information.

  • Heat Map (Uniform)

    Plots an intensity plot for all rows and columns. See Uniform Heat Maps for more information.

  • Columnwise Side-by Side Box Plots:

    Visualize 20 columns or less of numeric data and their distributions.

  • NxN Scatter Plot:

    Visualize multiple XY scatter plots in an N by N grid. The diagonal will have histograms and density plots. See N by N Scatter Plots for more information.

  • Plot Proportion by Group with Confidence Intervals:

    Visualize the proportion of cases in a binary column over several groups. See Plot Proportion by Group with Confidence Intervals for more information.

  • Side-by-Side Box Plots by Category:

    Visualize the distribution of a numeric variable over several groups. See Side by Side Box Plots for more information.

  • Stacked Histogram:

    Visualize a numeric column in a stacked histogram by grouping on at most 10 unique grouping values. See Stacked Histogram for more information.

Scripts Menu

This menu is available for launching any scripts to be run on entire spreadsheets. Scripts saved in the directory ”../Golden Helix SVS /AppData/userScripts/Spreadsheet/Scripts” will be available from this menu.

Help Menu

  • Spreadsheet Help:

    Launches the integrated SVS manual chapter on Spreadsheets.

  • Email Feedback:

    Launches an email message with the appropriate address and subject line depending on the type of email feedback selected.

Working with a Single Spreadsheet

There are several options available within a spreadsheet to prepare data for analysis. These operations are detailed below, along with a few other essential spreadsheet operations.

Copying Spreadsheet Information to Clipboard

Right-click menus for Column Headers, Row Labels, and Cells allow spreadsheet fields to be copied into the clipboard for easy copy and pasting either within the program or into a third-party program such as a document editor. If a marker map is applied to a spreadsheet, then information in the marker map fields for a particular row or column (depending on the orientation of the marker map) can also be selected and copied through the right-click menu.

Finding Strings or Values in a Spreadsheet

Golden Helix SVS provides an enhanced dialog for finding strings or parts of strings in not only the spreadsheet data fields, but also in column name headers, row labels and marker map fields. The Find dialog can be opened by either clicking on the magnifying glass in the tool bar or by going to Edit > Find (see Find Dialog for a Marker Mapped Spreadsheet).

findDialog

Find Dialog for a Marker Mapped Spreadsheet

To find a particular string or value, enter it in the text box. There are three matching methods available, Full Match, Partial Match and Regular Expression.

The “Full Match” option requires that the data field or value fully match the search string or value.

If it is desired that all fields contain the string or value be found, including partial matches, then the “Partial Match” option (default) should be selected. For example, if “1” is entered in the text box and the “Partial Match” option is selected, then the first field that contains a “1” anywhere in the specified search area will be found. If the “Full Match” option is selected then the first field containing only a “1” will be found.

Note

This search (with the Case sensitive check box unchecked) is case-insensitive. This means that capital letters will be considered to match lower-case letters when searching for matching column headers, row labels, or data cells.

If you check the Case sensitive check box, you will force capital letters to match capital letters and lower-case letters to match lower-case letters in the search.

It is necessary to specify where to search for a string or value. The string or value can be searched for in:

  • Column Names
  • Row Labels
  • Map Field (A particular map field available in an applied genetic marker map must be selected, such as chromosome)
  • Data

The search process can either start at the beginning of a spreadsheet, or at the upper left corner of the selected search field. The first occurrence of the string will be visible in the upper left corner of the searched field.

In cases where there are multiple occurrences of a search string, click on F3 or go to Edit > Find Next to find the next field. Repetitive use of Find Next will cycle through all occurrences of the string or value to be found. When no other fields match the search string, a message will be displayed indicating as such.

Recode

This sub-menu contains several options for recoding spreadsheet data, such as genotypes and alleles, and renaming row labels and column headers. Each recode function available in this menu is explained in detail below.

Recode Genotypes

This feature provides options for recoding genotypic data based on major or minor alleles, recoding based on reference or alternate alleles, flipping DNA strands for AGCT encoded genotypes, or transcoding AB to AGCT encoding based on a marker map field including Allele A/B information, or transcoding using an allele mapping specified in a marker map field. To open this window go to Edit > Recode > Recode Genotypes.

recodeGeno

Recode Genotypes Window

Recode Based on Major and Minor Alleles

The genotypic data for each marker is recoded based on major and minor alleles. The major and minor alleles can be specified two ways.

  • Observed Frequency: The first way is based on the observed frequency of each allele for each marker. For example, if there are only three occurrences of the A allele in a column and 2507 occurrences of the B allele in a column, then the A alleles will be replace with “D” (the minor allele) and the B alleles will be replaced with “d” (the major allele). If the genotypes for one marker are all homozygous in the same allele (say there are zero occurrences of the A allele and 2510 occurrences of the B allele) then the non-missing alleles in the entire column will be replaced with the major allele “d”. If genotypes are missing for an entire marker, the values will remain missing, ”?_?”.

    After recoding, if the spreadsheet had an applied marker map, two new fields will be added to the marker map, Major Allele and Minor Allele. In these fields will be the alleles used for recoding.

  • Major Allele and Minor Allele from Marker Map: If the spreadsheet has a marker map applied to it, and the marker map contains fields named Major Allele and Minor Allele the genotypes will be recoded based on the major and minor alleles specified in the marker map regardless of the observed frequencies.

The options for recoding are as follows, (see Recode Genotypes Window):

  • Encode genotypes as “DD”, “Dd”, or “dd”: Produces a new spreadsheet where the bi-allelic genotypes are coded as minor allele homozygous (DD), heterozygous (Dd) and major allele homozygous (dd) for each genotypic column. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (dd). This is useful as it gives insight into whether the genetic models view a specific sample as having two minor alleles (DD), one minor allele (Dd) or no minor alleles (dd).
  • Encode genotypes numerically based on genetic model: Produces a new spreadsheet where the bi-allelic genotypes are coded as 2, 1, or 0 based on a specific genetic model and the number of either major or minor alleles.
    • Additive model: Encodes bi-allelic genotypes based on the number of copies of the minor allele. Thus “DD” = 2, “Dd” = 1, and “dd” = 0. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
    • Dominant model: Encodes bi-allelic genotypes based on the presence of a copy of the minor allele. Thus “DD” = 1, “Dd” = 1, and “dd” = 0.If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
    • Recessive model: Encodes bi-allelic genotypes based on the presence of two copies of the minor allele. Thus “DD” = 1, “Dd” = 0, and “dd”=0. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
Recode Based on Reference and Alternate Alleles

To classify alleles as Reference/Alternate, a marker map field beginning with ‘Reference’ must be present in the spreadsheet’s marker map. If this is true, the radio option Classify Alleles by reference allele (r) vs. alternate allele (A) from the marker map will be available. The reference field used is specified below the option.

The genotypic data for each marker is recoded based on the specified reference allele in the marker map field. For example, if the reference allele is “C” according to the Reference field in the marker map then the genotypes will be recoded based on whether or not an allele is the reference allele or not. If genotypes are missing for an entire marker, the values will remain missing, ”?_?”.

If the marker map applied to the spreadsheet did not already have an Alternates field, one will be created and all observed alternate alleles (for a marker) will be stored in this field in a comma separated list.

There is also an option to recode all alternate alleles the same way if a column is multi-allelic. If the spreadsheet has multi-allelic columns and this option is not checked, multi-allelic columns will not be recoded.

The options for recoding are as follows, (see Recode Genotypes Window):

  • Encode genotypes as “AA”, “Ar”, or “rr”: Produces a new spreadsheet where the bi-allelic genotypes are coded as alternate allele homozygous (AA), heterozygous (Ar) and reference allele homozygous (rr) for each genotypic column. This is useful as it gives insight into whether the genetic models view a specific sample as having two alternate alleles (AA), one alternate allele(Ar) or no alternate alleles (rr).
  • Encode genotypes numerically based on genetic model: Produces a new spreadsheet where the bi-allelic genotypes are coded as 2, 1, or 0 based on a specific genetic model and the number of alternate alleles.
    • Additive model: Encodes bi-allelic genotypes based on the number of copies of the alternate allele. Thus “AA” = 2, “Ar” = 1, and “rr” = 0. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“rr” = 0).
    • Dominant model: Encodes bi-allelic genotypes based on the presence of a copy of the alternate allele. Thus “AA” = 1, “Ar” = 1, and “rr” = 0.
    • Recessive model: Encodes bi-allelic genotypes based on the presence of two copies of the alternate allele. Thus “AA” = 1, “Ar” = 0, and “rr” =0.
Recode Based on Other Methods
  • Flip DNA strands for AGCT encoded genotypes: Takes a bi-allelic genotype encoded in the AGCT format and flips the order of the alleles. In other words, the genotypes “A_A”, “A_G”, and “G_G” would become respectively “T_T”, “T_C”, and “C”.
  • Transcode AB to AGCT encoding using mapping: Takes a marker map field such as “Allele A/B” and transforms bi-allelic genotypes “A_B” to AGCT genotypes using the marker map field. For example, if for a given marker the “Allele A/B” field contains [C/T], then the genotypes “A_A”, “A_B”, and “B_B” will be transcoded to “C_C”, “C_T”, and “T_T” respectively.
  • Transcode using allele mapping: Takes a marker map field such as “Allele Mapping” and transforms alleles according to the transformation specified in the marker map field. For example, if for a given marker the “Allele Mapping” field contains ‘A:G B:T’, then the genotypes “A_A”, “A_B” and “B_B” will be transcoded to “G_G”, “G_T”, and “T_T” respectively.

Recoding will create a new spreadsheet. The new spreadsheet can be created as a child of the project root or a child of the current spreadsheet. In either case no dependencies are created, but in the case of creating the spreadsheet as a child of the project root a new dataset is created.

Recode SNPs as Variants

This feature recodes AB genotypes to AGCT encoding using an rsID field from the marker map. To open this window go to Edit > Recode > Recode SNPs as Variants.

Recode SNPs as Variants Window

Recode SNPs as Variants Window

Select the rsID from the marker map, a dbSNP variant source (this must be from a downloaded local copy), and whether the new spreadsheet should be created as a child of the current spreadsheet or at the project root.

This feature will look up the reference and alternate allele(s) from the provided variant track using the selected rsID from the marker map. Then the genotype alleles will be matched to the reference and alternate allele(s) based on the 1kG Variant Frequencies, if available in the track, and their frequency in the spreadsheet. If a column could not be recoded, then the column will be dropped from the final results spreadsheet.

Rename Marker Mapped Labels

This feature auto-detects the applied marker map orientation and allows the Row/Column Name Headers with marker map information to be replaced with a string field in the applied marker map. Renaming row or column headers with information from a marker map requires that a marker mapped spreadsheet be applied.

renameGeno

Rename Marker Mapped Headers Window

With one exception, only string fields in the marker map are displayed in the drop-down box, as only string fields can be used as Row/Column Name Headers. The exception is that you may rename row or column headers with a string of the form “Chromosome:Position” for example, “1:2543” would be used for Chromosome 1 and Position 2543. The “Chromosome:Position” option will appear as the second option in the drop-down box, after “Chromosome”.

Rows/columns that are not renamed can either be kept in the new spreadsheet with their original headers or dropped. The rows/columns that are not renamed will be dropped if the appropriate box is checked.

Once a marker map string field is selected to be used as the new Row/Column Name Headers, a new spreadsheet is created as a child of the project root or a child of the current spreadsheet. In either case, no dependencies are created, but in the case of creating a spreadsheet as a child of the project root a new dataset is created.

A new marker map is also created with the new marker names.

Recode AGCT Alleles to AB

This function recodes AGCT alleles to the standard genotype format of A_A, A_B, and B_B. The marker map fields are scanned and a list of appropriate fields is generated. The user is prompted to choose a marker map strand field to use for recoding from the generated list.

The first allele listed in the reference allele string is recoded as the ‘A’ allele, and the second allele listed is recoded as the ‘B’ allele. Missing genotypes remain missing. The resulting recoded spreadsheet has the original marker map reapplied.

Rename Alleles

This function scans genotype columns for allele names (e.g. 1,2,3,4), then opens a dialog listing the current alleles and prompts for how the alleles should be renamed. Missing alleles remain missing.

The new assignment will apply to all genotype columns, not on an individual column basis. The marker map on the original spreadsheet will be reapplied to the new spreadsheet.

Convert to Pedigree Spreadsheet

A binary case/control or affection status column is required to use this feature. All other fields can be generated.

convertPedSS

Convert to Pedigree Spreadsheet Option

Select the pedigree fields from the drop down menus (see Convert to Pedigree Spreadsheet Option). Only columns of the correct type will be available in the menus. The required column types are:

  • Affection Status: Binary Column (Values of: 0/1 or ?/0/1)
  • Patient ID: Row Label column, or any Integer or Categorical column
  • Sex: Binary Column (Values of: 0/1 or ?/0/1)
  • Family ID: Row Label column, or any Integer or Categorical column
  • Mother ID: Any Integer or Categorical column
  • Father ID: Any Integer or Categorical column

The spreadsheet can be created as either a child of the Project Root or of the Current Spreadsheet.

Build Pedigree from Row Labels

When a spreadsheet contains only a few samples, or a certain pedigree structure is known for a few rows the pedigree structure can be specified explicitly one family at a time. This can be used instead of the above tool (Convert to Pedigree Spreadsheet) which requires an affection status column and generates missing data if the expected columns are not available.

This script is extremely useful for DNA-Seq analysis of small families. The pedigree can be applied directly instead of having to create a pedigree spreadsheet in a text or spreadsheet editor tool such as Notepad or Microsoft Excel.

To use this script, go to Edit > Build Pedigree from Row Labels. The first dialog prompts for the family ID, the parents to be specified, the action for the remaining rows and how the resulting spreadsheet should be sorted. See Build Pedigree from Row Labels - Specify Parents dialog window.

pedBuilder1

Build Pedigree from Row Labels - Specify Parents dialog window

Options for the Parent Specification Dialog

  • Specify Family ID: This should be a unique family identifier unless a more complex family structure is to be created. The family identifier will be used for all samples chosen to be a part of the pedigree.

  • Father: Optional. If this box is not checked no father will be specified for the pedigree and where applicable the father ID fields will contain the missing value indicator ”?”.

    • ID: A selection box containing all of the row labels in the original spreadsheet. Choose the label of the sample that is the father in the pedigree.
    • Affection Status: Select the affection status for the father. Options are Affected, Unaffected and Unknown.
  • Mother: Optional. If this box is not checked no mother will be specified for the pedigree and where applicable the mother ID fields will contain the missing value indicator ”?”.

    • ID: A selection box containing all of the row labels in the original spreadsheet. Choose the label of the sample that is the mother in the pedigree.
    • Affection Status: Select the affection status for the mother. Options are Affected, Unaffected and Unknown.
  • Action for remaining rows: In the second dialog window all remaining rows will be listed for selecting the children. If there are more than 10 remaining rows and not all will be used in the pedigree, it is recommended that Uncheck All is selected to avoid having to uncheck all of the samples that will not be used in the pedigree structure.

  • Sort rows in Father, Mother Children, Other order?: If this box is selected the rows will be resorted to place the father in the first row, the mother in the second row, the children in the following rows and any rows that were kept but not used in the pedigree at the bottom. If this option is not selected the original row order will be preserved in the new spreadsheet.

    If there are many rows it is recommended to sort the rows to place the pedigree at the top.

Note

The same ID cannot be specified for both the father and mother ID.

Once these options have been selected click Next > to select the children or offspring in the pedigree. See Build Pedigree from Row Labels - Select Children dialog window.

pedBuilder2

Build Pedigree from Row Labels - Select Children dialog window

Options for the Children Selection Dialog

All rows after removing those selected for the mother and father (if specified) are included in this dialog. If a particular row label is not selected as a child but kept in the resulting spreadsheet, the pedigree columns will be filled with the missing value indicator ”?”.

The options for this dialog are as follows:

  • Keep subjects not chosen as individuals in the new spreadsheet?: If not all of the remaining row labels are chosen as children, those rows not included in the pedigree can either be dropped from the resulting spreadsheet or kept.
  • Set <Sample Name> as a child: Check the box if the sample is to be included in the pedigree.
    • Specify gender: Set the gender for the selected sample. Options include: Male, Female and Unknown.
    • Specify affection status: Set the affection status for the selected sample. Options include: Affected, Unaffected and Unknown.

Once the samples have been chosen for the children or offspring click OK.

The resulting spreadsheet will be identified as a pedigree spreadsheet and can be used for analyses that require pedigree columns.

Note

  1. This script can be run multiple times to specify more than one pedigree. If the row labels are not reused the pedigree information will be merged.
  2. If samples that already have pedigree information are chosen again in this script the new information entered will overwrite the existing information. Thus if a parent is not specified in the dialog when one was specified before, the information will be lost and replaced with the missing value indicator.

Add Allele Delimiter

This function will convert categorical genotypes such as AA, AG without a delimiter to the standard genotype format accepted by Golden Helix SVS (A_A, A_G, etc). The user is prompted for a missing value indicator, such as ‘?’.

The appropriate columns for this function will be recognized as Categorical upon import, then be converted to Genotype by the function. Only the appropriate categorical columns should be active in the spreadsheet, as the function works on all categorical columns, which may not be appropriate.

Bi-allelic Expansion

This feature allows datasets with multi-allelic variants to be represented in a bi-allelic format. Newer genotyping technologies are sometimes able to detect multiple alternate alleles at a given locus, but many older work-flows only support bi-allelic columns. Here is an simple example of a column with multi-allelic data:

Multi-allelic example

Multi-allelic example

Since there are 3 alleles (A, C, T) this column can’t be used in bi-allelic workflows. However, by expanding into two columns (one for each alternate) it is possible to represent this data as bi-allelic:

Example after bi-allelic expansion

Example after bi-allelic expansion

Now both columns are effectively bi-allelic. Here, !C represents “A known allele that is not C”. We can’t be any more specific without introducing additional alleles.

To perform this expansion, click Edit > Recode > Biallelic Expansion. Please note that columns that are already bi-allelic will not be recoded by this feature.

biallelicExpan

Bi-allelic Expansion Window

The spreadsheet must be marker mapped and contain a “Reference” field that indicates which allele matches the reference sequence. You can also choose from several different notations for “not alt” with the “Not Alternate” option.

In order to avoid confusion, each expanded column is given a unique header. The “Column Rename Format” section can be used to customize the column names. The following special tags will be replaced with actual values from the data:

  • {chr} = Chromosome
  • {pos} = Position
  • {ref} = Reference Allele
  • {alt} = Alternate Allele(s)
  • {allele1} = First Bi-Allelic Column Allele (alphabetical)
  • {allele2} = Second Bi-Allelic Column Allele (alphabetical)
  • {variant} = Variant type (I:Insertion, D:Deletion, S:Substitution)
  • {size} = Variant size(s) in bp (length of ref,alt for uneven substitutions)

Lastly, it is sometimes useful to have a numeric count of alternate alleles at each locus rather than a text representation of the alleles. Enable “Substitute Variation Counts in Output” to get alternate allele counts.

Spreadsheet Tools

Set Values to Missing based on Second Spreadsheet

Spreadsheet data will be set to missing given a filtering requirement that is applied to the corresponding numeric or categorical values in a second spreadsheet. The two spreadsheets must have at least one overlapping column header and row label.

Two different filtering mechanisms are available depending on the columns in the second spreadsheet. If the most common column type in the second spreadsheet is numeric (Real, Integer or Binary), then a threshold filtering mechanism appears in the second prompt. If the most common column type is categorical or genotypic, a list of values may be given that correspond to values being set to missing.

Requirements

  • A second spreadsheet containing several numeric (real, integer or binary) or categorical columns.
  • There must be at least one overlapping column header and row label in the two spreadsheets.

Compare and Activate by Column Agreement

Compares two columns and inactivates rows in which the data values in the two columns differ. The user also has the option to create a subset spreadsheet containing the rows with matching data values in the two columns and/or a subset spreadsheet containing the rows with differing data values in the two columns.

To compare two columns in a spreadsheet, choose Select > Compare and Activate by Column Agreement from the desired spreadsheet. Add the two columns by clicking Add Columns and selecting the appropriate column headers.

Note

Two columns must be selected with the column chooser. An error is thrown if less than or greater than two columns are chosen.

The subset spreadsheets can be created by checking the appropriate check boxes under Create subset spreadsheet(s) of:; Rows with matching data values and/or Rows with differing data values.

Select Variants by Filtering on a Marker Map Field

Marker maps can often contain information that can be used for filtering such as quality scores or flags from VCF files or strand information that can be used for strand flipping. This tool allows for obtaining a subset of variants or SNPs based on the specified filter criteria. The criteria for selection can either be a threshold for a numeric value, a list or both.

Note

If the spreadsheet was imported with the VCF importer, it is possible that numeric fields that contained conflicts were converted to strings with conflicting values listed, separated by semi-colons. If the threshold filtering criteria is chosen, the user has the option to choose how to handle these types of fields.

To select either marker mapped rows or columns by filtering on a marker map field, go to Select > Select Variants by Filtering on a Marker Map Field. Next, click on the Select Map Field button and choose the map field to use for filtering. This field can be numeric or categorical.

There are two options for the filtering criteria:

  1. Use a threshold as criteria: This option activates or inactivates markers with map field values that adhere to the specification.
    • If the field contains a list of values separated by a semi-colon, then either the mean, minimum or maximum value can be used for filtering. These options are provided below the threshold dialog. By default the mean value will be used.
  2. Use this list as criteria: This option allows for specifying a list of map field values that are to be used for filtering. For example, only variants or SNPs with a reference allele of A or C can be selected by entering in A,C in the text box.

The variants or SNPs identified as matching the filtering criteria can either be activated (kept) or inactivated (filtered) in the resulting spreadsheet. Also optional is the creation of a subset spreadsheet containing all columns or rows that remained active after applying the selection and filtering criteria to the original spreadsheet.

Sort by Marker Map Field

A row-mapped spreadsheet is required to use this feature. Choose a marker map field and sort direction (either ascending or descending) to sort the rows based on the marker map.

Create Labels from Marker Map Field

Dialog window for the Create Labels from Marker Map Field feature

Create Labels from Marker Map Field

A marker mapped spreadsheet is required to use this feature. Choose a marker map field to replace either the column headers or row labels with. If the marker map is column oriented, the column headers will be replaced, if row oriented, the row labels will be replaced. The new spreadsheet can be either a child spreadsheet or at the project level.

Row Select Operations

There are several operations for row selection as described below.

Activate/Inactivate All Rows

To activate all rows use the Select > Row > Activate All Rows command.

To inactivate all rows simultaneously use the Select > Row > Inactivate All Rows command. No analysis operations can be performed with all rows inactivated. This option is used in the case when it is desired that most, but not all rows, be set to inactive.

Invert Row Selection

Using the command Select > Row > Invert Row Selection flips the state of the active and inactive rows. By inverting, all of the inactive rows become active and all of the active rows become inactive.

Row Subset Spreadsheet

Creates a row subset spreadsheet will all active rows in the resulting child spreadsheet. See Row Subset for more information.

Selecting a Random Subset of Rows

A random subset of rows can be selected using the Select > Row > Select Row Subset tool. A random subset can be chosen based on:

  • Random fraction: Specifies what percentage of records to use (default = 0.5).
  • Random selection size: Specifies the number of records (default = total number of rows).
  • First N items: Specifies the first N rows to use from the spreadsheet (default = total number of rows).
  • Reset random seed: Changes the random seed although the default of 1 will do in most cases. Resetting the random seed to 1 picks the same random subset as would have occurred if the program had just started up and no random number generation had taken place.

Randomizing Row Order

Rows can also be randomized using the Select > Row > Randomize Row Order tool. Rows can be reverted to the original sort order by either clicking Unsort at the top left of the spreadsheet, or by clicking on a column header to get the Column Header Menu and selecting Unsort.

Column Select Operations

There are several operations for column selection as described below.

Activate/Inactivate All Columns

To activate all columns use the Select > Column > Activate All Columns command.

To inactivate all columns simultaneously use the Select > Column > Inactivate All Columns command. If there are many columns, it is easier to inactivate all columns and then activate the few columns to be analyzed. At least one column state needs to be set to either active or dependent for any analysis or subset spreadsheet options.

Activate By Column Type

It is often useful to only want columns of certain types to be active in a spreadsheet. Choose Select > Activate by Column Type, and the type selection dialog will open. All data types present in the spreadsheet will be listed as check boxes. Check the data types of the columns you would like to make active. All column that are of an un-checked data type will be made inactive.

Invert Column Selection

For convenience, states of columns can be inverted Select > Column > Invert Column Selection. By inverting, all the inactive columns become active, and all the active and dependent columns become inactive. This can be useful for splitting analysis up into two groups of chromosomes.

Column Subset Spreadsheet

Creates a column subset spreadsheet based on all active or dependent columns in the resulting child spreadsheet. See Column Subset for more information.

Random Column Selection

Selects the specified number of columns by type from the original spreadsheet. Optionally create a subset spreadsheet based on the random selection.

Select Columns by Chromosome Region

This tool scans genotypic columns in a marker mapped spreadsheet and activates the columns based on a user-specified chromosome start and chromosome stop position. The defined region may be within one chromosome or overlap several chromosomes.

For example, Chr1:1234-1234567 or Chr5:2345-Chr8:23454.

The marker map for the genotypic spreadsheet must contain at least chromosome and position fields.

Activate By Chromosome

In a marker mapped spreadsheet, the map information can be used to activate all columns corresponding to specified chromosomes. Choose Select > Activate by Chromosomes, and the chromosome selection dialog will open. All chromosomes available in the spreadsheet as determined by the corresponding marker map Chromosome field will be listed. Un-checking all the chromosomes will inactivate all columns corresponding to the chromosomes in the spreadsheet. The Uncheck All button is best used to clear all check marks and then to select a few chromosomes to activate. Chromosomes that have active columns in the spreadsheet will be listed at the top of the dialog. All chromosomes can be selected by clicking on the Check All button.

Activate by Gene List

This feature activates mapped markers that are inside the selected annotation source gene regions, given their existence in an separate spreadsheet’s column.

From a spreadsheet with a marker map applied to the columns, go to Select > Activate by Gene List.

Options to specify:

Choose a separate spreadsheet that contains the specified gene names in a categorical column, then select the appropriate column. Choose a gene track to define gene regions.

Checking the “Include nearby markers” option will treat markers that are within a specified distance from gene region boundaries as part of the gene region for filtering purposes.

Results:

In the original spreadsheet, the columns that meet the inactivation criteria are inactivated. The other columns are left active.

Creating Subset Spreadsheets

If certain rows or columns have been inactivated and need to be removed from a spreadsheet, a subset spreadsheet can be created. There are three ways to create a subset spreadsheet.

Row Subset

A new spreadsheet can be created using only the active rows of a spreadsheet. Using the Select > Row > Row Subset Spreadsheet option creates a new spreadsheet with only the active rows, and all of the columns from the original spreadsheet. This option keeps the states of all the columns the same in the subset spreadsheet. The Row Subset icon can also be used to create a row subset spreadsheet.

Column Subset

A new spreadsheet can be created using only the active columns of a spreadsheet. Using the Select > Column > Column Subset Spreadsheet option creates a new spreadsheet with only the active or dependent columns, and all of the rows from the original spreadsheet. This option keeps the states of all the rows the same in the subset spreadsheet. The Column Subset icon can also be used to create a column subset spreadsheet.

Subset Active Data

A new spreadsheet can be created using only the active rows and active columns of a spreadsheet. Using the Select > Subset Active Data option creates a new spreadsheet with only the active rows and columns from the original spreadsheet. This option keeps the state of any dependent columns in the original spreadsheet dependent in the subset spreadsheet.

Column and Row Spreadsheet Operations

A menu for column operations can be viewed by either left- or right-clicking on a Column Number Header, or by right-clicking on a Column Name Header.

Sorting on a Column

Columns can either be sorted in ascending or descending order. Sorting can be performed on a column by left- or right-clicking on the Column Number Header, or by right-clicking on the Column Name Header and selecting one of the sorting methods (Sort Ascending, Sort Descending, or Unsort). Sorting can be removed by clicking on the Unsort button at the upper left corner of the spreadsheet or by selecting Unsort in the Column Header Menu.

Activating/Inactivating Rows Using Column Values

Rows can be activated or inactivated based on the values of a column. The operation depends on the type of column, and all operations are found in the Column Header Menu.

  • A Binary (B), Categorical (C) or Genotype (G) column has the following options:

    • Activate By Category: The parameters needed for this dialog are a category selection dialog and whether to filter only on currently active rows.

      In the category selection dialog, one or more categories (with a maximum of 200 categories) can be selected by either using the <Ctrl> or <Shift> key and left-clicking on the desired categories to have active. All categories that were not selected will be made or left inactive.

      If Filter currently active rows only is checked, only active rows will be affected by this menu operation. Otherwise, all spreadsheet rows may be affected.

      Note

      1. Missing data (”?” or ”?_?”) is considered one category of data for these column types.
      2. If there are more than 200 categories in a column this feature will not be displayed in the Column Header Menu.
    • Inactivate Missings: Inactivates all rows with a ”?” in the selected column and keeps the state of all other rows the same.

  • A Real (R) or Integer (I) column has the following options:

    • Activate By Threshold: The parameters needed for this dialog are the direction of the threshold to activate vs. inactivate, the numerical threshold value, whether to filter only on currently active rows and whether to inactivate missing values. For example, if you want to activate all rows with column values greater than or equal to 0.986, select >= in the drop-down menu, type 0.986 into the text box in the Activate By Threshold dialog, and uncheck Filter currently active rows only. Missing values will be inactivated by default.

      If you now want to allow only rows with column values greater than or equal to 0.986 but less than or equal to 1.254 to be active, go back to the Activate By Threshold dialog and select <= in the drop-down menu, type 1.254 into the text box, and check Filter currently active rows only. Since rows with column values less than 0.986 have already been made inactive, they will remain inactive, while the (active) rows that are greater than 1.254 will be inactivated.

    • Inactivate Missings: Inactivates all rows with a ”?” in the selected column and keeps the state of all other rows the same.

Plotting from a Column

A column of numeric values can be plotted using the options in the Column Header Menu:

  • Plot Variable (in GenomeBrowse ): Plots the columns of data against row labels or genetic marker map information. See Numeric Value Plot for more information.
  • Plot Histogram: Plots a histogram of the selected column data. See Histograms for more information.
  • Plot X: [Column Name Header] vs Y:Dependent(s): Plots the selected column of data as an independent variable against one or more columns of dataset as dependent variables. This option requires at least one column state be set to dependent (column is magenta). See XY Scatter Plots for more information.

Copy Column Header, Row Label and Marker Map Information

Column Header, Row Label and Marker Map information can be copied from the appropriate column or row menu. Marker Map fields are viewable from the Copy sub-menu.

Edit This Spreadsheet

Opens the Spreadsheet Editor. See Editing a Spreadsheet for more information.

Column Average

This function calculates the average of real- and integer-valued columns.

Column Sum

This function calculates the sum of binary, real- and integer-valued columns.

Value Counts

This function counts the occurrences of values of a categorical, binary, genotypic and integer column.

Pie Chart

This function creates a pie chart for the number of occurrences of values for categorical, binary, integer, and genotypic columns. For real valued columns, the values are binned and a pie chart is displayed for the counts of the bins.

Calculate Expected P-value

This function calculates expected p-values for numeric column. -log10(Pvalues) can optionally be included in the output.

Transposing Spreadsheets

Spreadsheets can be transposed to facilitate analysis and/or plotting. It is important to note that only columns of the same data type can be transposed. If there were three columns, a binary column, a categorical column, and a real-valued column, transposing would create a binary row, a categorical row and a real-valued row, however, the column type would have to be the lowest common divisor, which would be categorical, so most analysis and plotting properties would be lost.

The Transpose Spreadsheet option is located in the Edit menu, Edit > Transpose Spreadsheet.

Transpose Spreadsheet Options

There are several options to use for transposing the data in a spreadsheet.

  • New Dataset Name: The default dataset name can be quite long and is based on the name of the spreadsheet being transposed. It is recommended that this default name be edited to a more informative name.
  • Data Type to Transpose: In the drop down list box, the possible types of data to be transposed are listed as well as the number of columns with that data type in parentheses.
  • Label for Column Names: The new Row Label Header default is a generic “Columns”, a more informative name should be chosen such as “SNPs” or “Genetic Markers”.
  • Memory to Use: (Only available for x64-bit versions.) Depending on the size of the dataset to be transposed, the process could be memory intensive. Changing this default changes the size of the memory cache. A larger value could make the transposing process run faster, but this value should not exceed 70% of the available RAM on the computer.
  • Spreadsheet as Child of: The spreadsheet can either be created as a child of the project root or of the current spreadsheet. The advantage to creating the transposed spreadsheet as a child of the project root is that the original spreadsheet can be deleted. However, sometimes it is advantageous to have all related spreadsheets in a parent/child chain to easily identify how the spreadsheets were created.

Create Top-Level Spreadsheet

A top-level spreadsheet can be created to remove all dependences between navigator window nodes so that unnecessary navigator window nodes can be deleted. A project that only contains top-level spreadsheets will be larger in size than a project that allows some dependent relationships between navigator window nodes.

This option is located in the File menu, File > Create Top-Level Spreadsheet. A dialog prompting for a new dataset name will open. It is recommended that the dataset name be changed to an informative dataset name from the default, if the default is not informative.

Create Marker Map from Spreadsheet

In order to create a genetic marker map from the current spreadsheet, select File > Create Marker Map from Spreadsheet. Columns containing the Marker Name, Chromosome and Position information must be specified. If a field containing “stop” or “end” is detected, this field will be included in the map by default as the interval stop. If this field is not desired, it can be unchecked. Including a “Stop” field allows for displaying map data as intervals instead of single position features.

The name containing Marker Name and Chromosome information must be categorical while the column(s) containing position information must be integer-valued. An informative marker map name can be specified with the default name “New Marker Map.” Additional columns can be added as desired and the column names can be specified as well.

Add Columns to Marker Map

This option allows users to add column data from a spreadsheet to the current marker map. A new map will be created and saved in the marker maps folder.

The Add Columns to Marker Map option is located in the File menu, File > Add Columns to Marker Map.

Add Columns to Marker Map Options

There are several options to use when adding columns to a marker map.

  • Columns to Add: Click the Add Columns button to choose one or more columns from the spreadsheet to add to the marker map.
  • Output Marker Map Name: Use this field to give a name to the new marker map.
  • Apply to Current Spreadsheet: When this option is checked, then the new marker map will be applied to the current spreadsheet.

Add Alternates to Marker Map

This option allows the user to create an Alternates field in the current spreadsheets marker map. The user will be prompted to select the Reference allele field from the map and then the tool will query the sample genotypes to determine the alternate allele(s) for each marker in the dataset.

The Add Alternates to Marker Map option is located in the File menu, File > Add Alternates to Marker Map. An new map will be created and applied to the current spreadsheet.

addAlts

Add alternate alleles to marker map

Exporting Spreadsheet Data

Data from spreadsheets can be exported to many formats, which include text or third party file formats, PED/MAP, TPED/TFAM, or BED/BIM/FAM format files, Golden Helix file format, Marker Map format and CNT file format. Exporting or saving in these file formats are described below.

Saving as a Text or Third Party File

To save a spreadsheet as a text or third party file, select File > Save As... > Text or Third Party Format.

exportText

Saving as a Text or Third Party Window

To select the file format and the location to save the spreadsheet, click the Browse button. This opens a Save As dialog window, with the default file format of .csv (comma-delimited text file). To save the spreadsheet as a .txt file, add the .txt extension after the file name. The file format type can be selected from the drop-down menu or by specifying the extension, such “test.xls”.

There are several Export Options available. You can either export only the active data (this includes columns set as a dependent variable) or export all data. You can also indicate whether you want to include column headers and row labels in the saved spreadsheet. If these boxes are not checked, then the first column and first row exported will contain data. Due to column oriented optimization within Golden Helix SVS, it may be advantageous to transpose the data on export, especially if saving to a data format with either column or row size limitations (like .xls). If the transpose option is selected, The row label header may also be specified, with the default header being “Columns.” For text files (.csv or .txt) you have the option of specifying the field delimiter by choosing from Comma, Tab, Space or specifying a different delimiter by choosing Other ->. You can also indicate how missing values are to be encoded by editing the “Encode missing values as:” text box.

If there are genotypic columns in the spreadsheet, then exporting options for genotypes will be available. The options that can be specified for genotypic export include specifying the allele delimiter and the missing allele encoding. There is a list menu for typical allele delimiters, with ‘_’ (underscore) set as default. A user specific character can be specified by selecting Other -> and specifying the custom delimiter in the text box. The encoding for missing alleles can also be specified in the appropriate text box.

Saving as PED/MAP, TPED/TFAM, or BED/BIM/FAM Files

Golden Helix SVS can export to plain text PED/MAP files, plain text TPED/TFAM files, and optimized binary PED or BED files (which should have corresponding BIM/FAM files). To accommodate the fact that these files are expected to have full marker map and pedigree information for each SNP, Golden Helix SVS can artificially generate marker map and/or pedigree information if the spreadsheet is not marker mapped or is not a pedigree spreadsheet.

exportFam

Saving as a Family Indexed Genetic Format Window

To save a spreadsheet as PED/MAP, TPED/TFAM, or BED/BIM/FAM files, select File > Save As... > PED/TPED/BED.

To select the file format and the location to save the spreadsheet, click the Browse button. This opens a Save As dialog window with the PED/MAP file format type selected as default.

To select the file format type, use the “Save as type” drop-down menu. The file format types appear in this menu as follows:

  • PED/MAP - 2 files (*.ped *.*)
  • TPED/TFAM - 2 files (*.tped *.*)
  • BED/FAM/BIM SNP-Major - 3 files (*.bed *.*)

Three option boxes are available:

  • A general Export Option allows you to either export only the active data (this includes columns set as a dependent variable) or export all data.

  • Family-index column assignments for the first six PED, TFAM, or FAM columns:

    • Affection Status
    • Patient ID
    • Sex
    • Family ID
    • Mother ID
    • Father ID

    If your spreadsheet is a pedigree spreadsheet, these column assignments default to their pedigree-spreadsheet counterparts, while other columns are selectable. In addition, for every column except Affection Status, artificial values may be generated.

    If your spreadsheet is not a pedigree spreadsheet, this export option is still available; however, in that case, while Golden Helix SVS will try to find reasonable defaults, it may be desirable or even required to generate artificial values for some or all of the columns other than the Affection Status.

  • An option to specify whether you want to map numerical chromosomes (this converts non-autosomes to expected allosomes) or to import the chromosomes “as is”. If mapping numerical chromosomes, you can indicate whether you are importing human genome data or non-human genome data. If you are importing non-human genome data, the number of autosomal chromosomes in the data must be specified.

Note

  1. If your spreadsheet does not have a marker map applied, artificial values will be generated for the marker mapping.
  2. In all cases, missing phenotypes in the Sex or Affection Status will be encoded as “-9”, and missing genotypes will be encoded as “0”.
  3. The first four columns of the PED, TFAM, and FAM file formats are identifiers for which missing values will be encoded as “0”.

Saving as Either a DSF or GHD File

DSF File

In order to save a spreadsheet in the Golden Helix DSF format, select File > Save As... > DSF File.

exportDSF

Saving as DSF Window

The DSF format is Golden Helix SVS’s proprietary sparse data storage format and is the only format that preserves all marker map information on export. To select the location and name of the output DSF file, click Browse. You can change the dataset name from the name when the original dataset was imported. This name will be used when imported into another Golden Helix SVS project. There is the option to either save only the active data or all of the data in the current spreadsheet. To save only the active data, make sure that Active Data is selected; to select All Data, click on the label or the radio button.

Legacy GHD File

In order to save a spreadsheet in the Golden Helix Legacy GHD format, select File > Save As... > Legacy GHD File. This file format is the one sparse storage format that is backwards compatible with previous versions of Golden Helix SVS. To select the location and name of the output GHD file, click Browse. There is the option to either save only the active data or all of the data in the current spreadsheet. To save only the active data, make sure that Active Data is selected; to select All Data, click on the label or the radio button.

Save as CNT Files

If a marker map is applied to the columns of the current spreadsheet, separate CNT files can be saved for each sample. For information on CNT file formats, see Affymetrix CNT File Format.

Variant Call Format (VCF)

This export requires a genotypic dataset with a marker map applied with a minimum of one additional field containing the reference allele.

Specific optional map fields include identifier, quality and filter. These fields have a specific columns allocated in the VCF file, which are filled in if the fields are selected and left as missing if not.

All other marker map fields can optionally be included and will appear in the VCF file’s INFO field. Alternate alleles are determined by the alleles present in each column.

The user may also select additional spreadsheets to export as data defined in the FORMAT field. The additional spreadsheets must contain at least one overlapping column header and row label found in the genotype spreadsheet. Examples of additional spreadsheets may include Read Depth or Quality Score.

All of the above options are selected in the first dialog. If at least one additional marker map field is found or at least one additional spreadsheet is selected, a second prompt will appear.

For additional marker map fields, the user may select:

  • Abbreviation: Used in the INFO field such that Abbreviation=Value. To not include the field in the VCF file, leave the abbreviation field blank.

For additional spreadsheets, the user must select:

  • Abbreviation: Used in the format field, e.g. GT:Abbreviation1:Abbreviation2.
  • Description: Used in the VCF header, e.g. FORMAT=<ID=DP,...,Description=”User input”>
  • The column type is automatically detected as the most frequent column type found in the spreadsheet.

Files follow the VCF (Variant Call Format) version 4.1 specifications. More information can be found at http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41.