2.7. Working with Multiple Spreadsheets¶
2.7.1. Difference Between Appending or Joining Two Spreadsheets¶
There are two ways to combine spreadsheets: you can append or join two spreadsheets. An append is used when you want to add more rows and both spreadsheets have more or less the same columns. A join is used when you want to add more columns and both spreadsheets have more or less the same rows.
2.7.2. Appending Spreadsheets¶
To append, select File > Append Spreadsheets and select a spreadsheet to append to the current spreadsheet. There are three options for appending as detailed below.
A new dataset name can be specified, otherwise the default name is Appended Spreadsheet. One convention is to name the appended spreadsheet “A appended to B” where A is an identifiable name (possibly shortened) for the current spreadsheet, and B is an identifiable name for the spreadsheet to be appended.
Column headers can be matched case sensitive or not with the column header of the current spreadsheet appearing in the resultant spreadsheet.
Non-matching columns between the two spreadsheets can be dropped or filled with missing values. Indicate your choice by clicking on the appropriate radio button.
You can also set the spreadsheet to be a child of the project root or of the current spreadsheet. If a child of the project root is selected, there are no dependencies between the appended spreadsheet and the two original spreadsheets, and these sheets can be deleted without deleting the appended dataset.
The appended spreadsheet will not have a marker map applied regardless of the current and selected spreadsheets’ marker map status.
2.7.3. Joining or Merging Spreadsheets¶
To join or merge, select File > Join or Merge Spreadsheets and select a spreadsheet to join to the current spreadsheet.
The current spreadsheet is referred to as the left spreadsheet and the spreadsheet being joined is referred to a right spreadsheet. This convention is used because the new columns are added to the right of the columns of the current spreadsheet. If a spreadsheet is marker mapped the name will appear in green text, otherwise the name of the spreadsheet will be in blue text.
There are five options for joining spreadsheets, as detailed below.
New Dataset Name: A new dataset name can be specified, otherwise the default name is “A + B” where A is the name of the left spreadsheet, and B is the name of a right spreadsheet to be joined.
Matching Criteria: Two spreadsheets can be joined based on row labels, if the row labels for the two spreadsheets are more or less identical. To join based on row labels click on the radio button in front of “Use row labels”. If two spreadsheets are to be joined but do not have identical row labels, the rows can be matched based on a custom sort order. The standard custom sort order is used if the left and right spreadsheets are sorted in the correct order for joining, and the desired row labels are the labels of the left spreadsheet.
If a different sort order of one or both of the spreadsheets, or a different row label convention is desired, click on the radio button for the custom sort order and then click on Define. This dialog allows you to sort both the left and right spreadsheets by any of the first 20 columns. You can sort either ascending or descending by clicking on the column headers. Individual rows can be moved by clicking on a row and clicking on Move Up or Move Down. There are three options for the row labels of the joined spreadsheet: use the labels of the left spreadsheet, use the labels of the right spreadsheet, or join the labels of both spreadsheets by an underscore. The edited custom sort order will be displayed on the Spreadsheet Join dialog.
Case Sensitivity: Column headers and row labels can be match exactly (case sensitive) or case insensitive.
Unmatched Rows: If there are any rows that do not match either by row labels or by custom sort order, you can choose to drop the rows that do not match, or to keep and consider the empty cells to be missing data.
Using custom sort order when the two spreadsheets do not have the same number of rows will result in unmatched rows.
Columns with the Same Name: There are three ways to deal with columns sharing the same name. They can be kept as separate columns or the column from either the left or right sheet can be kept with missing values filled in from the other sheet (if the data is not missing there as well).
Spreadsheet as Child of: The joined spreadsheet can be created as a child of the project root or the current spreadsheet. In either case the joined spreadsheet will be dependent on both the left and right spreadsheets, so neither of these spreadsheets can be deleted without also deleting the joined spreadsheet.
To create a top-level spreadsheet from the joined spreadsheet, see Create Top-Level Spreadsheet.
If the current spreadsheet has a marker map applied and the selected spreadsheet has the same map applied or does not have a map applied, the current spreadsheet’s map is reapplied. If the selected spreadsheet has a different marker map applied, the merged spreadsheet will have a merged map containing the intersection of the map fields applied to the original spreadsheets.
2.7.4. SNP Consensus and Concordance¶
This feature compares genotypes over several markers and subjects, either in one or several spreadsheets. Consensus genotypes and concordance rates are reported in four optional output spreadsheets; Concordance by Subject, Concordance by SNP, Discordant Genotypes and Consensus Genotypes.
This feature is available in the Project Navigator > Tools menu. The user can choose one or more spreadsheets to compare and four optional output spreadsheets are available (at least one must be selected). The user can also optionally prepend the output spreadsheets with a specified string. Comparisons can be case-insensitive (default) or case-sensitive.
After all spreadsheets are selected, the user can choose how to define markers and subjects separately for each input spreadsheet. Subjects can be defined by any categorical column (including row labels) or by row order while markers can be defined by column headers, and if a marker map is applied by Chromosome:Position or a different marker map field. If Marker Map Field is selected, the field can be specified in an additional dialog.
If at least one of the selected spreadsheets has a marker map applied to its columns, a marker map will be applied to the appropriate output spreadsheets. If multiple spreadsheets have marker maps, the map data will be combined, with matching string fields concatenated by semicolons and matching numeric fields kept separately.
The following output spreadsheets are available:
Concordance by Subject: This spreadsheet contains concordance statistics for each subject based on user definitions. The first column contains the # Comparisons, which is defined by the number on SNPs in that subject’s row that had at least two non-missing genotypes. Thus 0 <= # Comparisons <= Number of SNPs. The second column contains the # Concordant, which is defined by the number of comparisons in which all genotypes agree. The third column contains the Concordance Rate or #Concordant/#Comparisons. The fourth column contains the Discordance Rate is defined by 1-Concordance Rate.
Concordance by SNP: This spreadsheet contains concordance statistics for each SNP based on user definitions. The first column contains the # Comparisons, which is defined by the number on Subjects in that SNP’s column that had at least two non-missing genotypes. Thus 0 <= # Comparisons <= Number of Subjects. The second column contains the # Concordant, which is defined by the number of comparisons in which all genotypes agree. The third column contains the Concordance Rate or #Concordant/#Comparisons. The fourth column contains the Discordance Rate is defined by 1-Concordance Rate.
Discordant Genotypes: This spreadsheet contains all genotypes for every Subject/SNP that had at least two non-missing genotypes regardless of whether the calls agree or disagree. The order of genotypes directly relates to the order of spreadsheets listed in the log node, then by row and column order.
Consensus Genotypes: This spreadsheet contains a row for every subject and a column for every SNP. The order of rows will correspond to subjects found in order of spreadsheets then rows. The order of columns will correspond to markers found in order of spreadsheets then columns. The genotypes will be reported as missing if all calls were missing or if at least two genotypes disagreed. Otherwise the agreeing (or only) genotype will be reported.
2.7.5. Compare Variants Across Several Spreadsheets¶
This tool compares variants across several spreadsheets and computes the number of variants shared over all possible spreadsheet combinations. In the case when two or three spreadsheets are given as input, a Venn diagram is created.
Each spreadsheet compared must contain marker mapped columns. Only mapped columns (of any type) are compared.
In all cases a Variant Comparison Counts output spreadsheet is created. The row labels in this output contain the node id’s of the spreadsheets compared. The final column contains the number of overlapping variants found in all spreadsheets. The first N binary columns correspond to N input spreadsheets, where a 1 indicates that the node id is included in the comparison.
In the case when two or three spreadsheets are selected, a Venn Diagram is created. The user is prompted for a shortened spreadsheet name to be displayed as text in the plot and a color for each spreadsheet.
A two- or three-way Venn diagram is then created based on the user-specified options.
2.7.6. Build Sample Collated Spreadsheet¶
This tool transposes and collates several spreadsheets together. The collated spreadsheet contains a row for each intersecting column in the original spreadsheets and several columns for each original row over all spreadsheets.
This was originally designed for use on spreadsheets imported using the VCF importer. For example, if a user imports genotype (GT) and read depth (RD) spreadsheets, the collated spreadsheet would contain a row for each variant and two columns for each sample, e.g. Sample1-GT, Sample1-RD, Sample2-GT, Sample2-RD, etc.
Using the Build Sample Collated Spreadsheet Tool¶
From the project navigator, choose Tools > Build Sample Collated Spreadsheet.
Choose the appropriate spreadsheets and optionally choose a base dataset name that will proceed ‘-Sample Collated Spreadsheet’.
In the Advanced Options tab, the transpose memory limit may be set to something other than the default (based on the operating system and version of SVS).
For each previously selected spreadsheet, choose the new column header suffix (GT and RD in the example above) and the column type to transpose.
The default column type will be automatically detected based on the type representing the largest number of columns in that spreadsheet.
The output spreadsheet contains the original columns as rows and a separate column for each sample from each spreadsheet.