Editing a Spreadsheet

A powerful spreadsheet editor facilitates real-time data manipulation, editing, and enrichment on a grand scale. Go beyond traditional spreadsheet programs by not only performing common edits easily but by transforming your data in a variety of ways, such as automatically expanding categorical variables to binary columns, creating new variables based on thresholds and rounding, editing with regular expressions, and more.

Spreadsheet Editor Overview

When the Spreadsheet Editor is open, no other changes can be made to the dataset associated with the spreadsheet being edited until the editor is closed. This is done to prevent any circular references while the editing process is enabled. Datasets that are not linked with the current spreadsheet being edited can be manipulated and analyses run. This includes viewing spreadsheets or moving any windows. The Python Editor window or the user scripts folder can also be opened and scripts written while in the Spreadsheet Editor.

ssEditor

Spreadsheet Editor Viewer with Changes in Red

There are three major regions to the Spreadsheet Editor (see Spreadsheet Editor Viewer with Changes in Red): the title bar, the menu and tool bars, and the spreadsheet itself.

The Title Bar

The title bar for the Spreadsheet Editor indicates the spreadsheet being edited as well as the navigator window node id of the spreadsheet being edited: “Editing: Spreadsheet Name [Node ID]”.

The Menu and Tool Bars

The Spreadsheet Editor contains four menus: the File, Edit, Scripts and Help menus. The File menu contains the options to Save and to Discard Changes and Close. The Edit menu (The Edit Menu) contains links to the Go To... dialog, the Find Row or Column dialog, the Find Next option, as well as the Replace... dialog. The Scripts menu contains scripts that edit the entire spreadsheet, or several columns of data at once. The Help menu contains a link to this section of the Golden Helix SVS manual as well as links to Email Feedback.

The tool bar contains icon shortcuts for Save, Find Row or Column, Discard Changes and Close, Give Feedback and to this section of the manual.

The Spreadsheet

Options for editing columns and their headers can be seen by clicking on the Column Number Header or by right-clicking on Column Name Header. Options for editing row labels can be seen by right-clicking on the Row Label Header or on the cell above the Row Label Header. Row labels can be edited by right-clicking on the row label. Individual data cells can be edited by clicking on them.

The File Menu and Saving or Discarding Changes

There are two options for exiting the spreadsheet editor, both of which are available from the File menu: save the edited spreadsheet or discard changes. To save changes click on File > Save. When a spreadsheet is created, a new dataset DSF file is saved in the Data folder of the project. The spreadsheet can either be saved as a child of the project root or the current spreadsheet. If it is saved as a child of the project root then top-level dataset and spreadsheet nodes are created in the Project Navigator. If it is saved as a child of the current spreadsheet, the new spreadsheet node will be located as a child node under the current spreadsheet. The top-level edited spreadsheet is independent from all other spreadsheets, including the spreadsheet that was edited. To discard changes, click on File > Discard Changes and Close button. Closing the spreadsheet using the window close button will prompt for instruction on whether the changes should be saved or not. A Yes response is the same as “Save”, whereas a No response is the same as the Discard Changes and Close button. Selecting Cancel does not close the spreadsheet editor allowing for further changes to be made.

The Edit Menu

The following Edit menu options allow scrolling and editing operations over the entire spreadsheet.

  • Go To...:

    A dialog will pop up where you may enter the row number and column number where the upper-left-hand corner of the spreadsheet should be scrolled to.

  • Find Row or Column:

    A dialog will pop up allowing you to search for all or part of a column name, a row label, or a data cell. If the search is successful, the spreadsheet will be scrolled so that its upper-left-hand corner is positioned at the matching column, row, or data cell. If Partial matches is checked, just part of the name, label, or cell data need match. Otherwise, the complete column name, row label, or data cell content must be in agreement for there to be a match.

    You may also specify whether the searching should start at the beginning of the spreadsheet or instead at the cell shown in the upper-left corner of your spreadsheet editor display.

    Note

    This search is case-sensitive. This means that capital letters must match capital letters and lower-case letters must match lower-case letters.

  • Find Next:

    Repeat the search you have just done, except start just after the column, row, or data cell you just found.

  • Replace...:

    This feature allows you to take one column name, one row label, or text of one data cell, or one part of a column name, row label, or text of a data cell and substitute something else for it, and to do this for all columns, rows, or data of the spreadsheet in one operation. Please see Find/Replace and Regular Expressions for the many ways in which this may be done.

Editing the Row Label Header and Row Labels

Right-clicking on the Row Label Header will yield the following menu options for editing the row label header and the row labels.

  • Edit Row Label Header:

    A dialog will pop up to allow you to change the row label header.

  • Find and Replace...:

    This feature allows you to take one row label or one part of a row label and substitute something else for it, and to do this for all rows of the spreadsheet in one operation. Please see Find/Replace and Regular Expressions for the many ways in which this may be done.

  • Revert Row Labels:

    This will remove any changes to the row labels and revert them to the way they were before beginning the spreadsheet editor session.

  • Generate Generic Labels (1,2,3..)

    This will label the rows with their row numbers.

Editing Individual Row Labels

To edit an individual row label, right-click on a row label that needs to be changed and select Edit Row Label.

Any changes can be reverted by right-clicking on the Row Label Header and selecting Revert Row Labels.

Editing Columns

There are numerous ways to edit a column or its name header. To view the options for editing a column or its name header, right-click on the Column Number Header or the Column Name Header (or left-click on the Column Number Header) of the column that is to be edited.

Edit Column Header

Select this option to change the Column Name Header.

Find and Replace...

Select this option to find and replace a data string or value within the data of the column selected.

This feature allows you to take the text of one data cell or one part of the text of a data cell and substitute something else for it, and to do this for all the data cells of the column in one operation. Please see Find/Replace and Regular Expressions for the many ways in which this may be done.

Insert Column

Select this option to insert a column either to the left or right of the current column.

The Insert Column dialog has a text box for specifying the new Column Name Header. The type of column needs to be specified, as well as the default fill value.

Copy This Column

Select this option to create a copy of the current column and place the copy either to the left or the right of the current column.

Move Column

Select this option to move the current column one position either to the left or to the right.

The first column cannot be moved to the left and the last column cannot be moved to the right. Moves made to marker-mapped columns will not be saved as marker map order is preserved.

Set Row Labels as This Column

Select this option to replace the row labels with the values or strings of the current column.

This change can be reverted by right-clicking on the Row Label Header and select “Revert Row Labels”.

Revert Column

Select this option to revert the column to the original state when the Spreadsheet Editor was opened. This only removes changes made to the current column.

Delete Column

Select this option to remove a column from the spreadsheet.

Expand to Binary Columns

This command takes a Categorical (C) or a Genotype (G) column and creates a binary column for every value in the column. The values of these new columns are 1 if the value of the cells in the original column equal the value corresponding to the value assigned to the binary column and 0 otherwise.

For example, if a categorical column “Level” had four values: “Low”, “Medium”, “High”, and ”?”, four binary columns would be created: “Level=Low?”, “Level=Medium?”, “Level=High?”, and “Level=??”. Every instance of “Low” in the “Level” column would be indicated with a “1” in the column “Level=Low?” all other values would be indicated with a “0”. The other binary columns would be created in the same manner.

Expand to Binary Columns (with ?)

This command takes a Categorical (C) or a Genotype (G) column and creates a binary column for every value in the column except for missing values. The values of these new columns are 1 if the value of the cells in the original column equal the value corresponding to the value assigned to the binary column, ? if the value is missing and 0 otherwise.

For example, if a categorical column “Level” had four values: “Low”, “Medium”, “High”, and ”?”, three binary columns would be created: “Level=Low?”, “Level=Medium?”, and “Level=High?”. Every instance of “Low” in the “Level” column would be indicated with a “1” in the column “Level=Low?”, missing values would be indicated with a ”?”, and all other values would be indicated with a “0”. The other binary columns would be created in the same manner.

Convert to

Select this option to convert a column to a different data type. Columns can be converted to Binary, Integer, Real, Categorical or Genotypic depending on the type of data in the column. If the data is not formatted correctly or if the data is not of the correct type an error message will be displayed. The original column can be overwritten with the converted values or a new column can be created immediately to the right of the original column.

The options for converting each column type are as follows:

  • A Binary (B) Column can be converted to:
    • Integers: Numeric values in a column are identified as integers.
    • Reals: Numeric values in a column are identified as real-valued.
    • Strings: Numeric values in a column are identified as strings.
  • An Integer (I) Column can be converted to:
    • Binary By Threshold: All values greater than or equal to the specified value are converted to 1’s, and all values less than the threshold are converted to 0’s.
    • Reals: Numeric values in a column are identified as real-valued.
    • Strings: Numeric values in a column are identified as strings.
  • A Real (R) Column can be converted to:
    • Binary By Threshold: All values greater than or equal to the specified value are converted to 1’s, and all values less than the threshold are converted to 0’s.
    • Integers (Truncation): Decimal places are dropped in order to convert real-valued data to integers. For example, 65.123 is converted to 65, and 123.567 is converted to 123.
    • Integers (Rounding): Decimal places are rounded in order to convert real-valued data to the nearest integer. For example, 65.123 is converted to 65, and 123.567 is converted to 124.
    • Strings: Numeric values in a column are identified as strings.
  • A Categorical (C) Column can be converted to:
    • Binary: A category selection dialog lists all categorical values in the column. One or more categories can be selected for conversion to 1’s. All categories that are not selected will be converted to 0’s. Multiple selections of categories can be done by either <Shift>-left-clicking or <Ctrl>-left-clicking.
    • Integers (Truncation): If all values in a column are numeric after editing, then the column can be identified as an Integer column using this command. Decimal places are dropped in order to convert real-valued data to integers. For example, 65.123 is converted to 65, and 123.567 is converted to 123.
    • Integers (Rounding): If all values in a column are numeric after editing, then the column can be identified as an Integer column using this command. Decimal places are rounded in order to convert real-valued data to the nearest integer. For example, 65.123 is converted to 65, and 123.567 is converted to 124.
    • Reals: If all values in the column are numeric after editing, then the column can be identified as a Real column using this command.
    • Genotypes: If all values in a column are converted to the genotypic format (e.g. “A_B” or “1_2”) after editing, then the column can be identified as a Genotypic column using this command.
  • A Genotypic (G) Column can be converted to:
    • Binary: A category selection dialog lists all genotypic values in the column. One or more genotypes can be selected for conversion to 1’s. All genotypes not selected will be converted to 0’s. Multiple selection of genotypes can be done by either <Shift>-left-clicking or <Ctrl>-left-clicking.
    • Integers (Truncation): If all values in the column are numeric after editing, then the column can be identified as an Integer column using this command. Decimal places are dropped in order to convert real-valued data to integers. For example, 65.123 is converted to 65, and 123.567 is converted to 123.
    • Integers (Rounding): If all values in the column are numeric after editing, then the column can be identified as an Integer column using this command. Decimal places are rounded in order to convert real-valued data to the nearest integer. For example, 65.123 is converted to 65, and 123.567 is converted to 124.
    • Reals: If all values in the column are numeric after editing, then the column can be identified as a Real column using this command.
    • Strings: All genotypes in the column are identified as strings.

Concatenate Columns

This option prompts the user to select a second column and delimiter to concatenate the columns. The concatenated column will be added directly to the right of the column from which the function was called.

Note

If the spreadsheet contains more than 1000 columns, only the first 1000 column headers are displayed in the column chooser. If you would like to concatenate a column that is further right, you will need to first inactivate the preceding columns.

Move Column to Far Right

Select this option to set the current column as the last column of the spreadsheet.

Move Column to Specified Location

Select this option to move the current column to a user-specified position.

Move Column to Front

Select this option to set the current column as the first column of the spreadsheet.

Switch Row Labels with this Column

Select this option to switch the row labels with the current column. A new column at the same location as this column is created with the existing row labels.

Note

This operation can take a very long time for spreadsheets that have tens of thousands of rows.

Compute

This menu becomes available when working with integer or real-valued columns. The following options can then be selected:

  • Absolute Values Computes the absolute value of each cell in an integer or real-valued column. The user is given the option to overwrite the column or add a new column directly to the right of the current column.

  • Log of Numeric Values Computes the log of the values in an integer or real-valued column. The user is given the option to overwrite the column or add a new column directly to the right of the current column.

  • NegLog of Numeric Values Computes the negative log of the values in an integer or real-valued column. The user is given the option to overwrite the column or add a new column directly to the right of the current column.

  • Percent of Column Total Creates a new column that gives the percentage of existing value divided by the column total.

    Note

    Should only be used on a column with positive data.

Transform

This menu is available for binary, categorical, or genotypic columns. The following option can then be selected:

  • Capitalize All Characters Capitalizes all lowercase letters in a categorical or genotypic column only.
  • Invert Boolean Values Switches the values between 0 and 1 in a Boolean column and creates a new column directly to the right of the current column. Missing values remain missing.
  • Append String Appends a string to all values in a column.
  • Prepend String Prepends a string to all values in a column.
  • Split Column on Specified Delimiter Given a string delimiter, N columns will be created based on N-1 maximum occurrences of the string in any row of that column.

Editing Data

Individual data fields can be edited by clicking or double-clicking the data field in the column and typing in a new value or string.

If a string value is entered into a Binary, Integer or Real column then the column type will be automatically changed to Categorical.

If a string is entered into a Genotype column which is incorrectly formated to be a genotype, then the column will be automatically changed to Categorical.

If an integer or real value is entered into a Binary column, then the column type will be automatically changed to Integer or Real, whichever is appropriate.

If a real value is entered into an Integer column, then the column type will be automatically changed to Real.

Find/Replace and Regular Expressions

In three places, namely in the main spreadsheet editor menu (The Edit Menu), the right-click menu for editing row labels (Editing the Row Label Header and Row Labels), and the right-click menu for editing a column (Editing Columns), the spreadsheet editor gives you the option to find and replace data.

To use this feature, enter the data string or value to find in the Find what text box, choose a Matching Method (Full match, Partial Match or Regular Expression) and enter the data string or value to replace the original data string or value in the Replace with text box. The Matching Methods are described below. Also select Column names, Row labels, or Data if you are using Replace... from the main spreadsheet editor menu. All column headers, row labels, or data cells in the spreadsheet or in the column (depending on which Find and Replace you are using) containing the first data string will be changed to contain the second data string.

Note

This search (with the Case sensitive check box unchecked) is case-insensitive. This means that capital letters will be considered to match lower-case letters when searching for matching column headers, row labels, or data cells.

If you check the Case sensitive check box, you will force capital letters to match capital letters and lower-case letters to match lower-case letters in the search.

Full and Partial Matching

If the Full Match option is selected, the data string or value must match the string or value in the chosen fields exactly. For example, if F were searched in the data fields and the spreadsheet had a gender column with the values F and M, the search would locate all female samples. It would not, however, find strings that simply contained an F somewhere in the data.

The Partial Match option (default) locates all matches of the search string or value, even if it is located in the middle of another string or value.

Regular Expressions

If you check the Regular expressions check box, you will enable a powerful feature called Regular Expressions. Regular expressions give you many flexible options for making your changes.

Note

  • This feature replaces the “partial match” feature of earlier versions of Golden Helix SVS.
  • More information about regular expressions may be found at http://en.wikipedia.org/wiki/Regular_expression.
  • The version of regular expressions used by Golden Helix SVS is derived from that used in the Perl computer language.
  • As a convenience, the Case sensitive check box still works when Regular expressions is checked. If the Case sensitive box is unchecked, the resulting case-insensitivity for capital and lower-case letters you have entered in the Find what text box will vary from the standard regular expression behavior.

As an example of how regular expressions can be useful, suppose you have row labels in the form

  • NA06985_GW6_C
  • NA06972_HA5_B
  • NA07011_GA7_C
  • etc.

and you:

  • right-click on the row label header,
  • select Find and Replace...,
  • put “(NA[0-9]*)_.*” as your Find what string,
  • put “\backslash1” as your Replace with string, and finally
  • check Use regular expressions.

By doing this, you will change these labels to just

  • NA06985
  • NA06972
  • NA07011
  • etc.

in just one operation.

Using Regular Expressions in Golden Helix SVS

To use Find and Replace using regular expressions in Golden Helix SVS, you must specify:

  • The portions of the data in each label, header, or cell which are to be preserved.
  • The portions of the data of any label, header, or cell you wish to be found and replaced.
  • Any replacement data.
  • Where the data to be preserved and the replacement data should be put in the resulting strings.

Specify the portions of data to be preserved within a pair or pairs of parentheses in the Find what string. These are called “groups”. Use the parts of the Find what string outside of the group specifications to specify the data to be found and replaced. Specify the replacement data in the Replace with string, interspersing it with backslashes followed by group numbers (“\backslash1”, “\backslash2”, etc.) to specify where the data that is to be preserved should be.

When using regular expressions, most of what you enter is taken literally as what must be preserved or replaced. However, in the Find what string, there are the following exceptions to this:

  • . (Period): Matches any single character, except when used within a bracket expression (see below), in which case it simply matches a period.

  • [ ] (A bracket expression): Matches a single character that is contained within the brackets. For example, [abc] will match ’a’, ’b’, or ’c’. However, using a hyphen between two other characters allows you to match any character within the indicated range. For example, [a-z] means match any of the lower-case characters of the alphabet from ’a’ to ’z’.

  • [^ ]: Matches a single character that is NOT contained within the brackets. For instance, [^abc] matches any character except ’a’, ’b’, or ’c’. This form also uses a hyphen between two other characters to specify a range. For instance, [^a-z] matches any character except the lower-case letters from ’a’ through ’z’.

  • ^: Matches the start of the data within the label, header, or cell.

  • $: Matches the end of the data within the label, header, or cell.

  • ( ): “Capture” a “group” (a portion of the data to be preserved) within parentheses.

  • *: Matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.

  • {m,n}: Matches the preceding element from m times to n times.

  • \: (backslash) This character is used as an “escape character”, meant to be combined with another character to achieve a special meaning. Use \backslash\backslash to specify one backslash in your data. Some other examples of the special meanings available are:

    • \s: A “Whitespace” character such as a space or a tab.

    • \S: A non-whitespace character.

    • \w: A letter, digit, or underline character.

    • \W: NOT a letter, digit, or underline character.

    • \d: A digit.

    • \D: NOT a digit.

    • \1 through \9: means the first through the ninth “captured group” of data to be preserved in the change.

      Note

      Normally, this would only need to be used in your Replace with string.

In the Replace with string, most of what you enter is taken literally as replacement data. However, intersperse your replacement data with “\backslash1” through “\backslash9” to specify where the first through ninth group of data to be preserved should be put within this replacement string.

Practical Tips for Using Regular Expressions in Golden Helix SVS

  • The form .* used in the Find what string will encompass everything in the data label, header, or cell that has not yet been specified. You may often end your specification with .* to match all the rest of your data, or (."\*") to “capture” all the rest of your data.
  • In the Find what string, you can combine * (or \{m,n\}) with other special forms. For instance, [0-9]* means match or capture as many digits in a row as possible where the data is being scanned, and ^[0-9]* means match or capture as many non-digits in a row as possible.
  • You can switch the order of the “captured” parts of the data in the Replace with string by switching the order of, for instance, the “\backslash1” and “\backslash2” expressions.

Scripts for the Spreadsheet Editor

Available scripts are found in the Scripts menu in the menu bar. One script available from this menu is “Create Column From Row Labels”. This script copies the row labels and inserts a copy of the row label column in the spreadsheet. This is useful if another column is to be set as the row labels, but the row label information is still important to have in the spreadsheet.

The other script available is “Expand All Nominals,” which expands every nominal column and deletes the originals. This script can be applied to the entire spreadsheet.

The example scripts provided can be edited and changed to perform specific customized behaviors. See Essential Command Lines for Spreadsheet Editor Scripts for more information.