The Data Source Library

Golden Helix’s Data Source Library provides complete access to local, public, network, and project data sources to add to a GenomeBrowse viewer. The public network or cloud-based account sources can be downloaded through this interface as well.

The options available with the Data Source Library depend on the context in which it is accessed or used. All available features of the Data Source Library will be discussed and the specific options for each context will be presented.

Data Source Types Available through the Data Source Library

  • BAM:

    A binary sequence alignment file from a secondary analysis DNA-Seq pipeline.

  • BED:

    A tab-delimited text file containing annotation track information. See BED file format for more information.

  • TSF or IDF:

    File formats specifically designed to work with Golden Helix software programs. Possible types of TSF and IDF files include:

    • Cytoband: Contains cytoband intervals including Giemsa stain results.
    • Interval: Contains general purpose intervals. Each track can define the meaning of its intervals independently and multiple data fields of any supported types may be associated with each interval.
    • Gene: Contains information on gene transcripts.
    • Allele Sequence: Contains allele sequences. Typically represents a particular DNA reference sequence as a list of its single letter nucleotide abbreviations.
    • Value: Contains one or more numeric values for each interval or position of interest.
    • Variant: Contains genomic variant data as intervals. The represented variants may include single nucleotide variants (SNVs), insertions, and deletions.
    • Variant Map: Track contains genotypes (variants) for one or more samples. Genotypes are drawn to emphasize deviation from the reference allele sequence.
  • VCF:

    Information from a VCF file typically contains genotypes for one or

    more samples which can be visualized a Variant Map, or genomic variants which

    can be visualized as a variant plot. Other types of data stored in VCF can also be visualized. VCF files will be indexed and compressed before drawing.

  • Spreadsheet:
    • Heat Map: Plots all numeric variables against the genomic coordinates in the marker map as a heat map. See genomicHeatMap for more information.
    • Value: Plots one numeric variable against the genomic coordinates in the marker map of the spreadsheet.
    • LD: Plots linkage disequilibrium values between each pair of genotypes on a genomic scale. See genomicLD for more information.
    • Variant Map: Plots all genomic variables (deviations from the reference sequence) against the genomic coordinates in the marker map as a variant map. See Variant Maps for more information.

The Data Source Library as a Source Manager

If the Data Source Library is launched from the project navigator or the SVS Welcome Screen then the Data Source Library is used to manage sources. If no project is open, the Project source button is unavailable.

Source Manager

Data Source Library as Source Manager

The Data Source Library as an Plot Dialog

If the Data Source Library is launched from within a GenomeBrowse window or project then the Data Source Library is used to add data sources to that genome browser. In this case the name of the dialog is set to Add Data Sources and the Project source button is available.

Add Dialog

Data Source Library as Plot Dialog

The Data Source Library as a Script Widget

If the Data Source Library is called from a Python script the context of the script can influence the options that are available within the Data Source Library dialog. For instance, if only local data sources are allowed for the particular script then the Data Source Library will warn about likely slow down in processing speed when one or more network sources has been chosen.

Script Widget

Data Source Library as Script Widget

Downloading Data

Network data sources can be downloaded from most instances of the Data Source Library. When one or more network data sources are selected, the Download button will be enabled. Clicking the Download button will start downloading all selected network data sources in the background. The Download Manager will be activated to display the download progress.

Download Window

The download window lists all recent and active downloads. Downloads will continue whether or not this window is open. If the window is closed downloads will continue in the background. If SVS is closed the downloads will resume when SVS is reopened.

Download Window

Download Window showing 1 finished download and 3 active downloads

The default target download location is “User Annotations”. This can be changed in GenomeBrowse options. See General Options for GenomeBrowse for more information. Once the downloading is complete, the target location can be refreshed and the new local copy of the data source can be used just like any other local data source.

Exporting Data

Local data sources can be exported to delimited text, variant call 4.1 (VCF), Microsoft Excel XLSX, FASTA, and wiggle track (WIG) format, depending on the source file type. Only one source can be exported at a time.

To export a source, check its selection box and click the Export button. A dialog will appear with the compatible export file types. Upon selection of a export file type click the Export button and the appropriate subdialog will appear.

Delimited Text

Currently, only non full-coverage sources (i.e. cytoband, interval, gene, value, and variant) can be exported to delimited text.

The settings which may be changed are:

  • Header: By default, this setting is selected and will cause the source’s field names to be displayed on the first line of the file.
  • Prefix: A string can be entered to prefix the header line of the file.
  • Delimiter: By default, a “tab” character will delimit the columns in the file. A comma can be selected from the drop down list or a custom string can be entered.
  • Sub Delimiter: By default, a comma will delimit a list of values in column. A “tab” character can be selected or a custom string can be entered.
  • Coordinates: The genomic coordinates of the exported intervals can be represented as 0-based, 1-based, or a position.
    • 0-Based Interval: The difference between the stop and the start positions defines the width of the interval. For example, an interval covering the first three positions of a chromosome in 0-based coordinates would be specified as [0, 3]. (Also known as ‘half-open coordinates’.)
    • 1-Based Interval: The difference between the stop and the start positions plus one defines the width of the interval. For example, an interval covering the first three positions of a chromosome in 1-based coordinates would be specified as [1, 3]. (Also known as ‘indexed coordinates’.)
    • Position (1bp width): This option outputs a single coordinate. Thus, it is only useful if all features have a single base pair width. The position is 1-based so the smallest position in a chromosome would be 1.
  • Exported Fields: The desired fields may be selected using this option. The order in which the fields appear in the exported file may be changed by reordering the list by dragging fields up or down.
  • Output File: Clicking the Browse button will bring up a dialog to select the name and location for the exported file.

Variant Call (VCF) 4.1

Variant Call Files can be created for variant and variant map files.

The settings which may be changed are:

  • Exported Fields: The desired fields may be selected using this option. The order of the fields in the file will follow the ordering found in the genome browse table.
  • Exported Flags: The flags for the data features may be selected. These flags correspond to the Flags field in the genome browse table.
  • Output File: Clicking the Browse button will bring up a dialog to select the name and location for the exported file.

Microsoft Excel XLSX

Currently, only non full-coverage sources (i.e. cytoband, interval, gene, value, and variant) can be exported to XLSX.

The settings which may be changed are:

  • Output Options: If data includes sample level fields those fields can be grouped by sample in the exported file. Additionally the documentation comments can also be exported.
  • Exported Fields: The fields desired in the output XLSX file may be selected. The fields will appear in the output file in the same order which they appeared in the GenomeBrowse table view.
  • Output File: Clicking the Browse button will bring up a dialog to select the name and location for the exported file.

Note: Output to the Microsoft XLSX format is limited to 1,048,576 rows and 16,384 columns. After 1,048,576 rows have been written copying will stop and the rest of the input file will be truncated. Fields may be unselected to limit the number of columns in the file.

FASTA Format

FASTA files may be written for sequence sources with valid assemblies.

The settings which may be changed are:

  • Separate Chromosome Files: Selecting this check box will create a new file for each Chromosome in the source file.
  • Output File: Clicking the Browse button will bring up a dialog to select the name and location for the exported file.

Note: When creating separate files for each of the chromosomes in the source file, the text “%chr%” must be included in the destination file. When the individual files are written the “%chr%” will be replaced with the chromosome corresponding to that file.

Wiggle Track Format

Variable Step Wiggle track files can be created files with numeric fields

The settings which may be changed are:

  • Value: the data value associated with the each chromosome position can be selected from the numeric fields in the file.
  • Base Span: If single base span is selected a data point will be created for every base covered in the file, otherwise the span from the first feature will be used (in this case the span must be consistent across all of the features).
  • Output File: Clicking the Browse button will bring up a dialog to select the name and location for the exported file.

Utilities

The Utilities menu is located in the lower left corner of the Data Source Library. Functions in this menu are designed to work on current annotation sources or convert external files that are not supported by the data conversion wizard. Custom scripts can be placed in this menu.

Currently the custom script list includes:

Source Information Editor

To edit the documentation, field descriptions or categorical values for a source, click on the local source in the Data Source Library and then click on the Edit hyperlink in the Information panel. Alternatively, right-click on the source and select Edit Source Info. This will open up the Source Information Editor.

Source Editor

Source Information Editor for a variant source

The lock icon at the top of the window may be locked for sources provided by Golden Helix, in which case the source can not be edited. Any custom created source should not be locked and the information in the source can be edited.

Annotation sources must have a name specified. In addition to the source name, documentation on how the data was converted as well as the date and documentation for each field can be specified or edited. All documentation will be embedded into the TSF file to make sharing files and documentation easy.

There are three sections in the Source Information Editor:

  • Source Definition: This information is used to identify the annotation source, and also indicate the date it was converted, who converted it and any version information.

    • Name: [Required] The name of the annotation source.
    • Curated Date: [Required] By default this was a date associated with the files converted into the annotation source. It can be modified, but a date is required.
    • Curated By: Name or organization of who is curating the data.
    • Series Name: Name of a particular group of data. This field can be used to differentiate between newer versions of the same type of data. For example, RefSeqGenes-UCSC or dbSNP.
    • Version: A version number or date. It is recommended if there is a particular version name or identifier that this is included in the Name field and that this field be used for a date associated with the particular version.
  • Fields: The individual field descriptions can be specified in this table.

    • Orient: The orientation of the data (locus or sample) cannot be modified.
    • Type: The type of the data (cannot be modified). If the type needs to be modified, the TSF file can be converted into another TSF file using the Convert Source Wizard. See Converting an IDF or TSF File for more information.
    • Name: The name of fields can be modified. However, if a field name is modified for a required field with an explicit name the source may not be able to be plotted as a specialized track type.
    • Doc: The documentation string for the specific field.
    • URL Template: For fields with information that can be queried in an external site, specify the URL and two dollar signs ($$) to indicate where the text should be replaced. For example, for an “Identifier” field that contains RS ID’s, the URL Template could be http://www.ncbi.nlm.nih.gov/snp/?term=$$.
    • Categories: Click on the Edit button to edit the category names and/or documentation. The documentation can be edited to rename and/or document categories for categorical field types. For instance, change the category “D” to “Damaging” for a more informative name without having to modify the source file(s).
  • HTML Documentation: The four tabs at the bottom of the dialog are for writing HTML documentation of the source. The tabs are to guide the writing of the documentation and to provide nice headers for each section. HTML tags can be used for formatting.

    • Description: Description of the source and where it was obtained from.

    • Credit: Any required citations or credits for the source should go in this section.

    • Notes: Any relevant notes on pre-processing that had to be performed on the data or settings used to convert the source.

      Note

      This section will contain statistics about the fields and data in the source.

    • Meta: Any meta information for the data source(s).

      Note

      If the source was created from converted VCF files the header information from the first VCF file will be placed in this section.