Annotation Convert Source Wizard¶
Annotation sources exist in numerous file formats. Some are text based and easily manipulated, others are in binary formats that can only be read by specialized readers. If the format is a genome wide variant text file the size of the files can quickly become unmanageable. In addition separate documentation files need to be kept to contain the source and data curation information. Finally, if the data is not formatted subject to very specific requirements it is not possible for the files to be read and handled with ease and speed. To meet the ever increasing need for visualization of numerous file formats Golden Helix has created a very powerful convert source wizard for Golden Helix SVS, VarSeq and GenomeBrowse. This converter will take nearly every data file type and convert it into a ‘TSF’ file. This file format contains not only the data but also the coverage information, data extents, documentation, index and manages to package it up in a compressed data format to keep the file sizes as small and as compact as possible.
Opening the Convert Source Wizard¶
From a Data Source Library click on the Convert... button in the lower left hand corner of the dialog. This brings up the Define Input dialog which is the first step of the conversion wizard. In the left hand pane the source conversion steps are listed. Below the itemized list of steps is an information pane that will contain important instructions or tips for each step in the wizard.
The source conversion process can either use default settings or allow for more control over the process. To have more options or control over how the source is converted, check the Advanced Options box below the information pane.
On the right hand side is the file selection (Add) dialog. Multiple files can be selected at the same time to be converted into one source. However, all files will be added to the same TSF file and are required to be of the same type. For instance, if there is one FASTA file per chromosome, select all of the FA files, one for each chromosome. Similarly, if the source file type is VCF and there is one VCF file for each chromosome, select all of the per chromosome VCF files.
To select files, click on the Add button. Files can also be added to the wizard by dragging and dropping into the selected files dialog.
To remove files from the list, select the file by clicking on it and then click on the Remove button.
Once the desired files have been added and icon will appear in front of the file names. If the file extension is of a specific type an icon representing the expected source type will be displayed (Allele Sequence, Interval, Gene, Variant, etc.). If the file is a text file that can be converted into numerous source types a question-mark icon will be displayed in front of the file(s). If an exclamation point in a red circle is displayed, the file is either an invalid source or multiple sources of different types are selected in the same list of files. To get more information on the error, click on the Next > button.
If all sources are of the same type and are valid, clicking on the Next > button will lead to the next step in the wizard. At any point after the first page you can click < Back to return to previous steps. Any information changed or added will be preserved as long as the list of file sources is not modified. Clicking Cancel will exit the source conversion wizard.
If a source is already indexed it will not need to be scanned to determine the Genomic Coordinates and/or data types. If the file is not indexed a scanning pass will be initiated. This scan may be skipped if the genomic coordinates and/or data types are known. In general the scan will be the step immediately following the file selection, however, for text files (TXT, TSV, CSV, etc.) the delimited file characteristics will need to be specified first.
Each of the data formats are covered below in greater detail.
Convert a 2Bit File¶
A 2Bit file is a packed binary format described at http://genome.ucsc.edu/FAQ/FAQformat.html#format7 and is a very efficient method of packing ACGT sequence data. The 2Bit import requires exactly one input file in 2Bit format. The only available output type is ‘Allele Sequence’. When creating an allele sequence source a new genome assembly can be created at the same time if one does not already exist for the species and build for the 2Bit file.
After selecting a 2Bit source on the file selection page of the convert wizard, click Next >. This brings the wizard to the New Genome Assembly Build page. A 2Bit source can define a genome assembly if one has not already been defined for the species and build.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced Options” that do not need to be selected for in most cases. To show “Advanced Options”, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select an Existing Genome Assembly from 2Bit¶
ADVANCED OPTION
To select an existing assembly for the allele sequence, make sure the Advanced Options option box is check and select an assembly from the drop down list. Selecting an existing assembly will inactivate all of the other fields on the Genome Assembly Build page. However, the Source to Segment Mapping table will remain active.
The Source to Segment Mapping table can be used to exclude or include segments in a new/updated genome assembly file. Segments previously included in the selected genome assembly will have a green background. Segments not included in the selected genome assembly will have a white background. Segment names that exist in the selected assembly but have different lengths between the source and selected assembly file will have a warning icon in front of the segment name. See Define the Source to Segment Mapping below for more information.
Define a new Genome Assembly from 2Bit¶
To define a new genome assembly/build, fill in as many of the available fields as possible:
- Species: Either select the species from the list or enter in a new one.
The scientific name is preferred. Examples include: ‘Homo sapiens’, ‘Canis familiaris’, etc.
- Common Name: Enter in a common name for the species, such as ‘Human’ or
‘Dog’.
- Build Name: The NCBI assembly name is preferred, but the assembly synonym
or UCSC assembly name can be used instead.
Build Date: Submission or published date of the assembly.
- Taxonomy Id: Taxonomy ID for the species. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the taxonomy ID.
- GenBank Id: GenBank Assembly ID. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the GenBank ID. This field can be left empty if the species does not have a GenBank ID.
- RefSeq ID: RefSeq Assembly ID. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the RefSeq ID. This field can be left empty if the species does not have a RefSeq ID.
The Source to Segment Mapping table can be used to exclude or include segments in a new/updated genome assembly file. See Define the Source to Segment Mapping below for more information.
Define the Source to Segment Mapping¶
Segments can either be excluded or renamed from the assembly using the fields in this table. The type of the segment can be set as well as the visibility of the segment.
- Use: To include a segment in the assembly leave the ‘Use’ box
checked. To “Check All” or “Uncheck All” click on the “Use” column header. “Uncheck All Unmapped” will not work when creating a new assembly as there are no mapped or unmapped segments.
Note
If there are more than 5000 segments only the 5000 longest segments will be included in the assembly file. If there are more than 500 segments they will be arranged in the assembly file in descending order by length.
- Source: The name of the segment from the allele sequence file. To
rename a segment, either double click on the name in the Segment column, or if the segment names share the same pattern to be removed or for renaming, below the segment definition table are controls for renaming segments programmatically. The options include:
- RegEx: Use regular expressions to rename the “Source”.
- Substring: Remove a substring from all segment names to generate the segment name.
- Prefix: Remove a common prefix from all segment names.
- Suffix: Remove a common suffix from all segment names.
Enter in either the RegEx expression or the string to remove in the first text box. A preview of the renamed segment name will appear in the second text box. To apply the rename to all segments click on the Set Segment to Renamed button.
Length: The length of each segment is displayed in this column.
- Aliases: If a segment has an alias, it can be specified in this column.
For example mitochondrial chromosomes might be named “M” or “MT”, the alternate name can be listed in the alias column.
- Type: Set the type of segment. By default ‘Autosomes’ are always visible
and the rest are visible only if there is data. Options include:
- Autosomes
- Allosome
- Mitochondrial
- Fragment
- Scaffold
- Contig
- Unknown
- Visibility: The visibility of the data can be set manually. If the
segment should only be shown in GenomeBrowse if there is data the visibility can be set to “With Data”. Options include:
- Always
- Never
- With Data
Once the assembly has been defined click Next >. When the allele sequence is converted to TSF format an assembly file will be created and placed in the User Assembly folder and will be available for use in Golden Helix SVS, VarSeq and GenomeBrowse.
After clicking Next > the wizard will display the documentation page. See Documentation Step for more information.
Converting a FASTA File¶
A FASTA file is a text file where each character of data designates the value of a sequence base at each offset in a segment designated by its simple header. After conversion an ‘Allele Sequence’ source is created. When creating an allele sequence source a new genome assembly can be created at the same time if one does not already exist for the species and build for the FASTA file(s).
After selecting one or more FASTA sources (FA, FASTA, FA.GZ, FASTA.GZ, etc.) on the file selection page of the convert wizard, click Next >. If the data has not previously been indexed the file(s) will be scanned to obtain the genomic coordinates. This scan may not be skipped since a new genome assembly may be generated based on the genomic coordinates in the file(s). Next, the wizard will display the New Genome Assembly Build page. FASTA sources can define a genome assembly if one has not already been defined for the species and build.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select an Existing Genome Assembly from FASTA¶
ADVANCED OPTION
To select an existing assembly for the allele sequence, make sure the Advanced Options option box is check and select an assembly from the drop down list. Selecting an existing assembly will inactivate all of the species, build and identifier fields on the Genome Assembly Build page. However, the Source to Segment Mapping table will remain active.
The Source to Segment Mapping table can be used to exclude or include segments in a new/updated genome assembly file. Segments previously included in the selected genome assembly will have a green background. Segments not included in the selected genome assembly will have a white background. Segment names that exist in the selected assembly but have different lengths between the source and selected assembly file will have a warning icon in front of the segment name. See Define the Source to Segment Mapping below for more information.
Define a new Genome Assembly from FASTA¶
To define a new genome assembly/build, fill in as many of the available fields as possible:
- Species: Either select the species from the list or enter in a new one.
The scientific name is preferred. Examples include: ‘Homo sapiens’, ‘Canis familiaris’, etc.
- Common Name: Enter in a common name for the species, such as ‘Human’ or
‘Dog’.
- Build Name: The NCBI assembly name is preferred, but the assembly synonym
or UCSC assembly name can be used instead.
Build Date: Submission or published date of the assembly.
- Taxonomy Id: Taxonomy ID for the species. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the taxonomy ID.
- GenBank Id: GenBank Assembly ID. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the GenBank ID. This field can be left empty if the species does not have a GenBank ID.
- RefSeq ID: RefSeq Assembly ID. If the species is in the NCBI
database clicking on the link out button will open a web page on NCBI to help identify the RefSeq ID. This field can be left empty if the species does not have a RefSeq ID.
The Source to Segment Mapping table can be used to exclude or include segments in a new/updated genome assembly file. See Define the Source to Segment Mapping below for more information.
Define the Source to Segment Mapping¶
Segments can either be excluded or renamed from the assembly using the fields in this table. The type of the segment can be set as well as the visibility of the segment.
- Use: To include a segment in the assembly leave the ‘Use’ box
checked. To “Check All” or “Uncheck All” click on the “Use” column header. “Uncheck All Unmapped” will not work when creating a new assembly as there are no mapped or unmapped segments.
Note
If there are more than 5000 segments only the 5000 longest segments will be included in the assembly file. If there are more than 500 segments they will be arranged in the assembly file in descending order by length.
- Source: The name of the segment from the allele sequence file. To
rename a segment, either double click on the name in the Segment column, or if the segment names share the same pattern to be removed or for renaming, below the segment definition table are controls for renaming segments programmatically. The options include:
- RegEx: Use regular expressions to rename the “Source”.
- Substring: Remove a substring from all segment names to generate the segment name.
- Prefix: Remove a common prefix from all segment names.
- Suffix: Remove a common suffix from all segment names.
Enter in either the RegEx expression or the string to remove in the first text box. A preview of the renamed segment name will appear in the second text box. To apply the rename to all segments click on the Set Segment to Renamed button.
Length: The length of each segment is displayed in this column.
- Aliases: If a segment has an alias, it can be specified in this column.
For example mitochondrial chromosomes might be named “M” or “MT”, the alternate name can be listed in the alias column.
- Type: Set the type of segment. By default ‘Autosomes’ are always visible
and the rest are visible only if there is data. Options include:
- Autosomes
- Allosome
- Mitochondrial
- Fragment
- Scaffold
- Contig
- Unknown
- Visibility: The visibility of the data can be set manually. If the
segment should only be shown in GenomeBrowse if there is data the visibility can be set to “With Data”. Options include:
- Always
- Never
- With Data
Once the assembly has been defined click Next >. When the allele sequence is converted to TSF format an assembly file will be created and placed in the User Assembly folder and will be available for use in Golden Helix SVS, VarSeq and GenomeBrowse.
After clicking Next > the wizard will display the documentation page. See Documentation Step for more information.
Converting a VCF File¶
A VCF (Variant Call Format) file is a text file in the format specified by 1000 Genomes (http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41, https://github.com/samtools/hts-specs). VCF files will be converted into a variant source with one feature for each line in the VCF file. One field in the new annotation source will be created for each INFO field unless the type of the field is FLAG. All flags will be combined into one FLAG field. If the VCF file contains sample information, the information for each FORMAT field will be combined into a delimited list of the appropriate type (string, integer, float, etc).
After selecting one or more VCF sources (VCF, VCF.GZ) on the file selection page of the convert wizard, click Next >. If the data has been previously indexed or if CONTIG information exists in the header, then the file will not be scanned for genomic coordinates. Otherwise, the file(s) will be scanned to determine the genomic coordinates and most likely genome assembly. The scan may be skipped if the genome assembly is known and the segment naming convention is known.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select the Desired Plot Type for VCF¶
The desired plot type will be automatically detected by default. Unless a VCF file violates the file format specifications, this will always be a Variant plot type. To change the desired plot type, change the selection in the drop down box. If the selected current fields in the file(s) do not meet the specifications for the selected plot type a warning icon will appear on the upper right. To check the required fields for the selected plot type, or to read the warning message(s), hover over the [i]nformation or [!] warning icons. The tool tips will contain the information or warning messages.
The output fields can be edited on this page as well.
- Use: To remove a field from the output uncheck the box in front of the field. For a VCF file, the “Reference” and “Alternate” fields contain redundant information and can be excluded from the source, if desired. The Chr, Start and Stop fields cannot be modified and are not included in the list of editable fields.
- Rename: To rename the field, either type in the Name box, or select a name of a field required for the plot type in the drop down box. If a required field is renamed a warning will appear for the selected plot type. For a variant source a “Ref/Alt” or “Observed” field is required. From a VCF file the “Ref/Alt” field is created by concatenating the “Reference” and “Alternate” fields.
- Type: The default types of the fields are specified based on the type information in the VCF headers. To change the type, select the appropriate type in the drop down Type box. For instance, Float64 can be changed to Float, etc. If there are only a few strings used for a string field, “Category” is a better type to use, this will enable filtering on that field. If “Category” is selected but there are too many different categories for a particular field an error will be generated on convert.
- Reorder Fields: To reorder fields, click on a field to select (highlight) the row and use the directional arrows on the right of the dialog to move the field either up or down in the list. The double up arrow moves the field to the top of the list. The double down arrow moves the field to the bottom of the list.
The preview pane contains the fields selected for import and will update based on changes made in the Edit Output Fields table. By default only the first 1000 features will be read into the preview. To read more features click on Read More.
Once the plot type and fields have been edited as desired, click Next > to move to the next page of the wizard. See Select a Genome Assembly for the next page.
Converting a BED File¶
A BED file is a text file in the format specified by UCSC (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). It has three required fields (Chromosome, Start and Stop also called chrom, chromStart and chromEnd) and can have up to 9 optional fields. The expected order of the fields is as follows:
- Chr (chrom)
- Start (chromStart)
- Stop (chromEnd)
- Name
- Score
- Strand
- thickStart
- thickEnd
- itemRgb
- blockCount
- blockSizes
- blockStarts
If additional fields are included in the BED file, the field names can be specified on the desired plot type and field specification page.
After selecting one or more BED files (BED, BED.GZ) on the file selection page of the convert wizard, click Next >. If the data has not been previously indexed then the file(s) will be scanned to determine the genomic coordinates and most likely genome assembly. The scan may be skipped if the genome assembly is known and the segment naming convention is known.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select the Desired Plot Type for BED¶
The desired plot type will be automatically detected by default. Unless a BED file violates the file format specifications, this will always be a Generic Interval plot type. To change the desired plot type, change the selection in the drop down box. If the selected current fields in the file(s) do not meet the specifications for the selected plot type a warning icon will appear on the upper right. To check the required fields for the selected plot type, or to read the warning message(s), hover over the [i]nformation or [!] warning icons. The tool tips will contain the information or warning messages.
The output fields can be edited on this page as well.
- Use: To remove a field from the output uncheck the box in front of the field. The Chr, Start and Stop fields cannot be modified and are not included in the list of editable fields.
- Rename: To rename the field, either type in the Name box, or select a name of a field required for the plot type in the drop down box. If a required field is renamed a warning will appear for the selected plot type. For a generic interval source a “Name” or “Identifier” field is required.
- Type: The default types of the fields are specified based on the types defined in the BED specifications. To change the type, select the appropriate type in the drop down box. For instance, Float64 can be changed to Float, etc. If there are only a few strings used for a string field, “Category” is a better type to use, this will enable filtering on that field. If “Category” is selected but there are too many different categories for a particular field an error will be generated on convert.
- Reorder Fields: To reorder fields, click on a field to select (highlight) the row and use the directional arrows on the right of the dialog to move the field either up or down in the list. The double up arrow moves the field to the top of the list. The double down arrow moves the field to the bottom of the list.
The preview pane contains the fields selected for import and will update based on changes made in the Edit Output Fields table. By default only the first 1000 features will be read into the preview. To read more features click on Read More.
Once the plot type and fields have been edited as desired, click Next > to move to the next page of the wizard. See Select a Genome Assembly for the next page.
Converting a GTF File¶
A GTF (Gene Transfer Format) file is a text file in the format specified by Washington University in St. Louis (http://mblab.wustl.edu/GTF22.html and http://genome.ucsc.edu/FAQ/FAQformat.html#format4). Two requirements are that the GTF file contain a gene_id and transcript_id for each feature. If a gene_name field is also available the gene_name will be preferred over the gene_id for labeling the features in the annotation source.
After selecting one or more GTF files (GTF, GTF.GZ) on the file selection page of the convert wizard, click Next >. If the data has not been previously indexed then the file(s) will be scanned to determine the genomic coordinates and most likely genome assembly. The scan may be skipped if the genome assembly is known and the segment naming convention is known.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select the Desired Plot Type for GTF¶
The desired plot type will be automatically detected by default. Unless a GTF file violates the file format specifications, this will always be a Gene plot type. To change the desired plot type, change the selection in the drop down box. If the selected current fields in the file(s) do not meet the specifications for the selected plot type a warning icon will appear on the upper right. To check the required fields for the selected plot type, or to read the warning message(s), hover over the [i]nformation or [!] warning icons. The tool tips will contain the information or warning messages.
The output fields can be edited on this page as well.
- Use: To remove a field from the output uncheck the box in front of the field. The Chr, Start and Stop fields cannot be modified and are not included in the list of editable fields.
- Rename: To rename the field, either type in the Name box, or select a name of a field required for the plot type in the drop down box. If a required field is renamed a warning will appear for the selected plot type. For a gene source the following are required “Gene Name”, “Transcript Name”, “Strand”, “CDS Start”, “CDS Stop”, “Exon Starts” and “Exon Stops” are required.
- Type: The default types of the fields are specified based on the types detected in the scan pass. To change the type, select the appropriate type in the drop down box. For instance, Float64 can be changed to Float, etc. If there are only a few strings used for a string field, “Category” is a better type to use, this will enable filtering on that field. If “Category” is selected but there are too many different categories for a particular field an error will be generated on convert.
- Reorder Fields: To reorder fields, click on a field to select (highlight) the row and use the directional arrows on the right of the dialog to move the field either up or down in the list. The double up arrow moves the field to the top of the list. The double down arrow moves the field to the bottom of the list.
The preview pane contains the fields selected for import and will update based on changes made in the Edit Output Fields table. By default only the first 1000 features will be read into the preview. To read more features click on Read More.
Once the plot type and fields have been edited as desired, click Next > to move to the next page of the wizard. See Select a Genome Assembly for the next page.
Converting a WIG (Fixed or Variable Step) File¶
A WIG file is a text file which assigns a floating point value to each position or interval of interest in genomic space. Wiggle files can either be formatted as variable step or fixed step (variableStep and fixedStep respectively). See http://genome.ucsc.edu/goldenPath/help/wiggle.html for more information on the format.
After selecting one or more WIG files (WIG, WIG.GZ, WIGFIX, WIGFIX.GZ) on the file selection page of the convert wizard, click Next >. If the data has not been previously indexed then the file(s) will be scanned to determine the genomic coordinates and most likely genome assembly. The scan may be skipped if the genome assembly is known and the segment naming convention is known. For WIG files, the only available output is a data sequence source so the next page of the convert wizard is the genome assembly specification page. See Select a Genome Assembly for information on the next page.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Converting a Delimited Text File¶
A delimited text file is a text file consisting of rows and columns of data delimited by special characters or strings which are not allowed in the data values themselves. The delimited text import can take multiple files as long as they are formatted in the same manner. Because the input format is customizable, a delimited file characteristics page is included in the convert source wizard to allow the user to indicate how the file(s) should be parsed.
After selecting one or more text files (TXT, TXT.GZ, TSV, TSV.GZ, CSV, CSV.GZ, etc.) on the file selection page, the icon next to the files will be a question mark. This is because text files can contain data for numerous plot types and that cannot be determined by the file extension. If the files are a format that cannot be read or processed by the user or are of inconsistent types a warning will appear or prevent progressing to the next page of the convert wizard. To move to the next step of the conversion process, click Next >.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Specify the Delimited File Characteristics¶
As delimited text files can come in many different formats, additional information is required to parse the file(s). This information is specified on the Delimited Text File Characteristics page of the convert wizard. The options that can be specified are described below.
- Field Name Line: The files may contain a line which defines the names of
the data fields in the file(s). The header line must be near the top of the
file(s) and any text above the header line will be ignored during import.
The header line may be detected by:
- Starts With: A single or multiple character string
- Line Before Data: The line immediately proceeding the first data line. Selecting this option activates the First Data Line option. If not in Advanced mode, it is assumed that the first data line is the second line in the file. [Line 1]
- Manual Names: If no header line exists, select this option to auto generate field names that can be edited on the plot type/output field name specification page. Selecting this option activates the First Data Line option. If not in Advanced mode, it is assumed that the first data line is the first line in the file. [Line 0]
- First Data Line: [Advanced Option] Available when Line Before Data or Manual Names are selected for Field Name Line. Data lines are in 0-based indexes, i.e. the first line of the file is line number 0, the second line is line number 1, etc.
- Ignore Lines: [Advanced Option] The input file(s) may also contain any
number of lines which should be ignored during conversion. Such lines are often
referred to as comments because they are intended to include additional notes
but do not include data. The controls used to indicate which lines should be
treated as comments are similar to the header line controls. Comments may be
detected by:
- Starts With: A single or multiple character string at the beginning of the line.
- Ends With: A single or multiple character string at the end of the line.
- Contains: Any line that contains the specified string.
- Equals: Any line that is exactly equal to the specified string.
- Wildcard: Any line that contains the wildcard character.
- RegEx: A regular expression can be used to indicate lines to ignore using a more complicated pattern.
- Don’t Ignore: Don’t ignore lines that contain the specified string.
- Field Delimiter: The string that separates data values on each line of the input file(s). For correct alignment of the converted data, the delimiter should not occur within any of the data values.
- List Delimiter: [Advanced Option] The list delimiter specifies the string that separates data values items within a single field. By default this delimiter is a comma if the file is tab delimited or a custom delimiter and a semi-colon if the file is comma delimited.
- Missing Values: [Advanced Option] A list of common missing value indicators is included in the conversion wizard by default. To view this list or to add or remove a missing value indicator, select Advanced Options and edit the space delimited list of common missing value strings.
- Coordinates: The coordinates option is used to specify whether the intervals
defined in the input data are 0-based intervals, 1-based intervals or positions.
- 0-Based Interval: The difference between the stop and the start positions defines the width of the interval. For example, an interval covering the first three positions of a chromosome in 0-based coordinates would be specified as [0, 3]. (Also known as ‘half-open coordinates’.)
- 1-Based Interval: The difference between the stop and the start positions plus one defines the width of the interval. For example, an interval covering the first three positions of a chromosome in 1-based coordinates would be specified as [1, 3]. (Also known as ‘indexed coordinates’.)
- Position (1bp width): For files with only a single position/coordinate select this option for the coordinates. This option assumes all features have a single base pair width. The position is 1-based so the smallest position in a chromosome would be 1.
- Preview: The chromosome, start, stop or position columns are selected by default when ever possible. However, it is likely that the wrong fields are selected for the chromosome, start, stop or position. To change the fields used for the genomic coordinates, click on the field name and specify whether the field is the Chromosome (Segment) Field, Start Field (for interval coordinates only), Stop Field (for interval coordinates only), or Position Field (for position coordinates only).
Once the delimited text file characteristics and fields to use to define genomic coordinates are specified, click Next >. If the data has not been previously indexed then the file(s) will be scanned to determine the data types and genomic coordinates and most likely genome assembly. The scan may be skipped if the data types, genome assembly, and the segment naming convention are known.
Select the Desired Plot Type for Delimited Text¶
The desired plot type will be automatically detected by default. To change the desired plot type, change the selection in the drop down box. If the selected current fields in the file(s) do not meet the specifications for the selected plot type a warning icon will appear on the upper right. To check the required fields for the selected plot type, or to read the warning message(s), hover over the [i]nformation or [!] warning icons. The tool tips will contain the information or warning messages.
The output fields can be edited on this page as well.
- Use: To remove a field from the output uncheck the box in front of the field. The Chr, Start and Stop fields cannot be modified and are not included in the list of editable fields.
- Rename: To rename the field, either type in the Name box, or select a name of a field required for the plot type in the drop down box. If a required field is renamed a warning will appear for the selected plot type.
- Type: The default types of the fields are specified based on the types detected in the scan pass. To change the type, select the appropriate type in the drop down box. For instance, Float64 can be changed to Float, etc. If there are only a few strings used for a string field, “Category” is a better type to use, this will enable filtering on that field. If “Category” is selected but there are too many different categories for a particular field an error will be generated on convert.
- Reorder Fields: To reorder fields, click on a field to select (highlight) the row and use the directional arrows on the right of the dialog to move the field either up or down in the list. The double up arrow moves the field to the top of the list. The double down arrow moves the field to the bottom of the list.
The preview pane contains the fields selected for import and will update based on changes made in the Edit Output Fields table. By default only the first 1000 features will be read into the preview. To read more features click on Read More.
Once the plot type and fields have been edited as desired, click Next > to move to the next page of the wizard. See Select a Genome Assembly for the next page.
Converting an IDF or TSF File¶
An IDF file is an annotation track source from SVS 7. It can be converted to the new TSF format to optimize data storage as well as take advantage of the embedded genome assembly, coverage, and documentation.
If the IDF file is an allele sequence source, the file will be treated like a 2Bit or FASTA file in the convert wizard. See Convert a 2Bit File or Converting a FASTA File for more information.
If the IDF file is any other source type, the file will be treated like an indexed file and the next page in the convert wizard will be specifying the plot type and output fields.
Note
Throughout the Convert Source Wizard there are certain options considered “Advanced” options that do not need to be selected for in most cases. To show “Advanced” options, check the box on the lower left of the dialog. Any option that is advanced will be labeled as such in the documentation.
Select the Desired Plot Type for IDF or TSF¶
The desired plot type will be automatically set to the type of the IDF or TSF source being converted. To change the desired plot type, change the selection in the drop down box. If the selected current fields in the file does not meet the specifications for the selected plot type, a warning icon will appear on the upper right. To check the required fields for the selected plot type, or to read the warning message(s), hover over the [i]nformation or [!] warning icons. The tool tips will contain the information or warning messages.
The output fields can be edited on this page as well.
- Use: To remove a field from the output uncheck the box in front of the field. The Chr, Start and Stop fields cannot be modified and are not included in the list of editable fields.
- Rename: To rename the field, either type in the Name box, or select a name of a field required for the plot type in the drop down box. If a required field is renamed a warning will appear for the selected plot type.
- Type: The default types of the fields are specified based on the types detected in the scan pass. To change the type, select the appropriate type in the drop down box. For instance, Float64 can be changed to Float, etc. If there are only a few strings used for a string field, “Category” is a better type to use, this will enable filtering on that field. If “Category” is selected but there are too many different categories for a particular field an error will be generated on convert.
- Reorder Fields: To reorder fields, click on a field to select (highlight) the row and use the directional arrows on the right of the dialog to move the field either up or down in the list. The double up arrow moves the field to the top of the list. The double down arrow moves the field to the bottom of the list.
The preview pane contains the fields selected for import and will update based on changes made in the Edit Output Fields table. By default only the first 1000 features will be read into the preview. To read more features click on Read More.
Once the plot type and fields have been edited as desired, click Next > to move to the next page of the wizard. See Select a Genome Assembly for the next page.
Select a Genome Assembly¶
Depending on how much of the file(s) are scanned the convert source wizard will pick the most likely genome assembly based on the genomic coordinates (chromosomes and largest positions) as the default genome assembly. If this default is not the correct species or build, select the correct assembly from the drop down list. Note, all species that matched the features read will be at the top of the list. Scrolling down past the list of matches will present the available assemblies in alphabetical order based on the scientific name.
Once the assembly has been selected, the source to segment mapping can be modified as required.
Modify the Source to Segment Mapping¶
Segments can either be excluded or renamed from the annotation source using the fields in this table.
- Use: To include a segment in the assembly leave the ‘Use’ box
checked. To “Check All” or “Uncheck All” click on the “Use” column header. “Uncheck All Unmapped” can be used to remove contigs or segments not found in the genome assembly.
Note
If there was more than 5000 segments in the allele sequence only the 5000 longest segments would have been included in the assembly file.
- Source: The name of the segment from the allele sequence file. To
rename a segment, either double click on the name in the Segment column, or if the segment names share the same pattern to be removed or for renaming, below the segment definition table are controls for renaming segments programmatically. The options include:
- RegEx: Use regular expressions to rename the “Source”.
- Substring: Remove a substring from all segment names to generate the segment name.
- Prefix: Remove a common prefix from all segment names.
- Suffix: Remove a common suffix from all segment names.
- Manual: Rename segments manually, one at a time. If manual mode is selected you can rename directly in the Renamed cell in the Source to Segment Mapping table.
Enter in either the RegEx expression or the string to remove in the first text box. A preview of the renamed segment name will appear in the second text box. To apply the rename to all segments click on the Set Segment to Renamed button.
Length: The length of each segment is displayed in this column.
Aliases: If a segment has an alias listed in the genome assembly it will be listed in this field.
- Type: The type of segment. By default ‘Autosomes’ are always visible
and the rest are visible only if there is data. Options include:
- Autosomes
- Allosome
- Mitochondrial
- Fragment
- Scaffold
- Contig
- Unknown
Visibility: The visibility of the data. If the segment will only be shown in GenomeBrowse if there is data the visibility will be set to “With Data”. Options include:
- Always
- Never
- With Data
Once the assembly has been selected and the segments mapped to the assembly, click Next >.
After clicking Next > the wizard will display the documentation page. See Documentation Step for more information.
Documentation Step¶
Annotation sources must have a name specified. In addition to the source name, documentation on how the data was converted as well as the date and documentation for each field can be specified. All documentation will be embedded into the TSF file to make sharing files and documentation easy.
There are three sections in the documentation specification page of the convert source wizard:
Source Definition: This information is used to identify the annotation source, and also indicate the date it was converted, who converted it and any version information.
- Name: [Required] The name of the annotation source.
- Curated Date: [Required] By default this is a date associated with the files being converted. It can be modified, but a date is required.
- Curated By: Name or organization of who is curating the data.
- Series Name: Name of a particular group of data. This field can be used to differentiate between newer versions of the same type of data. For example, RefSeqGenes-UCSC or dbSNP.
- Version: A version number or date. It is recommended if there is a particular version name or identifier that this is included in the Name field and that this field be used for a date associated with the particular version.
Fields: The individual field descriptions can be specified in this table.
- Orient: The orientation of the data (locus or sample) cannot be modified.
- Type: The type of the data (cannot be modified). If the type needs to be modified, click < Back to go back to the desired plot type and field specification page. Other changes will not be lost.
- Name: The name of fields can be modified. However, if a field name is modified for a required field with an explicit name the source may not be able to be plotted as a specialized track type.
- Doc: The documentation string for the specific field.
- URL Template: For fields with information that can be queried in an external site, specify the URL and two dollar signs ($$) to indicate where the text should be replaced. For example, for an “Identifier” field that contains RS ID’s, the URL Template could be http://www.ncbi.nlm.nih.gov/snp/?term=$$.
- Categories: Click on the Edit button to edit the category names and/or documentation. For some data sources including VCF files this option is immediately available as the data types and categories are specified in the header/meta information. For other data sources, after the source is converted, the documentation can be edited to rename and/or document categories for categorical field types. See Source Information Editor for more information.
HTML Documentation: The four tabs at the bottom of the dialog are for writing HTML documentation of the source. The tabs are to guide the writing of the documentation and to provide nice headers for each section. HTML tags can be used for formatting.
Description: Description of the source and where it was obtained from.
Credit: Any required citations or credits for the source should go in this section.
Notes: Any relevant notes on pre-processing that had to be performed on the data or settings used to convert the source.
Note
After the file has been converted statistics about the fields and data in the source will be placed in this section.
Meta: Any meta information for the data source(s).
Note
If VCF files are converted the header information from the first VCF file will be placed in this section.
Advanced Options: When Advanced Options is checked in the lower left corner of the documentation dialog the ability to copy documentation from an existing source becomes available.
Mirror Settings from TSF file: This option allows the user to select an existing annotation source, in TSF format and the Curated by, Series Name, Description, Credit, and Notes fields are automatically copied to the dialog for the new source.
For example: if you were curating the new human dbSNP 144 you could select the previously available dbSNP 142v2 annotation source to automatically fill in the majority of the documentation for the 144 track.
Once the documentation has been filled in as desired, click Next > to go to the confirmation and conversion step.
Confirmation of the Specified Parameters¶
- Ready to convert message: Contains information about the number of input files, the total size of the data to be converted (not the final size of the new file), the number of fields to be included in the converted source, the track type, the assembly and type of coverage to be computed. If any of the information in this message is not correct, back up through the wizard to adjust the parameters.
- Field Indexing: [Advanced Option] String fields can be indexed to enable searching or looking up information from the source in the location bar in GenomeBrowse. Fields should only be indexed if there is a reasonable expectation that the names in the field apply to just a few features. Such as RS IDs, gene names or transcript names. By default Identifier, Gene Name, and Transcript Name fields will be indexed. These fields can unchecked if it is not desired to have them indexed.
- Left Align: [Advanced Option] Using the reference and alternate fields, insertions and deletions are shifted to their left most possible representation. The genome assembly selected for the input files will be used to find the sequence surrounding each variant, so that it may be realigned. This option is only available for variant and variant map source types.
- File Name: [Advanced Option] The file name will be auto-generated based on the source name, version, species and build. If a different file name is desired, that can be changed in this advanced option.
- Path: [Advanced Option] By default the converted source will be saved in the user’s default annotation folder. If a different location is desired, this can be specified by entering in a path or clicking on Browse....
- Add Path to Library: [Advanced Option] If the path is changed, the path can be saved to the library so the source can be easily accessed in the Data Source Library.
If an option does not appear correct, click < Back to go back to fix the option. Any information specified will not be lost when backing up through the wizard unless changing an option changes how the data is read.
Once all of the options look correct click Convert to convert the data source(s) into the Annotation Source TSF format.
Converting Data Sources to Annotation Source¶
The convert source wizard will become a progress dialog during the last step, the converting step. An approximate time remaining will be displayed as well. Clicking < Back will stop the conversion but let it be restarted without specifying all of the options again. Clicking Cancel will exit out of the convert source wizard completely.