2. Building Annotation Sources

New in SVS 8 is the Convert Sources Wizard!

  • Open SVS and go to Tools >Manage Data Sources to open the Data Source Library.
_images/dSL.png

Figure 2-1. Data Source Library

  • Click the Convert... button on the bottom left of the dialog to open the wizard.
_images/convertSource.png

Figure 2-2. Convert Source Wizard

Note

Full documentation on this new tool can be found in the SVS manual or by selecting the Help button on the dialog.

A. Creating a Reference Sequence

An allele reference sequence source can be built for any species where there is an available DNA sequence (FASTA) file.

Download the available FASTA file for the Zv9 assembly from the Ensembl FTP site.

  • Step 1: Click the Add button on the Define Input page of the Convert dialog navigate to the downloaded FASTA file and select the *.fa.gz file. Then click Next >.
  • Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that are included in the FASTA and determine the length of each segment. It will also attempt to match the information found to an existing assembly file.
  • Step 3: If a genome assembly match was found the next Change Options screen will show it in the Genome Assembly (Build): drop-down box. For this data we have already created the assembly file but the chromosome names in the FASTA file do not yet match.
  • We will need to rename the segments using the option at the bottom of the dialog before it will correctly match to the Danio rerio Zv9 assembly.
  • To rename select RegExp from the drop-down and type (.*) dna(.*) in the first box and \1 in the second. It should look like Figure 2-3.
_images/segmentMatch.png

Figure 2-3. Assembly match by renaming segments

  • If you scroll down the segment list you will start to see some additional segments that were not included in the assembly file (unmapped scaffolds). In this case we do not want to include them in the reference sequence so right-click on the Use column header and select Uncheck Unmapped then click Next >.

Note

SVS has an upper limit of 5000 segments that can be included. The wizard will scan all the available segments in the FASTA file but only allow the longest 5000 to be selected for inclusion in the reference sequence source.

Note

If no match is determine to an existing assembly file you can have the wizard create a new assembly based off the segments and lengths determined by the FASTA data. You will just need to select <Create New> from the genome build drop-down and fill in the required build information.

  • The next window is for labeling the data source and documenting the conversion process, at minimum you will want to select an informative Name: for the source then Click Next >

Note

For data sources curated by Golden Helix we will fully document the source of the data including any citations that are required by the provider. See Figure 2-4 for an example.

_images/document.png

Figure 2-4.

  • Step 4: For the last window you can select a location to save the created source, by default your SVS User Annotation Folder will be selected.
  • Click Convert to create the reference sequence.

B. Creating a Gene Annotation

A gene annotation track can be built for any species where there is an available gene annotation file, supported file formats are Delimited Text, GTF, or GFF.

Download the available GTF file for the Zv9 assembly from the Ensembl FTP site.

  • Step 1: Click the Add button on the Define Input page of the Convert dialog navigate to the downloaded GTF file and select the *.gtf.gz file. Then click Next >.

  • Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that can be used to match the information found to an existing assembly file.

  • Step 3: The first screen will be a listing of the fields found in the file along with their type. You can select which fields to include in the track and change the type if necessary. For this set we will leave the default options (Figure 2-5) and then click Next >.

    _images/outputFields.png

    Figure 2-5. Plot Type and Output Options Window

  • If a genome assembly match was found the next Change Options screen will show it in the Genome Assembly (Build): drop-down box. For this dataset it should match to the correct Zebrafish assembly we have built. There is still a bunch of unmapped scaffolds we will not include in the track.

  • Right-click on the Use column and select Uncheck Unmapped and click Next >.

  • On the next screen fill in any documentation for the track (Figure 2-6) and click Next >

    _images/document2.png

    Figure 2-6. Gene Track Documentation

  • Step 4: For the last window select a location to save the created source. An additional feature that is available with gene annotation sources is the ability to index certain field. The indexing makes searching for those values in the GenomeBrowse plot window much faster. In this case leave the default Gene Name and Transcript Name fields to be indexed (Figure 2-7) and click Convert.

    _images/indexFields.png

    Figure 2-7. Index Field Options

C. Visualizing the Annotation Sources

Now that the tracks have been created they can be used in SVS for analysis or just for visualization.

  • Open a new GenomeBrowse window by going to Tools >New GenomeBrowse Window

  • Select the Danio rerio (Zebrafish), Zv9(Jul 2010) assembly from the genome assembly drop-down menu, then click Add

  • Select both of created sources Ensembl Genes 74, Ensembl and Reference Sequence Zv9, Ensembl and then click Plot & Close

  • You can zoom into different features or type in any Zebrafish gene name to jump to that location. For example type GCNT7 in the location bar to automatically zoom into this region (Figure 2-8).

    _images/geneView.png

    Figure 2-8. GCNT7 Gene View

  • If you hover your mouse over Exon 1 of the gene and scroll up you can zoom in and see the proteins that make up the exon of the gene annotation source as well as the nucleotides that make up the reference sequence at that location.