Appendix

Installing the Third-Party Condor Package

Installing Condor Overview

Condor®is a freely available, specialized, batch system for managing compute~intensive jobs on a distributed network environment.

This chapter will provide a step~by~step guide to installing Condor®on Windows systems connected through a LAN. Although Condor®is cross~platform, this chapter will focus on installing Condor®through the Windows installation wizard.

Documentation on Condor®installation on other platforms and security concerns can be found on Condor ®‘s website at http://www.cs.wisc.edu/condor/.

Downloading and Using the Installation Wizard

Launching the Condor Installer

First, you must download the installer from Condor ®‘s website (http://www.cs.wisc.edu/condor/). After filling in your information you should be able to select the target platform to install Condor®. After downloading and launching the installer on Windows, you should be presented with the first screen of the installation wizard.

Creating or Joining a Condor Pool

After accepting the End~User License Agreement, you have the choice of creating a new Condor®Pool or joining an existing one. If this is the first computer on the LAN that Condor®is being installed on, select Create a new Condor®Pool and select a display name for the Condor®Pool. Otherwise, select Join an existing Condor®Pool and enter in the host name (machine name) of the first machine you installed Condor®on.

Execution and Submit Behavior for Jobs

Now the Condor®Wizard will allow you to select whether the current computer should be permitted to submit jobs, run jobs or both. When selecting whether the current computer will run jobs, you can choose to only run jobs if no user is actively using the computer. This may be a good default setting if the computer has only one core or CPU since desktop interactivity will most likely be degraded when Condor®jobs are running. If you choose this setting, you must decide what to do with jobs that are suspended when a user starts using a computer that was previously idle. It is recommended that you select Restart the job on a different machine if in general users occupy computers for long periods of time.

Setting Condor Host Permission

You can leave the Accounting Domains and email settings to blank for most cases. As SVS does not use Java to execute jobs on Condor®, the Java Settings screen is not important (although you may want to ensure correct settings if you plan to use Condor®for your own Java jobs).

The Host Permission Settings screen that follows is extremely important. With settings that are too restrictive, Condor®will appear to be completely dysfunctional. For the most part, if your LAN is behind a proper firewall, there should be no security concerns as your Condor®Pool will not be accessible from any external network or firewall. That is why we recommend setting the Hosts with Write access field to * to ease the setup and troubleshooting process. If your network is entirely exposed to the Internet and each host has public IP addresses, you may want to consider limiting these permissions to stricter settings. The other defaults on this screen should be correct. See the Condor®web site, specifically, the “Administrators’ Manual” for in~depth documentation.

Finish Up

The screen that follows provides you with the choice to do a custom installation which allows you to specify a target install directory. We recommend clicking on Install which will use the default install directory of C:/condor. After clicking on Install Condor®should be properly installed on that machine and the Condor®windows services should be started. Within a couple minutes, you should be able to run the condor_status utility from a command prompt that has changed directories to C:/condor/bin.

Troubleshooting Techniques and Common Issues

Although not a substitute for the official Condor®documentation available at http://www.cs.wisc.edu/condor/), this section covers methods to assess the health of your Condor®Pool and the status of your jobs submitted to the pool by SVS.

Command Utility Commands

The following Condor®utility programs can be run from the C:/condor/bin directory through a command prompt.

~ condor_status: Use this command to list the machines that

are currently in your Condor®pool. This command will also display the state of each machine, which is usually one of the following values:

~ Unclaimed: Available to run jobs.

~ Owner: Configured to only run jobs when a user is not using the
computer, and currently in use.

~ Claimed: In the process of running jobs.

~ condor_q: Use this command to list the jobs that have been
submitted by your machine and the state of each job. When you use SVS to run jobs on the Condor®pool, the queue command will show the jobs that are currently running and the ones that are waiting for resources to become available to run.

Condor Issues on Windows

Because Condor®is a complex system designed for multiple platforms and network environments, it may seem like a daunting task to discover the source of problems when things go wrong. In reality, the default settings along with the recommendations made above should provide you with working Condor®configurations. The problems that arise, therefore, are usually caused by external factors that block Condor®from fully functioning.

~ Windows Firewall: On Windows XP SP2, the Windows firewall seems
to block submitted Condor®jobs from running properly. This symptom may not occur until rebooting after installing Condor®. The simplest solution is to disable the Windows firewall. Alternatively, see the Firewalls section of Condor ®‘s “Administrators’ Manual” to learn how to configure a firewall to work with Condor®.
~ Failure to Start Condor®Services: There are many
reasons why the Condor®window service may fail to start. The log files found in C:/condor/log are sometimes helpful in troubleshooting these errors. Sometimes certain third~party anti~virus or firewall programs may block Condor®by overwriting Window’s WinSock. This will cause Condor®to fail when starting and output a bind failed: WSAError - 10038 error to the MasterLog file.
~ Underused Resources: Using condor_status may indicate
that not all the machines available on your network are being utilized to process idle jobs in the queue. Condor®is capable of using various metrics for determining if a machine is ready to receive jobs. By default, if you choose Always run jobs and never suspend them it should not use any metrics and simply run all jobs. If a machine is consistently not running jobs, you may want to check its logs for errors such as permission restrictions.

Extracting Affymetrix Copy Number Data for use in SVS

Extracting Affymetrix Copy Number Data Overview

Affymetrix provides tools for extracting data from cell intensity (CEL) files for copy number analysis. This chapter will provide instructions for using available tools from Affymetrix to extract copy number data for analysis in SVS.

Creating CNT Files using the Affymetrix CNAT Batch Analysis Tool

About Affymetrix CNAT

The Affymetrix GeneChip®Chromosome Copy Number Analysis Tool (CNAT) is an application used with the Affymetrix GeneChip®Genotyping Analysis (GTYPE) software. The CNAT Batch Analysis Tool uses cell intensity (CEL) files and GTYPE analysis results (CHP files) to create copy number analysis files (CNT files) from Affymetrix Mapping data.

GTYPE analysis supports the following DNA probe arrays:

~ Affymetrix GeneChip®Human Mapping 10K Array

~ Affymetrix GeneChip®Human Mapping 100K Array

~ Affymetrix GeneChip®Human Mapping 500K Array

The data must be accessed from the Affymetrix GCOS database, and therefore you must have the associated library files and the corresponding experimental information files (EXP files).

Creating the CNT Files

CNAT Batch Analysis Window

The CNAT Batch Analysis Window is where the copy number analysis is run. In this window you select the sample type, the analysis type, and the CHP files that will be used for samples and references.

In CNAT, you have three analysis options: copy number analysis (CN), loss of heterozygosity (LOH) analysis, or both analyses. For copy number analysis, the output files will have the form *.cn.cnt; for loss of heterozygosity analysis, the output files with have the form *.loh.cnt. Selecting to perform both analyses will generate both types of CNT files. Golden Helix SVS does not currently support loss of heterozygosity analysis data; therefore, to create CNT files that can be used in Golden Helix SVS, you must select either Copy Number (CN) or CN & LOH.

Advanced Analysis Options

Pressing the Advanced Analysis Options button in the CNAT Batch Analysis Window opens the Advanced Options dialog. In this dialog, you can set copy number parameters that effect the smoothing performed during the copy number analysis.

The Genomic Smoothing feature uses Gaussian smoothing to increase or decrease the significance of small aberrations in the data. In order for the resulting CNT files to be compatible with Golden Helix SVS, the Genomic Smoothing must be set to 0 Mb.

Performing the Analysis

After the options have been set, return to the Batch Analysis Window to start the analysis. When the analysis is running, CNAT accesses both the selected CHP files and their corresponding CEL files. The analysis results files will have the suffix *.cn.cnt and will be put in the destination folder specified in the Batch Analysis Window. These CNT files are then ready to be used.

Creating CNCHP Files Using Affymetrix Genotyping Console 2.0

About Affymetrix Genotyping Console

Affymetrix®Genotyping Console (GTC) is an application used to calculate genotype calls and to generate copy number data. Genotyping Console uses cell intensity (CEL) files and genotyping analysis results (CHP files) to create copy number analysis files (CNCHP files) from Affymetrix Mapping data.

GTC supports the copy number analysis of the following DNA probe arrays:

~ Affymetrix GeneChip®Human Mapping 100K Array

~ Affymetrix GeneChip®Human Mapping 500K Array

~ Affymetrix GeneChip®Genome Wide SNP 6.0 Array

GTC also supports genotype analysis of the Genome Wide SNP 5.0 Array; however, as of the release of Genotyping Console 2.0, the software does not support copy number analysis of the SNP 5.0 Array.

Generating CNCHP Files for the Mapping 100k and Mapping 500k Arrays

Setting Analysis Configurations

To perform copy number analysis of Mapping 100K or Mapping 500K data for analysis in Golden Helix SVS, first set the analysis parameters using the menu Edit > Copy Number / LOH Configurations > New Configuration....

After selecting the array type, a dialog will open where you can set copy number parameters that effect the smoothing performed during the copy number analysis. The Genomic Smoothing feature uses Gaussian smoothing to increase or decrease the significance of small aberrations in the data. In order for the resulting CNCHP files to be compatible with Golden Helix SVS, the Genomic Smoothing must be set to 0 Mb.

Starting the Analysis

Start the copy number analysis by using Workspace > Intensity Data > Perform Copy Number / LOH Analysis....

In the first window, select the sample type and the analysis type. Golden Helix SVS does not currently support paired sample analysis or LOH analysis. Select the Unpaired Sample Analysis sample type and the Copy Number (CN) analysis type.

In the next window, select from the first drop down list the analysis configuration where the Genomic Smoothing was set to 0 Mb. By clicking the Advanced > button, you can verify the settings. Verify the output path and the batch name for the analysis.

In the following window, select the CEL files that will be used for samples and references, select the enzyme’s shared attribute, and begin the analysis.

When the analysis is complete, a *.CN4.cnchp file will be created for each sample. Those files will be located in the directory given by the selected output path and batch name. The *.CN4.cnchp files are ready to be imported into a Golden Helix SVS project using the CNCHP file parsing tool.

Generating CNCHP Files for the Genome Wide SNP 6.0 Array

Setting Analysis Configurations

To perform copy number analysis of Genome Wide SNP 6.0 data for analysis in Golden Helix SVS, first set the analysis parameters using the menu Edit > Copy Number / LOH Configurations > New Configuration....

After selecting the array type, a dialog will open where you can set parameters that effect the copy number analysis. Click on Advanced to view the advanced settings. For copy number analysis in SVS, the data should not have any smoothing applied. In the SmoothSignal Graph Output, set both the Smoothing Gaussian Window and the Smoothing Sigma Multiplier to 0. Under Smoothing Parameters select Skip any smoothing.

Starting the Analysis

Genotyping Console uses a *.ppw reference model file for copy number analysis of SNP 6.0 array data. GTC allows you to use a reference model file provided by Affymetrix. You may also select to create your own reference model file with samples selected from your Genotyping Console Workspace. To start the copy number analysis using your own reference samples, select Workspace > Intensity Data > Create Copy Number / LOH Reference Model File and Perform Analysis.... To start the analysis using the reference model provided or a reference model you previously created, select Workspace > Intensity Data > Perform Copy Number / LOH Analysis....

In the first window, select from the first drop down list the analysis configuration where the smoothing was inactivated. By clicking the Advanced > button, you can verify the settings. Verify the output path and the batch name for the analysis.

If you selected to create a reference model, the following window will prompt for the samples to be used in the reference model. Select the reference samples and a save location for the reference model. Copy number results will be generated for each of the samples in the reference set. If you selected not to create a reference model, the window will prompt for the samples to be analyzed. You must also select the reference model file you wish to use for this analysis.

When the analysis is complete, a *.CN5.cnchp file will be created for each sample. Those files will be located in the directory given by the selected output path and batch name. The *.CN5.cnchp files are ready to be imported into a Golden Helix SVS project using the CNCHP file parsing tool.

Affymetrix CNT File Format

The Affymetrix CNT file format is a tab separated ASCII file format, in which each file represents data for one sample. Within each CNT file, the data is arranged such that each row contains all data for a given marker. Each row must contain the marker name, chromosome, and position and may contain any other information which should be associated with that marker. Currently, SVS requires a normalized copy number intensity column named Log2Ratio in addition to the name, chromosome and position columns that will be present in all CNT files.

If you have copy number data that you would like to import into Golden Helix SVS, you can use the CNT format to do so.

The CNT format consists of a header section, a column name section, and a data section. The beginning of each of these sections is marked with a specific token, and all sections are required. Each section is briefly described below.

Header Section

The header section of the file can contain meta~data for the file, and need not be in a tab delimited form. Any meta information about the file can be listed here, however, it is required to specify a value for the variable ChipType1, e.g., ChipType1-MappingK_Hind240. This value must appear on its own line in the header section of each file, and must match across all CNT files that you will be importing together. This value is used ensure that the CNT files that you import into Golden Helix SVS are of the same type. If you are converting your data into this format for use with Golden Helix SVS, you can set the value for ChipType1 to be your own ASCII string so long as the value is consistent over all the files that will be imported together. The start of the header section is indicated by the ‘[Header]’ section token and the header continues until the column names section.

Column Names Section

The column names section contains a tab~separated list of the names of the columns contained in the file. This list should be in the same order as the data columns themselves and each file must contain the following columns:

~ ProbeSet: This will become the column name in a dataset. This
column must be listed first in all files.

~ Chromosome: Chromosome associated with the marker.

~ Position: Position of the marker in the genetic marker map.

~ Log2Ratio: This represents the normalized copy number
intensities, and is required for use with copy number analysis in Golden Helix SVS.

The beginning of the column names section is marked by the ‘[ColumnName]’ section token and continues until the data section.

Data Section

The data section contains the actual data for each marker. The data should appear as a tab separated list of values where each line represents the values for one marker. The order of the values must match the order of the columns listed in the column names section. Missing values are indicated by empty strings, i.e., two consecutive tab characters, or a tab followed by the end of the line if the missing value is in the last column.

Note

The markers listed in the data section must be in marker map order, and the markers must appear in the same order across all input files. The start of the data section is indicated by the ‘[Data]’ section token and continues until the end of the file.

Example File

An example file might look like the following:

[Header]
ChipType1-MyType
[ColumnName]
ProbeSet Chromosome Position Log2Ratio
[Data]
Marker1 1 2224111 0.054294
Marker2 1 3084986 0.051188
Marker3 2 53452 0.288990

Exporting Data from GenomeStudio

Exporting Data From GenomeStudio Overview

There are a few ways to export data from the Illumina®GenomeStudio Data Analysis Software application for analysis in SVS.

Note

This manual will only refer to GenomeStudio, but the same directions apply for BeadStudio v. 3.0 or higher.

Genotype data can be exported using GenomeStudio’s Final Report feature. Alternatively, using the Custom Report feature and Golden Helix’s custom plug~in DSF files can be created simultaneously for genotype data, log ratio data, b~allele frequency data, computed CNV values, X/Y value pairs and X/Y raw value pairs. In both cases there are options that must be set to insure the data will import properly into Golden Helix SVS. This chapter will outline those requirements.

Exporting Genotype Data using the Final Report

This section will guide you through the export of your genotype data from Illumina®GenomeStudio into the Final Report text format and then the import of that data in to SVS using the Illumina BeadStudio Final Report by SNP script.

Exporting the Data from GenomeStudio

Begin by opening a GenomeStudio Project. In the GenomeStudio window select from the menu bar Analysis > Reports > Report Wizard.

Select in the Report Wizard Final Report and click Next.

Select the sample groups to be included in the report and click Next. Indicate if zeroed SNPs should be included in the report or not, and click Next. The next screen will ask you how you would like to format your final report. The options to select are:

~ Standard output format

~ Displayed Fields in this order:

~ SNP Name

~ Sample ID

~ Allele1 ~ Top (This can be replaced by any of the other Allele1
options).
~ Allele2 ~ Top (This can be replaced by any of the other Allele2
options as long as it is the same type as the Allele1 selection).

~ GC Score

~ Group by SNP

~ Tab or Comma delimited

~ Create map files

When you are finished selecting the proper parameters, click Next.

Select the output path and file name, and click Finish.

Importing Data into Golden Helix SVS

To import the Final Report text file into Golden Helix SVS, first open a project in SVS. From the Project Navigator Window select Import > Import Scripts > Illumina BeadStudio Final Report by SNP.

Select the Final Report file that you exported from GenomeStudio and click Open. A dialog box will appear asking for the file type, and whether you want to use a GC Score. Using the GC Score will convert genotype calls into missing values if they do not meet a user~specified threshold. Choose your parameters and click OK.

If you chose to use a GC Score threshold, a second dialog box will appear asking you to enter that threshold. Input your GC Score threshold and click OK.

Your data will then be imported into the current project. Once the import is complete, a message will appear with some import statistics recorded in the Annotation Window within the Project Navigator Window.

Exporting Data using the Golden Helix SVS DSF Export 4.0 Plug~In

This section will instruct you on how to download and install the Golden Helix SVS DSF Export 4.0 Plug~In to be used with Illumina®BeadStudio or GenomeStudio. You will also be guided through the export of your data from GenomeStudio into Golden Helix’s proprietary sparse data storage format (DSF) files. Please note, these instructions only apply for BeadStudio version 3.0 or greater or for GenomeStudio. The directions will only refer to GenomeStudio, but the same steps also apply to BeadStudio. A brief discussion on how to import the data into Golden Helix SVS will also be provided.

Installation of the Golden Helix Plug~In

Before beginning the installation, make sure there are no instances of GenomeStudio running.

To install the Golden Helix SVS DSF Plug~in into Illumina®GenomeStudio, you must first download the appropriate plug~in installer from http://www.goldenhelix.com/Support/illumina_support.html.

When the download has completed, open the installer and click Next to continue the installation. The second window will ask for a directory to extract the plug~in to; by default, the location is the standard GenomeStudio installation directory, for example: c:/Program Files/Illumina/GenomeStudio.

Click Next and the installation will be finished.

Exporting DSF Data from GenomeStudio using Plugin version 4.0

The GenomeStudio Report Wizard

To export a Golden Helix SVS DSF, a GenomeStudio project must be open. From GenomeStudio, open the report wizard by selecting Analysis > Reports > Report Wizard.

From the Report Wizard, select Custom Report and, from the drop down menu, select Golden Helix SVS DSF Export 4.0 from Golden Helix. Click Next.

If you have excluded samples inside the GenomeStudio project you will be asked how to handle these excluded samples. The options are to include all of the samples from the report or to only include the selected samples.

Select the sample groups that you wish to include in your report and click Next.

The following window may appear if you have applied filters to your data. Select whether or not you want to include the hidden SNPs in your report, and click Next to continue through the wizard.

Select the Output Path and a Report Name. These parameters will determine the final base file name for all the DSF files created and directory location. Click Finish.

The Golden Helix SVS DSF Export Window

After going through the Report Wizard, a progress bar will appear. After a brief moment, the Golden Helix SVS DSF Export window will open. In this window you must select the output options that determine what type of data you wish to export in DSF format.

The chromosomes to include or exclude from the DSF files can be selected by clicking on Pick Chromosomes. Only chromosomes initially included in the report will be available in the Chromosome Picker.

Information about the selected samples is displayed in the Selected Samples box. If the information in this box does not appear to be correct, cancel the export and adjust the selected samples in the GenomeStudio project. A DSF is automatically created for the samples to export the gender information.

There are six optional DSF output options to choose from:

~ SNPs for Genotype Analysis

~ Log R Ratios for Copy Number Analysis Module

~ B~Allele Frequency

~ Computed CNV Values

~ X/Y Value Pairs for QC

~ X/Y Raw Value Pairs for QC

Any one or all of these DSF files can be exported. These optional files will export data in the optimal export format with samples as columns and markers as rows.

The Output File Name Base is displayed at the bottom of the window. The file name is set by the output directory chosen in the GenomeStudio Report Wizard and can not be changed in this window.

Click OK in the Golden Helix DSF Export window to begin the export.

Importing the DSF files into SVS

To import the DSF files into SVS, go to Import > Illumina DSF.

The data will be imported in the format where samples are columns and markers are rows for all files, except the Samples DSF file. The sample gender file has samples in rows and one column of gender information. Some quality control measures can be performed on log ratio data in this orientation, but most analyses require that markers be in columns and samples be in rows. This requires transposing the data. See Transposing Spreadsheets for help transposing the data.


Platform Notes

This section will contain notable information regarding the use of SVS on specific platforms.

Under some platforms the behavior of Golden Helix SVS could vary slightly in specific situations. Also, on some platforms certain system settings can be used to improve the performance of program.

Microsoft Windows

Memory Usage

Allow up to 3GB of memory usage under 32~bit Windows

By default, 32~bit Windows versions allow applications to run in a 2GB memory space. Applications that attempt to use more than this 2GB limit will crash. When working with very large datasets, it may no longer be possible to fit the required data into the default memory space supplied by Windows. Using the /3GB switch in boot.ini allows certain applications to access up to 3GB of virtual address space leaving 1GB for the windows kernel.

To allow SVS to use more than 2GB of memory, you can edit your system’s boot.ini file.

To open the boot.ini file:

  1. Click Start, click Run
  2. Enter sysdm.cpl, and click OK

On the resulting window, select the Advanced tab, and click Settings under Startup and Recovery. Next, click the Edit button in the System startup group.

Now the boot.ini file should be open in a Notepad editor. Before you edit the file, it is recommended that you create a backup of the original. To do this: select File > Save As..., and choose a location to create the backup file. Close the Notepad editor, and click the Edit button again. Now you should be ready to edit the file.

To allow Golden Helix SVS to use up to 3GB of memory add the entry /3GB to the end of the line under the [operating systems] section corresponding to your current configuration. Save the file to keep this option. A reboot is required before the change will take effect.

Memory Availability Under 64~bit Windows

Windows XP Professional x64 Edition (for PCs with x86~64 processors) is capable of running SVS in its 32~bit emulation mode efficiently and allows the application to address up to 4 GB of virtual memory if available. Windows XP 64~bit Edition (built for Intel’s IA~64 Itanium processors) is discontinued as an operating system as of 2005, and is not supported by Golden Helix, Inc.

Hardware Acceleration

As of version 7.4, SVS supports hardware accelerated analysis through OpenCL. This allows SVS to use highly parallel devices such as Graphics Processing Units (GPUs) to speed up certain tasks. Whereas traditional CPUs use several very fast cores, GPUs use hundreds of moderately fast cores to achieve their high performance. Many analysis tasks (such as CNAM Optimal Segmentation) can be efficiently divided across many cores. We have found that GPUs are able to perform copy number segmentation anywhere from 5 to 20 times faster than traditional CPUs.

Using Hardware Acceleration

In order to take advantage of hardware acceleration, you will need:

  1. An OpenCL compatible graphics card. All recently released GPUs from NVidia and AMD/ATI support OpenCL.
  2. Up~to~date video drivers. OpenCL is a very new technology, so it is only included in the most recent driver releases. For NVidia cards, go to http://www.nvidia.com/ for drivers. For AMD/ATI cards, go to http://ati.amd.com/. Some AMD/ATI devices may also require the ATI stream sdk: http://developer.amd.com/gpu/atistreamsdk/

Note

For Windows Remote Desktop users, most GPUs can not be used when running SVS via Remote Desktop. This is because remote desktop sessions use a special video driver that is incompatible with OpenCL. Hopefully a work~around will be available in the future.