CNAM employs an optimal segmenting algorithm which uses dynamic programming to detect CNV segment boundaries. There are two methods available in SVS: univariate and multivariate. The univariate method, which considers only one sample at a time, is ideal for detecting rare and/or large CNVs. The multivariate method, which considers all samples simultaneously, is ideal for detecting small, common CNVs. This tutorial leads you through univariate segmentation.
IMPORTANT
The segmentation algorithm implemented in CNAM is an optimal algorithm, which produces high quality results, but comes at the expense of computation time. As of SVS v7.4, CNAM can take advantage of video graphics cards, or graphical processing units (GPUs), in addition to your computer’s CPUs. This can dramatically decrease computation time while producing the same high-quality results. Depending on computer configuration, the speed increase can be 20 times or more when utilizing the GPU for segmentation. (For more information see Video Graphics and Genomics: A Real Game Changer?)
With that said, segmentation on a whole genome dataset still takes some time. Therefore, rather than have you perform the segmentation yourself, we’ve provided the finished segmentation results in the project, outlining the steps below by which we achieved the results.
Note
The options in Figure 2-1 reflect the fact that the computer used in developing this tutorial has an OpenCL-compatible video graphics card (Nvidia GeForce GTS 240) and is a quad-core (4 threads) machine. When you run this on your own you will need to set these parameters based on your own hardware. For a more thorough description of each parameter in this window, click on the Help button in the lower left corner.
When the segmentation finishes, three spreadsheets and a run log are created. The Segmentation Covariates Every Column spreadsheet (Figure 2-2) contains the mean LR value of a given segment with redundant information displayed for every marker in the segment. This spreadsheet is useful for specific plots of the segmentation covariates. The Segmentation Covariates First Column spreadsheet contains the same information as the Segmentation Covariates Every Column spreadsheet, but it only includes columns that correspond to the first column of a segments identified in any of the subjects in the data. This spreadsheet is better for association testing as it is smaller and more accurately reflects the true multiple-testing burden of non-redundant data. The Segment List – Sheet 1 spreadsheet (Figure 2-3) contains more detailed information about each segment for each subject in a list format.
After segmentation is complete, it is often useful to examine the total number of segments for each subject in the data. An unusually large number of segments is often indicative of data quality problems such as wave effects that were not detected earlier by the log ratios themselves.
The plot should look like Figure 2-4 (with a bin count of 64).
There appear to be outliers on the right side of the distribution. Sorting the Segment Counts spreadsheet will let us know which samples these are.
This produces a second segment counts spreadsheet, Segment Counts - Sheet 2. The sample with the highest segment count is S59. Let’s take a look at this sample’s LRs and Segment results to get an idea of what is going on. To plot the sample data, we must first transpose the PCA-corrected LR spreadsheet as well as the Segmentation output, as SVS requires data to be oriented in columns in order to be plotted in the genome browser view.
Now we start by plotting the corrected LR values.
The plot should look like Figure 2-5.
Now let’s plot the segmentation results on top of the log ratios to see why there were so many segments found.
This creates a second S59 item under Graph 1. We need to move this item to the top to see it in the graph.
Your plot should now look like Figure 2-6.
Zoom in for a closer look.
We can already see a little bit of a wave effect. Applying a small median smooth to the log ratios will make this even more apparent.
Even with a median smooth of 2, the wave effect is much more apparent (Figure 2-7). In the previous Quality Assurance tutorial, sample S59 had an absolute wave factor that was just below the IQR cutoff. This suggests we may have used a more stringent outlier threshold for wave factor for this dataset.
For this tutorial, we will use IQR on the Segment Counts to determine which samples warrant exclusion.
In this case the Upper Outlier Threshold is 228.5. As we can see in the Segment Counts - Sheet 2 spreadsheet, there are 17 samples that have segment counts above this threshold (Figure 2-8).
Let’s exclude these samples from our Phenotype spreadsheet.
At this point you might continue on to run association tests based on the LR segment covariates. However, it is sometimes useful to discretize the segment means as two state (0,1) or three state (-1,0,1) covariates based on defined thresholds. Discretizing the covariates has the following benefits:
To do this, we first need to determine the appropriate number of copy number classes by examining the segment mean histogram and then discretize the covariates based on the determined thresholds.
The segment means can be accessed from the Segment List spreadsheet.
The histogram in Figure 2-9 appears (with a bin count of 128).
At first glance we can see the distribution is centered around zero as we’d expect, and we can see some additional peaks to the left and right. If you zoom in on the y-axis you can see the peaks more clearly.
You can now begin to see the peaks more clearly (Figure 2-10).
In this case it actually looks like there are four copy number classes. For the purpose of this tutorial we will make it simple and look for thresholds that distinguish a copy number loss, neutral, or gain, which at first glance appears to be around -0.15 and 0.15 (you can zoom in on the X-axis to see this more clearly).
Note
Now that appropriate thresholds have been determined, you can use these values to discretize the copy number segment covariates (continuous LR values) as three-state (-1,0,1) covariates. We’ll discretize the Segmentation Covariates First Column spreadsheet as we’ll be using this for association testing.
You will be prompted to choose a Three State Model or either of two Two State Models.
This will create two new spreadsheets Three State Covariates, with -1 for potential losses, 0 for potential neutrals, and 1 for potential gains, and Copy Number State Counts - Mapped Sheet 1 with various statistics about each marker in the segment. We will be using the first spreadsheet as covariates for association testing.