2. CNAM Optimal Segmenting

CNAM employs an optimal segmenting algorithm which uses dynamic programming to detect CNV segment boundaries. There are two methods available in SVS: univariate and multivariate. The univariate method, which considers only one sample at a time, is ideal for detecting rare and/or large CNVs. The multivariate method, which considers all samples simultaneously, is ideal for detecting small, common CNVs. This tutorial leads you through univariate segmentation.

A. Performing CNAM Optimal Segmenting

IMPORTANT

The segmentation algorithm implemented in CNAM is an optimal algorithm, which produces high quality results, but comes at the expense of computation time. As of SVS v7.4, CNAM can take advantage of video graphics cards, or graphical processing units (GPUs), in addition to your computer’s CPUs. This can dramatically decrease computation time while producing the same high-quality results. Depending on computer configuration, the speed increase can be 20 times or more when utilizing the GPU for segmentation. (For more information see Video Graphics and Genomics: A Real Game Changer?)

With that said, segmentation on a whole genome dataset still takes some time. Therefore, rather than have you perform the segmentation yourself, we’ve provided the finished segmentation results in the project, outlining the steps below by which we achieved the results.

  • Open PCA-Corrected Data (Center by Marker) and choose Analysis >CNAM Optimal Segmenting.
  • Set the parameters as they are in Figure 2-1.
Figure 2-1. CNAM Optimal Segmenting window

Figure 2-1. CNAM Optimal Segmenting window

Note

The options in Figure 2-1 reflect the fact that the computer used in developing this tutorial has an OpenCL-compatible video graphics card (Nvidia GeForce GTS 240) and is a quad-core (4 threads) machine. When you run this on your own you will need to set these parameters based on your own hardware. For a more thorough description of each parameter in this window, click on the Help button in the lower left corner.

  • Click Run to begin the segmentation process. Again, you do not need to actually complete the segmentation for this tutorial as the results are already provided in the project.

When the segmentation finishes, three spreadsheets and a run log are created. The Segmentation Covariates Every Column spreadsheet (Figure 2-2) contains the mean LR value of a given segment with redundant information displayed for every marker in the segment. This spreadsheet is useful for specific plots of the segmentation covariates. The Segmentation Covariates First Column spreadsheet contains the same information as the Segmentation Covariates Every Column spreadsheet, but it only includes columns that correspond to the first column of a segments identified in any of the subjects in the data. This spreadsheet is better for association testing as it is smaller and more accurately reflects the true multiple-testing burden of non-redundant data. The Segment List – Sheet 1 spreadsheet (Figure 2-3) contains more detailed information about each segment for each subject in a list format.

Figure 2-2. Segmentation Covariates Every Column spreadsheet

Figure 2-2. Segmentation Covariates Every Column spreadsheet

Figure 2-3. Segment List Spreadsheet

Figure 2-3. Segment List Spreadsheet

B. Filtering on Autosomal Segment Count

After segmentation is complete, it is often useful to examine the total number of segments for each subject in the data. An unusually large number of segments is often indicative of data quality problems such as wave effects that were not detected earlier by the log ratios themselves.

  • Open the Segment List spreadsheet. Choose Analysis >CNAM Output Analysis >Count Number of Segments Per Sample.
  • From the resulting Segment Counts spreadsheet plot the Segment Count column by right-clicking on the column label header and choosing Plot Histogram.

The plot should look like Figure 2-4 (with a bin count of 64).

Figure 2-4. Segment Counts Histogram

Figure 2-4. Segment Counts Histogram

There appear to be outliers on the right side of the distribution. Sorting the Segment Counts spreadsheet will let us know which samples these are.

  • From the Segment Counts spreadsheet right-click the Segment Count column header and choose Sort Descending.

This produces a second segment counts spreadsheet, Segment Counts - Sheet 2. The sample with the highest segment count is S59. Let’s take a look at this sample’s LRs and Segment results to get an idea of what is going on. To plot the sample data, we must first transpose the PCA-corrected LR spreadsheet as well as the Segmentation output, as SVS requires data to be oriented in columns in order to be plotted in the genome browser view.

  • Open PCA-Corrected Data (Center by Marker) – Sheet 2 and select Edit >Transpose Spreadsheet.
  • Leave the default options and click OK.
  • Repeat the transpose steps for Segmentation Covariates Every Column – Sheet 1

Now we start by plotting the corrected LR values.

  • Open PCA-Corrected Data (Center by Marker assumed) Transposed - Sheet 1 and choose Plot >Variables in Genome Browser.
  • Check the S59 box and click Plot.

The plot should look like Figure 2-5.

Figure 2-5. Corrected log ratio plot of S59

Figure 2-5. Corrected log ratio plot of S59

Now let’s plot the segmentation results on top of the log ratios to see why there were so many segments found.

  • From the newly-created log ratio plot of S59 click on the Graph 1 node in the Graph Control Interface.
  • Under Add Item click the spreadsheet drop down and choose Select Spreadsheet. Select the Segmentation Covariates Every Column Transposed and click OK.
  • Check S59 from the list and click Add.

This creates a second S59 item under Graph 1. We need to move this item to the top to see it in the graph.

  • Right-click on the new S59 item and rename it S59 – Segment Means for the sake of clarity.
  • Click on the new S59 – Segment Means item under Graph 1 drag the item above the first S59 item while holding the mouse button.
  • To better display the segmentation covariates, select S59 – Segment Means under Graph 1 and under the Item tab change Line to Mid Steps and change change the weight to 2. Change the Symbol to None.

Your plot should now look like Figure 2-6.

Figure 2-6. Log ratio plot with segmentation covariates overlaid

Figure 2-6. Log ratio plot of S59 with segmentation covariates overlaid

Zoom in for a closer look.

  • Double click on 12 in the Full Domain View to zoom into chromosome 12.

We can already see a little bit of a wave effect. Applying a small median smooth to the log ratios will make this even more apparent.

  • Select the S59 item (for the corrected log ratios) under Graph 1 and under the Smoothing tab select Median Smooth, Symmetric with a Window Radius of 2.

Even with a median smooth of 2, the wave effect is much more apparent (Figure 2-7). In the previous Quality Assurance tutorial, sample S59 had an absolute wave factor that was just below the IQR cutoff. This suggests we may have used a more stringent outlier threshold for wave factor for this dataset.

Figure 2-7. Wave effect on chromosome 12

Figure 2-7. Wave effect on chromosome 12

For this tutorial, we will use IQR on the Segment Counts to determine which samples warrant exclusion.

  • Open Segment Counts - Sheet 2, and select Quality Assurance >Column Statistics.
  • Leave 1.5 as the IQR Multiplier, and click OK.

In this case the Upper Outlier Threshold is 228.5. As we can see in the Segment Counts - Sheet 2 spreadsheet, there are 17 samples that have segment counts above this threshold (Figure 2-8).

Figure 2-8. Samples with a high number of segment counts

Figure 2-8. Samples with a high number of segment counts

Let’s exclude these samples from our Phenotype spreadsheet.

  • From Segment Counts - Sheet 2, inactivate rows 1 - 17 by clicking once on the first row label, holding the shift key and clicking on the seventeenth row label. This will create a new spreadsheet, Segment Counts - Sheet 3.
  • Open Final Sample Set Dataset - Sheet 1 and choose Select >Activate or Inactivate Based on Second Spreadsheet.
  • Set the state of Rows in the current spreadsheet to Active based on active... Rows in the specified spreadsheet.
  • Click on Select Sheet. Choose Segment Counts - Sheet 3 from the list and click OK. Click OK again to finish.
  • Create a subset spreadsheet by choosing Select >Subset Active Data.
  • Rename the resulting spreadsheet in the Project Navigator to Sim_Pheno - Final Sample Set.
  • Close all open windows (except the Project Navigator) before continuing.

C. Discretizing Copy Number Segment Covariates

At this point you might continue on to run association tests based on the LR segment covariates. However, it is sometimes useful to discretize the segment means as two state (0,1) or three state (-1,0,1) covariates based on defined thresholds. Discretizing the covariates has the following benefits:

  • Approximate copy number calls (potential: deletion, neutral, duplication, not a deletion, or not a duplication) can be made based on thresholds denoting approximate transitions between copy number states.
  • Discretizing can magnify small, statistically significant differences between cases and controls.
  • Using discretized values reduces the influence of outliers (extremely small or large logR values) on a p-value.

To do this, we first need to determine the appropriate number of copy number classes by examining the segment mean histogram and then discretize the covariates based on the determined thresholds.

The segment means can be accessed from the Segment List spreadsheet.

  • Open the Segment List spreadsheet, right-click on the Segment Mean column header (4) and choose Plot Histogram.

The histogram in Figure 2-9 appears (with a bin count of 128).

Figure 2-9. Segment Means Histogram

Figure 2-9. Segment Means Histogram

At first glance we can see the distribution is centered around zero as we’d expect, and we can see some additional peaks to the left and right. If you zoom in on the y-axis you can see the peaks more clearly.

  • To zoom click on the Graph 1 node in the Graph Control Interface, and then under the Graph tab enter 800 in the Y-Max text box under the Graph tab and hit Enter.
  • You may also wish to adjust the X axis.

You can now begin to see the peaks more clearly (Figure 2-10).

Figure 2-10. Histogram showing segment means in a four-state model

Figure 2-10. Histogram showing segment means in a four-state model

In this case it actually looks like there are four copy number classes. For the purpose of this tutorial we will make it simple and look for thresholds that distinguish a copy number loss, neutral, or gain, which at first glance appears to be around -0.15 and 0.15 (you can zoom in on the X-axis to see this more clearly).

Note

  • Beyond visual inspection of the histogram, a more exact approach would be to run CNAM segmenting on the Mean Value column but that’s beyond the scope of this tutorial.
  • The “peaks” or “clusters” of values seen in this histogram indicate a change in intensity relative to the reference intensity. Although we often refer to the lower peak as “losses” and the upper peak as “gains,” it is important to remember that this is relative, and doesn’t always correlate to actual copy numbers of more than 2 copies or less than 2 copies. For example, if there is a common indel where most of the reference samples have the deletion, a subject with two copies present would appear at the far right of the distribution. We might call it a “gain”, but in reality it is 2 copies, which would typically be “neutral.” The discretization approach described here works well for rare CNVs, but isn’t always appropriate for CNVs that are more common, as the discretization thresholds shift according to the frequency of the CNV in the reference population.
  • If you observe a uni-modal distribution (only one hill) regardless of Bin Count and/or a large number of outliers in your data, discretizing is not recommended. In this case using the PCA corrected segments as covariates for association testing is recommended.

Now that appropriate thresholds have been determined, you can use these values to discretize the copy number segment covariates (continuous LR values) as three-state (-1,0,1) covariates. We’ll discretize the Segmentation Covariates First Column spreadsheet as we’ll be using this for association testing.

  • Open the Segmentation Covariates First Column and select Analysis >CNAM Output Analysis >Discretize CN Segment Covariates with Counts.

You will be prompted to choose a Three State Model or either of two Two State Models.

  • Choose Three State Model (-1,0,1).
  • Keep -0.15 as the Lower Threshold Level and 0.15 as the Upper Threshold Level and click OK.

This will create two new spreadsheets Three State Covariates, with -1 for potential losses, 0 for potential neutrals, and 1 for potential gains, and Copy Number State Counts - Mapped Sheet 1 with various statistics about each marker in the segment. We will be using the first spreadsheet as covariates for association testing.

  • Close all open windows before continuing except the Project Navigator.