CNAM Optimal Segmentation Algorithm

Overview

The copy number segmenting algorithm is designed to find local variations over markers. The segmenting process is optimized by working at three levels:

  1. If desired, the region of markers is subdivided into a moving window of sub-regions.
  2. A unique segmenting algorithm is applied to find multiple segments wherever possible, and by implication, segment boundaries which are termed “cut-points”.
  3. A permutation algorithm is applied to validate the found cut-points.

CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points. The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions possibly useful for association studies. For a given sample, the covariate is the mean of the intensities within each segment for that sample. If there are consistent positions for copy number variation across multiple samples, the copy number segments will be found. In reality, there may not always be consistent copy number segments across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample and outputs a spreadsheet showing all cut-points found among all samples. There are two options for how this spreadsheet can be formatted. Either there will be one column for each unique cut-point segment, or there will be a column for all segmented markers in the marker map. The first spreadsheet is good for association analysis because the multiple testing adjustment is smaller. However, the second spreadsheet is better for plotting the results. The results are the same in both cases, but in the second spreadsheet the results of a column are repeated for every marker in each unique segment.

See Using CNAM (Copy Number Analysis Method) Optimal Segmenting for more information.

Obtaining Segments with a Moving Window

CNAM allows segmenting with a moving window, where the segmentation algorithm is applied to each sub-region created by the window. Using the optional moving window will result in an overall faster analysis; however, the results may not be as robust as analyzing the entire chromosome simultaneously. We implement an approach to windowing that minimizes edge effects as the moving window moves along the chromosome:

  1. Start out with a sub-region of s markers.

  2. Perform the segmenting algorithm on the current sub-region.

  3. If two or more cut-points were found, meaning that there were more than two segments found:

    • If Univariate Outlier Removal is selected, delete cut-points that bracket single marker segments.
    • If Univariate Outlier Removal is not selected, use all the cut-points except the last one and begin the next sub-region at the second-to-last cut-point.

    Use s markers. Otherwise:

    • If no cut-points could be found at all to subdivide the one sub-region, the next sub-region is started in the middle of the current sub-region and extended for s markers. No cut-points are yielded in this case.

      Note

      This may mean a region of copy number variation which started before this sub-region will be considered to be extended through this sub-region and possibly for much longer. We do not arbitrarily subdivide regions based on moving-window boundaries–instead, this whole procedure is meant to find valid cut-points throughout the main region.

    • If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, and the cut-point is in the first half of the sub-region, this cut-point is accepted, and the next sub-region is started at that cut-point and extended for s markers.

    • If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, but the cut-point is in the second half of the sub-region, the apparent cut-point may be due to an edge effect. Thus,the cut-point is not used and the moving-window sub-region is changed so as to get better segmenting. If the size of the sub-region is now still s, the sub-region is simply extended so that the total size of the sub-region is 2s. If the size of the sub-region is already 2s, the next sub-region, which goes back to being length s, is started at the former start plus s/2.

  4. Finally, when the end of the main region is encountered, whatever cut-points are obtained from the sub-region are accepted.

You may set the window size as one of the optional segmenting parameters, see Using CNAM (Copy Number Analysis Method) Optimal Segmenting.

Obtaining Segments without a Moving Window

Selecting to not use a moving window in the segmenting analysis results in the sub-region spanning an entire chromosome. The segmenting algorithm is then applied to the whole chromosome. This is more computationally intensive, and the most trusted procedure. Note, that it may be necessary to increase the Max segments per window to more than the default of 20, as a given chromosome may well have more than 20 regions of copy number variation.

Copy Number Segmentation within a Sub-Region

The segmentation algorithm exhaustively looks through all possible cut-points for a given region of data, to minimize the sum of squared deviations of the data from the mean of its respective segment. Given n points or intensities, and k cut-points defining k+1 segments, the number of possible cut-points is O(n^k). That is, the search space explodes exponentially. This exponential search space perhaps accounts for the explosion of the algorithms using hidden Markov model approaches or approximate segmenting approaches to find the segment regions.

Ironically, since 1999, within the tree splitting code of Golden Helix SVS, we have had an optimal algorithm that finds segment boundaries, effectively overcoming the exponential search problem. This algorithm, due to Hawkins, see [Hawkins1972], [Hawkins1973], and [Hawkins2002], is based on dynamic programming. It exhaustively searches through all possible cut-point positions to find an optimal segmenting of the data. We employ this approach with a few modifications to the copy number problem.

The segmentation algorithms in Golden Helix SVS use “segmentation for a normally distributed response for a continuous-ordinal predictor”. This mode of segmentation is partly analogous to a univariate counterpart used in creating tree models. However, there are important differences, the main one being that no final p-value calculation for the entire sub-region is attempted because the main emphasis here is on local variation rather than an entire “split”. Also, the tree models do not currently implement a permutation testing procedure for determining split cardinality.

Note

The following points apply for the copy number segmentation algorithms:

  • Within the sub-region under analysis, we consider the number of markers to correspond to the number of observations.
  • These markers are considered to be implicitly numbered consecutively. The implicit marker number is used as the (continuous-ordinal) “predictor” variable.
  • The object of this segmentation is to find which markers border areas of significant copy number variation–that is to find for which regions of markers there is at least one sample for whom the log2 ratio varies significantly between the regions, presumably indicating copy number variation between those regions.
  • As mentioned above, we label the borders between these regions as “cut-points”.

In the multivariate algorithm, the samples (from the respective subjects) are considered to be the respective multivariate “dimensions” of this analysis. In the univariate algorithm, the dimension of the analysis is one, as each sample is analyzed independently. The log2 ratio for every sample and every marker is considered to be the “response” variable.

The model is that the observations segment into k groups, and that each group, within each dimension, has a different mean with noise. The k-1 cut-points optimally splitting the data in a maximum likelihood sense are found by minimizing the sum over all dimensions of sums of squared deviations of the group means from the observations.

The positions of these cut-points are found in advance for every possible total number of segments, from two to the maximum number of segments allowed. You may set the maximum number of segments (per sub-region) k_{max} as one of the optional segmenting parameters (see Using CNAM (Copy Number Analysis Method) Optimal Segmenting).

To determine if we can use a given number of segments k, a permutation testing based comparison is done between each pair of adjacent segments. The permutation testing procedure is described in the next section. If the segments are significantly different for every pair of adjacent segments, k segments are used. Otherwise, this same determination is tried for k-1 segments, and so on. This process is started at k=k_{max} and continued through k=1.

It may turn out that this process continues through to k=1. This means it is not possible to divide the observations into more than one segment nor to get any cut-points (over this sub-region).

Univariate Outlier Removal

The univariate outlier removal option addresses a couple of problems. When a minimum number of markers per segment is specified, such as 10, a single marker outlier could drag along another 9 markers to create a segment that was driven only by the one point. This would inflate the number of 10 marker segments, and would incorrectly specify boundaries of the copy number segment if the region was 9 markers or less in length. The second problem is that if the minimum number of markers per segment is set to 1, single marker segments would be found, but those single marker segments would be found to not be significant with permutation testing, and because the outliers reduced the sum of squared deviations from the mean more than for some of the longer segments, those longer (but significant) segments would be lost as the next lower cardinality splits were chosen. The basic problem is that this maximum likelihood estimator of minimizing the sum of squared deviations from the mean assumes normality, which does not hold in the case of numerous outliers. The outlier removal process applies only when the minimum number of markers per segment is set to 1. It starts with finding the maximum number of segments per window and then deletes the cut-points that bracket single marker segments. Then permutation testing is done on the remaining segments that are longer than length 1 (including the outlier data as part of those larger segments), dropping down to lower cardinality if any pair of adjacent segments are not statistically different.

Permutation Testing for Verifying Copy Number Segments

An optimal k-way split is one that defines k segments such that the sum of squared deviations from the mean of each segment is minimized. In other words, there is no better placement of cut-points that would result in a smaller sum of squared deviations of the data from the respective mean of the segment in which it falls. The challenge, given this optimal segmenting, is determining whether the resulting segments represent significant differences between pairwise adjacent segments.

Consider two adjacent segments. Both the left and right segments have means. The cut-point between them minimizes the sum of squared deviations of the data from the two respective means. If we were to just perform a T-test comparing the adjacent segments, we would need to adjust the resulting p-value with some kind of multiple testing correction for all of the possible cut-points that were searched through to find the best one. A Bonferroni adjustment would be too conservative due to correlation structure in the data. Instead we turn to permutation testing.

The key to performing permutation testing correctly is to randomize and perform the same procedure as was used to find the original statistic. This ends up being fairly straightforward.

Given p_{max} is the max pairwise permuted p-value parameter, for each adjacent pair of segments we perform the following procedure:

  • Calculate the original sum of squared deviations for the means of the two adjacent segments.
  • Set count=1
  • Do the following 10/p_{max} times:
    • Randomly shuffle the marker label columns (i.e. multivariate columns are kept together).
    • Find the optimal two-way split that minimizes the sum of squared deviations from the means within the random data.
    • If the randomly computed sum of squared deviations is less than or equal to the original sum of squared deviations, set count=count+1
  • The pairwise p=count/(10/p_{max}).

If the permuted pairwise p is less than or equal to p_{max} for all adjacent pairwise segments, all of the segments are statistically significant. Otherwise, we repeat the procedure with the optimal k-1 way segmenting, until either we find all pairwise segments are significant, or all the data in a region represents a single segment (or possibly part of a larger segment if the moving window procedure is used).

Recent improvements in the permutation algorithm (versions 7.1 and higher) enable the use of all specified threads and an updated randomization procedure. The previous naive potentially biased randomization procedure was replaced with the unbiased Fisher and Yates method (also known as the Knuth shuffle) [Durstenfeld1964]. This modification and a set of 128 random number generators with unique keys run sequentially allows us to simultaneously have up to 128 threads performing permutation testing, and the same segmentation results occur regardless of the number of threads specified. If a non-power-of-two number of threads is specified, it rounds up or down to the nearest power of two, with an upper bound of 128. Due to differences in the randomization procedure, different output can be generated over previous versions for segments that fall on the borders of the p-value threshold. A single segmenting job will now utilize close to 100% CPU 100% of the time if all threads are selected, removing the need to have more than one job running simultaneously on the same machine.