Runs Of Homozygosity (ROH) Algorithm

The Runs of Homozygosity algorithm is designed to find runs of consecutive homozygous SNPs and find where those runs are common among at least the specified number of samples.

The first part of the analysis is determining the homozygous runs or Runs of Homozygosity (ROH), which can be represented as the index of the SNP where the run started and the index of the last SNP in the run. A homozygous genotype is one where both bases are the same, such as A_A or B_B.

Unlike the PLINK ROH algorithm, which uses a moving window that potentially introduces artificial runs and fails to capture runs larger than the window, the Golden Helix SVS algorithm works continuously across an entire chromosome, examining every possible run that matches the parameters specified for the algorithm in the ROH GUI Dialog window. The algorithm looks at every homozygous SNP as a potential start of a new ROH run. Each SNP is then determined to be homozygous, heterozygous or missing. Each potential run is then updated to be extended if it is homozygous, or modified to have its heterozygous or missing count incremented. If consecutive heterozygotes are restricted, a run will allow up to that amount of heterozygotes in a row, however if it isn’t currently at the maximum allowed heterozygote count, then it may continue its run. The same applies for consecutive missing genotypes. There is also an option to allow only a certain amount of heterozygotes (or missing genotypes) within a specified window of variants, this is more general than consecutive as there could be homozygous calls in between the heterozygotes (or missing genotypes). Runs that exceed the allowed number of heterozygotes or missing SNPs are then checked to see if they should be thrown out based on their length and SNP density according to the specified algorithm parameters. In addition, for NGS data, missing genotypes can be specified to be treated as homozygous reference calls. At the end of a chromosome, the longest runs are greedily chosen out of all the potential runs (with no overlapping of runs allowed) until no more runs are available. The result is the set of runs containing the longest runs of homozygous SNPs for a given chromosome and sample. This is repeated for each chromosome and sample of the dataset.

A second algorithm is used to create a clustering of runs. Given the final list of ROHs calculated by the previous algorithm and a threshold, S_{min}, for the minimum number of samples that contain a run, a cluster is defined as a contiguous set of SNPs where every SNP has at least S_{min} samples in a run for each SNP. The algorithm sweeps across all the SNPs in the dataset looking for clusters that meet the run length requirements provided to the algorithm. A list of clusters is generated in a spreadsheet. Optionally, another spreadsheet is created which calculates, for each sample, the ratio of SNPs in each cluster that are members of a run of homozygosity.

In addition to the clusters described above, Optimal and Haplotypic Similarity clusters can be found. Optimal clusters are found by grouping samples with highly overlapped ROH regions and Haplotypic Similarity clusters are found by grouping samples with high allelic similarity over overlapping ROH regions. The methodology behind finding these clusters is described below.

Optimal and Haplotypic Similarity Clusters

A method that uses normalized spectral clustering to group samples with highly overlapped ROH regions has been developed [Zhang2013]. The advantage of this method is that the new clusters have more stringent boundaries then the original clusters.

This clustering method is performed on each of the clusters found in the original method and repeated on those clusters until local optimum clusters are found. This method is also called repeated binary spectral clustering as for each cluster we group the samples into two clusters and then perform the clustering method again on those clusters. This continues until one of the stopping conditions, described below, is met. The clusters reported in the output spreadsheet are the clusters where the stopping condition was met.

Steps

The basic steps for finding these clusters is outlined below:

  1. Construct a similarity matrix that describes the amount of overlap between each pair of runs.
  2. Construct a degree matrix from the similarity matrix.
  3. Compute the normalized symmetric Laplacian matrix.
  4. Find the eigenvectors of the Laplacian matrix and create a matrix whose columns are the first K eigenvectors (K = 2, the number of groups we will form).
  5. Normalize the matrix by normalizing the rows to norm 1.
  6. Use k-means to cluster the points of this matrix (the rows) into K groups.
  7. Repeat on the resulting clusters until the number of iterations is above 100, the number of runs of each group is less or equal to than S_{min}, and the lower quartile of the similarity matrix is greater than 0.5.

The is done for each cluster found using the original method.

Similarity Matrix

There are two different ways that the similarity matrix is found. One for the optimal clustering method and one for the haplotypic similarity method.

For optimal clustering, the similarity matrix is defined as:

S = \{s_{ij}\}_{i,j = 1,...,m}

where

s_{ij} = min\{\frac{\delta}{l_i}, \frac{\delta}{l_j}\}

where \delta is the amount of overlap between two runs, in SNPs, l_i and l_j are the lengths of runs i and j, in SNPs, and m is the number of runs in this cluster.

For haplotypic similarity clustering, the similarity matrix is defined as:

S = \{s_{ij}\}_{i,j = 1,...,m}

where m is the number of runs in a cluster and i and j are the runs and

s_{ij} = \begin{cases}\frac{\sum\limits_{k = start}^{end}{\begin{cases} 1,& \text{if } C_{ik} = C_{jk}\\ 0,& \text{else} \end{cases}}}{\delta},& \text{if } \frac{\delta}{l_i} \geq \text{ 0.5 and } \frac{\delta}{l_j} \geq \text{ 0.5}\\ 0,& \text{else} \end{cases}

where start is the start index (in SNPs) of the overlap region between runs i and j. And end is the end index of that region. C_{ik} and C_{jk} are the calls of the two runs at index k. And \delta is the same as defined before.

In plain English, the similarity is the percent of SNPs over the overlapping regions of runs i and j with identical alleles. However, if their overlap region, \delta, is less than 0.5 of either the i th or j th run, the similarity is set at 0.

Degree Matrix

The degree matrix D, a diagonal matrix, for the similarity matrix is defined as:

d_i = \sum\limits_{j = 1}^{m}{s_{ij}}

where m is the number of runs in the cluster.

Normalized Symmetric Laplacian

The normalized symmetric laplacian matrix can be found from the degree and similarity matrices, matrices D and S, respectively.

There are different ways to get the laplacian matrix.

First method:

L = I - D^{1/2}SD^{1/2}

Second method [Ng2002]:

L = D^{1/2}SD^{1/2}

However, to reduced computation time, the second method can be reduced to:

l_{ij} = \frac {s_{ij}}{\sqrt {d_i d_j}}

The reduced version of the second method is the one used.

Eigenvectors of Laplacian Matrix

It has been shown that the eigenvectors of a Laplacian matrix can be used to find semi-optimal cuts in a graph [Ng2002] , and through the modifications made above, they can be used to find cuts amongst the runs in our cluster [Zhang2013].

We then put the K largest eigenvectors into the matrix U column wise.

Normalize

The last step before the k-means algorithm is performed is to normalize the rows of the matrix U.

The new matrix T is described as:

t_{ij} = \frac {u_{ij}}{\sqrt{\sum \limits_{k} u^2_{ik}}}

K-Means

Now that we have obtained the eigenvectors, and modified them, from the laplacian matrix, we can use the rows as points for the k-means algorithm. This essentially uses the second eigenvector to determine the clustering.

There are a few different k-means algorithms, the main ones are Lloyd, Hartigan-Wong, Forgy, and MacQueen. The Hartigan-Wong implementation generally performs better and is the method used.

Before running k-means on matrix T, initial cluster centers are found to help minimize the run time and to help ensure that the algorithm converges on a solution. To find these initial cluster centers, we first find the mean point. The mean point is the mean of the first column and the mean of the second column of our matrix T. We then find the distance between each point and this mean. Then the points are sorted in ascending order by their distance to the mean. And lastly, for each cluster L, (where L = 1, 2, ... K), the 1 + (L - 1) \frac {M}{K} th point is chosen to be the initial center for cluster L [Hartigan1979].

The k-means algorithm is then run on the matrix T and these initial centers.

After k-means, the runs are assigned to their respective clusters and the start and end indexes for these new clusters are found.

The start and end indexes of the clusters are found by ordering the start and end indexes of the runs in the cluster into ascending order. Then while iterating over the list of sorted start indexes, the first instance where at least 75% percent of the runs overlap this index is chosen as the new start index. The same is done to find the end index. However, if there isn’t an index that reaches 75% coverage, then the closest index is chosen [Zhang2013].

Now that the new clusters are established, the clustering process is started again on these and continues this until one of the end conditions is met, described above in Steps.