# Runs of Homozygosity Analysis¶

## Runs of Homozygosity¶

### Runs of Homozygosity Introduction¶

Case/control genotype studies frequently look at individual SNPs for correlations between genes and phenotypes. SNPs may not be associated with positive case status when adjacent SNPs are, making it difficult to find entire groups of genes associated with diseases or other conditions. It has been found that large regions of homozygous SNPs can be common between groups of people without direct common lineage, and these regions can be areas of functional significance [Lencz2007]. A new approach, termed Runs of Homozygosity (ROH), has been developed to identify susceptibility loci across the genome where there has been significant evolutionary pressure.

## Runs of Homozygosity Formulas¶

The Runs of Homozygosity algorithm is designed to find runs of consecutive homozygous SNPs and find where those runs are common among at least the specified number of samples.

The first part of the analysis is determining the homozygous runs or Runs of Homozygosity (ROH), which can be represented as the index of the SNP where the run started and the index of the last SNP in the run. A homozygous genotype is one where both bases are the same, such as A_A or B_B.

Unlike the PLINK ROH algorithm, which uses a moving window that potentially
introduces artificial runs and fails to capture runs larger than the window,
the Golden Helix SVS algorithm works continuously across an entire chromosome,
examining every possible run that matches the parameters specified for the
algorithm in the *ROH GUI Dialog* window. The algorithm looks at every homozygous
SNP as a potential start of a new ROH run. Each SNP is then determined to be
homozygous, heterozygous or missing. Each potential run is then updated to be
extended if it is homozygous, or modified to have its heterozygous or missing
count incremented. If consecutive heterozygotes are restricted, a run will allow
up to that amount of heterozygotes in a row, however if it isn’t currently at
the maximum allowed heterozygote count, then it may continue its run. The same
applies for consecutive missing genotypes. There is also an option to allow only
a certain amount of heterozygotes (or missing genotypes) within a specified window
of variants, this is more general than consecutive as there could be homozygous
calls in between the heterozygotes (or missing genotypes). Runs that exceed the
allowed number of heterozygotes or missing SNPs are then checked to see if they
should be thrown out based on their length and SNP density according to the specified
algorithm parameters. In addition, for NGS data, missing genotypes can be specified
to be treated as homozygous reference calls. At the end of a chromosome, the longest
runs are greedily chosen out of all the potential runs (with no overlapping of
runs allowed) until no more runs are available. The result is the set of runs
containing the longest runs of homozygous SNPs for a given chromosome and sample.
This is repeated for each chromosome and sample of the dataset.

A second algorithm is used to create a clustering of runs. Given the final list of ROHs calculated by the previous algorithm and a threshold, , for the minimum number of samples that contain a run, a cluster is defined as a contiguous set of SNPs where every SNP has at least samples in a run for each SNP. The algorithm sweeps across all the SNPs in the dataset looking for clusters that meet the run length requirements provided to the algorithm. A list of clusters is generated in a spreadsheet. Optionally, another spreadsheet is created which calculates, for each sample, the ratio of SNPs in each cluster that are members of a run of homozygosity.

In addition to the clusters described above, Optimal and Haplotypic Similarity clusters can be found. Optimal clusters are found by grouping samples with highly overlapped ROH regions and Haplotypic Similarity clusters are found by grouping samples with high allelic similarity over overlapping ROH regions. The methodology behind finding these clusters is described below.

### Optimal and Haplotypic Similarity Clusters¶

A method that uses normalized spectral clustering to group samples with highly overlapped ROH regions has been developed [Zhang2013]. The advantage of this method is that the new clusters have more stringent boundaries then the original clusters.

This clustering method is performed on each of the clusters found in the original method and repeated on those clusters until local optimum clusters are found. This method is also called repeated binary spectral clustering as for each cluster we group the samples into two clusters and then perform the clustering method again on those clusters. This continues until one of the stopping conditions, described below, is met. The clusters reported in the output spreadsheet are the clusters where the stopping condition was met.

Warning

If there are a substantial amount of runs (~498,073,600), a warning will appear letting the user know how much memory it could possibly take to find the optimal and/or haplotypic similarity clusters. If this warning appears and the user continues, there’s a chance for SVS to crash.

Also, if there are a substantial amount of runs in any cluster (~15,000), a matrix (used for the optimal and haplotypic similarity clusters) of this size cannot be created. A warning will appear informing the user of how much memory it could take to store this. If the user continues, there’s a chance for SVS to crash.

#### Steps¶

The basic steps for finding these clusters is outlined below [Zhang2013]:

- Construct a similarity matrix that describes the amount of overlap between each pair of runs.
- Construct a degree matrix from the similarity matrix.
- Compute the normalized symmetric Laplacian matrix.
- Find the eigenvectors of the Laplacian matrix and create a matrix whose columns are the first K eigenvectors (K = 2, the number of groups we will form).
- Normalize the matrix by normalizing the rows to norm 1.
- Use k-means to cluster the points of this matrix (the rows) into K groups.
- Repeat on the resulting clusters until the number of iterations is above 100, the number of runs of each group is less or equal to than , and the lower quartile of the similarity matrix is greater than 0.5.

The is done for each cluster found using the original method.

#### Similarity Matrix¶

There are two different ways that the similarity matrix is found. One for the optimal clustering method and one for the haplotypic similarity method.

For optimal clustering, the similarity matrix is defined as [Zhang2013]:

where

where is the amount of overlap between two runs, in SNPs, and are the lengths of runs and , in SNPs, and is the number of runs in this cluster.

For haplotypic similarity clustering, the similarity matrix is defined as [Zhang2013]:

where is the number of runs in a cluster and and are the runs and

where is the start index (in SNPs) of the overlap region between runs and . And is the end index of that region. and are the calls of the two runs at index . And is the same as defined before.

In plain English, the similarity is the percent of SNPs over the overlapping regions of runs and with identical alleles. However, if their overlap region, , is less than 0.5 of either the th or th run, the similarity is set at 0.

#### Degree Matrix¶

The degree matrix D, a diagonal matrix, for the similarity matrix is defined as [Zhang2013]:

where is the number of runs in the cluster.

#### Normalized Symmetric Laplacian¶

The normalized symmetric laplacian matrix can be found from the degree and similarity matrices, matrices D and S, respectively.

There are different ways to get the laplacian matrix.

First method [Zhang2013]:

Second method [Ng2002]:

However, to reduced computation time, the second method can be reduced to:

The reduced version of the second method is the one used.

#### Eigenvectors of Laplacian Matrix¶

It has been shown that the eigenvectors of a Laplacian matrix can be used to find semi-optimal cuts in a graph [Ng2002] , and through the modifications made above, they can be used to find cuts amongst the runs in our cluster [Zhang2013].

We then put the eigenvectors corresponding to the K largest eigenvalues into the matrix U column wise.

#### Normalize¶

The last step before the k-means algorithm is performed is to normalize the rows of the matrix U.

The new matrix T is described as [Zhang2013]:

#### K-Means¶

Now that we have obtained the eigenvectors, and modified them, from the laplacian matrix, we can use the rows as points for the k-means algorithm. This essentially uses the second eigenvector to determine the clustering.

There are a few different k-means algorithms, the main ones are Lloyd, Hartigan-Wong, Forgy, and MacQueen. The Hartigan-Wong implementation generally performs better and is the method used.

Before running k-means on matrix T, initial cluster centers are found to help minimize the run time and to help ensure that the algorithm converges on a solution. To find these initial cluster centers, we first find the mean point. The mean point is the mean of the first column and the mean of the second column of our matrix T. We then find the distance between each point and this mean. Then the points are sorted in ascending order by their distance to the mean. And lastly, for each cluster , (where L = 1, 2, ... K), the th point is chosen to be the initial center for cluster [Hartigan1979].

The k-means algorithm is then run on the matrix T and these initial centers.

After k-means, the runs are assigned to their respective clusters and the start and end indexes for these new clusters are found.

The start and end indexes of the clusters are found by ordering the start and end indexes of the runs in the cluster into ascending order. Then while iterating over the list of sorted start indexes, the first instance where at least 75% percent of the runs overlap this index is chosen as the new start index. The same is done to find the end index. However, if there isn’t an index that reaches 75% coverage, then the closest index is chosen [Reber2013].

Now that the new clusters are established, the clustering process is started again
on these and continues this until one of the end conditions is met, described above
in *Steps*.

### ROH Association Analysis¶

After the ROH analysis is complete, association studies can be performed by
joining either the “First column of each cluster” or the “One column per marker”
spreadsheet with phenotypic data. Set the phenotypic variable in the joined
spreadsheet as a dependent variable and perform numeric association tests.
See *Numeric Association Tests* for more information.

## ROH GUI Dialog¶

There are two versions of The Runs of Homozygosity (ROH) analysis feature. The
version with options more appropriate for GWAS data can be found in the spreadsheet
menu **Genotype > Runs of Homozygosity for GWAS**. The second version, with options
more appropriate for NGS data can be found in the spreadsheet menu **DNA-Seq > Runs
of Homozygosity for NGS**. For both versions, the spreadsheet must contain genotypic
data and be marker mapped. However, for **Runs of Homozygosity for NGS** the
spreadsheet also have a **Reference Field** in the marker map.

The first tab has options for how a run should be composed, the number of heterozygotes allowed, number of missing genotypes allowed, etc. The second tab has options related to the output spreadsheets, here you can select which spreadsheets to output and settings for clustering.

### ROH Options¶

When the **Runs of Homozygosity** dialog opens (*ROH GUI Dialog*), there are
several analysis options.

#### Minimum Run Length¶

The minimum run length is the length a sequence of homozygous SNPs must be to be considered a run of homozygosity. The minimum run length can be specified in two ways.

**Distance**: Specify the minimum run length in Kb (kilobase pairs) based on the genomic position information in the marker map. This option also allows for the specification of the minimum number of SNPs for the specified minimum distance.**Keyword Argument****minLengthKBase****minSnpsKBase****Value Type**Real Type Integer Type **Constraints**[1.0, inf) Kb [2, inf) **Default Value**500 25 **Inter-option constraints**minLengthSnps is not used minLengthSnps is not used **SNPs**: Specify the minimum run length solely based on the number of SNPs in a run, regardless of how far the markers are apart.**Keyword Argument****minLengthSnps****Value Type**Integer Type **Constraints**[2, inf) **Default Value**100 **Inter-option constraints**minLengthKBase and minSnpsKBase are not used

#### Number of Heterozygotes¶

The optional parameter **Allow runs to contain up to ... heterozygote(s)** allows
for the specification of the number of heterozygotes allowed in a run of
homozygotes in order to be considered a “run of homozygous markers”. If this
option is unchecked then no heterozygous markers are allowed in a run. If,
however, one or more heterozygotes are allowed, check this box and specify the
maximum number allowed. Selecting this option also allows the selection of,
**Allow runs to contain up to ... heterozygote(s) within a ... variant window**
and **Allow runs to contain up to ... consecutive heterozygote(s)**.

**Allow runs to contain up to ... heterozygote(s) within a ... variant window**
specifies how many heterozygote(s) are allowed in a certain window that’s specified
in terms of number of variants.
**Allow runs to contain up to ... consecutive heterozygote(s)** specifies how many
heterozygote(s) are allowed consecutively in a run.

Note

The algorithm will try to make runs at least as long as the minimum run length specified. If at least one heterozygote is allowed the algorithm will pull in flanking heterozygote genotypes to form a run.

If heterozygotes are allowed it is recommended that the genotypes making up the run are examined if it matters if the heterozygotes are inside or flanking the run.

Keyword ArgumentallowHeteromaxHeteroValue TypeBinary Type Integer Type ConstraintsTrue or False [1, inf) Default ValueTrue 1 Inter-option constraintsNone allowHetero = True

Keyword ArgumentuseHeteroDensityheteroDensityheteroDensityRangeValue TypeBinary Type Integer Type Integer Type ConstraintsTrue or False [1, inf) [2, inf) Default ValueFalse 1 5 Inter-option constraintsallowHetero = True allowHetero = True, useHeteroDensity = True allowHetero = True, useHeteroDensity = True

Keyword ArgumentuseHeteroConsecmaxHeteroConsecValue TypeBinary Type Integer Type ConstraintsTrue or False [1, inf) Default ValueFalse 1 Inter-option constraintsallowHetero = True allowHetero = True, useHeteroConsec = True

#### Missing Genotypes¶

One of the following options must be selected:

- If using the NGS dialog:
**Treat missing as homozygous reference calls**: Any missing genotypes in a run will be treated as if they were homozygous reference.**Treat as missing**: Missing genotypes will be treated as missing, checking this will allow you to select other options to restrict the amount of missings in a run.

- If using the GWAS dialog:
**Allow runs to contain up to ... missing genotype(s)**: Specify allowable number of missing genotypes. The number selected must be an integer greater than or equal to 0.**Allow any number of missing genotypes**: Selecting this option still requires that the first and last genotype be homozygote genotypes.

Note

The algorithm will not penalize a run for having a missing genotype at the edge of a run if at least one missing genotype is allowed and if the minimum run length size would have been met if the missing genotype was a homozygous genotype. Instead the total run length will be reduced to reflect the number of flanking missing genotypes.

If a minimum run length of 10 SNPs is specified and one missing genotype is allowed, it is possible to see a run length of 9 with one missing genotype.

If missing genotypes are allowed it is recommended that the genotypes making up the run are examined if it matters if the missing values are inside or flanking the run.

Keyword ArgumenttreatMissingAsHomo(NGS)restrictMissingValue TypeBinary Type Binary Type ConstraintsTrue or False True or False Default ValueFalse False Inter-option constraintsrestrictMissing = False treatMissingAsHomo = False (NGS), None (GWAS)

Keyword ArgumentmaxMissingValue TypeInteger Type Constraints[0, inf) Default Value5 Inter-option constraintsrestrictMissing = True, treatMissingAsHomo = False (NGS)

Keyword ArgumentuseMissingDensitymissingDensitymissingDensityRangeValue TypeBinary Type Integer Type Integer Type ConstraintsTrue or False [1, inf) [2, inf) Default ValueFalse 1 5 Inter-option constraintsrestrictMissing = True, treatMissingAsHomo = False (NGS) restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingDensity = True restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingDensity = True

Keyword ArgumentuseMissingConsecmaxMissingConsecValue TypeBinary Type Integer Type ConstraintsTrue or False [1, inf) Default ValueFalse 1 Inter-option constraintsrestrictMissing = True, treatMissingAsHomo = False (NGS) restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingConsec = True

Note

- To not allow any missing genotypes in a run, select
**Allow runs to contain up to ... missing genotype(s)**and set the value to 0. - If:
- treatMissingAsHomo = True, then any number of missing genotypes are allowed in a run as they will be treated as homozygous reference calls.
- restrictMissing = False, then any number of missing genotypes are allowed in a run of homozygous genotypes.
- restrictMissing = True, then the number of missing genotypes in a run is restricted to the specified number.

#### Maximum Gap Between SNPs¶

The optional parameter **Maximum gap between SNPs in a run: ... kb** restricts how
runs are formed. If the maximum gap is exceeded, shorter runs are formed if
possible. Otherwise, no runs are formed because the distance between at least
two SNPs is more than the specified amount. These options are only available in
the GWAS version of ROH.

Keyword ArgumentrestrictGap maxGap Value TypeBinary Type Real Type ConstraintsTrue or False [1.0, inf) Kb Default ValueTrue 100.0 Inter-option constraintsNone restrictGap = True

Note

If:

- restrictGap = False, then the maximum gap in Kb between SNPs in a run is not restricted.
- restrictMissing = True, then the maximum gap in a run is restricted to the specified size.

#### Minimum Density of a Run¶

The optional parameter **Minimum density of a run: 1 SNP per ... kb** restricts
how runs are formed. If the minimum density is selected and the criterion is not
met, then a run is not formed. These options are only available in the GWAS version
of ROH.

Keyword ArgumentrestrictDensity minDensity Value TypeBinary Type Real Type ConstraintsTrue or False [1.0, inf) Kb Default ValueFalse 50.0 Inter-option constraintsNone restrictDensity = True

Note

If:

- restrictDensity = False, then the minimum density of SNPs in a run in Kb is not restricted.
- restrictMissing = True, then the minimum density in a run is restricted to the specified value.

#### ROH Output Spreadsheets¶

At least one of the following output spreadsheets must be selected.

**Create a spreadsheet of runs per sample:**Creates the

**Homozygous Runs of Length >= ...**spreadsheet. This spreadsheet details each run of homozygosity found. There are columns for the chromosome, start and end genomic position in base pairs, length of the run, start and end column indexes, the number of SNPs in the run, the number of missing values and the number of heterozygotes.**Create a spreadsheet with binary ROH run status:**Creates the

**Binary ROH Status for Runs of Length >= ..**spreadsheet. This spreadsheet indicates if the data from each marker is part of a run of homozygosity for each sample. For instance, if the genotype for a marker is homozygous for a particular sample, and there are enough surrounding and consecutive homozygous genotypes for a sample to constitute a run, then a 1 is placed in the cell for the marker and sample. If, on the other hand, for a particular sample there is a homozygous genotype marker surrounded by two heterozygous genotypes, then a 0 is placed in the cell for the homozygous marker and sample. All heterozygous genotypes are replaced with 0’s unless a run is allowed to have one or more heterozygotes.**Create a spreadsheet with the incidence of common runs per SNP:**Creates the

**Incidence of Common Runs per SNP**spreadsheet. This spreadsheet displays columns for the original SNP column number, the number of runs associated with each SNP (i.e. the number of samples that have a run overlapping the SNP), and the chromosome number for the SNP. Row labels are the SNP names.**Output Cluster of Runs for Association Studies:**This option requires specification of the minimum number of samples for a cluster in the

**Minimum # samples that must contain a run:**option. The minimum number of samples is the number of samples that must share a run of homozygosity for a run to be considered “common”. See*Runs of Homozygosity Formulas*. The**Cluster of Runs with >=...**spreadsheet displays information about the cluster of SNPs where common runs of homozygosity were found including the genomic range of the cluster and the number of SNPs in the region. Additionally, spreadsheets can be generated which contain the proportion of SNPs in each cluster that are members of common runs of homozygosity. This information can be output in two ways as described below.In addition, the optimal and haplotypic similarity cluster spreadsheets can be be outputted, they have the same options as the regular cluster outputs.

**Cluster Outputs:**Spreadsheets that contain information about the proportion of SNPs in each cluster that are members of common runs of homozygosity.**First column of each cluster**: The values are generated for the first column (or marker) of each cluster. This format is best for association tests, as there is a reduction in the total number of tests and in the multiple testing correction.**One column per marker**: The values are generated for all columns (or markers) for each cluster. This format is best if visualization of the data in a heat map is desired.