Runs of Homozygosity Analysis

Runs of Homozygosity

Runs of Homozygosity Introduction

Case/control genotype studies frequently look at individual SNPs for correlations between genes and phenotypes. SNPs may not be associated with positive case status when adjacent SNPs are, making it difficult to find entire groups of genes associated with diseases or other conditions. It has been found that large regions of homozygous SNPs can be common between groups of people without direct common lineage, and these regions can be areas of functional significance [Lencz2007]. A new approach, termed Runs of Homozygosity (ROH), has been developed to identify susceptibility loci across the genome where there has been significant evolutionary pressure.

Runs of Homozygosity Formulas

The Runs of Homozygosity algorithm is designed to find runs of consecutive homozygous SNPs and find where those runs are common among at least the specified number of samples.

The first part of the analysis is determining the homozygous runs or Runs of Homozygosity (ROH), which can be represented as the index of the SNP where the run started and the index of the last SNP in the run. A homozygous genotype is one where both bases are the same, such as A_A or B_B.

Unlike the PLINK ROH algorithm, which uses a moving window that potentially introduces artificial runs and fails to capture runs larger than the window, the Golden Helix SVS algorithm works continuously across an entire chromosome, examining every possible run that matches the parameters specified for the algorithm in the ROH GUI Dialog window. The algorithm looks at every homozygous SNP as a potential start of a new ROH run. Each SNP is then determined to be homozygous, heterozygous or missing. Each potential run is then updated to be extended if it is homozygous, or modified to have its heterozygous or missing count incremented. If consecutive heterozygotes are restricted, a run will allow up to that amount of heterozygotes in a row, however if it isn’t currently at the maximum allowed heterozygote count, then it may continue its run. The same applies for consecutive missing genotypes. There is also an option to allow only a certain amount of heterozygotes (or missing genotypes) within a specified window of variants, this is more general than consecutive as there could be homozygous calls in between the heterozygotes (or missing genotypes). Runs that exceed the allowed number of heterozygotes or missing SNPs are then checked to see if they should be thrown out based on their length and SNP density according to the specified algorithm parameters. In addition, for NGS data, missing genotypes can be specified to be treated as homozygous reference calls. At the end of a chromosome, the longest runs are greedily chosen out of all the potential runs (with no overlapping of runs allowed) until no more runs are available. The result is the set of runs containing the longest runs of homozygous SNPs for a given chromosome and sample. This is repeated for each chromosome and sample of the dataset.

A second algorithm is used to create a clustering of runs. Given the final list of ROHs calculated by the previous algorithm and a threshold, S_{min}, for the minimum number of samples that contain a run, a cluster is defined as a contiguous set of SNPs where every SNP has at least S_{min} samples in a run for each SNP. The algorithm sweeps across all the SNPs in the dataset looking for clusters that meet the run length requirements provided to the algorithm. A list of clusters is generated in a spreadsheet. Optionally, another spreadsheet is created which calculates, for each sample, the ratio of SNPs in each cluster that are members of a run of homozygosity.

In addition to the clusters described above, Optimal and Haplotypic Similarity clusters can be found. Optimal clusters are found by grouping samples with highly overlapped ROH regions and Haplotypic Similarity clusters are found by grouping samples with high allelic similarity over overlapping ROH regions. The methodology behind finding these clusters is described below.

Optimal and Haplotypic Similarity Clusters

A method that uses normalized spectral clustering to group samples with highly overlapped ROH regions has been developed [Zhang2013]. The advantage of this method is that the new clusters have more stringent boundaries then the original clusters.

This clustering method is performed on each of the clusters found in the original method and repeated on those clusters until local optimum clusters are found. This method is also called repeated binary spectral clustering as for each cluster we group the samples into two clusters and then perform the clustering method again on those clusters. This continues until one of the stopping conditions, described below, is met. The clusters reported in the output spreadsheet are the clusters where the stopping condition was met.

Warning

If there are a substantial amount of runs (~498,073,600), a warning will appear letting the user know how much memory it could possibly take to find the optimal and/or haplotypic similarity clusters. If this warning appears and the user continues, there’s a chance for SVS to crash.

Also, if there are a substantial amount of runs in any cluster (~15,000), a matrix (used for the optimal and haplotypic similarity clusters) of this size cannot be created. A warning will appear informing the user of how much memory it could take to store this. If the user continues, there’s a chance for SVS to crash.

Steps

The basic steps for finding these clusters is outlined below [Zhang2013]:

  1. Construct a similarity matrix that describes the amount of overlap between each pair of runs.
  2. Construct a degree matrix from the similarity matrix.
  3. Compute the normalized symmetric Laplacian matrix.
  4. Find the eigenvectors of the Laplacian matrix and create a matrix whose columns are the first K eigenvectors (K = 2, the number of groups we will form).
  5. Normalize the matrix by normalizing the rows to norm 1.
  6. Use k-means to cluster the points of this matrix (the rows) into K groups.
  7. Repeat on the resulting clusters until the number of iterations is above 100, the number of runs of each group is less or equal to than S_{min}, and the lower quartile of the similarity matrix is greater than 0.5.

The is done for each cluster found using the original method.

Similarity Matrix

There are two different ways that the similarity matrix is found. One for the optimal clustering method and one for the haplotypic similarity method.

For optimal clustering, the similarity matrix is defined as [Zhang2013]:

S = \{s_{ij}\}_{i,j = 1,...,m}

where

s_{ij} = min\{\frac{\delta}{l_i}, \frac{\delta}{l_j}\}

where \delta is the amount of overlap between two runs, in SNPs, l_i and l_j are the lengths of runs i and j, in SNPs, and m is the number of runs in this cluster.

For haplotypic similarity clustering, the similarity matrix is defined as [Zhang2013]:

S = \{s_{ij}\}_{i,j = 1,...,m}

where m is the number of runs in a cluster and i and j are the runs and

s_{ij} = \begin{cases}\frac{\sum\limits_{k = start}^{end}{\begin{cases} 1,& \text{if } C_{ik} = C_{jk}\\ 0,& \text{else} \end{cases}}}{\delta},& \text{if } \frac{\delta}{l_i} \geq \text{ 0.5 and } \frac{\delta}{l_j} \geq \text{ 0.5}\\ 0,& \text{else} \end{cases}

where start is the start index (in SNPs) of the overlap region between runs i and j. And end is the end index of that region. C_{ik} and C_{jk} are the calls of the two runs at index k. And \delta is the same as defined before.

In plain English, the similarity is the percent of SNPs over the overlapping regions of runs i and j with identical alleles. However, if their overlap region, \delta, is less than 0.5 of either the i th or j th run, the similarity is set at 0.

Degree Matrix

The degree matrix D, a diagonal matrix, for the similarity matrix is defined as [Zhang2013]:

d_i = \sum\limits_{j = 1}^{m}{s_{ij}}

where m is the number of runs in the cluster.

Normalized Symmetric Laplacian

The normalized symmetric laplacian matrix can be found from the degree and similarity matrices, matrices D and S, respectively.

There are different ways to get the laplacian matrix.

First method [Zhang2013]:

L = I - D^{1/2}SD^{1/2}

Second method [Ng2002]:

L = D^{1/2}SD^{1/2}

However, to reduced computation time, the second method can be reduced to:

l_{ij} = \frac {s_{ij}}{\sqrt {d_i d_j}}

The reduced version of the second method is the one used.

Eigenvectors of Laplacian Matrix

It has been shown that the eigenvectors of a Laplacian matrix can be used to find semi-optimal cuts in a graph [Ng2002] , and through the modifications made above, they can be used to find cuts amongst the runs in our cluster [Zhang2013].

We then put the eigenvectors corresponding to the K largest eigenvalues into the matrix U column wise.

Normalize

The last step before the k-means algorithm is performed is to normalize the rows of the matrix U.

The new matrix T is described as [Zhang2013]:

t_{ij} = \frac {u_{ij}}{\sqrt{\sum \limits_{k} u^2_{ik}}}

K-Means

Now that we have obtained the eigenvectors, and modified them, from the laplacian matrix, we can use the rows as points for the k-means algorithm. This essentially uses the second eigenvector to determine the clustering.

There are a few different k-means algorithms, the main ones are Lloyd, Hartigan-Wong, Forgy, and MacQueen. The Hartigan-Wong implementation generally performs better and is the method used.

Before running k-means on matrix T, initial cluster centers are found to help minimize the run time and to help ensure that the algorithm converges on a solution. To find these initial cluster centers, we first find the mean point. The mean point is the mean of the first column and the mean of the second column of our matrix T. We then find the distance between each point and this mean. Then the points are sorted in ascending order by their distance to the mean. And lastly, for each cluster L, (where L = 1, 2, ... K), the 1 + (L - 1) \frac {M}{K} th point is chosen to be the initial center for cluster L [Hartigan1979].

The k-means algorithm is then run on the matrix T and these initial centers.

After k-means, the runs are assigned to their respective clusters and the start and end indexes for these new clusters are found.

The start and end indexes of the clusters are found by ordering the start and end indexes of the runs in the cluster into ascending order. Then while iterating over the list of sorted start indexes, the first instance where at least 75% percent of the runs overlap this index is chosen as the new start index. The same is done to find the end index. However, if there isn’t an index that reaches 75% coverage, then the closest index is chosen [Reber2013].

Now that the new clusters are established, the clustering process is started again on these and continues this until one of the end conditions is met, described above in Steps.

ROH Association Analysis

After the ROH analysis is complete, association studies can be performed by joining either the “First column of each cluster” or the “One column per marker” spreadsheet with phenotypic data. Set the phenotypic variable in the joined spreadsheet as a dependent variable and perform numeric association tests. See Numeric Association Tests for more information.

ROH GUI Dialog

There are two versions of The Runs of Homozygosity (ROH) analysis feature. The version with options more appropriate for GWAS data can be found in the spreadsheet menu Genotype > Runs of Homozygosity for GWAS. The second version, with options more appropriate for NGS data can be found in the spreadsheet menu DNA-Seq > Runs of Homozygosity for NGS. For both versions, the spreadsheet must contain genotypic data and be marker mapped. However, for Runs of Homozygosity for NGS the spreadsheet also have a Reference Field in the marker map.

The first tab has options for how a run should be composed, the number of heterozygotes allowed, number of missing genotypes allowed, etc. The second tab has options related to the output spreadsheets, here you can select which spreadsheets to output and settings for clustering.

ROH options for GWAS data

Runs of Homozygosity for GWAS GUI Dialog

ROH options for NGS data

Runs of Homozygosity for NGS GUI Dialog

ROH Options

When the Runs of Homozygosity dialog opens (ROH GUI Dialog), there are several analysis options.

Minimum Run Length

The minimum run length is the length a sequence of homozygous SNPs must be to be considered a run of homozygosity. The minimum run length can be specified in two ways.

  1. Distance: Specify the minimum run length in Kb (kilobase pairs) based on the genomic position information in the marker map. This option also allows for the specification of the minimum number of SNPs for the specified minimum distance.

    Keyword Argument minLengthKBase minSnpsKBase
    Value Type Real Type Integer Type
    Constraints [1.0, inf) Kb [2, inf)
    Default Value 500 25
    Inter-option constraints minLengthSnps is not used minLengthSnps is not used
  2. SNPs: Specify the minimum run length solely based on the number of SNPs in a run, regardless of how far the markers are apart.

    Keyword Argument minLengthSnps
    Value Type Integer Type
    Constraints [2, inf)
    Default Value 100
    Inter-option constraints minLengthKBase and minSnpsKBase are not used

Number of Heterozygotes

The optional parameter Allow runs to contain up to ... heterozygote(s) allows for the specification of the number of heterozygotes allowed in a run of homozygotes in order to be considered a “run of homozygous markers”. If this option is unchecked then no heterozygous markers are allowed in a run. If, however, one or more heterozygotes are allowed, check this box and specify the maximum number allowed. Selecting this option also allows the selection of, Allow runs to contain up to ... heterozygote(s) within a ... variant window and Allow runs to contain up to ... consecutive heterozygote(s).

Allow runs to contain up to ... heterozygote(s) within a ... variant window specifies how many heterozygote(s) are allowed in a certain window that’s specified in terms of number of variants. Allow runs to contain up to ... consecutive heterozygote(s) specifies how many heterozygote(s) are allowed consecutively in a run.

Note

The algorithm will try to make runs at least as long as the minimum run length specified. If at least one heterozygote is allowed the algorithm will pull in flanking heterozygote genotypes to form a run.

If heterozygotes are allowed it is recommended that the genotypes making up the run are examined if it matters if the heterozygotes are inside or flanking the run.

Keyword Argument allowHetero maxHetero
Value Type Binary Type Integer Type
Constraints True or False [1, inf)
Default Value True 1
Inter-option constraints None allowHetero = True
Keyword Argument useHeteroDensity heteroDensity heteroDensityRange
Value Type Binary Type Integer Type Integer Type
Constraints True or False [1, inf) [2, inf)
Default Value False 1 5
Inter-option constraints allowHetero = True allowHetero = True, useHeteroDensity = True allowHetero = True, useHeteroDensity = True
Keyword Argument useHeteroConsec maxHeteroConsec
Value Type Binary Type Integer Type
Constraints True or False [1, inf)
Default Value False 1
Inter-option constraints allowHetero = True allowHetero = True, useHeteroConsec = True

Missing Genotypes

One of the following options must be selected:

If using the NGS dialog:
  1. Treat missing as homozygous reference calls: Any missing genotypes in a run will be treated as if they were homozygous reference.
  2. Treat as missing: Missing genotypes will be treated as missing, checking this will allow you to select other options to restrict the amount of missings in a run.
If using the GWAS dialog:
  1. Allow runs to contain up to ... missing genotype(s): Specify allowable number of missing genotypes. The number selected must be an integer greater than or equal to 0.
  2. Allow any number of missing genotypes: Selecting this option still requires that the first and last genotype be homozygote genotypes.

Note

The algorithm will not penalize a run for having a missing genotype at the edge of a run if at least one missing genotype is allowed and if the minimum run length size would have been met if the missing genotype was a homozygous genotype. Instead the total run length will be reduced to reflect the number of flanking missing genotypes.

If a minimum run length of 10 SNPs is specified and one missing genotype is allowed, it is possible to see a run length of 9 with one missing genotype.

If missing genotypes are allowed it is recommended that the genotypes making up the run are examined if it matters if the missing values are inside or flanking the run.

Keyword Argument treatMissingAsHomo (NGS) restrictMissing
Value Type Binary Type Binary Type
Constraints True or False True or False
Default Value False False
Inter-option constraints restrictMissing = False treatMissingAsHomo = False (NGS), None (GWAS)
Keyword Argument maxMissing
Value Type Integer Type
Constraints [0, inf)
Default Value 5
Inter-option constraints restrictMissing = True, treatMissingAsHomo = False (NGS)
Keyword Argument useMissingDensity missingDensity missingDensityRange
Value Type Binary Type Integer Type Integer Type
Constraints True or False [1, inf) [2, inf)
Default Value False 1 5
Inter-option constraints restrictMissing = True, treatMissingAsHomo = False (NGS) restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingDensity = True restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingDensity = True
Keyword Argument useMissingConsec maxMissingConsec
Value Type Binary Type Integer Type
Constraints True or False [1, inf)
Default Value False 1
Inter-option constraints restrictMissing = True, treatMissingAsHomo = False (NGS) restrictMissing = True, treatMissingAsHomo = False (NGS), useMissingConsec = True

Note

  • To not allow any missing genotypes in a run, select Allow runs to contain up to ... missing genotype(s) and set the value to 0.
  • If:
    • treatMissingAsHomo = True, then any number of missing genotypes are allowed in a run as they will be treated as homozygous reference calls.
    • restrictMissing = False, then any number of missing genotypes are allowed in a run of homozygous genotypes.
    • restrictMissing = True, then the number of missing genotypes in a run is restricted to the specified number.

Maximum Gap Between SNPs

The optional parameter Maximum gap between SNPs in a run: ... kb restricts how runs are formed. If the maximum gap is exceeded, shorter runs are formed if possible. Otherwise, no runs are formed because the distance between at least two SNPs is more than the specified amount. These options are only available in the GWAS version of ROH.

Keyword Argument restrictGap maxGap
Value Type Binary Type Real Type
Constraints True or False [1.0, inf) Kb
Default Value True 100.0
Inter-option constraints None restrictGap = True

Note

If:

  • restrictGap = False, then the maximum gap in Kb between SNPs in a run is not restricted.
  • restrictMissing = True, then the maximum gap in a run is restricted to the specified size.

Minimum Density of a Run

The optional parameter Minimum density of a run: 1 SNP per ... kb restricts how runs are formed. If the minimum density is selected and the criterion is not met, then a run is not formed. These options are only available in the GWAS version of ROH.

Keyword Argument restrictDensity minDensity
Value Type Binary Type Real Type
Constraints True or False [1.0, inf) Kb
Default Value False 50.0
Inter-option constraints None restrictDensity = True

Note

If:

  • restrictDensity = False, then the minimum density of SNPs in a run in Kb is not restricted.
  • restrictMissing = True, then the minimum density in a run is restricted to the specified value.

ROH Output Spreadsheets

At least one of the following output spreadsheets must be selected.

  • Create a spreadsheet of runs per sample:

    Creates the Homozygous Runs of Length >= ... spreadsheet. This spreadsheet details each run of homozygosity found. There are columns for the chromosome, start and end genomic position in base pairs, length of the run, start and end column indexes, the number of SNPs in the run, the number of missing values and the number of heterozygotes.

  • Create a spreadsheet with binary ROH run status:

    Creates the Binary ROH Status for Runs of Length >= .. spreadsheet. This spreadsheet indicates if the data from each marker is part of a run of homozygosity for each sample. For instance, if the genotype for a marker is homozygous for a particular sample, and there are enough surrounding and consecutive homozygous genotypes for a sample to constitute a run, then a 1 is placed in the cell for the marker and sample. If, on the other hand, for a particular sample there is a homozygous genotype marker surrounded by two heterozygous genotypes, then a 0 is placed in the cell for the homozygous marker and sample. All heterozygous genotypes are replaced with 0’s unless a run is allowed to have one or more heterozygotes.

  • Create a spreadsheet with the incidence of common runs per SNP:

    Creates the Incidence of Common Runs per SNP spreadsheet. This spreadsheet displays columns for the original SNP column number, the number of runs associated with each SNP (i.e. the number of samples that have a run overlapping the SNP), and the chromosome number for the SNP. Row labels are the SNP names.

  • Output Cluster of Runs for Association Studies:

    This option requires specification of the minimum number of samples for a cluster in the Minimum # samples that must contain a run: option. The minimum number of samples is the number of samples that must share a run of homozygosity for a run to be considered “common”. See Runs of Homozygosity Formulas. The Cluster of Runs with >=... spreadsheet displays information about the cluster of SNPs where common runs of homozygosity were found including the genomic range of the cluster and the number of SNPs in the region. Additionally, spreadsheets can be generated which contain the proportion of SNPs in each cluster that are members of common runs of homozygosity. This information can be output in two ways as described below.

    In addition, the optimal and haplotypic similarity cluster spreadsheets can be be outputted, they have the same options as the regular cluster outputs.

  • Cluster Outputs: Spreadsheets that contain information about the proportion of SNPs in each cluster that are members of common runs of homozygosity.

    • First column of each cluster: The values are generated for the first column (or marker) of each cluster. This format is best for association tests, as there is a reduction in the total number of tests and in the multiple testing correction.
    • One column per marker: The values are generated for all columns (or markers) for each cluster. This format is best if visualization of the data in a heat map is desired.

Runs of Homozygosity References

[Lencz2007] [Zhang2013] [Hartigan1979] [Ng2002] [Reber2013]