For an explanation of the parameters on this window see the Runs of Homozygosity chapter in the SVS Manual.
The algorithm will begin by sweeping through the data row-wise for each sample and then internally create a binary matrix. It then runs through the binary matrix column-wise, looking at each SNP’s corresponding binary value to determine whether or not a given SNP falls in a common ROH. Within each column the algorithm counts how many samples have a ‘1’ for each SNP. For this example, to define a cluster of SNPs that fall in ROHs, there must be at least 10 samples with a ‘1’ for each SNP, and runs may include multiple SNPs. That is, a run starts at the first SNP with 10 or more ‘1s’ in the binary matrix and extends until a SNP is found having fewer than 10 ‘1s’. Another run would begin when another SNP is found with ten or more ‘1s’ in the binary matrix.
Illustrative Example
Consider the following abbreviated example of five samples. Let’s say the input parameters are 10 for minimum run length and 3 for minimum # of samples. For each sample, a horizontal run of 10 or more homozygous SNPs are denoted with a ‘1’. The highlighted regions are vertical clusters of at least 3 samples with a ‘1’ in the matrix.
Having completed this matrix, the algorithm then computes the fraction of ‘1s’ within each run for every sample. The example matrix above would produce the table below as the First Column – Cluster of Runs … spreadsheet where the label ROH1 will be the first SNP label in the first ROH.
For more information about the ROH algorithm, see the Runs of Homozygosity Formulas section in the SVS Manual.
As a result of running Runs of Homozygosity, six spreadsheets are created and explained below.
The First Column - Clusters of Runs spreadsheet (Figure 5) is analogous to the example table above. It contains the first SNP of each cluster where common ROHs were found. Each column represents a cluster of SNPs labeled by the first SNP’s name. Each row represents a sample from the HapMap spreadsheet. Each column in the spreadsheet shows the fraction of SNPs in the cluster that are members of common ROHs for each sample. Compared to the Every Column - Cluster of Runs spreadsheet, which contains every SNP in the run, this format is better for association testing as there is a reduction in the total number of tests, and therefore, fewer multiple testing corrections are required.
Notice column one, which indicates that 10 or more of the 270 samples in the dataset had a run of at least 25 SNPs of at least 1500kb in length. This cluster begins at SNP_A-2076482 and if you click on the green MAP button in the top-left corner you will see that it starts at the physical position 35100862 on Chromosome 1. Keep this spreadsheet open.
This spreadsheet displays the clusters identified by the previous spreadsheet. Take notice of how there are 24 rows labeled by Cluster ID. These 24 rows correspond to the 24 columns in the First Column - Cluster of Runs spreadsheet. This can be verified by clicking on the green Map button in the top-left corner of the First Column - Cluster of Runs spreadsheet and comparing the chromosome and start position to each cluster in the Cluster of Runs spreadsheet.
This spreadsheet is similar to the Cluster of Runs spreadsheet but contains details of all the ROHs found, not just the ones common among at least 10 samples. Displayed is information for the chromosome, start and end positions of the runs in terms of both physical position in the genome (Start Position) and column number from the marker mapped spreadsheet (Start Index), the run length in number of SNPs and base pairs, the number of missing genotypes in the run and the number of heterozygous markers. Each row represents one ROH and is labeled by the sample name. There may be more than one row for a sample if that sample contains more than one ROH.
This information is analogous to the First Column – Cluster of Runs spreadsheet, but has information repeated for each SNP in every cluster. This format is best for visualization of the data in a heat map.
This plot (Figure 12) differs from the previous heat map because it shows all homozygous runs, not just the ones found in clusters among samples. Also take notice that there is more white space along the bottom of the map (Sample index > 180). In the HapMap data, the large sample indexes correspond to the African (YRI) population. Africans are known to be more diverse genetically and therefore contain less homozygosity on average than the other HapMap groups.
Since we are testing for association based on homozygosity, it is important to consider population stratification in your own analyses as we will in this tutorial. Close this plot and spreadsheet.