2. Identify Runs of Homozygosity

  • To identify runs of homozygosity, open 500K HapMap - Sheet 1 and select Analysis > Runs of Homozygosity. The Runs of Homozygosity window will appear.
Figure 4. ROH Parameter Options

Figure 4. ROH Parameter Options

  • For this tutorial, choose the Distance: radio button under Minimum run length: and enter 1500. Leave 25 in the second box as the min # SNPs:.
  • Make sure to check all of the boxes under Output Runs and enter 10 under Minimum # samples that must contain a run:. Check that the window matches Figure 4 and click Run.

For an explanation of the parameters on this window see the Runs of Homozygosity chapter in the SVS Manual.

The algorithm will begin by sweeping through the data row-wise for each sample and then internally create a binary matrix. It then runs through the binary matrix column-wise, looking at each SNP’s corresponding binary value to determine whether or not a given SNP falls in a common ROH. Within each column the algorithm counts how many samples have a ‘1’ for each SNP. For this example, to define a cluster of SNPs that fall in ROHs, there must be at least 10 samples with a ‘1’ for each SNP, and runs may include multiple SNPs. That is, a run starts at the first SNP with 10 or more ‘1s’ in the binary matrix and extends until a SNP is found having fewer than 10 ‘1s’. Another run would begin when another SNP is found with ten or more ‘1s’ in the binary matrix.

Illustrative Example

Consider the following abbreviated example of five samples. Let’s say the input parameters are 10 for minimum run length and 3 for minimum # of samples. For each sample, a horizontal run of 10 or more homozygous SNPs are denoted with a ‘1’. The highlighted regions are vertical clusters of at least 3 samples with a ‘1’ in the matrix.

ROH Clusters

Having completed this matrix, the algorithm then computes the fraction of ‘1s’ within each run for every sample. The example matrix above would produce the table below as the First Column – Cluster of Runs … spreadsheet where the label ROH1 will be the first SNP label in the first ROH.

Samples

For more information about the ROH algorithm, see the Runs of Homozygosity Formulas section in the SVS Manual.

As a result of running Runs of Homozygosity, six spreadsheets are created and explained below.

  • Close all of the spreadsheets except for First Column – Cluster of Runs.

The First Column - Clusters of Runs spreadsheet (Figure 5) is analogous to the example table above. It contains the first SNP of each cluster where common ROHs were found. Each column represents a cluster of SNPs labeled by the first SNP’s name. Each row represents a sample from the HapMap spreadsheet. Each column in the spreadsheet shows the fraction of SNPs in the cluster that are members of common ROHs for each sample. Compared to the Every Column - Cluster of Runs spreadsheet, which contains every SNP in the run, this format is better for association testing as there is a reduction in the total number of tests, and therefore, fewer multiple testing corrections are required.

Figure 5. First Column Cluster of Runs Spreadsheet

Figure 5. First Column Cluster of Runs Spreadsheet

Notice column one, which indicates that 10 or more of the 270 samples in the dataset had a run of at least 25 SNPs of at least 1500kb in length. This cluster begins at SNP_A-2076482 and if you click on the green MAP button in the top-left corner you will see that it starts at the physical position 35100862 on Chromosome 1. Keep this spreadsheet open.

  • Now open the Cluster of Runs… spreadsheet (Figure 6).
Figure 6. Cluster of Runs spreadsheet

Figure 6. Cluster of Runs spreadsheet

This spreadsheet displays the clusters identified by the previous spreadsheet. Take notice of how there are 24 rows labeled by Cluster ID. These 24 rows correspond to the 24 columns in the First Column - Cluster of Runs spreadsheet. This can be verified by clicking on the green Map button in the top-left corner of the First Column - Cluster of Runs spreadsheet and comparing the chromosome and start position to each cluster in the Cluster of Runs spreadsheet.

  • Close both of these spreadsheets.
  • Open the Homozygous Runs spreadsheet (Figure 7).
Figure 7. Homozygous Runs spreadsheet

Figure 7. Homozygous Runs spreadsheet

This spreadsheet is similar to the Cluster of Runs spreadsheet but contains details of all the ROHs found, not just the ones common among at least 10 samples. Displayed is information for the chromosome, start and end positions of the runs in terms of both physical position in the genome (Start Position) and column number from the marker mapped spreadsheet (Start Index), the run length in number of SNPs and base pairs, the number of missing genotypes in the run and the number of heterozygous markers. Each row represents one ROH and is labeled by the sample name. There may be more than one row for a sample if that sample contains more than one ROH.

Figure 8. Every Column Cluster of Runs spreadsheet

Figure 8. Every Column Cluster of Runs spreadsheet

  • Open the Every Column – Cluster of Runs spreadsheet (Figure 8).

This information is analogous to the First Column – Cluster of Runs spreadsheet, but has information repeated for each SNP in every cluster. This format is best for visualization of the data in a heat map.

Figure 9. Common ROH Heat Map

Figure 9. Common ROH Heat Map

  • Create a heat map of this spreadsheet by choosing Plot > Heat Map.
  • Within the plot (Figure 9) select the Every Column – Clusters of Runs… node in the Graph Control Interface. Then choose the Color tab and select 2 Color Auto. Click on the red box and change the color to white. This plot shows you where the clusters are located on the genome.
  • Close this plot and spreadsheet.
Figure 10. Incidence of Common Runs spreadsheet

Figure 10. Incidence of Common Runs spreadsheet

  • Open the Incidence of Common Runs per SNP spreadsheet (Figure 10). This spreadsheet contains the number of runs that each SNP is included in, along with the column number that corresponds to the columns from the marker mapped spreadsheet. The chromosome in which the SNP is located is also included.
  • Close this spreadsheet.
  • Now open the last spreadsheet, Binary ROH Status… (Figure 11). This spreadsheet is the binarized version of the HapMap data, with a 1 corresponding to a homozygote and a 0 to a heterozygote as in the illustrative example above. This spreadsheet is also useful for heat map visualization.
Figure 11. Binary ROH Status spreadsheet

Figure 11. Binary ROH Status spreadsheet

  • Select Plot > Heat Map and change the color scheme as you did before.

This plot (Figure 12) differs from the previous heat map because it shows all homozygous runs, not just the ones found in clusters among samples. Also take notice that there is more white space along the bottom of the map (Sample index > 180). In the HapMap data, the large sample indexes correspond to the African (YRI) population. Africans are known to be more diverse genetically and therefore contain less homozygosity on average than the other HapMap groups.

Figure 12. All ROH Heat Map

Figure 12. All ROH Heat Map

Since we are testing for association based on homozygosity, it is important to consider population stratification in your own analyses as we will in this tutorial. Close this plot and spreadsheet.

Previous topic

1. Overview of the Project

Next topic

3. Perform Regression with ROH Covariates