Next find and filter samples determined to be “related” to other samples. Relatedness is often defined as family-relatedness but identity by descent (IBD) estimation can also detect duplicate samples, duplicate samples from one of a pair of genotyping chips but not the other, or sample contamination. Before doing IBD, standard practice is to first prune the SNPs in LD with one another, reducing the number of association tests performed and thus the effect of multiple testing.
This will take a few minutes. Upon finishing, 286,795 markers are deactivated as designated by the grayed out columns in the spreadsheet. You can also see in the upper-right portion of the window that only 212,469 columns are active out of 499,264. You will be using only the active columns (non-correlated markers) for autosomal chromosomes to perform IBD, so first create a column subset spreadsheet and then inactivate the X chromosome before continuing.
Choose Select >Column >Column Subset Spreadsheet.
Then from the subset spreadsheet, choose Select >Activate by Chromosomes, uncheck the X chromosome and click OK.
Note
In datasets with other non-autosome chromosomes, inactivate these as well.
In the Project Navigator rename this node to Pruned SNP Subset.
Now you are ready for IBD estimation.
This will take a few minutes. Upon completion, two spreadsheets are output: IBD Estimate: Estimated PI and Pairwise IBD Estimates (PI >=0).
IBD Estimate: Estimated PI gives an N x N table where N is the number of samples in the dataset. By plotting a heatmap of this table we can detect patterns showing relatedness. Pairwise IBD Estimates (PI >=0) outputs various IBD statistics for all samples. These values can be sorted or plotted to find samples related to one another or detect sample contamination.
You’ll get the plot in Figure 4-1.
By default, the heat map has a three color scheme calculated automatically. We want to define the color scheme manually based on a two color scheme where we look for sample pairs with a PI estimate of 0.25 or greater (PI of 0.25 = second degree relatives, 0.5 = first degree relatives, and 1 = identical twins (or duplicate samples)).
You can begin to see samples along the diagonal line that appear to be related to one another. You can zoom in to see this more clearly by clicking and dragging a red box in the plot area (Figure 4-3).
In most population-based studies you’ll typically find a couple sample pairs who are cryptically related in one way or another, which you can subsequently remove. In this particular study there are known to be several family trios and the cluster pattern above indicates this. For the purpose of this tutorial we won’t exclude any sample pairs here, but if this were your own study, you would need to decide which member(s) of the trio to keep or discard from the study.
5. Sample QA - III: Population Stratification
Enter search terms or a module, class or function name.