The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can do this by performing principal component analysis on your data and comparing the first two principal components against reference samples of known ethnicities.
There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD), some recommend using a filtered set of SNPs (on minor allele frequency and HWE for example), and some recommend using the entire SNP set. There are advantages to each. Here we’ll use the pruned set of SNPs we created earlier. First we need to append the HapMap samples with our study samples.
The new spreadsheet should have 734 rows and 207,890 columns.
Note
You don’t want to run PCA on non-autosomal chromosomes. The pruned SNP spreadsheet already had the X chromosome inactive so this chromosome was automatically dropped during the append process.
Now run PCA on this spreadsheet.
Two spreadsheets, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (Additive Model) spreadsheet, result from the analysis. To find out how many principal components are required to explain the majority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.
Look at the PC Eigenvalues (Additive Model) spreadsheet and notice that there is very little change between the third, fourth and fifth Eigenvalues, implying that three principal components explain the majority of stratification in the SNP data. This is consistent with there being three major populations in the data: CEPH (European), YRI (African), and CHB/JPT (Asian). You can visualize the population stratification by plotting the first few principal components against one another, juxtaposing the HapMap samples with the GEO study data. First you need to join the Population spreadsheet with the Principal Components spreadsheet.
It is now possible to plot one component against the other and color-code each sample or data point according to its respective ethnicity.
The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column (principal component) to represent the independent or X axis. The list view on the right is for selecting a single or multiple columns (principal components) to represent the dependent or Y axis.
If we color each data point according to its respective ethnicity, the clusters become more obvious.
In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 15.5349.
Select the Color tab and selec the By Variable radio button
Click Select Variable and select Population from the list. Click OK.
The four population groups are separated by color in the plot (Figure 5-2). We can see that our study consisted of mostly Caucasians (largest cluster), but also had Asians and some Asian Americans and African-Americans. As mentioned prior, these are also the samples that had over- and under-abundance of heterozygosity. Similarly, four separate graph items are displayed in the Graph Control Interface making it easy to change the name, color and symbol for each.
Since the threshold between which samples should be excluded or kept can be ambiguous based on visual inspection, we recommend calculating the inter-quartile range (IQR) distance around the centroid (median) of the study population cluster and excluding those that are 1.5 IQRs from the third quartile.
Since you want to calculate the centroid of the study population only, exclude the HapMap samples from this spreadsheet.
Upon completion a new spreadsheet is output, Multidimensional Outlier Detection, which contains two columns: the first is the distance from the median centroid and the second is a binary column which indicates whether a given sample is considered an outlier above the threshold determined by IQR. We can now create an XY scatter plot of the principal components, this time coloring by exclusion.
From the Population + Principal Components (Additive Model) - Sheet 2, select File >Join or Merge Spreadsheets.
Select the Multidimensional Outlier Detection spreadsheet and click OK. In the Join or Merge Spreadsheet window select the Current spreadsheet radio button under Spreadsheet as Child of, leave the rest of the parameters as defaults and then click OK. The combined spreadsheet should look like Figure 5-3.
Now create the scatter plot.
You can now color the plots according to whether or not they were considered outliers.
The resulting plot looks like Figure 5-4.
In summary, samples were filtered if they had a call rate below 95% and those whose reported and genotypically inferred genders did not match. We have also identified several outlier samples for possible exclusion due to autosomal heterozygosity rates, ethnicity identification, and family-relatedness. For demonstration purposes of this tutorial, these samples will remain in the study.
Next, filter SNPs based on standard SNP quality assurance metrics.