5. Sample QA - III: Population StratificationΒΆ

The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can do this by performing principal component analysis on your data and comparing the first two principal components against reference samples of known ethnicities.

There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD), some recommend using a filtered set of SNPs (on minor allele frequency and HWE for example), and some recommend using the entire SNP set. There are advantages to each. Here we’ll use the pruned set of SNPs we created earlier. First we need to append the HapMap samples with our study samples.

  • Open the Pruned SNP Subset spreadsheet and choose File >Append Spreadsheets.
  • Choose the 500K Geno HapMap Data - Sheet 1 and click OK.
  • In the Append Spreadsheet window enter Pruned SNPs + HapMap as the New Dataset Name, keep the rest of the parameters as the defaults and click OK.

The new spreadsheet should have 734 rows and 207,890 columns.

Note

You don’t want to run PCA on non-autosomal chromosomes. The pruned SNP spreadsheet already had the X chromosome inactive so this chromosome was automatically dropped during the append process.

Now run PCA on this spreadsheet.

  • From the spreadsheet, select Quality Assurance >Genotype Principal Component Analysis.
  • Under Principal Components, enter 5 for Find up to top ___ components.
  • Leave the defaults for the rest of the options, and click Run.

Two spreadsheets, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (Additive Model) spreadsheet, result from the analysis. To find out how many principal components are required to explain the majority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.

Look at the PC Eigenvalues (Additive Model) spreadsheet and notice that there is very little change between the third, fourth and fifth Eigenvalues, implying that three principal components explain the majority of stratification in the SNP data. This is consistent with there being three major populations in the data: CEPH (European), YRI (African), and CHB/JPT (Asian). You can visualize the population stratification by plotting the first few principal components against one another, juxtaposing the HapMap samples with the GEO study data. First you need to join the Population spreadsheet with the Principal Components spreadsheet.

  • Open the Population - Sheet 1 spreadsheet and select File >Join or Merge Spreadsheets.
  • Select the Principal Components (Additive Model) spreadsheet.
  • In the Join or Merge Spreadsheet window select the Current spreadsheet radio button under Spreadsheet as Child of, leave the rest of the parameters as defaults and then click OK. The combined spreadsheet should look like Figure 5-1.
Population added to the Principal Components spreadsheet

Figure 5-1. Population added to the Principal Components spreadsheet

It is now possible to plot one component against the other and color-code each sample or data point according to its respective ethnicity.

  • From the Population + Principal Components (Additive Model) - Sheet 1 spreadsheet, select Plot >XY Scatter Plots.

The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column (principal component) to represent the independent or X axis. The list view on the right is for selecting a single or multiple columns (principal components) to represent the dependent or Y axis.

  • In the left list box select EV = 35.8377. In the right list box check EV = 15.5349 and click Plot.

If we color each data point according to its respective ethnicity, the clusters become more obvious.

  • In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 15.5349.

  • Select the Color tab and selec the By Variable radio button

  • Click Select Variable and select Population from the list. Click OK.

    PCA plot of 3 HapMap populations and the study population

    Figure 5-2. PCA plot of 3 HapMap populations and the study population

The four population groups are separated by color in the plot (Figure 5-2). We can see that our study consisted of mostly Caucasians (largest cluster), but also had Asians and some Asian Americans and African-Americans. As mentioned prior, these are also the samples that had over- and under-abundance of heterozygosity. Similarly, four separate graph items are displayed in the Graph Control Interface making it easy to change the name, color and symbol for each.

  • When finished, close the Plot Viewer and rename its associated node (under the Population spreadsheet) in the Project Navigator to PCA Plot.

Since the threshold between which samples should be excluded or kept can be ambiguous based on visual inspection, we recommend calculating the inter-quartile range (IQR) distance around the centroid (median) of the study population cluster and excluding those that are 1.5 IQRs from the third quartile.

  • Open the Population + Principal Components (Additive Model) - Sheet 1.

Since you want to calculate the centroid of the study population only, exclude the HapMap samples from this spreadsheet.

  • Right-click on the Population column header and choose Activate by Category. Select Study and click OK.
  • Now choose Quality Assurance >Multidimensional Outlier Detection.
  • Leave 1.5 for the IQR.
  • Click Add columns and check the first two principal component columns (EV = 35.8402 and EV = 15.5348) and click OK. Click OK again to finish.

Upon completion a new spreadsheet is output, Multidimensional Outlier Detection, which contains two columns: the first is the distance from the median centroid and the second is a binary column which indicates whether a given sample is considered an outlier above the threshold determined by IQR. We can now create an XY scatter plot of the principal components, this time coloring by exclusion.

  • From the Population + Principal Components (Additive Model) - Sheet 2, select File >Join or Merge Spreadsheets.

  • Select the Multidimensional Outlier Detection spreadsheet and click OK. In the Join or Merge Spreadsheet window select the Current spreadsheet radio button under Spreadsheet as Child of, leave the rest of the parameters as defaults and then click OK. The combined spreadsheet should look like Figure 5-3.

    Principal components joined with outlier statistics

    Figure 5-3. Principal components joined with outlier statistics

Now create the scatter plot.

  • From the joined spreadsheet select Plot >XY Scatter Plots.
  • Again in the left list box select EV = 35.8042. In the right list box check EV = 15.5348 and click Plot.

You can now color the plots according to whether or not they were considered outliers.

  • In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 15.5348.
  • Select the Color tab and choose the By Variable radio button.
  • Click Select Variable and select Outlier >=0.0302... from the list. Click OK.

The resulting plot looks like Figure 5-4.

PCA plot colored by outlier identification

Figure 5-4. PCA plot colored by outlier identification

In summary, samples were filtered if they had a call rate below 95% and those whose reported and genotypically inferred genders did not match. We have also identified several outlier samples for possible exclusion due to autosomal heterozygosity rates, ethnicity identification, and family-relatedness. For demonstration purposes of this tutorial, these samples will remain in the study.

Next, filter SNPs based on standard SNP quality assurance metrics.

Previous topic

4. Sample QA - II: Cryptic Relatedness

Next topic

6. SNP Quality Assurance