3. Sample QA - I: Basics

The following three sections lead you through quality assurance procedures performed in genome-wide association studies to identify samples of poor quality (call rates, heterozygosity, etc.) and those whose identity is of question (mismatched gender, ethnicity different than intended for the study, cryptically related, etc.). In some cases we’ll automatically filter samples; in others we’ll just identify samples for possible exclusion.

A. Filtering Samples with Low Call Rate

The first quality assurance metric we’ll look at is sample call rate defined as the fraction of called SNPs per sample over the total number of SNPs in the dataset. A standard quality threshold for excluding samples with a low call rate is 95%. It is important to not include the Y chromosome when calculating per sample call rates. It is up to the researcher whether or not the X chromosome should be included in call rate calculations. In this case we will only use autosomes for determining per sample call rates.

  • Open 500K Geno Training Data - Sheet 1 and select Select >Activate By Chromosomes. Uncheck any non-autosomes including but not limited to X, Y, XY, M, MT (this project only has data from Chromosome X).
  • Create a column subset spreadsheet by going to Select >Column >Column Subset Spreadsheet. Rename this spreadsheet by right-clicking on the tab at the bottom of the spreadsheet window and selecting Rename. Change the name to: 500K Geno Training Data - Autosomes Only.
  • Next, from the autosomes only spreadsheet calculate per sample call rates by going to Quality Assurance >Genotype >Filter Samples by Call Rate.
  • Here choose to Drop if call rate < 0.95 and click OK.

Once finished, two new spreadsheets are created. The first, Subset - Samples with Call Rate >=0.95, contains the genotypes for only the samples that had a call rate at or above 95%. Notice in the upper-right portion of this spreadsheet window that there are now a total of 468 rows (out of the original 565) meaning 97 samples had call rates below 95%. We’ll be using only these samples for the remainder of the tutorial. The second spreadsheet, Statistics by Sample, lists the actual call rates for every sample.

To prevent confusion in later steps, rename the Subset - Samples with Call Rate >= 0.95 spreadsheet to Subset - Samples with Call Rate >= 0.95 (Autosomes Only).

Next we need to inactivate the samples in the original spreadsheet with all chromosomes that do not meet the per sample call rate threshold.

  • From the 500K Geno Training Data - Sheet 1 spreadsheet, choose Select >Activate by Chromosomes. Check all chromosomes and click OK.
  • Next, go to Select >Activate or Inactivate Based on Second Spreadsheet. Set the state of Rows in the current spreadsheet to Active based on active... Rows in the specified spreadsheet. Click on the Select Sheet button and choose the Subset - Samples with Call Rate >= 0.95 (Autosomes Only) spreadsheet. Click OK.
  • Create a row subset spreadsheet by going to Select >Row >Row Subset Spreadsheet. Rename this spreadsheet Subset - Samples with Call Rate >= 0.95

B. Genotype Gender Check

Next we will use the X chromosome heterozygosity rate to identify those samples whose inferred gender does not agree with their reported gender.

  • From the Subset - Samples with Call Rate >=0.95 choose Quality Assurance >Genotype >X Heterozygosity Gender Inference.

A new spreadsheet is created called X Heterozygosity with heterozygosity information and inferred gender for each sample (Figure 3-1).

X Heterozygosity output spreadsheet

Figure 3-1. X Heterozygosity output spreadsheet

You can join this spreadsheet with the phenotype information to check inferred gender against reported gender. This will be used to create a histogram of the Heterozygosity rate and filter based on reported gender to detect discrepancies.

  • Choose File >Join or Merge Spreadsheet. Select the Edited Phenotype spreadsheet and click OK.
  • Choose Current Spreadsheet under Spreadsheet as Child of. Leave all other parameters as the defaults, and click OK.

In the combined spreadsheet, inferred gender is located in the 5th column (Sex) and reported gender is located in the 13th column (Gender).

  • Right-click on the Heterozygosity Rate column header (2) and choose Plot Histogram. There are two distinct distributions. The left one represents males and the right one females.
  • To better visualize the distributions, in the plot viewer, click on the Graph 1 node in the Graph Control Interface and set the Bin Count parameter to 128.
  • Next, click on the Heterozygosity Rate node in the Graph Control Interface and select the Color tab. Select the By Variable radio button and click Select variable....
  • Scroll to Gender, select it and click OK. With the plot colored on reported gender, the inconsistencies between inferred and reported gender are evident. More inconsistencies can be seen by changing the opacity.
  • Click on the Heterozygosity Rate node in the Graph Control Interface and under the Item tab move the opacity bar to the left, until it is in the center.
X Heterozygosity Histogram

Figure 3-2. X Heterozygosity Histogram Colored by Reported Gender

In this example, there are four samples who have a reported gender opposite of what is characteristic according to the heterozygosity rate. Compare the values of the two columns to identify mismatched samples.

  • From the spreadsheet, X Heterozygosity + Edited Phenotype - Sheet 1, select Quality Assurance >Compare Columns. Click Add Columns and select the two previously mentioned categorical columns (Sex and Gender) in the menu. Click OK.

    Note

    Be sure to check the column that has a C to the left of Sex. The categorical column, rather than the binary column, has values that match the Gender column.

  • Check both Rows with matching data values and Rows with differing data values under Create subset spreadsheet(s) of:. This will allow you to examine samples that have consistent and inconsistent genders.

  • Confirm that the options in the window match those in Figure 3-3. Click OK.

Two subset spreadsheets are created: Rows with matching values in columns Sex and Gender and Rows with differing values in columns Sex and Gender.

Compare Columns window

Figure 3-3. Compare Columns window

You now want to exclude the mismatched samples from the genotype dataset (unless you can rectify the spreadsheet by verifying that the gender was simply a data entry error and not a genotyping anomaly). The easiest way to do this is to use the rows in the Rows with matching values in columns Sex and Gender spreadsheet to activate their corresponding rows in the Subset - Samples with Call Rate >=0.95 spreadsheet.

  • Open the Subset - Samples with Call Rate >=0.95 spreadsheet and choose Select >Activate or Inactivate Based on Second Spreadsheet.
  • Choose to set the state of Rows in the current spreadsheet to Active based on active... Rows in the specified spreadsheet Rows with matching values in columns Sex and Gender.
    • Click Select Sheet, choose Rows with matching values in columns Sex and Gender, and click OK.

You’ll notice now in the upper-right portion of the window that there are only 464 rows active out of 468. Create a subset with these samples.

  • Choose Select >Row >Row Subset Spreadsheet.
  • Rename this spreadsheet in the Project Navigator (right click on the node and select Rename Node) to Subset - Samples with Call Rate >=.95 and Matched Gender.

C. Outliers in Autosomal Heterozygosity

Next identify samples with an over- or under-abundance of autosomal heterozygous SNPs.

  • From the Subset - Samples with Call Rate >=.95 and Matched Gender spreadsheet, choose Quality Assurance >Genotype >Autosome Heterozygosity. Click OK when prompted.

Though all chromosomes are active, Autosome Heterozygsity will only be calculated on the autosome columns. Upon completion, a new spreadsheet is created, Autosome Heterozygosity Rate (Figure 3-4), which contains a column for each chromosome containing each samples heterozygosity rate and a column with the overall autosome heterozygosity rate for each sample.

Autosome Heterozygosity

Figure 3-4. Autosome Heterozygosity Rate spreadsheet

To determine outliers with over- and under-abundance of heterozygosity we will calculate 1.5 the inter-quartile ranges (IQR) of the overall autosome heterozygosity rate column.

  • From the Autosome Heterozygosity Rate spreadsheet choose Quality Assurance >Column Statistics.
  • Leave the default values and click OK.

A new spreadsheet is created, Column Statistics (Figure 3-5). Notice that the columns in the previous spreadsheet are now the rows in the current spreadsheet.

Column statistics

Figure 3-5. Column statistics

If you scroll down to the bottom you can see the Lower Outlier Threshold column (1) and Upper Outlier Thresholds column (8) for the overall autosome heterozygosity rate (row 23). In this case these values are 0.24603 and 0.28025 respectively. Any sample with an autosome heterozygosity rate below or above these thresholds is considered an outlier and a candidate for exclusion.

You can determine which samples these are by sorting the Autosome Heterozygosity Rate spreadsheet by the last column.

  • Open the Autosome Heterozygosity Rate spreadsheet.
  • Right-click on the Autosome het. rate column (23) and click Sort Ascending.

In this particular study there are several apparent African Americans and Japanese/Chinese American samples (as we’ll see later) that are resulting in the over- and under-abundance of heterozygous SNPs. Normally you would exclude these, but for demonstration purposes, keep them in for now.