The following three sections lead you through quality assurance procedures performed in genome-wide association studies to identify samples of poor quality (call rates, heterozygosity, etc.) and those whose identity is of question (mismatched gender, ethnicity different than intended for the study, cryptically related, etc.). In some cases we’ll automatically filter samples; in others we’ll just identify samples for possible exclusion.
The first quality assurance metric we’ll look at is sample call rate defined as the fraction of called SNPs per sample over the total number of SNPs in the dataset. A standard quality threshold for excluding samples with a low call rate is 95%. It is important to not include the Y chromosome when calculating per sample call rates. It is up to the researcher whether or not the X chromosome should be included in call rate calculations. In this case we will only use autosomes for determining per sample call rates.
Once finished, two new spreadsheets are created. The first, Subset - Samples with Call Rate >=0.95, contains the genotypes for only the samples that had a call rate at or above 95%. Notice in the upper-right portion of this spreadsheet window that there are now a total of 468 rows (out of the original 565) meaning 97 samples had call rates below 95%. We’ll be using only these samples for the remainder of the tutorial. The second spreadsheet, Statistics by Sample, lists the actual call rates for every sample.
To prevent confusion in later steps, rename the Subset - Samples with Call Rate >= 0.95 spreadsheet to Subset - Samples with Call Rate >= 0.95 (Autosomes Only).
Next we need to inactivate the samples in the original spreadsheet with all chromosomes that do not meet the per sample call rate threshold.
Next we will use the X chromosome heterozygosity rate to identify those samples whose inferred gender does not agree with their reported gender.
A new spreadsheet is created called X Heterozygosity with heterozygosity information and inferred gender for each sample (Figure 3-1).
You can join this spreadsheet with the phenotype information to check inferred gender against reported gender. This will be used to create a histogram of the Heterozygosity rate and filter based on reported gender to detect discrepancies.
In the combined spreadsheet, inferred gender is located in the 5th column (Sex) and reported gender is located in the 13th column (Gender).
In this example, there are four samples who have a reported gender opposite of what is characteristic according to the heterozygosity rate. Compare the values of the two columns to identify mismatched samples.
From the spreadsheet, X Heterozygosity + Edited Phenotype - Sheet 1, select Quality Assurance >Compare Columns. Click Add Columns and select the two previously mentioned categorical columns (Sex and Gender) in the menu. Click OK.
Note
Be sure to check the column that has a C to the left of Sex. The categorical column, rather than the binary column, has values that match the Gender column.
Check both Rows with matching data values and Rows with differing data values under Create subset spreadsheet(s) of:. This will allow you to examine samples that have consistent and inconsistent genders.
Confirm that the options in the window match those in Figure 3-3. Click OK.
Two subset spreadsheets are created: Rows with matching values in columns Sex and Gender and Rows with differing values in columns Sex and Gender.
You now want to exclude the mismatched samples from the genotype dataset (unless you can rectify the spreadsheet by verifying that the gender was simply a data entry error and not a genotyping anomaly). The easiest way to do this is to use the rows in the Rows with matching values in columns Sex and Gender spreadsheet to activate their corresponding rows in the Subset - Samples with Call Rate >=0.95 spreadsheet.
You’ll notice now in the upper-right portion of the window that there are only 464 rows active out of 468. Create a subset with these samples.
Next identify samples with an over- or under-abundance of autosomal heterozygous SNPs.
Though all chromosomes are active, Autosome Heterozygsity will only be calculated on the autosome columns. Upon completion, a new spreadsheet is created, Autosome Heterozygosity Rate (Figure 3-4), which contains a column for each chromosome containing each samples heterozygosity rate and a column with the overall autosome heterozygosity rate for each sample.
To determine outliers with over- and under-abundance of heterozygosity we will calculate 1.5 the inter-quartile ranges (IQR) of the overall autosome heterozygosity rate column.
A new spreadsheet is created, Column Statistics (Figure 3-5). Notice that the columns in the previous spreadsheet are now the rows in the current spreadsheet.
If you scroll down to the bottom you can see the Lower Outlier Threshold column (1) and Upper Outlier Thresholds column (8) for the overall autosome heterozygosity rate (row 23). In this case these values are 0.24603 and 0.28025 respectively. Any sample with an autosome heterozygosity rate below or above these thresholds is considered an outlier and a candidate for exclusion.
You can determine which samples these are by sorting the Autosome Heterozygosity Rate spreadsheet by the last column.
In this particular study there are several apparent African Americans and Japanese/Chinese American samples (as we’ll see later) that are resulting in the over- and under-abundance of heterozygous SNPs. Normally you would exclude these, but for demonstration purposes, keep them in for now.