6. SNP Quality AssuranceΒΆ

Before certain quality assurance measures can be performed on SNPs, the phenotype and genotype information needs to be merged.

  • Open the Edited Phenotype spreadsheet and choose File >Join or Merge Spreadsheets.
  • Select the Subset - Samples with Call Rate >=0.95 spreadsheet and click OK.
  • Leave all the parameters in the Join or Merge Spreadsheet window as the defaults and click OK.

Again, exclude the X chromosome from the following analysis. Also, specify case/control status because Hardy-Weinberg Equilibrium (HWE) will only be calculated based on control samples.

  • Choose Select >Activate by Chromosomes, uncheck the X box, and click OK.
  • Next left-click the Phenotype 1 - Binary column label header.

This will turn the column magenta denoting the column as the dependent variable. The spreadsheet should look like Figure 6-1.

Joined spreadsheet with case/control status selected

Figure 6-1. Joined spreadsheet with case/control status selected

  • From the joined spreadsheet choose Quality Assurance >Genotype >Genotype Filtering by Marker.

The Genotype Filtering window lets you simultaneously choose thresholds for multiple statistics to filter SNPs failing to meet respective quality assurance measures.

  • Check the following options and enter the following thresholds:
  • Drop if Call Rate < 0.9
  • Drop if Minor Allele Frequency < 0.01
  • Perform HWE filtering based on: Controls
  • Drop if Fisher’s exact test for HWE P-Value < 1e-4
  • Click Run.

Upon completion, SNPs in the Edited Phenotype + 500K Geno Training Data - Sheet 1 not meeting the specified thresholds are inactivated. A new spreadsheet, Filtering Results, will also be output with the various markers statistics for each SNP.

  • To see how many SNPs were filtered, go to the Project Navigator and select the Edited Phenotype + 500K Geno Training Data - Sheet 1 spreadsheet. In the Node Change Log it will say how many SNPs were filtered (columns set to inactive).

Assuming all steps were followed correctly to this point, 104,734 SNPs should have been filtered. Though any further analyses only takes active columns and rows into consideration, it is often preferred to first create a subset spreadsheet of only those that are active.

  • From the Edited Phenotype + 500K Geno Training Data - Sheet 1 spreadsheet, go to Select >Subset Active Data.

The new spreadsheet, Edited Phenotype + 500K Geno Training Data - Active Subset, should have 468 rows and 384,030 columns.

  • Rename this spreadsheet in the Project Navigator to Filtered Data for Association Testing.

You should now have a filtered set of samples and SNPs for association testing.

Previous topic

5. Sample QA - III: Population Stratification

Next topic

7. Genotype Association Analysis