Before certain quality assurance measures can be performed on SNPs, the phenotype and genotype information needs to be merged.
Again, exclude the X chromosome from the following analysis. Also, specify case/control status because Hardy-Weinberg Equilibrium (HWE) will only be calculated based on control samples.
This will turn the column magenta denoting the column as the dependent variable. The spreadsheet should look like Figure 6-1.
The Genotype Filtering window lets you simultaneously choose thresholds for multiple statistics to filter SNPs failing to meet respective quality assurance measures.
- Drop if Call Rate < 0.9
- Drop if Minor Allele Frequency < 0.01
- Perform HWE filtering based on: Controls
- Drop if Fisher’s exact test for HWE P-Value < 1e-4
Upon completion, SNPs in the Edited Phenotype + 500K Geno Training Data - Sheet 1 not meeting the specified thresholds are inactivated. A new spreadsheet, Filtering Results, will also be output with the various markers statistics for each SNP.
Assuming all steps were followed correctly to this point, 104,734 SNPs should have been filtered. Though any further analyses only takes active columns and rows into consideration, it is often preferred to first create a subset spreadsheet of only those that are active.
The new spreadsheet, Edited Phenotype + 500K Geno Training Data - Active Subset, should have 468 rows and 384,030 columns.
You should now have a filtered set of samples and SNPs for association testing.