3. Run CNV PCA Search Script

The search for the optimum number of principal components involves correcting log ratios using each possible number of principal components within the range determined in the previous step, performing association tests on the case/control status using the corrected data, running regression analysis on the top 90% least significant results, and then evaluating the respective Q-Q Plots to determine which number gives the best results. A script is available that will automatically peform the first three procedures for each number of principal components within the range.

A. Download the CNV PCA Search Script

  • Download the script: CNV PCA Search.py
  • Save this script to your *..\AppData\Golden Helix SVS\UserScripts\Spreadsheet\Scripts folder.

Note

The AppData folder is a hidden folder on Windows operating systems and its location varies between OS Versions. The easiest way to locate this directory on your computer is to open SVS and go to Tools >Open Folder >User Scripts Folder.

If saved to the proper folder, this script should show up in the Scripts menu from any spreadsheet.

B. Run the Script

  • Open the Pheno + LogRs - Sheet 1 spreadsheet and select Scripts >CNV PCA Search.
  • Click on Select Sheet and select the Principal Components (Center by Marker) spreadsheet.
  • Set the Minimum components and Maximum components as determined in the previous step. In this example it would be 1 and 60.
  • Set the Step size to 1 and make sure Center data by marker is selected. Click OK.

Using the min and max components considered in this example, the script will generate 60 different Principal Components spreadsheets, as well as 60 different Association Test Result spreadsheets. In the background a regression analysis will also be computed for the top 90% least significant results. From the regression analysis, the slope, and F statistic will be stored and –log10(F statistic) will be computed. These values will be output in the PCA Search Results spreadsheet (see Figure 3). For each number of principal components the slope and F statistic will be output in the node annotations log for each association results spreadsheet.

Figure 3. Comparing LD between all HapMap and Yorubans.

Figure 3. PCA Search results for 1 through 60 principal components.