2.21. Numeric Association Tests¶
2.21.1. Numeric Association Tests Overview¶
The Golden Helix SVS Numeric Association Tests window offers a straightforward way of testing for associations between numeric predictors against either case/control status or a quantitative trait using one or more statistical measures.
In addition, the Numeric Association Test window offers batch effects/stratification correction using Principal Component Analysis. (See Correction of Input Data by Principal Component Analysis.) There are also several options for multiple testing corrections on the resulting p-values from the association test.
2.21.2. Tests and Analysis Methods¶
You can choose from the following statistical tests and/or methods where appropriate:
All missing values will be dropped from the analysis both from the predictor variables and from the dependent variable.
For every individual predictor variable, Golden Helix SVS will always display the number of sample data values which are actually used for testing that predictor. Additionally, for case/control data, Golden Helix SVS will always display the number of case data values and control data values actually used for testing every individual predictor.
This test is available for both case/control and quantitative dependent variables. This test and logistic and linear regression are the only tests available if principal components analysis (PCA) is used for batch effects/stratification correction on the input data. See PCA-Corrected Association Testing for more information.
For each predictor, this test will show the p-value for the (possibly PCA-corrected) dependent variable value having any correlation with, or “trend” which depends upon, the (possibly PCA-corrected) predictor.
For case/control dependent variables, and before any PCA correction, a “case” is considered to have a value of one, and a “control” is considered to have a value of zero.
In addition, this test will show a signed correlation value indicating the amount and direction of dependency of the (possibly PCA-corrected) predictor on the (possibly PCA-corrected) dependent variable value.
See the Formulas and Theories chapter for an explanation of this statistic (Correlation/Trend Test).
The T-Test can only be run when the phenotype dependent trait is binary (which may be the result of transforming a variable such as gender into a binary variable).
This test is not available if Principal Components Analysis (PCA) is used on the dependent variable. See PCA-Corrected Association Testing for more information.
See the Formulas and Theories chapter for an explanation of this statistic (T-Test).
Additionally, several statistics are output for the predictor variable:
The average for cases and for controls
The difference between these averages
The average of taking the anti-log base 2 of the data for cases and for controls
The fold change of the anti-log base 2 of the data, based upon the above averages
The log base 2 of that fold change
If your predictor variable data are the logs base 2 of gene expression levels, the fold change shown will be for your original gene expression levels. (Also, the difference shown between the averages of the logs base 2 of gene expression levels will correspond to an alternative definition of “fold change”.)
When the fold change itself is displayed, the convention is used that the magnitude of the result will always be greater than one (if there is any fold change at all), but the result itself will sometimes be negative. That is:
If the case average is larger than the control average, the ratio of the case average to the control average is shown.
If the control average is larger than the case average, minus the ratio of the control average to the case average is shown.
If the two averages are the same, zero is shown.
The log base 2 of the fold change is always displayed as the log base 2 of the ratio of the case average to the control average.
When the dependent is a quantitative (real- or integer-valued) trait, linear regression is available. With linear regression, a line is fit to the response in terms of the predictor’s value, and a p-value is computed for goodness of fit. The output will include not only the regression p-value but also the estimate for the intercept and slope of the regression.
When the dependent is a binary trait, logistic regression is available. With logistic regression a logistic (sigmoid) curve is fit to the predictor value, and a p-value is computed for goodness of fit. The output will include not only the regression p-value but also the estimates for and .
Bonferroni and False Discovery Rate (FDR) multiple testing corrections can also be applied to the regression results.
2.21.3. Note on Missing Values¶
All missing values will be dropped from the analysis both from the predictor variables and from the dependent variable.
2.21.4. Multiple Testing Corrections¶
It may be possible to obtain a good test statistic result by chance alone. Multiple testing corrections are designed to help ensure, if possible, that this is not the case. You may optionally select one or more of the following multiple testing corrections.
The Bonferroni adjustment multiplies each individual p-value by the number of times a test was performed. This value, which is quite conservative, seeks to estimate the probability that this test would have obtained the same value by chance at least once from all the times this test was performed. (The number of times this test was performed will be equal to the number of predictor variables processed. Other types of tests on the same predictors are not counted.)
False Discovery Rate¶
The False Discovery Rate (FDR) option calculates the FDR for each statistical test selected. This test is based on the p-values from the original test.
A general interpretation of the FDR is “What would the rate of false discoveries (false positives) be if I accepted ALL of the tests whose p-value is at or below the p-value of this test?”
See the Formulas and Theories chapter for an explanation of this correction procedure (False Discovery Rate).
Permutation testing is another way of determining if a significant test statistic value was obtained by chance alone.
Single Value Permutation Testing¶
With single value permutations, the dependent variable is permuted and the given statistical test on the given predictor variable is performed. This process is repeated the number of times you select (counting the original test as one “permutation”). The permuted p-value is the fraction of permutations in which the test came out as significant or as more significant than it did with the non-permuted dependent variable.
Full Scan Permutation Testing¶
The full-scan permutation technique differs from the single-value technique in that it addresses the multiple testing problem. It does this by comparing the original test result from an individual predictor variable with the most significant permuted results from all tested predictors. The specified number of permutations are done on the dependent variable and these permutations are tested with each predictor. For each permutation only the most significant result statistic of all predictors tested with that permutation is saved.
The p-value is the fraction of permutations in which this best saved value of the test statistic was more significant than the original statistical test on the given predictor.
See the Formulas and Theories chapter for a more detailed explanation and examples of permutation testing. (Permutation Testing Methodology).
2.21.5. Principal Components Analysis¶
To correct for stratification, batch effects, or other measurement errors, you may choose to have Golden Helix SVS apply Principal Component Analysis (PCA) to your input data. If you do, you may choose to also correct the dependent variable through PCA or, on the other hand, you may choose to perform the test using the original dependent variable. (The components, if you choose to compute them from the data you are testing, will only be based on the predictor variables in either case.)
The corrected data, which you may request to be output into a separate spreadsheet, is the same as that which could be created through the separate PCA window (see Correction of Input Data by Principal Component Analysis and Using the Numeric Principal Components Analysis Window).
Correcting a binary dependent variable makes it continuous. Thus, linear regression and the Corr/Trend Test become the appropriate tests for numeric association testing when correcting a binary dependent variable through PCA.
2.21.6. Using the Numeric Association Test Window¶
Summary information for the dependent variable is displayed at the top of this window for reference. This information is visible from both tabs in this dialog window.
Numeric Association Tests require a dataset containing numeric data and either case/control or quantitative trait data. To use these tests, first import your data into a Golden Helix SVS project (see Importing Your Data Into A Project). Once you have the spreadsheet for this data, select the column representing the case/control status or quantitative trait as the dependent variable (see Column States) and access the Numeric Association Tests options dialog by selecting Numeric > Numeric Association Tests from the spreadsheet menu.
It is common practice to inactivate those markers known to have data quality issues before testing, especially if you wish to apply PCA correction to your input data.
The numeric association test window consists of two tabs:
Association Test Parameters: This tab contains all the parameters necessary for the association tests themselves, plus options for selecting principal component (PCA) analysis for batch effects/stratification correction of the test input data and for using PCA to correct the dependent variable.
PCA Parameters: This tab contains all of the remaining parameters for principal component analysis (PCA).
These parameters are also available in the stand-alone Numeric Principal Component Analysis window. If you wish to perform principal component analysis on your data without performing an association test, see Using the Numeric Principal Components Analysis Window.
The Association Test Parameters Tab¶
In the Association Test Parameters tab (see Numeric Association Tests Window – Association Test Parameters Tab), select all of the statistical tests you wish to perform, select whether you wish to correct your input data for batch effects or stratification through PCA, and select any multiple-testing corrections to apply to the results.
If an option is hidden, grayed out or inaccessible, it means a different option or options you previously selected will not allow this hidden, grayed out, or inaccessible option to be simultaneously selected.
Single Value Permutations and Full Scan Permutations can be run individually or together. You must provide a value for the number of permutations used in the test. When running both types of permutations together, the selected number of permutations is the same for both. The number of permutations should be greater than or equal to three.
The PCA Parameters Tab¶
If you selected to correct for batch effects/stratification with PCA, you will be able to select PCA parameters from this tab (see Numeric Association Tests Window – PCA Parameters Tab).
The principal components can be computed, or if they have already been computed for the dataset, the spreadsheet of principal components can be selected after selecting the “Use precomputed principal components” option. See Applying PCA to a Superset of Markers and Applying PCA to a Subset of Samples for specific limitations of this feature.
The other options include the number of components to be found, whether to output a separate eigenvalue spreadsheet, and whether and how to eliminate component outlier subjects and recompute components. See Principal Component Analysis for an explanation of the options for this tab.
Correcting a binary dependent variable makes it continuous, and thus linear regression and the Correlation/Trend Test are the appropriate tests in this situation.
When you have selected all the tests and outputs you wish to perform, select the Run button to start the selected tests and correction procedures. While the association test analysis is running, you can press the Cancel button on the progress bar dialog to stop the analysis.
When the tests are completed, the output spreadsheet(s) will appear.
These can be as follows:
The results of the association tests will be displayed in a spreadsheet. Each of the statistics calculated will be in its own column. If the original dataset was a marker mapped spreadsheet, this spreadsheet will have the rows marker mapped.
If you requested an output spreadsheet of the PCA-corrected input data, this will be created. If you requested PCA correction of the dependent variable for the test, that correction will also be shown in this spreadsheet. Otherwise, the dependent variable column will simply be copied to this spreadsheet.
If you requested a principal components spreadsheet, this will be created with rows according to the patient or subject and columns according to the component. These components will be sorted by eigenvalue, large to small. Only the number of components you requested will be shown.
If you requested an eigenvalue spreadsheet from PCA, it will simply show the eigenvalues from large to small (of the number of components you requested).
If you requested elimination of outlier subjects, and outliers were found, a spreadsheet will be made to list these outliers and the iteration and component in which they were found.
2.21.7. Fisher’s Exact Test for Binary Predictors¶
A Fisher’s Exact Test can be performed on binary predictors and a binary case/control dependent variable using Numeric > Fishers Exact Test for Binary Predictors. This function takes advantage of the contingency table nature of the data to return an exact p-value. See Fisher’s Exact Test.
This is useful for performing analysis on Binary ROH output, Variant Presence/Absence per Gene, as well as Discretized CN Segments in a Loss or Gain model. See ROH GUI Dialog, Binary Presence/Absence of Variant (Per Gene), Discretize CN Segment Covariates with Counts respectively.
There are no parameters to set for this function; all that is required is that a spreadsheet has a binary dependent column as well as active binary columns.
The output spreadsheet contains:
The Fisher’s Exact P-Value
Effect Direction (Natural log of the odds ratio: )
Bonferroni Corrected P-value
False Discovery Rate (FDR)
OR Lower Confidence Bound
OR Upper Confidence Bound
Case = 1 : Number of cases where the numeric value for the marker = 1
Case = 0 : Number of cases where the numeric value for the marker = 0
# Cases : Total number of cases
Control = 0 : Number of controls where the numeric value for the marker = 0
Control = 1 : Number of controls where the numeric value for the marker = 1
# Controls : Total number of controls
# Samples : Total number of samples with a non-missing value
The last 7 columns are the values that make up the contingency table for the marker
Observed = 1
n1 + n3
n2 + n4
n1 = “Control = 0”
n2 = “Control = 1”
n3 = “Case = 0”
n4 = “Case = 1”