Collapsing Methods for DNA-Sequence Analysis

Overview

Collapsing methods are techniques for performing analysis on DNA sequence data. Please see DNA Sequence Analysis for an overview of sequence analysis and Golden Helix SVS DNA-Seq Analysis workflows.

Traditional association techniques used in GWAS studies are not really suitable for sequence analysis, because they do not have the power to detect the significance of rare variants individually, nor do they provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it is necessary to “collapse” several variants into a single covariate based on regions such as genes.

In Golden Helix SVS, you can use several approaches to study rare variant burden. Two approaches are a simple detecting the presence of variants in samples (used in [Cohen2004]) or counting the variants per sample (used in [MorgenthalerAndThilly2007]) between affected and unaffected groups over a region or gene as defined by a specified annotation track, then performing an appropriate statistical test. See Workflow for Detecting Presence of Variants and Workflow for Counting Variants.

Another, more powerful, approach is the Combined Multivariate and Collapsing (CMC) method by Li and Leal [LiAndLeal2008]. Within each region, CMC first bins variants according to a criterion such as minor allele frequency, then collapses the variants within each bin, and finally performs multivariate testing on the counts across the various bins. See Combined Multivariate and Collapsing (CMC) Method.

A fourth approach is the Kernel-Based Adaptive Cluster method by Liu and Leal [LiuAndLeal2010]. It first catalogs and counts multi-marker genotypes within a given region based on the variant data, then performs a special case/control test based on these counts. This test is weighted according to how often each multi-marker genotype is expected to occur according to the null hypothesis. See Kernel-Based Adaptive Cluster (KBAC) Test and Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method.

Finally, there are the Sequence Kernel Association Test (SKAT), the Generalized Sequence Kernel Association Test (Generalized SKAT), and the Optimized Sequence Kernel Association Test (SKAT-O) ([Lee2012] and [LeeWuLin2012]). The SKAT test combines score tests from individual variants within a given region assuming that the effects of the individual variants are not correlated. Multiple variants that change in opposite directions and contribute in the same direction to changes in the phenotype will be detected by the SKAT test. Generalized SKAT combines the score tests assuming that the effects of the individual variants are partly correlated to a degree which must be pre-specified for the test. SKAT-O is a method of combining, into one overall test, a set of different generalized SKAT tests over the same region. See Optimized Sequence Kernel Association Test (SKAT-O).

Using one of the above approaches will give greater power to detect the significance of the rarer variants.

Workflow for Detecting Presence of Variants

One workflow to analyze sequence data is to start with filtering out probes, then performing a Fisher’s Exact Test on a binary dependent variable against the presence or absence of a variant in a gene. This test is performed in [Cohen2004].

  • Filter probes on Call Rate and Minor Allele Frequency, see Genotype Filtering by Marker.
    • Keep probes with a high call rate
    • Keep probes with a low minor allele frequency (rare variants)
  • Perform LD pruning to remove probes that are in high Linkage Disequilibrium with at least one other probe. One probe from a set of probes in LD is kept for analysis, see LD Pruning.
  • (Optional) Filter probes by the SIFT score, see siftFiltering.
  • Determine the genes that have at least one variant per sample. See Count Variants per Gene–select the option to output a binary presence/absence indicator of a variant (per gene).
  • Perform a Fisher’s Exact test using a binary case/control phenotype. See Fisher’s Exact Test for Binary Predictors.
  • Explore the results using the SVS Genome Browser.

Workflow for Counting Variants

Another workflow to analyze sequence data is to test on the number of variants per gene. This test is similar to the test (Cohort Allelic Sums Test or CAST) discussed in [MorgenthalerAndThilly2007].

  • Filter probes on Call Rate and Minor Allele Frequency, see Genotype Filtering by Marker.
    • Keep probes with a high call rate
    • Keep probes with a low minor allele frequency (rare variants)
  • Perform LD pruning to remove probes that are in high Linkage Disequilibrium with at least one other probe. One probe from a set of probes in LD is kept for analysis, see LD Pruning.
  • (Optional) Filter probes by the SIFT score, see siftFiltering.
  • Calculate number of variants per gene, see Count Variants per Gene.
  • Create a heat map to see where the samples have the most variants.
  • Perform an association test using either a quantitative or binary phenotype.
  • Explore the results using the SVS Genome Browser.

Combined Multivariate and Collapsing (CMC) Method

As noted above, the Combined Multivariate and Collapsing (CMC) method by Li and Leal [LiAndLeal2008] first bins variants within each of a number of regions according to a criterion such as minor allele frequency, then collapses the variants within each bin, and finally performs multivariate testing on the counts across the various bins.

Within each bin, each sample is counted as either having a variant (heterozygous or homozygous) in at least one marker classified to be within that bin, or on the other hand having no variants at all among any of the markers classified to be within the bin. Thus, internally, one binary variable will result from each of the bins in the region.

There are two algorithms available in Golden Helix SVS for performing multivariate testing on these resulting binary variables.

  1. Hotelling’s T^2 test (See CMC with Hotelling T Squared Test)
  2. Regression (See CMC with Regression)

CMC with Hotelling T Squared Test

This feature will perform the CMC method using its original algorithm, which is Hotelling’s T-squared test. The dependent variable must be a case/control status. This variable is tested against the independent bin variables.

Data Requirements for CMC with Hotelling T Squared Test

CMC analysis with Hotelling’s T^2 test is to be performed from a mapped spreadsheet of sequence data which has a case/control variable as its dependent variable.

A probe annotation track to locate the positions of the genes or the transcripts must be specified. If you wish to perform one test per gene, this annotation track must have a Name field.

In addition, you should have ready a “Variant Bins spreadsheet”, which should be mapped over the same markers as your sequence data. This spreadsheet should have a “Variant Bin” column that you may select from the CMC dialog.

Note

  1. If you wish to generate a “Variant Bins spreadsheet” based on user-defined minor-allele frequency thresholds and an external reference population provided through a probe annotation track, you may use DNA-Seq > Variant Binning by Frequency Source. (Please see Variant Binning by Frequency Source.)
  2. While variant bins may be defined as mentioned just above, the CMC analysis method is flexible as to what binning method you may select.

Usage and Parameters

To use the CMC method with Hotelling’s T^2 test, select DNA-Seq > Collapsing Methods > CMC with Hotelling T Squared Test from the spreadsheet menu, and use the resulting window (see CMC Method with Hotelling’s test – Parameters) to select the following parameters:

cmcHotellingWin

CMC Method with Hotelling’s T^2 test – Parameters

  • Regions for CMC (Li and Leal) Hotelling T Squared Test

    • Use Select Track to specify an annotation gene source that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

      Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...

      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

  • Variant Bins for CMC (Please see the notes above about “Variant Bins spreadsheets”.) Here, select the

    • Variant Frequency Bins spreadsheet and the
    • Variant Bin column in that spreadsheet.
      • The variant bin column may either be binary or integer, in which case the bin categories will be numeric, or the column may be categorical, in which case the categories will define the different bins.
  • Significance Testing If your sample size is small (under 300 cases and 300 controls), permutation testing is recommended in addition to obtaining the asymptotic (non-permuted) p-value.

    • To perform permutation testing, check Also compute permuted p-values and enter the Number of permutations to use.
    • To use adaptive permutation testing, check Adaptive permutation testing (threshold alpha) and fill in the threshold alpha value. Please see Adaptive Permutation Testing below for more information about this technique and the possible advantages of using it.
  • Missing Genotype Values If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select Yes to Impute Wild Type for Genotypic Missing Values to impute the wild type, or No to skip samples containing missing data for the current region.

Output

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing the following columns:

  • Chr The chromosome number or label.
  • Start The starting genetic position of the region.
  • Stop The ending genetic position of the region.
  • Gene Name The gene name.
  • Transcript Name(s) Transcript name(s) for each gene region.
  • Strand Strand for the gene and transcripts.
  • P-Value The p-value resulting from the Hotelling’s T^2 test.
  • -log10 P-Value Minus the base-10 log of the above p-value.
  • T^2 The T^2 statistic itself.
  • F The F-value corresponding to the T^2 statistic.
  • Permuted P-Value If permutation testing was requested, this column will appear and show the permuted p-value.
  • F Bonf. P The Bonferroni-adjusted p-value.
  • F FDR The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.
  • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.
  • # Markers Total The total number of markers tested in this region.
  • # in Bin=etc. For each bin, a column is generated containing the number of markers categorized into that bin for this region.

CMC with Regression

This feature will perform the CMC method using regression. The dependent variable, which may be either case/control or quantitative, is regressed against the independent bin variables.

This algorithm also allows correction for additional covariates. In this case, a full-model-vs-reduced-model regression is done, where the full model contains all of the independent bin variables plus the additional covariates as regressors, while the reduced model contains only the additional covariates as regressors.

Data Requirements for CMC with Regression

The data requirements for CMC analysis with regression are the same as they are for CMC with Hotelling’s T^2 test (Data Requirements for CMC with Hotelling T Squared Test) with the following exceptions:

  • The dependent variable may be quantitative rather than case/control, in which case linear regression will be used rather than logistic regression.
  • The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

Usage and Parameters

To use the CMC method with regression, select DNA-Seq > Collapsing Methods > CMC with Regression from the spreadsheet menu. Depending upon whether your dependent variable is case/control or quantitative, you will get either the logistic regression window (CMC Method with Logistic Regression – Parameters) or the linear regression window (CMC Method with Linear Regression – Parameters). Use the resulting window to select the following parameters:

cmcLogRegressWin

CMC Method with Logistic Regression – Parameters

cmcLinRegressWin

CMC Method with Linear Regression – Parameters

  • Regions for CMC (Li and Leal) Logistic Regression Test or Regions for CMC (Li and Leal) Linear Regression Test (prompt depends upon whether your dependent variable is case/control or quantitative)

    • Use Select Track to specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

      Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...

      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

  • Variant Bins for CMC (Please see the notes above about “Variant Bins spreadsheets”.) Here, select the

    • Variant Frequency Bins spreadsheet and the
    • Variant Bin column in that spreadsheet.
      • The variant bin column may either be binary or integer, in which case the bin categories will be numeric, or the column may be categorical, in which case the categories will define the different bins.
  • Check Correct for Additional Covariates to utilize the full-vs-reduced regression technique to correct for additional covariates.

    • Click on Add Columns to start selecting which variables (spreadsheet columns) to use for correction.
  • Significance Testing If your sample size is small (under 300 cases and 300 controls), permutation testing is recommended in addition to obtaining the asymptotic (non-permuted) p-value.

    • To perform permutation testing, check Also compute permuted p-values and enter the Number of permutations to use.
    • To use adaptive permutation testing, check Adaptive permutation testing (threshold alpha) and fill in the threshold alpha value. Please see Adaptive Permutation Testing below for more information about this technique and the possible advantages of using it.
  • Missing Genotype Values If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select Yes to Impute Wild Type for Genotypic Missing Values to impute the wild type, or No to skip samples containing missing data for the current region.

  • Additional Outputs The following additional outputs may be specified:

    • Select Output bin betas and their standard errors to obtain a regression coefficient (beta) and its standard error for
      • the intercept,
      • each additional covariate you are correcting for, if any, and
      • each bin which is used in the regression. If any potential bins are not used for a given region, their corresponding output cells are filled with missing-value indicators (‘?’ values).
    • To aid in comparing CMC to other methods, you may select Output separate spreadsheet of bin variables to get the bin values for every sample output to a second separate spreadsheet. That is, for every region and every sample, if N bins were used for the region, the N bin values for that sample (each value being one or zero) will be output into that sample’s row in each of N separate columns.

Outputs

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing the following columns, depending upon the type of your dependent variable and whether or not you are correcting for covariates:

  • If your dependent variable is case/control and you are not correcting for covariates, the spreadsheet will contain the following columns:
    • Chr The chromosome number or label.
    • Start The starting genetic position of the region.
    • Stop The ending genetic position of the region.
    • Gene Name The gene name.
    • Transcript Name(s) Transcript name(s) for each gene region.
    • Strand Strand for the gene and transcripts.
    • P-Value The p-value resulting from the Likelihood Ratio test.
    • -log10 P-Value Minus the base-10 log of the above p-value.
    • Log Likelihood Full The log likelihood for the regression.
    • Log Likelihood Null The log likelihood for the null model.
    • Permuted P-Value If permutation testing was requested, this column will appear and show the permuted p-value.
    • Bonf. P The Bonferroni-adjusted p-value.
    • FDR The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.
    • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.
    • # Markers Total The total number of markers tested in this region.
    • # in Bin=etc. For each bin, a column is generated containing the number of markers categorized into that bin for this region.
    • If you specified Output bin betas and their standard errors, the following additional outputs will appear:
      • Beta 0 The \beta_0 value for the regression.
      • Beta 0 SE The standard error for \beta_0.
      • Bin etc. Beta For each bin (which was used by the regression for a given region), the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
      • Bin etc. Beta SE For each bin (which was used by the regression for a given region), the standard error for the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
  • If your dependent variable is case/control and you are correcting for covariates, the spreadsheet will contain the following columns:
    • Chr The chromosome number or label.
    • Start The starting genetic position of the region.
    • Stop The ending genetic position of the region.
    • Gene Name The gene name.
    • Transcript Name(s) Transcript name(s) for each gene region.
    • Strand Strand for the gene and transcripts.
    • P-Value The (full vs. reduced) p-value resulting from the logistic regression.
    • -log10 P-Value Minus the base-10 log of the above p-value.
    • Full-Model P The p-value resulting from just regressing the full model.
    • Log Likelihood Full The log likelihood for the full model.
    • Log Likelihood Reduced The log likelihood for the reduced model.
    • Permuted P-Value If permutation testing was requested, this column will appear and show the permuted p-value.
    • Bonf. P The Bonferroni-adjusted p-value.
    • FDR The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.
    • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.
    • # Markers Total The total number of markers tested in this region.
    • # in Bin=etc. For each bin, a column is generated containing the number of markers categorized into that bin for this region.
    • If you specified Output bin betas and their standard errors, the following additional outputs will appear:
      • Beta 0 The \beta_0 value for the region’s regression.
      • Beta 0 SE The standard error for the \beta_0.
      • Covariate Beta For each covariate, the \beta value for regressing that covariate in this region is shown.
      • Covariate Beta SE For each covariate, the standard error for the \beta value for regressing that covariate in this region is shown.
      • Bin etc. Beta For each bin (which was used by the regression for a given region), the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
      • Bin etc. Beta SE For each bin (which was used by the regression for a given region), the standard error for the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
  • If your dependent variable is quantitative and you are not correcting for covariates, the spreadsheet will contain the following columns:
    • Chr The chromosome number or label.
    • Start The starting genetic position of the region.
    • Stop The ending genetic position of the region.
    • Gene Name The gene name.
    • Transcript Name(s) Transcript name(s) for each gene region.
    • Strand Strand for the gene and transcripts.
    • P-Value The p-value resulting from the linear regression.
    • -log10 P-Value Minus the base-10 log of the above p-value.
    • R Squared The coefficient of determination R^2 for the regression.
    • F The value of the F-statistic for the regression model.
    • Permuted P-Value If permutation testing was requested, this column will appear and show the permuted p-value.
    • Bonf. P The Bonferroni-adjusted p-value.
    • FDR The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.
    • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.
    • # Markers Total The total number of markers tested in this region.
    • # in Bin=etc. For each bin, a column is generated containing the number of markers categorized into that bin for this region.
    • If you specified Output bin betas and their standard errors, the following additional outputs will appear:
      • Intercept The intercept value for the regression.
      • Intercept SE The standard error for the intercept.
      • Bin etc. Beta For each bin (which was used by the regression for a given region), the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
      • Bin etc. Beta SE For each bin (which was used by the regression for a given region), the standard error for the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
  • If your dependent variable is quantitative and you are correcting for covariates, the spreadsheet will contain the following columns:
    • Chr The chromosome number or label.
    • Start The starting genetic position of the region.
    • Stop The ending genetic position of the region.
    • Gene Name The gene name.
    • Transcript Name(s) Transcript name(s) for each gene region.
    • Strand Strand for the gene and transcripts.
    • P-Value The (full vs. reduced) p-value resulting from the linear regression.
    • -log10 P-Value Minus the base-10 log of the above p-value.
    • Full-Model P The p-value obtained from regressing only the full model.
    • Full-Model R Squared The coefficient of determination R^2 for the full model.
    • Full-Model F The value of the F-statistic for the full model.
    • Reduced-Model P The p-value obtained from regressing only the reduced model.
    • Reduced-Model R Squared The coefficient of determination R^2 for the reduced model.
    • Reduced-Model F The value of the F-statistic for the reduced model.
    • Full vs Reduced F The value of the F-statistic for the full vs. reduced model.
    • Permuted P-Value If permutation testing was requested, this column will appear and show the permuted p-value.
    • Bonf. P The Bonferroni-adjusted p-value.
    • FDR The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.
    • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.
    • # Markers Total The total number of markers tested in this region.
    • # in Bin=etc. For each bin, a column is generated containing the number of markers categorized into that bin for this region.
    • If you specified Output bin betas and their standard errors, the following additional outputs will appear:
      • Intercept The intercept value for the regression.
      • Intercept SE The standard error for the intercept.
      • Covariate Beta For each covariate, the \beta value for regressing that covariate in this region is shown.
      • Covariate Beta SE For each covariate, the standard error for the \beta value for regressing that covariate in this region is shown.
      • Bin etc. Beta For each bin (which was used by the regression for a given region), the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)
      • Bin etc. Beta SE For each bin (which was used by the regression for a given region), the standard error for the \beta value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)

Additionally, if you have requested Output separate spreadsheet of bin variables, another marker-mapped spreadsheet called CMC Bin Variables will be output containing one row for each sample and the following columns:

  • Your (case/control or quantitative) dependent variable.
  • One column for every covariate you have specified.
  • For each region and each bin which was used for that region, a column is output containing that bin’s values for each sample in the region. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region, followed by a second underline (‘_’), followed by the bin identifier.

Kernel-Based Adaptive Cluster (KBAC) Test

As noted above, the Kernel-Based Adaptive Cluster (KBAC) method by Liu and Leal [LiuAndLeal2010] first catalogs the variant data within each of a number of regions into multi-marker genotypes. (Since the variants are rare, only a relatively few different multi-marker genotypes will be found in any given region.)

A special case/control test based on these counts is then applied. This test is weighted for each multi-marker genotype according to how often that genotype is expected to occur according to the data and according to the null hypothesis that there is no association between that genotype and the case/control status of the sample. Under this adaptive weighting procedure, the genotypes with higher sample risks (meaning that there are more occurrences of those genotypes for cases) will be given higher weights which can potentially separate causal from non-causal genotypes. This procedure is meant to attain a good balance between classification accuracy and the number of parameters which are estimated.

One result of this adaptive weighting procedure is that the KBAC test is intrinsically “one-sided”, in the sense that it is highly sensitive to associations from multi-marker genotypes conferring higher sample risks, and is more or less not sensitive at all to associations from multi-marker genotypes conferring lower sample risks (meaning that there are fewer occurrences of those genotypes for cases and more occurrences of those genotypes for controls).

If you suspect associations from multi-marker genotypes conferring lower sample risks, or wish to explore associations in both directions, you may wish to invert the Case/Control status of your phenotype and use the KBAC procedure to re-test for associations. The latter test will be highly sensitive to associations from multi-marker genotypes conferring lower sample risks (in relation to your original Case/Control status), and will be more or less not sensitive at all to associations from multi-marker genotypes conferring higher sample risks (in relation to your original Case/Control status).

Overview of Theory

Sample Risk for Multi-Site Genotypes

Suppose the total sample size is N, among which there are N^A samples from affected individuals and N^U=N-N^A samples from unaffected individuals, and that there are M sites within the candidate region where rare variants are observed. The rare variant multi-site genotype for each sample may be cataloged as a vector G=(g_1,g_2,...,g_M), where g_j is the number of rare variants observed at the sample’s j^{th} site-that is, g_j is 2 if the site is homozygous for the rare allele, 1 if the site is heterozygous, and 0 if the site is homozygous wild-type for the common major alleles. Further suppose that k+1 distinct multi-site genotype vectors G_0 through G_k can be cataloged from the data, where G_0 represents the wild-type genotype without any rare variants (a vector of all zeros) and G_1 through G_k represent multi-site genotypes with at least one variant. The sample risk for multi-site genotype G_i is defined as

R_i=\frac{N_i^A}{N_i},

where N_i is the number of occurrences of genotype G_i and N_i^A is the number of occurrences of genotype G_i for samples from affected individuals (cases).

The sample risk R_i for multi-site genotype G_i is modeled using a two-component distribution

R_i \sim \pi_ik^0_i(R_i) + (1 - \pi_i)k^A_i(R_i),

where the component k^0_i(R_i) represents the distribution of the sample risk when multi-site genotype G_i is non-causal and is known, while k^A_i(R_i) represents the unknown distribution of sample risk when G_i is causal. If the null hypothesis of no disease/gene associations holds, all genotypes will be non-causal and we could take \pi_i=1.

Adaptive Weighting

In this method, we adaptively weight each genotype G_i according to the area under the curve from 0 to \hat{R}_i of the known component k^0_i(R_i) of the distribution, where \hat{R}_i = \frac{n^A_i}{n_i} is the estimated sample risk for genotype G_i. This known component k^0_i(\bullet) is referred to as a “kernel”. The weight w_i of genotype G_i may thus be written

w_i = \int_0^{\hat{R}_i} k_i^0(r)dr = K_i^0(\hat{R}_i).

Three methods of calculation are available to obtain values for these weights.

  • Hyper-geometric kernel. This is the exact form of the kernel. Under the null hypothesis, conditioning on the genotype counts \{N_i=n_i\}_{i\le k} and the count of cases {N^A=n^A} (vs. controls {N^U=n^U}), the number n_i^A=n_ir_i of diseased individuals (cases) having the i^{th} multi-site genotype follows a hyper-geometric distribution with kernel function given by

    k_i^0(r_i)=P[R_i=r_i|\{N_i=n_i\}_{i\le k},N^A=n^A]=\frac{\dbinom{n_i}{n_ir_i}\dbinom{n-n_i}{n^A-n_ir_i}}{\dbinom n{n^A}}.

    Since the hyper-geometric distribution is discrete, the weight integral may be computed through the summation

    w_i = K_i^0(\hat{R}_i) = \sum_{r_i \in \{\frac 0{n_i},...,\hat{R}_i\}}k_i^0(r_i).

  • Marginal binomial kernel. In the marginal condition that there are only a few samples having the variant genotypes but yet the number of cases is a significant fraction of the number of samples, the number n_i^A=n_ir_i of diseased individuals (cases) having the i^{th} multi-site genotype approximates the binomial distribution, that is, n_i^A \sim
Binom(n_i,\frac{n^A}n). Based on this, the kernel may be written

    k_i^0(r_i)=P[R_i=r_i]=\dbinom{n_i}{n_ir_i}\left(\frac{n^A}n\right)^{n_ir_i}\left(1-\frac{n^A}n\right)^{n_i(1-r_i)},

    and the weight integral may be computed through the summation

    w_i = K_i^0(\hat{R}_i) = \sum_{r_i \in \{\frac 0{n_i},...,\hat{R}_i\}}k_i^0(r_i).

  • Asymptotic normal kernel. For large sample sizes, we take advantage of the fact that under the null distribution, the sample risk for genotype G_i is asymptotically normal, that is,

    \sqrt{n_i}\left(R_i-\frac{n^A}n\right) \to N\left(0,\frac{n^A}n\left(1-\frac{n^A}n\right)\right),

    so the kernel is given by

    k_i^0(r_i) = \frac{\sqrt{n_i}}{\sqrt{\frac{n^A}n\left(1 - \frac{n^A}n\right)}}\phi\left(\frac{\sqrt{n_i}\left(r_i - \frac{n^A}n\right)}{\sqrt{\frac{n^A}n\left(1 - \frac{n^A}n\right)}}\right),

    where \phi(\bullet) is the probability density function for a standard normal random variable. The weight is thus

    w_i = K_i^0(\hat{R}_i) = \int_0^{\hat{R}_i}k_i^0(r)dr = \int_{\frac{-\sqrt{n_i}\frac{n^A}n}{\sqrt{\frac{n^A}n\left(1 - \frac{n^A}n\right)}}}^{\frac{\sqrt{n_i}\left(R_i - \frac{n^A}n\right)}{\sqrt{\frac{n^A}n\left(1 - \frac{n^A}n\right)}}} \phi(s)ds,

    which may be found through a table or through a utility for computing the standard normal distribution.

The Statistic Itself

The KBAC test statistic itself is defined as

KBAC = \left(\sum_{i=1}^k\left(\frac{N_i^A}{N^A} - \frac{N_i^U}{N^U}\right)K_i^0(\hat{R}_i)\right)^2,

which compares the difference of weighted multi-site genotype frequencies between cases and controls. When a one-sided alternative hypothesis is tested, for instance, the enrichment of causal variants in cases, a corresponding one-sided version of KBAC can be used. To test the enrichment of causal variants in cases, one would use

KBAC_1 = \sum_{i=1}^k\left(\frac{N_i^A}{N^A} - \frac{N_i^U}{N^U}\right)K_i^0(\hat{R}_i).

Finding the P-Value

Since the distribution of this statistic is unknown, permutation testing or something approximating it must be used to determine a p-value.

Permutation Testing

Normally, permutation testing is used to determine the p-value of a KBAC statistic. Permutation testing is carried out as follows:

  1. Compute the KBAC statistic based upon the original data.
  2. Permute the case/control statuses.
  3. Using the sample risks \hat{R}_i = \frac{n^A_i}{n_i} as based upon the original multi-marker genotypes and the newly-permuted case/control statuses, re-compute the weight K_i^0(\hat{R}_i) = K_i^0\left(\frac{n^A_i}{n_i}\right) associated with each multi-marker genotype G_i.
  4. Re-compute the KBAC statistic based upon the permuted data and the newly re-computed genotype weights.
  5. Repeat steps (2) through (4) until all permutations are finished.
  6. Find the count of how many permutation results (counting the statistic from the original data as one permutation result) are greater than or equal to the original statistic, and divide by the total number of permutation results.

The Monte-Carlo Method

For large sample sizes, a Monte-Carlo approximation method for finding a p-value has been developed and is available in Golden Helix SVS. As was mentioned above, the number of cases n_i^A for each genotype G_i approximates a binomial distribution, that is, n_i^A \sim Binom(n_i,\frac{n^A}n). Due to the low frequencies of each multi-site genotype containing rare variants, we take the n_i^A to be independent of each other. The Monte-Carlo simulation algorithm is thus as follows:

  1. Compute the KBAC statistic based upon the original data.
  2. Simulate a k-vector of independent binomials (m_1,m_2,...,m_k) with m_i \sim Binom\left(n_i,\frac{n^A}n\right).
  3. Re-compute the genotype weights, based upon the k-vector, as K^0_i(m_i/n_i) for each multi-site genotype G_i.
  4. Re-compute a new KBAC statistic based upon the k-vector, using each m_i as the number of cases found for the multi-site genotype G_i.
  5. Repeat steps (2) through (4), using the KBAC statistics generated from these steps in the same fashion as is done for permutation testing–namely, find the count of how many simulation results (counting the original statistic as one simulation result) are greater than or equal to the original statistic, and divide by the total number of simulation results.

Correction for Covariates

Alternative View of the Statistic

Suppose, for any sample j, we let Y_j take on either the value

Y_j = \frac{E}{N^A} + C

for a case, and

Y_j = -\frac{E}{N^U} + C

for a control, where E and C are arbitrary constants. It may be shown that

E \bullet KBAC_1 = \sum_iw_i\sum_{j(i)}{(Y_j - \bar{Y})}
                 = \sum_jw_{i(j)}(Y_j - \bar{Y})

holds for the one-sided KBAC statistic documented above, where \bar{Y} is the overall average of all the Y_j, and w_{i(j)} means w_i for the multi-site genotype i contained by sample j.

For future reference, we will find it convenient to set C =
\frac{N^A}{N} and E = \frac{N^AN^U}{N}, so as to make Y_j = 1 for the cases and Y_j = 0 for the controls. Using this convention, if for any sample j, we then set

X_j = w_{i(j)},

where w_{i(j)} is defined as above, we have

U_{nocv} = \frac{N^AN^U}{N} \bullet KBAC_1 = \sum_j{X_j(Y_j - \bar{Y})}

as an alternative formulation of the one-sided KBAC statistic for which no covariates are corrected.

Note

The value X_j (for any sample j) may be formulated as

X_j = w_{i(j)} = K_{i(j)}^0(\hat{R}_{i(j)}) = K_{i(j)}^0\left(\frac{n^A_{i(j)}}{n_{i(j)}}\right)

or, when performing the Monte-Carlo approximation,

X_j = w_{i(j)} = K_{i(j)}^0(m_{i(j)}/n_{i(j)}).

The first form depends upon \hat{R}_{i(j)} =
\frac{n^A_{i(j)}}{n_{i(j)}} which, in turn, depends upon whether a permutation is being computed and upon which permutation is being computed. The second (Monte-Carlo) form depends upon m_{i(j)}/n_{i(j)} which, in turn, depends upon which iteration of the Monte-Carlo procedure is current.

Therefore, the values of the X_j depend upon which permutation, if any, (or upon which Monte-Carlo iteration) is being performed.

Regression Score

Suppose we have l covariates such as age, sex or eigenvectors for genotypes, and we define Z_{jl} to be the value of the l^{th} covariate for the j^{th} sample. If we define X_j and Y_j as above, the logistic model for association testing will have the form

log\left(\frac{P(Y_j=1|X_j,Z_{jl})}{1 - P(Y_j=1|X_j,Z_{jl})}\right) = \beta_0 + \beta_1X_j + \sum_l{\alpha_lZ_{jl}}.

The score statistic to test the null hypothesis H_0:\beta_1 =
0 of the independence of this model from the X_j is

U = \sum_j{X_j(Y_j - \mu_j)},

where

\mu_j = \frac{e^{(\beta_0 + \sum_l{\alpha_lZ_{jl}})}}{1 + e^{(\beta_0 + \sum_l{\alpha_lZ_{jl}})}}

and \beta_0 and the \alpha_l are determined by a regression on the reduced model

log\left(\frac{P(Y_j=1|Z_{jl})}{1 - P(Y_j=1|Z_{jl})}\right) = \beta_0 + \sum_l{\alpha_lZ_{jl}}.

In a fashion similar to the original KBAC statistics, we define U to be a one-sided score to test for the enrichment of causal variants for higher corrected case/control statuses Y_j - \mu_j, and we define U^2 to be a two-sided score to test for the dependency of causal variants upon the corrected case/control statuses Y_j - \mu_j.

Note

We see that if there are no covariates, \beta_0 estimates the log of the ratio of cases to controls, and that

\mu_j = \frac{e^{\beta_0}}{1 + e^{\beta_0}}

estimates \bar{Y}, so that

U = U_{nocv} = \sum_j{X_j(Y_j - \bar{Y})} = \frac{N^AN^U}{N} \bullet KBAC_1

emerges as the definition of the score statistic U when there are no covariates. This matches, within a constant multiplier, the original one-sided KBAC statistic.

Note

To derive the regression score, note that the expectation of the Y_j themselves may be written as

E[Y_j] = \mu_j = h(\eta_j) = \frac{e^{\eta_j}}{1 + e^{\eta_j}},

where

\eta_j = \beta_0 + \beta_1X_j + \sum_l{\alpha_lZ_{jl}},

and that the likelihood and log-likelihood for all the observations are

L = \prod_j \mu_j^{Y_j}(1 - \mu_j)^{1 - Y_j}

and

l = \sum_j \left[ Y_j log(\mu_j) + (1 - Y_j) log(1 - \mu_j) \right] .

Since we are using the null hypothesis H_0:\beta_1 = 0, we will determine the score function with respect to \beta_1, which is defined as the derivative of the log-likelihood with respect to \beta_1. We take this derivative and expand the resulting terms:

\frac{\partial l}{\partial \beta_1} = \sum_j \frac{\partial l}{\partial \mu_j} \frac{\partial \mu_j}{\partial \beta_1} = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \frac{\partial \mu_j}{\partial \beta_1} \right)

\frac{\partial \mu_j}{\partial \beta_1} = \frac{\partial \mu_j}{\partial \eta_j} \frac{\partial \eta_j}{\partial \beta_1} = h'(\eta_j) \frac{\partial \eta_j}{\partial \beta_1}

where for any scalar value \eta,

h(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}

(as defined earlier),

h'(\eta) = \frac{ e^{\eta}(1 + e^{\eta}) - e^{2\eta} }{ (1 + e^{\eta})^2 } = \frac{ e^{\eta} + e^{2\eta} - e^{2\eta} }{ (1 + e^{\eta})^2 } = \frac{ e^{\eta} }{ (1 + e^{\eta})^2 } = \frac{ h(\eta) }{ (1 + e^{\eta}) }

= h(\eta) \frac{ 1 + e^{\eta} - e^{\eta} }{ (1 + e^{\eta}) } = h(\eta) (1 - h(\eta)),

and

\frac{\partial \eta_j}{\partial \beta_1} = \frac{\partial \left(\beta_0 + \beta_1X_j + \sum_l{\alpha_lZ_{jl}} \right) }{\partial \beta_1} = X_j .

Thus

\frac{\partial \mu_j}{\partial \beta_1} = h'(\eta_j) \frac{\partial \eta_j}{\partial \beta_1} = h(\eta_j) (1 - h(\eta_j)) X_j = \mu_j (1 - \mu_j) X_j,

and the derivative of the log-likelihood with respect to \beta_1 is

\frac{\partial l}{\partial \beta_1} = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \frac{\partial \mu_j}{\partial \beta_1} \right) = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \mu_j (1 - \mu_j) X_j \right)

= \sum_j \left( \left[ \frac{(1 - \mu_j) Y_j - \mu_j (1 - Y_j)}{\mu_j (1 - \mu_j)} \right] \mu_j (1 - \mu_j) X_j \right)

= \sum_j \left( \left[ Y_j - \mu_j Y_j - \mu_j + \mu_j Y_j \right] X_j \right) = \sum_j \left( \left[ Y_j - \mu_j \right] X_j \right) .

Under the null hypothesis H_0:\beta_1 = 0, we have

\eta_j = \beta_0 + \sum_l{\alpha_lZ_{jl}}.

Also, under the null hypothesis, the X_i do not contribute to the solution to the original model, and thus \beta_0 and the \alpha_l are also, under the null hypothesis, the solution to the reduced model

log\left(\frac{P(Y_j=1|Z_{jl})}{1 - P(Y_j=1|Z_{jl})}\right) = \beta_0 + \sum_l{\alpha_lZ_{jl}}.

Thus, we may set up the score test for H_0:\beta_1 = 0 as

U = \sum_j{X_j(Y_j - \mu_j)},

\mu_j = h(\eta_j),

\eta_j = \beta_0 + \sum_l{\alpha_lZ_{jl}},

where \beta_0 and the \alpha_l are the solution to the reduced model.

Using the Regression Score with Permutation Testing

When there are covariates, we see that \mu_j is dependent upon the \alpha_l, which are dependent upon the Y_j. Thus, for every permutation of the case/control statuses Y_j, the reduced-model regression must be repeated.

Thus, KBAC with Regression can be much more compute-intensive than is KBAC with Permutation Testing. For this reason, Golden Helix SVS KBAC with Regression pre-computes the reduced-model regression for all permutations when imputing the wild type, and offers the option of pre-computing the reduced-model regression for all permutations when not imputing the wild type. However, KBAC with Regression is able to correct for covariates, while KBAC with Permutation Testing is not.

Correction for Relationships and/or Pedigree Structure

Please see Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method.

KBAC with Permutation Testing

This feature will perform the KBAC method using its original algorithm and statistic. The case/control dependent variable is tested against the weighted multi-site genotypes other than the wild-type genotype.

Data Requirements for KBAC with Permutation Testing

KBAC analysis is to be performed from a mapped spreadsheet of sequence data.

Note

  • You should first filter the markers by rare variants. It is assumed by KBAC with Permutation Testing that only rare variant markers will be active.
  • If you wish to filter based on a user-defined minor-allele frequency threshold and an external reference population provided through a probe annotation track, you may use DNA-Seq > Variant Frequency Binning by Frequency Source (Please see Variant Binning by Frequency Source), entering the minor-allele threshold as the “bin 0 threshold” to generate a “binning” spreadsheet. Inactivate all rows of the “binning” spreadsheet that are not in bin 0, then use Select > Activate or Inactivate Based on Second Spreadsheet to activate all columns in the original marker spreadsheet corresponding to active rows in the “binning” spreadsheet.

Usage and Parameters

To use KBAC with Permutation Testing, select DNA-Seq > Collapsing Methods > KBAC with Permutation Testing from the spreadsheet menu, and using the resulting window (see KBAC with Permutation Testing – Parameters) to select the following parameters:

kbacPermWin

KBAC with Permutation Testing – Parameters

  • Parameters for KBAC (Liu & Leal, 2010) Using Permutation Testing Enter the kernel type and the permutation parameters.

    • Kernel Type The possible choices are:
      • Hyper-geometric (Normally recommended) Normally recommended and gives the most accurate test.
      • Marginal binomial Less compute-intensive and almost as accurate as the hyper-geometric kernel.
      • Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls) You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.
    • Permutations The permutation parameters are:
      • Number of permutations to use: The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.
      • Permutation Mode: The possible choices are:
        • Standard C/C permutation procedure This is normally recommended and gives the most accurate test.
        • KBAC Monte-Carlo approximation (For large sample sizes only) (> 400 cases/400 controls) You should have more than 400 cases and 400 controls for this permutation mode, which is not as accurate as standard case/control permutation testing.
      • To use adaptive permutation testing, check Adaptive permutation testing (threshold alpha) and fill in the threshold alpha value. Please see Adaptive Permutation Testing below for more about this technique and the possible advantages of using it.
  • Outputs Choose one or both of the following:

    • One-sided statistics (recommended) This tests the one-sided alternative hypothesis of the enrichment of causal variants in cases.
    • Two-sided statistics This tests the two-sided alternative hypothesis of the difference between weighted multi-site genotype frequencies between cases and controls.

    Note

    Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

    Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

  • Regions Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

    Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...
      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.
  • Missing Genotype Values If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select Yes to Impute Wild Type for Genotypic Missing Values to impute the wild type, or No to skip samples containing missing data for the current region.

Output

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

  • Chr The chromosome number or label.
  • Start The starting genetic position of the region.
  • Stop The ending genetic position of the region.
  • Gene Name The gene name.
  • Transcript Name(s) Transcript name(s) for each gene region.
  • Strand Strand for the gene and transcripts.
  • P-Value (One-Sided) The p-value resulting from the one-sided KBAC test.
  • -log10 P-Value (One-Sided) Minus the base-10 log of the above p-value.
  • KBAC (One-Sided) The one-sided KBAC statistic itself.
  • Bonf. P (One-Sided) The Bonferroni-adjusted p-value from the one-sided test.
  • FDR (One-Sided) The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.
  • P-Value (Two-Sided) The p-value resulting from the two-sided KBAC test.
  • -log10 P-Value (Two-Sided) Minus the base-10 log of the above p-value.
  • KBAC (Two-Sided) The two-sided KBAC statistic itself.
  • Bonf. P (Two-Sided) The Bonferroni-adjusted p-value from the two-sided test.
  • FDR (Two-Sided) The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.
  • Actual Permutations Used How many permutations were actually used for this test by the adaptive permutation testing algorithm.
  • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.
  • # Markers The total number of markers tested in this region.
  • # Multi-Marker Genotypes The total number of distinct multi-marker genotypes discovered in this region.

KBAC with Regression

This feature, which allows additional covariates to be specified, will perform the KBAC method using the regression score function. The case/control variable is tested against the weighted multi-site genotypes other than the wild-type genotype and against the expected values of the case/control variable. The expected values of the case/control variable are corrected by any additional covariates.

Data Requirements for KBAC with Regression

The data requirements for KBAC analysis with regression are the same as they are for KBAC with Permutation Testing (Data Requirements for KBAC with Permutation Testing) with the following addition:

  • The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

Usage and Parameters

To use KBAC with Regression, select DNA-Seq > Collapsing Methods > KBAC with Regression from the spreadsheet menu, and use the resulting window (see KBAC Method with Regression – Parameters) to select the following parameters:

kbacRegressWin

KBAC Method with Regression – Parameters

  • Parameters for KBAC (Liu & Leal, 2010) Logistic Regression Enter the kernel type and the permutation parameters.

    • Kernel Type The possible choices are:
      • Hyper-geometric (Normally recommended) Normally recommended and gives the most accurate test.
      • Marginal binomial Less compute-intensive and almost as accurate as the hyper-geometric kernel.
      • Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls) You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.
    • Permutations The permutation parameters are:
      • Number of permutations to use: The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.
      • To use adaptive permutation testing, check Adaptive permutation testing (threshold alpha) and fill in the threshold alpha value. Please see Adaptive Permutation Testing below for more about this technique and the possible advantages of using it.
  • Outputs Choose one or both of the following:

    • One-sided statistics (recommended) This tests the one-sided alternative hypothesis of the enrichment of causal variants for higher corrected case/control statuses Y_j - \mu_j.
    • Two-sided statistics This tests the two-sided alternative hypothesis of the dependency of causal variants upon the corrected case/control statuses Y_j - \mu_j.

    Note

    Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

    Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

  • Regions Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

    Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...
      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.
  • Check Correct for Additional Covariates to utilize the regression score technique to correct for additional covariates.

    • Click on Add Columns to start selecting which variables (spreadsheet columns) to use for correction.
  • Missing Genotype Values If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used.

    If you wish to impute the wild type, or you have no missing data, select Yes (faster) to Impute Wild Type for Genotypic Missing Values. Reduced-model results for all of the permutations will be pre-computed in advance, and because the samples that will be used for all the regions will be the same (all samples for which the dependent variable and the covariates all have non-missing data), these results will be accurate.

    Otherwise, samples containing missing data for the region being analyzed will be skipped. Select No (slower) to compute the reduced model all over again for each region and obtain accurate results, or select No but precompute reduced models (faster but gives approximate values) to pre-compute the reduced-model results in advance (for all samples for which the dependent variable and the covariates all have non-missing data) and then use these pre-computed results for those samples that are actually being used.

    Note

    When selecting No but precompute reduced models (faster but gives approximate values), and if there are samples containing missing data for the region being analyzed (but not for the dependent variable and covariates), the KBAC statistic values and the p-values will be approximate for two reasons:

    1. The reduced model results were pre-computed for a larger subset than is actually being used.
    2. While the potential genotype weights (X_j or w_{i(j)}) for every region are always pre-computed using the actual subset (and thus are always exact for the unpermuted case), the number of total cases that are in the actual subset but were obtained from permutation on the larger subset may fluctuate as you move through the permutations, making the pre-computation results for the permutations only approximate. (Also, a pre-computation may not have been performed for the number of (permuted) cases that a genotype has, in which case the pre-computed value for the highest number of cases for which pre-computation was done is used.)
  • Intermediate Outputs To aid in comparing KBAC to other methods, you may select Output separate spreadsheets of weights (from unpermuted Y) and genotype indexes to get the vector of X_j (w_{i(j)} weight) values, as calculated from the unpermuted Y_j values, output to a second separate spreadsheet. That is, for every region and for every sample j for which the region contains multi-marker genotype G_{i(j)}, this spreadsheet will output

    X_j = w_{i(j)} = K_{i(j)}^0(\hat{R}_{i(j)}) = K_{i(j)}^0\left(\frac{n^A_{i(j)}}{n_{i(j)}}\right).

    Additionally, if you select Output separate spreadsheets of weights (from unpermuted Y) and genotype indexes, a third spreadsheet will be output containing the index value i(j) corresponding to each multi-marker genotype G_{i(j)}.

    If G_{i(j)} is the wild-type genotype, a zero will be output to both spreadsheets.

    Note

    As stated above in the note to Alternative View of the Statistic, the X_i used in the permutations (and in the iterations of the Monte-Carlo method featured in KBAC with Permutation Testing) will vary from the X_i derived from the original unpermuted data.

    Note

    The X_i values (for the unpermuted Y_j values) which are output to the separate spreadsheet are exact, even if you select No but precompute reduced models (faster but gives approximate values) to Impute Wild Type for Genotypic Missing Values.

Outputs

A marker-mapped spreadsheet will always be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

  • Chr The chromosome number or label.
  • Start The starting genetic position of the region.
  • Stop The ending genetic position of the region.
  • Gene Name The gene name.
  • Transcript Name(s) Transcript name(s) for each gene region.
  • Strand Strand for the gene and transcripts.
  • P-Value (One-Sided) The p-value resulting from the one-sided KBAC test.
  • -log10 P-Value (One-Sided) Minus the base-10 log of the above p-value.
  • KBAC Score (One-Sided) The one-sided KBAC score statistic itself.
  • Bonf. P (One-Sided) The Bonferroni-adjusted p-value from the one-sided test.
  • FDR (One-Sided) The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.
  • P-Value (Two-Sided) The p-value resulting from the two-sided KBAC test.
  • -log10 P-Value (Two-Sided) Minus the base-10 log of the above p-value.
  • KBAC Score (Two-Sided) The two-sided KBAC score statistic itself.
  • Bonf. P (Two-Sided) The Bonferroni-adjusted p-value from the two-sided test.
  • FDR (Two-Sided) The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.
  • Actual Permutations Used How many permutations were actually used for this test by the adaptive permutation testing algorithm.
  • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.
  • # Markers The total number of markers tested in this region.
  • # Multi-Marker Genotypes The total number of distinct multi-marker genotypes discovered in this region.

Additionally, if you have requested Output separate spreadsheets of weights (from unpermuted Y) and genotype indexes, two other marker-mapped spreadsheets will be output as follows:

  • KBAC Regression Weights (from Unpermuted Y) containing one row for each sample and the following columns:
    • Your case/control variable.
    • One column for every covariate you have specified.
    • The X_j = w_{i(j)} weight value (as computed from the unpermuted Y_j) for each sample j in the region. If the sample’s genotype G_{i(j)} is the wild type, zero is output. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region.
  • KBAC Regression Genotype Indexes containing one row for each sample and the following columns:
    • Your case/control variable.
    • One column for every covariate you have specified.
    • For each sample j, the i(j) index for that sample’s genotype G_{i(j)} in the region. If the sample’s genotype G_{i(j)} is the wild type, zero is output. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region.

Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method

At Golden Helix, we have adapted mixed models (Methods for Mixed Linear Model Analysis) to the KBAC test (Kernel-Based Adaptive Cluster (KBAC) Test). This method may be used to perform a KBAC test which corrects for genetic relationships or kinships among the samples.

This test works essentially as does KBAC with Regression (KBAC with Regression), except that the reduced model used for scoring is created from a logistic mixed-model equation, and thus implicitly includes the kinship information along with any covariate information.

Overview of Theory

Regression Score

We define the mixed-model regression score in a fashion similar to the definition in Regression Score, except that we will change the notation slightly to better agree with our mixed-model notation (The Mixed Model Equation).

Suppose we have l covariates such as age, sex or eigenvectors for genotypes, and we define X_{fjl} to be the value of the l^{th} (fixed-effect) covariate for the j^{th} sample. Also, suppose we have random effects u_j, written in vector form as u. We take the variance/covariance structure of the random effects to be G. If we define X_j and Y_j as in Alternative View of the Statistic, the logistic model for association testing will have the form

log\left(\frac{P(Y_j=1|X_j,X_{fjl},u_j)}{1 - P(Y_j=1|X_j,X_{fjl},u_j)}\right) = \beta_0 + \beta_1X_j + \sum_l{\beta_{fl}X_{fjl}} + u_j.

The score statistic to test the null hypothesis H_0:\beta_1 =
0 of the independence of this model from the X_j will be

U = \sum_j{X_j(Y_j - \mu_j)},

where

\mu_j = h(\eta_j) = \frac{e^{\eta_j}}{1 + e^{\eta_j}},

\eta_j being defined by

\eta_j = \beta_0 + \sum_l{\beta_{fl}X_{fjl}} + u_j.

\beta_0, the \beta_{fl}, and the predictions of the u_j are determined by solving the (reduced) logistic mixed model

log\left(\frac{P(Y_j=1|X_{fjl},u_j)}{1 - P(Y_j=1|X_{fjl},u_j)}\right) = \beta_0 + \sum_l{\beta_{fl}X_{fjl}} + u_j,

using the same variance/covariance matrix G as above for the u_j.

As with the original KBAC method, we define U to be a one-sided score to test for the enrichment of causal variants for higher corrected case/control statuses Y_j - \mu_j, and we define U^2 to be a two-sided score to test for the dependency of causal variants upon the corrected case/control statuses Y_j - \mu_j.

Note

We derive the mixed-model regression score in a manner similar to that for Regression Score. We note that the expectations of the Y_j given the u_j are

E[Y_j | u_j] = \mu_j = h(\eta_j) = \frac{e^{\eta_j}}{1 + e^{\eta_j}},

where

\eta_j = \beta_0 + \beta_1X_j + \sum_l{\beta_{fl}X_{fjl}} + u_j,

and that the likelihood and log-likelihood for all the observations are

L = \prod_j \mu_j^{Y_j}(1 - \mu_j)^{1 - Y_j}

and

l = \sum_j \left[ Y_j log(\mu_j) + (1 - Y_j) log(1 - \mu_j) \right] .

As before, since we are using the null hypothesis H_0:\beta_1 = 0, we will determine the score function with respect to \beta_1, which is the derivative of the log-likelihood with respect to \beta_1:

\frac{\partial l}{\partial \beta_1} = \sum_j \frac{\partial l}{\partial \mu_j} \frac{\partial \mu_j}{\partial \beta_1} = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \frac{\partial \mu_j}{\partial \beta_1} \right)

where

\frac{\partial \mu_j}{\partial \beta_1} = \frac{\partial \mu_j}{\partial \eta_j} \frac{\partial \eta_j}{\partial \beta_1} = h'(\eta_j) \frac{\partial \eta_j}{\partial \beta_1}

where for any scalar value \eta,

h(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}

(as defined earlier), and

h'(\eta) = h(\eta) (1 - h(\eta)),

(as shown in Regression Score). In a fashion similar to before, we have

\frac{\partial \eta_j}{\partial \beta_1} = \frac{\partial \left( \beta_0 + \beta_1X_j + \sum_l{\beta_{fl}X_{fjl}} + u_j \right) }{\partial \beta_1} = X_j .

Thus

\frac{\partial \mu_j}{\partial \beta_1} = h'(\eta_j) \frac{\partial \eta_j}{\partial \beta_1} = h(\eta_j) (1 - h(\eta_j)) X_j = \mu_j (1 - \mu_j) X_j,

and the derivative of the log-likelihood with respect to \beta_1 is

\frac{\partial l}{\partial \beta_1} = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \frac{\partial \mu_j}{\partial \beta_1} \right) = \sum_j \left( \left[ \frac{Y_j}{\mu_j} - \frac{(1 - Y_j)}{(1 - \mu_j)} \right] \mu_j (1 - \mu_j) X_j \right)

= \sum_j \left( \left[ Y_j - \mu_j \right] X_j \right) ,

(as shown in Regression Score, except this time, \mu_j and \eta_j are defined as directly above). Under the null hypothesis H_0:\beta_1 = 0, we have

\eta_j = \beta_0 + \sum_l{\beta_{fl}X_{fjl}} + u_j.

Also, under the null hypothesis, the X_i do not contribute to the solution to the original model, and thus \beta_0, the \alpha_l, and the predictions of u_j are also, under the null hypothesis, the solution to the reduced mixed model

log\left(\frac{P(Y_j=1|X_{fjl},u_j)}{1 - P(Y_j=1|X_{fjl},u_j)}\right) = \beta_0 + \sum_l{\beta_{fl}X_{fjl}} + u_j

using the same variance/covariance matrix G for the u_j as for the original mixed model.

Thus, we may set up the score test for H_0:\beta_1 = 0 as

U = \sum_j{X_j(Y_j - \mu_j)},

\mu_j = h(\eta_j),

\eta_j = \beta_0 + \sum_l{\beta_{fl}X_{fjl}} + u_j,

where \beta_0, the \alpha_l, and the predictions of u_j are the solution to the reduced mixed model.

The Logistic (Reduced) Mixed Model

Define

  • X_f as the matrix consisting of all one’s in the first column, followed by the X_{fjl} values in the remaining columns (with X_{fjl} being the value in row j and column l+1).
  • \beta as the vector consisting of the value \beta_0 as the first element and the value \beta_{fl} as its element number l+1.
  • Y, \mu, and \eta as the vectors of Y_j values, \mu_j values, and \eta_j values, respectively.
  • h(\eta), where \eta is a vector consisting of \eta_j values, as a vector consisting of the h(\eta_j) values using the scalar definition of h(\bullet).

The above reduced model may then be rewritten as

E[Y|u] = h(X_f \beta + u) = h(\eta) = \mu,

where E[Y|u] is the expectation of the Y_j given the u_j, and

Var[Y|u] = A = A^{1/2} A^{1/2},

where Var[Y|u] is the variance of the Y_j given the u_j, and A is the variance of the binomial distribution itself, with diagonal elements A_{jj} = \mu_j(1 -
\mu_j) and off-diagonal elements equal to zero.

Note

These expressions are sometimes seen elsewhere as

E[Y|u] = h(X_f \beta + Zu)

and

Var[Y|u] = A^{1/2} R A^{1/2},

but here, we use the identity matrix I for the “random-effect design matrix” Z as well as for the “R-side random effect matrix” R.

The “linear predictor” for this model is

\eta = X \beta + u,

while h(\bullet) is the “inverse link function” for this model.

Solving the Logistic Mixed Model

We do this by iterating back and forth between creating a linear pseudo-mixed-model and finding a solution for that pseudo-model.

To determine how to create the pseudo-model, we write a first-order Taylor series of \mu about some given tentative values of \beta and u, namely \tilde{\beta} and \tilde{u}:

h(\eta) \dot{=} h(\tilde{\eta}) + \tilde{\Delta} X (\beta - \tilde{\beta}) + \tilde{\Delta} (u - \tilde{u}),

where

\tilde{\Delta} = \big( \frac{\partial h(\eta)}{\partial \eta} \big)_{\tilde{\beta}, \tilde{u}}

is a diagonal matrix of derivatives of the conditional mean evaluated at the tentative values of \beta and u. Rearranging terms yields

\tilde{\Delta}^{-1} (\mu - h(\tilde{\eta})) + X \tilde{\beta} + \tilde{u} \dot{=} X \beta + u .

The left side is the expected value, conditional on u, of

P \equiv \tilde{\Delta}^{-1} (Y - h(\tilde{\eta})) + X \tilde{\beta} + \tilde{u}.

The variance of P given u is

Var[P|u] = \tilde{\Delta}^{-1} \tilde{A}^{1/2} \tilde{A}^{1/2} \tilde{\Delta}^{-1} ,

where

\tilde{A}_{jj} = \tilde{\mu}_j (1 - \tilde{\mu}_j), \tilde{A}_{ij} = 0 (i \ne j) .

We now have the form of the pseudo-model as

P = X \beta + u + \epsilon,

which is a linear mixed model with pseudo-response P, fixed effects \beta, random effects u, and Var[\epsilon] = Var[P|u].

The coefficients \beta and the random effect predictions u may now be used as the new tentative values \tilde{\beta} and \tilde{u}.

Note

Alternatively, it may be argued that it is better to use the expected values of u, which are all zero, for every iteration’s tentative random effect values \tilde{u}, and to wait until the iteration process is finished before using the predicted (BLUP) values of u from the pseudo-model.

Solving the Pseudo-Model

We would like to use the EMMA algorithm (Finding the Variance Components) to solve this pseudo-model, but there is only one problem. EMMA assumes that the variance of the errors is a multiple of the identity matrix (see The Mixed Model Equation), which is not what we have here.

Thus, to use EMMA, we must transform our pseudo-model to another model. To do this, we want to find a matrix T such that Var[T\epsilon] = I, so that we can write

TP = TX \beta + Tu + T\epsilon

and use EMMA to solve the mixed model

TP = TX \beta + Tu + \epsilon^*,

where the variance of \epsilon^* is proportional to I. We note that

Var[T\epsilon] = T Var[\epsilon] T' .

Setting this equal to the identity matrix gives us

T Var[\epsilon] T' = I ,

and

Var[\epsilon] = T^{-1} (T')^{-1} = Var[P|u] = \tilde{\Delta}^{-1} \tilde{A}^{1/2} \tilde{A}^{1/2} \tilde{\Delta}^{-1} .

This is solved by letting

T = \tilde{A}^{-1/2} \tilde{\Delta} .

(Note that A, \tilde{A}^{-1/2}, \tilde{\Delta}, and thus T are all diagonal.)

Now we may use EMMA to solve

TP = TX \beta + Tu + \epsilon^* .

(Note that in this transformed pseudo-model, either T may be thought of as the “Z” matrix or “random-effect design matrix” for u, or Tu may be thought of as its own random-effect variable.)

Summary of the Algorithm

First, pick starting values of \tilde{\beta} and \tilde{u}, such as all zeros. Then repeat the following steps until the changes in \tilde{\beta} and \tilde{u} are sufficiently small:

  1. Find \tilde{\eta} and \tilde{\mu} (the tentative values of \eta and \mu) from the original linear predictor equation and the definition for h(\bullet):

    \tilde{\eta} = X \tilde{\beta} + \tilde{u}

    \tilde{\mu} = h(\tilde{\eta})

  2. Find the (diagonal) \tilde{\Delta} matrix

    \tilde{\Delta} = \big( \frac{\partial h(\eta)}{\partial \eta} \big)_{\tilde{\beta}, \tilde{u}}

    the diagonal-element values of which specifically come out to

    \tilde{\Delta}_{jj} = h'(\tilde{\eta}_j) = h(\tilde{\eta}_j) (1 - h(\tilde{\eta}_j)) = \tilde{\mu}_j (1 - \tilde{\mu}_j)

  3. Find the pseudo-model P:

    P = \tilde{\Delta}^{-1} (Y - h(\tilde{\eta})) + X \tilde{\beta} + \tilde{u}

    or

    P = \tilde{\Delta}^{-1} (Y - \tilde{\mu}) + X \tilde{\beta} + \tilde{u}

  4. Find the (diagonal) matrix T:

    T = \tilde{A}^{-1/2} \tilde{\Delta}

    the diagonal-element values of which specifically come out to

    T_{jj} = \tilde{A}_{jj}^{-1/2} \tilde{\Delta}_{jj} = (\tilde{\mu}_j (1 - \tilde{\mu}_j))^{-1/2} \tilde{\mu}_j (1 - \tilde{\mu}_j) = (\tilde{\mu}_j (1 - \tilde{\mu}_j))^{1/2}

  5. Find new values of \tilde{\beta} and \tilde{u} using EMMA for:

    TP = TX \beta + Tu

Note

Using the alternative method (see the note in Solving the Logistic Mixed Model above), this algorithm simplifies somewhat to:

First, pick starting values of \tilde{\beta}, such as all zeros. (Always use zeros in place of \tilde{u}.) Then repeat the following steps until the changes in \tilde{\beta} are sufficiently small:

  1. Find \tilde{\eta} and \tilde{\mu} (the tentative values of \eta and \mu) from the original linear predictor equation, the definition for h(\bullet), and the fact that we are always using zeros for \tilde{u}:

    \tilde{\eta} = X \tilde{\beta}

    \tilde{\mu} = h(\tilde{\eta})

  2. Find the (diagonal) \tilde{\Delta} matrix

    \tilde{\Delta} = \big( \frac{\partial h(\eta)}{\partial \eta} \big)_{\tilde{\beta}, 0}

    the diagonal-element values of which specifically come out to

    \tilde{\Delta}_{jj} = h'(\tilde{\eta}_j) = h(\tilde{\eta}_j) (1 - h(\tilde{\eta}_j)) = \tilde{\mu}_j (1 - \tilde{\mu}_j)

  3. Find the pseudo-model P:

    P = \tilde{\Delta}^{-1} (Y - h(\tilde{\eta})) + X \tilde{\beta}

    or

    P = \tilde{\Delta}^{-1} (Y - \tilde{\mu}) + X \tilde{\beta}

  4. Find the (diagonal) matrix T:

    T = \tilde{A}^{-1/2} \tilde{\Delta}

    the diagonal-element values of which specifically come out to

    T_{jj} = \tilde{A}_{jj}^{-1/2} \tilde{\Delta}_{jj} = (\tilde{\mu}_j (1 - \tilde{\mu}_j))^{-1/2} \tilde{\mu}_j (1 - \tilde{\mu}_j) = (\tilde{\mu}_j (1 - \tilde{\mu}_j))^{1/2}

  5. Find new values of \tilde{\beta} using EMMA for:

    TP = TX \beta + Tu

    Because this is the alternative method, the values of u need not be predicted here.

Finally, when iteration has finished, predict the u from the final values of T, P, and \hat{\beta}, then compute the final values of \eta and \mu from this prediction.

Note

The alternative method is used in Golden Helix SVS.

Optimizations

  • Since finding the logistic reduced mixed model for any Y is somewhat compute-intensive, the results of finding the reduced mixed model for the un-permuted Y and every permutation of Y are pre-computed and stored in advance of scanning any regions for the KBAC test.

    Note

    To facilitate this, the Mixed-Model KBAC feature of Golden Helix SVS will never drop samples. Instead, it will always handle samples with missing genotypes in any region by imputing the wild type (homozygous reference genotype) for these samples in that region.

  • When there are no covariates, the correct value of \beta is computed directly from

    \beta = log(\frac{\bar{Y}}{1 - \bar{Y}}),

    where \bar{Y} is the average value of the Y. This value is then used as the “initial value” of \tilde{\beta}, after which the mixed-model will converge immediately.

    Note

    This optimization only works because we are using the alternative method whereby the tentative random effects \tilde{u} are taken to be zero for the purposes of the iteration process.

  • When there are covariates, the final values of \tilde{\beta} and of the eigen-decompositions for U_{full} and U_{reduced} which were computed for the unpermuted Y are used to initialize the mixed-model calculations for every permuted Y.

  • During logistic mixed-model computations, the EMMA matrix eigen-decompositions are only re-computed for every third iteration unless convergence seems imminent.

  • The following optimization process for the permuted Y, for use with covariates, can speed up computation by almost an order of magnitude, because it completely avoids performing any eigen-decompositions whatsoever. However, this optimization process gives only approximate final results–therefore, this process is optional and may be selected or not selected.

    The steps for this process are as follows:

    1. Initialize \tilde{\beta} using the final values of \tilde{\beta} that were computed for the unpermuted Y.

    2. Using the final eigen-decompositions of U_{full} and U_{reduced} that were computed for the unpermuted Y (and EMMA logic), attempt to compute valid values for the variance components (namely, the random effect variance \sigma^2_g and the error variance \sigma^2_e) within three iterations. However, accept whatever the results are, whether or not there was apparent convergence.

    3. Using these variance components (\sigma^2_g and \sigma^2_e) as derived from the EMMA logic, and initializing with the latest values of \tilde{\beta} as derived from the EMMA logic, iterate using the Henderson formula (as modified to accommodate singular G) (see note below) to find successive \tilde{\beta} values until convergence is achieved.

      Note

      The Henderson formula (as modified to accommodate singular G), for any mixed-model equation

      y = X \beta + Zu + \epsilon

      with

      Var(u) = G ,

      is:

      \begin{pmatrix}
X'R^{-1}X  &  X'R^{-1}Z \\
GZ'R^{-1}X  &  GZ'R^{-1}Z + I
\end{pmatrix}
\begin{pmatrix}
\tilde{\beta} \\
\hat{u}
\end{pmatrix} =
\begin{pmatrix}
X'R^{-1}y \\
GZ'R^{-1}y
\end{pmatrix} .

      When this formula is used, \sigma^2_g G is substituted for the “G” in the Henderson formula, the identity matrix is substituted for the “Z” in the Henderson formula, \sigma^2_e
\tilde{\Delta}^{-1} \tilde{A}^{1/2} \tilde{A}^{1/2}
\tilde{\Delta}^{-1} = \sigma^2_e \tilde{\Delta}^{-1} \tilde{A}
\tilde{\Delta}^{-1} is substituted for the “R” in the Henderson formula, and the pseudo-model P is substituted for the “y” in the Henderson formula.

      R^{-1} for the Henderson formula is composed of diagonal-element values

      R^{-1}_{jj} & = (\sigma^2_e \tilde{\Delta}^{-1} \tilde{A} \tilde{\Delta}^{-1})^{-1}_{jj}  \\
& = (\sigma^2_e)^{-1} \tilde{\Delta}_{jj} \tilde{A}^{-1}_{jj} \tilde{\Delta}_{jj}  \\
& = (\tilde{\mu}_j (1 - \tilde{\mu}_j)) (\tilde{\mu}_j (1 - \tilde{\mu}_j))^{-1} (\tilde{\mu}_j (1 - \tilde{\mu}_j)) / \sigma^2_e  \\
& = (\tilde{\mu}_j (1 - \tilde{\mu}_j)) / \sigma^2_e

      and off-diagonal-element values of zero.

      Notice that no transformation analogous to finding a “T” for the EMMA technique needs to be done to accommodate the Henderson formula.

    4. Finally, use the (BLUP) prediction of u given by the Henderson formula.

Mixed-Model KBAC

This feature, which allows both additional covariates and a pre-computed kinship matrix to be specified, will perform the KBAC method using a regression score function as does KBAC with Regression. The case/control variable is tested against the weighted multi-site genotypes other than the wild-type genotype and against the expected values of the case/control variable. The expected values of the case/control variable are corrected by any additional covariates and by the relationships specified in the pre-computed kinship matrix.

Data Requirements for Mixed-Model KBAC

The data requirements for Mixed-Model KBAC analysis are the same as they are for KBAC with Permutation Testing (Data Requirements for KBAC with Permutation Testing) with the following additions:

  • The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.
  • You must supply a pre-computed kinship matrix from a separate spreadsheet. The requirements for this kinship spreadsheet are outlined in Precomputed Kinship Matrix Option.

Usage and Parameters

To use Mixed-Model KBAC, select DNA-Seq > Collapsing Methods > Mixed-Model KBAC from the spreadsheet menu, and use the resulting window (see Mixed-Model KBAC Method – Parameters) to select the following parameters:

mmkbacWin

Mixed-Model KBAC Method – Parameters

  • Parameters for KBAC (Liu & Leal, 2010) Mixed-Model Logistic Regression Enter the kernel type and the permutation parameters.

    • Kernel Type The possible choices are:
      • Hyper-geometric (Normally recommended) Normally recommended and gives the most accurate test.
      • Marginal binomial Less compute-intensive and almost as accurate as the hyper-geometric kernel.
      • Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls) You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.
    • Permutations The permutation parameters are:
      • Number of permutations to use: The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.
      • To use adaptive permutation testing, check Adaptive permutation testing (threshold alpha) and fill in the threshold alpha value. Please see Adaptive Permutation Testing below for more about this technique and the possible advantages of using it.
  • Outputs Choose one or both of the following:

    • One-sided statistics (recommended) This tests the one-sided alternative hypothesis of the enrichment of causal variants for higher corrected case/control statuses Y_j - \mu_j.
    • Two-sided statistics This tests the two-sided alternative hypothesis of the dependency of causal variants upon the corrected case/control statuses Y_j - \mu_j.

    Note

    Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

    Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

  • Regions Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

    Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...
      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.
  • Pre-Computed Kinship Matrix: See Precomputed Kinship Matrix Option for more information about pre-computed kinship matrices. Click on Select Sheet and select the kinship spreadsheet from the window that is presented.

  • Check Correct for Additional Covariates to utilize the (mixed-model) regression score technique to correct for additional covariates.

    • Click on Add Columns to start selecting which variables (spreadsheet columns) to use for correction.
  • Algorithm Specify the algorithm to use for pre-computing the reduced mixed models for the phenotype permutations (permutations of the Y values).

    Note

    The original reduced mixed model that uses the un-permuted phenotypes is always computed exactly.

    • Select Exact to use the exact algorithm and avoid using the last optimization documented in Optimizations.
    • Select Approximate (faster when correcting for covariates) to use the last optimization documented in Optimizations. If you are correcting for covariates, this option will work much faster than the Exact option, but the results will only be approximate.

Note

If there is any missing genotype data in a region for a given sample, the wild type (homozygous reference genotype) will be imputed to have been present and be substituted for the missing data.

Output

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

  • Chr The chromosome number or label.
  • Start The starting genetic position of the region.
  • Stop The ending genetic position of the region.
  • Gene Name The gene name.
  • Transcript Name(s) Transcript name(s) for each gene region.
  • Strand Strand for the gene and transcripts.
  • P-Value (One-Sided) The p-value resulting from the one-sided KBAC test.
  • -log10 P-Value (One-Sided) Minus the base-10 log of the above p-value.
  • KBAC Score (One-Sided) The one-sided KBAC score statistic itself.
  • Bonf. P (One-Sided) The Bonferroni-adjusted p-value from the one-sided test.
  • FDR (One-Sided) The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.
  • P-Value (Two-Sided) The p-value resulting from the two-sided KBAC test.
  • -log10 P-Value (Two-Sided) Minus the base-10 log of the above p-value.
  • KBAC Score (Two-Sided) The two-sided KBAC score statistic itself.
  • Bonf. P (Two-Sided) The Bonferroni-adjusted p-value from the two-sided test.
  • FDR (Two-Sided) The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.
  • Actual Permutations Used How many permutations were actually used for this test by the adaptive permutation testing algorithm.
  • Sample Size Used The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.
  • # Markers The total number of markers tested in this region.
  • # Multi-Marker Genotypes The total number of distinct multi-marker genotypes discovered in this region.

Adaptive Permutation Testing

Overview

Suppose a large series of tests is being performed by permutation testing, such as a GWAS or a collapsing method over the entire genome. In order to overcome multiple testing issues, it is normally necessary to specify a large number of permutations. However, this can take large amounts of computation time, while specifying many fewer permutations could have taken much less computation time.

Adaptive permutation testing seeks to break this conundrum by “cutting short” (limiting the number of permutations of) any individual test that is “not going well”, meaning that based on tentative results, it is very unlikely that the p-value could be below (better than) a certain specified value.

Additionally, adaptive permutation testing seeks to assign a tentative p-value to any test based on the permutations thus far performed.

The hope is that if the series of tests has a “reasonable” p-value distribution (as would be determined by, for instance, a Q-Q plot), “most” of the tests will only need a “small” number of permutations, while the remaining tests that are much more “promising” will get a much larger number of permutations. If this hope is realized, computation time can be cut down to only a fraction of what it otherwise would be without losing significance where it is really needed.

Adaptive permutation testing is available in Golden Helix SVS for both versions of the CMC method (Combined Multivariate and Collapsing (CMC) Method) and both versions of the KBAC test (Kernel-Based Adaptive Cluster (KBAC) Test).

Theory

Suppose that at any given time during the process of a permutation test, M permutations have been completed with H “hits”, or permuted tests the outcomes of which are as significant as or more significant than the original unpermuted test. Also, suppose we specify that a p-value must be less than \alpha in order to be “interesting”, or sufficiently significant to justify continuing the permutation test (unless the maximum number of permutations specified has been reached).

When testing, we can always assign a tentative p-value p_M =
\frac{H}{M} based on the M permutations we have done so far, and determine if it is greater than \alpha.

The question is to determine whether and when, if indeed p_M >
\alpha, we can terminate the test confident that the real p-value lies above the value \alpha, and that we can use \frac{H}{M} as a reasonable final p-value result.

To accomplish this, we first estimate the standard deviation \sigma_M of where the p-value might be, based on the data we have. Then, we require that we have done enough tests to cause

p_M - 6\sigma_M > \alpha,

or

p_M > \alpha + 6\sigma_M.

That is, if we can assure ourselves that our tentative p-value outcome is greater than and more than six standard deviations away from the \alpha value, then it will be OK to stop early.

Variance of the Mean

Suppose X is the random variable which has the value 1 for each “hit”, or permuted test which is as significant as or more significant than the original unpermuted test, and 0 for each permuted test which is less significant than the original unpermuted test.

We take the variance of X corresponding to each of the permutations done so far to be

E[X^2] - E[X]^2 = p_M - p_M^2 = p_M(1 - p_M).

Meanwhile, the variance of the mean of M occurrences of a random variable X is the individual mean divided by M. Thus, in our case, the variance of the mean after M permutations may be defined as

\sigma_M^2 = \frac{p_M(1 - p_M)}{M},

or

\sigma_M = \sqrt{\frac{p_M(1 - p_M)}{M}}.

Our requirement then may be re-stated as doing enough tests to cause

p_M > \alpha + 6\sigma_M = \alpha + 6\sqrt{\frac{p_M(1 - p_M)}{M}}.

A Further Requirement

We note that in the case of getting entirely or almost entirely “hits” after very few permutations, the above expression for the standard deviation will be very small, even though the values of X are completely opposite of what would give us confidence in the p-value being very small. Thus, as a final guard against quitting early, we would like to get rid of the influence of the (1 - p_M) factor, which becomes very small when the fraction of “hits” is high. Noting that

\sqrt{\frac{p_M}{M}} > \sqrt{\frac{p_M(1 - p_M)}{M}},

we impose doing enough tests to cause

p_M > \alpha + 6\sqrt{\frac{p_M}{M}},

which is a more stringent requirement. This not only avoids estimating a small standard deviation of the mean when many tests are “hits” near the beginning of testing, but also, for very small p-values and many iterations, this requirement is only slightly more stringent than the original requirement. This is because, under those circumstances, (1 - p_M) in the original requirement becomes very close to one.

It may be noted that this more stringent requirement guarantees that at least 36 iterations take place for every permutation test, because taking the maximum possible value of p_M, one, and the minimum possible value of \alpha, zero, we have

1 > 0 + 6\sqrt{\frac{1}{M}},

or

1 > 36\frac{1}{M},

or

M > 36.

Summary of Usage

If you request Adaptive permutation testing, enter a threshold alpha.

If, while a test is performed, it has been determined that the actual p-value must be higher than the threshold alpha by more than six estimated standard deviations of the mean estimated p-value, the test will be terminated early, and the p-value which is output will be based on the permutations done thus far. An additional output of the number of permutations done thus far will appear under the heading Actual Permutations Used.

In the case that all of the permutations originally requested are used, the Actual Permutations Used will be output and will reflect the number of permutations originally requested.

Optimized Sequence Kernel Association Test (SKAT-O)

Overview

If one rare variant (marker) were to be tested by itself for association with a phenotype, the test would not have anywhere near as much power as if the one variant had been a common variant. For this reason, rare variants are always tested in groups of several markers at a time, where any group of markers is, for instance, in one region or one gene.

Several approaches to testing rare variants are:

  1. Burden testing. The straightforward form of burden testing is to sum the minor allele counts for each marker in the group, possibly weighted by a function of each marker’s MAF, and test this against the phenotype. This test works well in detecting when each of multiple markers varying in the same direction has an effect on the phenotype in the same direction–in other words, when the marker variations are well-correlated.

  2. Sequence Kernel Association Test (SKAT). Squares of score statistics for testing individual markers are summed. The sum is possibly weighted by, for instance, a function of each marker’s minor allele frequency (MAF). This test works well in detecting when multiple markers vary in different directions but have an effect on the phenotype in the same direction.

    Alternatively, the SKAT test may be thought of as testing on the random-effect variance component for a mixed model in which the effects from the individual markers are treated as uncorrelated random effects–the test is whether that variance component is zero. (In the corresponding mixed model, the individual random effect variances are proportional to the weights that are used.)

  3. Generalized Sequence Kernel Association Test (Generalized SKAT). This test statistic (but not necessarily its p-value) is a linear combination of the SKAT test and the straightforward burden test explained above, with the proportions of each pre-specified.

    One might ask whether there is a best proportion of SKAT and a best proportion of Burden that one could use for given data. That would be difficult to determine in advance. To find out, you could perform many generalized SKAT tests over the same region using different proportions of SKAT vs. Burden–however, should you use the best p-value from these tests? How do you avoid having to apply a Bonferroni or FDR correction to that p-value according to the number of different generalized SKAT tests you used?

  4. The SKAT-O (Optimized SKAT) algorithm, in answer to the above, combines a set of generalized SKAT tests using different proportions of SKAT vs. Burden in a fashion that will obtain one statistic and one p-value.

Every one of these tests may be corrected for covariates.

Note

The SKAT-O test may also be used to analyze common variants. Simply specify a uniform weighting (Usage and Parameters) when performing the analysis.

Availability

The generalized SKAT test (which may also be used to produce a SKAT test or a burden test) and the SKAT-O test are available in Golden Helix SVS for three different ranges of sample sizes:

  1. Very-small-sample corrected [Lee2012]. This is designed for sample sizes under 500 or 700, and involves a permutation-testing algorithm that is used exclusively to determine the kurtosis of the distribution under the null model.
  2. Moderately-small-sample corrected [Lee2012]. This analytical algorithm is designed for sample sizes over 500 up to 2000.
  3. Liu algorithm [LeeWuLin2012] and [Liu2009]. This algorithm is designed for samples sizes over 2000.

For usage of this feature, see Data Requirements for SKAT-O and Usage and Parameters.

Mathematical Overview

Note

The following are variables which have different symbols between the two papers this method is based on. For each variable, the symbol used in this document is also indicated.

  This Document [Lee2012] [LeeWuLin2012]
Number of variants m m p
Disease probability \mu_i \pi_i \mu_i
The SKAT-O Test Statistic T Q_{Optimal} T
Main factor of first term of Null Hypothesis Q \kappa_1 \kappa_1 \kappa
Main factor of second term of Null Hypothesis Q \kappa_2 \kappa_2 \eta_0
Multiplier for second term of Null Hypothesis Q \tau(\rho) \psi(\rho) \tau(\rho)

The Burden and SKAT Test Statistics

Assume n subjects are sequenced in a region that has m variants. For the i^{th} subject, let y_i denote a dichotomous phenotype, G_i=(g_{i1},...,g_{im}) the genotypes of the m variants (g_{ij} = 0,1,2), and X_i =
(x_{i1},...,x_{is}) the intercept and covariates. Also, let G and X denote the matrices that contain G_i and X_i, respectively, as their i^{th} rows. Consider the logistic regression model

logit(\mu_i) = X_i\alpha + G_i\beta,

where \mu_i is the disease probability, \alpha is an s \times 1 vector of regression coefficients of the intercept and covariates, and \beta = (\beta_1,...,\beta_m)' is an m \times 1 vector of regression coefficients of genetic variants. If w_j is a weight function that may depend on the properties of the j^{th} variant, the burden test rephrases the model as

logit(\mu_i) = X_i\alpha + \beta_c \left(\sum_{j=1}^m w_j g_{ij} \right),

where the test is for H_0:\beta_c = 0. The burden score statistic (used for the SKAT-O algorithm) is

Q_B = \left[ \sum_{i=1}^n (y_i - \hat{\mu}_i) \left(\sum_{j=1}^m w_j g_{ij} \right) \right]^2 ,

where \hat{\mu}_i is computed from fitting the null model

logit(\mu_i) = X_i \alpha .

Meanwhile, the SKAT test assumes that the \beta_j are independent and follow an arbitrary distribution with mean 0 and variance w_j^2 \sigma^2_g. Here, the test is H_0 :
\sigma^2_g = 0. The SKAT statistic may be written as

Q_S = (y - \hat{\mu})' K (y - \hat{\mu}),

where \hat{\mu} = (\hat{\mu}_1, ..., \hat{\mu}_n)' and K=GWWG' is an n \times n “kernel matrix” with W
= diag(w_1,...,w_m) as an m \times m diagonal weight matrix. The SKAT statistic may also be written as

Q_S = \sum_{j=1}^m w_j^2 S_j^2,

where S_j = \sum_{i=1}^n g_{ij} (y_i - \hat{\mu}_i) is the score statistic for testing H_0 : \beta_j = 0 in the single-SNP model with only the j^{th} SNP:

logit(\mu_i) = X_i \alpha + g_{ij} \beta_j .

Meanwhile, the original model, which may be written as

(logit(\mu_1), ..., logit(\mu_n))' = X \alpha + G \beta,

may be thought of as a mixed-model equation, where the G matrix is equivalent to the incidence or Z matrix in the usual mixed-model equation, and the \beta are random effects with a variance matrix of \sigma^2_g W.

The Generalized SKAT Test Statistic

This test unifies the burden and SKAT tests as a linear combination of the two. If Q_{Burden} and Q_{SKAT} are the results of the burden and SKAT tests, respectively, the generalized SKAT test for a given \rho, where 0 \le \rho \le 1, is

Q_{\rho} = \rho Q_{Burden} + (1 - \rho) Q_{SKAT}.

Note

Note that the p-value of this test might not be a linear combination of the corresponding burden and SKAT p-values at all.

This generalized SKAT test for \rho turns out to be equivalent to

Q_{\rho} = (y - \hat{\mu})' K_{\rho} (y - \hat{\mu}),

where K_{\rho} = G W R_{\rho} W G' is the “kernel matrix” with R_{\rho} = (1 - \rho) I + \rho (1,...,1)' (1,...,1) (a matrix with 1’s on the diagonal and values of \rho everywhere else). The parameter \rho may be thought of as the presumed correlation of different \beta_j when these are thought of as random effects in the corresponding mixed model.

Finding the P-Value from an Individual Statistic

Note

The following method and its variations are partly based on the following facts about the \chi^2_{df} distribution, where df is the “degrees of freedom” parameter:

  • The mean of the \chi^2_{df} distribution is df.
  • The variance of the \chi^2_{df} distribution is 2df.
  • The skewness of \chi^2_{df} is \sqrt{\frac{8}{df}}.
  • The (“excess”) kurtosis of \chi^2_{df} is \frac{12}{df}.

The p-value for an individual Q_{\rho} may be approximated as

1 - F\left( \frac{(Q_{\rho} - \mu_Q) \sqrt{2 df}}{\sqrt{v_Q}} + df | \chi^2_{df} \right),

where F( \bullet | \chi^2_{df}) is the chi-squared distribution function for df degrees of freedom, and the null-hypothesis mean, variance, and degrees of freedom \mu_Q, v_Q, and df (determined as documented below) adjust the mean and variance of Q_{\rho} to agree with those for the \chi^2_{df} distribution.

This formula, along with its inverse, is also used later on when intermediate p-values must be taken to estimate the overall SKAT-O p-value.

The above parameters are determined according to one of the following procedures:

  1. For moderately small sample sizes, theoretically from the eigen-decomposition of \tilde{K} = D^{1/2} K D^{1/2}, where D = diag(\hat{\mu}_i(1 - \hat{\mu}_i)), or from \tilde{K} = HKH', where D is defined the same way and

    H = D^{1/2} - D^{1/2}X(X'DX)^{-1}X'D

    is defined so that

    H'H = D - DX(X'DX)^{-1}X'D

    is the restricted maximum likelihood (REML) estimator of the variance component.

    Note

    Golden Helix SVS uses \tilde{K} = HKH'.

    Specifically, let \lambda_1, ..., \lambda_q be the ordered non-zero eigenvalues of \tilde{K}, U = [u_1
... u_q] be the n \times q eigenvector matrix of \tilde{K}, and u_{ij} be the i^{th} element of u_j. If we define c_{jk} with the following expression,

    c_{jk} = \sum_{i=1}^n \frac{u^2_{ij} u^2_{ik} (3 \mu^2_i - 3 \mu_i + 1)}{\mu_i(1 - mu_i)} \
+ \sum_{i_1 \ne i_2}^n u^2_{i_1 j} u^2_{i_2 k} \
+ 2 \sum_{i_1 \ne i_2}^n u_{i_1 j}u_{i_2 j} u_{i_1 k} u_{i_2 k}  -  1,

    we then can compute

    \mu_Q = \sum_{j=1}^q \lambda_j,

    v_Q = \sum_{i,j=1}^q \lambda_i \lambda_j \hat{c_{ij}},

    and

    df = \frac{\left(\sum_{j=1}^q \lambda_j^{*2} \right)^2} {\sum_{j=1}^q \lambda_j^{*4}},

    where \lambda_j^{*} = \lambda_j \hat{c}_{jj} / \sqrt{2}, and \hat{c}_{jk} is an estimated c_{jk} with \hat{\mu}.

  2. In the case of very small sample sizes, \mu_Q and v_Q are determined as above for moderately small sample sizes, but df is determined by performing permutation testing to estimate the kurtosis.

    If Q^{*}_{s,b}, with (b=1,...,B), is the SKAT test statistic from the permutation sample y^{*}_b, the sample kurtosis is

    \hat{\gamma} = \frac{\hat{\mu}_4}{\hat{\sigma}^4} - 3,

    where

    \hat{\mu}_4 = \frac{1}{B}\sum_{b=1}^B \left( Q^*_{s,b} - \mu_Q \right)^4

    and

    \hat{\sigma}^2 = \frac{1}{B} \sum_{b=1}^B \left( Q^*_{s,b} - \mu_Q \right)^2 .

    The df is then estimated as df = \frac{12}{\hat{\gamma}}.

    If there are covariates, parametric bootstrapping is used, rather than permutation testing. Each bootstrap iteration is sampled with replacement, with the variation that measures are taken to get the same number of cases and the same number of controls that the unpermuted y had.

  3. For large sample sizes, the eigen-decomposition of \tilde{K} (as defined above) is also used. The difference is that the somewhat more straightforward formulas

    \mu_Q = \sum_{j=1}^q \lambda_j,

    v_Q = 2 \sum_{j=1}^q \lambda_j^2,

    and

    df = \frac{\left(\sum_{j=1}^q \lambda_j^2 \right)^2} {\sum_{j=1}^q \lambda_j^4}

    apply.

    Note

    This formula is a modification of a technique mentioned in [Liu2009]. In that article,

    df = \frac{\left(\sum_{j=1}^q \lambda_j^2 \right)^3} {\left(\sum_{j=1}^q \lambda_j^3 \right)^2}

    is used (for the case that applies here of a “central quadratic form”).

The Optimal (SKAT-O) Test

This test statistic can be written as

T = \min_{0 \le \rho \le 1} p_{\rho}.

In practice, it is computed (or thought of) as

T = \min(p_{\rho_1}, ..., p_{\rho_b}),

where a grid of \rho values has been set up where

0 = \rho_1 < ... < \rho_b = 1.

This is not a p-value itself, but only a statistic. At this point, we must take the p-value of this statistic. This will be somewhat involved.

Alternative View of the Q_{\rho} Statistic

Define Z = D^{-1/2} GW (or Z = HGW, consistent with the definition used above for \tilde{K}) and \bar{z} =
(\bar{z}_1, ..., \bar{z}_n)', where \bar{z}_i = \sum_{j=1}^m
z_{ij}/m. Also, let M = \bar{z}(\bar{z}'\bar{z})^{-1}\bar{z}' and

\tau(\rho) = m^2 \rho \bar{z}'\bar{z} + \frac{1 - \rho}{\bar{z}'\bar{z}} \sum_{j=1}^m (\bar{z}'z_{.j})^2,

where z_{.j} is the j^{th} column of Z.

It can be shown that under the null hypothesis, Q_{\rho} is asymptotically the same as

(1 - \rho)\kappa_1 + \tau(\rho)\kappa_2,

where

\kappa_1 = \sum_{k=1}^q \lambda_k \eta_k + \zeta ,

\lambda_1,...,\lambda_q are the nonzero eigenvalues of Z'(I-M)Z, \eta_k (k=1,...,q) and \kappa_2 are independent and identically distributed \chi^2_1 random variables, and \zeta satisfies

E(\zeta) = 0,

Var(\zeta) = 4 \operatorname{trace}(Z'MZZ'(I-M)Z),

\operatorname{Corr} \left(\sum_{k=1}^q \lambda_k \eta_k, \zeta \right) = 0,

and

\operatorname{Corr}(\kappa_2, \zeta) = 0.

This is a mixture of two independent random variables, \kappa_1 and \kappa_2.

The P-Value Algorithm for the Optimal Test

The p-value for the optimal statistic is

1 - P(Q_{\rho_1} < q_{min}(\rho_1),...,Q_{\rho_b} < q_{min}(\rho_b))

= 1 - E[P(\kappa_1 < \min_{\nu}((q_{min}(\rho_{\nu}) - \tau(\rho_{\nu}) \kappa_2)/(1 - \rho_{\nu})) | \kappa_2)] ,

where q_{min}(\rho_{\nu}) “denotes the (1 - T)^{th} percentile of the distribution of Q_{\rho_{\nu}}” for each \nu.

To explain q_{min}(\rho_{\nu}) better, think of it as “going in reverse” from a p-value to a Q_{\rho_{\nu}} - like statistic, except that while parameters similar to the parameters for finding the individual (null-hypothesis distribution for) Q_{\rho_{\nu}} are used, the best p-value T obtained over all \rho is used rather than the p-value originally obtained for the individual \rho_{\nu}.

The expression being minimized is based on substituting q_{min}(\rho) for the value of Q_{\rho} in the expression

q_{min}(\rho) = (1 - \rho)\kappa_1 + \tau(\rho)\kappa_2 ,

and re-arranging terms as

\kappa_1 = \frac{q_{min}(\rho) - \tau(\rho) \kappa_2}{1 - \rho} .

The final expression for approximating the p-value of the optimal statistic is

\text{p-value} = 1 - \int F(\delta(\kappa_2)|\lambda) f(\kappa_2|\chi^2_1) d\kappa_2 ,

where

\delta(\kappa_2) = (\min_{\nu}((q_{min}(\rho_{\nu}) - \tau(\rho_{\nu}) \kappa_2)/(1 - \rho_{\nu})) - \mu_Q)\sqrt{\frac{2 df}{v_Q + Var(\zeta)}} + df,

F(\delta(\kappa_2)| \chi^2_{df}) is the chi-squared distribution function for df degrees of freedom, and f(\kappa_2|\chi^2_1) is a density function of \chi^2_1, and the \mu_Q, v_Q, and df used here are determined as in Finding the P-Value from an Individual Statistic as applied to Z'(I-M)Z.

The P-Value for Very Small Sample Sizes

For very small and moderately small sample sizes, we write

\kappa_{1S} = (1 - \rho)\tilde{y}'(I - M)ZZ'(I - M)\tilde{y} + 2(1-\rho)\tilde{y}'(I - M)ZZ'M\tilde{y}

and

\kappa_{2S} = \frac{\tilde{y}'\bar{z}\bar{z}'\tilde{y}}{\bar{z}'\bar{z}} .

\kappa_{1S} asymptotically follows the same distribution as the \kappa_1 mentioned in the development above, and \kappa_{2S} asymptotically follows the \chi^2_1 distribution as does \kappa_2 in the development above. However, the null distribution of \kappa_{1S} is small-sample adjusted using its computed small-sample variance and kurtosis, and is then used to modify the usage of F in the final p-value formula.

Data Requirements for SKAT-O

SKAT-O analysis is to be performed from a mapped spreadsheet of sequence or other genotypic data. This spreadsheet must also contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

SKAT-O analysis may be used with either common variants or rare variants. You may select a marker weighting to either treat all variants equally (to analyze common variants) or, on the other hand, to emphasize rare variants with a greater or lesser amount of contrast between them and common variants. (Usage and Parameters)

If you are analyzing rare variants, you may also follow the procedures outlined in Data Requirements for KBAC with Permutation Testing.

Usage and Parameters

To use SKAT-O analysis, select DNA-Seq > Collapsing Methods > SKAT-O from the spreadsheet menu, and use the resulting window (see SKAT-O Method – Parameters) to select the following parameters:

mmkbacWin

SKAT-O Method – Parameters

  • Parameters for SKAT-O (Lee et. al., Am J Hum Genetics 2012) Optimized Sequence Kernel Association Testing

    Enter the marker weighting. The possible choices are:

    • Uniform All variants (both common and rare) are weighted equally.
    • Madsen and Browning (one over square root of MAF(1 - MAF)) Rare variants are weighted more than common variants.
    • pdf(MAF; 1, 25) of the Beta distribution (normally recommended for rare variant studies) This is the default. Weights rare variants more and common variants less, as does Madsen and Browning, but with a much greater contrast between rare and common variants than that produced by Madsen and Browning.

    Note

    It turns out that all of the above marker weightings may be represented through the Beta distribution. The probability density function of this distribution is

    pdf(x; \alpha, \beta) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)},

    where B(\alpha, \beta) represents the Beta function. These weightings may be represented as:

    • pdf(MAF; 1, 1) Uniform.
    • pdf(MAF; 0.5, 0.5) Madsen and Browning (1/\sqrt{MAF (1 - MAF)}).
    • pdf(MAF; 1, 25) The default weighting.
  • Tests and Outputs Enter the desired test(s).

    • Generalized SKAT with rho = For this test, also enter the \rho value.

      Note

      A \rho value of zero will yield the original SKAT test. A \rho value of 1 will yield a burden test.

    • SKAT-O with standard grid of rho values This is checked by default.

  • Regions Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is: FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1

    Please see The Data Source Library and Downloading Data for information on how to obtain annotation tracks.

    • Perform one test per...
      • Select gene (recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed.
      • Select transcript to perform one test for every transcript. This option will result in a more severe multiple testing penalty.
    • Include nearby markers (distance in bp) Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.
  • Check Correct for Additional Covariates to correct for additional covariates.

    • Click on Add Columns to start selecting which variables (spreadsheet columns) to use for correction.
  • Algorithm for Estimating Distributions Specify the algorithm to use according to your sample size.

    • Select Very-small-sample corrected for up to about 700 samples.
    • Select Somewhat-small-sample corrected for 700 up to 2000 samples.
    • Select Liu algorithm for greater than 2000 samples.

Note

If there is any missing genotype data in a region for a given sample, the wild type (homozygous reference genotype) will be imputed to have been present and be substituted for the missing data.

Output

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

  • Chr The chromosome number or label.
  • Start The starting genetic position of the region.
  • Stop The ending genetic position of the region.
  • Gene Name The gene name.
  • Transcript Name(s) Transcript name(s) for each gene region.
  • Strand Strand for the gene and transcripts.
  • SKAT-O P-value The p-value resulting from the SKAT-O test.
  • -log10 SKAT-O P-Value Minus the base-10 log of the above p-value.
  • SKAT-O T statistic The T statistic from the SKAT-O computation.
  • SKAT-O Bonf. P The Bonferroni-adjusted p-value from the SKAT-O test.
  • SKAT-O FDR The false discovery rate for all SKAT-O tests having a p-value less than or equal to the current SKAT-O test’s one-sided p-value.
  • SKAT P-value for rho = ## The p-value resulting from the generalized SKAT test for the given \rho.
  • -log10 SKAT P-Value Minus the base-10 log of the above p-value.
  • SKAT Q for rho = ## The Q statistic itself for the generalized SKAT test for the given \rho.
  • SKAT Bonf. P The Bonferroni-adjusted p-value from the generalized SKAT test.
  • SKAT FDR The false discovery rate for all generalized SKAT tests having a p-value less than or equal to the current generalized SKAT test’s p-value.
  • # Markers Found in Region The total number of markers found in this region.
  • # Markers Used The total number of markers used in this region’s test. Monomorphic markers, all-missing markers, and multi-allelic markers are excluded.