# Collapsing Methods for DNA-Sequence Analysis¶

## Overview¶

Collapsing methods are techniques for performing analysis on DNA sequence
data. Please see *DNA Sequence Analysis* for an overview of
sequence analysis and Golden Helix SVS DNA-Seq Analysis workflows.

Traditional association techniques used in GWAS studies are not really suitable for sequence analysis, because they do not have the power to detect the significance of rare variants individually, nor do they provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it is necessary to “collapse” several variants into a single covariate based on regions such as genes.

In Golden Helix SVS, you can use several approaches to study rare variant
burden. Two approaches are a simple detecting the presence of variants
in samples (used in [Cohen2004]) or counting the variants per sample (used in
[MorgenthalerAndThilly2007]) between affected and unaffected groups
over a region or gene as defined by a specified annotation track, then
performing an appropriate statistical test. See
*Workflow for Detecting Presence of Variants* and *Workflow for Counting Variants*.

Another, more powerful, approach is the Combined Multivariate and
Collapsing (CMC) method by Li and Leal [LiAndLeal2008]. Within each
region, CMC first bins variants according to a criterion such as minor
allele frequency, then collapses the variants within each bin, and
finally performs multivariate testing on the counts across the various
bins. See *Combined Multivariate and Collapsing (CMC) Method*.

A fourth approach is the Kernel-Based Adaptive Cluster method by Liu
and Leal [LiuAndLeal2010]. It first catalogs and counts multi-marker
genotypes within a given region based on the variant data, then
performs a special case/control test based on these counts. This test
is weighted according to how often each multi-marker genotype is
expected to occur according to the null hypothesis. See
*Kernel-Based Adaptive Cluster (KBAC) Test* and *Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method*.

Finally, there are the Sequence Kernel Association Test (SKAT), the
Generalized Sequence Kernel Association Test (Generalized SKAT), and
the Optimized Sequence Kernel Association Test (SKAT-O)
([Lee2012] and [LeeWuLin2012]). The SKAT test combines score
tests from individual variants within a given region assuming that the
effects of the individual variants are not correlated. Multiple
variants that change in opposite directions and contribute in the same
direction to changes in the phenotype will be detected by the SKAT
test. Generalized SKAT combines the score tests assuming that the
effects of the individual variants are partly correlated to a degree
which must be pre-specified for the test. SKAT-O is a method of
combining, into one overall test, a set of different generalized SKAT
tests over the same region. See *Optimized Sequence Kernel Association Test (SKAT-O)*.

Using one of the above approaches will give greater power to detect the significance of the rarer variants.

## Workflow for Detecting Presence of Variants¶

One workflow to analyze sequence data is to start with filtering out probes, then performing a Fisher’s Exact Test on a binary dependent variable against the presence or absence of a variant in a gene. This test is performed in [Cohen2004].

- Filter probes on Call Rate and Minor Allele Frequency, see
*Genotype Filtering by Marker*.- Keep probes with a high call rate
- Keep probes with a low minor allele frequency (rare variants)

- Perform LD pruning to remove probes that are in high Linkage
Disequilibrium with at least one other probe. One probe from a set of
probes in LD is kept for analysis, see
*LD Pruning*. - (Optional) Filter probes by the SIFT score, see
*siftFiltering*. - Determine the genes that have at least one variant per sample. See
*Count Variants per Gene*–select the option to output a binary presence/absence indicator of a variant (per gene). - Perform a Fisher’s Exact test using a binary case/control phenotype.
See
*Fisher’s Exact Test for Binary Predictors*. - Explore the results using the SVS Genome Browser.

## Workflow for Counting Variants¶

Another workflow to analyze sequence data is to test on the number of variants per gene. This test is similar to the test (Cohort Allelic Sums Test or CAST) discussed in [MorgenthalerAndThilly2007].

- Filter probes on Call Rate and Minor Allele Frequency, see
*Genotype Filtering by Marker*.- Keep probes with a high call rate
- Keep probes with a low minor allele frequency (rare variants)

- Perform LD pruning to remove probes that are in high Linkage
Disequilibrium with at least one other probe. One probe from a set of
probes in LD is kept for analysis, see
*LD Pruning*. - (Optional) Filter probes by the SIFT score, see
*siftFiltering*. - Calculate number of variants per gene, see
*Count Variants per Gene*. - Create a heat map to see where the samples have the most variants.
- Perform an association test using either a quantitative or binary phenotype.
- Explore the results using the SVS Genome Browser.

## Combined Multivariate and Collapsing (CMC) Method¶

As noted above, the Combined Multivariate and Collapsing (CMC) method by Li and Leal [LiAndLeal2008] first bins variants within each of a number of regions according to a criterion such as minor allele frequency, then collapses the variants within each bin, and finally performs multivariate testing on the counts across the various bins.

Within each bin, each sample is counted as either having a variant (heterozygous or homozygous) in at least one marker classified to be within that bin, or on the other hand having no variants at all among any of the markers classified to be within the bin. Thus, internally, one binary variable will result from each of the bins in the region.

There are two algorithms available in Golden Helix SVS for performing multivariate testing on these resulting binary variables.

- Hotelling’s
test (See
*CMC with Hotelling T Squared Test*) - Regression (See
*CMC with Regression*)

### CMC with Hotelling T Squared Test¶

This feature will perform the CMC method using its original algorithm, which is Hotelling’s T-squared test. The dependent variable must be a case/control status. This variable is tested against the independent bin variables.

#### Data Requirements for CMC with Hotelling T Squared Test¶

CMC analysis with Hotelling’s test is to be performed from a mapped spreadsheet of sequence data which has a case/control variable as its dependent variable.

A probe annotation track to locate the positions of the genes or the
transcripts must be specified. If you wish to perform one test per
gene, this annotation track must have a **Name** field.

In addition, you should have ready a “Variant Bins spreadsheet”, which should be mapped over the same markers as your sequence data. This spreadsheet should have a “Variant Bin” column that you may select from the CMC dialog.

Note

- If you wish to generate a “Variant Bins spreadsheet” based on
user-defined minor-allele frequency thresholds and an external
reference population provided through a probe annotation track,
you may use
**DNA-Seq**>**Variant Binning by Frequency Source**. (Please see*Variant Binning by Frequency Source*.) - While variant bins may be defined as mentioned just above, the CMC analysis method is flexible as to what binning method you may select.

#### Usage and Parameters¶

To use the CMC method with Hotelling’s test, select
**DNA-Seq** > **Collapsing Methods** > **CMC with Hotelling T Squared
Test** from the spreadsheet menu, and use the resulting window (see
*CMC Method with Hotelling’s test – Parameters*) to select the following parameters:

**Regions for CMC (Li and Leal) Hotelling T Squared Test**Use

**Select Track**to specify an annotation gene source that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`Please see

*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

**Variant Bins for CMC**(Please see the notes above about “Variant Bins spreadsheets”.) Here, select the**Variant Frequency Bins spreadsheet**and the**Variant Bin column**in that spreadsheet.- The variant bin column may either be binary or integer, in which case the bin categories will be numeric, or the column may be categorical, in which case the categories will define the different bins.

**Significance Testing**If your sample size is small (under 300 cases and 300 controls), permutation testing is recommended in addition to obtaining the asymptotic (non-permuted) p-value.- To perform permutation testing, check
**Also compute permuted p-values**and enter the**Number of permutations to use**. - To use adaptive permutation testing, check
**Adaptive permutation testing (threshold alpha)**and fill in the threshold alpha value. Please see*Adaptive Permutation Testing*below for more information about this technique and the possible advantages of using it.

- To perform permutation testing, check
**Missing Genotype Values**If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select**Yes**to**Impute Wild Type for Genotypic Missing Values**to impute the wild type, or**No**to skip samples containing missing data for the current region.

#### Output¶

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing the following columns:

**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value**The p-value resulting from the Hotelling’s test.**-log10 P-Value**Minus the base-10 log of the above p-value.**T^2**The statistic itself.**F**The F-value corresponding to the statistic.**Permuted P-Value**If permutation testing was requested, this column will appear and show the permuted p-value.**F Bonf. P**The Bonferroni-adjusted p-value, based on the number of successful tests.**F FDR**The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.**# Markers Total**The total number of markers tested in this region.**# in Bin=etc.**For each bin, a column is generated containing the number of markers categorized into that bin for this region.

### CMC with Regression¶

This feature will perform the CMC method using regression. The dependent variable, which may be either case/control or quantitative, is regressed against the independent bin variables.

This algorithm also allows correction for additional covariates. In this case, a full-model-vs-reduced-model regression is done, where the full model contains all of the independent bin variables plus the additional covariates as regressors, while the reduced model contains only the additional covariates as regressors.

#### Data Requirements for CMC with Regression¶

The data requirements for CMC analysis with regression are the same as
they are for CMC with Hotelling’s test
(*Data Requirements for CMC with Hotelling T Squared Test*) with the following exceptions:

- The dependent variable may be quantitative rather than case/control, in which case linear regression will be used rather than logistic regression.
- The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

#### Usage and Parameters¶

To use the CMC method with regression, select **DNA-Seq** >
**Collapsing Methods** > **CMC with Regression** from the spreadsheet
menu. Depending upon whether your dependent variable is case/control
or quantitative, you will get either the logistic regression window
(*CMC Method with Logistic Regression – Parameters*) or the linear regression window
(*CMC Method with Linear Regression – Parameters*). Use the resulting window to select the
following parameters:

**Regions for CMC (Li and Leal) Logistic Regression Test**or**Regions for CMC (Li and Leal) Linear Regression Test**(prompt depends upon whether your dependent variable is case/control or quantitative)Use

**Select Track**to specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`Please see

*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

**Variant Bins for CMC**(Please see the notes above about “Variant Bins spreadsheets”.) Here, select the**Variant Frequency Bins spreadsheet**and the**Variant Bin column**in that spreadsheet.- The variant bin column may either be binary or integer, in which case the bin categories will be numeric, or the column may be categorical, in which case the categories will define the different bins.

Check

**Correct for Additional Covariates**to utilize the full-vs-reduced regression technique to correct for additional covariates.- Click on
**Add Columns**to start selecting which variables (spreadsheet columns) to use for correction.

- Click on
**Significance Testing**If your sample size is small (under 300 cases and 300 controls), permutation testing is recommended in addition to obtaining the asymptotic (non-permuted) p-value.- To perform permutation testing, check
**Also compute permuted p-values**and enter the**Number of permutations to use**. - To use adaptive permutation testing, check
**Adaptive permutation testing (threshold alpha)**and fill in the threshold alpha value. Please see*Adaptive Permutation Testing*below for more information about this technique and the possible advantages of using it.

- To perform permutation testing, check
**Missing Genotype Values**If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select**Yes**to**Impute Wild Type for Genotypic Missing Values**to impute the wild type, or**No**to skip samples containing missing data for the current region.**Additional Outputs**The following additional outputs may be specified:- Select
**Output bin betas and their standard errors**to obtain a regression coefficient (beta) and its standard error for- the intercept,
- each additional covariate you are correcting for, if any, and
- each bin which is used in the regression. If any potential bins are not used for a given region, their corresponding output cells are filled with missing-value indicators (‘?’ values).

- To aid in comparing CMC to other methods, you may select
**Output separate spreadsheet of bin variables**to get the bin values for every sample output to a second separate spreadsheet. That is, for every region and every sample, if bins were used for the region, the bin values for that sample (each value being one or zero) will be output into that sample’s row in each of separate columns.

- Select

#### Outputs¶

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing the following columns, depending upon the type of your dependent variable and whether or not you are correcting for covariates:

- If your dependent variable is case/control and you are not correcting
for covariates, the spreadsheet will contain the following columns:
**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value**The p-value resulting from the Likelihood Ratio test.**-log10 P-Value**Minus the base-10 log of the above p-value.**Log Likelihood Full**The log likelihood for the regression.**Log Likelihood Null**The log likelihood for the null model.**Permuted P-Value**If permutation testing was requested, this column will appear and show the permuted p-value.**Bonf. P**The Bonferroni-adjusted p-value, based on the number of successful tests.**FDR**The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.**# Markers Total**The total number of markers tested in this region.**# in Bin=etc.**For each bin, a column is generated containing the number of markers categorized into that bin for this region.- If you specified
**Output bin betas and their standard errors**, the following additional outputs will appear:**Beta 0**The value for the regression.**Beta 0 SE**The standard error for .**Bin etc. Beta**For each bin (which was used by the regression for a given region), the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)**Bin etc. Beta SE**For each bin (which was used by the regression for a given region), the standard error for the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)

- If your dependent variable is case/control and you are correcting
for covariates, the spreadsheet will contain the following columns:
**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value**The (full vs. reduced) p-value resulting from the logistic regression.**-log10 P-Value**Minus the base-10 log of the above p-value.**Full-Model P**The p-value resulting from just regressing the full model.**Log Likelihood Full**The log likelihood for the full model.**Log Likelihood Reduced**The log likelihood for the reduced model.**Permuted P-Value**If permutation testing was requested, this column will appear and show the permuted p-value.**Bonf. P**The Bonferroni-adjusted p-value, based on the number of successful tests.**FDR**The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.**# Markers Total**The total number of markers tested in this region.**# in Bin=etc.**For each bin, a column is generated containing the number of markers categorized into that bin for this region.- If you specified
**Output bin betas and their standard errors**, the following additional outputs will appear:**Beta 0**The value for the region’s regression.**Beta 0 SE**The standard error for the .**Covariate Beta**For each covariate, the value for regressing that covariate in this region is shown.**Covariate Beta SE**For each covariate, the standard error for the value for regressing that covariate in this region is shown.**Bin etc. Beta**For each bin (which was used by the regression for a given region), the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)**Bin etc. Beta SE**For each bin (which was used by the regression for a given region), the standard error for the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)

- If your dependent variable is quantitative and you are not correcting
for covariates, the spreadsheet will contain the following columns:
**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value**The p-value resulting from the linear regression.**-log10 P-Value**Minus the base-10 log of the above p-value.**R Squared**The coefficient of determination for the regression.**F**The value of the F-statistic for the regression model.**Permuted P-Value**If permutation testing was requested, this column will appear and show the permuted p-value.**Bonf. P**The Bonferroni-adjusted p-value, based on the number of successful tests.**FDR**The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.**# Markers Total**The total number of markers tested in this region.**# in Bin=etc.**For each bin, a column is generated containing the number of markers categorized into that bin for this region.- If you specified
**Output bin betas and their standard errors**, the following additional outputs will appear:**Intercept**The intercept value for the regression.**Intercept SE**The standard error for the intercept.**Bin etc. Beta**For each bin (which was used by the regression for a given region), the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)**Bin etc. Beta SE**For each bin (which was used by the regression for a given region), the standard error for the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)

- If your dependent variable is quantitative and you are correcting
for covariates, the spreadsheet will contain the following columns:
**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value**The (full vs. reduced) p-value resulting from the linear regression.**-log10 P-Value**Minus the base-10 log of the above p-value.**Full-Model P**The p-value obtained from regressing only the full model.**Full-Model R Squared**The coefficient of determination for the full model.**Full-Model F**The value of the F-statistic for the full model.**Reduced-Model P**The p-value obtained from regressing only the reduced model.**Reduced-Model R Squared**The coefficient of determination for the reduced model.**Reduced-Model F**The value of the F-statistic for the reduced model.**Full vs Reduced F**The value of the F-statistic for the full vs. reduced model.**Permuted P-Value**If permutation testing was requested, this column will appear and show the permuted p-value.**Bonf. P**The Bonferroni-adjusted p-value, based on the number of successful tests.**FDR**The false discovery rate for all tests having a p-value less than or equal to the current test’s p-value.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and any additional covariates and, if requested, the genotypes, have been removed.**# Markers Total**The total number of markers tested in this region.**# in Bin=etc.**For each bin, a column is generated containing the number of markers categorized into that bin for this region.- If you specified
**Output bin betas and their standard errors**, the following additional outputs will appear:**Intercept**The intercept value for the regression.**Intercept SE**The standard error for the intercept.**Covariate Beta**For each covariate, the value for regressing that covariate in this region is shown.**Covariate Beta SE**For each covariate, the standard error for the value for regressing that covariate in this region is shown.**Bin etc. Beta**For each bin (which was used by the regression for a given region), the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)**Bin etc. Beta SE**For each bin (which was used by the regression for a given region), the standard error for the value for regressing that bin is displayed. (For bins which are not used for a given region, a missing-value indicator (‘?’) is displayed.)

Additionally, if you have requested **Output separate spreadsheet of
bin variables**, another marker-mapped spreadsheet called **CMC Bin
Variables** will be output containing one row for each sample and the
following columns:

- Your (case/control or quantitative) dependent variable.
- One column for every covariate you have specified.
- For each region and each bin which was used for that region, a column is output containing that bin’s values for each sample in the region. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region, followed by a second underline (‘_’), followed by the bin identifier.

## Kernel-Based Adaptive Cluster (KBAC) Test¶

As noted above, the Kernel-Based Adaptive Cluster (KBAC) method by Liu and Leal [LiuAndLeal2010] first catalogs the variant data within each of a number of regions into multi-marker genotypes. (Since the variants are rare, only a relatively few different multi-marker genotypes will be found in any given region.)

A special case/control test based on these counts is then applied. This test is weighted for each multi-marker genotype according to how often that genotype is expected to occur according to the data and according to the null hypothesis that there is no association between that genotype and the case/control status of the sample. Under this adaptive weighting procedure, the genotypes with higher sample risks (meaning that there are more occurrences of those genotypes for cases) will be given higher weights which can potentially separate causal from non-causal genotypes. This procedure is meant to attain a good balance between classification accuracy and the number of parameters which are estimated.

One result of this adaptive weighting procedure is that the KBAC test is intrinsically “one-sided”, in the sense that it is highly sensitive to associations from multi-marker genotypes conferring higher sample risks, and is more or less not sensitive at all to associations from multi-marker genotypes conferring lower sample risks (meaning that there are fewer occurrences of those genotypes for cases and more occurrences of those genotypes for controls).

If you suspect associations from multi-marker genotypes conferring lower sample risks, or wish to explore associations in both directions, you may wish to invert the Case/Control status of your phenotype and use the KBAC procedure to re-test for associations. The latter test will be highly sensitive to associations from multi-marker genotypes conferring lower sample risks (in relation to your original Case/Control status), and will be more or less not sensitive at all to associations from multi-marker genotypes conferring higher sample risks (in relation to your original Case/Control status).

### Overview of Theory¶

#### Sample Risk for Multi-Site Genotypes¶

Suppose the total sample size is , among which there are samples from affected individuals and samples from unaffected individuals, and that there are sites within the candidate region where rare variants are observed. The rare variant multi-site genotype for each sample may be cataloged as a vector , where is the number of rare variants observed at the sample’s site-that is, is 2 if the site is homozygous for the rare allele, 1 if the site is heterozygous, and 0 if the site is homozygous wild-type for the common major alleles. Further suppose that distinct multi-site genotype vectors through can be cataloged from the data, where represents the wild-type genotype without any rare variants (a vector of all zeros) and through represent multi-site genotypes with at least one variant. The sample risk for multi-site genotype is defined as

where is the number of occurrences of genotype and is the number of occurrences of genotype for samples from affected individuals (cases).

The sample risk for multi-site genotype is modeled using a two-component distribution

where the component represents the distribution of the sample risk when multi-site genotype is non-causal and is known, while represents the unknown distribution of sample risk when is causal. If the null hypothesis of no disease/gene associations holds, all genotypes will be non-causal and we could take .

#### Adaptive Weighting¶

In this method, we adaptively weight each genotype according to the area under the curve from to of the known component of the distribution, where is the estimated sample risk for genotype . This known component is referred to as a “kernel”. The weight of genotype may thus be written

Three methods of calculation are available to obtain values for these weights.

**Hyper-geometric kernel.**This is the exact form of the kernel. Under the null hypothesis, conditioning on the genotype counts and the count of cases (vs. controls ), the number of diseased individuals (cases) having the multi-site genotype follows a hyper-geometric distribution with kernel function given bySince the hyper-geometric distribution is discrete, the weight integral may be computed through the summation

**Marginal binomial kernel.**In the marginal condition that there are only a few samples having the variant genotypes but yet the number of cases is a significant fraction of the number of samples, the number of diseased individuals (cases) having the multi-site genotype approximates the binomial distribution, that is, . Based on this, the kernel may be writtenand the weight integral may be computed through the summation

**Asymptotic normal kernel.**For large sample sizes, we take advantage of the fact that under the null distribution, the sample risk for genotype is asymptotically normal, that is,so the kernel is given by

where is the probability density function for a standard normal random variable. The weight is thus

which may be found through a table or through a utility for computing the standard normal distribution.

#### The Statistic Itself¶

The KBAC test statistic itself is defined as

which compares the difference of weighted multi-site genotype frequencies between cases and controls. When a one-sided alternative hypothesis is tested, for instance, the enrichment of causal variants in cases, a corresponding one-sided version of KBAC can be used. To test the enrichment of causal variants in cases, one would use

#### Finding the P-Value¶

Since the distribution of this statistic is unknown, permutation testing or something approximating it must be used to determine a p-value.

#### Permutation Testing¶

Normally, permutation testing is used to determine the p-value of a KBAC statistic. Permutation testing is carried out as follows:

- Compute the KBAC statistic based upon the original data.
- Permute the case/control statuses.
- Using the sample risks as based upon the original multi-marker genotypes and the newly-permuted case/control statuses, re-compute the weight associated with each multi-marker genotype .
- Re-compute the KBAC statistic based upon the permuted data and the newly re-computed genotype weights.
- Repeat steps (2) through (4) until all permutations are finished.
- Find the count of how many permutation results (counting the statistic from the original data as one permutation result) are greater than or equal to the original statistic, and divide by the total number of permutation results.

#### The Monte-Carlo Method¶

For large sample sizes, a Monte-Carlo approximation method for finding a p-value has been developed and is available in Golden Helix SVS. As was mentioned above, the number of cases for each genotype approximates a binomial distribution, that is, . Due to the low frequencies of each multi-site genotype containing rare variants, we take the to be independent of each other. The Monte-Carlo simulation algorithm is thus as follows:

- Compute the KBAC statistic based upon the original data.
- Simulate a -vector of independent binomials with .
- Re-compute the genotype weights, based upon the -vector, as for each multi-site genotype .
- Re-compute a new KBAC statistic based upon the -vector, using each as the number of cases found for the multi-site genotype .
- Repeat steps (2) through (4), using the KBAC statistics generated from these steps in the same fashion as is done for permutation testing–namely, find the count of how many simulation results (counting the original statistic as one simulation result) are greater than or equal to the original statistic, and divide by the total number of simulation results.

### Correction for Covariates¶

#### Alternative View of the Statistic¶

Suppose, for any sample , we let take on either the value

for a case, and

for a control, where and are arbitrary constants. It may be shown that

holds for the one-sided KBAC statistic documented above, where is the overall average of all the , and means for the multi-site genotype contained by sample .

For future reference, we will find it convenient to set and , so as to make for the cases and for the controls. Using this convention, if for any sample , we then set

where is defined as above, we have

as an alternative formulation of the one-sided KBAC statistic for which no covariates are corrected.

Note

The value (for any sample ) may be formulated as

or, when performing the Monte-Carlo approximation,

The first form depends upon which, in turn, depends upon whether a permutation is being computed and upon which permutation is being computed. The second (Monte-Carlo) form depends upon which, in turn, depends upon which iteration of the Monte-Carlo procedure is current.

Therefore, the values of the depend upon which permutation, if any, (or upon which Monte-Carlo iteration) is being performed.

#### Regression Score¶

Suppose we have covariates such as age, sex or eigenvectors for genotypes, and we define to be the value of the covariate for the sample. If we define and as above, the logistic model for association testing will have the form

The score statistic to test the null hypothesis of the independence of this model from the is

where

and and the are determined by a regression on the reduced model

In a fashion similar to the original KBAC statistics, we define to be a one-sided score to test for the enrichment of causal variants for higher corrected case/control statuses , and we define to be a two-sided score to test for the dependency of causal variants upon the corrected case/control statuses .

Note

We see that if there are no covariates, estimates the log of the ratio of cases to controls, and that

estimates , so that

emerges as the definition of the score statistic when there are no covariates. This matches, within a constant multiplier, the original one-sided KBAC statistic.

Note

To derive the regression score, note that the expectation of the themselves may be written as

where

and that the likelihood and log-likelihood for all the observations are

and

Since we are using the null hypothesis , we will determine the score function with respect to , which is defined as the derivative of the log-likelihood with respect to . We take this derivative and expand the resulting terms:

where for any scalar value ,

(as defined earlier),

and

Thus

and the derivative of the log-likelihood with respect to is

Under the null hypothesis , we have

Also, under the null hypothesis, the do not contribute to the solution to the original model, and thus and the are also, under the null hypothesis, the solution to the reduced model

Thus, we may set up the score test for as

where and the are the solution to the reduced model.

#### Using the Regression Score with Permutation Testing¶

When there are covariates, we see that is dependent upon the , which are dependent upon the . Thus, for every permutation of the case/control statuses , the reduced-model regression must be repeated.

Thus, KBAC with Regression can be much more compute-intensive than is KBAC with Permutation Testing. For this reason, Golden Helix SVS KBAC with Regression pre-computes the reduced-model regression for all permutations when imputing the wild type, and offers the option of pre-computing the reduced-model regression for all permutations when not imputing the wild type. However, KBAC with Regression is able to correct for covariates, while KBAC with Permutation Testing is not.

### Correction for Relationships and/or Pedigree Structure¶

Please see *Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method*.

### KBAC with Permutation Testing¶

This feature will perform the KBAC method using its original algorithm and statistic. The case/control dependent variable is tested against the weighted multi-site genotypes other than the wild-type genotype.

#### Data Requirements for KBAC with Permutation Testing¶

KBAC analysis is to be performed from a mapped spreadsheet of sequence data.

Note

- You should first filter the markers by rare variants. It is assumed by KBAC with Permutation Testing that only rare variant markers will be active.
- If you wish to filter based on a user-defined minor-allele
frequency threshold and an external reference population provided
through a probe annotation track, you may use
**DNA-Seq**>**Variant Frequency Binning by Frequency Source**(Please see*Variant Binning by Frequency Source*), entering the minor-allele threshold as the “bin 0 threshold” to generate a “binning” spreadsheet. Inactivate all rows of the “binning” spreadsheet that are not in bin 0, then use**Select**>**Activate or Inactivate Based on Second Spreadsheet**to activate all columns in the original marker spreadsheet corresponding to active rows in the “binning” spreadsheet.

#### Usage and Parameters¶

To use KBAC with Permutation Testing, select **DNA-Seq** >
**Collapsing Methods** > **KBAC with Permutation Testing** from the
spreadsheet menu, and using the resulting window (see
*KBAC with Permutation Testing – Parameters*) to select the following parameters:

**Parameters for KBAC (Liu & Leal, 2010) Using Permutation Testing**Enter the kernel type and the permutation parameters.**Kernel Type**The possible choices are:**Hyper-geometric (Normally recommended)**Normally recommended and gives the most accurate test.**Marginal binomial**Less compute-intensive and almost as accurate as the hyper-geometric kernel.**Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls)**You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.

**Permutations**The permutation parameters are:**Number of permutations to use:**The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.**Permutation Mode:**The possible choices are:**Standard C/C permutation procedure**This is normally recommended and gives the most accurate test.**KBAC Monte-Carlo approximation (For large sample sizes only) (> 400 cases/400 controls)**You should have more than 400 cases and 400 controls for this permutation mode, which is not as accurate as standard case/control permutation testing.

- To use adaptive permutation testing, check
**Adaptive permutation testing (threshold alpha)**and fill in the threshold alpha value. Please see*Adaptive Permutation Testing*below for more about this technique and the possible advantages of using it.

**Outputs**Choose one or both of the following:**One-sided statistics (recommended)**This tests the one-sided alternative hypothesis of the enrichment of causal variants in cases.**Two-sided statistics**This tests the two-sided alternative hypothesis of the difference between weighted multi-site genotype frequencies between cases and controls.

Note

Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

**Regions**Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`Please see

*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

**Missing Genotype Values**If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used. Select**Yes**to**Impute Wild Type for Genotypic Missing Values**to impute the wild type, or**No**to skip samples containing missing data for the current region.

#### Output¶

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value (One-Sided)**The p-value resulting from the one-sided KBAC test.**-log10 P-Value (One-Sided)**Minus the base-10 log of the above p-value.**KBAC (One-Sided)**The one-sided KBAC statistic itself.**Bonf. P (One-Sided)**The Bonferroni-adjusted p-value from the one-sided test, based on the number of successful one-sided tests.**FDR (One-Sided)**The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.**P-Value (Two-Sided)**The p-value resulting from the two-sided KBAC test.**-log10 P-Value (Two-Sided)**Minus the base-10 log of the above p-value.**KBAC (Two-Sided)**The two-sided KBAC statistic itself.**Bonf. P (Two-Sided)**The Bonferroni-adjusted p-value from the two-sided test, based on the number of successful two-sided tests.**FDR (Two-Sided)**The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.**Actual Permutations Used**How many permutations were actually used for this test by the adaptive permutation testing algorithm.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.**# Markers**The total number of markers tested in this region.**# Multi-Marker Genotypes**The total number of distinct multi-marker genotypes discovered in this region.

### KBAC with Regression¶

This feature, which allows additional covariates to be specified, will perform the KBAC method using the regression score function. The case/control variable is tested against the weighted multi-site genotypes other than the wild-type genotype and against the expected values of the case/control variable. The expected values of the case/control variable are corrected by any additional covariates.

#### Data Requirements for KBAC with Regression¶

The data requirements for KBAC analysis with regression are the same as
they are for KBAC with Permutation Testing
(*Data Requirements for KBAC with Permutation Testing*) with the following addition:

- The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

#### Usage and Parameters¶

To use KBAC with Regression, select **DNA-Seq** > **Collapsing
Methods** > **KBAC with Regression** from the spreadsheet menu, and
use the resulting window (see *KBAC Method with Regression – Parameters*) to select the
following parameters:

**Parameters for KBAC (Liu & Leal, 2010) Logistic Regression**Enter the kernel type and the permutation parameters.**Kernel Type**The possible choices are:**Hyper-geometric (Normally recommended)**Normally recommended and gives the most accurate test.**Marginal binomial**Less compute-intensive and almost as accurate as the hyper-geometric kernel.**Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls)**You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.

**Permutations**The permutation parameters are:**Number of permutations to use:**The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.- To use adaptive permutation testing, check
**Adaptive permutation testing (threshold alpha)**and fill in the threshold alpha value. Please see*Adaptive Permutation Testing*below for more about this technique and the possible advantages of using it.

**Outputs**Choose one or both of the following:**One-sided statistics (recommended)**This tests the one-sided alternative hypothesis of the enrichment of causal variants for higher corrected case/control statuses .**Two-sided statistics**This tests the two-sided alternative hypothesis of the dependency of causal variants upon the corrected case/control statuses .

Note

Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

**Regions**Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

Check

**Correct for Additional Covariates**to utilize the regression score technique to correct for additional covariates.- Click on
**Add Columns**to start selecting which variables (spreadsheet columns) to use for correction.

- Click on
**Missing Genotype Values**If there is any missing genotype data in the current region for a given sample, either the wild type (homozygous reference genotype) can be imputed to have been present and substituted for the missing data, or the sample can be skipped and not used.If you wish to impute the wild type, or you have no missing data, select

**Yes (faster)**to**Impute Wild Type for Genotypic Missing Values**. Reduced-model results for all of the permutations will be pre-computed in advance, and because the samples that will be used for all the regions will be the same (all samples for which the dependent variable and the covariates all have non-missing data), these results will be accurate.Otherwise, samples containing missing data for the region being analyzed will be skipped. Select

**No (slower)**to compute the reduced model all over again for each region and obtain accurate results, or select**No but precompute reduced models (faster but gives approximate values)**to pre-compute the reduced-model results in advance (for all samples for which the dependent variable and the covariates all have non-missing data) and then use these pre-computed results for those samples that are actually being used.Note

When selecting

**No but precompute reduced models (faster but gives approximate values)**, and if there are samples containing missing data for the region being analyzed (but not for the dependent variable and covariates), the KBAC statistic values and the p-values will be approximate for two reasons:- The reduced model results were pre-computed for a larger subset than is actually being used.
- While the potential genotype weights ( or ) for every region are always pre-computed using the actual subset (and thus are always exact for the unpermuted case), the number of total cases that are in the actual subset but were obtained from permutation on the larger subset may fluctuate as you move through the permutations, making the pre-computation results for the permutations only approximate. (Also, a pre-computation may not have been performed for the number of (permuted) cases that a genotype has, in which case the pre-computed value for the highest number of cases for which pre-computation was done is used.)

**Intermediate Outputs**To aid in comparing KBAC to other methods, you may select**Output separate spreadsheets of weights (from unpermuted Y) and genotype indexes**to get the vector of ( weight) values, as calculated from the unpermuted values, output to a second separate spreadsheet. That is, for every region and for every sample for which the region contains multi-marker genotype , this spreadsheet will outputAdditionally, if you select

**Output separate spreadsheets of weights (from unpermuted Y) and genotype indexes**, a third spreadsheet will be output containing the index value corresponding to each multi-marker genotype .If is the wild-type genotype, a zero will be output to both spreadsheets.

Note

As stated above in the note to

*Alternative View of the Statistic*, the used in the permutations (and in the iterations of the Monte-Carlo method featured in*KBAC with Permutation Testing*) will vary from the derived from the original unpermuted data.Note

The values (for the unpermuted values) which are output to the separate spreadsheet are exact, even if you select

**No but precompute reduced models (faster but gives approximate values)**to**Impute Wild Type for Genotypic Missing Values**.

#### Outputs¶

A marker-mapped spreadsheet will always be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value (One-Sided)**The p-value resulting from the one-sided KBAC test.**-log10 P-Value (One-Sided)**Minus the base-10 log of the above p-value.**KBAC Score (One-Sided)**The one-sided KBAC score statistic itself.**Bonf. P (One-Sided)**The Bonferroni-adjusted p-value from the one-sided test, based on the number of successful one-sided tests.**FDR (One-Sided)**The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.**P-Value (Two-Sided)**The p-value resulting from the two-sided KBAC test.**-log10 P-Value (Two-Sided)**Minus the base-10 log of the above p-value.**KBAC Score (Two-Sided)**The two-sided KBAC score statistic itself.**Bonf. P (Two-Sided)**The Bonferroni-adjusted p-value from the two-sided test, based on the number of successful two-sided tests.**FDR (Two-Sided)**The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.**Actual Permutations Used**How many permutations were actually used for this test by the adaptive permutation testing algorithm.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.**# Markers**The total number of markers tested in this region.**# Multi-Marker Genotypes**The total number of distinct multi-marker genotypes discovered in this region.

Additionally, if you have requested **Output separate spreadsheets of
weights (from unpermuted Y) and genotype indexes**, two other
marker-mapped spreadsheets will be output as follows:

**KBAC Regression Weights (from Unpermuted Y)**containing one row for each sample and the following columns:- Your case/control variable.
- One column for every covariate you have specified.
- The weight value (as computed from the unpermuted ) for each sample in the region. If the sample’s genotype is the wild type, zero is output. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region.

**KBAC Regression Genotype Indexes**containing one row for each sample and the following columns:- Your case/control variable.
- One column for every covariate you have specified.
- For each sample , the index for that sample’s genotype in the region. If the sample’s genotype is the wild type, zero is output. Each column header for these columns will contain the chromosome name, followed by an underline (‘_’), followed by the start position of the region.

## Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method¶

At Golden Helix, we have adapted mixed models
(*Methods for Mixed Linear Model Analysis*) to the KBAC test
(*Kernel-Based Adaptive Cluster (KBAC) Test*). This method may be used to perform a KBAC
test which corrects for genetic relationships or kinships among the
samples.

This test works essentially as does KBAC with Regression
(*KBAC with Regression*), except that the reduced model used for
scoring is created from a logistic mixed-model equation, and thus
implicitly includes the kinship information along with any
covariate information.

### Overview of Theory¶

#### Regression Score¶

We define the mixed-model regression score in a fashion similar to the
definition in *Regression Score*, except that we will change
the notation slightly to better agree with our mixed-model notation
(*The Mixed Model Equation*).

Suppose we have covariates such as age, sex or eigenvectors
for genotypes, and we define to be the value of the
(fixed-effect) covariate for the
sample. Also, suppose we have random effects , written in
vector form as . We take the variance/covariance structure of
the random effects to be . If we define and
as in *Alternative View of the Statistic*, the logistic model for
association testing will have the form

The score statistic to test the null hypothesis of the independence of this model from the will be

where

being defined by

, the , and the predictions of the are determined by solving the (reduced) logistic mixed model

using the same variance/covariance matrix as above for the .

As with the original KBAC method, we define to be a one-sided score to test for the enrichment of causal variants for higher corrected case/control statuses , and we define to be a two-sided score to test for the dependency of causal variants upon the corrected case/control statuses .

Note

We derive the mixed-model regression score in a manner
similar to that for *Regression Score*. We note that the
expectations of the given the are

where

and that the likelihood and log-likelihood for all the observations are

and

As before, since we are using the null hypothesis , we will determine the score function with respect to , which is the derivative of the log-likelihood with respect to :

where

where for any scalar value ,

(as defined earlier), and

(as shown in *Regression Score*). In a fashion similar to before, we have

Thus

and the derivative of the log-likelihood with respect to is

(as shown in *Regression Score*, except this time,
and are defined as directly
above). Under the null hypothesis , we have

Also, under the null hypothesis, the do not contribute to the solution to the original model, and thus , the , and the predictions of are also, under the null hypothesis, the solution to the reduced mixed model

using the same variance/covariance matrix for the as for the original mixed model.

Thus, we may set up the score test for as

where , the , and the predictions of are the solution to the reduced mixed model.

#### The Logistic (Reduced) Mixed Model¶

Define

- as the matrix consisting of all one’s in the first column, followed by the values in the remaining columns (with being the value in row and column ).
- as the vector consisting of the value as the first element and the value as its element number .
- , , and as the vectors of values, values, and values, respectively.
- , where is a vector consisting of values, as a vector consisting of the values using the scalar definition of .

The above reduced model may then be rewritten as

where is the expectation of the given the , and

where is the variance of the given the , and is the variance of the binomial distribution itself, with diagonal elements and off-diagonal elements equal to zero.

Note

These expressions are sometimes seen elsewhere as

and

but here, we use the identity matrix for the “random-effect design matrix” as well as for the “R-side random effect matrix” .

The “linear predictor” for this model is

while is the “inverse link function” for this model.

#### Solving the Logistic Mixed Model¶

We do this by iterating back and forth between creating a linear pseudo-mixed-model and finding a solution for that pseudo-model.

To determine how to create the pseudo-model, we write a first-order Taylor series of about some given tentative values of and , namely and :

where

is a diagonal matrix of derivatives of the conditional mean evaluated at the tentative values of and . Rearranging terms yields

The left side is the expected value, conditional on , of

The variance of given is

where

We now have the form of the pseudo-model as

which is a linear mixed model with pseudo-response , fixed effects , random effects , and .

The coefficients and the random effect predictions may now be used as the new tentative values and .

Note

Alternatively, it may be argued that it is better to use the expected values of , which are all zero, for every iteration’s tentative random effect values , and to wait until the iteration process is finished before using the predicted (BLUP) values of from the pseudo-model.

#### Solving the Pseudo-Model¶

We would like to use the EMMA algorithm
(*Finding the Variance Components*) to solve this pseudo-model, but
there is only one problem. EMMA assumes that the variance of the
errors is a multiple of the identity matrix (see
*The Mixed Model Equation*), which is not what we have here.

Thus, to use EMMA, we must transform our pseudo-model to another model. To do this, we want to find a matrix such that , so that we can write

and use EMMA to solve the mixed model

where the variance of is proportional to . We note that

Setting this equal to the identity matrix gives us

and

This is solved by letting

(Note that , , , and thus are all diagonal.)

Now we may use EMMA to solve

(Note that in this transformed pseudo-model, either may be thought of as the “” matrix or “random-effect design matrix” for , or may be thought of as its own random-effect variable.)

#### Summary of the Algorithm¶

First, pick starting values of and , such as all zeros. Then repeat the following steps until the changes in and are sufficiently small:

Find and (the tentative values of and ) from the original linear predictor equation and the definition for :

Find the (diagonal) matrix

the diagonal-element values of which specifically come out to

Find the pseudo-model :

or

Find the (diagonal) matrix :

the diagonal-element values of which specifically come out to

Find new values of and using EMMA for:

Note

Using the alternative method (see the note in
*Solving the Logistic Mixed Model* above), this algorithm simplifies
somewhat to:

First, pick starting values of , such as all zeros. (Always use zeros in place of .) Then repeat the following steps until the changes in are sufficiently small:

Find and (the tentative values of and ) from the original linear predictor equation, the definition for , and the fact that we are always using zeros for :

Find the (diagonal) matrix

the diagonal-element values of which specifically come out to

Find the pseudo-model :

or

Find the (diagonal) matrix :

the diagonal-element values of which specifically come out to

Find new values of using EMMA for:

Because this is the alternative method, the values of need not be predicted here.

Finally, when iteration has finished, predict the from the final values of , , and , then compute the final values of and from this prediction.

Note

The alternative method is used in Golden Helix SVS.

#### Optimizations¶

Since finding the logistic reduced mixed model for any is somewhat compute-intensive, the results of finding the reduced mixed model for the un-permuted and every permutation of are pre-computed and stored in advance of scanning any regions for the KBAC test.

Note

To facilitate this, the Mixed-Model KBAC feature of Golden Helix SVS will never drop samples. Instead, it will always handle samples with missing genotypes in any region by imputing the wild type (homozygous reference genotype) for these samples in that region.

When there are no covariates, the correct value of is computed directly from

where is the average value of the . This value is then used as the “initial value” of , after which the mixed-model will converge immediately.

Note

This optimization only works because we are using the alternative method whereby the tentative random effects are taken to be zero for the purposes of the iteration process.

When there are covariates, the final values of and of the eigen-decompositions for and which were computed for the unpermuted are used to initialize the mixed-model calculations for every permuted .

During logistic mixed-model computations, the EMMA matrix eigen-decompositions are only re-computed for every third iteration unless convergence seems imminent.

The following optimization process for the permuted , for use with covariates, can speed up computation by almost an order of magnitude, because it completely avoids performing any eigen-decompositions whatsoever. However, this optimization process gives only approximate final results–therefore, this process is optional and may be selected or not selected.

The steps for this process are as follows:

Initialize using the final values of that were computed for the unpermuted .

Using the final eigen-decompositions of and that were computed for the unpermuted (and EMMA logic), attempt to compute valid values for the variance components (namely, the random effect variance and the error variance ) within three iterations. However, accept whatever the results are, whether or not there was apparent convergence.

Using these variance components ( and ) as derived from the EMMA logic, and initializing with the latest values of as derived from the EMMA logic, iterate using the Henderson formula (as modified to accommodate singular ) (see note below) to find successive values until convergence is achieved.

Note

The Henderson formula (as modified to accommodate singular ), for any mixed-model equation

with

is:

When this formula is used, is substituted for the “” in the Henderson formula, the identity matrix is substituted for the “” in the Henderson formula, is substituted for the “” in the Henderson formula, and the pseudo-model is substituted for the “” in the Henderson formula.

for the Henderson formula is composed of diagonal-element values

and off-diagonal-element values of zero.

Notice that no transformation analogous to finding a “” for the EMMA technique needs to be done to accommodate the Henderson formula.

Finally, use the (BLUP) prediction of given by the Henderson formula.

### Mixed-Model KBAC¶

This feature, which allows both additional covariates and a
pre-computed kinship matrix to be specified, will perform the KBAC
method using a regression score function as does
*KBAC with Regression*. The case/control variable is tested against
the weighted multi-site genotypes other than the wild-type genotype
and against the expected values of the case/control variable. The
expected values of the case/control variable are corrected by any
additional covariates and by the relationships specified in the
pre-computed kinship matrix.

#### Data Requirements for Mixed-Model KBAC¶

The data requirements for Mixed-Model KBAC analysis are the same as
they are for KBAC with Permutation Testing
(*Data Requirements for KBAC with Permutation Testing*) with the following additions:

- The spreadsheet must contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.
- You must supply a pre-computed kinship matrix from a separate
spreadsheet. The requirements for this kinship spreadsheet are
outlined in
*Precomputed Kinship Matrix Option*.

#### Usage and Parameters¶

To use Mixed-Model KBAC, select **DNA-Seq** > **Collapsing
Methods** > **Mixed-Model KBAC** from the spreadsheet menu, and
use the resulting window (see *Mixed-Model KBAC Method – Parameters*) to select the
following parameters:

**Parameters for KBAC (Liu & Leal, 2010) Mixed-Model Logistic Regression**Enter the kernel type and the permutation parameters.**Kernel Type**The possible choices are:**Hyper-geometric (Normally recommended)**Normally recommended and gives the most accurate test.**Marginal binomial**Less compute-intensive and almost as accurate as the hyper-geometric kernel.**Asymptotic normal (For large sample sizes only) (> 400 cases/400 controls)**You should have more than 400 cases and 400 controls for this kernel type. Not as accurate as the other two kernel methods.

**Permutations**The permutation parameters are:**Number of permutations to use:**The KBAC test is dependent upon permutation testing. Please enter the number of permutations here.- To use adaptive permutation testing, check
**Adaptive permutation testing (threshold alpha)**and fill in the threshold alpha value. Please see*Adaptive Permutation Testing*below for more about this technique and the possible advantages of using it.

**Outputs**Choose one or both of the following:**One-sided statistics (recommended)**This tests the one-sided alternative hypothesis of the enrichment of causal variants for higher corrected case/control statuses .**Two-sided statistics**This tests the two-sided alternative hypothesis of the dependency of causal variants upon the corrected case/control statuses .

Note

Due to the often asymmetric distribution of the one-sided KBAC test-statistic, two-sided p-values will almost never be twice the one-sided p-values, and will in fact be exactly the same as the one-sided p-values if the statistic’s distribution, given the numbers of cases and controls and numbers of genotypes, is entirely positive.

Further, the two-sided and one-sided p-values will usually be exactly the same for most practical cases where the p-value is less than .05.

**Regions**Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

**Pre-Computed Kinship Matrix:**See*Precomputed Kinship Matrix Option*for more information about pre-computed kinship matrices. Click on**Select Sheet**and select the kinship spreadsheet from the window that is presented.Check

**Correct for Additional Covariates**to utilize the (mixed-model) regression score technique to correct for additional covariates.- Click on
**Add Columns**to start selecting which variables (spreadsheet columns) to use for correction.

- Click on
**Algorithm**Specify the algorithm to use for pre-computing the reduced mixed models for the phenotype permutations (permutations of the values).Note

The original reduced mixed model that uses the un-permuted phenotypes is always computed exactly.

- Select
**Exact**to use the exact algorithm and avoid using the last optimization documented in*Optimizations*. - Select
**Approximate (faster when correcting for covariates)**to use the last optimization documented in*Optimizations*. If you are correcting for covariates, this option will work much faster than the**Exact**option, but the results will only be approximate.

- Select

Note

If there is any missing genotype data in a region for a given sample, the wild type (homozygous reference genotype) will be imputed to have been present and be substituted for the missing data.

#### Output¶

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**P-Value (One-Sided)**The p-value resulting from the one-sided KBAC test.**-log10 P-Value (One-Sided)**Minus the base-10 log of the above p-value.**KBAC Score (One-Sided)**The one-sided KBAC score statistic itself.**Bonf. P (One-Sided)**The Bonferroni-adjusted p-value from the one-sided test, based on the number of successful one-sided tests.**FDR (One-Sided)**The false discovery rate for all tests having a one-sided p-value less than or equal to the current test’s one-sided p-value.**P-Value (Two-Sided)**The p-value resulting from the two-sided KBAC test.**-log10 P-Value (Two-Sided)**Minus the base-10 log of the above p-value.**KBAC Score (Two-Sided)**The two-sided KBAC score statistic itself.**Bonf. P (Two-Sided)**The Bonferroni-adjusted p-value from the two-sided test, based on the number of successful two-sided tests.**FDR (Two-Sided)**The false discovery rate for all tests having a two-sided p-value less than or equal to the current test’s two-sided p-value.**Actual Permutations Used**How many permutations were actually used for this test by the adaptive permutation testing algorithm.**Sample Size Used**The actual sample size used for this test after samples with missing values in the dependent variable and, if requested, the genotypes, have been removed.**# Markers**The total number of markers tested in this region.**# Multi-Marker Genotypes**The total number of distinct multi-marker genotypes discovered in this region.

## Adaptive Permutation Testing¶

### Overview¶

Suppose a large series of tests is being performed by permutation testing, such as a GWAS or a collapsing method over the entire genome. In order to overcome multiple testing issues, it is normally necessary to specify a large number of permutations. However, this can take large amounts of computation time, while specifying many fewer permutations could have taken much less computation time.

Adaptive permutation testing seeks to break this conundrum by “cutting short” (limiting the number of permutations of) any individual test that is “not going well”, meaning that based on tentative results, it is very unlikely that the p-value could be below (better than) a certain specified value.

Additionally, adaptive permutation testing seeks to assign a tentative p-value to any test based on the permutations thus far performed.

The hope is that if the series of tests has a “reasonable” p-value distribution (as would be determined by, for instance, a Q-Q plot), “most” of the tests will only need a “small” number of permutations, while the remaining tests that are much more “promising” will get a much larger number of permutations. If this hope is realized, computation time can be cut down to only a fraction of what it otherwise would be without losing significance where it is really needed.

Adaptive permutation testing is available in Golden Helix SVS for both
versions of the CMC method (*Combined Multivariate and Collapsing (CMC) Method*) and both versions of
the KBAC test (*Kernel-Based Adaptive Cluster (KBAC) Test*).

### Theory¶

Suppose that at any given time during the process of a permutation test, permutations have been completed with “hits”, or permuted tests the outcomes of which are as significant as or more significant than the original unpermuted test. Also, suppose we specify that a p-value must be less than in order to be “interesting”, or sufficiently significant to justify continuing the permutation test (unless the maximum number of permutations specified has been reached).

When testing, we can always assign a tentative p-value based on the permutations we have done so far, and determine if it is greater than .

The question is to determine whether and when, if indeed , we can terminate the test confident that the real p-value lies above the value , and that we can use as a reasonable final p-value result.

To accomplish this, we first estimate the standard deviation of where the p-value might be, based on the data we have. Then, we require that we have done enough tests to cause

or

That is, if we can assure ourselves that our tentative p-value outcome is greater than and more than six standard deviations away from the value, then it will be OK to stop early.

#### Variance of the Mean¶

Suppose is the random variable which has the value for each “hit”, or permuted test which is as significant as or more significant than the original unpermuted test, and for each permuted test which is less significant than the original unpermuted test.

We take the variance of corresponding to each of the permutations done so far to be

Meanwhile, the variance of the mean of M occurrences of a random variable is the individual mean divided by M. Thus, in our case, the variance of the mean after M permutations may be defined as

or

Our requirement then may be re-stated as doing enough tests to cause

#### A Further Requirement¶

We note that in the case of getting entirely or almost entirely “hits” after very few permutations, the above expression for the standard deviation will be very small, even though the values of are completely opposite of what would give us confidence in the p-value being very small. Thus, as a final guard against quitting early, we would like to get rid of the influence of the factor, which becomes very small when the fraction of “hits” is high. Noting that

we impose doing enough tests to cause

which is a more stringent requirement. This not only avoids estimating a small standard deviation of the mean when many tests are “hits” near the beginning of testing, but also, for very small p-values and many iterations, this requirement is only slightly more stringent than the original requirement. This is because, under those circumstances, in the original requirement becomes very close to one.

It may be noted that this more stringent requirement guarantees that at least 36 iterations take place for every permutation test, because taking the maximum possible value of , one, and the minimum possible value of , zero, we have

or

or

### Summary of Usage¶

If you request **Adaptive permutation testing**, enter a **threshold
alpha.**

If, while a test is performed, it has been determined that the actual
p-value must be higher than the **threshold alpha** by more than six
estimated standard deviations of the mean estimated p-value, the test
will be terminated early, and the p-value which is output will be
based on the permutations done thus far. An additional output of the
number of permutations done thus far will appear under the heading
**Actual Permutations Used**.

In the case that all of the permutations originally requested are
used, the **Actual Permutations Used** will be output and will reflect
the number of permutations originally requested.

## Optimized Sequence Kernel Association Test (SKAT-O)¶

### Overview¶

If one rare variant (marker) were to be tested by itself for association with a phenotype, the test would not have anywhere near as much power as if the one variant had been a common variant. For this reason, rare variants are always tested in groups of several markers at a time, where any group of markers is, for instance, in one region or one gene.

Several approaches to testing rare variants are:

Burden testing. The straightforward form of burden testing is to sum the minor allele counts for each marker in the group, possibly weighted by a function of each marker’s MAF, and test this against the phenotype. This test works well in detecting when each of multiple markers varying in the same direction has an effect on the phenotype in the same direction–in other words, when the marker variations are well-correlated.

Sequence Kernel Association Test (SKAT). Squares of score statistics for testing individual markers are summed. The sum is possibly weighted by, for instance, a function of each marker’s minor allele frequency (MAF). This test works well in detecting when multiple markers vary in different directions but have an effect on the phenotype in the same direction.

Alternatively, the SKAT test may be thought of as testing on the random-effect variance component for a mixed model in which the effects from the individual markers are treated as uncorrelated random effects–the test is whether that variance component is zero. (In the corresponding mixed model, the individual random effect variances are proportional to the weights that are used.)

Generalized Sequence Kernel Association Test (Generalized SKAT). This test statistic (but not necessarily its p-value) is a linear combination of the SKAT test and the straightforward burden test explained above, with the proportions of each pre-specified.

One might ask whether there is a best proportion of SKAT and a best proportion of Burden that one could use for given data. That would be difficult to determine in advance. To find out, you could perform many generalized SKAT tests over the same region using different proportions of SKAT vs. Burden–however, should you use the best p-value from these tests? How do you avoid having to apply a Bonferroni or FDR correction to that p-value according to the number of different generalized SKAT tests you used?

The SKAT-O (Optimized SKAT) algorithm, in answer to the above, combines a set of generalized SKAT tests using different proportions of SKAT vs. Burden in a fashion that will obtain one statistic and one p-value.

Every one of these tests may be corrected for covariates.

Note

The SKAT-O test may also be used to analyze common
variants. Simply specify a uniform weighting
(*Usage and Parameters*) when performing the analysis.

### Availability¶

The generalized SKAT test (which may also be used to produce a SKAT test or a burden test) and the SKAT-O test are available in Golden Helix SVS for three different ranges of sample sizes:

- Very-small-sample corrected [Lee2012]. This is designed for sample sizes under 500 or 700, and involves a permutation-testing algorithm that is used exclusively to determine the kurtosis of the distribution under the null model.
- Moderately-small-sample corrected [Lee2012]. This analytical algorithm is designed for sample sizes over 500 up to 2000.
- Liu algorithm [LeeWuLin2012] and [Liu2009]. This algorithm is designed for samples sizes over 2000.

For usage of this feature, see *Data Requirements for SKAT-O* and
*Usage and Parameters*.

### Mathematical Overview¶

Note

The following are variables which have different symbols between the two papers this method is based on. For each variable, the symbol used in this document is also indicated.

This Document[Lee2012][LeeWuLin2012]Number of variantsDisease probabilityThe SKAT-O Test StatisticMain factor of first term of Null Hypothesis QMain factor of second term of Null Hypothesis QMultiplier for second term of Null Hypothesis Q

#### The Burden and SKAT Test Statistics¶

Assume subjects are sequenced in a region that has variants. For the subject, let denote a dichotomous phenotype, the genotypes of the variants (), and the intercept and covariates. Also, let and denote the matrices that contain and , respectively, as their rows. Consider the logistic regression model

where is the disease probability, is an vector of regression coefficients of the intercept and covariates, and is an vector of regression coefficients of genetic variants. If is a weight function that may depend on the properties of the variant, the burden test rephrases the model as

where the test is for . The burden score statistic (used for the SKAT-O algorithm) is

where is computed from fitting the null model

Meanwhile, the SKAT test assumes that the are independent and follow an arbitrary distribution with mean 0 and variance . Here, the test is . The SKAT statistic may be written as

where and is an “kernel matrix” with as an diagonal weight matrix. The SKAT statistic may also be written as

where is the score statistic for testing in the single-SNP model with only the SNP:

Meanwhile, the original model, which may be written as

may be thought of as a mixed-model equation, where the matrix is equivalent to the incidence or matrix in the usual mixed-model equation, and the are random effects with a variance matrix of .

#### The Generalized SKAT Test Statistic¶

This test unifies the burden and SKAT tests as a linear combination of the two. If and are the results of the burden and SKAT tests, respectively, the generalized SKAT test for a given , where , is

Note

Note that the p-value of this test might not be a linear combination of the corresponding burden and SKAT p-values at all.

This generalized SKAT test for turns out to be equivalent to

where is the “kernel matrix” with (a matrix with 1’s on the diagonal and values of everywhere else). The parameter may be thought of as the presumed correlation of different when these are thought of as random effects in the corresponding mixed model.

### Finding the P-Value from an Individual Statistic¶

Note

The following method and its variations are partly based on the following facts about the distribution, where is the “degrees of freedom” parameter:

- The mean of the distribution is .
- The variance of the distribution is .
- The skewness of is .
- The (“excess”) kurtosis of is .

The p-value for an individual may be approximated as

where is the chi-squared distribution function for degrees of freedom, and the null-hypothesis mean, variance, and degrees of freedom , , and (determined as documented below) adjust the mean and variance of to agree with those for the distribution.

This formula, along with its inverse, is also used later on when intermediate p-values must be taken to estimate the overall SKAT-O p-value.

The above parameters are determined according to one of the following procedures:

For moderately small sample sizes, theoretically from the eigen-decomposition of , where , or from , where is defined the same way and

is defined so that

is the restricted maximum likelihood (REML) estimator of the variance component.

Note

Golden Helix SVS uses .

Specifically, let be the ordered non-zero eigenvalues of , be the eigenvector matrix of , and be the element of . If we define with the following expression,

we then can compute

and

where , and is an estimated with .

In the case of very small sample sizes, and are determined as above for moderately small sample sizes, but is determined by performing permutation testing to estimate the kurtosis.

If , with , is the SKAT test statistic from the permutation sample , the sample kurtosis is

where

and

The is then estimated as .

If there are covariates, parametric bootstrapping is used, rather than permutation testing. Each bootstrap iteration is sampled

*with replacement*, with the variation that measures are taken to get the same number of cases and the same number of controls that the unpermuted had.For large sample sizes, the eigen-decomposition of (as defined above) is also used. The difference is that the somewhat more straightforward formulas

and

apply.

Note

This formula is a modification of a technique mentioned in [Liu2009]. In that article,

is used (for the case that applies here of a “central quadratic form”).

### The Optimal (SKAT-O) Test¶

This test statistic can be written as

In practice, it is computed (or thought of) as

where a grid of values has been set up where

This is not a p-value itself, but only a statistic. At this point, we must take the p-value of this statistic. This will be somewhat involved.

### Alternative View of the Statistic¶

Define (or , consistent with the definition used above for ) and , where . Also, let and

where is the column of .

It can be shown that under the null hypothesis, is asymptotically the same as

where

are the nonzero eigenvalues of , () and are independent and identically distributed random variables, and satisfies

and

This is a mixture of two independent random variables, and .

### The P-Value Algorithm for the Optimal Test¶

The p-value for the optimal statistic is

where “denotes the percentile of the distribution of ” for each .

To explain better, think of it as “going in reverse” from a p-value to a - like statistic, except that while parameters similar to the parameters for finding the individual (null-hypothesis distribution for) are used, the best p-value obtained over all is used rather than the p-value originally obtained for the individual .

The expression being minimized is based on substituting for the value of in the expression

and re-arranging terms as

The final expression for approximating the p-value of the optimal statistic is

where

is the chi-squared
distribution function for degrees of freedom, and
is a density function of
, and the , , and
used here are determined as in *Finding the P-Value from an Individual Statistic* as applied
to .

### The P-Value for Very Small Sample Sizes¶

For very small and moderately small sample sizes, we write

and

asymptotically follows the same distribution as the mentioned in the development above, and asymptotically follows the distribution as does in the development above. However, the null distribution of is small-sample adjusted using its computed small-sample variance and kurtosis, and is then used to modify the usage of in the final p-value formula.

#### Data Requirements for SKAT-O¶

SKAT-O analysis is to be performed from a mapped spreadsheet of sequence or other genotypic data. This spreadsheet must also contain the covariates, if any, that you wish to control for. Each covariate may be either quantitative or binary.

SKAT-O analysis may be used with either common variants or rare
variants. You may select a marker weighting to either treat all
variants equally (to analyze common variants) or, on the other hand,
to emphasize rare variants with a greater or lesser amount of contrast
between them and common variants. (*Usage and Parameters*)

If you are analyzing rare variants, you may also follow the procedures
outlined in *Data Requirements for KBAC with Permutation Testing*.

#### Usage and Parameters¶

To use SKAT-O analysis, select **DNA-Seq** > **Collapsing Methods** >
**SKAT-O** from the spreadsheet menu, and use the resulting window
(see *SKAT-O Method – Parameters*) to select the following parameters:

**Parameters for SKAT-O (Lee et. al., Am J Hum Genetics 2012) Optimized Sequence Kernel Association Testing**Enter the marker weighting. The possible choices are:

**Uniform**All variants (both common and rare) are weighted equally.**Madsen and Browning (one over square root of MAF(1 - MAF))**Rare variants are weighted more than common variants.**pdf(MAF; 1, 25) of the Beta distribution (normally recommended for rare variant studies)**This is the default. Weights rare variants more and common variants less, as does**Madsen and Browning**, but with a much greater contrast between rare and common variants than that produced by**Madsen and Browning**.

Note

It turns out that all of the above marker weightings may be represented through the Beta distribution. The probability density function of this distribution is

where represents the Beta function. These weightings may be represented as:

**pdf(MAF; 1, 1)**Uniform.**pdf(MAF; 0.5, 0.5)**Madsen and Browning ().**pdf(MAF; 1, 25)**The default weighting.

**Tests and Outputs**Enter the desired test(s).**Generalized SKAT with rho =**For this test, also enter the value.Note

A value of zero will yield the original SKAT test. A value of 1 will yield a burden test.

**SKAT-O with standard grid of rho values**This is checked by default.

**Regions**Specify an annotation gene track that will be used to distinguish regions for testing. If possible it is recommended that a local filtered gene track be used to determine the gene list. A local track will allow the spreadsheet to be processed faster than if a network track was used. An example filtered gene track name is:`FilteredRefSeqGenes-UCSC_NCBI_36_Homo_sapiens.idf:1`*The Data Source Library*and*Downloading Data*for information on how to obtain annotation tracks.**Perform one test per...**- Select
**gene**(recommended) to use the minimum start position and the maximum stop position of all transcripts to define a gene, for which just one test will be performed. - Select
**transcript**to perform one test for every transcript. This option will result in a more severe multiple testing penalty.

- Select
**Include nearby markers (distance in bp)**Checking this and entering a distance in base pairs will let you create a “window” around a gene or transcript to include further variants, which you may want to do if you suspect these may affect gene regulation or splicing.

Check

**Correct for Additional Covariates**to correct for additional covariates.- Click on
**Add Columns**to start selecting which variables (spreadsheet columns) to use for correction.

- Click on
**Algorithm for Estimating Distributions**Specify the algorithm to use according to your sample size.- Select
**Very-small-sample corrected**for up to about 700 samples. - Select
**Somewhat-small-sample corrected**for 700 up to 2000 samples. - Select
**Liu algorithm**for greater than 2000 samples.

- Select

Note

If there is any missing genotype data in a region for a given sample, the wild type (homozygous reference genotype) will be imputed to have been present and be substituted for the missing data.

#### Output¶

A marker-mapped spreadsheet will be generated as output containing one row for each region and containing some or all of the following columns, according to what you have requested:

**Chr**The chromosome number or label.**Start**The starting genetic position of the region.**Stop**The ending genetic position of the region.**Gene Name**The gene name.**Transcript Name(s)**Transcript name(s) for each gene region.**Strand**Strand for the gene and transcripts.**SKAT-O P-value**The p-value resulting from the SKAT-O test.**-log10 SKAT-O P-Value**Minus the base-10 log of the above p-value.**SKAT-O T statistic**The T statistic from the SKAT-O computation.**SKAT-O Bonf. P**The Bonferroni-adjusted p-value from the SKAT-O test, based on the number of successful SKAT-O tests.**SKAT-O FDR**The false discovery rate for all SKAT-O tests having a p-value less than or equal to the current SKAT-O test’s one-sided p-value.**SKAT P-value for rho = ##**The p-value resulting from the generalized SKAT test for the given .**-log10 SKAT P-Value**Minus the base-10 log of the above p-value.**SKAT Q for rho = ##**The Q statistic itself for the generalized SKAT test for the given .**SKAT Bonf. P**The Bonferroni-adjusted p-value from the generalized SKAT test, based on the number of successful generalized SKAT tests (for the given ). However, if SKAT-O testing is also being done, the adjustment is instead based on the number of regions for which there were both a successful generalized SKAT test and a successful SKAT-O test.**SKAT FDR**The false discovery rate for all generalized SKAT tests having a p-value less than or equal to the current generalized SKAT test’s p-value.**# Markers Found in Region**The total number of markers found in this region.**# Markers Used**The total number of markers used in this region’s test. Monomorphic markers, all-missing markers, and multi-allelic markers are excluded.