PBAT Family-Based Analysis

This software component is based on PBAT under a license from Harvard University.

PBAT Family-Based Analysis Overview

Golden Helix SVS provides tools for the design and analysis of family-based association studies through the capabilities of the PBAT software package developed by Dr. Christoph Lange of Harvard University. (See: http://www.biostat.harvard.edu/~clange/default.htm)

PBAT incorporates virtually all of the features of the family-based tests of association (FBAT) package (also released by Harvard http://www.biostat.harvard.edu/~fbat/default.html) but also provides many additional options for designing association/linkage studies and analyzing data with multiple continuous traits.

The cornerstone of PBAT is the unified approach to the FBAT statistic, which itself is a generalization of the Transmission Disequilibrium Test (TDT) method, in which alleles transmitted to affected offspring are compared with the expected distribution of alleles among offspring. The FBAT statistic is based on a linear combination of offspring genotypes and traits:

\text{FBAT} &= \frac{(S-E[S])}{\sqrt{V}},

where

S &= \sum_{ij}{T_{ij} X_{ij}},

V = \text{Var}(S) and T_{ij} represents the coded phenotype (i.e. the phenotype adjusted for any covariates) of the j^{th} offspring in family i. X_{ij} denotes the offspring’s coded genotype at the locus being tested, and depends on the genetic model under consideration.

The expected distribution is derived using Mendel’s law of segregation, and in PBAT, conditioning on the sufficient statistics for any nuisance parameters under the null hypothesis. The possible null hypotheses are: “no linkage and no association” or “no association, in the presence of linkage”.

PBAT generalizes this test to cover different genetic models, tests of different sampling designs, tests involving different disease phenotypes, tests with missing parents and tests of different null hypotheses, all in the same framework.

The key concept of PBAT’s screening techniques is the conditional mean model approach, for which the data space is considered to be partitioned into two independent testing sets. This approach may be described as follows:

  1. Find which combination of phenotypes (as a group) and markers have the highest power when tested against not the actual genotypes, but those that are predicted from the parents’ genotypes, or, if those are missing, from the sufficient statistic of the marker distribution.
  2. Perform the FBAT test for the selected combinations of phenotypes and markers on the actual genotypes of the patients, both as a group and individually.

This allows one to control the type I error rates and to overcome one of the most important statistical hurdles when analyzing genome-wide association studies with thousands of markers: the multiple comparison problem. The screening methods are only minimally affected by the non-causal SNPs–in addition, they are robust against effects of population stratification and admixture, since the final decision is based on FBAT statistics, which guard against these confounding factors. Finally, the screening tools are successful in detecting common disease susceptibility loci.

Also, PBAT supports the advance planning of family-based association studies by providing calculations of power estimates for virtually any given study design or ascertainment conditions. Ascertainment conditions, in this case, are defined to be the phenotype(s) that are considered important by the lab technician, thus influencing for which patients the data is obtained.

A new feature of PBAT is the support of testing for copy-number variation (CNV) in a family-based setting. All of the test approaches that are used on genotypic data (coded genotypes) may also be performed on copy number intensity data.

Using PBAT Capabilities Through SVS

Golden Helix SVS provides a graphical user interface (GUI) for the PBAT program, as well as the ability to import from and export to FBAT/PBAT- format and csv-format pedigree and phenotype files. PBAT genotypic and CNV data analysis capabilities are specifically supported by SVS using pedigree and phenotype spreadsheets within an SVS project. The PBAT capability is now current to Version 3.6.0 (2010-01-19).

As mentioned above, PBAT tools support two scenarios: Pre-study and Post-study. In the Pre-study scenario, you can use the Golden Helix SVS PBAT capabilities for power calculations (see Pre-Study Power Calculation) to plan association studies (both family- and population-based) for virtually any given study design and ascertainment conditions.

The power of the statistic on a given study plan can be assessed to decide whether it has sufficient power. Alternatively, SVS allows you to easily (and repeatedly) change the design, the ascertainment conditions, the underlying genetic model or the mode of inheritance to find what parameters will give the study the best possible power.

For the Post-study scenario, the SVS PBAT analysis tools for genotypic and CNV data provide many useful capabilities for the statistical analysis of family-based association studies, e.g. simple FBAT or CNV FBAT statistics, multi-variate FBAT-statistics (FBAT-GEE and FBAT-PC for genotypic or CNV), FBAT-statistics for time-to-onset data, power estimations for the actually observed datasets, options to test linkage or association in the presence of linkage, options to use (bi- allelic or multi-allelic) marker or haplotype data, single or multiple traits (either separate traits or measurements recorded repeatedly over time) that may be quantitative, qualitative or time-to-onset, with nuclear families as well as extended pedigrees. Covariates and gene/covariate interactions in all computed FBAT statistics can easily be handled.

Statistical test results are returned in SVS spreadsheets. These spreadsheets allow you to find the most powerful test statistic and to reduce the large pool of traits and markers down to the most promising combinations in terms of the FBAT statistic.

Note

There is a glossary of terms related to genetic analysis at the end of the manual in Appendix (A Glossary of Terms Used in Genetic Analysis).

Pre-Study Power Calculation

Summary

The PBAT capabilities for power calculations are a software implementation of the approaches to analytical power calculations for FBATs by Dr. Christoph Lange ([Lange2002a], [Lange2002b], [Lange2002c]). The power of family-based association tests (FBATs) and population-based association tests can be assessed for a large variety of study designs:

  • Dichotomous/binary and continuous traits for family designs.
  • Dichotomous/binary and quantitative traits for population designs.
  • Computation of power for a given sample size.
  • Computation of required sample size for a given power and significance level.
  • Missing parental information.
  • Multiple offspring per family.
  • Combinations of different family-types.
  • Different genetic models.
  • Different ascertainment conditions for the first and second proband.
  • Marker and disease locus are not identical.
  • Combination of different family-types and different ascertainment conditions.
  • Verification of all power calculations by Monte-Carlo simulations.

Using Pre-Study Power Calculation

To perform power calculation analysis, select Tools > Pre-Study Power Calculation from the project navigator.

Power Calculation Types

Four types of power calculations are supported:

  • Family-based using a binary trait
  • Family-based using a continuous trait
  • Population-based using case/control status
  • Population-based using a quantitative trait

Parameters for these types of power calculation are organized within four tabs, three of which are used at any given time:

  • Methods
  • Family Design
  • Genetic Model
  • Computational

See the subsection below for the type of power calculation you wish to perform. See A Glossary of Terms Used in Genetic Analysis for definitions of terms. Once you have set the parameters for the type of power calculation you wish to perform, click Run to begin.

A progress dialog will track the progress of the calculations. Pressing Cancel will stop the calculation of power.

When the power calculations have finished, a text viewer will appear. This viewer will be associated with a new Navigator Node.

Methods Tab (all designs)

The Methods tab (see Methods Tab for Power Calculations) contains options to set the type of calculation to be performed, the significance level and the type of computation method to calculate the power.

FBATbinaryMeth

Methods Tab for Power Calculations

Type of Computation

Select the calculation type that matches the study design under consideration. Other fields and/or tabs will be made accessible depending on the selection.

Statistical Parameters

The Significance Level is the probability of wrongly rejecting the null hypothesis when in fact it is true (probability of type one error). Ideally the Significance Level should be as small as possible. The default is 0.01.

Computation Method

Family-Based Studies

Three computation methods apply to family-based studies, and may be used with either binary or continuous traits. PBAT will use the selected method to compute the power of the FBAT statistic. These methods are:

  • Numerical Integration When the power of the FBAT statistic is computed based on numerical integration, the numerical precision is 0.01. This method can take several minutes depending on the complexity of the study design and the computer speed.
  • Approximation The analytical power of the FBAT statistic will be computed based on a second-order Taylor expansion. The precision is good for sample sizes of at least 100 families, and it is the fastest approach. This method is described in [Knapp1999] and [Lange2002a].
  • Simulation The power will be estimated based on one million Monte-Carlo simulations and can take up to several minutes.
Population-Based Studies

Two computation methods apply to population-based studies, and may be used with either case/control or quantitative trait studies of unrelated individuals. These methods are:

  • Compute Power For Given Sample Size Create a table of predicted powers based on sample size information.
  • Compute Required Sample Size For Given Power and Significance Level Create a table of sample sizes required to achieve a given power under various tests.

Family Design Tab–Binary Traits

The Family Design for a Binary trait contains the options which allow the specification of multiple family types which will be included in the calculations. These options include:

  • Number of families
  • Number of offspring per family
  • Number of missing parents
  • Whether additional offspring are phenotyped
  • Ascertainment conditions for the probands (these may be set to Unaffected, Affected, or N/A).

Each of these options can be specified for a given family design, and multiple family designs can be included in a set of calculations (e.g. one set of calculations could be run with 100 families with 2 offspring and 1 missing parent, and 50 families with 3 offspring and 0 missing parents, etc.). Usually more families will increase the power of the study while missing parents will decrease the power of the study. See Family Design Tab for Power Calculations for Binary Traits.

FBATbinaryFD

Family Design Tab for Power Calculations for Binary Traits

To add a family type, enter the appropriate values for the options contained in the Change Family Design group, and click the Add Design button. The family design will appear in the list of Family Designs Currently Used, and will be included in the calculation of power.

Similarly, to remove a family type from the calculations, highlight the corresponding entry in the list of included family designs by clicking on it, then click the Remove Design button.

Family Design Tab–Continuous Traits

The Family Design for a continuous trait contains the options which allow the specification of multiple family types which will be included in the calculations. These options include:

  • Number of families
  • Number of offspring per family
  • Number of missing parents
  • Whether additional offspring are phenotyped
  • Ascertainment conditions for the probands

Each of these options can be specified for a given family design, and multiple family designs can be included in a set of calculations (e.g. one set of calculations could be run with 100 families with 2 offspring and 1 missing parent, and 50 families with 3 offspring and 0 missing parents, etc.). Usually more families will increase the power of the study while missing parents will decrease the power of the study. See: Family Design Tab for Power Calculations for Continuous Traits.

Any of ten ascertainment conditions may be specified. The numbers in the ascertainment conditions refer to sampling conditions for the phenotypes of the first and second probands. These are specified by the corresponding probabilities of the phenotypic distributions of the traits.

Ascertainment condition 1 is predefined and may not be changed–it is always equivalent to a total population sample.

Ascertainment conditions 2 through 10 may be specified. For example, suppose ascertainment condition 2 is set as follows:

  • Proband 1 Lower: 0.0
  • Proband 1 Upper: 0.25
  • Proband 2 Lower: 0.85
  • Proband 2 Upper: 1.0

For this condition, the trait of the first proband must be in the lower 25% of the phenotypic distribution, while the trait of the second proband must be in the upper 15% of the phenotypic distribution.

FBATctsFD

Family Design Tab for Power Calculations for Continuous Traits

To add a family type, enter the appropriate values for the options contained in the Change Family Design group, and click the Add Design button. The family design will appear in the list of Family Designs Currently Used, and will be included in the calculation of power. The currently highlighted ascertainment condition will be associated with the new family design.

Similarly, to remove a family type from the calculations, highlight the corresponding entry in the list of included family designs by clicking on it, then click the Remove Design button.

Genetic Model Tab–Family-design Binary Trait

Under the Genetic Model tab you can specify the genetic model underlying the power calculations. Note the disease gene is specified by allele “A”.

The basis for defining the genetic model may be specified (within the Specify Basis box) as follows (see Genetic Model Tab for Power Calculations for Binary Traits):

  • MOI, p, K, AF: Mode of inheritance (MOI), allele frequency (p), disease prevalence (K), attributable fraction (AF)
  • Penetrance Values and Allele Frequency
  • MOI, p, K, Odds Ratio
  • MOI, p, K, Allelic Odds Ratio

In addition, two modes exist for power output.

  • Enter zero for the allele frequency increment to allow entering parameters related to the disease gene not being the same as the marker gene. One power value will be output.
  • Enter a non-zero allele frequency increment. The calculations will take place as if the disease gene is the same as the marker gene. Power values will be output for allele frequencies starting with the “allele frequency for the disease gene” and incrementing by the allele frequency increment value.

Selecting the basis and selecting an allele frequency increment of zero vs. non-zero determine which other parameters must be specified in order to fully define the genetic model.

Note

  • The mode of inheritance is the manner in which a particular genetic trait or disorder is passed from one generation to the next. Examples of MOI are autosomal dominant, autosomal recessive, X-linked dominant, X-linked recessive, multifactorial and mitochondrial, etc.
  • Penetrance indicates the likelihood that a given gene will actually result in the disease.
  • Odds ratio is a way of comparing whether the probability of a certain event is the same for two groups.
  • Attributable fraction is the proportion of disease occurrence that can be potentially eliminated if the exposure were prevented.
FBATbinGM

Genetic Model Tab for Power Calculations for Binary Traits

The parameters for the respective bases are as follows:

  • MOI, p, K, AF: with this basis selected, the following parameters are available for specification:
    • Allele Frequency (Marker Gene): allele frequency of the marker gene. (Enter if entering zero for the allele frequency increment.)
    • Allele Frequency (Disease Gene): allele frequency of the disease gene.
    • Genetic Attributable Fraction: the proportion of the disease occurrence that would potentially be eliminated if the disease gene were not present.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations. Enter zero to allow an offset to be specified.
    • P(Disease allele A|Marker allele A)[0;1]: the conditional probability of observing the disease allele A given the presence of marker allele A. (Enter if entering zero for the allele frequency increment.)
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Offset: The offset to use for the FBAT statistic. (Enter if entering zero for the allele frequency increment. Otherwise, the offset will be automatically set to the population mean.)
    • Population Prevalence: the percentage of the population estimated to have the particular disease at a specific time.
    • Disease Locus = Marker Locus: indicates whether the disease locus is equal to the marker locus. (Enter if entering zero for the allele frequency increment.)
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Penetrance for AA (genotype)
    • Penetrance for AB (genotype)
    • Penetrance for BB (genotype)
    • Relative Risk RR1 (relative risk for carrying one disease allele)
    • Relative Risk RR2 (relative risk for carrying two disease alleles)
    • Odds Ratio OR1 (odds ratio for carrying one disease allele)
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)
    • D’ between the marker gene and the disease gene (when zero is entered for the allele frequency increment)
    • Offset (= the population mean) (for a non-zero allele frequency increment)
  • Penetrance Values and Allele Frequency with this basis selected, the following parameters are available for specification:
    • Penetrance for AA: penetrance fraction for the AA genotype
    • Penetrance for AB: penetrance fraction for the AB genotype
    • Penetrance for BB: penetrance fraction for the BB genotype
    • Allele Frequency (Marker Gene): allele frequency of the marker gene. (Enter if entering zero for the allele frequency increment.)
    • Allele Frequency (Disease Gene): allele frequency of the disease gene.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations. Enter zero to allow an offset to be specified.
    • P(Disease allele A|Marker allele A)[0;1]: the conditional probability of observing the disease allele A given the presence of marker allele A. (Enter if entering zero for the allele frequency increment.)
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Offset: The offset to use for the FBAT statistic. (Enter if entering zero for the allele frequency increment.)
    • Disease Locus = Marker Locus: indicates whether the disease locus is equal to the marker locus. (Enter if entering zero for the allele frequency increment.)
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Population prevalence of the disease
    • Genetic attributable fraction of the gene
    • Relative Risk RR1 (relative risk for carrying one disease allele)
    • Relative Risk RR2 (relative risk for carrying two disease alleles)
    • Odds Ratio OR1 (odds ratio for carrying one disease allele)
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)
    • D’ between the marker gene and the disease gene (when zero is entered for the allele frequency increment)
    • Offset (= the population mean) (for a non-zero allele frequency increment)
  • MOI, p, K, Odds Ratio with this basis selected, the following parameters are available for specification:
    • Allele Frequency (Marker Gene): allele frequency of the marker gene. (Enter if entering zero for the allele frequency increment.)
    • Allele Frequency (Disease Gene): allele frequency of the disease gene.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations. Enter zero to allow an offset to be specified.
    • P(Disease allele A|Marker allele A)[0;1]: the conditional probability of observing the disease allele A given the presence of marker allele A. (Enter if entering zero for the allele frequency increment.)
    • Odds Ratio: Odds ratio for carrying one disease allele. (OR1)
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Offset: The offset to use for the FBAT statistic. (Enter if entering zero for the allele frequency increment.)
    • Population Prevalence: the percentage of the population estimated to have the particular disease at a specific time.
    • Disease Locus = Marker Locus: indicates whether the disease locus is equal to the marker locus. (Enter if entering zero for the allele frequency increment.)
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Genetic attributable fraction of the gene
    • Penetrance for AA (genotype)
    • Penetrance for AB (genotype)
    • Penetrance for BB (genotype)
    • Relative Risk RR1 (relative risk for carrying one disease allele)
    • Relative Risk RR2 (relative risk for carrying two disease alleles)
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)
    • Allelic Odds Ratio (odds ratio for an allele being a disease allele)
    • D’ between the marker gene and the disease gene (when zero is entered for the allele frequency increment)
    • Offset (= the population mean) (for a non-zero allele frequency increment)
  • MOI, p, K, Allelic Odds Ratio with this basis selected, the following parameters are available for specification:
    • Allele Frequency (Marker Gene): allele frequency of the marker gene. (Enter if entering zero for the allele frequency increment.)
    • Allele Frequency (Disease Gene): allele frequency of the disease gene.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations. Enter zero to allow an offset to be specified.
    • P(Disease allele A|Marker allele A)[0;1]: the conditional probability of observing the disease allele A given the presence of marker allele A. (Enter if entering zero for the allele frequency increment.)
    • Allelic Odds Ratio: Odds ratio for an allele being a disease allele.
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Offset: The offset to use for the FBAT statistic. (Enter if entering zero for the allele frequency increment.)
    • Population Prevalence: the percentage of the population estimated to have the particular disease at a specific time.
    • Disease Locus = Marker Locus: indicates whether the disease locus is equal to the marker locus. (Enter if entering zero for the allele frequency increment.)
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Genetic attributable fraction of the gene
    • Penetrance for AA (genotype)
    • Penetrance for AB (genotype)
    • Penetrance for BB (genotype)
    • Relative Risk RR1 (relative risk for carrying one disease allele)
    • Relative Risk RR2 (relative risk for carrying two disease alleles)
    • Odds Ratio OR1 (odds ratio for carrying one disease allele)
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)
    • D’ between the marker gene and the disease gene (when zero is entered for the allele frequency increment)
    • Offset (= the population mean) (for a non-zero allele frequency increment)

Genetic Model Tab–Family-design Continuous Trait

Under the Genetic Model Tab, you can specify the genetic model underlying the power calculations. Note, the disease gene is specified by allele “A”. See Genetic Model Tab for Power Calculations for Continuous Traits.

FBATctsGM

Genetic Model Tab for Power Calculations for Continuous Traits

Two modes exist for power output.

  • Enter zero for the allele frequency increment to allow entering parameters related to the disease gene not being the same as the marker gene. One power value will be output.
  • Enter a non-zero allele frequency increment. The calculations will take place as if the disease gene is the same as the marker gene. Power values will be output for allele frequencies starting with the “allele frequency for the disease gene” and incrementing by the allele frequency increment value.

Selecting an allele frequency increment of zero vs. non-zero determines which other parameters must be specified in order to fully define the genetic model for family-based calculations with continuous traits.

The following parameters may be used to specify a model for family-based calculations with continuous traits:

  • Allele Frequency (Marker Gene): allele frequency of the marker gene. (Enter if entering zero for the allele frequency increment.)
  • Allele Frequency (Disease Gene): allele frequency of the disease gene.
  • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations. Enter zero to allow an offset to be specified.
  • P(Disease allele A|Marker allele A)[0;1]: the conditional probability of observing the disease allele A given the presence of marker allele A. (Enter if entering zero for the allele frequency increment.)
  • Heritability: A measure of the degree to which the variance in the distribution of a phenotype is due to genetic causes.
  • Model (mode of inheritance)
    • Additive
    • Dominant
    • Recessive
  • Offset: The offset to use for the FBAT statistic. (Enter if entering zero for the allele frequency increment.)
  • Disease Locus = Marker Locus: indicates whether the disease locus is equal to the marker locus. (Enter if entering zero for the allele frequency increment.)

The following parameters will be calculated according to the inputs to the above parameters:

  • Total population mean
  • D’ between the marker gene and the disease gene (when zero is entered for the allele frequency increment)
  • Offset (for a non-zero allele frequency increment)

Genetic Model Tab–Population-design Case/Control Trait

Under the Genetic Model tab you can specify the genetic model underlying the power calculations. Note, the disease gene is specified by allele “A”.

The basis for defining the genetic model may be specified (within the Specify Basis box) as follows (see Genetic Model Tab for Power Calculations for Case/Control Traits):

  • MOI, p, K, Odds Ratio
  • MOI, p, K, Allelic Odds Ratio

Selecting the basis selects the parameters that are used to define the genetic model.

PopCCGM

Genetic Model Tab for Power Calculations for Case/Control Traits

The parameters for the respective bases are as follows:

  • MOI, p, K, Odds Ratio with this basis selected, the following parameters are available for specification:
    • Min allele frequency of the disease allele: allele frequencies for the disease allele are calculated starting from this point based on the Allele Frequency Increment.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations.
    • Odds Ratio OR1 (AB versus BB): Odds ratio for carrying one disease allele.
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Population Prevalence: the percentage of the population estimated to have the particular disease at a specific time.
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)
    • Allelic Odds Ratio (odds ratio for an allele being a disease allele)
  • MOI, p, K, Allelic Odds Ratio with this basis selected, the following parameters are available for specification:
    • Min allele frequency of the disease allele: allele frequencies for the disease allele are calculated starting from this point based on the Allele Frequency Increment.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations.
    • Allelic Odds Ratio: Odds ratio for an allele being a disease allele.
    • Model (mode of inheritance)
      • Additive
      • Multi (Multifactorial)
      • Dominant
      • Recessive
    • Population Prevalence: the percentage of the population estimated to have the particular disease at a specific time.
  • The following parameters will be calculated according to the inputs to the above parameters:
    • Odds Ratio OR1 (odds ratio for carrying one disease allele)
    • Odds Ratio OR2 (odds ratio for carrying two disease alleles)

Genetic Model Tab–Population-design Quantitative Trait

Under the Genetic Model tab you can specify the genetic model underlying the power calculations. Note, the disease gene is specified by allele “A”.

The basis for defining the genetic model (automatically specified within the Specify Basis box) is always (see Genetic Model Tab for Power Calculations for Quantitative Traits):

  • MOI, p, K, Heritability

Under this basis, certain parameters that are used to define the genetic model are selected.

PopQuantGM

Genetic Model Tab for Power Calculations for Quantitative Traits

These parameters are as follows:

  • MOI, p, K, Heritability with this basis selected, the following parameters are available for specification:
    • Min allele frequency of the disease allele: allele frequencies for the disease allele are calculated starting from this point based on the Allele Frequency Increment.
    • Allele Frequency Increment: increment in allele frequency per iteration of the power calculations.
    • Heritability: A measure of the degree to which the variance in the distribution of a phenotype is due to genetic causes.
    • Model (mode of inheritance)
      • Additive
      • Dominant
      • Recessive

Computational Tab–Population-Based Case/Control Trait

The options for this tab are different based on if you are computing power based on sample size, or sample size based on required power and significance level.

Compute Power For Given Sample Size

PopCCcompPow

Computational Tab for Power Calculations for Case/Control Traits

The options available for the Computational Tab when computing the power based on the sample size information are (see Computational Tab for Power Calculations for Case/Control Traits):

  • Power and Sample Size Computational Parameters:
    • Number of Simulations
  • Case/Control Computational Parameters
    • Number of Cases
    • Number of Controls
  • Specify GWQ and QC parameters: (GWQ = Genome Wide Quality, QC = Quality Control)
    • Number of genotyped markers
    • Average call rate for common homozygous genotype
    • Average call rate for heterozygous genotype
    • Average call rate for rare homozygous genotype

Compute Required Sample Size For Given Power and Significance Level

PopCCcompSize

Computational Tab for Sample Size Calculations for Case/Control Traits

The options available for the Computational Tab when computing the sample size based on required power and significance level are (see Computational Tab for Sample Size Calculations for Case/Control Traits):

  • Power and Sample Size Computational Parameters:
    • Number of Simulations
    • Achieved power for sample size calculations
  • Case/Control Computational Parameters
    • Ratio: cases vs. controls
  • Specify GWQ and QC parameters: (GWQ = Genome Wide Quality, QC = Quality Control)
    • Number of genotyped markers
    • Average call rate for common homozygous genotype
    • Average call rate for heterozygous genotype
    • Average call rate for rare homozygous genotype

Computational Tab–Population-Based Quantitative Trait

The options for this tab are different based on if you are computing power based on sample size, or sample size based on required power and significance level.

Compute Power For Given Sample Size

PopQuantCompPow

Computational Tab for Power Calculations for Quantitative Traits

The options available for the Computational Tab when computing the sample size based on required power and significance level are (see Computational Tab for Power Calculations for Quantitative Traits):

  • Power and Sample Size Computational Parameters:
    • Number of Simulations
  • Quantitative Computational Parameters
    • Number of probands
  • Specify GWQ and QC parameters: (GWQ = Genome Wide Quality, QC = Quality Control)
    • Average call rate for common homozygous genotype
    • Average call rate for heterozygous genotype
    • Average call rate for rare homozygous genotype

Compute Required Sample Size For Given Power and Significance Level

PopQuantCompSize

Computational Tab for Sample Size Calculations for Quantitative Traits

The options available for the Computational Tab when computing the sample size based on required power and significance level are (see Computational Tab for Sample Size Calculations for Quantitative Traits):

  • Power and Sample Size Computational Parameters:
    • Number of Simulations
    • Achieved power for sample size calculations
  • Specify GWQ and QC parameters: (GWQ = Genome Wide Quality, QC = Quality Control)
    • Average call rate for common homozygous genotype
    • Average call rate for heterozygous genotype
    • Average call rate for rare homozygous genotype

PBAT Pre-Study Power Calculation Results

The results for the Pre-Study Power Calculations are displayed in a text viewer (see Text Viewer for PBAT Pre-Study Power Calculation Results). The output displayed depends on the type of calculation performed.

PreStudyResults

Text Viewer for PBAT Pre-Study Power Calculation Results

PBAT Genotype Analysis

Summary

The tools which have been implemented in PBAT for the analysis of quantitative and dichotomous traits are discussed in a series of papers by Lange and Laird ([Lange2002a], [Lange2002b], [Lange2002c]). They allow a variety of analysis possibilities:

  • Computation of a large variety of FBAT-statistics and their power for nuclear families and for extended pedigrees.
  • Multivariate FBATs for multiple phenotypes: FBAT-GEE, and FBAT-PC. FBAT-GEE is based on the generalized estimating equation approach. FBAT-PC is based on principal components that maximize the heritability.
  • FBATs for time to onset data/survival data (logrank-FBAT and Wilcoxon-FBAT, FBAT-EXP).
  • Permutation tests for certain FBAT statistics.
  • Transformation tools for continuous phenotypes that are not normally distributed.
  • Conditional power calculations for all implemented FBATs.
  • Construction of the most powerful FBAT-statistic.
  • Including predictor variables in the FBAT.
  • Including gene-environment/drug interactions in the FBAT statistic, as discussed in [Vansteelandt2006].
  • Various estimation routines to estimate the genetic effect size.
  • Screening methods to select the most “promising” combinations of markers and phenotypes without biasing the significance level of the FBAT statistic.

The default settings can be changed and saved by clicking Save Options at the bottom of the PBAT Genotype Analysis dialog window. See PBAT Genotype Analysis dialog – Select Phenotypes tab.

To restore the defaults, select Restore Defaults.

To access this section of the manual from the analysis dialog, select Help.

Using PBAT Genotype Analysis

Getting Started

The first step is to open an existing project or create a new project where you want to perform the data analysis and save the results. See Getting Started for more information on creating a new project or opening an existing one.

Once you have opened or created a project, you must import your pedigree and/or phenotype data into Golden Helix SVS. See Importing Family Pedigree Data for information on how to import pedigree and phenotype files. A properly imported pedigree file will have the six required pedigree columns at the front of the spreadsheet and the column name headers will have a blue background. See Special Features of a Pedigree Spreadsheet for more information about pedigree spreadsheets.

Note

  1. When creating your pedigree, remember to list the parents, even if their genotype information is not known. This ensures that siblings are grouped together properly into families.
  2. If unrelated families are listed together using the same family ID, the results will be unpredictable.

If there is additional phenotype information to be used for the PBAT analysis (over and above the Affection Status), join the pedigree and phenotype spreadsheets together, keeping unmatched rows. See Join or Merge dialog to Join a Pedigree spreadsheet to a Phenotype spreadsheet. The resulting spreadsheet will keep the pedigree columns at the front of the spreadsheet, followed by the phenotype columns and then the genotypes. See Pedigree spreadsheet joined to a Phenotype Spreadsheet.

Note

You don’t have to have additional phenotype columns to perform a PBAT analysis, but if you do, you need to follow the above steps to join the phenotype dataset to the pedigree dataset.

joinPedPhe

Join or Merge dialog to Join a Pedigree spreadsheet to a Phenotype spreadsheet

phePedSS

Pedigree spreadsheet joined to a Phenotype Spreadsheet

PBAT Genotype Analysis can be performed by opening a pedigree spreadsheet, activating the markers to be analyzed, and by selecting Genotype > PBAT Genotype Analysis. A parameter selection dialog will open.

Note

If you have many markers in your pedigree spreadsheet, it may be easiest to use Select > Column > Inactivate All Columns, to inactivate all columns. Then activate any phenotype columns as well as the columns for those markers you wish to analyze before opening the PBAT Genotype Analysis dialog.

The parameters for PBAT Genotype Analysis include phenotype (and other variable) selections, the type of analysis, type of screening, parameters for phenotypes, haplotypes, test statistic and computational algorithm, and types of outputs. In the parameter selection dialog, the parameters are organized into four tabs, which are:

Select Phenotypes

The Select Phenotypes tab of the dialog allows you to select the phenotypes to test. See PBAT Genotype Analysis dialog – Select Phenotypes tab if this dialog was opened from a spreadsheet that does not contain additional phenotype columns. PBAT Genotype Analysis dialog with extra Phenotypes – Select Phenotypes tab illustrates what the tab of this dialog looks like if there are additional phenotype columns joined to the pedigree spreadsheet.

PBATgenoPedOnlySP

PBAT Genotype Analysis dialog – Select Phenotypes tab

PBATgenoPedPheSP

PBAT Genotype Analysis dialog with extra Phenotypes – Select Phenotypes tab

Phenotypes

In this list, select the phenotype or phenotypes to be analyzed for association with the selected markers or with haplotypes from the selected markers. Multi-select operations are valid in this list box. These operations are: <Ctrl>-left-click selects multiple phenotypes one at a time, and <Shift>-left-click selects all phenotypes between the first and last selected phenotypes.

Phenotypes as predictor variables (covariates)

It may be possible that the selected phenotypes are not only associated with certain markers or haplotypes, but also are predicted by other phenotype variables (covariates for the test statistic). Select these other variables in this box to better determine the actual genetic effect after adjusting for the selected predictor variables.

When important covariates for the selected phenotypes are known, adding them to the conditional mean model ([Lange2002b] and [Lange2002c]) and also using them for the offset computation can increase the power of the FBAT statistic substantially.

Double-click on an item in this list to select or deselect it. An option dialog will appear. To select the variable, select the top radio button and enter the maximum power/order of the predictor variable. This determines the covariates that are added to the conditional mean model and to the offset value. For instance, entering “3” will add X_j, X_j^2, and X_j^3, where X_j is the selected predictor variable, to the model. To remove all orders of this predictor variable from the model, select the bottom radio button.

Phenotypes as interaction variables

To account for interactions of one or more phenotypic variables with the marker or haplotype being tested (“gene/covariate interactions”), select the interaction variables in this box.

Double-click on an item in this list to select it or deselect it. An offset selection dialog will appear. There are three options in this dialog:

  • Offset = mean: To use the mean of the selected variable as the offset, select this option.
  • Specify offset: Use this option to specify an offset for the selected variable. Enter the offset value into the Offset value box.
  • Deselect this interaction variable: To remove the selected variable as an interaction variable select this option.

Note

It is recommended that you use a particular offset choice here only when its effects need to be examined. In a standard data analysis, it is preferable to use “mean” here and allow all offsets to be computed by using one of the estimating procedures specified in the Offset drop-down menu on the next tab.

Subgroups

PBAT analyses may be divided into subgroups of patients (a stratified analysis). The outputs for the separate analyses of the subgroups will be provided on the same output spreadsheet, separated and categorized by subgroup.

To divide your patients into subgroups, click the box labeled Use a variable to define subgroups, and select one of the phenotype variables listed (this will be the grouping variable). Only binary, integer, and categorical variables can be used as grouping variables.

Select subgroup categories

Once the subgroup option is selected, this box becomes available and all subgroups for the selected variable are listed. Select the category or categories from the grouping variable for calculating the PBAT statistics. Multi-select operations are available in this list box.

Censoring Variables for Time-to-Onset Analysis

To do time-to-onset analysis:

  • Select the time-to-onset variable as the phenotype variable in the upper left-hand box.
  • Use the lower right-hand box to select a censoring variable. A censoring variable denotes whether the disease or condition has occurred at all during the study. It should be set to:
    • not censored, if the condition occurred (affected), and
    • censored, if the condition did not occur (unaffected).
  • Select other parameters (phenotype, haplotype, and computational) as necessary. FBAT-LOGRANK will have been automatically selected as the test statistic when the Select censoring variable for time to onset option is selected.

Phenotype and Haplotype Parameters

The next tab in the PBAT Genotype Analysis dialog is the Phenotype and Haplotype Parameters tab, see PBAT Genotype Analysis dialog – Phenotype and Haplotype Parameters tab.

PBATgenoPHP

PBAT Genotype Analysis dialog – Phenotype and Haplotype Parameters tab

Maximum and Minimum Number of Phenotypes per Group

  • FBAT-GEE statistic: (See FBAT-GEE under Test Statistic Parameters.) If more than one phenotype is selected, the test can be performed against all of the phenotypes as one group, just one phenotype at a time, or any number of phenotypes combined together. Testing against more than one phenotype at a time will result in a multivariate test. To select the number of phenotypes to “group together” when testing, set the minimum and maximum number in the Min number of phenotypes per group and Max number of phenotypes per group.
  • FBAT-PC statistic: (See FBAT-PC under Test Statistic Parameters.) The FBAT-PC statistic may be used to find the relative weights of many phenotypes within a PBAT principal component analysis. Set both Max number of phenotypes per group and Min number of phenotypes per group to the number of phenotypes selected. FBAT-PC tests against every phenotype individually as a part of its analysis. Select the non-compact output format (Output Format) to see the weight of each phenotype within the principal component.

Offset Choice

The phenotype offset may be specified in this menu and, when applicable, the following text box.

The final trait used in FBAT calculations is the original phenotype value minus the offset.

The offset accomplishes two purposes:

  1. Increases the power of the FBAT statistic by offsetting the mean of the original phenotype from the trait.
  2. Incorporates covariates and interaction variables into the FBAT statistic.

The offset choices in this menu are:

  • No offset: No offset is used; only the original phenotype value is used. Neither covariates nor interaction variables are incorporated into the FBAT statistic. (Useful for affected-only analyses.)

  • Optimal power: Use the offset that maximizes the power of the FBAT-statistic (computationally slow, efficiency dependent on the correct choice of the mode of inheritance).

  • Phenotypic residuals (including E(X|HO)): Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes the expected marker score (E(X|H_0)) as well as all covariates and interaction variables. (This differs from standard phenotypic residuals only in the inclusion of the expected marker score.)

  • Standard phenotypic residuals: Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes all covariates and interaction variables.

    In other words, the offset will be equal to the difference between the actual observed phenotype and a predicted phenotype. This predicted phenotype comes from a regression model that regresses the observed phenotype on all of the covariates in the dataset. If there are no covariates or interaction variables selected, this will constitute subtracting the mean phenotype value (for a continuous phenotype), or the sample prevalence (for a dichotomous phenotype).

  • Specify here: (User-specified offset.) Enter the offset to use in the text box to the right of this menu. (Useful for unaffected studies, for which you would use an offset of 1, or when the effects of a particular offset need to be examined.)

Normally, it is recommended to use Standard phenotypic residuals, except in the case of affected-only studies, where it is normally recommended to use No offset.

Other possibilities include:

  • Unaffected-only studies (use an offset of 1).
  • Other studies using binary traits (use the disease prevalence).
  • Total population samples and ascertained samples where the quantitative trait is not highly correlated with the ascertainment criteria (the offset should approximate the phenotypic mean–use Standard phenotypic residuals).
  • Ascertained samples where the quantitative trait is highly correlated with the ascertainment criteria (dichotomize and set the offset to 0–No offset).

Compute All Predictor Sub-Models

Check the Compute all predictor sub-models box to use the covariates (predictors) in all possible combinations, in separate tests.

Uncheck this box to use all of the covariates combined together in one test.

Transformations

The phenotypes can be used as is without a transformation, or the selected phenotypes can be transformed to ranks or Z-scores (normal scores). There is a similar choice for the selected predictor variables and also for the selected interaction variables. In practice, it is recommended to transform the data to normal scores, since the asymptotic convergence of the FBAT-statistic is robust against outliers and skewed data [Lange2002a].

MFBAT (Multi-Marker/Multi-Phenotype) Test Parameters

For the most common SNP and haplotype tests, multiple markers and/or multiple phenotypes may be subjected to both FBAT-GEE multivariate tests and FBAT-PC tests after the original analysis has finished. These tests are collectively referred to as “MFBAT” tests.

MFBAT testing may be done along with genotypic testing, with rapid analysis, or with haplotype testing performed using sub-haplotypes with adjacent markers. All combinations of M phenotypes and all combinations of N markers will be tested, where M and N take on all values within the bounds specified for the number of phenotypes and the number of markers, respectively.

Note

When performing MFBAT testing using sub-haplotypes of length greater than one, the “marker” referred to in the MFBAT test will mean the first marker of the sub-haplotype being tested.

MFBAT output will be shown in either of two modes:

  • Multiple Marker Testing: If you specify more than one marker as the maximum marker grouping for MFBAT output, the outputs will follow after the outputs for all of the individual markers.
  • Testing with Multiple Phenotypes Only: If you specify only one marker as the maximum marker grouping for MFBAT output, the MFBAT output for any marker or haplotype will follow after the output for that marker’s test.

Check Perform MFBAT test to perform these tests. Fill in the maximum and minimum numbers of phenotypes and of markers to be tested at a time.

The outputs will be identified by marker names separated by plus signs or a single marker name with a plus sign after it. The phrase “FBAT-GEE^2” or “FBAT-PC^2” will be used in place of an allele number or haplotype designation to identify the test.

A p-value will appear in the normal p-value column (either “p-value(FBAT)” or “FBAT-Wilcoxon”) for the “FBAT-GEE^2” test, and two power-related values will appear for the “FBAT-PC^2” test, appearing in the normal p-value column and the next column to the right of the p-value column.

Note

  1. MFBAT testing is valid for either the FBAT-GEE or FBAT-LOGRANK test statistic.
  2. To perform MFBAT testing, no interactions may be specified, no grouping of phenotypes is allowed, and Compute all predictor sub-models must be unchecked.
  3. MFBAT testing with combinations of multiple markers may only be performed when 20 or fewer markers are active in the pedigree spreadsheet.
  4. For MFBAT FBAT-LOGRANK (time-to-onset) testing using only multiple phenotypes, an FBAT-GEE test is made using the censor variable acting as a single phenotype against the marker or haplotype being tested. The result of this test is output after the results of the original test and before the results of the MFBAT test. Its output fields are the same as for an FBAT-GEE test except that “1” is used as the phenotype field indicator.

Check Use simplified variance structure to simplify (average out rows in) the variance/covariance matrix used in the FBAT-PC calculations, thereby improving performance for larger groups of phenotypes and markers.

Alternative Rapid Pedigree Algorithm

Check Use alternative rapid pedigree algorithm to use a new algorithm for processing extended pedigrees. This is currently the default pedigree algorithm. Uncheck this box to use the standard pedigree algorithm.

This new algorithm combines the advantages of the following two strategies:

  • Breaking up the extended pedigrees into trios before analysis.
  • Analyzing the extended pedigrees directly.

Breaking up the extended pedigrees into trios, which is a computationally fast strategy, does not take full advantage of the structure of the known extended pedigree. On the other hand, analyzing extended pedigrees as such, which takes full advantage of all the information and is the most powerful option, can be computationally slow when many of the genotypes in a pedigree are missing.

The standard extended pedigree algorithm is particularly slow in a situation in which families in an extended pedigree for which all genotypes are known are linked only by two or more family members for whom genotypic information is not available. Another situation is of an extended pedigree with “isolated genotypes”, that is, spare genotypic information spread across the entire pedigree. In either situation, the power gain is minimal and sometimes even jeopardized by the possibility that the linking family member or members have to be removed when the maximum number of founders is reached in PBAT.

The new rapid pedigree algorithm in PBAT identifies clusters of trios within extended pedigrees that share the same parents, and analyzes such clusters as extended pedigrees. At the same time, clusters of trios that do not share the same parents are broken up into separate extended-pedigree clusters. All resulting clusters are analyzed in the same way that extended pedigrees would be under the standard algorithm, but independently of each other.

The extra information provided to the computation of the genetic distribution under the original algorithm by linking together the extended pedigree clusters is minimal, while the effort required for taking advantage of this information is disproportionately enormous. This puts the standard algorithm at a severe disadvantage.

Under the new hybrid approach, however, such links between family clusters within extended pedigrees are dropped. The increased computational speed of a pure nuclear-family analysis is, therefore, achieved while almost completely keeping the statistical power of the standard extended pedigree algorithm.

Perform Rapid Analysis

To perform a rapid scan of markers using only one test per marker, check the Perform rapid analysis box. The minor allele for the marker being tested will be used as the one-marker “haplotype” for a haplotype test. This is repeated over all selected markers. Rapid analysis is now available for any or all of the four genetic models.

Because this rapid analysis approach focuses on just the minor allele, it will yield results more or less twice as fast as the standard genotypic approach. For certain extended pedigrees having many siblings, the results can be more than twice as fast as the standard genotypic approach, due to the differing algorithms which these two approaches use to infer expected marker scores.

Permutation Testing

Permutation testing may be selected for either the Rapid Analysis or for other modes of haplotype analysis. Check the Use permutation testing to obtain p-values, and enter the Number of permutations to use in the text box.

Haplotype Analysis

To perform haplotype analysis, check the Perform Analysis for Haplotypes box. The haplotype-related choices delineated in the following paragraphs will then become active.

Note

If any active marker is multi-allelic, haplotype testing will select only those two alleles that are most prevalent, and treat the marker as if it is bi-allelic with these two alleles.

  • Overall haplotype test: Check this box to additionally perform an overall haplotype test. This constitutes a multivariate test performed on all the haplotypes whose frequency is greater than the specified cut-off frequency.

    Note

    Checking this option is only valid:

    1. when the Analyze all sub-haplotypes option is not checked,
    2. if no interactions have been specified,
    3. if only one level of grouping (using the Subgroups box on the first tab–see Subgroups) is used, or if no explicit subgrouping is used at all, and
    4. if Screening based on non-parametric approach has been selected in the Test Statistic and Computational tab (See Screening Type).

    When the Overall haplotype test box is checked, the Cut-off frequency for overall haplotype tests box is active. Use this box to enter the minimum frequency a haplotype must have for inclusion in the overall test.

  • Analyze all sub-haplotypes: Check this box to analyze haplotypes that are defined by subsets of the currently selected markers. Checking this box will also activate the Length of sub-haplotypes box. If “0” is entered, haplotypes from the original set and every subset of the markers will be analyzed. Entering “0” is not allowed when 9 or more markers are active in the pedigree spreadsheet. If a non-zero number is entered in this box, only sub-haplotypes of length equal to the specified number of markers will be analyzed. The sub-haplotype length is not allowed to exceed 8.

    In addition, if a number greater than one and less than the total number of active markers is entered for Length of sub-haplotypes, the Only sub-haplotypes defined by adjacent SNPs check box is activated. Checking this will effectively cause the sub-haplotypes to be analyzed in a moving window. Unchecking this, which is not allowed for more than 20 total active markers, will test every combination of the selected markers consisting of the specified sub-haplotype length. This can be very slow because of the large quantity of calculation and output requested.

    Uncheck this box to analyze only the haplotypes defined by all the active markers in the pedigree spreadsheet without analyzing any haplotypes defined by fewer than all of these markers. Only 8 markers may be active for analysis in this mode.

  • Infer missing genotypes in haplotypes: Check this box to include individuals with missing genotype information in the analysis. The algorithm of [Horvath2004] is applied to all individuals, even if they have missing genotype information. Unfortunately, this can result in a greater number of ambiguous haplotypes and can be much more compute-intensive.

    Uncheck this box to exclude individuals with missing genotype information from the analysis.

  • Remove ambiguous haplotypes from the analysis: Check this box to consider only possible haplotypes which can be inferred from the parental genotypes. This is done by excluding families from the analysis which contain ambiguous haplotypes (possible haplotypes which cannot be inferred from the parental genotypes).

    Uncheck this box to include ambiguous haplotypes in the analysis and weight them according to their estimated frequencies in the probands.

  • Maximal number of mating types for computation: One (phased) mating type is one possible combination of one possible diplotype from one parent with one possible diplotype from the other parent. (Note that two “different mating types” that are only “different” because of switching the haplotype order in the diplotypes or because of switching the order of the parents are considered to actually be the same mating type.)

    Use 16 for most haplotype calculations. Use fewer to speed up the calculations for certain cases, and use more to be more certain to use all mating types.

Test Statistic and Computational

The next tab in the PBAT Genotype Analysis dialog is the Test Statistic and Computational tab, see PBAT Genotype Analysis dialog – Test Statistic and Computational tab. On this tab there are options to specify the test statistic parameters, the computational parameters, the screening type, and the output parameters.

PBATgenoTSnC

PBAT Genotype Analysis dialog – Test Statistic and Computational tab

Test Statistic Parameters

  • Test Statistics: select one of the following test statistics as appropriate.

    • FBAT-GEE: generalized estimating equation for FBAT. If one phenotype is selected, the FBAT-GEE statistic simplifies to the standard univariate FBAT-statistic. If several phenotypes are selected, all phenotypes are tested simultaneously using FBAT-GEE.

      For FBAT-GEE:

      • Both binary and continuous phenotypes will work.
      • Can combine phenotypes with different distributions (e.g. continuous and ordinal).
      • For each phenotype, an additional degree of freedom is used.
      • This statistic is not as good for a large number of phenotypes.

      Generally, the FBAT-GEE statistic can handle a moderate amount of any type of multivariate data, including groups of dichotomous phenotypes.

    • FBAT-PC: principal components FBAT extension for longitudinal phenotypes, repeated measurements and correlated phenotypes.

      This method tests a weighted sum of all the measurements, with the weights determined so as to maximize the genetic component of the overall phenotypes and to minimize the phenotypic/environmental variance. Generalized principal component analysis is used to determine these weights.

      For FBAT-PC:

      • All phenotypes must have the same distribution.
      • Degrees of freedom always equals one regardless of how many phenotypes are used.
      • As the number of phenotypes increases the power increases.
      • Quantitative phenotypes are preferable.
      • Good for a large number of phenotypes.
      • Can be its own type of marker “screening” test, since small genetic effects are amplified.

      Generally, FBAT-PC is more powerful than FBAT-GEE if the phenotypes are correlated and quantitative.

    • FBAT-LOGRANK: this option also includes FBAT-Wilcoxon. These test statistics are FBAT extensions of the classical LOGRANK and Wilcoxon tests for time-to-onset data.

  • Genetic Model: The mode of inheritance of the target/disease allele and the underlying genetic model can be selected here. The choices available are:

    • Additive
    • Dominant
    • Recessive
    • Heterozygous Advantage
    • All (calculates outputs for all four possible models)

    Note

    Ideally, the choice of model should be based on segregation analysis or previous observations. If you had information from previous research indicating that the phenotype is inherited in a manner consistent with a particular model, then that is the model you should choose. (In past decades, when genotyping was prohibitively expensive, this type of observational analysis was an absolute requirement before you could receive any funding for a genetic study, but it is routinely overlooked at present.) It is generally preferred to use an additive model for exploratory tests, because anything that is significant in the dominant or recessive models will usually give a strong signal for additive as well. In practice, most GWAS researchers don’t select a single model, but will often run several models and compare the results to determine which results are the best. GWAS is generally regarded as an exploratory, hypothesis-generating tool, and therefore it is not so important to select a model in advance. A candidate gene test, with the goal of replicating a GWAS finding, is a hypothesis testing procedure and should always use the same models used in the GWAS.

  • Null Hypothesis: Specify the applicable null hypothesis from among the following options.

    • No linkage and no association: Standard hypothesis
    • Linkage and no association: Use if testing in a region with known linkage.
    • Linkage and no association (sw): Use if testing in a region with known linkage when there are large pedigrees. The empirical variance requires estimation of the correlation between all pedigree members, which can be unstable in large pedigrees. Here “sw” stands for “sandwich variance”, which is used to provide a more robust variance estimate.

Screening Type

Screening is useful when the phenotypes with the strongest genetic components are not known prior to the analysis and several markers have to be analyzed. The screening technique can also deal with the multiple comparison problem in genome-wide association studies. Additionally, screening can help the user to decide whether a study has sufficient power to detect a significant association. See Output Spreadsheet for how screening is output from PBAT.

Screening is an integral part of the workflow of PBAT, which, for continuous phenotypes, is called the “Conditional Mean Model”.

Two types of screening are available for continuous phenotypes. Both are based on a genetic effect size estimate (i.e. \beta) which is obtained by regressing the observed offspring phenotypes on the expected offspring genotype (given the parental genotypes). The larger the genetic effect size, the larger the estimated power of the FBAT test.

The two screening types are:

  • Screening based on conditional power calculations (parametric approach). The conditional power is the probability that the FBAT test is rejected given the offspring phenotype and the parental genotypes. Under the “Conditional Mean Model”, the genetic effect size (\beta) is used to obtain the expected value and the variance of the marker scores (i.e. offspring genotypes) under the alternative hypothesis, and thus to obtain the conditional power.
  • Screening based on non-parametric approach (Wald tests). For the Wald test, the genetic effect size is directly tested (i.e. H_0: \beta=0). This method is recommended for use with data containing continuous phenotypes and extended pedigrees.

In general, the conditional power test is recommended over the Wald test (non-parametric approach) because the Wald test is a population-based estimate of the genetic effect size. Unlike the conditional power calculation, it does not require model assumptions under the alternative hypothesis, which is why it is called a non-parametric screening approach.

However, since the Wald test is a purely population-based approach, it is generally less powerful than conditional power, especially when population stratification may be present [Lange2002c].

Unfortunately, the conditional power method is more computationally intensive if there are very large pedigrees in the dataset. The non-parametric Wald test will run more quickly in these cases.

For other types of studies which do not use continuous phenotypes, use Screening based on conditional power calculations and see Empirical distribution for phenotypes in Computational Parameters below.

GFBAT

To adjust the FBAT statistic for environmental correlation between the traits of multiple siblings in a family (GFBATs), select this option [Lange2002b].

Computational Parameters

The following several options allow for the selecting of other necessary computational parameters.

  • Number of non-founders in one pedigree must be less than: Enter the maximum number of non-founders (subjects that have at least one parent in the pedigree) that will be in one pedigree, plus one. If a pedigree is found to have this number of non-founders or more, it will be broken up into smaller pedigrees. For instance, if the user wants to restrict pedigrees to only have two non-founders, then enter 3 in this box.

    Note

    1. Under the alternative pedigree algorithm, this parameter refers to the maximum of non-founders within the family clusters identified by this algorithm, rather than to the maximum number of non-founders within any original extended pedigree.
    2. For haplotype analysis or rapid analysis, a pedigree with too many non-founders will simply not be used.
    3. If you select fewer non-founders than the actual pedigrees have, and you are not using haplotype analysis or rapid analysis, the results may depend on how the data is sorted. This is because the process of breaking up a larger pedigree into smaller pedigrees (or reducing a cluster size in the case of the alternative pedigree algorithm) which can occur in this mode is dependent on the order in which the larger pedigree occurs in the pedigree spreadsheet.
    4. Also when you are not using haplotype analysis or rapid analysis, selecting more than approximately seven non-founders when the actual pedigrees have more than seven non-founders can become computationally intensive, especially if screening by conditional power. This can especially happen under the standard pedigree algorithm. Screening based on the non-parametric approach can reduce much of the computation and is one possible remedy for this situation. Other possible remedies are to use the alternative pedigree algorithm (if you have been using the standard algorithm) or to use rapid analysis.
  • Empirical distribution for phenotypes: (This parameter only adjusts the power column or columns in the output.) The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”. However, the “Conditional Mean Model” assumes continuous phenotypes are being used. Otherwise, a different method of obtaining conditional power needs to be selected. This is because the expected value/variance of the marker score under H_A must be estimated to obtain the conditional power.

    The following distributions for phenotypes may be selected:

    • Continuous phenotypes: The “Conditional Mean Model” will be used for power calculations.
    • Approach by Jiang et al (2006): Use this option for time-to-onset calculations. Also use if there is no a priori belief that association will only be observed in affected individuals.
    • Approach by Murphy et al (2006): Use for affected-only studies or categorical phenotypes.
    • Naive allele freq estimator: The allele frequencies used for screening are estimated from the parents’ genotypes. This is an alternative to the Murphy method for affected-only studies or categorical phenotypes if there is a reason why the assumption about the relationship of the penetrance functions under the alternative hypothesis made by the Murphy method might be violated.
    • Observed allele frequencies: Another alternative to the Murphy method for affected-only studies or categorical phenotypes.

    Note

    If you select any distribution for your phenotype other than Continuous phenotypes, your phenotype variable should either be the Affection Status or have category numbers ranging between 0 and 199 inclusive.

  • Min. number of informative families: “Informative families” are those families which could be included in the calculation for power and p-value statistics because of the usefulness of their genotypic information for the current marker or markers.

    Specify the minimum number of informative families required for the display of the FBAT-statistics. If “0” is entered, statistics on all tests will be displayed. In a typical analysis, it is not recommended to include markers with fewer than 20 informative families.

    Note

    Families that are informative for one model are not necessarily informative for a different model. For instance, if the parents are A_A and A_B, and the offspring is either A_A or A_B, this family will be informative for the dominant model but not for the recessive model.

  • Maximal iterations for GEE: Enter the maximal number of iteration steps in the GEE-estimation procedure. Enter “0” to use least-squares residuals. Otherwise, GEE residuals are computed (useful when multiple correlated phenotypes are analyzed). This choice will be active only if the FBAT-GEE statistic is selected.

  • Significance level: Enter the significance level to be used for the power calculations.

    Typically, 0.0005 might be used. However, for logrank tests, a higher significance level, such as 0.01, is preferable.

Output Format

The parameters in this box allow for indicating alternative and/or additional outputs to be included in the resulting spreadsheet.

  • Use compact output format: Select this option to output the shorter format that was developed for the database at the Channing Laboratories. This format is normally guaranteed to contain 17 columns plus a row label column for the marker names. The exceptions to this are as follows:

    • Output -log 10 p-values (see below) is selected. This will add exactly 3 additional columns to the output.
    • Output detailed statistics for each test (see Computational Details) is selected.
    • Output informative families for each test (see Computational Details) is selected.
  • Display p-values as signed numbers to show the direction of the main effect: Select this option to place a negative sign on the p-value when there is a negative correlation between the phenotype and the number of transmitted target/disease alleles. If this option is not selected, all p-values will be displayed as positive numbers.

    Note

    1. The signed p-value is a more reliable indicator of the direction of the effect than the heritability output, which is only an approximation to the direction of the effect.
    2. Signed p-values are not available when more than one phenotype is being tested at a time under FBAT-GEE, or when testing for interactions.
  • Output -log 10 p-values: Select this option to output -\log_{10}(\text{p-value}) for all p-values in the output, in addition to the p-values themselves.

Note on Genetic Models and the Direction of the Main Effect:

In general, with an additive model, the two alleles will come out with the same results but opposite signs. (Factors such as testing with interactions may change this in some cases.) This results directly from the concept that a subject with more of one allele will have to have fewer of the other allele.

When you have a dominant model, however, the probands with heterozygous genotypes will be put together with the probands who are homozygous with the allele you are making the test with. When you have a recessive model, the probands with heterozygous genotypes will be put together with the probands who are homozygous with the opposite allele from the allele you are making the test with.

For instance, suppose the two alleles are designated as A and B. If we use the dominant model and test on A, we will lump together subjects with genotype A_A with subjects with genotype A_B, and the opposite category of subjects will be those with B_B. If, on the other hand, we were to use the recessive model and test on A, we will have subjects with genotype A_A in one category, and all the other subjects, those with either A_B or B_B, will be in the other category.

As you can see, these tests are not really opposite to each other. What will actually be opposite to testing on A with the dominant model will be testing on B with the recessive model, because both of these tests lump together A_A and A_B and put B_B in the other category. Similarly, testing on A with the recessive model and testing on B with the dominant model will be opposite to each other, as they put A_A in one category and A_B and B_B in the other category.

Therefore, if you look for tests on opposite alleles within opposite models, you will see the same p-values with opposite signs.

Finally, there is the “heterozygous advantage” model. The heterozygotes are lumped into one category, and the homozygotes into the other category. With this model, the answer will always be the same, with the same sign, no matter which allele is selected for testing, because it is the same test in either case, with A_B in one category and A_A and B_B in the other.

Computational Details

The following two checkboxes allow you to see detailed data relating to the individual tests. Bear in mind that depending upon your input data, checking either of these may result in a large volume of data being output.

  • Output detailed statistics for each test: Check this box to examine, for each marker or haplotype (and genetic model) being tested, the individual vector component values, one for each proband, of the following vectors:

    • y1 The value of the phenotype (or the first phenotype) as adjusted for any covariates but before the offset being used has been subtracted.

    • y2, y3, ..., yn If you have selected multiple phenotypes, and you have selected FBAT-GEE as your Test Statistic (See Test Statistic Parameters), these represent the remaining phenotypes as adjusted for any covariates but before the offset has been subtracted.

      Note

      1. If you have selected FBAT-PC as your Test Statistic (See Test Statistic Parameters), the resulting weighted sum is used to form the one and only adjusted phenotype y1 to be used for the final test.
      2. The y1, y2, ..., yn displayed do not reflect the GFBAT adjustment for environmental correlation (see GFBAT), if you have specified that adjustment.
    • z1, z2, ..., zm If you have selected any phenotypes as predictors (covariates), these will be output here.

    • x This is the offspring’s coded genotype at the locus being tested. It is dependent upon the genetic model being used.

    • Ex This is the expectation of the offspring’s coded genotype based on the parental or other family genotypes.

    • Vx This is the diagonal of the variance matrix under the null hypothesis. Depending on the null hypothesis you have selected, this matrix is based upon either the parental genotypes (No linkage and no association) or upon the actual coded genotypes (Linkage and no association or Linkage and no association (sw)). (See Null Hypothesis under Test Statistic Parameters.)

    In the output spreadsheet, each of these vectors will be put into a different row. Enough extra rows will be created for every marker or haplotype (and genetic model) being tested to accommodate all of the detailed statistic vectors. All the columns not relating to detailed statistics will contain redundant values or missing values in these extra rows.

    The first detailed statistic column to be output will contain each row’s vector name (that is, y1, x, Ex, etc.). Then, one detailed statistic column will be output for every proband. Within each row, the detailed statistic column will show the individual vector component for that proband and that row.

  • Output informative families for each test: Check this box to get a list, for each marker or haplotype, of every informative family which was used for the test on that marker or haplotype.

    In the output spreadsheet, there will be created enough extra columns to list all possible families.

    In each row (for each marker or haplotype and genetic model being tested), family numbers of families that were informative for that particular test will be listed. Extra columns that were not needed for a given row will be filled with missing values.

    Note

    1. Family ID’s as obtained from the pedigree spreadsheet are listed here.
    2. This means that if the Alternative rapid pedigree algorithm was selected for a spreadsheet containing extended pedigrees, there may be some duplicate family ID’s corresponding to different clusters of trios into which some families may have been broken up.

Multiple Processes

The next tab in the PBAT Genotype Analysis dialog is the Multiple Processes tab, see PBAT Genotype Analysis dialog – Multiple Processes tab. On this tab there are options through which you can choose to run PBAT in multiple processes. This allows you to take advantage of multiple processors on a single machine by selecting Local Machine, or multiple machines in a distributed environment by selecting Run on Condor®Pool. If the option Divide Jobs Into Multiple Processes is not checked, PBAT will run normally on the current computer.

PBATgenoMP

PBAT Genotype Analysis dialog – Multiple Processes tab

Note

Dividing jobs into multiple processes is not allowed for haplotype analysis (other than for using the Perform Rapid Analysis option).

Local Machine

With the advent of dual-core and multiple processor systems as common desktop configurations, it is nice to take full advantage of the extra CPU resources available. It may also be convenient to divide analysis into multiple jobs for the purpose of keeping memory usage low when analyzing hundreds of thousands of markers.

When running multiple processes on a local machine, setting Maximum number of simultaneous jobs to be less than the total number of jobs will limit the number of jobs that can be run at one time. It is recommended to only run one concurrent job per processor. This will avoid memory access contention which severely impacts performance. So typically, this number should equal the number of processors and/or cores available on the current machine.

Run on Condor®Pool

Condor®is a freely available, specialized, batch system for managing compute-intensive jobs on a distributed network environment. Condor®and its extensive user manuals can be found at http://www.cs.wisc.edu/condor/. As Condor®is cross-platform, you can easily set up a Condor®pool on Windows, Linux or Mac OS X based systems and take advantage of a distributed computing environment with PBAT Genotype Analysis.

To run multiple jobs through Condor®, select the Run on Condor Pool option and browse to the location of the bin folder inside the directory where Condor®was installed on the system. Click Test to have Golden Helix SVS check that Condor®is configured and connected to a central manager.

It may be advantageous to specify the creation of more jobs than the number of machines available in the Condor®pool. Condor®will properly queue jobs and even out the effect of slower and faster computers taking longer or shorter times on each job.

For instructions on how to install Condor®on your network, see Installing the Third-Party Condor Package.

Output Spreadsheet

When all of the parameters are set, click Run to begin the analysis. A progress dialog will appear. The analysis may be stopped by pressing Cancel on the progress dialog.

If the PBAT analysis finishes normally, and results were obtained using the selected parameters, a results spreadsheet will be created and displayed. If no test has enough informative families for display, no output spreadsheet will be created.

Using Output for Screening

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”.

In PBAT, the screening results are output into the same spreadsheet as the results from the actual FBAT tests. This allows sorting by the screening (power) results, and selecting only those results which have the most significant power. The FBAT tests which are contained in these same spreadsheet rows (indicating the tests with the most power) may be considered as if they had been calculated separately from the other FBAT tests, and the multiple-test correction applied only to these FBAT tests. This may be done because the screening tests are independent of the offspring genotype component of the FBAT tests themselves. Both the screening tests and the FBAT tests are conditioned on the same known quantities, namely the parental genotypes and the offspring phenotype(s).

Compact Format

This shorter format was developed for the database at the Channing Laboratories. It is guaranteed to contain 17 columns plus a row label column for the marker names (or the first marker of a moving window for haplotype analysis) unless Output -log 10 p-values (3 additional columns), Output detailed statistics for each test (many additional columns), or Output informative families for each test (many additional columns) is or are selected.

The 17 columns are as follows:

  • Groupname: this is the grouping variable, if grouping is used. Otherwise, the column will be filled with the missing value ”?”.

  • Group: this is the group variable value, if grouping is used. Otherwise, the column will be filled with the missing value ”?”.

  • Allele: the allele or haplotype tested.

  • Freq: allele or allele combination frequency.

  • HWE: p-value of the Hardy Weinberg test for the parents.

  • phenos: phenotype(s) used.

  • cov: covariate(s) used, if any.

  • inter: interaction variable(s) used, if any.

  • model: the genetic model used for this test.

    • 0 additive
    • 1 dominant
    • 2 recessive
    • 3 heterozygous advantage
  • test: statistical test used.

    • 1 FBAT-GEE
    • 2 FBAT-PC
    • 3 FBAT-LOGRANK
    • 4 FBAT-Wilcoxon
    • 5 optimal FBAT-LOGRANK (naive weights)
  • #infofam: the number of families that were informative for this test.

  • pvalue: p-value for the FBAT statistic.

    Note

    1. If this test also included an interaction term, this p-value is derived from the overall score test for the null hypothesis of no genetic main effect and no gene-environment interaction. This score test is based upon a “multivariate phenotype”, the first component of which is the original phenotype y_{ij} itself and the second component of which is the inner product y_{ij}z_{ij} of the original phenotype with the interaction term (which is the interaction phenotype minus the interaction offset). (See [Vansteelandt2006].)
    2. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
    3. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
  • power: conditional power estimate, if screening with conditional power has been selected.

  • wald: the result of the Wald test. The values here will only be meaningful if the conditional mean model would have been meaningful for this test.

  • herit: the heritability of this trait. The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.

  • FBATI: P-value from testing the null hypothesis of no gene-environment interaction. (Also known as QBAT-I.) If no interaction term was selected, then a value of “1” will be displayed.

  • powerFBATI: power for the FBAT interaction statistic, if an interaction term and screening with conditional power were selected.

If Output -log 10 p-values is selected, these additional columns will be included in the output:

  • -log10 pvalue: -\log_{10}(\text{pvalue}), inserted to the right of the pvalue column
  • -log10 wald: -\log_{10}(\text{wald}), inserted to the right of the Wald column
  • -log10 FBATI: -\log_{10}(\text{FBATI}), inserted to the right of the FBATI column

If Output detailed statistics for each test or Output informative families for each test is or are selected, the (copious) output from these will be included after the powerFBATI column. The fields resulting from these selections are listed in Computational Details.

Note

See Using Output for Screening concerning output for screening tests vs. output for FBAT tests.

Normal Expanded Format

The normal expanded format output will have a varying number of columns, depending on the parameters selected and how many phenotypes are in the phenotype spreadsheet. Since a column will be present for every possible phenotype, the spreadsheet may be quite wide. However, all output statistics are visible in this format.

See Output for Time-to-Onset Analysis for the time-to-onset analysis output fields in the expanded format. Otherwise, the output spreadsheet columns in the expanded format may be divided into several categories:

  • Row label with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and Power
  • Phenotype columns
  • Extra columns for powers of predictor phenotypes, if necessary
  • Heritability
  • Extra columns relating to FBAT-PC, if necessary
  • Extra columns relating to interactions, if necessary
  • -log10 columns for p-values (if this output option is selected)
  • Many extra columns for the detailed statistics, if selected

Note

See MFBAT (Multi-Marker/Multi-Phenotype) Test Parameters for the extra outputs added when MFBAT testing is selected.

The column groups are:

  • Marker information:

    For SNP analysis, the marker (SNP) name is set as the row label. For haplotype analysis, the first marker (SNP) of the haplotype is set as the row label.

  • Subgroup designation:

    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value ”?” in the first column means that all of the samples were analyzed.

  • Marker/allele information and genetic model:

    For SNP analysis, the allele being tested is shown, followed by the following information:

    • freq: Allele frequency overall
    • HW: Hardy-Weinberg p-value overall
    • freq_parent: Allele frequency for the parents
    • HW_parents: Hardy-Weinberg for the parents

    For haplotype analysis, the outputs are instead:

    • markers used: SNPs used in defining the haplotype
    • haplotype: the respective alleles separated by colons
    • hap freq: the haplotype frequency

    These columns are followed (for both SNP and haplotype analysis) by a column for the genetic model. The values in this column (model) represent:

    • 0 additive
    • 1 dominant
    • 2 recessive
    • 3 heterozygous advantage

    If “All” was selected for the genetic model, the analysis will have been run not only for each marker and allele, but also for each model. In this case, an entry in this column will show which genetic model was used for that row’s analysis.

    Following the genetic model is a column (nbr_info_fam) that contains the number of informative families for the marker specified by the row label and allele (or for the haplotype listed for the row).

  • P-values and Power:

    After the marker/allele information and the genetic model are listed in the spreadsheet, the statistical outputs are listed in the following columns:

    • pvalue(FBAT): P-value for the FBAT statistic.

      Note

      1. If this test also included an interaction term, this p-value is derived from the overall score test for the null hypothesis of no genetic main effect and no gene-environment interaction. This score test is based upon a “multivariate phenotype”, the first component of which is the original phenotype y_{ij} itself and the second component of which is the inner product y_{ij}z_{ij} of the original phenotype with the interaction term (which is the interaction phenotype minus the interaction offset). (See [Vansteelandt2006].)
      2. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
      3. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
    • pvalue(FBATI): P-value for the FBAT statistic based upon only the interaction term (which is the interaction phenotype minus the interaction offset) as the “phenotype”. If no interaction term was selected, this column will be filled with ones.

    • power(FBAT): Conditional power estimate, if screening with conditional power has been selected.

    • power(FBATI): Power for the FBAT interaction statistic, if this test had an interaction term and screening with conditional power has been selected. Otherwise, this column will be filled with the significance level you have selected.

    • pvalue(Wald): P-value of the overall Wald test for a genetic effect in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.

    • pvalue(WaldI): P-value of the overall Wald test for a gene/covariate interaction in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.

  • Phenotype columns:

    A column for every phenotype (including Affection Status) that is used in the model is shown. The following notation is used:

    • 0 Not used in the analysis for this row.
    • 1 Selected as a phenotype/trait and tested for association with FBATs in this row’s results.
    • P :1. :1. : Selected and used as a covariate/predictor variable. The 1’s indicate that the covariate/predictor variable is significant at both the 5% or the 1% significance levels in the conditional mean model.
    • P :1. :0. : Selected and used as a covariate/predictor variable. The 1 indicates that the covariate/predictor variable is significant at the 5% level, and the 0 indicates that it is not significant at the 1% level in the conditional mean model.
    • P :0. :0. : Selected and used as a covariate/predictor variable. The 0’s indicate that the covariate/predictor variable is not significant at either the 5% or the 1% significance levels in the conditional mean model.
    • I Selected and used as an interaction variable in this row.
  • Extra columns for powers of predictor phenotypes, if necessary:

    If you used predictor variables with a maximum power greater than one, extra columns are included for the higher power phenotypes. The phenotype column notation indicated above is also used for these columns.

  • Heritability:

    The heritability of the selected phenotype(s) will have associated columns.

    The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.

    If you selected more than one phenotype, and you also asked for a maximum of more than one phenotype in a group, one column corresponding to each selected phenotype will appear here, and display the heritability whenever the phenotype was involved in the calculations. A value of zero will be used for uninvolved phenotypes.

  • Extra columns relating to FBAT-PC, if necessary:

    If FBAT-PC has been selected as the test statistic, one additional column will be included in the output spreadsheet for every phenotype, indicating that phenotype’s weight in the FBAT-PC calculation.

  • Extra columns relating to interactions, if necessary:

    If one or more interaction variables are selected, additional columns will be included in the output spreadsheet. These columns are (in order):

    • main effect: An estimate of the regression coefficient for the main effect.
    • Std error: Standard error for the main effect coefficient.
    • p-value: P-value for the main effect coefficient. (Also known as QBAT.)
    • interaction: An estimate of the regression coefficient for the interaction term.
    • Std error: Standard error for the interaction coefficient.
    • p-value: P-value for the interaction term coefficient. (Also known as QBAT-E.)
    • FBAT-I: P-value from testing the null hypothesis of no gene-environment interaction. (Also known as QBAT-I.)
    • h-main: The heritability of the main effect.
    • h-interaction: The heritability of the interaction.

    See [Vansteelandt2006] for further information about interactions.

  • -log10 columns for p-values:

    Additional columns containing the -\log_{10}(\text{p-value}) will be added if this output option is selected. The additional columns will be:

    • -log10 pvalue(FBAT): -\log_{10}(\text{pvalue(FBAT)}), inserted to the right of the pvalue(FBAT) column
    • -log10 pvalue(FBATI): -\log_{10}(\text{pvalue(FBATI)}), inserted to the right of the pvalue(FBATI) column
    • -log10 pvalue(Wald): -\log_{10}(\text{pvalue(Wald)}), inserted to the right of the pvalue(Wald) column
    • -log10 pvalue(WaldI): -\log_{10}(\text{pvalue(WaldI)}), inserted to the right of the pvalue(WaldI) column
  • Detailed Statistics:

    If Output detailed statistics for each test or Output informative families for each test is or are selected, the (copious) output from these will be included here. The fields resulting from these selections are listed in Computational Details.

Output for Time-to-Onset Analysis

For time-to-onset analysis, the outputs are somewhat different. This output may be divided into the following categories:

  • Row label with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and Power
  • -log10 columns for p-values (if this output option is selected)
  • Many extra columns for the detailed statistics, if selected

Note

See MFBAT (Multi-Marker/Multi-Phenotype) Test Parameters for the extra outputs added when MFBAT testing is selected.

The column groups are:

  • Marker information:

    For SNP analysis, the marker (SNP) name is set as the row label. For haplotype analysis, the first marker (SNP) of the haplotype is set as the row label.

  • Subgroup designation:

    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value ”?” in the first column means that all of the samples were analyzed.

  • Marker/allele information and genetic model:

    For SNP analysis, the allele being tested is shown, followed by the following information:

    • freq: Allele frequency overall
    • HW: Hardy-Weinberg p-value overall
    • freq_parent: Allele frequency for the parents
    • HW_parents: Hardy-Weinberg for the parents

    For haplotype analysis, the outputs are instead:

    • markers used: SNPs used in defining the haplotype
    • haplotype: the respective alleles separated by colons
    • hap freq: the haplotype frequency

    These columns are followed (for both SNP and haplotype analysis) by a column for the genetic model. The values in this column (model) represent:

    • 0 additive
    • 1 dominant
    • 2 recessive
    • 3 heterozygous advantage

    If “All” was selected for the genetic model, the analysis will have been run not only for each marker and allele, but also for each model. In this case, an entry in this column will show which genetic model was used for that row’s analysis.

    Following the genetic model is a column (nbr_info_fam) that contains the number of informative families for the marker specified by the row label and allele (or for the haplotype listed for the row).

  • P-values and Power:

    After the marker/allele information and the genetic model are listed in the spreadsheet, the statistical outputs are listed in the following columns:

    • FBAT-Wilcoxon: P-value for the FBAT-Wilcoxon statistic.
    • power: Power for the FBAT-Wilcoxon statistic.
    • FBAT-LOGRANK: P-value for the FBAT-LOGRANK statistic.
    • power: Power for the FBAT-LOGRANK statistic.
    • optimal FBAT-LOGRANK (FH-weights): P-value for the optimal FBAT-LOGRANK statistic (with FH-weights).
    • power: Power for the optimal FBAT-LOGRANK statistic (with FH-weights).
    • optimal FBAT-LOGRANK (naive-weights): P-value for the optimal FBAT-LOGRANK statistic (with naive-weights).
    • power: Power for the optimal FBAT-LOGRANK statistic (with naive-weights).
  • -log10 columns for p-values:

    Additional columns containing the -\log_{10}(\text{p-value}) will be added if this output option is selected. The additional columns will be:

    • -log10 FBAT-Wilcoxon: -\log_{10}(\text{FBAT-Wilcoxon}), inserted to the right of the FBAT-Wilcoxon column
    • -log10 FBAT-LOGRANK: -\log_{10}(\text{FBAT-LOGRANK}), inserted to the right of the FBAT-LOGRANK column
    • -log10 optimal FBAT-LOGRANK (FH-weights): -\log_{10}(\text{optimal FBAT-LOGRANK (FH-weights)}), inserted to the right of the optimal FBAT-LOGRANK (FH-weights) column
    • -log10 optimal FBAT-LOGRANK (naive-weights): -\log_{10}(\text{optimal FBAT-LOGRANK (naive-weights)}), inserted to the right of the optimal FBAT-LOGRANK (naive-weights) column
  • Detailed Statistics:

    If Output detailed statistics for each test or Output informative families for each test is or are selected, the (copious) output from these will be included here. The fields resulting from these selections are listed in Computational Details.

PBAT CNV Analysis

Summary

PBAT also supports testing of copy-number variation (CNV) data in a family-based setting [Ionita-Laza2007].

The normal FBAT statistic is based on the coded genotypes of the family members being tested for each locus. These depend on the genetic model under consideration. Whereas, the CNV FBAT statistic is simply based on the intensity values themselves, or rather numbers derived from intensity values such as \log_2 ratios. These intensity-derived values are used in place of the coded genotypes. This approach bypasses the need for a CNV genotyping algorithm to analyze CNV data.

To obtain the expected intensity value for an offspring, the intensity values of the respective parents are averaged. If the parental information is missing, the intensity values of the siblings are averaged. (This is in place of finding an expected genotypic coding based on the genotypes of the parents or the genotypes of the siblings.)

To obtain a variance, an empirical variance under the null hypothesis is used, since using Mendelian transmissions to compute the theoretical variance is not available in this context.

All robustness properties of the genotype FBAT approach are maintained in PBAT CNV analysis. In addition, all previously-developed FBAT extensions, including FBATs for time-to-onset, multivariate FBATs, and FBAT testing strategies, can be directly transferred to the analysis of copy-number variation.

The following PBAT CNV features are available in Golden Helix SVS:

  • Computation of CNV FBAT statistics for nuclear families and for extended pedigrees.
  • Multivariate CNV FBATs for multiple phenotypes: FBAT-GEE and FBAT-PC. FBAT-GEE is based on the generalized estimating equation approach. FBAT-PC is based on principal components that maximize the effective heritability.
  • Transformation tools for continuous phenotypes that are not normally distributed.
  • Including of predictor variables in the CNV FBAT statistic.
  • Including gene-environment/drug interactions in the CNV FBAT statistic.

The default settings can be changed and saved by clicking Save Options at the bottom of the PBAT CNV Analysis dialog window. See PBAT CNV Analysis dialog – Select Phenotypes tab.

To restore the defaults, select Restore Defaults.

To access this section of the manual from the analysis dialog, select Help.

Using PBAT CNV Analysis

Getting Started

The first step is to open an existing project or create a new project where you want to perform the data analysis and save the results. See Getting Started for more information on creating a new project or opening an existing one.

Once you have opened or created a project, you must import your pedigree and/or phenotype data into Golden Helix SVS. See Importing Family Pedigree Data for information on how to import pedigree and phenotype files. A properly imported pedigree file will have the six required pedigree columns at the front of the spreadsheet and the column name headers will have a blue background. See Special Features of a Pedigree Spreadsheet for more information about pedigree spreadsheets.

The copy number intensity data also needs to be loaded into the SVS project. There are many different ways to import log ratio data. See the appropriate section of Importing Your Data Into A Project that applies to your data format.

Note

  1. When creating your pedigree, remember to list the parents, even if their genotype information is not known. This ensures that siblings are grouped together properly into families.
  2. If unrelated families are listed together using the same family ID, the results will be unpredictable.

If there is additional phenotype information to be used for the PBAT analysis (over and above the Affection Status), join the pedigree and phenotype spreadsheets together, keeping unmatched rows (see Join or Merge dialog to Join a Pedigree spreadsheet to a Phenotype spreadsheet and CNV Pedigree spreadsheet joined to a CNV Phenotype spreadsheet), then take the result of this join and join the copy number intensity data to it.

Otherwise, just join the the copy number intensity data to the pedigree spreadsheet.

In either case, the resulting spreadsheet will keep the pedigree columns at the front of the spreadsheet, followed by any phenotype columns, then the CNV copy number intensity data. See CNV Pedigree and Phenotype Spreadsheet joined to CNV data.

Note

The CNV copy number intensity columns must be marker mapped.

pedPheWoCNV

CNV Pedigree spreadsheet joined to a CNV Phenotype spreadsheet

phePedWCNV

CNV Pedigree and Phenotype Spreadsheet joined to CNV data

PBAT CNV Analysis can be performed by opening a marker mapped pedigree spreadsheet with CNV data, activating the markers to be analyzed, and by selecting Numeric > PBAT CNV Analysis. A parameter selection dialog will open.

Note

If you have many markers in your pedigree spreadsheet, it may be easiest to use Select > Column > Inactivate All Columns, to inactivate all columns. Then activate any phenotype columns as well as the columns for those markers you wish to analyze before opening the PBAT CNV Analysis dialog.

The parameters for PBAT CNV Analysis include phenotype (and other variable) selections, phenotype parameters, pedigree algorithm, the test statistic, computational parameters, and types of outputs. In the parameter selection dialog, the parameters are organized into four tabs, which are:

Select Phenotypes

The Select Phenotypes tab of the dialog allows you to select the phenotypes to test. PBAT CNV Analysis dialog – Select Phenotypes tab illustrates what the tab of this dialog looks like if there are additional phenotype columns joined to the pedigree and CNV data columns.

PBATcnvPed

PBAT CNV Analysis dialog – Select Phenotypes tab

Phenotypes

In this list, select the phenotype or phenotypes to be analyzed for association with the selected markers. Multi-select operations are valid in this list box. These operations are: <Ctrl>-left-click selects multiple phenotypes one at a time, and <Shift>-left-click selects all phenotypes between the first and last selected phenotypes.

Phenotypes as predictor variables (covariates)

It may be possible that the selected phenotypes are not only associated with certain markers, but also are predicted by other phenotype variables (covariates for the test statistic). Select these other variables in this box to better determine the actual genetic effect after adjusting for the selected predictor variables.

When important covariates for the selected phenotypes are known, adding them to the conditional mean model ([Lange2002b] and [Lange2002c]) and also using them for the offset computation can increase the power of the FBAT statistic substantially.

Double-click on an item in this list to select or deselect it. An option dialog will appear. To select the variable, select the top radio button and enter the maximum power/order of the predictor variable. This determines the covariates that are added to the conditional mean model and to the offset value. For instance, entering “3” will add X_j, X_j^2, and X_j^3, where X_j is the selected predictor variable, to the model. To remove all orders of this predictor variable from the model, select the bottom radio button.

Phenotypes as interaction variables

To account for interactions of one or more phenotypic variables with the marker being tested (“gene/covariate interactions”), select the interaction variables in this box.

Double-click on an item in this list to select it or deselect it. An offset selection dialog will appear. There are three options in this dialog:

  • Offset = mean: To use the mean of the selected variable as the offset, select this option.
  • Specify offset: Use this option to specify an offset for the selected variable. Enter the offset value into the Offset value box.
  • Deselect this interaction variable: To remove the selected variable as an interaction variable select this option.

Note

It is recommended that you use a particular offset choice here only when its effects need to be examined. In a standard data analysis, it is preferable to use “mean” here and allow all offsets to be computed by using one of the estimating procedures specified in the Offset drop-down menu on the next tab.

Subgroups

PBAT analyses may be divided into subgroups of patients (a stratified analysis). The outputs for the separate analyses of the subgroups will be provided on the same output spreadsheet, separated and categorized by subgroup.

To divide your patients into subgroups, click the box labeled Use a variable to define subgroups, and select one of the phenotype variables listed (this will be the grouping variable). Only binary, integer, and categorical variables can be used as grouping variables.

Select subgroup categories

Once the subgroup option is selected, this box becomes available and all subgroups for the selected variable are listed. Select the category or categories from the grouping variable for calculating the PBAT statistics. Multi-select operations are available in this list box.

Censoring Variables for Time-to-Onset Analysis

Time-to-onset analysis is not currently available for Golden Helix SVS CNV PBAT. Thus it will not be possible to select a censor variable.

Phenotype Parameters

The next tab in the PBAT CNV Analysis dialog is the Phenotype Parameters tab, see PBAT CNV Analysis dialog – Phenotype Parameters tab.

PBATcnvPP

PBAT CNV Analysis dialog – Phenotype Parameters tab

Maximum and Minimum Number of Phenotypes per Group

  • FBAT-GEE statistic: (See FBAT-GEE under Test Statistic Parameters.) If more than one phenotype is selected, the test can be performed against all of the phenotypes as one group, just one phenotype at a time, or any number of phenotypes combined together. Testing against more than one phenotype at a time will result in a multivariate test. To select the number of phenotypes to “group together” when testing, set the minimum and maximum number in the Min number of phenotypes per group and Max number of phenotypes per group.
  • FBAT-PC statistic: (See FBAT-PC under Test Statistic Parameters.) The FBAT-PC statistic may be used to find the relative weights of many phenotypes within a PBAT principal component analysis. Set both Max number of phenotypes per group and Min number of phenotypes per group to the number of phenotypes selected. FBAT-PC tests against every phenotype individually as a part of its analysis. Select the non-compact output format (Output Format) to see the weight of each phenotype within the principal component.

Offset Choice

The phenotype offset may be specified in this menu and, when applicable, the following text box.

The final trait used in FBAT calculations is the original phenotype value minus the offset.

The offset accomplishes two purposes:

  1. Increases the power of the FBAT statistic by offsetting the mean of the original phenotype from the trait.
  2. Incorporates covariates and interaction variables into the FBAT statistic.

The offset choices in this menu are:

  • No offset: No offset is used; only the original phenotype value is used. Neither covariates nor interaction variables are incorporated into the FBAT statistic. (Useful for affected-only analyses.)

  • Optimal power: Use the offset that maximizes the power of the FBAT-statistic (computationally slow, efficiency dependent on the correct choice of the mode of inheritance).

  • Phenotypic residuals (including E(X|HO)): Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes the expected intensity value (E(X|H_0)) as well as all covariates and interaction variables. (This differs from standard phenotypic residuals only in the inclusion of the expected intensity value.)

  • Standard phenotypic residuals: Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes all covariates and interaction variables.

    In other words, the offset will be equal to the difference between the actual observed phenotype and a predicted phenotype. This predicted phenotype comes from a regression model that regresses the observed phenotype on all of the covariates in the dataset. If there are no covariates or interaction variables selected, this will constitute subtracting the mean phenotype value (for a continuous phenotype), or the sample prevalence (for a dichotomous phenotype).

  • Specify here: (User-specified offset.) Enter the offset to use in the text box to the right of this menu. (Useful for unaffected studies, for which you would use an offset of 1, or when the effects of a particular offset need to be examined.)

Normally, it is recommended to use Standard phenotypic residuals, except in the case of affected-only studies, where it is normally recommended to use No offset.

Other possibilities include:

  • Unaffected-only studies (use an offset of 1).
  • Other studies using binary traits (use the disease prevalence).
  • Total population samples and ascertained samples where the quantitative trait is not highly correlated with the ascertainment criteria (the offset should approximate the phenotypic mean–use Standard phenotypic residuals).
  • Ascertained samples where the quantitative trait is highly correlated with the ascertainment criteria (dichotomize and set the offset to 0–No offset).

Compute All Predictor Sub-Models

Check the Compute all predictor sub-models box to use the covariates (predictors) in all possible combinations, in separate tests.

Uncheck this box to use all of the covariates combined together in one test.

Transformations

The phenotypes can be used as is without a transformation, or the selected phenotypes can be transformed to ranks or Z-scores (normal scores). There is a similar choice for the selected predictor variables and also for the selected interaction variables. In practice, it is recommended to transform the data to normal scores, since the asymptotic convergence of the FBAT-statistic is robust against outliers and skewed data [Lange2002a].

Alternative Rapid Pedigree Algorithm

Check Use alternative rapid pedigree algorithm to use a new algorithm for processing extended pedigrees. This is currently the default pedigree algorithm. Uncheck this box to use the standard pedigree algorithm.

Please see Alternative Rapid Pedigree Algorithm for a full explanation of each of the two pedigree algorithms and the advantages and disadvantages of each of them when analyzing genotypic data.

Computationally, because a simple averaging technique is used to infer the expected marker scores, PBAT CNV analysis of extended pedigrees under the standard pedigree algorithm does not suffer from long computation times in the same way that PBAT analysis of genotypic data can under the same circumstances. However, for the sake of completeness, both pedigree algorithms are offered for CNV analysis.

Test Statistic and Computational

The next tab in the PBAT CNV Analysis dialog is the Test Statistic and Computational tab, see PBAT CNV Analysis dialog – Test Statistic and Computational tab. On this tab there are options to specify the test statistic parameters, the computational parameters, and the output parameters.

PBATcnvTSnC

PBAT CNV Analysis dialog – Test Statistic and Computational tab

Test Statistic Parameters

  • Test Statistics: select one of the following test statistics as appropriate.

    • FBAT-GEE: generalized estimating equation for FBAT. If one phenotype is selected, the FBAT-GEE statistic simplifies to the standard univariate FBAT-statistic. If several phenotypes are selected, all phenotypes are tested simultaneously using FBAT-GEE.

      For FBAT-GEE:

      • Both binary and continuous phenotypes will work.
      • Can combine phenotypes with different distributions (e.g. continuous and ordinal).
      • For each phenotype, an additional degree of freedom is used.
      • This statistic is not as good for a large number of phenotypes.

      Generally, the FBAT-GEE statistic can handle a moderate amount of any type of multivariate data, including groups of dichotomous phenotypes.

    • FBAT-PC: principal components FBAT extension for longitudinal phenotypes, repeated measurements and correlated phenotypes.

      This method tests a weighted sum of all the measurements, with the weights determined so as to maximize the genetic component of the overall phenotypes and to minimize the phenotypic/environmental variance. Generalized principal component analysis is used to determine these weights.

      For FBAT-PC:

      • All phenotypes must have the same distribution.
      • Degrees of freedom always equals one regardless of how many phenotypes are used.
      • As the number of phenotypes increases the power increases.
      • Quantitative phenotypes are preferable.
      • Good for a large number of phenotypes.
      • Can be its own type of marker “screening” test, since small genetic effects are amplified.

      Generally, FBAT-PC is more powerful than FBAT-GEE if the phenotypes are correlated and quantitative.

GFBAT

To adjust the FBAT statistic for environmental correlation between the traits of multiple siblings in a family (GFBATs), select this option [Lange2002b].

Computational Parameters

The following several options allow for the selecting of other necessary computational parameters.

  • Maximal iterations for GEE: Enter the maximal number of iteration steps in the GEE-estimation procedure. Enter “0” to use least-squares residuals. Otherwise, GEE residuals are computed (useful when multiple correlated phenotypes are analyzed). This choice will be active only if the FBAT-GEE statistic is selected.

  • Significance level: Enter the significance level to be used for the power calculations.

    Typically, 0.0005 might be used. However, for logrank tests, a higher significance level, such as 0.01, is preferable.

Output Format

The parameters in this box allow for indicating alternative and/or additional outputs to be included in the resulting spreadsheet.

  • Use compact output format: Select this option to output the shorter format that was developed for the database at the Channing Laboratories. This format is normally guaranteed to contain 17 columns plus a row label column for the marker names. The exceptions to this are as follows:

    • Output -log 10 p-values (see below) is selected. This will add exactly 3 additional columns to the output.
    • Output detailed statistics for each test (see Computational Details) is selected.
  • Display p-values as signed numbers to show the direction of the main effect: Select this option to place a negative sign on the p-value when there is a negative correlation between the phenotype and the difference between the actual intensity value and the expected intensity value. If this option is not selected, all p-values will be displayed as positive numbers.

    Note

    Signed p-values are not available when more than one phenotype is being tested at a time under FBAT-GEE, or when testing for interactions.

  • Output -log 10 p-values: Select this option to output -\log_{10}(\text{p-value}) for all p-values in the output, in addition to the p-values themselves.

Computational Details

The following checkbox allows you to see detailed data relating to the individual tests. Bear in mind that depending upon your input data, checking this may result in a large volume of data being output.

  • Output detailed statistics for each test: Check this box to examine, for each marker being tested, the individual vector component values, one for each proband, of the following vectors:

    • y1 The value of the phenotype (or the first phenotype) as adjusted for any covariates and minus the offset being used.

    • y2, y3, ..., yn If you have selected multiple phenotypes, and you have selected FBAT-GEE as your Test Statistic (See Test Statistic Parameters), these represent the remaining phenotypes as adjusted for any covariates and minus the offset being used.

      Note

      1. If you have selected FBAT-PC as your Test Statistic (See Test Statistic Parameters), the resulting weighted sum is used to form the one and only adjusted phenotype y1 to be used for the final test.
      2. At this time, the y1, y2, ..., yn displayed do not reflect the GFBAT adjustment for environmental correlation (see GFBAT), if you have specified that adjustment.
    • x This is the offspring’s intensity value at the locus being tested.

    • Ex This is the expectation of the offspring’s intensity value based on the parental or sibling intensity values.

    • Vx This is the diagonal of the variance matrix under the null hypothesis. For CNV analysis, this matrix is always based upon the actual intensity values.

    In the output spreadsheet, each of these vectors will be put into a different row. Enough extra rows will be created for every marker being tested to accommodate all of the detailed statistic vectors. All the columns not relating to detailed statistics will contain redundant values or missing values in these extra rows.

    The first detailed statistic column to be output will contain each row’s vector name (that is, y1, x, Ex, etc.). Then, one detailed statistic column will be output for every proband. Within each row, the detailed statistic column will show the individual vector component for that proband and that row.

Multiple Processes

The next tab in the PBAT CNV Analysis dialog is the Multiple Processes tab, see PBAT CNV Analysis dialog – Multiple Processes tab. On this tab there are options through which you can choose to run PBAT in multiple processes. This allows you to take advantage of multiple processors on a single machine by selecting Local Machine, or multiple machines in a distributed environment by selecting Run on Condor®Pool. If the option Divide Jobs Into Multiple Processes is not checked, PBAT will run normally on the current computer.

PBATcnvMP

PBAT CNV Analysis dialog – Multiple Processes tab

Local Machine

With the advent of dual-core and multiple processor systems as common desktop configurations, it is nice to take full advantage of the extra CPU resources available. It may also be convenient to divide analysis into multiple jobs for the purpose of keeping memory usage low when analyzing hundreds of thousands of markers.

When running multiple processes on a local machine, setting Maximum number of simultaneous jobs to be less than the total number of jobs will limit the number of jobs that can be run at one time. It is recommended to only run one concurrent job per processor. This will avoid memory access contention which severely impacts performance. So typically, this number should equal the number of processors and/or cores available on the current machine.

Run on Condor®Pool

Condor®is a freely available, specialized, batch system for managing compute-intensive jobs on a distributed network environment. Condor®and its extensive user manuals can be found at http://www.cs.wisc.edu/condor/. As Condor®is cross-platform, you can easily set up a Condor®pool on Windows, Linux or Mac OS X based systems and take advantage of a distributed computing environment with PBAT CNV Analysis.

To run multiple jobs through Condor®, select the Run on Condor Pool option and browse to the location of the bin folder inside the directory where Condor®was installed on the system. Click Test to have Golden Helix SVS check that Condor®is configured and connected to a central manager.

It may be advantageous to specify the creation of more jobs than the number of machines available in the Condor®pool. Condor®will properly queue jobs and even out the effect of slower and faster computers taking longer or shorter times on each job.

For instructions on how to install Condor®on your network, see Installing the Third-Party Condor Package.

Output Spreadsheet

When all of the parameters are set, click Run to begin the analysis. A progress dialog will appear. The analysis may be stopped by pressing Cancel on the progress dialog.

If the PBAT analysis finishes normally, and results were obtained using the selected parameters, a results spreadsheet will be created and displayed.

Using Output for Screening

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”.

In PBAT, the screening results are output into the same spreadsheet as the results from the actual FBAT tests. This allows sorting by the screening (power) results, and selecting only those results which have the most significant power. The FBAT tests which are contained in these same spreadsheet rows (indicating the tests with the most power) may be considered as if they had been calculated separately from the other FBAT tests, and the multiple-test correction applied only to these FBAT tests. This may be done because the screening tests are independent of the offspring intensity component of the FBAT tests themselves. Both the screening tests and the FBAT tests are conditioned on the same known quantities, namely the parental intensities and the offspring phenotype(s).

Note

At this time, FBAT CNV power calculations are not available. Therefore, you must use the results of the Wald test for screening. Select the FBAT test results which are in the output spreadsheet rows that have the most significant Wald p-values, and apply the multiple-test correction to these FBAT results.

Compact Format

This shorter format was developed for the database at the Channing Laboratories. It is guaranteed to contain 17 columns plus a row label column for the marker names unless Output -log 10 p-values (3 additional columns) or Output detailed statistics for each test (many additional columns) is or are selected.

The 17 columns are as follows:

  • Groupname: this is the grouping variable, if grouping is used. Otherwise, the column will be filled with the missing value ”?”.

  • Group: this is the group variable value, if grouping is used. Otherwise, the column will be filled with the missing value ”?”.

  • Allele: this column is not relevant for CNV analysis and will be filled with the missing value ”?”.

  • Freq: this column is not relevant for CNV analysis and will be filled with the missing value ”?”.

  • HWE: this column is not relevant for CNV analysis and will be filled with the missing value ”?”.

  • phenos: phenotype(s) used.

  • cov: covariate(s) used, if any.

  • inter: interaction variable(s) used, if any.

  • model: this column is not relevant for CNV analysis and will be filled with 0’s.

  • test: statistical test used.

    • 1 FBAT-GEE
    • 2 FBAT-PC
  • #infofam: this column is not relevant for CNV analysis and will be filled with 0’s.

  • pvalue: p-value for the FBAT statistic.

    Note

    1. If this test also included an interaction term, this p-value is derived from the overall score test for the null hypothesis of no genetic main effect and no gene-environment interaction. This score test is based upon a “multivariate phenotype”, the first component of which is the original phenotype y_{ij} itself and the second component of which is the inner product y_{ij}z_{ij} of the original phenotype with the interaction term (which is the interaction phenotype minus the interaction offset). (See [Vansteelandt2006].)
    2. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
    3. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the difference between the actual intensity value and the expected intensity value.
  • power: this column will be filled with the selected significance level, since power calculations are not currently available for PBAT CNV.

  • wald: the result of the Wald test.

  • herit: this column is not relevant for CNV analysis and will be filled with 0’s.

  • FBATI: P-value from testing the null hypothesis of no gene-environment interaction. (Also known as QBAT-I.) If no interaction term was selected, then a value of “1” will be displayed.

  • powerFBATI: this column will be filled with the selected significance level, since power calculations are not currently available for PBAT CNV.

If Output -log 10 p-values is selected, these additional columns will be included in the output:

  • -log10 pvalue: -\log_{10}(\text{pvalue}), inserted to the right of the pvalue column
  • -log10 wald: -\log_{10}(\text{wald}), inserted to the right of the Wald column
  • -log10 FBATI: -\log_{10}(\text{FBATI}), inserted to the right of the FBATI column

If Output detailed statistics for each test is selected, the (copious) output from this feature will be included after the powerFBATI column. The fields resulting from these selections are listed in Computational Details.

Note

See Using Output for Screening concerning output for screening tests vs. output for FBAT tests.

Normal Expanded Format

The normal expanded format output will have a varying number of columns, depending on the parameters selected and how many phenotypes are in the phenotype spreadsheet. Since a column will be present for every possible phenotype, the spreadsheet may be quite wide. However, all output statistics are visible in this format.

The output spreadsheet columns in the expanded format may be divided into several categories:

  • Row label with marker information
  • Subgroup designation
  • P-values
  • Phenotype columns
  • Extra columns for powers of predictor phenotypes, if necessary
  • Extra columns relating to FBAT-PC, if necessary
  • Extra columns relating to interactions, if necessary
  • -log10 columns for p-values (if this output option is selected)
  • Many extra columns for the detailed statistics, if selected

The column groups are:

  • Marker information:

    The marker name is set as the row label.

  • Subgroup designation:

    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value ”?” in the first column means that all of the samples were analyzed.

  • P-values:

    The statistical outputs are listed in the following columns:

    • pvalue(FBAT): P-value for the FBAT statistic.

      Note

      1. If this test also included an interaction term, this p-value is derived from the overall score test for the null hypothesis of no genetic main effect and no gene-environment interaction. This score test is based upon a “multivariate phenotype”, the first component of which is the original phenotype y_{ij} itself and the second component of which is the inner product y_{ij}z_{ij} of the original phenotype with the interaction term (which is the interaction phenotype minus the interaction offset). (See [Vansteelandt2006].)
      2. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
      3. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the difference between the actual intensity value and the expected intensity value.
    • pvalue(FBATI): P-value for the FBAT statistic based upon only the interaction term (which is the interaction phenotype minus the interaction offset) as the “phenotype”. If no interaction term was selected, this column will be filled with ones.

    • pvalue(Wald): P-value of the overall Wald test for a genetic effect in the conditional mean model.

    • pvalue(WaldI): P-value of the overall Wald test for a gene/covariate interaction in the conditional mean model.

  • Phenotype columns:

    A column for every phenotype (including Affection Status) that is used in the model is shown. The following notation is used:

    • 0 Not used in the analysis for this row.
    • 1 Selected as a phenotype/trait and tested for association with FBATs in this row’s results.
    • P :1. :1. : Selected and used as a covariate/predictor variable. The 1’s indicate that the covariate/predictor variable is significant at both the 5% or the 1% significance levels in the conditional mean model.
    • P :1. :0. : Selected and used as a covariate/predictor variable. The 1 indicates that the covariate/predictor variable is significant at the 5% level, and the 0 indicates that it is not significant at the 1% level in the conditional mean model.
    • P :0. :0. : Selected and used as a covariate/predictor variable. The 0’s indicate that the covariate/predictor variable is not significant at either the 5% or the 1% significance levels in the conditional mean model.
    • I Selected and used as an interaction variable in this row.
  • Extra columns for powers of predictor phenotypes, if necessary:

    If you used predictor variables with a maximum power greater than one, extra columns are included for the higher power phenotypes. The phenotype column notation indicated above is also used for these columns.

  • Extra columns relating to FBAT-PC, if necessary:

    If FBAT-PC has been selected as the test statistic, one additional column will be included in the output spreadsheet for every phenotype, indicating that phenotype’s weight in the FBAT-PC calculation.

  • Extra columns relating to interactions, if necessary:

    If one or more interaction variables are selected, additional columns will be included in the output spreadsheet. These columns are (in order):

    • main effect: An estimate of the regression coefficient for the main effect.
    • Std error: Standard error for the main effect coefficient.
    • p-value: P-value for the main effect coefficient. (Also known as QBAT.)
    • interaction: An estimate of the regression coefficient for the interaction term.
    • Std error: Standard error for the interaction coefficient.
    • p-value: P-value for the interaction term coefficient. (Also known as QBAT-E.)
    • FBAT-I: P-value from testing the null hypothesis of no gene-environment interaction. (Also known as QBAT-I.)
    • h-main: The heritability of the main effect.
    • h-interaction: The heritability of the interaction.

    See [Vansteelandt2006] for further information about interactions.

  • -log10 columns for p-values:

    Additional columns containing the -\log_{10}(\text{p-value}) will be added if this output option is selected. The additional columns will be:

    • -log10 pvalue(FBAT): -\log_{10}(\text{pvalue(FBAT)}), inserted to the right of the pvalue(FBAT) column
    • -log10 pvalue(FBATI): -\log_{10}(\text{pvalue(FBATI)}), inserted to the right of the pvalue(FBATI) column
    • -log10 pvalue(Wald): -\log_{10}(\text{pvalue(Wald)}), inserted to the right of the pvalue(Wald) column
    • -log10 pvalue(WaldI): -\log_{10}(\text{pvalue(WaldI)}), inserted to the right of the pvalue(WaldI) column
  • Detailed Statistics:

    If Output detailed statistics for each test is selected, the (copious) output from this feature will be included here. The fields resulting from these selections are listed in Computational Details.