Genotype Data Quality Assessment and Utilities

To ensure data is of the highest quality, SVS provides a variety of features that not only help assess the quality of data, but remedy any problems as well.

Genotype Menu

The following data quality tools are available from the Genotype Menu (see Genotype Menu and Quality Assurance Submenu):

genoQAMenu

Genotype Menu and Quality Assurance Submenu

  • Genotype Statistics by Marker

    Determine how closely respective genotypes in a dataset approximate a state of Hardy-Weinberg Equilibrium (HWE) by calculating HWE p-values, Fisher’s Exact Test for HWE p-values, and HWE correlation R values. This option also allows for the calculation of allele frequencies for both the major and minor alleles and can provide allele and genotype counts for each marker in the dataset. See Genotype Statistics by Marker for more information.

  • Genotype Filtering by Marker

    Determine how closely respective genotypes in a dataset approximate a state of Hardy-Weinberg Equilibrium (HWE) by calculating HWE p-values, Fisher’s Exact Test for HWE p-values, and HWE correlation R values. This option also allows for the exclusion of data out of HWE, with a minor allele frequency or call rate below a user-specified threshold or if the data does not meet other quality control thresholds. See Genotype Filtering by Marker for more information.

  • Genotype Statistics by Sample

    Determine if genotypes for samples violate population-based transmission Hardy-Weinberg principles. See Genotype Statistics by Sample for more information.

  • Quality Assurance and Utilities Menu

    • Identity by Descent Estimation

      Estimate Identity by Descent (IBD) between all pairs of individuals, based on the data in a genotypic spreadsheet. This function should mainly be used as a quality control measure. The samples are required to be row wise, and only the autosomal genotype columns should be active. See Identity by Descent Estimation for more information. See Identity by Descent Estimation for more information.

      Note

      It is usually advisable to apply LD pruning before using this feature.

    • Fixation Index Fst

      Estimate the Fixation Index F_{st} between pairs of subpopulations represented by groups of samples of data in a genotypic spreadsheet. See Fixation Index Fst and Fixation Index Fst (by Marker) for more information.

    • Fixation Index Fst (by Marker)

      Estimate the Fixation Index F_{st} for each marker in the spreadsheet. See Fixation Index Fst and Fixation Index Fst (by Marker) for more information.

    • GBLUP Genomic Relationship Matrix

      Estimate the genomic relationship between samples. This matrix can be used for further analysis with Mixed Linear Model Analysis or Genomic BLUP (GBLUP). Hemizygous chromosomes are allowed and can be corrected for in this computation if there is a gender information available in the spreadsheet. See Separately Computing the Genomic Relationship Matrix for more information.

    • Compute Kinship A Matrix from Pedigree

      This function computes the Numerator Relationship Matrix (which is one type of kinship matrix, sometimes called the “A Matrix”) based on the current spreadsheet’s pedigree information.

      For more information about the details of this function, see Computing the Numerator Relationship Matrix.

    • Filter Samples by Call Rates

      Filter the rows of the spreadsheet (either samples or markers) based on the call rate for the row. See Filter Samples by Call Rate for more information.

    • LD Pruning

      Inactivate (“prune”) genotypic data from the active columns of the current spreadsheet based on pairwise LD. If any pair of markers which are both within a moving window has LD greater than the specified threshold, the first marker of the pair will be inactivated. See LD Pruning for more information.

    • SNP Density

      Report the SNP density of the current marker mapped genotypic columns. See SNP Density for more information.

    • Mendelian Error Check

      Counts and reports all Mendelian Errors. Optionally also replace all errors with missing genotype calls. See Mendelian Error Check for more information.

    • Inbreeding Coefficients

      Calculate the inbreeding coefficients of the individuals corresponding to samples by looking at the samples’ autosomal data. This function requires a marker mapped spreadsheet containing genotype columns with samples row wise. See Inbreeding Coefficients for more information.

      Note

      It is usually advisable to apply LD pruning before using this feature.

  • LD Reports Menu

    Contains various tools for generating LD Reports. See LD Reports for more information.

  • Genotype Principal Component Analysis

    Adjust for population stratification on genotypic markers. See Genotypic Principal Component Analysis for more information.

  • PBAT Family-Based QA

    Determine if genotypes violate PBAT family-based quality control measures, including Mendelian errors. See PBAT Family-Based QA Statistics and PBAT Family-Based Analysis for more information.

Genotype Statistics by Marker

Several types of overall marker statistics and genetic measures are available as output (see Genotype Statistics By Marker – Allele Frequency Classification and Genotype Statistics By Marker – Reference and Alternate Allele Classification).

GenoStatByMarker

Genotype Statistics By Marker – Allele Frequency Classification

GenoStatByMarker_Ref

Genotype Statistics By Marker – Reference and Alternate Allele Classification

Note

  1. Most of these statistics are output only for bi-allelic markers, or, if markers are classified according to reference/alternate alleles, for markers containing one alternate allele. For the other markers, only the statistics that make sense for those other markers are output.
  2. If there is a case/control dependent variable or a categorical dependent variable with fewer than 30 categories, statistics will be output for all samples and for individual sample categories.
  3. If there is a quantitative dependent variable, statistics will be output for all samples and for the categories of dependent not missing and dependent missing.
  4. These statistics can be calculated simultaneously with running a genotypic association test. To do this, see section Genotype Association Tests. (If PCA correction is also used and there are PCA outliers, a separate category of statistics will be output for samples comprising these outliers.)

Warning

Statistics calculated by this function do not adjust for gender and are therefore not always appropriate for non-autosomal chromosomes.

Data Requirements

General marker statistics require a dataset containing genotypic data. Optionally, case/control, categorical or quantitative phenotype data will be used to subdivide most of these statistics according to “case” and “control”, category or missing/not missing status. A quantitative dependent will output an average of the dependent variable for each genotype category. First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project). Once you have the spreadsheet for this data, select the column representing the dependent variable (See Column States) if you wish to subdivide your statistics by “case” and “control”, category or get average values. If no dependent variable is selected, then only overall statistics will be returned. The marker statistics dialog can be accessed by selecting Genotype > Genotype Statistics by Marker from the spreadsheet menu.

Processing

Select how alleles should be classified, either by allele frequency or by reference/alternate alleles. If reference/alternate allele classification is selected, the marker map field containing the reference allele must also be chosen.

Select your marker statistics options and select the Run button to process. Descriptions of the marker statistics options are detailed below.

One spreadsheet of results will be created as a child of the current spreadsheet navigator window node. Information about the number of markers analyzed and the number of markers having greater than two alleles is entered into the Node Change Log for the Marker Statistics spreadsheet.

Note

As noted above, only a few statistics are displayed for markers having more than two alleles (or more than one alternate allele).

Call Rate

This option displays the fraction of genotypes that are present and not missing for the given marker.

With data from certain providers you can also set a confidence threshold on import to indicate which genotypes are to be called or not.

Number of Alleles

This option counts the number of distinct alleles for the given marker. If the entire column is missing, 0 is returned.

Allele Names

This output will always be displayed.

If alleles are classified according to allele frequency, the two columns Minor Allele and Major Allele are displayed, each column showing the allele name of the respective allele. If the marker is monomorphic, the lone allele will be reported as the major allele, while the minor allele will be reported as missing.

If alleles are classified according to reference/alternate alleles, the two columns Alternate Allele and Reference Allele are displayed, each column showing the allele name of the respective allele.

If the results of a genotypic association test are also being shown in this spreadsheet, the first column will be labeled Minor Allele (Test Allele) or Alternate Allele (Test Allele) instead of Minor Allele or Alternate Allele.

Allele Frequencies

If alleles are classified according to allele frequency, this option displays the two columns Minor Allele Freq. and Major Allele Freq. The Minor Allele Freq. is the fraction of the given marker’s total alleles that are minor alleles, and similarly, the Major Allele Freq. is the fraction of the given marker’s total alleles that are major alleles. If the marker is monomorphic, the “major allele frequency” will be reported as 1, while the “minor allele frequency” will be reported as 0.

If alleles are classified according to reference/alternate alleles, this option displays the two columns Alternate Allele Frequency and Reference Allele Frequency. The Alternate Allele Frequency is the fraction of the given marker’s total alleles that are alternate alleles, and similarly, the Reference Allele Frequency is the fraction of the given marker’s total alleles that are reference alleles.

Carrier Count

If alleles are classified according to allele frequency, this option displays the number of genotypes containing at least one minor allele.

If alleles are classified according to reference/alternate alleles then this option displays the number of genotypes containing at least one alternate allele.

Note

If alleles are classified according to reference/alternate alleles and a marker has more than one alternate allele, this option will still display the number of genotypes for which at least one of its alleles is an alternate allele.

Note

If all data is missing for a marker, or alleles are classified according to allele frequency and a marker has more than two alleles, or alleles are classified according to reference/alternate and a marker has no reference allele designated, this option will display zero for that marker.

Hardy-Weinberg Equilibrium P-Value

This option displays the Hardy-Weinberg Equilibrium (HWE) Correlation P-Values for each marker.

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this statistic is computed (Hardy-Weinberg Equilibrium Computation).

Fisher’s Exact Test for HWE P-Value

This option displays Fisher’s Exact Test HWE P-Values for each marker.

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this statistic is computed (Fisher’s Exact Test HWE P-Values).

Signed HWE R

This option displays the Signed HWE Correlation R for each marker. This is a measure designed to show specifically if the data for this marker shows a tendency towards being homozygous (positively signed R) or towards being heterozygous (negatively signed R).

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this statistic is computed (Signed HWE Correlation R).

Genotype Count Table(s)

The numbers of samples that contain each genotype are output. These will also be output separately for cases and for controls, if applicable. If a quantitative dependent variable was selected, an average of the dependent variable for each genotype category (DD, Dd, dd, Missing) will be calculated for each marker.

Allele Count Table(s)

The counts for each allele are output. This statistic will also be output separately for categories or missing/not missing status, if applicable.

Genotype Filtering by Marker

The genotype quality assurance filtering dialog (see Genotype Filtering By Marker – Allele Frequency Classification and Genotype Filtering By Marker – Reference and Alternate Allele Classification) offers many options for filtering out markers that do not meet user-defined criteria. Markers can be filtered by call rate, number of alleles, minor allele frequency (MAF), or by three measures of Hardy-Weinberg Equilibrium (HWE).

genoFilterWindow

Genotype Filtering By Marker – Allele Frequency Classification

genoFilterWindow_Ref

Genotype Filtering By Marker – Reference and Alternate Allele Classification

Warning

Statistics calculated by this function do not adjust for gender and are therefore not always appropriate for non-autosomal chromosomes.

Alleles can be classified by either allele frequency or by reference/alternate alleles. If reference/alternate allele classification is selected, the marker map field containing the reference allele must also be chosen.

The genotype columns meeting the criteria for filtering can either be inactivated in the original spreadsheet, listed in a filtering results spreadsheet, or both inactivated and listed in a separate spreadsheet. If the filtering results spreadsheet is created by user selection of the “Output spreadsheet with marker statistics and ‘Drop?’ columns” then all of the markers that were not skipped due to having more than two alleles are listed with a ‘1’ in the ‘Drop?’ column. This indicates the marker was dropped based on the selected criteria and a ‘0’ indicates that the marker was not dropped.

The filtering options are separated into two categories, General Statistics Filtering and Hardy-Weinberg Equilibrium (HWE) Filtering. The filtering options for each category are listed below:

  • General Statistics Filtering:
  • Drop if call rate: Drops a marker if the call rate meets the specified criterion. Initial default is to drop a marker if the call rate is less than 0.85.
  • Drop if number of alleles: Drops a marker if the number of alleles meets the specified criterion. Initial default is to drop a marker if the number of alleles is greater than 2.
  • Drop if Minor Allele Frequency (MAF): (This option will be present if alleles are classified by allele frequency.) This option drops a marker if the MAF meets the specified criterion. Initial default is to drop a marker if the MAF is less than 0.05.
  • Drop if alternate allele frequency: (This option will be present if alleles are classified by reference/alternate alleles.) This option drops a marker if the alternate allele frequency meets the specified criterion. Initial default is to drop a marker if the alternate allele frequency is less than 0.05.
  • Drop if carrier count: Drops a marker if the carrier count meets the specified criterion. Initial default is to drop a marker if the carrier count is less than 10.
  • Hardy Weinberg Equilibrium (HWE) Filtering:
  • Perform HWE filtering based on: select if the filtering is based on all the samples, on cases only or on controls only. This option is only available if a binary column is selected as a dependent variable.
  • Drop if Hardy Weinberg Equilibrium (HWE) P-value: Drops a marker if the HWE p-value meets the specified criterion. The initial default is to drop a marker if the HWE p-value is less than 0.001.
  • Drop if Fisher’s Exact Test for HWE P-value: Drops a marker if the Fisher’s Exact Test for HWE P-value meets the specified criterion. The initial default is to drop a marker if the value is less than 0.001.
  • Drop if Signed HWE R (positive if more homozygous): Drops a marker if the Signed HWE R meets the specified criterion. The initial default is to drop a marker if the value is greater than 0.2.

At least one filtering criterion and at least one action must be selected in the dialog to obtain results. Multiple filtering criteria are allowed at one time. Depending on the stringency of the filtering criteria, it is possible to filter out all of the markers in a dataset. If this is the case, the filtering should be rerun with less stringent criteria.

For more information on how the statistics are calculated see the following sections:

Genotype Statistics by Sample

Several types of genotypic statistics by sample are available as output (see Genotype Statistics By Sample).

genoStatsBySamp

Genotype Statistics By Sample

Tools > Manage Genome Assemblies > Set As Project Default

Data Requirements

Genotypic sample statistics require a dataset containing genotypic data. First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project). The sample statistics dialog can be accessed by selecting Genotype > Genotype Statistics By Sample from the spreadsheet menu.

If a binary or categorical column has been made dependent, many of the statistics will additionally be consolidated and reported for each dependent variable category.

Processing

Select your sample statistics options and select the Run button to process. Descriptions of the sample statistics options are detailed below.

At least one but up to four spreadsheets of results will be created as children of the current spreadsheet navigator window node. (See Output for a list of spreadsheets and outputs.) Your options selected and information about the number of markers processed are entered into the Node Change Logs for these spreadsheets.

Inputs

A number of inputs are available for statistics by sample. These are as follows:

  • Genotype Count Statistics: Call rate and heterozygosity are always output. Optionally select
    • Number and fraction of genotypes with a minor allele (as determined from sample data) to obtain these additional statistics.
  • Variant Statistics (Marker Map “Reference” Field Required): The first marker map field that starts with the characters “Reference” (case-insensitive match) will be used as the reference field. If there is no such field, this input category will not be available. Optionally select from the following three variant statistics:
    • Number of variant genotypes (non reference)
    • Number of singletons (variant genotype present only in given sample)
    • Mean Ti/Tv of variant genotypes: Outputs counts of transitions and transversions and the ratio of transitions to transversions. For further details, see Transitions and Transversions under Count and Variant Statistics below.
  • Autosomal Statistics
    • Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples): This option displays, for each sample, the p-value for the genome-wide test for departures of the minor allele count from two times the minor allele frequency of the corresponding markers. This is calculated over all active genotypic markers for the sample that are in autosomal chromosomes. This test does not require absence of linkage disequilibrium from the data and can detect even small deviations from Hardy-Weinberg equilibrium, which may be caused either by violations in the conditions for Hardy-Weinberg equilibrium or by genotyping error.
  • Gender Chromosome Statistics
    • Gender Inference: Select this option to obtain count and variant statistics for the gender chromosome, as well as to infer a sample’s gender based on the gender chromosome heterozygosity. This list of chromosomes is populated from reading the assembly file set as the project’s current default and checking the spreadsheet’s marker map. See Genome Assemblies for how to set this.
    • Threshold of heterozygosity for calling M/F: If a sample’s gender chromosome heterozygosity is more than specified here, the sample is inferred to be female. Otherwise, it is inferred to be male (with the alleles of the one gender chromosome having been duplicated for each genotype data entry).
  • Additional Outputs (Verbose Output)
    • Output count and variant statistics for each autosomal chromosome: This will generate a separate spreadsheet or spreadsheets with a column for each count and variant statistic specified above for each autosomal chromosome encountered.

Count and Variant Statistics

The count and variant statistics are reported in one or more of the following ways depending upon the options selected and whether a binary or categorical column has been made dependent:

  1. By sample for all markers scanned.

  2. By sample for the gender chromosome (if Gender Inference was selected),

  3. By category for all markers scanned (if a binary or categorical

    column was made dependent).

  4. By category for the gender chromosome (if a binary or categorical column

    was made dependent and Gender Inference was selected).

  5. By sample for each individual autosomal chromosome (if

    Output count and variant statistics for each autosomal chromosome was selected).

  6. By category for each individual autosomal chromosome (if a

    binary or categorical column was made dependent and Output count and variant statistics for each autosomal chromosome was selected).

The count and variant statistics are specifically as follows:

  • # Called Genotypes: The number of genotypes called for this sample. These genotypes may come from monomorphic, bi-allelic, or multi-allelic markers.
  • Call Rate: The number of called genotypes divided by the total number of genotypes scanned.
  • # from Bi-Allelic and Monomorphic: The number of called genotypes that come from either bi-allelic or monomorphic markers.
  • # Heterozygotes: The number of heterozygous genotypes encountered (that come from bi-allelic markers).
  • Heterozygosity Rate: The number of heterozygotes encountered (that come from bi-allelic markers) divided by the number of called genotypes from bi-allelic or monomorphic markers.
  • # with Minor Allele: (Output if Number and fraction of genotypes with a minor allele (as determined from sample data) was selected.) The number of genotypes with at least one minor allele encountered that come from bi-allelic markers.
  • Fraction with Minor Allele: (Output if Number and fraction of genotypes with a minor allele (as determined from sample data) was selected.) The number of genotypes with at least one minor allele encountered (that come from bi-allelic markers) divided by the number of called genotypes from bi-allelic or monomorphic markers.
  • # Variant Genotypes: (Output if Number of variant genotypes (non reference) was selected.) The number of genotypes (that come from either bi-allelic, monomorphic or multi-allelic markers) containing at least one non-reference allele.
  • # Singletons: (Output if Number of singletons (variant genotype present only in given sample) was selected.) The number of genotypes that come from bi-allelic markers containing only one variant in all of their samples.
  • # Transitions: (Output if Mean Ti/Tv of variant genotypes was selected.) The number of variant genotypes found in markers where the reference allele is “A” and the variant allele is “G”, the reference allele is “G” and the variant allele is “A”, the reference allele is “C” and the variant allele is “T”, or the reference allele is “T” and the variant allele is “C”.
  • # Transversions: (Output if Mean Ti/Tv of variant genotypes was selected.) The number of variant genotypes found in markers where both the reference and variant are any of “A”, “G”, “C”, or “T”, but the variant is not a transition (see above). (There are twice as many possible transversions as there are possible transitions.)
  • Mean Ti/Tv: (Output if Mean Ti/Tv of variant genotypes was selected.) The ratio of the number of transitions to the number of transversions.

Output

At least one but up to four spreadsheets of results will be created as children of the current spreadsheet navigator window node. These spreadsheets and the data categories reported by them are as follows:

  • Statistics by Sample: The rows in this spreadsheet will correspond to samples and the columns include some or all of the following:
    • (Category header): If you have specified a binary or categorical variable as dependent, that column will be echoed here.
    • Count and variant statistics by sample for all scanned markers. (See Count and Variant Statistics.)
    • Statistics for the gender chromosome (output only if Gender Inference was selected). These include the count and rate statistics for the gender chromosome (see:ref:countAndVariantStats) plus the following two columns, which are inserted after the Heterozygosity Rate (Chr. Gender) column (“Gender” will be the chromosome chosen in the drop down list):
      • Inferred Gender (Categorical M vs. F.): The inferred gender of the sample based on its gender chromosome heterozygosity rate. The gender is inferred to be female if this rate is above the Threshold of heterozygosity for calling M/F that you have specified, and male otherwise.
      • Inferred Gender (Binary 0 vs. 1.): The same as above, except that 0 is used for male and 1 is used for female.
  • Statistics by Sample Category: (Created if either a binary or a categorical dependent variable was selected in the original spreadsheet.) The first row of this spreadsheet contains totals. Each of the remaining rows shows statistics for one of the dependent variable categories. The columns include some or all of the following:
    • # Samples: The number of samples reflected in this row’s category.
    • Count and variant statistics by category for all scanned samples. (See Count and Variant Statistics.)
    • Count and variant statistics by category for the gender chromosome (output only if Gender Inference was selected). (See Count and Variant Statistics.)
  • Autosome Statistics by Sample: (Created if either Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples) or Output count and variant statistics for each autosomal chromosome has been selected.) The rows in this spreadsheet correspond to samples and the columns will include some or all of the following:
    • (Category header): If you have specified a binary or categorical variable as dependent, that column will be echoed here.
    • Hardy-Weinberg Thw statistics (output only if Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples) was selected). These are as follows:
      • Thw p-value: P-value of the Hardy-Weinberg Thw statistic.
      • -log10 Thw p-value: Negative log-based-10 of the P-value of the Hardy-Weinberg Thw statistic.
      • Thw: The Hardy-Weinberg Thw statistic. Under the null hypothesis of no departure from Hardy-Weinberg equilibrium, this statistic follows an approximate \chi^2 distribution with one degree of freedom.
      • E(delta X): Expected residual marker score. The residual marker score at a given marker and sample is given by \Delta X = X_i - E(X_i) = X_i - 2p_i, where the marker score X_i is the number of minor alleles and E(X_i) = 2p_i is the expected marker score based on the minor allele frequency p_i of the marker.
      • var(delta X): Variance of the residual marker score.
    • Count and variant statistics by sample for each autosomal chromosome (output only if Output count and variant statistics for each autosomal chromosome was selected). (See Count and Variant Statistics.)
  • Autosome Statistics by Sample Category (Created if a binary or categorical dependent variable was selected and Output count and variant statistics for each autosomal chromosome has been selected.) The first row of this spreadsheet contains totals. Each of the remaining rows shows statistics for one of the dependent variable categories. The columns will include the following:
    • # Samples: The number of samples reflected in this row’s category.
    • Count and variant statistics by category for each autosomal chromosome. (See Count and Variant Statistics.)

Identity by Descent Estimation

Overview

Identity by Descent (IBD) is a measure of how many alleles at any marker in each of two individuals came from the same ancestral chromosomes. (This is in contrast to the Identity by State (IBS) measure, which is simply a measure of how many alleles at any marker in each of two individuals happen to be the same, for whatever reason.) IBD is therefore a measure of the relatedness of the pair of individuals in question. For instance:

  • The alleles of identical twins should come 100% from the same ancestral chromosomes, because they have the same chromosomes.
  • The alleles of siblings should come approximately 50% from the same ancestral chromosomes.
  • The alleles of half-siblings should come approximately 25% from the same ancestral chromosomes.
  • The alleles of unrelated individuals should not come from the same ancestral chromosomes at all, or in other words approximately 0% from the same ancestral chromosomes.

Meanwhile, it is possible for genotyped samples to exhibit apparent relatedness that has nothing to do with the relatedness or lack of relatedness of the corresponding individuals. For instance:

  • Duplicate samples will exhibit alleles coming 100% from the same chromosomes.
  • In a dual-array system such as the Affy 500K, duplicate samples from one of a pair of genotyping chips but not the other one will exhibit alleles coming 50% from the same chromosomes.
  • Sample contamination will show as one individual seeming to have relatedness to many other individuals.

Golden Helix SVS allows estimation of the Identity by Descent between all pairs of samples, based on the data in your genotypic spreadsheet.

  • It is recommended that IBD estimation in Golden Helix SVS should be used for data quality control, rather than for actually attempting to impute relatedness among individuals whose samples you are analyzing.
  • It is usually advisable to apply LD pruning (Genotype > Quality Assurance > LD Pruning from the spreadsheet menu) before using this feature.
  • You will obtain the best values when you use many samples and many markers. This is due to the need to estimate allele frequencies over multiple samples, as well as the need to estimate IBD itself over multiple markers.

Warning

IBD is designed to be estimated only from genotypic data originating from autosomal chromosomes.

Data Requirements

First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project) to create a genotypic spreadsheet. The samples in your spreadsheet are required to be row wise, and only the autosomal genotype columns should be active. (If necessary, use Select > Activate by Chromosomes from the spreadsheet menu.) The IBD dialog can be accessed by selecting Genotype > Quality Assurance > Identity by Descent Estimation from the spreadsheet menu.

Values Computed

The first available output, the IBS distances, reflects the Identity by State (IBS) between pairs of samples. At each marker, the two samples in a pair will share (for whatever reason) zero, one, or two alleles–these are known as IBS state 0, IBS state 1, and IBS state 2, respectively. For each sample pair, the IBS distance, which may be thought of as one-half of an “average IBS”, is defined as ( (# of markers with IBS state 2) + 0.5 * (# of markers with IBS state 1) ) / (# of non-missing markers).

The next available outputs are the results of the initial computations for the respective probabilities that zero, one, or two alleles are identical by descent (shared IBD). These are designated P(Z=0), P(Z=1), and P(Z=2), respectively.

Using your genotypic data, Golden Helix SVS will “work backwards” to impute the most reasonable genome-wide IBD probabilities from your data, assuming it came from a homogeneous, random-mating population. For each of your markers, the allele frequencies are estimated. Using these frequencies, P(I=i|Z=z) is estimated for each combination of i, an IBS state, and z, a possible IBD state. For instance, if p and q are the actual respective allele frequencies of the two alleles in a marker, P(I=0|Z=0), the probability of having an IBS state of zero (completely different alleles) between two individuals given an IBD state of zero (completely different alleles by descent) between those same two individuals should be 2p^2q^2. (This reflects both individuals having opposite homozygotes two different ways, AA and aa, or aa and AA, each with probability p^2q^2.) Since allele frequency estimates are made from the spreadsheet data, a correction factor is actually used to obtain unbiased estimates of P(I=i|Z=z), but the results are similar to what would otherwise be obtained.

Estimating these probabilities allows incrementing the expected count of markers with IBS state i, conditioned on IBD state z, for each pair of samples.

After all markers are scanned, a method of moments is used to find, from the expected counts and actual counts of the different IBS states, global estimates for P(Z=0), P(Z=1), and P(Z=2) for each sample pair. In some cases, these values will not be in the range of zero to one–in these cases, values are corrected appropriately to be in the range zero to one before they are output by Golden Helix SVS.

The overall fraction of alleles which are shared IBD between two individuals over the genome may be summarized by the one value

\pi = \frac{P(Z=1)}2 + P(Z=2),

or half of the probability of sharing a single allele IBD plus the probability of sharing both alleles IBD.

It would be expected that the probability of sharing two alleles IBD would be less than the probability of picking one allele shared IBD multiplied by the probability of picking a second allele shared IBD between the same two individuals. If this is not so, namely,

\pi^2 <= P(Z=2),

a set of transformed probabilities is computed which are more biologically plausible, as follows:

P*(Z=0) = (1 - \pi)^2, P*(Z=1) = 2\pi(1 - \pi), and P*(Z=2) = \pi^2.

Otherwise, the values labeled P* for the pair of individuals will be copied from the initial estimates (P).

The complete algorithm used by Golden Helix SVS is spelled out in [Purcell2007].

Using IBD Estimation

Select the computation parameter (if applicable) and output options and select the Run button to process. Descriptions of the computation parameter and output options are detailed below.

One or more spreadsheets of results will be created as children of the current spreadsheet navigator window node. Information about the parameters used will be recorded in the Node Change Log.

Parameters

Allele Counts

If your spreadsheet is a pedigree spreadsheet, you may check Use only founders for allele counts to count alleles only from samples which contain missing values for the Father ID and the Mother ID. This is the default behavior for pedigree spreadsheets and this option is only used for IBD computations. On the other hand, you may leave this box unchecked to count alleles from all samples to determine allele frequencies, which is what is done for a non-pedigree spreadsheet.

Note

All pairs of samples with non-missing data are used for IBS computations and for the final IBD computations. The restriction on allele counting to only founders only applies to determining allele frequencies to be used in the IBD computations.

Identity by Descent Estimation Outputs

The following outputs may be checked or unchecked:

  • Output IBS distances ( (IBS 2 + 0.5 * IBS 1) / # non-missing markers ) (one spreadsheet)

    Note

    If only this output is requested, all computational overhead for computing IBD that is not needed for just computing IBS will be dispensed with. This will speed up IBS computation by approximately a factor of 3.

  • Output untransformed estimates of P(Z=0), P(Z=1), and P(Z=2) (three spreadsheets) (selected by default)

  • Output PI = P(Z=1)/2 + P(Z=2) (one spreadsheet) (selected by default)

  • Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2) (three spreadsheets)

All of these outputs are in the form of a spreadsheet with both rows and columns corresponding to the samples, with each cell representing the IBS or IBD value between the two samples represented by its row and its column.

The reason these outputs are in this form is to allow you to view them using the heat map feature of Golden Helix SVS. You may then easily pick out any pair of duplicate samples, or one sample contaminating a number of other samples, or other suspicious values of IBS distance or estimated IBD.

To view a spreadsheet as a heat map, select Plot > Heat Map from the spreadsheet menu.

Additional Outputs

To get a listing of all pairs of samples whose IBD PI estimate is at or above a certain value, check Output all pairs where PI >= (value), and input the value to use.

This listing will output, in one spreadsheet, one row for every pair of samples meeting the criterion above. The sample pair will be output along with the IBS distance and all of the pair’s IBD values.

Fixation Index Fst and Fixation Index Fst (by Marker)

Overview

The Fixation Index F_{st}, also known as the Co-ancestry Coefficient \theta, between two or more subpopulations is a measure of genetic divergence between the subpopulations and from the ancestral population from which they have derived. This parameter, which can range from zero (no genetic divergence between the subpopulations or from the ancestral population) to one (complete isolation of the subpopulations from each other and the overall population), measures the reduction in genotypic heterozygosity (the Wahlund Effect) resulting from inbreeding in the subpopulations to the exclusion of others from the overall population.

Golden Helix SVS allows estimation of F_{st} between all pairs of subpopulations from which you have samples, based on the genotypic data in your spreadsheet as grouped into subpopulations by a categorical grouping variable in your spreadsheet. In the Estimates Made Using All Markers version, 95% confidence intervals around the F_{st} are also reported.

Warning

F_{st} is designed to be estimated only from genotypic data originating from autosomal chromosomes.

Overview of F-Statistics

F_{st} is one of three “F-Statistics” first developed by Sewall Wright:

  • F_{is} (or f): Inbreeding coefficient of individuals (i) with respect to the subpopulations (s) of which they are a part.
  • F_{st} (or \theta): Fixation Index or Co-ancestry Coefficient. Compares the subpopulations (s) with the total population (t).
  • F_{it} (or F): Inbreeding coefficient of individuals (i) with respect to the total population (t).

These are defined in terms of the following measures of heterozygosity:

  • H_i: The observed heterozygosity over all the subpopulations,
  • H_s: The average over all the subpopulations of the expected heterozygosities within each of the subpopulations, and
  • H_t: The expected heterozygosity over all the subpopulations,

as

1 - F_{is} = 1 - f = \frac{H_i}{H_s}

1 - F_{st} = 1 - \theta = \frac{H_s}{H_t}

1 - F_{it} = 1 - F = \frac{H_i}{H_t} .

It can be seen that the three F-Statistics may be thought of as “partitioned” (between the individual, the subpopulations, and the total population) as follows:

1 - F_{it} = (1 - F_{is})(1 - F_{st}), or

1 - F = (1 - f)(1 - \theta).

Note

Golden Helix SVS can compute the inbreeding coefficient for individual samples relative to the entire data in your genotypic spreadsheet. Please see Inbreeding Coefficients for details.

Note

Golden Helix SVS can use Principal Components Analysis to remove or mitigate the effects of hidden population stratification (non-zero F_{st} between unknown groupings of samples) from your analysis. Please see Correcting for Stratification for more details.

Note

F_{is} is normally used as the actual measure of inbreeding among individuals, because it is measured against others who are in the same subpopulation.

Data Requirements

First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project) to create a genotypic spreadsheet. The samples in your spreadsheet are required to be row wise, and only the autosomal genotype columns should be active. (If necessary, use Select > Activate by Chromosomes from the spreadsheet menu.) Ensure that your spreadsheet also has a categorical variable which groups your samples according to which samples are taken from which subpopulations.

Using F_{st} Estimation

Two modes of F_{st} are available in Golden Helix SVS. They are:

  • F_{st} estimates using all markers, and
  • F_{st} estimates made using one marker at a time.

Estimates Made Using All Markers

The Fixation Index F_{st} computations taken over all markers can be accessed by making the grouping variable dependent and selecting Genotype > Quality Assurance > Fixation Index Fst from the spreadsheet menu.

One spreadsheet of results is created as a child of the current spreadsheet navigator window node. In this spreadsheet, the subpopulations will be used both as rows and columns, with each spreadsheet cell showing the estimated F_{st} between the row’s subpopulation and the column’s subpopulation. This spreadsheet format is suitable for plotting the F_{st} between the pairs of subpopulations as a heat map.

A second spreadsheet is created as a child of the current spreadsheet. This spreadsheet will have the F_{st} between the pairs of subpopulations in tall format with a 95% confidence interval around the F_{st}.

The overall F_{st}, which is the F_{st} taken over all subpopulations at once, is shown in the node change log of the project viewer in the first spreadsheet. In the second spreadsheet’s log message, the overall F_{st} is reported with a 95% confidence interval.

Estimates Made Using One Marker at a Time

This mode of computing the Fixation Index F_{st} can be accessed by making the grouping variable dependent and selecting Genotype > Quality Assurance > Fixation Index Fst (by Marker) from the spreadsheet menu. In this mode, Golden Helix SVS performs the equivalent of making an entire F_{st} run using a spreadsheet in which all markers except the one of interest are deactivated, repeating this for every marker.

One spreadsheet of results is created as a child of the current spreadsheet navigator window node. This spreadsheet is organized using the markers as rows and the groupings of subpopulations as columns, with each spreadsheet cell showing the estimated F_{st} between the subpopulations grouped together for the cell’s column, as computed using the one marker associated with the cell’s row.

The final subpopulation grouping is for F_{st} taken over all subpopulations at once (for the individual markers).

If the original spreadsheet was marker mapped, its marker map information will be transferred to the new spreadsheet.

Algorithm Used for F_{st} Estimation

Golden Helix SVS uses the algorithm explained in [WeirCockerham1984] to estimate F_{st}. This algorithm is summarized below. We use the f, \theta, and F notation of [WeirCockerham1984].

The variance of the allele frequency p of a given allele A at any locus may be thought of as divided into three components, a, b, and c, the expectations of which are

Ea = p(1-p)\theta

Eb = p(1-p)(F - \theta)

Ec = p(1-p)(1 - F),

where a is the variance component between subpopulations, b is the variance component within subpopulations and between individuals, c is the sample-based population estimate of the variance component between gametes within individuals, and p is the expected frequency of allele A, which is equal to its frequency in the original ancestral population. From this, we get

\hat{\theta} = \frac{a}{a + b + c}.

Note

An alternative definition of F_{st} (\theta) is the ratio of the variance of allele frequencies between different populations (a) to the overall variance of allele frequencies (p(1-p)).

Note

The expected heterozygosity within a subpopulation i, 2p_i(1 - p_i), also happens to be twice the total variance of p within that subpopulation (“total” counting both within individuals and between individuals). Also, the expected heterozygosity over all the subpopulations may be written as H_t = 2p(1 - p), which happens to be twice the overall variance of p (both within individuals and between individuals). Since we may write

\theta = F_{st} = 1 - \frac{H_s}{H_t} = \frac{H_t - H_s}{H_t},

or

H_t \theta = H_t - H_s,

and since the total variance is the sum of the variance within subpopulations and the variance between subpopulations, we see that H_t \theta is twice the variance between subpopulations, or that p(1 - p) \theta is equal to the variance between subpopulations.

Meanwhile,

1 - F = \frac{H_i}{H_t},

or

H_t(1 - F) = H_i,

the observed heterozygosity within individuals, which is twice the sample-based population estimate of the variance of p within individuals. Thus the population estimate of the variance of p within individuals is p(1-p)(1 - F) =
\frac{H_i}2, and the variance of p between individuals is thus p(1 - p)F. Thus, the variance of p between individuals but within subpopulations is

p(1 - p)F - p(1-p)\theta,

or

p(1-p)(F - \theta) .

Note

If we use the actual variances within the data that we have, rather than estimates of population variances based on our data, the above discussion is modified as follows:

Define

G = \frac{1 + F}{2} .

Then, the variance of the allele frequency p of a given allele A at any locus may be thought of as divided into three components, a, b, and c, the expectations of which are

Ea = p(1-p)\theta

Eb = p(1-p)(G - \theta)

Ec = p(1-p)(1 - G),

where a is the variance component between subpopulations, b is the variance component within subpopulations and between individuals, c is the actual variance component between gametes within individuals, and p is the expected frequency of allele A.

While the discussion in the preceeding note showing that p(1 - p) \theta is equal to the variance between subpopulations still applies, we now note that since

1 - F = \frac{H_i}{H_t},

we have

1 - G = 1 - \frac{1 + F}{2}
      = \frac{1 - F}{2}
      = \frac{H_i}{2H_t},

or

2H_t(1 - G) = 4p(1 - p)(1 - G) = H_i,

the observed heterozygosity within individuals, which is four times the actual variance of p within individuals. Thus the actual variance of p within individuals is p(1-p)(1 - G) = \frac{H_i}4, and the variance of p between individuals is thus p(1 - p)G. Thus, the variance of p between individuals but within subpopulations is

p(1 - p)G - p(1-p)\theta,

or

p(1-p)(G - \theta) .

The variances a, b, and c are estimated from a given biallelic locus in such a way as to compensate for finite and possibly unequal sample sizes, a finite number of subpopulations, and the fact that the subpopulations are effectively statistical samplings of the original ancestor population from which they came. These estimates are as follows:

a = \frac{\bar{n}}{n_c}\left\{s^2 - \frac{1}{\bar{n} - 1}\left[\bar{p}(1 - \bar{p}) - \frac{r-1}{r}s^2 - \frac{\bar{h}}{4}\right]\right\}

b = \frac{\bar{n}}{\bar{n} - 1}\left[\bar{p}(1 - \bar{p}) - \frac{r-1}{r}s^2 - \frac{2\bar{n} - 1}{4\bar{n}}\bar{h}\right]

c = \frac{\bar{h}}{2},

where:

  • \tilde{p}_i is the frequency for allele A in the sample of size n_i from subpopulation i (i = 1,2,...,r),
  • \tilde{h}_i is the proportion of individuals with heterozygous genotypes in the sample from subpopulation i,
  • \bar{n} = \sum_i\frac{n_i}{r}, the average sample size,
  • n_c = \frac{r\bar{n} - \sum_i\frac{n_i^2}{r\bar{n}}}{r - 1} = \bar{n}\left(1 - \frac{C^2}{r}\right), where C^2 is the squared coefficient of variation of sample sizes,
  • \bar{p} = \sum_i\frac{n_i\tilde{p}_i}{r\bar{n}}, the average sample frequency of allele A,
  • s^2 = \sum_i\frac{n_i(\tilde{p}_i - \bar{p})^2}{(r - 1)\bar{n}}, the sample variance of allele A frequencies over populations, and
  • \bar{h} = \sum_i\frac{n_i\tilde{h}_i}{r\bar{n}}, the average heterozygote frequency for allele A.

For multiple loci, instead of just trying to average the estimates over the individual loci l using the variances a_l, b_l, and c_l, we instead use a weighted average, namely,

\hat{\theta}_W = \frac{\sum_l a_l}{\sum_l(a_l + b_l + c_l)}.

Here, contributions to the numerator and contributions to the denominator are each effectively weighted by \bar{p}(1-\bar{p}), giving more importance to terms coming from markers with a higher minor allele frequency and effectively eliminating terms coming from monomorphic loci.

Note

When you estimate F_{st} using one marker at a time, Golden Helix SVS will simply output missing values for monomorphic loci.

Note

Even though the estimates (derived from [WeirCockerham1984]) of a, b, and c are meant to compensate, among other things, for smaller sample sizes, this algorithm will still produce better results by using reasonable sample sizes for your subpopulations and using multiple genotypic markers.

For instance, due to the \frac{1}{\bar{n} - 1} factor used in estimating a, it is possible to obtain negative-number estimates of F_{st} (\theta) by using extremely few samples over just a few markers. (F_{st}, since it is one variance divided by another, should always be positive or at least zero.)

Note

If we were to use the actual variances within the data that we have, rather than estimates of population variances based on our data, we would use, for each individual marker,

a = \sigma^2,

c = \frac{\bar{h}}{4}, and

b = \bar{p}(1 - \bar{p}) - a - c, where

\sigma^2 = \sum_i\frac{n_i(\tilde{p}_i -
\bar{p})^2}{r\bar{n}}, the actual variance of allele A frequencies over populations.

Algorithm Used for Confidence Intervals

To calculate the 95% confidence intervals around the F_{st} value Golden Helix SVS uses a percentile-t bootstrapping technique described in [Leviyang2010]. This algorithm is described below. We use the f, \theta, and F notation of [WeirCockerham1984].

To find an estimate of the variance of \hat \theta, we can use jackknifing [WeirCockerham1984].

var(\hat \theta) \mathrel{\widehat{=}} \frac {m - 1} {m} \sum \limits_{L = 1}^{m} (\hat \theta_{(L)} - \frac {1} {m} \sum \limits_{L = 1}^{m} \hat \theta_{(L)})^2

where \hat \theta_{(L)} is the estimate of \hat \theta obtained by omitting locus L and m is the number of loci.

We then perform one thousand bootstrap replicates where in each replicate, a simple random sample with replacement is taken of the loci and the F_{st} (\hat \theta^*) value is calculated for each subpopulation pair. We then use jackknifing again to find an estimate of the variance. Then the last part of the of each bootstrap replicate is to calculate the t-statistic.

t^* = \frac {\hat \theta^* - \hat \theta} {\hat se_{\hat \theta^*}}

where \hat se_{\hat \theta^*} is the square root of the estimate of the variance of \hat \theta^* found through jackknifing.

These t-statistics are then stored in a list in ascending order to be used after the replicates are finished.

After the bootstrap replicates, the confidence interval around \hat \theta is found with:

(\hat \theta - t^*_{(1 - \alpha / 2)} \hat se_{\hat \theta}; \hat \theta - t^*_{(\alpha / 2)} \hat se_{\hat \theta})

and since we’re trying to find the 95% confidence interval, \alpha is 0.05. and \hat se_{\hat \theta} is the square root of the estimate of the variance of \hat \theta found before.

Output

For the all markers mode, two spreadsheets are made. One spreadsheet is made for the by marker mode. Please see the output section of Estimates Made Using All Markers or of Estimates Made Using One Marker at a Time for details.

Separately Computing the Genomic Relationship Matrix

This tool outputs the genomic relationship matrix (The GBLUP Genomic Relationship Matrix) from a genotypic or numerically recoded spreadsheet.

This matrix can be used as a pre-computed genomic relationship matrix for GBLUP computations (Genomic Best Linear Unbiased Predictors Analysis) or as a pre-computed kinship matrix for the EMMAX and MLMM Mixed Model GWAS methods (Mixed Linear Model Analysis). It may also be used for visualization of the cryptic relatedness of samples.

For further details about pre-computed kinship matrices in general, see Precomputed Kinship Matrix Option.

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

gblupGrmDialog

Compute GBLUP Genomic Relationship Matrix Dialog Window

Options

  • Impute missing data as: Missing genotype data can be imputed by either of the following methods:

    • Homozygous major allele: All missing genotype data will be recoded to 0.

    • Numerically as average value: All missing genotype data will be recoded to the average of all non-missing genotype calls (using the additive model).

      Note

      If Correct for Gender (see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

  • Correct for Gender: Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question. Uses the [Taylor2013] gender-correction algorithm (see Correcting for Gender).

    • Choose Sex Column: Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).
    • Chromosome that is hemizygous for males: Usually the X Chromosome, which is the default.
    • Dosage Compensation: Modify the dosage compensation.
  • Select Algorithm: Select which form of normalization is preferred for computation.

    • Overall normalization: Normalization is performed on a more global scale.
    • Normalize by individual marker (GCTA method): Normalization is performed on a local, marker-based level.

Output

A GBLUP Genomic Relationship Matrix spreadsheet will be created.

Computing the Numerator Relationship Matrix

This tool outputs the numerator relationship matrix (sometimes referred to as the “A Matrix”) from the pedigree information in the current spreadsheet.

This matrix can be used as a pre-computed kinship matrix for the EMMAX and MLMM Mixed Model GWAS methods (Mixed Linear Model Analysis) and for the Mixed Model KBAC method (Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method).

For further details about pre-computed kinship matrices in general, see Precomputed Kinship Matrix Option.

Overview of Theory

The off-diagonal coefficient a_{ij} (i \ne j) for the i-th row and j-th column of this matrix are, if both parents of pedigree member j are in the pedigree and are designated k and l, the average of the numerator relationship coefficents a_{ik} and a_{il} between each parent and pedigree member i. That is,

a_{ij} = \frac{a_{ik} + a_{il}}{2}.

If only one parent k of pedigree member j is in the pedigree, and pedigree member i is in the same generation or an earlier generation than is pedigree member j, we have

a_{ij} = \frac{a_{ik}}{2}.

If niether parent of either pedigree member i or of pedigree member j is in the pedigree (that is, both i and j are “founders”), we have

a_{ij} = 0.

We also always have

a_{ji} = a_{ij}.

For the diagonal coefficient a_{jj} for pedigree member j, we have, if both parents k and l are in the pedigree,

a_{jj} = 1 + f_j = 1 + \frac{a_{kl}}{2} .

f_j is known as the “coefficient of inbreeding” (as computed from this pedigree) for pedigree member j.

If it is not true that both parents are in the pedigree, we use

a_{jj} = 1 .

Note

This matrix is called a “numerator” relationship matrix because its coefficients are effectively the numerators of the relationship coefficients r_{kl} given by Sewall Wright [Wright1922], the relation being

r_{kl} = \frac{a_{kl}}{\sqrt{(1 + f_k)(1 + f_l)}} .

Note

See Inbreeding Coefficients to estimate coefficients of inbreeding based on genotypic information.

Data Requirements

This feature must be run from a pedigree spreadsheet containing pedigree information for all samples for which the intended mixed-model analysis which will be using the output of this feature (as a pre-computed kinship spreadsheet) is to be run.

The pedigree does not need to be sorted in any particular order. This tool will determine the proper ordering of the pedigree members for computational purposes.

If there is no row entry for any given parent of a pedigree member, a virtual entry for that parent will be created internally for computational purposes.

Output

A numerator relationship matrix (“A Matrix”) for the current spreadsheet’s pedigree will be generated.

Filter Samples by Call Rate

Genotype Statistics by Sample Call Rates are calculated and samples whose call rates do not meet the specified criteria will be inactivated. If at least one sample, but not all of the samples, are inactivated, a subset of active rows is created.

From a spreadsheet containing several genotypic columns, choose Genotype > Quality Assurance > Filter Samples by Call Rate. The spreadsheet output (Statistics by Sample) contains ten columns; the first column contains the number of called genotypes (not including missing values) and the second column contains the call rate, defined as the number of non-missing values divided by the number of genotype columns. The rest of the columns in the Statistics by Sample spreadsheet are heterozygosity statistics. See Genotype Statistics by Sample for more information.

If at least one row, but not all of the rows, are inactivated, a subset will also be created.

LD Pruning

Overview

Some tests such as Identity by Descent and Inbreeding Coefficient Estimation will obtain better results if the markers used are not in linkage disequilibrium with each other.

Therefore, Golden Helix SVS provides this feature to inactivate (“prune”) markers that are in linkage disequilibrium with other markers that are left active, so that you may do your tests just with those active markers that are not as much in LD with each other.

Data Requirements

First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project) to create a genotypic spreadsheet. It is recommended that the spreadsheet be marker-mapped to insure that the markers are in the proper sequence. The samples in your spreadsheet are required to be rowwise. The LD Pruning dialog can be accessed by selecting Genotype > Quality Assurance > LD Pruning from the spreadsheet menu.

Method

All pairs of markers within a moving window, the size and increment of moving you may specify, are compared with each other to measure their pairwise LD. If any pair of markers which are both within the moving window are in LD greater than the specified threshold, the first marker of the pair will be inactivated (“pruned”).

Parameters

Window Size

Enter the window size in number of markers.

Window Increment

Enter the number of markers by which the beginning window position will be incremented.

LD Statistic

Choose r^2 or D' as the statistic to apply the threshold value to.

LD Threshold

For any pair of markers whose LD statistic is larger than this, the first marker of the pair will be inactivated (“pruned”).

LD Computation Method

Check whether to use CHM or EM. CHM is computationally much faster, and gives almost the same results as the EM method.

End Result

The column inactivation will be on the spreadsheet you are working with, or on a new tab containing a copy of the spreadsheet you have been working with.

SNP Density

Reports various SNP density statistics across all markers in a marker mapped spreadsheet. To calculate the statistics, open a marker mapped spreadsheet and select Genotype > Quality Assurance > SNP Density. The following statistics will appear in a window: Minimum Gap (bp), Maximum Gab (kb), Average Gap (kb), and SNP Density (1 SNP per X.XXkb).

Mendelian Error Check

This feature can either count and report all Mendelian errors, replace all errors with missing calls or both.

If Report Mendelian Errors is checked, the feature counts Mendelian errors over all trios and reports the total per marker and per child. Partial trios are also examined, but fewer errors can be detected by definition. Two output spreadsheets are created:

  • Mendelian Errors by Marker has one row for each genotypic column found in the original spreadsheet and an integer error count column.
  • Mendelian Errors by Sample has one row for each child found in the original spreadsheet and an integer error count column.

If Remove Mendelian Errors is checked, a child spreadsheet is created with the same dimensions as the original spreadsheet. This spreadsheet has all Mendelian errors removed and replaced with missing values. The number of calls replaced is reported in the node change log and should equal the sum of each Mendelian Errors columns in the report spreadsheets.

This feature requires a pedigree spreadsheet with several genotypic columns.

Inbreeding Coefficients

Overview

If there is inbreeding among individuals represented in a dataset, this will reduce the independence of the data. For this reason, and to better assure data quality in your data samples, Golden Helix SVS can estimate the inbreeding coefficient f for each individual represented in your data.

  • It is recommended that estimating inbreeding coefficients in Golden Helix SVS should be used for data quality control, rather than for actually attempting to impute inbreeding on the part of individuals whose samples you are analyzing.
  • It is usually advisable to apply LD pruning (Genotype > Quality Assurance > LD Pruning from the spreadsheet menu) before using this feature.
  • This inbreeding coefficient f is equivalent to Wright’s within-subpopulation fixation index, F_{is}, in population genetics. (See Fixation Index Fst and Fixation Index Fst (by Marker).)
  • Values may range from -1 to +1. Negative values indicate outbreeding (or data quality problems for large negative values), and positive values indicate inbreeding (or other data quality problems).
  • You will obtain the best values when you use many samples and many markers. This is due to the need to estimate allele frequencies over multiple samples, as well as the need to estimate f itself over multiple markers.

Note

If you have a pedigree which may reflect inbreeding, you can use Computing the Numerator Relationship Matrix to check this by computing the numerator relationship matrix on that pedigree. The diagonal elements for any pedigree members for which the pedigree shows inbreeding will be larger than one by the amount of the inbreeding coefficient (as computed from the pedigree).

Data Requirements

First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project) to create a marker-mapped genotypic spreadsheet. The samples in your spreadsheet are required to be rowwise. Only the autosomal genotype columns will be used by this feature. The inbreeding coefficient dialog can be accessed by selecting Genotype > Quality Assurance > Inbreeding Coefficients from the spreadsheet menu.

Computation

For a particular marker with allele frequencies p and q, the probability that an individual is homozygous is f + (1-f)(p^2 + q^2), or the probability of being homozygous by descent (f) plus the probability of being homozygous by chance. If an individual has L genotyped autosomal markers, O is the number of observed homozygotes for the individual over all markers, and E is the number expected by chance, then O=fL + (1-f)E, or f=\frac{O-E}{L-E}.

Since allele frequencies are estimated from the data in your spreadsheet, an unbiased estimator of E is used, based on the sum over all markers not missing for the individual: E = \sum_{j=1}^L[1 - 2p_jq_j\frac{T_{A_j}}{(T_{A_j} - 1)}], where T_{A_j} is twice the number of non missing genotypes for marker j.

Parameters

Genome

Check either the Human radio button or the Non-Human radio button.

Number of Autosomes

Enter the number of autosomes which the genome you are using contains.

Output

A spreadsheet is output with one row for each individual. The output columns consist of the inbreeding coefficient (f), the number of markers analyzed for the individual, the number of observed homozygotes for the individual, and the number of expected homozygotes for the individual.

PBAT Family-Based QA Statistics

The quality control statistics for family-based studies are used to measure the genotyping error rate of each proband in a family individually. See [Fardo2009].

PBATqcStats

PBAT Family-Based QA Window

Data Requirements

PBAT family-based QA statistics require a pedigree dataset containing genotypic data. First, import your data into a Golden Helix SVS project (See Importing Your Data Into A Project). The family-based statistics dialog can be accessed by selecting Genotype > PBAT Family-Based QA from the spreadsheet menu.

Processing

Select computation parameter and output options and select the Run button to process. Descriptions of the computation parameters and output options are detailed below.

One spreadsheet of results will be created as a child of the current spreadsheet navigator window node. Information about the parameters used will be recorded in the Node Change Log.

Computation Parameters

Algorithm

If the Use alternative rapid pedigree algorithm IS NOT selected then the standard PBAT algorithm for processing extended pedigrees is used and Mendelian errors will not be calculated.

If the Use alternative rapid pedigree algorithm IS selected (the default) then the alternative rapid pedigree algorithm for processing extended pedigrees will be used and Mendelian errors will be calculated. See Alternative Rapid Pedigree Algorithm for more information.

Number of non-founders in one pedigree

Enter the maximum number of non-founders plus one that exist in one pedigree. “Non-founders” refers to subjects in the pedigree who have parents whose data is also in the pedigree. If a pedigree is found to have this number of non-founders or more, it will not be processed. For instance, if the user wants to restrict pedigrees to only have two siblings plus their parents, then enter 3 in this box.

Note

Under the alternative rapid pedigree algorithm, this parameter refers to the maximum of non-founders within the family clusters identified by this algorithm, rather than to the maximum number of non-founders within any original extended pedigree.

Output

Output by marker

The rows will correspond to markers and the columns in the output spreadsheet will be:

  • MAF: Minor Allele Frequency for the specified marker.

  • Mendelian errors: Number of Mendelian errors for the specified marker.

    Note

    This column will only display if the alternative rapid pedigree algorithm was selected.

  • HW: Hardy-Weinberg Equilibrium value for the specified marker.

  • FBATS: Sum of the transmission scores for the specified marker that would occur if a TDT test were to be done in which all probands were assumed to be “affected”, and the null hypothesis were “no linkage and no association”.

  • FBATV: Sum of terms of the variance matrix for the specified marker that would occur under the above-mentioned test.

  • FBATV2: Sum of squares of the transmission scores over the probands for the specified marker that would occur under the above-mentioned test.

Output by proband

This is the default output selection.

The rows will correspond to probands and the columns in the output spreadsheet will be:

  • # Markers: The number of markers used for the calculation

  • Mendelian errors: Number of Mendelian errors for the specified proband.

    Note

    This column will only display if the alternative rapid pedigree algorithm was selected.

  • Tgw p-value: P-value of the standardized genome-wide transmission statistic. This statistic follows an approximate \chi^2 distribution with one degree of freedom.

  • Tgw: Standardized genome-wide transmission statistic. A value of greater than 30 for this statistic may indicate substantial amounts of genotyping error in the data for this proband.

  • E(delta X): Expected Mendelian residual.

  • var(delta X): Variance of the Mendelian residual.

Output all details

The rows will correspond to markers and the columns in the output spreadsheet will be:

  • MAF: Minor Allele Frequency for the specified marker.

  • Mendelian errors: Number of Mendelian errors for the specified marker.

    Note

    This column will only display if the alternative rapid pedigree algorithm was selected.

  • HW: Hardy-Weinberg Equilibrium value for the specified marker.

  • FBATS: Sum of the transmission scores for the specified marker that would occur if a TDT test were to be done in which all probands were assumed to be “affected”, and the null hypothesis were “no linkage and no association”.

  • FBATV: Sum of terms of the variance matrix for the specified marker that would occur under the above-mentioned test.

  • FBATV2: Sum of squares of the transmission scores over the probands for the specified marker that would occur under the above-mentioned test.

  • Columns for probands: A column for every proband. Each value listed is the contribution to the FBATS statistic and to the Tgw statistic for the specified proband and specified marker.

    Note

    A missing value in any cell of this column indicates that there was a Mendelian error for this proband with this marker’s data.

Output -log 10 p-values

These values are only available for Output by proband and calculates the -\log_{10}(\text{Tgw p-value}) for every proband.