# Genotype Data Quality Assessment and Utilities¶

To ensure data is of the highest quality, SVS provides a variety of features that not only help assess the quality of data, but remedy any problems as well.

## Genotype Statistics by Marker¶

Several types of overall marker statistics and genetic measures are
available as output (see *Genotype Statistics By Marker – Allele Frequency Classification* and *Genotype Statistics By Marker – Reference and Alternate Allele Classification*).

Note

- Most of these statistics are output only for bi-allelic markers, or, if markers are classified according to reference/alternate alleles, for markers containing one alternate allele. For the other markers, only the statistics that make sense for those other markers are output.
- If there is a case/control dependent variable or a categorical dependent variable with fewer than 30 categories, statistics will be output for all samples and for individual sample categories.
- If there is a quantitative dependent variable, statistics will be output for all samples and for the categories of dependent not missing and dependent missing.
- These statistics can be calculated simultaneously with running a
genotypic association test. To do this, see section
*Genotype Association Tests*. (If PCA correction is also used and there are PCA outliers, a separate category of statistics will be output for samples comprising these outliers.)

Warning

Statistics calculated by this function do not adjust for gender and are therefore not always appropriate for non-autosomal chromosomes.

### Data Requirements¶

General marker statistics require a dataset containing genotypic data.
Optionally, case/control, categorical or quantitative phenotype data will
be used to subdivide most of these
statistics according to “case” and “control”, category or missing/not missing
status. A quantitative
dependent will output an average of the dependent variable for each
genotype category. First, import your data into a Golden Helix SVS
project (See *Importing Your Data Into A Project*). Once you have the spreadsheet for this data,
select the column representing the dependent variable (See
*Column States*) if you wish to subdivide your statistics by “case” and
“control”, category or get average values. If no dependent variable is selected,
then only overall statistics will be returned. The marker statistics
dialog can be accessed by selecting **Genotype** >
**Genotype Statistics by Marker** from the spreadsheet menu.

### Processing¶

Select how alleles should be classified, either by allele frequency or by reference/alternate alleles. If reference/alternate allele classification is selected, the marker map field containing the reference allele must also be chosen.

Select your marker statistics options and select the **Run** button to
process. Descriptions of the marker statistics options are detailed
below.

One spreadsheet of results will be created as a child of the current spreadsheet navigator window node. Information about the number of markers analyzed and the number of markers having greater than two alleles is entered into the Node Change Log for the Marker Statistics spreadsheet.

Note

As noted above, only a few statistics are displayed for markers having more than two alleles (or more than one alternate allele).

### Call Rate¶

This option displays the fraction of genotypes that are present and not missing for the given marker.

With data from certain providers you can also set a confidence threshold on import to indicate which genotypes are to be called or not.

### Number of Alleles¶

This option counts the number of distinct alleles for the given marker. If the entire column is missing, 0 is returned.

### Allele Names¶

This output will always be displayed.

If alleles are classified according to allele frequency, the two
columns *Minor Allele* and *Major Allele* are displayed, each column
showing the allele name of the respective allele. If the marker is
monomorphic, the lone allele will be reported as the major allele,
while the minor allele will be reported as missing.

If alleles are classified according to reference/alternate alleles,
the two columns *Alternate Allele* and *Reference Allele* are
displayed, each column showing the allele name of the respective
allele.

If the results of a genotypic association test are also being shown in
this spreadsheet, the first column will be labeled *Minor Allele (Test
Allele)* or *Alternate Allele (Test Allele)* instead of *Minor Allele*
or *Alternate Allele*.

### Allele Frequencies¶

If alleles are classified according to allele frequency, this option
displays the two columns *Minor Allele Freq.* and *Major Allele Freq.*
The *Minor Allele Freq.* is the fraction of the given marker’s total
alleles that are minor alleles, and similarly, the *Major Allele
Freq.* is the fraction of the given marker’s total alleles that are
major alleles. If the marker is monomorphic, the “major allele
frequency” will be reported as 1, while the “minor allele frequency”
will be reported as 0.

If alleles are classified according to reference/alternate alleles,
this option displays the two columns *Alternate Allele Frequency* and
*Reference Allele Frequency*. The Alternate Allele Frequency is the
fraction of the given marker’s total alleles that are alternate
alleles, and similarly, the Reference Allele Frequency is the fraction
of the given marker’s total alleles that are reference alleles.

### Carrier Count¶

If alleles are classified according to allele frequency, this option displays the number of genotypes containing at least one minor allele.

If alleles are classified according to reference/alternate alleles then this option displays the number of genotypes containing at least one alternate allele.

Note

If alleles are classified according to reference/alternate alleles and a marker has more than one alternate allele, this option will still display the number of genotypes for which at least one of its alleles is an alternate allele.

Note

If all data is missing for a marker, or alleles are classified according to allele frequency and a marker has more than two alleles, or alleles are classified according to reference/alternate and a marker has no reference allele designated, this option will display zero for that marker.

### Hardy-Weinberg Equilibrium P-Value¶

This option displays the Hardy-Weinberg Equilibrium (HWE) Correlation P-Values for each marker.

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this
statistic is computed (*Hardy-Weinberg Equilibrium Computation*).

### Fisher’s Exact Test for HWE P-Value¶

This option displays Fisher’s Exact Test HWE P-Values for each marker.

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this
statistic is computed (*Fisher’s Exact Test HWE P-Values*).

### Signed HWE R¶

This option displays the Signed HWE Correlation R for each marker. This is a measure designed to show specifically if the data for this marker shows a tendency towards being homozygous (positively signed R) or towards being heterozygous (negatively signed R).

This statistic will also be output separately for categories or missing/not missing status, if applicable.

Please see the section in the Formulas and Theories chapter for how this
statistic is computed (*Signed HWE Correlation R*).

### Genotype Count Table(s)¶

The numbers of samples that contain each genotype are output. These will also be output separately for cases and for controls, if applicable. If a quantitative dependent variable was selected, an average of the dependent variable for each genotype category (DD, Dd, dd, Missing) will be calculated for each marker.

### Allele Count Table(s)¶

The counts for each allele are output. This statistic will also be output separately for categories or missing/not missing status, if applicable.

## Genotype Filtering by Marker¶

The genotype quality assurance filtering dialog (see
*Genotype Filtering By Marker – Allele Frequency Classification* and *Genotype Filtering By Marker – Reference and Alternate Allele Classification*) offers many
options for filtering out markers that do not meet user-defined
criteria. Markers can be filtered by call rate, number of alleles,
minor allele frequency (MAF), or by three measures of Hardy-Weinberg
Equilibrium (HWE).

Warning

Statistics calculated by this function do not adjust for gender and are therefore not always appropriate for non-autosomal chromosomes.

Alleles can be classified by either allele frequency or by reference/alternate alleles. If reference/alternate allele classification is selected, the marker map field containing the reference allele must also be chosen.

The genotype columns meeting the criteria for filtering can either be inactivated in the original spreadsheet, listed in a filtering results spreadsheet, or both inactivated and listed in a separate spreadsheet. If the filtering results spreadsheet is created by user selection of the “Output spreadsheet with marker statistics and ‘Drop?’ columns” then all of the markers that were not skipped due to having more than two alleles are listed with a ‘1’ in the ‘Drop?’ column. This indicates the marker was dropped based on the selected criteria and a ‘0’ indicates that the marker was not dropped.

The filtering options are separated into two categories, General Statistics Filtering and Hardy-Weinberg Equilibrium (HWE) Filtering. The filtering options for each category are listed below:

**General Statistics Filtering:**

Drop if call rate:Drops a marker if the call rate meets the specified criterion. Initial default is to drop a marker if the call rate is less than 0.85.Drop if number of alleles:Drops a marker if the number of alleles meets the specified criterion. Initial default is to drop a marker if the number of alleles is greater than 2.Drop if Minor Allele Frequency (MAF):(This option will be present if alleles are classified by allele frequency.) This option drops a marker if the MAF meets the specified criterion. Initial default is to drop a marker if the MAF is less than 0.05.Drop if alternate allele frequency:(This option will be present if alleles are classified by reference/alternate alleles.) This option drops a marker if the alternate allele frequency meets the specified criterion. Initial default is to drop a marker if the alternate allele frequency is less than 0.05.Drop if carrier count:Drops a marker if the carrier count meets the specified criterion. Initial default is to drop a marker if the carrier count is less than 10.

**Hardy Weinberg Equilibrium (HWE) Filtering:**

Perform HWE filtering based on:select if the filtering is based on all the samples, on cases only or on controls only. This option is only available if a binary column is selected as a dependent variable.Drop if Hardy Weinberg Equilibrium (HWE) P-value:Drops a marker if the HWE p-value meets the specified criterion. The initial default is to drop a marker if the HWE p-value is less than 0.001.Drop if Fisher’s Exact Test for HWE P-value:Drops a marker if the Fisher’s Exact Test for HWE P-value meets the specified criterion. The initial default is to drop a marker if the value is less than 0.001.Drop if Signed HWE R (positive if more homozygous):Drops a marker if the Signed HWE R meets the specified criterion. The initial default is to drop a marker if the value is greater than 0.2.

At least one filtering criterion and at least one action must be selected in the dialog to obtain results. Multiple filtering criteria are allowed at one time. Depending on the stringency of the filtering criteria, it is possible to filter out all of the markers in a dataset. If this is the case, the filtering should be rerun with less stringent criteria.

For more information on how the statistics are calculated see the following sections:

- For
**Call Rate**see*Call Rate*. - For
**Number of Alleles**see*Number of Alleles*. - For
**Minor Allele Frequency**see*Minor Allele Frequency (MAF)*. - For
**Carrier Count**see*Carrier Count*. - For
**Hardy Weinberg Equilibrium P-Value**see*Hardy-Weinberg Equilibrium Computation*. - For
**Fisher’s Exact Test for HWE P-Value**see*Fisher’s Exact Test HWE P-Values*. - For
**Signed HWE R**see*Signed HWE Correlation R*.

## Genotype Statistics by Sample¶

Several types of genotypic statistics by sample are available as output (see
*Genotype Statistics By Sample*).

Tools > Manage Genome Assemblies > Set As Project Default

### Data Requirements¶

Genotypic sample statistics require a dataset containing genotypic data.
First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*). The sample statistics dialog can be accessed by selecting
**Genotype > Genotype Statistics By Sample** from the
spreadsheet menu.

If a binary or categorical column has been made dependent, many of the statistics will additionally be consolidated and reported for each dependent variable category.

### Processing¶

Select your sample statistics options and select the **Run** button to
process. Descriptions of the sample statistics options are detailed
below.

At least one but up to four spreadsheets of results will be created as
children of the current spreadsheet navigator window node. (See
*Output* for a list of spreadsheets and outputs.)
Your options selected and information about the number of markers
processed are entered into the Node Change Logs for these
spreadsheets.

### Inputs¶

A number of inputs are available for statistics by sample. These are as follows:

**Genotype Count Statistics**: Call rate and heterozygosity are always output. Optionally select**Number and fraction of genotypes with a minor allele (as determined from sample data)**to obtain these additional statistics.

**Variant Statistics (Marker Map “Reference” Field Required)**: The first marker map field that starts with the characters “Reference” (case-insensitive match) will be used as the reference field. If there is no such field, this input category will not be available. Optionally select from the following three variant statistics:**Number of variant genotypes (non reference)****Number of singletons (variant genotype present only in given sample)****Mean Ti/Tv of variant genotypes**: Outputs counts of transitions and transversions and the ratio of transitions to transversions. For further details, see**Transitions**and**Transversions**under*Count and Variant Statistics*below.

**Autosomal Statistics****Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples)**: This option displays, for each sample, the p-value for the genome-wide test for departures of the minor allele count from two times the minor allele frequency of the corresponding markers. This is calculated over all active genotypic markers for the sample that are in autosomal chromosomes. This test does not require absence of linkage disequilibrium from the data and can detect even small deviations from Hardy-Weinberg equilibrium, which may be caused either by violations in the conditions for Hardy-Weinberg equilibrium or by genotyping error.

**Gender Chromosome Statistics****Gender Inference**: Select this option to obtain count and variant statistics for the gender chromosome, as well as to infer a sample’s gender based on the gender chromosome heterozygosity. This list of chromosomes is populated from reading the assembly file set as the project’s current default and checking the spreadsheet’s marker map. See*Genome Assemblies*for how to set this.**Threshold of heterozygosity for calling M/F**: If a sample’s gender chromosome heterozygosity is more than specified here, the sample is inferred to be female. Otherwise, it is inferred to be male (with the alleles of the one gender chromosome having been duplicated for each genotype data entry).

**Additional Outputs (Verbose Output)****Output count and variant statistics for each autosomal chromosome**: This will generate a separate spreadsheet or spreadsheets with a column for each count and variant statistic specified above for each autosomal chromosome encountered.

### Count and Variant Statistics¶

The count and variant statistics are reported in one or more of the following ways depending upon the options selected and whether a binary or categorical column has been made dependent:

By sample for all markers scanned.

By sample for the gender chromosome (if

Gender Inferencewas selected),

- By category for all markers scanned (if a binary or categorical
column was made dependent).

- By category for the gender chromosome (if a binary or categorical column
was made dependent and

Gender Inferencewas selected).

- By sample for each individual autosomal chromosome (if

Output count and variant statistics for each autosomal chromosomewas selected).

- By category for each individual autosomal chromosome (if a
binary or categorical column was made dependent and

Output count and variant statistics for each autosomal chromosomewas selected).

The count and variant statistics are specifically as follows:

# Called Genotypes: The number of genotypes called for this sample. These genotypes may come from monomorphic, bi-allelic, or multi-allelic markers.Call Rate: The number of called genotypes divided by the total number of genotypes scanned.# from Bi-Allelic and Monomorphic: The number of called genotypes that come from either bi-allelic or monomorphic markers.# Heterozygotes: The number of heterozygous genotypes encountered (that come from bi-allelic markers).Heterozygosity Rate: The number of heterozygotes encountered (that come from bi-allelic markers) divided by the number of called genotypes from bi-allelic or monomorphic markers.# with Minor Allele: (Output ifNumber and fraction of genotypes with a minor allele (as determined from sample data)was selected.) The number of genotypes with at least one minor allele encountered that come from bi-allelic markers.Fraction with Minor Allele: (Output ifNumber and fraction of genotypes with a minor allele (as determined from sample data)was selected.) The number of genotypes with at least one minor allele encountered (that come from bi-allelic markers) divided by the number of called genotypes from bi-allelic or monomorphic markers.# Variant Genotypes: (Output ifNumber of variant genotypes (non reference)was selected.) The number of genotypes (that come from either bi-allelic, monomorphic or multi-allelic markers) containing at least one non-reference allele.# Singletons: (Output ifNumber of singletons (variant genotype present only in given sample)was selected.) The number of genotypes that come from bi-allelic markers containing only one variant in all of their samples.# Transitions: (Output ifMean Ti/Tv of variant genotypeswas selected.) The number of variant genotypes found in markers where the reference allele is “A” and the variant allele is “G”, the reference allele is “G” and the variant allele is “A”, the reference allele is “C” and the variant allele is “T”, or the reference allele is “T” and the variant allele is “C”.# Transversions: (Output ifMean Ti/Tv of variant genotypeswas selected.) The number of variant genotypes found in markers where both the reference and variant are any of “A”, “G”, “C”, or “T”, but the variant is not a transition (see above). (There are twice as many possible transversions as there are possible transitions.)Mean Ti/Tv: (Output ifMean Ti/Tv of variant genotypeswas selected.) The ratio of the number of transitions to the number of transversions.

### Output¶

At least one but up to four spreadsheets of results will be created as children of the current spreadsheet navigator window node. These spreadsheets and the data categories reported by them are as follows:

**Statistics by Sample**: The rows in this spreadsheet will correspond to samples and the columns include some or all of the following:**(Category header)**: If you have specified a binary or categorical variable as dependent, that column will be echoed here.- Count and variant statistics by sample for all scanned markers.
(See
*Count and Variant Statistics*.) - Statistics for the gender chromosome (output only if
**Gender Inference**was selected). These include the count and rate statistics for the gender chromosome (see:ref:countAndVariantStats) plus the following two columns, which are inserted after the**Heterozygosity Rate (Chr. Gender)**column (“Gender” will be the chromosome chosen in the drop down list):**Inferred Gender**(Categorical**M**vs.**F**.): The inferred gender of the sample based on its gender chromosome heterozygosity rate. The gender is inferred to be female if this rate is above the**Threshold of heterozygosity for calling M/F**that you have specified, and male otherwise.**Inferred Gender**(Binary**0**vs.**1**.): The same as above, except that**0**is used for male and**1**is used for female.

**Statistics by Sample Category**: (Created if either a binary or a categorical dependent variable was selected in the original spreadsheet.) The first row of this spreadsheet contains totals. Each of the remaining rows shows statistics for one of the dependent variable categories. The columns include some or all of the following:**# Samples**: The number of samples reflected in this row’s category.- Count and variant statistics by category for all scanned samples.
(See
*Count and Variant Statistics*.) - Count and variant statistics by category for the gender chromosome
(output only if
**Gender Inference**was selected). (See*Count and Variant Statistics*.)

**Autosome Statistics by Sample**: (Created if either**Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples)**or**Output count and variant statistics for each autosomal chromosome**has been selected.) The rows in this spreadsheet correspond to samples and the columns will include some or all of the following:**(Category header)**: If you have specified a binary or categorical variable as dependent, that column will be echoed here.- Hardy-Weinberg Thw statistics (output only if
**Hardy-Weinberg Thw P-Value (taken over all autosomal chromosomes and all samples)**was selected). These are as follows:**Thw p-value**: P-value of the Hardy-Weinberg Thw statistic.**-log10 Thw p-value**: Negative log-based-10 of the P-value of the Hardy-Weinberg Thw statistic.**Thw**: The Hardy-Weinberg Thw statistic. Under the null hypothesis of no departure from Hardy-Weinberg equilibrium, this statistic follows an approximate distribution with one degree of freedom.**E(delta X)**: Expected residual marker score. The residual marker score at a given marker and sample is given by , where the marker score is the number of minor alleles and is the expected marker score based on the minor allele frequency of the marker.**var(delta X)**: Variance of the residual marker score.

- Count and variant statistics by sample for each autosomal chromosome
(output only if
**Output count and variant statistics for each autosomal chromosome**was selected). (See*Count and Variant Statistics*.)

**Autosome Statistics by Sample Category**(Created if a binary or categorical dependent variable was selected and**Output count and variant statistics for each autosomal chromosome**has been selected.) The first row of this spreadsheet contains totals. Each of the remaining rows shows statistics for one of the dependent variable categories. The columns will include the following:**# Samples**: The number of samples reflected in this row’s category.- Count and variant statistics by category for each autosomal
chromosome. (See
*Count and Variant Statistics*.)

## Identity by Descent Estimation¶

### Overview¶

Identity by Descent (IBD) is a measure of how many alleles at any marker in each of two individuals came from the same ancestral chromosomes. (This is in contrast to the Identity by State (IBS) measure, which is simply a measure of how many alleles at any marker in each of two individuals happen to be the same, for whatever reason.) IBD is therefore a measure of the relatedness of the pair of individuals in question. For instance:

- The alleles of identical twins should come 100% from the same ancestral chromosomes, because they have the same chromosomes.
- The alleles of siblings should come approximately 50% from the same ancestral chromosomes.
- The alleles of half-siblings should come approximately 25% from the same ancestral chromosomes.
- The alleles of unrelated individuals should not come from the same ancestral chromosomes at all, or in other words approximately 0% from the same ancestral chromosomes.

Meanwhile, it is possible for genotyped samples to exhibit apparent relatedness that has nothing to do with the relatedness or lack of relatedness of the corresponding individuals. For instance:

- Duplicate samples will exhibit alleles coming 100% from the same chromosomes.
- In a dual-array system such as the Affy 500K, duplicate samples from one of a pair of genotyping chips but not the other one will exhibit alleles coming 50% from the same chromosomes.
- Sample contamination will show as one individual seeming to have relatedness to many other individuals.

Golden Helix SVS allows estimation of the Identity by Descent between all pairs of samples, based on the data in your genotypic spreadsheet.

- It is recommended that IBD estimation in Golden Helix SVS should be used for data quality control, rather than for actually attempting to impute relatedness among individuals whose samples you are analyzing.
- It is usually advisable to apply LD pruning (
**Genotype**>**Quality Assurance**>**LD Pruning**from the spreadsheet menu) before using this feature. - You will obtain the best values when you use many samples and many markers. This is due to the need to estimate allele frequencies over multiple samples, as well as the need to estimate IBD itself over multiple markers.

Warning

IBD is designed to be estimated only from genotypic data originating from autosomal chromosomes.

### Data Requirements¶

First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*) to create a genotypic spreadsheet. The samples in your
spreadsheet are required to be row wise, and only the autosomal genotype
columns should be active. (If necessary, use **Select** > **Activate
by Chromosomes** from the spreadsheet menu.) The IBD dialog can be
accessed by selecting **Genotype** > **Quality Assurance** >
**Identity by Descent Estimation** from the spreadsheet menu.

### Values Computed¶

The first available output, the **IBS distances**, reflects the
Identity by State (IBS) between pairs of samples. At each marker, the
two samples in a pair will share (for whatever reason) zero, one, or
two alleles–these are known as IBS state 0, IBS state 1, and IBS
state 2, respectively. For each sample pair, the IBS distance, which
may be thought of as one-half of an “average IBS”, is defined as ( (#
of markers with IBS state 2) + 0.5 * (# of markers with IBS state 1) )
/ (# of non-missing markers).

The next available outputs are the results of the initial computations for the respective probabilities that zero, one, or two alleles are identical by descent (shared IBD). These are designated P(Z=0), P(Z=1), and P(Z=2), respectively.

Using your genotypic data, Golden Helix SVS will “work backwards” to impute the most reasonable genome-wide IBD probabilities from your data, assuming it came from a homogeneous, random-mating population. For each of your markers, the allele frequencies are estimated. Using these frequencies, is estimated for each combination of i, an IBS state, and z, a possible IBD state. For instance, if p and q are the actual respective allele frequencies of the two alleles in a marker, , the probability of having an IBS state of zero (completely different alleles) between two individuals given an IBD state of zero (completely different alleles by descent) between those same two individuals should be . (This reflects both individuals having opposite homozygotes two different ways, AA and aa, or aa and AA, each with probability .) Since allele frequency estimates are made from the spreadsheet data, a correction factor is actually used to obtain unbiased estimates of , but the results are similar to what would otherwise be obtained.

Estimating these probabilities allows incrementing the expected count of markers with IBS state i, conditioned on IBD state z, for each pair of samples.

After all markers are scanned, a method of moments is used to find, from the expected counts and actual counts of the different IBS states, global estimates for P(Z=0), P(Z=1), and P(Z=2) for each sample pair. In some cases, these values will not be in the range of zero to one–in these cases, values are corrected appropriately to be in the range zero to one before they are output by Golden Helix SVS.

The overall fraction of alleles which are shared IBD between two individuals over the genome may be summarized by the one value

or half of the probability of sharing a single allele IBD plus the probability of sharing both alleles IBD.

It would be expected that the probability of sharing two alleles IBD would be less than the probability of picking one allele shared IBD multiplied by the probability of picking a second allele shared IBD between the same two individuals. If this is not so, namely,

a set of transformed probabilities is computed which are more biologically plausible, as follows:

and

Otherwise, the values labeled P* for the pair of individuals will be copied from the initial estimates (P).

The complete algorithm used by Golden Helix SVS is spelled out in [Purcell2007].

### Using IBD Estimation¶

Select the computation parameter (if applicable) and output options and
select the **Run** button to process. Descriptions of the computation
parameter and output options are detailed below.

One or more spreadsheets of results will be created as children of the current spreadsheet navigator window node. Information about the parameters used will be recorded in the Node Change Log.

### Parameters¶

#### Allele Counts¶

If your spreadsheet is a pedigree spreadsheet, you may check **Use
only founders for allele counts** to count alleles only from samples
which contain missing values for the Father ID and the Mother ID.
This is the default behavior for pedigree spreadsheets and this option
is only used for IBD computations. On the other hand, you may leave this
box unchecked to count alleles from all samples to determine allele
frequencies, which is what is done for a non-pedigree spreadsheet.

Note

All pairs of samples with non-missing data are used for IBS computations and for the final IBD computations. The restriction on allele counting to only founders only applies to determining allele frequencies to be used in the IBD computations.

#### Identity by Descent Estimation Outputs¶

The following outputs may be checked or unchecked:

**Output IBS distances ( (IBS 2 + 0.5 * IBS 1) / # non-missing markers )**(one spreadsheet)Note

If only this output is requested, all computational overhead for computing IBD that is not needed for just computing IBS will be dispensed with. This will speed up IBS computation by approximately a factor of 3.

**Output untransformed estimates of P(Z=0), P(Z=1), and P(Z=2)**(three spreadsheets) (selected by default)**Output PI = P(Z=1)/2 + P(Z=2)**(one spreadsheet) (selected by default)**Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2)**(three spreadsheets)

All of these outputs are in the form of a spreadsheet with both rows and columns corresponding to the samples, with each cell representing the IBS or IBD value between the two samples represented by its row and its column.

The reason these outputs are in this form is to allow you to view them using the heat map feature of Golden Helix SVS. You may then easily pick out any pair of duplicate samples, or one sample contaminating a number of other samples, or other suspicious values of IBS distance or estimated IBD.

To view a spreadsheet as a heat map, select **Plot** > **Heat Map**
from the spreadsheet menu.

#### Additional Outputs¶

To get a listing of all pairs of samples whose IBD PI estimate is at or
above a certain value, check **Output all pairs where PI** >=
**(value)**, and input the value to use.

This listing will output, in one spreadsheet, one row for every pair of samples meeting the criterion above. The sample pair will be output along with the IBS distance and all of the pair’s IBD values.

## Fixation Index Fst and Fixation Index Fst (by Marker)¶

### Overview¶

The Fixation Index , also known as the Co-ancestry Coefficient , between two or more subpopulations is a measure of genetic divergence between the subpopulations and from the ancestral population from which they have derived. This parameter, which can range from zero (no genetic divergence between the subpopulations or from the ancestral population) to one (complete isolation of the subpopulations from each other and the overall population), measures the reduction in genotypic heterozygosity (the Wahlund Effect) resulting from inbreeding in the subpopulations to the exclusion of others from the overall population.

Golden Helix SVS allows estimation of between all pairs
of subpopulations from which you have samples, based on the genotypic
data in your spreadsheet as grouped into subpopulations by a
categorical grouping variable in your spreadsheet. In the **Estimates
Made Using All Markers** version, 95% confidence intervals around the
are also reported.

Warning

is designed to be estimated only from genotypic data originating from autosomal chromosomes.

### Overview of F-Statistics¶

is one of three “F-Statistics” first developed by Sewall Wright:

- (or ): Inbreeding coefficient of individuals (i) with respect to the subpopulations (s) of which they are a part.
- (or ): Fixation Index or Co-ancestry Coefficient. Compares the subpopulations (s) with the total population (t).
- (or ): Inbreeding coefficient of individuals (i) with respect to the total population (t).

These are defined in terms of the following measures of heterozygosity:

- : The observed heterozygosity over all the subpopulations,
- : The average over all the subpopulations of the expected heterozygosities within each of the subpopulations, and
- : The expected heterozygosity over all the subpopulations,

as

It can be seen that the three F-Statistics may be thought of as “partitioned” (between the individual, the subpopulations, and the total population) as follows:

, or

.

Note

Golden Helix SVS can compute the inbreeding coefficient for
individual samples relative to the entire data in your genotypic
spreadsheet. Please see *Inbreeding Coefficients* for details.

Note

Golden Helix SVS can use Principal Components Analysis to
remove or mitigate the effects of hidden population stratification
(non-zero between unknown groupings of samples)
from your analysis. Please see *Correcting for Stratification* for
more details.

Note

is normally used as the actual measure of inbreeding among individuals, because it is measured against others who are in the same subpopulation.

### Data Requirements¶

First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*) to create a genotypic spreadsheet. The samples in your
spreadsheet are required to be row wise, and only the autosomal
genotype columns should be active. (If necessary, use **Select** >
**Activate by Chromosomes** from the spreadsheet menu.) Ensure that
your spreadsheet also has a categorical variable which groups your
samples according to which samples are taken from which
subpopulations.

### Using Estimation¶

Two modes of are available in Golden Helix SVS. They are:

- estimates using all markers, and
- estimates made using one marker at a time.

#### Estimates Made Using All Markers¶

The Fixation Index computations taken over all
markers can be accessed by making the grouping variable dependent and
selecting **Genotype** > **Quality Assurance** > **Fixation Index
Fst** from the spreadsheet menu.

One spreadsheet of results is created as a child of the current spreadsheet navigator window node. In this spreadsheet, the subpopulations will be used both as rows and columns, with each spreadsheet cell showing the estimated between the row’s subpopulation and the column’s subpopulation. This spreadsheet format is suitable for plotting the between the pairs of subpopulations as a heat map.

A second spreadsheet is created as a child of the current spreadsheet. This spreadsheet will have the between the pairs of subpopulations in tall format with a 95% confidence interval around the .

The overall , which is the taken over all subpopulations at once, is shown in the node change log of the project viewer in the first spreadsheet. In the second spreadsheet’s log message, the overall is reported with a 95% confidence interval.

#### Estimates Made Using One Marker at a Time¶

This mode of computing the Fixation Index can be
accessed by making the grouping variable dependent and selecting
**Genotype** > **Quality Assurance** > **Fixation Index Fst (by
Marker)** from the spreadsheet menu. In this mode, Golden Helix SVS
performs the equivalent of making an entire run using
a spreadsheet in which all markers except the one of interest are
deactivated, repeating this for every marker.

One spreadsheet of results is created as a child of the current spreadsheet navigator window node. This spreadsheet is organized using the markers as rows and the groupings of subpopulations as columns, with each spreadsheet cell showing the estimated between the subpopulations grouped together for the cell’s column, as computed using the one marker associated with the cell’s row.

The final subpopulation grouping is for taken over all subpopulations at once (for the individual markers).

If the original spreadsheet was marker mapped, its marker map information will be transferred to the new spreadsheet.

### Algorithm Used for Estimation¶

Golden Helix SVS uses the algorithm explained in [WeirCockerham1984] to estimate . This algorithm is summarized below. We use the , , and notation of [WeirCockerham1984].

The variance of the allele frequency of a given allele at any locus may be thought of as divided into three components, , , and , the expectations of which are

where is the variance component between subpopulations, is the variance component within subpopulations and between individuals, is the sample-based population estimate of the variance component between gametes within individuals, and is the expected frequency of allele , which is equal to its frequency in the original ancestral population. From this, we get

.

Note

An alternative definition of () is the ratio of the variance of allele frequencies between different populations () to the overall variance of allele frequencies ().

Note

The expected heterozygosity within a subpopulation , , also happens to be twice the total variance of within that subpopulation (“total” counting both within individuals and between individuals). Also, the expected heterozygosity over all the subpopulations may be written as , which happens to be twice the overall variance of (both within individuals and between individuals). Since we may write

or

and since the total variance is the sum of the variance within subpopulations and the variance between subpopulations, we see that is twice the variance between subpopulations, or that is equal to the variance between subpopulations.

Meanwhile,

or

the observed heterozygosity within individuals, which is twice the sample-based population estimate of the variance of within individuals. Thus the population estimate of the variance of within individuals is , and the variance of between individuals is thus . Thus, the variance of between individuals but within subpopulations is

or

Note

If we use the actual variances within the data that we have, rather than estimates of population variances based on our data, the above discussion is modified as follows:

Define

Then, the variance of the allele frequency of a given allele at any locus may be thought of as divided into three components, , , and , the expectations of which are

where is the variance component between subpopulations, is the variance component within subpopulations and between individuals, is the actual variance component between gametes within individuals, and is the expected frequency of allele .

While the discussion in the preceeding note showing that is equal to the variance between subpopulations still applies, we now note that since

we have

or

the observed heterozygosity within individuals, which is four times the actual variance of within individuals. Thus the actual variance of within individuals is , and the variance of between individuals is thus . Thus, the variance of between individuals but within subpopulations is

or

The variances , , and are estimated from a given biallelic locus in such a way as to compensate for finite and possibly unequal sample sizes, a finite number of subpopulations, and the fact that the subpopulations are effectively statistical samplings of the original ancestor population from which they came. These estimates are as follows:

,

where:

- is the frequency for allele in the sample of size from subpopulation
- is the proportion of individuals with heterozygous genotypes in the sample from subpopulation
- the average sample size,
- where is the squared coefficient of variation of sample sizes,
- the average sample frequency of allele ,
- the sample variance of allele frequencies over populations, and
- the average heterozygote frequency for allele .

For multiple loci, instead of just trying to average the estimates over the individual loci using the variances , , and , we instead use a weighted average, namely,

Here, contributions to the numerator and contributions to the denominator are each effectively weighted by , giving more importance to terms coming from markers with a higher minor allele frequency and effectively eliminating terms coming from monomorphic loci.

Note

When you estimate using one marker at a time, Golden Helix SVS will simply output missing values for monomorphic loci.

Note

Even though the estimates (derived from [WeirCockerham1984]) of , , and are meant to compensate, among other things, for smaller sample sizes, this algorithm will still produce better results by using reasonable sample sizes for your subpopulations and using multiple genotypic markers.

For instance, due to the factor used in estimating , it is possible to obtain negative-number estimates of () by using extremely few samples over just a few markers. (, since it is one variance divided by another, should always be positive or at least zero.)

Note

If we were to use the actual variances within the data that we have, rather than estimates of population variances based on our data, we would use, for each individual marker,

and

where

the actual variance of allele frequencies over populations.

### Algorithm Used for Confidence Intervals¶

To calculate the 95% confidence intervals around the value Golden Helix SVS uses a percentile-t bootstrapping technique described in [Leviyang2010]. This algorithm is described below. We use the , , and notation of [WeirCockerham1984].

To find an estimate of the variance of , we can use jackknifing [WeirCockerham1984].

where is the estimate of obtained by omitting locus and is the number of loci.

We then perform one thousand bootstrap replicates where in each replicate, a simple random sample with replacement is taken of the loci and the () value is calculated for each subpopulation pair. We then use jackknifing again to find an estimate of the variance. Then the last part of the of each bootstrap replicate is to calculate the t-statistic.

where is the square root of the estimate of the variance of found through jackknifing.

These t-statistics are then stored in a list in ascending order to be used after the replicates are finished.

After the bootstrap replicates, the confidence interval around is found with:

and since we’re trying to find the 95% confidence interval, is 0.05. and is the square root of the estimate of the variance of found before.

### Output¶

For the all markers mode, two spreadsheets are made. One spreadsheet
is made for the by marker mode. Please see the output section of
*Estimates Made Using All Markers* or of *Estimates Made Using One Marker at a Time* for details.

## LD Score Computation and Binning¶

This feature computes an LD score, as well as the minor allele frequency (MAF), for each of the markers in your spreadsheet. Optionally, this feature will also categorize the markers into “bins” based on LD score, MAF, and/or both of these at once.

This method, as well as the feature *Genomic Best Linear Unbiased Predictors Analysis Using Bins*, is partly
inspired by the paper [Wainschtein2019], which describes recovering
the missing heritability for height and for body mass index (BMI) to
the level implied by pedigree studies. One possible workflow which
emulates what is done in this paper is to:

- Use this feature to categorize your markers into bins based on both LD score and MAF.
- Use
**Genotype > Compute GBLUP Using Bins**, giving that feature the output of this feature, to complete your analysis.

Another use of the LD Score itself is in LD Score regression
(*LD Score Regression*). The intercept of LD Score regression can
be used as a better correction factor for GWAS studies than Genomic
Control. Also, the slope of the LD Score regression may be used to
estimate the heritability of a phenotype coming from polygenicity.

### What is an LD Score?¶

The LD score of a marker is the sum of the LD values (
using the CHM method, see *Bi-Allelic*) between the marker
and all markers within a specified distance window surrounding the
marker. The LD of the marker with itself, which always has a value of
1, is included in this score.

The “window” is normally all markers within a specified genetic distance from the marker in question and which are in the same chromosome as the marker in question. However, this feature will optionally work on spreadsheets without a marker map, in which case it will use a specified number of neighboring markers in the spreadsheet as a type of distance measure from the marker in question.

Note

Practically speaking, this “window” is meant to hopefully cover all other markers anywhere near the given marker that would reasonably be expected to be in either complete LD or partial LD with the given marker. The idea is to approximate, for marker ,

where the markers are all the markers in the entire chromosome.

### Data Requirements¶

First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*) to create a genotypic spreadsheet. It is recommended
that the spreadsheet be marker-mapped to not only insure that the
markers are in the proper sequence, but that there will be access to
genotypic distance and the locations of chromosome boundaries. The
samples in your spreadsheet are required to be rowwise. The
LD Score dialog can be accessed by selecting **Genotype** > **Quality
Assurance** > **LD Score Computation and Binning** from the
spreadsheet menu.

### Parameters¶

*The LD Score dialog as initially shown* shows the LD Scoring and Binning dialog as
initially presented.

*The LD Score dialog where MAF bins have been selected* shows the same dialog with three MAF bins selected using
two MAF thresholds. Note that for the first frequency threshold, *Frequency
Threshold 3*, a new frequency has been selected, while the default frequency
was used for the other frequency threshold, *Frequency Threshold 5*.

#### Window Size¶

If your data is marker-mapped, enter the window size in genetic
distance by specifying the *Max window Size in kilo base pairs* the
window is to span. The default is 300K base pairs.

If your data is not marker-mapped, enter the *Window Size in
SNPs*. The default is 100 SNPs in the spreadsheet.

Note

A separate window is formed around every marker for the purpose of LD Score computation.

#### MAF Bin Thresholds¶

If you check **Bin by Minor Allele Frequency (MAF) Thresholds**, seven
different thresholds for separating MAF bins will be available from
which you may choose. Check one or more of these thresholds and, if
you want, change the threshold frequencies for those thresholds you
have checked.

Note

If you do not check any minor allele frequency thresholds, MAF binning will not occur.

Note

The default MAF threshold frequencies are taken from the paper [Wainschtein2019].

#### Number of LD Score Bins¶

Check **Bin by LD Score Quantile** to create multiple LD Score bins,
either overall or within each MAF bin, depending on whether you have
also checked **Bin by Minor Allele Frequency (MAF) Thresholds**, and
change the *Number of LD Score bins* if you wish.

If you have also chosen **Bin by Minor Allele Frequency (MAF) Thresholds**
and checked at least one MAF threshold value, the following will happen:

LD Score binning will take place as a sub-binning process within the markers that are in each MAF bin. The markers within each MAF bin will be evenly sub-divided according to their LD Score quantile within the MAF bin.

For instance, if you ask for three bins within each MAF bin (by selecting 3 LD Score bins and selecting two or more MAF bins), the markers with the bottom third of LD Scores within each MAF bin will be placed into LD Score Bin 1, the markers whose LD Scores fall into the middle third of each MAF bin will be placed into LD Score Bin 2, and the markers whose LD Scores fall into the top third within each MAF bin will be placed into LD Score Bin 3.

A final binning output will be generated that puts each marker in a separate bin based on both the marker’s MAF bin and the marker’s LD Score bin. For instance, you could have 7 MAF bins with two LD Score bins for each MAF bin, for a total of 14 bins.

Otherwise, if you have not chosen **Bin by Minor Allele Frequency
(MAF) Thresholds** and checked at least one MAF threshold value, the
following will happen:

LD Score binning will take place over all markers, evenly sub-divided according to their LD Score quantile over all markers.

For instance, if you ask for three bins overall (by selecting 3 LD Score bins and no MAF bins), the markers with the bottom third of LD Scores over all the markers will be placed into (LD Score) Bin 1, the markers whose LD Scores fall into the middle third over all the markers will be placed into Bin 2, and the markers whose LD Scores fall into the top third over all the markers will be placed into Bin 3.

This will be the only binning output in this case.

### Output¶

A spreadsheet with rows corresponding to markers will be created with the following outputs:

**LD Score**: LD Score for the specified marker.**MAF**: Minor Allele Frequency for the specified marker.**LD Bin**: Bin number (quantile number) for the LD Score for the specified marker. (Displays if you have selected LD Score binning but not MAF binning.)**MAF Bin**: Bin number for the Minor Allele Frequency of the specified marker, as defined by the MAF thresholds. (Displays if you have selected MAF binning, with or without LD binning.)**LD Bin within MAF Bin**: Bin number (quantile number) for the LD Score for the specified marker, with the quantiles taken strictly within the LD Scores of those markers that fall into the same MAF bin as does the current marker. (Displays if you have selected both MAF binning and LD binning.)**Overall Bin**: Uniquely identifying bin number for the specified marker. The bin number depends upon both the**MAF bin**and the**LD Bin within MAF Bin**. (Displays if you have selected both MAF binning and LD binning.) The formula for the Overall Bin is**Overall Bin**= (**MAF bin**- 1) **Number of LD Score bins*+**LD Bin within MAF Bin**

If the original spreadsheet is marker mapped, that marker map will be applied to this spreadsheet.

## Separately Computing the Genomic Relationship Matrix¶

This tool outputs the Genomic Relationship Matrix (GRM)
(*The Genomic Relationship Matrix*) from a genotypic or numerically
recoded spreadsheet.

Optionally, this tool can output a set of Genomic Relationship
Matrices and the GRM list spreadsheet for input into the binned GBLUP
feature (*Genomic Best Linear Unbiased Predictors Analysis Using Bins*).

A Genomic Relationship Matrix can be used as a pre-computed genomic
relationship matrix for GBLUP computations
(*Genomic Best Linear Unbiased Predictors Analysis*) or as a pre-computed kinship
matrix for the EMMAX and MLMM Mixed Model GWAS methods
(*Mixed Linear Model Analysis*). This matrix may also be used for
visualization of the cryptic relatedness of samples.

For further details about pre-computed kinship matrices in general,
see *Precomputed Kinship Matrix Option*.

Note

This method uses (with a genotypic spreadsheet) or assumes (with a numerically recoded spreadsheet) an additive genetic model.

### Options¶

**Impute missing data as:**Missing genotype data can be imputed by either of the following methods:**Homozygous major allele**: All missing genotype data will be recoded to 0.**Numerically as average value**: All missing genotype data will be recoded to the average of all non-missing genotype calls (using the additive model).Note

If

**Correct for Gender**(see below) is also selected, and there is non-missing data for both males and females in a given marker, averages for males and females will be computed and used separately.

**Correct for Gender**: Assumes the column is coded as if the male were homozygous for the X-Chromosome allele in question. Uses the [Taylor2013] gender-correction algorithm. (See*gblupGenderCorrOverall*and*gblupGenderCorrIndivMarker*.)**Choose Sex Column:**Choose the spreadsheet column that specifies the gender of the sample. This column may either be categorical (“M” vs. “F”) or binary (0 = male, 1 = female).**Chromosome that is hemizygous for males:**Usually the X Chromosome, which is the default.**Dosage Compensation**: Select the dosage compensation to be used. Equal X-Linked Variance is the default.

**Select Algorithm**: Select which form of normalization is preferred for computation.**Overall normalization**: Normalization is performed globally.**Normalize by individual marker (GCTA method)**: Normalization is performed on a per-marker basis.

**Create Multiple GRMs by Bin Using a Binning Spreadsheet**: To create a set of Genomic Relationship Matrices and a GRM list spreadsheet for binned GBLUP, check this option and select- The
*Marker Bin spreadsheet*and the *Marker Bin column*within that spreadsheet.

- The

### Output¶

Normally, a **GBLUP Genomic Relationship Matrix** spreadsheet will be created.

However, if you selected **Create Multiple GRMs by Bin Using a Binning
Spreadsheet**, the following spreadsheets are output:

- One
GBLUP Genomic Relationship Matrixfor every category/bin. This is the relationship between pairs of samples, as determined by actual genomic similarity (or dis-similarity) between samples, over the markers contained in the bin.- A
Genomic Relationship Matrix Listspreadsheet, which, for each row, has the category/bin number or label as a row label and the spreadsheet number of the bin’s GRM in that row’s first column.

## Computing the Numerator Relationship Matrix¶

This tool outputs the numerator relationship matrix (sometimes referred to as the “A Matrix”) from the pedigree information in the current spreadsheet.

This matrix can be used as a pre-computed kinship matrix for the EMMAX
and MLMM Mixed Model GWAS methods (*Mixed Linear Model Analysis*)
and for the Mixed Model KBAC method (*Mixed-Model Kernel-Based Adaptive Cluster (KBAC) Method*).

For further details about pre-computed kinship matrices in general,
see *Precomputed Kinship Matrix Option*.

### Overview of Theory¶

The off-diagonal coefficient for the -th row and -th column of this matrix are, if both parents of pedigree member are in the pedigree and are designated and , the average of the numerator relationship coefficents and between each parent and pedigree member That is,

If only one parent of pedigree member is in the pedigree, and pedigree member is in the same generation or an earlier generation than is pedigree member , we have

If niether parent of either pedigree member or of pedigree member is in the pedigree (that is, both and are “founders”), we have

We also always have

For the diagonal coefficient for pedigree member , we have, if both parents and are in the pedigree,

is known as the “coefficient of inbreeding” (as computed from this pedigree) for pedigree member .

If it is not true that both parents are in the pedigree, we use

Note

This matrix is called a “numerator” relationship matrix because its coefficients are effectively the numerators of the relationship coefficients given by Sewall Wright [Wright1922], the relation being

Note

See *Inbreeding Coefficients* to estimate coefficients of inbreeding
based on genotypic information.

### Data Requirements¶

This feature must be run from a pedigree spreadsheet containing pedigree information for all samples for which the intended mixed-model analysis which will be using the output of this feature (as a pre-computed kinship spreadsheet) is to be run.

The pedigree does not need to be sorted in any particular order. This tool will determine the proper ordering of the pedigree members for computational purposes.

If there is no row entry for any given parent of a pedigree member, a virtual entry for that parent will be created internally for computational purposes.

### Output¶

A numerator relationship matrix (“A Matrix”) for the current spreadsheet’s pedigree will be generated.

## Filter Samples by Call Rate¶

Genotype Statistics by Sample Call Rates are calculated and samples whose call rates do not meet the specified criteria will be inactivated. If at least one sample, but not all of the samples, are inactivated, a subset of active rows is created.

From a spreadsheet containing several genotypic columns, choose
**Genotype** > **Quality Assurance** > **Filter Samples by Call Rate**.
The spreadsheet output (**Statistics by Sample**) contains ten
columns; the first column contains the number of called genotypes (not
including missing values) and the second column contains the call rate,
defined as the number of non-missing values divided by the number of
genotype columns. The rest of the columns in the **Statistics by Sample**
spreadsheet are heterozygosity statistics. See *Genotype Statistics by Sample* for more
information.

If at least one row, but not all of the rows, are inactivated, a subset will also be created.

## LD Pruning¶

### Overview¶

Some tests such as Identity by Descent and Inbreeding Coefficient Estimation will obtain better results if the markers used are not in linkage disequilibrium with each other.

Therefore, Golden Helix SVS provides this feature to inactivate (“prune”) markers that are in linkage disequilibrium with other markers that are left active, so that you may do your tests just with those active markers that are not as much in LD with each other.

### Data Requirements¶

First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*) to create a genotypic spreadsheet. It is recommended that
the spreadsheet be marker-mapped to insure that the markers are in the
proper sequence. The samples in your spreadsheet are required to be
rowwise. The LD Pruning dialog can be accessed by selecting **Genotype** >
**Quality Assurance** > **LD Pruning** from the spreadsheet
menu.

### Method¶

All pairs of markers within a moving window, the size and increment of moving you may specify, are compared with each other to measure their pairwise LD. If any pair of markers which are both within the moving window are in LD greater than the specified threshold, the first marker of the pair will be inactivated (“pruned”).

### Parameters¶

#### Window Size¶

Enter the window size in number of markers.

#### Window Increment¶

Enter the number of markers by which the beginning window position will be incremented.

#### LD Statistic¶

Choose or as the statistic to apply the threshold value to.

#### LD Threshold¶

For any pair of markers whose LD statistic is larger than this, the first marker of the pair will be inactivated (“pruned”).

#### LD Computation Method¶

Check whether to use CHM or EM. CHM is computationally much faster, and gives almost the same results as the EM method.

### End Result¶

The column inactivation will be on the spreadsheet you are working with, or on a new tab containing a copy of the spreadsheet you have been working with.

## SNP Density¶

Reports various SNP density statistics across all markers in a marker
mapped spreadsheet. To calculate the statistics, open a marker mapped
spreadsheet and select **Genotype** > **Quality Assurance** > **SNP Density**.
The following statistics will appear in a
window: Minimum Gap (bp), Maximum Gab (kb), Average Gap (kb), and SNP
Density (1 SNP per X.XXkb).

## Mendelian Error Check¶

This feature can either count and report all Mendelian errors, replace all errors with missing calls or both.

If **Report Mendelian Errors** is checked, the feature counts Mendelian errors
over all trios and reports the total per marker and per child. Partial trios are
also examined, but fewer errors can be detected by definition. Two output
spreadsheets are created:

**Mendelian Errors by Marker**has one row for each genotypic column found in the original spreadsheet and an integer error count column.**Mendelian Errors by Sample**has one row for each child found in the original spreadsheet and an integer error count column.

If **Remove Mendelian Errors** is checked, a child spreadsheet is created with the
same dimensions as the original spreadsheet. This spreadsheet has all Mendelian
errors removed and replaced with missing values. The number of calls replaced is
reported in the node change log and should equal the sum of each **Mendelian Errors**
columns in the report spreadsheets.

This feature requires a pedigree spreadsheet with several genotypic columns.

## Inbreeding Coefficients¶

### Overview¶

If there is inbreeding among individuals represented in a dataset, this will reduce the independence of the data. For this reason, and to better assure data quality in your data samples, Golden Helix SVS can estimate the inbreeding coefficient for each individual represented in your data.

- It is recommended that estimating inbreeding coefficients in Golden Helix SVS should be used for data quality control, rather than for actually attempting to impute inbreeding on the part of individuals whose samples you are analyzing.
- It is usually advisable to apply LD pruning (
**Genotype**>**Quality Assurance**>**LD Pruning**from the spreadsheet menu) before using this feature. - This inbreeding coefficient is equivalent to Wright’s
within-subpopulation fixation index, , in population
genetics. (See
*Fixation Index Fst and Fixation Index Fst (by Marker)*.) - Values may range from -1 to +1. Negative values indicate outbreeding (or data quality problems for large negative values), and positive values indicate inbreeding (or other data quality problems).
- You will obtain the best values when you use many samples and many markers. This is due to the need to estimate allele frequencies over multiple samples, as well as the need to estimate itself over multiple markers.

Note

If you have a pedigree which may reflect inbreeding, you can
use *Computing the Numerator Relationship Matrix* to check this by computing the numerator
relationship matrix on that pedigree. The diagonal elements for any
pedigree members for which the pedigree shows inbreeding will be
larger than one by the amount of the inbreeding coefficient (as
computed from the pedigree).

### Data Requirements¶

First, import your data into a Golden Helix SVS project (See
*Importing Your Data Into A Project*) to create a marker-mapped genotypic spreadsheet. The
samples in your spreadsheet are required to be rowwise. Only the
autosomal genotype columns will be used by this feature. The inbreeding
coefficient dialog can be accessed by selecting **Genotype** > **Quality Assurance**
> **Inbreeding Coefficients** from the
spreadsheet menu.

### Computation¶

For a particular marker with allele frequencies p and q, the probability that an individual is homozygous is , or the probability of being homozygous by descent () plus the probability of being homozygous by chance. If an individual has genotyped autosomal markers, is the number of observed homozygotes for the individual over all markers, and is the number expected by chance, then or

Since allele frequencies are estimated from the data in your spreadsheet, an unbiased estimator of is used, based on the sum over all markers not missing for the individual: where is twice the number of non missing genotypes for marker j.

### Parameters¶

#### Genome¶

Check either the **Human** radio button or the **Non-Human** radio
button.

#### Number of Autosomes¶

Enter the number of autosomes which the genome you are using contains.

### Output¶

A spreadsheet is output with one row for each individual. The output columns consist of the inbreeding coefficient (), the number of markers analyzed for the individual, the number of observed homozygotes for the individual, and the number of expected homozygotes for the individual.

## PBAT Family-Based QA Statistics¶

The quality control statistics for family-based studies are used to measure the genotyping error rate of each proband in a family individually. See [Fardo2009].

### Data Requirements¶

PBAT family-based QA statistics require a pedigree dataset containing
genotypic data. First, import your data into a Golden Helix SVS project
(See *Importing Your Data Into A Project*). The family-based statistics dialog can be accessed by
selecting **Genotype** > **PBAT Family-Based QA** from the
spreadsheet menu.

### Processing¶

Select computation parameter and output options and select the **Run**
button to process. Descriptions of the computation parameters and output
options are detailed below.

One spreadsheet of results will be created as a child of the current spreadsheet navigator window node. Information about the parameters used will be recorded in the Node Change Log.

### Computation Parameters¶

#### Algorithm¶

If the **Use alternative rapid pedigree algorithm** *IS NOT* selected
then the standard PBAT algorithm for processing extended pedigrees is
used and Mendelian errors will not be calculated.

If the **Use alternative rapid pedigree algorithm** *IS* selected (the
default) then the alternative rapid pedigree algorithm for processing
extended pedigrees will be used and Mendelian errors will be calculated.
See *Alternative Rapid Pedigree Algorithm* for more information.

#### Number of non-founders in one pedigree¶

Enter the maximum number of non-founders plus one that exist in one pedigree. “Non-founders” refers to subjects in the pedigree who have parents whose data is also in the pedigree. If a pedigree is found to have this number of non-founders or more, it will not be processed. For instance, if the user wants to restrict pedigrees to only have two siblings plus their parents, then enter 3 in this box.

Note

Under the alternative rapid pedigree algorithm, this parameter refers to the maximum of non-founders within the family clusters identified by this algorithm, rather than to the maximum number of non-founders within any original extended pedigree.

### Output¶

#### Output by marker¶

The rows will correspond to markers and the columns in the output spreadsheet will be:

**MAF**: Minor Allele Frequency for the specified marker.**Mendelian errors**: Number of Mendelian errors for the specified marker.Note

This column will only display if the alternative rapid pedigree algorithm was selected.

**HW**: Hardy-Weinberg Equilibrium value for the specified marker.**FBATS**: Sum of the transmission scores for the specified marker that would occur if a TDT test were to be done in which all probands were assumed to be “affected”, and the null hypothesis were “no linkage and no association”.**FBATV**: Sum of terms of the variance matrix for the specified marker that would occur under the above-mentioned test.**FBATV2**: Sum of squares of the transmission scores over the probands for the specified marker that would occur under the above-mentioned test.

#### Output by proband¶

This is the default output selection.

The rows will correspond to probands and the columns in the output spreadsheet will be:

**# Markers**: The number of markers used for the calculation**Mendelian errors**: Number of Mendelian errors for the specified proband.Note

This column will only display if the alternative rapid pedigree algorithm was selected.

**Tgw p-value**: P-value of the standardized genome-wide transmission statistic. This statistic follows an approximate distribution with one degree of freedom.**Tgw**: Standardized genome-wide transmission statistic. A value of greater than 30 for this statistic may indicate substantial amounts of genotyping error in the data for this proband.**E(delta X)**: Expected Mendelian residual.**var(delta X)**: Variance of the Mendelian residual.

#### Output all details¶

The rows will correspond to markers and the columns in the output spreadsheet will be:

**MAF**: Minor Allele Frequency for the specified marker.**Mendelian errors**: Number of Mendelian errors for the specified marker.Note

This column will only display if the alternative rapid pedigree algorithm was selected.

**HW**: Hardy-Weinberg Equilibrium value for the specified marker.**FBATS**: Sum of the transmission scores for the specified marker that would occur if a TDT test were to be done in which all probands were assumed to be “affected”, and the null hypothesis were “no linkage and no association”.**FBATV**: Sum of terms of the variance matrix for the specified marker that would occur under the above-mentioned test.**FBATV2**: Sum of squares of the transmission scores over the probands for the specified marker that would occur under the above-mentioned test.**Columns for probands**: A column for every proband. Each value listed is the contribution to the FBATS statistic and to the Tgw statistic for the specified proband and specified marker.Note

A missing value in any cell of this column indicates that there was a Mendelian error for this proband with this marker’s data.

#### Output -log 10 p-values¶

These values are only available for **Output by proband** and calculates
the for every proband.