A Glossary of Terms Used in Genetic Analysis

The following is a short glossary of terms used either as a part of genetic analysis in general or as a part of family-based analysis in particular.


One of two or more alternative nucleotide sequences at a single gene locus on a chromosome.

allele frequency

Allele frequency is a term in population genetics that is used in characterizing the genetic diversity of a species population or equivalently the richness of its gene pool. Allele frequency is defined as follows:


  • a particular chromosome locus,
  • a gene occupying that locus,
  • a population of individuals carrying n loci in each of their somatic cells (e.g. two loci in the cells of diploid species, which contain two sets of chromosomes), and
  • a variant or allele of the gene,

the allele frequency of that allele is the fraction or percentage of loci that the allele occupies with in a population. For instance, if the frequency of an allele is 20% in a given population, then among population members, one in five chromosomes carry that allele. Four out of five will be occupied by other variants of the gene, of which there may be one or many.

Example: If there are ten individuals in a population, and at a given locus, there are two possible alleles A and a, and if the genotypes of the individuals are AA, Aa, AA, aa, Aa, AA, AA, Aa, Aa, and AA, then the allele frequencies of allele A and allele a are:

P(A) = \frac{2+1+2+0+1+2+2+1+1+2}{20}=0.7

P(a) = \frac{0+1+0+2+1+0+0+1+1+0}{20}=0.3

association studies

The primary means of establishing an association between a given phenotype and the other covariates, such as other phenotype data or genotype data.

attributable fraction

The proportion of disease occurrence that would potentially be eliminated if exposure were prevented.


A chromosome that is not a sex chromosome. In humans, the autosomal chromosomes are numeric (1...22). The non-autosomal chromosomes are X and Y.


A single nucleotide, composed of a nucleobase (nitrogenous base), a five-carbon sugar, and one to three phosphate groups. Together, the nucleobase and sugar comprise a nucleoside.

base pair

A pair of complimentary nucleotides. In DNA, the nucleotide adenine (A) always binds with thymine (T), and guanine (G) binds with cytosine (C). In RNA, uracil (U) binds with adenine, rather than thymine.

binary trait

A binary trait has only two possible values, e.g. presence of the trait versus absence of the trait.


A region of DNA that binds sister chromatids into a diploid chromosome.


A complete base pair sequence. A chromatid has a short arm (‘p’ arm) and a long arm (‘q’ arm), separated by the centromere, where the short arm contains fewer bases than the long arm. Each end of the arm furthest from the centromere are telemeres.


An organized structure of DNA and protein that is found in cells. A diploid chromosome is a pair of sister chromatids bound by a centromere.


A tri-nucleotide sequence associated with a particular amino acid.

continuous trait

A continuous trait is a trait whose variations are measured with a scale or has a range of variation, rather than classification into categories. Examples are height, body mass index (BMI), blood pressure, etc.


A term synonymous with predictor, explanatory variable and independent variable. It is a variable that is either of direct interest in predicting the response in a study or one that acts as a confounding variable, affecting the relationship between the dependent variable and the independent variables of primary interest.


A region of a chromatid that is distinguished from other cytobands by its shade as a result of applying a staining solution.


An organism with a haploid number of 2. Healthy human cells are diploid.

disease gene

A gene that carries or is responsible for a disorder, defect or a disease.


The portion of a gene that is ultimately expressed as a protein via mRNA translation and protein transcription.


A genomic region composed of exons and introns. Genes represent sequences that are transcribed into RNA, which is transcribed into proteins.

genetic model

The overall specification of how the disease allele(s) act to the influence the disease. For parametric (model-dependent) linkage analysis, the genetic model must be specified for analysis. Components of the genetic model include the information on whether the disorder is autosomal or X-linked, dominant or recessive, the frequency and penetrance of the disease allele, the frequency of the phenocopies and the mutation rate. A genetic model consists of three main components:

  • a model for disease susceptibility, connecting disease phenotypes to genotypes at disease susceptibility (DS) loci for the sibs;
  • a population genetics model, describing the population joint distribution of genotypes at the DS loci of the parents; and
  • a segregation model, describing the segregation of alleles at the DS loci during meiosis.


May mean the genetic composition (alleles) of an individual in total, but in Golden Helix SVS, refers to the particular pair of alleles that an individual possesses at a single gene locus on a chromosome.


The collection of chromosomes of a particular species, including autosomal and non-autosomal.

genomic position

A position within a chromosome. Usually described in the format <chr>:<pos> (i.e. chr1:50,000). In humans, positions are sorted alphanumerically by chromosome, and numbering starts at the beginning of the ‘p’ chromatid.

genomic region

A subsequence of alleles within a chromosome. Usually described in the format of <chr>:<start>-<stop> (i.e. chr1:50,000-100,000). In humans, regions are sorted alphanumerically by chromosome, and numbering starts at the beginning of the ‘p’ chromatid.

half-open coordinates

Coordinates are zero-based, and the difference between the stop and start positions define the width of an interval. An interval covering the first three positions of a chromosome in a half-open system would be specified as [0,3].


Set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination).

Hardy-Weinberg equilibrium

A state attained by a population which displays constant allele and genotype frequencies from generation to generation. In the case of a locus with two alleles, A and B, occurring at frequencies p and q, respectively, the frequency of genotype AA is p^2, the frequency of AB is 2pq and the frequency of BB is q^2. A population in HW equilibrium normally has to be large and random-mating with no selection, mutation or migration.


A measure of the degree to which the variance of the distribution of a phenotype is due to genetic causes. Specifically, heritability is defined as the proportion of phenotypic variance explained by the analyzed marker.

In PBAT, a negative sign for a heritability indicates that the specified allele is under transmitted in the test statistic and a positive sign indicates that the specified allele is over transmitted in the test statistic. The sign is estimated using the between family information.


The portion of a gene that is removed from translated mRNA prior to transcription.

indexed coordinates

Coordinates are one-based, and the width of an interval is one plus the difference between the stop and start positions. An interval covering the first three positions of a chromosome in an indexed system would be specified as [1,3].


Two genes or markers that are so close together on a chromosome that they are rarely separated by recombination are said to be linked.

linkage analysis

A statistical method for detecting linkage between a disease allele and markers of known location by following their inheritance in families.

linkage disequilibrium

Linkage disequilibrium (LD) is the condition in which the haplotype frequencies in a population deviate from the values they would have if the genes ate each locus were combined at random. LD between two loci often indicates that they are physically close to each other on a DNA strand.

marker gene

A detectable genetic trait or segment of DNA that can be identified and tracked. A marker gene can serve as a flag for another gene, sometimes called the target gene. A marker gene must be on the same chromosome as the target gene and near enough to it so that the two genes (the marker gene and the target gene) are genetically linked and are usually inherited together.

minor allele frequency (MAF)

The frequency of the SNP’s less frequent allele in a given population.

Monte-Carlo simulation

Statistical, mathematical or graphical technique which considers multiple variables simultaneously.


A nucleobase (nitrogenous base), a five-carbon sugar (either ribose or 2’-deoxyribose). Binds with phosphate groups to form a nucleotide.

null hypothesis

This is usually a statement of “no effect”, that is to say that the independent variable will not have any effect on the dependent variable and that any differences between the experimental and control groups are attributable to chance. The null hypothesis is usually represented by the symbol H_0 and is stated in order that it can be rejected as an explanation for the results of the experiment. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than a placebo. We would write H_0: there is no difference between the current drug and a placebo on average.


A ratio of number of people incurring an event to the number of people who have non-events.

odds ratio

The odds ratio is a way of comparing whether the probability of a certain event is the same for two groups. An odds ratio of 1 implies that the event is equally likely in both groups. An odds ratio greater than 1 implies that the event is more likely in the first group.

For instance, the odds ratio may describe the odds of an experimental patient suffering an adverse event relative to a control patient. Or, it may describe the ratio of the odds of having the target disorder in the experimental group relative to the odds in favor of having the target disorder in the control group. Or, it may describe the odds in favor of being exposed in subjects with the target disorder divided by the odds in favor of being exposed in control subjects (without the target disorder).


A measure of how much evidence there is against the null hypothesis. The smaller the p-value, the more evidence exists against H_0. Traditionally, researchers will reject the null hypothesis if the p-value is less than 0.05. A small p-value is evidence against the null hypothesis while a large p-value means little or no evidence against the null hypothesis.

pedigree files

Pedigree files contain information about family relationships, gender and genetic data.

With minor variations, the pre-MAKEPED format for the LINKAGE program is the de-facto standard for pedigree files. This format contains fields for pedigree number, individual ID, father’s ID, mother’s ID, sex, disease status, and the first and second alleles of each of the markers.

phenotype files

Phenotype files contain information about the individual phenotype values such as height, weight, body mass index (BMI), whether the individual has the disease being studied, severity, etc.

There are many different formats for phenotype files. However, they typically identify the pedigree ID and individual ID so that phenotype and pedigree information may be matched.

power (statistical)

Statistical power is the probability you will detect a meaningful difference, or effect, given that a true difference exists. Ideally, studies should have power levels of 0.80 or higher, an 80% chance or greater of finding an effect if one was really there.

Alternative definition 1: A gauge of the sensitivity of a statistical test, that is, its ability to detect relationships. Specifically, the probability of correctly rejecting a null hypothesis. In general, the statistical power increases with your sample size. Also called the “Power” of a test.

Alternative definition 2: The power of a statistical test is the probability that the test will reject a false null hypothesis, or in other words that it will not make a Type II error. The higher the power, the greater the chance of obtaining a statistically significant result when the null hypothesis is false.

predictor variables

A term synonymous with covariate, explanatory variable and independent variable. Variables or factors that are assumed to have an effect or influence on the selected phenotypes. E.g. height, weight, sex, age. However, they are not necessarily the variables of primary interest. (See covariate)


Prevalence is the total number of cases of a disease in a given population at a specific time, or the percentage of population estimated to have that particular disease. “Population”, as used as a denominator, is generally the projected population calculated from the given model. \text{Prevalence} = \frac{\text{Number of cases of a disease present in a population at a specific time}}{\text{Number of individuals in the population at the specific time}}


The family member through whom a family’s medical history comes to light. For example, a proband might be a baby with Down syndrome. The proband may also be called the index case, propositus (if male), or proposita (if female).

significance level

The significance level of a test is the probability that the test statistic will reject the null hypothesis when the null hypothesis is true.


The use of a mathematical model to recreate a situation, often repeatedly, so that the likelihood of various outcomes can be more accurately estimated.

SNP analysis

Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A, T, C, or G) in the genome sequence is changed. This occurs approximately once every 100 to 300 bases. There are many techniques for SNP detection and genotyping, such as restriction fragment length polymorphism PCR (RFLP-PCR), SSCP, allele specific hybridization, primer extension, allele specific oligonucleotide ligation, and sequencing.


The ends of a chromatid, composed of repetitive DNA that protects the bulk of the information contained in the chromatid during replication.

test statistic

A test statistic is a quantity calculated from a sample of data. Its value is used to decide whether or not the null hypothesis should be rejected in a hypothesis test. The choice of a test statistic will depend on the assumed probability model and the hypothesis under question.


Process by which messenger RNA is created from DNA.


Process by which messenger RNA is decoded into a specific amino acid chain.