3.9. Quantile Normalization of Affymetrix CEL Files

As SVS imports Affymetrix CEL files it creates quantile normalized log2 ratios in order to more accurately find regions of copy number variation and for more accurate association testing on the CNV variants.

The process to generate normalized log2 ratios is very analogous to the methodology employed by Affymetrix [Affymetrix2007], scaled to handle thousands of cases and controls. The process is as follows:

  1. Depending on the Mapping array type, there are anywhere from 1 to 40 probes used to interrogate a given genotype. The Affymetrix 500k mapping array provides perfect match and mismatch probes, whereas the Affy 5.0 and 6.0 chips only include perfect match probes. We only use the means of the perfect match probes.

    • For Affy 500k, the NSP and STY A and B probe intensities are extracted separately. For a given marker (SNP or CNV) there may be anywhere from 1 to 40 probes. These are averaged to get a probe intensity per marker.

    • For Affy 5.0 and 6.0, the polymorphic SNP probes and the non-polymorphic copy number probes are extracted separately. The non-polymorphic probes only have an A intensity.

  2. The A and B probe intensities are quantile normalized per sample using the approach of [Bolstad2001] and [Bolstad2003]. 500k NSP and STY samples are quantile normalized separately, as are the Affy 5.0 and 6.0 polymorphic and non-polymorphic probes. The process is as follows:

    • For each of the autosomal A and B probes: sort the intensities for each sample in ascending order.

    • Replace the smallest value in each sample with the mean of the smallest values, the second smallest value in each sample with the mean of the second smallest, and so on for the entire set of probes.

    • Reorder the modified A and B intensities into their original order for all samples.

    • Calculate modified intensities for the non-autosomal (sex) probes by finding the closest autosomal intensity value and substituting its corresponding quantile-normalized intensity.

  3. Calculate a reference distribution and calculate log2 ratios:

    • Select a set of samples for the reference distribution, for instance the controls.

    • For each polymorphic probe, i, calculate the median of the quantile normalized A intensities and the median of the quantile normalized B intensities, A_{i, med}B_{i,med} for the reference samples. Then for a given pair of probe intensities A_i and B_i, the normalized copy number signal is the log2 ratio of the sum of the A_i and B_i probe intensities to the median A and B probe intensities:

      \text{Log2 ratio} = \log_2\left(\frac{A_i + B_i}{A_{i,med} + B_{i,med}}\right)

    • For non-polymorphic probes, there is no B intensity, but the analogous normalization is performed:

      \text{Log2 ratio} = \log_2\left(\frac{A_i}{A_{i,med}}\right)

  4. Join together distributions:

    • Recall, for the 500k, the NSP and STY log2 ratios are calculated separately. Also, for arrays containing non-polymorphic copy number probes, these must be joined with the polymorphic probes. We join different arrays together by the “virtual array generation” procedure outlined in section 7 of [Affymetrix2007].

    • The Affymetrix procedure defines a log2 ratio range for defining which markers are copy-neutral. Rather than doing this, we sample from the middle 1/3 of the distribution of autosomal log2 ratios, but follow the same procedure of centering the copy number log2 ratios about zero by subtracting the mean of the copy-neutral markers, and scaling the distributions by their respective signal to noise ratios.