# 2.20. Numeric Data Quality Assessment¶

To ensure numeric data is of the highest quality, SVS provides a variety of features that not only help assess the quality of numeric data (or CNV data specifically), but remedy any problems as well.

The following data quality tools are available from the Numeric Menu (see Numeric Menu and CNV QA submenu):

• Numeric Principal Component Analysis

Adjust for batch effects or population stratification on log2 ratio data or other numeric data. See Numeric Principal Component Analysis for more information.

The derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio data. Samples with higher values of DLRS tend to have poor signal-to-noise properties. See Derivative Log Ratio Spread for more information.

• Percentile Based Winsorizing

Calculates thresholds for the top and bottom percentiles of log ratio data, as specified by the user, for the purpose of winsorizing - replacing extreme log ratio values with the calculated thresholds. Winsorizing data prevents segmentation algorithms from being driven by outlier values and results in a more accurate determination of regions of copy number variation. See Percentile Based Winsorizing for more information.

• Wave Detection/Correction

Detect and optionally correct the genomic wave phenomenon described by Disken et al. See Wave Detection and Correction for more information.

• Statistics (per Column)

This function calculates and/or reports the following approximate values for each real-, integer-valued and binary (optional) active column: Lower Outlier Threshold = Q1 - M*IQR, Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Upper Outlier Threshold = Q3 - M*IQR, Interquartile Range (IQR), Variance and Standard Deviation. M is a user defined multiplier to define outlier thresholds based on IQR (Inter Quartile Range).

• Statistics (per Row)

This function calculates and/or reports the following approximate values for each row using data from only real-, integer-valued and binary (optional) active columns: Lower Outlier Threshold = Q1 - M*IQR, Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Upper Outlier Threshold = Q3 - M*IQR, Interquartile Range (IQR), Variance and Standard Deviation. M is a user defined multiplier to define outlier thresholds based on IQR (Inter Quartile Range).

• Multidimensional Outlier Detection

This function determines outliers based on user-specified columns. A distance score is computed by summing the squared distances from the median in each column, then taking the square root of the sum. The sample is considered an outlier if its distance score is greater than a threshold, based on a user-specified multiplier and the quartiles of each column.

• Matched Pairs T-Test A Matched Pairs T-Test can be run on two dependent samples in the same spreadsheet with a grouping variable to distinguish them. For each numeric (integer or real) column, output will include the T-Statistic, the Mean of the difference, P-Value, , Bonferroni P-Value, FDR P-Value, Degrees of Freedom, Standard Error, and the lower and upper confidence bounds of the mean.

## 2.20.2. Derivative Log Ratio Spread¶

The derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio data. Samples with higher values of DLRS tend to have poor signal-to-noise properties. DLRS was originally developed for use in aCGH analysis. The measurement is based on absolute differences in LR values at consecutive points, rather than deviations from a baseline value. This property makes DLRS robust against signals from true copy number variants, because only the first and last markers in each segment (rather than all markers in a segment) will have a large deviation from normal values.

To calculate the DLRS, open a spreadsheet containing marker-mapped log ratio data and choose Numeric > CNV QA > Derivative Log Ratio Spread.

The spreadsheet output contains DLRS values for each sample, per chromosome and overall, as well as the median DLRS value per chromosome.

## 2.20.3. Percentile Based Winsorizing¶

Calculates thresholds for the top and bottom percentiles of log ratio data, as specified by the user, for the purpose of winsorizing - replacing extreme log ratio values with the calculated thresholds. Winsorizing data prevents segmentation algorithms from being driven by outlier values and results in a more accurate determination of regions of copy number variation.

For autosomes, the median threshold is used to winsorize the data. Values less than the lower threshold are replaced with the lower threshold, and values that are greater than the upper threshold are replaced with the upper threshold. For non-autosomes, the thresholds for each chromosome are used.

To use, open a marker-mapped spreadsheet containing log ratio data, with samples as columns. Select Numeric > CNV QA > Percentile Based Winsorizing and enter percentile thresholds to be used for winsorizing in the window, or leave the defaults of 0.002 and 0.998.

The spreadsheet output will contain the same information, with the extreme values winsorized.

Note

1. This takes about 56 minutes to process 3500 samples by 500k markers on a 32-bit Windows Dual-Core 2.33 GHz computer.

2. The resulting spreadsheet can be used for plotting individual samples. It will need to be transposed in order for analysis, in particular Copy Number Segmentation.

## 2.20.4. Wave Detection and Correction¶

Some samples in copy number datasets may suffer from genomic waves, which cause the Log R Ratios to drift up or down in a wave like fashion. SVS uses the method described in [Diskin2008] to detect and partially correct for this phenomenon.

### Wave Detection¶

The Wave Factor score is a metric for evaluating the severity of this phenomenon in a sample. SVS Wave Detection computes the absolute value of the Wave Factor for each sample. The GC content in the region around each marker is thought to contribute to the genomic wave phenomenon, so SVS also computes the correlation coefficient between the Log R Values and the GC content around each marker. After performing wave detection, a spreadsheet will be created containing the Abs. Wave Factor and GC Correlation for every sample.

### Wave Correction¶

Optionally, SVS can use the GC correlation to correct for the waviness contributed by GC content. This will produce a new spreadsheet with the same row and column labels as the input that contains corrected log ratios for every sample.

### Options¶

The following options are available (see Wave Detection and Correction Dialog Window).

Min Training Marker Distance

When computing the GC correlation, only a subset of available markers are used that will be spaced at least this many kilobase pairs apart. This avoids biasing the correlation towards regions with high marker density.

GC Reference

It is not necessary to manually calculate the GC content around each marker in your dataset. SVS will use the specified reference genome to automatically compute the GC percentage in the 1 megabase region around each marker.

Chromosome Selection

If Autosomes Only is selected, only chromosomes with a numeric name(eg. “1”, “15”, “22”) will be used for wave detection/correction. If All Chromosomes is selected, then every active chromosome in the spreadsheet will be used.

Output

If Detect Only is selected, A new spreadsheet containing the Abs. Wave Factor and GC correlation will be created. Detect and Correct will also produce a new spreadsheet containing corrected Log R Ratios.

## 2.20.5. Statistics (per Column)¶

Column statistics can be calculated on all real-, integer-valued and binary (optional) columns in a spreadsheet. Each of the following output statistics are optional: Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Variance, Standard Deviation, Lower and Upper outlier thresholds defined by Q1 - x*IQR, Q3 - x*IQR, where x is a user defined multiplier and IQR or the Interquartile Range, defined by IQR = Q3 - Q1.

The resulting spreadsheet will have the original active columns as row labels and the selected summary statistics in columns. If a marker map was applied to the original spreadsheet’s columns it will be reapplied to the new spreadsheet’s rows.

To calculate the Column Statistics, open a spreadsheet containing several quantitative columns. From the spreadsheet, select Numeric > Statistics (per Column). A dialog will appear (see Column Statistics Dialog Window). The dialog allows the user to specify which statistics to output. If the outlier thresholds are selected for output, a multiplier must be specified to be used in the formulas described above. Binary columns can also be included in the calculations, however not all of the statistics may be appropriate for binary columns.

The following summary statistics may be reported for every active integer-, real-valued or binary column in the spreadsheet:

• Minimum: The minimum value found in the column. If the minimum value is less than the Lower Outlier Threshold, there are outliers present in the column.

• Maximum: The maximum value of the data. If the maximum value is more than the Upper Outlier Threshold, there are outliers present in the column.

• Q1: The first quartile is defined as the value below which 25% of the data fall. Equivalently, the first quartile could be thought of as the median of the first half of the data.

• Q3: The third quartile is defined as the value below which 75% of the data fall. Equivalently, the third quartile could be thought of as the median of the second half of the data.

• Mean: The mean or mathematical average or the data. Comparing the mean and median values of the data can provide information about the skewness or normality of the data.

• Median: The median is defined as the value below which 50% of the data fall.

• IQR: The inner-quartile range is defined as the first quartile subtracted from the third quartile, Q3 - Q1. The IQR is used in the outlier threshold equations and is a measure of the variability in the data.

• Sum: This is the sum total of the data.

• Variance: The variance of the data values in the columns. Also the square of the standard deviation.

• Standard Deviation: The standard deviation of the data values in the column. Also the square root of the variance.

• Outlier Thresholds: The lower threshold is defined as Q1 - x*IQR, and the upper threshold is defined as Q3 + x*IQR. The multiplier, x, is user-specified. This threshold can be used to identify outliers that fall below the threshold.

## 2.20.6. Statistics (per Row)¶

Row statistics can be calculated for every row over all real-, integer-valued and binary (optional) columns in a spreadsheet. Each of the following output statistics are optional: Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Variance, Standard Deviation, Lower and Upper outlier thresholds defined by Q1 - x*IQR, Q3 - x*IQR, where x is a user defined multiplier and IQR or the Interquartile Range, defined by IQR = Q3 - Q1. Statistics can also be calculated by Chromosome.

To calculate Row Statistics, open a spreadsheet containing several quantitative columns. From the spreadsheet, select Numeric > Statistics (per Row). A dialog will appear. The dialog allows the user to specify which statistics to output. If the outlier thresholds are selected for output, a multiplier must be specified to be used in the formulas described above. Binary columns can also be included in the calculations, however not all of the statistics may be appropriate for binary columns.

The following summary statistics may be reported for every active row in the spreadsheet:

• Minimum: The minimum value found in the row. If the minimum value is less than the Lower Outlier Threshold, there are outliers present in the row.

• Maximum: The maximum value of the data. If the maximum value is more than the Upper Outlier Threshold, there are outliers present in the row.

• Q1: The first quartile is defined as the value below which 25% of the data fall. Equivalently, the first quartile could be thought of as the median of the first half of the data.

• Q3: The third quartile is defined as the value below which 75% of the data fall. Equivalently, the third quartile could be thought of as the median of the second half of the data.

• Mean: The mean or mathematical average or the data. Comparing the mean and median values of the data can provide information about the skewness or normality of the data.

• Median: The median is defined as the value below which 50% of the data fall.

• IQR: The inner-quartile range is defined as the first quartile subtracted from the third quartile, Q3 - Q1. The IQR is used in the outlier threshold equations and is a measure of the variability in the data.

• Sum: This is the sum total of the data.

• Variance: The variance of the data values in the rows. Also the square of the standard deviation.

• Standard Deviation: The standard deviation of the data values in the row. Also the square root of the variance.

• Outlier Thresholds: The lower threshold is defined as Q1 -

x*IQR, and the upper threshold is defined as Q3 + x*IQR. The multiplier, x, is user-specified. This threshold can be used to identify outliers that fall below the threshold.

If the spreadsheet has a column-oriented marker map, the user can optionally choose to calculate the statistics per chromosome. If this is desired, the results can be in the format of one spreadsheet per chromosome, one spreadsheet per statistic or all statistics and chromosomes in one spreadsheet.

## 2.20.7. Multidimensional Outlier Detection¶

A median centroid vector is calculated as [median(column1), median(column2), … , median(columnN)] based on N columns (dimensions) specified by the user. A distance score is then calculated for each sample or row as follows:

where N = number of dimensions, or columns included in the calculation and is the value of the median centroid vector. The outlier threshold is calculated as follows:

Where Q3 and IQR are the third quartile and inner quartile range of each column (1…N) and M is a user-specified multiplier.

To determine outliers in N dimensions, open a spreadsheet containing several integer- or real-valued columns. From the spreadsheet, select Numeric > Multidimensional Outlier Detection. The default multiplier value is 1.5 but can be changed at the user’s specification. Click Add Columns to add integer- or real-valued columns to be included in the outlier calculation, then click OK to select the columns. Click OK to begin the calculations.

The spreadsheet output, Multidimensional Outlier Detection will contain two columns. The first column contains the distance score for each sample and the second column is a binary column, where a 1 indicates an outlier. The threshold is specified in the second column header as Outlier threshold e.g. Outlier 0.28.

A common use of this function is to calculate outliers in two dimensions, then filter a scatterplot of the two columns based on outlier status. For example, one could run Principal Component analysis, then plot the first two principal components against each other. Then determine outliers in two dimensions (the first two principal components). If you merged the resulting Multidimensional Outlier Detection spreadsheet with the principal components spreadsheet, you could filter on the binary Outlier column. Outliers would fall outside of an imaginary circle created by the median centroid and threshold values.

## 2.20.8. Matched Pairs T-Test¶

A Matched Pairs T-Test can be run on two dependent samples in the same spreadsheet with a grouping variable to distinguish them. In the results spreadsheet, the following outputs will be available: T-Statistic, the Mean of the differences, P-Value, , Bonferroni P-Value, FDR P-Value, degrees of freedom, Standard Error, the lower and upper bounds of the confidence interval around the mean, average for each category, the difference between these averages, the average anti-log base 2 of the data for each category, the fold change of the anti-log base 2 of the data, based upon the above averages, the log base 2 of that fold change.

To run a Matched Pairs T-Test, open a spreadsheet containing at least one categorical or binary column for the grouping. Also, a sample or pair ID column can be present, if not, the row labels will be used instead. The grouping and sample ID columns can be selected in the spreadsheet view by setting them to dependent (magenta). If the column headers can be identified, the dialog will populate the group and sample ID selections with the correct field. These can be selected manually in the dialog too.

From the spreadsheet, select Numeric > Matched Pairs T-Test. A dialog will appear. The dialog allows the user to select the group column and, optionally, the sample or pair ID column. If no column is selected for the sample ID column, the row labels will be used. Also, the user can select whether to save the results spreadsheet under the current spreadsheet or to save it under the project root.

The following will be in the results spreadsheet after running the analysis:

• T-Statistic: The test statistic from the Student’s T distribution.

• Mean of the difference: The mean of the differences between each matched pair.

• P-Value: The p-value found from the Student’s T distribution for this T-Statistic and degrees of freedom.

• -log10(P-Value): The negative log of the p-value

• Bonferroni P-Value: The adjusted p-value using the Bonferroni correction. The number of columns that were tested successfully is counted as the number of tests for this correction.

• FDR P-Value: The adjusted FDR p-values.

• Average (Category 1): The average of all samples in category 1 for each numeric column.

• Average (Category 2): The same as above, but for the second category.

• Difference: The difference in the averages from the first and second categories.

• Average (2^Data for Category 1): The average of taking the anti-log base 2 of the data for the first category.

• Average (2^Data for Category 2): The same as above, but for the second category.

• Fold Change: The fold change of the anti-log base 2 of the data, using the averages above.

• Log2 Fold Change: The log base 2 of the above fold change.

• Degrees of Freedom: The number of samples in each test minus 1.

• Standard Error: The standard error of the mean of the difference.

• Lower and Upper Bounds: The lower and upper bounds of the 95% confidence interval around the mean.

If the original spreadsheet contained a column-oriented marker map, this marker map will be applied row-oriented to the results spreadsheet.