# Numeric Data Quality Assessment¶

To ensure numeric data is of the highest quality, SVS provides a variety of features that not only help assess the quality of numeric data (or CNV data specifically), but remedy any problems as well.

## Derivative Log Ratio Spread¶

The derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio data. Samples with higher values of DLRS tend to have poor signal-to-noise properties. DLRS was originally developed for use in aCGH analysis. The measurement is based on absolute differences in LR values at consecutive points, rather than deviations from a baseline value. This property makes DLRS robust against signals from true copy number variants, because only the first and last markers in each segment (rather than all markers in a segment) will have a large deviation from normal values.

To calculate the DLRS, open a spreadsheet containing marker-mapped log
ratio data and choose **Numeric** > **CNV QA** > **Derivative Log Ratio Spread**.

The spreadsheet output contains DLRS values for each sample, per chromosome and overall, as well as the median DLRS value per chromosome.

## Percentile Based Winsorizing¶

Calculates thresholds for the top and bottom percentiles of log ratio data, as specified by the user, for the purpose of winsorizing - replacing extreme log ratio values with the calculated thresholds. Winsorizing data prevents segmentation algorithms from being driven by outlier values and results in a more accurate determination of regions of copy number variation.

For autosomes, the median threshold is used to winsorize the data. Values less than the lower threshold are replaced with the lower threshold, and values that are greater than the upper threshold are replaced with the upper threshold. For non-autosomes, the thresholds for each chromosome are used.

To use, open a marker-mapped spreadsheet containing log ratio data, with
samples as columns. Select **Numeric** > **CNV QA** >
**Percentile Based Winsorizing** and enter percentile
thresholds to be used for winsorizing in the window, or leave the
defaults of 0.002 and 0.998.

The spreadsheet output will contain the same information, with the extreme values winsorized.

Note

- This takes about 56 minutes to process 3500 samples by 500k markers on a 32-bit Windows Dual-Core 2.33 GHz computer.
- The resulting spreadsheet can be used for plotting individual samples. It will need to be transposed in order for analysis, in particular Copy Number Segmentation.

## Wave Detection and Correction¶

Some samples in copy number datasets may suffer from genomic waves, which cause the Log R Ratios to drift up or down in a wave like fashion. SVS uses the method described in [Diskin2008] to detect and partially correct for this phenomenon.

### Wave Detection¶

The Wave Factor score is a metric for evaluating the severity of this phenomenon in a sample. SVS Wave Detection computes the absolute value of the Wave Factor for each sample. The GC content in the region around each marker is thought to contribute to the genomic wave phenomenon, so SVS also computes the correlation coefficient between the Log R Values and the GC content around each marker. After performing wave detection, a spreadsheet will be created containing the Abs. Wave Factor and GC Correlation for every sample.

### Wave Correction¶

Optionally, SVS can use the GC correlation to correct for the waviness contributed by GC content. This will produce a new spreadsheet with the same row and column labels as the input that contains corrected log ratios for every sample.

### Options¶

The following options are available (see *Wave Detection and Correction Dialog Window*).

**Min Training Marker Distance**- When computing the GC correlation, only a subset of available markers are used that will be spaced at least this many kilobase pairs apart. This avoids biasing the correlation towards regions with high marker density.
**GC Reference**- It is not necessary to manually calculate the GC content around each marker in your dataset. SVS will use the specified reference genome to automatically compute the GC percentage in the 1 megabase region around each marker.
**Chromosome Selection**- If
**Autosomes Only**is selected, only chromosomes with a numeric name(eg. “1”, “15”, “22”) will be used for wave detection/correction. If**All Chromosomes**is selected, then every active chromosome in the spreadsheet will be used. **Output**- If
**Detect Only**is selected, A new spreadsheet containing the Abs. Wave Factor and GC correlation will be created.**Detect and Correct**will also produce a new spreadsheet containing corrected Log R Ratios.

## Statistics (per Column)¶

Column statistics can be calculated on all real-, integer-valued and binary (optional) columns in a spreadsheet. Each of the following output statistics are optional: Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Variance, Standard Deviation, Lower and Upper outlier thresholds defined by Q1 - x*IQR, Q3 - x*IQR, where x is a user defined multiplier and IQR or the Interquartile Range, defined by IQR = Q3 - Q1.

The resulting spreadsheet will have the original active columns as row labels and the selected summary statistics in columns. If a marker map was applied to the original spreadsheet’s columns it will be reapplied to the new spreadsheet’s rows.

To calculate the Column Statistics, open a spreadsheet containing
several quantitative columns. From the spreadsheet, select **Numeric** >
**Statistics (per Column)**. A dialog will appear (see
*Column Statistics Dialog Window*). The dialog allows the user to specify which statistics
to output. If the outlier thresholds are selected for output, a multiplier
must be specified to be used in the formulas described above. Binary columns can
also be included in the calculations, however not all of the statistics
may be appropriate for binary columns.

The following summary statistics may be reported for every active integer-, real-valued or binary column in the spreadsheet:

**Minimum**: The minimum value found in the column. If the minimum value is less than the Lower Outlier Threshold, there are outliers present in the column.**Maximum**: The maximum value of the data. If the maximum value is more than the Upper Outlier Threshold, there are outliers present in the column.**Q1**: The first quartile is defined as the value below which 25% of the data fall. Equivalently, the first quartile could be thought of as the median of the first half of the data.**Q3**: The third quartile is defined as the value below which 75% of the data fall. Equivalently, the third quartile could be thought of as the median of the second half of the data.**Mean**: The mean or mathematical average or the data. Comparing the mean and median values of the data can provide information about the skewness or normality of the data.**Median**: The median is defined as the value below which 50% of the data fall.**IQR**: The inner-quartile range is defined as the first quartile subtracted from the third quartile, Q3 - Q1. The IQR is used in the outlier threshold equations and is a measure of the variability in the data.**Sum**: This is the sum total of the data.**Variance:**The variance of the data values in the columns. Also the square of the standard deviation.**Standard Deviation:**The standard deviation of the data values in the column. Also the square root of the variance.**Outlier Thresholds**: The lower threshold is defined as Q1 - x*IQR, and the upper threshold is defined as Q3 + x*IQR. The multiplier, x, is user-specified. This threshold can be used to identify outliers that fall below the threshold.

## Statistics (per Row)¶

Row statistics can be calculated for every row over all real-, integer-valued and binary (optional) columns in a spreadsheet. Each of the following output statistics are optional: Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Variance, Standard Deviation, Lower and Upper outlier thresholds defined by Q1 - x*IQR, Q3 - x*IQR, where x is a user defined multiplier and IQR or the Interquartile Range, defined by IQR = Q3 - Q1. Statistics can also be calculated by Chromosome.

To calculate Row Statistics, open a spreadsheet containing several quantitative
columns. From the spreadsheet, select **Numeric > Statistics (per Row)**. A
dialog will appear. The dialog allows the user to specify which statistics to
output. If the outlier thresholds are selected for output, a multiplier must be
specified to be used in the formulas described above. Binary columns can also be
included in the calculations, however not all of the statistics may be appropriate
for binary columns.

The following summary statistics may be reported for every active row in the spreadsheet:

**Minimum**: The minimum value found in the row. If the minimum value is less than the Lower Outlier Threshold, there are outliers present in the row.**Maximum**: The maximum value of the data. If the maximum value is more than the Upper Outlier Threshold, there are outliers present in the row.**Q1**: The first quartile is defined as the value below which 25% of the data fall. Equivalently, the first quartile could be thought of as the median of the first half of the data.**Q3**: The third quartile is defined as the value below which 75% of the data fall. Equivalently, the third quartile could be thought of as the median of the second half of the data.**Mean**: The mean or mathematical average or the data. Comparing the mean and median values of the data can provide information about the skewness or normality of the data.**Median**: The median is defined as the value below which 50% of the data fall.**IQR**: The inner-quartile range is defined as the first quartile subtracted from the third quartile, Q3 - Q1. The IQR is used in the outlier threshold equations and is a measure of the variability in the data.**Sum**: This is the sum total of the data.**Variance:**The variance of the data values in the rows. Also the square of the standard deviation.**Standard Deviation:**The standard deviation of the data values in the row. Also the square root of the variance.**Outlier Thresholds**: The lower threshold is defined as Q1 -x*IQR, and the upper threshold is defined as Q3 + x*IQR. The multiplier, x, is user-specified. This threshold can be used to identify outliers that fall below the threshold.

If the spreadsheet has a column-oriented marker map, the user can optionally choose to calculate the statistics per chromosome. If this is desired, the results can be in the format of one spreadsheet per chromosome, one spreadsheet per statistic or all statistics and chromosomes in one spreadsheet.

## Multidimensional Outlier Detection¶

A median centroid vector is calculated as [median(column1), median(column2), ... , median(columnN)] based on N columns (dimensions) specified by the user. A distance score is then calculated for each sample or row as follows:

where N = number of dimensions, or columns included in the calculation and is the value of the median centroid vector. The outlier threshold is calculated as follows:

Where Q3 and IQR are the third quartile and inner quartile range of each column (1...N) and M is a user-specified multiplier.

To determine outliers in N dimensions, open a spreadsheet containing
several integer- or real-valued columns. From the spreadsheet, select
**Numeric** > **Multidimensional Outlier Detection**. The
default multiplier value is 1.5 but can be changed at the user’s
specification. Click **Add Columns** to add integer- or real-valued
columns to be included in the outlier calculation, then click **OK** to
select the columns. Click **OK** to begin the calculations.

The spreadsheet output, **Multidimensional Outlier Detection** will
contain two columns. The first column contains the distance score for
each sample and the second column is a binary column, where a 1
indicates an outlier. The threshold is specified in the second column
header as **Outlier** **threshold** e.g. **Outlier**
**0.28**.

A common use of this function is to calculate outliers in two
dimensions, then filter a scatterplot of the two columns based on
outlier status. For example, one could run Principal Component analysis,
then plot the first two principal components against each other. Then
determine outliers in two dimensions (the first two principal
components). If you merged the resulting **Multidimensional Outlier
Detection** spreadsheet with the principal components spreadsheet, you
could filter on the binary Outlier column. Outliers would fall outside
of an imaginary circle created by the median centroid and threshold
values.

## Matched Pairs T-Test¶

A Matched Pairs T-Test can be run on two dependent samples in the same spreadsheet with a grouping variable to distinguish them. In the results spreadsheet, the following outputs will be available: T-Statistic, the Mean of the differences, P-Value, , Bonferroni P-Value, FDR P-Value, degrees of freedom, Standard Error, the lower and upper bounds of the confidence interval around the mean, average for each category, the difference between these averages, the average anti-log base 2 of the data for each category, the fold change of the anti-log base 2 of the data, based upon the above averages, the log base 2 of that fold change.

To run a Matched Pairs T-Test, open a spreadsheet containing at least one categorical or binary column for the grouping. Also, a sample or pair ID column can be present, if not, the row labels will be used instead. The grouping and sample ID columns can be selected in the spreadsheet view by setting them to dependent (magenta). If the column headers can be identified, the dialog will populate the group and sample ID selections with the correct field. These can be selected manually in the dialog too.

From the spreadsheet, select **Numeric > Matched Pairs T-Test**. A dialog will appear.
The dialog allows the user to select the group column and, optionally, the sample or
pair ID column. If no column is selected for the sample ID column, the row labels will
be used. Also, the user can select whether to save the results spreadsheet under
the current spreadsheet or to save it under the project root.

The following will be in the results spreadsheet after running the analysis:

**T-Statistic**: The test statistic from the Student’s T distribution.**Mean of the difference**: The mean of the differences between each matched pair.**P-Value**: The p-value found from the Student’s T distribution for this T-Statistic and degrees of freedom.**-log10(P-Value)**: The negative log of the p-value**Bonferroni P-Value**: The adjusted p-value using the Bonferroni correction.**FDR P-Value**: The adjusted FDR p-values.**Average (Category 1)**: The average of all samples in category 1 for each numeric column.**Average (Category 2)**: The same as above, but for the second category.**Difference**: The difference in the averages from the first and second categories.**Average (2^Data for Category 1)**: The average of taking the anti-log base 2 of the data for the first category.**Average (2^Data for Category 2)**: The same as above, but for the second category.**Fold Change**: The fold change of the anti-log base 2 of the data, using the averages above.**Log2 Fold Change**: The log base 2 of the above fold change.**Degrees of Freedom**: The number of samples in each test minus 1.**Standard Error**: The standard error of the mean of the difference.**Lower and Upper Bounds**: The lower and upper bounds of the 95% confidence interval around the mean.

If the original spreadsheet contained a column-oriented marker map, this marker map will be applied row-oriented to the results spreadsheet.