2.28. RNA Sequencing Analysis¶
2.28.2. Activate Genes by Minimum Read Threshold¶
This tool is designed to remove genes from an RNA sequencing dataset that fall below a certain minimum read threshold.
- Binary or Categorical Dependent column (Optional): optionally
set a binary or categorical dependent column prior to running this tool. If set, the thresholds set must apply to all subcategories found within this column.
- Minimum Sample #: Percent or count threshold of samples that must
have minimum # of reads.
- percent/occurrences: Choose whether the previous number threshold
refers to a percent of samples or a count of samples that must have minimum # reads.
- Minimum Read threshold: Numeric value that determines the minimum
- Only use marker mapped quantitative columns: Optionally limit the
tool to perform the filtering on marker mapped column only. This option is appropriate if all gene columns are mapped and additional numeric columns exist in your spreadsheet.
The resulting spreadsheet will inactivate any column that has fewer than the percentage or count of values that meet the specified read threshold.
2.28.3. Normalization and Log Transformation¶
This function is intended to be used on RNA-seq quantification data and requires a spreadsheet with several active quantitative columns (could be marker mapped) that contain positive values. The user can choose to normalize the real values with an upper quartile normalization or log transform the values (base two or base ten) or both. If both options are selected, the normalization is done first and the normalized values are log transformed.
The normalization procedure is done if the group box Upper Quartile Normalization of Samples is checked. The following normalization is applied to each value:
where the filtered values depend on the following user-selected options:
Drop zero’d out columns for all samples: This option removes all columns that are entirely 0 from the upper quartile calculation. This option will not remove all zero values, rather only zero values found in all zero columns.
Only use valued >= …: This options filters the values used in the upper quartile calculation. Entering a positive value here will remove all zeros from the percentile calculation.
Scaler in normalization: The scaler used in the normalization formula.
In order to log transform the values, check the Log Transformation group box. The log transformation is either applied to the original values or to the normalized values, if applicable. The user can select a base two or a base 10 log transformation. One is added to all values so that values=0 remain as they are and all transformed values are still positive.
The user can choose to only apply the normalization and/or transformation to all quantitative columns or only to mapped quantitative columns. The latter is appropriate if the spreadsheet contains quantitative phenotypes as well as mapped quantitative data.
2.28.4. DESeq Analysis¶
DESeq is a method developed by [AndersAndHuber2010] to perform differential expression analysis on RNA sequence count data. Fold changes and p-values at individual genes or transcripts can be calculated for two conditions delineated in a categorical or binary column. The method is based on the negative binomial distribution, which allows for less restrictive variance parameter assumptions than does the Poisson distribution.
To compensate for the fact that the sample size is small and the number of genes or transcripts being analyzed is much larger, DESeq provides alternative variance algorithms which can take advantage of a large number of genes or transcripts having valid data.
In addition, Euclidean distances between all pairs of samples based on their count data, as well as moderated fold changes at individual genes or transcripts between two conditions, can be calculated. These results use a variance-stabilizing transformation (VST) applied to a variance-vs-count-mean function estimated for all data.
The current version of SVS DESeq is based on the R DESeq Version 1.10.1 by [AndersAndHuber2010].
DESeq Method Algorithms¶
The main purpose of the DESeq method is to perform statistical testing to decide whether, for a given gene or transcript, the observed difference in read counts is significant, that is, greater than would be expected from just natural random variation.
The DESeq method also provides a method of scaling data and two alternatives for calculating fold changes between conditions, one based on a variance-stabilizing transformation (VST).
DESeq Method Assumptions¶
It would be expected that read counts would follow a multinomial distribution, which may be approximated by a Poisson distribution. However, the Poisson distribution does not predict enough variance compared with what is actually seen in the data.
Therefore, to address this “over dispersion” problem, the DESeq method assumes that the number of reads in sample that are assigned to gene or transcript can be modeled by a negative binomial (NB) distribution, as
which has the two parameters , the mean, and , the variance. The read counts are non-negative integers.
This and the next two assumptions are made to compensate for the typically small sample size used for RNA sequencing analysis by taking advantage of the large number of genes or transcripts that are typically analyzed.
The mean , that is, the expected value of the observed counts for gene or transcript in sample , may be expressed as
where is the experimental condition of sample , is a value only dependent upon the gene or transcript and upon the experimental condition, and is proportional to the expected value of the true (but unknown) concentration of fragments from gene or transcript under condition , and is a size factor proportional to the coverage or sampling depth of library . We refer to the as the “base means”.
The total variance is assumed to be expressible as the sum of a shot noise term and a “raw variance” term , as
where is defined as the “raw variance”.
“Shot noise” is a random variation which occurs from a discrete measurement process such as is used in RNA sequencing analysis and other areas. We assume that the raw variance term is the “real variance” taking place in the counts.
This assumption, and the assumption that the measured read counts are NB-distributed, is discussed further in [AndersAndHuber2010], and are based on the actual read counts having a gamma distribution and the measured counts, if they were based on a fixed actual count instead of actual read counts, having a Poisson distribution.
The raw variance is assumed to be a smooth function of , that is,
Size Factor Estimation¶
Since the total number of reads is not necessarily a good measure of sequencing depth over any but the most highly expressed genes or transcripts, a median is taken to more realistically estimate the size factor parameter :
This may be interpreted as the median of the ratios of the -th sample’s counts to those of a pseudo-reference sample obtained by taking the geometric mean across samples. (Actually, only non-zero count values may be used in this process. This is accomplished in the standard DESeq algorithm by only using genes or transcripts for which all samples have non-zero count values–but see What About Missing Values? below for the SVS DESeq alternative algorithm used when there are too few genes or transcripts (with no missing data and) containing all non-zero count values.)
These size factor estimates are then used in the DESeq method to scale the counts between samples.
Alternative Algorithms for Estimating Variances¶
SVS DESeq has two methods of estimating variances as alternatives to directly estimating them from individual gene or transcript data. These are:
Estimating a relation between the gene or transcript variance vs. the mean of the scaled gene or transcript counts using local regression. (See Variance Estimation Using Local Regression.)
Estimating a relation parametrically between the gene or transcript “dispersion” (raw variance divided by the square of the mean of the scaled count) vs. the mean of the scaled gene or transcript counts using a general linear model. (See Parametric Variance/Dispersion Estimation.)
These methods may be applied to:
All samples aggregated as one condition
All replicated conditions (conditions corresponding to more than one sample) using pooled variances
Finally, when p-values are calculated, the actual variances used can be one of:
The variance as derived from one of the relations described above
The variance as directly estimated from the individual gene or transcript data
The choice of the above two which results in the maximum dispersion estimate
If one or both conditions consists of only one sample, or for whatever other reason no variances can be computed for one or both conditions,
the variance as derived from…
local regression applied to…
all samples aggregated as one condition
will always be used to estimate variances and calculate p-values.
Variance Estimation Using Local Regression¶
First, we estimate what we call the “base mean” for condition , , by averaging the scaled counts from the samples corresponding to condition :
where are the actual counts (as distinct from the statistic ), and is the number of replicates (samples) of condition . The sum is only taken over these replicate samples.
We then estimate what we call the “base variance”, , at gene or transcript for condition as an estimated population variance just using this sample data:
We also define a mean scaling factor for each condition :
Finally, we use a technique called “local regression” (Local Regression within DESeq) on the relation of data points vs. for condition to obtain a function (the fitted base variance) to estimate the base variances for condition . We define the raw variance function as
and estimate the raw variance for each gene or transcript under condition as
This raw variance estimate is based on the overall base mean , which is defined as the base mean for all samples aggregated as one condition.
Before the above raw variance, as estimated through local regression, can be used for p-value testing, it must be bias-corrected. See Variance Bias Correction.
When variances are computed for all samples aggregated as one “condition”, substitute “all samples” for in the discussion above.
To compute variances using a pooled variance estimate, the local regression is done on the relation of vs. to obtain a fitted pooled base variance , where is the pooled variance estimate for gene or transcript over all the conditions having replicated samples, and is the base mean computed for all samples for gene or transcript . The pooled raw variance function is then defined as
where is the mean scaling factor for all samples, and the pooled raw variance is defined as
(A “pooled variance” is basically the variance within conditions excluding the variance between conditions.)
[AndersAndHuber2010] describes a quantity which is equivalent to our . However, working with is more straightforward.
Local Regression within DESeq¶
A technique of data fitting known as “local regression” can be used by DESeq for fitting base variances (as well as for finding dispersion bias corrections–see Variance Bias Correction). This technique may be summarized as follows:
A set of evaluation points are found within the interval enclosing the range of function argument data points. This interval is subdivided until for each subinterval, the subinterval length divided by the lesser of the “vertex bandwidth” of each of its two end points is less than or equal to a “cut parameter”, for which 0.8 is used in DESeq.
The “vertex bandwidth” of a point is the maximum distance between that point and any of the closest data points, where is the local regression “bandwidth” being used. For fitting base variances, 0.7 is used for the “bandwidth” .
The endpoints of the resulting subintervals are used as the evaluation points.
At each evaluation point, a weighted Generalized Linear Model (GLM) fit is done over the closest data points. A “tri-cube weight function”, namely
where is the vertex bandwidth and is the difference between the data point and the evaluation point, is used for the GLM weighting.
DESeq uses a GLM linear predictor , where is an by three matrix the rows of which are , and is a one-by-three vector of parameters.
For fitting the base variances, DESeq’s GLM specifically uses the Gamma family with the natural logarithm as the link function, and uses the natural logarithms of the base means as the actual function arguments.
In the unlikely event that the GLM procedure does not converge for fitting the base variances, a weighted least squares is used between the natural logarithms of the base means and the natural logarithms of the base variances. A message will inform you that this substitution took place.
The result is a set of evaluation points with their corresponding estimated data values and estimated data value first derivatives.
When a function value is desired, a cubic Hermite spline is done using the two evaluation points surrounding the argument along with their corresponding values and first derivatives.
Variance Bias Correction¶
In reality, there is some bias in using raw variances in determining the parameters to use for the two negative binomial distributions that are needed for testing for differential expression (see Testing for Differential Expression). To correct for this, DESeq translates the mean of the scaled counts and the raw variance (see DESeq Method Assumptions) into a mean and a squared coefficient of variation (SCV) or “dispersion” , which is defined as the raw variance over the square of the scaled count mean–that is,
When raw variances have been estimated through local regression, the overall base mean is used to obtain dispersion values from the estimated raw variances–that is,
DESeq then uses the appropriate pre-computed function fit to look up a corrected value for the dispersion. This corrected value may then be used in the computation of the parameters for the negative binomial distributions.
DESeq stores one pre-computed fit function for every possible number of samples from 2 through 15. Each fit function has been obtained through many simulations of counts from the number of samples in question using negative binomial distributions which have varying dispersion values and a consistent high mean. Ninety-nine dispersion values between 0 and 2 and twenty dispersion values between 2 and 10 were used. (The consistent high mean was used in the simulations because it gives either valid or only slightly pessimistic values.)
The data from each of these sets of simulations was fit through local regression (see Local Regression within DESeq), with the “inputs” being the dispersion values computed on the data coming from the simulations, and the “outputs” being the dispersion values originally used for the negative binomial parameters. A weighted GLM with a bandwidth of 0.2 using the Gaussian family with the identity link function was used for fitting these dispersion correction functions. No dispersion correction is used for numbers of samples higher than 15, nor for dispersions lower than .02.
Parametric Variance/Dispersion Estimation¶
An alternative algorithm for estimating the actual raw variance is through the parametric fitting of the dispersion vs. the scaled count mean as
where is the “asymptotic dispersion” and is the “extra Poisson” term.
Before fitting, individual dispersion estimates
are made, where is the base variance, are the base means, and the mean scaling factor is as defined above in Variance Estimation Using Local Regression.
Individual dispersion estimates for all samples aggregated as one “condition” use the same formula except that “all samples” should be substituted as the “condition” .
Individual dispersion estimates for pooled variance estimates use the base mean and mean scaling factor as computed for all samples:
Final dispersion estimates are made from these by:
taking the maxima of the initial estimates and , and
bias-correcting the results (see Variance Bias Correction).
These final estimates are fit against the base means . The fitting algorithm used is a General Linear Model using the Gamma family with the identity link function.
When a parametrically-fitted dispersion is needed for testing differential expression, the substitution
is made, where is the overall base mean and and are the dispersion parameters for condition . No bias correction is done here, since the original fitting function was based on bias-corrected data.
Individual Dispersion Estimates¶
For pooled variances, SVS DESeq offers options for using (pooled) dispersion estimates for individual genes or transcripts (see Parametric Variance/Dispersion Estimation above). When individual dispersion estimates are used in this fashion, they are not bias-corrected, but the maxima of the individual estimates and are still used. These options are:
Pooled: Use individual estimate for gene/transcript Always use individual dispersion estimates to derive raw variances.
Pooled: Take maximum of fitted variance vs. individual estimate Use either the estimated dispersion (based on either local regression or parametric fitting) or the individual dispersion estimate, whichever is higher.
Using the maximum dispersion value removes the effect of outliers, that is, genes or transcripts with a much larger individually estimated dispersion value than what the fitted value calls for.
Testing for Differential Expression¶
Suppose we are testing for differential expression between conditions A and B at gene or transcript , and that we have samples for condition A and samples for condition B. Our null hypothesis will be , where is the expression strength for condition A and is the expression strength for condition B. We define
and their sum as test statistics.
We want to be able to compute the probability of and of for any given or . The random variable is the sum of negative-binomial-distributed random variables. We approximate its distribution by a negative binomial distribution whose parameters we obtain from those of the .
To do this, we compute the pooled mean estimate from the counts of both conditions,
which accounts for the fact that the null hypothesis stipulates that . is called the “combined base mean”. We then compute a summed mean and a variance for condition A as
where is the dispersion being used for condition A for this test.
If a fitted dispersion is being used,
and otherwise, if the individual pooled dispersion estimate is being used,
If the fitted dispersion is being used and local regression was used to compute the raw variances upon which it is based, the bias correction (see Variance Bias Correction) is applied to the dispersion value before using it.
The summed mean and the variance are used to set the probability and size parameters and for the negative binomial distribution we are assuming for as follows:
The summed mean, variance, and NB probability and size parameters for condition B are computed similarly. These parameters for both conditions allow us to compute and .
Now, if we assume the counts from the different samples are independent, we can compute as the probabilities of both and happening at the same time.
The p-value of a pair of observed count sums is then the sum of all probabilities less than or equal to where , divided by the sum of all probabilities where :
The variables and take on the values . The condition may be rewritten as , which makes the above expression rewritable as a summation in the variable only:
Computing the P-Values¶
Normally, the probability function is in a nice “bell” shape with one maximum somewhere in the middle (between zero and ) and tapering off towards zero near the ends. However, if the size parameter for either negative binomial is less than or equal to one, the probability function will not be of this shape. Therefore, three p-value algorithms are available in SVS DESeq:
For cases when the size parameter for both negative binomials is greater than one, the algorithm described above in Testing for Differential Expression.
Note that for this algorithm, the definition of a pair of count sums being “more extreme” than the pair is that their probability is smaller than the probability .
Two-tailed. We define “more extreme” as meaning “a more extreme fold change” , where and are the sums of the size factors of the samples in conditions and , respectively.
We find the values of (where ) having a more extreme fold change than as follows:
If is one borderline between a more extreme fold change and not a more extreme fold change, then its reciprocal, , must also be such a borderline.
Define the “opposite point” as a value between zero and , inclusive, such that substituting for and for in the fold change expression will equal the reciprocal borderline expression . That is,
If is zero, define as , and if is zero, define as zero. Otherwise, we have
To do the actual computation, we round this to an integer:
If , we use
if , we use
and if , we set to one.
Double a one-tailed estimate. For this algorithm, we also define “more extreme” as meaning “a more extreme fold change” , where and are the sums of the size factors of the samples in conditions and , respectively.
Using the summed means and for conditions A and B, approximate the mode of the distribution as
If , use
and if , use
This may give a value above one–in that case, set to one.
Computing the Actual Sums¶
Since count sums may get into the millions for strongly expressed genes or transcripts, these expressions can take a very long time to sum over every possible count value. Therefore, SVS DESeq makes Romberg numerical integration available as a procedure to approximate sums when the number of values to be summed is higher than a selectable value such as 10000. This works because the negative binomial probability mass functions are defined in terms of functions that may be computed for non-integer values.
To better handle the possibility that the largest values of may be near one or both edges of the interval, a sum is taken for the first 1501 values and the last 1501 values of over the interval, and the Romberg integration is done for the remaining values away from the edges. Additionally, in order to better approximate the sum over integer values, the actual Romberg integration is taken over intervals of length one each centered around an integer value. That is, it approximates the sum from (the beginning value + 1501) through (the end value - 1501) by integrating from (the beginning value + 1500.5) to (the end value - 1500.5).
If “more extreme” is being defined as being less than , and is larger than the threshold being used for Romberg integration, the value of , where is an integer on the opposite side of the distribution from , and for which borders on being less than , is found through a binary search. Then, similar to the two-tailed approach, if , we use
if , we use
and if , we set to one.
The above binary search assumes a distribution of that has one maximum in the middle. This results from the size parameters of both negative binomials being greater than one. (The two-tailed approach is always used instead if either size parameter is less than or equal to one.)
If a set of observations has a mean and a variance which is expressible as a function of the mean, , it is possible to apply the transformation
to this data. This is known as a variance-stabilizing transformation (VST), because the variance of the transformed data is approximately the same for all values of .
Several optional outputs of DESeq take advantage of this transformation. These include moderated fold changes (see Moderated Fold Changes) and Euclidean distances (see Euclidean Distances), as well as direct outputs of VST-transformed counts themselves.
If variances are being fitted through local regression, DESeq will obtain a VST transformation numerically by:
Insuring that a variance is fitted using local regression over all samples aggregated as one condition;
Using a numeric integration procedure over 999 points (chosen so that their arc sine values are evenly spaced) to obtain a VST function.
Since the original local regression for fitting variances vs. means works with logarithms of variances and means rather than the variances and means themselves, and thus there is no value obtained for a zero mean, fitting a spline over the lowest four evaluation points in order to use the result to substitute the first derivative used for the first two points.
Scaling the VST function (using count means that are at the 95th percentile and the 99.9 percentile of the count mean data) so that it approaches as the count data gets large.
Storing this function in the same format as are local regression function results, so that it may be used directly on the arc sine values of the scaled count data .
On the other hand, if dispersions are fitted parametrically as
the following closed formula for the variance-stabilizing transformation is used:
where is the VST transform function and is a scaled count.
Note that this formula automatically approaches as becomes large.
Moderated Fold Changes¶
A “fold change” between conditions A and B at a gene or transcript is normally computed as the ratio at gene or transcript of the base mean of scaled counts for condition B to the base mean of scaled counts for condition A. However, this result is not valid if the scaled counts for condition A are all zero, nor is the logarithm of this result valid if the scaled counts are all zero for either condition. In addition, even when it is valid to compute them, these fold changes can sometimes take on extreme values.
However, if a VST is applied to the scaled counts for samples in condition A at gene or transcript and the results are averaged, and similarly a VST is applied to the scaled counts for samples in condition B at gene or transcript and these results are averaged, the A average may be subtracted from the B average to obtain what we call a “moderated fold change”. (Subtraction rather than division is done because the VST function is logarithm-like.)
Moderated fold changes may be computed for all data, not just data for which both of the counts are non-zero. Additionally, moderated fold changes will not take on as extreme values as normal fold changes would.
Sometimes, it is desired to find an “overall distance” between the counts of pairs of samples. To make it possible to more meaningfully compare distances between different pairs of samples, as well as to make the actual distance computations more meaningful, a VST transform is first applied to the scaled counts .
A procedure is then done which is analogous to finding the distance between two points on a plane, which is the square root of the sum of the x-distance squared plus the y-distance squared, except that instead of the two dimensions of the plane, there is one “dimension” for every gene or transcript . This Euclidean distance between samples c and d is defined as
where is the VST transform function.
What About Missing Values?¶
All of the above methods are predicated on missing values not existing in the data. (Note that zero counts are not considered as “missing”.)
What if we do have missing values? Some experiments may have occasional missing values sprinkled throughout their data. Not only that, but some RNA-Seq experiments have samples for which all data is missing.
If we do have missing data, or even a substantial proportion of zero count values, we need to accommodate these in such a way that the original essence of the DESeq algorithm is preserved as much as possible. Specific issues are:
There must be at least a small proportion of non-zero data if fold changes and other measurements are expected to be extracted from the data.
Any computation is more meaningful when a greater proportion of genes or transcripts have data for every sample involved in that computation.
Geometric means (which are used to compute size factors) are only meaningful with non-zero data. The above size factor algorithm avoids using any genes or transcripts that have ANY missing data or have ANY zero counts.
Variances need to be computed for genes or transcripts which have a consistent number of non-missing samples (which might as well be all of the samples available for the condition, if possible) so that applying the dispersion correction will be meaningful.
P-value computations are predicated on the number of samples in each condition being consistent with the numbers of samples used in the dispersion corrections. In addition, in any case, p-values are most meaningful as p-values when data is present for all of the samples in each condition.
Euclidean distances, computed with the above algorithm, grow larger if the number of genes or transcripts considered grows larger and the behavior of the VST-transformed counts for the individual genes or transcripts stays, on the average, the same. For instance, if a pair of samples has data for only 55% of the genes or transcripts, the Euclidean distance will probably be only the square root of .55 of, or about 74.2% of, what it probably would have been had there been data for all of the genes or transcripts.
In order to allow computation to take place and get results approximating those that would be obtained if all data had been present, SVS DESeq handles these issues as follows:
After loading the data, SVS DESeq eliminates, one by one, the samples with the greatest number of missing values (and, in case of a tie, the greatest proportion of non-zero values among those values that are not missing) from being used in the analysis until certain criteria are met:
The proportion of genes or transcripts that are missing-value free for the remaining samples is at least 50%.
The proportion of genes or transcripts that are both missing-value free and are completely free of zero values is at least 1%.
Only after this process is complete is the analysis dialog presented. (Analysis will stop if these conditions are never met despite eliminating all of the samples from consideration.)
If less than 30% of the genes or transcripts have all non-missing and all non-zero data for all of the remaining samples, SVS DESeq will use a modified size factor algorithm, as follows:
The geometric mean used for gene or transcript in the denominator of the expression from which the median is taken (see Size Factor Estimation) is computed only from samples (from among the samples that remain after the elimination process above) for which gene or transcript has non-missing, non-zero values.
Only genes or transcripts for which (remaining) sample has non-missing, non-zero values are used to determine the size factor for sample .
If a given (remaining) sample has no non-missing, non-zero values, the size factor used for that sample is 1.0. This allows the zero values in such a sample to be used in further computations.
SVS DESeq estimates variances for individual conditions or for all samples aggregated as one condition using only genes or transcripts which are completely missing-value free for all of the (remaining) samples in the condition.
When computing pooled variances, SVS DESeq obtains partial sums from an individual condition only for genes or transcripts which are completely missing-value free for all of the (remaining) samples in that condition. (The degrees of freedom will also only be incremented only for those genes or transcripts which are completely missing-value free for all of the (remaining) samples in that condition.)
SVS DESeq requires any gene or transcript for which a p-value is taken to have data present for all of the (remaining) samples in each condition.
SVS DESeq will require only at least one (remaining) sample in each condition to have data in order to compute a moderated fold change.
To compute a Euclidean distance between two (of the remaining) samples, SVS DESeq will only use genes or transcripts for which both samples have data (which is guaranteed to be at least 50% of the total genes or transcripts as a result of the sample elimination process above). In addition, before the square root is taken in the computation, the sum is multiplied by the ratio of the number of total genes or transcripts to the number of genes or transcripts that have data for the two samples.
When outputting count data for plotting hierarchical clustering dendrograms and heatmaps (a feature described in Other Outputs Designed for Heatmap and Dendrogram Plotting below), SVS DESeq will use zero in place of any missing value. (Missing values will only be possible for samples not in either of the two conditions used for p-value testing.)
DESeq Method Calculations and Outputs¶
Test Outputs and Intermediate Outputs¶
The following results may be calculated using DESeq:
P-values and fold changes between two selected conditions at every gene or transcript.
Moderated fold changes may be added to the above results.
Euclidean distances between all samples.
In addition, there are several optional outputs:
The scaled counts for each sample and gene or transcript, as well as the size factors used to scale each sample’s counts.
The VST-transformed counts for each sample and for gene or transcript.
Data related to variance or dispersion computations that must be made as intermediate values to other requested outputs.
Other Outputs Designed for Heatmap and Dendrogram Plotting¶
This category of output, which is available only if you compute p-values and fold changes between two selected conditions, is designed to be used as input for plotting hierarchical clustering dendrograms and heatmaps (Dendrograms and Heatmap). These outputs are as follows:
VST-transformed counts for genes or transcripts with the best p-values
Scaled counts for genes or transcripts with the best p-values
Original count data for genes or transcripts with the best p-values
These may be output either for all samples or just for the samples in the two conditions involved in the p-value calculations.
In general, VST-transformed counts work the best for clustering and heatmapping because these counts are not spread out over such a vast scale as are the original counts or scaled counts. (This is also why VST-transformed counts are used for moderated fold changes (Moderated Fold Changes) and finding Euclidean distances between samples (Euclidean Distances).)
The RNA sequence count data should be active data organized by rows for samples and marker-mapped numeric columns for genes or transcripts. One binary or categorical column should be set as dependent. This column should be used to categorize your samples as to which condition each belongs.
If the numeric columns are real-valued, the results of rounding these values to the nearest integers will be used for this algorithm.
The above having been said, SVS DESeq still uses double-precision variables in its calculations, even for integer-valued data, in order to allow for count values up to and surpassing .
Usage and Parameters¶
In a spreadsheet containing mapped integer or real columns that correspond to the count data, set a categorical or binary column as dependent. Select RNA-Seq > DESeq Analysis.
There will appear two tabs–DESeq Analysis and Advanced Parameters. In the DESeq Analysis tab, two sets of optional parameters are selectable:
P-values and Fold Changes
If p-value and fold change outputs are desired, two conditions must be selected under Conditions to Analyze. The drop down boxes will each contain a list of all unique values found in the condition-specifying categorical or binary column.
The P-values and Fold Changes check box will only be available if the dependent column has at least two unique categories and at least two rows (samples) in each unique category (out of samples that have not been removed due to excessive missing values).
Output the Following Counts for Genes/Transcripts with the Best P-Values Please see Other Outputs Designed for Heatmap and Dendrogram Plotting for more information about this category of output. The following count spreadsheets may be selected for output:
VST-transformed counts These will be counts for the applicable genes/transcripts and samples that have been transformed using the variance-stabilizing transformation (VST).
Scaled counts These will be the scaled counts for the applicable genes/transcripts and samples.
Original counts This will be a copy of your original count data (as rounded to integers, if applicable) for the applicable genes/transcripts and samples.
The following choices determine the genes/transcripts and the samples for which the above counts will be output:
Output the above for this many best p-values (Shown above the choices for category of output.) Select how many of the genes/transcripts having the best p-values will be output. While outputting for the best 50 p-values is the default, you may wish to go up to 100 or 300 to get a bigger picture, or down to 30 to get better resolution.
Output the above:
Only for valid samples having condition A or condition B Select this to limit the output to the samples that helped determine the p-values in the first place.
For all valid samples Select this for output using all samples (that were not eliminated during preliminary data scanning).
Any missing data will be output in these spreadsheets as zero.
This feature was designed to give valid input for the Plot > Dendrograms and Heatmap feature. Please see Dendrograms and Heatmap for information about this plot feature.
Output moderated fold changes using VST Checking this will add these outputs to the p-value and fold change output spreadsheet itself.
Other options This box specifically includes:
Output Euclidean distances using VST. This is a calculation which is independent of the P-values and Fold Changes calculations. Euclidean distances will be calculated between every pair of valid samples.
In the Advanced Parameters tab, the following groups of parameters are selectable:
Size Factors, Scaled Counts, and Variances/Dispersions The following optional outputs are selectable:
Output size factor and scaled count spreadsheet This option shows the size factor for each sample and the sample’s count data for all genes and transcripts as scaled by the sample’s size factor.
Output all variance-stabilizing-transformed (VST) counts This option shows the VST-transformed counts for each sample and for each gene or transcript.
Output spreadsheets related to variance or dispersion computations This option outputs, in one or more spreadsheet(s) data related to variance or dispersion computations that must be made as intermediate values to other requested outputs.
Usage of Variances/Dispersions for P-Value Estimates The selectable parameter is:
Variance/dispersion computation method: The choices are:
Pooled: Take maximum of fitted variance vs. individual estimate
Pooled: Use fitted variance
Pooled: Use individual estimate for gene/transcript
Variance fit for all samples aggregated as one condition
Per-condition: Use fitted variance
See Alternative Algorithms for Estimating Variances and the sections that follow it for more information.
This parameter is only available when the P-values and Fold Changes checkbox is available.
If one or both conditions consists of only one sample, or for whatever other reason no variances can be computed for one or both conditions, the variance/dispersion computation method will always be Variance fit for all samples aggregated as one condition.
Variance/Dispersion Fitting Method The selectable parameter is:
Variance fitting method: The choices are:
If one or both conditions consists of only one sample, or for whatever other reason no variances can be computed for one or both conditions, the variance fitting method will always be Local.
P-Value Computation Algorithm Two parameters are selectable:
Definition of “more extreme” pairs of group sums (a, b): The choices are:
Smaller probability than the observed pair
More extreme fold change: Use two-tailed estimate
More extreme fold change: Double the one-tailed estimate
See Computing the P-Values for a detailed explanation of this parameter.
Approximate with Romberg integration when counts exceed _____ Leave this checked to approximate the sums with Romberg integration. Uncheck it to force all sums to be computed exactly (may take longer). Adjust the count limit upward if it is desired to approximate the sum for fewer genes or transcripts.
See Computing the Actual Sums for more information on this parameter.
This group of parameters is only available when the P-values and Fold Changes checkbox is available.
Size Factors and Scaled Counts Contains one row for each sample which has not been eliminated from analysis due to excessive missing values or excessive zero values. Columns include the dependent column, a column containing the size factors and, for each integer counts column in the original spreadsheet, a column of counts which are scaled according to each sample’s size factor.
VST-Transformed Counts Contains one row for each sample which has not been eliminated from analysis due to excessive missing values or excessive zero values. Columns include the dependent column and, for each integer counts column in the original spreadsheet, a column of VST-transformed counts.
Variance Estimates for ________ The blank here may be “Condition ______”, “All Samples” (all samples being analyzed aggregated as one condition), or “Pooled”. Contains a row for each gene or transcript, and may contain any or all of the following columns as applicable:
Base Mean This is the mean of the scaled counts for this gene or transcript.
If this is the Variance Estimates for Pooled spreadsheet, this column will contain the overall base mean–that is, the mean of the scaled counts for this gene or transcript for all samples being analyzed.
log2(Base Mean) The base-2 logarithm of the above.
Base Variance Estimate The estimate of the variance of the counts for this gene or transcript, based upon just the counts for this gene or transcript.
If this is the Variance Estimates for Pooled spreadsheet, this column will contain the pooled variance estimate for this gene or transcript.
log2(Base Variance Estimate) The base-2 logarithm of the above.
Ind. Est. Variance The individually-estimated “raw variance”. (Shown only if individual dispersion estimates are needed for other calculations.)
Ind. Est. Dispersion The individually-estimated dispersion, not bias-corrected. (Shown only if individual dispersion estimates are needed for other calculations.)
Corrected Ind. Est. Dispersion The bias-corrected individually-estimated dispersion. (Shown only if parametric dispersion fitting is used.)
Overall Base Mean The mean of the scaled counts for this gene or transcript for all samples. (Shown only if Per-condition: Use fitted variance was selected.)
The Overall Base Mean, whether it has just been shown above or has effectively been shown earlier, is always used to compute the next several fields from the Fitted Base Variance W through the Corrected Dispersion.
log2(Overall Base Mean) The base-2 logarithm of the above. (Shown only if Per-condition: Use fitted variance was selected.)
Fitted Base Variance W The result of applying the variance fit function (taken over all data) to the Overall Base Mean. (Shown only if local variance fitting is used.)
log2(Fitted Base Variance W) The base-2 logarithm of the above. (Shown only if local variance fitting is used.)
p(Chi Square of Fit) The (with degrees of freedom) cumulative density function value taken at times the ratio of the Base Variance Estimate to the Fitted Base Variance. Here, is the number of samples in the condition. (Shown only if local variance fitting is used.)
Fitted Raw Variance The result of subtracting the estimated shot noise of this gene or transcript from the Fitted Base Variance W. (Shown only if local variance fitting is used.)
This result may sometimes be negative. If it is, the Fitted Dispersion computed from this value will simply be set to its minimum value of .
Fitted Dispersion The dispersion computed from the Fitted Raw Variance and the Overall Base Mean. If this result is less than , is used instead. (Shown only if local variance fitting is used.)
Corrected Dispersion The result of bias-correcting the Fitted Dispersion. (Shown only if local variance fitting is used.)
Disp. Used The dispersion that will be used for p-value testing. (Shown only if the Pooled: Take maximum of fitted variance vs. individual estimate computation method has been selected and local variance fitting is used.)
VST-Transformed Base Means The result of applying the VST-transformed variance function to the Overall Base Mean of this gene or transcript. (Shown only for the All Samples spreadsheet and only if VST transforms are used elsewhere.)
If Parametric variance fitting was selected, the values of the constants (Asymptotic Dispersion) and (Extra Poisson) will be output to the log message of this spreadsheet.
P-values for Conditions A and B: Contains a row for each quantitative counts column (representing a gene or transcript) in the original spreadsheet, and columns as follows:
p-value The p-value for this gene or transcript for the fold change existing between the two conditions being analyzed. This p-value represents the probability of a fold change of this magnitude or greater existing by chance under the (null hypothesis) assumption that there should really be no fold change at all in the actual data being analyzed.
-Log 10 P-Value The negative of the base-10 logarithm of the above.
FDR-corrected p-value The result of applying a false discovery rate (FDR) correction to the p-value.
-Log 10 (FDR-corrected p-value) The negative of the base-10 logarithm of the above.
Count kiA The observed count sum for condition A.
Count kiB The observed count sum for condition B.
Fold Change The ratio of the base mean of scaled counts for condition B to the base mean of scaled counts for condition A.
If the count for a condition is zero, the Fold Change output will be based upon simulating a count of one for that condition.
In specific mathematical terms, for the ith gene or transcript, this translates to:
If the base mean for condition A, , is zero, substituting for .
If the base mean for condition B, , is zero, substituting for .
These changes are only made as input to computing the Fold Change output. For all other computational purposes, the base means are left as they were originally computed.
The purpose of making these changes for computing the Fold Change output is to make it always possible to take the logarithm of the fold change (Log 2 Fold Change) for all genes or transcripts for which there is a non-zero count for at least one condition.
Log 2 Fold Change The base-2 logarithm of the fold change (as computed above).
Moderated Fold Change (Appears if Output moderated fold changes using VST has been selected.) The difference between the average of the VST-transformed scaled counts for condition B and the average of the VST-transformed scaled counts for condition A.
VST Mean (A) (Appears if Output moderated fold changes using VST has been selected.) The average of the VST-transformed scaled counts for condition A.
VST Mean (B) (Appears if Output moderated fold changes using VST has been selected.) The average of the VST-transformed scaled counts for condition B.
Combined Base Mean The mean of the scaled counts taken over both conditions A and B.
Log 2 (Combined Base Mean) The base-2 logarithm of the above.
Base Mean A The mean of the scaled counts for condition A.
Base Mean B The mean of the scaled counts for condition B.
Dispersion used (A) The dispersion used for condition A.
Dispersion used (B) The dispersion used for condition B.
NB mean (A) The mean (summed mean ) used for the Negative Binomial distribution for condition A.
NB mean (B) The mean (summed mean ) used for the Negative Binomial distribution for condition B.
NB size parm. (A) The size parameter used for the Negative Binomial distribution for condition A.
NB size parm. (B) The size parameter used for the Negative Binomial distribution for condition B.
Original Counts for G/T with the Best … P-Values: Contains your original count data (as rounded to integers, if applicable) for the applicable genes/transcripts and samples. The genes/transcripts are in columns which are sorted in order of p-values with the lowest p-values first.
Scaled Counts for G/T with the Best … P-Values: Contains the scaled counts for the applicable genes/transcripts and samples. The genes/transcripts are in columns which are sorted in order of p-values with the lowest p-values first.
VST-Transformed Counts for G/T with the Best … P-Values: Contains counts for the applicable genes/transcripts and samples that have been transformed using the variance-stabilizing transformation (VST). The genes/transcripts are in columns which are sorted in order of p-values with the lowest p-values first.
Euclidean Distances: An by spreadsheet (based on rows in the original spreadsheet). The first column contains a copy of the dependent column. This is followed by a square matrix containing the Euclidean distances. Each distance shown is between the sample associated with the distance’s row and the sample associated with the distance’s column.
Visualizing Results from DESeq¶
Several of the DESeq outputs lend themselves to visualizations.
Fold Changes vs. Combined Base Means
From a P-Values for Conditions A (…) and B (…) spreadsheet, select Plot > XY Scatter Plots, select Log10 (Combined Base Mean) for the independent variable and Log 2 Fold Change as the single dependent variable. This will come out somewhat as a “sideways volcano plot”.
Again from a P-Values for Conditions A (…) and B (…) spreadsheet, select Plot > XY Scatter Plots, but this time select Log 2 Fold Change as the independent variable. Select either -Log 10 P-Value or -Log 10 (FDR-corrected p-value) for the dependent variable.
Heat map of Euclidean distances
From a Euclidean Distances spreadsheet, select Plot > Heat Map.
Dendrograms and heatmaps of count data
From any of the … Counts for G/T with the Best … P-Values spreadsheets, select RNA-Seq > Dendrograms and Heatmap to get a heatmap and clustering dendrograms of these counts. See Dendrograms and Heatmap.
2.28.5. Dendrograms and Heatmap¶
This plot will show a heatmap of the spreadsheet’s active data with adjacent hierarchical clustering dendrograms. The top dendrogram refers to the spreadsheet rows, while the side dendrogram refers to the spreadsheet columns. The columns of the heatmap itself correspond to the values of the active rows of the spreadsheet, except they are sorted according to the order as shown in the top dendrogram. Likewise, the rows of the heatmap itself correspond to the values of the active columns of the spreadsheet, except they are sorted according to the order as shown in the side dendrogram. Viewing both the dendrograms and heatmap will allow you to visually find how the data is clustered in either direction.
Although this feature may be used with any completely numeric spreadsheet containing fewer than 500 rows and fewer than 3000 columns, it was designed specifically for viewing DESeq count data for those genes or transcripts having the best p-values for differences of expression between two conditions. Used this way, the top dendrogram will represent a hierarchical clustering of data by sample, and the side dendrogram will represent a hierarchical clustering of data by gene or transcript.
It is expected that the count data could be the original input data, scaled counts, or VST-transformed counts. In general, VST-transformed counts work the best for clustering and heatmapping using this feature.
Any missing values will be set to zero before plotting.
For each dendrogram, you may specify both the distance metric and the “linkage” method.
The distance metric algorithms available are:
Correlation The distance between two rows and or between two columns and will be considered to be
where is the correlation of the data between the two rows or columns. Note that perfect correlation would be , leading to a distance of zero, and no correlation () leads to a distance of one. Also note that anti-correlation (correlated but in the “wrong” direction), where , leads to distances between one and two.
If this distance metric is being used and the values of a spreadsheet row or column are all within of each other, will be added to the first value extracted from a column or subtracted from the first value extracted from a row for the purposes of plotting. This is to avoid a zero variance for the data from such columns or rows.
Euclidean The distance between two rows and or between two columns and will be the analog of physical distance,
where and are the elements of row or column and row or column .
The plot will look more or less as displayed here.
Panning mode and zoom mode are both available for this type of plot. (See Viewing a Custom Plot.)
In the initial display, note that there is barely room to show the names of the genes for these best 50 p-values. In fact, for 100 or 300 genes or transcripts (columns in the spreadsheet to be plotted), the gene/transcript names will overlap. Therefore, to see individual gene/transcript (spreadsheet column) names, please use the zoom feature (Viewing a Custom Plot) on either the side dendrogram or the heatmap. You may then use panning mode to move the plot to see, for instance, the names of other genes or transcripts that are nearby.
If there are either too few rows or too few columns of data or both to compute a dendrogram, that dendrogram will not be shown. The other elements of the plot will still be there.