Linear Regression

Genotype or Numeric Association Test – Linear Regression

The response, y, is fit to every genetic predictor variable or encoded genotype, x, in the spreadsheet, using linear regression, and the results include the regression p-value, intercept and slope which are output in a new spreadsheet along with other genotypic association test results and any multiple correction results. The response is represented with the formula y = b_1 x + b_0 + \epsilon, where the model is represented by the expression b_1 x + b_0 and the error term, \epsilon, expressing the difference, or residual, between the model of the response and the response itself. For missing values of the predictor, the mean of the response is used.

The regression hypothesis test is the test of:

\begin{cases}
H_0: \beta_0 = \beta_1 = 0\\
H_a: \beta_i \neq 0 \text{ for at least one } i\\
\end{cases}

Assumptions:

\epsilon_i \sim N(0,\sigma^2) for all i=1, \ldots, n

where \epsilon_i denote the residuals

and \epsilon_i are all independent and follow a normal distribution

and all \epsilon_i have equal variance \sigma^2

The sums of squares and mean sum of squared errors are calculated as follows:

Number of Observations: n
Rank of the Coefficient Matrix: m
Mean of the response: \bar{y}=\frac{1}{n}\sum^n_{i=1}{y_i}
Mean of the predictors or genotypes: \bar{x}=\frac{1}{n}\sum^n_{i=1}{x_i}
Solution to the normal equations: \hat{\beta} = \frac{\sum^n_{i=1}(x_i -\bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} = (\bf{X}^T\bf{X})^{-1}\bf{X}^T\bf{y}
  where \hat{\beta} \sim N(\beta,\sigma^2(\bf{X}^T\bf{X})^{-1})
Total Sum of Squares: SST = \sum^n_{i=1}{y^2_i}-\frac{1}{n}\left(\sum^n_{i=1}{y_i}\right)^2=SSReg + SSE
Regression Sum of Squares: SSReg = \sum^n_{i=1}{\left(\hat{y}_i-\bar{y}\right)^2}= \hat{\beta}^T\bf{X}^T\bf{y}- \frac{1}{n} \left(\bf{y}^T\bf{J}\bf{y}\right)
  where \bf{J} is a matrix of ones
Error Sum of Squares: SSE = \sum^n_{i=1}{\left(y_i-\hat{y}_i\right)^2} = \bf{y}^T\bf{y} - \hat{\beta}^T\bf{X}^T\bf{y}
Residual Sum of Squares: SE_{resid} = \sqrt{\frac{SSE}{n-m}}
Coefficient of determination: R^2=\frac{SSReg}{SST}
Adjusted coefficient of determination: R^2_{adj}=1-\left((1-R^2)\frac{n-1}{n-m-2}\right)
Test Statistic: F^* = \frac{R^2/(m-1)}{(1-R^2)/(n-m)}

The test statistic follows the F distribution, where p-value=P(X > F^*) where X \sim F(1,n-m).

Multiple Linear Regression Model

The Regression Analysis window performs multiple linear regression on the regressors unless only one regressor is specified. A multiple linear regression model takes one or more regressors and fits a regression model to one dependent variable. This model is a generalization of the simple linear regression model used for linear regression in the analysis test dialogs.

Full Model Only Regression Equation

The regression hypothesis test is the test of:

\begin{cases}
H_0: \beta_1 = \beta_2 = \beta_3 = \ldots = 0\\
H_a: \beta_i \neq 0 \text{ for at least one } i\\
\end{cases}

Assumptions:

\epsilon_i \sim N(0,\sigma^2) for all i=1,\ldots,n

where \epsilon_i denote the residuals

and \epsilon_i are all independent and follow a normal distribution

and all \epsilon_i have equal variance \sigma^2

The sums of squares and mean sum of squared errors are calculated as follows:

Number of Observations: n
Rank of the Coefficient Matrix: m
Mean of the response: \bar{y}=\frac{1}{n}\sum^n_{i=1}{y_i}
Mean of the predictors or genotypes: \bar{x}=\frac{1}{n}\sum^n_{i=1}{x_i}
Solution to the normal equations: \hat{\beta} = \frac{\sum^n_{i=1}(x_i -\bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} = (\bf{X}^T\bf{X})^{-1}\bf{X}^T\bf{y}
  where \hat{\beta} \sim N(\beta,\sigma^2(\bf{X}^T\bf{X})^{-1})
Total Sum of Squares: SST = \sum^n_{i=1}{y^2_i}-\frac{1}{n}\left(\sum^n_{i=1}{y_i}\right)^2=SSReg + SSE
Regression Sum of Squares: SSReg = \sum^n_{i=1}{\left(\hat{y}_i-\bar{y}\right)^2} = \hat{\beta}^T\bf{X}^T\bf{y}- \frac{1}{n}\left(\bf{y}^T\bf{J}\bf{y}\right)
  where \bf{J} is a matrix of ones
Error Sum of Squares: SSE = \sum^n_{i=1}{\left(y_i-\hat{y}_i\right)^2} = \bf{y}^T\bf{y} - \hat{\beta}^T\bf{X}^T\bf{y}
Residual Sum of Squares: SE_{resid} = \sqrt{\frac{SSE}{n-m}}
Coefficient of determination: R^2=\frac{SSReg}{SST}
Adjusted coefficient of determination: R^2_{adj}=1-\left((1-R^2)\frac{n-1}{n-m}\right)
Test Statistic: F^* = \frac{R^2/(m-1)}{(1-R^2)/(n-m)}

The test statistic follows the F distribution, where p-value=P(X > F^*) where X \sim F(m-1,n-m).

Full Versus Reduced Model Regression Equation

In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.

The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.

The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:

Number of Observations: n
Rank of the Reduced Model Coefficient Matrix: r
Mean of the response: \bar{y}=\frac{1}{n}\sum^n_{i=1}{y_i}
Mean of the predictors or genotypes: \bar{x}=\frac{1}{n}\sum^n_{i=1}{x_i}
Solution to the normal equations: \hat{\beta}_R = \frac{\sum^n_{i=1}(x_i -\bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} = (\bf{X}^T\bf{X})^{-1}\bf{X}^T\bf{y}
  where \hat{\beta} \sim N(\beta\sigma^2(\bf{X}^T\bf{X})^{-1})
Total Sum of Squares: SST_R = \sum^n_{i=1}{y^2_i}-\frac{1}{n}\left(\sum^n_{i=1}{y_i}\right)^2=SSReg_R + SSE_R
Regression Sum of Squares: SSReg_R = \sum^n_{i=1}{\left(\hat{y}_i-\bar{y}\right)^2} = \hat{\beta}^T_R\bf{X}^T\bf{y}- \frac{1}{n}\left(\bf{y}^T\bf{J}\bf{y}\right)
  where \bf{J} is a matrix of ones
Error Sum of Squares: SSE_R = \sum^n_{i=1}{\left(y_i-\hat{y}_i\right)^2} = \bf{y}^T\bf{y} - \hat{\beta}^T_R\bf{X}^T\bf{y}

The sums of squares and mean sum of squared errors for the full model are calculated similarly:

Number of Observations: n
Rank of the Full Model Coefficient Matrix: m
Mean of the response: \bar{y}=\frac{1}{n}\sum^n_{i=1}{y_i}
Mean of the predictors or genotypes: \bar{x}=\frac{1}{n}\sum^n_{i=1}{x_i}
Solution to the normal equations: \hat{\beta}_F = \frac{\sum^n_{i=1}(x_i -\bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} = (\bf{X}^T\bf{X})^{-1}\bf{X}^T\bf{y}
  where \hat{\beta} \sim N(\beta,\sigma^2(\bf{X}^T\bf{X})^{-1})
Total Sum of Squares: SST_F = \sum^n_{i=1}{y^2_i}-\frac{1}{n}\left(\sum^n_{i=1}{y_i}\right)^2=SSReg_F + SSE_F
Regression Sum of Squares: SSReg_F = \sum^n_{i=1}{\left(\hat{y}_i-\bar{y}\right)^2} = \hat{\beta}^T_F\bf{X}^T\bf{y}- \frac{1}{n}\left(\bf{y}^T\bf{J}\bf{y}\right)
  where \bf{J} is a matrix of ones
Residual Sum of Squares: SE_{resid} = \sqrt{\frac{SSE}{n-m}}
Error Sum of Squares: SSE_F = \sum^n_{i=1}{\left(y_i-\hat{y}_i\right)^2} = \bf{y}^T\bf{y} - \hat{\beta}^T_F\bf{X}^T\bf{y}

The test statistic is:

F^* =\frac{\left(\frac{SSReg_F}{SSE_F}-\frac{SSReg_R}{SSE_R}\right)\times (n-m)}{\left(1+\frac{SSReg_F}{SSReg_R}\right)\times (m-r)}.

The p-value is calculated by: p-value = P(X > F^*) where X \sim F(m-r,n-m).

Regressor Statistics

The coefficient of the j^{th} regressor is calculated with the equation:

b_j = \frac{\sum^n_{i=1}{(x_{i,j}-\bar{x}_j)*(y_i-\bar{y})}}{\sum^n_{i=1}{(x_{i,j}-\bar{x}_j)^2}}

where n is the sample size, \bar{x}_j is the mean of the j^{th} regressor and \bar{y} is the mean of the response.

The Y-intercept of the regression equation is calculated with the equation:

b_0 =  \bar{y}-\sum^k_{j=1}{b_j\bar{x}_j}

where k is the number of regressors, b_j is the coefficient and \bar{x}_j is the mean of the j^{th} regressor.

The standard error for the j^{th} regressor is computed by taking a full model regression equation with all regressors less the j^{th} regressor. For the purposes of calculating the standard error, the j^{th} regressor is set as the dependent variable. Let SSR_j=\sum^n_{i=1}{(x_{i,j}-\bar{x}_j)} be the regressor sum of squares, R^2_j be the coefficient of determination for the j^{th} regressor vs all other regressors model, MSE be the mean square errors for the regression model, and SSE be the error sum of squares. Let the total number of regressors in the model be k. Then the standard error of the regressor SE_j is calculated as follows:

SE_j =\sqrt{ \frac{MSE}{(1-R^2_j)\times SSR_j}}=\sqrt{\frac{SSE}{(1-R^2_j)\times SSR_j\times (n-k-1)}}=\sqrt{\frac{\sum^n_{i=1}{(y_i-(b_0+\sum^k_{l=1}{b_l x_l}))^2}}{(1-R^2_j)\times SSR_j \times (n-k-1)}}.

The standard error for the intercept is found as

SE_0 = \sqrt{ MSE (\bf{X}^T\bf{X})^{-1}_{(0,0)} },

where (\bf{X}^T\bf{X})^{-1}_{(0,0)} is the intercept-related element of the inverse of \bf{X}^T\bf{X}, the matrix \bf{X} having been formed from the intercept term (a vector of one’s) plus the covariates.

The value of the t-statistic for the j^{th} regressor is obtained from the equation:

t = \frac{\hat{\beta}_j}{SE_j},

where \hat{\beta}_j is the estimated coefficient for the j^{th} regressor.

The p-value of the t-statistic for the j^{th} regressor is the probability of a value as extreme or more extreme than the observed t-statistic from a Student’s T distribution with n-2 degrees of freedom.

P(>|T|) = p-value = 2*P(X > |T|) \text{, where } X\sim t(n-2)

The p-value for the univariate fit is obtained from a Student’s T distribution where the t-statistic is calculated assuming that the j^{th} regressor is the only regressor in the model against the dependent variable.

Categorical Covariates and Interaction Terms

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.

Sample Lab Dose Age
sample01 A Low 35
sample02 A Med 31
sample03 A High 37
sample04 B Low 32
sample05 B Med 36
sample06 B High 33

Using dummy variables for the categorical covariates the above table would be:

Sample Lab=A Lab=B Dose=Low Dose=Med Dose=High Age
sample01 1 0 1 0 0 35
sample02 1 0 0 1 0 31
sample03 1 0 0 0 1 37
sample04 0 1 1 0 0 32
sample05 0 1 0 1 0 36
sample06 0 1 0 0 1 33

Interactions Lab*Dose and Lab*Age would be specified as:

Sample A*Low A*Med A*High B*Low B*Med B*High A*Age B*Age
sample01 1 0 0 0 0 0 35 0
sample02 0 1 0 0 0 0 31 0
sample03 0 0 1 0 0 0 37 0
sample04 0 0 0 1 0 0 0 32
sample05 0 0 0 0 1 0 0 36
sample06 0 0 0 0 0 1 0 33

Stepwise Regression

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Binomial Predictor

In this case, all the observations with zero as the predictor variable are place in one group, and all of the observations with a one as the predictor variable are placed in a second group. A two-sample t-test is used to determine the probability that the two groups have the same mean.

Univariate Case (Student’s T-Test)

Suppose you have n items that are split into two groups of sizes n_0 and n_1, and the respective sums of their continuous responses are s_0 and s_1. Further, let S be the sum of the squared responses, S=\sum^n_{i=1}{y^2_i}, and let SD = \frac{S-\frac{s^2_0}{n_0} - \frac{s^2_1}{n_1}}{n_0 + n_1 - 2}. We can then calculate the t statistic: T = \frac{\left(\frac{s_1}{n_1}-\frac{s_0}{n_0}\right)}{\sqrt{\frac{SD}{n_0}+\frac{SD}{n_1}}}, where the p-value is given by the tails of a two-sided Student’s t distribution with n_0 + n_1 -2 degrees of freedom: p-value = 2 \times P(X \geq |T|) where X \sim t(n_0+n_1-2).

Multivariate Case (Hotelling’s T^2 Test)

Suppose you have n observations, each with a k-dimensional response in an n \times k matrix, Y. All the observations with zero as the predictor variable are placed into one group, g_0, of size n_0, and all of the observations with a one as the predictor variable are placed into a second group, g_1, of size n_1. Note that n=n_0+n_1. Then a Hotelling T^2 statistic is computed to compare the multivariate continuous responses in the two groups. Let the k-dimensional vector M contain the means of responses 1,\ldots,k,

M_j =\frac{\sum^n_{i=1}{Y_{i,j}}}{n}.

Define the k -dimensional vector S, where

S_j = \sum^n_{i=1}{Y_{i,j} - M_j}.

Define the ij^{th} entry of the k \times k matrix A as follows:

A_{i,j} = \frac{\sum^k_{i=1}{\sum^l_{m=1}{S_l S_m}}}{n-1}.

Define the j^{th} entry of the k \times 1 vector b as follows:

b_j =\frac{\sum^n_{i=1}{Y_{i,j}} - \sum_{i \in g_1}{Y_{i,j}}}{n_0} - \frac{\sum_{i \in g_1}{Y_{i,j}}}{n_1}.

The matrix equation Ax=b is solved and the T^2 statistic is calculated as follows:

T^2 = \left|\frac{(n-2)\frac{n_0 n_1}{n}x^T b}{n-1-\frac{n_0 n_1}{n}x^T b}\right|\frac{n-k-1}{k(n-2)}.

The p-value is computed using the F distribution with k and n-k-1 degrees of freedom:

p-value = P(X > T^2),

where X \sim F(k, n-k-1).