# 3.3. Linear Regression¶

## 3.3.1. Genotype or Numeric Association Test – Simple Linear Regression¶

The response, , is fit to every genetic predictor variable or encoded genotype, , in the spreadsheet, using linear regression, and the results include the regression p-value, intercept and slope which are output in a new spreadsheet along with other genotypic association test results and any multiple correction results. The response is represented with the formula , where the model is represented by the expression and the error term, , expressing the difference, or residual, between the model of the response and the response itself. Samples for which either the predictor or the response has a missing value are left out of the regression.

The regression hypothesis test is the test of:

Assumptions:

for all

where denote the residuals

and are all independent and follow a normal distribution

and all have equal variance

The sums of squares and mean sum of squared errors are calculated as follows:

 Number of Observations: The Coefficient Matrix: where is a column vector of 1’s, and is a colum vector of predictor values Rank of the Coefficient Matrix: Mean of the response: Mean of the predictors or (numerically encoded) genotypes: Solution to the normal equations: and or where (Our assumptions imply that where is the true for the model.) Regression Sum of Squares: where is a matrix of ones Error Sum of Squares (also called Residual Sum of Squares RSS): Total Sum of Squares (also abbreviated TSS): Coefficient of determination: Adjusted coefficient of determination: Test Statistic:

The test statistic follows the F distribution, where where .

## 3.3.2. Multiple Linear Regression Model¶

The Regression Analysis window can perform regression on one regressor (simple linear regression) or more than one regressor (multiple linear regression). A multiple linear regression model fits the multiple regressors (independent variables) to one dependent variable, and may be expressed as , where is the response vector, is the matrix where respresents a column vector of 1’s to correspond to the intercept and each represents one regressor in column-vector form, and where is the intercept term and each is a regression coefficient. This model is a generalization of the simple linear regression model used for linear regression in the association test dialogs.

### Full Model Only Regression Equation¶

The regression hypothesis test is the test of:

Assumptions:

for all

where denote the residuals

and are all independent and follow a normal distribution

and all have equal variance

The sums of squares and mean sum of squared errors are calculated as follows:

 Number of Observations: The Coefficient Matrix: Rank of the Coefficient Matrix: Mean of the response: Solution to the normal equations: (Our assumptions imply that where is the true for the model.) Regression Sum of Squares: where is a matrix of ones Error Sum of Squares (also called Residual Sum of Squares RSS): Total Sum of Squares (also abbreviated TSS): Coefficient of determination: Adjusted coefficient of determination: Test Statistic:

The test statistic follows the F distribution, where where .

### Full Versus Reduced Model Regression Equation¶

In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.

The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.

The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:

 Number of Observations: Number of Reduced-Model Regressors: The Coefficient Matrix: where is the j-th reduced-model regressor. Rank of the Reduced Model Coefficient Matrix: Mean of the response: Solution to the normal equations: where is the true for the model. Regression Sum of Squares: where is a matrix of ones Error Sum of Squares (also called Residual Sum of Squares RSS) Total Sum of Squares (also abbreviated TSS):

The sums of squares and mean sum of squared errors for the full model are calculated similarly:

 Number of Observations: Number of Full-Model-Only Regressors: The Coefficient Matrix: where is either the j-th reduced-model regressor or, if , full-model-only regressor number . Rank of the Full Model Coefficient Matrix: Mean of the response: Solution to the normal equations: where is the true for the model. Regression Sum of Squares: where is a matrix of ones Error Sum of Squares (also called Residual Sum of Squares RSS): Total Sum of Squares (also abbreviated TSS):

The test statistic is:

Another way of expressing this is:

The p-value is calculated by: where .

### Regressor Statistics¶

The coefficient of the regressor is calculated with the equation:

where is the sample size, is the mean of the regressor and is the mean of the response.

The Y-intercept of the regression equation is calculated with the equation:

where is the number of regressors, is the coefficient and is the mean of the regressor.

The standard error for the regressor is computed by taking a full model regression equation with all regressors less the regressor. For the purposes of calculating the standard error, the regressor is set as the dependent variable. Let be the regressor sum of squares, be the coefficient of determination for the regressor vs all other regressors model, be the mean square errors for the regression model, and be the error sum of squares. Let the total number of regressors in the model be . Then the standard error of the regressor is calculated as follows:

The standard error for the intercept is found as

where is the intercept-related element of the inverse of , the matrix having been formed from the intercept term (a vector of one’s) plus the covariates.

Note

It may be shown that an alternative way of computing the standard error for regressor exists which is similar to the formula mentioned above for the standard error for the intercept–namely,

where is the element of the inverse of related to the j-th covariate.

This follows because first, for any regression,

so for the regressor, we have

the residual sum of squares of regressing the regressor onto the remaining regressors.

If we call the regressor , and we call the matrix formed from the remaining covariates , we can write as

Meanwhile, if we have block matrix

where , , and are appropriately-sized submatrices, and is invertible, it may be shown (using block matrix manipulation and the Schur’s complement of matrix ) that the inverse of this block matrix is

Now define

(Think of as the result of moving the regressor from the position in to the first position in .)

If we let

we get

We now have

Thus,

The value of the t-statistic for the regressor is obtained from the equation:

where is the estimated coefficient for the regressor.

The p-value of the t-statistic for the regressor is the probability of a value as extreme or more extreme than the observed t-statistic from a Student’s T distribution with degrees of freedom.

The p-value for the univariate fit is obtained from a Student’s T distribution where the t-statistic is calculated assuming that the regressor is the only regressor in the model against the dependent variable.

### Categorical Covariates and Interaction Terms¶

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.

Sample

Lab

Dose

Age

sample01

A

Low

35

sample02

A

Med

31

sample03

A

High

37

sample04

B

Low

32

sample05

B

Med

36

sample06

B

High

33

Using dummy variables for the categorical covariates the above table would be:

Sample

Lab=A

Lab=B

Dose=Low

Dose=Med

Dose=High

Age

sample01

1

0

1

0

0

35

sample02

1

0

0

1

0

31

sample03

1

0

0

0

1

37

sample04

0

1

1

0

0

32

sample05

0

1

0

1

0

36

sample06

0

1

0

0

1

33

Interactions Lab*Dose and Lab*Age would be specified as:

Sample

A*Low

A*Med

A*High

B*Low

B*Med

B*High

A*Age

B*Age

sample01

1

0

0

0

0

0

35

0

sample02

0

1

0

0

0

0

31

0

sample03

0

0

1

0

0

0

37

0

sample04

0

0

0

1

0

0

0

32

sample05

0

0

0

0

1

0

0

36

sample06

0

0

0

0

0

1

0

33

### Stepwise Regression¶

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

### Binomial Predictor¶

In this case, all the observations with zero as the predictor variable are place in one group, and all of the observations with a one as the predictor variable are placed in a second group. A two-sample t-test is used to determine the probability that the two groups have the same mean.

#### Univariate Case (Student’s -Test)¶

Suppose you have items that are split into two groups of sizes and , and the respective sums of their continuous responses are and . Further, let be the sum of the squared responses, , and let We can then calculate the statistic: where the p-value is given by the tails of a two-sided Student’s t distribution with degrees of freedom: where .

#### Multivariate Case (Hotelling’s Test)¶

Suppose you have observations, each with a -dimensional response in an matrix, . All the observations with zero as the predictor variable are placed into one group, , of size , and all of the observations with a one as the predictor variable are placed into a second group, , of size . Note that . Then a Hotelling statistic is computed to compare the multivariate continuous responses in the two groups. Let the -dimensional vector contain the means of responses ,

Define the -dimensional vector , where

Define the entry of the matrix as follows:

Define the entry of the vector as follows:

The matrix equation is solved and the statistic is calculated as follows:

The p-value is computed using the F distribution with and degrees of freedom:

where .