3.3. Linear Regression¶
3.3.1. Genotype or Numeric Association Test – Simple Linear Regression¶
The response, , is fit to every genetic predictor variable or encoded genotype, , in the spreadsheet, using linear regression, and the results include the regression pvalue, intercept and slope which are output in a new spreadsheet along with other genotypic association test results and any multiple correction results. The response is represented with the formula , where the model is represented by the expression and the error term, , expressing the difference, or residual, between the model of the response and the response itself. Samples for which either the predictor or the response has a missing value are left out of the regression.
The regression hypothesis test is the test of:
Assumptions:
for all
where denote the residuals
and are all independent and follow a normal distribution
and all have equal variance
The sums of squares and mean sum of squared errors are calculated as follows:
Number of Observations: 

The Coefficient Matrix: 
where is a column vector of 1’s, and is a colum vector of predictor values 
Rank of the Coefficient Matrix: 

Mean of the response: 

Mean of the predictors or (numerically encoded) genotypes: 

Solution to the normal equations: 
and 
or where 

(Our assumptions imply that where is the true for the model.) 

Regression Sum of Squares: 

where is a matrix of ones 

Error Sum of Squares (also called Residual Sum of Squares RSS): 

Total Sum of Squares (also abbreviated TSS): 

Coefficient of determination: 

Adjusted coefficient of determination: 

Test Statistic: 
The test statistic follows the F distribution, where where .
3.3.2. Multiple Linear Regression Model¶
The Regression Analysis window can perform regression on one regressor (simple linear regression) or more than one regressor (multiple linear regression). A multiple linear regression model fits the multiple regressors (independent variables) to one dependent variable, and may be expressed as , where is the response vector, is the matrix where respresents a column vector of 1’s to correspond to the intercept and each represents one regressor in columnvector form, and where is the intercept term and each is a regression coefficient. This model is a generalization of the simple linear regression model used for linear regression in the association test dialogs.
Full Model Only Regression Equation¶
The regression hypothesis test is the test of:
Assumptions:
for all
where denote the residuals
and are all independent and follow a normal distribution
and all have equal variance
The sums of squares and mean sum of squared errors are calculated as follows:
Number of Observations: 

The Coefficient Matrix: 

Rank of the Coefficient Matrix: 

Mean of the response: 

Solution to the normal equations: 

(Our assumptions imply that where is the true for the model.) 

Regression Sum of Squares: 

where is a matrix of ones 

Error Sum of Squares (also called Residual Sum of Squares RSS): 

Total Sum of Squares (also abbreviated TSS): 

Coefficient of determination: 

Adjusted coefficient of determination: 

Test Statistic: 
The test statistic follows the F distribution, where where .
Full Versus Reduced Model Regression Equation¶
In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.
The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.
The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:
Number of Observations: 

Number of ReducedModel Regressors: 

The Coefficient Matrix: 
where is the jth reducedmodel regressor. 
Rank of the Reduced Model Coefficient Matrix: 

Mean of the response: 

Solution to the normal equations: 

where is the true for the model. 

Regression Sum of Squares: 

where is a matrix of ones 

Error Sum of Squares (also called Residual Sum of Squares RSS) 

Total Sum of Squares (also abbreviated TSS): 
The sums of squares and mean sum of squared errors for the full model are calculated similarly:
Number of Observations: 

Number of FullModelOnly Regressors: 

The Coefficient Matrix: 
where is either the jth reducedmodel regressor or, if , fullmodelonly regressor number . 
Rank of the Full Model Coefficient Matrix: 

Mean of the response: 

Solution to the normal equations: 

where is the true for the model. 

Regression Sum of Squares: 

where is a matrix of ones 

Error Sum of Squares (also called Residual Sum of Squares RSS): 

Total Sum of Squares (also abbreviated TSS): 
The test statistic is:
Another way of expressing this is:
The pvalue is calculated by: where .
Regressor Statistics¶
The coefficient of the regressor is calculated with the equation:
where is the sample size, is the mean of the regressor and is the mean of the response.
The Yintercept of the regression equation is calculated with the equation:
where is the number of regressors, is the coefficient and is the mean of the regressor.
The standard error for the regressor is computed by taking a full model regression equation with all regressors less the regressor. For the purposes of calculating the standard error, the regressor is set as the dependent variable. Let be the regressor sum of squares, be the coefficient of determination for the regressor vs all other regressors model, be the mean square errors for the regression model, and be the error sum of squares. Let the total number of regressors in the model be . Then the standard error of the regressor is calculated as follows:
The standard error for the intercept is found as
where is the interceptrelated element of the inverse of , the matrix having been formed from the intercept term (a vector of one’s) plus the covariates.
Note
It may be shown that an alternative way of computing the standard error for regressor exists which is similar to the formula mentioned above for the standard error for the intercept–namely,
where is the element of the inverse of related to the jth covariate.
This follows because first, for any regression,
so for the regressor, we have
the residual sum of squares of regressing the regressor onto the remaining regressors.
If we call the regressor , and we call the matrix formed from the remaining covariates , we can write as
Meanwhile, if we have block matrix
where , , and are appropriatelysized submatrices, and is invertible, it may be shown (using block matrix manipulation and the Schur’s complement of matrix ) that the inverse of this block matrix is
Now define
(Think of as the result of moving the regressor from the position in to the first position in .)
If we let
we get
We now have
Thus,
The value of the tstatistic for the regressor is obtained from the equation:
where is the estimated coefficient for the regressor.
The pvalue of the tstatistic for the regressor is the probability of a value as extreme or more extreme than the observed tstatistic from a Student’s T distribution with degrees of freedom.
The pvalue for the univariate fit is obtained from a Student’s T distribution where the tstatistic is calculated assuming that the regressor is the only regressor in the model against the dependent variable.
Categorical Covariates and Interaction Terms¶
If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rankdeficient matrix in the regression equation.
A firstorder interaction term is considered a new covariate created from the product of two covariates as specified in either the full or reducedmodel covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a firstorder interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.
For example, consider the following covariates for five samples.
Sample 
Lab 
Dose 
Age 

sample01 
A 
Low 
35 
sample02 
A 
Med 
31 
sample03 
A 
High 
37 
sample04 
B 
Low 
32 
sample05 
B 
Med 
36 
sample06 
B 
High 
33 
Using dummy variables for the categorical covariates the above table would be:
Sample 
Lab=A 
Lab=B 
Dose=Low 
Dose=Med 
Dose=High 
Age 

sample01 
1 
0 
1 
0 
0 
35 
sample02 
1 
0 
0 
1 
0 
31 
sample03 
1 
0 
0 
0 
1 
37 
sample04 
0 
1 
1 
0 
0 
32 
sample05 
0 
1 
0 
1 
0 
36 
sample06 
0 
1 
0 
0 
1 
33 
Interactions Lab*Dose and Lab*Age would be specified as:
Sample 
A*Low 
A*Med 
A*High 
B*Low 
B*Med 
B*High 
A*Age 
B*Age 

sample01 
1 
0 
0 
0 
0 
0 
35 
0 
sample02 
0 
1 
0 
0 
0 
0 
31 
0 
sample03 
0 
0 
1 
0 
0 
0 
37 
0 
sample04 
0 
0 
0 
1 
0 
0 
0 
32 
sample05 
0 
0 
0 
0 
1 
0 
0 
36 
sample06 
0 
0 
0 
0 
0 
1 
0 
33 
Stepwise Regression¶
If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.
Forward Selection
Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.
Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The pvalue of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest pvalue is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest pvalue, or if no pvalue is better than the pvalue cutoff specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest pvalue then this full model is determined to be the final model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
Backward Elimination
Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.
Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The pvalue of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest pvalue is used as the next model. This method removes the least significant variable from the current model. If every pvalue is smaller than the pvalue cutoff specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
Binomial Predictor¶
In this case, all the observations with zero as the predictor variable are place in one group, and all of the observations with a one as the predictor variable are placed in a second group. A twosample ttest is used to determine the probability that the two groups have the same mean.
Univariate Case (Student’s Test)¶
Suppose you have items that are split into two groups of sizes and , and the respective sums of their continuous responses are and . Further, let be the sum of the squared responses, , and let We can then calculate the statistic: where the pvalue is given by the tails of a twosided Student’s t distribution with degrees of freedom: where .
Multivariate Case (Hotelling’s Test)¶
Suppose you have observations, each with a dimensional response in an matrix, . All the observations with zero as the predictor variable are placed into one group, , of size , and all of the observations with a one as the predictor variable are placed into a second group, , of size . Note that . Then a Hotelling statistic is computed to compare the multivariate continuous responses in the two groups. Let the dimensional vector contain the means of responses ,
Define the dimensional vector , where
Define the entry of the matrix as follows:
Define the entry of the vector as follows:
The matrix equation is solved and the statistic is calculated as follows:
The pvalue is computed using the F distribution with and degrees of freedom:
where .