3.4. Logistic Regression¶
3.4.1. Genotype or Numeric Association Test – Logistic Regression¶
The (univariate binary) response, is fit to the given
predictor variable,
, using logistic regression, and the
results include the regression p-value, the parameters
and
which are output in a new spreadsheet along with
other association test results, any multiple test correction results, as
well as any expected p-values based on the rank of the observed p-value
and number of predictors. The response is represented with the formula
, with the model itself being
the expression
and the error term,
, expressing the difference, or residual, between the
model of the response and the response itself.
Assuming there are observations, a logit model is used to
fit the binary response,
, using
and a vector
of 1’s as a covariate matrix (
). (The vector of 1’s
facilitates obtaining an intercept.) The Newton’s method approach of
maximizing the log likelihood function is used for estimating the
logit model [Green1997]. The null hypothesis being tested is that the
slope and intercept coefficients in the logit model are zero. A
likelihood statistic is calculated, where
is the
unrestricted likelihood and
is the restricted
likelihood, and
is asymptotically distributed
as Chi-Squared with
degrees of freedom.
We simplify the notation by using a lower-case to mean “log
likelihood”–that is,
and
where base logarithms are used.
Using this notation, the unrestricted log likelihood is
where is the proportion of the
dependent observations
that are equal to one, and the restricted log
likelihood is
and where
. Here,
is the vector
.
3.4.2. Multiple Logistic Regression¶
Full Model Only Regression Equation¶
The multiple logistic regression uses a logit model to fit the binary
response , using the covariate matrix
, consisting of the regression coefficients for
continuous predictors and indicator coefficients for categorical
predictors, along with a column of 1’s for the intercept. The Newton’s
method approach of maximizing the log likelihood function is used for
estimating the logit model [Green1997]. The null hypothesis being
tested is that the slope and intercept coefficients in the logit model
are zero. A likelihood statistic is calculated, where
is
the unrestricted likelihood and
is the restricted
likelihood, and
is asymptotically distributed
as Chi-Squared with
degrees of freedom.
We simplify the notation by using a lower-case to mean “log
likelihood”–that is,
and
where base logarithms are used.
Using this notation, the unrestricted log likelihood is
where is the proportion of the
dependent observations
that are equal to one, and the restricted log
likelihood is
and where
. Here,
is the vector
of slope coefficients.
Full Versus Reduced Model Regression Equation¶
For the full versus reduced logistic regression model, logistic regression equations are obtained for both the full model and for the reduced model. The reduced logistic regression model includes only the dependent and any covariates selected for the reduced model. The full logistic regression model includes all of the variables including any full model covariates.
A likelihood ratio statistic is calculated to find the significance of
including the full model regressors vs not including these regressors.
The restricted likelihood of the reduced model is represented by
and
is the restricted likelihood of the full
model. Both
and
are computed as below:
and where
where
are the degrees of
freedom of the full model and
are the degrees of freedom of
the reduced model. Here,
is the reduced model
vector of slope coefficients, and
is the full
model vector of slope coefficients.
Regressor Statistics¶
The coefficient of each regressor, along with the y-intercept, is calculated as a part of the Newton’s method approach of maximizing the log likelihood function for the full model.
The standard error of the regressor is found by
inverting the information matrix of the regression, which is formed
using the intercept as the last coefficient. The square root of the
diagonal element of the inverted matrix is the standard
error of the
regressor, and the square root of the last
diagonal element of the inverted matrix is the standard error of the
intercept. (See [HosmerAndLemeshow2000].)
The p-value Pr(Chi) associated with dropping the
regressor from the regression is found by running a separate logistic
regression using all the regressors as the full model and a model with
all the regressors except the
regressor as the reduced
model. (The “Chi” refers to the likelihood ratio test that is performed
between these two models to find the p-value.)
The regression odds ratio for a coefficient is simply
. The interpretation of this is by how much (by what
ratio) the odds of the dependent being one change if the given regressor
changes by one unit. An example would be the ratio of the odds of being
a case rather than a control for a smoker to the odds of being a case
rather than a control for a non-smoker.
The p-value for the univariate fit of the regressor is
obtained from a separate logistic regression which is calculated as if
the
regressor were the only regressor in the model
against the dependent variable.
Categorical Covariates and Interaction Terms¶
If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.
A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.
For example, consider the following covariates for five samples.
Sample |
Lab |
Dose |
Age |
---|---|---|---|
sample01 |
A |
Low |
35 |
sample02 |
A |
Med |
31 |
sample03 |
A |
High |
37 |
sample04 |
B |
Low |
32 |
sample05 |
B |
Med |
36 |
sample06 |
B |
High |
33 |
Using dummy variables for the categorical covariates the above table would be:
Sample |
Lab=A |
Lab=B |
Dose=Low |
Dose=Med |
Dose=High |
Age |
---|---|---|---|---|---|---|
sample01 |
1 |
0 |
1 |
0 |
0 |
35 |
sample02 |
1 |
0 |
0 |
1 |
0 |
31 |
sample03 |
1 |
0 |
0 |
0 |
1 |
37 |
sample04 |
0 |
1 |
1 |
0 |
0 |
32 |
sample05 |
0 |
1 |
0 |
1 |
0 |
36 |
sample06 |
0 |
1 |
0 |
0 |
1 |
33 |
Interactions Lab*Dose and Lab*Age would be specified as:
Sample |
A*Low |
A*Med |
A*High |
B*Low |
B*Med |
B*High |
A*Age |
B*Age |
---|---|---|---|---|---|---|---|---|
sample01 |
1 |
0 |
0 |
0 |
0 |
0 |
35 |
0 |
sample02 |
0 |
1 |
0 |
0 |
0 |
0 |
31 |
0 |
sample03 |
0 |
0 |
1 |
0 |
0 |
0 |
37 |
0 |
sample04 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
32 |
sample05 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
36 |
sample06 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
33 |
Stepwise Regression¶
If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.
Forward Selection
Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.
Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.
Backward Elimination
Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.
Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.