Logistic Regression

Genotype or Numeric Association Test – Logistic Regression

The (univariate binary) response, y is fit to the given predictor variable, x, using logistic regression, and the results include the regression p-value, the parameters \beta_0 and \beta_1 which are output in a new spreadsheet along with other association test results, any multiple test correction results, as well as any expected p-values based on the rank of the observed p-value and number of predictors. The response is represented with the formula y = logit(b_1x + b_0) + \epsilon, with the model itself being the expression logit(b_1x + b_0) and the error term, \epsilon, expressing the difference, or residual, between the model of the response and the response itself.

Assuming there are n observations, a logit model is used to fit the binary response, y, using x and a vector of 1’s as a covariate matrix (z). (The vector of 1’s facilitates obtaining an intercept.) The Newton’s method approach of maximizing the log likelihood function is used for estimating the logit model [Green1997]. The null hypothesis being tested is that the slope and intercept coefficients in the logit model are zero. A likelihood statistic is calculated, where L_0 is the unrestricted likelihood and L_1 is the restricted likelihood, and -2\ln(L_0/L_1) is asymptotically distributed as Chi-Squared with n-1 degrees of freedom.

We simplify the notation by using a lower-case l to mean “log likelihood”–that is,

l_0 = \log(L_0)

and

l_1=\log(L_1),

where base e logarithms are used.

Using this notation, the unrestricted log likelihood is

l_0 = n[s\log(s) + (1-s)\log(1-s)],

where s is the proportion of the n dependent observations (y_1,\ldots,y_n) that are equal to one, and the restricted log likelihood is

l_1 = \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right)\right],

and p-value = P(X > -2(l_0 - l_1)) where X \sim \chi^2(n-1). Here, \hat{\beta} is the vector (b_1,b_0).

Multiple Logistic Regression

Full Model Only Regression Equation

The multiple logistic regression uses a logit model to fit the binary response \bf{y}, using the covariate matrix \bf{X}, consisting of the regression coefficients for continuous predictors and indicator coefficients for categorical predictors, along with a column of 1’s for the intercept. The Newton’s method approach of maximizing the log likelihood function is used for estimating the logit model [Green1997]. The null hypothesis being tested is that the slope and intercept coefficients in the logit model are zero. A likelihood statistic is calculated, where L_0 is the unrestricted likelihood and L_1 is the restricted likelihood, and -2\ln(L_0/L_1) is asymptotically distributed as Chi-Squared with n-1 degrees of freedom.

We simplify the notation by using a lower-case l to mean “log likelihood”–that is,

l_0 = \log(L_0)

and

l_1=\log(L_1),

where base e logarithms are used.

Using this notation, the unrestricted log likelihood is

l_0 = n[s\log(s) + (1-s)\log(1-s)],

where s is the proportion of the n dependent observations (y_1,\ldots,y_n) that are equal to one, and the restricted log likelihood is

l_1 = \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^Tz_i}}\right)\right],

and p-value = P(X > -2(l_0 - l_1)) where X \sim \chi^2(n-1). Here, \hat{\beta} is the vector (b_0,b_1,\ldots,b_k) of slope coefficients.

Full Versus Reduced Model Regression Equation

For the full versus reduced logistic regression model, logistic regression equations are obtained for both the full model and for the reduced model. The reduced logistic regression model includes only the dependent and any covariates selected for the reduced model. The full logistic regression model includes all of the variables including any full model covariates.

A likelihood ratio statistic is calculated to find the significance of including the full model regressors vs not including these regressors. The restricted likelihood of the reduced model is represented by L_0 and L_1 is the restricted likelihood of the full model. Both L_0 and L_1 are computed as below:

l_0 & =\log{L_0}= \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^T_Rz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^T_Rz_i}}\right)\right],

l_1 & =\log{L_1}= \sum^n_{i=1}\left[y_i\log\left(\frac{1}{1+e^{-\hat{\beta}^T_Fz_i}}\right) + (1-y_i)\log\left(1-\frac{1}{1+e^{-\hat{\beta}^T_Fz_i}}\right)\right],

and p-value = P(X > -2(l_0 - l_1)) where X \sim \chi^2(m-k) where m are the degrees of freedom of the full model and k are the degrees of freedom of the reduced model. Here, \hat{\beta}_R is the reduced model vector of slope coefficients, and \hat{\beta}_F is the full model vector of slope coefficients.

Regressor Statistics

The coefficient of each regressor, along with the y-intercept, is calculated as a part of the Newton’s method approach of maximizing the log likelihood function for the full model.

The standard error of the j^{th} regressor is found by inverting the information matrix of the regression, which is formed using the intercept as the last coefficient. The square root of the j^{th} diagonal element of the inverted matrix is the standard error of the j^{th} regressor, and the square root of the last diagonal element of the inverted matrix is the standard error of the intercept. (See [HosmerAndLemeshow2000].)

The p-value Pr(Chi) associated with dropping the j^{th} regressor from the regression is found by running a separate logistic regression using all the regressors as the full model and a model with all the regressors except the j^{th} regressor as the reduced model. (The “Chi” refers to the likelihood ratio test that is performed between these two models to find the p-value.)

The regression odds ratio for a coefficient \beta is simply e^\beta. The interpretation of this is by how much (by what ratio) the odds of the dependent being one change if the given regressor changes by one unit. An example would be the ratio of the odds of being a case rather than a control for a smoker to the odds of being a case rather than a control for a non-smoker.

The p-value for the univariate fit of the j^{th} regressor is obtained from a separate logistic regression which is calculated as if the j^{th} regressor were the only regressor in the model against the dependent variable.

Categorical Covariates and Interaction Terms

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.

Sample Lab Dose Age
sample01 A Low 35
sample02 A Med 31
sample03 A High 37
sample04 B Low 32
sample05 B Med 36
sample06 B High 33

Using dummy variables for the categorical covariates the above table would be:

Sample Lab=A Lab=B Dose=Low Dose=Med Dose=High Age
sample01 1 0 1 0 0 35
sample02 1 0 0 1 0 31
sample03 1 0 0 0 1 37
sample04 0 1 1 0 0 32
sample05 0 1 0 1 0 36
sample06 0 1 0 0 1 33

Interactions Lab*Dose and Lab*Age would be specified as:

Sample A*Low A*Med A*High B*Low B*Med B*High A*Age B*Age
sample01 1 0 0 0 0 0 35 0
sample02 0 1 0 0 0 0 31 0
sample03 0 0 1 0 0 0 37 0
sample04 0 0 0 1 0 0 0 32
sample05 0 0 0 0 1 0 0 36
sample06 0 0 0 0 0 1 0 33

Stepwise Regression

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.