The University of North Carolina at Pembroke
DSC 510--Quantitative Methods
Logistic Regression

Performing a Logistic Regression with Computers
JMP IN
SPSS
SAS
Interpreting the output

Suppose that as the manager of a store's credit department, you have access to financial information (including categorical and numeric variables) about customers from credit-reporting agencies.  Furthermore, based on your experience with these customers in the past you can objectively classify them as good borrowers and those who have defaulted on a payment to the store.  Logistic regression lets you see how the financial data relate to being a good borrower.  With this information you could tell which new customers should be granted store credit and which should not.

 Independent variable One or more numeric variables.  (may include dummy variables) Dependent variable A categorical variable with only two values: Yes/No or On/Off etc. (such a variable is called a dichotomous variable for obvious reasons) CAUTION: if Yes or No refers to whether an event happens after a variable lenght of time, logistic regression is inappropriate.  See time-to-event (survival) models. Null hypothesis (H0) None of the independent variables affects the probability that the dependent variable will be Yes or No.  This implies that ß1, ß2, and ß3 are all zero and that only ß0 differs from zero. Research hypothesis The dependent variable is more likely to be Yes for some values of the independent variables than for others.  This implies that some of ß1, ß2, and ß3 differ from zero. Test statistic c² Rejection region Right tail (values of c² that are significantly larger than its d.f.)

The Fundamental Equation

R = e0 + ß1X1 + ß2X2 + ß3X3)

or

lnR = ß0 + ß1X1 + ß2X2 + ß3X3

where

R is the odds ratio that an event will occur,
Xi are the independent variables,
Y is the dependent variable, called the response variable.

Generally Y is a dichotomous (0/1) variable, but some computer programs, such as JMP, allow Y to have several levels.
ßi are the coefficients to be estimated, and
i  represents the number of  a coefficient or the variable associated with that coefficient
j  represents an observation.  i = 1, 2, . . ., n
e  is the base of the natural logarithms: 2.71828182846 . . .

This equation says that the odds that something will happen (and hence the probability that it will happen) depend on some explanatory variables, the Xi.
For example, if ß3 is positive, then an increase in X3 will increase the odds that the event will occur.

Working with odds ratios
The odds ratio is the ratio of the probability that something will happen to the probability that it will not happen.
Let R = the odds ratio and let P = the probability that something will happen.
Then the odds ratio is R = P ÷ (1 - P) and P = R ÷ (1 + R).
Furthermore, lnR = ln P - ln(1 - P).
For example,

• if P = 2/5 then the odds ratio is R = 2/5 ÷ (1 - 2/5) = .4 ÷ .6 = 2/3.  We say that the   odds are 2 to 3 (2 : 3).  lnR = -0.40547
• if P = 5/8 then the odds ratio is R = 5/8 ÷ 3/8 = 5/3 = 1.667.  We say that the odds are 5 to 3 (5 : 3).  lnR =  0.51083
Note that if P = 1/2, then R = 1/2 ÷ 1/2 = 1.  Note also that ln 1 = 0.
• if R = 2/5 (that is, if the odds are 2 to 5 or 2 : 5) then P = 2/5 ÷ (1 + 2/5) = .4 ÷ 1.4 =    0.2856.  An other way to do this is: P = 2 ÷ (2 + 5).
• if R = 7/8 (that is, if the odds are 7 to 8 or 7 : 8) then P = 7/8 ÷ (1 + 7/8) = 7 ÷ 15 =    0.4667.

What happened to the error term?
There is no error term in logistic regression.  In the familiar least-squares regression, the randomness in the problem is accommodated by the error term, e.  In logistic regression we are talking about probabilities, so we don't need the error term.  In least-squares regression, if the observations don't work out exactly as predicted, we need the error term to make the equation hold true.  In logistic regression when we predict that there is a 70% chance that the NASDAQ will rise tomorrow but tomorrow we observe that the NASDAQ fell, then we can just say that a 70% chance of rising means that there is a 30% chance of falling.  Probabilities are themselves ways of dealing with randomness and imprecision.

Why isn't Y in the fundamental equation?
On the one hand, since Y is a dichotomous (0/1) variable we cannot simply run a
least-squares regression of Y on the Xj.  We could perform the calculations, but the t tests and the F test associated with the least-squares regression would be meaningless.  Those tests assume that the error comes from a normal distribution, which cannot be true if Y is forced to be 0 or 1.

We might use the probability that the result would occur.  This would be better than a
dichotomous variable in that a probability must be a real number between 0 and 1.  But the regression could predict that the probability is 1.2 or -0.3, and we know that probabilities cannot have such values.

The odds ratio is even better because it can have any non-negative real value.

On the other hand, the logistic regression method discussed here uses maximum-likelihood estimation.  For each observation i, the logistic regression method calculates the probability of observing Yj based on the odds ratio Rj for that observation.  The method then chooses the set of ßis that maximizes the probability of observing the Yj that we did observe in the sample.

Maximum Likelihood Estimation

To perform a maximum-likelihood estimation we need a function that tells how the probabilities of the results are determined.  The fundamental equation gives us this information by giving us a formula for the odds ratio.  These probabilities are applied to the Yj.

The maximum-likelihood criterion for estimating a set of parameters specifies a probability function in terms of a set of parameters and then finds the set of values for those parameters that gives the greatest likelihood of observing the results that we actually did observe.

For example, take a simple one-parameter function: the probability that one of Acme's
widgets will be good is P and the probability that it will be defective is (1 - P).  We sample four of their widgets and find that the third one is defective and the others are good.  To apply this probability scheme to the observed results (the Yj) we assume that these widgets are selected randomly and independently.  The probability of finding such a sample is then L = P x P x (1 - P) x P = P3 - P4.  Use calculus to find the P that maximizes this function.
dL/dP = 3P2 - 4P3 = 0.  So P = 3/4, which happens to be the sample proportion.  This illustrates that the sample proportion is a maximum-likelihood estimator of the likelihood that a randomly chosen Y will be a success.

Essentially, this is what logistic regression does except that instead of having a simple
constant P as we had, logistic regression lets P be a function of the independent variables, Xi.

Interpreting the ßi
Recall that ex+y = exey.  So the fundamental equation can be written as

R = eß0 eß1X1 eß2X2 eß3X3

If one of the ßi  is zero, then eßiXi is 1, so that term drops out of the equation--multiplying by 1 has no effect.

If ßi is different than 1 then changes in Xi will have a multiplicative effect on R.
A one-unit change in Xi will change eßiXi  to eßi(Xi +1), which is eßiXi eßi.  So a one-unit change in Xi will increase R by a factor of eßi.
So, if  ßi = 2.0 then a one-unit change in Xi will increase R by a factor of 7.389.
If ßi = 0.5 then a one-unit change in Xi will increase R by a factor of 1.649.
If ßi = -0.4 then a one-unit change in Xi will increase R by a factor of 0.6703, which is really a decrease of 32.07%.
R would double as a result of a one-unit increase in Xi if ßi = 0.6931.

When ßi = 2.0 then a one-unit change in Xi will increase R by a factor of 7.389.  So a
three-unit change in Xi will increase R by a factor of 3x7.389, right?
WRONG!   Each one-unit increase in Xi causes R to increase by a multiplicative factor of 7.389.  So three one-unit increases in Xi will cause R to increase by a factor of

7.389x7.389x7.389 = 7.3893 = 403.42.

Testing the significance of the whole equation

To test the significance of the whole equation (or the whole model) means to test whether any of the explanatory variables (the Xs) have an effect on the dependent variable (Y).  If we find that the answer is "yes", we would also want to test the significance of individual explanatory variables.  The approach used here is to measure the "fit" of the model to the data with and without the explanatory variables.  The "fit" is measured by the likelihood (probability) that the observed sample data should be observed under the assumptions of the model.  Technically, the fit is measured as the negative of twice the natural logarithm of the likelihood, -2LL.

The discussion below uses the following dataset for an example:

 X Y Predicted Y Predicted Y 2 Hit 2 Hit 2 Miss 2 Hit 4 Miss 4 Hit 4 Miss 4 Miss 6 Miss 6 Miss 6 Miss 6 Miss

-2LL.  The natural logarithm of the likelihood function can be used to test the null hypothesis that all of the ßi are equal to zero.  This is similar to the null hypothesis of the F statistic in a familiar multiple linear regression.  If all of the ßi except for ß0 are equal to zero, the best prediction for the odds ratio is the odds ratio for all Y in the whole sample.

Prediction under the null hypothesis.  The null hypothesis says that none of the X variables has any influence on the probability or on the odds that Y will be a success.  If this is true, we could get the best estimates of the odds ratio by running a logistic regression with no X variables. (JMP IN actually lets us do this.) Under this circumstance,  the best estimate of the odds ratio, R, would be the odds ratio for the whole sample.  For example, consider a sample of 12 observations in which one-third of the Ys are successes and two-thirds are failures.  The naive estimate (using the null hypothesis) of the odds ratio would be 1/3÷2/3 = 1/2.  R will be 1/2 if ß0 is ln(1/2) = -0.69315.  (To verify this, note that e-0.69315 = 1/2.)  If the null hypothesis is true, each success had a 1/3 probability of occurring and each failure had a 2/3 probability.  Based on these probabilities, the probability of observing four successes and eight failures (as we did observe in this sample) would be (1/3)4(2/3)8 = 0.00048170916.  The natural logarithm of this probability is -7.63817.

The JMP IN output for this model--using no X variable--is as shown below:

 Model -LogLikelihood DF ChiSquare Prob>ChiSq Difference 0 0 0 0.0000 Full 7.63817002 Reduced 7.63817002 RSquare (U) 0 Observations 12

 Term Estimate Std. Error Chi Square Prob>ChiSq Intercept -0.6931472 0.61237244 1.28120803 0.2577

As we calculated, the estimate of ß0 is -0.69315.  Also note that the negative of the LogLikelihood value is 7.63817, as we calculated.

Prediction under the alternative hypothesis.  The alternative hypothesis says that the X variables do influence the probability of success and the odds ratio.  The alternative hypothesis does not constrain the ß1, ß2, and ß3 to equal zero.  In the 12-observation sample used above, it is apparent that a success ("Hit") is more likely when X is 2 than when X is 6.  The JMP IN results of this logistic regression are the following:

Whole  Model  Test
 Model -LogLikelihood DF Chi Square Prob>ChiSq Difference 3.03305336 1 6.0661067 0.01378004 Full 4.60511666 Reduced 7.63817002 RSquare (U) 0.3971 Observations 12

 Term Estimate Std. Error Chi Square Prob>ChiSq Intercept 3.75134428 2.3346487 2.58184856 0.10809536 X -1.2703325 0.69619146 3.3294876 0.06804807

This time, notice that the -LogLikelihoods for the Full and Reduced models are different.  The -LogLikelihood for the Reduced model (without the X variables, or with ß1 set equal to zero) is the same as before.  The -LogLikelihood for the Full model (with the X variables in the equation) is less, indicating that the estimated likelihood of getting the sample that we observed is greater than the likelihood that the  naive model gives us.  The -LogLikelihood for the Difference between the models is a measure of the how adding the X variables to the model improves the model's fit and predicting power.   Statisticians have demonstrated that for large sample sizes, -2xLogLikelihood has a probability distribution that is approximately a chi-square distribution.  Note that the chi-square statistic is exactly twice the -LogLikelihood for the Difference.  The number of degrees of freedom is equal to the number of X variables in the model.  The Prob>ChiSq is the p value for the whole-model test.  If there is a significant improvement in the fit of the model, the difference in the -LogLikelihoods will be great and the chi-square will be large.  The probability that we would randomly find a chi-square value this large or larger will therefore be small.  So, small p values (less than 5% usually, sometimes less than 1% if we want to be more cautious) indicate that changes in X do have an effect on the probability of success.  (This is a whole-model test.  If there had been more than one X variable, a small p value would indicate that at least one, but not necessarily all, of the X variables has an effect on the probabilty of success.)  On the other hand, if there really is no effect of X on the probability of success, the difference would be small (it should be zero but there is always some randomness) and the chi-square value would be small and the p value would be closer to 0.5 or even 0.

The Logit R² (sometimes denoted by U) is similar to the R² from a linear regression analysis.  It gives the percentage of the overall fit that is due to using the X variables in the model to change the probabilities.  It is calculated as

R² = [(-2LLnull) - (-2LLmodel)]÷[-2LLnull]

or in terms of the JMP IN output

R² = -LogLikelihoodDifference ÷ -LogLikelihoodReduced

In the example above, we have R² = 3.033 ÷ 7.6381 = 0.3971.  If the X variable had not had any influence on the probability of success, R² would be close to zero.  At the other extreme, if knowing X could let us predict success or failure perfectly, R² would be 1.00.

Making predictions.  Once we have estimates of the parameters (the ßi) of the fundamental equation, we can estimate the probability of success for any observation or any hypothetical values of the X variables.  A natural criterion for predicting success is to only predict success when the probability is greater than 1/2 (and the odds ratio, R, is greater than 1.0 and ß0 + ß1X1 + ß2X2 + ß3X3 is greater than zero).  However, sometimes we may be willing to miss a few true successes, so that we are less apt to incorrectly predict success when in fact the case turns out to be a failure.  For example, we might be trying to predict which patients will be suitable for a new drug.  If the drug has fatal side-effects, we would want to reduce the change of falsely predicting success.

Classification matrix:  Whatever criterion we use, we can classify our predictions according to the table below.  The percentage correctly classified (also called the hit ratio) is the sum of the number of successes correctly predicted and the number of failures correctly predicted divided by the total number of cases.  Some researchers choose their criterion so as to maximize their hit ratio.  JMP IN's Receiver Operating Characteristic curve (ROC) plots similar information for various criterion levels.

 Predicted success Predicted failure Actual success 3 1 Actual failure 1 7
Hit ratio = (3 + 7)/12 = 0.833
Cpro (proportional chance criterion)
Press's Q (not to be confused with the PRESS statistic for validation of a regression)

Testing the significance of an explanatory variable
The Wald statistic is used to test the null hypothesis that a ßi is equal to zero, similar to a t statistic in the familiar multiple linear regression.   If the p value for a coefficient bi is small (less than or equal to a), the null hypothesis should be rejected, implying that the variable Xi does influence the probability that Y will be 1.0 (a "success").

If we find that some of the variables do not seem to have an effect on Y, it is usually good practice to reestimate the model without those variables.  (However, if there is a theoretical reason to believe that a variable should have an effect on Y, then it would be better to leave the variable in the equation.  Failing to prove that variable does not influence Y is not the same as proving that it does not influence Y.)

Glossary

• level of significance: the probability of making a type I error when the null hypothesis is true.  Denoted by a (alpha).
• likelihood function
• maximum likelihood
• odds: a way expressing a probability:  If the odds are 3 to 2, then the probability is 3/5.
• odds ratio: R, the probability that an event will happen divided by the probability that it will not happen.  R = P/Q = P/(1 - P)     So, P = R/(1+R)
• Wald statistic: a chi-square statistic used to test either the significance of the whole model or the significance of a single explanatory variable in the model.
Last updated March 16, 2003, by James R. Frederick