Glossary
The
Fundamental Equation
Working with odds
ratios
Why no error term?
Why isn't Y in the
fundamental equation?
How
does maximumlikelihood estimation work?
Testing the
significance of the whole model
Testing the
significance of an explanatory variable
Performing a Logistic Regression with Computers
JMP IN
SPSS
SAS
Interpreting the output
Suppose that as the manager of a store's credit department,
you have access to financial information (including categorical and
numeric variables) about customers from creditreporting agencies.
Furthermore, based on your experience with these customers in the past
you can objectively classify them as good borrowers and those who have
defaulted on a payment to the store. Logistic regression lets
you see how the financial data relate to being a good borrower. With
this information you could tell which new customers should be granted store
credit and which should not.
Independent variable  One or more numeric variables. (may include dummy variables) 
Dependent variable  A categorical variable with only two
values: Yes/No or On/Off etc. (such a variable is called a dichotomous variable for obvious reasons) CAUTION: if Yes or No refers to whether an event happens after a variable lenght of time, logistic regression is inappropriate. See timetoevent (survival) models. 
Null hypothesis (H_{0})  None of the independent variables affects the probability that the dependent variable will be Yes or No. This implies that ß_{1}, ß_{2}, and ß_{3} are all zero and that only ß_{0} differs from zero. 
Research hypothesis  The dependent variable is more likely to be Yes for some values of the independent variables than for others. This implies that some of ß_{1}, ß_{2}, and ß_{3} differ from zero. 
Test statistic  c² 
Rejection region  Right tail (values of c² that are significantly larger than its d.f.) 
R = e^{(ß}^{0 + ß1X1 + ß2X2 + ß3X3)}
or
lnR = ß_{0} + ß_{1}X_{1} + ß_{2}X_{2} + ß_{3}X_{3}
where
R is the odds ratio that an event will occur,
X_{i} are the
independent variables,
Y is the dependent variable, called the response
variable.
This equation says that the odds that something will happen
(and hence the probability that it will happen) depend on some explanatory
variables, the X_{i}.
For example, if ß_{3} is positive, then an increase in X_{3} will increase the odds that the event will occur.
Working
with odds ratios
The odds ratio is the ratio of the probability
that something will happen to the probability that it will not happen.
Let R = the odds ratio and let P = the probability
that something will happen.
Then the odds ratio is R = P ÷ (1 
P) and P = R ÷ (1 + R).
Furthermore, lnR = ln P  ln(1  P).
For example,
We might use the probability that the result would occur.
This would be better than a
dichotomous variable in that a probability
must be a real number between 0 and 1. But the regression could
predict that the probability is 1.2 or 0.3, and we know that probabilities
cannot have such values.
The odds ratio is even better because it can have any nonnegative real value.
On the other hand, the logistic regression method discussed here uses maximumlikelihood estimation. For each observation i, the logistic regression method calculates the probability of observing Y_{j} based on the odds ratio R_{j} for that observation. The method then chooses the set of ß_{i}s that maximizes the probability of observing the Y_{j} that we did observe in the sample.
To perform a maximumlikelihood estimation we need a function that tells how the probabilities of the results are determined. The fundamental equation gives us this information by giving us a formula for the odds ratio. These probabilities are applied to the Y_{j}.
The maximumlikelihood criterion for estimating a set of parameters specifies a probability function in terms of a set of parameters and then finds the set of values for those parameters that gives the greatest likelihood of observing the results that we actually did observe.
For example, take a simple oneparameter function: the
probability that one of Acme's
widgets will be good is P and the probability
that it will be defective is (1  P). We sample four of their widgets
and find that the third one is defective and the others are good.
To apply this probability scheme to the observed results (the Y_{j}) we assume that these widgets are selected randomly and independently.
The probability of finding such a sample is then L = P x P x (1  P)
x P = P^{3}  P^{4}. Use calculus to find the P that maximizes this function.
dL/dP = 3P^{2}  4P^{3} = 0. So P = 3/4,
which happens to be the sample proportion. This illustrates that
the sample proportion is a maximumlikelihood estimator of the likelihood
that a randomly chosen Y will be a success.
Essentially, this is what logistic regression does except
that instead of having a simple
constant P as we had, logistic regression lets
P be a function of the independent variables, X_{i}.
Interpreting the ß_{i}
Recall that e^{x+y} = e^{x}e^{y}. So the fundamental equation can be written as
R = e^{ß}^{0} e^{ß}^{1X1} e^{ß}^{2X2} e^{ß}^{3X3}
If one of the ß_{i} is zero, then e^{ß}^{iXi} is 1, so that term drops out of the equationmultiplying by 1 has no effect.
If ß_{i} is different
than 1 then changes in X_{i} will have
a multiplicative effect on R.
A oneunit change in X_{i} will change e^{ß}^{iXi} to
e^{ß}^{i(Xi +1)}, which is e^{ß}^{iXi} e^{ß}^{i}. So a oneunit change in X_{i} will increase R by a factor of e^{ß}^{i}.
So, if ß_{i} = 2.0 then a oneunit change in X_{i} will increase R by a factor of 7.389.
If ß_{i} = 0.5 then a oneunit change in X_{i} will increase R by a factor of 1.649.
If ß_{i} = 0.4 then a oneunit change in X_{i} will increase R by a factor of 0.6703, which is really a
decrease of 32.07%.
R would double as a result of a oneunit increase
in X_{i} if ß_{i} = 0.6931.
When ß_{i} = 2.0
then a oneunit change in X_{i} will increase
R by a factor of 7.389. So a
threeunit change in X_{i} will increase R by a factor of 3x7.389, right?
WRONG! Each oneunit increase
in X_{i} causes R to increase by a multiplicative
factor of 7.389. So three oneunit increases in X_{i} will cause R to increase by a factor of
Testing the significance of the whole equation
To test the significance of the whole equation (or the whole model) means to test whether any of the explanatory variables (the Xs) have an effect on the dependent variable (Y). If we find that the answer is "yes", we would also want to test the significance of individual explanatory variables. The approach used here is to measure the "fit" of the model to the data with and without the explanatory variables. The "fit" is measured by the likelihood (probability) that the observed sample data should be observed under the assumptions of the model. Technically, the fit is measured as the negative of twice the natural logarithm of the likelihood, 2LL.
The discussion below uses the following dataset for an example:
X 
Y 
Predicted Y 
Predicted Y 
2 
Hit 

2 
Hit 

2 
Miss 

2 
Hit 

4 
Miss 

4 
Hit 

4 
Miss 

4 
Miss 

6 
Miss 

6 
Miss 

6 
Miss 

6 
Miss 
2LL. The natural logarithm of the likelihood
function can be used to test the null hypothesis that all of the ß_{i} are equal to zero. This is similar to the null hypothesis
of the F statistic in a familiar multiple linear regression.
If all of the ß_{i} except for ß_{0} are equal to zero, the best prediction
for the odds ratio is the odds ratio for all Y in the whole sample.
Prediction under the null hypothesis. The
null hypothesis says that none of the X variables has any influence
on the probability or on the odds that Y will be a success. If
this is true, we could get the best estimates of the odds ratio by running
a logistic regression with no X variables. (JMP IN actually lets us do
this.) Under this circumstance, the best estimate of the odds ratio,
R, would be the odds ratio for the whole sample. For example, consider
a sample of 12 observations in which onethird of the Ys are successes
and twothirds are failures. The naive estimate (using the null
hypothesis) of the odds ratio would be 1/3÷2/3 = 1/2. R
will be 1/2 if ß_{0} is ln(1/2)
= 0.69315. (To verify this, note that e^{0.69315} =
1/2.) If the null hypothesis is true, each success had a 1/3 probability
of occurring and each failure had a 2/3 probability. Based on these
probabilities, the probability of observing four successes and eight
failures (as we did observe in this sample) would be (1/3)^{4}(2/3)^{8}
= 0.00048170916. The natural logarithm of this probability is
7.63817.
The JMP IN output for this modelusing no X variableis
as shown below:
Model 
LogLikelihood 
DF 
ChiSquare 
Prob>ChiSq 
Difference 
0 
0 
0 
0.0000 
Full 
7.63817002  
Reduced 
7.63817002  
RSquare (U) 
0 

Observations 
12 
Term 
Estimate 
Std. Error 
Chi Square 
Prob>ChiSq 
Intercept 
0.6931472  0.61237244 
1.28120803 
0.2577 
Prediction under the alternative hypothesis. The
alternative hypothesis says that the X variables do influence the
probability of success and the odds ratio. The alternative hypothesis
does not constrain the ß_{1}, ß_{2}, and ß_{3} to equal zero. In the 12observation
sample used above, it is apparent that a success ("Hit") is more likely
when X is 2 than when X is 6. The JMP IN results of this logistic
regression are the following:
Model 
LogLikelihood 
DF 
Chi Square 
Prob>ChiSq 
Difference 
3.03305336 
1 
6.0661067 
0.01378004 
Full 
4.60511666  
Reduced 
7.63817002  
RSquare (U) 
0.3971 

Observations 
12 
Term 
Estimate 
Std. Error 
Chi Square 
Prob>ChiSq 
Intercept 
3.75134428 
2.3346487 
2.58184856 
0.10809536 
X 
1.2703325 
0.69619146 
3.3294876 
0.06804807 
This time, notice that the LogLikelihoods for the Full
and Reduced models are different. The LogLikelihood for the Reduced
model (without the X variables, or with ß_{1} set equal to zero) is the same as before. The LogLikelihood
for the Full model (with the X variables in the equation) is less, indicating
that the estimated likelihood of getting the sample that we observed is
greater than the likelihood that the naive model gives us. The
LogLikelihood for the Difference between the models is a measure of the
how adding the X variables to the model improves the model's fit and predicting
power. Statisticians have demonstrated that for large sample sizes,
2xLogLikelihood has a probability distribution that is approximately a
chisquare distribution. Note that the chisquare statistic is exactly
twice the LogLikelihood for the Difference. The number of degrees
of freedom is equal to the number of X variables in the model. The
Prob>ChiSq is the p value for the wholemodel test. If there
is a significant improvement in the fit of the model, the difference in
the LogLikelihoods will be great and the chisquare will be large. The
probability that we would randomly find a chisquare value this large or
larger will therefore be small. So, small p values (less than
5% usually, sometimes less than 1% if we want to be more cautious) indicate
that changes in X do have an effect on the probability of success. (This
is a wholemodel test. If there had been more than one X variable,
a small p value would indicate that at least one, but not necessarily
all, of the X variables has an effect on the probabilty of success.) On
the other hand, if there really is no effect of X on the probability of success,
the difference would be small (it should be zero but there is always some
randomness) and the chisquare value would be small and the p value
would be closer to 0.5 or even 0.
The Logit R² (sometimes denoted by U) is similar
to the R² from a linear regression analysis. It gives the percentage
of the overall fit that is due to using the X variables in the model to
change the probabilities. It is calculated as
or in terms of the JMP IN output
In the example above, we have R² = 3.033 ÷
7.6381 = 0.3971. If the X variable had not had any influence on the
probability of success, R² would be close to zero. At the other
extreme, if knowing X could let us predict success or failure perfectly,
R² would be 1.00.
Making predictions. Once we have estimates
of the parameters (the ß_{i}) of the fundamental equation, we can estimate the
probability of success for any observation or any hypothetical values of
the X variables. A natural criterion for predicting success is to only
predict success when the probability is greater than 1/2 (and the odds ratio,
R, is greater than 1.0 and ß_{0} + ß_{1}X_{1} + ß_{2}X_{2} +
ß_{3}X_{3} is greater than zero). However, sometimes
we may be willing to miss a few true successes, so that we are less apt to
incorrectly predict success when in fact the case turns out to be a failure.
For example, we might be trying to predict which patients will be suitable
for a new drug. If the drug has fatal sideeffects, we would want to
reduce the change of falsely predicting success.
Classification matrix: Whatever criterion we use, we can classify our predictions according to the table below. The percentage correctly classified (also called the hit ratio) is the sum of the number of successes correctly predicted and the number of failures correctly predicted divided by the total number of cases. Some researchers choose their criterion so as to maximize their hit ratio. JMP IN's Receiver Operating Characteristic curve (ROC) plots similar information for various criterion levels.
Predicted success 
Predicted failure 

Actual success 
3  1 
Actual failure 
1 
7 
Testing the significance of an explanatory variable
The Wald statistic is used to test the null
hypothesis that a ß_{i} is equal
to zero, similar to a t statistic in the familiar multiple linear
regression. If the p value for a coefficient b_{i} is small (less
than or equal to a), the null hypothesis
should be rejected, implying that the variable X_{i} does influence the probability that Y will be 1.0 (a "success").
If we find that some of the variables do not seem to have
an effect on Y, it is usually good practice to reestimate the model without
those variables. (However, if there is a theoretical reason to believe
that a variable should have an effect on Y, then it would be better to leave
the variable in the equation. Failing to prove that variable
does not influence Y is not the same as proving that it does not influence
Y.)