It measures the association between actual events and predicted probability. In other words, it is a measure of how close the predicted probabilities are to the actual rate of events.
In HL test, null hypothesis states that sample of observed events and non-events supports the claim about the predicted events and non-events. In other words, the model fits data well.
You need to have two variables - Dependent Variable and Probability. Fit the binary classification model (like logistic regression) to your data and get the estimated probability for each observations.
Split data into 10 sections based on ascending order of probability.
In Excel, let's say you have probability values in cells C3:C4692. Create a new variable called "Group" and then enter the following formula in cell D3 and then paste it down to the last observation.
=ROUND(RANK(C3,$C$3:$C$4692)/COUNT($C$3:$C$4692)*10,0)+1
Calculate the number of actual events and non-events in each group.
In Excel, enter unique values of column D from cell F4.
In cell G4, enter the following formula and then paste it down.
=COUNTIFS($D$3:$D$4692,F4,$B$3:$B$4692,1)
In cell H4, enter the following formula and then paste it down.
=COUNTIFS($D$3:$D$4692,F4,$B$3:$B$4692,0)
- Calculate Predicted Probability = 1 by averaging probability in each group. In cell I4, enter the following formula and then paste it down.
=AVERAGEIF($D$3:$D$4692,F4,$C$3:$C$4692)
- Calculate Predicted Probability = 0 by subtracting Predicted Probability=1 from 1. In cell J4, enter the following formula and then paste it down.
=1-I4
- Calculate expected frequency by multiplying number of cases by Predicted Probability = 1.
In cell K4, enter the following formula and then paste it down.
=I4*SUM(G4:H4)
In cell L4, enter the following formula and then paste it down.
=J4*SUM(G4:H4)
- Calculate chi-square statistics by taking frequency of observed (actual) and predicted events and non-events.
(a) Calculate the HL Statistics : In cell M4, enter the following formula and then paste it down.
=((G4-K4)^2/K4)+((H4-L4)^2/L4)
(b) Chi-square is calculated by summing the HL Statistic using the formula
=SUM(M4:M11)
.(c) P-value for the Chi-square is calculated using the following formula :
=CHIDIST(SUM(M4:M11),COUNT(F4:F11)-2)
Hosmer Lemeshow Test |
Rule : If p-value > .05, the model fits data well.
HI Deepanshu I was looking at the attached excel for HL in excel. In the column H, you do an average of all probability (irrespective of dependent variable =0 or 1) and then put that average value in Column I. I was wondering if that is the correct way or we need to filter only those probabilities where dependent variable =1 and then average those probabilities?
ReplyDeleteThis is the correct way. Thanks!
DeleteThank you very much for posting such a wonderfull tool, it saves a lot of programming and accelerates the work however, I integrated it on my data set and I'm receiving two errors: I have 11 groups instead of 10 and there are negative values. I'm not sure if I missed something during the adaptation or if the problem arises of the wide probabilities range. Would it be possible for you to check my instalation? Thank you very much.
ReplyDelete