Oversampling
Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)In the program below, we are keeping all the events and same number of non-events.
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
table(mydata$admit)
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
set.seed(9)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
table(down_train$admit)
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling.
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt = subset(mydata, admit==0)
set.seed(112)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 = rbind(subset(mydata, admit==1),dt3)
nrow(dt4)
Share Share Tweet