Oversampling for Rare Event with R

Deepanshu Bhalla Add Comment ,
Oversampling

Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
table(mydata$admit)
In the program below, we are keeping all the events and same number of non-events.
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
set.seed(9)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
table(down_train$admit)
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling. 
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt =  subset(mydata, admit==0)
set.seed(112)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 =  rbind(subset(mydata, admit==1),dt3)
nrow(dt4)

Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 0 Response to "Oversampling for Rare Event with R"
Next → ← Prev