In random forest, you can perform oversampling of events without data loss.
There are 2 functions in randomForest package for sampling :
1. strata - A (factor) variable that is used for stratified sampling.
Example : sampsize= c(100,50) OR you can write : sampsize=c('0'=100, '1'=50)
There are 2 functions in randomForest package for sampling :
1. strata - A (factor) variable that is used for stratified sampling.
2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.
library(caret)In the above code, sampsize = rep(sum(training$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).
set.seed(1401)
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training$class,
sampsize = rep(sum(training$class == 1), 2), metric = "ROC")
testing$rf = predict(rf, testing, type = "prob")[,1]
library(pROC)
auc <- roc(testing$class, testing$rf, levels = rev(levels(training$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)
Awesome.
ReplyDeleteWhat about regression (not classification)?
ReplyDelete