Feature Selection : Select Important Variables with Boruta Package

Deepanshu Bhalla 4 Comments , ,
This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?
  1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
  2. Too many variables might result to overfitting which means model is not able to generalize pattern
  3. Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special".  See the following reasons to use boruta package for feature selection.
  1. It works well for both classification and regression problem.
  2. It takes into account multi-variable relationships.
  3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
  4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
  5. It can handle interactions between variables
  6. It can deal with fluctuating nature of random a random forest importance measure
Boruta Package

Basic Idea of Boruta Algorithm
Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works

Follow the steps below to understand the algorithm -
  1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
  2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
  3. Combine the original ones with shuffled copies
  4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
  5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
  6. Find the maximum Z score among shadow attributes (MZSA)
  7. Tag the variables as 'unimportant'  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
  8. Tag the variables as 'important'  when they have importance significantly higher than MZSA.
  9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.


Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.


Is Boruta a solution for all?
Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

What is shuffled feature or permuted copies?

It simply means changing order of values of a variable. See the practical example below -
set.seed(123)
mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))
head(cbind(mydata, shuffle))
  
    Original         Shuffled
   var1   var2    var1      var2
1    1 0.2875775    4 0.9404673
2    2 0.7883051    5 0.4089769
3    3 0.4089769    3 0.2875775
4    4 0.8830174    2 0.0455565
5    5 0.9404673    6 0.8830174
6    6 0.0455565    1 0.7883051

R : Feature Selection with Boruta Package

1. Get Data into R

The read.csv() function is used to read data from CSV and import it into R environment.
#Read data
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
2. List of variables
#Column Names
names(df)
Result : "admit" "gre"   "gpa"   "rank"

3. Define categorical variables
df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

4. Explore Data
#Summarize Data
summary(df)

#Check number of missing values
sapply(df, function(y) sum(is.na(y)))
 admit        gre             gpa        rank   
 0:273   Min.   :220.0   Min.   :2.260   1: 61  
 1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
         Median :580.0   Median :3.395   3:121  
         Mean   :587.7   Mean   :3.390   4: 67  
         3rd Qu.:660.0   3rd Qu.:3.670          
         Max.   :800.0   Max.   :4.000 

No missing values in the dataframe df.

Handle Missing Values
In this dataset, we have no missing values. If it exists in your dataset, you need to impute them before implementing boruta package.
5. Run Boruta Algorithm
#Install and load Boruta package
install.packages("Boruta")
library(Boruta)

# Run Boruta Algorithm
set.seed(456)
boruta <- Boruta(admit~., data = df, doTrace = 2)
print(boruta)
plot(boruta)

Boruta performed 9 iterations in 4.870027 secs.
 3 attributes confirmed important: gpa, gre, rank;
 No attributes deemed unimportant.

It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.
Box Plot - Variable Selection
As you can see above the label of shadowMean is not displayed as it got truncated due to insufficient space. To fix this problem, run the following program.
plot(boruta, xlab = "", xaxt = "n")
k <-lapply(1:ncol(boruta$ImpHistory),function(i)
  boruta$ImpHistory[is.finite(boruta$ImpHistory[,i]),i])
names(k) <- colnames(boruta$ImpHistory)
Labels <- sort(sapply(k,median))
axis(side = 1,las=2,labels = names(Labels),
       at = 1:ncol(boruta$ImpHistory), cex.axis = 0.7)

Let's add some irrelevant data to our original dataset

It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.
#Add some random permuted data
set.seed(777)
df.new<-data.frame(df,apply(df[,-1],2,sample))
names(df.new)[5:7]<-paste("Random",1:3,sep="")
df.new$Random1 = as.numeric(as.character(df.new$Random1))
df.new$Random2 = as.numeric(as.character(df.new$Random2))
> head(df.new)
  admit gre  gpa rank Random1 Random2 Random3
1     0 380 3.61    3     600    3.76       4
2     1 660 3.67    3     660    3.30       4
3     1 800 4.00    1     700    3.37       2
4     1 640 3.19    4     620    3.33       3
5     0 520 2.93    4     600    3.04       2
6     1 760 3.00    2     520    3.64       4

Run Boruta Algorithm
set.seed(456)
boruta2 <- Boruta(admit~., data = df.new, doTrace = 1)
print(boruta2)
plot(boruta2)
Boruta performed 55 iterations in 21.79995 secs.
 3 attributes confirmed important: gpa, gre, rank;
 3 attributes confirmed unimportant: Random1, Random2, Random3;

The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.
> attStats(boruta2)
            meanImp   medianImp    minImp    maxImp   normHits  decision
gre      5.56458881  5.80124786  2.347609  8.410490 0.90909091 Confirmed
gpa      9.66289180  9.37140347  6.818527 13.405592 1.00000000 Confirmed
rank    10.16762154 10.22875211  6.173894 15.235444 1.00000000 Confirmed
Random1  0.05986751  0.18360283 -1.281078  2.219137 0.00000000  Rejected
Random2  1.15927054  1.35728128 -2.779228  3.816915 0.29090909  Rejected
Random3  0.05281551 -0.02874847 -3.126645  3.219810 0.05454545  Rejected

To save a final list of important variables in a vector, use getSelectedAttributes() function.
#See list of finalvars
finalvars = getSelectedAttributes(boruta2, withTentative = F)
[1] "gre"  "gpa"  "rank"

Incase you get tentative attributes in your dataset, you need to treat them. In this dataset, we did not get any one. When you run the following function, it will compare the median Z score of the variables with the median Z score of the best shadow attribute and then make a decision whether an attribute should be confirmed or rejected.
Tentative.boruta <- TentativeRoughFix(boruta2)
List of parameters used in Boruta
  1. maxRuns: maximal number of random forest runs. Default is 100.
  2. doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus reporting each iteration. Default is 0.
  3. getImp : function used to obtain attribute importance. The default is getImpRfZ, which runs random forest from the ranger package and gathers Z-scores of mean decrease accuracy measure.
  4. holdHistory: The full history of importance runs is stored if set to TRUE (Default).

Compare Boruta with RFE Algorithm

In caret, there is a variable selection algorithm called recursive feature elimination (RFE). It is also called backward selection. A brief explanation of the algorithm is given below -
  1. Fit the model using all independent variables.
  2. Calculate variable importance of all the variables.
  3. Each independent variable is ranked using its importance to the model.
  4. Drop the weakest variable (worst ranked) and builds a model using the remaining variables and calculate model accuracy.
  5. Repeat step 4 until all variables are used.
  6. Variables are then ranked according to when they were dropped.
  7. For regression, RMSE and R-Squared are used as a metrics. For classification, it is 'Accuracy' and 'Kappa'.
In the code below, we are building a random forest model in RFE algorithm. The function 'rfFuncs' denotes for random forest.
library(caret)
library(randomForest)
set.seed(456)
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
rfe <- rfe(df.new[,2:7], df.new[,1], rfeControl=control)
print(rfe, top=10)
plot(rfe, type=c("g", "o"), cex = 1.0)
predictors(rfe)
head(rfe$resample, 10)
Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         4   0.6477 0.1053    0.07009  0.1665         
         6   0.7076 0.2301    0.06285  0.1580        *

The top 6 variables (out of 6):
   gpa, rank, gre, Random2, Random3, Random1

RFE - Variable Selection
In this case, RFE algorithm returned all the variables based on model accuracy. As compared to RFE, boruta final variables make more sense in terms of interpretation. It all depends on data and its variables' distribution. As an analyst, we should explore both the techniques and see which one works better for the dataset. There are many packages in R for variable selection. Every technique has pros and cons.

The following functions can be used for model fitting in RFE selections
  1. linear regression (lmFuncs)
  2. random forests (rfFuncs)
  3. naive Bayes (nbFuncs)
  4. bagged trees (treebagFuncs)

Does Boruta handle multicollinearity?

Multicollinearity means high correlation between independent variables. It is an important assumption in linear and logistic regression model. It makes coefficients (or estimates) more biased. Lets's check whether boruta algorithm takes care of it. Let's create some sample data. In this case, we are creating 3 predictors x1-x3 and target variable y.
set.seed(123)
x1 <- runif(500)
x2 <- rnorm(500)
x3 <- x2 + rnorm(n,sd=0.5)
y <- x3 + runif(500) 
cor(x2,x3)
[1] 0.8981247

The correlation of variables x2 and x3 is very high (close to 0.9). It means they are highly correlated. 
mydata = data.frame(x1,x2,x3)
Boruta(mydata, y)
Boruta performed 9 iterations in 7.088029 secs.
 2 attributes confirmed important: x2, x3;
 1 attributes confirmed unimportant: x1;

Boruta considered both highly correlated variables to be important. It implies it does not treat collinearity while selecting important variables. It is because of the way algorithm works.

Important points related to Boruta
  1. Impute missing values - Make sure missing or blank values are filled up before running boruta algorithm.
  2. Collinearity - It is important to handle collinearity after getting important variables from boruta.
  3. Slow Speed - It is slow in terms of speed as compared to other traditional feature selection algorithms.
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 4 Responses to "Feature Selection : Select Important Variables with Boruta Package"
  1. TestIndex <- createDataPartition(Credit$Credit.Amount,times = 1,p=0.5,list = FALSE)
    Train <- Credit[TestIndex,]
    Test <- Credit[-TestIndex,]

    names(Credit)
    # Building the model

    Fit <- glm(Creditability~Account.Balance+Payment.Status.of.Previous.Credit+Purpose+Value.Savings.Stocks+Length.of.current.employment+Type.of.apartment+Most.valuable.available.asset+Concurrent.Credits+Duration.of.Credit..month.+Credit.Amount+Age..years.,family = binomial,data = Train)


    summary(Fit)

    # Removing the non significant variable

    Fit1 <- glm(Creditability~Account.Balance+Payment.Status.of.Previous.Credit+Value.Savings.Stocks+Length.of.current.employment+Most.valuable.available.asset+Duration.of.Credit..month.,family = binomial,data=Train )


    summary(Fit1)


    fitlog <- predict(Fit1,type="response",newdata=Test)
    fitlog
    fitlog1 <- predict(Fit1,type="response",newdata=Test)
    fitlog1<-ifelse(fitlog1>0.5,1,0)

    tab <- table(fitlog1,Train$Creditability)

    1-sum(diag(tab))/sum(tab)

    # Model performance Evaluation
    library(ROCR)
    pred <- predict(Fit1,Test,type="response")
    hist(pred)

    # Cut offvalue on eye estimate
    pred <- prediction(pred,Test$Creditability)
    eval <- performance(pred,"acc")
    plot(eval)
    abline(h=0.775,v=.575)


    # Identifying the best cutoff and accuracy
    eval
    max <- which.max(slot(eval,"y.values")[[1]])
    acc <- slot(eval,"y.values")[[1]][max]
    cut <- slot(eval,"x.values")[[1]][max]
    cut

    # Optimalcutoffvalueis .572
    fitlog1 <- predict(Fit1,Test,type = "response")
    fitlog1 <- ifelse(fitlog1>0.57,1,0)

    # Misclassification error
    tab1 <- table(fitlog1,Test$Creditability)
    tab1

    accuracy <- (sum(diag(tab1))/sum(tab1))
    accuracy
    # accuracy = 77.4%

    # ROC
    # we are intrested in finding the number of 1 rather than 0's.'

    roc <- performance(pred,"tpr","fpr")
    plot(roc)
    abline(0,1)

    require(Deducer)
    rocplot(Fit1)

    ReplyDelete
  2. Missing Value Imputation
    Categorical logreg for Binary Variables , polyreg for more than 2 levels (MICE)
    Numerical KNN Imputation , Multivariate Imputation by Chained Equations, rpart http://www.stefvanbuuren.nl/publications/MICE%20V1.0%20Manual%20TNO00038%202000.pdf

    Levels Reduction WOE and Informaion and Business Logic
    Dimensionality Reduction PCA
    Data Synthesis SMOTE, ROSE . Synthetic Minority Oversampling Technique

    Examining the Each variable outlier Detection, Outlier Capping for the numerical variables, bi-variate analysis wrt dependent variable
    Test & Train Data preparation Caret package- createDataPartition
    Model Building
    Logistic Regression
    stepAic and decision tree for the variable selection
    building the Logistic Regression model
    Evaluation of Logistic Regression
    Goodness of Fit -hoslem, loglikelihood, pR2,waldtest , variable Importance, Classification rate, AUC ROC curve, K-fold cross validation ,Concordance -Discordance pairs
    Evaluating the accuracy by random forest and CHAID trees
    Visulisation of the result via ggplots,

    ReplyDelete
  3. Hi Deepanshu,

    How to implement this in SAS?

    Regards,
    Harneet.

    ReplyDelete
  4. SIR CAN YOU HELP US WITH EDA PROCESS IN PHYTHON

    ReplyDelete
Next → ← Prev