Impute Missing Values with Decision Tree

CART has built-in algorithm to impute missing data with surrogate variables. The surrogate splits the data in exactly the same way as the primary split, in other words, we are looking for clones, close approximations, something else in the data that can do the same work that the primary split accomplished.

Imputation Process

Suppose there are 10 predictors x1 − x10 to be included in the CART analysis, and suppose there are missing values for x1 only, which happens to be the “best” predictor chosen to define the “optimal” split.

CART is applied with x1 as the dependent variable and x2 − x10 as potential splitting variables. Only one partitioning is allowed here; a full tree is not constructed. The nine predictors are then ranked by the proportion of cases in x1 that are misclassified. Predictors that do no better than the marginal distribution of x1 are dropped from further consideration.

The variable with the lowest classification error for x1 is then used in place of x1 to assign cases with missing values on x1 to one of the two daughter nodes. That is, “the predicted classes for x1 are used when the actual classes for x1 are missing”. If there are missing data for the “best” predictor of x1, the “best” surrogate variable is used instead. If there are missing data on the “best” surrogate variable of x2, the second “best” surrogate variable of x3 is used instead. And so on.

Surrogate splitting rules enable you to use the values of other input variables to perform a split for observations with missing values.

Important Note : Tree Surrogate splitting rule method can impute missing values for both numeric and categorical variables.

In R, it is implemented with usesurrogate = 2 in rpart.control option in rpart package. Check out : GBM Missing Imputation

Mice Package : Imputing Missing Value with CART

anscombe <- within(anscombe, {
y1[1:3] <- NA
y4[3:5] <- NA
})

imp = mice(anscombe, meth = "cart", minbucket = 4)
imp1 = complete(imp)

Source : Mice Package in Detail

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn