CART has built-in algorithm to impute missing data with surrogate variables. The surrogate splits the data in exactly the same way as the primary split, in other words, we are looking for clones, close approximations, something else in the data that can do the same work that the primary split accomplished.
Imputation Process
CART is applied with x1 as the dependent variable and x2 − x10 as potential splitting variables. Only one partitioning is allowed here; a full tree is not constructed. The nine predictors are then ranked by the proportion of cases in x1 that are misclassified. Predictors that do no better than the marginal distribution of x1 are dropped from further consideration.
The variable with the lowest classification error for x1 is then used in place of x1 to assign cases with missing values on x1 to one of the two daughter nodes. That is, “the predicted classes for x1 are used when the actual classes for x1 are missing”. If there are missing data for the “best” predictor of x1, the “best” surrogate variable is used instead. If there are missing data on the “best” surrogate variable of x2, the second “best” surrogate variable of x3 is used instead. And so on.
Surrogate splitting rules enable you to use the values of other input variables to perform a split for observations with missing values.Important Note : Tree Surrogate splitting rule method can impute missing values for both numeric and categorical variables.
In R, it is implemented with usesurrogate = 2 in rpart.control option in rpart package. Check out : GBM Missing Imputation
Mice Package : Imputing Missing Value with CART
anscombe <- within(anscombe, {
y1[1:3] <- NA
y4[3:5] <- NA
})
imp = mice(anscombe, meth = "cart", minbucket = 4)
imp1 = complete(imp)
Source : Mice Package in Detail
Hii, Thank you so much for posting this. this article veru useful for readers. you writing style is good. Once again thanks for sharing.. http://kosmiktechnologies.com
ReplyDelete