Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

How many levels can a target variable have in CART® and other SPM data mining engines?

The Salford CART decision tree is exceptional in supporting an essentially unlimited number of target levels. Of course the vast majority of classification problems tackled by analysts have two classes, or are reformulated to have two classes. There is no reason, however, to confine yourself to just two levels if you are working with CART. In our training materials we discuss three–level, five–level, and ten–level examples in detail. The ten–level example concerns the reverse engineering of a clustering solution, in which a market researcher was looking to extract a simple set of rules that could be used to assign new records to a previously constructed clustering solution based on a very large number of variables. Ten levels is a rather small number when considering how far you might be able to stretch the CART machinery. In our work with a car manufacturer our goal was to predict the specific car model chosen by a new car buyer from a set of more than 400 alternatives. The analysis was based on survey responses to several hundred attitude and preference questions administered to more than 20,000 new car buyers, and the results yielded extraordinary insight into the needs and wants driving ultimate car model selection. In our own internal testing of CART classification based on synthetic data, we have successfully run CART models on targets with 1,000 levels.

Several important points should be kept in mind if you are serious about analyzing a high cardinality categorical variable (a target with many levels). First, you need to review the distribution of your data (training and test data) across all the target levels you plan to include in your analysis. In our auto choice analysis we worked with a data set in which there were many buyers for ultra–popular cars, and very few for rarely–chosen alternatives. Specific car models may be enocuntered rarely because they are expensive,impractical, offer untested new technology, or are released in low numbers. For whatever reason, these car models, represented by levels of the target, may not appear with sufficient frequency in the data to support reasonable analysis. In our real world data set designed to support real world decisions, we encountered some models with fewer than 10 purchases. Clearly, such small samples cannot support reliable analysis. However, if you have sufficient data for every level of your target then moving forward with a CART analysis can be very productive. We have a few further comments to make about such analysis below.

Other SPM data mining engines

Technically any data mining engine capable of handling a binary target variable can de adapted to handle an unlimited number of target values. You accomplish this by building a binary model for each level of the target, contrasting the level in question against all other levels. Thus, a three-level target with values of say "A," "B" and "C" would be tackled with three binary targets {"A" vs "not–A"}, {"B" vs "not–B"}, {"C" vs "not–C"}. The problems with an aproach is, first, the fact that you have to build a separate model for each level. CART handles all levels simultaneously and thus builds one efficient model. The multi–model approach requires a complete new analysis for every model. If you have 50 levels you will have to wait for the 50 models to complete. Of course, such an approach would benefit dramatically from parallel processing. The second problem is the assembly of the separate models into a coherent single model. Having made these introductory comments we now review the engines available in SPM.

MARS

MARS was designed originally as a regression tool to capture the partial linearity and smoothness of responses that can be expected in most successful regression models. It was never a surprise that MARS could also be used to model the binary response two–valued target (Yes/ No or 1 vs 0) as a form of logistic regression. This, however, is as far as MARS can be expected to go when it comes to modeling multi–class problems out of the box. MARS could be used to develop a series of binary response models, one for each level of a multi–class target, but at the moment SPM provides no additional support for refining or combining the separate models into a coherent whole.

TreeNet

TreeNet was designed to handle the multi–class target automatically. TreeNet offers some very useful reports in such models, chiefly the level–specific variable importance list, and the level–specific partial dependency plots. TreeNet accomplishes this by using the strategy of one model per target level and then automatically combining the separate submodels into a single coherent whole. in general, this strategy works well for a relatively small number of levels. Because one model must be built for each level, care should be taken when working with, for example, 50 levels, as TreeNet will need to build 50 distinct submodels. With today's multi–core processors the 2012 edition of SPM can leverage the parallel processing possibilities to accelerate the modeling, but the runs can be expected to take longer than, for example, a CART run.

RandomForests

Like CART, RF is inherently designed to handle the multi–class target, and some of RF's notable successes and interesting visual displays are seen in three–class problems. RF models are not easy to explain and are not as robust as the original CART engine. However, it is always worth experimenting with RF at a minimum to benchmark the potential predictability existing within the data.

Conclusion

We recommend starting with CART for multi–class problems; further, the larger the number of target levels the greater the strengths of CART become. The 2012 edition of SPM includes a variety of multi–tree options as well, including bagging and RF–style forest ensemble construction, that offer both the ultra–robustness of CART and the potential predictive accuracy advantages of tandomized tree ensembles.

[J#65:1603]

Tags: CART, SPM, MARS, TreeNet, Target Variables