Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

BATTERY BOOTSTRAP in Salford Predictive Modeler

The most recent versions of Salford Predictive Modeler™ SPM PRO EX include a new BATTERY to invoke bootstrapped replication of most model types available in SPM. One of our reasons for adding this BATTERY was to provide access to the full CART engine when generating RandomForests® (RF) models. The principle advantages of this are:

Breiman's original RF uses a stripped down and simplified tree growing algorithm designed for speed. It lacks tree growing options and missing handling, and for many users Breiman's RF is confined to classification problems. By accessing the full CART engine with all of its Salford extensions and customized controls, modelers can accomplish far more sophisticated analyses, handle missing values with surrogates, apply penalties and constraints, and most importantly for those interested in continuous dependent variables, BATTERY BOOTSTRAP gives access to both Least Squares (LS) and Least Absolute Deviation (LAD) regression trees.

The principle drawback of BATTERY BOOTSTRAP is that the extra machinery comes with a computational price: RF runs under BATTERY BOOTSTRAP are much slower than under Breiman–RF. The extra robustness, ability to handle huge problems, and added controls should often make the slower runs worthwhile. Also observe that at the moment the RF post–model visualization machinery is not available.

To run a classic RF style regression using CART regression follows these steps:

MODEL TARGET

KEEP varlist

LIMIT ATOM=2 MINCHILD=1

ERROR EXPLORE

These set up the model (make sure that the target is not declared categorical), and ensure that trees will be grown to maximum possible depth (proven to be the best practice fort RF models). Then, follow the model setup with:

GROVE grove_name BATTERY BOOTSTRAP TEST=EXPLORE RSPLIT=5 REPEAT=100 BUILD The above commands tell the BATTERY not to prune the trees, to select 5 predictors at random at each node as candidate splitters, and to repeat the bootstrap replication 100 times. The GROVE command is needed to ensure that copy of the forest including all of the individual trees is saved, For testing one would then follow this with

GROVE grove_name

SAVE predicted_values

SCORE COMMITTEE=YES

If you also want to save the separate predictions of each tree in the forest issue:

SCORE DCM=YES COMMITTEE=YES

and the output data set will include both the overall forest prediction as well the prediction for every tree in the forest. The latter might be useful for experimental post-processing and reweighting.

To record the INBAG/OOB status of every record in the data for every tree in the forest just issue a SAVE command right after the BATTERY command.

To obtain RF style variable importance measures based on the loss of predictive accuracy on the OOB data induced by random permutation of a predictor ahead of being dropped down each tree add the VARIMP option to the BATTRY command.

To obtain the RF proximity matrix add the PROX=filename to the BATTERY command. This matrix will of size NOBS x NOBS where NOBS= number of records in training data set. Keep in mind that a 10,000 record data will produce a matrix of 10,000 rows by 10,000 columns (100 million elements). Such files can easily be made to exceed 2 GB in size which will cause problems on 32–bit platforms.

The HELP entry states:

BATTERY BOOTSTRAP TEST=ORIG|OOB|POOL|EXPLORE|LEARN|CROSS

REPEAT=, REFERENCE=, RSPLIT=

LDRAW=, SAVE="filename.ext",

VARIMP=, NPREPS=,

PROX="filename.ext", NODE="filename.ext",

TREESIZE=

BATTERY BOOTSTRAP builds a series of models, all sharing a common set of options with the exception that each is built from a bootstrapped version of the learn sample. Bootstrapping is done with replacement, meaning that some records in the learn sample may appear more than once in the bootstrapped sample while other records may not appear at all.

The options are:

REFERENCE — an initial reference model that does not employ any bootstrap sampling or manipulation of the learn or test samples, can be built. The reference model is essentially what you would build if, instead of using BATTERY BOOTSTRAP, you built a single model. The default is REFERENCE=NO, in which no reference model is built.

TEST — the TEST option specifies how a test sample, if any, is to be defined for each model. ORIG use the original test sample as the test sample for all cycles, if a test sample is available such as is generated with the ERROR command. OOB uses the out-of-bag learn sample records for the cycles other than the reference model. EXPLORE does not use any test sample for the cycles, other than for the reference model. LEARN uses the unsampled original learn sample as the test sample for all cycles other than the reference model. CROSS use cross validation for all cycles other than the reference model. The default is ORIG.

REPEAT — specifies the number of cycles, or models, that are to be built, in addition to a possible reference model. The default is 10.

LDRAW — specifies a target size of the bootstrapped learn sample. Normally, the bootstrapped learn sample has as many records as the original learn sample. If you wish to force the bootstrapped learn sample to have, say, 20000 records instead, use LDRAW=20000.

SAVE — saves In–BAG/OOB indicators and scores, for CART and TN models only, to a dataset.

RSPLIT — for CART models, if you wish to consider splitting each node on just a random subset of the available predictors. For instance, if you wish to consider only 4 predictors at each node, independently sampled for each node, use RSPLIT=4. This is similar to the RandomForests algorithm.

VARIMP — produces RandomForests-type variable importance measures by randomly permuting in–bag and out–of–bag data to evaluate the impact that a predictor has on each model. Note that this option is potentially very memory intensive and time consuming.

NPREPS specifies the number of random perturbations to be done for each model, when VARIMP=YES. The default is 1.

PROX="filename" produces a proximity matrix based on OOB data for CART models only.

NODE="filename" stores terminal nodes (that are used to determine the proximity matrix) for CART models only.

TREESIZE=POISSON causes tree sizes to be random based on the Poisson distribution using the LIMIT NODES setting as the mean. This affects CART models only.

For example:

BATTERY BOOTSTRAP TEST=EXPLORE REPEAT=100 RSPLIT=4

Will repeat the model 100 times, without using any test data, and randomly selecting 4 potential splitters at each node.

[J#50:1603]

Tags: CART, Blog, RandomForests, SPM, BATTERY