Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Rules of Thumb When Working With Small Data Samples

Binary Classification

CART®

The original CART monograph discusses a study the authors performed working with 215 observations and 19 predictors, where 37 records were of class 1 and 178 of class 0. We think that this is example, with 37 examples in the smaller class is close the smallest sample size you can usefully work with CART.

Recommendation: We suggest using a minimum of 100 records, with the target variable distributed not more unbalanced than in proportions (1/3, 2/3) for up to 30 predictors. We recommend repeated cross-validation to estimate the out-of-sample (previously unseen data) performance.

MARS®

Some our clients have reported with extremely small samples for regression, using about 30 records and 10 predictors to develop very compact models. We do not have much experience with small sample binary response models.

Recommendation: A minimum of 30 records in the smaller of the two classes, and thus a sample size of 60 for a balanced sample, working with up to 15 predictors.

TreeNet®

TN is probably our most effective tool for working with very small samples in the context of many predictors. We have seen successful results with as few as 30 records in the smaller class while working with several thousand predictors in genetics research.

Recommendation: First, it is strongly advised that the minimum node size be lowered from the default of 10 to as low as 3. Of course repeated CV is required to determine the likely out of sample performance of the final model. Repeated CV and bootstrap re–sampling (via the BATTERY BOOTSTRAP) are best for honest performance assessments.

[J#66:1603]

Tags: CART, MARS, TreeNet, Datasets