Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Dan Steinberg's Blog

Dan Steinberg, President and Founder of Salford Systems, is a well respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

SPM Command Line Options

Non-GUI (aka command line) SPM includes a number of command line options, offering the user some additional flexibility. These can be displayed by launching SPM with the -h flag, like so:

Continue Reading

Using Your Own Cross-Validation Bins

Users of cross validation (CV) in CART, MARS, and TreeNet have become accustomed to simply requesting this testing method when setting up a predictive model and allowing the software to take care of the details. Of course, the Salford software prepares the data automatically and uses stratified sampling to randomly assign each record to a CV bin. The user has no influence and no control over how the bins are managed.

Continue Reading

Getting to the command Prompt FAST in CART, MARS, TreeNet, and Random Forests

This is a very short post but it could be very useful for the expert user. As you know, Salford data mining and predictive modeling tools produce both pretty GUI (Graphical User Interface) displays of the results and also classical, plain text output. The plain text shows up in the “Classic Output” window and it can occasionally be very convenient to issue commands directly from this window. But you need to get to the bottom of the window and activate the command prompt. To do this instantly no matter where your cursor is positioned in the Classic Output window simply click (maybe repeatedly) on the “Command Line” icon on the toolbar; as soon as you see the command prompt displayed at the bottom of the window, you can type commands directly.

Continue Reading

Regression Tree Ensembles

Many have asked if RandomForests (RF) supports regression analysis.

The short answer is: not with the current implementation. Salford Systems plans to support RF regression in our next release.

That said, if you have been thinking about RF regression we urge you to consider using TreeNet regression instead. Some reasons follow:

Continue Reading

R-Squared for CART Regression Trees

CART users often ask where they can find the value of the R-Squared for their regression trees. The answer is simple. In conventional statistics,

R-Squared = 1 - SSE/SST, (1)

where SSE is the sum of squared errors of the actual data around the model predictions, and SST, the total sum of squares, is the sum of squared deviations of the dependent variable around its mean. In traditional statistics R-Squared is always calculated using the training data (LEARN SET). CART users can read the R-Squared directly from the output:

Continue Reading

Use of CART and TreeNet on Microarray Datasets

CART (Classification and Regression Trees) was originally developed by Breiman, Friedman, Olshen, and Stone to construct data-driven solutions to a predictive modeling problem. The essence of the technology is recursive partitioning, where the original dataset is progressively split into mutually exclusive regions using a series of binary splits. The resulting solution is presented in the form of a binary tree with key variables shown at each node in a tree.

Continue Reading

Which Data Mining, Predictive Modeling Engine is Best for Me?

We are often asked “Which analytical technology is best for my problem?” This topic not only comes up in practical, day-to-day modeling, but has also been the subject of a few (largely disappointing) academic studies. The short answer we usually give is that for many modeling problems it doesn’t require much time to run several different analyses, so why not rely on experimentation rather than some rule of thumb? If one method stands out for any reason, such as accuracy or intuitive attractiveness of the model, then you have your answer.

Continue Reading

Battery Nodes

Did you know you can easily build a family of CART models with the BATTERY feature? It's true! BATTERY is one of the most powerful aspects of the Salford Predictive Modeling Suite (SPM). For instance, suppose you wish to consider how the size of your CART tree affects the tree's predictive accuracy. You might build a series of individual trees yourself, or you can let BATTERY do it for you. Four batteries -- ATOM, MINCHILD, DEPTH and NODES -- work in similar ways by varying the allowable size of the atom, minchild, tree depth and the number of nodes permitted in the maximal tree. These controls constrain how large your CART tree is permitted to grow. Because they are tree-oriented controls, they work with TreeNet and RandomForests models too. For example, by issuing just the following simple series of commands you will find yourself with eight CART trees, which you can easily compare against one another to find a tradeoff between predictive accuracy and tree complexity that works best for you:

Continue Reading