Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Dan Steinberg's Blog

Dan Steinberg, President and Founder of Salford Systems, is a well respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

Using Batteries in The Salford Predictive Modeling Suite

The Salford Predictive Modeler™ suite (SPM) includes a number of automated tools to assist in the process of feature selection under the BATTERY mechanism. For example,

BATTERY KEEP

Selects a subset of features at random and builds a model from this random subset only. The GUI will guide you in how to use this option, but from the command line you would issue something like:

BATTERY KEEP=100, 15

Which requests 100 models, each of which includes 15 randomly-selected predictors. If we are sure that we want certain variables included in every such model, the command would look like:

BATTERY KEEP=100, 15 CORE= X1, X2, X3, X4, X5

Continue Reading

Academic Articles on the Performance of New Methodology

When we started our work to release our first release of CART (a 1993 command line version running very nicely on UNIX), I was startled by some (now long forgotten) articles claiming to describe a new technology that was more accurate, or faster, on some class of analytic problem. At the time, I assumed that such articles needed to be taken seriously because they represented peer-reviewed, solidly-researched scientific advances.

Continue Reading

BATTERY BOOTSTRAP in Salford Predictive Modeler

The most recent versions of Salford Predictive Modeler™ SPM PRO EX include a new BATTERY to invoke bootstrapped replication of most model types available in SPM. One of our reasons for adding this BATTERY was to provide access to the full CART engine when generating RandomForests® (RF) models. The principle advantages of this are:

Breiman's original RF uses a stripped down and simplified tree growing algorithm designed for speed. It lacks tree growing options and missing handling, and for many users Breiman's RF is confined to classification problems. By accessing the full CART engine with all of its Salford extensions and customized controls, modelers can accomplish far more sophisticated analyses, handle missing values with surrogates, apply penalties and constraints, and most importantly for those interested in continuous dependent variables, BATTERY BOOTSTRAP gives access to both Least Squares (LS) and Least Absolute Deviation (LAD) regression trees.

The principle drawback of BATTERY BOOTSTRAP is that the extra machinery comes with a computational price: RF runs under BATTERY BOOTSTRAP are much slower than under Breiman–RF. The extra robustness, ability to handle huge problems, and added controls should often make the slower runs worthwhile. Also observe that at the moment the RF post–model visualization machinery is not available.

Continue Reading

Improving a CART Tree by Reducing the Number of Predictors

The basic question is:

If CART is a great variable selector, why should I do any variable selection at all? Isn't it better to let CART do everything automatically?

More technically:

If I have already built a CART tree using a given list of variables, why would rebuilding with fewer predictors sometimes yield a better-performing tree? Didn't CART already make the best possible decisions regarding which variables to use in any part of the tree?

A number of points should be made regarding these issues. The first, and the simplest to understand, is that CART is a myopic model builder that looks only at the split it is currently working on. This means that CART does not look ahead to future splits to be made on the children and grandchildren of the current split.Consider just the root node for the sake of argument. Suppose that we have five relatively strong splitters for the root. Of course, CART chooses the split generating the greatest reduction in Gini impurity (by default) and then goes on to build the entire tree using the same split selection criterion. Suppose we were to split the root on the second best root node splitter. It could happen that the overall tree now generated is a better performer on test data than the default tree.

Continue Reading

Introduction

Dan Steinberg, CEO of Salford Systems, has initiated a blog principally devoted to technical matters pertaining to our core products CART, MARS, TreeNet, RandomForests, Generalized Path Seeker, and RULEFIT, among others. This new blog focuses on the fields of data mining, machine learning, predictive analytics, and business intelligence, but with a personal perspective. Entries here could well recount conversations with product developer Jerry Friedman, or some time ago with Leo Breiman, or could reflect his thoughts on the art and practice of advanced analytics and the development of new analytics methodology.

Continue Reading

How to Control Salford Console Applications From an External Process

Warning: The following article is 'geeky.' It has to be, since it discusses programming techniques (but non–geeks are welcome to continue).

One of the questions our more technically adept users sometimes ask is how to run our applications from an external program written in a language such as Perl, Python, or Microsoft Visual BASIC. While our standard GUI programs do allow this, it is much more conveniently done with the console (command–line or non–GUI) versions which are standard on UNIX and Linux, but are also available for MS Windows. The reasons for this are as follows:

Continue Reading

Working with the Salford Predictive Modeler and Scratch Directories

Like many programs, the Salford Predictive Modeler reads, writes, and otherwise manages temporary files in the course of its work. These are written to a particular directory on your computer called a "scratch directory". SPM also writes a command log to the scratch directory. The GUI version of SPM allows the location of this directory to be set as an option (with a sensible default), but non-GUI versions determine where to write temporary files by means of environment variables. Presently, SPM searches for the following environment variables and uses the value of the first one defined as its scratch directory:

Continue Reading