Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Dan Steinberg's Blog

Dan Steinberg, President and Founder of Salford Systems, is a well respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

Modeling tricks with TreeNet: Treating Categorical Variables as Continuous

Every experienced modeler knows that it is important to differentiate between ordered and unordered variables. If a variable X happens to be coded as 1, 2, or 3 but is unordered, then the three possible values are arbitrary labels not intended to convey any sense of order. In other words, the value of X for a record that records X=1 is not necessarily larger nor smaller than the possible values of 2 or 3; it is simply different.

Therefore, were we to run a regression that treated X as continuous, any slope we discovered would be an illusion. Further, X treated as continuous in a regression would embed the notion that a value of "3" for X is not just larger than the value of "1," but is specifically three times larger.

Continue Reading

A Simple Explanation Of TreeNet Models For Regulators

Data mining and machine learning are technological fields that are having a substantial impact on how data is analyzed and how predictive models are built in virtually all industries. The new methods are being adopted rapidly by expert data analysts because of their extraordinary power, but the ability of consumers of the models and results have been lagging in their understanding of what the new methods actually do and how they work. To the extent that the new methods appear to be "black boxes" that mysteriously produce results, those outside the data mining field sometimes appear hesitant to trust the results. This brief note is intended to address this matter focusing on TreeNet, introduced into the literature as "stochastic gradient boosting" and "multiple additive regression trees."

TreeNet is the brainchild of celebrated Stanford University Professor Jerome H. Friedman (Statistics Department), and was announced in a paper in the prestigious Annals of Statistics in 1999. Friedman explains the methodology in a series of highly technical papers and Salford systems has followed with a variety of tutorial materials presented several times at the Joint Statistical Meetings of the American Statistical Association, and we will not attempt to summarize that material here, other than to provide references. Instead, we will offer a somewhat different and succinct explanation of the underlying logic of this innovation.

Continue Reading

Modeling Competitions and Illegitimate Predictors

Our recent set of posts on the topic of illegitimate predictors was provoked by a fascinating recent paper on the topic presented at the KDD2011 technical data mining conference. The paper, by Kaufman, Rosset, and Perlich, focused on public modeling competitions. Such competitions have been in vogue since at least 1997 when the first KDDCup was conducted, and were given a terrific boost when Netflix offered a $1 million prize for a movie recommendation system that could beat their own internally–developed system.

Continue Reading

Unintended Use of Illegitimate Data in Predictive Modeling Part 1

At Salford Systems we take pride in pointing out that much of the work of modern analytics can be automated using our advanced technology. And indeed, our process of going from raw data to high quality predictive models is vastly faster than it was when we used classical statistical models some 20 years ago. But not everything that needs to be done in model construction is 100% automatable, and this is especially true when it comes to the avoidance of certain common blunders in model construction. In this article our focus is on the inadvertent use of information which in fact should never have been used in the model construction. Although we can provide some rules of thumb and some management advice to protect against this type of blunder, at present it appears that avoidance of these errors requires specific knowledge of the details of all of the potential fields in the database. In other words, there are some errors which are probably always going to be avoided only by the exercise of good judgment and vigilance exercised by human experts. This article is devoted to one such problem: the use of predictors which should never be used even though they appear on the database.

Continue Reading

Linear Combination Splits in Decision Trees

Decision trees, and CART® in particular, were originally designed to split data using a single predictor. For continuous predictors the split is the form:

if X = c hen go left

else go right

For categorical predictors the split is of the form:

if X is in { value1, value2,....} then go left

else go right

The advantage of such splitting rules lies principally in their simplicity and understandability, and the terminal nodes of the tree are conjunctions of such spliters that define rules (or data segments) that may also be easy to understand. But for continuous variables the advantages go far beyond comprehensibility as the split is essentially unchanged when the predictor in question is transformed via common transforms such as log(X) or SQRT(X) etc. Technically, so long as the transform does change the rank ordering of the values of the predictor (or simply inverts the order) then the splitting rule is essentially unchanged. Further, extreme values of the predictor should not affect the selection of the best splitter as the splitter is generally towards the interior of the split variable distribution. These characteristics go a long way to making CART such an effective analytical tool even in the presence of flawed data.

Continue Reading

Using Dates In Data Mining Models

data-mining

Using dates in any kind of predictive modeling model can be tricky to get right. It is important to be clear about what you are trying to accomplish. Suppose, for example, we are trying to predict sales of a specific brand of beer in a given store and have daily sales data going back several years. One of the patterns we are going to want to track and capture is “seasonality,” which refers to changes in sales levels due to the season of the year. We might find that beer sales of all types are typically highest in the summer months, lowest in the winter, and intermediate in spring and fall. Of course, seasonality is only one factor among many, and good forecasts will require much more information than the date. To capture seasonality, statisticians and econometricians have long resorted to introducing variables to reflect the season of the year. This could be captured by a categorical variable coded, say, “fall” “winter” “spring” “summer.” A modeler might instead prefer to introduce a variable for the month of the year or even the week or the day of the year. The point is that this variable would be extracted from the date, and we would leverage the fact that we can observe the seasonal pattern more than once to draw conclusions about something like a “summer effect.”

Continue Reading

Feature Selection for CART® using the Salford Predictive Modeler® SoftwareSuite

We can dig deeper than we did in our previous post into the reasons why more compact predictor lists can improve decision trees. Recall that a CART tree is grown by searching for splits across all predictors and all possible split points in a given partition of the learning data. There is no guarantee that this same split will be as good on the previously-unseen test data. Occasionally, the best split on the learn data will be a lucky draw, and the split will not be confirmed on test data. In the original CART monograph, large sample theory was intended to assure that in very large samples CART will always correct any unfortunate splits made as the tree evolves by making the correct splits lower down in the tree. With sufficiently large samples, enough data always are left to converge to the best model. In most real world situations, however, we will not want to rely on massive data sets to get to the best model, and we may not have enough data to assure the desired result.

Continue Reading