Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Data Mining for "What-If" Analysis

Once you have generated a model (or models) you are happy with and have saved to grove files you are in a position to run simulations and explore what-if scenarios. To do this you normally want to start by creating a new artificial data set in which you will have REFERENCE or BASELINE observations and then variations which are copies of your baseline record but with systematic changes installed. Your baseline record could well a record with average and/or modal values for every variable appearing in the model or it could be an actual data recor selected because it is somehow "typical". We will illustrate the process of creating the variations below.

As an example, take the case of retail grocery chain with a predictive model for NOVA PEIXE brand tuna fish in a 250 gram tin. Our predictors include DISCOUNT (could be negative) from average non-sale price in last 3 months, promotion type, month, and day of week of the promotion. The baseline record might look like this:

DISCOUNT PROMO_TYPE MONTH DAY_OF_WEEK
10 TV 06 TUESDAY

Now we want to explore the effects of varying the discount and would thus prepare a file containing these records:

 
DISCOUNT PROMO_TYPE MONTH DAY_OF_WEEK
10 TV 06 TUESDAY
00 TV 06 TUESDAY
01 TV 06 TUESDAY
02 TV 06 TUESDAY
03 TV 06 TUESDAY
04 TV 06 TUESDAY
05 TV 06 TUESDAY
06 TV 06 TUESDAY
07 TV 06 TUESDAY
08 TV 06 TUESDAY
09 TV 06 TUESDAY

Observe that here we have allowed only the DISCOUNT to vary and all other values of the baseline record remain the same. The example above only varies DISCOUNT over a narrow range and in small steps. Instead, we could vary DISCOUNT more widely and also include negative discounts (price increases). The simulation file could include several such sets of simulation records for other predictors, for example, MONTH could be varied from 01 to 12, and DAY_OF_WEEK could be stepped through all seven days. We could also explore joint variation of predictors such as varying the discounts through their entire range for every month of the year. There is no limit on the number of variations we can create for this set of records. We then save this file and score it to produce the model predictions; writing the predictions out to an Excel file would facilitate pretty graphing of the results. This is a classic model simulation methodology.

A more sophisticated methodology does not rely on a single baseline record but instead uses the entire training file as the baseline. Starting from the actual training data we manipulate a predictor, such as discount, by setting it equal to say 10 for every record in the file, leaving all other data as is. We then score the entire file and save the average predicted target. Then we repeat this for other values of the predictor, each time scoring the entire training file just to extract the average prediction. Clearly this involves quite a bit more computation but is generally considered a more reliable way to conduct the simulations.

Within SPM you could run the latter type of simulation using the following template:

USE myfile
Load the training file (or a random sample)
GROVE mymodel.grv
Load the predictive model
%let DISCOUNT=10
Reset the value of DISCOUNT to 10 for every record in the file
SAVE predict_10
Create the file to hold the predictions
SCORE GO
score
%let DISCOUNT=20
Reset the value of DISCOUNT to 20 for every record in the file
SAVE predict_10
Create the file to hold the predictions
SCORE GO
score

etc.

The raw predictions then need to be averaged and the results collated to report and chart. This method can be used with any SPM data mining engine such as CART, MARS, TreeNet, and RandomForests.

It turns out that TreeNet actually runs this type of simulation for you automatically when you request PLOTS and the results are displayed in the TreeNet post-model reports. Also check the TreeNet HELP entry for information on the TREENET PF= option that saves the plot data to a file you can further manipulate in Excel of other software.

[J#21:1603]

Tags: Data Mining