Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

How to Control Salford Console Applications From an External Process

Warning: The following article is 'geeky.' It has to be, since it discusses programming techniques (but non–geeks are welcome to continue).

One of the questions our more technically adept users sometimes ask is how to run our applications from an external program written in a language such as Perl, Python, or Microsoft Visual BASIC. While our standard GUI programs do allow this, it is much more conveniently done with the console (command–line or non–GUI) versions which are standard on UNIX and Linux, but are also available for MS Windows. The reasons for this are as follows:

Windows GUI applications disable standard input and output, which are the most convenient means of communicating with an external program.

Our GUI applications necessarily use more memory and are slower than their non–GUI counterparts (after all, they produce a lot of graphical displays and reports that our console applications do not), and they take much longer to start up.

At the present time, our GUI applications do not allow the display to be suppressed, which makes it inconvenient to run them from Windows services.

Standard Input and Output

These terms refer to what used to be the most common means of interacting with a computer program: typing messages at a terminal (or a virtual terminal like the Windows Command Prompt), and receiving responses. In the days before CRT terminals or personal computers were widespread, the input messages were usually written on punch cards and the responses normally came back over the printer. More recently, users would send messages to the program from their keyboards and the results would return to their monitors. The stream of messages to the program is called 'standard input,' while the stream of messages from the program is called 'standard output.' By default, standard input comes from your keyboard, and standard output goes to a terminal display, but all modern operating systems allow standard input and output to be redirected in many different ways.

For example, if you launch console CART interactively, you will get something like this:

CART ProEX version 6.2.0.162

Copyright, 1991–2006, Salford Systems, San Diego, California, USA

Launched on 9/8/2009 Licensed until 12/31/2010.

This launch supports up to 32768 variables.

256 MB RAM allocated at launch, partitioned as:

Real : 65109998 cells

Integer : 1114112 cells

Character: 3539016 cells

The license supports up to 9999999 MB of learn sample data.

StatTransfer enabled.

 

CART then waits for you to type commands. At this point, if you opened a dataset with the USE command, CART would report its success or failure to do so and, if successful, would list the variables in the dataset thus:

>use boston

Opening text file as dataset:

/test/datasets_csv/BOSTON.CSV

/test/datasets_csv/BOSTON.CSV uses, as delimiter.

VARIABLES IN RECT FILE ARE:

CRIM ZN INDUS CHAS NOX

RM AGE DIS RAD TAX

PT B LSTAT MV

/test/datasets_csv/BOSTON.CSV: 506 records.

>

From there you could perform the desired analyses by typing additional commands. When finished, you could exit the session with the QUIT command, at which point you would be returned to your friendly command prompt (or perhaps, depending upon how you launched CART, the window would simply close).

While this is not a very pleasant way for humans to interact with CART (the GUI is much better for interactive use), it does provide an easy way to write programs that interact with CART (or whichever Salford console application is desired). This is accomplished by re–directing standard input and output in such a way that the application is communicating with an external process, rather than with the user directly.

Korn Shell Example

In the course of a data mining competition in which Salford recently took part, I wrote a script which ran a battery of TreeNet models which were all specified in the same manner, but each model used a different division between the learning and test samples. I wrote the script in Korn Shell 93 (the most recent specification of the Korn Shell language, as published by AT&T; see http://www.kornshell.org) and ran it under CentOS 5.1 (CentOS is a Linux distribution based on Red Hat's Fedora distribution). Since this uses ksh93–specific constructs, it will not run under such versions of the Korn Shell as pdksh, which is based upon older specifications. The script is as follows:

#!/bin/ksh

SPM=spm640205

ND=26

{

print submit fpath

print use \"../Data/cacmps5x3.sas7bdat\"

print submit class2

print submit keep66

print category mps CAC_GENDER DWELL_TYPE cred mail donor resp,

print LIFESTAGE_GRP2 LIFESTYLE_DIM1 LIFESTYLE_DIM2 SILHOUETTE,

print activeyr CNTY_SIZEN CAC_MARSTAT2 OCCUP_CD2 agecatd CRED_ANY,

print dimn DIM_ANY DONOR_ANY RESP_ANY kids Lifestage_Bin Lifestage_Grp2_Bin,

print Lifestyle_Dim1_Bin Lifestyle_Dim2_Bin, Silhouette_Bin binarydpv

print loptions timing pred gains roc

print mart trees=2000 nodes=6 learnrate=.01 fullreport=yes

print model binarydpv

print keep copy keep66

print cw unit

for ((i=1; i=$ND; i++)) do

model=bt66tlun6d$i

print output $model

print grove $model

print save $model.csv /model

print note \"TreeNet model on BinaryDPV \(KEEP66\)\"

print note \"Logistic/CW UNIT\"

print note \"ERROR SEPVAR=D$i\"

print id respid d$i

print error sepvar=d$i

print mart go

done

}|$SPM >/dev/null

spm640205 is Salford Predictive Miner (SPM) 6.4.0.205, a developmental version of our data mining suite, which I used in this analysis. I previously created a series of variables D1-D26, which specified 26 different divisions between the learning and test samples. The print statements in the code between '{' and '}' write commands to SPM, which in turn executes them. The base model is specified first, and then the 26 models are built and scored with a for loop. In this example, no attempt is made to read the responses to the commands. Instead, standard output is redirected to /dev/null which is the standard null device on UNIX and Linux systems (NUL: is the standard null device under DOS/Windows); and the text (classic) output for each model is saved to individual files with the OUTPUT command. No QUIT command is required because SPM interprets the end of the input stream as QUIT. In my next post, I will demonstrate how to extract useful data from standard output and use them to create a report.

Perl Example

When I wrote the above shell script, I took the 'quick and dirty' approach, rather than trying to write something generic. The 'right way' would be to write a parameterized script that could be used on a variety of datasets. I recently wrote such a script in Perl, as shown below:

else {$EXEC=treenet}

#!/usr/bin/perl

use Getopt::Std;

#Set Command-line flags

getopts("cdgi:sx:",\%opt);

$WRTCMD=$opt{"c"};

$DEBUG=$opt{"d"};

$SAVEGRV=$opt{"g"};

$ID=$opt{"i"};

$SCORE=$opt{"s"};

if (defined $opt{"x"}) {$EXEC=$opt{"x"}}

#Set constants

$ARGC=@ARGV;

$NULL="/dev/null";

#Help message

if ($ARGC<4) {

print "Run battery of TN models, each on a different learn/test division\n";

print "Usage:\n";

print "$0 [-cdgs] [-i ] [-x ] \n";

print "cmd: Name of the command file specifying the base model\n";

print "basename: Name to use as base for output, grove file names, and scores\n";

print "basefold: Base name of variables defining learn/test divisions\n";

print "nflds: Number of learn/test divisions\n";

print "\n";

print "Options:\n";

print " -c: Write commands generated to standard output and then exit";

print " -d: Run in debug mode. Text output not suppressed";

print " -g: Save grove files (name=basename+index.GRV)\n";

print " -i : Specify one or more ID variables\n";

print " -s: Save scores to CSV (name=basename+index.CSV)\n";

print " -x : Use TN executable named \n";

exit}

#Set command line parameters

$CMD=$ARGV[0];

$BASENAME=$ARGV[1];

$BASEFOLD=$ARGV[2];

NFOLD=$ARGV[3];

#Run the battery

if ($WRTCMD) {$XPROC="cat"}

elsif ($DEBUG) {$XPROC=$EXEC}

else {$XPROC="$EXEC>$NULL"}

open TN,"|-",$XPROC||die "Failure to run TreeNet executable $EXEC\n";

print TN "submit '$CMD'\n";

for ($index=1; $index<=$NFOLD; $index++) {

$model=$BASENAME.$index;

$fold=$BASEFOLD.$index;

print TN "output $model\n";

if ($SAVEGRV) {print TN "grove $model\n"}

if ($SCORE) {

print TN "save '$model\.csv' /model\n";

print TN "idvar $ID $fold\n"}

print TN "error sepvar=$fold\n";

print TN "mart go\n"}

print TN "quit\n";

close TN||die "Failure to close TreeNet process\n";

In this example the base model specification is contained in an external file, the name of which is passed to the script as a parameter; together with a base model name, and a name specification for the learn/test indicators. There are also some options which may be set, such as the name of the TreeNet executable, and any ID variables. For our purposes, however, the most important section of the script is the one labeled "Run the battery." Here we open the TreeNet process as if it were a file and then we write the appropriate commands to it. We save the classic output for each model in a separate file; optionally we save the groves and scores as well. When processing is finished, we write the QUIT command to TreeNet and close the process. As in the previous example, no attempt is made to parse the responses written to standard output; by default they are sent to the null device. When generating the commands without executing them for debug purposes (the –c flag), the script sends them to the UNIX cat utility, which is will not work under Windows unless an appropriate set of UNIX utilities is installed.

I have found that the easiest way to pass model specifications to a generic script is in the form of a command file. The script executes the command file with a SUBMIT command, followd by the specific commands required to run the battery. For example, the commands specifying the base model used in the Korn Shell example would look like this:

.

submit fpath

use "../Data/cacmps5x3.sas7bdat"

submit class2

submit keep66

category mps CAC_GENDER DWELL_TYPE cred mail donor resp,

LIFESTAGE_GRP2 LIFESTYLE_DIM1 LIFESTYLE_DIM2 SILHOUETTE,

activeyr CNTY_SIZEN CAC_MARSTAT2 OCCUP_CD2 agecatd CRED_ANY,

dimn DIM_ANY DONOR_ANY RESP_ANY kids Lifestage_Bin Lifestage_Grp2_Bin,

Lifestyle_Dim1_Bin Lifestyle_Dim2_Bin, Silhouette_Bin binarydpv

loptions timing pred gains roc

mart trees=2000 nodes=6 learnrate=.01 fullreport=yes

model binarydpv

keep copy keep66

cw unit

Note that the MART GO command, which would actually build the model, is absent. The purpose of the file is not to build a model, but merely to specify one. The script itself will generate the commands necessary to build the appropriate models.

[J#49:1603]

Tags: Blog, SPM