install the specific version of SPM used in this tutorial. â§. Following the above ... In 2006 the DMC data mining competition (restricted to student competitors.
Getting Started with Text Mining: STM™, CART® and TreeNet®
Dan Steinberg, Mykhaylo Golovnya, Ilya Polosukhin May, 2011
Text Mining and Data Mining Text mining is an important and fascinating area of modern analytics On the one hand text mining can be thought of as just another application area for powerful learning machines On the other hand, text mining is a distinct field with its own dedicated concepts, vocabulary, tools, and techniques In this tutorial we aim to illustrate some important analytical methods and strategies from both perspectives on data mining ²
²
introducing tools specific to the analysis text, and, deploying general machine learning technology
The Salford Text Mining utility (STM) is a powerful text processing system that prepares data for advanced machine learning analytics Our machine learning tools are the Salford Systems flagship CART® decision tree and stochastic gradient boosting TreeNet® Evaluation copies of the the proprietary technology in CART and TreeNet as well as the STM are available from http://www.salford-systems.com Salford Systems © Copyright 2011
2
For Readers of this Tutorial To follow along this tutorial we recommend that you have the analytical tools we use installed on your computer. Everything you need may already be on a CD disk containing this tutorial and analytical software Create an empty folder named “stmtutor”, this is the root folder where all of the work files related to this tutorial will reside You may also use the following link to download Salford Systems Predictive Modeler (SPM) http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zip
After downloading the package, unzip its contents into “stmtutor” which will create a new folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation steps described on the next slide. For the original DMC2006 competition website visit http://www.data-mining-cup.de/en/review/dmc-2006/ We recommend that you visit the above site for information only; data and tools for preparing that data are available at the URL next below For the STM package, prepared data files, and other utilities developed for this tutorial please visit http://www.salford-systems.com/dist/STM.zip After downloading the archive, unzip its contents into “stmtutor” Salford Systems © Copyright 2011
3
Important! Installing the SPM Software The Salford Systems software you’ve just downloaded needs to be both installed and licensed. No-cost license codes for a 30 day period are available on request to visitors of this tutorial* Double click on the “Install_a_Transform_SPM.exe” file located in the “SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) to install the specific version of SPM used in this tutorial ²
Following the above procedure will ensure that all of the currently installed versions of SPM, if any, will remain intact!
Follow simple installation steps on your screen * Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion Salford Systems © Copyright 2011
4
Important! Licensing the SPM Software When you launch the Salford Systems Predictive Modeler (SPM) you will be greeted with a License dialog containing information needed to secure a license via email Please, send the necessary information to Salford Systems to secure your license by entering the “Unlock Code” which will be emailed back to you The software will operate for 3 days without any licensing; however, you can secure a 30-day license on request
Salford Systems © Copyright 2011
5
Installing the Salford Text Miner (STM) In addition to the Salford Predictive Modeler (SPM) you will also work with the Salford Text Miner (STM) software No installation is needed and you should already have the “stm.exe” executable in the “stmtutor\STM\bin” folder as the result of unzipping the “STM.zip” package earlier STM builds upon the Python 2.6 distribution and the NLTK (Natural Language Tool Kit) but makes text data processing for analytics very easy to conduct and manage ²
You do not need to add any other support software to use STM
Expect to see several folders and a large number of files located under the “stmtutor\STM” folder. It is important to leave these files in the location to which you have installed them. ²
Please do not MOVE or alter any of the installed files other than those explicitly listed as user-modifiable!
“stm.exe” will expire in the middle of 2012, contact Salford Systems to get an updated version beyond that Salford Systems © Copyright 2011
6
The Example Project The best examples are drawn from real world data sets and we were fortunate to locate data publicly released by eBay. Good teaching examples also need to be simple. ²
²
Unfortunately, real world text mining could easily involve hundreds of thousands if not millions of features characterizing billions of records. Professionals need to be able to tackle such problems but to learn we need to start with simpler situations. Fortunately, there are many applications in which text is important but the dimensions of the data set are radically smaller, either because the data available is limited or because a decision has been made to work with a reduced problem.
We use our simpler example to illustrate many useful ideas for beginning text miners while pointing the way to working on larger problems.
Salford Systems © Copyright 2011
7
The DMC2006 Text Mining Challenge In 2006 the DMC data mining competition (restricted to student competitors only) introduced a predictive modeling problem for which much of the predictive information was in the form of unstructured text. The datasets for the DMC 2006 data mining competition can be downloaded from http://www.data-mining-cup.de/en/review/dmc-2006/ ²
For your convenience we have re-packaged this data and made it somewhat easier to work with. This re-packaged data is included in the STMU package described near the beginning of this tutorial.
The data summarizes 16,000 iPod auctions held at eBay from May 2005 through May 2006 in Germany Each auction item is represented by a text description written by the seller (in German) as well as a number of flags and features available to the seller at the time of the auction Auction items were grouped into 15 mutually exclusive categories based on distinct iPod features: storage size, type (regular, mini, nano), and color The competition goal was to predict whether the closing price would be above or below the category average Salford Systems © Copyright 2011
8
Comments on the Challenge One might think that a challenge with text in German might not be of general interest outside of Germany However, working with a language essentially unfamiliar to any member of the analysis team helps to illustrate one important point ²
Text mining via tools that have no “understanding” of the language can be strikingly effective
We have no doubt that dedicated tools which embed knowledge of the language being analyzed can yield predictive benefits ²
We also believe we could have gained further valuable insight into the data if any of the authors spoke German! But our performance without this knowledge is still impressive.
In contexts where simple methods can yield more than satisfactory results, or in contexts where the same methods must be applied uniformly across multiple languages, the methods described in this tutorial will be an excellent guide.
Salford Systems © Copyright 2011
9
Configuring Work Location in SPM The original datasets from the DMC 2006 challenge reside in the “stmtutor\STM\dmc2006” folder To facilitate further modeling steps, we will configure SPM to use this location as the default location: ²
Start SPM
²
Go to the Edit – Options menu
²
Switch to the Directories tab
²
²
Enter the “stmtutor\STM\dmc2006” folder location in all text entry boxes except the last one Press the [Save as Defaults] button so that the configuration is restored the next time you start SPM
Salford Systems © Copyright 2011
10
Configuring TreeNet Engine Now switch to the TreeNet tab ²
²
²
Configure the Plot Creation section as shown on the screen shot Press the [Save as Defaults] button Press the [OK] button to exit
Salford Systems © Copyright 2011
11
Steps in the Analysis: Data Overview
1.
2.
3.
Describe the data: (Data Dictionary and Dimensions of Data) a.
What is the unit of observation? Each record of data is describing what?
b.
What is the dependent or target variable?
c.
What other variables (data base fields) are available?
d.
How many records are available?
Statistical Summary a.
Basic summary including means, quantiles, frequency tables
b.
Dimensions of categorical predictors
c.
Number of distinct values of continuous variables
Outlier and Anomaly Assessment a.
Detection of gross data errors such as extreme values
b.
Assessment of usability of levels of categorical predictors (rare levels) Salford Systems © Copyright 2011
12
Data Fundamentals The original dataset is called “dmc2006.csv” and resides in the “stmtutor\STM\dmc2006” folder 16,000 records divided into two equal sized partitions ²
Part 1: Complete data including target, available for training during the competition
²
Part 2: Data to be scored; during the competition the target was not availabler
25 database fields two of which were unstructured text written by the seller Each line of data describes an auction of an iPod including the final winning bid price An eBay seller must construct a headline and a description of the product being sold. Sellers can also pay for selling assistance ²
E.g. Seller can pay to list the item title in BOLD
Salford Systems © Copyright 2011
13
The Data: Available Fields The following variables describe general features of each auction event Variable
Description
AUCT_ID
ID number of auction
ITEM_LEAF_CATEGORY_NAME
products category
LISTING_START_DATE
start date of auction
LISTING_END_DATE
end date of auction
LISTING_DURTN_DAYS
duration of auction
LISTING_TYPE_CODE
type of auction (normal auction, multi auction, etc)
QTY_AVAILABLE_PER_LISTING
amount of offered items for multi auction
FEEDBACK_SCORE_AT_LISTIN
feedback-rating of the seller of this auction listing
START_PRICE
start price in EUR
BUY_IT_NOW_PRICE
buy it now price in EUR
BUY_IT_NOW_LISTING_FLAG
option for buy it now on this auction listing Salford Systems © Copyright 2011
14
Available Data Fields In addition, there are binary indicators of various “value added” features that can be turned on for each auction Variable
Description
BOLD_FEE_FLAG
option for bold font on this auction listing
FEATUERD_FEE_FLAG
show this auction listing on top of homepage
CATEGORY_FEATURED_FEE_FLAG
show this auction listing on top of category
GALLERY_FEE_FLAG
auction listing with picture gallery
GALLERY_FEATURED_FEE_FLAG
auction listing with gallery (in gallery view)
IPIX_FEATURED_FEE_FLAG
auction listing with IPIX (additional xxl, picture show, pack)
RESERVE_FEE_FLAG
auction listing with reserve-price
HIGHLIGHT_FEE_FLAG
auction listing with background color
SCHEDULE_FEE_FLAG
auction listing, including the definition of the starting time
BORDER_FEE_FLAG
auction listing with frame Salford Systems © Copyright 2011
15
Target Variable Finally, the target variable is defined based on the winning bid price revenue relative to the category average Variable
Description
GMS
scored sales revenue in EUR
CATEGORY_AVG_GMS
Average sales revenue for the product category
GMS_GREATER_AVG
zero when the revenue is less than or equal to the category average sales and one otherwise
The values were only disclosed on a randomly selected set of 8,000 auctions which we use to train a model 4199 auctions with the revenue below the category average 3801 auctions with the revenue above the category average During the competition the auction results for the remaining 8,000 auction results were kept secret, and used to score competitive entries We will only use these records at the very end of this tutorial to validate the performance of various models that will be built Salford Systems © Copyright 2011
16
Comments on Methodology Predictive modeling and general analytics competitions are increasingly being launched both by private companies and by professional organizations and provide both public data sets and a wealth of illustrative examples using different analytic techniques When reviewing results from a competition, and especially when comparing results generated by analysts running models after the competition, it is important to keep in mind that there is an ocean of difference between being a competitor during the actual competition and an after-the-fact commentator Regardless of what is reported the after-the-fact analyst does have access to “what really happened” and it is nearly impossible to simulate the competitive environment once the results have been published ²
We all learn in both direct and indirect ways from many sources including the outcomes of public competitions. This can affect anything that comes later in time.
In spite of this, we have tried to mimic the circumstances of the competitors by presenting analyses based only on the original training data, and using well-established guidelines we have been promoting for more than decade to arrive at a final model We urge you to never take as face value an analyst’s report on what would have happened if they had hypothetically participated Salford Systems © Copyright 2011
17
First Round Modeling: Ignoring the TEXT Data Even before doing any type of data preparation it is always valuable to run a few preliminary CART models ²
²
²
CART automatically handles missing values and is immune to outliers CART is flexible enough to adapt to any type of nonlinearity and interaction effects among predictors. The analyst does not need to do any data preparation to assist CART in this regard CART performs well enough out of the box that we are guaranteed to learn something of value without conducting any of the common data preparation operations
The only requirement for useful results is that we exclude any possible perfect or near perfect illegitimate predictors ²
Common examples of illegitimate predictors include repackaged versions of the dependent variable, ID variables, and data drawn from the future relative to the data to be predicted
We start with a quick model using 20 of the 25 available predictors. None of these involve any of the text data we will focus on later. Salford Systems © Copyright 2011
18
Quick Modeling Round with CART We start by building a quick CART model using original raw variables and all 8,000 complete auction records Assuming that you already have SPM launched ²
²
²
Go to the File – Open – Data File menu Note that we have already configured the default working folder for SPM Make sure that the Files of Type is set to ASCII
²
Highlight the dmc2006.csv dataset
²
Press the [Open] button
Salford Systems © Copyright 2011
19
Dataset Summary Window The resulting window summarizes basic facts about the dataset Note that even though the dataset has 16000 records, only top 8000 will be used for modeling as was already pointed out
Salford Systems © Copyright 2011
20
The View Data Window Press the [View Data…] button to have a quick impression of physical contents of the dataset Out goal is to eventually use the unstructured information contained in the text fields right next to the auction ID
Salford Systems © Copyright 2011
21
Requesting Basic Descriptive Stats We next produce some basic stats for all available variables: ²
Go to the View – Data Info… menu
²
Set the Sort mode into File Order
²
Highlight the Include column
²
Check the Select box
²
Press the [OK] button
Salford Systems © Copyright 2011
22
Data Information Window All basic descriptive statistics for all requested variables are now summarized in one place Note that the target variable GMS_GREATER_AVG is not defined for the one half of the dataset (N Missing 8,000), all those records will be automatically discarded during model building Press the [Full] button to see more details
Salford Systems © Copyright 2011
23
Setting Up CART Model We are now ready to set up a basic CART run: ²
²
²
²
²
²
²
Switch to the Classic Output window active Go to the Model – Construct Model… menu (alternatively, you could press one of the buttons located on the bar right below the menu bar) In the resulting Model Setup window make sure that the Analysis Method is set to CART In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification Check GMS_GREATER_AVG as the Target Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors You should see something similar to what is shown on the next slide
Salford Systems © Copyright 2011
24
Model Setup Window: Model Tab
Salford Systems © Copyright 2011
25
Model Setup Window: Testing Tab Switch to the Testing tab and confirm that the 10-fold cross-validation is used as the optimal model selection method
Salford Systems © Copyright 2011
26
Model Setup Window: Advanced Tab Switch to the Advanced tab and set the minimum required number of records for the parent nodes and the child nodes at 15 and 5 These limits were chosen to avoid extremely small nodes in the resulting tree
Salford Systems © Copyright 2011
27
Building CART Model Press the [Start] button, building progress window will appear for a while and then the Navigator window containing model results will be displayed Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator window, note that all trees within one standard error (SE) of the optimal tree are now marked in green Use the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree
Salford Systems © Copyright 2011
28
CART model observations The selected CART model contains 64 terminal nodes and it is the smallest model with the relative error still within one standard error of the optimal model (the model with the smallest relative error) pointed by the green bar ²
²
This approach to model selection is usually employed for easy comprehension We might also want to require terminal nodes to contain more than the 6 record minimum we observe in this out of the box tree
All 20 predictor variables play a role in the tree construction ²
but there is more to observe about this when we look at the variable importance details
Area under the ROC curve is a respectable 0.748
Salford Systems © Copyright 2011
29
CART Model Performance Press the [Summary Reports…] button in the Navigator, select Prediction Success tab, and press the [Test] button to display crossvalidated test performance of 68.66% classification accuracy Now select the Variable Importance tab to review which variables entered into the model Interestingly enough, none of the “added value” paid options are important and exhibit practically no direct influence on the sales revenue A detailed look at the nodes might also be instructive for understanding the model
Salford Systems © Copyright 2011
30
Experimenting with TreeNet We almost always follow initial CART models with similar TreeNet models We start with CART because some glaring errors such as perfect predictors are more quickly found and obviously displayed in CART ²
A perfect predictor often yields a single split tree (two terminal nodes) for classification trees
TreeNet models have strengths similar to CART regarding flexibility and robustness and has advantages and disadvantages relative to CART ²
²
²
²
TreeNet is an ensemble of small CART trees that have been linked together in special ways. Thus TreeNet shares many desirable features of CART TreeNet is superior to CART in the context of errors in the dependent variable (not relevant in this data) TreeNet yields much more complex models but generally offers substantially better predictive accuracy. TreeNet may easily generate thousands of trees to arrive at an optimal model TreeNet yields more reliable variable importance rankings
Salford Systems © Copyright 2011
31
A few words about TreeNet TreeNet builds predictive models in stages. It first starts with a deliberately very small first round tree (essentially a CART tree). Then TreeNet calculates the prediction error made by this simple model and builds a second tree to try to model that prediction error. The second tree serves as tool to update, refine, and improve the first stage model. A TreeNet model produces a “score” which is a simple of sum of all the predictions made by each tree in the model Typically the TreeNet score becomes progressively more accurate as the number of trees is increased up to an optimal number of trees Rarely the optimal number of trees is just one! Occasionally, a handful of trees are optimal. More typically, hundreds or thousands of trees are optimal. TreeNet models are very useful for the analysis of data with large numbers of predictors as the models are built up in layers each of which makes use of just a few predictors More detail on TreeNet can be found at http://www.salford-systems.com Salford Systems © Copyright 2011
32
Setting Up TN Model Switch to the Classic Output window and go to the Model – Construct Model… menu Choose TreeNet as the Analysis Method In the Model tab make sure that the Tree Type is set to Logistic Binary
Salford Systems © Copyright 2011
33
Setting Up TN Parameters Switch to the TreeNet tab and do the following: ²
Set the Learnrate to 0.05
²
Set the Number of trees to use: to 800 trees
²
Leave all of the remaining options at their default values
Salford Systems © Copyright 2011
34
TN Results Window Press the [Start] button to initiate TN modeling run, the TreeNet Results window will appear in the end
Salford Systems © Copyright 2011
35
Checking TN Performance Press on the [Summary] button and switch to the Prediction Success tab Press the [Test] button to view cross-validation results Lower the Threshold: to 0.45 to roughly equalize classification accuracy in both classes (this makes it easier to compare the TN performance with the earlier reported CART performance)
Salford Systems © Copyright 2011
36
The Performance Has Improved! The overall classification accuracy goes up to about 71% Press the [ROC] button to see that the area under ROC is now a solid 0.800 This comes at the cost of added model complexity – 796 trees each with about 6 terminal nodes Variable importance remains similar to CART
Salford Systems © Copyright 2011
37
Understanding the TreeNet Model TreeNet produces partial dependency plots for every predictor that appears in the model, the plots can be viewed by pressing on the [Display Plots…] button Such plots are generally 2D illustrations of how the predictor in question affects an outcome ²
For example, in the graph below the Y axis represents the probability that an iPod will sell at an above category average price
We see that for a BUY_IT_NOW price between 200 and 300 the probability of above average winning bid rises sharply with the BUY_IT_NOW_PRICE For prices above 300 or below 200 the curve is essentially flat meaning that changes in the predictor do not result in changes in the probable outcome Salford Systems © Copyright 2011
38
Understanding the Partial Dependency Plot (PD Plot) The PD Plot is not a simple description of the data. If you plotted the raw data as say the fraction of above average winning bids against prices intervals you might see a somewhat different curve The PD Plot is a plot that is extracted from the TreeNet model and it is generated by examining TreeNet predictions (and not input data) The PD Plot appears to be relate two variables but in fact other variables may well play a role in the graph construction Essentially the PD Plot shows the relationship between a predictor and the target variable taking all other predictors into account The important points to understand are that ²
the graph is extracted from the model and not directly from raw data
²
the graph provides an honest estimate of the typical effect of a predictor
²
the graph displays not absolute outcomes but typical expected changes from some baseline as the predictor varies. The graph can be thought of as floating up or down depending on the values of other predictors Salford Systems © Copyright 2011
39
More TN Partial Dependency Plots
Salford Systems © Copyright 2011
40
Introducing the Text Mining Dimension To this point, we have been working only with the set of traditional structured data fields continuous and categorical variables Further substantial performance improvement can be achieved only if we utilize the text descriptions supplied by the seller in the following fields Variable
Description
LISTING_TITLE
title of auction
LISTING_SUBTITLE
subtitle of auction
Unfortunately, these two variables cannot be used “as is”. Sellers were free to enter free form text including misspellings, acronyms, slang, etc. So we must address the challenge of converting the unstructured text strings of the type shown here into a well structured representation Salford Systems © Copyright 2011
41
The Bag of Words Approach of Text Mining The most straightforward strategy for dealing with free form text is to represent each “word” that appears in the complete data set as a dummy (0/1) indicator variable For iPods on eBay we could imagine sellers wanting to use words like “new” “slightly scratched”, “pink” etc. to describe their iPod. Of course the descriptions may well be complete phrases like “autographed by Angela Merkel” rather than just single term adjectives Nevertheless in the simplest Bag of Words (BOW) approach we just create dummy indicators for every word Even though the headlines and descriptions are space limited the number of distinct words that can appear in collections of free text can be huge Text mining applications involving complete documents, e.g. newspaper articles, the number of distinct words can easily reach several hundred thousands or even millions
Salford Systems © Copyright 2011
42
The End Goal of the Bag of Words
• • • • •
Record_ID
RED
USED
SCRATCHED
CASE
1001
0
1
0
1
1002
0
0
0
0
1003
1
0
0
0
1004
0
0
0
0
1005
1
1
1
0
1006
0
0
0
0
Above we see an example of a database intended to describe each auction item by indicating which words appeared in the auction announcement Observe that Record_ID 1005 contains the three words “RED”, “USED” and “SCRATCHED” Data in the above format looks just like the kind of numeric data used in traditional data mining and statistical modeling We can use data in this form, as is, feeding it into CART, TreeNet, or regression tools such Generalized Path Seeker (GPS) or everyday regression Observe that we have transformed the unstructured text into structured numerical data Salford Systems © Copyright 2011
43
Coding the Term Vector and TF weighting In the sample data matrix on the previous slide we coded all of our indicators as 0 or 1 to indicate presence or absence of a term An alternative coding scheme is based on the FREQUENCY COUNT of the terms with these variations: ²
0 or 1 coding for presence/absence
²
Actual term count (0,1,2,3,…)
²
Three level indicator for absent, one occurrence, and more than one (0,1,2)
The text mining literature has established some useful weighted coding schemes. We start with term frequency weighting (tf) ²
²
²
Text mining can involve blocks of text of considerably different lengths It is thus desirable to normalize counts based on relative frequency. Two text fields might each contain the term “RED” twice, but one of the fields contains 10 words while the other contains 40 words. We might want our coding to reflect the fact that 2/10 is more frequent than 2/40. This is nothing more than making counts relative to the total length of the unit of text (or document) and such coding yields the term frequency weighting Salford Systems © Copyright 2011
44
Inverse Document Frequency (IDF) Weighting IDF weighting is drawn from the information retrieval literature and is intended to reflect the value of a term in narrowing the search for a specific document within a larger corpus of documents If a given term occurs very rarely in a collection of documents then that term is very valuable as a tag to target those documents accurately By contrast, if a term is very common, then knowing that such a term occurs within the document you are looking for is not helpful in narrowing the search While text mining has somewhat different goals than information retrieval the concept of IDF weighting has caught on. IDF weighting serves to upweight terms that occur relatively rarely. IDF(term) = log { (Number of documents)/Number of documents containing(term))} The IDF increases with the rarity of a term and is maximum for words that occur in only one document A common coding of the term vector uses the product: tf * idf Salford Systems © Copyright 2011
45
Coding the DMC2006 Text Data The DMC2006 text data is unusual principally because of the limit on the amount of text a seller was allowed to upload This has the effect making the lengths of all the documents very similar It also limits sharply the possibility that a term in a document would occur with a high frequency These factors contribute to making the TF-IDF weighting irrelevant to this challenge. In fact, for this prediction task other coding schemes allow more accurate prediction. STM offers these options for term vector coding ²
0 – no/yes
²
1 – no/yes/many – this one will be used in the remainder of this tutorial
²
2 – 0/1
²
3 – 0/1/2
²
4 – term frequency (relative to document)
²
5 – inversed document frequency (relative to corpus)
²
6 – TF-IDF (traditional IR coding)
Salford Systems © Copyright 2011
46
Text Mining Data Preparation The heavy lifting in text mining technology is devoted to moving us from raw unstructured text to structured numerical data Once we have structured data we are free to use any of a large number of traditional data mining and statistical tools to move forward Typical analytical tools include logistic and multiple regression, predictive modeling, and clustering tools But before diving into the analysis stage we need move through the text transformation stage in detail The first step is to extract and identify the words or “terms” which can be thought of as creating the list of all words recognized in the training data set This stage is essentially one of defining the “dictionary”, the list of officially recognized terms. Any new term encountered in the future will be unrecognizable by the dictionary and will represent an unknown item It is therefore very important to ensure that the training data set contains almost all terms of interest that would be relevant for future prediction Salford Systems © Copyright 2011
47
Automatic Dictionary Building The following steps will build an active dictionary for a collection of documents (in our case, auction item description strings) ²
Read all text values into one character string
²
Tokenize this string into an array of words (token)
²
Remove words without any letters or digits
²
²
Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English and German languages Remove words that have fewer than 2 letters and encountered less than 10 times across the entire collection of documents (rare small words) ³
²
Lemmatize words using WordNet lexical database ³
²
At this point the too-common, too-rare, weird, obscure, and useless combinations of characters should have been eliminated
This step combines words present in different grammatical forms (“go”, “went”, “going”, etc.) into the corresponding stem word (“go”)
Remove all resulting words that appear less than MIN times (5 in the remainder of this tutorial) Salford Systems © Copyright 2011
48
Build the Dictionary (or Term Vector) For purpose of automatic dictionary building and preprocessing data we developed the Salford Text Mining (STM) software - a stand alone collection of tools that perform all the essential steps in preparing text documents for text mining STM builds on the Python “Natural Language Toolkit” (NLTK) From NLTK we use the following tools Tokenizer
(extract items most likely to be “words”)
²
Porter Stemmer
(recognize different simple forms of same word – e.g. plural)
²
Word Net lemmatizer (more complex recognition of same word variations)
²
stop word list
²
(words that contribute little to no vale such as “the”, “a”)
Future versions of STM might use other tools to accomplish these essential tasks “stm.exe” is a command line utility that must be run from a Command Prompt window (assuming you are running Windows, go to the Start – All Programs – Accessories – Command Prompt menu) The version provided here resides in the stmtutor\STM\bin folder
Salford Systems © Copyright 2011
49
STM Commands and Options Open a Command Prompt window in Windows, then CD to the “stmtutor\STM” folder location, for example, on our system you would type in cd c:\stmtutor\STM
To obtain help type the following at the prompt: bin\stm --help
This command will return very concise information about STM: stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE] [-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE] etc.
The details for each command line option are contained in the software manual appearing in the appendix You will also notice the “stm.cfg” configuration file – this file controls the default behavior of the STM module and relieves you of specifying a large number of configuration options each time “stm.exe” is launched ²
Note the TEXT_VARIABLES : 'ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE‘
line which specifies the names of the text variables to be processed 50
Create Dictionary Options For the purposes of this tutorial, we have prepackaged all of the text processing steps into individual command files (extension *.bat). You can either doubleclick on the referenced command file or alternatively type its contents into the Command Prompt window opened in the directory that contains the files The most important arguments for our purposes in this tutorial now are: ²
--dataset DATAFILE
name and location of your input CSV format data set
²
--dictionary DICTFILE name and location of the dictionary to be created
These two arguments are all you need to create your dictionary. By default, STM will process every text field in your input data set to create a single omnibus dictionary Simply double click on the “stm_create_dictionary.bat” to create the dictionary file for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict” file in the “stmtutor\STM\dmc2006” folder In typical text mining practice the process of generating the final dictionary will be iterative. A review of the first dictionary might reveal further words you wish to exclude (“stop” words) Salford Systems © Copyright 2011
51
Internal Dictionary Format The dictionary file is a simple text file with extension *.dict The file contents can be viewed and edited in a standard text editor The name of the text mining variable that will be created later on appears on the left of the “=“ sign on each un-indented line The default value that will be assigned to this variable appears on the right side of the “=“ sign of the un-indented lines and it usually means the absence of the word(s) of interest Each indented line represents the value (left of the “=“) which will be entered for a single occurrence in a document for any of the word(s) appearing on the right of the “=“ ²
More than one occurrence will be recorded as “many” when requested (always the case in this tutorial)
Salford Systems © Copyright 2011
52
Hand Made Dictionary To use a multi-level coding you need to create a “hand made dictionary”, which is already supplied to you as “hand.dict” in the “stmtutor\STM\dmc2006” folder Here is an example of an entry in this file hand_model=standard mini nano standard The un-indented line of an entry starts with the name we wish to give to the term (HAND_MODEL) and also indicates that a BLANK or missing value is to be coded with the default value of “standard” The remaining indented entries are listed one-per-line and are an exhaustive list of the acceptable values which the term HAND-MODEL can receive in the term vector Another coding option is, for example: hand_unused=no yes=unbenutzt,ungeoffnet which sets “no” as the default value but substitutes “yes” if one of the two values listed above is encountered You may study additional examples in our stmtutor\STM\dmc2006hand.dict file on your own, all of them were created manually based on common sense logic 53
Why Create Hand Made Dictionary Entries Let’s revisit the variable HAND_MODEL which brings together the terms ²
Standard, mini, nano
Without a hand made dictionary entry we would have three terms created, one for each model type, with “yes” and “no” values, and possibly “many” By creating the hand made entry we ²
²
Ensure that every auction is assigned a model (default=“standard”) All three models are brought together into one categorical variable with three possible values “standard”, “mini”, and “nano”
This representation of the information is helpful when using tree-based learning machines but not helpful for regression-based learning machines ²
²
²
The best choice of representation may vary from project to project Salford regression-based learning machines automatically repackage categorical predictors into 0\1 indicators meaning that you work with one representation But if you need to use other tools you may not have this flexibility Salford Systems © Copyright 2011
54
Further Dictionary Customization The following table summarizes some of the important fields introduced in the custom dictionary for this tutorial Variable
Values
Combines word variants
CAPACITY
20 30 40 80 …
20gb,20 30gb,30 40gb,40 80gb,80 …
STATUS
Wieneu Neu Unbenutzt defekt
Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig neu,new,brandneu,brandneues Unbenu defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektes
MODEL
Mini, nano, standard
Captures presence of the corresponding word in the auction description
COLOR
Black, white, Green, etc.
Captures presence of the corresponding words or variants in the auction description
IPOD_GENE RATION
First, second, etc.
Identified iPod generation from the information available in the text description
gb,20 gb,30 gb,40 gb,80
gigabyte gigabyte gigabyte gigabyte
Salford Systems © Copyright 2011
55
Final Stage Dictionary Extraction To generate a final version of the dictionary in most real world applications you would also need to prepare an expanded list of stopwords The NLTK provides a ready-made list of stopwords for English and another 14 major languages spanning Europe, Russia, Turkey, and Scandinavia ²
These appear in the directory named stmtutor\STM\data\corpora\stopwords and should be left as they are
Additional stopwords, which might well vary from project to project, can be entered into the file named “stopwords.dat” in the “stmtutor\STM\data” folder ²
In the package distributed with this tutorial the “stopwords.dat” file is empty
²
You can freely add words to this file, with one stopword per line
Once the custom “stopwords.dat” and “hand.dict” files have been prepared you just run the dictionary extraction again but with the “--source-dictionary” argument added (see the command files introduced in the later slides) The resulting dictionary will now include all the introduced customizations Salford Systems © Copyright 2011
56
Creating Structured Text Mining Variables The resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stems In the final step of text processing the data dictionary is applied to each document entry Each stem from the dictionary is represented by a categorical variable (usually binary) with the corresponding name The preparation process checks whether any of the known word variants associated with each stem from the dictionary are present in the current auction description, and if “yes”, the corresponding value is set to “yes”, otherwise, it is set to “no” ²
When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”
²
You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”
²
²
We have experimented with alternative variants of coding (see the “--code” help entry in the STM manual) and came to conclusion that the “YNM” approach works best in this tutorial Feel free to experiment with alternative coding schemas on your own
The resulting large collection of variables will be used as additional predictors in our modeling efforts Even though other more computationally intense text processing methods exist, further investigation failed to demonstrate their utility on the current data which is most likely related to extremely terse nature of the auction descriptions Salford Systems © Copyright 2011
57
Creating Additional Variables Finally, we spent additional efforts on reorganizing the original raw variables into more useful measures ²
MONTH_OF_START – based on the recorded start date of auction
²
MONTH_OF_SALE – based on the recorded closing date of auction
²
²
HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the CATEGORY_AVG_GMS as suggested by common sense and the nature of the classification problem In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that option was not available – we reset all such 0s to missing
All of these operations are encoded in the “preprocess.py” Python file located in the “stmtutor\STM\dmc2006” folder ²
This component of the STM is under active development
²
The file is automatically called by the main STM utility
²
You may add/modify the contents of this file to allow alternative transformations of the original predictors Salford Systems © Copyright 2011
58
Generation of the Analysis Data Set As this point we are ready to move on to the next step which is data creation This is nothing more than appending the relevant columns of data to the original data set. Remember that the dictionary may contain tens of thousands if not hundreds of thousands of terms For the DMC2006 dataset the dictionary is quite small by text mining standards containing just a little over 600 words To generate processed dataset simply double-click on the stm_ynm.bat command file or explicitly type in its contents in the Command Prompt ²
The “--dataset” option specifies the input dataset to be processed
²
The “--code YNM” option requests “yes/no/many” style of coding
²
The “--source-dictionary” option specifies the hand dictionary
²
The “--process” option specifies the output dataset
²
Of course you may add other options as you prefer
This creates a processed dataset with the name dmc2006_res_ynm.csv which resides in the stmtutor\STM\dmc2006 folder Salford Systems © Copyright 2011
59
Analysis Data Set Observations At this point we have a new modeling dataset with the text information represented by the extra variables ²
Note that he raw input data set is just shy of 3 MB in size in a plain text format while the prepared analysis data set is about 40 MB in size, 13 times larger
Process only training data or all data? ²
²
²
²
²
For prediction purposes all data needs to be processed, both the data that will be used to train the predictive models and the holdout or future data that will receive predictions later In the DMC2006 data we happen to have access to both training and holdout data and thus have the option of processing all the text data at the same time Generating the term vector based only on the training data would generally be the norm because future data flows have not yet arrived In this project we elected to process all the data together for convenience knowing that the train and holdout partitions were created by random division of the data It is worth pointing out, though, that the final dictionary generated from training data only might be slightly different due to the infrequent word elimination component of the text processor Salford Systems © Copyright 2011
60
Quick Modeling Round with CART We are now ready to proceed with another CART run this time using all of the newly created text fields as additional predictors Assuming that you already have SPM launched ²
²
²
²
Go to the File – Open – Data File menu Make sure that the Files of Type is set to ASCII Highlight the dmc2006_res_ynm.csv dataset Press the [Open] button
Salford Systems © Copyright 2011
61
Dataset Summary Window Again, the resulting window summarizes basic facts about the dataset Note the dramatic increase in the number of available variables
Salford Systems © Copyright 2011
62
The View Data Window Press the [View Data…] button to have a quick look at the physical contents of the dataset Note how the individual dictionary word entries are now coded with the “yes”, “no”, or “many” values for each document row
Salford Systems © Copyright 2011
63
Setting Up CART Model Proceed with setting up a CART modeling run as before: ²
²
²
²
²
²
²
Make the Classic Output window active Go to the Model – Construct Model… menu (alternatively, you could use one of the buttons located on the bar right below the menu) In the resulting Model Setup window make sure that the Analysis Method is set to CART In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification Check GMS_GREATER_AVG as the Target Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors You should see something similar to what is shown on the next slide
Salford Systems © Copyright 2011
64
Model Setup Window: Model Tab
Salford Systems © Copyright 2011
65
Model Setup Window: Testing Tab Switch to the Testing tab and confirm that the 10-fold cross-validation is used as the optimal model selection method
Salford Systems © Copyright 2011
66
Model Setup Window: Advanced Tab Switch to the Advanced tab and set the minimum required number of records for the parent nodes and the child nodes at 15 and 5 These limits were chosen to avoid extremely small nodes in the resulting tree
Salford Systems © Copyright 2011
67
Building CART Model Press the [Start] button, building progress window will appear for a while and then the Navigator window containing model results will be displayed (this time, the process takes a few minutes!) Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator window, note that all trees within one standard error (SE) of the optimal tree are now marked in green Use the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree
Salford Systems © Copyright 2011
68
CART Model Performance The selected CART model contains 102 terminal nodes where nearly all available predictor variables play a role in the tree construction Area under the ROC curve (Test) is now an impressive 0.830, especially when compared to the one reported earlier at 0.748 for the basic CART run or the 0.800 for the basic TN run Press on the [Summary Reports] button in the Navigator window, select the Prediction Success tab, and finally press the [Test] button to see cross-validated test performance at 76.58% classification accuracy – a significant improvement! Also note the presence of the original and derived variables on the list shown in the Variable Importance tab
Salford Systems © Copyright 2011
69
Setting Up TN Model Now switch to the Classic Output window and go to the Model – Construct Model… menu Choose TreeNet as the Analysis Method In the Model tab make sure that the Tree Type is set to Logistic Binary
Salford Systems © Copyright 2011
70
Setting Up TN Parameters Switch to the TreeNet tab and do the following: ²
Set the Learnrate: to 0.05
²
Set the Number of trees to use: to 800
²
Leave all of the remaining options at their default values
Salford Systems © Copyright 2011
71
TN Results Window Press the [Start] button to initiate TN modeling run, the TreeNet Results window will appear in the end, even though you might want to take a coffee break until the modeling run completes
Salford Systems © Copyright 2011
72
Checking TN Performance Press on the [Summary] button and switch to the Prediction Success tab Press the [Test] button to view cross-validation results Lower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes (this makes it easier to compare the TN performance with the earlier reported CART and TN model performance) You can clearly see the improvement!
Salford Systems © Copyright 2011
73
Requesting TN Graphs Here we present a sample collection of all 2-D contribution plots produced by TN for the resulting model The plots are available by pressing on the [Display Plots…] button in the TreeNet Results window The list is arranged according to the variable importance table
74
More Graphs
Salford Systems © Copyright 2011
75
Insights Suggested by the Model Here is a list of insights we arrived at by looking into the selection of plots ²
²
²
There is a distinct effect of the iPod category once all the other factors have been accounted for Larger start price means above the average sale (most likely relates to the quality of an item) A“new” and “unpacked” item should fetch a better price, while any “defect” brings the price down
²
End of the year means better sales
²
Having a good feedback score is important
²
It is best to wait 10 days or more before closing the deal
²
Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th
²
2G started to fall out of favor in 2005-2006
²
Black is much more popular in Germany than other colors
²
Mentioning “photo”, “video”, “color display”, etc. helps get a better price
²
The paid advertising features are of little or marginal importance Salford Systems © Copyright 2011
76
Final Validation of Models At this point we are ready to check the performance of all our models using the remaining 8,000 auctions originally not available for training This way each model can be positioned with respect to all of the official 173 entries originally submitted to the DMC 2006 competition However, in order to proceed with the evaluation, we must first score the input data using all of the models we have generated up until now The following slides explain how to score the most recently constructed CART and TN models, the earlier models can be scored using similar steps You may choose to skip the scoring steps as we have already included the results of scoring in the “stmtutor\STM\scored” folder: ²
Score_cart_raw.csv – simple CART model predictions
²
Score_tn_raw.csv – simple TN model predictions
²
Score_cart_txt.csv – text mining enhanced CART model predictions
²
Score_tn_txt.csv – text mining enhanced TN model predictions Salford Systems © Copyright 2011
77
Scoring a CART Model Select the Navigator window for the model you wish to score Select the tree from the tree sequence (in our runs we pick the 1SE trees as more robust) Press the [Score] button to open the “Score Data” window Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press the [Select…] button on the right and select the dataset to be scored Place a checkmark in the “Save results to a file” box, then press the [Select] button right next to it, this will open the “Save As” window Navigate to the “stmtutor\STM\scored” folder under “Save in:” selection box, enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the [Save] button You should now see something similar to what’s shown on the next slide Press the [OK] button to initiate the scoring process You should now have the Scored_cart_txt.csv file in the stmtutor\STM\scored folder Salford Systems © Copyright 2011
78
Scoring CART
Salford Systems © Copyright 2011
79
Scoring a TN Model Select the “TreeNet Results” window for the model you wish to score Go to the “Model – Score Data…“ menu to open the “Score Data” window Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press the [Select…] button on the right and select the dataset to be scored Place a checkmark in the “Save results to a file” box, then press the [Select] button right next to it, this will open the “Save As” window Navigate to the “stmtutor\STM\scored” folder under “Save in:” selection box, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and press the [Save] button You should now see something similar to what’s shown on the next slide Press the [OK] button to initiate the scoring process You should now have the Scored_tn_txt.csv file in the stmtutor\STM\scored folder
Salford Systems © Copyright 2011
80
Scoring TN
Salford Systems © Copyright 2011
81
Using STM to Validate Performance We can now use the STM machinery to do final model validation Simply double-click the “stm_validate.bat” command file to proceed Note the use of the following options inside of the command file: ²
²
²
“-score” – specifies the output dataset where the model predictions will be written “--score-column” – specifies the name of the variable containing the actual model predictions (these variables are produced by CART or TN during the scoring process) “--check” – specifies the name of the dataset that contains the originally withheld values of the target ³
²
this dataset was used by the organizers of the DMC 2006 competition to select the actual winners
STM is currently configured to validate only the bottom 8,000 of the 16,000 predictions generated by the model; the top 8,000 records (used for learning) are simply ignored
The results will be saved into text files with extensions “*.result” appended to the original score file names in the “stmtutor\STM\scored” folder Salford Systems © Copyright 2011
82
Validation Results Format The following window shows the validation results of the final TN model we built
8000 validation records were scored, of which: 719 ones were misclassified as zeroes 807 zeroes were misclassified as ones Thus 1,526 documents were misclassified This gives the final score of 8,000 – (1,526 * 2) = 4,948 Salford Systems © Copyright 2011
83
Final Validation of Models Based on the predicted class assignments, the final performance score is calculated as 8,000 minus twice the total number of auction items misclassified The following table summarizes how these virtually out-of-the-box elementary modelings perform on the holdout data (the values are extracted from the four *.result files produced by the STM validator)
Model
ROC Area
Missed 0s
Missed 1s
Score
CART raw data
75%
1123
1387
2980
TN raw data
80%
1308
926
3532
CART text data
83%
981
848
4342
TN text data
89%
807
719
4948
Salford Systems © Copyright 2011
84
Visual Validation of the Results The following graph summarizes the positioning of the four basic models with respect to the 173 official competition entries The TN model with text mining processing is among the top 10 winners! TN text CART text TN raw CART raw
Salford Systems © Copyright 2011
85
Observations on the Results We used the most basic form of text mining, the Bag of Words, with minor emendations ²
None of the authors speaks German although we did look up some of the words in an on-line dictionary. If there are any subtleties to be picked from seller wording choices we would have missed them.
We chose the coding scheme that performed best on the training data. We have six coding options and one stands out as clearly best We used common settings for the controls for CART and TreeNet We did not use any of the modeling refinement techniques we teach in our CART and TreeNet tutorials We thus invite you to see if you can tweak the performances of these models even higher
Salford Systems © Copyright 2011
86
Command Line Automation in SPM SPM has a powerful command line processing component which allows you to completely reproduce any modeling activity by creating and later submitting a command file We have packaged the command files for the four modeling and scoring runs you have conducted in the course of this tutorial ²
SPM command files must have the extension *.cmd
²
The four command files are stored in the “stmtutor\STM\dmc2006” folder
You can create, open, or edit a command file using a simple text editor, like Notepad, etc. SPM has a built-in editor, just go to the File – New Notepad… menu You may also access the command line directly from inside of the SPM GUI, just make sure that the File – Command Prompt menu item is checked Just type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Output window to get the listing of all available commands Then you can request a more detailed help for any specific command of interest, for example “help battery” will produce a long list of various batteries of automated runs available in SPM Furthermore, you may view all of the commands issued during the current session by going to the View – Open Command Log… menu, this way you can quickly learn which commands correspond to the recent GUI activity you were involved with
Salford Systems © Copyright 2011
87
Basic CART Model Command File You may now restart SPM to emulate a new fresh run Go to the File – Open – Command File… menu Select the “cart_raw.cmd” command file and press the [Open] button The file is now opened in the built-in Notepad window
Salford Systems © Copyright 2011
88
CART Command File Contents OUT – saves the classic output into a text file USE – points to the modeling dataset GROVE – saves the model as a binary grove file MODEL – specifies the target variable CATEGORY – indicates which variables are categorical, including the target KEEP – specifies the list of predictors LIMIT – sets the node limits ERROR – requests cross-validation BUILD – builds a CART model SAVE – names the file where the CART model predictions will be saved HARVEST – specifies which tree is to be used in scoring Note the use of the relative paths in the GROVE and SAVE commands Also note the use of the forward slash “/” to separate folder names
IDVAR – requests saving of the additional variables into the output dataset SCORE – scores the CART model OUTPUT * – closes the current text output file
Salford Systems © Copyright 2011
89
Submitting Command File With the Notepad window active, go to the File – Submit Window menu to submit the command file into SPM In the end you will see the Navigator and the Score windows opened which should be identical to the ones you have already seen in the beginning of this tutorial Furthermore, you should now have ²
²
²
“cart_raw.dat” text file created in the “stmtutor\STM\dmc2006” folder, the file contains the classic output you normally see in the “Classic Output” window “cart_raw.grv” binary grove file created in the “stmtutor\STM\models” folder, the file contains the CART model itself, it can be opened in the GUI using the File – Open – Open Grove… menu which reopens the Navigator window, this file will be also needed to future scoring or translation “Score_cart_raw.csv” data file created in the “stmtutor\STM\scored” folder, the file contains the selected CART model predictions on your data
You may proceed now with opening up the “tn_raw.cmd” file using the File – Open – Command File… menu Salford Systems © Copyright 2011
90
TN Command File Contents OUT, USE, GROVE, MODEL, CATEGORY, KEEP, ERROR, SAVE, IDVAR, SCORE, OUTPUT – same as the CART command file introduced earlier MART TREES – sets the TN model size in trees MART NODES – sets the tree size in terminal nodes MART MINCHILD - set the minimum individual node size in records MART OPTIMAL – sets the evaluation criterion that will be used for optimal model selection MART BINARY – requests logistic regression processing in our case MART LEARNRATE – sets the learnrate parameter MART SUBSAMPLE – sets the sampling rate MART INFLUENCE – sets the influence trimming value
Salford Systems © Copyright 2011
The rest of the MART commands requests automatic saving of the 2-D and 3-D plots into the grove; type in “help mart” to get full descriptions 91
Submitting the Rest of the Command Files Again, with the current Notepad window active, use the File – Submit Window menu to launch the basic TN modeling run automatically followed by scoring This will create the output, grove, and scored data files in the corresponding locations for the chosen TN model; also note the use of the EXCLUDE command in place of the KEEP command inside of the command file – this saves a lot of typing Now go back to the Classic Output window and notice that the File menu has changed Go to the File – Sumbit Command File… menu, select the “cart_txt.cmd” command file, and press the [Open] button Notice the modeling activity in the Classic Output window, but no Results window is produced – this is how the Submit Command File… menu item is different from the Submit Window menu item used previously; nonetheless, the output, grove, and score files are still created in the specified locations Use the File – Open – Open Grove… menu to open the “tn_raw.grv” file located in the “stmtutor\STM\models folder”, you will need to navigate into this folder using the Look in: selection box in the Open Grove File window You may now proceed with the final TN run by submitting the “tn_txt.cmd” command file using either the File – Open – Command File… / File – Submit Window or File – Submit Command File… menu routes – don’t forget that it does take long time to run! Salford Systems © Copyright 2011
92
Final Remarks This completes the Salford Systems Data Mining and Text Mining tutorial In the process of going through the tutorial you have learned how to use both GUI and command cine facilities of SPM as well as the command line text mining facility STM You managed to build two CART models, two TN models, as well as enriched the original dataset with a variety of text mining fields The final model puts you among the top winners in a major text mining competition – a proud achievement Even though we have barely scratched the surface, you are now ready to proceed with exploring the remainder of the vast data mining activities offered within SPM and STM on your own We wish you best of luck on the exciting and never ending road of modern data analysis and exploration And don’t forget that you can always reach us at www.salford-systems.com should you have further modeling questions and needs Salford Systems © Copyright 2011
93
References Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer. Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156. Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004). Text Mining. Predictive Methods for Analyzing Unstructured Information. Springer. Salford Systems © Copyright 2011
94
STM Command Reference Salford Text Miner is simple utility that should make text mining process much easier. For this purpose application described in this manual have different parameters and can execute Salford Predictive Miner at the data mining backend STM Workflow: ²
Automatically generate dictionary based on dataset
²
Process dataset and generate new with additional columns based on dictionary
²
Generate model folder with dataset, command file and dictionary
²
Run Salford Predictive Miner with generated command file
²
Run checking process comparing results from scoring with real classes
All of these steps can be done in separate STM calls or in one call
Salford Systems © Copyright 2011
95
STM Command Reference Short Option
Long Option
Description
-data DATAFILE
--dataset DATAFILE
Specify dataset to work with
-dict DICTFILE
--dictionary DICTFILE
Specify dictionary to work with
-source-dict SDFILE
--source-dictionary SDFILE
Dictionary that is used as source for automatic dictionary retrieval process
-score SFILE
--scoreresult SFILE
Specify file with score result, for checking process, default – ‘score.csv’
-spm SPMAPP
--spmapplication SPMAPP
Path to spm application, default – ‘spm.exe’
-t TARGET
--target TARGET
Target variable to generate command file, default – ‘GMS_GREATER_AVG’
-ex EXCLUDE
--exclude EXCLUDE
List of variables to exclude from keep list, when generate command file.
-cat CATEGORY
--category CATEGORY
List of variables to select as category variables, when generate command file
Salford Systems © Copyright 2011
96
STM Command Reference Short Option
Long Option
Description
-templ CMDTEMPL
--cmdtemplate CMDTEMPL
Specify template of command file, that will be used for generation. Default – ‘data/template.cmd’
-md MODEL_DIR
--modeldir MODEL_DIR
Dir, where model’s folders will be created. Default – ‘models’
-trees TREES
--trees TREES
Parameter for TreeNet command files, specify number of trees will be build. Default – 500
-maxnodes MAXNODES
--maxnodes MAXNODES
Parameter for TreeNet command files, specify numbers of nodes in one tree will be build. Default – 6
-fixwords
--fixwords
Enables heuristics that tries to fix words (find nearest by different metrics, searching spell checking, etc)
-textvars VARLIST
--text-variables VARLIST
List of variables separated by commas, which will be used in dictionary retrieving process
Salford Systems © Copyright 2011
97
STM Command Reference Short Option
Long Option
Description
-outrmwords
--output-removedwords
Enables outputting removed stop words to file ‘data/removed.dat’
-code CODE
--column-coding CODE
Specify how to code absence/presence of word in row: YN or 0 – no/yes YNM or 1 – no/yes/many 01 or 2 – 0/1 012 or 3 – 0/1/2 TF or 4 – term frequency IDF or 5 – inversed document frequency TF-IDF or 6 – TF-IDF TC or 7 – term count (0,1,2,…)
Default – YN -mp MODELPATH
--model-path MODELPATH
Specify path where model files would be created
-cmd-path CMDPATH --command-file-path CMDPATH
Specify path to command file, which will be executed by Salford Predictive Miner
-ppfile PPFILE
Path to python code that will be executed on process step for data manipulate data
--preprocess-file PPFILE
Salford Systems © Copyright 2011
98
STM Command Reference Short Option
Long Option
Description
-rc NAME
--realclasscolumn-name
Specify column name for in real class dataset for check step. Default – GMS_GREATER_AVG
-e
--extract
Run first step – automatic extraction of dictionary from dataset. Need to specify --dataset
-p OUTFILE
--process OUTFILE
Run second step – process dataset and create new dataset with name OUTPUTFILE were depending on dictionary will be created new columns. Need to specify --dataset and --dictionary
-g
--generate
Run third step – generate model folder with command file. Need specify --dataset, --dictionary
-m
--model
Run forth step. Run Salford Predictive Miner with generate command file. Works only with –generate
-c DATASET
--check DATASET
Run fives step. Check score file with real classes (from specified REALCLASSFILE) and outputs misclassification table. Need to specify --scoreresult
-h
--help
Show help Salford Systems © Copyright 2011
99
STM Configuration File Name
Description
Default
SPM_APPLICATION
Path to Salford Predictive Miner
spm.exe
CMD_TREES
Number of trees to build in TN models
500
CMD_NODES
Tree size for TN modes
6
CMD_TEMPLATE
Command file template
data/template.cmd
MODELS_DIR
Dir, where model’s folders will be created
models
LANGUAGES
Languages, stop words which will be used
English, German
SPELLCHECKER_DICT
Additional spell checker dictionary, with words that are allowed (like “ipod”)
data/spellchecker_dict.dat
SPELLCHECKER_LANGUAGE
Language for spell checker
de_DE
ADDITIONAL_STOPWORDS
File with additional stop words, which user can edit
data/stopwords.dat
REMOVED_WORDS_FILE
File, where removed words will be written on “extract” step
data/removed.dat
WORD_FREQUENCY_THRESH OLD
Lower threshold word frequency, which will be deleted on “extract” step
5
PREPROCESS_FILE
Include script to do additional processing
dmc2006/preprocess.py
Salford Systems © Copyright 2011
100
STM Configuration File Name
Description
CHECK_RESULTS_FILE
Default data/score_results.csv
LOGFILE
Path to log file. Can be mask (%s for date).
log/stm%s.log
TARGET
Default variable for target argument, which would be used to fill command file template
GMS_GREATER_AVG
EXCLUDE
Default variable for keep argument, which would be used to fill command file template
AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, GMS_GREATER_AVG
CATEGORY
Default variable for category argument, which would be used to fill command file template
GMS_GREATER_AVG
SCORE_FILE
Name of score file which need to be checked
Score.csv
TEXT_VARIABLES
List of text variables in dataset separated by comma
ITEM_LEAF_CATEGORY_ NAME, LISTING_TITLE, LISTING_SUBTITLE
DEFAULT_CODING
Default coding for extract and preprocess steps
YN
REALCLASS_COLUMN_ NAME
Name of column in real class file, which would be used in check step
GMS_GREATE_AVG
SCORE_COLUMN_NAM E
Name of column in score file, which would be used in check step
PREDICTION
Salford Systems © Copyright 2011
101