D. Delen. KPM Symposium â Tulsa, OK â October 3-4, 2007. 3. Barcodes ... 3. Cross-Industry Standard. Process for Data Mining. CRISP DM ..... Spider-Man 3. 9.
Knowledge Discovery with Data and Text Mining
Dursun Delen, Ph.D. Associate Professor of MSIS William S. Spears School of Business Oklahoma State University - Tulsa
Introduction
2
Why data mining?
What is data mining?
Data Mining Process
Data Mining Techniques
Text Mining
Data & Text Mining Examples
A Demonstration of Data Mining in Action
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Deluge Why Now? Barcodes RFIDs Magnetic Strips Sensors…
3
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
1
Motivation: “Necessity is the Mother of Invention”
Competitive pressure to make better decisions
Data explosion problem (or opportunity)
Why do we have more data now, then we had before?
Technology driven reasons: Automated data collection t l and tools d techniques… t h i Software driven reasons: Mature database technology… Cost driven reasons (both hardware and software)
We are drowning in data, but starving for knowledge!
Solution: Data and data mining 4
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
What Is Data Mining?
Data mining :
Alternative names…
5
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data
© D. Delen
Data mining: a misnomer? Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized DM Processes
CRISP-DM
Cross-Industry Standard Process for Data Mining
www.crisp-dm.org
SEMMA
9
Sample the data Explore the data Modify the data Model Assess
http://www.sas.com/technologies/analytics/datamining/miner/semma.html
9 9 9 9
6
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
2
Standardized Data Mining Processes
Step 1: Business Understanding Cross-Industry Standard forthe Data Mining Process Determine business objectives CRISP DM CRISP-DM Assess the situation Established in 1996 Determine the data mining By Daimler-Benz, goals Integral Solutions Ltd. (ISL), NCR, Produce a project plan and OHRA
Cross-Industry Standard Process for Data Mining CRISP-DM 7
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 2: Data Understanding Collect the initial data Describe the data Explore the data Verify the data
Cross-Industry Standard Process for Data Mining CRISP-DM 8
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 3: Data Preparation Select data Clean data Construct data Integrate data Format data
Cross-Industry Standard Process for Data Mining CRISP-DM 9
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
3
Standardized Data Mining Processes
Step 4: Modeling Select the modeling technique Generate test design Build the model Assess the model
Cross-Industry Standard Process for Data Mining CRISP-DM 10
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 5: Evaluation Evaluate results Review process Determine next step
Cross-Industry Standard Process for Data Mining CRISP-DM 11
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 6: Deployment Plan deployment Plan monitoring and maintenance Produce final report Review the project
Cross-Industry Standard Process for Data Mining CRISP-DM 12
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
4
Data Mining Techniques (1)
Association (correlation and causality)
age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) Æ contains(x, “software”) [[support pp = 1%,, confidence = 75%]]
Prediction (Classification & Regression)
13
© D. Delen
Classification: developing models (functions) that describe and distinguish classes or concepts for future prediction Predicting whether it will rain tomorrow Classifying loan applicant as “good” or “bad”
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Techniques (2)
Prediction (Classification & Regression)
Cluster analysis
14
Regression: developing models (functions) that forecast the value of a continuous numerical variable Predicting what the temperature will be tomorrow Forecasting the future value of a stock a year from now
© D. Delen
Class label is unknown: Group the samples of objects based on their quantifiable characteristics Cluster the customers who share the same characteristics: interest, income level, spending habits, etc. Clustering is done by maximizing the intra-class similarity and minimizing the interclass similarity KPM Symposium – Tulsa, OK – October 3-4, 2007
Text Mining
Text mining is a process that employs
Statistical Natural Language Processing g
Data Mining
15
a set of algorithms for converting unstructured text into structured data objects, and the quantitative methods that analyze these data objects to discover knowledge
Text Mining = Statistical NLP + DM
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
5
Text Mining Process Map Unstructured Text Document Structuring - Extract - Transform - Load
- Documents (.doc, .pdf, .ps, etc.) - Fragments (sentences, paragraphs, speech encodings, etc.) - Web pages (textual sections, XML files, etc.) - Etc.
Problems / Opportunities - Environmental scanning
Term Generation - Natural Language Processing - Term Filtering: use of domain knowledge (stemming, stop word list, synonym list, acronym list, etc.)
Structured Text Feedback
- Tabular representation - Collection of XML files - Etc.
Term by Document Matrix T_1 T_2
Feedback
Doc 1
1
Doc 2
1
...
... Doc M
16
© D. Delen
T_N
2 3 1
Information / Knowledge - Clustering - Classification - Association - Link analysis
1
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications… Military Health System (1/3)
KBSI’s Phase II SBIR research project
Funded through SBIR program by the Offices of the Secretary of Defense SBIR: Phase I ⇒ Phase II ⇒ Phase III
DM in Healthcare
Managerial Clinical
Reference
17
Delen, D. and S. Ramachandran (2003). A Hybrid Approach to Knowledge
Discovery from Military Health Systems. Journal of Neural, Parallel and Scientific Computing, Volume 11, Number 1&2, pp. 161-183.
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications… Military Health System (2/3)
One of the largest health systems in the US
Purpose/mission is to
18
90+ hospitals 100s of outpatient clinics, treatment facilities 180,000 employees (doctors, nurses, other staff) 8 million beneficiaries $20 Billion/year Billi / budget b d t Provide healthcare to eligible veterans Provide education and training opportunities to health profession (residency, practical training, etc.) Conduct medical research, create innovation Provide public health service at the time of natural disasters
Problem: ⇑ demand, ⇓ budget
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
6
Data Mining Applications… Military Health System (3/3) Classification/prediction
Demand forecasting Resource allocation
Load Forecast DMIS Region
FACILITY LEVEL Load Forecast DMIS Facility 1
Load Forecast DMIS Facility 2
GRANULAR
REGIONAL LEVEL
AGGREGATE
SERVICE LEVEL
Load Forecast DMIS Facility 1 Diagnosis Type C, D
Association rules
19
© D. Delen
Patient diagnosis
Load Forecast DMIS Facility 1 Diagnosis Type A, B
Load Forecast DMIS Facility 2 Diagnosis Type X, Y
Load Forecast DMIS Facility 2 Diagnosis Type P, Q, R
Primary/Secondary Diagnosis Patient ID
D1
D2
D3
...
10001 10002 10003 . .
0 0 1 . .
0 1 1 . .
1 0 1 . .
… … … . .
Association Rules Disease 724, 653, 366 => 343, 453 Disease 741. 606, 42.01 => 42.05, 35.0, 23.0 Disease 724, 653, 366 => 343, 453 Disease 724, 653, 366 => 343, 453 ...
Support 60%, Conf. 40%, Lift 79% Support 30%, Conf. 50%, Lift 49% Support 50%, Conf. 50%, Lift 45% Support 30%, Conf. 40%, Lift 49% ...
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications… Predicting Breast Cancer Survivability
Objective: Predicting breast cancer survivability Methods/Materials:
Used artificial neural networks (MLP), decision trees (C5) and logistic regression Used 10-fold cross validation and more than 200,000 cases
Results:
Decision tree (C5) is the best predictor with 93.6% accuracy, artificial neural networks (MLP) came out to be the second with 91.2% accuracy, and the logistic regression models came out to be the worst of the three with 89.2% accuracy
Reference:
Delen, D., G. Walker and A. Kadam (2004). Predicting Breast Cancer Survivability: A Comparison of Three Data Mining Methods, to appear in Artificial Intelligence in Medicine.
20
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
A Brief List of Text Mining Applications
Exploring trends in research literature
Using text mining to detect deception
21
We did this for IS literature
Identifying functional relationships between different genes using large collections of interdisciplinary research literature A l i records Analyzing d for f customer t contact t t centers t
Both audio as well as text (SAS does this) One of our Ph.D. student working on this
“Reading”, summarizing classifying emails Text mining the Web content Anti-terrorism initiatives (CIA is using text mining)
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
7
Forecasting Box Office Success of Hollywood Movies
Ramesh Sharda and Dursun Delen Institute for Research in Information Systems Oklahoma State University
Forecasting BoxBox-Office Receipts: A Tough Problem! “… No one can tell you how a movie is going to do in the marketplace… not until the film opens in darkened theatre and sparks fly up between the screen and the audience” Mr. Jack Valenti
President and CEO of the Motion Picture Association of America
23
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Current work on prediction…
A lot of research have been done
Our approach
24
Behavioral models Analytical models ' Predict after the initial release
© D. Delen
Use a data mining approach Use as much historical data as possible Make it web-enabled & Predict before the initial release KPM Symposium – Tulsa, OK – October 3-4, 2007
8
Our Approach – Movie Forecast Guru 9
DATA – 849 Movies released between 1998-2006
9
Movie Decision Parameters:
9
Intensity of competition rating MPAA Rating Star power Genre Technical Effects Sequel ? Estimated screens at opening …
Output: Box office gross receipts (flop → blockbuster) Class No. Range (in Millions)
25
© D. Delen
1 1 < 10
3 > 10 < 20
4 > 20 < 40
5 > 40 < 65
6 > 65 < 100
7 > 100 < 150
8 > 150 < 200
9 > 200 (Blockbuster)
KPM Symposium – Tulsa, OK – October 3-4, 2007
Our Approach – Movie Forecast Guru PREDICTION MODELS 9
Statistical Models:
9
Discriminant Analysis Ordinal Multiple Logistic Regression
Machine Learning Models:
Artificial Neural Networks Decision Tree Induction
26
© D. Delen
CART - Classification & Regression Trees C5 - Decision Tree
Support Vector Machines Rough Sets …
KPM Symposium – Tulsa, OK – October 3-4, 2007
Prediction Results… Models include plots via texttext-mining Text Mining Models - Tested on 2006 Data
100.00% 90.00% 80 00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
27
No Category
War
Life
Organization
Character
Bingo
49.86%
53.03%
52.16%
51.59%
53.60%
1-Away
88.18%
87.32%
88.18%
89.63%
89.63%
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
9
Sample Predictions… Movie
Actual
Spider-Man 3
9
9
9
9
9
9
300
9
5
7
5
5
5
War
Organization
Life
Character
The Simpsons Movie
8
9
9
9
9
9
The Bourne Ultimatum
8
6
7
9
9
7
Knocked Up
7
5
5
5
5
7
Live Free or Die Hard
7
9
7
9
9
9
Evan Almighty
6
6
7
7
7
7
1408
6
5
5
4
5
5
TMNT
28
No Category
5
4
3
4
5
5
Music and Lyrics
5
4
5
5
5
5
Freedom Writers
4
4
3
4
4
5
Reign Over Me
3
4
4
4
4
4
Black Snake Moan
2
3
3
3
3
3
Ta Ra Rum Pum
1
1
1
1
1
1
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Web--based DSS Web Prediction Models Remote Models
Movie Forecast Guru (MFG)
Local Models
GUI (Internet Browser) HTML TCP/IP
User (Manager)
Remote Data Sources
Web Services XML / SOAP
MFG Engine (Web Server)
ETL
ODBC & ETL
MFG Database XML
Knowledge Base (Business Rules)
29
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Demonstration of MGF
30
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
10