Knowledge Discovery with Data and Text Mining

4 downloads 67 Views 264KB Size Report
D. Delen. KPM Symposium – Tulsa, OK – October 3-4, 2007. 3. Barcodes ... 3. Cross-Industry Standard. Process for Data Mining. CRISP DM ..... Spider-Man 3. 9.
Knowledge Discovery with Data and Text Mining

Dursun Delen, Ph.D. Associate Professor of MSIS William S. Spears School of Business Oklahoma State University - Tulsa

Introduction

2

„

Why data mining?

„

What is data mining?

„

Data Mining Process

„

Data Mining Techniques

„

Text Mining

„

Data & Text Mining Examples

„

A Demonstration of Data Mining in Action

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Data Deluge Why Now? Barcodes RFIDs Magnetic Strips Sensors…

3

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

1

Motivation: “Necessity is the Mother of Invention” „

Competitive pressure to make better decisions

„

Data explosion problem (or opportunity) „

Why do we have more data now, then we had before? „

„ „

Technology driven reasons: Automated data collection t l and tools d techniques… t h i Software driven reasons: Mature database technology… Cost driven reasons (both hardware and software)

We are drowning in data, but starving for knowledge!

Solution: Data and data mining 4

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

What Is Data Mining? „

Data mining : „

„

Alternative names… „ „

5

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data

© D. Delen

Data mining: a misnomer? Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. KPM Symposium – Tulsa, OK – October 3-4, 2007

Standardized DM Processes „

„

CRISP-DM „

Cross-Industry Standard Process for Data Mining

„

www.crisp-dm.org

SEMMA

9

Sample the data Explore the data Modify the data Model Assess

„

http://www.sas.com/technologies/analytics/datamining/miner/semma.html

9 9 9 9

6

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

2

Standardized Data Mining Processes

Step 1: Business Understanding Cross-Industry Standard forthe Data Mining „ Process Determine business objectives CRISP DM CRISP-DM „ Assess the situation „ Established in 1996 „ Determine the data mining „ By Daimler-Benz, goals Integral Solutions Ltd. (ISL), „ NCR, Produce a project plan and OHRA

Cross-Industry Standard Process for Data Mining CRISP-DM 7

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Standardized Data Mining Processes

Step 2: Data Understanding „ Collect the initial data „ Describe the data „ Explore the data „ Verify the data

Cross-Industry Standard Process for Data Mining CRISP-DM 8

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Standardized Data Mining Processes

Step 3: Data Preparation „ Select data „ Clean data „ Construct data „ Integrate data „ Format data

Cross-Industry Standard Process for Data Mining CRISP-DM 9

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

3

Standardized Data Mining Processes

Step 4: Modeling „ Select the modeling technique „ Generate test design „ Build the model „ Assess the model

Cross-Industry Standard Process for Data Mining CRISP-DM 10

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Standardized Data Mining Processes

Step 5: Evaluation „ Evaluate results „ Review process „ Determine next step

Cross-Industry Standard Process for Data Mining CRISP-DM 11

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Standardized Data Mining Processes

Step 6: Deployment „ Plan deployment „ Plan monitoring and maintenance „ Produce final report „ Review the project

Cross-Industry Standard Process for Data Mining CRISP-DM 12

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

4

Data Mining Techniques (1) „

Association (correlation and causality) „

„

age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) Æ contains(x, “software”) [[support pp = 1%,, confidence = 75%]]

„

Prediction (Classification & Regression) „

„ „

13

© D. Delen

Classification: developing models (functions) that describe and distinguish classes or concepts for future prediction Predicting whether it will rain tomorrow Classifying loan applicant as “good” or “bad”

KPM Symposium – Tulsa, OK – October 3-4, 2007

Data Mining Techniques (2) „

Prediction (Classification & Regression) „

„ „

„

Cluster analysis „

„

„

14

Regression: developing models (functions) that forecast the value of a continuous numerical variable Predicting what the temperature will be tomorrow Forecasting the future value of a stock a year from now

© D. Delen

Class label is unknown: Group the samples of objects based on their quantifiable characteristics Cluster the customers who share the same characteristics: interest, income level, spending habits, etc. Clustering is done by maximizing the intra-class similarity and minimizing the interclass similarity KPM Symposium – Tulsa, OK – October 3-4, 2007

Text Mining „

Text mining is a process that employs

Statistical Natural Language Processing g

„

Data Mining

„

„

15

a set of algorithms for converting unstructured text into structured data objects, and the quantitative methods that analyze these data objects to discover knowledge

Text Mining = Statistical NLP + DM

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

5

Text Mining Process Map Unstructured Text Document Structuring - Extract - Transform - Load

- Documents (.doc, .pdf, .ps, etc.) - Fragments (sentences, paragraphs, speech encodings, etc.) - Web pages (textual sections, XML files, etc.) - Etc.

Problems / Opportunities - Environmental scanning

Term Generation - Natural Language Processing - Term Filtering: use of domain knowledge (stemming, stop word list, synonym list, acronym list, etc.)

Structured Text Feedback

- Tabular representation - Collection of XML files - Etc.

Term by Document Matrix T_1 T_2

Feedback

Doc 1

1

Doc 2

1

...

... Doc M

16

© D. Delen

T_N

2 3 1

Information / Knowledge - Clustering - Classification - Association - Link analysis

1

KPM Symposium – Tulsa, OK – October 3-4, 2007

Data Mining Applications… Military Health System (1/3) „

KBSI’s Phase II SBIR research project „

„

„

Funded through SBIR program by the Offices of the Secretary of Defense SBIR: Phase I ⇒ Phase II ⇒ Phase III

DM in Healthcare „ „

Managerial Clinical

Reference „

17

Delen, D. and S. Ramachandran (2003). A Hybrid Approach to Knowledge

Discovery from Military Health Systems. Journal of Neural, Parallel and Scientific Computing, Volume 11, Number 1&2, pp. 161-183.

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Data Mining Applications… Military Health System (2/3) „

One of the largest health systems in the US „ „ „ „ „

„

Purpose/mission is to „ „

„ „

„ 18

90+ hospitals 100s of outpatient clinics, treatment facilities 180,000 employees (doctors, nurses, other staff) 8 million beneficiaries $20 Billion/year Billi / budget b d t Provide healthcare to eligible veterans Provide education and training opportunities to health profession (residency, practical training, etc.) Conduct medical research, create innovation Provide public health service at the time of natural disasters

Problem: ⇑ demand, ⇓ budget

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

6

Data Mining Applications… Military Health System (3/3) Classification/prediction „

Demand forecasting Resource allocation

Load Forecast DMIS Region

FACILITY LEVEL Load Forecast DMIS Facility 1

Load Forecast DMIS Facility 2

GRANULAR

„

REGIONAL LEVEL

AGGREGATE

„

SERVICE LEVEL

Load Forecast DMIS Facility 1 Diagnosis Type C, D

„

Association rules „

19

© D. Delen

Patient diagnosis

Load Forecast DMIS Facility 1 Diagnosis Type A, B

Load Forecast DMIS Facility 2 Diagnosis Type X, Y

Load Forecast DMIS Facility 2 Diagnosis Type P, Q, R

Primary/Secondary Diagnosis Patient ID

D1

D2

D3

...

10001 10002 10003 . .

0 0 1 . .

0 1 1 . .

1 0 1 . .

… … … . .

Association Rules Disease 724, 653, 366 => 343, 453 Disease 741. 606, 42.01 => 42.05, 35.0, 23.0 Disease 724, 653, 366 => 343, 453 Disease 724, 653, 366 => 343, 453 ...

Support 60%, Conf. 40%, Lift 79% Support 30%, Conf. 50%, Lift 49% Support 50%, Conf. 50%, Lift 45% Support 30%, Conf. 40%, Lift 49% ...

KPM Symposium – Tulsa, OK – October 3-4, 2007

Data Mining Applications… Predicting Breast Cancer Survivability „ „

Objective: Predicting breast cancer survivability Methods/Materials: „

„

„

Used artificial neural networks (MLP), decision trees (C5) and logistic regression Used 10-fold cross validation and more than 200,000 cases

Results: „

Decision tree (C5) is the best predictor with 93.6% accuracy, artificial neural networks (MLP) came out to be the second with 91.2% accuracy, and the logistic regression models came out to be the worst of the three with 89.2% accuracy

Reference:

Delen, D., G. Walker and A. Kadam (2004). Predicting Breast Cancer Survivability: A Comparison of Three Data Mining Methods, to appear in Artificial Intelligence in Medicine.

20

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

A Brief List of Text Mining Applications „

Exploring trends in research literature „

„

„

Using text mining to detect deception

„

„

„ „ „ 21

We did this for IS literature

Identifying functional relationships between different genes using large collections of interdisciplinary research literature A l i records Analyzing d for f customer t contact t t centers t

„

Both audio as well as text (SAS does this) One of our Ph.D. student working on this

“Reading”, summarizing classifying emails Text mining the Web content Anti-terrorism initiatives (CIA is using text mining)

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

7

Forecasting Box Office Success of Hollywood Movies

Ramesh Sharda and Dursun Delen Institute for Research in Information Systems Oklahoma State University

Forecasting BoxBox-Office Receipts: A Tough Problem! “… No one can tell you how a movie is going to do in the marketplace… not until the film opens in darkened theatre and sparks fly up between the screen and the audience” Mr. Jack Valenti

President and CEO of the Motion Picture Association of America

23

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Current work on prediction… „

A lot of research have been done „ „

„

Our approach „ „ „

24

Behavioral models Analytical models ' Predict after the initial release

© D. Delen

Use a data mining approach Use as much historical data as possible Make it web-enabled & Predict before the initial release KPM Symposium – Tulsa, OK – October 3-4, 2007

8

Our Approach – Movie Forecast Guru 9

DATA – 849 Movies released between 1998-2006

9

Movie Decision Parameters: ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰

9

Intensity of competition rating MPAA Rating Star power Genre Technical Effects Sequel ? Estimated screens at opening …

Output: Box office gross receipts (flop → blockbuster) Class No. Range (in Millions)

25

© D. Delen

1 1 < 10

3 > 10 < 20

4 > 20 < 40

5 > 40 < 65

6 > 65 < 100

7 > 100 < 150

8 > 150 < 200

9 > 200 (Blockbuster)

KPM Symposium – Tulsa, OK – October 3-4, 2007

Our Approach – Movie Forecast Guru PREDICTION MODELS 9

Statistical Models: ‰ ‰

9

Discriminant Analysis Ordinal Multiple Logistic Regression

Machine Learning Models: ‰ ‰

Artificial Neural Networks Decision Tree Induction ‰ ‰

‰ ‰ ‰

26

© D. Delen

CART - Classification & Regression Trees C5 - Decision Tree

Support Vector Machines Rough Sets …

KPM Symposium – Tulsa, OK – October 3-4, 2007

Prediction Results… Models include plots via texttext-mining Text Mining Models - Tested on 2006 Data

100.00% 90.00% 80 00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

27

No Category

War

Life

Organization

Character

Bingo

49.86%

53.03%

52.16%

51.59%

53.60%

1-Away

88.18%

87.32%

88.18%

89.63%

89.63%

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

9

Sample Predictions… Movie

Actual

Spider-Man 3

9

9

9

9

9

9

300

9

5

7

5

5

5

War

Organization

Life

Character

The Simpsons Movie

8

9

9

9

9

9

The Bourne Ultimatum

8

6

7

9

9

7

Knocked Up

7

5

5

5

5

7

Live Free or Die Hard

7

9

7

9

9

9

Evan Almighty

6

6

7

7

7

7

1408

6

5

5

4

5

5

TMNT

28

No Category

5

4

3

4

5

5

Music and Lyrics

5

4

5

5

5

5

Freedom Writers

4

4

3

4

4

5

Reign Over Me

3

4

4

4

4

4

Black Snake Moan

2

3

3

3

3

3

Ta Ra Rum Pum

1

1

1

1

1

1

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Web--based DSS Web Prediction Models Remote Models

Movie Forecast Guru (MFG)

Local Models

GUI (Internet Browser) HTML TCP/IP

User (Manager)

Remote Data Sources

Web Services XML / SOAP

MFG Engine (Web Server)

ETL

ODBC & ETL

MFG Database XML

Knowledge Base (Business Rules)

29

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

Demonstration of MGF

30

© D. Delen

KPM Symposium – Tulsa, OK – October 3-4, 2007

10