Course Information Course Information Course Structure Tentative ...

18 downloads 219 Views 240KB Size Report
Prentice Hall. 3. Course Information n Textbook. – Data Mining: Introductory and Advanced Topics. » by Margaret H. Dunham , ISBN 0-13-088892-3 n Topics.
Course Information Information Data Mining n

CS 341, Spring 2007

n

Instructor: Xiaoyan Li Lecture: Mon&W on&W ed 2:40pm – 3:55pm – Room: Kendade hall 107

n

Prof. Xiaoyan Li Visiting Assistant Professor of Computer Science Mount Holyoke College

Office hour: Tu/Th 10:00am – 11:00am (or by appointment) – Office: Clapp 227 – Email: [email protected] © Prentice Hall

Course Information n

Course Structure

Textbook

n

– Data Mining: Introductory and Advanced Topics

The course is divided into 3 parts – Related concepts and basic techniques – Core Topics

» by Margaret H. Dunham , ISBN 00-1313-088892088892-3

n

2

Topics

» Classification, clustering, association rules

– Related Concepts & Basic Techniques – Core Topics

– Perl programming language, final projects

» Classification, Clustering and Association Rules

n

– Advanced Topics

The first 2/3 are lectures, the rest 1/3 are seminars.

» Web Mining, Spatial Mining & Temporal Mining

© Prentice Hall

3

© Prentice Hall

Tentative schedule: n

Grading

CSCS-341 Data Ming

n n n n

© Prentice Hall

4

5

Class participation: 20% Four homework assignments: 20% One midterm: 20% One final project: 40%

© Prentice Hall

6

1

Some slides are adopted from:

Introduction Outline

DATA MINING Introductory and Advanced Topics

Goal: Provide an overview of data mining.

Part I n n

Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University

n n n

Define data mining Basic data mining tasks Data mining vs. database & KDD Data mining development Data mining issues

Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Topics, Prentice Hall, 2002. © Prentice Hall

Introduction n n

n

Data Mining Definition

Data is growing at a phenomenal rate Users expect more sophisticated information How?

n

Finding hidden information in a database

n

Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning

UNCOVER HIDDEN INFORMATION DATA MINING © Prentice Hall

9

© Prentice Hall

Example 1.1 n n

n

10

Data Mining Algorithm

Credit card company must determine whether to authorize credit card purchases. Four classes: – – – –

8

1) Authorize, 2) Ask for further identification before authorization 3) do not authorize, 4) do not authorize but contact police

n

Purpose: Fit Data to a Model

n

Preference – Criteria to choose the best model Search – Technique to search the data

n

How to classify a purchase? – Examine historical data and determine how data fit into the four classes. – Apply the model to new purchase © Prentice Hall

11

© Prentice Hall

12

2

Data Mining Models n

Data Mining Models and Tasks

Predictive: – A predictive model makes a prediction about values of data using known results found from different data.

n

Descriptive: – A descriptive model identifies patterns or relationships in data. © Prentice Hall

13

© Prentice Hall

Basic Data Mining Tasks n

Classification maps data into predefined groups or classes

n

Example 1.1 is a general classification problem Example 1.2 is an example of pattern recognition

Basic Data Mining Tasks n

– Pattern recognition

n

n

n

Example 1.3 – A college professor wishes to reach a certain level of savings before her retirement. – She predicts what her retirement savings will be based on its current values and several past values. – She uses a linear regression formula to predict her retirement savings.

15

© Prentice Hall

16

Basic Data Mining Tasks (cont’ (cont’d)

Basic Data Mining Tasks n

Regression is used to map a data item to a real valued prediction variable. – Assume some known type of function (e.g. linear) and select the best one.

– Airport screening is used to determine whether passengers are potential terrorists or criminals – Basic patterns: distance between eyes, size and shape of mouth, etc. © Prentice Hall

14

Clustering groups similar data together into clusters. (The clusters are not predefined)

n

Example 1.6

Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization

– A department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, etc.

n

Example 1.7 – The average SAT score is one of the criteria used to compare universities by the U.S. News & World Report.

© Prentice Hall

17

© Prentice Hall

18

3

Basic Data Mining Tasks (cont’ (cont’d) n

Ex: Time Series Analysis

Link Analysis uncovers relationships among data. – Affinity analysis – Association rules – identify items are frequently purchased together.

n

n n n n

Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

Example 1.8 – A grocery store retailer is trying to decide whether to put bread on sale. – He finds that 60% of the time that bread is sold so are pretzels and 70% of the time jelly is also sold by using association rules. – Decisions? © Prentice Hall

19

© Prentice Hall

Data Mining vs. Database Processing --Query --Query Examples n

Database Processing vs. Data Mining Processing

Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk

n

n

Query

n

risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)

n

Data

n

n

Output – Fuzzy – Not a subset of database

© Prentice Hall

22

KDD Process

Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Another opinion: opinion:

Modified from [FPSS96C]

n n n n

– They are no difference. difference. © Prentice Hall

Data – Not operational data

– Precise – Subset of database

Data Mining vs. KDD

n

n

Output

21

Query – Poorly defined – No precise query language

– Operational data

– Find all credit applicants who are poor credit

n

n

– Well defined – SQL

Data Mining

© Prentice Hall

20

n 23

Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall

24

4

Data Mining Development

Data Mining Metrics

•Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines

•Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques

n n

•Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures

n n

Usefulness Return on Investment (ROI) Accuracy Space/Time

•Neural Networks •Decision Tree Algorithms

© Prentice Hall

25

© Prentice Hall

Database Perspective on Data Mining (what is a good data mining tool?) n n n n

26

Social Issues n

Privacy ?

Scalability Real World Data Updates Ease of Use

© Prentice Hall

27

© Prentice Hall

28

Announcements: n

Next Lecture: – Database, Decision Support System & Warehousing

n

Reading assignments: – Chapter 2

© Prentice Hall

29

5