Prentice Hall. 3. Course Information n Textbook. – Data Mining: Introductory and
Advanced Topics. » by Margaret H. Dunham , ISBN 0-13-088892-3 n Topics.
Course Information Information Data Mining n
CS 341, Spring 2007
n
Instructor: Xiaoyan Li Lecture: Mon&W on&W ed 2:40pm – 3:55pm – Room: Kendade hall 107
n
Prof. Xiaoyan Li Visiting Assistant Professor of Computer Science Mount Holyoke College
Office hour: Tu/Th 10:00am – 11:00am (or by appointment) – Office: Clapp 227 – Email:
[email protected] © Prentice Hall
Course Information n
Course Structure
Textbook
n
– Data Mining: Introductory and Advanced Topics
The course is divided into 3 parts – Related concepts and basic techniques – Core Topics
» by Margaret H. Dunham , ISBN 00-1313-088892088892-3
n
2
Topics
» Classification, clustering, association rules
– Related Concepts & Basic Techniques – Core Topics
– Perl programming language, final projects
» Classification, Clustering and Association Rules
n
– Advanced Topics
The first 2/3 are lectures, the rest 1/3 are seminars.
» Web Mining, Spatial Mining & Temporal Mining
© Prentice Hall
3
© Prentice Hall
Tentative schedule: n
Grading
CSCS-341 Data Ming
n n n n
© Prentice Hall
4
5
Class participation: 20% Four homework assignments: 20% One midterm: 20% One final project: 40%
© Prentice Hall
6
1
Some slides are adopted from:
Introduction Outline
DATA MINING Introductory and Advanced Topics
Goal: Provide an overview of data mining.
Part I n n
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University
n n n
Define data mining Basic data mining tasks Data mining vs. database & KDD Data mining development Data mining issues
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Topics, Prentice Hall, 2002. © Prentice Hall
Introduction n n
n
Data Mining Definition
Data is growing at a phenomenal rate Users expect more sophisticated information How?
n
Finding hidden information in a database
n
Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning
UNCOVER HIDDEN INFORMATION DATA MINING © Prentice Hall
9
© Prentice Hall
Example 1.1 n n
n
10
Data Mining Algorithm
Credit card company must determine whether to authorize credit card purchases. Four classes: – – – –
8
1) Authorize, 2) Ask for further identification before authorization 3) do not authorize, 4) do not authorize but contact police
n
Purpose: Fit Data to a Model
n
Preference – Criteria to choose the best model Search – Technique to search the data
n
How to classify a purchase? – Examine historical data and determine how data fit into the four classes. – Apply the model to new purchase © Prentice Hall
11
© Prentice Hall
12
2
Data Mining Models n
Data Mining Models and Tasks
Predictive: – A predictive model makes a prediction about values of data using known results found from different data.
n
Descriptive: – A descriptive model identifies patterns or relationships in data. © Prentice Hall
13
© Prentice Hall
Basic Data Mining Tasks n
Classification maps data into predefined groups or classes
n
Example 1.1 is a general classification problem Example 1.2 is an example of pattern recognition
Basic Data Mining Tasks n
– Pattern recognition
n
n
n
Example 1.3 – A college professor wishes to reach a certain level of savings before her retirement. – She predicts what her retirement savings will be based on its current values and several past values. – She uses a linear regression formula to predict her retirement savings.
15
© Prentice Hall
16
Basic Data Mining Tasks (cont’ (cont’d)
Basic Data Mining Tasks n
Regression is used to map a data item to a real valued prediction variable. – Assume some known type of function (e.g. linear) and select the best one.
– Airport screening is used to determine whether passengers are potential terrorists or criminals – Basic patterns: distance between eyes, size and shape of mouth, etc. © Prentice Hall
14
Clustering groups similar data together into clusters. (The clusters are not predefined)
n
Example 1.6
Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization
– A department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, etc.
n
Example 1.7 – The average SAT score is one of the criteria used to compare universities by the U.S. News & World Report.
© Prentice Hall
17
© Prentice Hall
18
3
Basic Data Mining Tasks (cont’ (cont’d) n
Ex: Time Series Analysis
Link Analysis uncovers relationships among data. – Affinity analysis – Association rules – identify items are frequently purchased together.
n
n n n n
Example: Stock Market Predict future values Determine similar patterns over time Classify behavior
Example 1.8 – A grocery store retailer is trying to decide whether to put bread on sale. – He finds that 60% of the time that bread is sold so are pretzels and 70% of the time jelly is also sold by using association rules. – Decisions? © Prentice Hall
19
© Prentice Hall
Data Mining vs. Database Processing --Query --Query Examples n
Database Processing vs. Data Mining Processing
Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk
n
n
Query
n
risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)
n
Data
n
n
Output – Fuzzy – Not a subset of database
© Prentice Hall
22
KDD Process
Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Another opinion: opinion:
Modified from [FPSS96C]
n n n n
– They are no difference. difference. © Prentice Hall
Data – Not operational data
– Precise – Subset of database
Data Mining vs. KDD
n
n
Output
21
Query – Poorly defined – No precise query language
– Operational data
– Find all credit applicants who are poor credit
n
n
– Well defined – SQL
Data Mining
© Prentice Hall
20
n 23
Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall
24
4
Data Mining Development
Data Mining Metrics
•Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines
•Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques
n n
•Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures
n n
Usefulness Return on Investment (ROI) Accuracy Space/Time
•Neural Networks •Decision Tree Algorithms
© Prentice Hall
25
© Prentice Hall
Database Perspective on Data Mining (what is a good data mining tool?) n n n n
26
Social Issues n
Privacy ?
Scalability Real World Data Updates Ease of Use
© Prentice Hall
27
© Prentice Hall
28
Announcements: n
Next Lecture: – Database, Decision Support System & Warehousing
n
Reading assignments: – Chapter 2
© Prentice Hall
29
5