Introduction

26 downloads 74 Views 411KB Size Report
Abraham Otero .... classes. Regression: We try to learn a real function that assignes to ... If we have built different models (using different techniques, or the same ...
Introduction Data Mining Abraham Otero

Abraham Otero

Data Mining

1/25

Introduction

Abraham Otero

Data Mining

2/25

1

Agenda The problem Data Mining Overview Data Mining Tools Final Remarks

Abraham Otero

Data Mining

3/25

Agenda The problem Data Mining Overview Data Mining Tools Final Remarks

Abraham Otero

Data Mining

4/25

2

The problem Information systems produce an ever greater volume and variety of information that is stored in digital databases. This growth is exponential. Abraham Otero

Data Mining

5/25

The problem Navigation patterns on an e-commerce site, retail sales, information about providers, information about employees and projects, etc. The root of the problem is that the conversion of information into digital format and its storage are technologies that have been fully mastered. This information can be extremely valuable to the organization that generated it. It can: Optimize processes. Maximize customer satisfaction. Create more targeted marketing campaigns. Improve the navigation of their Web sites. Detect fraudulent and suspicious transactions. Help identify the most profitable customers. Etc.

Abraham Otero

Data Mining

6/25

3

The problem But ... there is simply too much information to be analyzed Cognitive overload problem: we are presented with more information than we can assimilate.

In addition, individuals responsible for analyzing these data are not experts in statistics or machine learning.

Abraham Otero

Data Mining

7/25

The problem The solution:

Abraham Otero

Data Mining

8/25

4

Agenda The problem Data Mining Overview Data Mining Tools Final Remarks

9/25

Data Mining

Abraham Otero

Data Mining Overview In fact, data mining is just one of the steps in the knowledge discovery process.

$

Exploitation

Interpretation and evaluation

Knowledge

Data mining Patterns

Data cleaning, selection and transformation Collection and integration Datawarehouse

Abraham Otero

Data Mining

10/25

5

Data Mining Overview Phase 1: collection and integration of the data. Quite often, we will have to integrate several databases. Some of these databases may be external to the organization. The ultimate goal is to create a data warehouse, a "database" that is specially designed for data mining tasks.

Data Mining

Abraham Otero

11/25

Data Mining Overview Phase 2: Data cleaning, selection and transformation. Cleaning : Some of the data collected may not be relevant for our goals. Sometimes there are too many missing values or some data is unreliable. It may be desirable to remove this data so it does not “contaminate” the knowledge that we discover.

Selection: Before beginning the data mining process, we must select a subset of all available data to work on.

Abraham Otero

Data Mining

12/25

6

Data Mining Overview Transformation. Since we have integrated data from different sources, it often will be necessary to make formatting changes to homogenize the information. Sometimes we need to apply various transformations to the data (numerize, discretize, normalize ...) on which we want to apply the data mining techniques of our choice. Therefore it is necessary to apply various transformations to the data before starting the data mining process.

Data Mining

Abraham Otero

13/25

Data Mining Overview Phase 3: Data mining. Is the phase which seeks to discover new knowledge. It is important to distinguish between tasks and techniques in this phase.

Data mining tasks: They are “what we want to achieve”. There are two main types: Predictive tasks: tasks that try to predict one or more values for each instance. Classification: given a new instance, we want to determine what class it belongs to (different categories of clients, offering more targeted advertising, detect fraudulent transactions). The classes are always known. The goal is to learn a function which associates a class to each instance. Abraham Otero

Data Mining

14/25

7

Data Mining Overview Estimate the probability of classification: in this case, instead of learning a function that associates a single class to each instance, we want to learn a function for each class that indicates the probability that a given instance belongs to that class. Categorization: In this case instead trying to learn a function which associates a class to each instance, we try to learn a correspondence. That is, each instance can have multiple classes. Regression: We try to learn a real function that assignes to each instance a real value.

Descriptive tasks: instead of trying to make a prediction, they try to describe the data. Clustering: it seeks to create clusters with the data. We do not know how many clusters there are in the data, we need to discover it. Abraham Otero

Data Mining

15/25

Data Mining Overview Association Rules: they try to identify relationships between non-categorical attributes. Correlations: they try to identify dependencies (correlations) between different attributes. Detection of anomalous values or anomalous instances: they try to find values of attributes, or instances, which are anomalous. For example, fraudulent transactions on a credit card.

Abraham Otero

Data Mining

16/25

8

Data Mining Overview Data mining techniques: There are several "tools" available to us to address this tasks. Often, a single task can be approached with different techniques. The most common techniques are: algebraic and statistics techniques, Bayesian networks, decision trees, association rules, clustering, neural networks, genetic algorithms, etc..

Abraham Otero

Data Mining

17/25

Data Mining Overview Phase 4: Interpretation and evaluation: Once the new knowledge has been discovered, it is necessary to interpret what it means, and evaluate it: Is this knowledge something new or was it already known? If it is new knowledge, is this knowledge useful? How reliable is the knowledge that has been discovered? What guarantees do we have that it will apply to other data? If we have built different models (using different techniques, or the same technique with different parameters) can we combine them to improve the results?

Abraham Otero

Data Mining

18/25

9

Data Mining Overview Phase 5: exploitation of the new knowledge. Finally, it's time to incorporate the new knowledge in the organization processes and use it in production.

These phases are not linear, but there are feedback loops between them. For example, in the data mining phase we may discover that we are missing relevant data, and we must return to the collection phase. Or in the validation phase we find that our models do not behave properly with data different from those used in training, and we must return to the data mining phase. Abraham Otero

Data Mining

19/25

Agenda The problem Data Mining Overview Data Mining Tools Final Remarks

Abraham Otero

Data Mining

20/25

10

Commercial Tools Producto Knowledge Seeker CART Clementine Data Surveyor GainSmarts

Intelligent Miner Microstrategy Polyanalyst

Darwin

Enterprise Miner

SGI MineSet Wizsoft/Wizwhy

Compañía Angoss http://www.angoss.com/ Salford Systems www.salford-systems.com SPSS/Integral Solutions Limited (ISL) www.spss.com Data Distilleries http://www.datadistilleries.com/ Urban Science www.urbanscience.com

Técnicas Decision Trees, Statistics

Plataformas Win NT

Decision Trees

UNIX/NT

Decision Trees, ANN, Statistics, Rule Induction, Association Rules, K Means, Linear Regression. Amplio Abanico.

UNIX/NT

ODBC

UNIX

ODBC

Especializado en gráficos de ganancias en campañas de clientes (sólo Decision Trees, Linear Statistics y Logistic Regression). Decision Trees, Association Rules, ANN, RBF, Time IBM http://www.ibm.com/software/data/iminer Series, K Means, Linear Regression. Microstrategy Datawarehouse sólo www.microstrategy.com Megaputer Symbolic, Evolutionary http://www.megaputer.com/html/polyanal yst4.0.html Oracle Amplio Abanico (Decision Trees, ANN, Nearest http://www.oracle.com/ip/analyze/wareho Neighbour) use/datamining/index.html SAS Decision Trees, Association rules, ANN, regression, http://www.sas.com/software/components clustering. /miner.html Silicon Graphics association rules and classification models, used for http://www.sgi.com/software/mineset/ prediction, scoring, segmentation, and profiling http://www.wizsoft.com/

Abraham Otero

Interfaz ODBC

UNIX/NT

UNIX (AIX)

IBM, DB2

Win NT

Oracle

Win NT

Oracle, ODBC

UNIX/NT

Oracle

UNIX (Sun), NT, Mac

Oracle, ODBC

UNIX (Irix)

Oracle, Sybase, Informix.

Data Mining

21/25

Opensource tools Rproject: http://www.R-project.org/

WEKA: http://www.cs.waikato.ac.nz/ml/weka/

Abraham Otero

Data Mining

22/25

11

Agenda The problem Data Mining Overview Data Mining Tools Final Remarks

Abraham Otero

Data Mining

23/25

Final Remarks Differences between data mining and OLAP (On-Line Analytical Processing): OLAP tools allow us to manage and transform data, but they produce other new data or different views of the same data. Data mining tools extract patterns/models/relationships/regularities (knowledge) from the data. These tools do not transform the data, and facilitate process of analyzing it; they analyze the data.

Differences between statistics and data mining: The statistical methods attempt to validate or parametrize a model suggested by the user. The data mining techniques try to discover the model. Data mining often faces problems involving larger volumes of data than the statistical techniques, and the nature of the data is usually much more heterogeneous. Abraham Otero

Data Mining

24/25

12

Final Remarks Difference between data mining and machine learning: The data on which data mining techniques are used is usually much more heterogeneous. The end user of machine learning techniques usually is the scientific/technical staff, while data mining techniques are used by people without a strong mathematical and technical background .

Abraham Otero

Data Mining

25/25

13