An Exploration of Knowledge Discovery from Data (KDD) Tools Koushik Dutta* and Dr. T. V. Prasad** * Deputy Director, IT Services Department, Bureau of Indian Standards, New Delhi 110 002 ** Professor & Head, Dept. of Computer Science & Engg., Lingaya’s University, Faridabad 121 002, Haryana E-Mail:
[email protected],
[email protected]
Abstract- The rapid evolution of science and technology has led to the generation of huge volumes of data. We have an abundance of data, scientific data, medical data, demographic data, financial data and marketing data to name a few. With its ever increasing volume, the emphasis is on automatic analysis of data. Data analysis is important for extracting knowledge or information from huge data sets and utilise it for discovering patterns or for predicting the future trends. Thus the explosive growth of data has necessitated the development of new techniques and automated tools for transforming data into knowledge. Keywords: Knowledge mining, KDD tools
1. INTRODUCTION Data Mining, popularly known as Knowledge Discovery from Data (KDD), is the automated or convenient extraction of hidden predictive information implicitly stored or captured in massive data repositories. Since its beginning in the 1980s, the subject has made rapid and significant progress [1]. Data mining, like statistics, is not a business solution, it is just a technology. In the last two decades, numerous commercial data mining and data analysis tools have been built solving problems across fields such as financial services, life sciences, telecom and insurance. Data mining softwares allow users to analyze large databases to solve business decision problems. Data mining process operates on information contained in historical databases containing records of previous interactions with customers. The data mining software uses this historical information to build a model that will predict customer behaviour e.g., which customers are likely to respond to a new product [2]. This treatise attempts to compare some of the prevailing KDD tools which are being used by organisations in taking appropriate business decisions and making optimal use of resources for business development.
2. IMPORTANCE OF KDD TOOLS Numerous commercial data mining systems are available in the market today. The fact that data mining is used in a vast array of areas has necessitated the development of tools for recognizing and tracking patterns within the data. Such KDD and mining tools help organizations sift through volumes of real time data to extract meaningful relationships. This helps businesses in anticipating rather than simply reacting to customer needs. It allows business users to make informed business decisions with the available data that can put a company ahead of its competitors. Acquisition of new customers is the primary means of growth for many businesses. This involves wooing new customers who have never used the company’s products. KDD tools can help segment those prospective customers and increase the response rate that an acquisition marketing campaign can achieve. Thus, in the present day, use of efficient KDD tools is critical for business analysis and information operations [3].
3. PARAMETERS USED FOR COMPARISON While there are numerous tools available in the market, and each of them has a range of functionalities to offer, we have tried to analyze 15 important ones from those tools which apart from performing basic data mining tasks like classification, prediction and clustering, specializes in feature reduction, pattern recognition, anomaly detection etc. The tools make use of techniques like decision trees, neural networks, K-Means to name a few. The features of the tools have been highlighted. Most of the tools have been designed to work on Windows and Linux platforms. A few of them namely SAS Enterprise Miner, IBM Intelligent Miner, Oracle Data Miner, IBM SPSS Modeler work on other platforms like AIX, HPUX, Sun Solaris etc. Most of the tools have easy to use GUI while a few have adapted the MS Office based environment. The details have been mentioned at Annexure.
5. CONCLUSION It has taken human society more than 300,000 years to create 12 Exabyte (1 billion gigabytes) of data and the amount of data is expected to double in the next three years, according to the School of Information Management and Systems at the University of California Berkeley. With the ever increasing volume of data, nearly all of the data mining applications feature expanded analytics, userfriendly interfaces, and powerful algorithms that allow analysis of structured and unstructured data. The KDD tools accepts data from multiple sources like MS Excel, MS Access, MS SQL Server, Oracle and other relational databases. The KDD tools have interactive user interfaces. Few of the tools also offer complete end to end solution starting with data importing to data scoring and reporting. Text and web mining form an integral part of many of the tools, given the fact a huge chunk of data is unstructured, residing in websites and e-mails.
REFERENCES [1] Han Jiawei and Kamber Micheline, “Data Mining: Concepts and Techniques”, 2nd Edition, Morgan Kaufmann Publishers [2] Home page of Kurt Thearling, [3] Berson Alex, Smith Stephen and Thearling Kurt, “Building Data Mining Techniques”, New York McGraw Hill, 2000 [4] Salford Systems, [5] IBM Cognos,
[6] C 5.0. [7] Olson Louis Davis and Dursun Delen, “Advanced Data Mining Techniques”, Berlin Heidelberg Springer, 2000 [8] Wizwhy, [9] Elder John and Abbott Dean, “A Comparison of Leading Data Mining Tools”, Elder Research [10] Features.pdf [11] Superquery, [12] Statistica Dataminer, [13] DBMiner, [14] IntelligentMiner, [15] Polyanalyst, [16] Oracle Data Miner, /odm/index.html [17] SAS Enterprise Miner, amining/miner [18] IBM SPSS Modeler, http:// ler [19] Data Engine, http:// [20] Knowledge Studio, owledgeSTUDIO.php [21]AI Trilogy,
Cognos Business Intelligence
C 5.0/SEE 5.0
Azmy Thinkware Inc.
Statistica Inc.
DB Technologies Inc.
Classification, Clustering, Prediction
Association Rules, Classification, Clustering Decision Trees, K-Means
Statistica Dataminer
Salford Systems
IBM Corporation
Rule Quest Research Pty Ltd. Classification
Decision Trees (Classification and Regression Trees)
Neural Networks, Rule Induction (if-then)
Decision Tree, Rule Induction (if-then)
Rule Induction (if-then and ifand-only-if)
Rule Induction
Windows, Linux
Windows, Linux
Is a decision tree tool that uses the CART algorithm. Makes use of seven different splitting criteria. Specialized backup rule available to handle missing data. Rules
Windows, Linux, Solaris, AIX, HP Itanium, HP UX Deployment and architecture simplified by Web services architecture. Functionalities like reporting, analysis, scorecards,
Decision Trees, (CART, CHAID), Neural Networks (including Back propagation), Regression Windows
See5.0 runs on Windows machines and C 5.0 runs on Unix. Designed to analyze substantial databases containing
Can be used for data analysis, making predictions and revealing cases that deviate from the rules[8]. It is a rule induction data mining tool
Classifies data and discovers all the facts. Automatically draws graphs and calculates totals and statistics for any column and any filter. With
Makes use of statistical methods to address data mining issues. Processes remote databases without creating local copies which enhances
Uses intelligent and automated processes to analyze large volumes of detailed data from relational databases, data warehouses and
Sl. No.
Cognos Business Intelligence
C 5.0/SEE 5.0
do not assume that the values for a missing attribute are the same. Data from 80 different file formats (including Excel, Lotus, and Oracle) can be used.
dashboards, business event management possible with the software. Single, open API enables integration with existing security, portals, and IT infrastructure.
Regular GUI version of CART available
Has a launch menu to access IBM Cognos 8
thousands to hundreds of thousands of records and tens to hundreds of numeric, time, date, or nominal fields. Easy to use and does not presume any special knowledge of Statistics or Machine Learning. Source Code is provided to embed classifiers generated by See 5.0/C5.0 in applications [6]. To maximize interpretability, classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than ANN Application has main window containing
that discovers the if then rules in the data and reveals the necessary and sufficient conditions [9]. Calculates the error probability of each rule and summarizes the data graphically by presenting the main rules and trends
SuperQuery, there is no need to know SQL or any statistical language
performance in case of large data repositories
web data. Also accepts data from multiple sources including MS SQL Server, Excel, OLEDB and other relational databases. DBMiner Insight Solutions provide association, sequence and differential mining capabilities for MS SQL Server Analysis Services Platform and they also provide market basket, sequence discovery and profit optimization for MS Accelerator for Business Intelligence
Has a main project view with multiple
Makes use of statistical methods to address data
Has a click icon based GUI to create a workflow
DMSQL interface or a GUI interface is
Statistica Dataminer
Sl. No.
Cognos Business Intelligence
C 5.0/SEE 5.0
Administration and Studios. Has a unique “My Area” icon which grants the user access to customized workspace.
buttons for all applications.
components integrated into a single project window. A project is organised through the metaphor of creating and manipulating variety of objects including data objects, decision tree objects, cluster objects and so on. MS Windows 98/XP/Vista; 32 MB RAM
mining issues. Processes remote databases without creating local copies which enhances performance in case of large data repositories.
description of the tasks to be performed.
used. Allows a cube view of data by interfacing through MS SQL Server’s OLAP.
486 PC or better, MS-Windows 2000/XP/Vista, 8 MB RAM, and 8MB free space on HDD
MS Windows XP/Vista MS SQL Server’s OLAP Service MS Excel
MS Windowscompatible CPU, 32MB RAM, Windows 2000/XP/Vista/7, and a network server connected to workstations (an existing enterprise database application can be used or it can be provided by StatSoft) [12]
System Requirements
MS Windows, 2.0 GHz P IV Processor, 2 GB RAM, 2 GB Free Hard Disk Storage Space Linux: 32 MB RAM (min.), Min. 40 GB of free storage space on Hard Disk
MS Windows, IE 6, 128 MB RAM (min.)
MS Windows 2000/XP/Vista UNIX: Linux/Iris/Solari s Processor: Intel Pentium IV RAM: 64 MB (Min)
Statistica Dataminer
DB2 Intelligent Miner
Oracle Data Miner
SAS Enterprise Miner
IBM SPSS Modeler
IBM Corporation
Megaputer Intelligence Inc.
Oracle Corporation
SAS Institute Inc.
IBM Corporation
Association Rules, Clustering, Classification, Prediction
Classification, Prediction, Anomaly Detection, Clustering, Association Rules, Feature Extraction
Market Basket Analysis, Predictive Modeling, Time Series Data Preparation and Analysis
Association Rules, Classification, Clustering, Prediction
Decision Trees (CART), KMeans, Neural Networks, Linear Regression
Association Rules, Classification, Clustering, Prediction, Anomaly Detection, Pattern Recognition Decision Trees, Neural Networks
Windows, Solaris, AIX, OS/390, OS/400 Family comprises three products namely DB2 Intelligent Miner for Data which mines
Decision Trees, Regression, Naïve Bayes, SVM, Enhanced KMeans, A priori Windows, Linux, Solaris
Neural Networks, Regression, Ensemble Methods, Decision Trees Windows, Solaris, Linux, AIX, HP-UX Creates descriptive and predictive models by analysing voluminous
Windows, Unix
Accesses data stored in relational databases using the ODFC interface. Can
Embedded in the Oracle database. It identifies patterns and key attributes and
Data Engine
Knowledge Studio
AI Trilogy
Management Intelligenter Technologien Gmbh Classification, Clustering, Decision Trees
Angoss Software Corporation
Ward Systems Group Inc.
Classification, Clustering, Prediction, Rules
Classification, Forecasting, Prediction
A priori, Decision Trees, K-Means Clustering, Neural Networks, Regression, Rule Induction Windows, Solaris, IBM AIX, HP/UX
Decision Trees, KMeans, Neural Networks, Linear Regression Windows
Decision Trees, K-Means Clustering, Neural Networks, Regression Windows, Solaris
Genetic Algorithms, Neural Networks
Has a number of descriptive icons where each icon represents steps like accessing data, preparing
Integrates statistical tools with neural networks. It has many methods for
Provides a set of scoring and deployment tools in a single workflow environment. It
Is a suite of three productsNeuroShell Predictor, NeuroShell Classifier and
Sl. No.
DB2 Intelligent Miner
Oracle Data Miner
DB2 databases or flat files, DB2 Intelligent Miner for Text mines textual data including flat files and web pages and DB2 Intelligent Miner for Scoring documentation. IBM's indatabase mining capabilities integrate with existing systems to provide scalable, high performing predictive analysis without moving the data into proprietary data mining platforms Simple GUI interface is provided to for user convenience.
process flat files, MS Excel and DBF files. Enables data modeling and testing using different machine learning algorithms. Offers complete end to end solution from data importing, cleaning, manipulation, visualization, modeling, scoring and reporting
discovers associations and clusters. ODM moves the analytical functions into traditional mining servers [16].
Objected oriented GUI available. It is a self documenting system that provides visual tools for data analysis.
User interacts with the software through the Oracle DM GUI, PL/SQL and Java API, Predictive Analytics PL/SQL package, Oracle
SAS Enterprise Miner data. Apart from fraud detection, the software can be used for business based model comparisons, reporting and management. Data access, management and cleansing are integrated thereby making data analysis easier. Also supports scalable batch processing through GUI with access to more than 50 file structures Interactive GUI with easy to use Graphics Explorer Wizard and Graphics Explore Node
IBM SPSS Modeler
Data Engine
Knowledge Studio
AI Trilogy
data, data visualisation and modelling. It mines large data sets using a client/server model. Server converts data access requests into SQL queries which can then access a relational database [18].
data cleansing, transformation and for handling missing data. The Data Engine ADL generates C code or produces DLLs which can be incorporated in the application code for subsequent use
allows analysts to generate application code that can be exported to Visual Basic, C++, Java, XML, PMML and SAS generators thereby facilitating integration with all data sources within the organization
GeneHunter. Supports ASCII, CSV and Excel files. Application serves as a tool extract the prominent relationships among process variables
Makes use of descriptive icons to create a data flow description of the functions to be performed.
User interacts with the software through a project window that gives a survey of data, graphics and models
Familiar interface based on the MS Office environment. The package has an interface based on the MS Office environment. The package
Has a Windows icon driven user interface and a host of other utilities to provide users with a neural network experimental environment
Sl. No.
System Requirements
DB2 Intelligent Miner
MS Windows XP with SP2/ Vista with SP1, 1 GHz Processor, 512 MB RAM, Hard Disk 60 MB, MS Office 2003 with SP1/MS Office 2007 IE 6.0 with SP2/7.0/8.0 [14]
MS Windows 2000/XP, 1.0 GHz Processor, 128 MB RAM, 50 MB free space on HDD, MS Internet Explorer 6.0
Oracle Data Miner Spread sheet Add-In for Predictive Analytics. MS Windows XP Prof./Vista Business, Enterprise and Ultimate/2000 server with SP1 and all editions of 2003, 512 MB RAM, 2.04 GB free space on HDD [16]
SAS Enterprise Miner
IBM SPSS Modeler
MS Windows, Solaris, Linux, SAS/STAT and Base SAS
MS Windows XP Prof./Vista/ Server 2003 Intel Pentium, AMD 64 & EM 64T, 1 GB RAM and 1 GB free space on HDD
Data Engine
Knowledge Studio has an intuitive GUI for easy deployment and ease of use. MS Windows 2000/XP, IBM compatible PC with 886/50MHz Processor or higher, 64 MB RAM, 25 MB free space on HDD, LabVIEW 6.x
AI Trilogy
MS Windows 2000 (with SP4)/XP/Vista/ 7, Intel Pentium compatible processor, 256 MB RAM