Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (WI2004) Beijing, China, Sep. 20-24, 2004
Grid-based Data Mining in Real-life Business Scenario Tianchao Li Institut für Informatik, Technische Universität München, Germany
[email protected] Toni Bollinger, Nikolaus Breuer, Hans-Dieter Wehle IBM Development Laboratory Böblingen, Germany {toni.bollinger,hdwehle,nbreuer}@de.ibm.com
Abstract This paper presents a Grid-based distributed and parallel data mining system targeting a real-life application scenario typical in the business realm – franchise supermarket basket analysis. Following a layered design of three tiers, this system enables parallel association rule mining on a farm of Grid servers, offers a standard service interface for custom applications, and provides a friendly user portal. The work presented in this paper reveals specific requirements for applying Grid-based data mining in the business realm, which is helpful for the design and implementation of a generic Grid-based data mining system.
The rest of the paper is organized as follows. Section 2 describes a typical application scenario in the realworld business realm, the analysis of which reveals some specific requirements for implementing such a Grid-based data mining system. Section 3 describes the system infrastructure, which follows a layered design of three tiers – Grid tier, service tier, and client/portal tier. Focusing on a general description of the parallel mining task submission workflow, Section 4 summarizes the design and implementation. Section 5 discusses future developments and the paper concludes with section 6.
2. A Real-life Business Application Scenario 2.1. Franchise Supermarket Basket Analysis
1. Introduction Data mining, which targets the goal of retrieving information automatically from large data sets, is one of the most important business intelligence technologies. Because of its high computational intensiveness and data intensiveness, data mining serves a good field of application for Grid technology. The idea of data mining on the Grid is not new, but it has become a hot research topic only recently. The number of research efforts up to now is still quite limited (for a short summary see [1]). Many of the existing systems, such as NASA’s Information Power Grid [2], TeraGrid [3] and Discovery Net [4] are either utilizing non-standard data mining techniques, or restricted to a special domain in the scientific realm. The implementation of a Grid-based data mining system targeting real-life application scenarios in the business realm will reveal the importance of specific requirements that are not so evident in the scientific realm, and will contribute to a generic design and implementation.
1
A typical application scenario of Grid-based data mining in real-life enterprises is presented in Figure 1, where there is a franchise supermarket formed by headquarter, regional branches and distributed member stores. Boston Store
Chicago Store
Paris Store
Berlin Store
Germany Branch Data Center
America Branch Data Center
Headquarter Miami Store
New York Store
Stuttgart Store
Munich Store
Figure 1. Franchise-supermarket scenario
Each store collects transaction data by scanning bar codes at the till when customers buy products. Most stores rely on regional branch to store their transaction data, while some stores have local databases. For the
sake of safety or convenience, the transaction data for certain stores are optionally replicated in several different data storages. Each regional branch or member stores have their own choice of data mining software, and deploys the software locally. The headquarter employ data mining to analyze the transactional data of selected member stores. The franchise supermarket basket analysis scenario described above, though specific, is typical for Gridbased data mining in the business realm. Another example situation that follow similar scenario is the cooperation among several financial organizations. They need to share data patterns relevant to the data mining task, but they do not want to share the data since they are sensitive.
2.2. Rational Grid-based Approach
3. System Infrastructure The system follows a layered design (Figure 2), where the infrastructure is divided into three tiers in general – Grid tier, service tier, and client/portal tier [5]. Client/Portal Tier Service Tier
Server Browser Module
New Task Module
Task Browser Module
Result Visualization Module
Common Parts Service Interfaces
data mining script-template repository
Grid Tier
The application scenario described above reveals some important properties which advocate the adoption of a Grid-based data mining approach. First, security is crucial for distributed data mining in enterprises. The business data are so sensitive that they should in no case be accessed without proper privileges. Second, Grid-based data mining system for enterprises should be able to cope with the heterogeneity of not only computers, operating systems, networks, but also that from the different sources of data and different data mining software. Third, in the highly competitive business environment, enterprises tend to be more flexible and dynamic. As a result, data mining systems in enterprises face stronger demand on the capability of processing data that are dynamically organized. In the franchise supermarket scenario, it often happens that new stores are opened for business expansion and non-profitable stores are closed. With all above properties, a Grid-based approach becomes the natural choice, although there could be other similar techniques that are capable of achieving the same goal.
The inadequacy of data shipping is mainly attributed to security and efficiency considerations. With data shipping, when the business data is transmitted through the network and copied to other sites, access to the business data only with proper privileges can hardly be guaranteed. Data mining is highly data intensive, thus the network can often be a bottleneck for a distributed data mining system, which results in efficiency degradation. Enterprises tend to use standard data mining techniques from commercial data mining software rather than domain specific knowledge extraction techniques from self-implemented software. The complexity and differences in installation and configuration procedures of those software also inadequate program shipping, especially in cases that data mining programs are tightly coupled with the database. Instead of data shipping and program shipping, it should be more appropriate to ship the data pattern of each data partition rather than the data, and to ship script for invoking the program rather than the whole executable.
Grid-enabled data mining manager
Grid Middleware
2.3. Scenario Specific Requirements The application scenario described above also reveals specific requirements that are crucial for the design and implementation of the Grid-based data mining system. It is common practice for Grid applications to decide on program shipping versus data shipping, i.e. to move program to the place of data or to move data to the place of program. However, neither of these two approaches is suitable for application scenarios specified above.
2
Figure 2. System infrastructure
3.1. Grid Tier The underlying Grid tier is composed of a set of heterogeneous compute nodes. The business data is partitioned and deployed on each server as databases. Software enabling data mining on local partial of data are installed and configured on each Grid node. Grid middleware enables a secure, coordinated and dynamic
resource sharing among Grid nodes. Custom information providers are implemented and deployed to provide additional information about configurations and data content.
3.2. Service Tier The service tier implements a Grid-enabled data mining system and provides standard service interfaces. The Grid-enabled data mining system consists of a data mining script template repository and a Grid-enabled data mining manager. The data mining script template repository stores the scripts and script templates for all kinds of data mining software deployed on the servers. The script templates are transformed into actual mining scripts in an adaptive way. The Grid-enabled data mining manager controls the process of distributed data mining on the Grid servers until finally the result of data mining is retrieved.
whole dataset, these conditions are not fulfilled on one specific server. In this case, this rule would not be in the result set of this server such that we will not be able to get the corresponding confidence and support values for this rule on this server. As a result, the confidence and support for this association rule after merging will not be exact, even though it can be quite close to the exact value. Corrections are necessary to be made in order to get accurate results while stick to the same parallelization paradigm. This is made though additionally executing small pieces of mining jobs that will do some makeup works to complement the missing information. Trading accuracy with performance, this “accurate” mode emphasis more on availability and accuracy, rather than efficiency as it is in the case of a “rough” mode.
Schedule
Detect Incomplete Rules
Prepare Scripts
Prepare Makeup Scripts
Submit Jobs
Submit Makeup Jobs
Read Results
Read Makeup-Job Results
Merge Results
Merge Makeup-Job Results
3.3. Client/Portal Tier The Grid-enabled data mining services can be accessed with custom applications that are either programmed or generated with tools according to the Web service interface definitions. For casual users that do not intend to have custom implementation of service clients, a portal is implemented to provide easy access. The portal provides a user interface for the Grid system, with which the users submit new data mining tasks, browse the statuses of tasks, and explore the visualization of data mining results. The portal is implemented as a J2EE compliant application, so that it is accessible by the end users simply through a Web browser.
[Rough mode]
[Exact mode]
Generate PMML
4. Parallel Mining on Grid Servers The market basket analysis is based on association rule mining, one of the major standard data mining techniques. Based on algorithms like Apriori algorithm [6], this mining technique is applicable for an “embarrassingly parallel” paradigm, where a task can be divided into small independent jobs. These independent jobs can then be distributed on the farm of Grid servers, as is assigned by a scheduler. A little bit complexity comes from the fact that the result returned by data mining software usually only contains association rules that satisfy the requirements specified in mining parameters – minimal support, minimal confidence, maximal rule length etc. This might lead to an inaccuracy in the result of merging: It may happen that even if an association rule satisfies the conditions for minimal support and confidence on the
3
Figure 3. Parallel mining on Grid servers
Figure 3 shows the general process of submitting association rule mining task on the Grid. The mining manager starts a mining task with scheduling, when the scheduler matches the mining task to available servers in an optimal way. The scripts for the mining task are prepared separately for each destination server, adapted to the specific data mining software on that server. These scripts inherit data mining parameters from the mining request, especially the requirement of minimal support, minimal confidence, and maximal rule length. Grid jobs are then submitted to the Grid servers for performing data mining on local partition of data. The contents of Grid job results, i.e. the result of data mining on Grid servers, are read into memory and are merged
together. Unlike “rough” mode, which accepts the inexact result of simple merging, the “exact” mode tries to make corrections. Rules that possibly suffer from the absence of information are detected. Scripts for makeup mining jobs, each with very small sets of items, are prepared while removing the requirements for minimal support and confidence. The makeup mining jobs are submitted, and the results are collected. A makeup merge is performed to update the association rule parameters from the initial merge. For detailed discussion of some specific issues of design and implementation, especially those for the custom information providers and brokers, task scheduling, adaptive script preparation, and XML-based message formats based on PMML [7] etc, please reference the paper by Li and Bollinger [8].
entific realm. The design of a generic Grid-based data mining system should take full consideration of application scenarios in both realms. A real-life application scenario typical in the business realm, the franchise supermarket basket analysis, has been discussed. A Grid-based data mining system has been implemented targeting this scenario, the infrastructure of which follows a layered design. With underlying custom information providers and brokers, optimistic task scheduling, adaptive script preparation, and scenario specific parallelization with “embarrassingly parallel” paradigm, the system offers service interface for a distributed and parallel Grid-based data mining system.
References 5. Further Developments The application scenario discussed in this paper utilizes the association rule mining. Algorithms for this mining technique are applicable for a coarse-grained parallelism, where a job is divided into small independent sub-jobs. For other data mining techniques, such as clustering, classification, and regression, studies should be carried out to find appropriate algorithms for a similar parallelism. While the current implementation is based on Globus Toolkit 2 (GT2) [9] and Web service, migration to Globus Toolkit 3 (GT3) and Grid service is to be carried out. From implementation perspective, the resource management services and the information services offered by GT2 and GT3 keeps much consistency, thus the migration of the Grid client and Grid servers are relatively easy. For the service interface, migrating from stateless and persistent Web service to stateful and transient Grid service demands however a moderate work. Despite the effort in the migration, offering a Grid service instead of Web service can greatly simplify the implementation of the Grid portal and other custom client applications that consume the Grid-based data mining service. With Web service, the Grid portal has to maintain a long running process (running as a message-driven EJB) for each Web service invocation to keep track of it. With Grid service, this long running process can be broken into several very short processes, each responsible separately for invoke the service, to get the status or result of the service invocation.
6. Conclusions Real-world application scenarios in business realm have specific requirements that are not evident in sci-
4
[1]Cannataro, M.; Talia, D.: The Knowledge Grid, Communications of the ACM, Vol. 46, No. 1, pp89-93, Jan. 2003 [2] Hinke, T.; Novotny, J.: Data Mining on NASA’s Information Power Grid, Proc. 9th IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, Pennsylvania, Aug. 1-4, 2000 [3]Conover, H. et.al.: Data Mining on the TeraGrid, Poster Presentation, Supercomputing Conference 2003, Phoenix, AZ, Nov. 15, 2003 [4]Ćurčin, V. et.al.: Discovery Net: Towards a Grid of Knowledge Discovery, Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp658-663, 2002 [5] Li, T.: A Grid-based Business Intelligence Infrastructure – Concept, Design and Prototype Implementation of Business Intelligence Grid, Master Thesis, Technische Universität München, 2003 [6] Agrawal, R.; Srikant, A.: Fast algorithms for mining association rules, Proc. 20th International Conference on Very Large Data Bases, pp487-499, 1994 [7] Data Mining Group, PMML 2.0 Specification, http://www.dmg.org/v2-0/GeneralStructure.html [8] Li, T.; Bollinger, T.: Distributed and Parallel Data Mining on the Grid, Proc. 7th Workshop Parallel Systems and Algorithms, Augsburg, Germany, Mar. 26, 2003 [9] Globus Toolkit, http://www.globus.org