Federated Data Mining Services and a Supporting XML Markup Language S. Krishnaswamyï, A. ZaslavskyïS. W. Loke2, School of Computer Science & Software Engineeringï CRC for Enterprise Distributed Systems Technology2 Monash University, 900 Dandenong Road, Caulfield, VIC 3145, Australia Email: {shonali.krishnaswamy, arkady.zaslavsky}@broncho.ct.monash.edu.au
[email protected] Abstract E-businesses are increasingly looking towards data mining systems for meeting their business intelligence needs. However, the current state of the art in data mining does not allow any one system to be able to meet the diverse business intelligence needs of e-businesses. A second bottleneck is the high initial cost involved in establishing a data mining infrastructure within an organisation. In this paper, we propose the concept of a federated data mining system hosted by an Application Service Provider (ASP) as a means of alleviating the bottlenecks of high cost and diverse data mining needs. We also present an XML language which provides the basis for data mining systems to describe their services and architecture to the rest of the federation.
1. Introduction E-businesses are increasingly looking towards business intelligence tools to provide them with a competitive edge by maximising the gain obtained from their information resources and supporting their strategic decision-making process. This demand for business intelligence by organisations operating in e-markets is driving data mining systems to support a wide variety of algorithms (to fulfil the diverse requirements of ebusinesses), to provide access to data mining services from web interfaces and even from mobile users with hand held devices. Thus, in order for data mining systems to integrate well with the e-commerce world they need to be able to operate in heterogeneous and distributed environments. Since the early 1990’s data mining systems have continued to evolve from stand-alone systems characterised by single-algorithms with little support for the knowledge discovery process to integrated systems incorporating several mining algorithms, multiple users, various data formats and distributed data sources. We characterise the growth and evolution of data mining systems in a three-dimensional space as illustrated in figure 1. The orthogonal dimensions that specify the directions in which data mining systems are advancing include: complexity of data, levels of support and complexity of distribution. This continuous growth and evolution notwithstanding, the current state of the art in data mining systems makes it unlikely for any one system to be able to support the wide and varied business intelligence needs of e-markets. In this paper we propose a federation of data mining systems as a means of meeting the diverse data mining needs of the ecommerce world. We also present Distributed Data Mining System Markup Language 1
(DDMS-ML) which is a markup language that uses XML as its meta-language. DDMSML is used by data mining systems participating in a federation to communicate with each other and share information about the services that they offer. DDMS-ML allows data mining systems to describe their respective functionality and architecture. We have developed an XML DTD for DDMS-ML, which is both generic and extensible to describe the various characteristics of distributed data mining systems.
Complexity of Data Organisations Spatial, temporal, multimedia systems Unstructured text Object-oriented Relational
Data Mining
Flat files
Heterogeneous
Centralised, Distributed Stand-alone
Mobile, Distribution Ubiquitous
Automated Preprocessing Algorithm Selection Multiple Users Optimisation
Level of KD Support
Figure 1. Evolution of Data Mining Systems
This paper is organised as follows. In section 2, we illustrate scenarios of federated data mining systems and how they can benefit data mining in a B2B (business-to-business) context and C2B (customer-to-business) context. In section 3, we present DDMS-ML and discuss the components of its DTD. Section 4 presents a case study of a DDMS-ML document for the DAME distributed data mining system as a proof-of-concept applicability of DDMS-ML. In section 5 we discuss related work. Finally in section 6 we conclude with the current status and future directions of our work. 2. Federated Data Mining Systems The primary motivation for a federation of data mining systems is that users can benefit from the data mining services of several systems. Thus, a user who wants to mine data in a special format (e.g. say spatial or temporal data) can request this from a system that supports mining such data types. Similarly a user who is on the move but requires a data mining task may access a data mining system that specialises in meeting mobile users requirements. Consider an on-line shopping centre which consists of buyers, dealers and a broker. The buyers access the shopping centre through a web-interface and interact with the vendors via the trader. The broker at one-level provides catalogue services to customers in terms
2
of dealer profiles and availability of goods and services. At another level, the broker negotiates transactions between the buyers and the dealers. The need for distributed data mining in such a scenario arises from two possible sources, namely, the dealers and the broker. The dealers’ data mining requirements have their origins in traditional data mining applications such as market basket analysis. The broker’s data mining needs will be centred on customer-profiling to improve the level of service provided to individual customers. The environment is inherently distributed and heterogeneous. In addition to the complexity of distribution, e-commerce adds to the mining process an additional dimension of complexity by emphasising the importance of optimised response time. For example, in a situation where a product required by a customer is not currently available, the trader might want to provide the customer with details such as the likelihood of when the product would be available by analysing past trends or similar products offered by vendors. The trader might also want to give the customer the incentive for waiting by analysing dependencies with seasonal specials. In Krishnaswamy et al (2000a), we presented a framework whereby Application Service Providers (ASP) can host distributed data mining services with a framework for costing and billing for data mining services. The advantage of this is that it allows organisations to access data mining services without having to be concerned with the setting up costs. Similarly, the application service provider paradigm can be used to host a federation of data mining systems that e-businesses can access to fulfil their data mining tasks. There are two models for hosting a federation of data mining systems using application service providers, namely, the customer-to-business (C2B) model and the business-to-business (B2B) model. 2.1 Customer to Business Model (C2B) In this model, one ASP hosts a federated data mining system as illustrated in figure 2. Thus, several data mining systems are registered with one ASP who manages the federation and services the data mining needs of e-businesses. In this model the “customers” are the components of e-commerce systems such as the buyers, dealers and brokers and the “business” is the ASP which provides data mining services to its clients. The components of the figure are: Buyers. The buyers use the on-line shopping centre to procure goods and services. E-Commerce system. The e-commerce system provides the infrastructure for the on-line shopping centre. It comprises a web interface, an e-catalog, a broker and a database. The web interface is the point of access for the customers into the shopping centre. The “ecatalog” is a directory of the goods, services and dealer profiles. The broker negotiates transactions between the buyers and the dealers. The “database” is used to maintain transaction details, vendor and customer information for use by the e-catalog and the broker. Dealers. The dealers are the businesses that use the on-line shopping centre as a medium for marketing and selling their products.
3
WEB-INTERFACE
E-CATALOG
On-line Shopping Centre BROKER
Database Dealer 1 Oracle
Dealer 2 Flat Files
Legacy
Sybase
F E D E R A T I O N M A N A G E R
Distributed Data Mining System 1
Distributed Data Mining System 2
Distributed Data Mining System 3
Application Service Provider
Figure 2 Federation of Data Mining Systems Hosted by an ASP
Application Service Providers (ASP). The ASP provides application services to the ecommerce system components and the vendors. The focus in the above scenario is on the federated data mining service that is provided by the ASP. The dealers and the broker requiring this service pay the ASP for accessing the data mining systems that form the federation. The ASP must have an infrastructure for costing and billing clients for data mining services. Federated Data Mining System. The federated data mining system consists of several distributed data mining systems and a federation manager. The DDM systems support different functionality and special features that are tailored for specific data mining tasks in a distributed and heterogeneous environment. Several DDM systems have been proposed including Papyrus (Ramu, 1998), JAM (Stolfo et al, 1997), DecisionCentre (Chattratichat et al, 1999), Bodhi (Kargupta et al, 1998), IntelliMiner (1999), DAME (Krishnaswamy et al, 2000a). The federation manager maintains information about the different data mining systems that are part of the federation such as the services provided. This information is obtained from the individual DDM systems in the form of DDMSML documents. DDMS-ML forms the basis for the federation manager to locate a DDM system that is best suited to perform a given task. The DDMS-ML tags and the document type definition for DDMS-ML documents are presented in section 3 of this paper. 2.2 Business to Business Model (B2B) In this model there are several ASPs which provide data mining services as illustrated in figure 3. The ASPs participate in a federation consisting of their individual data mining systems. Each ASP has its own customers. When an ASP receives a data mining request that it is unable to carry out it propagates this request onto the federation. This scheme can be viewed as a B2B model since each ASP is a business and it out-sources those requests from its clients that it does not have the resources to perform.
4
Distributed Data Mining System X
Application Service Provider #1
WEB-INTERFACE
Distributed Data Mining System Y
E-CATALOG On-line Shopping Centre BROKER
Database Dealer 1 Oracle
Flat Files
Distributed Data Mining System Z
Dealer 2 Sybase
Legacy
Federation Manager
Application Service Provider #2
Application Service Provider #3
Figure 3 Federation of Data Mining Systems Hosted by several ASPs
The federation manager as before maintains profiles of the data mining systems participating in the federation in the form of DDMS-ML documents and co-ordinates the interaction between the different ASPs. 2.3 Federation Issues Managing a federation of distributed data mining systems involves several issues. The federation must provide the infrastructure for the following: • • • •
•
Registration. Data mining systems must be able to register and de-register themselves to join and leave the federation respectively. Access and Usage. By joining a federation, the data mining system must provide access for the federation (and its users). Thus, it must allow the federation to use its services and must provide the federation with information about access and usage. Communication Protocols. The protocols for communication and interaction between the federation and the data mining systems need to be specified. Billing. Data mining systems must have the means for billing users for their services. In the C2B model, this is not as significant an issue since the ASP bills clients for data mining services. However, in the B2B model, when one ASP decides to outsource a task to another ASP, this raises issues such as the federation’s ability to find a data mining system that will provide a service for the lowest price. Thus, the federation must allow the participating data mining systems to quote what they will charge and then choose the best offer. Security and Trust. The federation must provide good security for its clients so that they trust the federation with their data mining tasks. 5
Federated data mining systems are a new concept and the above issues remain open for examination. The focus of this paper is to present DDMS-ML – a markup language that allows a data mining systems to describe their services and infrastructure to the rest of the federation. 3. Distributed Data Mining Systems – Markup Language (DDMS-ML) DDMS-ML is a markup language for distributed data mining systems to describe their functionality, features and structure. DDMS-ML is a subset of XML and allows distributed data mining systems that are part of a federation to exchange information about their respective architectures and services. XML is widely used in metadata standards such as Microsoft’s Channel Definition Format (CDF) and is suitable for standardised information interchange in many domain specific contexts (Goldfarb et al, 1998). We believe that XML provides a suitable basis for DDMS-ML because of the following reasons: • XML query languages can be used to query a collection of XML documents (XML Query Languages, 1999). Thus, the federation manager can query a collection of DDMS-ML documents to locate a DDM system in the federation that provides a particular service. • XML is humanly readable. • The Document Object Model (DOM) allows access to XML documents from with programming languages. • XSLT (Extended Style Sheet Language Transformation) (W3C, 1999) allows XML documents to be converted to other languages such as HTML, WML. In this section, we present an XML DTD for DDMS-ML and discuss its structure and components. A DDMS-ML document consists of seven distinct parts which are discussed as follows. Meta Information. This part of a DDMS-ML document contains information about the DDM system such as its name, version, date of development, organisation and developer. Only the name and date of development are required attributes, while the remaining attributes are optional. We impose the constraint that within a federation each DDM system must have a unique name. Connection and Access. This component of a DDMS-ML document provides information about the DDM server and how the federation can access it. The DDM server’s hostname, IP address and the usernames and passwords for use by the federation (including optional instructions for connection) are provided. We allow for a DDM system to be hosted by more than one server and for the federation to have one or more accounts on any server. Computational Resources. This component includes information about the DDM system’s computational resources. A DDM system can have several servers at its disposal. A server is either a stand-alone system or a parallel server or a cluster. A server
6
may or may not be dedicated for distributed data mining. A server’s physical configuration such as the operating system, the CPU, the memory and the number of nodes (if the server is a cluster) is recorded. This information allows the federation manager to preliminarily determine a system’s relative suitability for a given task. Architectural Model. This component of a DDMS-ML document states whether the DDM system uses the client-server approach, the mobile agent paradigm or a hybrid model for distributed data mining. This information is important in situations where a client requires a particular architectural model. For instance, a situation where the data to be mined is sensitive and the client does not want the data to be transported to the DDM server will warrant that a DDM system that uses mobile agents be used. It is also possible then that the mobile agent performs its task and is destroyed and not allowed to leave the site to provide further protection to the client. Data Types. This part of a DDMS-ML document states the data types that can be mined using a given DDM system. The following options have been specified: text, relational, spatial, temporal, image, video, multimedia, object-oriented and hypertext. However to allow for flexibility and extensibility the DTD allows specifying other data formats apart from the ones listed above. Specialisations and Features. This section of the document allows DDM systems in the federation to describe their distinguishing functions and special services. Similar to the data types, the DTD has some pre-specified options such as support for parallel algorithms, optimisation, cost-efficiency, pre-processing, mobile users and visualisation. However, it also allows a DDM system to present any other special features that it may possess. This component includes information about support for “knowledge integration” which is the process of integrating results obtained from distributed data sets. Algorithms. This component of a DDMS-ML specifies the mining algorithms that are supported by the DDM system. The specification includes details such as the algorithm’s name, version and developer. This is followed by details regarding the structure of the input file for the algorithm, the input parameters, the command to call the algorithm and output model produced. The current version of DDMS-ML only allows the specification of a text input file. We are working on a format for specifying relational data. Specifying the structure of complex data files is a non-trivial task and is not part of our current focus. For such input data, the DTD only allows specifying the data type and the respective file extension required. This component of the document allows the federation manager to present details about algorithm usage to clients who might wish to use a particular system. The DTD for DDMS-ML is as follows. _______________________________________________________________________ < ?xml version=”1.0”? >
7
8
9
> file_extension CDATA #REQUIRED ______________________________________________________________________________________
The above DTD has been designed to allow DDM systems to describe themselves in a flexible but structured manner. It is easily extensible. The DTD is expressive to the extent that it allows DDM systems to describe their distinguishing features and at the same time
10
it enforces a certain structure that provides for obtaining the necessary information for a federation of data mining systems to function. 4. Case Study - A DDMS-ML Document for the DAME System The DAME architecture is a hybrid model for distributed data mining which integrates the client-server and mobile agent paradigms. The system has evolved to address the issue of cost-efficient distributed data mining. A detailed discussion of the DAME architecture and cost models for optimisation of the distributed data mining process can be found in Krishnaswamy et al (2000b). In this section, we describe DAME using DDMS-ML to illustrate the applicability of the markup language. We first present an overview of the DAME architecture and then present a DDMS-ML document that specifies DAME. 4.1 Overview of the DAME Architecture In this section we present a brief overview of the DAME system. The components of the DAME architecture illustrated in figure 3 are as follows: Users. The users request data mining services by connecting to the distributed data mining server. The users communicate with the distributed data mining management system through the user manager component. Dedicated Distributed Data Mining Server. This is a server with high computational power that acts as both the point of control for the distributed data mining process and the provision of dedicated resources for mining. The server maintains the distributed data mining management system. Distributed Data Mining Management System (DDMMS). The DDMMS is the software that performs the various tasks associated with the distributed data mining process. The DDMMS forms the core of this architecture and the way it is structured encapsulates the framework for resource optimisation. The components within the DDMMS are a user manager, algorithm manager, optimiser, mining process manager and an agent control centre. We now present a detailed outline of the functionality and structure of each of these sub-components. User Manager. The users connect to the distributed data mining system through the user manager. The user manager performs the following functions: authentication of users, profiling of the data mining task in terms specifying the user requirements including the data mining query, the output required, the time frame within which the output is required and supporting mobility of users by providing results and updates to users who are on the move and may not remain connected throughout the duration of the data mining process. Algorithm Manager. The algorithm manager’s primary task is to maintain the data mining algorithms that are part of the distributed data mining system. Users can register any mining algorithm with the system. The users can choose to make available the algorithms that they have registered to other users. At the time of incorporating an algorithm into the system, the algorithm manager records meta level information about the algorithm and its characteristics such as name, version, input parameters, operating environment and output produced. The algorithm manager feeds this information to the mining process manager, which maintains profiles about algorithmic characteristics.
11
Optimiser. The optimiser is the component that is primarily responsible for building an estimated cost of alternative strategies and determining the best option for performing the data mining task to meet user needs. The optimiser interacts with the mining process manager in order to collect statistics regarding the current status of the communication channels and the task profile (specifically to determine the user requirements for task completion and the algorithm allocated for the task). It also interacts with the agent control centre (i.e. the mine sweeper agent) for details regarding the data set size. Using the data collected by the mine sweeper and the mining process manager, the optimiser builds an estimated cost model for the alternative ways to perform the data mining and decides on the option that will meet the user requirements as closely as possible.
USERS
Result
Notebook
Workstation
PC
DDM Mangement System O p t i m i s e r
Algorithm Manager
User Manager
Mining Process Manager Knowledge Integrator
Mine Sweeper Agent
User Agent
Agent Control Centre
Local Computational Resources
Mining Agent
''06HUYHU
Resource Monitoring Agents
Data Server 1
Data transfer for mining locally
Network Monitoring Agent
Data Server2 Status Monitoring Information
Client Server Model Mobile Agent Model
Figure 4 DAME Architecture for Distributed Data Mining
Mining Process Manager. This module forms the core of the distributed data mining system. It is basically the co-ordinating facility and the directory service for the different components of the system. The other components in the system access the mining process manager as it forms a point of reference from which information can be obtained regarding the current status of various aspects of the system. To the best of our knowledge, the mining process manager is the first integrated attempt in dynamically tracking and specifying the components and their interactions within the distributed data mining framework. Broadly, the components whose states and status are relevant include users, data mining tasks, data mining servers, data servers, communication resources and algorithms. It is obvious that the status of the information recorded must be current at all times.
12
Agent Control Centre (ACC). The agent control centre is the framework within which the agent activities in the distributed data mining system take place. The ACC is responsible for activating/generating/assembling other agents required for the data mining process. It interacts closely with the DDMMS, particularly with the Mining Process Manager. The ACC activates a distributed data mining task on the basis of instructions received from the DDMMS. The optimiser determines the appropriate model for mining. If this model involves mobile agents, then the ACC is responsible for controlling and co-ordinating the agents that perform the task. When the mining model to be deployed is client-server then the mining agents are instructed to perform the task locally. It also provides the monitoring and status information from the network and the data sources to the Mining Process Manager. The different agent types in the system are user agents, network monitoring agents, data resource monitoring agents, mine-sweeper agents and miner agents. 4.2 DAME System in DDMS-ML We now present a DDMS-ML document that describes the DAME architecture presented in section 4.1. The DAME system is currently being implemented in Java and uses the WEKA package of data mining algorithms (Witten et al, 1999). The current implementation only includes the ID3 algorithm which mines text data in the ARFF file format (Witten et al, 1999). The DAME system has three servers being used currently with the following operating systems: Solaris, Linux and NT. The following DDMS-ML document represents the current status of DAME < ?xml version=”1.0”? >