A Web Service-based approach for data mining in

5 downloads 0 Views 135KB Size Report
gent Miner [5], XELOPES [9] and PolyAnalyst [10] provide platform-independent in- terfaces, and can be easily integrated in the system. Finally the field of ...
A Web Service-based approach for data mining in distributed environments Ning Chen1 , Nuno C. Marques1 , and Narasimha Bolloju2 1

CENTRIA, Department of Information FCT, New University of Lisbon, Portugal {nchen, nmm}@di.fct.unl.pt http://centria.di.fct.unl.pt 2 Department of Information Systems City University of Hong Kong [email protected] http://www.is.cityu.edu.hk

Abstract. Data mining is usually associated with centralized data mining systems. Here we present an approach to develop a data mining system in distributed environments. The main difficulty in this approach is the unrestricted sharing of information and dynamic integration of components. In this paper, we present a Web Service-based approach to solve these problems. The system built using this approach offers a uniform presentation and storage mechanism, platformindependent interface, and a dynamically extensible architecture. To achieve it, techniques and languages such as XML, SOAP, WSDL and UDDI are employed. A prototype system for a classification problem solving service is described as an example of the proposed approach.

1 Introduction Data mining is a process to extract interesting, implicit, previously unknown and potentially useful knowledge or patterns from data in large databases [1]. The application of data mining had an explosive growth in diverse domains such as financial management, retail industry or customer relation management. A large variety of data mining tools including commercial products, research prototypes and free packages appear in the last decade. Most of these tools are useful for centralized data mining tasks, executing as stand-alone programs and serving only one user at a time. However, the increasing market demands the attention to the distributed data mining. Firstly, the data and software are geographically distributed on a network instead of locating on single computer. For example, each location of a company has a local database for that district. It is more practical, in such situation, to summarize the local models derived on district data than to combine the distributed databases and generate models on the whole data. Secondly, cost is another reason for distribution. In order to support various types of data mining a company may have to use multiple data mining tools. Data mining in distributed environment would further increase the cost of licensing because of need for multiple copies of different types of tools. To save the investment, users prefer to utilize

2

Ning Chen, Nuno C. Marques, and Narasimha Bolloju

the components which meet their requirements instead of the whole package. A traditional example of these will be to develop a model within one vendor’s tool (possibly a new and free academic model), and then visualize those models using a commercial visualization tool. Moreover, different data mining tools have distinct functionality and specialization degrees. To obtain the optimal patterns, multiple data mining tools are usually involved in an application. The integration of multiple data mining tools in a distributed environment could be a solution to these problems. Distributed data mining deals with the problem of finding patterns in an environment with distributed data and computation. It implies that the data mining tasks are done in an environment where the users, data, hardware, and mining software are geographically dispersed [2]. The centralized data mining systems do not satisfy the requirements on dynamic, cooperation and extensibility in a distributed environment. To establish the distributed data mining system, the following issues must be addressed: – Multiple formats. In many organizations users hope not only to discover new models by themselves, but also to share acquired models in a distributed manner or to summarize a set of models for some domain. As the models generated by different tools are usually represented in different formats and structures, these models are difficult to exchange and share between tools. – Inadequate methods for model validation and analysis. Model validation and analysis are important issues to deploy the discovered models to applications. Usually, it is easy to get models, but difficult to choose suitable models for applications. Also, effective methods for model analysis are seldom provided in current data mining systems. The difficulty is the diverse presentation and semantic of the models. – Platform-dependent interfaces. Users may need to share their digital assets worldwide and utilize other useful assets quickly and cheaply. Since most data mining tools are implemented as stand-alone packages or platform-dependent interfaces, the integration of compliant vendors’ applications is not simple. This paper offers a viable approach to above issues through Web service. Web service is a collection of functions published to network for use by other applications through a standard protocol. It offers the possibility of transparent integration between heterogeneous platforms and applications. Through Web service, various operations supplied by different providers can be aggregated to achieve a higher-level set of features. In this paper, we provide a framework based on Web service, integrating a number of data mining components, including data preprocess, model construction and model deployment. In this framework, geographically distributed software cooperate with each other and dynamically execute according to users preference. Different from prior contributions [2, 3], we care about the distribution of programs more than the distribution of data sources. Based on this approach, a classification solving service prototype is implemented. In this prototype, a number of classification methods are implemented as Web services and published to UDDI (Universal Description Discovery Integration). Users are able to browse the available classification methods provided by the service providers and generate models using the chosen method via a platform-independent interface. Represented in a standard format, the models derived by different methods on different data can be interpreted by various tools.

A Web Service-based approach for data mining in distributed environments

3

The rest of the paper is organized as follows. Section 2 presents a general approach to a distributed data mining system and discusses the possible techniques for its implementation. Then in section 3, a classification problem solving service system is developed as an experimental prototype for the proposed approach. Some results are discussed in section 4. Section 5 concludes the paper and proposes some future work.

2 An approach for distributed data mining systems In centralized data mining systems, the data mining process is composed of a sequence of steps which associate with each other closely. This could produce a barrier to information exchange and system integration. We will focus on the components needed to implement a solution supporting model integration and data sharing. Provided with them third party components for the data mining process such as data selection, data cleaning, model extraction, model validation or model visualization [1] could be used. A uniform data representation and storage mechanism is essential for distributed systems. To make data exchangeable, a cross-platform approach to data encoding and formatting is required. PMML (Predictive Model Markup Language) [4] is an extension of XML to provide standard description of the predictive models and enable the models to be shared. It provides applications a vendor-independent method of model definition so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. Data mining systems such as Intelligent Miner [5] and Enterprise Miner [6] can share models with other systems through import or export of PMML/XML models. The module providing algorithms for effective model analysis will allow us to perform model validation. Model validation identifies the interesting and useful models. The abstraction and comparison on the models to provide higher-level patterns and discover the trends over long periods are useful for decision makers [7]. Models derived by different users are different because of the mining strategy, subjective setting and local information. The use of XML for representing models allows a uniform representation of model knowledge on which analysis and comparison are possible. Since this analysis will be dependent on user decisions, models are represented in XML and then stored in databases. The analysis on models requires new solutions to approach the problem of mining on XML data [8]. The use of a platform-independent interface implemented as Web services is the central and unifying module of this approach. Once the services are arranged on a network, they can be used anywhere through SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language). Contributing to a platform-independent interface, the system has a program-to-program architecture instead of the traditional customer-to-program architecture. Several data mining systems such as IBM Intelligent Miner [5], XELOPES [9] and PolyAnalyst [10] provide platform-independent interfaces, and can be easily integrated in the system. Finally the field of machine learning and data mining is rapidly advancing each day. A dynamic and extensible structure will be the only way of having an up-to-date state of the art data-mining system. In the proposed system, the components are dynamically organized and the operations provided by components are selected at runtime.

4

Ning Chen, Nuno C. Marques, and Narasimha Bolloju

The operations are discovered through the interface of UDDI, which is a standard way to describe any kind of Web service [11].

3 A prototype of classification problem solving service Classification is a supervised data mining process that is able to derive models from a set of training data and then use those models to classify new data. Decision trees are one of the fastest and easily interpreted methods to solve the classification task. We developed a classification problem solving service prototype using the technique depicted in section 2. The prototype was developed in Windows 98 system with Microsoft Personal Web server, Visual studio 6.0, SQL server 2000, Microsoft UDDI registration, SOAP toolkit 2.0, and UDDI SDK. We are studying the possibility of migration to an IBM platform. In the prototype shown in Fig. 1, we implemented a set of components including data selection, method selection, model extraction, model representation, model visualization, model validation and model execution. Three decision trees methods (C4.5, EC4.5, Dtree) are provided for model construction. The model construction tools, model validation tools and model execution tools are implemented as Web services and published to UDDI. Other components are implemented in ASP (Active Server Page). In order to enable users to carry out their mining tasks, a Web site is provided. Generally, the prototype consists of the following components.

Client users

Network



Web services























ASP scripts 

























PWS / IIS 































UDDI registration MS Windows

















Web server

Fig. 1. System architecture

UDDI registration UDDI describes the information of Web services exposed in the registry. We used Microsoft public UDDI registration [12] in this prototype.

A Web Service-based approach for data mining in distributed environments

5

Web services The services are implemented using Visual C++ 6.0 as DLL wrappers containing a set of COM methods. The prepared services are published to UDDI registration describing the business, service, access point and service location. Model base The derived decision tree models are transformed to PMML and stored in a model base. Other information including username, the data set on which the model is derived, method and parameters used for the creation of the model are also stored. The model base provides acquisition, storage, and access to the models. Model warehouse can be built on top of model bases to store the primitive models, high-level models and historical models. Database Database and data warehouse store the data used to train, validate and execute models. Web server IIS/PWS servers as the Web server built on top of Windows operating system. To build the Web site, HTML, ASP and JSP files are developed on the server.

name, contacts, description, indentifiers, categories name description URL pointers to specificaitons

sService> (1..n) name description, (1..n) technical categories information

Services discovery

UDDI API

(1) UDDI specification

UDDI Services registration

Service description

(2) Service description

(4) Operation call

(3) XML Request Web Services API

Model base

Web Server

(6) XML Response

(5)Operation result Remote Web Service

Model storage and retrieval Service intergration Web user

Fig. 2. A procedure of model creation

Fig. 2 describes the steps to create classification models. An interface built on UDDI Programmer’s API is provided for users to discover services dynamically and quickly. When SOAP client (Web Site) queries UDDI registration about the classification services, UDDI registration looks up all the entries registered and selects the satisfying items as the result. The result contains the tmodel key, location, category and description of the services. The discovered services are listed in a Web page as the available methods for users. When a user chooses a classification method, the system gets the location of chosen service and sends a requesting message of user data and parameters to the remote service server. After receiving the request, the service server performs the corresponding operation and returns a response message to the client, containing the derived model. Users are able to view the models using a visualization tool or to

6

Ning Chen, Nuno C. Marques, and Narasimha Bolloju

store them in the model base. Models can also be tested by the model validation tool or applied to new data sets by the model execution tool. As shown in Fig. 3, the system provides a set of operations, through which users can perform their classification tasks. Table 1a presents the small drug data set which concerns the features of 12 patients and effective drugs for some unspecified disease [13]. The data set is characterized by the values of four fields, sex, age, blood pressure and drug shown in Table 1b. The classification task finds the predictive models of drug, called as target variable, depending on the other fields, called as independent variables. The method and parameters of the example application is described in Table 1c. In this case, the user creates models using three methods respectively. Fig. 3 displays the derived model in a graphical interface. The model is composed of data dictionary and tree model. Data dictionary describes the name and type of all fields and values for categorical fields. Tree model shows the structural description of decision tree, in which each node involves a predicate of field, operator, value and class. The derived model can be stored for later decisions. That way when users want to classify a new, previously unknown instance, several previously stored models could be available. In this case, a simple approach was chosen. We request the user to supply a test data set (with known classes). The test data set is used to validate all the previous applicable model. Then the best model will be chosen. If such test data set does not exist, it is also possible to classify the unknown data using several known models and then determine the class as the majority label. sex male female female male female male female male male female female male

age 20 73 37 33 48 29 52 42 61 30 26 54

blood pressure normal normal high low high normal normal low normal normal low high (a)

drug A B A B A A B B B A B A

field sex age blood pressure drug

type categorical continuous categorical categorical

values male, female 0-100 normal, high, low A,B (b)

method

parameters use subset split on categorical variables C4.5 minimum records is 2, pruning rate is 25% use gain as split criterion EC4.5 strategy I for continuous variables maximum height is 100 DTree Pessimistic pruning method (c) Table 1. Drug data and method setting

4 Results The prototype is developed for Web users to generate models using a dynamic set of classification tools. Users are able to share the models derived by different classification methods applied on different data. The prototype illustrates how the requirements of data mining such as cost of using variety of tools, dealing with multiple formats,

A Web Service-based approach for data mining in distributed environments

7

platform-dependency, etc. can be supported using Web Services. As a whole, it can be stated that Web service is a viable approach to build an open distributed data mining application. Although the results are satisfactory, some improvements are expected on the prototype. In the present system, all services are implemented in windows platform. To test the interoperability independent of underlying languages and operating systems, one task is to implement different services in different environments. Also, the utilization of the existing software is an important topic to preserve the investment in legacy systems. The inclusion of a commercial package, besides the empirical demonstration of platform independence will allow us to use a state of the art commercial data mining tools to offer more data mining services. This prototype uses a simple approach to execute the data mining service by sending the data to the service and returning the obtained models to the application. Distributed data mining system techniques could be used to solve theses problems [14]. Afterwards, inefficiencies in moving data from the source database to the destination where the data mining Web Service is located can be a limitation if large data sets are to be used. Mobil agent could be considered in such cases [2], by transferring and running the programs on the clients.

5 Conclusion In this paper, we presented an approach for a distributed data mining system and discuss its strategy of implementation using Web Service. A prototype of a classification service system is described to illustrate the proposed approach. Currently, the prototype providing classification services provides only three decision tree methods. However, the proposed system is robust and will allow the inclusion of any other supervised and unsupervised machine learning through the feedforward neural network and clustering through self-organizing map. We hope to test platform independence by implementing an API interface to Intelligent Miner [5] and Matlab. Data preparation tools and model analysis tools will also be supplemented. We are now planing to apply the proposed system to a real world problem. We will use the Part-of-speech tagging problem in corpus based analysis [15]. Currently several distinct users need to classify distinct, but compatible data sets. Also multiple pre-learned classification models (part-of-speech taggers) have already been developed for distinct types of text. The proposed approach in this paper will allow users to classify new incoming data by selecting one of the previous learnt models. Since distinct types of text require distinct models, users should also be allowed to add new classification models to enrich system knowledge. Also, for better helping users we are considering learning of user profile models (in a way, similar to what is done in PGR system [16]).

References 1. Han Jiawei, Kamber Micheline: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher (2001) 2. Krishnaswamy, S., Loke, S. W., and Zaslavsky, A.: Cost Models for Heterogeneous Distributed Data Mining. In: Proceedings of the 12th International Conference on Software Engineering and Knowledge Engineering (SEKE), Chicago, Illinois, (2000) 31-38

8

Ning Chen, Nuno C. Marques, and Narasimha Bolloju

Fig. 3. Model visualization

3. Krishnaswamy, S., Zaslavsky, A., and Loke, S. W.: An Architecture to Support Distributed Data Mining Services in E-Commerce Environments. In: Proceedings of the 2nd International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS), Milipitas, CA, USA, IEEE Press (2000) 4. Data mining group: An Introduction to PMML 1.1-Ratified. http://www.dmg.org/v1-1/pmml_v1_1.htm 5. IBM: Intelligent Miner. http://www.ibm.com/ 6. SAS: Enterprise Miner. http://www.sas.com/ 7. Bolloju Narasimha: Extended Role of Knowledge Discovery Techniques in Enterprise Decision Support Environments. In: Proceedings of the 34th Annual Hawaii International Conference on Systems Science, Maui, Hawaii, USA (2001) 8. Azuaje Francisco: Advancing Post-genome Data and System Integration through Machine Learning. Comparative and Function Genomics (2002) 3:28-31 9. ZSoft: Platform-independent Solutions for Data Mining. http://www.zsoft.ru 10. Megaputer: PloyAnalyst. http://www.megaputer.com/ 11. UDDI: The UDDI Technical white paper. http://www.uddi.org 12. Microsoft: UDDI Universal Discovery, Description & and Intergration. http://uddi.microsoft.com/inquire 13. Borgelt Christian: A Decision Tree Plug-In for DataEngine. In: Proceedings of 2nd Data Analysis Symposium, Germany (1998) Vol.2, 1299-1303 14. Silberschatz Abraham, Korth Henry F., Sudarshan S. : Database System Concepts. McGrawHill (2001) 15. Marques Nuno, Lopes Gabriel: Tagging With Small Training corpora. In: Hoffmann, F., Hand, D., Adams, N., Fisher, D. and Guimares, G. (eds.): Advances in Intelligent Data Analysis (LNCS 2189), 4th International Conference IDA, Springer Verlag (2001) 63-72 16. Quaresma Paulo, Rodrigues Irene, Lopes Gabriel, Almeida Teresa, Garcia Elsa, Lima Ana: PGR Project: The Portuguese Attorney General Decisions on the Web. In: Ciampi, C., Marinai, E.(eds.): Proceedings of The Law in the Information Society. Florence, Italy (1998)