Towards an Autonomic Wisdom Grid - Semantic Scholar

Towards an Autonomic Wisdom Grid Peter Brezany , Ivan Janciak , Andrzej Goscinski , and A. Min Tjoa

Institute for Software Science University of Vienna, Nordbergstrasse 15/C/3, A-1090 Vienna, Austria [email protected] Institute for Software Technology and Multimedia Systems Vienna University of Technology, Favoritenstrasse 9-11/E188, A-1040 Vienna, Austria [email protected], [email protected] School of Information Technology, Deakin University Geelong, Vic 3217, Australia [email protected]

Abstract. Multi agent systems, Grid technology, Semantic Web, Autonomic Computing, and Web Intelligence paradigms are modern approaches in information technologies, which we put together in our research effort described in this paper to create a new-generation infrastructure called the Autonomic Wisdom Grid (AWG) with the mission to maintain, share, discover, and expand knowledge in geographically distributed environments. This paper introduces motivating ideas for this project, proposes the system architecture of an instance of AWG, and describes its functionality. It represents an original framework for creating advanced applications to integrate knowledge discovery and knowledge management in the Autonomic Grid and Web environments.

1 Introduction Information technologies became an inseparable component of modern society. There are many professions and professionals who require excellent information technology support to provide the best possible services. Examples are: medicine, where a specialist must deal with difficult health problems and make correct decision to help a patient; banking and money lending, where a lending officer must make a decision as to offer or refuse a loan; and industry, where an engineer must assess a risk in order to avoid losses. Making these decisions (and others) is very difficult as they require excellent knowledge of a given area, access to variety sources of information and supporting tools, advice from specialists of ”neighbor” areas, and a need for them to be provided in soft real-time. The situation has been improved in the last decade by the development and availability of the Web and data mining tools. However, any of the specialists working in the same area use different terminologies. This implies that databases that have been developed and are in use suffer from the interoperability problem - communication

The work described in this paper is being carried out as part of the research projects “Aurora” and “Modern Data Analysis on Computational Grids” supported by the Austrian Research Foundation.

and cooperation is restricted. In response to these problems “a new generation of Web technology, called the Semantic Web, is designed to improve communications between people using different terminologies, to extend interoperability of databases, to provide tools for interacting with multimedia collections” [1]. Further, there has been a need to automatically generate and store knowledge to support the professionals and specialists. This issue has been addressed by Knowledge Discovery in Databases [6]. However, the challenging problem is how to mine huge geographically distributed data repositories to provide requested information, and enhance existing knowledge. Web Intelligence (WI) is a new direction for scientific research and development that explores the fundamental roles as well as practical impacts of Artificial Intelligence (AI) and advanced Information Technology (IT) on the next generation of Webempowered products, systems, services, and activities [8]. WI aims to develop Wisdom 4 Web in order to help people achieve better ways of living, working, learning etc. and so improve quality of life. There are several research areas that are related to WI topics, like for instance Web Agents, Web Farming, Web Mining, Web Based Application, Web Information Management, Web Human-Media Engineering, etc. Our contribution to this research area is to integrate knowledge discovery and knowledge management as an autonomic system that can give a strong support to other intelligent entities in their needs for knowledge and appropriate ways (mechanisms) for practical knowledge application. We see its place in the space of the WI-related topics where it can support them with managing whole lifecycle of knowledge from its discovery to reusing and practical application. Currently, professionals accessing databases and other data repositories within the Web and the Grid must be involved in many activities that are of a system nature. As the problems that professionals must deal with are getting more and more complicated professionals should concentrate their efforts on their area activities. They should be supported by computing and communication services transparently. The professionals should not be aware that databases and supporting tools are distributed - the Grid, the Web and the Internet infrastructures should not be visible to them. It is very well known that computers and networks that form the Internet generate the problems of slow response, a need for dealing with failures and security. Another important issue is administration of such information and knowledge supporting systems. They are growing rapidly in scale and complexity and, consequently, their efficient administration is becoming a serious problem. A response to these requirements and problems could be found if the approach of Autonomic Computing [5, 7] is applied in the development of the whole system. The aim of this paper is to report the outcome of our research on the development of an Autonomic Wisdom Grid (AWG)5 that is able to efficiently support specialists in 4

5

Based on the definitions provided by the Webster’s Dictionary and the Oxford Encyclopedic Dictionary, we understand wisdom as knowledge and experience, and the capacity to make due use of them. The new Web Services Resource Framework and Web Services Notification [3] represent an important contribution to converging Grid and Web technologies. Due to this fact, we are using the term Grid to denote an integrated Web and Grid infrastructure.

a given area by exploiting data mining and knowledge management, provides services requested by them using available systems (databases, computers, software tools), is able to configure and reconfigure itself under varying and unpredictable conditions, optimizes its working, performs something akin to healing and provides self-learning and self-protection.

2 Towards the Autonomic Wisdom Grid To give a better picture to the whole concept we distinguish between two basic levels that form AWG - (1) user level and (2) infrastructure level. User level is a higher level which provides services to a user (medical specialist, lending manager, engineer) directly. The lower level offers computing and communication services of the AWG.

Intelligent Interface Problem Solver

Knowledge Provider

Data Mining infrastructure GridMiner

Autonomic Support

Knowledge Management Infrastructure

Generic Grid Services Globus Toolkit

Fig. 1. The AWG architecture

Fig. 1 depicts a layered architecture view of the AWG. All the layers are supported by the Autonomic Support component of the AWG. The first step of the knowledge discovery process is handled by the Intelligent Interface and the rest by the Knowledge Management Infrastructure and cooperating Data Mining Infrastructure, the functionality of which is based on the Generic Grid Services layer. In the following sections we describe the functionality of these architecture components and the way they cooperate.

3 Knowledge Discovery Subsystem In this section we deal with the heart of the AWG – the Knowledge Discovery Subsystem and its components. Intelligent Interface - Problem Solver Intelligent Interface is able to communicate with the user in the Problem Solver Markup Language, construct workflows, execute data mining activities, act as a client to other services and present results. It is implemented by a set of intelligent agents called the Problem Solver Agents (PSA). These agents are associated with the ontologies in the Knowledge Base (KB), that determine their domains of interest. It means that each agent is identified by its associated domain and also by a range of questions to which is able to give an answer. These agents are not isolated but cooperate together and communicate with other agents or other virtual entities outside the system.

Collaboration & Collective Service Layer GMOrchS - Orchestration

Knowledge Discovery Services GMPPS Data Preprocessing

Workload Management GMRB Resource Broker

GMDMS Data Mining & OLAP

Information and Logging GMIS Information Service

Repository Access and Manipulation GMLB Logging Bookkeeping

OGSA-DAI

Replica Management GM-Mediator DS-Mediation

GMRM Replica Manager

GMRC Replica Catalog

Generic Grid Services

Fig. 2. The GridMiner architecture

Data Mining Infrastructure - GridMiner The data mining infrastructure is realized by GridMiner6 [2], a service oriented grid-aware data mining infrastructure with several knowledge discovery services able to perform high performance data mining and OLAP algorithms. The GridMiner architecture (Fig. 2) consists of the components organized as layers built on top of Generic Grid Services . At the lowest layer, the services are provided that are common to a variety of knowledge discovery applications and should help reducing the upper-layer service complexity. The set of these services includes workload management, information and logging services, the access to and manipulation of data repositories and the management of replicas. On top these common services the knowledge discovery service layer contains services for preprocessing datasets and data mining within these repositories. At the very top of the framework architecture, 6

http://www.gridminer.org

the GridMiner Orchestration Service allows to compose workflows from activities (i.e. it groups several service invocations to workflows), handles failures and interacts transparently with optional components like the Resource Broker or the Replica management components. Knowledge Management Infrastructure The Knowledge Management Infrastructure (KMI) aims at fulfilling the following objectives: it is responsible for maintaining domain ontologies stored in the shareable knowledge base; it manages and controls the life cycle of the knowledge from its discovery to its use; it stores the knowledge representing domains in ontologies, metadata and facts; it allows adding new discovered knowledge to the knowledge base and its sharing. KMI is supported by a set of services implemented in the Knowledge Provider. The Knowledge Provider architecture consists of the three basic components: Inference engine (allows to reason and retrieve knowledge from KB), The generator of rules and facts (process results of the knowledge discovery tasks) and The Knowledge base.

Knowledge Base Facts Rules

Data mining Ontology

Data Source Ontology

Activity Ontology

Domain Ontology

Ontologies

Metadata

Fig. 3. The Knowledge Base architecture

The basic building blocks of the are facts, rules, ontologies, and metadata as depicted in Figure 3. Ontologies describe particular domains and offer semantic model of the resources and detailed description of their functionality and usability. We have created our own ontologies which describe three basic domains: data mining, activities and data sources. Data mining ontology describes the highest abstraction level of data mining domain including data mining tasks, algorithms and models. Activity ontology supports system with the description of particular services, their input parameters and

results. Different data types and data repositories which could be involved in the knowledge discovery are described in the data source ontology. All ontologies are implemented in the KB and specified in the Ontology Web Language (OWL)7 . OWL is based on Resource Description Framework (RDF)8 , what allows to build a distributed and sharable KB which can be easily extended by referencing to the other ontology servers. Generic Grid Services The lowest layer of the AWG layered architecture (Fig. 1) provides a set of basic Grid mechanisms. Essentially all major Grid projects are currently built on protocols and services provided by the Globus Toolkit. In our current AWG implementation effort, the Generic Grid Services layer is based on the Globus Toolkit 3, which aims to implement the Open Grid Service Architecture [4] concepts.

4 Autonomic Support The development of AWG is determined by the requirements of the Wisdom Grid and the required Autonomic Computing features, which are able to provide data mining and knowledge management services transparently, relieve professionals from activities that are of a system nature and allow them to concentrate their efforts on their area activities, and offer computing and communication services quickly, reliably and securely. 4.1 Usability and Design Considerations An example of a patient visiting a medical specialist is used to explain a need for autonomic services of the Grid. The medical specialist should be able to communicate with the system using their professional language. There must be a Diagnosis Solver Markup Language (DSML) to specify their requirements and settings as well as relationships with other services, e.g., hospitals and pharmacy stores. This language should exploit an inference engine that can do automatic reasoning, form a diagnosis workflow based on information provided by the specialist, received from the patient, specialist’s knowledge, patient’s record, and help interpreting this workflow. The engine should also update the local database and enhance reasoning rules. The language should understand the specialist terminology to save time when communicating with the Web. Communication between the patient and specialist is private (of course there could be a person who accompany the patient as each visit is a stressful event). Information exchanged between these two parties is confidential. This information is put into AWG believing that it won’t be accessed and changed by anybody else but trusted entities, and will be transferred securely that means that it won’t be monitored (read), changed or destroyed when in transfer. This implies that AWG provides self-protection at the resources, service and communication levels. Both the patient and specialist must be sure that secrecy and integrity of information is guaranteed. 7 8

http://www.w3.org/2001/sw/WebOnt http://www.w3.org/RDF

AWG should be able to know its components, and configure and reconfigure itself under varying and unpredictable conditions. The whole Grid infrastructure is composed of many computing systems, databases, data mining tools and applications that are connected by a number of networks that form a subset of the Internet. The system should know its static and dynamic attributes. The interfaces of all components should be well defined and easily accessible. Communication patterns and volume must be known to provide fast communication services. To support the medical specialist in forming a diagnosis, databases could be selected and connected automatically, and reconnected when they are for example overloaded or re-established. Supporting the specialist requires a soft real-time reaction to any request. This implies that AWG must optimize its working. Replica databases should be exploited and communication cost taken into consideration when far remote devices are accessed. Computation and communication load must be balanced, which guarantees the improved performance. As some request are urgent there is a need for real-time responsiveness of the Web. This makes the communication and execution requirements even stronger. In a large Web, hardware and software faults can happen reasonably often. The specialist should not be aware of any faults. The requested execution performance should not deteriorate. Thus, AWG must perform something akin to healing. This requires replication of databases and distributed transactions. Furthermore, as computation processes, e.g., data mining services, can be affected, checkpointing and recovery must be employed to avoid their restarting. Even migration of processes could be employed to improve the overall performance. AWG despite the fact that it is based on a large distributed system, it should be aware of other similar or complimentary Grids, e.g., pharmacy, hospital grids. This means that it knows its surrounding environment, in particular available resources (computers, clusters, unique peripheral devices, databases) and provided services (gridminers). This implies a need for some cooperation. This cooperation requires resource discovery of other Webs is provided, advertising services is in place, negotiation and brokerage of resources and services are used. Thus, resources and services are shared in a distributed manner. This cooperation requires knowledge of interfaces and description of resources and services, to allow individual grids to select, negotiate and use the selected entities. 4.2 Development of Autonomic Services AWG involves autonomic support for each of its architecture levels, as depicted in Fig. 1 on page 3. Recently, we started a research effort, which plans to extend the ongoing Globus Toolkit 4,which will be distributed by the end of 2004, by autonomic system features, as they were specified by Horn [7]. This work will profit from the research results gained during the development of an autonomic operating system [5] for nondedicated computing cluster systems. The GridMiner infrastructure already includes an advanced data mediation service, which supports automatic integration of geographically distributed heterogeneous data sources into one virtual data repository. This significantly simplifies complexity of the interface of data mining and OLAP services to data sources and simplifies the system

administration as well. The Intelligent Interface is supported by an advanced Graphical User Interface which provides a significant reduction in system complexity. The advanced resource management notification mechanisms of the Globus Toolkit 4 will allow data sources to signal changes in their contents to the data mining and OLAP services, due to their subscriptions. These services can then decide autonomously when to automatically start new data analyses and appropriately update the knowledge base.

5 Conclusions In this paper we have presented the Autonomic Wisdom Grid Framework, which aims to build grid-based distributed knowledge discovery and management applications where the full power of the Grid and the intelligent system methodology are combined to fulfil requirements of Web Intelligence - a novel development in Web research. To satisfy the user requirement of high performance, ability to access any available database and computational tool, easy communication with the system, reliability and security, the Wisdom Grid infrastructure (being currently built) exploits the approaches of Autonomic Computing. The Autonomic Wisdom Grid Framework is a conjunction of the Grid Data Mining Infrastructure for data mining on the top of the Autonomic Grid able to perform high performance data pre-processing, mining, analyzing and OLAP algorithms, and Knowledge Management Infrastructure able to process and share knowledge. This framework allows for the development of advanced Grid applications that enable better usability, a more complex and intelligent functionality for a wider user community.

References 1. T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, May 2001. 2. P. Brezany, J. Hofer, A Min Tjoa, and A. Woehrer. GridMiner: An Infrastructure for Data Mining on Computational Grids. In Australian Partnership for Advanced Computing (APAC), 2003. 3. Karl Czajkowski, Donald F. Ferguson, Ian Foster, Jeffrey Frey, Steve Graham, et al. The ws-resource framework. http://www-106.ibm.com/developerworks/library/ws-resource/wswsrf.pdf. 4. I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration, January 2002. 5. A. Goscinski, J. Silcock, and M. Hobbs. Toward highly available, self-healing, adaptable, gridlike and user friendly nondedicated clusters. APAC Conference and Exibition on Advanced, Grid Applications and eResearch, Gold Coast, 2003. 6. J. Han and Micheline Kamber. Data Mining. Concepts and Techniques. Morgan Kaufmann, 2000. 7. P. Horn. Autonomic Computing: IBM’s Perspective on the State of Information Technology. IBM, 2000. 8. N. Zhong, J. Liu, and Y. Yao (eds.). Web Intelligence. Springer-Verlag, 2003.