The Development of a Wisdom Autonomic Grid Peter Brezany1, Andrzej Goscinski2, Ivan Janciak3, and A Min Tjoa3 1 Institute for Software Science, University of Vienna Nordbergstraße 15/C/3, A-1090 Vienna, Austria E-mail:
[email protected] 2 School of Information Technology, Deakin University Geelong, Vic 3217, Australia E-mail:
[email protected] 3 Institute of Software Technology and Interactive Systems Vienna University of Technology, Favoritenstraße 9-11/E188, A-1040 Vienna, Austria E-mail:
[email protected],
[email protected]
Abstract It is not a simple matter to develop an integrative approach that exploits synergies between knowledge management and knowledge discovery in order to monitor and manage the full lifecycle of knowledge and provides services quickly, reliably and securely. One of the main problems is the heterogeneity of the involved resources that represent knowledge. Data mining systems produce knowledge in a form meant to be understandable to machines and on the other hand in knowledge management systems the priority is placed on the readability and usability of knowledge by humans. The Semantic Web is a promising platform to unify this heterogeneity and, in conjunction with novel techniques for Web Intelligence it could offer more then just knowledge - wisdom. The Wisdom Autonomic Grid is an original proposal of a knowledge based Grid that is able to configure and reconfigure itself under varying and unpredictable conditions and optimize its working, performs something akin to healing and provides self-protection, as visioned in the IBM Autonomic Computing initiative. This paper presents an original framework for creating advanced applications to integrate knowledge discovery and knowledge management in the Autonomic Grid and Web environments.
1. Introduction There are many professions and professionals who require excellent information technology support to
provide the best possible services. Examples of such professions are • Medicine, where a specialist must deal with difficult health problems and make correct decision to help a patient • Banking and money lending, where a lending officer must make a decision as to offer or refuse a loan • Industry, where an engineer must assess a risk in order to avoid losses. Making these decisions (and others) is very difficult as they require excellent knowledge of a given area, access to variety sources of information and supporting tools, advice from specialists of "neighbor" areas, and a need for them to be provided in soft realtime. The situation has been improved in the last decade by the development and availability of the Web and data mining tools. However, any of the specialists working in the same area use different terminologies. This implies that databases that have been developed and are in use suffer from the interoperability problem - communication and cooperation is restricted. In response to these problems “a new generation of Web technology, called the Semantic Web, is designed to improve communications between people using different terminologies, to extend interoperability of databases, to provide tools for interacting with multimedia collections” [2]. These ideas led Grid scientists to the notion of Semantic Grid, where they plan to apply Semantic Web technologies in Grid computing developments [17]. Semantic Web technologies improve interoperability of databases and allow users and
computers to communicate and cooperate efficiently. However, there is a problem with accessing knowledge offered by specialists. Specialists working in a given or neighbour area are not available all the time to provide services to their peers. Furthermore, employing them at any time would be very expensive. For these reasons there has been a need to automatically store and generate knowledge to support the professionals and specialists. This issue has been addressed by Knowledge Discovery in Databases [15]. However, the challenging problem is how to mine huge geographically distributed data repositories to provide requested knowledge, and enhance existing knowledge. Web Intelligence (WI) is a new direction for scientific research and development that explores the fundamental roles as well as practical impacts of Artificial Intelligence (AI) and advanced Information Technology (IT) on the next generation of Webempowered products, systems, services, and activities [20]. WI aims to develop Wisdom1 Web in order to help people achieve better ways of living, working, learning etc. and so approve quality of life. There are several research areas that are related to WI topics, like for instance Web Agents, Web Farming, Web Mining, Web Based Application, Web Information Management, Web Human-Media Engineering, etc. Our contribution to this research area is to integrate knowledge discovery and knowledge management as an autonomic system that can give a strong support to other intelligent entities in their needs for knowledge and appropriate ways (mechanisms) for practical knowledge application. We see its place in the space of the WI-related topics where it can support them with managing whole lifecycle of knowledge from its discovery to reusing and practical application. Currently, professionals accessing databases and other data repositories within the Web and the Grid must be involved in many activities that are of a system nature. As the problems that professionals must deal with are getting more and more complicated professionals should concentrate their efforts on their area activities. They should be supported by computing and communication services transparently. The professionals should not be aware that databases and supporting tools are distributed - the Grid, the Web and the Internet infrastructures should not be visible to them. It is very well known that computers and networks that form the Internet generate the problems 1
Based on the definitions provided by the Webster's Dictionary and the Oxford Encyclopedic Dictionary, we understand wisdom as knowledge and experience, and the capacity to make due use of them.
of slow response, a need for dealing with failures and security. Another important issue is administration of such information and knowledge supporting systems. They are growing rapidly in scale and complexity and, consequently, their efficient administration is becoming a serious problem. A response to these requirements and problems could be found if the approach of Autonomic Computing [13,16] is applied in the development of the whole system. The aim of this paper is to report on the outcome of our research on the development of a Wisdom Autonomic Grid (WAG)2 that is able to efficiently support specialists in a given area by exploiting data mining and knowledge management, provides services requested by them using available systems (databases, computers, software tools), is able to configure and reconfigure itself under varying and unpredictable conditions, optimizes its working, performs something akin to healing and provides self-learning and selfprotection. The structure of the rest of the paper is organized as follows. Section 2 presents basic concepts that form WAG and specifies a generic set of requirements and refine them into a set of functional specifications. We explicitly state our design philosophy and goals, consider various design alternatives under those constraints, and present the WAG architecture design. Section 3 describes the components forming the wisdom functionality of WAG. The lowest layer of the WAG architectures involving Generic Grid services is discussed in Section 4. The process of WAG Knowledge Discovery is introduced in Section 5, followed by the design of autonomic services in Section 6. The related work is presented in the Section 7 and we briefly conclude in Section 8. In the paper, we use examples from the application addressing management of hypertensions risks as a case study.
2. Towards the Wisdom Autonomic Grid To give a better picture to the whole concept we distinguish between two basic levels that form WAG (1) user level and (2) infrastructure level. User level is a higher level which provides services to a user (medical specialist, lending manager, engineer) directly. The lower level offers computing and communication services of the WAG. Their basic
2 The new Web Services Resource Framework and Web Services Notification [5] represent an important contribution to converging Grid and Web technologies. Due to this fact, we are using the term Grid to denote an integrated Web and Grid infrastructure.
assumptions, design goals, functionality, and the techniques are used in terms of the following scenario.
2.1. User level Suppose a medical specialist has a patient with some health problems. The specialist wants to know, what should be done next in the treatment of this patient, what are the possible risks, what kind of drugs could be used and what is their impact on the patients conditions, what outcome for this and similar patients can be assumed, etc. A sequence of steps that are carried out in this situation are as follows: 1. The patient introduces her/his health problems: some basic symptoms are described. 2. The specialist studies the results of some basic measurements carried out by a nurse. 3. The specialist, based on their experience, identifies a basic possible course of action and with the gathered information in the form of facts and attributes accesses a local database. 4. The first workflow that specifies initial basic activities is generated, which starts two basic communication and computation activities. The user front-end accesses other databases to collect information to support a diagnosis making process. The user front-end tries to access surgeries and hospitals in all places where the patient was leaving to collect their medical history. 5. The collected data are processed and a medical treatment workflow is automatically generated.
specialist is not aware of the locations of the accessed databases; securely, to provide secrecy and integrity of message content; and quickly, to serve a patient professionally and timely. In our example the specialist has several options of how to find out answers to these questions. Traditionally, they can discuss these issues with their colleagues, with a specialist by an on-line chat, they can look into a patient database to see what development can be observed at other patients with the same or similar symptoms and diagnoses, they could search the Internet for appropriate medical articles and reports on this problem, etc. However, in this way, it is not possible to obtain information urgently, which may often be a very critical issue in many medical applications, like management of patients with traumatic brain injuries [3] or in crisis management environments in general. Basic assumptions of supporting professionals are that they should not be involved in any activities that are of a system nature, they should mainly concentrate their efforts on their area activities. To solve the above problem, an advanced information technology support in computing and communication is needed which fulfills a set of requirements, which include: 1. ability to access and analyze a huge amount of information, which is typically heterogeneous and geographically distributed; 2. intelligent behavior - ability to maintain, discover, extend, present, and communicate knowledge; 3. high performance (real-time or soft real-time) query processing; 4. high security guarantee.
6. The specialist carries out an analysis of the received workflow, and if it is acceptable, starts the whole treatment process. There are two sets of requirements that must be satisfied: 1. In the interest of the patient all these steps must be performed efficiently, to allow the specialist to provide good diagnosis during the appointment time; securely, to protect confidential information about the patient, and reliably, to save processing time should any computation carried out within the workflow fail. 2. In the interest of the specialist all these steps must be performed trustworthily, to allow the specialist to be sure that all information provided by the system is current and from the most advanced in a given area sources; transparently, which means that the
This combination of requirements results in complex and stringent demands that, until recently, could not be satisfied by any existing computational and data management infrastructure.
2.2. Infrastructure Level An analysis of the steps carried out at the user level shows that a basic infrastructure is based on the Internet, databases and Web services. However, these components satisfy the requirements of communication and remote computation and data access services. There are other requirements that are identified in the user level of the WAG system. Their analysis shows that they could be satisfied and the WAG system offered, if the whole infrastructure is designed and
implemented with the Autonomic Computing characteristics in mind just forming an autonomic Grid. The goal is to hide all available information sources behind intelligent entities like for example agents which are able to answer questions in real time communication with a user. An agent by itself is not able to perform all required activities and needs support from other systems in its searching for answers and that is reason why it needs partners which are able to transform its high level requests into the simple tasks. In tackling these problems, we have initiated a research effort to design and implement a novel framework called Wisdom Autonomic Grid (WAG), which aims to fulfill the requirements outlined above.
2.3. Design and Development Goals The above requirements were illustrated on the medical domain. However, a similar requirements set can be derived from other application fields. They tell us what properties WAG must have in order to serve its users - be successful. The next subsection introduces a validated set of requirements, which allow us to stear our effort on the important issues. However, it is just as important to have a clear set of design and development goals that we can use to guide the technical and methodological decisions that must be made as we develop the solution that meets those requirements. There are some fundamental issues that will drive our design and development. The WAG provides a framework that enables the development of intelligent Grid applications. This framework can be divided into two main categories:
new application, new Semantic Web techniques, new basic Grid technologies and other features can be easily added.
2.4. Functional Specifications In this subsection, we take the requirements and our design goals and turn them into a list of functions that satisfy those requirements and goals. This defines what we have to build. Here is the functionality we need: • Open architecture. In the context of the Grid documents [10,9], open is being used to communicate extensibility, vendor neutrality, and commitment to a community standardization process. • Intelligent behavior. Interaction between a user and the WAG should be based on the mutual understanding and therefore an intelligent agent interface solution has to be considered. • Semantic richness. Detailed semantic description of all system parts and their functionalities is highly required to be easily integrated as a Semantic Web application. • Data distribution, complexity, heterogeneity, and large data size. WAG must be able to cope with very large and high dimensional data sets that are geographically distributed and stored in different types of repositories. • Compatibility with existing Grid infrastructure. The
• Technologies and Tools. They are intended to assist application developers develop wisdom and autonomic capabilities in their Grid and Web applications. • Use Cases. They show how the technologies and tools available in the WAG work together, and how they can be used in realistic situations. The WAG framework we are developing must be easily understood and straightforward to use. The primary aim of developing this framework is to illustrate how knowledge based technologies, autonomic computing technologies, and Grid technologies can be used to significantly enhance the quality and usability of applications. Further, the framework must also be flexible so that support for
Figure 1. Wisdom Autonomic Grid Architecture
higher levels of the WAG architecture use the basic Grid services for implementing wide area communication, cooperation and resource management. Autonomic computing features enhance at some level the existing Grid services frameworks. • Openness to tools and algorithms. The WAG architecture must be open to the integration of new data mining tools and algorithms. A wide spectrum of data mining strategies should be considered and accommodated in the system, and their impact on performance has to be investigated. • Scalability. Architecture scalability is needed both in terms of number of nodes used for performing the distributed knowledge discovery tasks and in terms of performance achieved by using parallel computers to speed up the data mining task. • Grid, network, and location transparency. Users should be able to run their data mining applications on the Grid in an easy and transparent way, without needing to know details of the Grid structure and operation, network features and physical location of data sources. This is supported by a layer of Grid data virtualization services. • Security and data privacy. Security and privacy issues are vital features in a lot of data mining applications [4]. The Grid services offer a valid support to the WAG system to cope with user authentication, security and privacy of data. However, some special security and data private services required, e.g. by medical and finance applications, are not provided by Grid middleware, thus they are implemented as the parts of WAG services. • OLAP Support. The architecture must allow the interoperability with data warehouse and Online Analytical Processing (OLAP) services. Data mining and OLAP are two complementary methodologies, which, if applied in conjunction [10,19], can provide powerful advanced data analysis solutions. • Resource Discovery. WAG must be able to discover appropriate resources that could come in the process of the knowledge discovery and can be used in resolving desired problem. • Automatic updating of domain knowledge. Discovered knowledge should be automatically deployed into the knowledge base for possible reuse.
2.5. The WAG Architecture Now that we have specified the requirements, the design goals and the functions that our WAG architecture must provide, we have to make our design decisions. The first step is architecture design. Figure 1 depicts a layered architecture view of the WAG. All
the layers are supported by the Autonomic Support component of the WAG. The first step of the knowledge discovery process is handled by the Intelligent Interface and the rest by the Knowledge Management Infrastructure and cooperating Data Mining Infrastructure, the functionality of which is based on the Generic Grid Services layer. In the following sections we describe the functionality of these architecture components and the way they cooperate.
3. Knowledge Discovery Subsystem In this section we deal with the heart of the WAG the Knowledge Discovery Subsystem and its components.
3.1. Intelligent Interface - Problem Solver The Intelligent Interface is an autonomous agent which hides the whole complexity of the WAG structure and offers its capabilities to the user. Its autonomy resides in its ability to make decisions which activities will be executed in the whole process of knowledge discovery and also control these activities. It is able to communicate with the user in the Problem Solver Markup Language, construct workflows, execute datamining activities, act as a client to other services and present results. Intelligent Interface is implemented by a set of intelligent agents called the Problem Solver Agents (PSA). These agents are associated with the ontologies in the Knowledge Base, that determine their domains of interest. It means that each agent is identified by its associated domain and also by a range of questions to which is able to give an answer. These agents are not isolated but cooperate together and communicate with other agents or other virtual entities outside the system. To support autonomic behavior of the whole system, Intelligent Interface implements highly specialized agent which initializes all the activities, controls their execution and tries to automatically solve their problems. Intelligent Interface closely cooperates with the Knowledge Provider which supports Problem Solver Agents with information from the Knowledge Base and also with the GridMiner system and its particular datamining service in the process of the Knowledge Discovery in Databases. The PSA receives questions as messages in the FIPA/ACL format [7], the bodies of which are specified in RDF [1]. As an example, the content of an
Figure 2. GridMiner architecture ACL message that invokes the process of searching for hypertension risk of a given patient is shown below. The whole process is in details described in Section 5. 150 95
3.2. Data Mining Infrastructure - GridMiner The data mining infrastructure (DMI) supports WAG by the functionality of knowledge discovery in databases. It offers a wide range of data mining techniques which come into a process of the knowledge extraction from various types of datasets which are usually geographically distributed. For this purpose we use already developed prototype for data mining on the Grid called GridMiner3. It is a service oriented grid- aware data mining infrastructure with several knowledge discovery services able to perform high performance data mining and On-Line Analytical Processing (OLAP) algorithms. It allows an easy integration of existing and newly developed data preprocessing and data mining applications into the Grid.
3
http://www.gridminer.org
GridMiner Architecture Figure 2 illustrates the GridMiner architecture consisting of components organized as layers. It is built on top of Generic Grid Services (see Section 5) At the lowest layer the services provided that are common to a variety of knowledge discovery applications and should help reducing the upper-layer service complexity. The set of these services includes workload management, information and logging services, the access to and manipulation of data repositories and the management of replicas. On top these common services the knowledge discovery service layer contains services for preprocessing datasets and data mining within these repositories. At the very top of the framework architecture the GridMiner Orchestration Service allows to compose workflows from activities (i.e. it groups several service invocations to workflows), handles failures and interacts transparently with optional components like the Resource Broker or the Replica management components. The functionalities of the particular blocks are as follows. • Workload management. Distributed scheduling and Grid resource management. The main objective is to allocate and assign resources to Grid applications effectively and efficiently. • Information and logging. Information about Grid resources and scheduling, logging of job states and service messages.
Figure 3. Knowledge Provider Architecture • Repository Access and Manipulation. Access and manipulation with datasources, currently OGSA Data Access and Integration4 is implemented. • Replica management. Optimization of access of geographically distributed datasets and their management • Knowledge discovery services. Data preprocessing and data mining functionality. • Coordinative and Collective Services (Orchestration). Composition of workflows and service invocations. GridMiner services produce outputs in Predictive Model Markup Language format (PMML)5, which is an interchange format for results of data mining task depending on the selected method. The following example shows a “TreeModel” element which is a part of the PMML document produced as a result of the classification task on the data stored in a cardiological database describing patients and their hypertension risk.
3.3. Knowledge Management Infrastructure Knowledge Provider The Knowledge Management Infrastructure (KMI) aims at fulfilling the following objectives: it is responsible for maintaining domain ontologies stored in the shareable knowledge base; it manages the life cycle of the knowledge from its discovery to its use; it stores the knowledge representing domains in ontologies and metadata and allows adding new discovered knowledge to the knowledge base; and it also stores information about sources (services) that could be used in resolving some data mining tasks. KMI is provided by a set of Web services implemented in the Knowledge Provider subsystem. Figure 3 illustrates a simple architecture of the Knowledge Provider supported by the inference engine, the knowledge base, and the generator of rules and facts.
4 5
http://www.ogsadai.org.uk http://www.dmg.org
Figure 4. Knowledge Base Architecture
3.3.1. Knowledge Base As a well prepared knowledge base is half of the success of any intelligent system, its structure and organization are very important. The basic building blocks of the knowledge base in the Knowledge Management Infrastructure are facts, rules, ontologies, and metadata as depicted in Figure 4. Ontologies describe particular domains and offer semantic model of the resources and detailed description of their functionality and usability. We have created our own ontologies which describe three basic domains: data sources, activities and data mining. All ontologies are implemented in the KB and specified in the Ontology Web Language (OWL)6. The following part discusses basic building blocks of the KB. Data source ontology Data source ontology is one of the basic building blocks of the KB. It describes all frequently used data types which could be involved in the knowledge discovery. The main idea in creating this ontology was to separate the data repository from its content. Therefore the main ontology class, datasource, has two properties: data repository and content. Data repository tells us what kind of technology is used to access this datasource (e.g. MySQL database, binary/text file, OGSA-DAI etc.) Content describes what is stored in this repository (e.g., table, picture, text, etc.). This information about content is usually also connected with classes of the domain ontology, e.g. table PATIENTS keeps information about patients. This is extremely useful in searching KB because it is then possible to answer questions like “Where is stored a table about patients' diagnoses?” or “What is stored in file CT.jpg?” and so on. A lot of other classes and properties precisely specify the special types and components of data structures, like attribute, xml element, text row, etc. Instances of this ontology are datasources which come into a process of knowledge discovery and also in knowledge storage and retrieval. The following example shows a fragment from a data source ontology specified in the OWL abstract syntax. Annotation(rdfs:comment ”Data source ontology”) Class(a:table partial restriction(a:hasAttribute minCardinality(0)) restriction(a:table-name cardinality(1))) Class(a:table partial annotation(rdfs:label ”table”)) Class(a:attribute partial restriction(a:attribute-name cardinality(1)) restriction(a:type cardinality(1)))
Class(a:attribute partial ObjectProperty(a:hasAttribute domain(a:table) range(a:attribute)) DatatypeProperty(a:attribute-name) DatatypeProperty(a:table-name) DatatypeProperty(a:type)
Data Mining Ontology Data mining ontology is a special ontology about data mining domain and areas related with it. This ontology is helpful in the process of knowledge discovery when an appropriate method and algorithm must be selected to reach demanding results. The ontology is based on the categorization of data mining tasks, the same as it is done in the last Weka release (Weka7 is an open-source, Java-based data mining system), where three main data mining areas are defined, namely classification, association and clustering. We have also included description of OLAP and some statistical functions. The basic building block of the ontology is a data mining task associated with a method and an algorithm. This ontology is not isolated but is very closely related to the data source ontology, which helps to define data mining methods. It is mainly used to define required inputs and outputs of these methods. For example, the class describing one of the algorithms “DecisionTreeMethod” says that it needs a database table and one attribute to create a decision tree. Activity ontology This ontology describes executive parts of the system responsible for knowledge discovery. It could be a Web and Grid service or an application able to perform demanded tasks and return required results. Our system is grid service oriented and therefore, this ontology deals with a detailed description of services. For this purpose, we use the OWL-S8 ontology which allows a very elegant semantics definition of any Web service. OWL-S organizes a service description into three conceptual areas, the process model, the profile and the grounding, which describe what the service does, how the service works and how the service is used. It is not a problem that OWL-S is designed for web services and our system is grid services oriented because we just want to express semantics of services and for such a purpose it is very suitable. The following example shows a fragment from an activity ontology. The basic class is activity and OWL-S class 7
6
http://www.w3.org/2001/sw/WebOnt
8
http://www.cs.waikato.ac.nz/~ml/weka http://www.daml.org/services/owl-s/1.0
service is related to it with the hasService object property. Specific properties of grid services like factory handler (GSH) or service data elements (SDE) are added as properties of class activity and don't extend or change the OWL-S ontology. This ontology also allows to identify one service by more activities if the service provides different data mining tasks by different methods. Annotation(rdfs:comment ”Activity ontology”) ObjectProperty(a:hasService domain(a:activity) range(c:Service)) ObjectProperty(a:presents domain(a:activity) range(b:datamining)) Class(c:Service partial) Class(a:activity partial restriction(a:presents someValuesFrom(b:datamining)) restriction(a:hasService minCardinality(0)) d:resource restriction(a:GSH cardinality(1))) restriction(a:SDE maxCardinality(1))) DatatypeProperty(a:GSH) DatatypeProperty(a:SDE) Class(d:resource partial) Class(b:datamining partial))
range(a:Risk)) DatatypeProperty(a:DiastolicBP) DatatypeProperty(a:SystolicBP) Individual(a:high type(owl:Thing) type(a:Risk)) Individual(a:low type(owl:Thing) type(a:Risk)) DifferentIndividuals(a:low a:high) Individual(a:ID123 type(a:patient) value(a:hasHypertensionRisk a:high)
Facts Facts are basic nuggets of knowledge and therefore they must be clearly defined and easily accessible. In our case, facts are instances of ontologies. They are used to state things that are unconditionally true in the domain of interest. For example, from the previous example of the domain ontology, we can state that patient ID123 has high hypertension risk. From combination of facts and rules it is also possible to deduce new facts (inferred knowledge), that are not explicitly stored in the KB, but they are there implicitly present.
Domain Ontology Rules This ontology is independent from the above ontologies and describes the domain of interest or an area where knowledge discovery is used. Such ontologies usually describe domains of real life, e.g. health care, insurance, sales, etc. Many ontologies are currently available on special ontology servers and don't have to be physically stored as parts of our knowledge base because they can be simply referenced, what is a very good advantage. The main goal of domain ontology is to give better understanding of the discovered knowledge and helps its sharing and reusing. Knowledge discovery processes that produce some rules could extend the knowledge about domains with new facts. The following example shows a very simple ontology about patients and their potential hypertension risk. Annotation(rdfs:comment ”Blood pressure ontology”) Class(a:patient partial restriction(a:hasHypertensionRisk allValuesFrom(a:Risk)) restriction(a:hasBloodPressure maxCardinality(1))) Class(a:BloodPressure partial restriction(a:DiastolicBP cardinality(1)) restriction(a:SystolicBP cardinality(1))) EnumeratedClass(a:Risk a:low a:high) ObjectProperty(a:hasBloodPressure domain(a:patient) range(a:BloodPressure)) ObjectProperty(a:hasHypertensionRisk Functional domain(a:patient)
This part of KB contains rules that can be applied to domains ontologies in order to retrieve some knowledge about those domains. Rules are generated in the process of knowledge discovery described in Section 5 and then inserted into KB by Generator of Rules and Facts. Rules are stored in SWRL9 format based on a combination of the Web Ontology Language an the Rule Markup Language (RuleML)10. Metadata Metadata represent an additional information about instances stored in facts. They extend characteristics of datasources and activities or classes from the domain ontology describing their content, quality, data structures, etc. They are stored in XML files and are referenced as properties of instances. For example, metadata about the database table describe in detail all table attributes, their types, names, labels, etc. Structure of the metadata file is usually known and therefore, its elements could be referenced from the instances by means of an XPath11. The following instance of the attribute class from the data source 9
http://www.w3.org/Submission/SWRL http://www.ruleml.org 11 http://www.w3.org/TR/xpath 10
ontology introduced above shows how metadata describing a database table in WebRowSet format can be referenced from the attribute instance using XPath. /webRowSet/metadata/column-definition[1]/column-type /webRowSet/metadata/column-definition[1]/column-name
3.3.2. Inference Engine The Inference Engine [18] operates on the KB with the help of its ontology models. It collects information from the KB and applies rules on it and locates appropriate patterns. It is used to retrieve facts from KB based on other facts and rules. It is also able to deduce new implicitly present facts and perform reasoning about them. The Inference Engine gives a strong support to the Knowledge Provider in the process of retrieving knowledge from the KB. 3.3.3. Generator of Rules and Facts This component is activated when the newly discovered knowledge is going to be added to the KB. It takes an output from the KDD process and tries to realize it in the KB. The output from the GridMiner, which is represented in PMML doesn't tell anything about the domain ontology because it deals with the data in databases and not with their semantics. To get the knowledge, the result of the KDD process must be first mapped to the domain ontology and then converted into facts or sets of rules that could be applied to the domain ontology. This set of rules is generated in the process of transforming PMML into the Semantic Web Rule Language (See also Section 3.3.1.). An example of the rules in SWRL, generated from the PMML example presented in Section 3.2 and mapped to the ontology about the hypertension risk, is shown in the following example. patient SBP DBP patient
patient SBP patient DBP SBP 140 DBP 90 patient high
4. Generic Grid Services The lowest layer of the WAG layered architecture (Figure 1) provides a set of basic Grid mechanisms. Essentially all major Grid projects are currently built on protocols and services provided by the Globus Toolkit. The Globus internal structure supports a layered Grid basic architecture [11]. The Grid Fabric layer provides the resources to which shared access is mediated by Grid protocols. The Connectivity layer defines core communication and authentication and authorization protocols required for Grid-specific transactions. The Resource layer defines protocols for the secure negotiation, initiation, monitoring, control, accounting, and payment of sharing operations on individual resources. The Collective layer contains protocols and services that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources. Applications are constructed in terms of, and by calling upon, services defined at any layer. The Global Grid Forum [8] is currently working on several projects [10,12] which allow Grid services to be defined and accessed as Web services [14]. In the current WAG implementation effort, the Generic Grid Services layer is based on the Globus Toolkit 3, which aims to implement the Open Grid Service Architecture [10] concepts. In the near future, we are also going to set a new research task addressing the Web Service Resource Framework [5] and associated software support tools.
5. Knowledge Discovery Process The process of knowledge discovery presents all activities that support the life cycle of the knowledge from formulating a demand to presenting results. The activity diagram shown in Figure 5 presents the flow of activities which are involved in the process of knowledge discovery in WAG. This process consists of two subprocesses: (1) retrieving available knowledge from KB (Knowledge Retrieval from KB KRKB) and (2) searching for unavailable knowledge applying data mining techniques to data sets (Knowledge Discovery in Databases - KDD). We consider four components involved in the process of knowledge discovery, as depicted in Figure 5: (1) User, (2) Problem Solver, (3) Knowledge Provider, and (4) GridMiner. • User - With this term we denote an actor asking for a knowledge. It could be an agent, a service or a graphical user interface able to construct questions in a way that intelligent interface understands it. In our case, it should be able to formulate questions in the FIPA acl/rdf message format. The user initializes teh whole process of the knowledge discovery and receives final results. • Problem Solver - It is an intelligent interface able to receive messages with questions and transform them into particular actions. Its intelligence resides in the way how it makes decisions about steps involved in the whole process. It closely cooperates with the Knowledge Provider, which supports the Problem Solver with information stored in KB. • Knowledge Provider - This component manages KB and has control of all related data entering the process of knowledge discovery. It supports Problem Solving Agents with the knowledge stored in KB, verifies given data by ontologies, extends facts, rules and metadata by new information. • GridMiner - It is used to perform data mining activities in selected data sets. It supports the process with newly discovered knowledge in form that can be easily integrated into KB.
5.1. Knowledge Retrieval from Knowledge Base This is a process of reusing the knowledge stored in the KB. The process of knowledge retrieval (KR) from
KB (KRKB) starts when the received question is validated and transformed by the Problem Solver into a set of actions to be performed by the Knowledge Provider. KR uses the inference engine and its search ability to successfully find the required information in KB. This process ends by transforming the results obtained from the Knowledge Provider into a form which can be presented to the user. The following scenario discusses this process in details. The medical specialist wants to know what is the hypertension risk of the patient who come for regular medical check. To answer this question medical specialist must know all the factors that influence blood pressure and lead to hypertension and also what categories of hypertension exist. This information can be retrieved form knowledge base where is stored ontology about health care domain and blood pressure in particular. First step is to formulate question for Problem Solver. It is done by specialist's user interface (it could be personal agent with special graphical interface) which is able to create and send message in FIPA/ACL format as presented in the Section 3.1. The Problem Solver Agent validates the content of the message and tries to associate it with the domain ontology. For this purpose the Knowledge Provider is invoked to process the message and mach all used terms with the domain ontology to guarantee that the content of the message is fully understandable. When the matching is successful (the Problem Solver Agent understands the message) then the knowledge retrieval process can start. At the beginning a new model of the domain is created. It includes all known information about selected domain including facts and rules. When such a model is created then the Knowledge Provider can receive queries and return answers and about this domain together with reasoning of inference engine. Very important part is the transformation of the message to a query or set of queries that can answer the request contained in the message. This is done by the Problem Solver Agent which is able to resolve this request from the message body. Queries are formulated in a Query Language for RDF (RDQL)12 implemented by the Jena Toolkit13 and used together with the external inference engine which also reasons about the rules presented in the model. Let's assume that there are no rules in KB that can determine the hypertension category based on the given values of the blood pressure. Then the query constructed from the message presented in the Section 3.1 is as follows and it returns only all available 12 13
http://w3.org/Submission/RDQL http://www.hpl.hp.com/semweb
Figure 5. Activity diagram of the knowledge processing categories of hypertension risk presented in the domain ontology example in the Section 3.3.1. SELECT ?x FROM WHERE (?x,, ) USING bloodPressure FOR http://www.gridminer.org/kb/domain/health/bp.owl#
At the end the Problem Solver Agent processes the information returned by the query, transfers it into a replay message and sends it to the user.
5.2. Knowledge Discovery in Databases The KDD subprocess represents not only pure data mining but it also involves a set of activities needed to successfully start data mining process and at the end to combine its results with already available knowledge. This process starts with identifying appropriate data mining tasks by searching KB for eligible services and data sources. This process is supported by the GridMiner infrastructure.
As discussed in the scenario in the previous section, the medical specialist wants to also know what is the hypertension risk of the particular patient. This can be done by classification of the patient into a risk category (class). Here, we came to the point when some rules are needed that can be applied to the patient's data (factors that influence blood pressure) to find which category should be chosen. Let's assume that there are no such rules in the KB and they must be discovered in the databases available. The Problem Solver must identify all the factors from KB that influence blood pressure, then find an appropriate datamining method and the service for performing the classification task. It must select the datasource containing attributes related to hypertension factors, and the attribute about the risk category. If all needed information was collected then the KDD process can start. As it was presented in the Section 3.1, the Problem Solver creates an appropriate workflow of services execution and can act as a client for those services. The result of the KDD process can be PMML document (see the example on page 6), which is
parsed by Rules Generator and stored in fact as SWRL file as depicted in example in the Section 3.3.3. These new rules are combined with data about patient and new fact about patient's category is introduced to the personal agent of the medical specialist.
6. Autonomic Support The development of a Wisdom Autonomic Grid is determined by the requirements of the Wisdom Grid and the required Autonomic Computing features, which are able to provide data mining and knowledge management services transparently, relieve professionals from activities that are of a system nature and allow them to concentrate their efforts on their area activities, and offer computing and communication services quickly, reliably and securely.
6.1. Usability and Design Considerations An example of a patient visiting a medical specialist is used to explain a need for autonomic services of the Grid. The medical specialist should be able to communicate with the system using their professional language. There must be a Diagnosis Solver Markup Language (DSML) to specify their requirements and settings as well as relationships with other services, e.g., hospitals and pharmacy stores. This language should exploit an inference engine that can do automatic reasoning, form a diagnosis workflow based on information provided by the specialist received from the patient, specialist's knowledge, patient's record, and help interpreting this workflow. The engine should also update the local database and enhance reasoning rules. The language should understand the specialist terminology to save time when communicating with the Web. Communication between the patient and specialist is private (of course there could be a person who accompany the patient as each visit is a stressful event). Information exchanged between these two parties is confidential. This information is put into the Wisdom Autonomic Grid believing that it won't be accessed and changed by anybody else but trusted entities, and will be transferred securely that means that it won't be monitored (read), changed or destroyed when in transfer. This implies that the Wisdom Autonomic Grid provides self-protection at the resources, service and communication levels. Both the patient and specialist must be sure that secrecy and integrity of information is guaranteed. A Wisdom Autonomic Grid
should be able to know its components, and configure and reconfigure itself under varying and unpredictable conditions. The whole Grid infrastructure is composed of many computing systems, databases, data mining tools and applications that are connected by a number of networks that form a subset of the Internet. The system should know its static and dynamic attributes. The interfaces of all components should be well defined and easily accessible. Communication patterns and volume must be known to provide fast communication services. To support the medical specialist in forming a diagnosis, databases could be selected and connected automatically, and reconnected when they are for example overloaded or reestablished. Supporting the specialist requires a soft real-time reaction to any request. This implies that the Wisdom Autonomic Grid must optimizes its working. Replica databases should be exploited and communication cost taken into consideration when far remote devices are accessed. Computation and communication load must be balanced, which guarantees the improved performance. As some request are urgent there is a need for real-time responsiveness of the Web. This makes the communication and execution requirements even stronger. In a large Web, hardware and software faults can happen reasonably often. The specialist should not be aware of any faults. The requested execution performance should not deteriorate. Thus, the Wisdom Autonomic Grid must perform something akin to healing. This requires replication of databases and distributed transactions. Furthermore, as computation processes, e.g., data mining services, can be affected, checkpointing and recovery must be employed to avoid their restarting. Even migration of processes could be employed to improve the overall performance. The Wisdom Autonomic Grid despite the fact that it is based on a large distributed system, it should be aware of other similar or complimentary Grids, e.g., pharmacy, hospital grids. This means that it knows its surrounding environment, in particular available resources (computers, clusters, unique peripheral devices, databases) and provided services (gridminers). This implies a need for some cooperation. This cooperation requires resource discovery of other Webs is provided, advertising services is in place, negotiation and brokerage of resources and services are used. Thus, resources and services are shared in a distributed manner. This cooperation requires knowledge of interfaces and description of resources and services, to allow individual grids to select, negotiate and use the selected entities.
The Wisdom Autonomic Grid is a rather complex system requiring demanding administration activities. Therefore, another important functionality to be provided is the system administrator's interface, handling exceptions which would otherwise have resulted in system alerts, and the learning, by the system, of actions taken by the administrator. The first research effort addressing these issues was conducted by the IBM [6].
6.2. Development of Autonomic Services The Wisdom Autonomic Grid involves autonomic support for each of its architecture levels, as depicted in Figure 1. Recently, we started a research effort, which plans to extend the ongoing Globus Toolkit 4 (GT4) by autonomic system features, as they were specified by Horn [16]. This work will profit from the research results gained during the development of an autonomic operating system [13] for nondedicated computing cluster systems. The GridMiner infrastructure already includes an advanced data mediation service, which supports automatic integration of geographically distributed heterogeneous data sources into one virtual data repository. This significantly simplifies complexity of the interface of data mining and OLAP services to data sources and simplifies the system administration as well. The advanced resource management notification mechanisms of GT4 will allow data sources to signal changes in their contents to the data mining and OLAP services, due to their subscriptions. These services can then decide autonomously when to automatically start new data analyses and appropriately update the knowledge base. The Intelligent Interface is supported by an advanced Graphical User Interface which provides a significant reduction in system complexity.
7. Related Work A need for integrated knowledge discovery and knowledge management systems has been expressed in the call for the workshop on Integrating Data Mining and Knowledge Management, held on the IEEE international Conference on Data mining in 2001 (ICDM'01). Several approaches to such integrated systems have been proposed14. We are focusing our research effort on the Grid computing area and there are also several projects aiming to build datamining
systems. One of the first attempts to build a domainindependent knowledge discovery environment on the Grid is Knowledge Grid project in Italy15. Knowledge Grid is a system which is built on top of Grid environments and allows the creation of geographically distributed knowledge discovery applications. The Knowledge Grid uses the basic grid services such as communication, authentication, information, and resource management to build more specific parallel and distributed knowledge discovery tools and services. The Knowledge Grid services are organized into two layers: (1) Core K-grid Layer, which is built on top of generic grid services and supports definition, composition and execution and meta-data management of data mining computation over the Grid, and (2) High Level K-grid Layer, which is implemented over the core layer and provides services and tools for data access, tools and algorithm access, execution plan management and result presentation. The China Knowledge Grid Research Group16 goes further and in the China's e-science Knowledge Grid environment putting Knowledge Grid ideas into practice. This group has been pursuing solutions for three fundamental issues of the future Knowledge Grid - normally reorganizing, semantically interconnecting and dynamically clustering and fusing globally distributed resources [21].
8. Conclusions In this paper we have presented the Wisdom Autonomic Grid Framework, which aims to build gridbased distributed knowledge discovery and management applications where the full power of the Grid and the intelligent system methodology are combined to fulfil requirements of Web Intelligence - a novel development in Web research. To satisfy the user requirement of high performance, ability to access any available database and computational tool, easy communication with the system, reliability and security, the Wisdom Grid infrastructure (being currently built) exploits the approaches of Autonomic Computing. The Wisdom Autonomic Grid Framework is a conjunction of the Grid Data Mining Infrastructure for data mining on the top of the Autonomic Grid able to perform high performance data pre-processing, mining, analyzing and OLAP algorithms, and Knowledge Management Infrastructure able to process and share knowledge. This framework allows for the 15
14
http://cui.unige.ch/~hilario/icdm-01/cfp.html
16
http://dns2.icar.cnr.it/kgrid http://kg.ict.ac.cn
development of advanced Grid applications that enable better usability, a more complex and intelligent functionality for a wider scientific community and also to help people achieve better ways of living, performing business, treating patients, working, learning, etc. The proposed Autonomic Grid is able to configure and reconfigure itself under varying and unpredictable conditions and optimize its working, performs something akin to healing and provides selfprotection and thus relieve users from many activities that are of the system nature. We have also presented an implementation of the Wisdom Grid Framework based on the GridMiner and Knowledge Provider infrastructures that demonstrates the feasibility of the proposed concept and design.
[9]
G. G. Forum. Grid Service Specification, Nov 2002.
[10]
I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture fordistributed systems integration, January 2002.
[11]
I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid: Enabling scalable virtual organizations. Intl. J. Supercomputer Applications, 15(3), 2001.
[12]
D. Gannon. Grid application design using software components and Web services. In Proceedings of the Workshop on Grid Computing, Denver, Colorado, USA, November 2001.
[13]
A. Goscinski, J. Silcock, and M. Hobbs. Toward highly available, self-healing, adaptable, grid-like and user friendly nondedicated clusters. APAC Conference and Exibition on Advanced, Grid Applications and eResearch, Gold Coast, 2003.
[14]
S. Graham, S. Simeonov, T. Boubez, G. Daniels, D. Davis, Y. Nakamura, and R. Neyama. Building Web Services with Java: Making Sense of XML, SOAP, WSDL, and UDDI. Sams, 2001.
[15]
J. Han and M. Kamber. Data Mining. Concepts and Techniques. Morgan Kaufmann, 2000.
[16]
P. Horn. Autonomic Computing: IBM’s Perspective on the State of Information Technology. IBM, 2000.
[17]
D. D. Roure, N. R. Jennings, and N. R. Shadbolt. The semantic grid: A future e-science infrastructure. In F. Berman, A. J. G. Hey, and G. Fox, editors, Grid Computing: Making The Global Infrastructure a Reality, pages 437–470. John Wiley & Sons, 2003.
[18]
S. Russel and P. Norvig. Artificial Intelligence. Prentice Hall, 2003.
[19]
Y. J. Tam. Datacube: Its implementation and application in OLAP mining. MSc.Thesis, Simon Fraser University, Canada, September 1998.
[20]
N. Zhong, J. Liu, and Y. Y. (eds.). Web Intelligence. Springer-Verlag, 2003.
[21]
H. Zhuge. China’s e-science knowledge grid environment. IEEE intelligent Systems, 19:13–17, 2004.
Acknowledgements This research is being carried out as part of the research projects “Modern Data Analysis on Computational Grids” and “Aurora” supported by the Austrian Research Foundation.
References [1]
Resource description framework (rdf).http://www.w3.org/RDF, February 2004.
[2]
T. Berners-Lee, J. Hendler, and O. Lassila. The semanticweb. Scientific American, May 2001.
[3]
P. Brezany, M. Rusnak, and P. Tomsich. Knowledge grid support for treatment of traumatic brain injury victims. Submitted to the Conference on HighPerformance Distributed Computing, Edinburgh, Scotland, July 2002.
[4]
M. Cannataro, D. Talia, and P. Trunfio. Distributed data mining on the grid. Future Generation Computer Systems, 18:1101–1112, 2002.
[5]
[6]
[7]
[8]
K. Czajkowski, D. F. Ferguson, I. Foster, J. Frey, S. Graham, et al. The ws-resource framework. http://www-106.ibm.com/developerworks/library/wsresource/wswsrf.pdf. R. T. et al. Usability and design considerations for an autonomic relational database management system. IBM SystemsJournal, 42(4):568–580, 2003. F. for Intelligent Physical Agents. Fipa acl message structure specification. http://www.fipa.org/specs/fipa00061/, 2000. G. G. Forum. Mission http://www.globalgridforum.org.
and
goals.