the architectural details are presented, we will now briefly discuss application ..... prehensive grid infrastructures such as Legion or the widely acknowledged ...
Creating a Java- and CORBA-Based Enterprise Knowledge Grid Using Topic Maps Axel Korthaus, Tobias Hildenbrand University of Mannheim, Germany {korthaus|hildenbrand}@wifo3.uni-mannheim.de
As the demand for structuring, retrieving and discovering knowledge in distributed multi-site computer environments grows increasingly and grid computing provides a new IT infrastructure for secure high-performance applications, new concepts are needed to bring together current knowledge management requirements and technological opportunities. In this paper, we present some basic considerations with respect to the design and implementation of an open standards-based enterprise Knowledge Grid architecture using Topic Maps, the “Topic Map Grid”.
1. Introduction
Knowledge Grid Information Grid
Control
With the development of the Grid towards a transparent and reliable infrastructure for both distributed computation and distributed data, new opportunities for effective and efficient knowledge management approaches are emerging. In the context of knowledge management, we think of the Grid as a means of utilizing geographically distributed information from several authorized entities within the network to retrieve and generate knowledge. In order to realize a new kind of knowledge retrieval grid service based on the metaphor of inducing “knowledge current” in a way comparable to electric power served by the power grid, not only a suitable grid infrastructure is needed, but also a standardized, simple, yet semantically sophisticated approach to the representation of information and a layer of tools and application services for the knowledge-oriented management of this information. The central goal is to enable a process of generating specific knowledge from a bulk of data through the Grid, which can be seen as a value-added service, enabling new forms of business models. In our paper we are presenting a service-oriented Knowledge Grid design exclusively using open standards, particularly Java and CORBA technology. The basic idea is to employ Topic Maps distributed over the grid nodes as a means of structuring and retrieving knowledge on a semantically rich level, and to enable queries directed to groups of grid nodes, each providing Topic Maps, which might be statically or dynamically merged in order to handle spanning queries. The notion of Topic Maps and our
concept of a Topic Map Grid will be explained in more detail in sections 2 and 3. Section 4 contains architectural considerations with respect to the design of our solution. The work presented here is part of a project called “KnowME” (“Knowledge Management Environment”) which has been started recently at the University of Mannheim and aims at conceptualizing and providing a comprehensive, service-oriented, open standards-based Knowledge Management architecture for small and medium enterprises, consisting of pluggable components, e.g. for document management, mobile knowledge management etc. The approach presented here will eventually cover all the layers of Keith G. Jefferey’s Grid vision of 1999 as described by De Roure et al. in [6] (cf. figure 1), i.e., the data/computation layer is realized via suitable grid and networking technologies, the information layer is formed by distributed information resources such as documents, web pages, graphics, emails etc., and the value-adding knowledge layer services are implemented on top of the Topic Maps for Java (TM4J) Topic Map Engine using XML Topic Maps (XTM) files or their respective database representation to provide the required knowledge retrieval and merging functionality (for more information about these technologies see the following sections) .
Data to Knowledge
Abstract
Computation/Data Grid
Figure 1. Three Layer Grid Vision (Jefferey)
2. The Topic Map Grid Concept 2.1. Foundation As the basic data structure for representing associative knowledge and referencing external information resources we chose Topic Maps, which are derived from the theoretical concept of semantic networks often used to de-
scribe how human memory works. Since they facilitate semantic-based locating of information, Topic Maps are often dubbed “the GPS of the information universe”. [24] As figure 2 indicates, we consider the Topic Map concept as being institutionalized in a stack of open standards. The foundation of this stack is laid by the general concept of cerebral knowledge structures.
tm_1
The Grid
tm_2
tm_3
tm_i tm_n
Java-Level: TMAPI/TM4J XML-Level: XML Topic Maps (XTM) ISO/IEC-Level: Topic Maps (13250) Conceptual Level: Semantic Networks
Figure 2. Topic Map Open Standards Layer Model The original ISO specification was the first effort to operationalize this theoretical basis in the context of knowledge representation and retrieval. [15] Subsequently, the XML Topic Maps (XTM) standard successfully simplified the syntax introduced by the ISO standard and paved the way for Topic Maps to become a knowledge interchange method. XTM uses standard XML and thus guarantees easy visualization by transforming XTM documents (e.g. to HTML) using eXtensible Stylesheet Language Transformation (XSLT). ([25], [22]) The Topic Map Application Programming Interface (TMAPI) represents an open standard collaboratively established by several Java Topic Map Engine vendors. This standard was strongly influenced by the Topic Maps for Java (TM4J) implementation which we chose for our project because of its elaborate Topic Map Engine (TME) and persistence functionality. [26] Figure 3 roughly illustrates our concept of a Topic Map Grid. We have a grid consisting of several grid nodes, each of which exposes one or more Topic Maps via a Java interface and optionally hosts a number of local information resources. More precisely, the resources depicted in figure 3 can be either located on the grid nodes or on the general Web. Information resources include inline Topic Map Occurrences (see section 3), different kinds of documents and, in general, all elements that can be addressed by a Universal Resource Identifier (URI). Each grid member runs a service providing access to Topic Map objects, Topic Map merging functionality, and a query interface returning relevant Topics as a result. Queries can be directed to specific groups of grid nodes, potentially containing all available grid nodes, so that the Grid Architecture can be conceived as one single knowledge structure or knowledge resource, respectively, trans-
Resources
Figure 3. Topic Map Grid parently integrating the computers connected to the Grid. The group-based query functionality, which involves merging of different topic maps, is provided by a special service component as described in section 4. But before the architectural details are presented, we will now briefly discuss application domains as well as economic and business issues of the Knowledge Grid, before we then offer detailed background information about the theoretical concepts and practical implementations of Topic Maps.
2.2. Application Domains The idea of a knowledge grid environment arose from the e-science context. The Grid was intended to facilitate research work and effectively exchange knowledge among distributed institutions collaborating. The World Wide Web was initially also created around this motivation and thought of as a pure scientific network. [6] We consider knowledge grids to be the future means for conducting both intra-organizational and interorganizational knowledge management for companies, and thus an enabling technology for optimization and innovation of e-business processes and collaboration. XTM as a simple and lean knowledge interchange standard perfectly serves this purpose and integrates in current EAI efforts using XML. Using an XML-based knowledge representation facilitates the establishment of portal applications directly derived from corporate knowledge, e.g. through transformation into HTML via XSLT. Installing an intra-organizational grid represents the easier case in terms of control and coordination issues. If all grid nodes are subject to a common administration, corporate ontologies and security policies can be readily accomplished. Devising domain-specific Topic Map Templates and ontologies, as intended by the ISO and
XTM standard creators, offers opportunities for new business models. In the case of several independent organizations being involved in one grid with the shared goal of knowledge collaboration, issues of authorization, confidentiality, and common language arise. Considering the tremendous impact on e-business processes the establishment of electronic data interchange (EDI) and follow-up technologies, e.g. e-business XML (ebXML), had, we are anticipating prosperous opportunities for electronic knowledge interchange via XML Topic Maps.
2.3. Economic and Business Issues From a theoretical economic perspective a transparent knowledge grid connecting a selected number of business entities bears potentials in saving transaction costs. In particular, initiation costs for establishing mutual knowledge exchange are cut by the Grid’s “knowledge on demand” concept, where initiation efforts are conducted by the grid fabric or middleware, respectively. Regarding knowledge retrieval as a service to a Grid user, narrowing available information down to a pragmatic piece of information, i.e., knowledge, constitutes a value-added process in terms of value chain analysis and discloses room for new business models. The value of a knowledge grid also increases with the number of grid nodes providing their knowledge structures as Topic Maps. According to Metcalfe’s Law, “the value of a network grows in proportion to the square of the number of people using it”, provided that limiting factors such as search costs, clustering, saturation etc. as described in [20] do not prevail. The resulting phenomenon, the so-called “Network Effect”, states that the adoption rate of the network increases in proportion to its value. In our opinion, this self-energizing process will apply to knowledge grids as well.
Attributes. A Statement expresses a binary relationship among two Resources as a triple of subject, predicate, and object, i.e., one Resource is considered the subject and the other one the object with the predicate describing the semantics of the relationship. [28] The RDF meta-model is quite different from that of Topic Maps even though their underlying intentions are related. However, as stated above, RDF solely focuses on describing and characterizing information resources, whereas Topic Maps try to establish a separate semantic network spanning those information resources in order to conduct effective knowledge management. In this respect both approaches model semantically interconnected information resources, whereas RDF starts with the resources as a basis and Topic Maps evolve from semantics. The Topic Maps’ “unique selling proposition” is coined by the fact that the actual data structure itself carries knowledge expressed by a markup language such as the Standard Generalized Markup Language (SGML) or the eXtensible Markup Language (XML). In addition, external resources can be referenced as so-called Occurrences. In this way, a Topic Map constitutes a superimposed layer or map of resources, and, once established, provides a means for navigating through an arbitrarily large amount of information in order to retrieve knowledge. [25] Figure 4 shows an example Topic Map with six topics, namely: University, University of Waterloo, University of Mannheim, Axel, Tobi, and Germany. We will refer to this example throughout the remainder of this section. of type Canada
University
Works for
Axel University of Mannheim
partners
3. Topic Map Standards Since our vision of the Semantic Grid is based upon Topic Maps as a way to express meaning and therefore knowledge, we also have to present Tim Berners-Lee’s vision of the Semantic Web based on a standard called Resource Description Framework (RDF), web agents, and ontologies and compare both concepts. [7] Similar to our vision of a Semantic Grid, Berners-Lee deems the current Web the appropriate underlying infrastructure for his Semantic Web. Via RDF, meaning or semantics, respectively, is added to regular web resources. The three basic RDF components are the RDF data model, syntax, and the schema language RDF-Schema (RDFS). Resources are identified by a Universal Resource Identifier (URI) and each Resource can be described by RDF
of type
University of Waterloo
Works for Tobi
Germany
citizen of
www.uwaterloo.ca German Citizens DB
www.uni-mannheim.de
Figure 4. Example Topic Map
In the following subsections we will elaborate on each of the four layers of the Topic Map open standards stack as shown in figure 2.
3.1. Conceptual Level As mentioned above, the Topic Map idea arose from the concept of semantic networks representing human memory. This concept of subjects or topics, respectively, interconnected by weighted associations, originates from cognitive psychology. The building blocks of those conceptual graphs are concepts (topics) and their conceptual relations. [24] The concept of semantic networks proved useful in several application domains, like, for example, business administration: in marketing theory and consumer research, for example, it is applied to explain consumer behavior from a theoretical perspective, e.g. to establish a schema of how consumers associate brands with other values, thus creating a brand image. [17]
3.2. ISO/IEC Level The first attempt to institutionalize and standardize the Topic Map idea was the international standard ISO 13250 describing a SGML notation for Topic Maps. [8] Each Topic Map entity as a self-contained, interchangeable unit of knowledge always consists of at least one SGML document. It may include and/or refer to additional information resources, i.e., inline text fragments or hyperlinks, respectively. A set of information resources comprising a complete interchangeable Topic Map can be specified using a bounded object set (BOS) as defined by the Hypermedia/Time-based Structuring Language (HyTime) standard. Topic Maps are a HyTime application, i.e., they utilize an adequate subset of this standard. The data exchange format for HyTime is also SGML. [15] Abstract subjects/topics represent entities within the knowledge context modeled. As figure 4 indicates, topics can be persons, institutions, countries, and in general everything that can be described. A Topic can be an instance of several Topic Types also defined as topics themselves. In our example ‘University of Waterloo’ is of type ‘University’. Thus, hierarchical structures can be established. Topic Characteristics comprise Topic Names, Occurrences and Roles. Each Topic must have a Base Name. Optionally, a Display Name and a Sort Name might be specified. If this is not the case, the Base Name is used for visualization and sorting. Any Topic can refer to an arbitrary number of information resources, so-called Occurrences. The Occurrences’ semantics are described by Occurrence Roles, also modeled as a Topic themselves. For example, resource ‘www.uni-mannheim.de’ is of Occurrence Role ‘Institution Homepage’, as is ‘www.uwaterloo.ca’. Technically, Occurrences are implemented via HyTime or the XML technologies XLink and XPointer. Topics are uniquely identified by a Public
Subject Descriptor (PSD), e.g. Axel’s social security number. Associations describe relations between Topics. An arbitrary number of Topics are allowed to participate as a member in one single Association. An Association is described by exactly one Association Type also modeled as a Topic. An Association Role being a Topic instance itself can be assigned to each Topic participating in an Association. The Association Role concept used here represents similar semantics to that of the well-known Unified Modeling Language (UML) and is part of the Topic Characteristics mentioned above. Scope is another important constituent of the Topic Map ISO standard. Scopes are particularly important to solve the problem of homonymous subjects, i.e., two Topics with identical names but different meanings. Paris, for example, represents the French capital in the Scope of ‘Geography’, whereas in the Scope of ‘Greek Mythology’, Paris denotes a heroic figure. Scopes consist of at least one Theme, which in turn is a Topic instance itself. ‘Greek’ and ‘Mythology’ are both Themes forming the Scope ‘Greek Mythology’. Except for providing namespaces, scopes can be used for many other purposes as well: access rights, expertise levels, validity limits, security, knowledge domains, product destinations, workflow management, and so on. [22] Facets provide a means of adding optional name/value tuples to a Topic construct. Facets recursively can be specified by Sub-Facets. The components mentioned above sum up for the actual Topic Map construct. The detailed syntax is defined in the Topic Map Meta Document Type Definition (DTD), i.e., the ISO standard document. Apart from the basic components, the authors of this standard propose to provide and use so-called Topic Map Templates containing the most important and frequently used Topics for certain domains to be included in custom Topic Maps serving as Types. Providing standardized domain Templates constitutes a new business area pertaining to Topic Map-enabled Knowledge Management. [28] When comparing the standards for Topic Maps and RDF, both provide a semantic annotation and classification for information objects, make use of references to both inline and external resources, and enable complex semantic-based queries. Instance models of each approach can be described in a standardized markup language, i.e., SGML or XML, but the RDF meta model does not specify a separate semantic data structure as it solely describes resources. Therefore, Topic Maps represent a topiccentric and RDF a resource-centric view on a semantically interweaved structure overlaying information resources. The two concepts only differ in their starting points, and thus can be both transformed and combined. [22]
Axel University of Mannheim worksFor Association
Figure 5. Example XTM File
3.3. XML Level The XML Topic Maps (XTM) specification issued in 2001 basically adheres to the primary Topic Map constructs like Topics, Associations, Names, and Scopes. Through the transformation of the ISO 13250 Meta DTD into one single XML DTD the usability of Topic Maps not only as a means of knowledge representation but also for standardized knowledge interchange, e.g. in the realm of Enterprise Application Integration (EAI), increased significantly. Since Topic Maps are highly mergeable, their capability to be combined by a set of merging rules and/or ontologies enables novel knowledge management practices. [22] In the same way XML was created to reduce SGML’s complexity to those features essentially needed on the Web, the XTM standard was designed to simplify the initial ISO specification “for optimized use on the web”. XTM limits addressing to simple XLink URI syntax and does no longer include the original Facet concept also expressible with inline Occurrences. Furthermore, the conceptual Topic Map model is expressed in a more explicit way. As already mentioned above, XTM specifies a set of fixed DTDs and no longer Architectural Forms only – as is the case with ISO Topic Maps and HyTime. [8] [15] Other slight differences are the preferred use of element types instead of attributes on the XML level, and the generalization of Display Names and Sort Names into Variant Names on the XTM level. A complete list of innovations can be found in [22]. The basic design principles for XTM are simplicity and neutrality, and thus flexibility in knowledge engineering.
Figure 5 shows an excerpt from the XTM file modeling the knowledge structure from the university context depicted in figure 4. The listing underlines the fact that Association Types are also modeled as Topics and referenced by their Topic ID via XLink. Using pure XML syntax also brings along the advantage of easy transformation and graphical representation of Topic Maps via XSLT, e.g. constructing dynamic web appearances and portal applications from basic Topic Maps. It has to be made clear that creating and maintaining semantically correct and reasonably complete Topic Maps implies an immense amount of effort for an organization. In order to exploit XTM’s knowledge interchange and merging capabilities, common vocabularies – so-called ontologies – need to be defined. They have to ensure that distributed knowledge structures with varying identifiers merge correctly and make use of the possibility to generate new knowledge. In order to express a certain consensus among the parties involved the Published Subject Indicator (PSI) concept can be utilized, which is explained in ([26], [22]). After having completed their work on the XTM specification, TopicMaps.org dissolved into the Organization for the Advancement of Structured Information Standards (OASIS), now concerned about the application level of XTM. OASIS provides recommendations about standard processes and best practices. [22]
3.4. Current ISO Standardization Efforts The International Standards Organization included XTM in their latest activities and drafted a three layer model for their efforts pertaining to Topic Maps.
First, there is the Modeling layer consisting of a Reference Model and a Standard Application Model, whereas the Syntax Layer comprises the ISO 13250 and XTM syntax. On the third level – the Constraints and Query Layer – the Topic Map Query Language (TMQL) and Topic Map Constraint Language (TMCL) are introduced [22].
3.5. TM4J and Topic Map API (TMAPI) The open source Java API Topic Maps for Java (TM4J), among other things, grants Java programmers access to knowledge structures previously defined as a flat XTM file. TM4J provides an object-oriented scheme the Topic Map constructs are mapped on. Based on this object model, Java applications are able to read, alter, and write XTM files. TM4J also implements a flexible backend persistence system where Topic Map object structures can either be kept in memory, written into an integrated object-oriented database (OODB), or mapped onto a Relational Database Management System (RDBMS). Figure 6 depicts the basic architecture of TM4J.
abstract knowledge layer based on Topic Maps by providing reference interfaces. Programmers must adhere to this API in order to keep their Java Topic Map applications portable with respect to different vendor-specific implementations of the TMAPI functionality. The customeroriented goal of this alliance is to enable easy substitution of TMEs from different vendors abiding by the TMAPI standard.
4. The Knowledge Grid Architecture After having provided the technical background of Topic Maps, we can now have a closer look at the architectural design of our Topic Map Grid solution. Since Java and CORBA are the central technologies in our current approach, we first start with a brief rationale for choosing these two technologies, before we elaborate on our layered architecture model of a Topic Map Grid and present a core component of the approach, namely a CORBA-based Object Group Service and a Join Service, in detail.
4.1. Java-Based Implementation .xtm files Topic Map Engine Access Memory Java Object Model OODB
RDBMS
Figure 6. TM4J Architecture Not only does the framework manage access to XTM files, but also a Topic Map Engine (TME) is included providing additional services like merging several Topic Maps and a semantic query interface. For query purposes, TM4J is equipped with an indexing subsystem accepting queries about Topic Types, Association Types, Occurrence Locations, and Scopes. The TME manages the merging of Topic Maps, i.e. the combination of an arbitrary number of Topic Maps to generate an enriched Topic Map structure. For this purpose, ontologies again come into play, which are in turn defined as Topic Maps. TMQL is currently developed in an ISO task force and will finally be included as the query language of choice. At the moment, TM4J supports tolog, a prolog- and SQLrelated query language developed by Ontopia. [26] Several TME vendors decided to define a vendorindependent Java Topic Map API (TMAPI), creating an
The general advantages of using a Java-based approach in combination with the Topic Map standards stack illustrated in figure 2 originate from Java being an open standard as well, though financially fostered by Sun Microsystems. So we are using open standards consistently to develop our Knowledge Grid Architecture. Java is a modern, object-oriented, and platformindependent programming language supported by a wide range of programmers spread over several communities. It is extremely well suited and elaborate in class libraries for network intensive applications like the Knowledge Grid. Java 2 Standard Edition (J2SE) also includes the libraries necessary to parse and transform XML, e.g. in order to construct web sites or enterprise portals from Topic Map structures. As we try to focus on future business applications boosted by knowledge grid technology as an enabling infrastructure, another reason to choose Java was the Java 2 Enterprise Edition (J2EE) and related technologies being widely used and established in international business contexts so far. By means of J2EE technology, publishing a grid service as a Web Service can readily be realized.
4.2. CORBA-Based Infrastructure We chose the Common Object Request Broker Architecture (CORBA) [21] as the basic communication architecture for our prototypical design. The CORBA standard is wide-spread in the field of object-oriented and distributed systems. It brings about independence from computer
architectures and programming languages as well as the possibility for the user to choose an Object Request Broker (ORB) product vendor-independently. A prerequisite to this last characteristic was the introduction of a mechanism for uniquely referencing objects on the basis of socalled Interoperable Object References (IORs) and of a standardized transmission protocol, called Internet-InterORB-Protocol (IIOP), in CORBA 2.0. Thus, applications that were developed using different programming languages can be made to interoperate. For the description of interfaces belonging to classes that offer their services, the Object Management Group (OMG) has specified the Interface Definition Language (IDL). The IDL is a declarative language, i.e., it is used to describe data types and interfaces by specifying their attributes, operations and exceptions but not their actual implementation algorithms. The IDL is the basis for the achievement of programming language independence, and the transformation into a concrete programming language does not take place before an IDL interface is being compiled using an IDL compiler for the target language. Besides the language mappings described in the OMG standard, which include the mappings from IDL to Ada, C, C++, COBOL, Java, Lisp, Python, and Smalltalk, there are a number of non-standard language mappings to further programming languages such as Eiffel, Objective-C, and Perl, which only exist in certain ORB products.
4.3. The Basic Topic Map Grid Architecture Our goal was to describe a flexible, structured and well-defined system, so that we chose to subdivide the architecture of our system into several modular layers. We
distinguish the Technical Foundation Layers providing the basic infrastructure including the Grid fabric and services on the one hand, and the Knowledge Application Layers containing Topic Map-oriented services, tools, and applications on the other hand (cf. figure 7). According to Foster, Kesselmann, and Tuecke [14], the Grid is an infrastructure that allows end users to share both information and computing resources in secure environments. It has been proposed to solve the general problem of “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations”. These virtual organizations are formal or informal communities sharing a set of resources under some welldefined rules. In their paper, they also present the idea of a layered Grid architecture consisting of lower levels providing middleware support for higher level applicationspecific services. Among others, Grid middleware addresses technical challenges with regard to communication, scheduling, security, information, data access, and fault detection [9]. There are several technologies which are discussed in literature with respect to the implementation of Grid services, among which are Java Technologies such as Jini, JXTA for peer-to-peer solutions, Java Remote Method Invocation (RMI), as well as middleware technologies for distributed objects such as CORBA, or recently Web Services ([5], [6]). However, there are specialized and comprehensive grid infrastructures such as Legion or the widely acknowledged Globus Toolkit. [12] The latter is probably the most sophisticated Grid middleware available today, although it does not run on Windows. The current version of Globus (GT3) includes some Java components, but also core C components which only
Topic Map Tools / Services Topic Map Navigator
Topic Map Engine
OGS/JS
Topic Map Designer
...
Topic Map Interfaces (TMAPI etc.)
...
Middleware/Grid Fabric etc. (e.g. Globus, Legion, OGSA, CORBA, Web Services)
Java Virtual Machine
Network Operating System
Figure 7. Basic Architecture of the Topic Map Grid
Technical Foundation Layers
(e.g. Jini/JXTA/RMI)
Knowledge Application Layers
Topic Map Grid Knowledge Applications
run on Unix and Linux machines. Beneficially, now an Open Grid Services Architecture (OGSA) [13] implementation is integrated that combines the existing Grid work with the broad experience of Web service technologies to provide an industry-usable platform [10]. We envision our system to be built on top of these basic Grid services, so that we can assume that this lower layer of services dealing with security, authentication, resource management, and communication issues is robust and stable. The Topic Map Grid uses the basic Grid services and defines a set of additional layers to implement the services of distributed knowledge discovery on globally connected computers where each node provides its own set of Topic Maps. Thus, the Topic Map Grid enables queries spanning Topic Maps on computers located in different company establishments. Basic principles that motivated the architecture design of our Topic Map Grid include • • • •
openness, scalability, security and data privacy, and compatibility with grid infrastructure.
Figure 7 shows the central elements of our layered Topic Map Grid architecture. The Technical Foundation Layers comprise functionalities provided by the basic operating system, on top of which network protocol stacks are working and the Java Virtual Machine offers platform independence. Here, we also find the middleware layer providing grid characteristics, based on technologies such as Globus, Legion, OGSA, CORBA, Web Services, Jini, JXTA, RMI etc. As can be seen, the architectural model leaves the actual choice of the grid middleware technology to the developer. However, in our current prototypical design, we have not yet evaluated the use of sophisticated tools such as the Globus Toolkit, but rely on a simple CORBA-based approach as a first start. Yet, it should be mentioned at this point that there is a CORBA Commodity Grid Kit project [29] and a Java Commodity Grid Kit project [19] which offer solutions to provide access to Grid Services delivered by the Globus Toolkit through CORBA interfaces and a Java framework, respectively. Built on CORBA, we have implemented an Object Group Service (OGS) and a Join Service (JS) with Java in order to provide functionality for dispatching queries to different grid nodes hosting Topic Maps and to join the partial results to return a complete end result back to the user. This approach will be presented in section 4.4. The Knowledge Application Layers consist of the Topic Map Interfaces (TMAPI) which provide standardized access to Topic Map structures and their repositories. They are used by Topic Map tools and services such as
the Topic Map Engine (TM4J in our case) to define, create and merge Topic Maps, a Topic Map Navigator to navigate and query Topic Maps, and the Topic Map Designer to modify existing Topic Maps. The top layer is represented by Topic Map Grid knowledge applications which actually implement the functionalities needed by end users in the organization to define, archive, retrieve and discover knowledge modeled in Topic Map structures. The design of the architecture constituents should be interface-based, to be as modular as possible. Thus, specific component implementations can be exchanged without affecting the whole system, which is a characteristic of stable and robust architectures.
4.4. The CORBA-Based Group and Join Service Component In this section, we are going to present two exemplary service implementations from the Technical Foundation Layers which are used to provide a middleware-level means for the implementation of spanning Topic Map queries. As mentioned before, we refer to plain CORBA as our middleware, not taking into account more sophisticated grid middleware products. Since one of the desired functionalities in our system is the possibility to query groups of Topic Maps residing on different nodes in the grid, we need a mechanism to manage such groups, to distribute and dispatch queries to the nodes belonging to the requested group, and to collect and merge the returned results in order to offer a consistent end result to the user. In the context of a project on CORBA at the University of Mannheim, we have already designed and implemented a generic and reusable Object Group Service and a Join Service, which can be used in our architecture for the purposes mentioned above ([2], [3], [4]). The remainder of this section will provide some background information about this approach and how it can be applied in the Topic Map Grid architecture. 4.4.1. Background Especially in the grid sector, one-to-many-to-one architectures (1:N:1) gain increasing importance. Here, the request of a client – called master in the following – is delivered to an arbitrary number of servers – called workers in the following (1:N). The workers execute their task and deliver the response to the master who has to aggregate and evaluate the individual results (N:1). This 1:N:1 architecture should not be mistaken for the standard three tier architecture with clients, application server, and database backend.
If we analyze the processing of such a call in more detail the necessary communication techniques become apparent: • •
group communication and asynchronous invocations.
Group communication is needed to dispatch the request to the workers that will execute it in parallel. Asynchronous invocations are required since the request must be distributed asynchronously and the workers’ responses will arrive at different times. The reusable core architecture we propose, which can be used in the context of many different kinds of applications and not only for the Topic Map Grid, comprises the following components: • • • •
an Object Group Service, a Join Service, one or more Masters (here: clients who require to place queries to groups of grid nodes), and one or more Workers (here: grid nodes, hosting Topic Maps).
The general idea of a CORBA-based Object Group Service (OGS) is based on the work by Felber [11], who uses this approach in order to facilitate the replication of data. Our design and the corresponding implementation, on the other hand, are aimed at the parallel processing of CORBA calls, i.e., a message sent by a master is concurrently delivered to any number of workers, and the workers process the data included in the message independently of each other (i.e., in our case they process Topic Map queries in parallel). To that end, our solution supports the transmission of the complete data sets (i.e., the queries or search keywords in our context) to all the workers as well as the application of several distinct data dispatching policies. By the design of our OGS it becomes possible to distribute the source data (queries; search keywords etc.) independently of the data types and data structures used. The only restriction that has to be observed by the programmer is the need to organize the data as a CORBA sequence, i.e., a vector of variable length, if a dispatching policy is to be applied that differs from the policy of sending all data to all workers. However, the single data elements in the sequence can have any simple or complex structure. Basic data types are just as allowable as complex data types containing others such as basic data types, arrays, sequences or user-defined, possibly layered data structures. At run-time, the OGS dynamically determines the data types contained in the CORBA sequences and copies the data into new subsequences of the same data type. How many of those subsequences have to be created
depends on the number of workers involved and the number of data sets to be dispatched. One positive aspect of this approach is that it enables the developer to even dispatch sequences of data the types of which were not foreseen at the time of construction of the OGS. Therefore, the OGS is capable of supporting a broad scope of current and future fields of application. The price for this flexibility gain is a certain loss of performance, because the marshalling and unmarshalling of CORBA type any requires more time than that of simple data types. Thus, of course a more specific solution could be built for our purposes in the context of the Topic Map Grid, but the reuse possibility makes up for the minor disadvantages of using the generic solution. To be able to use the OGS, an OGS client has to implement an IDL interface called Master (note: a detailed description of the whole OGS/JS architecture can be found in [2]). It only contains an operation receive(), which is called by the Join Service (JS) and is handed over a collection of the results produced by the workers. The grid servers representing workers, on the other hand, have to implement an interface called Worker including operation send(), which has to be provided not only with the actual message data but also with the Interoperable Object Reference (IOR) of the JS. This is necessary to be able to apply a callback. First, the master sends a message (including its IOR) to a certain group by calling send(). Then, the group forwards the call to each of its members (the workers) and informs the JS of the fact that a number of worker results are to be expected. Later, the results produced by the workers arrive at the JS which collects them and finally calls method receive() on the master in order to inform the master of the complete end result of the processing. The purpose of IDL interface Group is to specify functionality for forwarding a call issued by a client to each member of the respective group. It also contains operations for the attachment and detachment of servers. In order to be able to manage different groups, we defined the GroupManager interface. It is equipped with operations that can be used to create, list, retrieve, and delete groups. The JS represents the counterpart of the OGS. Like the OGS, the JS supports different policies: • •
COMPLETE and PARTIAL.
Choosing the COMPLETE policy has the effect, that in any case the JS will wait for the results of all the workers. This means that if one of the workers crashes during processing, no result will be produced and, consequently, the waiting thread will be blocked infinitely.
This drawback can be avoided by applying the PARTIAL policy. In that case, the JS has to be provided with an additional time value. The JS then maximally waits for this specified amount of time, and if not all results have arrived from the workers before the deadline, it reports only those results that have already been delivered. Should all the workers call back before the timeout period has passed, the complete end result is delivered immediately. Figure 8 shows the collaboration of the architecture’s main components. OGS Master mn Master m1
JS
Worker wnk Worker wn1 Worker w1j Worker w11
Figure 8. Structure of the core architecture The core architecture allows a from-one-to-many/frommany-to-one communication model. It must be considered, however, that with an increasing number of masters and workers or large amounts of data to be dispatched the two services can become considerable bottlenecks. To avoid this situation it is possible to start more than one instance of each of the two services. Using a Naming Service or a Trading Service, respectively, a master can look up the IORs of different OGSs and can subsequently decide to which one it wants to send its request. Figure 9 illustrates this process. Furthermore, the OGS can prescribe on the group level to which JS instance the workers have to deliver their results. For example, the selection of the responsible JS instance is based on a round-robin load balancing strategy. Naming Service
OGS OGS
Master mn Master m1
JS JS
Worker wnk Worker wn1 Worker w1j Worker w11
Figure 9. Scalability of the core architecture 4.4.2. Application to the Topic Map Grid As already indicated, we can use the OGS and the JS in our Topic Map Grid architecture as the foundation for the implementation of a query service as part of the Topic Map Navigator, which allows for queries not restricted to one single Topic Map but spanning multiple Topic Maps
residing on different nodes in the Grid. The OGS provides the possibility to manage groups of grid nodes to which a query can be directed. This might also include a complete group of all known grid nodes as members. Each node runs a Topic Map Engine service granting access to Topic Map objects, a TMQL query interface etc. Some basic methods of a Java interface TMGridNode that might be provided by each node could include String getXTM() TopicMap getTopicMap(String name) Collection getTopicMaps() Collection getScopes() Collection query(String request) Topic getTopic(String name) ... A client (called master in the OGS/JS design) directs a query to a specific group managed by the OGS, which is then dispatched to all the members of the group. The members handle the queries in parallel, based on their Topic Map query functionality. Results might be smaller Topic Maps, Topics etc., which are returned to the JS which collects the results and merges them, based on standard Topic Map merging functionality. Finally, the result can be presented to the client who initiated the whole process in the first place. Of course, the processing of queries has to take authorization aspects into account to make sure that information and knowledge chunks are only presented to users who are allowed to see them. Based on the global definition of user roles and an appropriate application of the Topic Map concept of Scopes and Themes, this problem can be solved.
5. Related Work Smolnik and Nastansky [27] describe a project called K-Discovery which uses Topic Maps to identify distributed knowledge structures in groupware-based organizational memories. Their special contribution is the seamless integration with groupware environments and considerations for the automatic generation of Topic Maps. There are also many Knowledge Grid-based approaches not using Topic Maps, like those described in [9] or [10]. Most of these approaches are concerned with knowledge discovery in databases (KDD) and newer, grid-based variants of this technique, called parallel and distributed knowledge discovery (PDKD). For further information see the Related Work section in [9].
6. Conclusion and Future Work Today, grid computing appears to be the most promising framework for future implementations of highperformance data-intensive distributed applications. It is expected that Grid usage will quickly expand from the domain of scientific applications to industrial and commercial applications where knowledge discovery is very important and critical [9]. Thus, the Grid provides an infrastructure perfectly suited as a foundation for distributed knowledge management applications seamlessly spanning all or parts of the grid nodes to create a “Knowledge Grid” on top of the Data/Computation and Information Grid. In this paper, we have presented an approach to the design of a Knowledge Grid based on the use of distributed Topic Maps, which we have dubbed the “Topic Map Grid”. It is a service-oriented design exclusively using open standards, particularly XTM, TMAPI, Java, and CORBA. Since Topic Maps are the central knowledge management concept of our approach, we have provided in-depth background information on this concept. Our basic Topic Map Grid architecture is a layered model consisting of several layers which can be grouped as Technical Foundation Layers and Knowledge Application Layers. Being part of the Technical Foundation Layers, we have employed a CORBA-based Object Group Service and a Join Service as middleware components to enable the transparent distribution of Topic Map queries to groups of grid nodes hosting Topic Maps and to recollect and merge the partial results. The design of these reusable service components, which have been fully designed and implemented in an earlier project, is also described. However, at the current state we are implementing a prototype based on TM4J and the CORBA OGS/JS services mentioned before, but we are not working with a fully-fledged grid infrastructure such as the Globus Toolkit, for example. After finishing the first prototype, it will be our future work to evaluate the integration of our OGS/JS-based approach with Globus, probably based on the Java and CORBA Commodity Grid Kits. Topic Maps as a standardized base technology for knowledge representation are a promising approach in the context of knowledge management to provide an additional semantic level on top of distributed information resources. However, the process of gathering knowledge to be stored in Topic Maps is mission critical, because usually a larger company produces an enormous amount of information. Thus, at least semi-automatic processes are needed to support the creation and maintenance of Topic Maps. At the moment, we are doing some research work in the context of the KnowME project in order to provide semi-automatic Topic Map generation processes
based on document indexes created with the Java Lucene information indexing and retrieval API [16]. Furthermore, we will have to investigate the organizational integration of our concepts to provide better solutions. A first approach to these questions has been presented by Smolnik and Nastansky [27] who assign persons to abstract organization and structure entities, called roles and groups, in order to be able to model the different tasks, steps, and skills involved in managing Topic Maps. Thus, they distinguish end users with informational access, knowledge authors who create specific Topic Map objects, knowledge editors who maintain specific Topic Map objects, knowledge managers with the right to maintain all Topic Map objects, designers who maintain Topic Map Templates (i.e., type systems or ontologies), and last but not least administrators who take care of the technical infrastructure. Furthermore, they identify three core workflows to describe and support the process chain of creating, publishing, archiving, and maintaining Topic Map objects: the content approval workflow, the content expiry workflow, and the archiving workflow. Based on these considerations, we will have to work on organizational integration concepts ourselves to be able to refine our current architecture in order to provide the right functionality with respect to meeting business requirements and not only offering a technology-centric solution.
References [1] Ahmed, K.: “Topic Maps, the Business Case”, Tequila.com, Oxford (UK), 2001. [2] Aleksy, M.: Entwicklung einer komponentenbasierten Architektur zur Implementierung paralleler Anwendungen mittels CORBA, Peter Lang, Frankfurt a.M., 2003. [3] Aleksy, M. and Korthaus, A.: “A CORBA-Based Object Group Service and a Join Service Providing a Transparent Solution for Parallel Programming”, Proc. Int. Symp. on Software Engineering for Parallel and Distributed Systems (PDSE 2000), 10.-11. June 2000, Limerick, Ireland, IEEE Computer Society Press, Los Alamitos, California, pp. 123-134. [4] Aleksy, M., Korthaus, A., and Schader, M.: “Implementing Distributed Electronic Auction Applications Using CORBA”, ACIS International Journal of Computer & Information Science (IJCIS), Vol. 3, No. 3, Sept. 2002, ACIS, Mt. Pleasant, USA, pp. 217-226. [5] Baker, M., Buyya, R., and Laforenza, D.: “Grids and Grid technologies for wide-area distributed computing”, SOFTWARE-PRACTICE AND EXPERIENCE, John Wiley & Sons, England, 2002 [6] Berman, F., Fox, G.C., and Hey, A.J.G.: Grid Computing – Making the Global Infrastructure a Reality, John Wiley & Sons, England, 2003.
[7] Berners-Lee, T., Hendler, J., and Lassila, O.: “The Semantic Web”, Scientific American, US, 2001. [8] Biezunski, M., Bryan, M., and Newcomb, S.R. (ed.), ISO/IEC 13250:2002 Topic Maps, International Organization for Standardization (ISO), May 2002. [9] Cannataro, M., and Talia, D.: “The Knowledge Grid: Designing, building, and implementing an architecture for distributed knowledge discovery”, Communications of the ACM, vol. 46, no. 1, January 2003, pp. 89-93. [10] ur in, V., Ghanem, M., Guo, Y., Köhler, M., Rowe, A., Syed, J., and Wendel, P.: Discovery Net – Towards a Grid of Knowledge Discovery, ACM, 2002.
[21] OMG, “Common Object Request Broker Architecture: Core Specification”, Version 3.02, OMG Technical Document Number formal/02-12-06, 2002, http://www.omg.org/cgibin/doc?formal/02-12-06.pdf [22] Park, J., and Hunting, S.: XML Topic Maps, Addison Wesley, US, 2003. [23] Pepper, S.: Navigating haystacks and discovering needles, MIT Press, Cambridge (US), 1999. [24] Pepper, S.: The TAO of Topic Maps, Infotek, Norway, November 2000. [25] Pepper, S., and Moore, G.: XML Topic Maps Specification, TopicMaps.Org, UK, August 2001.
[11] Felber, P., Garbinato, B., and Guerraoui, R.: “The design of a CORBA group communication service”, in: Proc. 15th IEEE Symp. on Reliable Distributed Systems, Niagara-on-the-Lake, 1996, pp. 150-159.
[26] Schmuck, N.: “Finden in Landkarten”, Javamagazin, S&S, Germany, May 2004, pp. 96-100.
[12] Foster, I., and Kesselmann, C.: The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco (US), 2002.
[27] Smolnik, S., and Nastansky, L.: “K-Discovery – Using Topic Maps to Identify Distributed Knowledge Structures in Groupware-based Organizational Memories”, Proc. 35th Annual Hawaii Int. Conf. on System Sciences (HICSS’02) Vol. 4, Jan. 07-10, 2002, Big Island, Hawaii, pp. 106b ff.
[13] Foster, I., Kesselmann, C., Nick, J.M., and Tuecke, S.: “The Physiology of the Grid – An Open Grid Services Architecture for Distributed Systems Integration”, Technical Report, 2002.
[28] Widhalm, R., and Mueck, T.: Topic Maps, Springer / Xpert.press, Germany, 2002.
[14] Foster, I., Kesselmann, C., and Tuecke, S.: “The Anatomy of the Grid – Enabling Scalable Virtual Organization”, The International Journal of High Performance Computing Applications, 15(3), Fall 2001, pp. 200-222. [15] Goldfarb, C.F., Kimber, E., and Newcomb, P.J.: ISO/IEC 10744:1997 Hypermedia Time-based Structuring Language (HyTime), International Organization for Standardization (ISO), 1997. [16] Hardt, M.: “Suchmaschinen entwickeln mit Java und Lucene”, Javamagazin, S&S, Germany, September 2002, pp. 39-46. [17] Hoyer, W.D., and MacInnis, D.J.: Consumer Behavior, Houghton Mifflin, US, 2001. [18] Lassila, O., and Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification, W3C, February 1999. [19] Laszewski, G.v., Foster, I., Gawor, J., and Lane, P.: “A Java Commodity Grid Kit”, Concurrency and Computation: Practice and Experience, vol. 13, no. 8-9, pp. 643-662, 2001, http:/www.cogkits.org/. [20] McAfee, A., and Oliveau, F.-X.: “Confronting the Limits of Networks”, MIT Sloan Management Review, Summer 2002, pp.85-87.
[29] Verma, S., Parashar, M., Gawor, J., and Laszewski, G.: “Design and Implementation of a CORBA Commodity Grid Kit”. In Proc. of the 2nd Int. Workshop on Grid Computing, Nov. 2001, Denver, USA, Springer, pp. 2-13.