different sources (Internet web sites, manuals and guides furnished by vendors, internal ... files (pdf, doc, rtf, html), images (jpg, bmp, gif), cad files. (dwg, dxf) etc.
AN INDEXING AND CLUSTERING ARCHITECTURE TO SUPPORT DOCUMENT RETRIEVAL IN THE MAINTENANCE SECTOR Andrea Passadore Giorgio Pezzuto D’Appolonia S.p.A. Via San Nazaro 9 16145 Genova - Italy
KEYWORDS Indexing, clustering, ontology, multi-agent , document. ABSTRACT This paper shows the solution provided by D’Appolonia for the support to the maintenance companies and every other enterprise having the need to manage a large amount of electronic documents as textual files, pictures, and cad files. The architecture of the system involves indexing techniques, clustering algorithms and the use of ontologies. The final product will be a tool which helps the end-users to find documents, automatically organized into categories depending on the context (searched key words, topics, and industrial fields). INTRODUCTION In the ICT epoch the problem is not how and where to store information and data. The problem is not the cost of the storage of them. The problem is the management of the content. At now, it is easy to collect contents coming from different sources (Internet web sites, manuals and guides furnished by vendors, internal reports, and so on). We can cumulate these files into our capacious hard disks but when we need a particular document, often an usual indexing engine is not sufficient and efficient. Information and Communication Technologies oftentimes provide a large amount of data that we are not able to manage, maintain and use with efficiency. The solution provided by D’Appolonia has the main goal to simplify the work of all these employees and technicians who need to quickly find an information, by using an application which provides high quality results combining the usual indexing techniques, the newest clustering algorithms (Landau 2001) and exploiting the Semantic Web (BernersLee 2001) features as the ontology representation of a discourse domain. An interesting issue of the proposed solution is the user-friendly interface for the end-user who must not be an expert of semantics and other advanced techniques involved in the project. We propose a Google-like graphical user interface, for an easy use of the system, plus advanced features for a detailed document research. In contrast with the simplicity of the user side of the system, the core of the application is a complex flow of data and is the result of an orchestration of processes, aimed at the convergence of the different involved technologies.
In the following sections we introduce the proposed architecture, showing how each technology is involved in the project; in particular we describe the adopted indexing engine, the clustering engine, the ontology support, the user functionalities and the administrator activities. A test case is briefly shown, in order to explain the usefulness of the application in the maintenance field. THE ARCHITECTURE In figure 1 is intuitively shown the architecture for the document retrieval system. The basis of the system is obviously the document corpus of the company. Typical files are the most used during the everyday work activity: textual files (pdf, doc, rtf, html), images (jpg, bmp, gif), cad files (dwg, dxf) etc. These files are stored in a file server with a whatever hierarchical folder organization. It is also possible to manage a distributed environment, with different repositories connected to the same network.
Figure 1: the architecture of the document retrieval system. An important aspect of the proposed solution is the possibility to add tags to the documents, in order to include helpful metadata which ease the work of the whole system. The first step is to index the document collection. Considering only textual documents, the Apache Lucene indexing engine has been chosen to create an index which can be read by the clustering engine. Carrot² is the selected engine which provides an automatic categorization of documents starting from a key word submitted by the user. Crucial is the role of the administrator who is able to tune several parameters of the system and to customize the service in order to adapt it to the different industrial contexts. The figure 1 reports another feature of the document retrieval
system: the use of ontologies is transversal to the whole architecture. Details on every aforementioned component are illustrated hereinafter. Implementation based on a multi-agent system As briefly shown above, the document retrieval system is composed by different layers that must cooperate to provide significant results. For the implementation of the tool we chose a new advanced technology: a multi-agent platform. Carrot² and Lucene are written in Java, so a multi-agent system (MAS) written in the same language is a major requirement, in order to build an homogeneous system. The selected framework for the development of a multi-agent system is the well-known JADE (Bellifemine et al. 2007) which represents the benchmark in the field of MASs. Relating our project the agents represent the adhesive in order to establish a cooperation among different aspects of the architecture. An agent is a software entity characterized by three features: it is reactive to the environment stimuli, it is proactive in reaching its goal (namely is able to take the initiative), it interacts with its peer in order to cooperate. An agent is not an omniscient entity: it is specialized in achieving a goal and it can provide a service to share its ability with the other agent of the same community. The combined use of agent services is the power of a multi-agent system. Other features make a multi-agent framework very interesting: the system is completely scalable, because if its tasks grow, more agents can be instantiated in order to split the computation; the system is reliable, safe, and robust due to the fact that if an agent crashes, the whole system continues to run. Moreover, a multi-agent platform provides a set of basic features that simplify the implementation of a software application: the scheduler for the agent activities, the message transport system, monitoring tools, and management utilities are the most relevant. Regarding the concrete implementation of the document retrieval system, one or more agents are in charge to manage the indexing of the document corpus. The number of indexing agents depends on the size of the document collection and the distribution of the documents folders. A particular indexing agent implements the functionalities of a web spider bot, with the main goal to index web pages, from a selected list of sites inherent the current industrial sector. These agent runs in batch mode, during the idle time of the system (e.g. in the night). Another agent is dedicated to the execution of clustering algorithms. This agent cooperates with the indexing agents, sending queries to the indexing agent that implements the Lucene API (Application Programming Interface). The clustering agent widely uses the ontologies describing the principal topics in the enterprise domain. The clustering agent performs several functionalities in order to match the categorization of textual documents with the classification of other types of files, to select synonyms for the searched key word, and to exploit the ontological representations as described in the following sections.
The key role of ontologies In order to explain the indexing and clustering layers, an introduction on the role of ontologies in the document retrieval system is fundamental. Ontology is a term deriving from Philosophy and it is the study of being and its categories. More pragmatically, an ontology in Computer Science is a data model which contains a collection of concepts within a domain and the relations among them. The ontological model of a domain (that is a specific and delimited topic) is a detailed and rigorous description of every concept through its properties and relations with other concepts. This description is done through a formal language. At now, there are several languages for the description of ontologies: RDF (Lassila and Swick 1998), OWL (McGuinness and van Harmelen 2004), KIF (DARPA 2006, OKBC (Chaudhri et al. 1998), DAML+OIL (Joint US/EU 2001), etc., arising a Babel of dialects. Considering the recent but strengthened studies on the Semantic Web (Berners-Lee et al. 2001), a language is emerging: it is the OWL language. For this reason we chose OWL in order to represent our ontologies. OWL is a language based on XML and is used in several projects. There exist many visual editors that allow users to develop their ontologies, linking shapes by arrows and inserting customized properties. An interesting editor is Protégé (Gennari et al. 2002): it supports different languages in a comfortable graphic environment. For our purposes, the ontological models are helpful in order to fully describe particular industrial domains. The document retrieval system is aimed to the maintenance companies. A significant annotation is that every company is specialized in one or more fields, so the implementation of a tool based on a generic knowledge base is not an efficient solution. The document retrieval system is then able to support different domains in order to adapt it-self to a particular context. The definition of an ontology for each maintenance field is the solution to this issue. In a maintenance ontology are described all the relevant entities in the related domain. For instance these entities are the used equipment, the machines, the tools, the parts, and the actors involved in maintenance processes. Relations among these entities are established. Usually in an ontology are defined relations as: is a, is a part of, same as, disjoint with. These relations help the clustering algorithms to manage a particular domain, furnishing significant results, specific for each domain. In conclusion the role of ontologies in our system is to add a semantic component in every layer depicted in figure 1. Ontologies are used to label the documents and files with tags. Tags are instances of the concepts described in a particular ontology. As an example, if in a maintenance ontology is defined the concept hot-water heater with properties model name, vendor, and fuel, a picture representing the heater installed at the customer can be tagged with an instance of the concept hot-water heater, namely filling its properties with the suitable information. Of course, customer can be another concept contained in the maintenance ontology and the aforementioned picture can be tagged with an instance of this concept.
The user functionalities The interaction of the system with the end-user is a key feature for each software application. It is a relevant issue also for the document retrieval system. In addiction, our tool is aimed to an heterogeneous public, with various skills and sometimes with low experiences in using computers. To provide a simple interface is a strong requirement, in order to make accessible the tool to a large audience. The main feature is also the simplest: as shown in figure 2 the user must enter the key word he wants to search and the system provides a list of files organized into thematic categories. The resulting hierarchical tree is a classification of documents according to topics belonging to a domain about which the user is experienced.
Regarding these features, the use of ontologies contributes to the generation of meaningful results but does not appear directly in the user interface. Addressed to exigent users, the conceptual search is delivered. As shown in figure 5 the user can search a particular instance of an ontological concept coming from the domain ontology. For instance, if he has to search a particular micro processor in the enterprise document corpus, he can go through the hierarchical tree representing the ontology and select the processor concept; he can fill the input boxes and then start the query. The inputs provided during the instantiation will be used to create meaningful clusters.
Figure 4: the document preview.
Figure 2: the results of a search query. Other useful features are at the user’s disposal. The directory service provides a static classification of documents (a typical example is the Google Directory service). A chronology of read files is maintained. New and updated documents are notified in an appropriate section. For each file the user can express a judgment through a vote (figure 3); the vote is a parameter taken into account when the documents in a cluster are listed, in order to report in the first positions the preferred user’s files.
Related to this service, the user can search files instantiating concepts from a particular ontology containing the definitions of document types. If the user wants to find all the documents that are guides and contain specific words, thus it must instantiate the correspondent concept.
Figure 5: the conceptual search. The administrator functionalities
Figure 3: every document can be voted. The figure 4 illustrates another feature: an user can see a preview of the sections of the selected document containing the searched key word.
The main task of the administrator is to set and tune various parameters and configurations files in order to provide a performing application totally aimed to a specific industrial maintenance domain. The administrator manages the entire file collection and is able to add, update, and delete a file. He can link an ontological tag to a file, instantiating a concept. An interesting feature is the management of user profiles. A profile is a particular configuration for the document retrieval system. Every user can be linked to a profile according to his skills and specializations. For each maintenance field is defined an ontology; when a new profile
is built, an ontology is selected. For example if we want to create a profile for the typical technicians involved in building maintenance, the corresponding ontology is chosen. A profile also contains a particular definition of the thematic directories which represent the static clusters. It seems useful to set a particular directory deployment for each maintenance sector. Analogously, a domain specific configuration for the label cluster preferences is set. It is possible to boost particular cluster names and to neglect unwanted labels. Finally, Carrot² allows the tuning of several parameters. The modification of these parameters entails the variation of the algorithm performances and influences the clustering results (as the total number of clusters, the number of documents per cluster, the number of sub-clusters, etc.). Every profile can have a particular clustering setup in order to create a dedicated configuration for exigent users. THE CLUSTERING AND INDEXING ENGINES The document retrieval system is based on two engines which work together in order to provide a smart service. The data clustering techniques are aimed to select and group homogeneous elements from a data set. Clustering has many applications in various fields: biology, market research, data mining, social network analysis, etc. Relating the document clustering there are several solutions. Vivisimo is the best solution in the field of document clustering and it represents a complete application for big enterprises. Vivisimo is fully customizable and adaptable to each context but a base license is too expensive for a small company. Only enterprises as Airbus, Cisco, and P&G are able to maintain and fully exploit a product like Vivisimo. Vivisimo basically offers a search engine for the indexing of documents, a content integrator which enables the clustering engine to return a set of significant clusters. Many projects are available online as services like Google and Yahoo: Exalead, Clusty (based on Vivisimo), SnakeT, iBoogie, and KWMAP. They represents a good example to explain what is clustering, but they are not useful to build an offline application running exclusively for an enterprise. Considering the high level of complexity of a state-of-the-art clustering engine, the development of a new one is not a good solution, especially if one of the goals of our document retrieval system is the low cost in order to furnish the service to the small and medium enterprises. The available open source project are essentially two: Carrot² and Egothor. Egothor is an interesting solution that gathers both the indexing engine and the clustering engine. It seems a promising tool but at now is still under construction. Carrot² is instead a good solution for our application. It has reached a certain solidity and supports several clustering algorithms: K-Means (Ngo and Nguyen 2004), Fuzzy Ants (Schockaert et al. 2004), HAOG-STC, and STC (Stefanowski and Weiss 2003)) and in particular the Lingo algorithm (Osinski and Weiss 2005): developed by the authors of Carrot². In the table 1 are shown the performances of the aforementioned algorithms.
Table 1: performance of the algorithms supported by Carrot2 (* number of documents). Algorithm Fuzzy Ants HAOG-STC Lingo (Carrot²) Rough k-Means STC Lingo3G
100* 2,17 0,04 0,34 1,38 0,04 0,03
Speed [s] 200* 8,70 0,11 0,52 6,76 0,10 0,06
400* 16,93 0,28 0,84 27,73 0,23 0,13
Hierarchical Clustering yes yes no no no yes
Carrot² The architecture of Carrot² is a pipeline that receives the input query, obtains search results from a search engine, filters the results applying a clustering algorithm and then gets the clusters (figure 6). The supported search engines are Yahoo, Google, MSN, eTools, Alexa, OpenSearch, Lucene, SOLR. Carrot² accesses the index of these engines or use the API published by the services. The support to different search engines and clustering algorithms makes Carrot² a very versatile application.
Figure 6: the architecture of Carrot². The team who developed Carrot², is also developing a commercial version named Lingo3G. It is essentially Carrot² improved with very interesting features. The table 2 shows the differences between Carrot² and Lingo3G. The commercial edition allows the definition of sub-clusters, has a built-in cluster label filtering and boosting, supports the definition of synonyms (e.g. if the user searches the key word “car”, the clustering engine extends the research to “automobile”, “vehicle”, etc.) and offers 44 parameters for the tuning. The most significant ones are the depth of subclusters, the percentage of documents that must be necessarily included in clusters, the maximum and minimum cluster size, and cluster label composition. A particular remark for the multilingual parameters: Lingo3G supports the multilingual clustering because it is able to recognize the language of the document. Presently, only the main European languages are supported. Considering the interesting features of Lingo3G, in our document retrieval system we think to support both Carrot² and Lingo3G and let the customer to choose the clustering engine to exploit. The similar APIs of the two engines ease the integration in the document retrieval system.
Table 2: comparison between Carrot² and Lingo3G. Feature Hierarchical clustering Customizable stop word list Label filtering Label boosting Synonyms Tunable parameters Further development
Carrot² (Open Source)
Lingo3G (Commercial)
no
yes
yes
yes
no no no no Only critical bug fixes
yes yes yes yes New features planned
Apache Lucene Carrot² supports several search engine, but most of them are online services. For the purpose of the document retrieval system, is necessary a search engine able to run on a server and to index the internal files of an enterprise. Probably, Google is the most popular and advanced search engine on the web. There is an offline version called Google Desktop which enables users to index their computers in order to retrieve a large number of document types. Although Google Desktop seems to be an optimal solution for a private user, is not useful for an industrial solution, due to the low predisposition to modifications of the application. Lucene represents a good alternative to Google Desktop and denotes some interesting features: it is open source, is written in Java, is customizable, has good performances and in particular is compatible with Carrot². To confirm the quality of the Lucene solution, it has been used in many famous projects: mediaWiki (the engine of Wikipedia), Beagle (based on the .NET porting of Lucene), DPSpace (a project managed by MIT and HP labs), LjFind (an indexing engine for 110.000.000 blogs), and Eclipse (the development framework for Java uses Lucene to index its guide). Lucene supports the definition of plug-ins in order to allow the indexing spider to read different types of files. The basic distribution of Lucene is not able to parse pdf files, but there are several plug-ins that add this feature. For our project, we are thinking to create a parser for cad files in order to index the texts contained in files with extension dwg and dxf. Lucene is used as search engine aimed to intranets or web crawler for Internet pages. In the document retrieval system, an agent will implement the features of a web spider in order to read the content of web sites which publish helpful information for the maintenance sector. Another indexing agent has a list of folders distributed on the enterprise intranet and periodically parses every file in order to update the index. A TEST CASE: THE E-SUPPORT PROJECT A first test case for our application is the E-Support project managed by an European partnership. E-Support is a collective European project which involves RTD (Research and Technology Development) companies, small and medium enterprises, and associations of
maintenance companies. The project is addressed to these maintenance companies which send small teams of engineers and technicians at the customer’s, in order to fix, upgrade or simply manage a plant. Often the personnel which works away from the headquarters needs information in real-time regarding the customers, vendors, parts to replace and any type of data useful during the everyday work activity. The aim of E-Support is to provide these data on the field, directly on the plant, connecting a mobile device to the remote server containing all the information owned by the company and the information offered by the associations (as regulations, norms, standards, etc.) that reunite maintenance companies. E-Support enables technicians to use every type of mobile device: mobile phones, smart phones, PDAs (Personal Digital Assistant), and notebooks. It is a task of the remote system to display on the mobile device screen only the information that the device is able to visualize. The entire system runs over a multi-agent platform. Considering the need to run remote agents representing the mobile devices, we chose the well-known multi-agent system (MAS) JADE, with the extension for mobile devices named LEAP. A field service engineer (FSE) which wants to access the enterprise knowledge base, connects his mobile device to the available network infrastructure (GPRS, UMTS, Wi-Fi, WiMax) and contacts (through his mobile agent) a remote agent running on the server platform. This remote agent (called interface agent) processes the query and forwards the related tasks to the agents in charge of the databases management and the documents collection. The interface agent then collects the results of these tasks, generates an html page considering the display capabilities of the mobile device, and sends it to the FSE. Four major challenges are recognized, in order to provide a useful tool; these challenges are briefly described in the following paragraphs. Document retrieval system It is the main contribution of D’Appolonia in the project. The core of the document retrieval system is essentially the same proposed in this paper. A peculiarity of the E-Support is the mobility of end-users and the document retrieval system must be adapted to this feature. In particular, the mobile device of the user can have some restriction to display particular file types (especially if it is a PDA or a smart phone) and the ESupport system must be able to adapt the information generated by the document retrieval system in order to guarantee the visualization of documents on terminals. Knowledge sharing and learning The goal is to develop an intelligent open knowledge sharing and learning system able to provide a technical training to FSEs who work on-site. The FSE-Master comprehends intelligent user-profiles and personalization capabilities, in a user-friendly web-based context. The learning audience benefits from a modular step-by-step approach, following several training processes: learning, practicing, testing, and assessing. According to the E-Support essence, the field service “students” can access to the “classroom”, when and where they need, through their mobile devices. The e-
learning tool is enriched with the possibility to share a personal knowledge with other colleagues, offering own experiences in helping them to solve a problem. Wireless mobile client The main problem in reaching the workers on the field is the network infrastructure. For this reason the E-Support system must be versatile, offering different media to connect a mobile device to the central server. Considering the low costs level of the E-Support product, oriented towards the SMEs, the mobile device is able to manage different standards, choosing among those that are available on site. The ESupport tool must tolerate several bandwidths and rate profiles, self-adapting the throughput to the current context. Several wireless technologies have been evaluated considering their speed, price, compatibility with the devices, and coverage. The E-Support project took into account existing and incoming standards (Tanenbaum 2002) as: WiFi, Wi-Max, UMTS, GPRS, and Edge. The main considerations regard the coverage and the speed. Wi-Fi is quite widespread in offices, factories and production areas and it has an adequate bandwidth. UMTS and GPRS cover every populated area of Europe but they denote a low bandwidth. The Wi-Max technology represents the best solution, but, at now, is not commercialized and not at all diffused. Considering these points for the E-Support system, the main wireless medium is the Wi-Fi where possible, switching to GSM standards otherwise. The Wi-Max will be monitored in order to introduce it in the system as soon as possible. Other issues related to the communication layer are the security of transactions and the compression of transmitted data (Salomon 2003): S-HTTP and DES (Data Encryption Standard) ensure safe communications and the ZIP method to compress the exchanged files. The connection switching and security functionalities are hidden to the end-user who can use the E-Support device without caring the communication status. The multi-agent platform The motivations that encourage the project consortium to adopt a multi-agent platform concern especially: •
•
•
The dynamicity of the system, considering the mobile devices, the different data sources, the needed intelligence and pro-activity of the requested software components. The scalability of the system, which has to be adapted to different scenarios, with enterprises having different sizes and requirements. The naturally distributed environment.
To implement the E-Support multi-agent society, the JADE platform has been chosen, due to its good reputation, stability, and mainly the possibility to distribute agents and agent containers over a network of mobile devices, by using the JADE LEAP (Berger 2002) extension. Another consideration is the fact that other E-Support components are
written in Java and therefore they are easily mixable with the multi-agent platform. Regarding the compatibility with the document retrieval system, the multi-agent platform grants the simple integration of indexing and clustering agents. A mediator agent can adapt the results provided by the clustering agent in order to provide to the mobile devices only the displayable information. CONCLUSIONS The document retrieval system is a tool to support all the workers that manages electronic documentation. A first application of the proposed solution will be the E-Support project, where the main challenges of the ICT field are faced: the mobility of users, the proactivity of software components, and the knowledge management. The document retrieval system represents an innovative solution that supports interesting state-of-the-art technologies as the clustering techniques, multi-agent systems and ontologies. In opposition to these advanced techniques and the complexity of the system, the user’s interactions are simple and intuitive. The proposed solution is then a user-friendly application that is essentially aimed to exploit and emphasize the skills of the user, in order to provide a smart categorization of the knowledge base of the enterprise. REFERENCES Bellifemine, F.; G. Caire; and D. Greenwood. 2007. “Developing Multi-agent Systems with JADE”. Wiley. Berger, M.; S. Rusitschka; D. Toropov; M. Watzke and M. Schlichte. 2002. “Porting Distributed Agent-Middleware to Small Mobile Devices”, In Proceedings Workshop on Ubiquitous Agents on embedded, wearable, and mobile devices, Bologna. Berners-Lee, T.; J. Hendler; O. Lassila. 2001 “The Semantic Web”. Scientific American Magazine. Chaudhri V. K.; A. Farquhar; R. Fikes; P. D. Karp and J. P. Rice. 1998. “Open Knowledge Base Connectivity 2.0.3”. SRI International. DARPA. 2006. “Knowledge Interchange Format”. Draft proposed American National Standard. Gennari, J; M. A. Musen; R. W. Fergerson; W. E. Grosso; M. Crubezy; H. Eriksson; N. F. Noy and S. W. Tu. 2002. “The Evolution of Protégé: An Environment for Knowledge-Based Systems Development”, Stanford University SMI Report. Joint US/EU ad hoc Agent Markup Language Committee. 2001. “DAML+OIL”. DARPA and European Union Information Society Technologies Programme (IST). Landau, S. and M. Leese. 2001. “Cluster Analysis”, Oxford University Press US. Lassila, O. and R. R. Swick. 1998. “Resource Description Framework (RDF) model and syntax specification”. W3C Working Draft WD-rdf-syntax-19981008. McGuinness, D. L. and F. van Harmelen. 2004. “OWL Web Ontology Language Overview”, W3C Recommendation. Ngo, C.L. and H.D.Nguyen. 2004 “A Tolerance Rough Set Approach to Clustering Web Search Results”. In
Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004), Italy. Osinski, S.; D. Weiss. 2005 “A Concept-Driven Algorithm for Clustering Search Results”, IEEE Intelligent Systems, 3 (vol. 20), pp. 48-54. Salomon, D. 2003 “Data Privacy and Security: Encryption and Information Hiding”, Springer-Verlag, New York. Schockaert, S.; M. de Cock; C. Cornelis; E. E. Kerre. 2004. “Efficient Clustering with Fuzzy Ants”, In Proceedings of Ant Colony Optimization and Swarm Intelligence (ANTS 2004), pp 342-349. Stefanowski, J. and D. Weiss. 2003 “Carrot and Language Properties in Web Search Results Clustering”, In Proceedings of the First International Atlantic Web Intelligence Conference, Madrid, Spain, pp. 240-249. Tanenbaum A. 2002. “Computer Networks”, Prentice Hall.