This paper focuses on information retrieval aspects of a new application in domain of ... data in different formats, e.g. doc, pdf, rtf, avi, mpeg, mp3, etc. .... on logs of events), search (free or semantic based) and help (simple help pages, wiki, .... a decision-support system employing heuristic rules in the following form: IF
Use of semantic principles in a collaborative system in order to support effective information retrieval František Babič, Karol Furdík, Ján Paralič, Peter Bednár, Jozef Wagner Centre for Information Technologies, Technical University of Košice, Letná 9/B, Košice, Slovakia {frantisek.babic, karol.furdik, jan.paralic, peter.bednar, jozef.wagner}@tuke.sk 1
Abstract. This paper focuses on information retrieval aspects of a new application in domain of collaborative systems based on utilization of semantic principles for representation of different types of knowledge, collaborative objects and relations between them. Proposed collaborative system (within European IST project called KP-Lab 1 ) uses ontologies as common communication framework and exchange format for different types of end-user tools. Theoretical background is provided by innovative theoretical approach called Trialogical learning. Information retrieval in KP-Lab System is supported by designed and implemented text-mining and search services. These two sets of functionalities provide features for management of shared objects, described with content in textual format, as well as with semantic metadata. Keywords: ontologies, semantic metadata, text mining, semantic search, collaborative system
1 Introduction Domain of collaborative systems became important part of teaching and learning in last years. Important fact is that this approach should not substitute completely face to face seminars and lectures. The main trend is to provide suitable solutions to effective support of traditional learning processes with use of new information or communication technologies in order to investigate the “best practices” and interesting innovative elements and approaches. This approach offers possibility to create application that builds on social relations, on-line or offline communicative channels, sharing different types of objects, awareness support, effective and userfriendly search features, simple and intuitive user environment with possibility to manage video or sound files, etc. Sharing features in collaborative systems lead to accumulation of large volumes of data in different formats, e.g. doc, pdf, rtf, avi, mpeg, mp3, etc. It is necessary to provide effective support for representation and search methods in these large 1
http://www.kp-lab.org/
databases. One possibility is to combine principles of semantic web for data representation and web 2.0 approaches for user friendly access environment. Presented collaborative system is based just on this idea and provides interesting functionalities for support of collaborative learning or working processes based on underlying Trialogical learning theory (TL). TL provides theoretical framework for modeling of knowledge creation processes in collaborative manner based on suitable technological solution. This approach was one of the main motivation of technological development within KP-Lab project, an integrated IST EU funded project that is contracted for 5years (2006-2011). This project and comparable solutions are briefly introduced in the following subsections. Result of the implementation work in mentioned project, the KP-Lab system is described in the section 2. Section 3 focuses on our information retrieval functionalities enhanced with semantic technologies. Finally, section 4 concludes the paper with a brief summary. 1.1 Related work Domain of collaborative systems is nowadays very extensive and includes such fields as Computer-supported Collaborative Learning, Computer-supported Collaborative Work, Virtual learning environment, Collaborative Working Environment, Learning Management System, etc. However, as European Commission states in its report [8]: ”the characteristic of current collaborative environments is that they are not integrated and interoperable, that they support mainly point to point and not multipoint conferencing, that they are defined mainly for structured environment providing static artefacts and that they do not support the unstructured orchestration of activities using collaboration aware objects. Finally they focus primarily on peer communication and not flexible team interaction.” Results of this report describe a vision and the research topics for further development in the domain of collaborative environment, namely the context-oriented data mapping, support of the shared objects lifecycle, synchronous and asynchronous cross-domain communication/collaboration, etc. Proposed collaborative system KP-Lab is aiming to address these principles, as it is described in the next sections. KP-Lab System can be compared with different representatives in domain of collaborative systems. We choose for comparison these open source applications as Claroline, FLE3, Moodle, SAKAI; commercial product called BSCW and new solution based on web2.0 principles – Google Apps. These environments were chosen because they have many similar functions and features as KP-Lab System, thus, through the comparison it is possible to bring forward the advantages, benefits and innovations of the proposed approach. Furthermore, all the selected tools are widely in use, therefore, it is essential to be able to show the benefits of new system in order to motivate and convince the future users. Several main advantages of the proposed system can be identified based on performed comparison: Multifunctionality – selected open-source solutions don’t provide so many features as KP-Lab System, except of commercial BSCW that provides many similar functions but these functions are based on another development
approach (using transactional database not semantic repository, processes are modeled with workflows, etc.) Orientation on semantic features – shared objects' description consists of two parts (different integrated tools share the same semantic, semantic information can be reused across the tools): o metadata (semantic information) is saved into knowledge repository (semantic repository based on ontologies - SWKM) o the content is saved into content repository based on Java technologies and access to it is based on G2CR (gateways to content repositories) Easily extendable and highly interoperable system, e.g. access to different types of content repositories through G2CR (migration of the data from previous used systems) or import and export of learning objects based on most used standards (IMS, SCORM). Some examples are provided below: Moodle is strongly oriented on the area of integrated modules, while the semantic aspects and capabilities to analyze user’s practices are weakly developed. Google apps provide a set of tools that are connected through predefined API and user can select any combinations and customize them. Search is provided by Google search engines with possibility to save search results and queries. SAKAI emphasizes the phase of development and, like Moodle, can be extended by the modules with new features. SAKAI provides similar functionalities as KP-Lab System, such as shared workspace, job scheduler with calendar, portfolio, discussion and blogs facilities. However, it lacks focus on semantics of the shared objects, its editing, managing, and visualization capabilities. BSCW provides advanced functionalities as tagging, communities, templates, and search on different indexing services. Limited support is provided for editing tags or indexes, also collaborative and idea-generation tools are not available. Each of the collaborative systems includes some search facilities that enable accessing and retrieving the stored information. The searching is usually based on full-text indexing. More advanced retrieval facilities are rarely used – the tagbased search, provided by the BSCW, can be mentioned as an example. The KP-Lab System employs the semantic enhancements of the stored information to provide semantic-based retrieval combined with text-mining methods (cf. section 3). 1.2 Trialogical learning Trialogical learning [3] refers to the process where participants are collaboratively developing shared objects of activity (such as conceptual artefacts, practices, or products) in a systematic fashion. It concentrates on the interaction through these common objects (or artefacts) of activity, i.e. the interactions between people are mediated by various types of knowledge artefacts, not just among people (as it is
in dialogical learning) or within one’s mind (as it is in monological learning) – the previous two approaches. This innovative approach provides theoretical framework for knowledge creation processes analogous to some other existing approaches in this domain as Socialisation – Externalisation – Combination - Internalisation (SECI) model, theory of knowledge building, theory of expansive learning or Activity theory. “Trialogue” in trialogical learning is not about discussion between three persons, it means that individuals (or groups of people) are developing some shared objects of activity within some social or cultural settings, see Fig. 1.
Fig. 1. The activity model [7] The basic concept for trialogical learning is an activity. Activity is composed of two elements (subject and object) and mediated by mediators - artefacts. A subject is a person or a group engaged in an activity. An object (in the sense of “objective”) is held by the subject and motivates the activity, giving it a specific direction. The mediation can occur through the use of many different types of artefacts, material tools as well as mental tools, including culture, ways of thinking and language. Transforming the object into an outcome motivates the existence of the activity. As a demonstrative example of described approach, we can mention a collaborative creation of an essay. Object is the essay itself; it acts as motivational element of planned activity. Subjects are individual people/participants that will be engaged in evolutionary process, e.g. students and teachers. Artefact or mediator is the environment in which will be the essay created, some type of collaborative document creation tool or wiki engine, etc. Community consists of all engaged participants and has some rules. These rules are defined for single community member or for the whole community as such. Division of Effort describes decomposition of all activity; e.g. by means of various process elements (e.g. tasks), which usually have associated responsible member(s) and may have defined also start and end time, inputs and output, etc. After successful evaluation by teacher essay will be an outcome of activity. Every participant can write his/her part or edit, modify the other parts. Every action and every change is saved for monitoring purposes. When some problems emerge, participants can discuss it via a chat or in virtual meetings. Every participant has possibility to make his/her work public or keep it private.
2 KP-Lab System Design and proposed functionalities of KP-Lab System can be seen as results of internal co-evolutionary design process based on long-term discussion with pedagogical partners about their expectations and requirements that cover basic concepts of trialogical learning and their real work. The whole development stressed the usage of semantic principles, effective management of created ontologies, designed and implemented semantic features are available to the users through enduser tools. Architecture of the whole KP-Lab System is based on the platform and collaborative user environment (KP-environment) with integrated end-user applications. 2.1 KP-Lab platform The KP-Lab platform (see Fig. 3) provides integrated semantic middleware that is based on common semantic framework represented by internal ontology architecture [6]. This set of designed ontologies provides possibility to connect heterogeneous technologies in KP-Lab platform through web services as the common language for the communication and functionalities around shared objects. The core of described architecture is represented by Trialogical Learning Ontology (TLO) that defines core concepts and principles of trialogical learning and provides the common semantics for the data interoperability in whole KP-Lab System [6]. KP-Lab platform is composed of several groups of services and libraries: • Semantic Knowledge Middleware Services (SWKM Services in Fig. 3), providing storage and management services for semantic descriptions (metadata) of the shared objects created by the KP-Lab tools. Knowledge repository is implemented within RDFSuite [1] that is based on RDF (Resource Description Framework 2 ) standard. This standard enables the creation and exchange of resource metadata as normal Web data. RDFSuite is being developed at FORTH -ICS in Greece and comprises the Validating RDF Parser (VRP), the Schema-Specific Data Base (RSSDB), interpreters for the RDF Query Language (RQL) and RDF Update Language (RUL). • Content Management Services (Content Transfer Module - CTM in Fig. 3) are dedicated to creation and management of regular content (documents in various formats) used in shared objects (content described by metadata), either towards KP-Lab’s own content repositories or external content repositories. KP-Lab Content repository is implemented through Jackrabbit 3 engine for the compatibility with the JSR-170 standard. • Persistence-API (P-API) is a client library used for managing all Knowledge repository tasks: generating RQL/RUL and persisting/fetching of data from repository. This library [9] provides the generic RDF persistence framework, which allows serialization and deserialization of the Java objects into RDF repositories. It allows developers to focus on the 2 3
http://www.w3.org/RDF/ http://jackrabbit.apache.org/
•
application logic rather than on the RDF language or RQL/RUL, the low level mechanism of storing the metadata in the SWKM. Technical services cover middleware support services, dedicated to the authorization and identity management, the user management, routing etc. Shared Space Tools
SOAP
Multimedia Tools
Mobile Tools
Meeting Tools
SOAP
HTTP (S)
Use
Gateways Engine Authorization and Authentication Services
Use
Persistence API (Library)
User Management Services CIS - Content Item Services Multimedia Content Management Services
TOOLS
CTM – Content Transfer Module (Library)
SOAP
Knowledge Matchmaker Knowledge Repository
Technical Services
WebDAV
Content Repository (Jackrabbit)
Knowledge Mediator
RTSP/RTP
SWKM Services RUL/RQL
Streaming Servers
Knowledge Repository KP-Lab Platform
Fig. 3. The KP-Lab platform [4] 2.2 KP-environment KP-environment provides virtual user environment that mediated all user activities and actions within different types of shared objects based on their goals or expectations. The integrated end-user applications have been implemented based on initial analyses, case studies, generic scenarios and requirements identification, to enable representation, realization, analyses and adaptation of existing or innovative knowledge practices in collaborative manner around shared objects, see Fig. 4. Shared Space provides the main features of a learning system aimed at facilitating innovative practices of sharing, creating and working with knowledge in education and workplaces through different types of views. It supports users’ collaboration according to different working practices and allows viewing of shared knowledge in flexible manner. It provides a set of tools for knowledge building and process management (parts of knowledge practices can be represented as knowledge processes). The personalised, temporal and faceted views allow users to describe and visualise shared knowledge objects, their associations and state in different arrangements. Support tools provide functionalities that are necessary for effective collaborative work or learning within virtual space, e.g. awareness (on-line or historical based on logs of events), search (free or semantic based) and help (simple help pages, wiki, interactive help based on user’s interests or performed actions). Common tools refer to the tightly integrated tools of KPE, which are available inside a shared space for working with shared objects, e.g. possibility to comment
or tag (tags based on predefined vocabularies, or own tags) relevant concepts, to create chat or virtual meeting with semi-automatic generation of discussion maps, to re-use interesting concepts based on created templates, possibility to import learning objects in form of packages from other types of collaborative systems, support for personal work within to-do list or calendar.
Fig. 4. Integrated view on whole KPE architecture [5] Optional tools are loosely integrated applications that can be selected by the user based on its preferences. An optional tool opens directly into the KPE graphical user interface or into a separate browser window. These applications provide interesting functionalities, e.g. possibility to analyze multimedia video clips through tagging features (tags based on predefined vocabularies, or own tags), to export data for research purposes (from three different types of repositories – knowledge, content and awareness), to interact with web2.0 external applications like Google Calendar or Google Docs and last but not least possibility to create and manage visual models and own visual modeling languages. Stand-alone applications are used separately from the KPE due to their implementation as e.g. mobile applications (CASS Memo), or due to the focus on supporting specific pedagogical research methodology (CASS Query).
3 Information retrieval in KP-Lab System Shared objects in KP-Lab System consist of the content and metadata parts. The content is stored in the Content repository and the metadata in the Knowledge repository (see section 2.1). The content is represented by the documents in different formats, e.g. doc, pdf or rtf, and the metadata are used for more structured description
of these documents by title, author, type of document, etc., in the Dublin Core standard 4 . Real usage of KP-Lab System within pilot courses, evaluation cases and other experiments brought a need to store and manipulate large data volumes. It implied a design and implementation of the services for effective information retrieval, employing the access features to Content and Knowledge repositories (P-API or CTM), advanced text-mining methods and semantic search capabilities. Various learning or working materials are uploaded into KP-Lab System as shared objects and they are further investigated in a collaborative learning process. These materials can be semantically annotated through predefined vocabularies taxonomies, concept maps, or domain ontologies, or through free tags – textual descriptions created by users. The ontologies provide a conceptual framework for the semantic annotation by defining a structure of a domain of discourse. In the trialogical learning, the ontology itself is a subject of creation, modification, and evolution in the process of learning as a socially determined and interactive activity [10]. 3.1 Motivating scenario KP-environment enables users (students, teachers, etc.) working on shared objects in one virtual place. It allows the participants of collaborative processes perceiving and handling shared materials, knowledge representations and respective processes in an integrated way, to support a creation of new and innovative knowledge. As an example, let us assume that a group of students is aiming at creation of documentation resources for a given product. The collection of text-based materials should be organized into a defined process. The collaborative procedure starts with the design of the process elements - tasks accompanied with the relevant resources. Students will use the KP-Lab search functions to retrieve the materials from internal collaboration space or from outside. A query can be formulated on a concrete title of document, its author, type or keywords. These metadata are specified by the semantic tags that accompany each of the documents stored in KPLab Shared Space. The external documents can be retrieved by a full-text search and then stored into the Content Repository. The collected materials are then appointed to the process; their context is defined by a set of semantic tags specified by the students. Such a semantic description consists of pre-defined as well as ad-hoc created and customizable semantic tags. Based on text-mining methods, KP-Lab supports this process by providing recommendations of suitable tags. In addition, it allows clustering and categorization of materials into topic groups, enhancements of the semantic tag vocabulary (ontology), and an advanced search that employs the built-in semantic information.
4
http://dublincore.org/
3.2 Text-mining services KP-Lab text-mining services have been designed to assist users in creating or updating the semantic descriptions of KP-Lab shared objects [10]. The semiautomatic generation of these descriptions or even of new KP-Lab ontologies relies on the textual information attached to particular objects. The textual description is analyzed and processed by parsing, part-of-speech tagging, lemmatization, and keywords extraction techniques. The text-mining classification of the shared objects, which is based on a matching of extracted keywords against an existing conceptual model, proposes a set of the most relevant ontology concepts that are suggested for users as suitable for semantic annotation. The classification works in two modes: 1) as a supervised method, employing the classification models created from previously processed training examples, and 2) as a decision-support system employing heuristic rules in the following form: IF THEN (optional: weight=N); N=. In addition, unsupervised text mining techniques such as clustering algorithms are used to find some unseen concepts (or clusters) in the set of analyzed textual resources. These may lead to, e.g., the suggestion to upgrade existing KP-Lab ontologies, as the knowledge of a user group evolves. Functional requirements for the text-mining services emerged from a discussion on user expectations and the service utilization within end-user applications: • Creation of a training data set from already annotated shared objects to a predefined set of categories, i.e. concepts of existing domain ontology. The textual descriptions of the objects are pre-processed and transformed into a term-document matrix. The classification service indexes the training data set and stores it into the Mining Object Repository. • Creation of a classification model, based on the selected algorithm and on a given training data set. Based on the selected implementation platform [2], the kNN (k Nearest Neighbours), SVM (Support Vector Machine), and Perceptron were employed as basic classification algorithms. • Modification (improvement) of the applied classification model, by changing the texts and/or categories in the training data set, as well as by editing the settings of the algorithm or switching to another algorithm. • Provision of basic measures for applied classification model, e.g. by means of precision and recall. Support of creation, indexing, and storage of the testing data set(s) that can be used for more exact evaluation of the quality of classification processes. • Verification and validation of the applied classification model. The model is no longer valid if a portion of training data set (e.g., the term-document matrix or the set of pre-defined categories) was modified. In this case, a reindexing of the model is needed to make it valid again.
•
Classification of a set of unknown (not annotated) objects to the categories used for training. The output of this function is a set of weighted categories (concepts, terms) for each of the classified objects. Implementation of the classification service (see Fig. 5) is based on the JBowl (Java Bag of Words) library [2], providing a platform for several classification algorithms, tools for processing natural language texts, as well as for some of the clustering techniques [10].
Fig. 5. Position of Text-mining services in whole KP-Lab System These basic functionalities of proposed text-mining services are extended with some new features to improve the support of semantic tagging and semantic maintenance of the shared objects: • Assistance in the process of semantic tagging, namely a suggestion of proper tags that semantically match with the textual content of the objects. • Transformation of free-text tags into the tags predefined in the vocabulary. • Consistency checking of the semantic tagging. Evaluation of homogeneity, similarities, and differences between the tagging performed by different users. • Maintenance of the tag vocabulary. Proposal for adding, modification, or removal of a semantic tag from/to the vocabulary. • Extension of search results, query expansion with a possibility to tag the search results. Searching for similar objects, based on a textual content, metadata properties, or semantic annotations of a shared object. 3.3 KP-Lab Search services KP-Lab Search services provide an integrated interface for semantic search and freeterm search. In order to effectively evaluate a search query, properties of the objects of activity have to be indexed using the Indexing service. Implementation of the Search services is based on the Solr search server [11], which provides an API for both indexing and faceted searching. Indexing service is integrated with the Persistence API and CTM to simplify maintenance of the index for tool developers.
In the case that Persistence API is not used to manage objects of activities, it is possible to invoke Indexing service directly using the Solr HTTP interface. The Search services enable advanced faceted search (see Fig. 6) based on the metadata and content of shared objects and support user-defined, flexible visualization as well as semantics based classification and clustering of search results.
Fig. 6. Facet GUI of KP-Lab Search services The users can save their search results as well as save the queries they have made. In next release, Search services will give suggestions by means of saved queries in order to help the user to find the most appropriate way of executing the search.
4 Conclusion Information retrieval in KP-Lab System is represented by designed and implemented text-mining and search services. These two sets of functionalities provide features for management of shared objects, mainly content in textual format. These services are implemented on the middleware layer and end-users can use them within designed facet GUI, integrated part of virtual user environment called KP-environment. Actual version of the proposed system was published in the beginning of this year and is available on http://2d.mobile.evtek.fi/shared-space/. Centre of information technologies 5 , Technical University of Košice Slovakia, is responsible for the design and implementation of Persistence-API, Text-mining and Search services, Knowledge process service, To-Do service, Historical awareness, and partially CTM.
5
http://web.tuke.sk/fei-cit/index-a.html
Acknowledgments. The work presented in this paper was supported by the following projects: the KP-Lab project, which is supported by European Commission DG INFSO under the IST program, contract No. 27490; the Slovak Research and Development Agency under the contracts No. APVV-0391-06 and No. RPEU-001106; the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/4074/07. The KP-Lab Integrated Project is sponsored under the 6th EU Framework Programme for Research and Development. The authors are solely responsible for the content of this article. It does not represent the opinion of the KP-Lab consortium or the European Community, and the European Community is not responsible for any use that might be made of data appearing therein.
References 1.
Alexaki, V., Christophides, G., Karvounarakis, D., Plexousakis, K., Tolle: The ICSFORTH RDFSuite: Managing Voluminous RDF Description Bases, In Proc. of the 2nd International Workshop on the Semantic Web (SemWeb'01), in conjunction with Tenth International World Wide Web Conference (WWW10). Hongkong, (2001) 1-13 2. Bednar, P., Butka, P., Paralic, J.: Java Library for Support of Text Mining and Retrieval. In Proceedings of Znalosti 2005, pp.162-169, Stará Lesná, Slovakia (2005) 3. Hakkarainen, K., Paavola, S.: From monological and dialogical to trialogical approaches to learning. A paper at an international workshop "Guided Construction of Knowledge in Classrooms", February 5-8, 2007, Hebrew University, Jerusalem, 2007 4. Ionescu, M., et. al.: KP-Lab Platform Architecture Dosier. KP-Lab public deliverable D4.2.3. November, 2007 5. Markkanen, H., et. al.: M33 specification of end-user applications. KP-Lab public deliverable D6.6, December 2008 6. Markkanen, H., Holi, M.: Ontology-driven knowledge management and interoperability in trialogical learning applications. Article in the KP-Lab book, 2008. 7. Nardi, B. A.: 'Activity Theory and Human-Computer Interaction' in Context and Consciousness: Activity Theory and Human Computer Interaction. MIT Press, Cambridge, 1996 8. New Collaborative Working Environments 2020. Report on industry-led FP7 consultations and 3rd Report of the Experts Group on Collaboration@Work. European Commission, Information Society and Media, February 2006. 9. Persistence API, available on http://kplab.tuke.sk/trac/wiki/kms-persistence-3 10. Smrž, P., Paralič, J., Smatana, P., Furdík, K.: Text Mining Services for Trialogical Learning. In: Proc. of the 6th annual conference Znalosti 2007, Ostrava, Czech Republic (2007), pp. 97-108, ISBN 978-80-248-1279-3 11. Solr search server specification, available on http://wiki.apache.org/solr/