KEA - a Mathematical Knowledge Management System combining Web 2.0 with Semantic Web Technologies Sabina Jeschke, Marc Wilke University of Stuttgart Center of Information Technologies, RUS
[email protected] [email protected] Abstract We present a knowledge management system suggesting a combination of Web 2.0 and Semantic Web technologies. With regard to the accumulation of knowledge as a consequence of the increasing amount of scientific publications, the demand for sophisticated knowledge management systems in the scientific education has increased lately. Above all, mathematical publications remain a major challenge, caused by the need for complicated analysis of natural language texts and mathematical formulas. The web-based, mathematical knowledge management system KEA provides natural language processing techniques, and benefits from new web technologies.
1. Introduction The knowledge management systems in use today are more or less document or content management systems similar to yellow pages for accumulating a wide range of information and providing straightforward search mechanisms. Nowadays, in a new generation of knowledge management systems, more and more emphasis is placed on the integration of semantic retrieval technologies, state-of-the-art search routines, social networking mechanisms and collaborative technologies [1]. To ensure lasting success, they have to combine different aspects for retrieving, managing and imparting knowledge. Knowledge Relationship Discovery [1] deals with the representation of knowledge as networks of information from different sources. The information is collected by knowledge management systems. In this context, Google [2] provides a renowned combination of services merging new ideas of upcoming knowledge management systems. Moreover, knowledge management systems can take advantage of social networking techniques (used i.e. in Flickr [3] or
978-1-4244-1841-1/08/$25.00 ©2008 IEEE
Nicole Natho, Olivier Pfeiffer Berlin University of Technology Center for Multimedia in Education and Research, MuLF {natho, pfeiffer}@math.tu-berlin.de del.icio.us [4]) to establish new applications for tagging and semantic annotation mechanisms for better search results within large knowledge bases. New light-weight web interfaces with reusable components such as Mashups offer new opportunities for more comfortable and ubiquitous knowledge management systems. The idea of desktop applications within a browser e.g. GMail [5] or iGoogle, yields the possibilities for a high degree of user personalization. Collaborative techniques (i.e. trackbacks, blogs or wikis) help spreading knowledge through different types of perceptions such as wikipedia and sister projects [6], LycosIQ [7] or ontoworld [8]. The Knowledge Management System KEA is exemplified by means of an example of a mathematical encyclopedia. In this ambitious project, we provide automated creation mechanisms for knowledge bases from diverse mathematical texts, easy-usable administration interfaces, and intelligent retrieval interfaces to create a web service-based and desktoplike knowledge management system.
2. The KEA System KEA is a semi-automated web-based knowledge management system constructing and organizing knowledge bases from mathematical language texts to store, maintain and retrieve information with natural language processing methods. An overview of the system design is given in Figure 1.
2.1. The NLP System The linguistic approach to the mathematical language is straightforward and is performed by the subsystem MARACHNA [9]. Entities are principal carriers of information in mathematical texts. They are analyzed using natural language processing (NLP)
techniques based on a linguistic classification scheme [10]. MARACHNA is made up of separate components: a pre-processed analysis for different text sources, the syntactic and the semantic analysis. Each extracted natural language sentence is analyzed using MARACHNA’s NLP methods. The NLP analysis is implemented using TRALE [11, 12] based on headdriven phrase structure grammars, providing detailed syntactic and semantic information about each sentence. TRALE has been extended by expanding the underlying dictionary and grammar to include the specifics of mathematical language. Mathematical equations cannot be processed by TRALE, and therefore have to be treated separately. The output of the NLP analysis is an abstract syntax tree representing the sentence structure within the analyzed text. Knowledge processing Knowledge Acquisition
Web Interface
Math. Texts
mArachna NLP: TRALE
Lucene
Administrator Workbench
Jena Ontologies
Knowledge Base Knowledge Retrieval Developer Workbench App 1
App 2 User Web Interfaces
Figure 1: KEA System Design The implementation of the semantic analysis is performed by an embedded JavaScript interpreter [13] in which the syntax tree is categorized according to characteristic structures of the mathematical language. Each category is translated in a specific triple structure defined by external JavaScript rules. In addition, the triples are annotated with supplementary information. Typically, the rules reflect characteristic mathematical language constructs such as propositions, assumptions or definitions. Equations are essential carriers of mathematical information. Consequently, it is desirable to be capable of analyzing and representing their content in correlation with the NLP analysis. Currently, we are investigating a syntactical analysis approach, similar to
those used in computer algebra systems in combination with contextual grammars to correlate information given in equations with the surrounding text. Therefore, equations will be reintegrated during the NLP semantic analysis1.
2.2. Knowledge Base Concept Information fragments resulting from the semantic analysis are integrated into knowledge bases managed and developed by the Java-based Jena Semantic Web Framework [14]. Jena is a toolset consisting of several APIs to manipulate RDF [15], OWL [16] and RDFS [17] as Java classes. RDF representations are triples which are forming statements. Jena supports different kinds of representations forms for serialization (reading and writing) RDF models such as N-Triples or RDF/XML. In addition, there are different possibilities for storing RDF models, e.g. data bases. Furthermore, Jena provides SPARQL [18] and an interference engine to derive additional RDF assertions. SPARQL is a key component when combining Web 2.0 and semantic web technologies [19]. 2.2.1. Integration of triples. Triples generated by the NLP analysis are made up of two nodes and a relation between them. Each node represents a mathematical expression, phrase or term, and the relation describes the association between these nodes. Therefore, each relation depicts different types of linguistic phrases or key words in mathematical texts (e.g. the relation “is equivalent to”). To support complex interconnections, triples themselves are often used as nodes of other triples. In addition, for each integrated triple element, it has to be decided whether it represents an OWL class or instance of OWL classes. This distinction poses a considerable challenge for such an automated analysis, forcing us to use OWL Full. The whole process of integration maps the language structure into very fine-grained knowledge bases. In this regard, a primary knowledge base contains some mathematical information fragments encompassing axiomatic set theory and first order logic. To avoid inconsistencies within the knowledge base new information is added using as a semiautomated approach. Hence, new information 1
It should be pointed out that KEA does not aim at machine-based understanding of equations and mathematical language as this would require an automatic reasoning system. Equations and mathematical language constructs are integrated straightforward into knowledge bases without any interpretation. This approach is employed as a testbed for further investigations to discover which parts really need assistance of reasoning processes.
fragments are integrated if and only if there are no duplications or contradictions with existing triples. Inconsistencies have to be eliminated by an administrator through a web interface which manipulates the data base2. Additionally, each analyzed text generates a separate, independent knowledge base. This separation is mandatory for retaining the idiosyncrasies and preferences of each author. 2.2.2. Merging process. By reason of categorization and management of huge knowledge bases, one part of our investigations is the process of merging. In this context, we unify all knowledge bases from different authors into one mathematical field. Trying to do this we encounter some problems: even in mathematics, different authors use different types of definitions and theorems. The differences are so subtle that even a human reader needs profound knowledge to be able to handle them. Therefore in this case human intervention through the administration web-based interface is needed. Fortunately, this intervention is not the rule. In addition, tools and strategies for partitioning the unified knowledge bases into smaller knowledge bases have to be developed and realized. Such smaller knowledge bases can be specifically adapted to the preferences and requirements of specialized user groups. For example, we are using different versions of entities for elementary or high schools. Moreover partitioned knowledge bases containing different notations could contribute to the harmonization of teaching material within courses. 2.2.3. Administrator interface. Providing an easy access for administrators, web-based interfaces control the semi-automated complex processes of KEA’s knowledge engineering. We provide different interfaces for direct access to the separate stages of the analysis such as graphical representations of different levels of knowledge bases or of syntax trees as well as text-based highlighted documents. Following the metaphor of a desktop and the “Personalized Home”, administrators have the possibility to arrange their workspace. They can select preferred working tools (so called widgets or intelligent desktop agents), and display only those tools 2
KEA’s information processing model is based on the human knowledge processing model: humans can integrate new knowledge into their world view only if there are no possible misinterpretations. As a consequence, incorrect knowledge may be deleted or corrected under certain conditions. It should be noted that, given the nature of the sources, human intervention into the automated integration is not the rule; most of the knowledge presented in a textbook follows the rules of consistency required for the automated integration.
within the browser (such as searching tools, graphical representations based on Graphviz [20], history log files, tagging tools, notepads, etc.). Moreover, we want to provide tools like small forums and wikis for efficient communication exchange between different administrators. For easy implementation of these exchangeable desktop agents we have developed an extensive free JavaScript library. Thus, we provide the opportunity to write or extend desktop agents in an easy way similar to writing an extension for Typo3 [21] or Eclipse [22].
2.3. Information Retrieval of KEA After the acquisition process (semantic analysis), the knowledge bases consist of linked mathematical information fragments. Besides individual text ontologies, the KEA system generates different levels of upper ontologies of each knowledge base. The lowlevel ontology describes the correlation between mathematical terms both within the mathematical entities and between these entities themselves. The upper ontology reproduces the structure of the processed text and thus the field ontology used by the author. Both of these ontologies facilitate data retrieval, the upper ontology in direct response to queries, the low-level ontology by showing interconnections and relations not directly included by the author or by offering additional links between the upper ontologies of different authors. Like the administration interface, the information retrieval web interface of KEA also follows the metaphor of desktop of iGoogle or GMail allowing a high degree of personalization to the user. Therefore the interface is implemented in JavaScript offering desktop agents that can easily be adapted to the preferences of the user. Predefined desktop agents include: advanced search mechanisms, personalized history of queries and results, including social tagging in form of personal annotation for relevance, graphical representations of knowledge bases including controls for choosing the referred ontology (low level, different upper level ontologies based on author, field etc.) or the context and type of information (choosing e.g. examples, definitions or theorems). In addition, the user also has the possibility to write new desktop agents.
3. Related Works To date we have not learned of a project similar to KEA regarded as a turn-key solution. However, KEA consists of different building blocks, which can be compared with related projects. Basic results on
mathematical ontologies were accomplished by Gruber and Olsen [23]. MBASE [24] is an example of a manually written mathematical ontology. Investigating English mathematical texts is accomplished by Baur [25]. An example of a manually written mathematical encyclopedia is MathWord [26]. WordNet [27, 28] and GermaNet [29] are intelligent thesauri providing a number of suitable synonyms and definitions for given queried expressions similar to the semantic analysis of MARACHNA. Concepts for automated semantic analysis of the German language were investigated by Helbig [30]. Mizar [31, 32] is used for describing mathematical proofs through a formal human and machine-readable language as input for an automated reasoning system. DIALOG [33] processes natural language for mathematical validation in an automated reasoning system. HELM [34] uses metadata provided by authors for semantic annotation of mathematical texts used in digital libraries. In combination with HELM MoWGLi [35] is a retrieval interface based on pattern-matching within mathematical equations and logical expressions.
4. Conclusion and Evaluation KEA is a semi-automatic web-based knowledge management system constructing and organizing knowledge bases for mathematical language to store, maintain and retrieve mathematical information combining different technologies. Metadata is generated in the form of RDF/OWL triples, which can be used in Semantic Web environments. Social tagging technologies improve our knowledge results based on tracking methods of users’ knowledge. We provide desktop agents to maintain and support the individual learning process and motivation of each user. During further development a collaborative environment shall be evaluated to enhance the user-friendliness of the knowledge management system. Using such an environment the users’ and administrators’ individual preferences are preserved. Currently, KEA is a testbed analyzing mathematical texts written in German and English, processing selected entities and texts blocks and successfully integrating the extracted information into knowledge bases. We provide parts of the user interfaces for administrator and the user interface for a mathematical encyclopedia. In further developments, we plan to analyze complete textbooks for different fields of mathematics, starting with linear algebra. For evaluation purposes KEA will be integrated in a cooperative knowledge space system used for mathematical learning environments. Such systems contain numerous entries of mathematical language
constructs. This approach confronts KEA with a wide range of different correct and incorrect usage of mathematical language constructs. In addition, user activities can be traced for direct feedback. We can examine users’ knowledge skills through tracking methods and questionnaires. The differences between “correct knowledge” and the user skills are revealed by merging techniques of mathematical knowledge bases and user skills knowledge nets. This is only the first step in documenting the users’ skills and allows for guiding the user to improve or complete his/her knowledge, e.g. by taking additional courses, hints for additional literature and check for correlations between language and mathematical skills. Additionally, we are interested to see which types of desktop agents will be developed by the users within the framework of KEA.
5. References [1] Semantic Web Activity http://www.w3.org/2001/sw/ [2] Google Labs, http://labs.google.de/ [3] Flickr, http://www.flickr.com/ [4] del.icio.us, http://del.icio.us/ [5] GMail, http://mail.google.com/ [6] Wikipedia and sister projects, http://www.wikipedia.org/ [7] LycosIQ, http://iq.lycos.de/ [8] ontoworld, http://ontoworld.org/wiki/Main_Page [9] S. Jeschke, N. Natho, M. Wilke MARACHNA - Ontology Engineering for Mathematical Natural Language Texts, IEEE Symposium on Computers and Communications, pp. 1035 – 1042 (July 2007), ISCC’07, Aveiro, Portugal [10] N. Natho, „MARACHNA: Eine semantische Analyse der mathematischen Sprache für ein computergestütztes Information Retrieval.“, PhD thesis, Technische Universität Berlin (2005) [11] S. Müller, „TRALE“, www.cl.unibremen.de/Software/Trale [12] S. Müller: „Deutsche Syntax deklarativ: Head-Driven Phrase Structure Grammar für das Deutsche.“ Linguistische Arbeiten, No. 394. Max Niemeyer Verlag, Tübingen (2005)
[13] Mozilla Foundation: Rhino. http://www.mozilla.org/rhino/
[26] Wolfram Research, “MathWorld.” http://mathworld.wolfram.com
[14] Jena, “A Semantic Web Framework for Java.”, http://jena.sourceforge.net
[27] C. Fellbaum, “WordNet: An Electronic Lexical Database”. MIT Press, Cambridge, London (1998)
[15] W3C, “RDF”,. http://www.w3c.org/RDF
[28] C. Fellbaum, “WordNet”, http://wordnet.princeton.edu
[16] W3C, “OWL”,. http://www.w3c.org/2004/OWL
[29] GermaNet Team, “GermaNet”, www.sfs.unituebingen.de/lsd/english.html
[17] W3C, “RDFS”,. http://www.w3c.org/TR/rdf-schema/ [18] W3C, “SPARQL”,. http://www.w3.org/TR/rdf-sparqlquery/ [19] L. Dodds, „Introduction SPARQL: Querying the SemanticWeb“, http://xml.com/lpt/a/2005/11/16/introducingsparql-querying-semantic-web-tutorial.html, 2006 [20] AT&T Research, “Graphviz”,. www.graphviz.org/ [21]Typo3,. http://typo3.org/ [22] Eclipse,. http://www.eclipse.org/ [23] T. Gruber, G. Olse, “An Ontology for Engineering Mathematics.”, Technical Report KSL- 94-18, Stanford University (1994) [24] M. Kohlhase, A. Franke, “MBase: Representing Knowledge and Context for the Intergration of Mathematical Software Systems.”, Journal of Symbolic Computation 23:4, pp. 365 - 402 (2001) [25] J. Baur, „Syntax und Semantik mathematischer Texte“, Master’s thesis, Universität des Saarlandes, FB Computerlinguistik. (Nov. 1999)
[30] H. Helbig, “Knowledge Representation and the Semantics of Natural Language (Cognitive Technologies)“, (November 2005), Springer, Berlin [31] J. Urban, “MoMM – Fast Interreduction and Retrieval in Large Libraries of Formalized Mathematics.” Internation Journal on Artificial Intelligence Tools (15(1)), 2006, pp. 109–130 [32] J. Urban, “MizarMode – An Integrated Proof Assistance Tool for the Mizar Way of Formalizing Mathematics.” J. of Applied Logic, 2005 [33] M. Pinkall, J. Siekmann, C. Benzmüller, I. KruijffKorbayova, „DIALOG“, http://www.ags.uni-sb.de/~dialog/ [34] A. Asperti, I. Padovani, C. Sacerdoti, I. Schena, “HELM and the Semantic Web.” In Boulton, R.J., Jackson, P.B., eds.: Theorem Proving in Higher Order Logics, 14th Internat. Conf., TPHOLs 2001, Edinburgh, Scotland, UK, September 3-6, 2001, Proc.. Vol. 2152 of Lect. Notes in Computer Science., Springer, 2001 [35] A. Asperti, S. Zacchiroli, “Searching Mathematics on the Web: State of the Art and Future Developments.” In: Joint Proceedings of the ECM4 Satellite Conference on Electronic Publishing at KTH Stockholm, AMS - SM M Special Session, Houston, 2004