KEA - A Knowledge Management System for ...

KEA - A Knowledge Management System for Mathematics Sabina Jeschke1, Nicole Natho2 and Marc Wilke1 University of Stuttgart, Center of Information Technologies – RUS, Stuttgart, 70550, Germany 2 TU Berlin, Center for Multimedia in Education and Research - MuLF, Berlin, 10623, Germany 1

Abstract — Information contained in mathematical digital writings creates new challenges for knowledge management systems because automated knowledge extraction from natural language texts is a technical challenge. However, mathematical texts possess a distinctive structure. KEA benefits from this structure, and extracts linguistic relations from texts, creates ontologies, and combines them to a knowledge base. This knowledge base is the core of an enhanced retrieval system. Responses of this system are user-adaptive text-based and visualized. Possible applications are mathematical encyclopaedias, context sensitive library search, intelligent book indexes and e-learning software. Index Terms — Knowledge Mangementment System, Knowledge Engineering, Web Engineering, Semantic Web, Natural Language Processing, Information Retrieval

I. INTRODUCTION Knowledge management takes an important function in the modern society. Numerous publications are released to impart new knowledge. Especially, the Internet is an example of an unsystematic knowledge base (KB). Too much information leads to an explosion of knowledge called “information glut”. Applications to specialize and generalize information gain in importance finding solutions to face these new challenges. In this paper, we present a knowledge management system (KMS) of the third generation to access mathematical information from different sources (i.e. textbooks, publications). Information snippets are extracted and prepared for integration into a KB. Many of the current information extraction approaches are based on statistical methods, e.g. Google. These approaches – while very successful – neglect the exactness of the mathematical language (ML). This exactness encourages the use of natural language processing (NLP) methods like “head-driven phrase structure grammar” and allows an automatic semantic annotation of texts. We obtain information in context (knowledge) that imparts multifaceted correlations between terms and concepts. However, a KMS including this technology provides new opportunities: automated knowledge acquisition, sophisticated user interfaces, intelligent retrieval methods, intelligent visualization, an ontology that projects an overview about a field of mathematics, etc. KEA is a comprehensive KMS to accumulate, store, merge, evaluate and visualize mathematical information for applications like encyclopaedias, context sensitive library search systems, intelligent book indexes and elearning software using standardized methods. It supports different types of users (administrators, developers and

end-users) by using specific web interfaces. The administrator manages the workflow of knowledge acquisition, storing and retrieval. A developer generates extensions and web interfaces for different applications. The workflow of providing such a system is designed by Probst et al. [1] consisting of the components: knowledge representation (storing, merging and splitting), acquisition, communication, usage and evaluation. There are many concepts to accomplish the key components: manual methods of knowledge acquisition and representation are time-consuming; a semi-automated method is used to retrieve information from ML texts. The knowledge acquisition from ML texts is implemented by the MARACHNA subsystem. MARACHNA generates OWL documents [2] of extracted mathematical information, representing elements of concepts and objects as well as relations and connections between them. These OWL documents are parts of the structure of the KBs. This is feasible because of the distinctive structure of ML texts itself: It is characterized by representative text modules “entities” such as definitions, theorems and proofs, which are used to describe mathematical concepts and objects. The information in these entities forms a complex relationship model in terms of a network defining an ontology of a specific ML text. The basic ontology which forms the skeletal for all KEA ontologies follows the ideas of Hilbert [3] and Bourbaki [4]: mathematics as a whole can be derived from a small set of axioms using propositional logic1. This logic is reflected in the structure and diction of ML texts. Therefore, an ontology of mathematical content can be directly extracted from the texts. MARACHNA utilizes both the ontology inherent to mathematics itself (domain ontology) and the ontology defined by the structure of the text chosen by the author (upper ontology). Thus, the MARACHNA subproject provides mechanisms to create knowledge networks from mathematical texts. The KEA system additionally provides mechanisms to navigate on these networks that reflect the contextual relations between mathematical terms and concepts. They are further annotated regarding their role in the ontology of the original text, thus preserving the upper ontology used by the authors, and can map relationships in a fine-grained way. In addition, these networks can be used to create an overview of mathematical content, offering different levels of information detail. To store and manage these information snippets of the texts in a KB the JENA framework is used. Furthermore the search engine Apache Lucene [5] is used for information retrieval and testing purposes. 1

This approach is valid within the context of mArachna, despite Gödel’s Incompleteness Theorem [6], since we map existing and proven mathematical knowledge into a KB and do not want to prove new theorems or check the consistency of the theorems we store.

For use with the KEA system it consists of several web interfaces (workbenches) for administrators, developers and end-users: Administrator workbench: Administration of the semi-automated natural language processing, KB and the retrieval methods; Developer workbench: Developing tools for web interfaces of the different applications of KEA; End-user: User-adaptive desktop solutions for different uses (Semantic Desktop) with personal ontologies and social network components. II. RELATED WORKS Being unaware of any current research mirroring KEA’s global concept, there are a number of projects with similar aims (yet different approaches). Fundamental research into ontologies for mathematics was performed by T. Gruber and G. Olsen [7]. Baur analyzed the language used in English mathematical texts [8]. The linguistic analysis of mathematical language is based on TRALE [9]. MBASE is an example of a mathematical ontology created by humans. The structuring of mathematical language and the modelling of the resulting ontology is similar to the human work performed in the creation of several mathematical encyclopaedias like MathWorld [10]. The ontologies represented in these projects can be used as an interesting comparison to the ontologies and KBs automatically generated by KEA. WordNet/GermaNet [11-13] are intelligent thesauri providing sets of suitable synonyms and definitions to a given queried expression. WordNet uses a similar approach to the automated semantic analysis of natural language compared to MARACHNA. However, WordNet does not provide detailed correlations between dissimilar or not-directly related terms. Helbig [14] has designed a concept for automated semantic analysis of German natural language texts. The Mizar [15, 16] system uses a both human and machine-readable formal language for the description of mathematical content as input for automated reasoning systems. An example for a different approach towards information extraction from ML texts, based on automated reasoning, is the work of the DIALOG [17] project. DIALOG is aimed at processing natural language user input for mathematical validation in an automated reasoning system. Both Mizar and DIALOG, in difference to KEA, require complex reasoning systems for their extraction of information. The HELM [18] project defines metadata for semantic annotation of mathematical texts for use in digital libraries. However, this metadata has to be provided by the original authors. MoWGLi [19] extracts the metadata provided in HELM and provides a retrieval interface based on pattern-matching within equations and logic expressions. The PIA [20, 21] project extracts information from natural language texts (not limited to mathematics) using machine learning techniques. PIA requires an initial set of human expert annotated texts as a basis for the information extraction performed by intelligent agents. An interesting KMS is KAON [22] to manage ontologies together with reasoning mechanisms.

III. LINGUISTIC APPROACH WITH THE MARACHNA SUBSYSTEM Entities are the principal carriers of information in ML texts. They are analysed using NLP techniques, based on a linguistic classification scheme [23]. This scheme defines four levels: relations between different types of entities are described on the entity level. The internal entity structure level specifies the internal structure of an entity (e.g. the assumptions and proposition of a theorem). Characteristic sentence structures which are commonly found in ML texts are described on the sentence level. On the word and symbol level, the lowest and most detailed level of our classification scheme, single symbols and words and their relations between each other are schematized. Mathematical information is extracted from a text using the structures and linguistic relations as defined by this classification scheme. The information is integrated into a KB, consisting of one or more directed graphs representing terms, concepts and the relations between each other. It is based on an ontology of the language of mathematics encoded with the web ontology language OWL [24]. The linguistic analysis of entities yields triples consisting of two nodes and one relation. Nodes represent mathematical terms and propositions, with the relation describing how they are connected to each other. It should be noted that triples themselves can be used as nodes in other triples, allowing the representation of more complex interrelations. In this context, different types of relations describe different types of linguistic phrases or key words in ML texts (e.g. two nodes corresponding to two propositions A and B, connected by the relation “is equivalent to”). These triples are integrated into the KB. This process closely maps the actual language structure, resulting in a very fine-grained KB. A.Technical Aspects of MARACHNA subsystem MARACHNA consists of separate modules: preliminary analysis of the input, syntactic analysis, semantic analysis and integration of results into the KB. Each extracted natural language sentence is analysed using NLP methods. The NLP analysis is implemented using the TRALE-System [9] based on a Head-Driven Phrase Structure Grammar. TRALE is a PROLOG adaptation of ALE for the German language. TRALE, as used in MARACHNA, has been extended by expanding the underlying dictionary and grammar to include the specifics of the ML. Complex equations have to be processed separately from the natural language analysis. This process is, however, not yet implemented. The NLP analysis provides detailed syntactic and even some limited semantic information about each sentence. The output is converted into an abstract syntax tree representing the structure of the analysed sentences. At this point, the separately processed equation will have to be reintegrated with the natural language text for further semantic analysis. The semantic analysis is implemented in the form of an embedded JavaScript interpreter [25]. The syntax trees are categorized according to typical structures characteristic for specific mathematical entities and semantic constructs. Each category is transformed into a specific triple struc-

ture defined by external JavaScript rules. These rules map typical mathematical language constructs onto the corresponding basic mathematical concepts (e.g. proposition, assumption, definition of a term etc). The resulting triples are annotated with additional information. This includes both the context within the original document as well as their classification within the context of the final OWL documents generated by MARACHNA. For each element of the triples it has to be decided whether they represent OWL classes or individuals of OWL classes. This distinction poses a significant challenge for the automated analysis, currently forcing MARACHNA to use OWL Full. Equations form a major portion of mathematical texts and constitute a primary source of information. Thus it is desirable to be able to include their content in the analysis and representation created by KEA. Currently, KEA is not capable of this important feature yet. However, we are investigating an approach to rectify this deficiency. Therefore the use of a syntactic analysis, similar to those used in computer algebra systems in combination with contextual grammars (e.g. Montague grammars) to correlate the information given in a formula with information already provided in the surrounding natural language text, is proposed. Using this approach should facilitate the integration of formulae and their informational content in the network created by the analysis of the natural language text. We do not aim for machine-based understanding of the formulae, as automatic reasoning systems would require. Instead, formulae are to be treated as a different representation of mathematical knowledge, to be integrated into the KB in a similar manner to that used for the natural language text. The analysis proposed here can be used as a first step in a further process leading to viable input for such reasoning systems, providing additional assistance in building the KB. B. Linguistic Analysis: An Example Take as an example the definition of a group [26]:

Let G be a set, G≠∅, a,b ∈ G: * : G×G | G, (a,b)| a*b. The tupel (G,*) is called a Group if the following conditions hold: 1. Associativity: ∀a,b,c∈G, a*(b*c)=(a*b)*c, 2. Neutral Element: ∃ e∈G such that ∀a∈G , a*e=e* =a, 3. Inverse Element: ∀a∈G ∃ e∈G such that a-1* =a*a =e where e is the neutral element.

Based on the knowledge of the use of certain keywords in mathematical texts, the linguistic analysis has identified “Let G be a set ... ” as the assumption, “The tupel (G,*) is called ... conditions hold:” as the proposition and the following three points as the properties of a definition. The linguistic analysis results in the representation as triples of predicate, subject and object. For example, the phrase “Let G be a set” is mapped to the triple (is_a, set, G). Finally, identifying identical subjects and objects of different triples with each other generates the relation-network. The resulting structure of triples describing the entities and the relations between them are coded into the database using OWL. The OWL representation, generated by extracting relevant information from the text, provides an overview of the mathematical knowledge presented by the

author. The upper ontology used by the author is reproduced. IV.

KNOWLEDGE ENGINEERING OF KEA

KEA is a semi-automated web-based system constructing and organizing a KB for mathematical language to store, maintain and retrieve information. The generated, annotated triples of the mathematical texts are organized in OWL-documents. Each analyzed text generates a separate independent document and represents a semiautomated generated ontology of this text by the semantic analysis of MARACHNA. This separation is necessary to retain the intrinsic preferences of each author. Therefore, these documents are the basic elements of the KB. A. KB of KEA - Ontology of Mathematical Texts with Jena The KB is organized and developed within the Javabased Jena Framework (“A semantic Web Framework for Java” [27]). This framework is open source and consists on several APIs (RDF – Resource Description Framework, OWL – Web Ontology Language, RDFS – RDF Scheme etc.) providing an opportunity to generate and manipulate RDF data models (in detail graphs) as Java classes. Several implementations of Java interfaces represent RDF-constituents like resources, literals and statements. Generally, RDF consists of representations forms like triples which forms statements. Jena can read and write different representations forms (M-triples or N3). There are different possibilities to store RDF models: primary memory or/and miscellaneous data bases (persistence models). In addition, Jena provides a RDF query language SPARQL [28] which is oriented on SQL, and an interference engine to derive additional RDF assertions. Initially, the KB contains basic mathematical information snippets comprising axiomatic set theory and first order logic. To avoid inconsistencies within the KB new information is added using as required a semi-automated approach: information is integrated into the KB if and only if there are no conflicts with existing entries. This means that new nodes have to be connected to existing ones; duplications or contradictions are forbidden. In case of conflicts, an administrator may provide additional information or delete the conflicting entries manually. The administrator-intervention is performed through a web interface manipulating the Jena Java classes. This model of information processing is based on well-known models of human knowledge processing: humans are able to integrate new knowledge into their world view only if it can be linked to already existing knowledge without misinterpretations of new information. As a consequence, incorrect knowledge may be deleted or corrected under certain conditions. It should be noted that, given the nature of the sources, human intervention into the automated integration is not the rule; most of the knowledge presented in a textbook follows the rules of consistency required for the automated integration.

B. Knowledge Retrieval of KEA – the Web-based Retrieval Interface After the acquisition process (semantic extraction), the KB consists of mathematical information snippets without need of interpretation. Beside the individual text ontologies, the KEA system generates several levels of superior ontologies of the KB. The low-level ontology describes the correlation between mathematical terms both within mathematical entities and between these entities. The upper-ontology reproduces the structure of the processed text and thus the field ontology used by the author. Both of these ontologies facilitate data-retrieval, the upper ontology in direct response to queries, the low-level ontology by showing interconnections and relations not directly included by the author or by offering additional links between the upper ontologies of several authors. Currently a web interface for the retrieval of information from MARACHNA is being implemented. The interface follows the basic philosophy of the “Personalised Home” of Google. The interface adheres to the metaphor of a desktop, allowing a high degree of personalization to the user. The desktop is implemented in JavaScript, offering predefined widgets that can easily be adapted to the preference of the user provided in the form of an extensive, free library. JavaScript was chosen for its wide-spread use and familiarity, allowing for uncomplicated modification by advanced users. Predefined widgets include (among others): Advanced search, Personalised history of queries and results, including social tagging and personal annotation for relevance, Graphic display of the KB, based on Graphviz [29] (an open source graph-visualisation project) including controls for choosing the referred ontology (low-level, different upper level ontologies based on author, field, etc.) or the context and type of information (choosing e.g. examples, definitions or theorems). Collaborative tagging methods will be used to categorize content and classify user types. Social networking services (collaborative tools) help to manage user groups at different knowledge levels (pupils, students, graduates, laities, etc). V.

a "standard" mathematical encyclopaedia, particularly concerning the generated (or, in the case of the standard encyclopaedia, used) ontologies for this field of mathematics. REFERENCES [1]

[2] [3] [4] [5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15]

[16]

[17] [18]

[19]

CONCLUSIONS AND EVALUTATION

A prototypical implementation to analyze text written in German currently exists (it will be extended to analyze English texts). The prototype is capable of analyzing short passages of text, such as single mathematical entities as described in this paper. The semantic extraction leads to information snippets of the analyzed mathematical text. This information is successfully integrated into the discussed KB. As such, the prototype serves as a proof-ofprinciple, validating the approach of MARACHNA. The system is capable of automatically generating a low-level ontology of mathematics from natural langue texts based solely on NLP techniques. Future evaluation will include analysing complete textbooks on Linear Algebra and merging the resulting KBs to test the validity of the ontology mapping approach. The results will be compared with

[20]

[21] [22] [23]

[24] [25] [26] [27] [28] [29]

Probst, G., Raub S., Romhardt K. (2006): Wissen managen: Wie Unternehmen ihre wertvollsten Ressourcen optimal nutzen“ , Betriebwirtschaftl. Verlag Dr. Th. Gabler, Edition 5th W3C: OWL. http://www.w3c.org/2004/OWL D. Hilbert: Die Grundlagen der Mathematik. Abhandl. aus dem math. Seminar der Hamburgischen Universität 6 (1928) 65–85 N. Bourbaki: Die Architektur der Mathematik. Mathematiker über die Mathematik. Springer, Berlin, Heidelberg, New York (1974) Apache Lucene: http://lucene .apache.org K. Gödel: „Über formal unentscheidbare Sätze der Principia Mathematica und verwandte Systeme I“,. Monatsheft f. Mathematik und Physik (1931-1932) 147f T. Gruber, G. Olsen: “An Ontology for Engineering Mathematics.” Technical Report KSL- 94-18, Stanford University (1994) Baur, J.: Syntax und Semantik mathematischer Texte. Master’s thesis, Universität des Saarlandes, FB Computerlingui. (Nov. 1999) S. Müller TRALE. www.cl.uni-bremen.de/Software/Trale Wolfram Research: MathWorld. http://mathworld.wolfram.com Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, London (1998) C. Fellbaum: WordNet. http://wordnet.princeton.edu GermaNet Team: GermaNet. www.sfs.unituebingen.de/lsd/english.html Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, Heidelberg, New York (2006) J. Urban: “MoMM – Fast Interreduction and Retrieval in Large Libraries of Formalized Mathematics.” Internation Journal on Artificial Intelligence Tools (15(1)) (2006) 109–130p J. Urban: MizarMode – “An Integrated Proof Assistance Tool for the Mizar Way of Formalizing Mathematics.” J. of Applied Logic (2005) M .Pinkall, J. Siekmann, C. Benzmüller, I. Kruijff-Korbayova: DIALOG. http://www.ags.uni-sb.de/~dialog/ Asperti, A. and Padovani, I. and Sacerdoti Coen, C. and Schena, I.: “HELM and the Semantic Web.” In Boulton, R.J., Jackson, P.B., eds.: Theorem Proving in Higher Order Logics, 14th Internat. Conf., TPHOLs 2001, Edinburgh, Scotland, UK, September 3-6, 2001, Proc.. Vol. 2152 of Lect. Notes in Computer Science., Springer (2001) Asperti, A., Zacchiroli, S.: “Searching Mathematics on the Web: State of the Art and Future Developments.” In: Joint Proceedings of the ECM4 Satellite Conference on Electronic Publishing at KTH Stockholm, AMS - SM M Special Session, Houston / (2004) Collier, N., K, T.: “PIA-Core: Semantic Annotation through Example-based Learning,” Third International Conference on Language Resources and Evaluation (May 2002) 1611–1614 The PIA Project: PIA. http://www.pia-services.de/ KAON: http://kaon2.semanticweb.org/ N. Natho: mArachna: Eine semantische Analyse der mathematischen Sprache für ein computergestütztes Information Retrieval. PhD thesis, Technische Universität Berlin (2005) W3C: MATHML. http://www.w3.org/Math Mozilla Foundation: Rhino. http://www.mozilla.org/rhino/ Wüst: Mathematik f. Physiker u. Mathematiker, Bd1. WileyVCH(2005) Jena: A Semantic Web Framework for Java. http://jena.sourceforge.net W3C: SPARQL. http://www.w3.org/TR/rdf-sparql-query/ AT&T Research: Graphviz. www.graphviz.org/