Knowledge Bases in mArachna

Knowledge Bases in mArachna Sven Grottke TU Berlin Berlin, Germany [email protected] Sabina Jeschke TU Berlin Berlin, Germany [email protected] Nicole Natho TU Berlin Berlin, Germany [email protected] Ruedi Seiler TU Berlin Berlin, Germany [email protected]

Abstract: Automated extraction of knowledge from natural language texts is a major technical challenge that remains largely unsolved. Scientific texts in general, and mathematical texts in particular, are characterized by the use of complex language constructs with the intent to transfer knowledge. To a large extent, mathematical texts possess a strict internal structuring and can be separated into text elements such as definitions, theorems etc. These text elements are principal carriers of mathematical information. In addition, these elements show a characteristic linguistic structuring well suited for natural language processing techniques. In this paper we present MARACHNA, a system for extracting mathematical relations from texts and integrating them into a knowledge base. In response to user queries, parts of the knowledge base are visualized using XML Topic Maps and OWL. In particular, MARACHNA aims to provide an overview of single fields of mathematics, as well as showing intra-field relations between mathematical objects and concepts.

Background Information and knowledge are central concepts in today’s society. Numerous publications, books and the World Wide Web create an “infoglut” that is not easy to manage. Furthermore, manual information processing is a very time-consuming process. A reasonable approach to this problem is the development of mechanisms for the automated extraction of knowledge from natural language texts. Such an automated extraction requires methods for natural language processing and for the classification and visualization of knowledge. Artificial intelligence and psychology provide key impulses for the development of knowledge classification mechanisms such as semantic networks, associative networks and knowledge maps. These mechanisms provide an effective way to organize knowledge, a prerequisite for modeling knowledge. In the context of this work, knowledge is represented as relations between propositions and terms. Hence, knowledge classification mechanisms help to answer the question of how to make knowledge accessible for and processable by a computer, a crucial issue in developing eLearning, eTeaching and eResearch applications.

Basic concepts MARACHNA

is a system for automatically extracting knowledge from mathematical natural language texts. To achieve this, MARACHNA uses well-known natural language processing and knowledge representation techniques (e. g. Chomsky’s grammar). One sample application of MARACHNA is the development of a user-adaptive mathematical information retrieval system for the MUMIE eLearning environment (see Mumie community; Jeschke 2004). The MUMIE system focuses on teaching mathematics to engineering students at undergraduate university level. Mathematical texts show a distinctive structure, both on the linguistic level and in the presentation of knowledge chosen by the author. This structure is characterized by typical text elements, such as definitions, theorems and proofs. In the following, we will refer to these text elements are called entities. In mathematical textbooks, entities are commonly used to describe mathematical objects and concepts. These entities form a complex network of relationships that can be described by an ontology. Since mathematics as a whole can be derived from a small set of axioms using propositional logic it can be said that mathematics posses an inherent structure, or, in other words, an inherent ontology. The network of mathematical terms and their relations as created by MARACHNA should closely recreate that structure inherent to mathematics itself. As a result, we expect MARACHNA to be able to integrate mathematical entities from very different sources, such as different mathematical textbooks, independent of the Upper Ontology preferred and used by the authors of those sources. MARACHNA provides mechanisms for the creation of retrieval networks from mathematical texts, such as textbooks, papers or online content, as well as mechanisms for navigation on these networks. These networks reflect the contextual relations between mathematical terms and concepts. They are tightly connected to the original text, thus preserving the upper ontology used by the original authors, and can map relationships in a very fine-grained way. In addition, these networks can be used to create an overview of mathematical content. Therefore, they offer different levels of information detail. User-adaptive retrieval and navigation mechanisms are an important aspect of the MARACHNA project. User queries are implemented using selection mechanisms and keyword searches. Based on these queries, knowledge networks are dynamically generated at runtime. In the future, the MARACHNA engine will be able to process natural language queries. In addition, the answers will be adapted to a user’s knowledge level. This can be done by keeping a specific profile of each user.

Linguistic Approach Entities are the principal carriers of information in mathematical texts. They are analyzed using natural language processing techniques, based on a linguistic classification scheme (see Jeschke 2004; Natho 2005). This scheme defines four levels (cf. Fig. 1): relations between different types of entities are described on the entity level.

Axiom

First Level: Entity Level

Definition Thm Proof

Second Level: Internal Entity Structure Level

Third Level: Sentence Level

if and only if

Fourth Level: Word & Symbol Level

Word: Set

Figure 1: Linguistic classification scheme.

Symbol: S

On the internal entity structure level the internal structure of an entity (i.e. the assumptions and proposition of a theorem) is specified. Characteristic sentence structures, which are commonly found in mathematical texts, are described on the sentence level. On the word and symbol level at the bottom single symbols and words and their relations between each other are schematized (see Grottke, Jeschke, Natho, Seiler 2005). Mathematical information is extracted from a text using the structures and linguistic relations as defined by this classification scheme. Finally, this information is integrated into a knowledge base. This knowledge base consists of one or more directed graphs representing terms, concepts and their relations between each other. It is based on an ontology of the language of mathematics encoded with the web ontology language OWL (see W3Ca). OWL itself is a W3C specification and builds on RDF and RDF Schema (Resource Description Framework) (see W3Cb). The linguistic analysis of entities yields triples consisting of two nodes and one relation (see TopicMaps.org). Nodes represent mathematical terms and propositions, with the relation describing how they are connected to each other. In this context, different types of relations describe different types of linguistic phrases or key words in mathematical texts (e.g. two nodes corresponding to two propositions A and B, connected by the relation “is equivalent to”). These triples are then integrated into the knowledge base. This process closely maps the actual language structure, resulting in a very fine-granular knowledge base. The relations between entities are described using topic maps (see TopicMaps.org). They give an overview of the mathematical knowledge that is generated by extracting relevant information from the knowledge base, with special attention given to the underlying field ontology.

Approach for Integrating Mathematical Formulae As formulae form a major portion of mathematical texts and constitute a primary source of information in these texts, it is desirable to be able to include their content in the analysis and representation created by MARACHNA. Currently, MARACHNA is not capable of this important feature yet. However, we are investigating an approach to rectify this deficiency. We propose using a syntactical analysis similar to those used in computer algebra systems in combination with contextual grammars (e.g. Montague grammars) to correlate the information given in a formula with information already provided in the surrounding natural language text. Using this approach should enable MARACHNA to integrate formulae and their informational content in the network created by the analysis of the natural language text. It should be pointed out that we do not aim for machine-based understanding of the formulae, as automatic reasoning systems would require. Instead, formulae are to be treated as a different representation of mathematical knowledge, to be integrated into the knowledge base in a similar manner to that used for the natural language text. However, the analysis proposed here can be used as a first step in a further process leading to viable input for such reasoning systems, providing additional assistance in building the knowledge base.

Knowledge Base In order to organize information extracted from mathematical texts, it is integrated into a knowledge base. The knowledge base consists of basic mathematical knowledge (see Bourbaki 1974) that comprises axiomatic set theory and first order logic. To avoid inconsistencies within the knowledge base, new information is added using a semi-automated approach: information is integrated into the knowledge base if and only if there are no conflicts with existing entries. This means that new nodes have to be connected to existing ones; duplications or contradictions are inadmissible. In case of conflicts, the user may provide additional information or delete the conflicting entries manually (see Fig. 2 for an excerpt from the knowledge base). This dual model of information management is based on well-known models of human knowledge processing: humans will be able to integrate new knowledge into their worldview only if it can be linked to existing knowledge. Insufficient or incorrect prior knowledge may lead to misinterpretations of new information. As a consequence, incorrect knowledge may be deleted or corrected under certain conditions (see Anderson 2001).

Figure 2: Basic structures of the knowledge base.

Knowledge Representation Based on the knowledge base described in section 4, MARACHNA creates visual knowledge representations. These representations provide a clear view of mathematical fields using varying levels of detail. Visualization is accomplished by an interactive directed graph offering the possibility to navigate the mathematical landscape using selection and zooming. XML Topic Maps (see TopicMaps.org) are used to display knowledge representations: based on the XML meta language, they were designed for the purpose of information management and organization, taking into account natural human knowledge structuring mechanisms (In contrast RDF focuses on structuring meta-data in such a way as to allow for more efficient processing by a computer (see Garshol). MARACHNA uses two standard mechanisms: the knowledge base is based on RDF/OWL, while the information retrieval system, operating on a mirror of the knowledge base, uses XML topic maps (cf. Fig. 3)). Therefore, topic maps are used by MARACHNA’s information retrieval system. There are a lot of applications for topic maps, such as the Java-based TM4J framework (see tm4j.org), which is also used in the MUMIE environment. Definitions of mathematical terms and concepts are usually clear and precise. In addition, the description mechanisms used in mathematics are well-structured and strict. Hence, topic maps are well-suited to represent mathematical concepts. Hence, the structure of the knowledge base as well as that of the mathematical text is easily adapted to the topic map structure. XML Topic Maps extract relevant information from the knowledge base, providing a selection of mathematical knowledge suitable for acquisition by a human. They offer a clear and efficient representation of pieces of information and the relations between them. In addition, they allow for user-specific views of a mathematical knowledge area, for example by specifying limiting conditions (e.g. scopes) for the visualized area.

Knowledge Management in MARACHNA: An Outlook At the moment MARACHNA exists as a prototypical implementation to analyze text written in German (it will be extended to analyze texts written in English). For selected text elements, the prototype demonstrates the feasibility

of a semi-automated approach to semantic extraction as it has been described in this paper. The semantic extraction leads to information snippets of the analyzed mathematical text. This information can be integrated into the discussed knowledge base. Based on the knowledge base entries, it is possible to create and display different forms of knowledge representations, for example based on Topic Maps and Web Ontology Language (OWL). The concept of the knowledge base is simple: the semantic analysis creates triples consisting of a subject, an object and a relation between them. These triples are inserted into the knowledge base, where they form a complex network of concepts and relations. The advantage of the mathematical language is that it knows neither emotional nor informal concepts. The mathematical language, in particular within the entities, places a strict emphasis on the transfer of knowledge. Thus, the knowledge base consists of pure mathematical knowledge without any need of interpretation. The acquisition process (semantic extraction) follows strict rules based on the structure of entities. The same is true for the organization and storage of mathematical content in the knowledge base. But still, some problems remain with the knowledge base with regard to merging or extracting knowledge. In mathematics, there are often different notations or phrasings describing the same concept, but customized for a specific audience. For example, in education it is advisable to use different versions of an entity for elementary school and high school. When trying to integrate entities into the knowledge base, which are based on courses for students of mathematics, it is difficult to use the same information for teaching engineering students. Hence, there is a problem with the reusability of information in the knowledge base.

Figure 3: Relations on the knowledge base (left) - Representation on the topic maps (right). In addition, the content in the knowledge base consists of very fine-grained structures base on the actual mathematical text. Often, this is too much information to handle in any useful way. Furthermore, MARACHNA is not capable of integrating different entities with the same meaning into the knowledge base. This means that you cannot integrate the same triple twice, and is one of the main reasons for using a semi-automated approach. A human administrator has to decide which entity is the correct or most useful one. It is desirable to automate this process, allowing the integration of different forms of the same entity, and creating filtering mechanisms for differentiating between the different uses of knowledge during the retrieval process. Furthermore, an automated process for integrating different (alternative or even contradicting) relations between entities would facilitate the merging of different knowledge bases as e.g. gained from different textbooks. One approach under consideration would attempt to identify equivalent identical entities in the different databases by matching identical key terms often introduced in definitions. Once these equivalent entities are identified, we propose to collapse the identical nodes of their representation within the knowledge base and to store the remaining nodes (grouped together by original entities and annotated) as alternatives connected to the collapsed identical nodes. Implementation of this concept facilitates user adaptivity, the ability to display selected knowledge that is relevant to a specific user. Therefore, it is necessary to store information about a user’s existing knowledge and skill level. This could be done by keeping track of a user’s previous activities, such as completed courses, a query history, exam results etc.

References Anderson, J. R. (2001). Kognitive Psychologie. Spektrum Akademischer Verlag, Heidelberg, Berlin, 3rd edition. Bourbaki, N. (1974). Die Architektur der Mathematik. Mathematiker über die Mathematik, 1974, Springer, Berlin, Heidelberg, New York. Dahlmann, N., Jeschke, S., Seiler, R., Sinha, U. (2003). MOSES meets MUMIE: Multimedia-based Education in Mathematics. International Conference on Education and Information Systems: Technologies and Applications, 2003, International Institute of Informatics and Systemics, Orlando, Fla., 2003. 370-375. Garshol, L. M. Living with topic maps and RDF. http://www.ontopia.net/topicmaps/materials/tmrdf.html Grottke, S., Jeschke, S, Natho, N., Seiler, R. (2005). mArachna: A Classification Scheme for Semantic Retrieval in eLearning Environments in Mathematics. Proceedings of the 3rd International Conference on Multimedia and ICTs in Education, June 7-10, 2005, Caceres/Spain, 2005. Helbig, H. (2000). Die semantische Struktur natürlicher Sprache. Wissensrepräsentation mit MultiNet. Springer, Berlin, Heidelberg. Jeschke, S. (2004). Mathematik in Virtuellen Wissensräumen - IuK-Strukturen und IT-Technologien in Lehre und Forschung. PhD thesis, Technische Universität Berlin, Berlin. Jeschke, S., Keil-Slawik, R. (2004). Next Generation in eLearning Technology - Die Elektrifizierung des Nürnberger Trichters und die Alternativen. Informationsgesellschaft. Alcatel SEL Stiftung, 2004. Jeschke, S., Kohlhase, M. Seiler, R. (2004). eLearning-, eTeaching- & eResearch-Technologien - Chancen und Potentiale für die Mathematik. DMV-Nachrichten. Jeschke, S., Richter, T., Seiler, R. (2005). VideoEasel: Architecture of Virtual Laboratories on Mathematics and Natural Sciences. Proceedings of the 3rd International Conference on Multimedia and ICTs in Education, June 710, 2005, Caceres/Spain. to appear. Mumie community. Mumie. http://www.mumie.net Natho, N. (2005). mArachna: Eine semantische Analyse der mathematischen Sprache für ein computergestütztes Information Retrieval. PhD thesis, Technische Universität Berlin, Berlin. Richter, T. VideoEasel. http://www.math.tu-berlin.de/~thor/videoeasel tm4j.org. TM4J - Topic Maps 4 Java. http://tm4j.org TopicMaps.org. Topic Maps. http://www.topicmaps.org W3Ca. Resource Description Framework (RDF). http://www.w3.org/RDF/ W3Cb. Web Ontology Language (OWL). http://www.w3.org/2004/OWL/