[2] Bourbaki, N.: Die Architektur der Mathematik. Mathemati- ker über die Mathematik. Springer .... [33] Wüst, R.: Mathematik für Physiker und Mathematiker, Bd.1.
Information extraction from mathematical texts by means of natural language processing techniques Sabina Jeschke, Marc Wilke
Marie Blanke, Nicole Natho, Olivier Pfeiffer
University of Stuttgart, Center of Information Technologies Institute of Information Technology Services (IITS) RUS, Allmandring 30a, 70550 Stuttgart, Germany +49-711-685-{88000, 60476}
Berlin University of Technology Center for Multimedia in Education and Research MuLF - Sekretariat MA 7-2, Straße des 17. Juni 136 10623 Berlin, Germany +49-30-314-{79155, 79155, 24603}
{sabina.jeschke, marc.wilke}@rus.uni-stuttgart.de
{blanke, natho, pfeiffer}@math.tu-berlin.de
ABSTRACT Particularly with regard to the widespread use of the internet, the increasing amount of scientific publications creates new requirements for sophisticated information retrieval systems. The discovery of semantic annotation for describing mathematical texts themselves and the structure of the observed mathematical field is an important issue supporting such information retrieval systems. A lot of good statistical approaches for finding correlations in texts exist – e.g. as used by Google. mArachna follows a different approach and uses natural language processing techniques to recover all the fine-grained information snippets within mathematical texts. The extracted information is stored in knowledge bases, creating a low-level ontology of mathematics. In this article we represent our further developments in this field and the technical implementation of the mArachna prototype.
Categories and Subject Descriptors I.7.0 [Document and Text Processing]: General
General Terms Algorithms, Management, Design, Languages, Theory.
Keywords Ontology Engineering, Natural Language Processing, Semantic Annotation, Automated generation of metadata.
1. General Concepts of mArachna Processing and managing knowledge is an important matter in modern organizations and societies. Numerous publications are released to spread information expanding our knowledge reservoir on a daily basis. Especially the Internet is an important example of a non-systematic knowledge source providing access to too much information. This “information glut” causes severe problems and can - if at all - not be controlled easily. Therefore, finding applications for specializing and generalizing knowledge is an outstanding challenge. In this article a sophisticated information management system for accessing mathematical information from different text sources such as textbooks or digital publications is presented. The ex-
tracted information is integrated into appropriate knowledge bases. The advantage of the mathematical language is its precision, facilitating the application of natural language processing (NLP) methods like “Head-driven Phrase Structure Grammar” for automated semantic annotation of texts. For this reason, we obtain information snippets of a given context imparting the knowledge of mathematical correlations between terms and concepts. An information management system based on these mechanisms generates challenges of automated knowledge acquisition, sophisticated solutions for user interfaces, requirement of intelligent retrieval mechanisms, visualization techniques of knowledge, generated different types of ontologies of several mathematical courses etc. A comprehensive knowledge/information management system consists of accumulation, storage, merging, evaluation and representation of mathematical information for different purposes (e.g. a context sensitive library search, book indexer, teaching software or encyclopedias). The mArachna system tries to combine all these knowledge acquisition methods of the above described system. It provides extraction of mathematical ontologies from mathematical texts represented as elements of mathematical concepts and objects as well as relations between them. These ontologies are mapped onto the structure of the knowledge bases. This process is possible because of special structure of mathematical texts induced by the design of mathematics itself and authors’ preferences. Furthermore this special structure is distinguished by particular basic text elements called “entities” such as definitions, theorems and proofs. Entities are generally used to present mathematical objects and concepts and symbolize the key components of mathematical texts. All information within the entities delineates a complex network model, defining an ontology of the analyzed text. The fundamental ontology (basic mathematical structure for all ontologies) follows the ideas of Hilbert [1] and Bourbaki [2]: mathematics as a whole can be derived from a selected set of axioms using propositional logic. All mathematical texts reflect this axiomatic and logical structure, allowing for the direct extraction of the mathematical content into the ontology. Hence mArachna is capable of integrating mathematical entities from diverse sources (e.g. lectures, articles, textbooks) independent of the upper ontology, representing the corresponding author’s preferences. The analysis based on natural language processing techniques utilizing both the ontology inherent in mathematics itself and the ontology defined by the structure of the text chosen by the author. Additionally mArachna
provides several techniques for creating semantic networks form mathematical texts. The extracted mathematical information is annotated regarding to its role in the original texts and maps correlations in a complex and fine-grained way. Furthermore, these networks are capable of generating an overview of mathematical content and offer different levels of information detail. For storing, managing and retrieval in the knowledge base, the Jena [3] framework and the search engine Apache Lucene [4] are used.
•
Entity Level: used entities and their composition within the text
•
Internal Structure Level: Texts blocks within an entity such as proposition and assumptions
•
Sentence Level: characteristic, frequently used sentence structures
•
Word and Symbol Level: single symbols and words and the relations between them [26].
2. Related Works Our investigation of finding related works of mArachna as a holistic project yielded the result that we could not discover any similar overall project. Nevertheless we have the possibility to decompose mArachna into different modules, which are comparable to work done in other groups. Basic results for mathematical ontologies were accomplished by T. Gruber and G. Olsen [5]. MBASE [6, 7] is a mathematical ontology generated with manually written entries. Baur investigated the use of the English language in mathematical texts [8]. Our natural language processing analysis is based on TRALE [9]. The structuring and modeling mechanisms of mathematical knowledge are comparable to those of manually written mathematical encyclopedias like MathWorld [10]. The ontologies used in these projects can be compared to the knowledge bases generated by mArachna. WordNet [11, 12] and GermanNet [13] are intelligent thesauri providing number of suitable synonyms and definitions for given queried expressions. Both projects run on a similar approach of automated semantic analysis as mArachna. However, WordNet and GermanNet do not provide detailed correlations between dissimilar or indirectly related terms. The concept of automated semantic analysis of German natural language texts was designed by Helbig [14]. For describing mathematical content as input for automated reasoning systems, the Mizar [15-17] system uses a human and machine-readable formal language. An example of an automated reasoning system is the DIALOG [18] project. DIALOG tries to process natural language input for mathematical validation in an automated reasoning system. The difference between mArachna and MIZAR and DIALOG to is that both systems use complex reasoning systems. The HELM [19] project defines metadata – which have to be provided by the original authors – for semantic annotation of mathematical texts for use in digital libraries. MoWGLi [20] uses these metadata from HELM and offers a retrieval interface based on pattern-matching within mathematical equations and logical expressions. Extracting information from any kind of natural language texts, the PIA [21-23] project uses machine learning techniques and requires an initial set of human expert annotated texts as a basis for information extraction implemented by intelligent agents.
3. Natural Language Processing Approach Entities (see chapter 1) are the basic carries of information in mathematical text. A linguistic classification scheme [24, 25] is applied to analyze the entities using natural language processing methods. This scheme with its levels is depicted in figure 1:
Figure 1. Classification Scheme The text analyzed by this scheme consists of one or more directed graphs representing terms, concepts, and their relations. These graphs form the knowledge base encoded in the Web Ontology Language (OWL) [27]. A single information snippet is a graph or a triple of two nodes (representing mathematical terms and relations) and one relation (representing types of relation between the nodes). These triples are nested and form complex structures because triples can be used as nodes of other triples. Different types of keywords and linguistic phrases in mathematical texts are described by different types of relations: nodes representing two propositions, connected with the “is equivalent to” relation. This mechanism transfers the information from mathematical text into the knowledge base, generating a very fine-grained structure.
4. Technical Overview mArachna is composed of three separate interfaces (see figure 2): •
Text preparation Interface: LATEX Converter,
•
NLP Interface: the syntactic and semantic analysis,
•
Knowledge Base and Retrieval Interface
In addition, mArachna strictly follows general directives: •
Modularize Components: Each interface is designed as preferably independent component.
•
Use of standardized formats: All data formats are represented in standardized formats such as TEI, MathML, XML, RDF/OWL, etc.
•
Access by non-programmers: All rules of the NLP analysis should be simply modifiable by non-programmers
Retrieval
Jena
Knowledge Base
NLP
Textpreparation
LatexConverter
Trale JS Rules Engine
Figure 2. Interfaces of mArachna
4.1 Text Preparation Currently the only input format of mArachna is LATEX since it is the standard format for publications in mathematics and natural science. The internal representation format of the processed texts is TEI (Text Encoding Initiative [28]) and presentation format for the mathematical symbols and equations is MathML [29]. For this reason the system provides a converter from LATEX to TEI/MATHML. The internal structure of the raw mathematical text is decomposed in its components such as chapters, entities, paragraphs etc. All this information can be coded in the TEI representation. In the following step the text is separated into single sentences and annotated with metadata such as their original TEI representation, their internal structure role within an entity etc. Symbols and equations are processed separately from the text. Some simple symbols and equations can be replaced by natural text elements. The rest is coded in Representation MATHML. The converter is implemented in Java using rule-based string comparison.
4.2 NLP: Syntactical and Semantic Analysis All sentences in TEI and MATHML format are analyzed by natural language processing methods. Of course, mathematical equations cannot be analyzed by general natural language processing methods, and must be analyzed separately by their own syntax analyzer (see chapter 6.4). This process is however not yet implemented. Therefore, all non-analyzed symbols and equations are tagged with an identity number, and treated like a noun in the NLP analysis. Especially the syntax analysis of the general natural language processing uses the TRALE System (a PROLOG adaptation [9] of ALE for German [30]) based on Head-Driven Phrase Structure Grammars. The existing dictionary and grammar of TRALE for the German language had to be extended including the properties of the mathematical language, in order to provide a sophisticated syntactic and partially aspects of the semantic analysis. The results of the TRALE system analysis are represented as an abstract syntax tree of the analyzed text. At this point, all separated symbols and equations will be fully reintegrated. A JavaScript interpreter [31] implements the semantic analysis. The components of the syntax trees are compared according to typical constructs of the mathematical language which are implemented as JavaScript rules. These rules map typical mathematical language constructs and prepare the structure of the mathematical language in machine-readable code. As a result of the semantic
analysis, we obtain a specific triple structure with additional information from the analyzed text defined by these rules. The resulting triple structure is encoded in OWL documents. Unfortunately these documents are only satisfying the OWL-full standard since for each triple element it has to be determined if it represents OWL classes or individuals of OWL classes.
4.3 Knowledge Base and Retrieval With help of the Java-based framework Jena [3], triples are inserted into the knowledge bases. Jena is processing OWL documents and provides APIs for persistent storage. In addition, it provides a SPARQL [32] engine for querying RDF and OWL documents. Each analyzed text creates its own knowledge base because each author has his own spelling style and notation for the representation of a specific mathematical field. To avoid inconsistencies in a single knowledge base, the process of buildup of single knowledge bases and the merging process of different knowledge bases for a mathematical field are separated. A plain knowledge base consists of fundamental mathematical knowledge needed for the analyzed text, and some aspects of axiomatic field theory and first order logic. Inconsistency avoidance (in the knowledge bases) is ensured by a semi-automated approach, which is used for the integration of new information snippets (triples): triples are only integrated if they are not redundant and do not contradict existing triples in the knowledge base. Therefore, new nodes have to be linked to already existing ones. To accomplish this approach, the user has to eliminate such conflicts manually through a web-based interface using the Jena framework. The fundamental idea of this approach is based on models of human knowledge engineering processes performed in the psychology: integration of new knowledge by humans is only possible if it can be consistently integrated into existing knowledge of the concerning person. Fortunately for the mArachna System, textbooks use strict and consistent rules, and therefore facilitate the automated integration. Hence seamless integration should be the normal case with user intervention as an exception. The information retrieval interface is not fully implemented yet (see chapter 6.3).
5. An Example To explain how mArachna works, the definition of a mathematical group [16] (cf. figure 3) is presented: Let A be a set, A≠∅, x,y ∈ A: * : A×A | A, (x,y)| x*y. The tupel (A,*) is called a group if the following conditions hold: 1. Associativity: ∀x,y,z∈A, x*(y*z) = (x*y)*z, 2. Neutrality: ∃ e∈A such that ∀x∈A , x*e = e*x = x, 3. Inverse: ∀x∈A ∃ e∈A such that x-1*x = x*x-1 = e where e is the neutral element. In the semantic analysis “Let ...” is identified as a keyword of an assumption. In the same manner “The tupel (A,*) is called ... conditions hold” is recognized as a proposition with three properties (associative, neutral and inverse element). Now the natural lan-
guage processing analysis yields a number of connected triples with each triple consisting of a predicate, a subject and an object. For example the phrase “The tupel (A,*) is called a group…” is mapped to the simple triple (is_called, group, (A,*)). The result is a network of relations built up from the identification of identical subjects and objects of the integrated triples. The whole network is coded in OWL and is generated by the extraction of information from a specific text and represents an overview of the mathematical knowledge of a certain author.
map
set set
elementOf elementOf
A rdf:_2
rdf:_1 rdf:Type
(A,*)
tuple
6.1 Knowledge Engineering
isCalled hasProperty
hasProperty associativity
Group
neutral element
hasProperty
definedAs
inverse element
exists hasTerm
qualifies e
forall hasTerm
qualifies
= leftTerm
x rightTerm
owl:sameAs
e*x=x leftTerm
x
For evaluation purposes mArachna will be integrated in a cooperative knowledge space system used for mathematical learning environments containing umpteen mathematical language constructs in written communications between learners or used in documents. This approach allows for testing mArachna with a wide range of different correct and presumably also incorrect usage of mathematical language constructs and a subsequent evaluation of the results gained. Besides user’s activities can be traced to have a direct feedback from the users. For the ongoing development, we are challenged by the structured organization and extraction of mathematical information from our knowledge bases (knowledge engineering), the merging process, and visualization concepts of the knowledge bases, the syntactical analysis of equations and symbols, and the development of a desktop-based web retrieval interface.
rdf:Type
rdf:Type
plete textbooks for different fields of mathematics, starting with linear algebra.
rightTerm e
Figure 4. Knowledge Base: Definition of a Group
6. Conclusions, Evaluation and Outlook mArachna is a semi-automated web-based system constructing and organizing knowledge bases for mathematical language to store, maintain and retrieve mathematical information. Precise conversion of knowledge is a vital point working with mathematical texts. Hence the acquisition process (semantic analysis) and the organization of knowledge bases are based on well defined rules representing the structure of mathematical texts. The knowledge base itself consists of annotated triples generated from the analyzed text organized in OWL-documents. These triples form a very complex structure. Each text is represented in a separate independent OWL-document, and hence, in an individual knowledge base. This separation is important to preserve the individual preferences of each author. Currently, mArachna is a prototype analyzing texts written in German and English, processing selected entities and texts blocks, and successfully integrating the extracted information into knowledge bases. For prospective development we will analyze com-
In mArachna each analyzed mathematical text produces different ontologies within the knowledge base. Our upper ontology describing general concepts of all texts consists of all information of some general language constructs (“is called”, “Let ...”, etc.). Because of the speech diversity of different authors, there also exist upper ontologies within the knowledge base. These upper ontologies describe the different usage of styles of phrasing and notation, or the author’s didactical preferences. Obviously the sheer amount of data is a serious problem for retrieving information from the knowledge bases. To cope with this challenge, the knowledge base is developed within the Java-based Jena Framework (“A semantic Web Framework for Java” [14]). This open source framework consists of several useful APIs for the generation and manipulation of knowledge bases using RDF (Resource Description Framework), OWL (Web Ontology Language [8]), RDFS (RDF Scheme) etc. Generally, RDF forms triples to create statements. Several implementations of Jena interfaces represent RDF basic elements (resources, literals, statements, etc.). Additionally, Jena can read and write different representation forms (M-Triples or N3), and handle different mechanisms of RDFmodels (primary memory or/and miscellaneous data bases). Particularly Jena provides the RDF query language SPARQL which is oriented on SQL as inference engine to derive additional RDF assertions. The basic knowledge base consists of mathematical information snippets based on axiomatic set theory and first order logic. To avoid inconsistencies within the knowledge base, new information is added using when required a semi-automated approach. Information can only be integrated into the knowledge base if and only if there are no conflicts with existing entries. This means that new nodes have to be connected to existing ones, duplications or contradictions are not allowed. A preliminary evaluation showed that conflicts do not occur very frequently, and thus they are not the rule. The semi-automated administrator intervention is performed through a comfortable web interface. However, it should now be clear that extracting information from thus a fine-grained structure in our knowledge bases is very problematic. We will find a complex structure of combined triples. All triples have additionally attached information. Therefore it makes sense to consider the different language preferences of the au-
thors, and terminate or use them early. One idea is storing all textual exchangeable abbreviations and symbols in separate text files. Then we categorize different linguistic levels like the axiomatic structure or internal structure. From this level we can search topdown. It is also possible to use the natural structure of the text itself: chapters, subchapters, paragraphs etc. In addition, we can track users’ searching results and techniques within our knowledge bases to imitate their structural searching routines. Due to the sheer amount of data and complexity of the information processing, the categorization of knowledge bases is our main focus of development.
cal analysis will be made available, similar to those used in computer algebra systems in combination with contextual grammars (e.g. Montague grammars) to correlate the information given in an equation with information already provided in the surrounding text. We would like to point out that our main goal is not a machine-based understanding of mathematical information. Hence we try to avoid the use of automatic reasoning systems and instead treat equations and symbols as a different representation of mathematical knowledge to be integrated into the knowledge base, similar to the use of the natural language texts.
6.2 Merging Process
7. References
The merging process is one part of our efforts to categorize the knowledge bases. On the one hand we want to unify all knowledge bases of the different authors to one mathematical field. Trying to do this we encounter some problems: even in mathematics, different authors use different types of definitions and theorems. The differences can be so subtle that even a human reader needs profound considerations. On the other hand, tools and strategies for unitizing the unified knowledge bases into smaller knowledge bases have to be developed and realized. Such smaller knowledge bases can be accommodated to the preferences and requirements of specialized user groups. The unitized knowledge bases can be used as the basis of an intelligent retrieval system based on the concept of personal information agents (PIA). For example, we are using different versions of entities for elementary or high schools. Moreover unitized knowledge bases containing different notations could contribute to the harmonization of teaching material within courses.
6.3 Web-based Retrieval Interface Currently a web interface for information retrieval from mArachna is being implemented. The interface pursues the basic philosophy of the “Personalized Home” of iGoogle, and follows up the metaphor of a desktop within a browser, allowing a high adaptability to the needs of a single user. The web-based desktop is implemented in JavaScript and XML-technologies (Ajax applications), offering predefined widgets that can easily be adapted to the preference of the user provided in form of a substantial, free library. For example, these widgets include: advanced search tools, personalised histories of queries and results, including social tagging, graphical representations mechanisms of different views (lowlevel and upper ontologies, different entity levels, etc.) on the knowledge base generated via Graphviz [34]. We use Ajax technologies to show how it is possible to combine semantic web technologies and new generations of web-based applications like wikis and folksonomies, in a suggestive and mutually supportive way.
6.4 Syntactic Analysis of Mathematical Symbols and Equations Symbols, abbreviations and equations are the most important elements in mathematical texts. Therefore they present the essential source to get information from these texts. It is important to be able to evaluate their content in the natural language analysis. Currently, only simple symbols and equations can be processed. However, we are developing an approach to improve our skills for analyzing mathematical formulae. Therefore the use of a syntacti-
[1] Hilbert, D.: Die Grundlagen der Mathematik. Abhandlungen aus dem mathematischen Seminar der Hamburgischen Universität 6 (1928) 65–85 [2] Bourbaki, N.: Die Architektur der Mathematik. Mathematiker über die Mathematik. Springer, Berlin, Heidelberg, New York (1974) [3] Jena: A Semantic Web Framework for Java. http://jena.sourceforge.net [4] The Apache Software Foundation: Apache Lucene. http://lucene.apache.org/ [5] Gruber, T., Olsen, G.: An Ontology for Engineering Mathematics. Technical Report KSL-94-18, Stanford University (1994) [6] MBase, http://www.mathweb.org/mbase/ [7] Franke, A. and Kohlhase, M.: MBase: Representing Knowledge and Context for the Integration of Mathematical Software Systems, JSymComputation, 23 (4): 365-402, 2001 [8] Baur, J.: Syntax und Semantik mathematischer Texte. Master’s thesis, Universität des Saarlandes, Fachbereich Computerlinguistik (November 1999) [9] Müller, S.: TRALE. http://www.cl.unibremen.de/Software/Trale/index.html [10] Wolfram Research: MathWorld. http://mathworld.wolfram.com [11] Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, London (1998) [12] Fellbaum, C.: WordNet. http://wordnet.princeton.edu [13] GermaNet Team: GermaNet. http://www.sfs.unituebingen.de/lsd/english.html [14] Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, Heidelberg, New York (2006) [15] Urban, J.: MoMM – Fast Interreduction and Retrieval in Large Libraries of Formalized Mathematics. Internation Journal on Artificial Intelligence Tools (15(1)) (2006) 109– 130 [16] Urban, J.: MizarMode – An Integrated Proof Assistance Tool for the Mizar Way of Formalizing Mathematics. Journal of Applied Logic (2005) [17] Urban, J.: XML-izing Mizar: Making Semantic Processing and Presentation of MML Easy, MKM2005 (2002)
[18] Pinkall, M., Siekmann, J., Benzmüller, C., and KruijffKorbayova, I.: DIALOG. http://www.ags.uni-sb.de/˜dialog/ [19] Asperti, A.,.Padovani, L., Sacerdoti Coen, C. and Schena, I.: HELM and the Semantic Web. In Boulton, R.J., Jackson, P.B., eds.: Theorem Proving in Higher Order Logics, 14th International Conference, TPHOLs 2001, Edinburgh, Scotland, UK, September 3-6, 2001, Proceedings. Volume 2152 of Lecture Notes in Computer Science, Springer (2001) [20] Asperti, A., Zacchiroli, S.: Searching Mathematics on theWeb: State of the Art and Future Developments. In: Joint Proceedings of the ECM4 Satellite Conference on Electronic Publishing at KTH Stockholm, AMS - SM M Special Session, Houston / (2004) [21] Albayrak, S., Wollny, S., Varone, N., Lommatzsch, A., Milosevic, D.: Agent Technology for Personalized Information Filtering: The PIA-System. ACM Symposium on Applied Computing (2005) [22] Collier, N., K, T.: PIA-Core: Semantic Annotation through Example-based Learning, Third International Conference on Language Resources and Evaluation (May 2002) 1611–1614 [23] The PIA Project: PIA. http://www.pia-services.de/ [24] Jeschke, S.: Mathematik in Virtuellen Wissensräumen - IuKStrukturen und IT-Technologien in Lehre und Forschung. PhD thesis, Technische Universität Berlin (2004)
[25] Natho, N.: MARACHNA: Eine semantische Analyse der mathematischen Sprache für ein computergestütztes Information Retrieval. PhD thesis, Technische Universität Berlin (2005) [26] Grottke, S., Jeschke, S., Natho, N., Seiler, R.: mArachna: A Classification Scheme for Semantic Retrieval in eLearning Environments in Mathematics. Proceedings of the 3rd International Conference on Multimedia and ICTs in Education, June 7-10, 2005, Caceres/Spain (2005) [27] W3C: Web Ontology Language. http://www.w3c.org/2004/OWL [28] The TEI Consortium: Text Encoding Initiative. http://www.tei-c.org [29] W3C: MathML. http://www.w3.org/Math [30] Müller, S.: Deutsche Syntax deklarativ: Head-Driven Phrase Structure Grammar für das Deutsche. In: Linguistische Arbeiten, No. 394. Max Niemeyer Verlag, Tübingen (2005) [31] Mozilla Foundation: Rhino. http://www.mozilla.org/rhino/ [32] W3C: SPARQL. http://www.w3.org/TR/rdf-sparql-query/ [33] Wüst, R.: Mathematik für Physiker und Mathematiker, Bd.1. Wiley-VCH (2005) [34] AT&T Research: Graphviz. www.graphviz.org/