Natural language processing methods for ... - Semantic Scholar

8 downloads 0 Views 319KB Size Report
precision enhances the use of alternative natural language processing methods like .... from the text and supplies an outline of the mathematical knowledge.
Natural language processing methods for extracting information from mathematical texts

Nicole Natho1, Sabina Jeschke2, Olivier Pfeiffer1, and Marc Wilke2 1

Berlin University of Technology, MuLF – Center for Multimedia in Education and Research, 10623 Berlin, Germany 2 University of Stuttgart, RUS - Center of Information Technologies, 70550 Stuttgart, Germany

Introduction Managing knowledge takes up a significant role in modern organization and society. Every day, numerous books and publications are released to disseminate new knowledge. In particular, the World Wide Web is an example of a huge unsystematic knowledge base that provides access to a lot of information. Too much information leads to the proliferation of knowledge called “information glut” that is not easy to handle. Applications to specialize and generalize information gain in importance finding solutions to face the new challenges. In this work a reasonable knowledge management system approach to access mathematical information from different sources such as textbooks and publications is presented. Information is extracted from these sources and put into a suitable format for integration into a knowledge base. Many of the current approaches of information extraction are based on statistical analysis of the correlation between terms and knowledge elements. These approaches, while very successful neglect the advantages offered by the precision of the mathematical language. This precision enhances the use of alternative natural language processing methods like “Head-driven Phrase Structure Grammar” for automated semantic annotation of mathematical texts. Thereby we obtain informa-

tion within a given context (knowledge) that imparts mathematical correlation between terms and concepts. A knowledge management system based on this technology provides new opportunities and challenges: automated knowledge acquisition, sophisticated solutions for user interfaces, requirement of intelligent search and retrieval mechanisms, visualization of knowledge, an extend ontology spanning several sources for an overview about mathematics itself etc. A comprehensive knowledge management system needs to address the accumulation, storage, merger, evaluation and representation of mathematical information for different applications like encyclopedias, context sensitive library search systems, intelligent book indexes and e-learning software using standardized methods. The MARACHNA-System implements the knowledge acquisition of such a knowledge management system. MARACHNA generates mathematical ontologies of extracted information represented as elements of mathematical concepts and objects as well as relations and interconnections between them. These ontologies are mapped onto the structure of a knowledge base. This is possible because mathematical texts possess a distinctive structure caused by the theoretical design of mathematics itself and author’s preferences. This structure is characterized by representative text modules “entities” such as definitions, theorems and proofs. Entities are commonly used to describe mathematical objects and concepts and represent the key-content of mathematical texts. Furthermore, the information contained within these entities forms a complex relationship model in shape of a network defining an ontology of the specific mathematical text. The basic ontology that forms the skeletal mathematical structure for all ontologies follows the ideas of Hilbert [1] and Bourbaki [2]: mathematics as a whole can be derived from a small set of axioms using propositional logic1. Propositional logic is reflected in the structure and diction of mathematical texts. Therefore, an ontology of mathematical content can be directly extracted form mathematical texts. As a result, MARACHNA is able to integrate mathematical entities from very different sources, such as different mathematical textbooks, articles or lectures, independent of the upper ontology (describing general concepts) preferred and used by the authors of those sources. This approach - based on natural language processing techniques - offers several advantages over the more common approach 1

This approach is valid within the context of mArachna, despite Gödel’s Incompleteness Theorem [3], since we map existing and proven mathematical knowledge into a knowledge base and do not want to prove new theorems or check the consistency of the theorems we store.

based on purely statistical methods as used in e.g. Google, while avoiding the increased complexity of automated reasoning systems. It utilizes both the ontology inherent in mathematics itself and the ontology defined by the structure of the text chosen by the author. Thus, MARACHNA provides mechanisms for the creation of knowledge networks from mathematical texts, such as textbooks, digital lectures or articles. The extracted mathematical terms and concepts are further annotated regarding their role in the ontology of the original text, thus preserving the upper ontology used by the authors, and can map relationships in a very fine-grained way. In addition, these networks can be used to create an overview of mathematical content. Therefore, they offer different levels of information detail. To store and manage these information snippets of the texts in a knowledge base the Jena framework is used. Additionally the search engine Apache Lucene [4] is utilized for information retrieval and testing purposes.

Linguistic Approach The main carriers of information in mathematical texts are entities. Based on a linguistic classification scheme [5, 6] they are analyzed using natural language processing methods. The levels of this scheme are depicted in figure 1:

Fig. 1. Linguistic Classification Scheme



on the entity level the used entities and their arrangement within the text are described

• • •

on the structure level the internal structure of an entity is described (e.g. the proposition and assumptions of a theorem) on the sentence level characteristic sentence structures, frequently found in mathematical texts are described on the bottommost and most detailed level of this scheme, the word and symbol level single symbols and words and the relations between them are schematized [7].

Using these structures and linguistic relations mathematical information is educed from the text and integrated into a knowledge base, consisting of one or more directed graphs representing terms, concepts, and their interrelations. The knowledge base is encoded in the web ontology language OWL [8] and is based upon an ontology of the mathematical language. Linguistic analysis of entities produces triples of two nodes (representing mathematical terms and relations) and one relation (describing the type of connection between the nodes). The triples can be nested, i.e. they can be used as nodes of other triples, thus facilitating representation of complex interrelations. Here different types of relations describe different types of keywords or linguistic phrases in mathematical texts (e.g. two nodes representing two theorems, connected with the relation “is equivalent to”). Finally the triples are integrated into the knowledge base. This transfers the knowledge from mathematical texts to the knowledge base, generating a very finegrained structure.

Technical Aspects of MARACHNA MARACHNA

• • • •

consists of four separate modules (see figure 2): the preliminary analysis of the input, the syntactic analysis, the semantic analysis, and the integration of the results into a knowledge base.

Design Decisions The basic design of MARACHNA adheres to the following general guidelines:

• Modularization: Each component (input segmentation, syntactic analysis, semantic analysis, knowledge base, retrieval) have been designed as independent modules that can be replaced easily. • LATEX as input format: LATEX was chosen as input format since it is the standard format for publication in mathematics and the natural sciences. • Standard internal data formats: All data is represented in standard data formats (TEI, MathML, XML, RDF/OWL). • Control of the syntactic and semantic analysis: All rules governing the syntactic and semantic analysis should be modifiable by nonprogrammers to facilitate future extensions.

Fig. 2. Processing of Natural Language Texts

Preliminary Analysis Currently for the internal representation of the processed texts MAuses TEI (Text Encoding Initiative [9]) with all mathematical symbols and formula being encoded in Presentation MathML [10]. For this purpose MARACHNA provides a LATEX -to-TEI-converter. The entities and their interrelations, as well as the information provided on their internal structure level are analyzed and educed from the TEI representation. In the next step the entities are segmented into single sentences and annotated with some meta-information (their parent entity and their role within the internal structure of that entity). For further processing formulae are separated from the enveloping text (replacing incomplex formulae by the associated natural language texts). The described preliminary analysis is implemented in Java using rule-based string comparison. RACHNA

Syntactical Analysis At this point natural language processing techniques are employed to analyze every extracted natural language sentence. This natural language analysis is implemented using the TRALE-System (a PROLOG adaptation [11] of ALE for German [12]) based on a Head-Driven Phrase Structure Grammar. For the use in MARACHNA the underlying dictionary and grammar of TRALE have been expanded to include the particularities of the mathematical language, in order to provide a comprehensive syntactic and some partial semantic information about each sentence. The output of TRALE is then transformed to an abstract syntax tree, symbolizing the structure of the analyzed sentence. At this stage the formulae, which have been processed separately and the natural language part of the text, have to be restored to unity for further semantic analysis. The processing of complex formulae has to be done in a separate step and is however not yet implemented. Semantic Analysis The semantic analysis is implemented as an embedded JavaScript interpreter [13]. The syntax trees are categorized according to typical structures characteristic for specific mathematical entities and semantic constructs. Each category is transformed into a specific triple structure defined by external JavaScript rules. These rules map typical mathe-

matical language constructs onto the corresponding basic mathematical concepts (e.g. proposition, assumption, definition of a term etc.). The resulting triples are annotated with additional information from the original text. At present, the resulting OWL-documents are only satisfying the OWL-full standard. In addition, for each triple element it has to be determined if it represents OWL classes or individuals of OWL classes – complicating the semantic analysis. Knowledge Base The Java-based framework Jena [14] is responsible for inserting created triples into knowledge bases. Jena works with OWL-documents and provides APIs for persistent storage. Additionally, it contains a SPARQL [15] engine for querying RDF documents. It is of particular importance that each analyzed text creates its own new knowledge base, by reason of each author having his own preferences and idiosyncrasies of the representation of a mathematical field. The plain knowledge base consists of fundamental mathematical knowledge including axiomatic field theory and first order logic (cf. figure 3).

Fig. 3. Basic Knowledge Base

While adding new knowledge to the knowledge base inconsistencies have to be avoided. Therefore a semi-automated approach for integrating new information into the knowledge base is used. Information is only integrated if it is not redundant and does not conflict or contradict older entries. Consequently, a new node has to be linked to an already existing one. Semi-automatic means here, that the user of the MARACHNA System eliminates such conflicts by adding additional information or manually deleting the conflicting entries. This user intervention is performed through a web-based interface by using the framework Jena. The basic idea of this approach is the well-known model of human knowledge processing in psychology: integration of new knowledge by humans is only possible if it can be linked to existing knowledge conformed by the persons world view. Consequently, correct knowledge has to be newly interpreted. Fortunately textbooks follow strict rules so that automated integration should be the normal case, and semi-automated integration the exception.

Linguistic Analysis: An Example The following the definition of a mathematical group [16] is illustrated (cf. figure 4) as an example. Let G be a set,

G≠∅, a,b ∈ G: * : G×G | G, (a,b)| a*b.

The tupel (G,*) is called a Group if the following conditions hold: 1. Associativity: ∀a,b,c∈G, a*(b*c) = (a*b)*c, 2. Neutral Element: ∃ e∈G such that ∀a∈G , a*e = e*a = a, 3. Inverse Element: ∀a∈G ∃ e∈G such that a-1*a = a*a-1 = e where e is the neutral element. The assumption “Let G be a set ... ” is recognized based on the knowledge of the use of definite keywords of mathematical texts. In the same way “The tupel (G,*) is called ... conditions hold:” is identified as the proposition and the three following points (Associativity, Neutral Element, and Inverse Element) as definition or properties. Morphologic,

syntactic and semantic analysis results in the representation as triples of predicate, subject, and object. The phrase “Let G be a set” is mapped to (is_a, set, G), for example. In the end the network of relations originates from the identification of identical subjects and objects of different triples. The generated pattern of triples describing the entities and the relations between them are integrated into the database using OWL. The OWLrepresentation is generated by extraction of the significant information from the text and supplies an outline of the mathematical knowledge presented by the author.

Fig. 4. Knowledge Base: Definition of a Group

Evaluation of the Basic Concept While the prototypical implementation of MARACHNA is able of analyzing German texts only, its capabilities are currently extended to analyze English texts as well. The current implementation proves the viability of this semi-automated method of semantic extraction for selected text elements, as illustrated above. Semantic extraction results in information snippets of the analyzed mathematical text which can be assimilated in the knowledge base (cf. figure 4). The ontology mapping approach shall be evaluated in a next step by analyzing whole textbooks on Linear Algebra with a subsequent merging of the resulting knowledge bases and a concluding comparison of the generated outcome with some mathematical “standard” encyclopedia – respectively the corresponding part of the used encyclopedia.

Related Works MARACHNA

consists of several separate subprojects. While we are unaware of any current research mirroring MARACHNA’s global concept, there are a number of projects with similar aims (if different approaches) to the subprojects within MARACHNA. Fundamental research into ontologies for mathematics was performed by T. Gruber and G. Olsen [17]. Baur analyzed the language used in mathematical texts in English [18]. The linguistic analysis of mathematical language is based on TRALE [11]. MBASE is an example of a mathematical ontology created by humans. The structuring of mathematical language and the modeling of the resulting ontology is similar to the work performed by humans in the creation of several mathematical encyclopedias like MathWorld [19]. The ontologies represented in these projects can be used as an interesting comparison to the ontologies and knowledge bases automatically generated by MARACHNA. WordNet/GermaNet [20-22] are intelligent thesauri providing sets of suitable synonyms and definitions to a given queried expression. WordNet uses a similar approach to the automated semantic analysis of natural language compared to MARACHNA. However, WordNet does not provide detailed correlations between dissimilar or not-directly related terms. Helbig [23] has designed a concept for automated semantic analysis of German natural language texts. The Mizar [24-26] system uses a both human and machine-readable formal language for the description of mathematical content as input for automated reasoning systems. An example for a differ-

ent approach towards information extraction from mathematical natural language texts, based on automated reasoning, is the work of the DIALOG [27] project. DIALOG is aimed at processing natural language user input for mathematical validation in an automated reasoning system. Both Mizar and DIALOG, unlike MARACHNA, require complex reasoning systems for their extraction of information. The HELM [28] project defines metadata for semantic annotation of mathematical texts for use in digital libraries. However, these metadata have to be provided by the original authors. MoWGLi [29] extracts the metadata provided in HELM and provides a retrieval interface based on pattern-matching within mathematical formulae and logic expressions. The PIA [30–32] project extracts information from natural language texts (not limited to mathematics) using machine learning techniques. PIA requires an initial set of human expert annotated texts as a basis for the information extraction performed by intelligent agents.

Ontology Engineering in MARACHNA - An Outlook

Summary Exact transfer of knowledge is the central point of mathematical language. Hence the knowledge base contains pure mathematical knowledge with no interpretation needed. The semantic extraction (acquisition process) obeys well defined rules based on the structure of the entities. The same holds true for storage and organization of the mathematical content in the knowledge base. Still some subproblems remain unsolved regarding the merging and the extraction from knowledge bases and are subject of further developments. Knowledge Management and Ontology Engineering MARACHNA generates several levels of ontologies from mathematical texts. The low-level ontology describes the correlation between mathematical terms both within mathematical entities and between these entities. The upper ontology reproduces the structure of the processed text and thus the field ontology used by the author. Both of these ontologies facilitate data-retrieval, the upper ontology in direct response to queries, the low-level ontology by showing interconnections and relations not di-

rectly included by the author or by offering additional links between the upper ontologies of several authors. Different authors use their own notations and phrasing, based on their personal preferences, their audience, and their didactical goals. Therefore each text is stored in its own dedicated knowledge base. On the one hand creating a sensible ontology mapping algorithm to unify knowledge from numerous small knowledge bases, each of them related to one text, in one superior knowledge base is one of developments intended for MARACHNA’s near future. On the other hand, tools and strategies for unitizing the unified knowledge base into smaller knowledge bases, particularly tailored to the preferences and requirements of specialized user groups have to be devised and implemented and realized. The unitized knowledge bases could be used as the basis of an intelligent retrieval system, based on the concept of personal information agents (PIAs): using different versions of an entity for elementary and high school might be wise, for example. Moreover unitized knowledge bases using different notations could contribute to the harmonization of the teaching material within a sequence of courses. Retrieval The design of the retrieval interface aims at supporting users in understanding and learning mathematics. The interface shall offer tools to select information based on the user’s personal preferences (e.g. an example oriented vs. an axiomatic approach). Additionally the integration of different roles (learner, instructor, administrator) into a generic interface is desirable. The retrieval interface shall be implemented as a web service. Processing of Mathematical Formulae As formulae form a major portion of mathematical texts and constitute a primary source of information in these texts, it is desirable to be able to include their content in the analysis and representation created by. Currently, this important feature is not implemented. However, we are investigating an approach to rectify this deficiency. Therefore the use of a syntactical analysis, similar to those used in computer algebra systems in combination with contextual grammars (e.g. Montague grammars) to correlate the information given in a formula with information already provided in the surrounding natural language text, is proposed. Using this approach should allow for the integration of formulae and their in-

formational content in the network created by the analysis of the natural language text. It should be pointed out that we do not aim for machinebased understanding of the formulae, as automatic reasoning systems would require. Instead, formulae are to be treated as a different representation of mathematical knowledge, to be integrated into the knowledge base in a similar manner to that used for the natural language text. However, the analysis proposed here can be used as a first step in a further process leading to viable input for such reasoning systems, providing additional assistance in building the knowledge base.

References 1. Hilbert, D.: Die Grundlagen der Mathematik. Abhandlungen aus dem mathematischen Seminar der Hamburgischen Universität 6 (1928) 65–85 2. Bourbaki, N.: Die Architektur der Mathematik. Mathematiker über die Mathematik. Springer, Berlin, Heidelberg, New York (1974) 3. Gödel, K.: Über formal unentscheidbare Sätze der Principia Mathematica und verwandte Systeme I,. Monatsheft f. Mathematik und Physik (19311932) 147f 4. The Apache Software Foundation: Apache Lucene. http://lucene.apache.org/ 5. Jeschke, S.: Mathematik in Virtuellen Wissensräumen - IuK-Strukturen und IT-Technologien in Lehre und Forschung. PhD thesis, Technische Universität Berlin (2004) 6. Natho, N.: MARACHNA: Eine semantische Analyse der mathematischen Sprache für ein computergestütztes Information Retrieval. PhD thesis, Technische Universität Berlin (2005) 7. Grottke, S., Jeschke, S., Natho, N., Seiler, R.: mArachna: A Classification Scheme for Semantic Retrieval in eLearning Environments in Mathematics. Proceedings of the 3rd International Conference on Multimedia and ICTs in Education, June 7-10, 2005, Caceres/Spain (2005) 8. W3C: Web Ontology Language. http://www.w3c.org/2004/OWL 9. The TEI Consortium: Text Encoding Initiative. http://www.tei-c.org 10. W3C: MathML. http://www.w3.org/Math 11. Müller, S.: TRALE. http://www.cl.uni-bremen.de/Software/Trale/index.html 12. Müller, S.: Deutsche Syntax deklarativ: Head-Driven Phrase Structure Grammar für das Deutsche. In: Linguistische Arbeiten, No. 394. Max Niemeyer Verlag, Tübingen (2005) 13. Mozilla Foundation: Rhino. http://www.mozilla.org/rhino/ 14. Jena: A Semantic Web Framework for Java. http://jena.sourceforge.net 15. W3C: SPARQL. http://www.w3.org/TR/rdf-sparql-query/ 16. Wüst, R.: Mathematik für Physiker und Mathematiker, Bd.1. Wiley-VCH (2005)

17. Gruber, T., Olsen, G.: An Ontology for Engineering Mathematics. Technical Report KSL-94-18, Stanford University (1994) 18. Baur, J.: Syntax und Semantik mathematischer Texte. Master’s thesis, Universität des Saarlandes, Fachbereich Computerlinguistik (November 1999) 19. Wolfram Research: MathWorld. http://mathworld.wolfram.com 20. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, London (1998) 21. Fellbaum, C.: WordNet. http://wordnet.princeton.edu 22. GermaNet Team: GermaNet. http://www.sfs.uni-tuebingen.de/lsd/english.html 23. Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, Heidelberg, New York (2006) 24. Urban, J.: MoMM – Fast Interreduction and Retrieval in Large Libraries of Formalized Mathematics. Internation Journal on Artificial Intelligence Tools (15(1)) (2006) 109–130 25. Urban, J.: MizarMode – An Integrated Proof Assistance Tool for the Mizar Way of Formalizing Mathematics. Journal of Applied Logic (2005) 26. Urban, J.: XML-izing Mizar: Making Semantic Processing and Presentation of MML Easy, MKM2005 (2002) 27. Pinkall, M., Siekmann, J., Benzmüller, C., and Kruijff-Korbayova, I.: DIALOG. http://www.ags.uni-sb.de/˜dialog/ 28. Asperti, A.,.Padovani, L., Sacerdoti Coen, C. and Schena, I.: HELM and the Semantic Web. In Boulton, R.J., Jackson, P.B., eds.: Theorem Proving in Higher Order Logics, 14th International Conference, TPHOLs 2001, Edinburgh, Scotland, UK, September 3-6, 2001, Proceedings. Volume 2152 of Lecture Notes in Computer Science., Springer (2001) 29. Asperti, A., Zacchiroli, S.: Searching Mathematics on theWeb: State of the Art and Future Developments. In: Joint Proceedings of the ECM4 Satellite Conference on Electronic Publishing at KTH Stockholm, AMS - SM M Special Session, Houston / (2004) 30. Albayrak, S., Wollny, S., Varone, N., Lommatzsch, A., Milosevic, D.: Agent Technology for Personalized Information Filtering: The PIASystem. ACM Symposium on Applied Computing (2005) 31. Collier, N., K, T.: PIA-Core: Semantic Annotation through Example-based Learning, Third International Conference on Language Resources and Evaluation (May 2002) 1611–1614 32. The PIA Project: PIA. http://www.pia-services.de/