A semantic search framework for document retrievals (literature, art and history) based on thesaurus Multiwordnet like Vitoantonio Bevilacqua1,2* , Vito Santarcangelo1,2, Alberto Magarelli1, Annalisa Bianco3, Giuseppe Mastronardi1,2 , Egidio Cascini4 1
Department of Electrical and Electronics, Polytechnic of Bari, Via E. Orabona, 4 – 70125 Bari – Italy 2 e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Via Pavoncelli, 139 Bari – Italy 3 Gius. Laterza & Figli S.p.A., piazza Umberto I 54, 70121, Bari 4 Accademia Italiana del Sei Sigma - c/o Università degli Studi Guglielmo Marconi Via Giusti 7, 50121 Firenze *corresponding author:
[email protected]
Abstract. The aim of this paper is to show the application of a Thesaurus based approach to several issues of interest for an Italian Publishing House. Final experimental results reveal good performance in terms of rate of retrieval and recall and then encourage the prosecution of the multidisciplinary research. Keywords: Thesaurus, Multi-Wordnet, Monte Carlo Method, retrieval
1 Introduction This paper introduces the work carried out by the University Politecnico of Bari in collaboration with the Gius. Laterza and Figli publishing house, as part of the project Diderot: Data Oriented Integrated System for the Rapid Definition of Multimedial Object and Texts. Diderot is a research project held in the area of the National Innovation Founding of the Italian Government in which the editorial sector, ICT industries and research institutions have been working together to design and develop a system able to govern all the steps related to the production of education content, content to be delivered with different outputs through different channels of distributions (printed books, e.books, learning objects for the web or the interactive whiteboard, etc.). New educational products could be created on the basis of preexistent raw material that could be reshaped and modified to satisfy the requirements of a specific output and target or on the basis of a new content to be written. But the real goal for a XXI Century publishing house is to be able to re-use the wide and preexistent catalogue in order to give shape to new contents and to adapt or re-create the original material to give birth to innovative products. The new ways of teaching, the increasing use of interactive whiteboards and computers in the classroom and at
home, the new ways of learning that the new media and the habit to use them boost in the new generation, makes for a publisher a must rethink its content and innovate its production process. Therefore, Diderot had to deal with two different priorities necessarily intertwined: implementing a new production system and together an information and retrieval system, able to store and research raw material from published books. This second objective implies to deal and solve different problems we came across along the research work. First of all, a published book is not just a set of self-contained assets ready to be used, but a linear structured work, where style and content are normally associated in a single body. The shape of the product is correlated to the content [6] and this implies that different products are released in different shapes, more often not marked up according to a standard set of indexing keys. This feature results from the production process of an educational book: every single book has its own graphic design, the shape is part and expression of the content and the final product is released after having been processed in authoring systems like InDesign. The lack of a separation between content and shape creates not a few problems when the goal is to store and research the content inside the book. The first problem that the Diderot team had to face was therefore finding a way to index content already produced according to a set of standard keys. The research of the editorial standards and the analysis of the Laterza education books led to choose Docbook as an XML markup language useful to index published work. The team has then created a table of correspondence among the styles as set in the books (chapters – paragraphs – indexes – notes – exercises – etc.) and the Docbook tags. Two more problems came so ahead. Which kind of content could represent the raw material to be converted into DocBook files? InDesign XML files, final result of the production process, or Word revised files, the final content after revision before a graphic designer has given to the book its own final shape? The answer to this question required not a simple task, as we will explain in the following part of this paper. The second problem was given by the need to index the content with keywords that will make possible to store and retrieve it through semantic keys. We had therefore to design a semantic engine which allowed retrieval thanks to indexing keys. But how should these keys been assigned? The process of creating new entries every time we had to index a book without referring to a structured vocabulary was not approachable, because we will have encountered a wide number of variations and similar expressions that would have made difficult and burdensome the system, the more it would have been populated with new material. The solution was to turn to a structured vocabulary to be optimized with terms appropriate to classify the Laterza educational content. Therefore, two new fields of research were undertaken: searching for a multilingual lexical database that could be the basis for future expansion and detecting the areas of classification that could result suitable to index the educational products published by Laterza. The original material was related to different fields of study (language and literature, history, history of art), that implied that a multilingual thesaurus related to a specific cultural domain would not have satisfied the purpose. We needed just a multilingual lexical database that could be expanded with terms related to our material. The investigation led us to select Multiwordnet, a multilingual lexicon database in which the Italian Wordnet is strictly aligned with the original project developed by the Princeton University: WordNet. This lexicon had to be extended with terms related to the specific domains and with the key-values that could
result useful to set a semantic query. Therefore, for every single domain the set of data directly related to the field has been identified and for each of them a set of Italian keywords has been defined and uploaded to the system. This methodology has the advantage to maintain the consistency and integrity of the Lexicon and to finalize a system that is possible to enhance in time adding new terms.
2 Background [1] A semantic search engine [4] is a retrieval system that allows to conduct research expressing the questions in the same way that humans request information to each other. Through the network of the meanings of words (Knowledge based of the semantic search engine) it is possible to correctly identify the meaning of the speech, so the search engine is able to return all the content related to the initial question. If the query is precise the performance of the engine increases. A query as “the Italian poetry in romanticism” can be hard to understand for a classical search engine (it would consider the occurrences of each term but would not consider the meaning of the sentence , but it is very simple for a semantic search engine that would consider the set of meanings that exist between the words of the query. Examples of semantic search engine are www.hakia.com and http://www.sensebot.net. Hakia returns results from Web, Blogs, Images and Video, Sensebot shows us the concepts of its Knowledge Base. The different dimension of concepts in this graphical representation relates to the rank of matching relevance with the input query.
3 Analysis of the problem and Materials As previously said, in order to populate the Diderot database it was necessary to find a way for conferring the raw material of Laterza school books to the system. A possible file-format considered initially for the scope of indexing books has been the IDML (In Design Mark-up Language) format. The inability to index texts in this format by PhpDig moved us to create an algorithm that converts such books into Docbook files. Docbook format is a XML standard widely used that counters the hierarchical structure of a book ready to be published. Docbook linear structure indexes chapters, paragraphs, notes of the book: this indexing allows to release the book in multiple outputs: volumes to be printed, e-books, content to be delivered on a web site, and so on. Indexing has not yielded, at first, significant results for the complexity of the IDML standard. The first step has been to understand the correspondence between the IDML tag and Docbook tags. In this regard, we have considered conversion tables. IDML is the XML result of a book produced using InDesign. In order to convert the raw material in a text ready to be published, specific styles are assigned to the content, each of them defines a part of the work (chapter title – chapter number – and so on). The same styles are marked up in Docbook: in theory, it would be therefore possible to set a table of correspondence between the IDML and Docbook XML. As a matter of fact, we realized that the resulting table
could not be applied to every occurrence, because each designer uses InDesign with a different scheme that cannot be represented in a single conversion table. In addition to this, IDML spreads the linear content in single sub-units not understandable as selfcontained assets, but just as part of a major referenced structure: many of these subunits do not contain any content, but just information about it. This makes it difficult to reconstruct the original chapter as viewed in separated multiple files. For all these reasons, the final analysis led to choose the .doc format as a proper content source suitable to be converted into Docbook files. The conversion process has been realized as follows. .doc files have been converted, first of all, in HTML pages as to be quickly indexed by a search engine. For this action the “Word to HTML Converter” has been used. The operation is simple: we enter the location of the source file to convert, and the folder in which we want to allocate the resulting HTML file. At this point, after invoking a script that removes the head generated by the Converter, we create hyperlinks to each paragraph of a chapter, so as to obtain maximum benefit from indexing. It is possible to do it manually (with an html/text editor) or with a script that binds the tag