Enhanced Algorithms for Automatic Segmentation Of ...

7 downloads 2448 Views 227KB Size Report
single book has its own graphic design, the shape is part and expression of ... InDesign. The lack of a separation between content and shape creates not a few.
A semantic search framework for document retrievals (literature, art and history) based on thesaurus Multiwordnet like Vitoantonio Bevilacqua1,2* , Vito Santarcangelo1,2, Alberto Magarelli1, Annalisa Bianco3, Giuseppe Mastronardi1,2 , Egidio Cascini4 1

Department of Electrical and Electronics, Polytechnic of Bari, Via E. Orabona, 4 – 70125 Bari – Italy 2 e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Via Pavoncelli, 139 Bari – Italy 3 Gius. Laterza & Figli S.p.A., piazza Umberto I 54, 70121, Bari 4 Accademia Italiana del Sei Sigma - c/o Università degli Studi Guglielmo Marconi Via Giusti 7, 50121 Firenze *corresponding author: [email protected]

Abstract. The aim of this paper is to show the application of a Thesaurus based approach to several issues of interest for an Italian Publishing House. Final experimental results reveal good performance in terms of rate of retrieval and recall and then encourage the prosecution of the multidisciplinary research. Keywords: Thesaurus, Multi-Wordnet, Monte Carlo Method, retrieval

1 Introduction This paper introduces the work carried out by the University Politecnico of Bari in collaboration with the Gius. Laterza and Figli publishing house, as part of the project Diderot: Data Oriented Integrated System for the Rapid Definition of Multimedial Object and Texts. Diderot is a research project held in the area of the National Innovation Founding of the Italian Government in which the editorial sector, ICT industries and research institutions have been working together to design and develop a system able to govern all the steps related to the production of education content, content to be delivered with different outputs through different channels of distributions (printed books, e.books, learning objects for the web or the interactive whiteboard, etc.). New educational products could be created on the basis of preexistent raw material that could be reshaped and modified to satisfy the requirements of a specific output and target or on the basis of a new content to be written. But the real goal for a XXI Century publishing house is to be able to re-use the wide and preexistent catalogue in order to give shape to new contents and to adapt or re-create the original material to give birth to innovative products. The new ways of teaching, the increasing use of interactive whiteboards and computers in the classroom and at

home, the new ways of learning that the new media and the habit to use them boost in the new generation, makes for a publisher a must rethink its content and innovate its production process. Therefore, Diderot had to deal with two different priorities necessarily intertwined: implementing a new production system and together an information and retrieval system, able to store and research raw material from published books. This second objective implies to deal and solve different problems we came across along the research work. First of all, a published book is not just a set of self-contained assets ready to be used, but a linear structured work, where style and content are normally associated in a single body. The shape of the product is correlated to the content [6] and this implies that different products are released in different shapes, more often not marked up according to a standard set of indexing keys. This feature results from the production process of an educational book: every single book has its own graphic design, the shape is part and expression of the content and the final product is released after having been processed in authoring systems like InDesign. The lack of a separation between content and shape creates not a few problems when the goal is to store and research the content inside the book. The first problem that the Diderot team had to face was therefore finding a way to index content already produced according to a set of standard keys. The research of the editorial standards and the analysis of the Laterza education books led to choose Docbook as an XML markup language useful to index published work. The team has then created a table of correspondence among the styles as set in the books (chapters – paragraphs – indexes – notes – exercises – etc.) and the Docbook tags. Two more problems came so ahead. Which kind of content could represent the raw material to be converted into DocBook files? InDesign XML files, final result of the production process, or Word revised files, the final content after revision before a graphic designer has given to the book its own final shape? The answer to this question required not a simple task, as we will explain in the following part of this paper. The second problem was given by the need to index the content with keywords that will make possible to store and retrieve it through semantic keys. We had therefore to design a semantic engine which allowed retrieval thanks to indexing keys. But how should these keys been assigned? The process of creating new entries every time we had to index a book without referring to a structured vocabulary was not approachable, because we will have encountered a wide number of variations and similar expressions that would have made difficult and burdensome the system, the more it would have been populated with new material. The solution was to turn to a structured vocabulary to be optimized with terms appropriate to classify the Laterza educational content. Therefore, two new fields of research were undertaken: searching for a multilingual lexical database that could be the basis for future expansion and detecting the areas of classification that could result suitable to index the educational products published by Laterza. The original material was related to different fields of study (language and literature, history, history of art), that implied that a multilingual thesaurus related to a specific cultural domain would not have satisfied the purpose. We needed just a multilingual lexical database that could be expanded with terms related to our material. The investigation led us to select Multiwordnet, a multilingual lexicon database in which the Italian Wordnet is strictly aligned with the original project developed by the Princeton University: WordNet. This lexicon had to be extended with terms related to the specific domains and with the key-values that could

result useful to set a semantic query. Therefore, for every single domain the set of data directly related to the field has been identified and for each of them a set of Italian keywords has been defined and uploaded to the system. This methodology has the advantage to maintain the consistency and integrity of the Lexicon and to finalize a system that is possible to enhance in time adding new terms.

2 Background [1] A semantic search engine [4] is a retrieval system that allows to conduct research expressing the questions in the same way that humans request information to each other. Through the network of the meanings of words (Knowledge based of the semantic search engine) it is possible to correctly identify the meaning of the speech, so the search engine is able to return all the content related to the initial question. If the query is precise the performance of the engine increases. A query as “the Italian poetry in romanticism” can be hard to understand for a classical search engine (it would consider the occurrences of each term but would not consider the meaning of the sentence , but it is very simple for a semantic search engine that would consider the set of meanings that exist between the words of the query. Examples of semantic search engine are www.hakia.com and http://www.sensebot.net. Hakia returns results from Web, Blogs, Images and Video, Sensebot shows us the concepts of its Knowledge Base. The different dimension of concepts in this graphical representation relates to the rank of matching relevance with the input query.

3 Analysis of the problem and Materials As previously said, in order to populate the Diderot database it was necessary to find a way for conferring the raw material of Laterza school books to the system. A possible file-format considered initially for the scope of indexing books has been the IDML (In Design Mark-up Language) format. The inability to index texts in this format by PhpDig moved us to create an algorithm that converts such books into Docbook files. Docbook format is a XML standard widely used that counters the hierarchical structure of a book ready to be published. Docbook linear structure indexes chapters, paragraphs, notes of the book: this indexing allows to release the book in multiple outputs: volumes to be printed, e-books, content to be delivered on a web site, and so on. Indexing has not yielded, at first, significant results for the complexity of the IDML standard. The first step has been to understand the correspondence between the IDML tag and Docbook tags. In this regard, we have considered conversion tables. IDML is the XML result of a book produced using InDesign. In order to convert the raw material in a text ready to be published, specific styles are assigned to the content, each of them defines a part of the work (chapter title – chapter number – and so on). The same styles are marked up in Docbook: in theory, it would be therefore possible to set a table of correspondence between the IDML and Docbook XML. As a matter of fact, we realized that the resulting table

could not be applied to every occurrence, because each designer uses InDesign with a different scheme that cannot be represented in a single conversion table. In addition to this, IDML spreads the linear content in single sub-units not understandable as selfcontained assets, but just as part of a major referenced structure: many of these subunits do not contain any content, but just information about it. This makes it difficult to reconstruct the original chapter as viewed in separated multiple files. For all these reasons, the final analysis led to choose the .doc format as a proper content source suitable to be converted into Docbook files. The conversion process has been realized as follows. .doc files have been converted, first of all, in HTML pages as to be quickly indexed by a search engine. For this action the “Word to HTML Converter” has been used. The operation is simple: we enter the location of the source file to convert, and the folder in which we want to allocate the resulting HTML file. At this point, after invoking a script that removes the head generated by the Converter, we create hyperlinks to each paragraph of a chapter, so as to obtain maximum benefit from indexing. It is possible to do it manually (with an html/text editor) or with a script that binds the tag

for chapter titles and the tag

only for paragraph titles. In this way, we can identify these paragraph titles and create hyperlinks between them (with the tag ). Finally, the resulting file is uploaded on the server after logging in with administrator credentials. At this point, when the procedure of indexing is ended, semantic searches within documents are possible through the knowledge base, the thesaurus, which is properly populated.

4 Methods: thesaurus and retrieval The thesaurus is based on the MultiWordnet structure [2]. MultiWordnet is an international lexical database with a simple structure (indexes and relational database). As part of the Diderot project a thesaurus has been implemented that is based on three index databases and a database of relationships. The index databases refer to three semantic areas: “History”, ”Art History” and “Literature”, whose relationships have been saved in the “Common Relation” database. In order to preserve the integration and the consistency within the research project Multiwordnet, the structure of the index fields has been maintained and we used the same type of notation on the letter of id#. To avoid id# overlapping with those already present in MultiWordNet, different letters have been chosen as id# related to the specific domains: the letter “S” has been used as id# for historical terms, the “L” for id# for literature’s vocabularies, and the “K” as id# for words referred to the History of Art. The semantic search engine [5] can be represented by a simple block diagram as showed in figure 2 . The INPUT data is the sentence to be (characterized by word, punctuation, prepositions and conjunctions) written by the user in the INPUT FORM. The sentence is filtered by the FILTERING block (this block removes stop words) and the output of this one is a VECTOR OF WORDS. These words are the input of the THESAURUS QUERIES block, which calls the thesaurus through each word and obtains the associated ENTITIES. These ones are the input of the LOGICS block, that elaborates them and considers the related ones, which are the inputs of the DOCS QUERIES block. This block queries the file server and obtains the relevant docs with

relative score. The DOCS QUERIES block is the PhpDig web spider and search engine.

Fig. 1 - The figure shows the relation between the lemmas “Manzoni” (Literature) and “800” (History) in the Relation Database with the type “%p” (has part in).

Fig. 2 - System Block Diagram An example can be useful to show the data flow on this block diagram. The user writes the sentence “the Italian romanticism in poetry” , so the FILTERING block returns the vector of words [Italian, romanticism, poetry]. The system queries the THESAURUS with these words and obtained for the “Italian” word the names of Italian events, authors and artists set in the thesaurus [Boccaccio, Manzoni, Leopardi] , for the “romanticism” word the names of international events, authors and artists related [800, Manzoni, Leopardi, VanGogh], for the “poetry” the words

generalized by this term and the international poets and authors [lyrics, rhymes, Leopardi]. Matching these results is possible to invoke the PhpDig with the logic words obtained from the thesaurus [Leopardi].

5 Experimental Results and Validation For the validation of the system has been considered a bandwidth validation and a classical parametric validation for the retrieval system. The first is based on a statistical tool based on Monte Carlo method, the second on the measure of the parameters precision and recall of the system.

5.1 Statistical bandwidth validation Monte Carlo method is based on a simulated sampling, representing the cumulated probability distribution of X. The procedure to choose an element x of X is: 1) a random decimal number included between 0 and 1 is taken, by means of a random numbers generator; 2) the obtained number is put on the y axis; 3) this number is projected on the curve y = F (x) ; 4) the x value, determined in this way, is an element of a sample extracted by X. This procedure is adequate, in fact, to assign to the x value determined as illustrated above, the meaning of an element of a sample extracted from X. In fact, we can write:

P ( x1 < xcampione < x1 + dx) = dy1 =

dF ( x1 ) dx = f ( x1 )dx dx1

(1)

in order to consider x an element of a sample from X. For further news about the Monte Carlo method it is useful to read the paper [3]. Obviously, the way to apply the method depends on the specific problem under consideration. An important statistical tool to help the control of the upload bandwidth request of the server, based on Monte Carlo method, has been implemented. The input variables of the system are X1 (number of users) and X2 (average number of requests per user), detected for 15 days. For example, the first day the number of users that has been connected on the system is 30, then X1=30 and the total number of request is 300 (X2=300/30=10). The 15 X1 numbers are ordered in increasing mode and for each of them is associated a value between 0 and 1. 1000 random numbers are then generated and a value of X1 is associated to each of them. The same procedure is followed for X2, and the 1000 X1*X2 products are calculated. After that, the mean value of these results is calculated and this value is compared with that of upload bandwidth of the server. The analysis of this bias value is the output of this system. Once obtained the result it is possible to draw the cumulative probability distribution of the bandwidth, from which it is possible to obtain all kinds of needed information.

5.2 Comparison between semantic search engines We have considered some queries about literature for testing the performances of our semantic web search engine with those of www.hakia.com and www.sensebot.net. Hakia supports only English language while the second supports English, French, German e Spanish languages. The queries used are “the Italian poetry of 1300” , “the Italian satire in the Renaissance” and “the Sculpture in the Baroque”. For the first query our system returns documents about “Landini, Sacchetti, Petrarca” , Hakia “Cavalcanti, Petrarca, Giacomo da Lentini”, Sensebot returns a meaning map that shows Dante and Petrarca as right results. Our system as output of the query “the Italian satire in the Renaissance” produces “Ariosto”, Hakia produces “William Baldwin, Ariosto”, Sensebot produces the semantic map in the figure 3, in which we can see that the meaning is wrong.

Fig.3: Sensebot map of the query "The italian satire in the Renaissance" The output of our system for the query “the Italian sculpture in the Baroque” is “Bernini” with its relative documents scored as shown in figure 4 100.00 41.34 37.31

19.1 GIAN LORENZO BERNINI.html 19.2FRANCESCO BORROMINI.html 19.4 BAROCCO.html

26.39 13.78 7.96

19.3 PIETRO DA CORTONA.html 19 IL BAROCCO.html 18 NATURALISMO E CLASSICISMO.html

Fig.4: Output of our system for the query "The italian sculpure in the Baroque" the output of Hakia is general but in those results we can see the word “Bernini”, the map returns by SensBot is shown in the following figure 5.

Fig.5: Sensebot map of the query "The italian sculpure in the Baroque"

Our system is highly performing thanks to a thesaurus dedicated to Literature, Art and History, but this has been made possible thanks to the implementation of words and to the process of stemming data. Hakia returns very good results also if the historical period isn’t correct in some outputs, Sensebot has a very innovative graphical representation of the semantic map, but the results sometimes are wrongs.

6

Conclusions

This paper shows the features and the potentialities of a semantic search engine used for the context of literature, history of art and history. The search engine implemented can be adapted for other semantic contexts by the realization of a related thesaurus. The limits of our search engine are the use of single words of the sentence and a poor graphical interface for the results of the query. Good features of our system is the possibility of fully control of the thesaurus, of the implementation of new relations and the good performances of relevance on the document retrieval.

References [1] Tumer, D. Shah, M.A. Bitirim, Y. , An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia, Internet Monitoring and Protection, 2009. ICIMP '09, pp. 51 – 55 [2] Pianta E. et al. , Multiwordnet Developing an aligned multilingual database, Proceedings of the 1st International WordNet Conf., Jan 2002, Mysore, pp. 293-302 [3] Cascini E., Considerazioni sulla variabilità di un processo di misura con una applicazione nell'area della Customer Satisfaction, Sei Sigma & Qualità, Vol 1, N.2, pp. 181 - 210 [4] Kassim J. et al. Introduction to Semantic Search Engine, 2009 International Conference on Electrical Engineering and Informatics, pp.380-385 [5] Loganantharaj R. et al. An Ontology Based Semantic Literature Retrieval System, Proc. of the 19th IEEE Symp. on Computer-Based Medical Systems (CBMS'06) [6] He Ruiying, Liu Lingling, Educational Resource Sharing Model Based on Semantic Grid, Computational Intelligence and Software Engineering, 2009. CiSE 2009