A Text Retrieval Efficiency Testing Tool for Different

0 downloads 0 Views 492KB Size Report
txt, doc/docx, pdf, html, xml, rdf, and json. ... storing, searching, and retrieving information that match a ... format used for storage of information in sequence of.
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com

TRET: A Text Retrieval Efficiency Testing Tool for Different Document Types/Formats and Calculating Evaluation Measures for XML Retrieval SARANYA.S1,

B.S.E.ZORAIDA2,

COMPUTER SCIENCE AND ENGINEERING BHARATHIDASAN UNIVERSITY, TRICHY-23.

COMPUTER SCIENCE AND ENGINEERING, BHARATHIDASAN UNIVERSITY, TRICHY-23 [email protected] process. A semantic relevance measure is used for retrieval [9]. For semantic Web application to make use of a Web data source, data needs to be transformed into a machine understandable format. The most common format is RDF for the extracted information [8]. Many information retrieval systems and searching tools are developed in recent years. For instance, one of the main novel approaches for searching information in Web pages is Nutch [5]. Nutch is built on top of Lucene [6], which is an API for text indexing and searching [4]. Chien, LF [11] proposed a keyword extraction method based on PAT tree. In this method the major disadvantage is the construction and retrieval of information from PAT tree is time consuming. Eiber Frank [12] [13] proposed KEA algorithm to implement keyword extraction. The algorithm is a simple Bayesian machine learning. Bao Hang [14] proposed an extended term frequency (Extended TF) based method which combined Chinese linguistic characteristics with basic TF method to improve the precision value. The major demerit of this method is low recall value. The most extended tool used to search information in Internet is Google [7]. There are generic search systems whose data can be of several types (images, videos, documents, etc.) – such as Google [10]. To enhance the text retrieval efficiency and accuracy, a tool with the functionalities of both comparing, calculating the text retrieval time of different document types and evaluating the efficiency measure of XML text retrieval is developed in this paper. A tool TRET is designed to retrieves information from seven different document types and compares the retrieval efficiency of documents based on the retrieval time. To obtain the improved recall and precision efficiency measures a web search algorithm CCA is implemented in TRET. Thus an attempt has been made in this paper successfully to design and develop a tool with these functionalities.

Abstract−The web is a huge collection of heterogeneous documents with different types or formats. Retrieving information from documents of different types is a difficult task, due to the different structure or schema of the documents. To study the retrieval efficiency of heterogeneous document types, a software tool TRET (Text Retrieval Efficiency Testing) is designed and established using a CCA (Competent Crawling Algorithm) web search algorithm in this paper. The tool TRET has two phases analysis. In the first phase keyword based information retrieval is carried out on different document formats like txt, doc/docx, pdf, html, xml, rdf, and json. The first phase investigation results proved XML documents takes minimum retrieval time when compared to all other documents. XML documents are evaluated for search effectiveness by measuring the parameters recall, precision and F-score in the second phase. Investigation result proved XML document has high precision and recall value for retrieval process. The objective of this work is to identify the document type which takes minimum retrieval time and to obtain maximum recall and precision values for the corresponding retrieval. In this paper the investigational results of TRET have proven that the XML document type has the high precision and recall value for retrieval process. Keywords: XML Retrieval, Recall, Precision, Information searching, Retrieval Efficiency. --------------------------------------------------------------

I.

INTRODUCTION

Since the beginning of written language, humans have been developing ways of quickly indexing and retrieving information [1]. Information Retrieval (IR) is the act of storing, searching, and retrieving information that match a user’s request [2]. Due to the dynamic nature and heterogeneity of the web, the organization of the information is a difficult task. Approximately the web shares 15 billion to 30 billion of web pages with heterogeneous information. Classifying information is a challenging process and it makes the difficulty in retrieving information on the web [4]. Different types of documents present in the web, consume different time duration to retrieve and search the information for a query. To overcome information overload, tremendous amount of research has improved in information retrieval systems over the last few decades. To enhance the retrieval efficiency lots of efforts have been invested to develop query and filtering languages as means to consult XML documents [3]. Semantic information retrieval tries to go beyond traditional methods by defining the concepts in documents and in queries to improve retrieval [1]. Some of the implicit semantic information is also discovered after the text mining

II.

CLASSIFICATION OF TEXT STRUCTURES

Textual information based on the structure can be divided into three categories. Documents are classified into structured, semi structured and unstructured based on their internal and external information type variations. A. Unstructured data: Unstructured data is raw text for describing markings and separating syntactic labels. Text files (TXT): A text file is a type of simple file format used for storage of information in sequence of line structure. Generally text file have the ASCII format and UTF-8 encoding style. Text files usually have the MIME type "text/plain". Text files saved with the .txt extension and can easily be read or opened by any program that reads text and, for that reason, are considered universal (or platform independent)[15].

36162

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com Word Document (DOC): A word document is a product of Microsoft. Word document is a binary file formats designed with very different design goals than HTML. It has the native file format of .doc/.docx extensions. Word older version (1997-2003) is a default Binary file format of word. After MS Word 2003, XML file format is introduced in word. So MS Word 2007 and later versions also support default file format and XML file formats [16]. Portable Document Format (PDF): PDF is a Multiplatform file format developed by Adobe systems. A "tagged" PDF (ISO 32000-1:2008 14.8) includes document structure and semantics information to enable reliable text extraction and accessibility. A ―tagged‖ PDF is builds on the logical structure framework of PDF 1.3. A ―tagged‖ PDF defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes [17].

documents as Information Retrieval systems usually do in unstructured retrieval [22]. The web documents have extreme variations in the internal and external Meta information. For example, documents differ internally in their language (both human and programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or output from a database) [23]. So it is very difficult to search and retrieve accurate information from heterogeneous documents of the web with best retrieval time.

III.

TRET ( Text Retrieval Efficiency Testing Tool) TRET is a web application tool, that allows to search, retrieve and evaluate the indexed document with the type of txt, doc/docx, pdf, html, xml, rdf and json on TRET collections. TRET provides an interactive retrieval of information. This web application tool is developed using C# .Net in framework 4.5. TRET is designed using a web search crawling algorithm. To enhance the performance of the information retrieval, competent crawling algorithm (CCA) is implemented in TRET [24]. CCA is a web search algorithm derived from the existing functionality of both page rank and BFS Web search crawling algorithms. Compared to existing web search algorithms, CCA has several advantages to increase the time efficiency by dequeuing the visited URLs from buffer before the crawler encounters it. In CCA, dynamic hash tables are used for scalability and the system is reliable to crawler crashes. The difficulty of searching and retrieving information from different document type or structure is overcome by using CCA web search algorithm in TRET [24].

B. Structured data: Structured data contains well defined processed data. In structured data the user can extract specified and exact information for the query. Resource Description Framework (RDF): RDF is developed by W3C standard. RDF used to describe the meta data on the web. RDF was designed to allow developers to build search engines that rely on the metadata and to allow Internet users to share web site information more readily. RDF uses XML as interchange syntax for creating an ontology system to exchange the information on the web [18]. JavaScript Object Notation (JSON): JSON is an open standard, lightweight, human readable and data interchange format. JSON is used as an alternate to XML. It is the most widely used format for exchanging data on the web [19]. C. Semi-structured data: Semi-structured data contains semi processed data, which relies between unstructured and structured data. Semi structured data has information constrained by schema as well as unconstrained information. Hyper Text Markup Language (HTML):HTML is a standard markup language used for creating and formatting web pages. HTML has the hierarchal structure in the form of HTML elements and tags. HTML describes the semantic structure of website. HTML allows images and objects to be embedded and can be used to create interactive forms. It provides a means to create structured documents by denoting text in hierarchal structures such as heading, paragraph, and so on [20]. Extensible Markup Language (XML): XML is developed by W3C standard. It is a structured markup Language for describing data in terms of attributes and elements in a tree structure [21]. The first challenge in structured retrieval is to return parts of documents (i.e., XML elements), not entire

Pseudo code: CCA (Starting Url) { For ( P=0; p max_buffer);} { DEQUEUE(Frontier); } Merge(Frontier,max_trip,BufferedPages); } [24]

With the implementation of CCA [24], TRET is a highly flexible, efficient, and effective keyword based

36163

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com search tool, readily deployable on large-scale collections of documents. TRET implements state-ofthe-art searching and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications. TRET is ideal for performing information retrieval investigations. TRET calculates the information retrieval time and compares the retrieval time of different document types/formats in the first phase of investigation. In second phase of investigation TRET calculates and compares the efficiency measures (Recall and Precision) of XML Retrieval. The design view of TRET tool is shown in Fig1.

correct matches that are predicted, i.e., the number of true positives over the number of pairs that are actually matched [26]. F-score is a measure that combines both precision and recall measures to evaluate the accuracy of a retrieval. They are computed as follows [24].

Precision( P)

NumberofRe lavantPage sRetrieved TotalNumbe rofPagesRe trived (1)

Re call ( R)

Numberof Re lavantPages Re trieved TotalNumberof Re levantpages (2)

F

score( F )

2(Pr ecision* Re call ) (Pr ecision Re call )

(3)

TRET tool is used to calculate the precision and recall value of the documents present in the TRET collection for the given query. The F-score is calculated from the precision and recall values. Based on the value of evaluation measures the efficiency and accuracy of the retrieved information is calculated.

V.

The investigation is carried out by TRET by considering different document formats, of different sizes, ranging from a collection of less than 1 gigabyte of text, to 2.5 or 3 gigabytes of text. TRET has two phases analysis:

Fig 1: Design view of TRET

This tool has the functionalities to evaluate the information retrieval time of different document types and obtain the investigations results in the form of graphical representations.

IV.

INVESTIGATION AND RESULTS

A. First Phase (Calculating Retrieval Time) In the first phase of investigation seven type/format (txt, doc, pdf, html, xml, rdf and json) of documents with the same content are indexed in the TRET collection. The first phase of investigation is tested under two cases, for calculating information retrieval time by file size. Case 1: In the first case, documents with medium file size (ranges from 0 to 50000 bytes) are taken for investigation. The word “Fantasy” is given as a query for all documents to calculate the information retrieval time. Initially, book.txt file is used for the information retrieval and the retrieval time is calculated for the query/keyword “Fantasy”. The result is represented in the form of bar chart. Subsequently book.doc, book.pdf documents are taken for the same query respectively. The screen shot of the TRET tool with the retrieval time for the three documents is shown in Fig 2, and for clear visualization of the graph Fig 3 is made known.

EVALUATION MEASURES

To evaluate the quality of a retrieval system, it is important to analyze, that the results returned by a system for a query are related to the corresponding query or not. This can be done by determining the relativity between the retrieved documents and the query. The relativity of the retrieved documents are determined by given a query on a set of documents, and then comparing the retrieved results with the corresponding query for calculating the number of relevant results returned by the retrieval system [25]. With the rapid growth in web content and due to its large volume, the answers provided by searching tool for a specific keywords has resulted in high recall and low precision values. In order to eliminate this problem, the principle of semantic efficiency is implemented in TRET using CCA web search algorithm. To formalize this notion of quality, there has been defined several measures of quality. This paper focuses on precision, recall and F-score measures and the relationship between them. To evaluate the accuracy and coverage of the predicted results in the information retrieval system, both precision and recall are used as the metrics. Precision measures the fraction of predicted matches that are correct, i.e., the number of true positives over the number of pairs predicted as matched. Recall measures the fraction of

36164

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com Fig 2: Retrieving the term “Fantasy” from unstructured (txt, doc and pdf) documents

The final result of case 1 is shown in Fig 5. By comparing the retrieval time of medium file size for seven different formats of documents ―txt‖ file format has the minimum retrieval time for unstructured and ―XML‖ document has the best retrieval time for structured documents.

The First three document type’s txt, doc/docx and pdf are unstructured and uncontrolled documents. Retrieving of information from unstructured and structured document has different interpretation strategies. The major principle for structured document retrieval is that a system should always retrieve the most specific part of a document answering the query [27].

Case 2: In the second case, seven different document (txt, doc, pdf, html, xml, rdf and json) formats/types with large file size (ranges from 50000 to 500000 bytes) are analyzed for information retrieval process and the result is shown in Fig 6. By comparing the TRET final results of case 2 for large file size, XML documents has the best retrieval time for structured information retrieval. By analyzing the final results of “case1” and ―case2‖ XML document has the best retrieval time over other structured document formats. Though ―txt‖ file format has the minimum retrieval time, it is not considered due to the unstructured format.

Fig 3: Result of book.txt, book.doc & book.pdf

HTML is a language used for describing the structured documents in the web. XML, RDF, and JSON are structured languages. Retrieval times of the txt, doc, pdf, html and xml for medium file size are shown in Fig 4.

Fig 6: Comparison of Retrieval time for Large File Size

From the final result of phase1, it is concluded that XML information retrieval has the best and efficient retrieval time than other document formats/types. B. Second Phase (Calculating and Comparing the Efficiency Measures of XML documents) In the second phase of investigation, XML documents are taken for the efficiency analysis. The efficiency of the retrieved information is calculated from the measures Precision and Recall [28]. Precision – measures the ability of a search engine to produce only relevant results. It is the ratio between the number of relevant documents retrieved by the system and the total number of documents retrieved. In an ideal scenario, a search would have a precision score of 1, i.e. every document retrieved is relevant [28]. However, in search engines, the number of results given back in response to typical queries is usually in thousands. Here the Precision value is calculated for the 100 XML documents. TRET tool has hundred XML documents in its collection. Precision and Recall values are calculated for ten queries from100 XML documents present in the TRET collection. F-Score value is calculated from the values of Precision and Recall to measure the accuracy of the retrieval. The Precision, Recall and F-score values of corresponding ten queries are shown in Table1.

Fig 4: Retrieval time of txt, doc, pdf, html and xml

Fig 5: Comparison of Retrieval time for medium file size

36165

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com Recall (%)

Precision (%)

F-score (%)

67 69 55

82 92 68

Q4

33

100

Q5

87

95

Q6 Q7 Q8 Q9 Q10

42 86 95 78 92

85 89 91 77 98

74 79 61 50 91 56 87 93 77 95

70.4

87.7

76.3

AVG

Recall

Query Vs XMLRetrieval Recall Efficiency (in percentage)

Query (for 100 XML Documents) Q1 Q2 Q3

100 90 80 70 60 50 40 30 20 10 0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Recall 67 69 55 33 87 42 86 95 78 92 Fig 8: Recall rate of XML Retrieval

In Fig 9, Precision values of hundred XML documents for ten same queries are calculated using TRET and shown as graphical representation.

Table 1: Recall and Precision Rate

Query Vs XML Retrieval Precision Efficiency (in percentage)

Recall - measures the ability of an information retrieval system to find the complete set of relevant results from a set of documents. It is the ratio of the number of relevant documents retrieved to the total number of relevant documents for a given query [28]. For a search engine, the total number of relevant documents can be all relevant documents on the Web. A sample result of Recall and Precision calculation using TRET is shown in Fig 7.

Precision 120 100 80

60 40 20 0

Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10

Precision 8 9 6 1 9 8 8 9 7 9 Fig 9: Precision rate of XML Retrieval

In Fig 10, F-score value is calculated from the obtained precision and recall values for the corresponding ten queries of the XML Retrieval.

Fig 7: Evaluation of Precision and Recall Measures

In Fig 8, calculated Recall values of hundred XML documents for ten same queries are shown as graphical representation.

Q10 Q9 Q8 Q7 Q6

F-score

Q5 Q4 0

20

40

60

80

100

Q3 Q2

0: F-Score Measure for XML Retrieval

Fig1

Q1

Precision and recall are generally inversely proportional, i.e., the higher the precision of a result set, the lower the recall and vice versa [28]. But to increase the accuracy of information retrieval the difference between precision and

36166

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com 3.

recall value should be small. With the implementation of CCA web search algorithm in TRET, an attempt has been made in this paper to obtain high precision and high recall values for XML search.

M.Baggi , An Ontology-based System for Semantic Filtering of XML Data, Elsevier Electronic Notes in Theoretical Computer Science 235 (2009) 19–33

4.

Josep Silva, Information Filtering and Information Retrieval with the Web Filtering Toolbar, Elsevier Electronic Notes in Theoretical Computer Science 235 (2009) 125–136.

Average Recall and Precision Rate 5.

Recall

The Apache Software Foundation. Nutch version 0.7 tutorial. Accessed

on

October

10th

2007.,

URL:

http://lucene.apache.org/nutch/tutorial.pdf.

Precision

6.

E. Hatcher and O. Gospodnetic. Lucene in Action (In Action series). Manning Publications Co.,Greenwich, CT, USA, 2004.

0

20

40

60

80

100

7.

N. Blachman. Google guide: Making searching even easier. Accessed on February 24, 2015. URL:http://www.googleguide.com.

AVG

Precision

Recall

87.7

70.4

8.

Saeed Al-Bukhitan, Tarek Helmy, Mohammed Al-Mulhem , Semantic Annotation Tool for Annotating Arabic Web Documents, Elsevier Procedia Computer Science 32 ( 2014 ) 429 – 436.

Fi

9.

g 11: Comparison of Average Recall and Precision Rate

P Shanmuga Vadivu, P Sumathy, A Vadivel, Image Retrieval From WWW Using Attributes in HTML TAGs , Elsevier Procedia Technology 6 ( 2012 ) 509 – 516.

In Fig 11, the average values of precision and recall are calculated and compared using TRET. By analyzing the result as in Fig 11, it has proven that XML information retrieval has the less difference between precision and recall value for a search. So the XML text retrieval using TRET is best in efficiency and accuracy by implementing CCA web search algorithm.

10. Sara Paiva, Manuel Ramos-Cabrer, The relevance of profile-based disambiguation and citations in a fuzzy algorithm for semantic document search, Elsevier Procedia Technology 16 (2014) 22 – 31. 11. Chien, L. F., PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997, pp. 50--59.

CONCLUSION

VI.

12. Frank E., Paynter G.W., Witten I.H., Gut-win C., and Nevill-Manning C.G., Domain-specific

A web application tool TRET is designed and developed in this paper to search, retrieve, evaluate and compare the efficiency of information retrieval for heterogeneous document types/formats. In this paper a web search algorithm (CCA) is implemented successfully in TRET. TRET involved in two phase of analysis. In the first phase of investigation, retrieval time of the different document are calculated and compared. The retrieval time of different documents for medium file size and large file size are compared and determined XML has the best information retrieval time than other document formats. In the second phase of investigation, the efficiency of the XML retrieval is proven by evaluating the precision, recall and F-score values for the corresponding retrieval. The objective of this paper is achieved by evaluating the retrieval time and efficiency of XML retrieval. The F-score measure values obtained from recall and precision proved that the XML retrieval has high accuracy. From the investigations, it is concluded that TRET provides better accuracy, efficiency for Text Retrieval and XML text retrieval has the best retrieval time.

keyphrase

extraction, Proc. Sixteenth

International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, 1999, pp. 668-673. 13. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., and NevillManning, C.G.: KEA:

Practical automatic keyphrase extraction,

Proceedings of Digital Libraries 99 (DL’99), ACM press (1999) 254– 256. 14. BAO Hong, DENG Zhen, An Extended Keyword Extraction Method, Elsevier Physics Procedia 24 (2012) 1120 – 1127. 15. http://en.wikipedia.org/wiki/Text_file accessed on 2015. 16. http://en.wikipedia.org/wiki/Doc_%28computing%29. 17. http://en.wikipedia.org/wiki/Portable_Document_Format 18. http://www.webopedia.com/TERM/R/RDF.html 19. http://rapidvaluesolutions.com/whitepapers/mobility-informationseries.html 20. http://en.wikipedia.org/wiki/HTML 21. Christopher

D.

Manning, PrabhakarRaghavan and HinrichSchütze, Introduction

to

Information Retrieval , Cambridge University

REFERENCES

22. http://nlp.stanford.edu/IR-book/html/htmledition/challenges-in-xml1.

Feji Ren, David B. Bracewell ,Advanced Information Retrieval,

retrieval-1.html

Elsevier Proceedings of Electronic Notes in Theoretical Computer

23. Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale

Science 225 (2009) 303–317. 2.

Hypertextual Web Search Engine, http://google.stanford.edu .

Korfhage, R. R.,Information Storage and Retrieval, John Wiley and Sons, 1997.

36167

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 16 (2015) pp 36162-36168 © Research India Publications. http://www.ripublication.com 24. S. Saranya, B.S.E. Zoraida and P. Victor Paul, A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval, Springer 2015, Artificial Intelligence and Evolutionary Algorithms in Engineering Systems Advances in Intelligent Systems and Computing Volume 325, 2015, pp 9-16 25. Christian Middleton, Ricardo Baeza-Yates, A Comparison of Open Source Search Engines. 26. Pan Jiayi, Chin-Pang Jack Cheng, Utilizing Statistical Semantic Similarity Techniques for Ontology Mapping — with Applications to AEC Standard Models, IEEExplore ISSN� 1007-0214� 35/67� pp217-222 Volume 13, Number S1, October 2008. 27. Christopher

D.

Manning, Prabhakar

Raghavan and Hinrich

Schütze, Introduction to Information Retrieval, Chapter 10- XML Retrieval , Cambridge University . 28. Sowmya Kamath S, Member, IEEE, Dhivya Piraviperumal, Garima Meena, Srijana Karkidholi, A Semantic Search Engine for Answering Domain Specific User Queries

International conference on

Communication and Signal Processing, April 3-5, 2013, India

36168

Suggest Documents