Information Extraction Tools for Portable Document

0 downloads 0 Views 1MB Size Report
i.e. it wraps or embeds many types of data (images, ... text, Doc and Html, by saving the original text format ... PDF to HTML for text detection and Emma Tonkin.
ISSN:2229-6093 Sarang Pitale et al, Int. J. Comp. Tech. Appl., Vol 2 (6), 2047-2051

Information Extraction Tools for Portable Document Format 1

Sarang Pitale, 2Tripti Sharma

1

Research Scholar, Department of Computer Science, CSIT, Durg, C.G, INDIA 1 [email protected] 2 Assistant Professor, Department of Computer Science, CSIT, Durg, C.G, INDIA 2 [email protected]

Abstract Interest in the new publishing phenomenon known as e-book has grown enormously in last few years. There are now at least 150 companies involved in various ways in the development of e-books. Despite this involvement the spread of e-books has not yet useful in implementation of digital libraries. The use of e-books of PDF format in the implementation of digital library requires a robust information extraction system. In this paper we survey ten extraction tools for extracting contents like text, images, tables fonts etc. from e-books of PDF format. We also compare information extraction tools on the basic of various factors.

1. Introduction PDF files are one of the most popular format for publishing e-books now a days. This popularity is because of the wrapping ability of the PDF document i.e. it wraps or embeds many types of data (images, forms, texts, fonts etc.). It also restricts the direct editing of contents which makes them more secure. These features also make PDF a great input format of e-books for the digital library. But these features make the information retrieval task tougher. Information retrieval from e-books consists of retrieval of text, images, fonts, page and many more things. But because of PDFs secure and wrapping feature, the extraction of various objects becomes a tedious task. So in this paper we survey easily available information extraction tools and compare their features.

2. Extraction Tools 2.1. SciPlore Xtract Used by Jöran Beel,Bela Gipp, Ammar Shaker, Nick Friedrich for Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) . SciPlore Xtract[1] is an open source Java program that is based on pdftohtml1 and runs on Windows, Linux and MacOS. The basic idea is to identify a title based on the rule that it will be the largest font on the upper first third on the first page. In the first step, SciPlore

IJCTA | NOV-DEC 2011 Available [email protected]

Xtract converts the entire PDF to an XML file. In contrast to many other converters, SciPlore Xtract keeps all layout information regarding text size and text position. Result shows that SciPlore Xtract gives better performance in comparison to PDFbox and PDFtohtml tool while working with scientific research papers.

2.2. PDFLlb TET Used by Qingzhao Tan et. al.[2] For Metadata Extraction and Indexing for Map Search in Web Documents PDFlib TET[3] (Text Extraction Toolkit) is a tool used to extract text, images and metadata from PDF documents. TET extracts text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Raster images are extracted in common raster formats. TET also converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information. TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. TET supports all relevant flavours of PDF input:  All PDF versions up to Acrobat 9  Able to open Protected PDF documents  Repairs the damaged PDF documents.  Text other then page content can also be extracted.

2.3. Greenstone The Greenstone [4] is an open source development library provides tools for organizing information and making it available over the Internet, with a uniform interface to access all documents in a collection. Greenstone can be used to build personal digital libraries of research papers gathered from Internet or collected from conference proceedings. The plug-in metadataPDFPlug, defined to import PDF documents, is based on the pdftohtml software suite, which in turn relies on the PDF reader xpdf and the ghost script libraries. The pdftohtml tool works well to merge text lines in paragraphs with uniform formatting rules. However, this general purpose tool does not

2047

ISSN:2229-6093 Sarang Pitale et al, Int. J. Comp. Tech. Appl., Vol 2 (6), 2047-2051

handle technical papers in Greenstone very well . One reason is that the actual input of documents in Greenstone is made by means of the html plug-in that in turn processes the output of pdftohtml. Moreover ,the metadataPDFPlug does not automatically extract administrative metadata. Greenstone plug-ins were written in Perl and it turned out not to be very easy to customize the existing plugins in a suitable way.

2.4. JPedal Used by Simone Marinai for extracting meta data from PDF papers [5].Jpedal is developped by IDRsolutions. JPedal is the Java PDF development library, providing a Java PDF viewer, PDF to image conversion, PDF printing or adding PDF search and PDF extraction features. JPedal can be used as part of a client or server Swing or SWT application, thin client, applet, JavaFX, JSP or webstart. There is also a JavaME viewer. It provides a complete replacement for PDF Reader and more. Jpedal is available commercially.

converter, and various other utilities. Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler. Xpdf is designed to be small and efficient. It can use Type 1, TrueType, or standard X fonts. Xpdf should work on pretty much any system which runs X11 and has Unix-like (POSIX) libraries. It requires ANSI C++ and C compilers to compile it. The main problem with this tool is that it is tough to extract images and the commercially available executable format.

2.8. PDF Box

Solid Convertor[6] is commercial and multi-function GUI software. It has the ability to convert PDF to plain text, Doc and Html, by saving the original text format (font and size).The problem with the tool is that it is only limited to the windows platforms and image extraction is tough task.

Used by Deliang JIANG et al [10] for converting PDF to HTML for text detection and Emma Tonkin UKOLN and Henk L.Muller [11] for extracting keywords and metadata, while R. Mishra et al[12] used it in order to develop ETD Repository . Fang Yuan et al [13] used it for extracting title, author, address, abstract, keywords and the class number of these papers. Besides, it is used to extract the text from PDF files for Gate (the Natural Language Processing tool). Apache PDFBox[14] is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities.

2.6. 3-Heights PDF Extract

2.9. ICE pdf

3-Height PDF Extract[7]is a tool or component for reading out the contents and properties of PDF documents. 3-Height PDF Extract can extract pdf contents quickly and efficiently. Performance Characteristics are  Extract text by the character, word or page (including invisible text)  Search for keywords and retrieve their position  Extract images (including alternative images)  Retrieve form fields  Extract document information such as version, encryption, linearization and metadata  List fonts and colour spaces  Extract page information and page descriptions (graphic objects, position and other attributes)  Extract bookmarks

ICEpdf[15] is an open source Java PDF tool that can render, convert, or extract PDF content within any Java application or on a Web server. It has a nice graphical interface that includes many options like switching between pages, zooming, selecting and copying text, etc. It is a real mimic of Adobe Reader. Also, IceSoft provided a commercial version of ICEpdf, called ICEpdf Pro. ICEpdf is a pure Java PDF document rendering and viewing solution. ICEpdf can parse and render documents based on the latest PDF standards [16] (Portable Document Format v1.6/Adobe® Acrobat® 7) with superior rendering accuracy and performance. ICEpdf is designed to support PDF document viewing within Java applications in a manner not possible with the native Acrobat Reader application. Benefits include:  Seamless integration with Java client applications, allowing complete control over the configuration, exposed functionality and user interface.  A lightweight static and dynamic memory footprint.  Easy deployment to any Java platform without the hassles of Java-to-native integration issues. ICEpdf supports:  PDF Viewing: ICEpdf can easily be integrated into any Java client application to provide

2.5. Solid converter

2.7. PDF to html PDFtohtml [8] is a command line converter tool to Html, based on the open source viewer XPDF . The commercial application is available in only executable format. Xpdf[9] is an open source viewer for Portable Document Format (PDF) files. The Xpdf project also includes a PDF text extractor, PDF-to-PostScript

IJCTA | NOV-DEC 2011 Available [email protected]

2048

ISSN:2229-6093 Sarang Pitale et al, Int. J. Comp. Tech. Appl., Vol 2 (6), 2047-2051

      



PDF document viewing and navigation in a manner not possible with the Acrobat Reader application. ICEpdf includes an embeddable PDF document viewer component for easy integration within Java client applications. ICEpdf can also be used standalone as an industrial strength PDF Viewer application. Multipage view support: Continuous and sideby-side view types. Text Selection: Multi-page text selection tool allows users to select and copy text to the system clipboard. Search Highlighting: Advanced contextual search result and highlighting of found words. PDF Content Conversion: Convert rendered PDF pages to other formats, such as images, SVG documents, etc. PDF Content Extraction: Extract PDF document meta-data, text, and images. PDF Link Annotations: Developers can optionally configure ICEpdf to support interactive link annotations via a mouse. An annotation callback gives developers flexibility in which types of link annotation actions they wish to support. PDF Link Annotation Editing: Users can now configure the UI to support creation, editing and deletion of Link annotations and their respective URI, Launch or GoTo actions.

2.10. iText Used by César García-Osorio et.al.[17] For developing A Tool for Teaching LL and LR Parsing Algorithms .iText[18] is a free and open source library for creating and manipulating PDF files in Java. Developers will use iText to:  Serve PDF to a browser  Generate dynamic documents from XML file or databases  Use PDF's many interactive features  Add bookmarks, page numbers, watermarks, barcodes, etc.  Split, concatenate and manipulate PDF pages  Automate filling out PDF forms  Add digital signatures to a PDF file Typically, iText is used in projects that have one of the following requirements:  The content isn't available in advance: it's calculated based on user input or real-time database information.  The PDF files can't be produced manually due to the massive volume of content: a large number of pages or documents.  Documents need to be created in unattended mode, in a batch process.

IJCTA | NOV-DEC 2011 Available [email protected]

The content needs to be customized or personalized; for instance, the name of the end user has to be stamped on a number of pages. Often you'll encounter these requirements in web applications, where content needs to be served dynamically to a browser. Normally you'd serve this information in the form of HTML, but for some documents, PDF is preferred over HTML for better printing quality, for identical representation on a variety of platforms, for security reasons, or to reduce the file size. iText provides support for most of advanced PDF features such as PKI-based signatures, 40-bit and 128bit encryption, colour correction, PDF/X, colour management via ICC profiles and barcodes.

3. Comparison 3.1. General Characteristic Table 1 shows the compression of the tools based on general features like type, licence, platform and language used. Table 1. Comparison based on General Characteristic Tool

License

Platform

SciPlore Xtract

Open Source

PDFLlb TET

Commercial Multi Platform

Greenstone Open Source

Multi Platform

Multi Platform

Language Java

Multi Language Perl

3-Height PDF Extract

Commercial Multi Platform

Multi Language

Solid converter

commercial Windows

JPadel

Open Source

Linux/ Windows

Emma Tonkin UKOLN and Henk L.Muller Java

PDFtohtml

Open Source

Linux/ Windows

C++

PDFBox

Open Source

Multi Platform

Java

-

2049

ISSN:2229-6093 Sarang Pitale et al, Int. J. Comp. Tech. Appl., Vol 2 (6), 2047-2051

ICEpdf iText

Open Multi Source/ Platform Commercial

Java

Open Source

Java

Multi Platform

3.2. Working Characteristic Table 2 shows the compression of the tools based on working features like image extraction capability, text extraction capability, font extraction capability and documentation available.

Table 2. Comparison based on Working Characteristic Extraction Tool

Image Text Font

Documentation available

SciPlore Xtract √ PDFLlb TET √







Greenstone √ 3-Height PDF Extract √











































Solid converter jPadel √

PDFtohtml PDFBox √

ICEpdf iText

IJCTA | NOV-DEC 2011 Available [email protected]

4. Results and discussions For the implementation of digital library the selection of extraction tool is a crucial and a very important step. The present survey paper clearly shows the capabilities and features of ten easily available PDF extraction tools. The digital library frameworks can be implemented in any language; extraction tool is a subpart of the overall framework. It is clearly observable that only some of the extraction tools are fit for the extraction work. It is clear from the tables shown that only five tools are eligible for the selection they are:  PDFLlb TET  3-Height PDF Extract  jPadel  ICEpdf  iText Others are having problems related to the extraction features or with documentation. All the five tools listed are having all capabilities needed. Again the first four are commercial tools. So the iText is a great tool with all the desired functionality and free of cost.

References [1] Jöran Beel, Bela Gipp, Ammar Shaker and Nick Friedrich , “SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)”, Proceedings of the 14th European Conference on Digital Libraries (ECDL‟10), volume 6273 of Lecture Notes of Computer Science (LNCS), September 2010. pp- 413–416, Glasgow (UK), Springer. [2]Qingzhao Tan,Prasenjit Mitra,C. Lee Giles, “Metadata Extraction and Indexing for Map Search in Web Documents”, Proceeding of the 17th ACM conference on Information and knowledge management, ACM New York, NY, USA ©2008,pp- 1367-1368 [3]http://www.pdflib.com/products/tet/ [4]http://www.greenstone.org/ [5]Simone Marinai, “Metadata Extraction from PDF Papers for Digital Library Ingest”, 10th International Conference on Document Analysis and Recognition [6]http://www.soliddocuments.com/ [7]http://www.pdf-tools.com/ [8]http://pdftohtml.sourceforge.net/ [9]http://www.foolabs.com/xpdf/ [10]Deliang JIANG,Xiaohu YANG, “Converting PDF to HTML approach based on Text Detection”, Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, ACM New York, NY, USA ©2009,pp-982-985

2050

ISSN:2229-6093 Sarang Pitale et al, Int. J. Comp. Tech. Appl., Vol 2 (6), 2047-2051

[11]Emma Tonkin, Henk L.Muller, “Semi automated metadata extraction for preprints archives”, Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, ACM New York, NY, USA ©2008,pp-157-166 [12]R. Mishra, S. K. Vijaianand, Noufal P. P. and Gaurav Shukl, “Development of ETD Repository at IITK Library using Dspace”, ICSD conference Proceedings ,DRTC, 2007 , pp-249-259. [13] Fang Yuan and Bo Lu, “A new method of information extraction from PDF files” , Proceedings of 2005 International Conference on Machine Learning and Cybernetics, 18-21 Aug. 2005, Guangzhou, China , pp- 1738 – 1742. [14]http://pdfbox.apache.org/ [15]http://www.icepdf.org/ [16]http://www.adobe.com/enterprise/standards/ [17]César García-Osorio,Carlos Gómez-Palacios,Nicolás García-Pedrajas, “A Tool for Teaching LL and LR Parsing Algorithms”, Proceedings of the 13th annual conference on Innovation and technology in computer science education, ACM New York, NY, USA ©2008, pp-317-317 [18]http://itextpdf.com/

IJCTA | NOV-DEC 2011 Available [email protected]

2051