Performance of Document Image OCR Systems for ...

0 downloads 0 Views 236KB Size Report
SHILPA SHETTY ON IPL : SHILPA SIIETTY ON IPL. SHILPA SHETTY UN IPI;. SHILPA SHETIY ON IPL. LET IT BE AN I-T PROBE,. LEI IT IIE AN I-I BKDBLH,.
Performance of Document Image OCR Systems for Recognizing Video Texts on Embedded Platform Tanushyam Chattopadhyay, Priyanka Sinha, Provat Biswas Innovation Labs Tata Consultancy Services Limited Kolkata, India [email protected], [email protected], [email protected]

Abstract—Market demand for an embedded realization of video OCR motivated the authors to exert an attempt to evaluate the performance of existing document image OCR techniques for the same. Thus authors have tried to port the open source OCR systems like GOCR and Tessaract on an embedded platform. But their performance on an embedded platform shows that the character level and word level recognition accuracy is quite unacceptable for video text. This paper compares two such open source OCR systems on Indian TV videos and proposes some techniques that can be used to improve the recognition accuracy from 62% to 93%. Moreover the challenges of porting those codes on an embedded platform is also analyzed in this paper.

Figure 1.

News Video and Active Page

Figure 2.

Guide Page and Recipe Video

Keywords-OCR; video; Tesseract; ABBYY; Findreader; GOCR;

I. I NTRODUCTION In current day Television and video, a lot of text is being embedded to augment the images. For example, sports events usually come with scores and player information in text overlaid on top of the video. Business news usually feature a ticker on the bottom displaying stock information in text. Advertisements feature product details and telephone numbers over the video. We have identified five such categories of videos where textual content is very high like: • News videos: News videos are containing lots of textual information like breaking news, ticker text, stock ticker. Some possible applications might be duplicate news story detection, personalized stock market ticker, Personalized mash-up of the internet news with TV news, etc. One such example of news video is shown in Figure 1. • Active pages and Guide pages of DTH services : Active Pages usually contain a alpha numeric stream that needs to be sent to a particular subscriber number to avail the service like downloading wallpaper, ring tone, or any other service. As the existing DTH services lack of Interactivity the user need to manually type the content and send to the specific recipient number as short messaging service (SMS). We have recognized the text from those active pages automatically so that any application can be built to send those text automatically

as an SMS. Two such examples are shown in Figure 1 and 2. • Recipe Shows: One of the popular shows in India particularly for women are recipe show. Recipe shows usually contains the ingredients and recipe as a text at the end of the show. Many applications can be developed like sending automatic SMS containing recipe to intended user. One such examples are shown in Figure 2. • Sports Videos: Sports videos contains different textual information like score, player names. One such example is shown in Figure 3. • Movies and Music with Subtitles: Music channels usually contain the album name, the singer name as a text during the music video. These textual information can be used to build different applications. Movies also contains subtitled. One such example is shown in Figure 3. These texts may not always be available as metadata in digital TV systems and are not extractable in analog TV. Therefore, one approach to extract such text from video is of successive frames from the TV to extract text from video

Table II C OMPARISON OF OCR S OFTWARES

Figure 3.

Video with Subtitle and Sports Video

OCR Software ABBYY Finereader

Memory Required 512MB

GOCR

17MB

Table I L IST OF C ONTEXT BASED APPLICATION Tesseract Genre News News Business Music Movies Sports

Context Breaking News News text Stock Ticker Album and Singer name Subtitle Player name and score

Application Related web information Cross lingual information Personalized stock ticker Automatic song download Cross lingual subtitle Statistics of player

is OCR of successive frames from the TV to extract text. his extracted text provides metadata information that can form contextual information of the viewers. On the other hand, English Optical Character Recognition (OCR) is a matured technology in the field of printed document image processing. Thus there is a strong motivation to be able to OCR text from video. Some of such example applications were listed in the Table I. A. Existing OCR There are a number of existing OCR solutions available for converting image into text. Amongst the commercially available OCR solutions, ABBYY finereader has very good performance. ABBYY FineReader is a highly intelligent OCR (optical character recognition) software for text recognition and creating editable and searchable electronic files from scanned paper documents, PDFs and digital photographs. Being commercial, the ABBYY finereader is harder to integrate as part of other applications, making it disadvantageous in using it for OCR from video frames. Amongst open source variants, GOCR and Tesseract are well known. GOCR is an OCR program developed under the GNU Public License. It converts scanned images of text back to text files. It can be used with different front-ends, which makes it easier to port to different OSes and architectures. It can open many different image formats. It is able to recognize and translate barcodes as well. Another advantage of GOCR is that neither training is required nor a large font base to run it. Characters need to be well separated for better accuracy. Does not support document layouts. Does not use a dictionary for recognizing text. Does not support italic fonts. But in a video frame touching character is very common. Moreover the background is very complex in some cases.

Advantages

Disadvantages

Full featured, good performance, understands layout No training or large font bases required, can open many different formats

commercial, harder to integrate

Good accuracy,eaasily integrable

Does not support italic fonts, does not understand layout, characters need to be well separated does not understand layout, not threadsafe

Tesseract is a free software optical character recognition engine for various operating systems. Tesseract is considered one of the most accurate free software OCR engines currently available.A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Currently Tesseract builds under Linux with GCC 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library and may be more efficient than Standard Template Library lists, but is reportedly harder to debug in the event of a segmentation fault. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. The migration to C++ is a step towards eliminating this conversion, though it is not yet complete. Tesseract is a raw OCR engine, with no document layout analysis, no output formatting Tesseract does not come with a GUI and is instead run from the command-line interface. There are several separate projects which provide a GUI for Tesseract. From our above description, we can see that there exist OCR software for converting images to text , but text extraction from video is an open area of research. In further sections, we describe more on how we have adapted an image to text OCR system, Tesseract, to extract text from video. Thereafter we analyse its performance on various parameters. Finally we outline our observations and conclude suggesting improvements. II. P ERFORMANCE OF D IFFERENT OCR S In this section we shall discuss the performance of these three OCR systems namely Fine reader, GOCR and Tessaract on TV videos. A. Fine Reader Fine reader is a commercially available software and thus it was not possible for us to integrate it into a system.

So we have saved some of the text rich video frames into image format and checked the recognition accuracy of trial version of Fine reader. Here is a screen shot showing the performance of Fine Reader on the video frames shown in Figure 4. We observed that the character level accuracy for Fine Reader on our Video corpus was nearly 86.57%. B. GOCR We have found that the performance of GOCR is very poor on the video frames as shown in Table. The main reason behind the unacceptable performance of GOCR is that most of the video frames includes a lot of touching characters. Moreover the video frames are of very low resolution and thus most of the characters are broken in nature. So unless an image enhancement algorithm is applied on GOCR, the character level accuracy on our video corpus is nearly 37.34%. C. Tesseract In order to use Tesseract to OCR text from video, it was incorporated with the Set Top Box (STB). This setup is fed video frames periodically in the form of YUV images, which Tesseract uses to OCR text and feed into other applications. Tesseract as provided by Google, takes as input a TIFF image and outputs text. In our case, tesseract runs in a thread that takes up a frame from the currently playing video buffer as a YUV image, recognizes text in it, and returns the output text to javascript plugins. The tesseract source consists of a main program, a word level recognizer, a module that organizes(orders) text into lines and words. a low-level character classifiers and a dictionary that is used by the classifier. We have been running the modified version of tesseract on an embedded platform with limited resources. It is a low end TI DaVinci platform. This limits the processing power and memory available. By profiling tesseract, we find that dictionary functions and classification take the most time. User words and frequently used words dictionaries are loaded for every frame in the video that is being OCRed. This increases both the memory requirements as well as the processing power requirements. It introduces delay in processing of subsequent frames as well. Removing the dictionary or limiting it would thus reduce load. An observation is that the greater the number of fontfiles used for training, the larger the number of features generated for classification. This increases requirement on processing power. Hence, reducing the number of font files reduces the running time drastically. From the experiment conducted with results are in Table III, we can see that if the detected text is expected to be not italicized, the detection is not improved by including italicized fonts during training. Further, if the detected text is expected to not contain any special characters, they can be removed from the training dataset.

Table III E FFECT OF I TALICIZED FONTS ON T ESSERACT ACCURACY Font combination

Character level (in %) with Italics

Times, Arial

90.14

Character level (in %) without Italics 89.15

Word level (in %) with Italics 23.98

Word level (in %) without Italics 33.47

Table IV R ECOGNITION ACCURACY OF T ESSERACT Number of fonts trained 4 8 16 20

Character level (in %) 86.72 89.36 94.78 99.42

Word level (in %) 0.09 21.8 68.7 96.53

Even though some font families may not be able to detect text individually, but when used alongwith other font families, they help in increasing the correctness of detected text as observed from Table IV From Table V we observe that recognition accuracy rises when the font that has been trained is similar to the font that is being recognized. In our experiment, the number of training fonts used is reduced to the font nearest to the text in the video and new language data set is generated. The total improvement for 43 frames had gone down from 56 seconds to under 30seconds. III. S COPE OF I MPROVEMENT OF EXISTING OCR S In this section we shall discuss about the performance of the proposed system by comparing the accuracy of the available OCR engines (GOCR and Tesseract) before and after applying the image enhancement techniques. In Figure 7 some example images used in the experiment are shown. The output of different OCR engines before and after applying the proposed image processing algorithms are shown in Figure 6. From the Figure 5 it is evident that considerable improvement is obtained in character recognition after restricting the ROI and applying preprocessing and touching character segmentation before providing the final image to the OCR engine. Table V F ONT SPECIFIC ACCURACY OF T ESSERACT THAT INCLUDES BOTH BOLDS AND ITALICS

Font Family Times Arial Courier Verdana Georgia Trebuch Bookman Century

Character level (in %) 88.25 84.89 82.62 83.33 83.33 84.88 83.33 96.51

Word level (in %) 28.86 22.36 0 0 0 9.3 0 79.07

Figure 4.

Screen shot of Fine Reader

Table VI E XAMPLE TEXT: I NPUT AND O UTPUT OF DIFFERENT APPROACHES Original text SHILPA ON IPL: LEGAL ACTION WILL BE TAKEN IN CASE OF IRREGULARITIES SHARAD PAWAR CALLS LALIT MODI TO DELHI SHILPA SHETTY ON IPL : LET IT BE AN I-T PROBE, NOT A MEDIA PROBE

O/P without any processing SIIILPA DN IPL LEGAL ACTION WILL BE LAKENJI LBBSLQFJVKBEGUJJIRUIES rrrr , SHARAD PAWAR CALLS LALIT MIJUI I0 DELHI SHILPA SIIETTY ON IPL LEI IT IIE AN I-I BKDBLH, DI,&M,ED!!LEKDBL,,m-w,-,,,

Table VII R ECOGNITION ACCURACY IN DIFFERENT APPROACHES

Without any processing Super resolution

Character level (in %) 63.504 92.700

Word level (in %) 41.176 82.353

Figure 5. Performance of different OCR engines before and after applying Touching character segmentation

O/P with pixel interpolation SHILPA UN IPI; LEGAL ACTION WILL BE LMB! JILDASL QEJBB EGLLLAKLUES. —. SHARAD PAWAR CALLS LALIT MUDI T0 DELHI SHILPA SHETTY UN IPI; LET IT BEAN IT R DB, EJiiJI-lLM.EDILEBQBL-

O/P with Super resolution SHILPA UN IPI; LEGAL ACTION WILL BE TAKEN IN CASE UF IRREGULARITIES SHARAD PAWAR CALLS LALIT MUDI T0 DELHI SHILPA SHETIY ON IPL LEI IT BE AR I-T PROBE NOT A MEOIA PROBE

need to optimize the performance of Tessaract to directly deploy into some system. In particular, if the font family, font size and reduced character set of the text on the video can be identified, it would reduce the size of trained data. This would reduce the time taken by the classifier. Reducing the dictionary would reduce both space and time requirements. We have also shown that there is a need to enhance the image quality before giving it as an input for OCR. We have shown two such techniques like construction of super resolution image from multiple frames and touching character segmentation. ACKNOWLEDGMENT We are grateful to the entire Home Infotainment Platform research team at TCS Innovation Labs Kolkata for their support in integrating Tesseract into it. R EFERENCES

IV. C ONCLUSION Thus in this paper we have compared the recognition accuracy of three OCR solutions on video frames. From this study we can conclude that the applicability of Tessaract is best for any embedded video OCR solution. But the main limitation of Tessaract is that it takes a huge number of cycles when used as is used for still images. So there is a

[1] R. Lienhart and W. Effelsberg, Automatic Text Segmentation and Text Recognition for Video Indexing, ACM/Springer Multimedia Systems, vol 8, (1), pp. 69-81, January 2000. [2] R. Smith. (2011). tesseract-ocr. [Online]. http://code.google.com/p/tesseract-ocr/

Available:

[3] J. Schulenburg. (2010). http://jocr.sourceforge.net/

Available:

GOCR.

[Online].

Figure 6. Output of the different OCR engines before and after applying the image processing algorithm

Figure 7.

(a) - (j) represents different images under consideration.

[4] ABBYY FineReader 10. http://finereader.abbyy.com/

[Online].

Available:

[5] A. Pal, M. Prashant, A. Ghose and C. Bhaumik, Home Infotainment Platform - A Ubiquitous Access Device for Masses, in Ubiquitous Computing and Multimedia Applications, vol. 75, Ed. G.S. Tomar, W.I. Grosky, T. Kim, S. Mohammed and S.K. Saha Berlin Heideberg: Springer, 2010, pp. 11-19.