Li, H., Kia, O., Doermann, D.: Text enhancement in digital video. Doc. Recognition ... 4th IAPR Workshop on Document Analysis Systems (DAS2000), Rio de.
Text Area Identification in Web Images Stavros J. Perantonis1, Basilios Gatos1, Vassilios Maragos1,3, Vangelis Karkaletsis2, and George Petasis2 1
Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Research Center "Demokritos", 153 10 Athens, Greece {sper,bgat}@iit.demokritos.gr http://www.iit.demokritos.gr/cil 2 Software and Knowledge Engineering, Institute of Informatics and Telecommunications, National Research Center "Demokritos", 153 10 Athens, Greece {vangelis,petasis}@iit.demokritos.gr http://www.iit.demokritos.gr/skel 3 Department of Computer Science, Technological Educational Institution of Athens, 122 10 Egaleo, Greece
Abstract. With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.
1 Introduction With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. The World Wide Web contains lots of information but even modern search engines just index a fraction of this information. This issue poses new challenges for Web Document Analysis and Web Content Extraction. While there has been active research on Web Content Extraction using text-based techniques, documents often include multimedia content. It has been reported [1][2] that of the G.A. Vouros and T. Panayiotopoulos (Eds.): SETN 2004, LNAI 3025, pp. 82–92, 2004. © Springer-Verlag Berlin Heidelberg 2004
Text Area Identification in Web Images
83
total number of words visible on a Web page, 17% are in image form and those words are usually the most semantically important. Unfortunately, commercial OCR engines often fail to recognize Web images due to their special key characteristics. Web images are usually of low resolution, consist mainly of graphic objects, are usually noiseless and have the anti-aliasing property (see Fig. 1). Anti-aliasing smoothes out the discretization of an image by padding pixels with intermediate colors. Several approaches in the literature deal with text locating in color images. In [3], characters are assumed of almost uniform colour. In [4], foreground and background segmentation is achieved by grouping colours into clusters. A resolution enhancement to facilitate text segmentation is proposed in [5]. In [6], texture information is combined with a neural classifier. Recent work in locating text in Web images is based on merging pixels of similar colour into components and selecting text components by using a fuzzy inference mechanism [7]. Another approach is based on information on the way humans perceive colour difference and uses different colour spaces in order to approximate the way human perceive colour [8]. Finally, approaches [9][10] restrict their operations in the RGB colour space and assume text areas of uniform colour.
(a)
(b
Fig. 1. A Web image example (a) and a zoom in it (b) to demonstrate the web image key characteristics.
In this paper, we aim at two objectives: (a) Development of new technologies for extracting text from Web images for Information Extraction purposes and (b) Creation of an evaluation platform in order to measure the performance of all introduced new technologies. Recently, some of the authors have proposed a novel method for text area identification in Web images [11]. The method has been developed in the framework of the EC-funded R&D project, CROSSMARC, which aims to develop technology for extracting information from domain-specific Web pages. Our approach is based on the transitions of brightness as perceived by the human eye. An image segment is classified as text by the human eye if characters are clearly distinguished from the background. This means that the brightness transition from the text body to the foreground exceeds a certain threshold. Additionally, the area of all characters observed by the human eye does not exceed a certain value since text bodies are of restricted thickness. These characteristics of human eye perception are embodied in our approach. According to it, the Web color image is converted to gray scale in order to record the transitions of brightness perceived by the human eye. Then, an edge extraction technique facilitates the extraction of all objects as well as of all inverted objects. A conditional dilation technique helps to choose text and inverted text objects among all objects. The
84
Stavros J. Perantonis et al.
criterion is the thickness of all objects that in the case of characters is of restricted value. Our approach is mainly based on the detected character edges and character thickness that are the main human eye perception characteristics. The evaluation platform used in order to assess the performance of the proposed method for text area location was based on the Segmentation Evaluation Tool v.2 of the Computational Intelligence Laboratory (NCSR “DEMOKRITOS”) [12]. We measured the performance of the proposed scheme for text area identification and recorded a significant facilitation in the recognition task of the OCR engine. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system (NERC module [13]). We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.
2 Text Area Location Algorithm 2.1 Edge Extraction Consider a color Web image I. First, we covert it to the gray scale image Ig. Then, we define as e and e-1 the B/W edge and invert edge images that encapsulate the abrupt increase or decrease in image brightness: 1, if ∃ (m, n) : Ig(m, n) - Ig(x, y) > D ∧ e( x, y ) = m - x