Text Area Identification in Web Images

Text Area Identification in Web Images Stavros J. Perantonis1, Basilios Gatos1, Vassilios Maragos1,3, Vangelis Karkaletsis2, and George Petasis2 1

Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Research Center "Demokritos", 153 10 Athens, Greece {sper,bgat}@iit.demokritos.gr http://www.iit.demokritos.gr/cil 2 Software and Knowledge Engineering, Institute of Informatics and Telecommunications, National Research Center "Demokritos", 153 10 Athens, Greece {vangelis,petasis}@iit.demokritos.gr http://www.iit.demokritos.gr/skel 3 Department of Computer Science, Technological Educational Institution of Athens, 122 10 Egaleo, Greece

Abstract. With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.

1 Introduction With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. The World Wide Web contains lots of information but even modern search engines just index a fraction of this information. This issue poses new challenges for Web Document Analysis and Web Content Extraction. While there has been active research on Web Content Extraction using text-based techniques, documents often include multimedia content. It has been reported [1][2] that of the G.A. Vouros and T. Panayiotopoulos (Eds.): SETN 2004, LNAI 3025, pp. 82–92, 2004. © Springer-Verlag Berlin Heidelberg 2004

Text Area Identification in Web Images

83

total number of words visible on a Web page, 17% are in image form and those words are usually the most semantically important. Unfortunately, commercial OCR engines often fail to recognize Web images due to their special key characteristics. Web images are usually of low resolution, consist mainly of graphic objects, are usually noiseless and have the anti-aliasing property (see Fig. 1). Anti-aliasing smoothes out the discretization of an image by padding pixels with intermediate colors. Several approaches in the literature deal with text locating in color images. In [3], characters are assumed of almost uniform colour. In [4], foreground and background segmentation is achieved by grouping colours into clusters. A resolution enhancement to facilitate text segmentation is proposed in [5]. In [6], texture information is combined with a neural classifier. Recent work in locating text in Web images is based on merging pixels of similar colour into components and selecting text components by using a fuzzy inference mechanism [7]. Another approach is based on information on the way humans perceive colour difference and uses different colour spaces in order to approximate the way human perceive colour [8]. Finally, approaches [9][10] restrict their operations in the RGB colour space and assume text areas of uniform colour.

(a)

(b

Fig. 1. A Web image example (a) and a zoom in it (b) to demonstrate the web image key characteristics.

In this paper, we aim at two objectives: (a) Development of new technologies for extracting text from Web images for Information Extraction purposes and (b) Creation of an evaluation platform in order to measure the performance of all introduced new technologies. Recently, some of the authors have proposed a novel method for text area identification in Web images [11]. The method has been developed in the framework of the EC-funded R&D project, CROSSMARC, which aims to develop technology for extracting information from domain-specific Web pages. Our approach is based on the transitions of brightness as perceived by the human eye. An image segment is classified as text by the human eye if characters are clearly distinguished from the background. This means that the brightness transition from the text body to the foreground exceeds a certain threshold. Additionally, the area of all characters observed by the human eye does not exceed a certain value since text bodies are of restricted thickness. These characteristics of human eye perception are embodied in our approach. According to it, the Web color image is converted to gray scale in order to record the transitions of brightness perceived by the human eye. Then, an edge extraction technique facilitates the extraction of all objects as well as of all inverted objects. A conditional dilation technique helps to choose text and inverted text objects among all objects. The

84

Stavros J. Perantonis et al.

criterion is the thickness of all objects that in the case of characters is of restricted value. Our approach is mainly based on the detected character edges and character thickness that are the main human eye perception characteristics. The evaluation platform used in order to assess the performance of the proposed method for text area location was based on the Segmentation Evaluation Tool v.2 of the Computational Intelligence Laboratory (NCSR “DEMOKRITOS”) [12]. We measured the performance of the proposed scheme for text area identification and recorded a significant facilitation in the recognition task of the OCR engine. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system (NERC module [13]). We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.

2 Text Area Location Algorithm 2.1 Edge Extraction Consider a color Web image I. First, we covert it to the gray scale image Ig. Then, we define as e and e-1 the B/W edge and invert edge images that encapsulate the abrupt increase or decrease in image brightness: 1, if ∃ (m, n) : Ig(m, n) - Ig(x, y) > D ∧  e( x, y ) =  m - x

Text Area Identification in Web Images

Text Area Identification in Web Images

Suggest Documents

Text Extraction from Web Images

Text Segmentation in Web Images Using Colour ... - CiteSeerX

A Methodology for Integrating Images and Text for Object Identification

Mining Text Snippets for Images on the Web - Microsoft

Web Document Text and Images Extraction using DOM Analysis and ...

COLOR IDENTIFICATION IN DERMOSCOPY IMAGES ... - VisLab

Object Identification in Dynamic Images Based on

wordfence: text detection in natural images with

Text Recognition in Street Level Images - IJETAE

TEXT DETECTION AND RECOGNITION IN IMAGES ... - Infoscience

text detection in images and video sequences

Threat Characterization in Vital Area Identification ...

Categorizing images in web documents ... - Semantic Scholar

Transmitting Video Images in XML Web Service

Crawling Images on the Web.

The size of juxtaluminal hypoechoic area in ultrasound images ... - Core

Concentration on the Area of Significance in Medical Images - IJCST

Fusion of Satellite Images in Urban Area ... - Semantic Scholar

A System to Detect Residential Area in Multispectral Satellite Images

albanian language identification in text documents

Stroke Identification in Gujarati Text using Directional

Text independent speaker identification in multilingual ... - CiteSeerX

2547RN16 text web version

Gender Identification from Frontal Facial Images using