Text detection and restoration in natural scene images

1 downloads 0 Views 3MB Size Report
Keywords: Text detection; Text recognition; Text restoration. 1. Introduction. Texts are ..... SVM is easier to train, needs fewer training samples, and has better ...
Available online at www.sciencedirect.com

J. Vis. Commun. Image R. 18 (2007) 504–513 www.elsevier.com/locate/jvci

Text detection and restoration in natural scene images Qixiang Ye *, Jianbin Jiao, Jun Huang, Hua Yu College of Engineering, Graduate University of Chinese Academy of Sciences, No. 19 Yu Quan Road, Shi Jing Shan Distinct, Beijing, PR China Received 12 May 2006; accepted 10 July 2007 Available online 28 July 2007

Abstract A new method for text detection and recognition in natural scene images is presented in this paper. In the detection process, color, texture, and OCR statistic features are combined in a coarse-to-fine framework to discriminate texts from non-text patterns. In this approach, color feature is used to group text pixels into candidate text lines. Texture feature is used to capture the ‘‘dense intensity variance’’ property of text pattern. Statistic features from OCR (Optical Character Reader) results are employed to further reduce detection false alarms empirically. After the detection process, a restoration process is used. This process is based on plane-to-plane homography. It is carried out to refine the background plane of text when an affine transformation is detected on a located text and independent of camera parameters. Experimental results tested from a large dataset have demonstrated that the proposed method is effective and practical.  2007 Elsevier Inc. All rights reserved. Keywords: Text detection; Text recognition; Text restoration

1. Introduction Texts are important objects embedded in natural scenes. They often carry useful information such as traffic signals, advertisement billboards, dangerous warnings, etc. [1]. Automatic text recognition induces a lot of potential applications such as: (1) Helping a foreigner to understand the contents on an information board (by translating the recognized text into his native language); (2) Helping blind people to walk freely in a street; and (3) Drawing attention of a driver to traffic signs. With the gaining momentum of powerful portable digital devices, automatically extracting useful information from natural scene images from them is becoming a more of a practical application. As a result, research on text recognition from scene images is becoming a hot topic in recent years [1–7]. *

Corresponding author. Fax: +86 10 8825 6278. E-mail address: [email protected] (Q. Ye).

1047-3203/$ - see front matter  2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2007.07.003

Text detection is the very first step for the correct text recognition from an image. In a natural scene, there are a lot of text-like patterns such as building windows and fences. They are easily to be mis-identified as texts. Different from patterns like human faces and single character, text pattern is not a square image ‘‘block’’. Its size varies with the change of character numbers in it (as shown in Fig. 2a). Furthermore, text appearance varies dramatically with the change of characters and character positions, making its structure uncertain. An intuitive comparison between text and human face pattern is shown in Fig. 2b. From this figure, we can see that, a ‘‘average face’’, obtained by summing up the gray values of the images and average the results with the image number, keeps basic structure while ‘‘average text’’ contains no information. Fig. 2c shows that, by applying PCA (principle component analysis) on texts, there are far more non-zero eigen-values than that of human faces, which implies that there are many challenges in building a model to represent text pattern. The success of OCR technique has made it an easier task to read frontal text lines. However, in some case, a camera is not perpendicular to the text plane, the text in the image

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

plane may deform (as shown in Fig. 1b). This will dramatically affect the recognition result. Therefore, before feeding the located text into an OCR system, it is necessary to refine the text with affine deformation into frontal text, a process we call text restoration in this paper. There are quite some relevant works on text detection reported in the literature, including overlay text location in both images and videos. Edge (gradient) and edge layout are often used [1,9,10]. In videos produced by video editors, overlay texts often have marked contrasts with its background making them easier to be located by edge (gradient) features. However, in natural scenes, pure edge (gradient) feature is not very effective for discriminating text from its background, this is because text itself is embedded in scene and surrounded by text-like objects such as tree leaves and window curtains and they have gradients as strong as that of text. In [11], Jain et al. proposed a classic method called connected color component analysis (CCA) for text location. CCA uses spatial structure analysis of the color connected components and can work well on most kinds of texts such as the characters on book covers, news titles, and video captions. The authors obtained high recall rate at the cost of high false alarm rate through CCA. In [12], Gao et al. combined the edge information with color layout analysis to detect scene text, the principal idea of which was to depend on the ‘‘dense edge’’ property to locate text region. However, their method may fail when texts are surrounded by or near to objects of dense edges. In [8,13– 15], a texture discrimination method was used to distinguish text from non-text. Unfortunately, we have found out that using the experiments reported in [15], the variable content of the text makes the text a weak and irregular texture property. It is difficult to discriminate text, especially scene text, from general textures by pure texture features. For text restoration, Chen et al. [1,16] proposed a method for affine deformation restoration using camera parameters and vanish points of parallel lines of the text bounding box. Clark et al. [17] also used a similar method for text line refinement from scanned documents. In both methods, precise outline rectangles of text lines are needed, with shape and texture change of the characters, which will affect the restoration result dramatically. Furthermore, these two methods rely on precise camera parameters which will definitely limit their usefulness.

505

In this paper, an innovative and robust method for text detection and restoration is proposed for text recognition in natural scene image. In text location, color, texture, and OCR feedback are combined in different stages to segment the text area. We use a color quantization method to separate text from its background and then use region spatial layout analysis to locate candidate text lines. Since text is made up of vertical/horizontal/skew strokes, histogram features of wavelet coefficients and color variance features are extracted to capture the texture properties of texts. However, although these texture features can be used to eliminate most of the non-text part in the images, they are not effective for discriminating text with image blocks of large gradients, especially for general textures. Statistic features from OCR are used to further reduce the false alarms after the texture-based text/non-text classification. The OCR software based on shape feature of isolated character can finally discriminate candidate into text or nontext. For located text line, a procedure is carried out to judge whether the text plane is of affine transformation. If it is, a homography operation between image plane and text plane is proposed to refine the distorted text line. In the procedure, we use line intersections as the corresponding points for homography operation. Fig. 3 shows the flow chart of the proposed method. Compared with existing approaches, the proposed algorithm has the following advantages: (1) Scale invariance: We use color information and adaptive region layout analysis to locate candidates in the text detection method, which makes it unnecessary to consider the scale problem. (2) ‘‘Uncertain’’ pattern detection by feature combination: It has been made clear that the text pattern is ‘‘uncertain’’ in both length and appearance. To detect this ‘‘uncertain’’ pattern, a feature combination method is proposed. Color, spatial layout, texture, and statistics on OCR result are used in the region location, candidate text location, text verification stages, respectively. This is a new feature combination method for text detection. (3) Affine restoration without using cameral parameters: By using simple but effective homography operation between two planes, we can restore the text plane of affine deformation. No camera parameters are needed in the process.

Fig. 1. Text examples. (a) Text with complex scene and (b) with affine deformation.

506

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

Fig. 2. Comparison between text and human face patterns. (a) Size comparison, (b) the ‘‘average face’’ and ‘‘average text’’, and (c) PCA comparison.

Scene image Location Candidate text line location Text/nontext classification Restoration and recognition Homography-based rectification OCR-based recognition

Fig. 3. Flow chart of the proposed method.

In the following, we will first present the proposed text location algorithm (Section 2), then text restoration algorithm (Section 3). Following that, we will present our experimental results (Section 4). Finally, we will draw our conclusions (Section 5). 2. Text location In this section, we will present the text location algorithm with candidate location, text/non-text classification, and OCR feedback. 2.1. Candidate text location By assuming that either text or its background is uniform in color, we can locate candidate text by grouping

text pixels through image segmentation and region layout analysis. In image segmentation, a GLVQ (generalized learning vector quantization) algorithm is adopted to group pixels of similar color into the same cluster in LUV color space. In this process, the key issue is how to decide the color quantization number NQuan. A method based on color variance analysis of the whole image is proposed to calculate NQuan. Supposing that there are M w1 · w2 windows in an image and each window contains n pixels, we define Sm as the color coarseness of a window, which represents the color variance of the window. Generally, the larger the color variance is, the more the color number should be. We can use the average color coarseness of the whole image Savg to calculate NQuan. Sm and Savg can be calculated from the following equations: !1=2 n1  M1  1X 1 X ðmÞ 2  ~ Sm ¼ xi ~ xmean ; S avg ¼ Sm; ð1Þ n i¼0 M m¼0 where ~ xi is color values of pixel i in w1 · w2 and n is number of color bands. In our experiments, we set w1 and w2 to be 1/101/20 of the width and height of the image. ~ xðmÞ mean is the average color values in a window. i Æ i is Euclidian distance. The larger Savg is, the larger NQuan should be. NQuan can be calculated by a linear function of Savg as N Quan ¼ a  S avg þ 1;

ð2Þ

where a is a coefficient which can be set as 0.5 empirically.

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

507

Once NQuan is decided, we can use the GLVQ algorithm to cluster pixels into NQuan color clusters. After a region growing operation, image pixels of the same color label and spatial connection are segmented into the same region. In Fig. 4a, all segmented regions are marked on the original image with outline rectangles. After the image segmentation process, a spatial layouts analysis procedure is developed to obtain candidate text lines. For Chinese characters (or Korean, Japanese character), components that constitute characters have mainly two spatial relations, up–down and left–right relation, as shown in Fig. 5a. For a text line, character regions are approximately aligned in horizontal (vertical) orientation as shown in Fig. 5b. Based on these observations, we proposed a procedure to connect the color regions into word or character and then into candidate text. The procedure includes:

Using the above procedure, most of the character components are merged into character regions. These regions need to be further connected into text lines by the following procedure:

(1) Search unprocessed regions; (2) If an unprocessed region Ri (black region in Fig. 5a) is found, create a new text region;

(1) Iteratively search region pairs that can meet the following conditions and connect them: (a) Regions have the same color label;

(3) Then, iteratively collect unprocessed region, say Rj (gray region in Fig. 5a) which has the same color label as Rj and is ‘‘adjacent’’ to Ri. Merge it into Ri. Criteria for ‘‘adjacent’’ are defined as follows: The maximum horizontal (or vertical) distance Dh (Dv) from (as shown in Fig. 5a) Rj to gravity center of Ri is smaller than a threshold Tc, that is Dh (Dv) < Tc. To make the procedure adaptive to texts composed of different font-size, the value of threshold Tc can be calculated in terms of height of region Ri by Tc = 2.0Hr; (4) If unprocessed regions possessing the same color label as Ri exist, go to step (3); (5) Otherwise, go to (2).

Fig. 4. Illustration of text detection process. (a) Outline rectangles of segmentation regions. (b) Spatial layout analysis result of color connected regions. (c) The classification result. (d) Final result after OCR feedback.

Fig. 5. (a) Region connection. (b) Character connection.

508

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

(b) Regions are horizontally aligned (as shown in Fig. 5b, the horizontal bias Db of two regions should be small enough or satisfying Db < 2.0 Æ Tw. Tw is a threshold calculated adaptively calculated by Tw = 2.0Hw, where Hw is the height of the larger region); (c) Regions are horizontally ‘‘adjacent’’ (as shown in Fig. 5b, the horizontal distance Dw of two regions is small enough, that is Dw < 1.0 Æ Tw); (2) If there are no region pairs meeting the above conditions, procedure exists. After the region spatial layout analysis, candidate text lines are located. The region layout analysis procedure is adaptive to text font-size since all of the thresholds in the procedure are calculated in terms of the sizes of the regions themselves. However, in these located candidates, there will be many false alarms. Therefore, we must use a supervised classification procedure to identify true text. 2.2. Text and non-text classification Text can be considered as a special texture made up of positive and negative sharp signals. In addition, color variance in text region is often large. Based on these observations, wavelet coefficient histogram features together with color variance features are extracted to represent text pattern. Wavelet coefficients can be used to effectively identify these sharp signals in that filters in 2D wavelet transformation can properly capture the both signal variation and their orientations. Wavelet coefficients in wavelet LH and HL bands [18] are used to calculate the histogram (Hi, i = 0, . . ., 15.) wavelet coefficients of all pixels. These coefficients are quantized into 16 levels C q ¼ C  16=ðC max  C min Þ;

ð3Þ

where C is the wavelet coefficient of a pixel, Cmax and Cmin are the maximum and minimum values of wavelet coefficients in corresponding band, respectively. The wavelet histogram can then be computed on the quantized coefficients. The value of Hi is the percentage of the pixels whose quantized coefficient is equal to i. Comparing with the histogram of non-text area, the average values of Hi in text line should be large. The bins at the front and tail of Hi should be large in both vertical and horizontal bands. Bins in the middle parts of Hi should be small [15]. This is arisen by the contrast between text and its background. Color variance features are calculated to capture the large color contrast between text foreground and background. Color variance for the L, U, and V component can be calculates as a standard variance formulation as: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 XN  p  2 Vc ¼ I  Ic ; ð4Þ p¼1 c N

where Vc is color variance of one color component, I pc is color value of pixel p, N is the pixel number in a text line. I pc is the mean value of one color component. As for the classifier, SVM (support vector machine) [22] is selected in this work. Comparing with other classifiers such as neural network and decision tree (C4.5), SVM is easier to train, needs fewer training samples, and has better generalization ability. The kernel function of SVM is chosen as a RBF (radius basis function) due to its better performance than other kernels for this task. Although text can be considered as one special texture, its regularity and directionality are quite weak. Therefore, it is quite difficult to discriminate text from non-text by pure texture classification. For this reason, we keep a high recall rate at the cost of high false alarm rate in the above procedure. False alarms will be further processed by an OCR feedback procedure. 2.3. OCR-based recognition and feedback for detection Characters in a text line are binarinized by our previous text segmented algorithm [20] and then fed into an OCR for recognition. In fact, since there must be some characters in a true text line. Therefore, true text line will be correctly recognized into characters and non-text be recognized into mess. And, generally speaking, repetition of characters should not be very high, since high repetition often implies some other structures or textures instead of text. Using this fact, we can calculate recognized character number (Rn) and character repetition rate (Rr) as statistics to verify the located candidate. Rr is calculated as the number of repeated characters in a text line divided by sum recognized character number Rn. Candidate text meeting any of the following conditions will be discarded as non-text. 1 Wt Rn <  ; 2 Ht

ð5Þ

Rr > T r ;

ð6Þ

where Wt and Ht are the width and height of the text line. Tr is a threshold set as 0.3 empirically . After the OCR-based recognition and feedback for detection, the final true text lines are located with recognized characters. 3. Text restoration Text in image plane may deform (as the example of Fig. 1b) when a camera’s optical axis is not perpendicular to the text plane. This will dramatically affect the recognition result. Traditionally, in order to restore the text line of perspective transformation, camera parameters are needed, which may not be available in some applications. To avoid using camera parameters in this condition, correspon-

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

dences between two planes can be calculated by a perspective projection matrix, which relates the image coordinate system to the world system [21]. That is called a homography operation. Assuming that the camera is a pinhole model, the image coordinate system o–xyz and the world coordinate system O–XYZ can be represented by homogeneous coordinates as 3 2 H 11 X x 6Y 7 6H 6 7 6 21 6 07 4 y 5 ¼ H6 7 ¼ 6 4 Z 5 4 H 31 h 1 H 41    0 x 1 x ¼ ; h y0 y 2

0

3

2

H 12 H 22

H 13 H 23

H 32 H 42

H 33 H 43

32 X H 14 6 7 H 24 76 Y 76 H 34 54 Z H 44

3 7 7 7; 5

2

X1 Y1 1

6 0 0 6 6 2 6X Y2 6 6 0 0 6 6 3 6X Y3 6 6 0 0 6 6 4 4X Y4 0 0

0

0

509

3 2 13 H 11 x 1 1 76 7 6 H y Y x 76 12 7 6 1 7 7 76 7 6 7 6 H 14 7 6 x2 7 Y 2 x2 7 76 7 6 7 7 6 27 6 Y 2 x2 7 76 H 21 7 6 y 7 7 7 ¼ 6 7: 6 6H 7 6 3 7 Y 3 x3 7 76 22 7 6 x 7 7 6 37 6 Y 3 x3 7 76 H 24 7 6 y 7 7 7 6 7 6 Y 4 x4 54 H 31 5 4 x4 5

0 X 1 x1 Y 1 x1

0 X 1 Y 1 1 X 1 x1 1 0 0 0 X 2 x2 0 X 2 Y 2 1 X 2 x2 1 0 0 0 X 3 x3 0 X 3 Y 3 1 X 3 x3

1 0 0 0 X 4 x4 0 X 4 Y 4 1 X 4 x4 Y 4 x4

32

H 32

y4 ð8Þ

ð7Þ

1

where H performs transformation, rotation, and perspective projection. From the functions, we can obtain

By this function, we know that given four corresponding points in the real world plane and image plane, the values of H11, H12, H14, H21, H22, H24, H31, H32 can be calculated. If more than four point correspondences are detected, an over determined function will be built as the formula of function (8). When H is calculated, given a point in the plane of image, the (X, Y) value in real world can be calculated by solving function (8) [21]. 3.1. Deformation judgment

XH 11 þ YH 12 þ ZH 13 þ H 14  XxH 31  YxH 32  ZxH 33  xH 34 ¼ 0; XC 21 þ YH 22 þ ZH 23 þ H 24  XyH 31  YyH 32  ZyH 33  yH 34 ¼ 0:

Z in the world coordinate is equal to zero since the text lay on a plane. By assuming H34 = 1, we can build the following function:

For a detected text line, we will start by using the following procedure to judge if it is deformed (rotation out of image plane). Taking a horizontal text line as an example, we will firstly evaluate the up-bound line and then calculate the angle between this up-bound with the outline rectangle (as shown in Fig. 6a). If a > Tangle, it can be determined that the text line is deformed. However, in some case, a text line deforms but the up (down)-bound still keep horizontal (as shown in the second and third rows of Fig. 6a). For this

Fig. 6. Text restoration illustration. (a) Text of affine deformation, (b) deformed text lines, (c) corresponding points (black points), and (d) refined text plane.

510

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

case, we define that if the average width/height is out of the scope of (0.5, 2.0), the text line needs to be refined. Experiments have shown that when the width/height ratio is out of the scope (0.5, 2.0), recognition performance drops dramatically. 3.2. Corresponding points for restoration A homography operation needs at least four corresponding points. Two pairs of lines in horizontal and vertical orientation can form four intersection feature points for correspondence. The two pairs should approximately form a close rectangle, which is usually the background board of the text line. We will then make a model rectangle with four corresponding points. By assuming that all text lays on a rectangle board (as shown in Fig. 6d), the height of the model rectangle is set as the larger one of the width/height of the board. The width of the model rectangle is calculated by an optimizing procedure. Because a single Chinese (Japanese, Korean) character is ‘‘foursquare’’ in shape, the procedure adjusts the width of model rectangle pixel by pixel and then refines the restoring result. The procedure finds a width for model rectangle so that in the restoring image the average width/ height ratio of all the characters is nearest to 1.0. Worth pointing out is that, in our experiments, line detection is performed manually. Our focus is to validate the feasibility of the proposed method. Robust line detection and feature points’ correspondence is not focus of this paper. Existing automatic line detection algorithm can be integrated into the programs in the future. 4. Experimental results We collected a dataset of 1500 images captured from natural scene for our experiments. The image sizes range from 640 · 480 to 1024 · 768 pixels. The test set consists of a variety of situations such as text in different font-size, color, light text on dark background, text with textured background, and text of poor quality. Different illumination conditions, background materials, and cameras are also considered as shown in Table 1. We believe that our dataset represents most of the text in real world. Readers can obtain the dataset through the authors. Eight hundred images are selected for building the training set and 500 images are used for performance evaluation. For these images, 65 of them contain deformed text Table 1 Text obtained under various situations Illumination

Background material

Cameras

Condition 1

Daytime

Cloth

Condition 2 Condition 3

Night Biased light

Board Electronic bulletin board

Sony professional camera HP camera Cannon camera

Fig. 7. Marked text and negatives.

lines. Some marked examples are shown in the Fig. 7. In this figure, the first column is the original image. Second column is marked text line regions (white rectangle areas) and third column marked negative region. The proposed method showed robust performance on a majority of the test images. In Fig. 8, we illustrate some examples of the detected text lines. It can be seen from the results that most of the text is well detected despite of different font sizes and complex backgrounds. Results also show that even in a cluttered background, the proposed method performs well on most of the scene text. Fig. 8d is an example of refined text line. It can be seen that after the restoration procedure, all of the located characters are correctly recognized. Fig. 8h is an example of vertical text line. In this paper, we locate vertical text lines by rotating the original image 90 and then use the horizontal text location procedure described previously. Some text lines for small characters are missed because the character components are too small to form text regions. Fig. 8e shows an example of missing text line. This example shows the text line with its color similar to the background. It is missed because the color quantization procedure cannot correctly tell the text from the background. In real application this condition can be ignored since it is rare that text has similar color to its background. Fig. 8f shows a false alarm. In this image, since some tree branches appear quite like some Chinese characters, they pass both supervised classification and OCR verification. These false alarms can be eliminated by analyzing the lingual content of recognized text line in our future work. The recognized characters can be translated into other languages using an existing native language analyzing software. We use existing software for the recognition procedure. The software can recognize binarinized images with Chinese characters. For printed Chinese characters, the reported recognition rate of the software is about 95.2%. Given the marked ground truth and detection result by the algorithm, we can calculate the recall and false alarm rate by

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

511

Fig. 8. Experimental results. (a) Text of low resolution, (b) text with complex background, (c) text at night, (d) text of restoration, (e) missed text, (f) false alarm, (g) text with cloth background, and (h) vertical text line.

Fig. 9. Examples of the training data. (a) The text data. (b) Non-text data.

Number of correctly located text lines ; ð9Þ Number of text lines Number of falsly located text lines False alarm rate ¼ : ð10Þ Number of located text lines

strap’’ process is carried out to improve the performance of the classifier. That is to say, false alarms will be added into the training set for re-training.

The SVM was trained on a dataset consisting of 1100 text and 3000 non-text labeled samples. Fig. 9 shows some of the training examples. As stated in [19], although positive samples are easy to be obtained, it is difficult to get representative negative samples since they may have various appearances. After a trained model is obtained, a ‘‘boot-

Table 2 Performance of coarse and fine detection

Recall ¼

SVM classification Final result (OCR)

Recall rate (%)

False alarm rate (%)

Recognition rate (%)

92.5 86.2

13.4 6.7

— 93.9

512

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

Fig. 10. Text restoration examples.

very near to 1.0. This shows that the proposed scheme for model size determination is feasible (Fig. 11).

Table 3 Recognition performance before/after restoration Recognition rate (%) Without restoration With restoration

42.5 78.8

For the SVM-based classification, 92.5% recall rate and 13.4% false alarm rate are generated. The algorithm produces a higher recall rate at the cost of higher false alarm rate in the classification procedure. After the feedback from OCR, 86.2% recall rate and 6.7% false alarm rate are produced. This shows that the SVM classification can eliminate most of the false candidates while the OCR feedback can reduce false alarms to a lower level as shown in Table 2. To show the contribution of the text restoration procedure, we selected 20 images of text line with affine deformation for experiments. Two examples of text restoration are given in Fig. 10. Table 3 shows the result of text with and without restoration. It can be seen that the restoration procedure can greatly improve the recognition result of deformed text. However, the restoration algorithm fails in some conditions. For example, when the background of text is not a rectangle, e.g., a circle or other shape, it is impossible to find enough feature points. In this case, the restoration algorithm will not work well. Nevertheless, the restoration algorithm is effective in most of the cases. We have stated that the width of restoration model is a undetermined parameter. It is calculated through experiment. The following images are restoration results by models of different width. It can be seen that in the fourth image, the width of restored text keep a width/height ratio

5. Conclusions and future works A new method for text location and restoration in natural scene images is presented in this paper. The method avoids scale problem by using an adaptive region layout analysis procedure. The text restoration procedure improves the recognition performance of deformed text line and it needs not to know camera parameters. Comparing with mature pattern recognition techniques such as face detection and optical character recognition, the performance of proposed text detection and recognition method is encouraging. Currently, we use OCR software to recognize detected text lines. In future work, context information can be integrated to improve the recognition performance. In the restoration procedure, feature point detection algorithm needs to be developed to make the method fully automatic. Furthermore, the restoration procedure needs to be extended to other surfaces such as columniform surfaces and spherical surfaces.

Acknowledgments The authors thank Professor Anil K. Jain for his advice on the text detection method and the anonymous reviewers for their constructive comments. This work is partly supported by Bairen Project of Chinese Academy of Sciences.

Fig. 11. Restoration results by models of different width.

Q. Ye et al. / J. Vis. Commun. Image R. 18 (2007) 504–513

References [1] X. Chen, J. Yang, J. Zhang, A. Waibel, Automatically text detection and recognition in natural scene images, IEEE Transactions on Image Processing 13 (2004) 87–99. [2] X.R. Chen, A.L. Yuille, Detecting and reading text in natural scenes, IEEE International Conference on Computer Vision and Pattern Recognition USA (2004) 366–373. [3] J. Zhang, X. Chen, A. Hanneman, J. Yang, A. Waibel, A robust approach for recognition of text embedded in natural scenes, International Conference on Multi-modal Interface (2002) 204– 207. [4] J. Gllavata, R. Ewerth, B. Freisleben, Text detection in images based on unsupervised classification of high-frequency wavelet coefficients, International Conference on Pattern Recognition (2004) 425–428. [5] D. Karatzas, A. Antonacopoulos, Text extraction from web images based on split-merge segmentation method using color perception, International Conference on Pattern Recognition. (2004) 634–637. [6] N. Ezaki, M. Bulacu, L. Schomaker, Text detection from natural scene images: towards a system for visually impaired persons, International Conference on Pattern Recognition. (2004) 683–686. [7] Y. Baba, A. Hirose, Proposal of the hybrid spectral gradient method to extract character/text regions from general scene images, International Conference on Image Processing. (2004) 211–214. [8] K.C. Kim, H.R. Byun, Y.J. Song, Y.W. Choi, S.Y. Chi, K.K. Kim, Y.K. Chung, Scene text extraction in natural scene images using hierarchical feature combining and verification, International Conference on Image Processing. (2004) 679–682. [9] V. Wu, R. Manmatha, E.M. Riseman, Textfinder: an automatic system to detect and recognize text in images, IEEE Transactions on PAMI 20 (1999) 1224–1229.

513

[10] R. Lienhart, A. Wernicke, Localizing and segmenting text in images and videos, IEEE Transactions on Circuits and Systems for Video Technology 12 (2002) 256–268. [11] A.K. Jain, B. Yu, Automatic text location in images and video frames, Pattern Recognition 31 (1998) 2055–2076. [12] J. Gao, J. Yang, An adaptive algorithm for text detection from natural scenes, International Conference on Computer Vision and Pattern Recognition (2001). [13] H. Li, D. Doermann, Omid Kia, Automatic text detection and tracking in digital video, IEEE Transaction on Image Processing 9 (2000) 147–156. [14] K.I. Kim, K. Jung, Hyung Kim, Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm, IEEE Transactions on PAMI 25 (2003) 1631–1639. [15] Q.X. Ye, Q.M. Huang, W. Gao, D.B. Zhao, Fast and robust text detection in images and video frames, Image and Vision Computing. 23 (2005) 565–576. [16] X.L.C. Jie Yang, J. Zhang, A. Waibel, Automatic detection of signs with affine transformation, Applications of Computer Vision (2002) 32–36. [17] Paul Clark, Majid Mirmehdi, Rectifying perspective views of text in 3D scenes using vanishing points, Pattern Recognition (2004). [18] G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Transactions on PAMI 11 (1989) 674– 693. [19] K. Sung, T. Poggio, Example-based learning for view-based human face detection. Inst. Technol., Cambridge, MA, A.I. Memo 1521, 1994. [20] Q.X. Ye, W. Gao, Automatic text segmentation from complex background, International Conference on Image Processing (2004) 2905–2908. [21] J. Mundy, A. Zisserman (Eds.), Geometric Invariance in Computer Vision, The MIT Press, Cambridge, MA, USA, 1992. [22] V. Vapnik, The Nature of Statistical Learning Theory, SpringerVerlag, New York, 1995.