Image and Vision Computing 31 (2013) 823–840
Contents lists available at ScienceDirect
Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis
Integrating multiple character proposals for robust scene text extraction☆ SeongHun Lee ⁎, Jin Hyung Kim Computer Science, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea
a r t i c l e
i n f o
Article history: Received 9 October 2012 Received in revised form 1 August 2013 Accepted 29 August 2013 Available online 9 September 2013 Keywords: Scene text extraction Two-stage CRF models Multiple image segmentations Component Character proposal
a b s t r a c t Text contained in scene images provides the semantic context of the images. For that reason, robust extraction of text regions is essential for successful scene text understanding. However, separating text pixels from scene images still remains as a challenging issue because of uncontrolled lighting conditions and complex backgrounds. In this paper, we propose a two-stage conditional random field (TCRF) approach to robustly extract text regions from the scene images. The proposed approach models the spatial and hierarchical structures of the scene text, and it finds text regions based on the scene text model. In the first stage, the system generates multiple character proposals for the given image by using multiple image segmentations and a local CRF model. In the second stage, the system selectively integrates the generated character proposals to determine proper character regions by using a holistic CRF model. Through the TCRF approach, we cast the scene text separation problem as a probabilistic labeling problem, which yields the optimal label configuration of pixels that maximizes the conditional probability of the given image. Experimental results indicate that our framework exhibits good performance in the case of the public databases. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Scene texts such as bottle labels, street signs, and license plates are text regions, which are present as integral parts of pictures (Fig. 1). The text is usually linked to the semantic context of the image, and it constitutes a relevant descriptor for content-based image indexing. Direct recognition of scene text from the images captured by mobile devices could facilitate the development of a variety of new applications, such as translation, navigation, and tour guide services. Scene text understanding refers to an attempt to recognize text in camera-captured images. It consists of two parts: scene text extraction and scene text recognition. The purpose of the extraction process is to separate text regions from the natural image. The purpose of the recognition process is to determine labels from the extracted text regions. Robust extraction of text from scene images is an essential step for successful scene text recognition. A very efficient text extraction method would enable the direct use of commercial OCR engines, which are normally optimized for binarized document images. However, errors due to a poor extraction method could be propagated in the recognition process. Therefore, we will focus on the extraction process in this paper. Extracting text from unconstrained images of natural scenes is difficult owing to the lack of any prior knowledge about the text regions, such as the color, font, size, orientation, or even the location of the text. In addition, scene images usually have uneven illumination, reflections ☆ This paper has been recommended for acceptance by Cheng-Lin Liu. ⁎ Corresponding author. Tel.: +82 42 350 7817. E-mail addresses:
[email protected] (S. Lee),
[email protected] (J.H. Kim). 0262-8856/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2013.08.007
on objects, and inter-reflection between objects owing to uncontrolled lighting conditions and the presence of shadows. These conditions make colors vary drastically, so the text regions may be fragmented, or the boundaries of the text region may be faint. It is also common for outdoor images to have complex layouts in which the content and background are mixed. Shapes in the background can be similar to characters, particularly for textured objects such as windows or buildings. Such complications make extracting text from scene images a persistent challenge. A two-stage conditional random field (TCRF) approach with multiple image segmentations is proposed to overcome these challenging problems and to extract proper character regions regardless of complex background and outdoor environments. One assumption in dealing with multiple segmentations is that there exists at least one correct segmentation for each character instance. Multiple image segmentations with different partition parameters provide alternative ways of grouping pixels to form homogeneous regions in the image, so having multiple image segmentations increase the possibility of some homogeneous regions matching character regions. Although some segmentations are inaccurate, others are accurate and would prove useful for the extraction task. Hence, having multiple segmentations can provide a more robust basis for character regions than any single image segmentation. The proposed TCRF approach finds coherent groups of correctly segmented character regions within a large pool of multiple candidate regions by utilizing the properties and hierarchical structure of the scene text. Commonly used structures of the scene text are the following: “the textline (composed of characters),” “the characters,” and “pixels”. An image contains one or more textlines; they are aligned horizontally, vertically or diagonally. These textlines have distinguishable
824
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
Fig. 1. Examples of scene text images.
texture patterns. Each textline does not vary in height too much, and it has sufficient width to contain more than a single character. Most characters (alphabet and numbers) are assumed to appear as single regions. Characters in the same textline exhibit common properties such as font and color. For instance, characters have no steep change in their thickness, and the insides of character regions have homogeneous colors, whereas the character boundaries show a distinction from the background. By satisfying these relationships of characters based on the scene text model, the proposed method can extract most plausible configurations of character regions among all possible combinations of segmented regions. As a result, the pixels in the character regions are instantiated as foreground, and the pixels in the other regions (non-text regions, i.e. noises) are indicated as background. The rest of this paper is organized as follows. Section 2 presents related works on the scene text extraction method. In Section 3, we briefly introduce the proposed extraction framework. The algorithm of partitioning an image into multiple segmentations and generating multiple character proposals is explained in Sections 4 and 5. We explain how to integrate character proposals into textlines in Section 6. The efficiency and performance of the suggested system are experimentally evaluated in Section 7. Finally, the paper is concluded in Section 8. 2. Related work To make scene text extraction a manageable problem, previous researchers made some assumptions by using the domain knowledge on the scene text: one is that the general shapes of text regions are distinctive from those of background regions, and the other is that characters are homogeneous in color (or stroke thickness). The former leads to texture-based approaches, and the latter leads to region-based approaches. Texture-based approaches focus on the local distinctiveness of text. Researchers assume that the text regions form certain patterns of texture features [1–3]. These methods first try to locate roughly a region of text by analyzing texture features. And an additional stage may follow to separate the text pixels from the background within the region located. Texture features, such as DCTs, wavelet features, or edges, are commonly used for rough segmentation of regions. However, these methods only consider local information on text, so they cannot gather high-level
information such as character relationships. As a result, many complex background regions (human-made objects or leaves) that are similar in shape to characters are misclassified as character regions. Region-based approaches are inspired by the observation that pixels constituting a particular region with the same color or stroke thickness often belong to the same object. Several previous extraction algorithms [4–6,37] assumed that text pixels are homogeneous in color so that they can be separated from the background and grouped on the basis of color values. Epshtein et al. [7] identified text regions on the basis of local homogeneity of stroke thickness. This method measured stroke width for each pixel and merged neighboring pixels having approximately similar stroke thickness into a region. Neuman and Matas [8] assumed that characters are extremal region (ER) in some scalar projection of pixel value, and they used maximally stable extremal region (MSER) algorithm to detect stable regions in an image. These separated regions (called connected components) obtained from regions-based approaches are then classified in turn as foreground or background by further analyzing those using heuristic methods or random field models. However, for region-based methods, character regions cannot be segmented well without proper text information such as position and scale. Region-based methods cannot also handle varying environmental conditions, and they might produce an unwanted separation of regions under large variations in illumination and color. Moreover, it is difficult to recover from errors in the initial segmentation. From this review of the literature, we may conclude that extracting text from images of natural scenes is still a challenging problem for images with complex backgrounds and natural illumination. Especially, it is difficult to extract character regions directly with a single-pass process without any prior knowledge of the given character regions. To handle these challenging issues, analyzing scene image based on various perspectives with multiple image segmentations is needed. 3. Overview of the proposed method The goal of the proposed method is to isolate a character region as a single continuous region in any kind of outdoor environments. To handle complex backgrounds and natural illumination of scene images, our proposed system generates diverse character proposals based on bottom-up image processing. Multiple image segmentations with
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
different partition parameters provide alternative ways of grouping the pixels to form homogeneous regions in the image, so these provide different interpretations of the given image. As a result, multiple image segmentations increase the possibility of some homogeneous regions matching character regions. This approach allows label decisions regarding pixels to be postponed until the evidence across multiple segmentations has all been collected. Integrating the information from the multiple image segmentations can provide a more robust basis for text extraction than using the information from one image segmentation alone. The proposed framework then needs to find groups of correctly segmented character regions within a large pool of multiple candidate regions. We regard a text segmentation problem as a labeling problem for each region obtained from multiple segmentation processes. A two-stage conditional random field (TCRF) approach is proposed to find coherent groups of correctly segmented character regions based on the properties and hierarchical structures of the scene text. The proposed TCRF models are graphical models that represent segmented regions as nodes and their relationships among segmented regions as edges in undirected graph structures [9]. Labels of segmented regions correspond to the random variables of the CRF model. The evidence at each node is computed to tell whether the node belongs to the character region or the background region using a discriminative classifier on the character features. The structure and strength of the edges in the graph are determined dynamically from the segmentations. The proposed TCRF approach consists of two types of CRF models: one is a local CRF model and the other is a holistic CRF model. In the first stage, each local CRF model in a single segmentation identifies disjoint regions as foreground or background by analyzing shape and geometric features of the regions. The local CRF model alleviates the computation overhead of exhaustively searching for proper character regions by pruning most apparent non-text regions and leaving the remaining regions as character proposals. In the second stage, a holistic CRF model integrates multiple character proposals and finds the most plausible configurations of character regions among all possible combinations of character proposals. The flowchart of the proposed method is given in Fig. 2. First, the input image is segmented into multiple regions with various measures of color distance, and with the two constraints on the scene text. The labels of segmented regions are inferred by the local CRF model based on the combination of the multiple character features and relationships among the regions for each segmentation. Each segmentation process provides proposals for character regions, which have high probabilities belonging to the character regions. These proposals are grouped into several textlines. The holistic CRF model finds the most plausible configuration among all possible combinations of character proposals in the potential textline which maximally satisfying properties of the scene text model. Finally, the proposed system provides text-only regions as a final result. A detailed description is given in the following sections. 4. Multiple image segmentations Our text extraction methodology relies on a set of image segmentations generated by bottom-up processes. Segmentation refers to one possible way of grouping pixels to generate a set of disjoint regions. The perfect image segmentation can generate homogeneous regions consistent with object boundaries. However, it is difficult to obtain perfect segmentation with respect to a variety of color gradations and outdoor environments. For better recognition afterwards, it is necessary to isolate character regions from the background without a loss of small details. To partition an image robustly into the desired regions, we propose a generalized K-means clustering algorithm which seamlessly combines color, texture, and edge. In addition, we generate multiple image segmentations with different partition criteria on homogeneous regions to provide various interpretations for the given image [10]. Even though
825
Multiple Image Segmentations Text Saliency Map
Edge Map
Generalized K-Means Clustering
Multiple Character Proposals Generator Component Verifier
Component Verifier
Component Verifier
Feature Extractor
Feature Extractor
Feature Extractor
Local CRF Model
Local CRF Model
Local CRF Model
Character Proposals
Character Proposals
Character Proposals
Multiple Character Proposals Integrator Potential Textline Region Estimator
Character Proposals Integrator with Holistic CRF Model
Fig. 2. Flowchart (main components) of proposed scene text extraction method.
single segmentation cannot find all text regions, the set of all segmented regions by multiple segmentations could contain all text regions. The proposed clustering algorithm estimates the distributions of colors in the image, and it finds modes of the color clusters. These clusters can be formed in different ways based on the definition of color similarity or the number of clusters. In some environments, a specific color distance metric is suitable for distinguishing text colors from background colors, but this might not be proper for other environments. As mentioned by Berkhin [11], there is no way to find an appropriate number of clusters with respect to all kinds of natural scenes. Therefore, the extraction system cannot identify the appropriate color metric and number of text colors in advance without any prior knowledge about the image. Instead, the proposed system generates multiple segmentations by varying the partition parameters of the segmentation algorithm so that it can extract character regions regardless of the outdoor environment or lighting conditions. Our aim is to produce sufficient segmentations of the input image to have a high chance of obtaining character regions. 4.1. Single image segmentation with specific partition criterion Most characters are assumed to appear as single regions in scene images as shown in Fig. 1. For isolating text regions, we utilize three characteristics of the scene text: the color homogeneity, the geometric alignment, and the distinctiveness from the background. First, characters are assumed to have uniform colors so that the pixels in the character region can be grouped on the basis of their color values. Second, a text region as a periodic repetition of similar shaped objects with a specific alignment presents some of fundamental characteristics of texture. Even though the appearance of a single character depends on a certain font in an image, a text region has common block-based information which is distinctive from a background [12]. For instance, a text region can be characterized as a horizontal or vertical rectangular structure of
826
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
clustered sharp edges [13]. This fact motivated us to use several heterogeneous texture features from the local and global spatial distribution of the text region, which represent those characteristics of text well. Third, the characters usually have sharp boundaries with the background regions because scene text is superimposed on objects for easy detection by humans. An edge map is utilized to force the groups of interior pixels and exterior pixels around the boundaries into two different clusters. The proposed K-means clustering algorithm finds K dominant colors (centroids of clusters) where the differences between the color of the cluster centroid and the colors of the pixels belonging to the same cluster are minimized. Pixels in the character region can be grouped into the single region based on the one of K dominant colors. However, bare K-means clustering based on the color distribution often yields an inadequate segmentation result. The reason is that the portion of the text region in the image is relatively small in general so that it cannot find text colors as one of the dominant colors. To overcome the above problems, our image segmentation algorithm first estimates the probabilities of pixels belonging to text regions based on the combination of the texture features. This likelihood of text of pixels, called text saliency map, is used as assistant information in K-means clustering to find text colors by adjusting weight of color frequencies. In other words, the color instances that appeared in the potential text regions can have high weights by utilizing the text saliency map so that they are likely to be chosen as dominant colors. In detail, a text localizer is designed to generate a text saliency map based on the combination of heterogeneous texture features: Mean Difference Features (MDFs), Standard Deviations (SDs), Histograms of Oriented Gradients (HOGs), and Edge Local Binary Pattern (eLBP). MDFs are calculated as the weighted mean of each text block, and they represent the spatial relationship among text blocks [14]. SDs show statistics on the intensity variance of blocks in the text region, and HOGs describe the strength regularity of text contours [15]. eLBP can capture not only texture characteristic but also structure characteristic of the given region, which is suitable for text detection [16]. For the text localizer, AdaBoost algorithm has been shown to be arguably the most effective method for detecting target objects in an image [17]. The selection of features and weights is learned through supervised training (offline mode) [18]. The AdaBoost algorithm with the cascade approach enables the algorithm to rule out most of the images as non-text locations with a few tests. The text localizer is applied to all sub-regions of the whole image at multiple scales to capture various font sizes. Most falsely detected text locations appear consistently less over multiple scales, while truly detected text locations appear coherently at multiple scales. Therefore, a text saliency map is created by projecting the responses (text = 1, non-text = 0) of the classifiers in the different scales back to the original scale of the input image. In other words, a text saliency map is initialized by zero. For each pixel at each scale, its response value for a text location is added to the text saliency map at the original
image scale. Then, the responses in the text saliency map are normalized by the number of scales. Fig. 3(b) shows the text saliency map of Fig. 3(a) where confidence in text locations is represented by darkness. Separating text region from the background is difficult under the condition that text is overlaid on scene images and it has low contrast with its backgrounds because of color similarities of both or shading. When clustering all pixels in the image, the color of text boundary pixels can be chosen as a representative color while dropping the true text color. It would bring up an unexpected result such that the text region is fused with background region, or it is over-segmented into small pieces (Fig. 4). To prevent the adverse effect of boundary pixels, we propose an edge map constraint, which forces text-colored pixels and background-colored pixels assigned into different clusters. In the scene image, the text pixel values are different from the background pixel values, so edges are formed in the boundary between text and background regions. Since the edge gives the boundaries of homogeneous color regions, the color values of a few pixels that lie normal to the contour are good indicators not to be assigned into the same clusters. Lists of two-color pairs in edge map constraints are obtained from the normal vectors of edge contour pixels [19] (Fig. 3(c)). To satisfy the edge map constraint, one color is assigned to the cluster of the closest centroid, and the other is assigned to the cluster of the other centroid. The proposed generalized K-means clustering algorithm finds the most dominant K colors by calculating the weighted frequencies of color values and penalizing constraint violations in the edge-constraint map. By weighting pixels in salient sections and removing the effects of boundary pixels, colors in text regions and background regions are well separated. The proposed approach can be regarded as finding a solution that minimizes the constrained vector quantization error (CVQE) [20,21]. Given a pair of colors, sp and sq, the formula for CVQE is the following ( " K X 1 X CVQE ¼ ωp DCol ck ; sp 2 k¼1 s ∈C
)# ; þ DCol cyðsq Þ ; cqr Δ y sq ; yðsr Þ ðsq ;sr Þ∈EC p
X
k
ð1Þ
where sp, sq, and sr are color instances and the value of y(sq) is the index of the cluster to which the data point sq belongs. C1,…,CK are the K clusters, and c1 ; …; cK are their centroids. DCol(sp,sq) is a function to calculate the distance between sp and sq based on the specific color distance metric Col. The distance metrics used in this paper are explained in Section 4.2. The text confidence score, ωp, which is obtained from the text saliency map, changes the weights of the color frequencies of potential text regions. Weighted frequency of a color cluster is calculated by multiplying the likelihood of the text region and the count of pixels for the color
Fig. 3. Two assisting forms of information for estimating characteristics of scene text: (a) original image, (b) text saliency map, and (c) color pairs around edge map. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
827
Fig. 4. Side effects on segmentation without considering edge map.
cluster. Because pixels in the text areas have a high confidence score and most of the backgrounds have a low confidence score, the text colors can be chosen as one of the dominant colors by changing the weight of color frequencies that may contain text regions. Edge map constraint, EC in Eq. (1), is also utilized to force text colored pixels and background colored pixels into different clusters. When a cluster contains both color instances in EC, one color instance is moved into the other cluster where its centroid is nearest to the current cluster centroid (c∗qr is the next closest centroid to either sq or sr). That is to say, when colors in EC are assigned to the same cluster cyðs Þ , the edge map constraint forces them to be assigned q to different clusters c∗qr Þ by adding a penalty, DCol cyðs Þ ; c∗qr . Therefore, q
character colors and background colors around boundaries are likely assigned to the different clusters. The CVQE equation is solved by using the expectation–maximization (EM) algorithm. The CVQE is calculated for each possible combination of cluster assignments in every iteration, and the instances are assigned to the clusters that minimally increase the CVQE. By continuously updating the centroids of the clusters while minimizing the CVQE, the K dominant colors are obtained. For instance, as shown in Fig. 5, image pixels of the scene image in Fig. 3(a) are separated into four different clusters according to the four dominant colors (the centroids of the clusters). Adjacent pixels with the same dominant color are grouped into a homogeneous region. A homogeneous region is called a connected component or simply a component. The pixels within a component are assumed to belong to the same object. That is to say, the label of each pixel is determined by the label of the component to which it belongs.
4.2. Multiple image segmentations with various partition criteria For robustly determining text color regardless of large variations in illumination and color, different partition parameters impose different criteria on homogeneous regions. By applying several color distance measures and various numbers of clusters in the CVQE equation, different interpretations are generated for the given image. In other words, the proposed multiple image segmentations produce sufficiently large
plausible character region candidates. As a result, it can increase the chance of obtaining accurate text regions from multiple segmentations. Therefore, combining information of multiple segmentations is robust to the misleading single segmentation. In addition, it can handle the diverse image resolutions and various natural scene complexities. The first parameter in the proposed generalized K-means clustering algorithm is the color distance metric (DCol). The metrics used in our framework are the Euclidean, HCL (hue, chroma, and luminance) [22], and NBS (National Bureau of Standards) [23] color distances. The Euclidean metric in the RGB color space measures the difference of color magnitudes, and it also calculates the changes in the luminance information. The HCL color metric is an angle-based similarity metric, and it emphasizes hue difference. The NBS color metric provides a perceptually uniform color distance. As opposed to the RGB color space, the color space used in NBS color metric is considered as natural representations of color space (i.e., close to the physiological perception of the human eye) [24]. If the NBS distance is less than 1.5, people can hardly distinguish the difference between two colors. We found that these different metrics play complementary roles in color segmentations (Fig. 6). Images that exhibit a strong contrast between the foreground and background are usually better segmented with the Euclidean distance. In contrast, the color variation within the text region would be well handled using the HCL or NBS distance metrics. In particular, hue is less likely to be affected by illumination changes than are other color distance metrics. When images are affected by uneven lighting or curved surfaces, the HCL distance performs better. For instance, bright red and dark red have different magnitudes but similar color orientations. These colors, which are perceived as slightly different by the camera, can be merged by HCL color distance. The NBS color distance was devised to approximate human color perception. Because the colors of the scene text are chosen for effective notification to humans, it would be effective to use a color model close to human perception of colors. For calculating HCL or NBS color distance, we need to convert the RGB color space to HCL or HCV color space. Color space conversions from the RGB to the HCL or HCV are explained in the literature [23,24]. Given a pair of RGB colors, p = (rp,gp,bp) and q = (rq,gq,bq), as well as the
Fig. 5. Color clustering results in Fig. 3(a): four dominant colors of the clusters are shown on the bottom row, and pixels assigned to the corresponding clusters are marked in black on the top row. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
828
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
Fig. 6. Clustering result based on the different color distance metrics (K is four on the top row, three on the bottom row): (a) original image, (b) clustering with Euclidean distance, and (c) clustering with HCL distance. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
corresponding converted colors in the HCL and HCV color spaces, the color distance, DCol, in Eq. (1) is defined as follows: 2 2 2 DEuclidean ðp; qÞ ¼ r p −r q þ g p −g q þ bp −bq ; 2 n o DHCL ðp; qÞ ¼ AL lp −lq þ 0:2 þ hp −hq n o 2 2 ACH cp þ cq −2cp cq cos hp −hq ; 2π DNBS ðp; qÞ ¼ 2 cp cq 1−cos hp −hq 100 2 2 þ cp −cq þ 16 vp −vq ;
ð2Þ
where hue (hp,hq) refers to the pure spectrum colors and corresponds to the prominent color as perceived by a human. Chroma (cp,cq) corresponds to colorfulness relative to the brightness of another color that appears white under similar viewing conditions. Luminance (lp,lq) or volume (vp,vq) refers to the amount of light in a color. Further, AL is a constant of linearization for luminance, and it is defined as 0.1. ACH is a parameter that helps reduce the distance between colors having the same hue as the target color, and it is defined as 0.2 + (hp − hq) / 2. The other parameter in the proposed generalized K-means clustering algorithm is the number (K) of clusters. The shapes and sizes of the regions in the different segmentations change with the number of clusters. When small K is used in the complex image, the text color cannot be chosen as one of dominant colors so that text region and background region can be merged. On the other hand, when large K is used in the simple image, many similar colors to the text color can be chosen as dominant colors so that text region can be over-segmented into sub-regions. Four different cluster numbers (K is from 2 to 5) are applied to provide different scales of partitioned regions for the given image. As a result, coarsely segmented regions as well as finely segmented regions can be realized from a single original image. These combinations of multiple scale segmentations are robust to variations in the image conditions (Fig. 7).
5. Generation of multiple character proposals Image segmentation generates not only text components but also non-text components. From a large pool of components in multiple segmentations, we now need to find coherent groups of correctly segmented character regions and to prune the noise components. Because characters have common shape characteristics distinctive from those of background regions, the geometric shape of a single component is considered for verification measurement. In addition, characters usually have a similar font and color, so spatial relationships among neighboring components are also important factors to determine the label of these
(a)
K=3, Euclidean distance
K=4, Euclidean distance
K=4, HCL distance
(b)
(c)
Fig. 7. Example of multiple image segmentations: (a) original image, (b) dominant colors according to specific partition parameters (number of clusters and color distance metric), and (c) image segmentation by grouping pixels. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
components. The probabilistic graphical models such as Markov random field (MRF) model [25] or conditional random field (CRF) model [9] provide a convenient way to model the properties of the individual component as well as spatial relationships among components in an undirected graphic structure. These graphical models encourage adjacent nodes to take the same label via spatial structure. The CRF model has the capability of unifying multiple features simultaneously in a single unified model. It fuses the local features and learns the conditional distribution of the class labeling given the components in the image. We propose a local CRF model to label each component with one of the two classes (foreground or background) based on the properties of text. When the image is partitioned into multiple fragments with H different segmentation parameters, there are at most H image segmentations (Fig. 8). In each segmentation, a single component can be regarded as a candidate for a character region. Component verification processes using the local CRF model are applied to each segmentation in parallel. Each local CRF model can measure the posterior of a component being the foreground or background by analyzing the characteristics of the character regions. As a result, if the characteristic of the component is far from the normal properties of character regions, the component will be regarded as noise and removed. In the CRF model, components are represented as nodes, and relationships among components are represented as edges (Fig. 8). The edges are formed when two components are located within a certain distance of each other. The distance is heuristically defined as three times the width of the narrower of the two components. The nodes are denoted as x ∈ X, and their hidden labels are denoted as y ∈ Y. The nodes with the connectivity structure imposed by the undirected edges between them define the conditional distribution P(y|x) over the hidden labels y. Formulated probabilistically, the extraction problem is derived as the conditional probability of the component labels for the given segmented image. Based on the Hammersley–Clifford theorem [25], the conditional distribution of the local CRF is factorized into a product of clique potentials ϕc(xc,yc), where a clique is a fully connected subgraph of the local CRF. These potentials evaluate how many and which labels are likely to occur, given information on the components. Using the clique potentials, the conditional distribution over the hidden labels is expressed as P ðyjxÞ ¼
1 ∏ ϕ ðx ; y Þ; Z ðxÞ c∈C c c c
where Z(x) = ∑ normalization.
y
∏
c ∈C
ð3Þ ϕc(xc,yc) is the partition function for
829
In this local CRF model, two clique potentials are designed on the basis of the clique configuration; namely, the unary potential and the pairwise potential (u is for unary, and p is for pairwise potential). The unary potential at each node i, ϕu(yi,xi), measures the likelihood of variable xi taking a label yi based on the character features. Non-text components have irregular feature values that are far different from those of the text components for the specific properties. In addition, the average saliency confidence value is used to reflect that a component in a potential text region has a high chance of being in the foreground. The definition of character features is given in Table 1 and characteristics of them are well explained in our previous paper [26]. For instance, if a size of the component is too large or small, it could be a noise component so that it should be discarded. Long bar-shaped component can be figured out as noise by the aspect ratio feature, and component having more complex contour shape than that of the text component can be determined as noise by the compactness feature. These shape features and text saliency confidence score contain the descriptions for the component such as geometric and texture properties of the single component. By utilizing these character-related features, the proposed system can determine labels of the components. The machine learning classifiers such as artificial neural network (ANN) learn the distributions of feature values on the training data, and they find a discriminant hyperplane which can separate two classes based on the feature vectors. As a discriminant classifier, a multilayer perceptron (MLP) is used to produce likelihood f(yi|xi) of the component over the label variable yi given the character features xi. The classification score f(yi|xi) is normalized into the range between zero and unity. As a result, the output of the classifier on the given component can be viewed directly as an approximate conditional probability for each class. The unary potential is defined as ϕu ðyi ; xi Þ ¼ wu f ðyi jxi Þ;
ð4Þ
where w is a weight that represents the importance of different potentials for correctly identifying the hidden labels. The weights are learned from labeled training data by using a stochastic gradient descent algorithm [27]. In most images, text characters do not appear alone but together with other characters. Characters are subjected to certain geometric restrictions, i.e., their heights, widths and stroke thicknesses usually fall into specific ranges of values. The pairwise potential, ϕp(yi,yj,xij), is intended to represent the probability of a given set of two neighboring components belonging to one of three classes (both foreground, both background, or one foreground and one background) based on the
Fig. 8. Graphical view of components for each image segmentation: different shapes of components are generated by H different partition parameters on the top row. Each component is represented as a node, and two neighboring components are connected by edges on the bottom row.
830
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
Table 1 Character features used for calculation of unary potential.
Table 2 Character similarity features used for calculation of pairwise potential.
Character features
Definition
Normalized size Aspect ratio Occupancy ratio Compactness Contour roughness Edge overlap ratio Stroke thickness variation Saliency confidence
NS(Ci) = height(Ci)/height(Image), width(Ci)/width(Image) AR(Ci) = height(Ci)/width(Ci) OR(Ci) = Area(Ci)/Area(Bouding box(Ci)) Compactness(Ci) = Area(Ci)/Length(Contour(Ci))2 CR(Ci) = ∥ Ci − Open(Ci)|/|Ci| EO(Ci) = |Edge(Ci) ∩ Contour(Ci)|/|Contour(Ci)| SV(Ci) = Deviation(Stroke thickness(Ci))/ Mean(Stroke thickness(Ci)) SC(Ci) = Mean(Saliency(Ci))
geometric relationships between the components. This pairwise potential encourages the smoothness of labels of neighboring nodes when they have similar properties in the image. Two different geometric relationships are considered. First one is discontiguous relationship, and the other is contiguous relationship. When two discontiguous components are located within a certain distance of each other, we evaluate the compatibility between their labels (Fig. 9). When they have similar properties, they have a high chance of sharing the same labels. Character features used for this purpose are color, texture, geometry, shape, and text saliency confidence (Table 2). Each similarity feature value is calculated by the ratio between the greater and lesser values of the character features of the two components (Eq. (5)). If the similarity values are closed to one, then the components are classified with the same labels. A statistical classifier is also trained to produce a classification score, f(yi, yj|xij), over the label variables yi and yj, given character similarity features xij. min xi ; x j xij ¼ max xi ; x j
ð5Þ
Conversely, when two components touch, we merge the components and evaluate the adjacent pairwise potential score by using the classifier which is used for the unary potential calculation. If the merged component has high foreground likelihood, the adjacent pairwise potential gives a high probability of both components being in the foregrounds. Therefore, these two components are labeled as foreground both, and they are merged into a single component after the inference process (Fig. 10). Even though some character regions are fragmented into two or three pieces by improper image segmentation, this adjacent pairwise potential function can mitigate these side effects in which the fragmented regions can be misclassified as backgrounds. By considering two different types of relationships between two components, the pairwise potential is defined as 8 < wp f ymerged jxmerged ij ij ϕp yi ; y j ; xij ¼ : w f y ; y jx p i j ij
if xi ; x j : contiguous; yi ; yi : FG; otherwise;
Color similarity Texture similarity (histogram of oriented gradient [15]) Geometric characteristics similarity (height, width, area, and aspect ratio) Shape similarity (compactness, occupancy ratio, roughness, stroke thickness, and stroke variance) Text saliency confidence similarity
where node j is a neighbor of the node i, which is obtained from Markov blanket in the graph. Given all potential function values at the nodes and edges in the graph, labels of the nodes need to be determined in order to prune apparent non-text components. Inference techniques find the optimal configuration of labels, which maximizes the posterior probability of the hidden variables Y given the set of observations X among all possible configurations. In other words, the system finds the most probable joint class assignment y⁎ for the components given the local CRF model. This is defined as
y ¼ arg max P ðyjxÞ: y
ð6Þ
Inferring the most probable solution of a CRF model is equivalent to minimizing an energy function. However, the energy minimization problem is NP-hard in general [28]. To solve the problem in the reasonable time, approximation approach called loopy belief propagation (loopy BP) [29] is used in this paper. This loopy BP method is simple to implement. Lan et al. [30] and Potetz [31] showed how belief propagation can be efficiently performed in graphical models containing moderately large cliques. Through the inference algorithm, most apparent non-text regions are pruned out, and the remaining regions are left as character proposals. Fig. 11 shows some examples of multiple image segmentations and their corresponding unary potentials, and character proposals. Single component in each segmentation is marked in random color in the second columns of Fig. 11. Corresponding unary potential values on the components are marked in the gray-scale values where intensity brightness represents probabilities that belong to the foreground labels: the components having high potential values for being foreground shown in dark colors. By considering the likelihood of single regions and relationships among neighboring regions together in the local CRF model, components having the high maximum a posteriori (MAP) are shown in the fourth columns of Fig. 11 (MAP probabilities on the components are marked in the gray-scale values). We now need to find true character regions from a pool of remained character candidates, which will be explained in Section 6. 6. Integration of multiple character proposals When generating multiple proposals for text regions from the multiple segmentations, most text components have higher foreground belief from the local CRF models so that they are retained as character
Fig. 9. Edges among neighboring components (marked in green): character regions are marked in blue and background regions are marked in red. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
831
Fig. 10. Fragmented character regions (two contiguous components are shown in different colors) are merged into a single component based on adjacent pairwise potential. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
proposals. However, some of the non-text components might have higher foreground belief so that they are also retained as character proposals. Faced with many character proposals obtained from multiple segmentations, we essentially need to look for the good components that are the most consistent based on the scene text model. Most are true positive regions but some may not be consistent with character boundaries. For instance, the shapes and sizes of the regions depend on the partition parameters. Some components can be over-segmented or merged with the background regions because of incorrect segmentation. Even though single segmentation cannot find all character regions, set of all proposals could contain most character regions. Therefore, instead of choosing the components in the best single segmentation and discarding all other components in the different segmentations, we suggest integrating all information of the character proposals obtained from multiple segmentations through a holistic CRF model. The process of multiple proposal integration is the following: textline estimation and multiple proposal validation by the holistic CRF model. We assume that characters are linearly aligned in a textline and different textlines have different characteristics such as color, font, or size. Multiple proposals are grouped into several textlines based on the locations and sizes of the proposals. For each textline, most plausible combination of foreground regions is selected by validating the consistency of proposals in the textline. 6.1. Estimation of potential textlines We need to find the most plausible configuration among all possible character proposals as building character sequence interactively. However, exhaustive enumeration of character sequences is intractable. Instead of exhaustively integrating all character proposals at once, we group proposals into several potential textlines. A textline can be regarded as a linear sequence of characters. All components of text strings are assumed to be roughly aligned. Characters generally do not appear alone but with
other characters having similar properties, such as font, size, and color. Fonts of characters in the textline rarely change, which implies that characteristic properties such as height or stroke thickness, etc. are maintained as constant. By grouping proposals into textlines, we can utilize a hierarchical structure of scene text, such as the collinearity among characters in a textline. In addition, this divide-and-conquer approach can reduce the complexities of the problem space. A potential textline is constructed in the following manner. We focus on horizontal textlines (straight or slightly curved) with angles of less than 30°; nonetheless, vertical textlines can be added if needed. Based on the assumption that components with the higher posteriors on the foregrounds are more likely to be character regions, we sort all character proposals by their posterior values for the foreground which are calculated by local CRF models. The top-ranked unprocessed character proposal is selected first, creating an initial textline. This textline is expanded to the left and right by adding all neighboring unprocessed character proposals that satisfy five constraints which are defined to decide whether two components are neighboring of each other [5]. When the expansion is complete and more than two non-overlapping components exist within the textline, it is selected as a valid textline. All character proposals in the valid textline are marked as processed. All neighboring character proposals are connected with links during estimation of the textline. However, when there is only a single proposal in the textline, these isolated proposals are regarded as outliers and pruned. When the processing textline construction is done, the next unprocessed top-ranked character proposal is selected to initialize another textline. And the expansion process is repeated until no more unprocessed character proposals are available. For instance, the letter “s” in Fig. 12(a) is selected as the initial element of the set, and its neighboring components (“A,” “h,” and some noise) are added to the set in Fig. 12(b). Expansion to the left and right is repeated alternately (Fig. 12(c), (d)). In Fig. 12(c), character proposals in the same row (horizontal axes) are obtained from the same image segmentation,
ECL
HCL
NBS
segmented regions
joint decision on regions
Fig. 11. Processing of multiple character proposals: three sets of character proposals are obtained according to three different types of partition criteria.
832
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
Fig. 12. Potential textline construction (a) original image, (b) initial element and expansion, (c) proposals in textline (each row represents a specific image segmentation and each column represents the location of character proposals in the image), (d) connections among proposals, and (e) potential textlines.
on the other hand, character proposals in the same column (vertical axes) are overlapping proposals. Only one of the overlapping proposals should be selected as foreground, or all of them should be regarded as background. In Fig. 12(d), character proposals in different columns but connected with links are neighboring components. Finally, a minimum bounding box containing all elements of the set is selected as a textline. Other potential textlines are constructed in the same manner, and the total five textlines are extracted as shown in Fig. 12(e). 6.2. Integration of multiple character proposals in a textline Given all possible combinations of the character proposal in the textline, we need to find the most plausible configuration while maximally satisfying conditions of the scene text model. A probabilistic integration method based on the graph optimization algorithm, called a holistic CRF model, is proposed for grouping character proposals to a text string by modeling the global information of multiple characters. It can capture global consistency among characters by reflecting highorder relationships such as text alignment where more than two characters are aligned on a straight line or a smooth curve. In addition, the proposed model can handle overlapping components of different segmentations together. This holistic CRF model is also formulated within a probabilistic framework, where the maximum conditional probability represents the ideal character proposal configuration that we aim to find. In the holistic CRF model, the character proposals are represented as nodes, and links between two adjacent character proposals located within a certain distance are represented as edges. The distance for connecting the links is heuristically defined as three times the width of the narrower of the two proposals. A CRF graph for the character proposals in the given textline is shown in Fig. 13; only a partial CRF graph is shown due to the complexity of the graph and the limited space. When two proposals are adjacent, they
(a)
(b)
are linked as neighbors (links are marked in blue colors). If they have similar characteristics, they have a high probability of being regarded as the same object. Conversely, local CRF models produce many potential character regions in multiple segmentations, and some of these overlap. If two proposals overlap, they are linked as mutually exclusive (links are marked in red colors). Overlapping components from different hypotheses may have different class labels. These mutual exclusion constraints prevent two overlapping proposals from both having foreground labels. In order to alleviate the computational burden, obviously overlapping proposals that have identical shapes and lower foreground likelihoods are filtered out by non-maximum suppression. We retain the character proposals that have the highest likelihoods, or do not overlap significantly. For instance, two proposals among three ‘e’ regions in Fig. 13 are pruned so that the complexity of the graph becomes much simpler. In the holistic CRF model, the unary potential is calculated based on the foreground likelihood obtained from the local CRF model. The pairwise potentials, which represent the relationship between two proposals, are differently defined according to the two types of edges: one for mutually exclusive relationship between overlapping proposals, and the other is for neighboring relationship between adjacent proposals. The pairwise potential of overlapping components is defined to be zero when both proposals would be assigned to the foreground so that they are never both selected in the foreground. The pairwise potential of adjacent proposals is defined as multiplication of the pairwise potential of a local CRF and a collinearity weight. The pairwise potential functions are defined as ϕp yi ; y j ; xij ¼
(
0 wp λij f yi ; y j jxij
if xi ; x j : overlapping; yi ; yi : FG; : otherwise
The collinearity weight is calculated on the basis of the angular differences among up to four proposals. This weight is used to satisfy the
(c)
Fig. 13. Process for building holistic CRF graph in textline (a) potential textline in Fig. 12(e), (b) character proposals in textline (each row represents character proposals obtained from different image segmentations), and (c) partial view on holistic CRF graph with mutual exclusion links. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
S. Lee, J.H. Kim / Image and Vision Computing 31 (2013) 823–840
(a)
(b)
(c)
(d)
833
(e)
Fig. 14. Example of integration of character proposals (a) multiple proposals, (b) textline estimation, (c) multiple proposal integration into textline, (d) multiple proposal validation in textline, and (e) final result. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
assumption that neighboring components with the same alignment should have higher belief being foregrounds both [12]. Collinearity weight between the two components i and j, λij, is defined as the function score of θij in Eq. (7). 0 2 1 2 19 > = θij −θhi θij −θjk B C B C ; þ exp − exp@− λij ¼ A @ A 2 2 > 2> σθ σθ ; : θhi ¼ arg min θij −θhi ; h∈ni ; h≠j θhi θjk ¼ arg min θij −θjk ; k∈n j ; k≠i 8 > 1