Fuzzy Techniques for Text Localisation in Images Przemyslaw G´ orecki1 , Laura Caponetti2 , and Ciro Castiello2 1
2
Department of Mathematics and Computer Science, University of Warmia and Mazury, ul. Oczapowskiego 2, 10-719 Olsztyn, Poland
[email protected] Department of Computer Science, University of Bari, via E. Orabona, 4-70125 Bari, Italy
[email protected],
[email protected]
Summary. Text information extraction represents a fundamental issue in the context of digital image processing. Inside this wide area of research, a number of specific tasks can be identified ranging from text detection to text recognition. In this chapter, we deal with the particular problem of text localisation, which aims at determining the exact location where the text is situated inside a document image. The strict connection between text localisation and image segmentation is highlighted in the chapter and a review of methods for image segmentation is proposed. Particularly, the benefits coming from the employment of fuzzy and neuro-fuzzy techniques in this field is assessed, thus indicating a way to combine Computational Intelligence methods and document image analysis. Three peculiar methods based on image segmentation are presented to show different applications of fuzzy and neuro-fuzzy techniques in the context of text localisation.
1 Introduction Text information represents a very important component among the contents of a digital image. This kind of information is related to the category usually referred to as semantic content. By contrast with perceptual content, related to low-level characteristics including colour, intensity or texture, semantic content involves recognition of components, such as text, objects or graphics inside a document image [1–3]. The importance of achieving text information by means of image analysis is straightforward. In fact, text can be used to describe the content of a document image, can be converted into electronic formats (for memorisation and archiving purposes), can be exploited to ultimately understand documents, thus enabling a plethora of applications ranging from document indexing to information extraction and automatic annotation of documents [4–6]. Additionally, with the increasing use of web documents, a lot of multimedia content is available having different page representation forms, which do not lend easily to automatic analysis. Text stands P. G´ orecki et al.: Fuzzy Techniques for Text Localisation in Images, Studies in Computational Intelligence (SCI) 96, 233–270 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
234
P. G´ orecki et al.
as the most appropriate medium for allowing a suitable analysis of such contexts, with additional benefits deriving from possible conversions into other multimedia modalities (such as voice signal), or representations in natural language of the web page contents. The recognition of text in images is a step towards achieving such a representation [7]. The presence of text inside a digital image can be characterised by different properties: text size, alignment, spacing and colour. In particular, text can exhibit varying size, being text dimension an information which cannot be a priori assumed. Also, text alignment and text spacing are relevant properties that can variegate a document appearance in several ways and presumptions about horizontal alignment of text can be made only when specific contexts are investigated. Usually text characters tend to have the same (or similar) colours inside an image, however the chromatic visualisation may represents a fundamental property, specially when contrasting colours are employed to enhance text among other image regions. Automatic methods for text information extraction have been investigated in a comprehensive way, in order to define different mechanisms that, starting from a digital image, could ultimately derive plain text to be memorised or processed. By loosely referring to [8], we can define the following steps corresponding to the sequential sub-problems which characterise the general text information extraction task: •
•
•
Text Detection and Localisation. In some circumstances there is no certainty about the presence of text in a digital image, therefore the text detection step is devoted to the process of determining whether a text region is present or not inside the image under analysis. In this phase no proper text information is derived, but only a boolean response to a detection query. This is common when no a priori knowledge about the characteristics of an image is available. Once the presence of the text inside an image has been assessed, the next step is devoted to determining the exact location where the text is situated. This phase is often combined with different techniques purposely related to the problem of image segmentation, thus configuring text regions as specific components to be isolated in digital images. Text Tracking. Text tracking represents a support activity correlated to the previously described step of text localisation whenever the task of text information extraction is performed over motion images (such as videos). Even if this kind of process has been frequently overlooked in literature, it could prove its usefulness also to verify the results of the text detection and localisation steps or to shorten their processing times. Text Recognition and Understanding. Text recognition represents the ultimate step when analysing a digital image with the aim of deriving plain text to be stored or processed. This phase is commonly carried out by means of specific Optical Character Recognition (OCR) technologies. Moreover, text understanding aims to classify text in logical elements, such as headings, paragraphs, and so on.
Fuzzy Techniques for Text Localisation in Images
235
In this chapter, we are going to address localisation step; the interested reader can be referred to a number of papers directly devoted to the analysis of the other sub-problems [9–15]. Particularly, the additional contribution of this chapter consists in introducing novel text localisation approaches, based on fuzzy segmentation techniques. When dealing with text localisation we are particularly involved with the problem of digital image segmentation. The amount and complexity of information in the images, together with the process of the image digitalisation, lead to a large amount of uncertainty in the image segmentation process. The adoption of the fuzzy paradigm is desirable in image processing because of the uncertainty and imprecision present in images, due to noise, image sampling, lightning variations and so on. Fuzzy theory provides a mathematical tool to deal with the imprecision and ambiguity in an elegant and efficient way. Fuzzy techniques can be applied to different phases of the segmentation process; additionally, fuzzy logic allows to represent the knowledge about the given problem in terms of linguistic rules with meaningful variables, which is the most natural way to express and interpret information. The rest of the chapter is organised as follows. Section 2 is devoted to the presentation of a brief review of methods for image segmentation, proposing different lines of categorisation. Section 3 introduces some concepts related to fuzzy and neuro-fuzzy techniques, discussing their usefulness in the field of digital image processing. Specifically, the particular model of a neuro-fuzzy system is illustrated: its formalisation is useful for the subsequent presentation carried on in Sect. 4, where three peculiar text localisation approaches are reported for the sake of illustration. In Sect. 5 the outcomes of experimental results are reported and discussed. Section 6 closes the chapter with some conclusive remarks.
2 A Categorisation of Image Segmentation Approaches Image segmentation is widely acknowledged to play a crucial role in many computer vision applications and its relevance in the context of the text localisation process has been already mentioned. In this section we are going to discuss this peculiar technique in the general field of document image analysis. Image segmentation represents the first step of document image analysis, with the objective of partitioning a document image into some regions of interest. Generally, in this context, image segmentation is also referred to as page segmentation. High level computer vision tasks, related with text information extraction, often utilise information about regions extracted from document pages. In this sense, the final purpose of page segmentation is to classify different regions in order to discriminate among text and non-text areas1 . Moreover image segmentation is critical, because segmentation results 1
Non-text regions may be distinguished as graphics, pictures, background, and so on (in according with the requirements of the specific problem context).
236
P. G´ orecki et al.
will affect all subsequent steps of image analysis. In recent years image segmentation techniques have been variously applied for the analysis of different types of documents, with the aim of text information extraction [16–21]. Closely related to image segmentation is the problem of feature extraction. The goal is to extract the most salient characteristics of an image for the purpose of its segmentation: an effective set of features is one of the requirement for successful image segmentation. Information in the image, coded directly in pixel intensities, is highly redundant: the major problem here is the number of variables involved. Direct transformations of an image f (x, y) of size M × N to a point in a (M · N )-dimensional space is impractical, due to the number of dimensions involved. To solve this problem, the image representation must be simplified by minimising the number of dimensions needed to describe the image or some part of it. Therefore, a set of features is extracted from a region of interest in the image. It is common in literature to distinguish between natural features, defined by the visual appearance of the image (i.e. intensity of a region), and artificial features, such as intensity histograms, frequency spectra, or co-occurrence matrices [22]. Moreover, first order statistical features, second-order statistics, and higher-order statistics can be distinguished, depending on the number of points defining the local feature [23, 24]. In the first case, features convey information about intensity distributions, while in the second case, information about pixel pairs is exploited in order to take into account spatial information of the distribution. In the third case, more than two pixels are considered. The second-order and higher-order features are especially useful in describing texture, because they can capture relations in the repeating patterns, that define visual appearance of a texture. There is no single segmentation method that provides acceptable results for every type of images. General methods exist, but those which are designed with particular images often achieve better performance by utilising a prior knowledge about the problem. For our purposes, we are going to discuss peculiar segmentation methods by considering two distinct lines of classification (a diagram of the proposed categorisation is reported in Fig. 1). On the one
top-down
bottom-up
region-based methods
edge-based methods
texture-based methods
Fig. 1. The categorisation of the image segmentation approaches
Fuzzy Techniques for Text Localisation in Images
237
hand, by referring to the working mechanism of the segmentation approaches, it is possible to distinguish three classes: top-down approaches, bottom-up approaches and hybrid approaches. Top-down algorithms start from the whole document image and iteratively subdivide it into smaller regions (blocks). The subdivision is based on a homogeneity criterion: the splitting procedure stops when the criterion is met and blocks obtained at this stage constitute the final segmentation result. Some examples of top-down algorithms are reported in [25, 26]. Bottom-up algorithms start from document image pixels and cluster the pixels into connected components (such as characters). The procedure can be iterated giving rise to a growing process which adjoins unconnected adjacent components, in order to cluster higher-order components (such as words, lines, document zones). Typical bottom-up algorithms can be found in [27–30]. Hybrid algorithms can be regarded as a mix of the previous approaches, thus configuring a procedure which involves both splitting and merging phases. Hybrid algorithms have been proposed in [31–33]. The second line of classification to categorise segmentation approaches relies on the features utilised during the process. Methods can be categorised into region-based methods, edge-based methods and texture-based methods. In the first case properties such as intensity or colour are used to derive a set of features describing regions. Edge-based and texture-based methods, instead, derive a set of local features, concerning not only the analysis of a single pixel, but also its neighbourhood. In particular, the observation that image text regions have textural properties different from background or graphics represents the foundation of texture-based methods. In the following sections we discuss in more details the above reported segmentation methods. 2.1 Region-Based Methods Region-based methods for image segmentation use the colour or grey scale properties in a region; when text regions are to be detected, their differences with the corresponding properties of the background can be highlighted for the purpose of text localisation. The key for region-based segmentation consists in firstly devising suitable methods for partitioning an image in a number of connected components, according to some specific homogeneity criteria to be applied during the image feature analysis. Once obtained the initial subdivision of the image into a grid of connected regions, an iterative grouping process of similar regions is started in order to update the partition of the image. In this way, it is possible to create a final segmentation of regions which are meant to be purposely classified. It should be observed that the term “grouping” is used here in a loose sense. We intend to address a process which could originate an incremental or decremental assemblage of regions, with reference to region growing (bottom-up) methods, region splitting (top-down) methods and split-and-merge (hybrid) methods.
238
P. G´ orecki et al.
The analysis of the image features can be performed on the basis of different techniques: among them, thresholding represents one of the simplest methods for segmentation. In some images, an object can be easily separated from the background if the intensity levels of the object fall outside the range of intensity levels of the background. This represents a perfect case for applying a thresholding approach. Each pixel of the input image f (x, y) is compared with the threshold t in order to produce the segmented image l(x, y): 1 if f (x, y) > t (object), l(x, y) = (1) 0 if f (x, y) ≤ t (background). The selection of an appropriate threshold value is essential in this technique. Many authors have proposed to find the threshold value by means of an image histogram shape analysis [34–37]. Global thresholding techniques use a fixed threshold for all pixels in the image and therefore work well only if the intensity histogram of the objects and background are well separated. Hence, these kind of techniques cannot deal with images containing, for example, a strong illumination gradient. On the other hand, local adaptive thresholding selects an individual threshold for each pixel based on the range of intensity values in its local neighbourhood. This allows for thresholding of an image whose global intensity histogram does not contain distinctive peaks [38]. Thresholding approach has been successfully applied in many image segmentation problems with the goal of text localisation [39–41]. Clustering can be seen as a generalisation of the thresholding technique. In fact, it allows for partitioning data into more than two clusters dealing with a space of higher dimensionality than thresholding, where data are onedimensional. Similarly to thresholding, clustering is performed in the image feature space, and it aims at finding structures in the collection of data, so that data can be classified into different groups (clusters). More precisely, data are partitioned into different subsets and data in each subset are similar in some way. During the clustering process, structures in data are discovered without any a priori knowledge and without providing an explanation or interpretation why they exist [42]. Clustering techniques for image segmentation have been adopted for the purpose of text localisation [43–45]. 2.2 Edge-Based Methods Edge-based techniques, rather than finding regions by adopting a grouping process, aim at identifying explicit or implicit boundaries between regions. Edge-based methods represent the earliest segmentation approaches and rely on the process for edge detection. The goal of edge detection is to localise the points in the image where abrupt changes in intensity take place. In the document images, edges may appear on discontinuity points between the text and the background. The simplest mechanism to detect edges is the differential detection approach. As the images are two-dimensional, the gradient ∇ is calculated from the partial derivatives of the image f (x, y):
Fuzzy Techniques for Text Localisation in Images
239
∂f (x,y) ∇f (x, y) =
∂x ∂f (x,y) ∂y
.
(2)
The computations of the partial derivatives are usually realised by convolving the image with a given filter, which estimates the gradient. The maps of edge points obtained at the end of this process can be successively utilised by an edge tracking technique, so that the contour of different regions may be highlighted inside the image. Generally, the Canny operator, one of the most powerful edge filter, can be applied to detect edge points in document images [46]. In case of text localisation, edge-based methods aim at exploiting the high contrast between the text and the background. The edges of text boundary are identified and merged and then several heuristics are used to filter out the non-text regions [47–49]. 2.3 Texture-Based Methods Texture-based methods consider a document image as a composite of textures of different classes. With this approach various texture segmentation and classification techniques can be used directly or with some modifications. Some texture segmentation approaches apply splitting and merging or clustering methods to the feature vectors computed for the image and describing its texture information. When a document image is considered as texture, text regions are assumed to have texture features different from the non-text ones. Text regions are modelled as regular periodic textures, because they contain text lines with the same orientation. Also their interline spacings are approximately the same. Instead non-text regions correspond to irregular textures. Generally, the problem is how to separate two or more different texture classes. Techniques based on Gabor filters, Wavelet, FFT, spatial variance can be used to detect the textural properties of an image text region [50–52]. In the following, we describe two fundamental approaches as Gabor filtering and multi-scale techniques. Gabor Filtering Gabor filtering is a classical approach to describe textural properties of an image. A two-dimensional Gabor filter is a complex sinusoid (with a wavelength λ and a phase offset ψ) modulated by a two-dimensional Gaussian function (with an aspect ratio of γ). The Gabor filter, that has an orientation θ, is defined as following: x x2 + γ 2 y 2 ) cos(2π + ψ), 2 2σ λ where x = x cos θ + y sin θ and y = −x sin θ + y cos θ. G(x, y) = exp(−
(3)
240
P. G´ orecki et al.
In the context of text extraction, a filter bank consisting of several orientation-selective 2-D Gabor filters can be used to detect texture features of text and non-text components. As an illustrative example, in [53] the Gabor transform with m different spatial frequencies and p different orientations is applied to the input image by producing mp filtered images. A texture feature is computed as the mean value in small overlapping windows centred at each pixel. The values of each pixel in n features images form an n-dimensional features vector. These vectors are grouped into K clusters using a squared-error clustering algorithm. Multi-Scale Techniques One problem associated with document texture based approaches is due to both large intra-class and inter-class variations in textural features. To solve this problem multi-scale analysis and features extraction at different scales have been introduced by some authors [54,55]. In [56], Wavelet decomposition is used to define local energy variations in the image at several scales. Binary image, which is obtained by thresholding the local energy variation, is analysed by connected component-based filtering using geometric attributes such as size and aspect ratio. All the text regions, which are detected at several scales, are merged to give the final result. Wavelet packet analysis is an important generalisation of Wavelet analysis [57, 58]. Wavelet packet functions are localisable in space such as Wavelet functions, but offer more flexibility in decomposition of signals. Wavelet packet approximators are based on translated and scaled Wavelet packet functions Wj,b,k , which are generated from the base function [59], according to the following equation: Wj,b,k (t) = 2j/2 Wb (2−j (t − k)),
(4)
where j is the resolution level, Wb is the Wavelet packet function generated by scaling and translating a mother Wavelet function, b is the number of oscillations (zero crossings) of Wb and k is the translation shift. In Wavelet packet analysis, a signal x(t) is represented as a sum of orthogonal Wavelet packet functions Wj,b,k (t) at different scales, oscillations and locations: x(t) = wj,b,k Wj,b,k (t). (5) j
b
k
where each wj,b,k is a Wavelet packet coefficient. To compute the Wavelet packet coefficients a fast splitting algorithm [60] is used, which is an adaptation of the pyramid algorithm [61] for the discrete Wavelet transform. The splitting algorithm differs from the pyramid algorithm by the fact that both low-pass (L) and high-pass (H) filters are applied to the detailed coefficients, in addition to the approximation coefficients, at each stage of the algorithm. Moreover, the splitting algorithm retains all the coefficients, including those at intermediate filtering stages.
Fuzzy Techniques for Text Localisation in Images
241
The Wavelet packet decomposition process can be represented with a quadtree in which the root node is assigned to the highest scale coefficients, that are the original image itself, while the leaves represent outputs of the LL, LH, HL and HH filters. Assuming that similar regions of an image have similar frequency characteristics, we infer that these characteristics are captured by some nodes of the quadtree. As a consequence, the proper selection of quadtree nodes should allow for localisation of similar regions in the image. Learning based methods are proposed for the automatic selection of nodes describing text or background as we will illustrate in Sect. 4.3.
3 Fuzzy Techniques in Image Segmentation In the previous section, we have discussed different techniques for image segmentation. Some of the feature extraction methods and most of the algorithms are based on crisp relations, comparisons and thresholding. Such constraints are not well suited to cope with ambiguity and imprecision present in the images, which are very often degraded by noise coming from various sources such as imperfect capturing devices, image digitalisation and sampling. Fuzzy techniques provide a mathematical tool to deal with such imprecision and ambiguities in an elegant and efficient way, allowing to eliminate some of the drawbacks of classical segmentation algorithms. Additionally, the hybrid approach based on integration of fuzzy logic and neural networks proved to be very fruitful. This hybridisation strategy allows to combine the benefits of both methods while eliminating their drawbacks. Neuro-fuzzy networks can be trained in a similar fashion as classical neural networks, but they are also capable of explaining the decision process by representing the knowledge in terms of fuzzy rules. Moreover, the rules can be discovered automatically from data and their parameters can be easily fine tuned in order to maximise the classification accuracy of the system. Neuro-fuzzy hybridisation belongs to the research field of Computational Intelligence, that is an emerging area in the field of intelligent systems development. This novel paradigm results from a partnership of different methodologies: Neural Computation, Fuzzy Logic, Evolutionary Programming. Such a consortium is employed to cope the imprecision of real world applications, allowing the achievement of robustness, low solution cost and a better rapport with reality [62,63]. In this section, we introduce the basics of fuzzy theory and neuro-fuzzy hybridisation, while discussing their relevance and application in the context of image analysis. 3.1 General Theory of Fuzzy Sets The incentive for the development of fuzzy logic originates from observing that people do not require precise, numerical information in order to describe events or facts, but rather they do it by using imprecise and fuzzy linguistic
242
P. G´ orecki et al.
terms. Yet, they are able to draw the right conclusions from fuzzy information. The theory of fuzzy sets, underpinning the mechanisms of fuzzy logic, was introduced to deal mathematically with imprecise or vague information that is present in everyday life [64]. In the bi-valued logic, any relation can be either true or false, which is defined by the crisp criteria of membership. For example, it is easy to determine precisely whether a variable x is greater than a certain number. On the other hand, evaluating whether x is much greater than a certain number is ambiguous. In the same way, when looking at a digital document image, we can say that the background is bright and the letters are dark. We are able to identify the above classes, despite of the lack of precise definitions for the words “bright” and “dark”: this question relies on the assumption that many objects do not have clear criteria of membership. Fuzzy logic allows to handle such situations, by introducing continuous intermediate states between true and false. This allows also to represent numerical variables in terms of linguistic labels. Actually, the mean for dealing with such linguistic imprecision is the concept of fuzzy set, which permits gradual degree of membership of an object in relation to a set. Let X denote a universe of discourse, or space of points, with its elements denoted as x. A fuzzy set A is defined as a set of ordered pairs: A = {(x, µA (x)) | x ∈ X},
(6)
where µA (x) is the membership function of A: µA : X → [0, 1],
(7)
representing the degree of membership of x in A. A single pair (x, µ(x)) is called fuzzy singleton, thus a fuzzy set can be defined in terms of the union of its singletons. Based on the above definitions, an ordinary set can be derived by imposing the crisp membership condition µA (x) ∈ {0, 1}. Graphical examples of crisp and fuzzy sets are shown in Fig. 2. Analogously, it is possible to extend operators of ordinary sets to their fuzzy counterparts, giving rise to fuzzy extension of relations, definition and
Fig. 2. An example of a crisp set and a fuzzy set with Gaussian membership function
Fuzzy Techniques for Text Localisation in Images
243
so on [65, 66]. In the following, we shall review different fuzzy image features, which are employed in the field of digital image processing. Moreover, we are interested in dealing with the peculiar aspects of fuzzy clustering and the definition of fuzzy and neuro-fuzzy systems. 3.2 Fuzzy Image Features An M × N image f (x, y) can be represented as an array of fuzzy singletons, denoting pixel grey level intensities. However, due to the imprecise image formation process, it is more convenient to treat the pixel intensity (or some other image feature, such as edge intensity) as a fuzzy number, having nonsingleton membership function, rather than a crisp number (corresponding to the fuzzy singleton). A fuzzy number is a fuzzy set defining a fuzzy interval for a real number, with the membership function that is piecewise continuous. One way for expressing fuzzy numbers is by means of triangular fuzzy sets. A triangular fuzzy number is defined as A = (a1 , a2 , a3 ), where a1 ≤ a2 ≤ a3 are the numbers describing a shape of a triangular membership function: ⎧ 0 x < a1 , ⎪ ⎪ ⎪ x−a1 ⎪ a ⎨ a2 −a1 1 ≤ x < a2 , x = a2 , (8) µA (x) = 1 ⎪ −x ⎪ a < x ≤ a , ⎪ aa33−a 2 3 ⎪ 2 ⎩ 0 x > a3 . Fuzzy numbers can be applied to incorporate imprecision into image statistics (i.e. histograms). This allows to improve the noise invariance of this kind of features, which is especially important in some situations where the image statistics are derived from small regions, so that the number of observations is small. Fuzzy Histogram A crisp histogram represents the distribution of pixel intensities in the image to a certain number of bins, hence it is reports the probability of observing a pixel with a given intensity. In order to obtain the histogram, the intensity value of each pixel in the image is accumulated in the bin corresponding to this value. In this way, for an image containing n pixels, a histogram representation H = {h(1), h(2), . . . , h(b)} can be obtained, comprising a number of b bins. Therefore h(i) = ni /n denotes the probability that a pixel belongs to the i-th intensity bin, where ni is the number of pixels in the i-th bin. However, as the measurements of the intensities are imprecise, each accumulated intensity should also affect the nearby bins, introducing a fuzziness in the histogram. The value of each bin in a fuzzy histogram represents a typicality of the pixel
244
P. G´ orecki et al.
within the image rather than its probability. The fuzzy histogram can be defined as F H = {f h(1), . . . , f h(b)} where f h(i) is expressed as following: f h(i) =
n
µj (i),
i = 1, . . . , b,
(9)
j=1
where b is the number of bins (corresponding to the number of intensity levels), n is the number of pixels in the image and µj (i) is the membership degree of the intensity level of the j-th pixel with respect to the i-th bin. Therefore, µj (i) denotes the membership function of a fuzzy number, related to the value of the pixel intensity. The value f h(i) can be expressed as the linear convolution between the conventional histogram and the filtering kernel provided by the function µj (i). This approach is possible if all fuzzy numbers have the membership function of the same shape. Hence, the membership function µl of a fuzzy number corresponding to a crisp intensity level l, can be expressed as µl (x) = µ(x − l), where µ denotes the general membership function, common to all fuzzy numbers accumulated in the histogram. By representing µ as a convolution kernel, the fuzzy histogram F H = {f h(1), . . . , f h(b)} is smoothed as following: f h(i) = (h ∗ µ)(i) = h(i + l)µ(l), i = 1, . . . , b, (10) l
where h(i) denotes the i-th bin of a crisp histogram. In [67] such a smoothing based approach, where the influence from neighbouring bins is expressed by triangular membership functions, has been used to extract fuzzy histograms of grey images. Fuzzy Co-occurrence Matrix Fuzzy co-occurrence matrix is another example of fuzzifying the crisp feature measure. Similarly to the second-order statistic, it is often employed for measuring the texture features of the images. The idea of the classical cooccurrence matrix is to accumulate in the matrix C the co-occurrences of the intensity values i = f (xi , yi ) and j = f (xj , yj ) of the pixels (xi , yi ) and (xj , yj ), given the spatial offset (δx , δy ) separating the pixels. Therefore, the spatial co-occurrence of the intensities i and j will be accumulated in the bin C(i, j) of the matrix, by increasing the value of the bin by one. In the case of fuzzy co-occurrence matrix F , intensity vales of pixels (xi , yi ) and (xj , yj ) are represented with fuzzy numbers having the membership functions µi (x) and µj (x). Thus, not only the bin (i, j) should be incremented, but also its neighbour bins. However, the amount of the increment ∆F (k, l) for the bin F (k, l) should depend on the fulfilment degrees of membership functions µi (k) and µj (l) and the increment is calculated as following: ∆F (k, l) = µi (k)µj (l).
(11)
Fuzzy Techniques for Text Localisation in Images
245
Similarly to the fuzzy histogram, a fuzzy co-occurrence matrix can be obtained from a crisp co-occurrence matrix by means of the convolution operator. However, as the matrix is two-dimensional, the convolution is performed first along its rows, and then along its columns. 3.3 Fuzzy Systems Fuzzy systems are designed to cope with imprecision of the input and output variables by defining fuzzy numbers and fuzzy sets that can be expressed by linguistic variables. The working scheme of a fuzzy system is based on a particular inference mechanism where the involved variables are characterised by a number of fuzzy sets with meaningful labels. For example, a pixel grey value can be described using the {“bright”,“grey”,“dark”} fuzzy sets, an edge can be characterised by the {“weak”,“strong”} fuzzy sets, and so on. In detail, each fuzzy system is designed to tackle a decision problem by means of a set of N fuzzy rules, called fuzzy rule base R. The rules incorporate a number of fuzzy sets whose membership functions are usually designed by experts in the field of the problem at hand. The j-th fuzzy rule in a fuzzy rule base R has the general form: Rj : If x1 is Aj1 and x2 is Aj2 and . . . xn is Ajn then y is B j ,
j = 1, 2, . . . , N, (12) where x = (x1 , x2 , . . . , xn ) is an input vector, y is an output value and Aji and B j are fuzzy sets. In order to infer the output from a crisp input, the first step is to fuzzify input values. This is achieved by evaluating a degree of membership in each of the fuzzy sets, describing the variable. The overall process of fuzzy inference is articulated in consecutive steps [68]. At first, a fuzzification of input values is needed, in order to infer the output from a crisp input. This is achieved by evaluating a degree of membership in each of the fuzzy sets. In this way, an expression for the relation of the j-th rule can be found. By interpreting the rule implication by a conjunction-based representation2 , it is possible to express the relation of the j-th rule as follows: µRj (x1 , x2 , . . . , xn , y) = µAj (x1 ) ∧ µAj (x2 ) ∧ . . . µAjn (xn ) ∧ µB j (y), 1
2
(13)
where ∧ denotes the operator generalising the fuzzy “AND” connective. The aggregation of all fuzzy rules in the rule base is achieved by: µR (x1 , x2 , . . . , xn , y) =
N
µRj (x1 , x2 , . . . , xn , y),
(14)
j=1 2
This kind of interpretation for an IF-THEN rule assimilates the rule with the Cartesian product of the input/output variable space. Such an interpretation is commonly adopted, like in the cases of Mamdani [69] and Takagi-Sugeno-Kang (TSK) models [70], but it does not represent the only kind of semantics for fuzzy rules [71].
246
P. G´ orecki et al.
where ∨ is the operator generalising the fuzzy “OR” connective, and µR is a membership function characterising a fuzzy output variable. The last step of the process is defuzzification, which assigns appropriate crisp value to the fuzzy set R described by the membership function (14), such that an output crisp value is provided at the end of the inference process. For selecting this value, different defuzzification operators can be employed [72], among them: the centre of area (evaluating the centroid of the fuzzy output membership), the mean - smallest, largest - of maxima, (evaluating the mean smallest or largest - of all maximum points of the membership function). No standard techniques are applicable for transforming the human knowledge into a set of rules and membership functions. Usually, the first step is to identify and name the system inputs and outputs. Then, their value ranges should be specified and a fuzzy partition of each input and output should be made. The final step is the construction of the rule base and the specification of the membership functions for the fuzzy sets. As an illustrative example, we show how fuzzy systems can be employed to obtain a simple process of text information extraction. Let us consider the problem of a decision task, based on classification of small image blocks as text or background. By examining the blocks extracted from the image, it can be observed that the background is usually bright, with little or no variations in grey-scale. On the other hand, text contains high variations in grey-scale, as the block contains black text pixels and white background pixels, or it is black with small grey-scale variance (in case of larger headings fonts). The above observations allow to formulate a set of rules, containing linguistic variables, with the employment of such features as the mean and the standard deviation of pixel values: • • • •
R1 : R2 : R3 : R4 :
IF IF IF IF
mean mean mean mean
is is is is
dark AND std. dev. is low THEN background is low. dark AND std. dev. is high THEN background is low. grey AND std. dev. is high THEN background is low. white AND std. dev. is low THEN background is high.
The foregoing simple rules allow us to infer the membership degree bi of the i-th block to the background class, while the membership degree ti to the text class can be obtained simply as: ti = 1 − bi . In order to obtain the segmentation of a document image, this should be partitioned into regular grid of small blocks (i.e. with size of 4 × 4 or 8 × 8 pixels, depending on the size of the image). Successively, fuzzy rules are evaluated based on the features of each block. Figure 3 illustrates the sets of membership functions defined for the input values. Figure 4 illustrates the inference process for a sample input value: each row corresponds to one of the rules in the rule base previously described, with two input membership functions and one output membership function. Degrees of membership (vertical lines) are calculated based on illustrative crisp inputs (mean = 193, std. dev. = 32). The activation function of each rule is calculated by adopting the min function, according to (13). Finally, all activation functions are aggregated using the
Fuzzy Techniques for Text Localisation in Images
(a)
247
(b)
Fig. 3. Membership functions of the variables mean(a) and std. dev. (b) employed for segmentation of document images
Fig. 4. Fuzzy inference process performed over illustrative input values
max function, according to (14). The crisp value (equal to 0.714, as shown in Fig. 4) is calculated by defuzzifying the output value, employing the centre of area method. Results obtained by employing this approach on a sample document image are presented in Fig. 5. 3.4 Fuzzy C-Means Clustering Traditional clustering approaches generate partitions where each pattern belongs to one and only one cluster. Fuzzy clustering extends this notion using the concept of membership function. In this way, the output of this kind of fuzzy algorithms is a clustering but not a partition. The Fuzzy C-Means method of clustering was developed by Dunn in [73] and improved by Bezdek in [74], and it is frequently used in data clustering problems. The Fuzzy CMeans (FCM) is a partitional method, that is derived from the K-Means clustering [75]. The main difference between FCM and K-Means is that the former allows for one piece of data to belong to many clusters with certain membership degrees. In other words, the partitioning of the data is fuzzy rather than crisp. Given the number of clusters m, the distance metric d(x, y) and an objective function J, the goal is to assign the samples {xi }ki=1 into clusters.
248
P. G´ orecki et al.
(a)
(b)
Fig. 5. Document image segmentation with employment of a fuzzy system. Original document image (a), obtained segmentation (b)
In particular, the Fuzzy C-Means algorithm is based on minimisation of the following objective function: Js =
k m
(uij )s d(xi , cj )2 ,
1 < s < ∞,
(15)
j=1 i=1
where the distance metric d(x, y) is represented by any norm expressing the similarity between the measured data and the centres (most frequently, the Euclidean distance); s is the parameter determining the fuzziness of clustering; m is the number of clusters; k is the number of observations; uij is the membership degree of observation xi belonging to a cluster cj , calculated as follows: 1 uij = m (16) 2 . d(xi , cj ) s−1 d(xi , cl ) l=1
The values of membership m degrees are constrained to be positive and they satisfy the constraint j=1 uij = 1. It should be observed that the Fuzzy C-Means does not incorporate any spatial dependences between the observations, which may degrade the overall segmentation results, because the obtained homogeneous regions are likely to be disjoint, irregular and noisy. However, it is possible to penalise the objective function (15) in order to restrict the membership functions in FCM to be spatially smooth. This penalty is used to discourage spatially undesirable configurations of membership values, i.e. high membership values surrounded
Fuzzy Techniques for Text Localisation in Images
249
by low membership values of the same cluster, or adjacent high membership values of different clusters. Examples of such penalised objective function were proposed in [76]. The Fuzzy C-Means method has been applied in a variety of image segmentation problems, such as medical imaging [77] or remote sensing [78]. 3.5 Neuro-Fuzzy Systems Integration of fuzzy logic and neural networks boasts a consolidated presence in scientific literature [79–83]. The motivations behind the success of this kind of combination can be easily assessed by referring to the issues introduced in the previous section. In fact, by means of fuzzy logic it is possible to facilitate the understanding of decision processes and to provide a natural way for the interpretation of linguistic rules. On the other hand, rules in fuzzy systems cannot be acquired automatically. The designing process of rules and membership functions is always human-driven and reveals to be difficult, especially in case of complex systems. Additionally, tuning of fuzzy membership functions representing linguistic labels is a very time consuming process, but it is essential if accuracy is a matter of concern [84]. Neural networks are characterised by somewhat opposite properties. They have the ability to generalise and to learn from data, obtaining knowledge to deal with previously unseen patterns. The learning process is relatively slow for large sets of training data, and any additional information about the problem cannot be integrated into the learning procedure in order to simplify it and speed up the computation. Trained neural network can classify patterns accurately, but the decision process is obscure for the user. In fact, information is encoded in the connections between the neurons, therefore extraction of structural knowledge from the neural network is very difficult. Neuro-fuzzy systems allow to extract fuzzy rules from data during the knowledge discovery process. Moreover, the membership functions inside each rule can be easily tuned, based on information embedded in data. In order to perform both tasks, the expert intervention can be avoided by resorting to neural learning and a training set T of t samples is required. In particular, the i-th sample in the training set is a pair of input/output vectors (xi , yi ), therefore T = {(x1 , y1 ), . . . , (xt , yt )}. In case of classification problems, the input vector xi is an m-dimensional vector containing the m measurements of the input features, while the output vector yi is an n-dimensional binary vector, codifying the membership of xi for each of the n classes (i.e., yi is one of the linearly independent basis vectors spanning the Rn space). In the following, we are going to introduce the peculiar scheme of a neurofuzzy model, whose application in text localisation problems will be detailed in the next section.
250
P. G´ orecki et al.
A Peculiar Scheme for a Neuro-Fuzzy System The fuzzy component of the neuro-fuzzy system is represented by a particular fuzzy inference mechanism whose general scheme is comparable to the Takagi-Sugeno-Kang (TSK) fuzzy inference method [70]. The fuzzy rule base is composed by K fuzzy rules, where the k-th rule is expressed in the form: (k)
Rk : If x1 is A1
(k)
(k)
and . . . and xm is Am then y1 is b1
(k)
and . . . and yn is bn , (17)
where x = (x1 , . . . , xm ) is the input vector, y = (y1 , . . . , yn ) is the output (k) (k) vector, (A1 , . . . , Am ) are fuzzy sets defined over the elements of the input (k) (k) vector x, and (b1 , . . . , bn ) are fuzzy singletons defined over the elements (k) of the output vector y. Each of the fuzzy sets Ai is defined in terms of a (k) Gaussian membership function µi : ⎞ ⎛ (k) − c x i (k) i ⎠ , (18) µi (xi ) = exp ⎝− (k) 2 2σi (k)
(k)
where ci is the centre and σi is the width of the Gaussian function. The rule fulfilment degree of the k-th rule is evaluated using the formula: µ(k) (x) =
m
(k)
µi (xi ),
(19)
i=1
where the product function is employed to interpret the AND connective. The final output of the fuzzy model can be expressed as: K yj =
(k) (k) (x)bj k=1 µ , K (k) (x) k=1 µ
j = 1, . . . , n.
(20)
In classification tasks, the elements of the output vector y express in the range [0, 1] the membership degrees of the input pattern for each of the classes. In order to obtain a binary output vector y = {yj }nj=1 , the defuzzification of the output vector y is performed as follows: 1 if yj = max(y), yj = (21) 0 otherwise. By means of (21), the input pattern is classified in according with the highest membership degree. The neural component of the neuro-fuzzy system is represented by a particular neural network which reflects in its topology the structure of the previously presented fuzzy inference system. The network is composed by four layers with the following characteristics:
Fuzzy Techniques for Text Localisation in Images
251
Layer 1 provides the crisp input vector x = (x1 , . . . , xm ) to the network. This layers does not perform any calculation and the input vector values are simply passed to the second layer. Layer 2 realises a fuzzification of the input variables. Units in this layer are organised into K distinctive groups. Each group is associated with one of the fuzzy rules, and it is composed of m units, corresponding to the m fuzzy sets in the fuzzy rule. The i-th unit in the k-th group, connected with the i-th neuron in layer 1, evaluates the Gaussian membership degree (k) of the fuzzy set Ai , according to (18). Layer 3 is composed of K units. Each of them performs the precondition matching of one of the rules and reports its fulfilment degree, in accordance with (19). The i-th unit in this layer is connected with all units in the i-th group of layer 2. Layer 4 supplies the final output vector y and is composed of n units. The i-th unit in this layer evaluates the element yi , according to (20). In particular, the fulfilment degrees of the rules are weighted by the fuzzy singletons, which are encoded as the values of the connections weights between layer 3 and layer 4. Figure 6 depicts the structure of the above described neuro-fuzzy network, with reference to a neuro-fuzzy system with two inputs, three rules and two outputs.
Fig. 6. Structure of the neuro-fuzzy network coupled with a neuro-fuzzy system exhibiting two inputs, three rules and two outputs (m = 2, K = 3, n = 2)
252
P. G´ orecki et al.
As concerning the learning procedure of the neuro-fuzzy network, two distinctive steps are involved. The first one is devoted to discovering the initial structure of the neuro-fuzzy network. Successively, the parameters of the fuzzy rules are refined, so that the overall classification accuracy is improved. During the first step, a clustering of the input data is performed by an unsupervised learning process of the neuro-fuzzy network: each cluster corresponds to one of the nodes in the rule layer of the neuro-fuzzy network. The clustering process is able to derive the proper number of clusters. In fact, a rival penalised mechanism is employed to adaptively determine the suitable structure of the network and therefore the number of fuzzy rules (starting from a guessed number). In this way, an initial knowledge is extracted from data and expressed in the form of a base of rules. The obtained knowledge is successively refined during the second step, where a supervised learning process of the neuro-fuzzy network is accomplished (based on a gradient descent technique), in order to attune the parameters of the fuzzy rule base to the numerical data. For the sake of conciseness, we omit further mathematical details concerning the learning algorithms, addressing the reader to [85].
4 Text Localisation: Illustrative Applications As previously stated, the different techniques for image segmentation present some drawbacks. Classical top-down approaches, based on run-length encoding and projection profiles, are sensitive to skewed text and perform well only with highly structured page layouts. On the contrary, bottom-up approaches are sensitive to font size, scanning resolution, interline and inter-character spacing. To overcome these problems, the employment of Computational Intelligence methods would be beneficial. Here we detail some of our experiments with the employment of fuzzy and neuro-fuzzy techniques. With reference to the classification directions proposed in this chapter, the first approach we are going to introduce can be classified as a region-based approach, which stands as a preliminary naive formulation of our research activity [86]. The involved image regions are classified as text or graphic regions, on the basis of their appearance (regularity) and shape. The classification process is realised by employing the peculiar neuro-fuzzy model described in Sect. 3.5. The second approach proposed is somewhat more involved and it is related to a multi-resolution segmentation scheme, belonging to the category of edge-based bottom-up approaches [87]. Here pixels are classified as text, graphics, or background, in accordance with their grey-level intensity and edge strength values, extracted from different resolution levels. In order to improve the segmentation results obtained from the initial pixel level classification phase, a region level analysis phase is performed. Both steps, namely pixel level analysis and region level analysis, are realised by the employment of the already mentioned neuro-fuzzy methodology.
Fuzzy Techniques for Text Localisation in Images
253
The third approach, representing an example of texture-based bottom-up approach, is based on a more sophisticated tool for multi-resolution analysis with Discrete Wavelet Packet Transform [88]. To discriminate between text and non-text regions, the image is transformed into a Wavelet packet analysis tree. Successively, the feature image, exploited for the segmentation of text and non-text regions, is obtained from some of the nodes selected from the quadtree. The most discriminative nodes are derived using an optimality criterion and a genetic algorithm. Finally, the obtained feature image is segmented by means of a Fuzzy C-Means clustering. All the proposed segmentation approaches have been evaluated using the Document Image Database available from the University of Oulu [89]. This database includes 233 images of articles, scanned from magazines and newspapers, books and manuals. The images vary both in quality and contents: some of them contain text paragraphs only (with Latin and Cyrillic fonts of different sizes), while others contain mixtures of text, pictures, photos, graphs and charts. Moreover, not all the documents are characterised by regular (Manhattan) page layout. 4.1 Text Region Classification by a Neuro-Fuzzy Approach The idea at this stage is to exploit a neuro-fuzzy classifier to label the different regions composing a document image. The work assumes that a database of segmented images is available, from which it is possible to extract a set of numerical features. The first step is a feature extraction process and consists in detecting the skew angle φ of each region as the dominant orientation of the straight lines passing through that region. Inside the text regions, being composed of characters and words, the direction of the text lines will be highly regular. This regularity can be captured by means of the Hough transform [22,90–92]. Particularly, the skew angle is detected as the angle for which the Hough transform of a specific region has the maximum value. The retrieved skew angle φ is used to obtain the projection profile of the document region. The profile is calculated by accumulating pixel values in the region along its skew angle, so that the one-dimensional projection vector vp is obtained. The elements of vp codify the information about the spatial structure of the analysed region. For a text region, vp should have regular, high frequency sinusoidal-like shape with peaks and valleys corresponding to the text lines and the interline spacings, respectively. In contrast, such regularities cannot be observed, when a graphics region is considered. To measure the regularity of the vp vector, the Power Spectral Density (PSD) [22] analysis is performed. Actually, for large paragraphs of text, the PSD coefficients show a significant peak around the frequency value corresponding approximately to the number of text lines in this region. For graphic regions, instead, the
254
P. G´ orecki et al.
(a)
(b)
(c)
Fig. 7. A region of a document image (a), its projection profile calculated for skew angle of 90 degrees (b) and PSD spectrum of the profile (c)
spectrum presents only a few peaks (one or two) around the lowest frequency values. A vector vpsd of PSD coefficients is calculated as follows: vpsd = |F T (vp )|2 ,
(22)
where F T (·) denotes the Fourier Transform [93]. An illustrative projection profile and its PSD spectrum for a sample text region is presented in Fig. 7. Generally, the number of the components of the PSD spectrum vector vpsd is too large to be directly used as a feature vector for the classification tasks. In order to reduce the dimensionality of vpsd , it can be divided into a number of intervals. In particular, we considered some intervals of different lengths, corresponding to the scaled Fibonacci sequence, with multiplying factor equal two (i.e., 2, 4, 6, 10, 16, 26, 42, . . .). In this way, we are able to preserve and to exploit mostly of the information accumulated in the first part of the PSD spectrum. For each interval, the maximum value of vpsd is derived, and the obtained maxima (normalised with respect to the highest one) represent the first seven components of the feature vector vf , which will be employed in the successive region classification stage. To increase classification accuracy, statistical information concerning the connectivity of the analysed region is extracted, thus extending the feature number of the vector vpsd . At the end of the overall feature extraction process, every region of the segmented document image is represented as a feature vector vf with ten elements, which are used for the classification purposes. The final step is the classification of the regions described in terms of the feature vector vf . Such a classification process has been performed by means of the neuro-fuzzy system introduced in Sect. 3.5. In the course of the experimental session concerning the image region classification, the input vector x, involved in the fuzzy inference model, corresponds to the ten-dimensional
Fuzzy Techniques for Text Localisation in Images
255
feature vector vf , derived during the feature extraction process. The output vector y is related to the classes of the classification tasks (i.e., textual and graphical regions). The overall algorithm can be summarised as follows: For each region: 1. 2. 3. 4. 5.
Calculate skew angle θ by means of Hough transform Obtain projection profile vp of the region along θ Calculate vpsd from vp Obtain vf by dividing vpsd into intervals Classify the region as text or graphics on the basis of vf by means of the neuro-fuzzy inference
4.2 Text Localisation by a Neuro-Fuzzy Segmentation The idea at this stage consists in exploiting a neuro-fuzzy classifier for achieving both the segmentation of a document image and the final labelling of the derived regions. The described work is related to an edge-based approach for document segmentation, aiming at the identification of text, graphic and background regions. The overall methodology is based on the execution of two successive steps, working at different levels, configuring a bottom-up approach. In particular, an edge-based pre-processing step concerns a pixel level analysis, devoted to a preliminary classification of each image pixel into one of the previously described general classes. From the results of this phase, coherent regions are obtained by a merging procedure. To refine the obtained segmentation, an additional post-processing is performed at region level, on the basis of shape regularity and skew angle analysis. This post-processing phase is beneficial for obtaining a final accurate segmentation of the document image. The peculiarity of the proposed approach relies on the employment of the neuro-fuzzy system both in the pre-processing pixel level analysis and in the post-processing region level refinement. Low-Level Pixel Analysis The aim of the low level pixel analysis is to classify each pixel of a document image f (x, y) into text, background or graphic category, according to its grey level and edge strength values. When extracting features from image data, the type of information that can be obtained may be strongly dependent on the scales at which the feature detectors are applied [94]. This can be perceptually verified with ease: when an image is viewed from near to far, the edge strength of a pixel is decreased in general, but the relative decreasing rates for contour, regular and texture points are different. Moving from this kind of observation, we followed a multi-scale analysis of the image: assuming that
256
P. G´ orecki et al.
an image f (x, y) is given, let R be the number of scale representations considered for our analysis. In this way, a set of images {f (1) (x, y), . . . , f (R) (x, y)} is involved and an edge map e(x, y) can be obtained from each image by means of the Sobel operator [22]. Since the information extracted from image data is strongly dependent on the image scale at which the feature detectors are applied, we have represented the images f (x, y) and e(x, y) as Gaussian pyramids with R different resolution levels. In the pyramid, image at level r + 1 is generated from image at level r by means of down-sampling by a factor of 2. Therefore, a set of edge maps {e(1) (x, y), . . . , e(R) (x, y)} is generated during the creation of the pyramids and associated to the set of multi-scaled images. By examining the luminosity and edge strength information of the image at different resolution levels, it is possible to formulate a set of rules that enables the pixel classification. In this way, a pixel (x, y) is characterised by a feature vector of length 2R, containing information about intensity and edge strength at different resolution levels. Such a feature vector vxy can be formalised as: vxy = ((f (1) (x, y), f (2) (x/2, y/2), . . . , f (R) (x/2R−1 , y/2R−1 ), (1)
e
(2)
(x, y), e
(R)
(x/2, y/2), . . . , e
R−1
(x/2
R−1
, y/2
(23)
)).
In order to derive a set of applicable rules encoding accurate information, we exploited the neuro-fuzzy system introduced in Sect. 3.5, which automatically derives a fuzzy rule base from a training set of manually labelled pixels. In this case, the neuro-fuzzy network consists of 2R inputs (corresponding to the elements of the vector vxy ), while three output classes correspond to each of the recognised category of pixel (text, background, graphic). The obtained fuzzy rule base is applied to perform the pixel classification process, which ultimately produces three binary images: btex (x, y), bgra (x, y) and bbac (x, y). The images are composed by pixel candidates of text, graphic and background regions, respectively. In order to obtain more coherent regions, a merging procedure is applied to each of the binary images, on the basis of a set of predefined morphological operations (including well-known techniques of image processing, such as erosion, dilation, hole filling [95]). High-Level Region Analysis The high-level region analysis is purposed to provide a refinement of the text information extraction process. In other words, this step aims at detecting and correcting misclassified text regions identified during the previous analysis. To do that, the shape properties of every text region are analysed as follows. By examining the image btex , containing text regions, we can firstly extract a number of connected components {Et }Tt=1 representing the text regions to be analysed. Particularly, we are interested in processing the images composed by the pixels representing the perimeter of each region Et . Each of them is mapped by the Hough transform from spatial coordinates of Et (x, y) to polar coordinates of Ht (d, θ), where d denotes the distance from line to the origin,
Fuzzy Techniques for Text Localisation in Images
257
and θ ∈ 0, π) is the angle between this line and x axis. The one-dimensional function (24) h(θ) = max Ht (d, θ), d
(which is applied for each value of θ), contains information about the angles of the most dominant lines in the region Et . In general, for a rectangular region, with a skew angle of α degrees, the plot of h(θ) has two significant maximum values located at: θ1 = α degrees (25) θ2 = α + 90 degrees, corresponding to the principal axes of the region. The presence or absence of such maxima is exploited to classify each text region as rectangular or non-rectangular, respectively. To obtain a set of linguistic rules suitable for this novel classification task, the neuro-fuzzy model adopted for classifying the image pixels is employed once again. In this case, the input vector x can be defined in terms of 20 elements, which synthetically describe the information content of h(θ). Particularly, the normalised values of h(θ) have been divided into 20 intervals of equal lengths, and the elements of x represent the mean values of h(θ) in each interval. The number of the intervals has been empirically selected as a compromise between the length of the input vector (thus, a complexity of the neuro-fuzzy network structure) and the amount of information required for following classification task (accuracy of a classification). Moreover, h(µ) has been normalized, as the amplitude of the function carry information about the size of the region, which is irrelevant in this particular case and would hamper the classification process. The region Et under analysis can be ultimately classified in one of two possible output classes: non-rectangular shape (in this case Et is definitively labelled as graphic region) and rectangular shape. This latter case opens the way for an analysis performed over the skew angle value. In particular, the skew angle αt of a region Et is chosen as the minimum angle value θ1t (see (25)), while the overall skew angle φ of the document is chosen as the most often occurring skew angle along all rectangular regions. Successively, simple thresholding is applied: if |αt − φ| is greater than some small angle β, then the rectangular region Et is re-classified as a graphic region; otherwise, Et retains its original text classification. Finally, graphic regions are recursively enlarged by bounding boxes surrounding them, which are aligned according to φ. The overall proposed algorithm can be summarised as follows: For an input document image f (x, y): 1. Create a Gaussian pyramid of {f (1) (x, y), . . . , f (R) (x, y)}. 2. For each level f (i) (x, y) of a pyramid, apply Sobel operator to calculate its edge image e(i) (x, y). 3. Classify each pixel of the image as text graphics or background according to values of luminosity and edge strength in the pyramid. Create three
258
4. 5.
6.
7.
8. 9.
P. G´ orecki et al.
binary images: btex (x, y), bgra (x, y) and bbac (x, y) according to the classification results. Process btex (x, y) and bgra (x, y): median filter, apply dilation, remove small holes from the regions, apply erosion. For each connected component Et in btex obtain its perimeter (by removing interior pixels) and calculate its skew angle αt . Additionally, classify Et as rectangular or non-rectangular. Calculate a histogram containing skew angles of connected components classified as rectangular. The most occurring value is chosen as an overall skew angle θ. For each connected component Et : if it non-rectangular or it is not aligned with an overall skew angle, then reclassify it as a graphics region: btex (x, y) = btex (x, y) ∧ ¬Et (x, y), bgra (x, y) = bgra (x, y) ∨ Et (x, y). Enlarge graphics regions in bgra with bounding boxes aligned to θ. Set the binary image of a background as bbac (x, y) = ¬ (btex (x, y)∨bgra (x, y)).
4.3 Text Localisation by Wavelet Packet Segmentation In this section we propose our methodology for document page segmentation into text and non-text regions based on Discrete Wavelet Packet Transforms. This approach represents an extension of the work presented in Sect. 4.2, which is based on the Gaussian image pyramids. In fact, two-dimensional Wavelet analysis is a more sophisticated tool for multi-resolution analysis, if compared to the image pyramids. The main concern of the methodology is the automatic selection of packet Wavelet coefficients describing text or background regions. Wavelet packet decomposition acts as a set of band-pass filters, allowing to localise frequencies in the image much better than standard Wavelet decomposition. The goal of the proposed feature extraction process is to obtain a basis for the Wavelet sub-bands, that exhibit the highest discrimination power between text and non-text regions. This stage is realised by the analysis of the quadtree obtained by applying the Wavelet packet transform to a given image. In particular, |τ | the most discriminative nodes are selected among all the nodes {ci }i=1 in d−1 2j the quadtree τ , where |τ | = j=0 2 is the total number of all nodes in the quadtree having depth d. This process is based on ground truth segmentation data. Coefficient Extraction Given an image f (x, y), the initial step consists in decomposing it using Wavelet packet transform, so that the quadtree τ of Wavelet coefficients is obtained. An example of the decomposition is depicted in Fig. 8, where the
Fuzzy Techniques for Text Localisation in Images
(a)
259
(b)
(c) Fig. 8. DWPT decomposition of the image (a) at levels 1–2 (b–c). Each subimage in (b–c) is a different node of the DWPT tree
coefficients of the nodes at each decomposition level are displayed as subimages. By visually analysing the figure, it can be observed that some of the sub-images appear to be more discriminating between text and non-text areas. To quantitatively evaluate the effectiveness of the node ci ∈ τ (associated with the matrix of Wavelet coefficients) in discriminating between text and non-text, the following procedure is performed. At first, the Wavelet coefficients ci are represented in terms of absolute values |ci |, because discrimination power does not depend on the coefficient signs. Then, the coefficients are divided into the sets Ti (text coefficients) and Ni (non-text coefficients), on the basis of the known ground truth segmentation of the image f (x, y).
260
P. G´ orecki et al.
For each set Ti and Ni , the mean and variance values are calculated, N denoted as µTi and σiT for text and µN i and σi for non-text, respectively. After that, the discrimination power Fi of the node ci is evaluated using the following optimality criterion, based on the Fisher’s criterion [96]: Fi =
2 (µTi − µN i ) . σiT + σiN
(26)
To a certain extent, Fi measures the signal-to-noise ratio in the text and non-text classes. The nodes with maximum inter-class distance and minimum intra-class variance have the highest discrimination power. The simplest approach to obtain the best set of nodes, denoted as υ ⊂ τ , is to select the smallest number of nodes which have the highest discrimination power. Then, a feature image f (x, y) can obtained from the selected nodes υ. In particular, the Wavelet coefficients of the set υ are rescaled to the size of image f (x, y) and then added together: f (x, y) = ci (x, y), (27) i∈|υ|
where ci (x, y) denotes the |ci | values rescaled to match the size of the original image f (x, y). Even if the approach for obtaining υ is fast and simple, it is not an optimal technique to maximise signal-to-noise ratio between text and non-text classes. Moreover, the optimal number of nodes to be chosen for υ is unknown and it must selected manually. The problem of selecting the best nodes from all nodes available is a combinatorial problem, producing an exponential explosion of possible solutions. We propose to solve this problem by employing a genetic algorithm [97, 98]. In particular, each node ci ∈ τ is associated with a binary weight wi ∈ {0, 1}, so the tree τ is associated with a vector of weights W = [w1 , . . . , wi , . . . , w|τ | ]. Consequently, the subset of the best nodes is defined as υ = {ci ∈ τ : wi = 1}. Given a weights vector W of the nodes, the feature image f is calculated as following: |τ | f (x, y, W ) = wi ci (x, y). (28) i=1
The discrimination power F of the subset υ can be computed extending the (26), by evaluating the mean values µT , µN and the deviation values σ T , σ N of the values in the feature image f corresponding to text regions (T superscript) and non-text regions (N superscript): F =
(µT − µN )2 . σT + σN
(29)
To find the optimal subset υ by means of 28), a genetic algorithm is applied in order to maximise the cost function F . Initially a random population of K
Fuzzy Techniques for Text Localisation in Images
261
weight vectors {Wi : i = 1, . . . , K}, represented as binary strings, is created. Successively, for each weight vector the feature image is calculated and its cost function is evaluated using the (29). The best individuals are subject to crossover and mutation operators in order to produce the next generation of weight vectors. Finally, the optimal subset υ is found from the best individuals in the evolved vector population. Finally, the feature image f (x, y) is obtained from merging the set of coefficients in the nodes υ, as described in (27) or (28).
5 Experimental Results and Discussion To test the effectiveness of the presented methodology, we have employed a publicly available document image database [89]. In particular, the preliminary region-based approach we firstly presented has been tested on 306 graphic regions and 894 text regions which have been extracted from the database and automatically labelled. The extracted feature vectors were divided into a training set composed of 900 samples and a testing set composed of the remaining 300 observations. Proportions between text and graphics regions were preserved in both the datasets. A set of 12 fuzzy rules have been extracted from the training set by means of the unsupervised neuro-fuzzy learning procedure previously detailed. Successively, the rules have been refined using the gradient-descend technique of back-propagation. Table 1 reports the classification accuracy over the training and testing set produced by the neuro-fuzzy system, both for initial and refined base of rules. Classification results are satisfactory in terms of the accuracy. However, the most common error is the misclassification of short (one, or two lines of text) text regions, as can be observed also in Fig. 9. The main reason for that is the insufficient regularity in the projection profiles of such regions. Nevertheless, the strong points of the proposed method rely on the ability to process skewed documents, and the invariance to font shape and font size. The second approach proposed has been tested using 40 images related to magazines and newspapers, drawn from the Oulu document image database. For the purpose of pixel classification, a three level Gaussian pyramid was built from the original image. From the knowledge extraction process performed by Table 1. Overall classification accuracy of the document regions Number of rules
% classification Training set
Test set
Initial fuzzy rule base
12
95.71
93.53
Refined fuzzy rule base
12
95.80
93.60
262
P. G´ orecki et al.
(a)
(b)
Fig. 9. Classification results obtained for two sample images. Dark regions have been classified as text, while light regions have been classified as graphics Table 2. Pixel level classification accuracy Data set
Text (%)
Graphics (%)
Background (%)
Training Testing
91.54 91.54
85.42 86.05
93.33 95.66
the neuro-fuzzy system over a pre-compiled training set, a fuzzy rule base comprising 12 rules has been obtained. Table 2 reports the accuracy of the pixel classification process (considering both a training and a testing set); the classification results for an illustrative image from the database, are presented in Fig. 10. The further application of the neuro-fuzzy system, during the high-level analysis, was performed over a pre-compiled training set including the feature vector information related to 150 regions. The obtained rule base comprises 10 fuzzy rules and its classification accuracy is reported in Table 3, considering both training and testing sets. The final segmentation results for the previously considered sample image are presented in the Fig. 11. The accuracy of the method can be quantitatively measured using a ground truth knowledge deriving from the correct segmentation of the 40 images employed. The effectiveness of the overall process is expressed by a measure of segmentation accuracy Pa , defined as: Pa =
Number of correctly segmented pixels ∗ 100%. Number of all pixels
(30)
Table 4 reports the mean values of segmentation accuracy obtained over the entire set of analysed images, distinguishing among the different methodology
Fuzzy Techniques for Text Localisation in Images
(a)
(b)
263
(c)
(d) Fig. 10. Classification of the pixels of an input image (a) into text (b), graphics (c) and background (d) classes Table 3. Region level classification accuracy Data set
Rectangular (%)
Non rectangular (%)
Training Testing
97.43 94.11
92.85 93.93
steps. The apparently poor results obtained at the end of the pixel classification step are due to the improper identification of text regions (only the pixels corresponding to the words are classified as text). The effectiveness of the initial stage of pixel classification is demonstrated by the rapid increase of the accuracy values achieved in the subsequent merging process. The quantitative measure of the segmentation accuracy allows for comparison with other existing techniques. As an example, we can compare the results illustrated in Table 4 with those reported in [17], where a polynomial spline Wavelet approach has been proposed and the same kind of measure has been employed to quantify the overall accuracy. Particularly, the best results in [17] achieved an accuracy of 98.29%. Although our methodology produced slightly lower accuracy results, it should be observed that we analysed a total
264
P. G´ orecki et al.
(a)
(b)
(c)
(d) Fig. 11. Final segmentation of a sample figure (a) into text (a), graphics (b) and background (c) regions Table 4. Overall segmentation accuracy expressed in terms of Pa . “PC” and “MO” stand for Pixel Classification and Morphological Operation, respectively
PC PC + MO Final
Text
Graphics (%)
Bckgr (%)
Image (%)
59.92 96.65 98.19
88.32 90.63 96.36
52.93 93.26 97.99
50.59 90.27 97.51
number of 40 images, instead of the 6 images considered in [17]. Finally, it can be noted that our approach may be extended to colour documents using the HSV system [22]. In this case, the Gaussian pyramid could be evaluated for the H and S components and the edge information for the V components. The texture-based approach lastly presented has been tested on 40 images extracted from the Oulu database: in order to obtain the feature images, each image has been decomposed by Daubechies db2 Wavelet functions [59] in three level coefficients. One of these document images has been manually segmented, to create ground truth segmentation data. The best nodes have been selected by means of a basic genetic algorithm [97, 98] with an initial population of
Fuzzy Techniques for Text Localisation in Images
265
20 weight vectors. New generations of vector population have been produced by crossover (80%) and mutation operator (20%). After 50 generations, the best subset of nodes has been obtained, containing 39 out of all 85 nodes. Additionally, it should be noted that more than one image can be combined into one larger image for the purpose of the node selection. Using the selected nodes, the feature images f (x, y) have been evaluated for each considered image. Then, we applied the Fuzzy C-Means algorithm [74] to each image f (x, y), in order to group its pixels into two clusters, corresponding to text and non-text regions. The final segmented image has been obtained by replacing each pixel of f (x, y) with its cluster label. As the clustering is not performed in the image space but in the feature space, additional post processing is necessary to refine segmentation. In particular, a median filter is applied to remove small noisy regions, while preserving the edges of larger regions. Successively, a morphological closing is applied on the filtered image, in order to merge nearby text regions (i.e. letters, words, text lines) into larger ones (i.e. paragraphs, columns). Figure 12 shows an example of feature image, obtained from a document page, and its final segmentation. The percentage of segmentation accuracy has been evaluated by the measure of segmentation accuracy Pa previously described. For this purpose, the ground truth segmentation of each image has been obtained automatically, according to the additional information in the database. Moreover, to test the robustness of the method against page skew, some of the images have been randomly rotated. The obtained segmentation accuracy has an average value of 92.63% presenting the highest value of 97.18% and the lowest value of 84.37%. Some results are shown in Fig. 13. The results are comparable with other state-of-art document image segmentation techniques. Once again, we report as an example that the best resulted obtained in [17] is 98.29% (over only 6 images considered). Anyway, the approach proves to be robust against page skew and provides good results when dealing with images presenting different font sizes and style.
(a)
(b)
(c)
Fig. 12. Document image (a), its corresponding feature image (b) and segmentation result (c)
266
P. G´ orecki et al.
(a)
(b)
(c)
Fig. 13. Segmentation results. Segmentation of the document image (a), invariance to page skew (b) and invariance to font changes (c)
6 Conclusions Text information represents a very important component among the contents of a digital image. The importance of achieving text information by means of image analysis is straightforward. In fact, text can be variously used to describe the content of a document image, and it can be converted into electronic format (for memorisation and archiving purposes). In particular, different steps can be isolated corresponding to the sequential sub-problems which characterise the overall text information extraction task. In this chapter, we addressed the specific problem connected with text localisation. The peculiarity of the present work consists in discussing text localisation methods based on the employment of fuzzy techniques. When dealing with text localisation, we are particularly involved with the problem of digital image segmentation and the adoption of the fuzzy paradigm is desirable in such a research field. That is due to the uncertainty and imprecision present in images, deriving from noise, image sampling, lightning variations and so on. Fuzzy theory provides a mathematical tool to deal with the imprecision and ambiguity in an elegant and efficient way. Fuzzy techniques can be applied to different phases of the segmentation process; additionally, fuzzy logic allows to represent the knowledge about the given problem in terms of linguistic rules with meaningful variables, which is the most natural way to express and interpret information. After reviewing a number of classical image segmentation methods, we provided a presentation of fuzzy techniques which commonly find application in the context of digital image processing. Particularly, we showed the benefits coming from the fruitful integration of fuzzy logic and neural computation and we introduced a particular model for a neuro-fuzzy system. By doing so, we indicated a way to combine Computational Intelligence methods and document image analysis. Actually, a number of research works of ours have been
Fuzzy Techniques for Text Localisation in Images
267
illustrated as examples of applications of fuzzy and neuro-fuzzy techniques for text localisation in images. The presentation of the research works is intended to focus the interest of the reader on the possibilities of these innovative methods, which are by no means exhausted with the hints provided in this chapter. In fact, a number of future research lines can be addressed, ranging from the analysis of different image features (such as colour), to the direct application of Computational Intelligence mechanisms to deal with the large amount of web image contents.
References 1. Colombo C, Del Bimbo A, Pala P (1999) IEEE Multimedia 6(3):38–53 2. Long F, Zhang H, Feng D (2003) Fundamentals of content-based image retrieval, in: Feng D ZHE Siu WC (ed.) Multimedia information retrieval and management - technological fundamentals and applications. Springer, Berlin Heidelberg New York 3. Yang M, Kriegman D, Ahuja N (2002) IEEE Trans Pattern Anal Mach Intell 24(1):34–58 4. Dingli A, Ciravegna F, Wilks Y (2003) Automatic semantic annotation using unsupervised information extraction and integration, in: Proceedings of semAnnot workshop 5. Djioua B, Flores JG, Blais A, Descl´es JP, Guibert G, Jackiewicz A, Priol FL, Nait-Baha L, Sauzay B (2006) EXCOM: An automatic annotation Engine for semantic information, in: Proceedings of FLAIRS conference, pp. 285–290 6. Orasan C (2005) Automatic annotation of corpora for text summarisation: A comparative study, in: Computational linguistics and intelligent text processing, volume 3406/2005, Springer, Berlin Heidelberg New York 7. Karatzas D, Antonacopoulos A (2003) Two Approaches for Text Segmentation in Web Images, in: Proceedings of the 7th International Conference on Document Analsis and Recognition (ICDAR2003), IEEE Computer Society Press, Cambridge, UK pp. 131–136 8. Jung K, Kim K, Jain A (2004) Pattern Recognit 37:977–997 9. Chen D, Odobez J, Bourlard H (2002) Text segmentation and recognition in complex background based on Markov random field, in: Proceedings of International Conference on Pattern Recognition, pp. 227–230 10. Li H, Doerman D, Kia O (2000) IEEE Trans Image Process 9(1):147–156 11. Li H, Doermann D (2000) Superresolution-based enhancement of text in digital video, in: Proceedings of International Conference of Pattern Recognition, pp. 847–850 12. Li H, Kia O, Doermann D (1999) Text enhancement in digital video, in: Proceedings of SPIE, Document Recognition IV, pp. 1–8 13. Sato T, Kanade T, Hughes E, Smith M (1998) Video OCR for digital news archive, in: Proceedings of IEEE Workshop on Content based Access of Image and Video Databases, pp. 52–60 14. Zhou J, Lopresti D, Lei Z (1997) OCR for world wide web images, in: Proceedings of SPIE on Document Recognition IV, pp. 58–66
268
P. G´ orecki et al.
15. Zhou J, Lopresti D, Tasdizen T (1998) Finding text in color images, in: Proceedings of SPIE on Document Recognition V, pp. 130–140 16. Ching-Yu Y, Tsai WH (2000) Signal Process.: Image Commun. 15(9):781–797 17. Deng S, Lati S, Regentova E (2001) Document segmentation using polynomial spline wavelets, Pattern Recognition 34:2533–2545 18. Lu Y, Shridhar M (1996) Character segmentation in handwritten words, J. of, Pattern Recognit 29(1):77–96 19. Mital D, Leng GW (1995) J Microcomput Appl 18(4):375–392 20. Rossant F (2002) Pattern Recognit Lett 23(10):1129–1141 21. Xiao Y, Yan H (2003) Text extraction in document images based on Delaunay triangulation, Pattern Recognition 36(3):799–809 22. Pratt W (2001) Digital image processing (3rd edition). Wiley, New York, NY 23. Haralick R (1979) Proc IEEE 67:786–804 24. Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image classification, IEEE Trans Syst Man Cybern 3:610–621 25. Baird H, Jones S, Fortune S (1990) Image segmentation by shape-directed covers, in: Proceedings of International Conference on Pattern Recognition, pp. 820–825 26. Nagy G, Seth S, Viswanathan M (1992) Method of searching and extracting text information from drawings, Computer 25:10–22 27. O’Gorman L (1993) IEEE Trans Pattern Anal Mach Intell 15:1162–1173 28. Kose K, Sato A, Iwata M (1998) Comput Vis Image Underst 70:370–382 29. Wahl F, Wong K, Casey R (1982) Graph Models Image Process 20:375–390 30. Jain A, Yu B (1998) IEEE Trans Pattern Anal Mach Intell 20:294–308 31. Pavlidis T, Zhou J (1992) Graph Models Image Process 54:484–496 32. Hadjar K, Hitz O, Ingold R (2001) Newspaper Page Decomposition Using a Split and Merge Approach, in: Proceedings of Sixth International Conference on Document Analysis and Recognition 33. Jiming L, Tang Y, Suen C (1997) Pattern Recognit 30(8):1265–1278 34. Rosenfeld A, la Torre PD (1983) IEEE Trans Syst Man Cybern SMC-13:231–235 35. Sahasrabudhe S, Gupta K (1992) Comput Vis Image Underst 56:55–65 36. Sezan M (1985) Graph Models Image Process 29:47–59 37. Yanni M, Horne E (1994) A new approach to dynamic thresholding, in: Proceedings of EUSIPCO’94: 9th European Conference on Signal Processing 1, pp. 34–44 38. Sezgin M, Sankur B (2004) J Electron Imaging 13(1):146–165 39. Kamel M, Zhao A (1993) Graph Models Image Process 55(3):203–217 40. Solihin Y, Leedham C (1999) Integral ratio: A new class of global thresholding techniques for handwriting images, in: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-21, pp. 761–768 41. Trier O, Jain A (1995) Goal-directed evaluation of binarization methods, in: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-17, pp. 1191–1201 42. Bow ST (2002) Pattern Recognition and Image Preprocessing 2nd edition. Dekker, New York, NY 43. Jung K, Han J (2004) Pattern Recognit Lett 25(6):679–699 44. Ohya J, Shio A, Akamatsu S (1994) IEEE Trans Pattern Anal Mach Intell 16(2):214–224 45. Wu S, Amin A (2003) Proceedings of Seventh international conference on Document Analysis and Recognition, volume 1, pp. 493–497
Fuzzy Techniques for Text Localisation in Images
269
46. Canny J (1986) IEEE Trans Pattern Anal Mach Intell 8(6):679–698 47. Chen D, Shearer K, Bourlard H (2001) Text enhancement with asymmetric filter for video OCR, in: Proceedings of International Conference on Image Analysis and Processing, pp. 192–197 48. Hasan Y, Karam L (2000) IEEE Trans Image Process 9(11):1978–1983 49. Lee SW, Lee DJ, Park HS (1996) IEEE Trans Pattern Recogn Mach Intell 18(10):1045–1050 50. Grigorescu SE, Petkov N, Kruizinga P (2002) IEEE Trans Image Process 11(10):1160–1167 51. Livens S, Scheunders P, van de Wouwer G, Van Dyck D (1997) Wavelets for texture analysis, an overview, in: Proceedings of the Sixth International Conference on Image Processing and Its Applications, pp. 581–585 52. Tuceryan M, Jain AK (1998) Texture analysis, in: Chen CH, Pau LF, Wang PSP (eds.) The Handbook of Pattern Recognition and Computer Vision 2nd edition, World Scientific Publishing, River Edge, NJ pp. 207–248 53. Jain A, Bhattacharjee S (1992) Mach Vision Appl 5:169–184 54. Acharyya M, Kundu M (2002) IEEE Trans Circ Syst video Technol 12(12): 1117–1127 55. Etemad K, Doermann D, Chellappa R (1997) IEEE Trans Pattern Anal Mach Intell 19(1):92–96 56. Mao W, Chung F, Lanm K, Siu W (2002) Hybrid Chinese/English text detection in images and video frames, in: Proceedings of International Conference on Pattern recognition, volume 3, pp. 1015–1018 57. Coifman R, Wickerhauser V (1992) IEEE Trans Inf Theory 38(2):713–718 58. Coifman RR (1990) Wavelet Analysis and Signal Processing, in: Auslander L, Kailath T, Mitter SK (eds.) Signal Processing, Part I: Signal Processing Theory, Springer, Berlin Heidelberg New York, pp. 59–68, URL {citeseer.is-t}.psu. edu/coifman92wavelet.html 59. Daubechies I (1992) Ten Lectures on Wavelets (CBMS - NSF Regional Conference Series in Applied Mathematics), Soc for Industrial & Applied Math 60. Bruce A, Gao HY (1996) Applied Wavelet Analysis with S-Plus, Springer, Berlin Heidelberg New York 61. Mallat SG (1989) IEEE Trans Pattern Anal Mach Intell 11(7):674–693 62. Engelbrecht A (2003) Computational Intelligence: An Introduction, WileyNew York, NY 63. Sincak P, Vascak J (eds.) (2000) Quo vadis computational intelligence?, PhysicaVerlag 64. Zadeh L (1965) Inform Control 8:338–353 65. Klir G, Yuan B (eds.) (1996) Fuzzy sets, fuzzy logic, and fuzzy systems: selected papers by Lotfi A. Zadeh, World Scientific Publishing, River Edge, NJ 66. Pham T, Chen G (eds.) (2000) Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems, CRC , Boca Raton, FL 67. Jawahar C, Ray A (1996) IEEE Signal Process Lett 3(8):225–227 68. Jin Y (2003) Advanced Fuzzy Systems Design and Applications, Physica/ Springer, Heidelberg 69. Mamdani E, Assilian S (1975) Int J Man-Mach Studies 7(1):1–13 70. Sugeno M, Kang G (1988) Structure identification of fuzzy model, Fuzzy Sets Syst 28:15–33 71. Dubois D, Prade H (1996) Fuzzy Sets Syst 84:169–185
270
P. G´ orecki et al.
72. Leekwijck W, Kerre E (1999) Fuzzy Sets Syst 108(2):159–178 73. Dunn J (1974) J Cybern 3:32–57 74. Bezdek J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms (Advanced Applications in Pattern Recognition), Springer, Berlin Heidelberg New York URL http://www.amazon.co.uk/exec/obidos/ASIN/0306406713/ citeulike-21 75. Macqueen J (1967) Some methods of classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathemtical Statistics and Probability, pp. 281–297 76. Pham D (2001) Comput Vis Image Underst 84:285–297 77. Bezdek J, Hall L, Clarke L (1993) Med Phys 20:1033–1048 78. Rignot E, Chellappa R, Dubois P (1992) IEEE Trans Geosci Remote Sensing 30(4):697–705 79. Jang JS, Sun C (1995) Proc of the IEEE 83:378–406 80. Kosko B (1991) Neural networks and fuzzy systems: a dynamical systems approach to machinhe intelligence, Prentice Hall, Englewood Cliffs, NJ 81. Lin C, Lee C (1996) Neural fuzzy systems: a neural fuzzy synergism to intelligent systems, Prentice-Hall, Englewood Cliffs, NJ 82. Mitra S, Hayashi Y (2000) IEEE Trans Neural Netw 11(3):748–768 83. Nauck D (1997) Neuro-Fuzzy Systems: Review and Prospects, in: Proc. Fifth European Congress on Intelligent Techniques and Soft Computing (EUFIT’97), pp. 1044–1053 84. Fuller R (2000) Introduction to Neuro-Fuzzy Systems, Springer, Berlin Heidelberg New York 85. Castellano G, Castiello C, Fanelli A, Mencar C (2005) Fuzzy Sets Syst 149(1):187–207 86. Castiello C, Gorecki P, Caponetti L (2005) Neuro-Fuzzy Analysis of Document Images by the KERNEL System, Lecture Notes in Artificial Intelligence 3849:369–374 87. Caponetti L, Castiello C, Gorecki P (2007) Document Page Segmentation using Neuro-Fuzzy Approach, to appear in Applied Soft Computing Journal 88. Gorecki P, Caponetti L, Castiello C (2006) Multiscale Page Segmentation using Wavelet Packet Analysis, in: Abstracts of VII Congress Italian Society for Applied and Industrial Mathematics (SIMAI 2006), p. 210 89. of Oulu Finland U, Document Image Database, http://www.ee.oulu.fi/ research/imag/document/ 90. Hinds S, Fisher J, D’Amato D (1990) A document skew detection method using run-length encoding and Hough transform, in: Proc. of the 10th Int. Conference on Pattern Recognition (ICPR), pp. 464–468 91. Hough P (1959) Machine Analysis of Bubble Chamber Pictures, in: International Conference on High Energy Accelerators and Instrumentation, CERN 92. Srihari S, Govindaraju V (1989) Mach Vision Appl 2:141–153 93. Gonzalez R, Woods R (2007) Digital Image Processing 3rd edition, Prentice Hall 94. Lindeberg T (1994) Scale-space theory in computer vision, Kluwer, Boston 95. Watt A, Policarpo F (1998) The Computer Image, ACM, Addison-Wesley 96. Sammon J (1970) IEEE Trans Comput C-19:826–829 97. Holland J (1992) Adaptation in Natural and Artificial Systems reprint edition, MIT, Cambridge, MA, 98. Mitchell M (1996) An Introduction to Genetic Algorithms, MIT, iSBN:0-26213316-4