Text/Graphics Separation using Agent-based Pyramid Operations

2 downloads 0 Views 62KB Size Report
Kent Ridge, Singapore 119260. Abstract. This paper describes a document image analysis system using multiple agents working on a pyramid structure to ...
Text/Graphics Separation using Agent-based Pyramid Operations Chew Lim Tan, Bo Yuan, Weihua Huang, Qian Wang, Zheng Zhang Department of Computer Science, School of Computing National University of Singapore Kent Ridge, Singapore 119260

connected components based on their mutual relationship with neighboring components. Existing methods suffer from the necessity to assume some linear arrangement of characters and the computational cost in dealing with individual characters at full resolution to find inter-character relationships. The present method makes use of a pyramid structure to facilitate connected component analysis but without the need to do expensive process of analyzing intercomponent relationship. The work reported in this paper continues from our earlier attempt to use pyramid [4,5] by adopting a different multi-agent operations and clustering process.

Abstract This paper describes a document image analysis system using multiple agents working on a pyramid structure to separate text from graphics in the image. Text strings appear as different groupings of connected components at different resolution of the images. As such, the pyramid structure, which is a multi-resolution image representation, provides a natural means of identifying and grouping of character strings in the document at different levels of resolution. The pyramid structure is also amenable to parallel processing, where multiple agents in the system can individually and concurrently look for groups of connected components at appropriate levels. The agent-based pyramid operations do not require expensive feature analysis among different connected components to detect text strings as found in other existing works.

2. The pyramid Mathematically, a pyramid is defined as a set P of cells together with a function F that assigns a value v to each cell. Thus P = ( (v, k, i, j) | 0 ≤ k ≤ L ; 0 ≤ i, j ≤ 2k - 1 ; v = F(k, i, j) ) (1) is a pyramid of L + 1 levels, whose pixel (v, k, i, j) at level k contains the value v = F(k, i, j). The function F determines how a pyramid level is formed from the level immediately below. We use a so-called binary-pyramid or bit-pyramid [6] where the value in each cell is either 0 or 1. The function F is such that a cell at level k will have a value 1 if any of its four descendant cells at level K+1 has a value 1. Otherwise, the cell will have value 0. The pyramid allows grouping characters and words at appropriate resolution levels to permit extraction as a whole without the expensive computation of finding inter-character/word relationships. Furthermore, multiple agents in the system can individually and concurrently analyze different text regions of various sizes in the image.

1. Introduction Computer vision research today is still facing greater difficulties in understanding graphical objects than character strings which can be handled using OCR with reasonable success. Any words that can be found in a document image by the computer will certainly assist in the understanding of the image. One major technique reported in the literature for text/graphics seperation relies on connected component analysis [1-3] through examination of the relationships among neighboring components. Fletcher and Kasturi [1] use Hough Transform to detect sets of connected components that lie along respective straight lines and thereafter do analyze spatial relationships among characters and words. He, Abe and Tan [2], on the other hand, requires some form of pre-processing to build a data structure to allow computation of an Euclidean radius of each connected component to be later used for clustering of characters and words. Hase et al [3] use a multi-stage relaxation technique to iteratively go through a labeling process to finally fix the labels of all

3. Agent-based pyramid operations 3.1. Preprocessing

1

splits to at the next lower level. Furthermore, each dense component is recognized and marked.

Preprocessing entails the construction of 8-connected components and removal of big components which are deemed to be graphics objects. Such removal is necessary as big and obvious graphics objects will ultimately touch and join textual objects during the pyramid construction, thereby making it impossible for text extraction. The removal process is done through an area versus frequency of occurrence histogram [1], based on the fact that text characters are almost similarly sized and thus objects whose sizes lie outside the permissible range are assumed to be graphics objects and removed accordingly. With the graphics objects removed, the system actually builds the pyramid structure only on the remaining connected components rather than the entire image. The system also computes the average values of height, width and area of the connected components which serve as appropriate filter values for text extraction later. As each higher level is created, connected components in that level are formed at the same time. Also, inter-level linking operation is performed for each connected component to find those connected components that are generated through splitting at the next lower level. This operation is repeated top down until the one above the base level, pointers are set so that every connected component in the pyramid knows information about the connected components it splits to at the next lower level. We define a density measure to facilitate a clustering process later. Basically, a higher density of a number of components means a greater possibility for these components to be included as logical groups at the next lower level. A connected component is considered dense when the size of the group is greater than the average size over all groups and the size of the component is greater twice the average group size. Determination of the whether a component is dense or sparse is based on the following decision: (1) If group size < Average group size of the level then For each component c in the group, c is considered as sparse component. Else compute density of the group For each component c in the group do (2) If size of c > 2.3 * average group size of the level then c is considered dense (3) Else if size of c > = 2 * density of the group then c is considered dense Else c is considered sparse. A 10-level pyramid is constructed from the base, i.e. level 9 (512 × 512) to the top, i.e. level 0 (1 × 1) in the above process. The only one pixel represents the connected component formed through compression of all objects in the levels below. At each level, connected components contain information of the components it

3.2. The agent process At the beginning, several agents start working at the start level and proceeds to descend to lower levels. In each descent, the agent examines whether the connected component in the earlier level has now become disjointed through expansion of resolution. If so, a clustering process is done based on a method adopted from the Newton formula in physics to cluster components that are in close proximity into clusters. Some components may be deemed dense components and instead of adding to clusters they will be subjected to further agent processing at the level below to detect any logical text groups at the next finer resolution. For the remaining components, each cluster is deemed a logical grouping of possible text elements. The agent will stop its descent and components in such clusters will be individually subjected to a further test to ascertain that they are text. If the test succeeds, the system will display the component in question. Thus any parts that are deemed graphics objects will be suppressed. Instead of starting at the level 0 as in the earlier system, the present system starts at level 5, since generally it contains enough important details for manipulation and meanwhile does not have too many connected components. Figure 1 shows the pseudo-code of the agent-based pyramid operations on component n at pyramid level k. Note also that the agents initiated within the parallel-do can run in parallel. In the present implementation, this is simulated in a multi-thread environment.

3.3. The clustering process Newton formula (m1*m2/r2) is used as the metric to judge whether a certain connected component should be treated as an element in a cluster. Here, m1 and m2 are the number of black pixels of the two components, and r is the distance between their centres. Newton Metric is a value used to compare with values calculated from the Newton formula. After classifying components into two categories, it is possible to apply local Newton Metric, which can be more precise. According to the property of graphics, we take an initial Newton value, and relate the Newton value to be applied by each level to the density of that level. The formula is as below: Initial Newton value = 0.5 * (Density of start level) / (Density of top level) (2)

2

unmarked connected component is pushed into the current cluster and marked with value 1. This step stops until no more connected component is added in. The system repeats the clustering process until all the connected components are marked. The algorithm guarantees that no other unmarked connected components remain in the current cluster. Also, marked components need not be checked anymore. Therefore when the process is repeated, the Newton formula is only applied between unmarked connected components and elements in the current incomplete cluster.

Agent (level k, comp n); begin If k= L then if Test(level k,comp n) print comp n at base level; else begin descend to level k+1; if comp remains as one whole comp at level k+1 then begin if comp is dense then Agent(level k+1,comp n) else ifTest(level k+1,comp n) then print comp n at base level end else begin for all dense comps n parallel-do Agent(levelk+1,comp n) Cluster the remaining comps to m clusters for i = 1 to m do for each comp j in cluster i ifTest(levelk+1,compj) then print comp j at base level; end; end; end.

4.

Experimental results

Figures 2 shows a full resolution image. After preprocessing to remove big graphics objects, a pyramid structure is formed as in figure 3. Only five levels from the base are shown as higher levels are too small to visualize. Agent-based pyramid operations then proceed to work on the pyramid images and extract text strings accordingly as shown in figures 4, where all graphical objects are successfully removed, while the two words “NCInet” and “Coten” are wrongly sperated. Altogether eight images were tested. Table 1 shows the recall and precision of the system. Here we take strict measurement where we do not consider logical groups containing other logical groups as correct. Separated logical groups are also considered incorrect. In the table, the term “Actual” means the number of actual logical groups manually found in the images, “Detected” indicates the number of logical groups detected by the system, and “Correct” means the number of correct logical groups found by the system. Average recall and precision from the eight images are 80.7% and 78.5% respectively.

Figure 1. Pseudo-code for agent-based pyramid operations. Level Newton value = Newton value of upper level* (Density of current level) / (Density of upper level) (3) Clustering proceeds as follows: Step 1: For the first and each other unmarked connected component, Newton formula is applied pairwise to calculate the value for each pair. If this value is greater than the Newton Metric, which means the two components are close enough, then the two components are put in the same cluster and marked with value 1. Thus a new cluster is formed. The cluster is not complete yet. In step 2, new component may be added in. Step 2: For each unmarked connected component and each component in the current cluster, Newton formula is applied pairwise to calcuate the value for each pair. Once the value is greater than the Newton Metric, the

Figure 2. Input image (512 × 512)

3

Figure 3. Pyramid for input image 1 after removal of large graphics objects ple agents working on a pyramid structure for text extraction is proposed. The system does not require expensive computation of inter-component relationship as found in existing works which are based on single process and single resolution. The groupings of text strings need not be in linear sequences as required in some previous works. The agents are autonomous as they navigate pyramid levels on their own to extract text. The present agent-based pyramid approach has the following strengths: (1) It works at a reduced resolution image and hence lesser numbers of connected components, and (2) the agents adopt a divide-conquer strategy to distribute the work load. The only extra cost is the pyramid construction. But this process is only a one-time cost as part of the image preprocessing. Figure 4. Text extracted from input image Image Actual Detected Correct Recall % Precision%

1 15 16 14 93.3 87.5

References

2 3 4 5 6 7 8 Average 14 11 6 14 8 11 17 15 8 6 12 8 12 20 13 1 6 10 8 10 15 92.9 9.1 100 71.4 100 90.9 88.2 80.7 86.7 12.5 100 83.3 100 83.3 75.0 78.5

[1] Fletcher LA and Kasturi R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Transactions on Pattern Analysis Machine Intelligence, 1998; 10(6): 910-918. [2] He S, Abe N and Tan CL. A clustering-based approach to the separation of text strings from mixed text/graphics documents. Thirteenth International Conference on Pattern Recognition, Austria, 25-29, August 1996, pp 706-710. [3] Hase H, Shinokawa T, Yoneda M, Sakai M and Maruyama H. Character string extraction by multi-stage relaxation. Fourth International Conference on Document Analysis and Recognition, 18-20 August 1997, pp 298-302. [4] Tan CL and Ng PO. Text extraction using pyramid. Pattern Recognition, 1998; 31(1): 63-72. [5] Tan CL, Yuan B and Ang CH. Agent-based text extraction from pyramid images. International Conference on Advances in Pattern Recognition, 23-25 November 1998, Plymouth, UK, pp 344-352. [6] Tanimoto SL. Pictorial feature distortion in a pyramid. Computer Graphics and Image Procesing, 1976; 5: 333352.

Table 1. Test results on eight images The testing result for image 3 is rather unfavorable. This is due to uneven distribution of logical groups as well as the wide separation of components within the same group. It was found that starting at the next lower level instead of the pre-determined level 5 improved the correctness of groupings. Future work will explore the possibility of self-determination of the starting level.

5.

Conclusion A system based on the concept of concurrent multi-

4