A Relaxation Algorithm That Uses Word ... - Semantic Scholar

2 downloads 0 Views 181KB Size Report
A word recognition algorithm typically outputs a list of alternatives for ... use of contextual information above the level of individual characters can improve performance ... time contain the correct decision in their rst ten alternatives even though the rst alternative .... is a chapter from the adventure ction \The Killer Marshal.
A Relaxation Algorithm That Uses Word Collocation to Improve Text Recognition Performance Tao Hong and Jonathan J. Hull* Center of Excellence for Document Analysis and Recognition Department of Computer Science State University of New York at Bu alo Bu alo, New York 14260 * RICOH California Research Center 2882 Sand Hill Road, Suite 115 Menlo Park, California 94025 [email protected]

415-496-5700 (voice) 415-854-8740 (fax)

Abstract A method is presented for postprocessing the output of a word recognition algorithm for visual text recognition. A word recognition algorithm typically outputs a list of alternatives for the identity of each word in a running text. Each alternative is ranked by the probability that it is correct. A relaxation algorithm is proposed in this paper that re-ranks the alternatives for each word using word collocation statistics. These are measurements of the probability that two words occur nearby each other in a running text. The improvement in performance is measured by the increase in the percentage of correct alternatives in the rst position. Experimental results are presented that show the proposed algorithm can improve the performance of a word recognition algorithm from 56% to over 83% correct in the top position of the candidate lists.

Keywords: text recognition, word recognition, postprocessing, OCR, word collocation, probabilistic relaxation, results propagation.

revised version of \Degraded Text Recognition Using Word Collocation," 1994 SPIE Conference on Document Recognition, San Jose, CA 0

1

Summary A method is presented for postprocessing the output of a word recognition algorithm for visual text recognition. A word recognition algorithm typically outputs a list of alternatives for the identity of each word in a running text and the probability that each word is correct. Those alternatives are visually similar to the input word images. However, most current algorithms have no way to use the results from one word to in uence the results for another word. For example, the ranked list of alternatives for the word image \river" might be f river 0.85, rover 0.10, raven 0.05 g and the adjacent word \boat" might have been mis-recognized as f boot 0.41, boat 0.39, hat 0.20 g. A probabilistic relaxation algorithm is proposed in this paper that uses word collocation statistics to re-rank the entries in the lists of alternatives. Word collocation statistics express the likelihood that two words occur nearby each other in a running text and have been shown to capture both syntactic and semantic constraints. In the above example, the collocations might be C(river,boat) = 0.50, C(river, boot) = 0.001, C(river, hat) = 0.02, C(rover, boot) = 0.001, C(rover, boat) = 0.001, C(rover, hat) = 0.02, C(raven, boot) = 0.001, C(raven, boat) = 0.002, C(raven, hat) = 0.002. That is, aside from the correct choices (river and boat), almost all of the other collocations have negligible values. The relaxation algorithm uses this data to assign new probabilities to the words in the lists of alternatives. This algorithm runs iteratively so that it can propagate the in uence of strong collocations to other parts of a text. In the above example, after the rst iteration of relaxation the lists of alternatives may have been re-ranked as f river 0.93, rover 0.05, raven 0.02 g and f boat 0.80, boot 0.10, hat 0.01 g. Further iterations would increase the probabilities assigned to the correct choices. An experimental implementation is constructed in this paper in which several pages of text are corrupted with a noise model that simulates fax images or poor photocopies. A word recognition algorithm is applied to those images to generate input data for the relaxation algorithm. It is shown that in cases where the word recognition algorithm is only 56% correct, the relaxation algorithm can improve the correct rate to over 80%. When perfect collocation data is provided, the correct rate is improved to nearly 90%. This demonstrates the robustness of the proposed method in the presence of noise and its usefulness as a general purpose preprocessor for further more detailed linguistic analysis.

2

1 Introduction The recognition of images of text is typically solved by assigning a unique ASCII interpretation to each isolated character image. An improvement on this approach, which uses more contextual information, is to model it as a word recognition problem [10]. The objective then becomes the assignment of the correct ASCII interpretation to each word image. While the use of contextual information above the level of individual characters can improve performance (i.e., the percentage of ASCII interpretations that are correct), text recognition is still a dif cult problem, especially when the images are degraded by noise such as that introduced by photocopying or facsimile transmission. Methods for improving the performance of word recognition algorithms have used knowledge about the language in which a document is written [11, 12]. These techniques often post-process the lists of alternatives for the identity of each word image, ranked by a con dence value, that are output by a word recognition algorithm. The objective is to choose the correct decision for each word from the list of alternatives. This is an important step since it has been observed that a word recognition algorithm can provide lists of alternatives that 98% of the time contain the correct decision in their rst ten alternatives even though the rst alternative for each word is only 80% correct [7]. An example of postprocessing the output of a word recognition algorithm is given by the sentence \The river boat race was held yesterday." The list of alternatives for each word image 3

might contain the correct decision in the rst position except for the word \river" that might have a list of alternatives and con dence values f rover 0.5, river 0.5

g

. That is, the word

recognition algorithm could not determine whether the correct decision was rover or river. Word collocation data is one source of information that has been investigated in computational linguistics and that has been proposed as a useful tool to post-process word recognition results [1, 2, 16]. Word collocation refers to the likelihood that two words co-occur within a xed distance of one another. For example, it is highly likely that if the word \boat" occurs, the word \river" will also occur somewhere in ten words on either side of \boat." Thus, if \river" had been misrecognized with the list of alternatives \rover, river" (i.e., rover is the rst choice, river the second choice), the presence of \boat" nearby would allow for the recognition error to be corrected. A numerical measure of the collocation strength of word pair (x,y) is given by the mutual information of (x,y): (

M I x; y

)=

log

2

(

) () ()

P x; y

P x P y

where P(x) and P(y) are the probabilities of observing words x and y in a corpus of text and P(x,y) is the joint probability of observing x and y within a xed distance of one another. The strength of word collocation can also be measured by the frequency of a word pair (

F x; y

) in a

corpus of xed size. Previous work in using word collocation data to post-process word recognition results has 4

shown the usefulness of this data [17]. This technique used local collocation data about words that co-occur next to each other to improve recognition performance. A disadvantage of this approach was that it did not allow for successful results on one word to propogate over a passage of text and in uence the results on other words. This paper proposes a relaxation-based algorithm that propagates the results of word collocation post-processing within sentences. The promotion of correct choices at various locations within a sentence in uences the promotion of word decisions elsewhere. This e ectively improves on a strictly local analysis by allowing for strong collocations to reinforce weak (but related) collocations. Relaxation has been widely used to solve problems where the search for a globally optimal solution can be broken down into a series of local problems each of which contributes to the overall result [18]. Most previous work in using word collocation statistics has assumed that there is a uniform distribution for the likelihood of co-occurrence as a function of the distance between the two words. This has the e ect of equally weighting collocations for nearby words and collocations for words in widely separate positions in an input text. One way to improve this would be to use a Gaussian distribution to model the distance weighting function. We incorporate this characteristic in our work by using collocation statistics for adjacent words and a relaxation algorithm that propagates their e ect across an input text. This e ectively incorporates a decay over distance in much the same way that a Gaussian distribution would. The rest of the paper discusses the proposed algorithm in more detail. An experimental 5

analysis is discussed in which the algorithm is applied to improving text recognition results that are less than 60% correct. The correct rate is e ectively improved to 83% or better in all cases.

6

2 Algorithm Background and Description Word recognition postprocessing using word collocation statistics can be viewed as a classic constraint satisfaction problem. That is, there is a set of variables (the words in a passage of text), each of which are assigned a subset from a given domain (a list of word decision alternatives chosen from a dictionary) as a potential solution. Each member of this subset has an associated con dence value that expresses its suitability as a value for the variable. There is also a set of local constraints (word collocation statistics) that are used to update the con dence values assigned to members of adjacent subsets. Solutions for constraint satisfaction problems have taken many forms including backtracking and relaxation [15]. The common objective of these methods is to update the con dence values so that the member of each subset with the maximum con dence value is the best assignment for the variable, given the constraints. Probabilistic relaxation is one solution method that assigns con dences so that compatible values, as measured by the local constraints, are reinforced and incompatible values are suppressed [3]. Also, this e ect is propagated beyond the immediate range of the local constraints so that strong local constraints e ect the con dences of related values that are not immediately adjacent to a given word. This occurs because a probabilistic relaxation algorithm is applied in parallel in a number 7

of iterations. In each iteration, the local constraints are used to update the con dence assigned to the value for each variable. This is done with a function that measures the compatibility of an assignment of values to adjacent variables and an updating rule that adjusts the con dence values based on their current assignments and the con dences assigned to adjacent variables. An updating rule that has these characteristics and casts word recognition postprocessing as a probabilistic relaxation algorithm is given below. Let word i

be word candidate number

wij

for word position in an input text. The rule that updates the con dence assigned to j

from iteration to iteration + 1 is k

k

+1

pw

ij

=P [ =1

ij

n

The initial score

0

pw

wij

ij

+1

pw

ij

+

k

pw

and is given by:

k

+1 1 pw +1 1

rwij ;wi

pk wil

l

word candidate

k

k

+

;

i

+1 1

rwil ;wi

;

. The compatibility functions ?1 1 ;w ;

ij

=P

+1 1

rwij ;wi

(

score for

wij

pk wi

+

+1 1 ;

+

k

?1 1 ;w

rwi

;

i;j

pw

?1 1

i

;

?1 1 ]

?1 1

pk ; ;wi;l wi

rwi

;

and

?1 1 ;w

rwi

(

;

ij

?1 1; w ) =1 j M I (w ?1 1; w M I wi

;

ij

n k

i

and

M I x; y

;

;

is the con dence (between 0 and 1 provided by a word recognizer for the

rwi

where

wij

=P

(

=1 j M I (w

ik

ik

;

are de ned as

)j

+1 1 ) ; w +1 1 ) j

M I wij ; wi

n k

;

+1 1

rwij ;wi

;

i

) is the mutual information of the word pair (

;

x; y

). This method of calculating the

at time +1 is an improvement over a previous approach that did not incorporate k

recognition con dence [9]. 8

This updating rule uses the top-ranked choice of adjacent words to adjust the ranking in each list of alternatives. Repeated applications of this updating rule e ectively propagates results across a sentence. In fact, it may require several iterations for the algorithm to converge on a stable state from which further iterations cause no signi cant changes in the rankings. Figure. 1 shows an example, using actual data, of how the algorithm operates. A degraded image of the sentence \Please show me where Hong Kong is!" is input to a word recognizer. The top three recognition choices are shown in each position. Only two of the seven words are correct in the rst choice. After one iteration, six of the seven are correct and after two iterations all seven word decisions are correct.

9

Initial Word Alternatives position 1 2 3 4 5 6 7 8 top1 Places snow me whale Kong Kong it ! top2 Please slow we where Hong Hong is top3 Pieces show mo chore Hung Kung Is Word Alternatives After Iteration 1 position 1 2 3 4 5 6 top1 Places show me where Hong Kong top2 Please slow we whale Kong Hong top3 Pieces snow mo chore Hung Kung

7 8 is ! it Is

Word Alternatives After Iteration 2 position 1 2 3 4 5 6 top1 Please show me where Hong Kong top2 Places slow we whale Kong Hong top3 Pieces snow mo chore Hung Kung

7 8 is ! it Is

Figure 1: An example of the relaxation process (the sentence to be recognized is \Please show me where Hong Kong is ! "

10

3 Experiments and Analysis Experiments were conducted to determine the performance of the proposed algorithm. The data used in the experiments were generated from the Brown Corpus and Penn Treebank databases. These corpora together contain over four million words of running text. The Brown corpus is divided into 500 samples of approximately 2000 words each [14]. The part of the Penn Treebank used here is the collection of articles from the Wall Street Journal that contain about three million words. Five articles were randomly selected from the Brown Corpus as test samples. They are A06, G02, J42, N01 and R07. A06 is a collection of six short articles from the Newark Evening News. G02 is from an article \Toward a Concept of National Responsibility " that appeared in The Yale Review. J42 is from the book \The Political Foundation of International Law." N01

is a chapter from the adventure ction \The Killer Marshal." R07 is from the humor article \Take It O " in The Arizona Quarterly. Each text has about 2000 words. There are totally 10,280 words in the testing samples. The training data from which the collocation statistics were calculated was composed of the approximately 1.2 million distinct word pairs in the combination of the Brown corpus and Penn Treebank minus the ve test samples. Examples of word pairs and their frequencies are shown below.

11

the

doctor

64

a

doctor

27

doctor

and

8

doctor

was

8

doctor

who

7

his

doctor

6

doctor

bills

4

ward

doctor

1

An estimate for the upper bound, or best performance that could be expected with a technique that uses word collocation data, was derived from the test articles. This was done by calculating the percentage of words in the test articles that also appear in the training data. This is relevant since if a word does not occur in the training data there will be no collocations stored for it and the algorithm may not select it. The results in Table 1 show that somewhere between 97% and 98% of the isolated words in each of the test articles occur in the training data. Thus, only 2% to 3% of the words in a typical document are not found. The potential upper bound in performance is also illustrated by the results shown in Table 2. These give the number of times each word in the test articles appear adjacent to the same word or words in the training data. Statistics are provided that show how often words in the test articles are collocated with both the word before and the word after somewhere in the training data. Also, the number of times a word is collocated only with either the word before 12

test no. of no. in article words training data A06 2213 2137 (97%) G02 2267 2201 (97%) J42 2269 2208 (97%) N01 2313 2271 (98%) R07 2340 2262 (97%)

Table 1: Isolated word occurrence in the training data or the word after is given. These results show that about 60% of the words occur adjacent to the same words in both the test data and the training data. About 32% of the words are adjacent only to either the word before or the word after and 4% to 8% of the words have no collocations in the training database. Thus, an algorithm that uses collocation to choose the candidate for a word could achieve a correct rate of between 92% and 96%. However, this would require perfect knowledge about which collocation is correct. Actual performance will be less than this because of other words in the list of alternatives that have collocations in the training data and the interactions of those words and their recognition con dence values during iterations of the relaxation algorithm.

13

test no. of both before only before neither before article words and after or after nor after A06 2213 1342 694 177 61% 31% 8% G02 2267 1355 750 162 60% 33% 7% J42 2269 1349 762 158 59% 34% 7% N01 2313 1609 614 90 70% 27% 4% R07 2340 1499 659 182 64% 28% 8%

Table 2: Word collocation occurrence in the training data

3.1 Generation of Lists of Alternatives Lists of alternatives were generated for each of the 70,000 unique words in the combined corpus of testing and training data by the following procedure. Digital images of the words were generated from their ASCII equivalents by rst converting them to an 11 pt. Times Roman font in postscript with the Unix command ditro . The postscript les were then converted into raster images with the ghostscript system. Feature vectors were then calculated from each word image with a representation known as the stroke direction feature vector [6]. The list of alternatives for each dictionary word were calculated by computing the Euclidean distance between its feature vector and the feature vectors of all the other dictionary words and sorting the result. The ten words with the smallest distance values were stored with each dictionary word as its list of alternatives. 14

To mimic the performance of a word recognition technique in the presence of noise, the lists of alternatives were corrupted. This step was necessary since the original neighborhoods were 100% correct at the top choice. Also, we wanted to test the performance of the relaxation algorithm with a range of correct rates and a noise model was the best way to do that. A di erent technique would have been to corrupt the images of the words in the test data before sending them to the word recognizer. However, this would have required a simulator for the image corruption processs that is veri ed to be correct. That is, the results produced by the simulator should be similar to those produced by a real fax machine or photocopier., While some simulators exist, their results have not been veri ed. The corruption process used in this paper changed the positions of the entries in the lists of alternatives with a statistical model. This retained the same ten words that were in the original list of alternatives but shued their order. Thus, the words in the lists are guaranteed to be visually similar to each other. However, in some percentage of the cases, the correct choice is not in the rst position. This method for corrupting word recognition results was chosen based on experience which shows that a recognizer based on the stroke direction feature vector has a high correct rate in the rst position when it is given clean images. When degraded images are input, performance at the top choice typically drops o but the correct word is usually retained in the list of alternatives. Also, the other entries in the list besides the correct one are usually very similar to those that were present when the corresponding clean image was tested. This informal 15

veri cation of the correspondence between the noise model and real performance was accepted as proof of its validity. A formally veri ed simulator for image noise would have been a better alternative but such a technique was not available. The corruption algorithm was implemented by specifying a correct rate in each position in the list of alternatives. For example, the top choice might be 80% correct, the second choice 10% correct, and so on. The noise model was applied to the text by calling a uniform random number generator for each word in the passage and scaling the result between zero and one. The correct rate distribution was then used to select the position in the list of alternatives into which the correct word was moved. Thus, in the above example, 80% of the time the correct word would remain in the top position, 10% of the time it would be moved into the second position, and so on. The eight noise models shown in Table 3 were used to generate alternatives for the words in the running text of the ve test articles.

model 1 2 3 4 5 6 7 8

1 .55 .65 .70 .75 .80 .85 .90 .95

2 .15 .15 .17 .10 .07 .06 .03 .02

3 .08 .06 .05 .05 .05 .03 .02 .01

position in alternatives 4 5 6 7 .05 .04 .03 .03 .04 .02 .02 .01 .02 .02 .01 .01 .03 .03 .01 .01 .02 .02 .01 .01 .01 .01 .01 .01 .01 .01 .01 .005 .005 .005 .005 .002

8 .03 .01 .01 .01 .01 .01 .005 .001

9 .02 .02 .005 .005 .005 .005 .005 .001

10 .02 .02 .005 .005 .005 .005 .005 .001

Table 3: Performance models used to generate corrupted word recognition results

16

3.2 Experimental Results Before testing the algorithm, a baseline for performance comparison was determined. The accuracy rate in each position in the top10 candidate list was calculated by re-ranking using word frequency, i.e., the a-priori probability of the word (see Table 4). The result shows that the correct rate of the rst position in the list of alternatives is around 75% by using word frequency data alone. It should be noted that these results were calculated from the entire training database, including the test articles. Since, as shown earlier, about 2% to 3% of the words in the test data do not occur in the training data, the performance obtained by re-ranking using frequency could be up to 2% to 3% lower.

# of words

A06 2040

G02 2075

Article J42 2078

N01 2021

top1 top2 top3 top4 top5 top6 top7 top8 top9 top10

73.6% 89.0% 94.6% 96.9% 98.0% 98.4% 99.2% 100.0% 100.0% 100.0%

74.7% 88.7% 94.6% 97.1% 98.0% 98.9% 99.3% 99.9% 100.0% 100.0%

75.1% 91.8% 95.7% 97.8% 98.7% 99.0% 99.3% 99.9% 100.0% 100.0%

76.6% 89.8% 95.5% 97.7% 98.8% 99.4% 99.6% 100.0% 100.0% 100.0%

R07 2066

Average 2056

76.1% 75.2% 90.1% 89.9% 95.8% 95.2% 97.5% 97.4% 98.5% 98.4% 99.2% 99.0% 99.4% 99.4% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

Table 4: Baseline for comparison (Re-ranking Top10 List by Word Frequency)

The result of applying the relaxation algorithm to the lists of alternatives is shown in Table 5 and Figure 2. The average correct rate at the top choice across all the passages tested 17

is given for ten iterations of relaxation. Table 6 and Figure 3 shows the results obtained when the test passages are included in the training data. The di erences between the two results show the e ect that \perfect" collocation data can have on performance. The top choice correct rate when the training data does not include the test passages is between 83% and 86% correct. The best performance possible, as shown in Table 6, raises the correct rate at the top choice to between 91% and 93%. This is interesting when it is considered that the top choice correct rate of the input was initially as low as 56%. In most cases the majority of the improvement in performance is obtained after three iterations. A drop in performance can be observed after the rst iteration in cases where the initial correct rate is high. For example, when the rst word has a 94% correct rate, at the rst iteration it is improved to 95% and then it drops to 86% after ten iterations. This same e ect does not occur when the initial correct rate is lower than 80%. Thus, in practice, the number of iterations should be controlled by an estimate of the overall error rate in word recognition. If low con dences have been assigned to many words in a text passage, then this should indicate that the relaxation should be iterated until the order of the decisions in the lists of alternatives ceases to change. If many words have been assigned high con dences, then the relaxation should be terminated after one or two iterations. Also, many of the cases that were correct on the rst iteration but that were placed in a lower position in the list of alternatives on a subsequent iteration were short words such as \one" that were confused with function words such as \the". The higher collocation strength 18

of \the" caused it to be chosen incorrectly. This suggests that the relaxation algorithm should be used as part of a larger text recognition system and that this system should include a preprocessing step, such as [13], that detects function words and recognizes them separately. The relaxation algorithm would then be applied to the other words in the text. Since these will be the information-bearing words, the relaxation approach will be more likely to perform correctly.

19

Iter. 1 74.6% 80.3% 84.1% 85.8% 88.4% 90.7% 93.4% 95.4%

Iter. 2 81.2% 83.8% 85.6% 86.4% 87.4% 88.3% 89.2% 90.0%

Iter. 3 82.5% 84.2% 85.0% 85.4% 86.1% 86.7% 87.2% 87.6%

Iter. 4 82.9% 84.0% 84.9% 85.1% 85.5% 86.0% 86.4% 86.8%

Iter. 5 83.1% 84.1% 84.8% 85.0% 85.3% 85.9% 86.1% 86.5%

Iter. 6 83.2% 84.1% 84.7% 85.0% 85.2% 85.7% 86.0% 86.4%

Iter. 7 83.3% 84.1% 84.7% 84.9% 85.1% 85.6% 85.9% 86.3%

Iter. 8 83.3% 84.1% 84.8% 84.9% 85.1% 85.6% 86.0% 86.3%

Iter. 9 Iter. 10 83.2% 83.2% 84.1% 84.1% 84.7% 84.7% 84.9% 84.9% 85.1% 85.1% 85.6% 85.6% 85.9% 85.9% 86.2% 86.2%

Table 5: Correct rate at each iteration of relaxation

Top1 Correct Rate

Initial 56.4% 65.2% 70.9% 75.2% 79.9% 84.5% 89.7% 94.1%

100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 50

0

1

2

3

4

5

6

7

8

Iteration Figure 2: Correct rate at each iteration of relaxation

20

9

10

Iter. 1 80.7% 85.8% 88.9% 90.5% 92.7% 94.5% 96.4% 97.5%

Iter. 2 88.5% 90.9% 92.2% 93.0% 93.7% 94.4% 95.0% 95.5%

Iter. 3 90.2% 91.6% 92.1% 92.5% 92.9% 93.3% 93.7% 93.8%

Iter. 4 90.7% 91.8% 92.0% 92.3% 92.6% 92.8% 93.0% 93.1%

Iter. 5 91.0% 91.8% 92.0% 92.3% 92.5% 92.7% 92.8% 92.9%

Iter. 6 91.2% 91.8% 91.9% 92.1% 92.3% 92.5% 92.6% 92.7%

Iter. 7 91.2% 91.9% 91.9% 92.1% 92.2% 92.5% 92.5% 92.7%

Iter. 8 91.3% 91.9% 91.9% 92.1% 92.2% 92.5% 92.5% 92.7%

Iter. 9 Iter. 10 91.3% 91.3% 91.9% 91.9% 91.9% 91.9% 92.1% 92.1% 92.2% 92.2% 92.4% 92.4% 92.5% 92.4% 92.6% 92.7%

Table 6: Correct rate in relaxation when collocation includes test samples

Top1 Correct Rate

Initial 56.4% 65.2% 70.9% 75.2% 79.9% 84.5% 89.7% 94.1%

100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 50

0

1

2

3

4

5

6

7

8

9

Iteration Figure 3: Correct rate when collocation includes test samples 21

10

4 Discussion and Future Directions In this paper a relaxation algorithm was described that used word collocation information to improve text recognition results. The experimental results showed that the correct rate at the top choice of a word recognition algorithm could be improved from 56% to 83% correct. This is a substantial improvement compared to the best that could be obtained by re-ranking the lists of alternatives using a-priori probability alone. The performance gap between the results achieved when the collocation data includes the test data and when it does not suggests that word collocation data should be collected from larger and more balanced English corpora. Analysis of the remaining errors showed that many of them could be corrected by using a larger window size and special strategies for processing proper nouns. Modi cations of the ranking function will also be considered. The importance of the technique presented in this paper is primarily in its use as a language-level postprocessor in a recognition system for degraded text images. These are the noisy images that are characteristic of facsimile transmissions and poor photocopies. Such images typically exhibit uniformly poor contrast and degradation patterns. Word recognition techniques that are applied to images like these will perform poorly because they use only local characteristics of word images to make their decisions. That is, they do not take advantage of the statistical constraints that exist between words in a normal passage of text. The algorithm proposed in this paper overcomes this characteristic by its use of word 22

collocation statistics in a relaxation algorithm. These statistics measure how likely it is that two words co-occur nearby each other in a running text. This is a powerful source of information that is used to reinforce the decisions of nearby words. It has been shown that word collocation data captures a number of linguistic constraints including syntactic and semantic information. For example, an article like \the" is very likely to precede an adjective like \brown". Also, a word like \boat" is very likely to occur nearby a semantically related word like \river." This is an important consideration since many inter-word constraints can be modeled with these simple statistics. The relaxation algorithm allows the e ect of word collocation statistics for adjacent words to be propagated across the lists of alternatives for words in a running text. This is an important characteristic of relaxation algorithms that has been utilized in many other domains such as edge detection and image understanding. It is adapted here to a word recognition problem and allows for results such as those discussed above to be computed. That is, the decision for \boat" can reinforce the decision for \river" even though they may occur in signi cantly di erent positions in the text. The experimental results indicate that the relaxation algorithm is most relevant to use when the word recognizer has relatively poor performance. This most often occurs when the input text is degraded signi cantly. The results show that the correct rate can be improved to nearly 90% in most cases but not much further. Thus, applying the proposed technique to word recognition results that are already 90% correct or better will yield little if any improvement 23

in performance. Inspection of the results showed that many of the errors are caused by proper nouns that occur only once in a speci c test document and do not occur elsewhere in the data used to calculate the word collocation statistics. This could be xed by incorporating preprocessing that detects proper nouns [4]. The relaxation algorithm could then be applied to all the other words in the input text. Increasing the training set is another way to improve the performance of the algorithm. This would require large amounts of English text. This has recently become easier to do with the availability of online news services that provide an essentially unlimited supply of ASCII text. Such a data stream could be segmented into semantic classes (for example, stories about water travel) and the collocation data calculated from the passages within those classes. Recent results have indicated that such a technique can improve performance since the collocation data is \focused" on the same topic as that of the input article [5]. The performance of this algorithm was compared to results that were achieved by reranking the alternative lists using the a-priori probabilities of the words they contain. These probabilities were calculated from the same training set that was used to calculate the word collocation statistics. The results showed that the relaxation algorithm performed signi cantly better than the probabilistic re-ranking. This was expected since the word collocation data captures information that cannot be expressed in simple a-priori probabilities. Further experimental comparisons to other algorithms are possible to a limited extent since there are only a small number of methods that have been applied to language-level post24

processing for text recognition. The most similar technique is a hidden Markov model (HMM) that uses syntactic class transition probabilities to improve the performance of a word recognition algorithm [12]. This technique used probabilities of adjacent part-of-speech tags (e.g., the probability of an adjective following an article) to re-rank word candidate lists. The relaxation method proposed in this paper uses a richer source of statistical information and thus should produce better overall performance than the HMM. The relaxation algorithm was described in enough detail so that the reader should be able to implement and test the method. Since the technique was shown to be economical in both storage and runtime, it should be suitable for a real-time commercial application. The relaxation algorithm currently works as one part of a degraded text recognition system. There are two types of linguistic constraints used in the system. Besides word collocation data, global structural constraints from an English grammar are employed [8]. Visual global contextual information available inside a text page is also being considered for integration with the linguistic knowledge sources to further improve the performance of degraded text recognition.

25

References [1] H. S. Baird, Private communication about the use of word collocation to improve OCR results, February, 1989. [2] K. W. Church and P. Hanks, \Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, Vol. 16, No. 1, pp. 22-29, 1990. [3] L.S. Davis and A. Rosenfeld, \Cooperating processes for low-level vision: A survey," Arti cial Intelligence, Vol. 17, pp. 245-263, 1981. [4] G. DeSilva and J. J. Hull, \Proper noun detection in document images," Pattern Recognition, pp. 311-320, February, 1994. [5] P. Filipski and J.J. Hull, \Keyword selection from word recognition results using de nitional overlap," Third Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, April 11-13, 1994. [6] T. K. Ho, J. J. Hull and S. N. Srihari, \A Word Shape Analysis Approach to Lexicon Based Word Recognition," Pattern Recognition Letters, Vol. 13, pp. 821-826, 1992. [7] T.K. Ho, J.J. Hull, and S.N. Srihari, \A computational model for recognition of multifont word images," Machine Vision and Applications, special issue on Document Image Analysis, 5, 3, pp. 157-168, summer, 1992. [8] T. Hong and J. J. Hull, \Text Recognition Enhancement with a Probabilistic Lattice Chart Parser," Proceedings of the Second International Conference on Document Analysis ICDAR93, Tsukuba, Japan, pp. 222-225, October 20-22, 1993. [9] T. Hong and J. J. Hull, \Degraded text recognition using word collocation," SPIE Conference on Document Recognition, San Jose, California, pp. 334-341, February 11-13, 1994. [10] J.J. Hull, \Hypothesis generation in a computational model for visual word recognition," IEEE Expert, 1, 3, pp 63-70, Fall, 1986. [11] J. J. Hull, S. Khoubyari and T. K. Ho, \Word Image Matching as a Technique for Degraded Text Recognition," Proceedings of 11th IAPR International Conference on Pattern Recognition, The Hague, The Netherlands, pp. 665-668, 1992. [12] J. J. Hull, \A Hidden Markov Model for Language Syntax in Text Recognition," Proceedings of 11th IAPR International Conference on Pattern Recognition, The Hague, The Netherlands, pp.124-127, 1992. [13] S. Khoubyari and J.J. Hull, \Font and Function Word Identi cation in Document Recognition," Computer Vision, Graphics, and Image Processing: Image Understanding, to appear, 1995. 26

[14] H. Kucera and W. N. Francis, Computational Analysis of Preset-day American English, Brown University Press,1967. [15] A.K. Mackworth, \Constraint Satisfaction," in Encyclopedia of Arti cial Intelligence, S.C. Shapiro (ed.), Wiley, New York, pp. 285-293, 1992. [16] T. G. Rose, R. J. Whitrow and L. J. Evett, \The Use of Semantic Information as an Aid to Handwriting Recognition," Proceedings of 1st International Conference of Document Analysis and Recognition, pp. 629-637, Saint-Malo, France, 1991. [17] T. G. Rose and L. J. Evett, \Text Recognition Using Collocations and Domain Codes," Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pp. 65-73, Columbus, Ohio, 1993. [18] A. Rosenfeld, R. A. Hummel and S. W. Zucker, \Scene Labeling by Relaxation Operations," IEEE Trans. on Sys. Man and Cyb., SMC-6(6):420-433, 1976.

27

Suggest Documents