focus on the new updates added to the original sys- tem: a new baseline ... extraction module, and a new generation, validation ..... IOS Press, Amsterdam, 1996.
Automatic Reading of Cursive Scripts Using Human Knowledge M. C^ote 1?3, M. Cheriet 2, E. Lecolinet 1, and C.Y. Suen 3 1 ENST Paris, Dept. signal, 75013 Paris, France 2 E.T.S., Dept. G.P.A, Universite du Quebec, Mtl, Que., H2T 2C8, Canada 3 CENPARMI, Concordia University, Mtl, Que., H3G 1M8, Canada Abstract
This paper presents a model for reading cursive script which has an architecture inspired by a reading model and which is based on perceptual concepts. We limit the scope of our study to the o-line recognition of isolated cursive words. First of all, we justify why we chose McClelland & Rumelhart's reading model as the inspiration for our system. A brief resume of the method's behavior is presented, and the main originalities of our model are underlined. After this, we focus on the new updates added to the original system: a new baseline extraction module, a new feature extraction module, and a new generation, validation and hypothesis insertion process. After implementation of our method, new results have been obtained on real images from a training set of 184 images, and a testing set of 100 images, and are discussed. We are concentrating now on validating the model using a larger database.
1 Introduction
The advent of computers has deeply modi ed our way of interacting with our environment. Even if this machine is able to realize complex calculations and is often beyond human capacity, it remains limited. In fact, communication with a computer through a keyboard is not very natural and requires much discipline. Hence, in our craziest dreams, we imagine ourselves chatting with our computer using speech, an electronic pen or letting it read handwritten notes. Unfortunately, even after 30 years of intensive research in this domain, the general solution to the automatic reading of cursive script has not been completely addressed. It seems, however that handwriting recognition may play a role in future reading systems [1], and consequently that it still is a worthy area of investigation. This paper presents a new method based on a human reading model, in uenced by the work of McClelland & Rumelhart in perception. We limit the scope of our study to the o-line recognition of isolated cursive words. Our model, already described in [2], shares the following characteristics with the model of McClelland & Rumelhart [3]: network with local representation, parallel processing, and activation mechanism (topdown and bottom-up processes). We present here new updates of the model, and discuss new results obtained from tests on real images. The organization of the paper is as follows: sec-
tion 2 explains why we chose McClelland & Rumelhart's reading model as the inspiration for our system; section 3 presents brie y our method; results from simulations are shown and discussed in section 4. The last section concludes the presentation.
2 Previous work
One of the trends in cursive script recognition is to get inspiration from reading models [4, 5]. It has been suggested that to read handwriting in context requires no more features than the rst letter and the word shape [4]. Obviously, the reading will be facilitated by the presence of additional information. Nevertheless, the underlying idea is to derive bene t from studies of reading, in order to build ecient automatic reading systems. The question is, which reading model is best suited for the task? Jacobs & Grainger [6] have published an overview of word reading models. The reading models are compared and evaluated according to their ability to reproduce behaviors observed in humans. Three models emerged from the others: Dual route, from Coltheart & Rastle [7], Interactive Activation Model, from McClelland & Rumelhart [3], and Veri cation Model, from Paap et al. [8]. The rst and the last are representatives of a classical school, while the second is an adaptation of \brainstyle" simulations models. Since the perceptual approach is our rst motivation, we have chosen the McClelland & Rumelhart model, as we explained in [5]. Moreover, it is possible to adapt an implicit segmentation procedure to this hierarchical architecture [9]. In fact, this model has been especially designed to mimic the Word Superiority Eect (superiority of letter recognition within a context over the recognition of isolated letters). Thus, it gives us the opportunity to improve the segmentation by taking advantage of contextual information. This Interactive Activation Model has been modi ed many times subsequently. Its local representation1 has been changed to a distributed one [10]. These ideas have given birth to the current generation of neural networks which are 1 In a network with local representation, each cell or neuron corresponds to a speci c concept. For example, there is a cell for the word \ fteen" and one for the letter \f". On the contrary, in a network with a distributed representation, the concept \ fteen" is represented by several cells.
able to learn new paradigms. However, when comparing both McClelland's models (the classical and the distributed one), Forster [11] remains skeptical about the pretension of Parallel Distributed Models to explain cognitive functions in terms of neurons. To him, a simulation is not an explanation. The demonstration that a network works does not give a theoretical explanation of the cognitive processes involved. He concludes:
3.1 Behavior of the recognition model
Our system is based on three levels of detectors, hierarchically organized at feature, letter, and word levels as shown in Fig. 2. Words
\I suggest that the type of connections needed are effectively equivalent to local representations... I also protest the trend to substitute simulations for theoretical explanations" ([11], p. 1303).
We are facing the \black box" problem here. Jacobs supports Forster's armation:
\evidence from behavioral and brain imaging studies supports a word recognition model closer in spirit to the interactive-activation model (McClelland & Rumelhart) than to more recent (distributed) models (Seidenberg & McClelland, 1989)" ([6], p. 1326).
Because we choose a local representation instead of a distributed one for our model, we are not in the main stream of the actual recognition methods, but we still value our personal slant on that matter: human perception. Hence in our network there is no learning phase by connection weight modi cations. The main advantage of this knowledge representation is that we do not need an extensive database to train the system.
3 Method
A thorough description of our recognition method can be found in [5]. Nevertheless, some modi cations have been added to the original system. We will brie y summarize the behavior of our method, and enumerate the main originalities of our model. After this, we will focus mainly on the new updates. The dierent modules of the system are represented in Fig. 1, and will be described in this section. pre-processing
baseline extraction
feature extraction face-up and face-down valleys
key-letters word length
7 face-up valley
recognition act 0.921 0.772 0.589
words hundred thousand dollars
labels 3 3 2
face-down valley
Figure 1: System Overview. In this example, the key-letters are ‘h’, ‘d’, ‘e’ and ‘d’.
FIFTEEN ONLY
Letters
F
Features
ASC. LOOP
T
O
first position
Figure 2: Three level system. We assume that there is a detector for each word in the considered lexicon, as well as for each letter and for each feature associated with a given location in the image. Since the letters in the image are in a relative order, we model the letter position by a fuzzy area which does not necessarily contain one letter. We call this area a \zone", which may be envisaged as a window superimposed on an image. Connections between adjacent levels are excitatory and bi-directional, except between the feature and the letter level, where the connection is bottom-up only. There are no connections within the same level. Each detector or unit has a momentary activation Ai (t). The unit's activation varies between 0 and 1 in accordance with the following equation: Ai (t + t) = (1 ? )Ai (t) + Ei (t) (1) Ai(t +t) the new value of the activation of unit i at time (t + t). a constant for the unit's decay, set to 0.07. Ei(t) the eect on unit i at time t due to inputs from its neighbors. The eect from the neighbors on unit i at time t, Ei(t), is represented by:
Ei (t) = ni (t)(M ? Ai (t)) (2) ni(t) the excitatory in uences from the neighbors at time t on unit i. M the maximum activation level of the unit, set to 1. where the factor ni (t) is de ned as: ni(t) = ij aj (t) (3)
X
aj (t) ij
j
the activation of an active excitatory neighbor of the unit. the associated weight constant.
The factor (M ? Ai (t)) modulates the contribution of the neighbors ni (t), to keep the input to the unit from driving it beyond some maximum. As can be seen, when the activation of the unit has reached its maximum value (one), the eect of the input is reduced to zero. Thus, the activation is always bounded. The weights of the connections are not learned. They adapt during the process according to the following formulas: 1 , = 1 fl = NF1 , lw = F ()lw NZ wl NW where NF is the number of features found in the signal for letter L, F ()lw is the position coecient for letter L in word W (the de nition follows), NZ is the number of zones found in the signal at time t, and NW is the number of words in the lexicon containing letter L. During the bottom-up process, we rst try to match the activated letter with the related word (letter-word lexicon), and at the right position within the word. To do so, a labeling technique has been developed which deals with the relative order of letters and the letter's fuzzy position in a word. Letter labeling is done starting from the ends of the word moving in, since this has been shown to be the way humans do it during the early stages of reading [12]. Letters described with meaningful features and with high activation are rst labelled. The estimation of the relative letter positions is a very important parameter for successful letter labeling of each word in the lexicon. During the labeling process, we always compare the position of the tested letter in the word with the corresponding position in the image. Since the letter position in the image is not known precisely, we introduce a position coecient (real number between 0 and 1). When its value is 1, we consider that the letter in the word corresponds exactly to the letter in the image. On the contrary, a 0 value means that the letter in the word does not correspond at all to the letter in the image. Between these two limits, this coecient varies as a fuzzy function.
3.2 Main contributions
The main original issues of our model are: 1. architecture based on a reading model: hierarchical, parallel, with local representation and interactive activation mechanism; 2. signi cant perceptual features in word recognition such as ascender and descender; 3. fuzzy position concept dealing with the location uncertainty of features and letters; 4. outside-in labeling process which mimics the human recognition process while reading; 5. adaptability of the model to words of dierent lengths and from dierent languages.
3.3 New updates
In the following sections, we describe the recent updates added to the model: new baseline and feature extractions, new generation, validation and hypotheses insertion process.
3.3.1
Baseline extraction
We have developed a new method for baseline extraction which is described in [13]. In this method the slant of the word is not corrected and the baseline follows the word as it is written. We basically compute several histograms for dierent y projections and calculate the entropy associated with each of them. The histogram having the lowest entropy will correspond to the writing direction. Following a simple heuristic, we then nd the thresholds delimiting the body of the word. Examples of this baseline extraction technique are shown in Fig. 3.
Figure 3: Baseline extraction for the short word “Two”, for a word with descenders such as “Tappan” and for the slanted word “Treadwell”. 3.3.2
Feature extraction
3.3.3
Contextual analysis
Three types of features are extracted in this method, as has been shown in Fig. 1: key-letters, face-up and face-down valleys and word length estimation. The key-letters [14] are the letters (or parts of letters) which are described with an ascender, a descender or a loop within the body of the word. These features are robust and will be the anchor points of the recognition process. The face-up and the face-down valleys are the peaks and the valleys extracted from the lower and upper contours of the word. Here, the background of the image is taken into account. These features are less stable but will be used in the generation, validation and hypothesis insertion process to nd other clues leading to the identity of the target word. The word length is de ned as the number of letters in the word. This estimation is quite dicult to perform precisely in the cursive script context. We still have to build a module to extract this feature. At the moment, we give this parameter to the system. The use of contextual information, i.e. the lexicon in this case, is realized following three successive mechanisms: generation, validation and hypothesis insertion. They give our method the necessary exibility to perform an implicit segmentation of the word, always guided by a feedback on the input image. Fig. 4 shows how these three steps are working. At the beginning, there are anchor zones that have been found and some parts of the words which are still unknown. Additional hints are then necessary to
generation letter n
? 1
1
2
2
4 3 5
1 0.8 0.6 0.4 0.2 0 0
yes validation letter n valid
4 3 56
insertion
6
Figure 4: Generation, validation and hypothesis insertion. give us a clue as to what should be found in these shady areas. This is where hypothesis generation enter into play. First of all, the mean width of the anchor zones (related to the key-letters) are found. Then, the system builds a topological representation of the input word image based on information such as: mean width of a zone, beginning and end of a zone. It also takes into account the estimation of the number of letters between the anchor zones. Once this topology has been established, the system computes the distance between this expected topology and each of the labelled word in the lexicon. The words that are closer to this target topology are retained as word candidates to hypothesis validation. At the validation step, for each word considered as a possibility, we try to validate the retained hypotheses with the input image. A letter in a word will be validated if its features can be found in the image. When a letter is validated, the score associated with the corresponding word is increased. The words which have the highest scores will participate in the insertion process. At the insertion step, for each candidate word and for each validated letter within this word, a zone is created and inserted at the appropriate location in the image. In each of these new zones, the featuredetectors corresponding to the features found in this zone are activated. In the next cycle, these features will also contribute to the propagation of activation among the three dierent levels of the system. Each time a new zone is created, a box with the same dimensions is drawn on the image. Hence, it is quite easy to visualize the contribution of hypothesis insertion in the recognition process, for each cycle.
4 Simulations and results
After implementation of our method, some results have been obtained on real images. The value associated with each word in the output list is an activation value. The higher the value, the higher the likelihood that this word corresponds to the target word. With the activation curves shown in Fig. 5, we can follow the behavior of the most activated units (not all the units are represented here) involved in the recognition process of the word \ fteen".
mot: fifteen_l13203_0_8 word: ‘fifteen’
mot: fifteen_l13203_0_8 − zone 1 1 word: ‘fifteen’ - zone
letter activation activation des lettres
word activation activation des mots
Contextual Analysis
2
4
6
indice du temps time index fifteen −.−. fifteen sixteen ........ sixteen eighteen −___− eighteen fourteen fourteen
1
0.8 0.6 0.4 0.2 0 0
2
4
6
indice du temps time index
f g
−− f ___ g
Figure 5: (a) word activation curves; (b) letter activation curves, for “fifteen”, zone 1 corresponds to letter ‘f’. In Fig. 5(a), candidate word, \ fteen" is more activated because it has more letters matched than its competitors. One should notice that \ fteen", \sixteen", \eighteen" and \fourteen" belong to the same perception family. They share the same global shape. In Fig. 5(b), letters `f' and `g' of zone 1, at the beginning of the word \ fteen", are in competition because they have a common feature, namely a descender. Since few words in the lexicon begin with letter `g' in the rst case, letter `f' receives more activation. Letters `f' and `g' are in competition, but the feedback of the word ` fteen' on letter `f' enables this letter to win the competition. This fact illustrates very well the \Word Superiority Eect" described by our equation of activation. Indeed, when a cell is activated, it receives more stimuli from its neighbors of adjacent levels, and therefore its activation increases.
4.1 Database
At CENPARMI lab, where this research has been partly conducted, a large database has been built for the need of an important project, aiming at the recognition of cheques' legal amount [15]. To be more precise, this database has been created from 2 500 handwritten cheques written in English and from 1 900 handwritten cheques written in French. The number of writers is estimated to be close to 800 for the English cheques and close to 600 for the French cheques. In this database, 7 837 English words and 7 135 French words are available. We have decided to choose this database rst, because it was already available and can be a common basis for performance evaluation of dierent methods; second, because of compatibility advantages if this method was to be integrated to a larger framework combining dierent experts.
4.2 Testing conditions and parameters
The word recognizer has been trained on a small set of 184 images, and tested on a set of 100 images. None of these images contain capital letters. For now, our feature-letter lexicon does not contain any capital letter de nitions. It will be completed in the near future. The lexicon used for the tests includes 32 English cursive words with 3 to 9 letters. Because our training set is small, the rates are given as a rst approximation and need to be con rmed on a larger set.
Table 1 gives the results obtained for the recognition of isolated cursive words without hypothesis generation, validation or insertion. The module which estimates the number of letters in the word to be recognized has not been implemented so far. We therefore must give this information to the system. N (%) 1 2 5 10 Training set 88.04 91.85 94.57 96.74 Testing set 87 89 95 96 Table 1: CWR results (% correct in top N choices).
[5]
[6]
4.3 Discussion
As might be expected, the words described with a larger number of anchor features are more likely to be easily recognized. This facilitates the detection of long words and increases the con dence associated with the recognition of these words. On the contrary, short words are more dicult to recognize. This is even worse for such words as \one" and \nine" because they do not contain any ascenders or descenders. This is why, in these cases, we plan to include additional feature extraction to improve the recognition rates. The system has shown a sensitivity to the word length parameter when we have varied its value. Since not much attention has been given to this problem, we think that it is still possible to improve the robustness of our method relatively to this parameter.
5 Summary and conclusions
After a brief description of our method, where we underlined the major contributions of our system and justi ed the choice of our architecture, we have tested the model on a training set of 184 images and a testing set of 100 images of English cursive words. New results have been presented as a rst approximation, and need to be validated on a larger set of real images. Although this method, at the present time, is not completely automated, we have shown that it is operational and that it has the expected behavior (cf. g. 5), because it conforms to the perceptual concepts studied.
Acknowledgments
We thank IRIS National Networks of Centres of Excellence and NSERC of Canada for nancial funding.
[7] [8] [9] [10] [11]
[12] [13]
References
[1] N. Bartneck. The role of handwriting recognition in future reading systems. In Fifth International Workshop on Frontiers in Handwriting Recognition, pages 147{176, University of Essex, England, 1996. [2] M. C^ote, E. Lecolinet, M. Cheriet, and C.Y. Suen. Building a perception based model for reading cursive script. In Proc. of the third ICDAR95, Vol. II, pages 898{901, Montreal, Canada, August 14-16 1995. [3] J. L. McClelland and D. E. Rumelhart. An interactive activation model of context eects in letter perception. Psychological Review, 88:375{407, 1981. [4] C. Higgins and P. Bramall. An on-line cursive script recognition system. In M.L. Simner, C.G. Leedham, and A.J.W.M. Thomassen, editors, Handwriting and
[14] [15]
Drawing Research: Basic and Applied Issues, pages 285{298. IOS Press, Amsterdam, 1996. M. C^ote, E. Lecolinet, M. Cheriet, and C.Y. Suen. Using reading models for cursive script recognition. In M.L. Simner, C.G. Leedham, and A.J.W.M. Thomassen, editors, Handwriting and Drawing Research: Basic and Applied Issues, pages 299{313. IOS Press, Amsterdam, 1996. A. M. Jacobs and J. Grainger. Models of visual word recognition{sampling the state of the art. Journal of Experimental Psychology: Human Perception and Performance; Special Section: Modeling Visual Word Recognition, 20, no 6:1311{1334, 1994. M. Coltheart and K. Rastle. Serial processing in reading aloud: Evidence for dual-route models of reading. Journal of Experimental Psychology: Human perception and Performance, 20:1197{1211, 1994. K. Paap, S. L. Newsome, J.E. McDonald, and R.W. Schvaneveldt. An activation-veri cation model for letter and word recognition: The word superiority eect. Psychological Review, 89:573{594, 1982. R. G. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, no. 7:690{706, 1996. J.L. McClelland. Putting knowledge in its place: a scheme for programming parallel processing structures on the y. Cognitive Science, 9:113{146, 1985. K. I. Forster. Computational modeling and elementary process analysis in visual word recognition. Journal of Experimental Psychology: Human Perception and Performance; Special Section: Modeling Visual Word Recognition, 20, no 6:1292{1310, 1994. I. Taylor and M. M. Taylor. The Psychology of Reading, chapter 9: Letter and Word Recognition, pages 201{203. Academic Press, New York, 1983. M. C^ote, M. Cheriet, E. Lecolinet, and C.Y. Suen. Detection des lignes de base de mots cursifs a l'aide de l'entropie. In Actes de congres de l'Association canadienne-francaise pour l'avancement de la science, Montreal, Canada, 13-17 mai 1996. M. Cheriet and C. Y. Suen. Extraction of key letters for cursive script recognition. Pattern Recognition Letters, 11:1009{1017, 1993. L. Lam, C. Y. Suen, D. Guillevic, N. W. Strathy, M. Cheriet, K. Liu, and J. N. Said. Automatic processing of information on cheques. In International Conference on Systems, Man and Cybernetics, pages 2353{2358, Vancouver, October 1995.