Handwritten Email Address Recognition with Syntax and Lexicons

49 downloads 1662 Views 425KB Size Report
handwritten email address recognition, skew detection, slant correction, bichain elastic matching, segmentation hypothesis graph. 1. Introduction. Email address ...
Handwritten Email Address Recognition with Syntax and Lexicons D.N. Zheng, J. Sun, S. Naoi

Y. Hotta, A. Minagawa, M. Suwa, K. Fujimoto

Fujitsu R&D Center Co., Ltd. Beijing, China

Fujitsu Laboratories Ltd. Kawasaki, Japan

{zhengdanian, sunjun, naoi}@cn.fujitsu.com

{y.hotta, minagawa.a, suwa, fujimoto.kat}@jp.fujitsu.com

Abstract Email address becomes an important item of personal information for daily communication, and the demand for recognizing handwritten email addresses is increasing. This paper presents a recognition system for handwritten email addresses. It comprises of six main steps: deskew handwritten line, erect slant characters, separate string into words, segment touching characters, recognize words with syntax and lexicons, and string postprocess. The system makes a comprehensive use of email address structure, segmentation results, recognition results and lexicon knowledge. Experimental results demonstrate that for good quality strings (91 samples), the system achieves high recognition rates for single characters (94.6%) and whole strings (59.3%).

general documents, there are some special facts that can be used in recognizing handwritten email addresses, e.g. the format of a valid email address, the hierarchical system in the domain name, a set of legal words in the domain name and personal information in the user name. In order to make use of these special facts, the proposed recognition system goes in the following steps: I. deskew the handwritten line; II. erect the slant characters of the string; III. separate the string into multi words by ‘@’ and ‘.’; IV. segment candidates from touching characters; V. recognize words with syntax and lexicons; VI. string postprocess. It does a good fusion of email address structure, segmentation results, recognition results and lexicon information, so good recognition rates can be achieved for words and whole strings.

Keywords:

handwritten email address recognition, skew detection, slant correction, bichain elastic matching, segmentation hypothesis graph.

1.

2.

System Overview

Introduction

Email address becomes an important item of personal information for daily communication. As a valid email address [1], e.g. “[email protected]. cn”, it consists of three parts: user name, e.g. “jzhang2001”, middle “@”, and domain name, e.g. “student.dlut.edu.cn”. Domain name follows a hierarchical system, which can be divided into top level domain name and custom domain name. The top level domain is made up of popular words like “edu” and “cn”. The custom domain name represents an organization or people applying for it, e.g. “student” and “dlut”. Dots “.” are used as separators between words in user name and domain name. A credit-card reader having a built-in digital camera for capturing a template signature and an email address on a credit card is presented in [2]. This reader can recognize a label stripe having a printed email address by an optical character recognition program. But so far there is no literature about the recognition of handwritten email addresses. This paper describes an interesting recognition system for handwritten email addresses. Unlike recognizing the

Figure 1. The system recognizes the handwritten email addresses with syntax and lexicons.

Our recognition system for handwritten email addresses makes up of six main steps as shown in Fig. 1. I. Deskew the Handwritten Line. Sometimes people write the email addresses not in a straight line, but in a skew or curve line. A recursive linear or polynomial ridge regression algorithm is proposed to fit a handwritten line and the regression result is used to correct the positions of all stroke pixels in the string image. II. Erect the Slant Characters. Handwritten alpha and numeral characters are often slant in a certain angle. An entropy minimization algorithm is proposed to estimate the italic angle of handwritten characters in the string. The estimated angle is used to correct characters from italic font into normal font. III. Separate String into Words. A valid email address can be separated into multi words by one ‘@’ and some dots ‘.’. First the string image is divided into a number of isolated components by connectivity analysis. Noise components are found according to their pixel numbers and component positions, and deleted from the image. Some adjacent components are combined into one. For example, if one component should be the cap dot of ‘i’ or ‘j’, then find its adjacent body component, combine the two components and connect them by a shortest path. Then the unique ‘@’ is recognized and all dots ‘.’ are detected, and the string can be separated into a number of words by them. IV. Segment Candidates from Touching Characters. A long component maybe consist of multi characters. And it is broken into a number of small pieces so that every 1∼3 pieces make up of one single character. A bichain elastic matching algorithm is proposed to search all potential segmentation points for those long components, which may be some connected characters. Thus one component may be subdivided into multi pieces, and each piece represents a single character or only a local stroke of some character. Each word can be reconstructed by a set of nonoverlapping components and pieces. V. Recognize Words with Syntax and Lexicons. The words in domain name follows a hierarchical system, so they are recognized from root to node (from right to left). First we meet the words in the top-level domain name, e.g. “cn”, “jp”, “ac/edu”, “co/com”, “go/gov”, “net”, “org”, which are collected in the “top lexicon”. If the words in the top-level domain name contain “ac” or “edu”, then we’ll meet the words in the custom domain name for education, e.g. “pku”, “dlut”, “tsinghua” in China area, which are collected in “edu lexicon”; otherwise we’ll meet the words in the custom domain name for public, e.g. “263”, “msn”, “eyou”, “sina” in China area, which are collected in “pub lexicon”. The words in user name are more flexible and com-

plex. The family names, years and other information may appear in the user name. Some trigrams, e.g. “ang”, “cao”, “che” and “200” for China area collected in “trigram lexicon”. A segmentation hypothesis graph, which makes a comprehensive use of segmentation results, recognition results and lexicon knowledge, is made for each word. There exist many possible paths from the start node to the end node, and on each path, there are many possible words combined from the recognition candidates of its nodes. And the optimal word on the optimal path is taken as the recognition result. VI. String Postprocess. Similar numerals and alphas in user name, e.g. ‘0’ and ‘o’, ‘1’ and ‘l’, ‘2’ and ‘z’, need to be distinguished by some experience rules. The whole domain name, e.g. “sina.com.cn”, can be validated in the database.

3.

Deskew the Handwritten Line

A recursive linear or polynomial ridge regression algorithm is proposed to fit the string image line. First the input string image is converted into a binary image. Then a training point (x, y), where x = (1, x, . . . , xp )T , x = nj and y = mi , can be constructed from every stroke pixel Pij . The m and n are the height and width of string image size, and p is the order of polynomial kernel. A large training set can be obtained from all stroke pixels. And this set is used to train a hyperplane in the polynomial feature space f (x) = wT x = (w0 , w1 , . . . , wp )(1, x, . . . , xp )T

(1)

In the input space, if p = 1, this is a linear fitting problem; if p ≥ 2, this is a polynomial fitting problem. All input vectors can be arranged into a matrix X = [x1 , x2 , · · · , xl ](p+1)×l , and all outputs can be put into a vector y = (y1 , y2 , . . . , yl )T . Then the solution of the polynomial ridge regression [3] can be written as w = (XT X + λlI)−1 XT y

(2)

where I denotes an identity matrix, λ a small constant, e.g. λ = 0.001. To obtain a more accurate approximation to the string line, a recursive training procedure is necessary. At the t = 0th training, the whole training set is used as training data. And a standard deviation σ for approximation errors yi − f (xi ) can be estimated. At the next t = 1st, 2nd, . . . trainings, a new subset with those samples lying between the following zone ft−1 (xi ) − σ ≤ yi ≤ ft−1 (xi ) + σ

(3)

will be selected from the whole training set. These recursive trainings remove outliers in the original training set

and make the estimation more and more accurate to the center line of string. Finally, the regression result is used to correct the line skew. For each stroke pixel Pij , its new row coordinate are calculated as follows i0 = (0.5 + mi − f (x|x = nj ))m. The pixel values on the result image can be calculated by nearest neighbor or linear interpolation. An example image of handwritten email address is shown in Fig. 2(a). The solid line denotes the final regression result y = f (x; p = 5) after the 3rd training, and the two dotted lines denote the zone y = f (x; p = 5) ± σ between which most of the stroke pixels are bounded. The image after string skew correction is given in Fig. 2(b).

Figure 3(a) describes how the entropy changes with the italic angle. At the optimal italic angle αk∗ = 9◦ = 0.05π, the entropy arrives at its minimum. Figure 3(b) shows the result image after character italic correction.

(a)

(b)

Figure 3. (a) Entropy changes with the italic angle; (b) After slant correction. (a)

5. (b)

Figure 2. (a) Handwritten line fitted by the recursive polynomial ridge regression; (b) After skew correction.

4.

Erect the Slant Characters

An entropy minimization algorithm is applied to estimate the italic angle of handwritten characters. The Shannon entropy from the information theory [4] X E=− pj log pj (4) j

is used to evaluate the degree of italic in the handwritten characters, where pj denote the probabilities of the column projections of all stroke pixels. The smaller entropy, the less italic effect. Our target is to select one angle from a set of angles {−N ∆, . . . , −∆, 0, ∆, . . . , N ∆}, e.g. ∆ = 3◦ and N = 10, so that the entropy of the result image after the font italic correction arrives at a minimum. If the italic angle αk = k∆, then the new column coordinate of each stroke pixel is given by j 0 = j + (i − 0.5m) tan(αk ). After the italic font correction, the vertical projections of stroke pixels of the result image are calculated and normalized. Thus the column distribution probabilities pj , j = 1, . . . , n are obtained and they are used to calculate the Shannon entropy E(αk ). The optimal italic angle is αk∗ = arg min E(αk ) αk

(5)

which is used to correct characters from italic font into normal font.

Separate String into Words

After the 8-nearest-neighbor connectivity analysis, the string image is divided into a number of isolated components. But some of them are noise and some of them are only broken strokes of characters. Noise components are found according to their pixel numbers (smaller than a dot) or component positions (far away from the string top or bottom rows), and deleted from the image. Some adjacent components are combined into one. If the previous component completely bounds the current component, then combine two components and connect them by a shortest path. For example ‘@’ may be divided into an inner and an outer component, and we need to combine them. If one component should be the cap dot of ‘i’ or ‘j’, e.g. small size and above the center row, then find its adjacent body component, combine two components and connect them by a shortest path. There is one and only one ‘@’ in the email address. In most cases, ‘@’ present as one component; otherwise, ‘@’ touched with other characters can be found back by splitting it from other characters. So the component with the largest probability value, which can be obtained from an MQDF or SVM classifier, should be ‘@’. Some components should be separators ‘.’ according to their sizes and positions. Then the whole string is separated into several words.

Figure 4. The string image is divided into a number of components, and components are separated into a few words by one ‘@’ and some ‘.’.

An example is given in Fig. 4. All components of the string image are marked by their rectangle box borders.

Noise components are removed, and stroke components are combined, e.g. ‘j’. And these components are separated into 5 words by one ‘@’ and three dots ‘.’.

6.

Oversegment Touching Characters

A long component may be multi connected characters. It will be segmented into a number of small pieces so that every 1∼3 pieces make up of one single character [5]. A bichain elastic matching algorithm is proposed to search all potential segmentation points for those long components. This algorithm is described as follows: a. Search the contour of the long component by the 8directional inner boundary tracking algorithm; b. Separate the contour ring into two chains at the leftest and the rightest point: from left to right, the points of the up chain are denoted as P1 , . . . , Pm and those of the down chain are denoted as Q1 , . . . , Qn ; c. Match some point pairs between the up chain and the down chain: at each column k, find the lowest point P Pi = (xP i , yi ) on the up chain and find the highest point Q Q Qj = (xj , yj ) on the down chain; if their row coordinates satisfy 0 ≤ yiP − yiQ ≤ T , where T is a threshold with respect to the stroke width, then match this point pairs; d. Calculate the index intervals between every two adjacent matched-pairs (Pik , Qik ) and (Pik+1 , Qik+1 ) by dk,k+1 = (ik+1 − ik ) + (jk+1 − jk ); e. Select segmentation-pairs from the matched-pairs: if dk,k+1 ≥ D and D a threshold with respect to the x-height of string, then select (Pik , Qik ) or a near matched-pair as one segmentation-pair; f. Segment the long component into multi pieces by segmentation pairs. The segmentation procedure of our algorithm is illustrated in Fig. 5. In Fig. 5(a), there is a long component with four connected characters “hang”, and two chains are found for it on the right side. It’s obvious that every segmentation line should begin from one point on the up chain and end at one point on the down chain. In the top of Fig. 5(b), the result of the elastic matching between the up chain and the down chain is given. The up chain is a little longer than the down chain. Most of the points can not be matched, and the other points are matched. These matched pairs are connected by some short black lines respectively. Segmentation-pairs will be chosen from them. In the bottom of Fig. 5(b), the index intervals between every two adjacent matched-pairs are shown by peaks. The threshold is marked by a horizontal purple line. If the peak is above the threshold line, then there exists one segmentation-pair near peak, e.g. the left matched-pair of the peak. The leftest matched-pair can not be a segmentation-pair. All selected segmentationpairs are marked in some big purple dots. In Fig. 5(c),

segmentation-pairs are shown on the original long component. This component is subdivided into 6 pieces by 5 segmentation pairs. Every 1∼2 pieces can be one single character except for ‘m’ and ‘w’, which may be subdivided into 3 pieces.

(a)

(b)

(c)

Figure 5. Long component segmentation: (a) a long component and the two chains obtained by tracking its inner contour, (b) bichain elastic matching and segmentation point selection, (c) segmentation points on component.

7.

Recognize Words with Lexicons

To make the word recognition more reliable, a segmentation hypothesis graph (similar idea to [6-7]), which makes a comprehensive use of segmentation results, recognition results and lexicon knowledge, is established for each word. In the graph, the start node and the end node represent the left side and the right side of the word. Every middle node represents one component or a combination of 1∼3 pieces. MQDF or SVM classifiers are used to recognize the node “character”, and the recognition results, 1∼3 recognition candidates with posterior class probabilities (some candidates with low confidence may be rejected), are recorded by this node. If one node comprises of 3 pieces but both ‘m’ and ‘w’ do not appear in the recognition candidates, then reject this node “character”, i.e. record an unknown ‘?’ with the probability 0. The words are recognized from the last to the first one. First, we meet the words in the top-level domain name. Top-level domain includes the country or area abbreviations, e.g. “au”, “cn”, “jp”, “uk”, etc, and the function abbreviations, e.g. ac/edu, co/com, go/gov, net, org, etc.

For these words, a lexicon named “top lexicon” is made. Then, we meet the words in the custom domain name. They can be divided into two categories, one belongs to education ”ac/edu” and the other belongs to public. So two lexicons named ”edu lexicon” and “pub lexicon” are made for them respectively. For China area, words like “bnu”, “pku”, “dlut”, “scut”, “sjtu” and “tsinghua” are collected in the “edu lexicon”; words like “263”, “msn” and “eyou” are collected in the “pub lexicon”. The recognition result in the top-level domain name determines which lexicon should be used for the custom domain name. At last, we meet the words in user name. The family names, years and other information may appear in the user name. To avoid the complex structure of user name, a lexicon named “trigram lexicon” is made. For China, words like “199”, “200”, “ang”, “cao”, “che”, “dan” and “eng” are collected in “trigram lexicon”. There exist many possible paths from the start node to the end node. On each path, many possible words of the same length can be combined by the recognition candidates at its nodes. Our target is to select an optimal one from the words on the optimal path as the recognition result. Words and paths are evaluated by their scores. For a word in the domain name, its score is defined as follows S(c , . . . , cn ; p1 , . . . , pn ) ½ p11 +···+p n + 1 if c1 · · · cn ∈ word lexicon n = p1 +···+p n otherwise n

Figure 6. Segmentation hypothesis graph for a word in domain name, the optimal segmentation path and the optimal recognition result on this path.

sists of 3 pieces but ‘m’ and ’w’ do not present in the candidates, then reject this “character”, e.g. the node “ut ‘?’ 0.000”. Paths with a rejected node should be rejected too, and their edges are shown in dotted arrows. The number of all possible paths is 2×3 = 6. The optimal path is shown in bold arrows. Because one of the words on the optimal path is found in “edu lexicon”, i.e. “dult”, the path score is calculated as “score = (0.945+0.167+0.753+0.886)/4+1.000 = 1.688”.

(6)

where c1 , . . . , cn and p1 , . . . , pn are the selected candidates making a word and their posterior class probabilities or confidence values. For a word in the user name, its score is defined as follows n S(c1 , . . . , cn ; p1 , . . . , pn ) = q1 +···+q , where n + 1 if c c c ∈ trigram p lexicon  i i−2 i−1 i   pi + 1 if ci−1 ci ci+1 ∈ trigram lexicon qi = pi + 1 if ci ci+1 ci+2 ∈ trigram lexicon    pi otherwise.

(7)

For a path in the graph, its score is equal to the highest one among all word scores on this path. The path with the highest score is called the optimal path of the graph. On the optimal path, the word with the highest score is the final recognition result. In Fig. 6, a segmentation hypothesis graph for the domain name word “dlut” and its optimal segmentation path are given. The two big dots denote the start and the end node of the graph, the arrows denote the directed edges, and the middle nodes represent a “character”, i.e. one component or a combination of 1∼3 pieces. Under each “character”, 1 3 recognition candidates with class conditional probabilities are provided. If the “character” con-

Figure 7. Segmentation hypothesis graph for the word in user name, the optimal segmentation path and the optimal recognition result on this path.

In Fig. 7, a segmentation hypothesis graph for the user name word “jzhang2001” and its optimal segmentation path are given. If a “character” consists of 1 piece but it is too small, then reject it, e.g. the right piece of ‘h’ and the left piece of ‘n’. After some paths are rejected, the total number of possible paths is 6×2 = 12. The optimal path is shown in bold arrows. Since trigrams “zha”, “han”, “ang”, “200” and “001” on the optimal path are found in the “trigram lexicon”, the path score is calculated as “score = (0.882+1.315+1.682+1.696+1.862+1.522+ 1.755+1.119+1.509+1.754)/10 = 1.410”.

8.

Experiment Results

After postprocess, the segmented characters with their recognition candidates and the final recognized email address string are shown to user. Figure 8 shows the final output results of the system for the previous sample,

where all selected candidates are marked by ellipses.

Figure 8. Characters on the optimal segment path, recognition candidates and the final recognition result for the email address.

A set of email addresses were collected from an international conference on the internet. Then 179 string images of handwritten email addresses written by 10 individuals were used in our experiments. The string images are divided into two sets: good quality set and bad quality set. In bad quality set, some handwritten strings are highly recursive and some handwritten strings have broken strokes. The results of word separation step are quite crucial to the following steps. Table 1 gives the detection rates for the separators ‘@’ and ‘.’. In the good quality set, only one ‘.’ touching other character is missed; In the bad quality set, some ‘.’ with big sizes or touching other characters are missed. Table 1. Detection rates of ‘@’ and ‘.’ on 179 samples. images ‘@’ ‘.’ good set 91 91/91 = 100% 177/178 = 99.4% bad set 88 88/88 = 100% 166/174 = 95.4%

Recognition results on the two sets are reported in Table 2. On the good quality set, the recognition rate for single characters reaches 94.6%, and that for whole strings arrives at 59.3%. One third of the errors are recognition errors, and about one-fifth of the errors are on account of the words or trigrams absent in lexicons. On bad quality set, where handwriting samples are seriously recursive or have broken strokes, the recognition rates get much lower, 82.8% for single characters and 11.4% for whole strings. Table 2. Recognition rates of single characters and whole strings on 179 samples. good set bad set

images characters strings 91 1602/1694 = 94.6% 54/91 = 59.3% 88 1411/1704 = 82.8% 10/88 = 11.4%

Recognition errors arise from many reasons. Some categories need more training samples in cursive writing styles. If the training samples are not enough, the first recognition candidate may be not the character true code or even the first three recognition candidates may not include the character true code. These errors occupy 34.3%. Some connected characters are wrongly separated. These errors occupy 22.2%. If the words including the connected characters are not cached in the lexicons, the connected

characters are separated by dynamic programming search [8]. This depends on whether the classifiers can always output large confidence values for the true codes and low confidence values for the false codes. But classifiers do not satisfy this requirement always. If the provided lexicons are complete, such errors should happen only in recognizing the user name. Some characters are broken due to the user’s writing style. These errors occupy 8.3% and they can be reduced by detecting broken strokes and connecting them with proper paths. Some two connected characters in user name, where adjacent characters are not in the trigram lexicon, are wrongly taken as a single character. These errors occupy 8.3%. The remaining errors are miscellaneous.

9.

Discussion

In this paper we have presented a recognition system for handwritten email address images. Some methods used in this system can be also applied to other applications, e.g. line skew correction, font slant correction and connected character segmentation. Segmentation hypothesis graph makes a comprehensive use of segmentation results, recognition results and lexicon knowledge, and it greatly improves the recognition rate of words. Experiment results on two small datasets exhibit the effectiveness of the system. Future work will involve investigating the usefulness of reading the domain name words by holistic recognition methods.

References [1] H. Tschabitscher, The Elements of an Email Address, http://email.about.com/cs/standards/a/email addresses.htm [2] Y.T. Kuo and S. Kuo, System of central signature verifications and electronic receipt transmissions, United States Patent 6873715, Issued on March 29, 2005. [3] R.G. Casey and E. Lecolinet, “A Survey of Methods and Strategies in Character Segmentation”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, pp. 690–706, 1996. [4] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge, UK: Cambridge University Press, 2000. [5] Entropy. http://members.aol.com/jmtsgibbs/entropy. [6] H. Murase, “Online Recognition of Free-Format Japanese Handwrittens”, Intl Conf on Pattern Recognition, Brisbane, Australia, 1998, pp. 1143–1147. [7] X.D. Zhou, J.L. Yu. C.L. Liu, T. Nagasaki and K. Marukawa, “Online Handwritten Japanese Character String Recognition Incorporating Geometric Context”, Intl Conf on Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 48–52. [8] H. Fujisawa, “A View on the Past and Future of Character and Document Recognition”, Intl Conf on Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 3–7.