Grouping Text Lines in Online Handwritten Japanese Documents by

Grouping Text Lines in Online Handwritten Japanese Documents by Combining Temporal and Spatial Information Xiang-Dong Zhou, Da-Han Wang, Cheng-Lin Liu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing 100190, P.R. China E-mail: {xdzhou,dhwang,liucl}@nlpr.ia.ac.cn Abstract We present an effective approach for grouping text lines in online handwritten Japanese documents by combining temporal and spatial information. Initially, strokes are grouped into text line strings according to off-stroke distances. Each text line string is segmented into text lines by dynamic programming (DP) optimizing a cost function trained by the minimum classification error (MCE) method. Over-segmented text lines are then merged with a support vector machine (SVM) classifier for making merge/non-merge decisions, and last, a spatial merge module corrects the segmentation errors caused by delayed strokes. In experiments on the TUAT Kondate database, the proposed approach achieves the Entity Detection Metric (EDM) rate of 0.8816, the Edit-Distance Rate (EDR) of 0.1234, which demonstrates the superiority of our approach.

1. Introduction Freeform online handwritten notes are increasingly generated by Tablet PCs, digital whiteboards, and digital pens on paper. Such handwritten ink documents bring new challenges to automatic analysis and recognition. Text lines are the most salient structures and their reliable extraction is the pre-condition to further processing tasks such as character recognition, text editing and retrieval. This paper addresses the problem of grouping text lines in freeform online handwritten Japanese documents. Earlier ink document analysis systems [1, 2] used to assume horizontal text lines and simply separate the text lines according to the projection profile. However, the assumption of text line regularity is often violated in freeform handwritten documents. For Japanese documents, Nakagawa et al. [3] describe a system in which the stroke sequence is segmented into text lines

according to off-stroke distances. Due to the variability of character size and spacing, the off-stroke distance based segmentation does not perform reliably. Methods based on optimization are considered to be more robust than heuristic based ones. To find the extrema of the cost function, the dynamic programming (DP) algorithm is often used [4-6], but without a training procedure, the parameters in the cost functions have to be tuned manually. In the system described in [7], which is an enhanced one of [6], a high-confidence-first method was proposed to group text lines and writing regions, with an AdaBoost classifier for making merge/non-merge decisions. To better utilize the temporal and spatial information in online handwriting, we propose a text line grouping approach with very few artificial parameters. For documents of mixed text/drawing, our text line grouping approach can also correct text/nontext stroke classification errors. The text line grouping process performs in several stages. Initially, strokes are grouped into text line strings according to off-stroke distances. A discriminant function is trained under the string-level minimum classification error (MCE) criterion [8, 9] to separate the over-merged text lines with the DP algorithm. For merging temporally adjacent text lines, an SVM classifier is trained to make merge/non-merge decisions. Last, the errors caused by delayed strokes are amended by a spatial merge step. By our temporal merge module, text/nontext stroke classification errors can be corrected. As for performance evaluation of the text line grouping algorithms, there has not been a unified criterion. In [6], the recall metric for each page is used to measure the accuracy of the system. In [4] and [7], the edit-distance metrics are employed to evaluate the system performance. The system proposed by Liwichi et. al. [5] adopts the stroke classification rate and the document classification rate as the evaluation metrics. To evaluate our approach systematically, we use some

performance metrics inspired by the evaluation methods for graphics recognition systems [10], which are also used in the ICDAR page segmentation and handwriting segmentation competitions [11]. To demonstrate the effectiveness of the proposed algorithm, we have experimented on the TUAT Kondate database [3] with two settings: one with perfect text/non-text separation (stroke type labels given) and the other with a stroke classification module. The results show that the proposed text line grouping method is robust for both the two cases.

2. System Overview Text line grouping is one of key parts of our online handwritten document analysis system. After text/nontext separation by stroke classification [12], the text strokes are grouped into text lines. Misclassified strokes can be corrected in the process of text line grouping. Last, each text line is recognized using a character string recognition algorithm [13]. The text line grouping process comprises of five stages from temporal to spatial (Fig. 1). First, to alleviate the computation cost, consecutive strokes with small off-stroke distance are merged as blocks. Then the per-segmentation module performs coarse segmentation of the block sequence according to offstroke distance at the risk of merge errors, and the over-merged text lines are split at the temporal segmentation stage. Due to misclassification of text strokes as non-text ones, a genuine text line maybe over-segmented into multiple lines. At the temporal merge stage, these over-segmented text lines are connected and meanwhile, the misclassified strokes are corrected. Last, the spatial merge module can merge the delayed strokes and the collinear text lines separated by the temporal grouping process.

3.1. Block Grouping The strokes in the document are initially grouped into small blocks according to off-stroke distances. The succeeding grouping based on blocks other than strokes can reduce the computation cost. An off-stroke is defined as the vector from the ending point of the preceding stroke to the starting point of the succeeding stroke. If the off-stroke distance between two successive strokes is smaller than a threshold, and the two strokes are not separated spatially by long non-text strokes, they are marked as belonging to the same block. The threshold should be small enough to guarantee that strokes belonging to different text lines are assigned to different blocks. We empirically set the threshold to be 0.3 times the average character size in our experiment. Strokes of different types (text or nontext) can not be merged to one block. After block grouping, we get a sequence of stroke blocks.

3.2. Pre-segmentation For pre-segmentation, we first define the off-stroke between two consecutive blocks as the off-stroke between the last stroke of the preceding block and the first stroke of the succeeding block. Off-strokes within a text line are usually shorter than those between text lines. If the off-stroke between two consecutive text blocks is longer than a threshold, and the two blocks are not separated by a long non-text stroke, this offstroke is regarded as a segmentation position. The threshold is empirically set as 5 times the average character size in our experiment to guarantee merging within-line blocks but risk merging multiple text lines (over-merge). The over-merged text lines are usually short ones so the off-stroke between them is also short. On splitting the sequence of blocks at segmentation positions, each sub-sequence is to be split in following temporal segmentation considering the linearity of stroke blocks.

3.3. Temporal Segmentation

Fig. 1. Flowchart of text line grouping process.

3. Text Line Grouping Before the grouping process, the average character size, which will be used in the following steps, is estimated from the text strokes in the ink page [3].

After pre-segmentation, each sub-sequence of stroke blocks (also called a text line string) can be split into multiple text lines at internal off-strokes (candidate separation points). The combinations of the candidate separation points form a candidate lattice, where each edge represents a candidate text line and a path from the start to the end represents a partitioning of text lines. The paths are evaluated using a trained discriminant function and the optimal path is searched by dynamic programming (DP). 3.3.1. Segmentation Candidate Lattice. Inspired from character string recognition by integrated

segmentation and recognition approach [13, 14], we build a lattice for a sequence of stroke blocks (text line string), in which candidate text lines composed of consecutive blocks correspond to candidate character patterns in string recognition (Fig. 2). Denoting the number of stroke blocks in the text line string by n, then the block number of the candidate text lines in the segmentation candidate lattice ranges from 1 to n, and the total number of candidate text lines is n(n+1)/2. Each path from the start node to the terminal node corresponds to a candidate segmentation (partitioning) of text lines, and the total number of partitioning possibilities is 2n-1.

Fig. 2. Segmentation candidate lattice of a text line string with 5 blocks (each denoted by a straight edge). Each internal node corresponds to an offstroke between stroke blocks. 3.3.2. Path Scoring and Search. To distinguish the correct segmentation from the incorrect ones in the candidate lattice, we design a linear discriminant function to evaluate the goodness of the paths such that the correct path has the highest score. The discriminant function is defined as M

g ( X , S ) = ∑ λk f k ( X , S ) = ΛT F ,

(1)

k =1

Where X denotes the text line string, S is a segmentation of X into text lines, Λ=[λ1, λ2,…,λM]T is a set of weighting parameters, F=[f1, f2,…,fM]T is the feature vector characterizing the segmentation S, and M is the number of features. With the discriminant function (1), to segment a text line string is to find the optimal path with maximum score: (2) S * = arg max{g ( X , S )} . S

The optimal path is found by DP search, which requires that the discriminant function is a summation of the terms along the path. Our path features below have the summation nature. 3.3.3. Path Features. The first feature is the summation of the linear regression error e(li) of each candidate text line li along the segmentation path, which reflects the overall linearity of the composed text lines [6]. The second feature is the summation of the squared temporal line distance between consecutive text lines. Temporal line distance d(li-1,li) is the minimum

distance between the terminal end of li-1 and the start end of li (Fig. 3(a)) , and d(l0,l1)=0 is defined. The third feature is the summation of the intersection part area A(li-1,li) between the preceding text line (extended along its writing direction) and the succeeding one (Fig. 3(b)), and A(l0,l1)=0 is defined. This feature is effective to partition consecutive text lines which are not collinear. The text line number that is used to depict the model complexity in [6] is selected as another feature, which is very important to prevent over-segmentation.

Fig. 3. Definitions of temporal line distance (a) and intersection part area (b). The first three features are normalized by the square of average character size estimated for each text line string. With the above four features, the path score function (1) can be written as n

g ( X , S ) = ∑ ⎡⎣λ1e(li ) + λ2 d 2 (li −1 , li ) + λ3 A(li −1 , li ) + λ4 ⎤⎦ . (3) i =1

Eq. (3) is similar to the cost function in [6], whose parameters are hand-tuned. In our system, the parameters in Eq. (3) are estimated by string-level MCE training. 3.3.4. String-level MCE Training. In string-level training, the weighting parameters are estimated on a dataset of string samples DX ={(Xn, Sn) | n=1,…,NX}, where Sn denotes the correct segmentation of the sample Xn. Because the number of string classes is huge, the misclassification measure considers only a finite number of classes (N-best list), and each class S has a discriminant function g(X, S, Λ), where Λ is the parameter set. In string-level MCE training described in [8], the misclassification measure for correct string class Sc is approximated by d ( X , Λ) = − g ( X , Sc , Λ) + g ( X , Sr , Λ) , (4) where g(X, Sr, Λ) is the score of the closest rival class: (5) g ( X , Sr , Λ ) = max g ( X , Sk , Λ ) . k ≠c

The misclassification measure is transformed to loss function by

l ( X , Λ) =

1 1+ e

−ξ d ( X , Λ )

,

(6)

Where ξ is a parameter that controls the hardness of sigmoid nonlinearity. On the training sample set, the empirical loss is computed by 1 NX (7) L( Λ ) = ∑ l ( X , Λ) . N X n =1 By stochastic gradient descent, the parameters are updated on each training sample by (8) Λ(t + 1) = Λ(t ) − ε (t )U ∇l ( X , Λ) |Λ=Λ (t ) , where ε(t) is the learning step. U is usually approximated to be a diagonal matrix. In MCE training for text line string segmentation, each segmentation, the correct or the rival, has a candidate text line sequence from which the features are extracted. Given the genuine and rival segmentations Sc and Sr, the corresponding feature vectors Fc and Fr, the misclassification measure is defined as

d ( X , Λ ) = − g ( X , Sc , Λ ) + g ( X , S r , Λ ) = ΛT ( Fr − Fc ).

(9)

By stochastic gradient decent, the parameters of the discriminant function are updated on training sample X by X ,Λ ) Λ (t + 1) = Λ (t ) − ε (t ) ∂l (∂Λ |Λ=Λ ( t )

= Λ (t ) − ε (t )ξ l (1 − l )( Fr − Fc ).

(10)

3.4. Temporal Merge After stroke classification, some text strokes are misclassified as non-text ones, which will split a text line into multiple ones. The temporal merge module is designed to correct such stroke classification errors and merge the over-segmented text lines. For this stage, an SVM classifier is trained to make the merge/nonmerge decision for each hypothesis. The temporal merge algorithm is inspired by the high-confidencefirst method [7] which was proposed for grouping text lines and writing regions based on a neighborhood graph and an AdaBoost classifier. 3.4.1. Merge Algorithm. The temporal merge algorithm proceeds iteratively. Each time we start from the longest text line which has not been checked. The selected text line has a preceding neighbor and a succeeding neighbor, which is either a text line or a non-text block. We extract features from the pair of selected line with its predecessor and the pair with its successor (two hypotheses of merge). The features are fed to an SVM classifier for making merge/non-merge decision. If one or two pairs are decided “merge”, we

accept the hypothesis with the highest confidence and merge the pair. The merged text line is then recursively paired with its predecessor and successor for hypotheses until it is decided “non-merge”. The “nonmerge” text line is marked “checked”. Recursively, the next unchecked longest line is selected for hypothesis of merge, until all the text lines are marked. 3.4.2. Training Sample Collection. To train the SVM classifier, we need to collect training samples (hypotheses) beforehand which could be pair of successive text lines or a pair of text line and non-text block (which maybe a text block misclassified to be non-text), labeled as “merge” or “non-merge”. The samples are collected from the temporal segmentation results. Whenever a pair of neighboring text lines or a text line and a non-text block are hypothesized, we label the pair as a positive (merge) sample if the pair belong to a ground-truthed text line, otherwise a negative (non-merge) sample. 3.4.3. Feature Extraction. Before feature extraction, each sample, a pair of text lines or a line and a non-text block are tentatively merged and fitted by linear regression. Eight features are extracted from each sample, the first three are: the average regression error (regression error divided by the number of convex hull vertexes), the squared maximum inner distance along the fitting direction, and the squared maximum stroke length. These three features are normalized by the squared average character size. The 4-7 features are the sine and cosine of the fitted line direction, and the sine and cosine of the angle difference between the fitted line direction and the longer text line direction in the pair. Another important feature is the difference between the outputs of the discriminant function (3) for the merged line and two separate ones. A positive difference implies that the pair is likely to be merged.

3.5. Spatial Merge In writing a sentence, delayed strokes are added to a former character after a later character is written. In temporal grouping of text lines, such delayed strokes cause to types of segmentation errors: Case A and Case B (Fig. 4.). In Case A, the block (or short text line) of delayed strokes is embraced by a long text line. And in Case B, the ends of two collinear text lines are close to each other but are temporally separated by delayed strokes. The spatial merge module is designed to merge such over-segmented text lines. At the spatial merge stage, we process Case A and then Case B, because a short text line embraced in a longer one is more likely to be merged. To perform spatial merge, we build a neighborhood graph to

generate and update the hypotheses and train a classifier to produce the confidence for each hypothesis [7].

a queue together with its confidence score. After all the hypotheses in the graph are scored, then the hypothesis with the highest confidence is accepted, and accordingly the neighborhood graph and hypotheses queue are updated. This process iterates until the queue becomes empty.

4. Performance Evaluation For performance evaluation, we first define the matches and errors between the text lines detected by the algorithm (result lines) and those in the groundtruth (ground-truthed lines), based on which a set of evaluation metrics are calculated. Some of the metrics were originally proposed to evaluate the graphics recognition systems [10].

Fig. 4. Two cases that need spatial merge. 3.5.1. Discriminant Functions. Considering that there are not many instances requiring spatial merge and consequently we have only a small number of training samples, two simple linear discriminant functions are trained for the two cases on the training samples collected based on temporal segmentation. For Case A, we train a discriminant function with regression error and line number features only, i.e. n

g ( X , S ) = ∑ [ λ1e(li ) + λ2 ] .

(11)

i =1

For Case B, the discriminant function of Eq. (3) is trained with the temporal line distance replaced by the spatial line distance, which is the minimum of the two distances between the start end of one line and the terminal end of the other.

3.5.2. Neighborhood Graph. For Case B, we can build the neighborhood graph according to spatial line distance. For Case A, we build the neighborhood graph according to the intersection ratio between two text lines, which is the area of the common part between two text line bounding boxes divided by the minimum area of the two bounding boxes. The spatial merge process is relatively conservative: two text lines are said to be neighbors if the intersection ratio is larger than 0.7 for Case A, or the spatial line distance is smaller than 0.5 times the average character size for Case B. 3.5.3. Merge Algorithm. The spatial merge algorithm is an iterative procedure. For each pair of neighboring text lines, we compute the features for both merged and separated, and calculate the difference between the discriminant function outputs of the two hypotheses. This difference can be viewed as the confidence of merge/non-merge decision. A positive difference indicates merge and negative one indicates non-merge. If a merge decision is made, the hypothesis is added to

4.1. Matches and Errors If a result line and a ground-truthed line contain identical strokes, the match between them is called one-to-one match. A g_one-to-many match is a match between a ground-truthed line and two or more result lines each containing integral characters, and the union of the result lines equals the ground-truthed line. A g_one-to-many match corresponds to two or more d_many-to-one matches. Misses are those text lines in the ground-truth that do not match with any result lines. False-alarms are those result lines that do not match any ground-truthed lines. Some counts measuring partial matches are as follows. For a result text line, a d_segmentation error is produced if there exists at least one ground-truthed line, such that the intersection of the strokes of the pair is a nonempty proper subset of the ground-truthed one. And a d_merge error is produced if there exists at least one ground-truthed line, such that some strokes of the result line belong to the ground-truth line and some do not.

4.2. Performance Metrics Let one2one, g_one2many, d_many2one, misses, false_alarms, d_segmentation, and d_merge denote the counts of the corresponding matches and errors, N is the count of ground-truthed text lines, and M is the count of detected text lines. Below, we define the performance metrics.

Detection Rate (DR)

DR = ω1

one2one g _ one2many . + ω2 N N

Missed Detection Rate (MDR)

(12)

MDR =

misses . N

(13)

Recognition Accuracy (RA)

RA = ω3

one2one d _ many 2one . + ω4 M M

(14)

False-alarm Rate (FAR)

false _ alarms . FAR = M

(15)

Entity Detection Metric (EDM)

EDM =

2 ⋅ DR ⋅ RA . DR + RA

(16)

Edit-Distance Rate (EDR) [4,7]

EDR =

d _ segmentation + d _ merge . N

(17)

DR, RA and EDM are used to measure the accuracies, while MDR, FAR and EDR are used to measure the errors. Among the metrics, EDM and EDR are taken as overall performance measures. In our experiments, the weights in DR and RA are set to 1, because one-to-one, g_one-to-many and d_many-toone matches are all proper for character string recognition.

5. Experimental Results To evaluate the performance of the proposed text line grouping approach, we have experimented on the TUAT HANDS-Kondate_t_bf-2001-11 (in brief, Kondate) database, of online freeform handwritten Japanese documents from 100 people without any writing constraints [3]. The database contains text lines in arbitrary directions and font sizes, and often mixed with non-text structures, such as drawings, table forms and flowcharts. The stroke types cover text, formula, figure, ruled line and editing mark. The formulas, which contain both characters and non-characters, are excluded in our experiment, thus the non-text strokes are composed of figure, ruled line and editing mark strokes. We used the first 60 files (corresponding to 60 writers) for training and the left 40 files for testing. The documents contain both text and non-text strokes. Some documents are dominant of text strokes: some with all text strokes, some with a few editing marks only. All the text lines in these pages have been manually labeled, for training and evaluation, respectively. The details of the training and test data are listed in Table 1.

Table 1. Specification of training and test set. Data sets

#Page

Training Test

2,043 1,560

#Textdominant 1,680 1,120

#Mixed 363 440

#Text lines 15,679 11,868

After pre-segmentation of the training pages, 12,816 text line strings each with no more than three text lines were selected for string-level MCE training of linear discriminant function. The parameter ξ in loss function (6) was set to be proportional to the reciprocal of the scatter estimated on the samples with genuine segmentation. The string samples were processed iteratively for five cycles in stochastic gradient decent and after the last cycle, the accuracy on the training samples is 0.989. The SVM classifier for temporal merge was trained on 13,734 samples (10,023 merge and 3,711 nonmerge) with 4-th order polynomial kernel. For the classifier training in spatial merge, the samples used and the settings for the training procedure are identical to the string-level MCE training in temporal segmentation. The system was implemented in Microsoft Visual C++ 6.0 and all the experimental results were produced on a AMD Dual Core 3800+ CPU 1GBRAM PC. Since text line grouping is described as a module with the stroke classification results as inputs in our ink document analysis system mentioned in section 2, we first evaluate its performance with a stroke classification preprocessor [12]. And then, as a comparison, the system is evaluated with the genuine stroke labels given. The above two settings are called crude writing/drawing (CrudeWD) input and perfect writing/drawing (PerfectWD) input, respectively in [6]. Our grouping results are listed in Table 2 and Table 3 for CrudeWD and PerfectWD, respectively, and some examples of the grouping process are illustrated in Fig. 5 and Fig. 6. In Table 2 and Table 3, each row gives the results after the corresponding operation. Temporal merge is designed for CrudeWD only since PerfectWD has no stroke classification errors. Each column, from the second to the seventh, corresponds to a metric mentioned in Section 4, and the last column (T) is the time cost (seconds per page) for each operation. From results in Table 2 and Table 3 we can see that: 1. The system performance is improved step by step for both CrudeWD and PerfectWD. And at the last step, the highest EDM rate and the lowest EDR rate are achieved.

2.

3.

4. 5.

6.

7.

8.

With the accuracy 0.9779 on the test set by MRFbased stroke classification, the system performs much better for PerfectWD, which confirms that stroke classification errors affect the text line grouping accuracy significantly. Pre-segmentation brings in merge errors more than segmentation errors. Note that the high segmentation error rate after pre-segmentation in Table 2 is caused by stroke classification. Temporal segmentation can effectively correct the merge errors while bringing in minor segmentation errors. Temporal merge can significantly reduce the segmentation errors caused by stroke classification. After temporal merge, the stroke classification accuracy is improved from 0.9779 to 0.9935. The number of spatial merge to amend the segmentation errors is relatively small, because there are not many instances requiring spatial merge in Japanese documents. Compared to the missed detection rates (MDRs), the false-alarm rates (FARs) are much larger in Table 2, because the non-text strokes are more likely to be misclassified than the text strokes by our stroke classification algorithm. The nonzero FARs in Table 3 owe much to label noise, e.g. wild points or strokes which are labeled as nontext in ground-truth. Compared to the other operations, the temporal segmentation step is relatively time consuming, while the total processing time spent is less than one second per page for both CrudeWD and PerfectWD.

Fig. 6. Grouping process for PerfectWD. Our experimental results cannot be compared with those reported in the literature because there is not a common database for evaluation.

6. Conclusion and Future Work We presented a text line grouping approach for analyzing freeform online handwritten Japanese documents. After the relatively coarse presegmentation, we trained a discriminant function under the string-level MCE criterion to separate over-merged text lines. Then the high-confidence-first method is employed for both temporal and spatial merge. To evaluate the performance, we give a set of metrics from multiple aspects. The experiments on the TUAT Kondate database demonstrate the effectiveness of our approach. The future works include string-level training with more features, improved stroke classification, and interaction between text line grouping and character string recognition.

Acknowledgements This work was supported by the Central Research Laboratory of Hitachi Ltd., Tokyo, Japan. The authors thank the Nakagawa Laboratory of Tokyo University of Agriculture and Technology (TUAT) for providing the Kondate database.

References

Fig. 5. Grouping process for CrudeWD.

[1]

A. K. Jain, A. M. Namboodiri, and J. Subrahmonia, “Structure in On-line Documents,” Proc. Sixth Int’l Conf. Document Analysis and Recognition, Seattle, WA, pp. 844-848, 2001.

[2]

E. H. Ratzlaff, “Inter-line Distance Estimation and Text Line Extraction for Unconstrained Online Handwriting”, Proc. Seventh Int'l Workshop on

Frontiers in Handwriting Recognition, Nijmegen, Netherlands, pp. 33-42, 2000. [3]

[4]

M. Nakagawa and M. Onuma, “On-line Handwritten Japanese Text Recognition Free from Constrains on Line Direction and Character Orientation,” Proc. Seventh Int’l Conf. Document Analysis and Recognition, Edinburgh, Scotland, pp. 519-523, 2003. M. Shilman, Z. Wei, S. Raghupathy, P. Simard, and D. Jones, “Discerning Structure from Freeform Handwritten Notes,” Proc. Seventh Int’l Conf. Document Analysis and Recognition, Edinburgh, Scotland, pp. 60-65, 2003.

[9]

W. Chou, “Discriminant-Function-Based Minimum Recognition Error Pattern Recognition Approach to Speech Recognition," Proc. IEEE, vol.88, no.8, pp.1201-1223, 2000.

[10] I. Phillips and A. Chhabra, "Empirical Performance Evaluation of Graphics Recognition Systems," IEEE Transaction of Pattern Analysis and Machine Intelligence, Vol. 21, No. 9, pp. 849-870, Sep. 1999. [11] A. Antonacopoulos, B. Gatos, and D. Bridson , “ICDAR2007 Page Segmentation Competition, ” Proc. Ninth Int'l Conf. Document Analysis and Recognition, Curitiba, Brazil, pp. 1279-1283, 2007.

[5]

M. Liwicki, E. Indermuhle and H. Bunke, “On-Line Handwritten Text Line Detection Using Dynamic Programming,” Proc. Ninth Int’l Conf. Document Analysis and Recognition, Curitiba, Brazil, pp. 447-451, 2007.

[12] X.-D. Zhou and C.-L. Liu, "Text/Non-text Ink Stroke Classification in Japanese Handwriting Based on Markov Random Fields," Proc. Ninth Int'l Conf. Document Analysis and Recognition, Curitiba, Brazil, pp. 377-381, 2007.

[6]

M. Ye, H. Sutanto, S. Raghupathy, C. Y. Li, and M. Shilman, “Grouping Text Lines in Freeform Handwritten Notes,” Proc. Eighth Int’l Conf. Document Analysis and Recognition, Seoul, Korea, pp. 367-371, 2005.

[13] X.-D. Zhou, J.-L. Yu, C.-L. Liu, T. Nagasaki, and K. Marukawa, "Online Handwritten Japanese Character String Recognition Incorporating Geometric Context," Proc. Ninth Int'l Conf. Document Analysis and Recognition, Curitiba, Brazil, pp. 48-52, 2007.

[7]

M. Ye, P. Viola, S. Raghupathy, H. Sutanto and C. Li, “Learning to Group Text Lines and Regions in Freeform Handwritten Notes,” Proc. Ninth Int’l Conf. Document Analysis and Recognition, Curitiba, Brazil, pp. 28-32, 2007.

[8]

B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum Classification Error Rate Methods for Speech Recognition," IEEE Trans. Speech and Audio Processing, vol.5, no.3, pp.257-265, 1997.

[14] C.-L. Liu, H. Sako, and H. Fujisawa, “Effects of Classifier Structures and Training Regimes on Integrated Segmentation and Recognition of Handwritten Numeral Strings,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1395-1407, Nov. 2004. [15] C.-L. Liu and K. Marukawa, “Handwritten Numeral String Recognition: Character-Level vs. String-Level Classifier Training,” Proc. 17th Int’l Conf. Pattern Recognition, Cambridge, UK, Vol.1, pp. 405-408, 2004.

Table 2. Text line grouping results for CrudeWD. Operations Pre-seg. Temporal Seg. Temporal Merge Spatial Merge

DR 0.6586 0.7707 0.8995 0.8984

MDR 0.0050 0.0050 0.0048 0.0048

RA 0.5742 0.6221 0.8524 0.8654

FAR 0.0467 0.0470 0.0528 0.0477

EDM 0.6135 0.6885 0.8753 0.8816

EDR 0.4720 0.4498 0.1419 0.1234

T(s/P) 0.020 0.645 0.008 0.021

Table 3. Text line grouping results for PerfectWD. Operations Pre-seg. Temporal Seg. Spatial Merge

DR 0.8004 0.9447 0.9468

MDR RA FAR EDM 0.0000 0.8882 0.0012 0.8420 0.0000 0.9435 0.0011 0.9441 0.0000 0.9552 0.0010 0.9510

EDR 0.1222 0.0912 0.0707

T(s/P) 0.021 0.698 0.014