A Model-based Ruling Line Detection Algorithm for Noisy Handwritten Documents Jin Chen Dept. of Computer Science & Engineering Lehigh University Bethlehem, PA, United States
[email protected]
Abstract—Ruling lines are prevalent to help people write neatly in paper. However in document analysis area, they usually raise hurdles on the tasks of handwriting recognition or identification. In this paper, we model ruling lines as a multi-line linear regression problem and then derive a globally optimal solution given the Least Square Error. We justify the efficacy of our algorithm by evaluating it on both synthesized and real datasets. In addition, our algorithm outperforms a previously published method on the public Germana dataset. Keywords-handwritten documents; ruling line detection;
I. I NTRODUCTION Line processing is ubiquitous in various document analysis applications, e.g., form/invoice processing [1], engineering drawing processing [2], music score analysis [3], and paper pads [4]. Many techniques work well on relatively clean images with good quality, however, if lines are severely broken due to low image resolution or they are overlapped by handwriting, the performance might be significantly deteriorated [5]. Zheng, et al., propose a stochastic model-based parallel line detection algorithm that incorporates context to detect ruling lines systematically [4]. Rather than treating the peaks on the projection profile as the line positions, they model such lines with a Hidden Markov Model so the constraints among these lines could be incorporated. In this work, we propose a model-based ruling line detection algorithm that takes advantages of the model attributes, e.g., constant spacing, skew angle, thickness, and length. First, we introduce the framework of multi-line linear regression and derive a globally optimal solution given the Least Square Error (LSE). Then we describe a Hough transform variant for extracting line segments and the “Basic Sequential Algorithmic Scheme (BSAS)” clustering to group line segments. The next step makes use of the ruling line model to detect missing lines based on the layout of the line clusters. We evaluate detection errors on the basis of individual model attributes, rather than attempting to define a single metric that combines all attributes as in [6]. The rest of this paper is organized as following: in Section II, we introduce the multi-line linear regression model. Then we describe the ruling line detection algorithm in Section III and the experimental setup in Section IV.
Daniel Lopresti Dept. of Computer Science & Engineering Lehigh University Bethlehem, PA, United States
[email protected]
Finally we show experimental results in Section V and then conclude in Section VI. II. M ULTI - LINE L INEAR R EGRESSION The multi-line linear regression model assumes: i) Ruling lines are straight lines in documents. ii) They preserve constant spacing and skew. iii) The association of point-to-line is available. iv) The number of lines k is available. v) Within the line list, there are no missing lines between the first and the last line. It is important to clarify that iii), iv), and v) are guaranteed by our algorithm, rather than fundamental assumptions of the ruling line detection problem. Since we assume the number of lines k and the point-to-line association, the points set {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} can be further P formulated k−1 as {(xi,j , yi,j ) | i = 0, . . . , k − 1; j = 1, . . . , ni ; i=0 ni = n}. In addition, all the ruling line are parallel and have the same spacing, so we rewrite the normal form of line equation as: β0 + iβ1 + β2 xi,j + i = yi,j (1) Pk−1 where i = 0, . . . , k − 1, j = 1, . . . , ni , and i=0 ni = n. i is a random error component. β0 is the y-intercept of the 0-th line, β1 is the spacing between lines, and β2 is the skew angle. Next, we rewrite the residual function as: def
f = 2
def
= =
kb − A x ˆk2 ni k−1 XX (yi,j − βˆ0 − iβˆ1 − βˆ2 xi,j )2 i=0 j=1
(2) To minimize the residual, we take the partial derivatives of f with respect to βˆ0 , βˆ1 , βˆ2 , respectively (denote Σ as P k−1 Pni i=0 j=1 ). δf = −Σ2(yi,j − βˆ0 − iβˆ1 − βˆ2 xi,j ) = 0 δβˆ0 δf = −Σ2k(yi,j − βˆ0 − iβˆ1 − βˆ2 xi,j ) = 0 (3) δ βˆ1 δf = −Σ2xi (yi,j − βˆ0 − iβˆ1 − βˆ2 xi,j ) = 0 δ βˆ2
the current position searching for the ending points of the line segment. Considering ruling lines might be degraded, short gaps (five pixels) are tolerable during the searching. Once the searching stops, we record the coordinates of the ending points in the line segment and exclude all these points from the accumulation matrix and proceed. B. Sequential Clustering After the Hough transform, we acquire a set of line segments specified by their ending points. Denote T as the threshold of dissimilarity between clusters, and Q as the maximum number of clusters. In our experiments, the dissimilarity measure is the ρ-value distance between line segments and we set T = 10, Q = 32 empirically. We depict the “Basic Sequential Algorithmic Scheme” (BSAS) in Algorithm 1.
(a)
Figure 1: A sample from the Germana dataset.
Solving Σ1 Σi Σxi,j
these equations, we have ˆ β0 Σi Σxi,j Σyi,j Σi2 Σi xi,j × βˆ1 = Σi yi,j (4) Σi xi,j Σx2i,j Σxi,j yi,j βˆ2
The 3-by-3 design matrix is a symmetric matrix with positive entries. From physical considerations where the number of lines and the point-to-line association are available, we can ˆ = (AT A)−1 AT b to Ax = b. expect a unique solution x III. M ODEL - BASED RULING L INE D ETECTION Our model-based ruling line detection algorithm is based on the following assumptions: (1). Documents preserve salient, although not necessarily continuous, ruling line segments (Figure 1). (2). Ruling lines have the attributes: parallel and constant spacing, thickness, and lengths. A. A Variant of The Hough Transform The classical Hough Transform projects each point onto a set of sinusoidal curve points in the (ρ, θ) Plane (the Hough Space): ρ = xi cosθ + yi sinθ (5) where θ ∈ [0, 2π). We adopt an effective variant in our work [7]. First in each iteration, we select a point randomly from the remaining point set, and then compute its sinusoidal curve in the Hough space and update the accumulation matrix. If the guard of the current maximum votes is larger than the threshold, then we traverse into each direction from
Algorithm 1: Basic Sequential Algorithmic Scheme [8] Data: G = {xi : i = 1, . . . , N }: a set of line segments. Result: m: number of clusters acquired. begin m=1 Cm = {xi } for i = 2 to N do Find Cj : d(xi , Cj ) = min1≤p≤m d(xi , Cp ) if d(xi , Cj ) > T and (m < Q) then m=m+1 Cm = {xi } else Cj = Cj ∪ {xi }
After the BSAS clustering, we first estimate the ruling line spacing by building a histogram that contains the votes for spacing values between two consecutive clusters. Since ruling lines in the Germana dataset are usually broken, we select the minimum spacing between clusters. It is temporary because at later stages, we will update it by computing the spacing globally and thus, more precisely. C. Single Line Fitting After combining close clusters, we now have a good estimation of which line segments belong to which cluster. In this stage, we employ linear regression on each cluster. Then, we compute a temporary skew angle by averaging all β2 values. Again, this skew angle is tentative and will be fine-tuned by the multi-line linear regression at a later stage. In addition, the thickness of ruling lines H is computed by first building a histogram of vertical run-lengths along each line. Then we examine the histogram and select the most frequent bucket as the thickness H. D. Reasoning Missing Lines For degraded documents, it is common to observe line segments that are missing by the Hough transform. Thus,
we traverse the clusters checking whether two consecutive clusters have a much larger spacing than the temporary spacing. If so, we consider there exist missing ruling lines in between. Then, we compute the positions of missing lines by considering the spacing and the skew angle. Finally we scan along the missing lines to collect sample points for further regression. This procedure is depicted in Algorithm 2. The subroutine ScanZones specifies either “North” or “South” for the scanning direction, and it employs either the “strict” or the “relaxed” criterion on collecting evidence. For the strict criterion, we only consider one ruling line exists if we collect more sample points than T1 from the scanning area. For the relaxed criterion, we do not set such a threshold. We only apply the strict criterion at the topmost and bottommost line, and apply the relaxed criterion for intermediate lines. The rationale is that if we want to find missing lines at the first/last of the current candidate line list, we need strong evidence; otherwise we can derive from the spacing between two consecutive lines. In this way, we iteratively scan for missing lines until the sample amount is lower than the threshold or the probe steps out of the image boundary. In our experiments, T1 = 0.2 × page width. Algorithm 2: Find Missing Lines. Data: A list of lines: linesold [m]. The temporary spacing between lines: s. 0 Result: An updated list of lines: linesnew [m ]. begin linesnew = ScanZones(linesold [0], s, “North,” “strict”) for i = 0 to m − 1 do space = linesold [i + 1].ρ − linesold [i].ρ count = space/s local s = space/count if abs(space) > 1.5 s then linesnew = ScanZones(linesold [i], local s, “South,” “relaxed”) linesnew = ScanZones(linesold [m], s, “South,” “strict”)
Algorithm 2 relies on a local spacing to find missing lines, so it might not be precise. Hence, we run another round of missing line scanning from the first and the last ruling line using the global spacing. Another difference is that now T2 = 0.01×page width for the degraded dataset Germana. If there are extra ruling lines detected at this stage, we update corresponding model parameters: number of ruling lines K and starting position of the first line P(xp , yp ). To sum up, the model parameters are: starting point of the first line P(xp , yp ), the length L, the thickness H, the skew angle β2 , the number of lines K, and the spacing β1 . We denote them as Θ = (P(xp , yp ), L, H, β2 , K, β1 ). IV. E XPERIMENTAL S ETUP A. Data Preparation We evaluate our algorithm on both synthesized and real datasets. First, we synthesize a dataset that contains only ruling lines. By defining a set of model parameters beforehand, we draw ruling lines by specifying such parameters. One subset consists of 11 pages where each has 20 ruling lines and the same length, thickness, spacing, but each page has a different skew angle, ranging from [−1.0◦ , 1.0◦ ] with a step length 0.2◦ . The other one consists of 10 pages where each page has the same length, thickness, and skew angle, but has a different number of lines: [10, 20), thus the different spacing. The ground-truth for this dataset is generated directly from the pre-defined model parameters. Second, we also use one real dataset Germana [9] for the performance evaluation. The ground-truth is generated by an annotation tool (Badcat) developed at Lehigh. Making use of the properties of ruling lines, Badcat enables users to fast label all horizontal ruling lines with only three handles: the first controls the position of the top first line, the second the skew angle, and the third the number of lines and spacing. From the Germana dataset, we randomly select 20 pages as the testing dataset, and another 20 pages as the training dataset required by Zheng, et al.’s HMM training. The Hidden Markov Model in their algorithm is trained using their tool that enables a user to label ruling lines one at a time. A breakdown of datasets is listed in Table I. Table I: A breakdown of datasets.
E. Computing Model Parameters Utill this stage, we have satisfied all prerequisites for Eq. 4 in Section II. Solving the equation, we obtain the ˆ Next, we update the linear estimated parameter vector β. equation for each cluster. Then for each cluster, we scan the areas that extend from the leftmost and the rightmost point. Any newly discovered sample points are collected as the new starting and the ending point for that line. Then, we compute the maximum line length as the ruling line length L. At the same time, we can decide the starting position of the first ruling line P(xp , yp ).
Dataset rotation-vary spacing-vary Germana
pages 11 10 20
lines/page 20 [10, 20) 24
page size 816w × 1056h 816w × 1056h 1420w × 2120h
B. Performance Evaluation Metric The performance of line detection algorithms is in general measured by two protocols: pixel-level metrics and objectlevel metrics [6]. Pixel-level metrics including precision, recall, and F-Score are intuitive measurements. However, the
ground-truth for pixel-level accuracy is difficult to acquire because it is subjective and this situation becomes more severe when the lines are degraded. On the other hand, researchers have proposed several object-level metrics. However, these compound metrics are less intuitive in showing how significantly the performance differs among algorithms (e.g., Qv (c) in Liu and Dori’s work [6]). Instead of composing a compound metric that incoparates all parameters, we choose to measure directly the discrepancy between the computed parameters and the ground-truth where Θ = (P(xp , yp ), L, H, β2 , K, and β1 ). C. Post-processing Zheng, et al.’s Algorithm [4] We treat Zheng, et al.’s algorithm as a black box – no modifications is made, so that we can make the output from their code (a list of detected line segments) comparable to ours. However, we perform some minor post-processing1 . The number of lines K is intuitively the size of the list. The spacing β1 is selected at the largest bucket in the spacing histogram. The skew angle β2 is the average of that for all line segments. The thickness H is acquired exactly the same as describe in Section III-C. Finally, P(xp , yp ) and L are computed exactly the same as described in Section III-E. V. E XPERIMENTAL R ESULTS We plot some intermediate results from the pipeline of model-base ruling line detection algorithm in Figure 2. Images from the Germana have lower quality, so most ruling lines are broken and difficult to see even for human beings. As we can see from Figure 2a, line segments are tend to be incomplete and sparse. Running Algorithm 2, we obtained several missing lines in the line list, as shown in Figure 2b. At this stage, all lines shown in Figure 2c were used for the multi-line linear regression. Finally, using the skew angle and the spacing, we successfully detected one missing line at the top of the page, as shown in Figure 2d. To compare the algorithm performance, we computed the errors of our algorithmic outputs using the ground-truth: D[i] = Malgorithm [i] − Mground−truth [i]
(6)
where D[] is the error vector and M(·) [] contains algorithmic output or the ground-truth. Then we computed the mean and the standard deviation of errors from each dataset. The results are organized in Table II. Note that there are two different synthesized subsets in the experiments, but we summarized the performance with one entry in the table. The synthesized dataset is easy: the mean and the standard deviation of the errors are much lower comparing to those on the Germana dataset. We observed relatively large errors in term of σ-values on the Germana dataset. One reason is this dataset has low resolution, and the ruling lines are thin 1 We observe some memory issues when executing Zheng, et al.’s code, so the execution is carefully supervised and the algorithmic results are generated one at a time.
and usually broken. Thus, in many cases it is ambiguous to decide ruling lines precisely. We also ran the control algorithm on the Germana where we randomly selected 20 pages for the HMM training. Table II shows the error statistics of two algorithms. Looking into the mean and the standard deviation of errors, we observe that our algorithm performed better on Germana. Finally we plot the performance results generated from Zheng, et al.’s algorithm. Figure 3 shows a sample result generated by their algorithm. We also display our result for comparison. It’s clear that for degraded images like in Germana, our algorithm managed to detected light and broken ruling lines deriving from the other salient ones. We consider it the advantages of our ruling line modeling. VI. C ONCLUSION In this work, we introduce a model-based ruling line detection algorithm that requires no supervised learning. We first formulate the ruling line model into the framework of multi-line regression and then derive a global optimal solution given the least square errors. Next, we introduce the algorithms for extracting line segments and detecting missing lines. Finally, we justify the efficacy of our algorithm by comparing it with former work in the literature on two datasets, and also use statistical metrics to show the performance improvement. ACKNOWLEDGMENT We acknowledge Dr. Yefeng Zheng for providing the parallel line detection package for comparison. This work is supported by a DARPA IPTO grant administered by Raytheon BBN Technologies.
R EFERENCES [1] J. Liu and A. Jain, “Image-based form document retrieval,” Pattern Recognition, vol. 33, no. 3, pp. 503–513, 2000. [2] D. Dori and W. Liu, “Sparse pixel vectorization: an algorithm and its performance evaluation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 3, pp. 202–215, 1999. [3] D. Blostein and H. Baird, “A critical survey of music image analysis,” in Structured Document Image Analysis, e. H. Baird, Ed., 1992, pp. 405–434. [4] Y. Zheng, H. Li, and D. Doermann, “A parallel-line detection algorithm based on hmm decoding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 777–792, 2005. [5] H. Cao and V. Govindaraju, “Handwritten carbon form preprocessing based on markov random field,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [6] W. Liu and D. Dori, “A protocol for performance evaluation of line detection algorithms,” Machine Vision And Applications, vol. 9, pp. 240–250, 1997. [7] J. Matas, C. Galambos, and J. Kittler, “Robust detection of lines using the progressive probabilistic hough transform,” Computer Vision and Image Understanding, vol. 78, pp. 119– 137, 2000. [8] S. Theodoridis and K. Koutroumbas, Pattern Recognition. Academic Press, 2009. [9] D. P´erez, L. Taraz´on, N. Serrano, F. Castro, O. R. Terrades, and A. Juan, “The germana database,” in Proc. of the International Conference on Document Analysis and Recognition, 2009, pp. 301–305.
(a) Detected clusters.
(b) Intermediate lines.
(c) Clusters of line segments.
(d) Final results.
Figure 2: Snapshots for the intermediate results generated by our model-based ruling line detection algorithm on a Germana sample. Note that in (d), there is a missing line detected at the top. Table II: Performance of our model-based ruling line detection and its comparison with one algorithm in the literature. Error Statistics
X-position P.x
Mean (µ) Standard Deviation (σ)
-4.16 5.60
Mean (µ) Standard Deviation (σ)
-12.30 39.01
Mean (µ) Standard Deviation (σ)
-49.40 62.64
Y-position P.y Length L # of lines K Synthesis Dataset 0.55 6.9 0 2.18 5.11 0 Germana Dataset 3.80 12.85 0.25 20.70 34.03 1.25 Germana Dataset (Zheng, et al. [4]) 57.60 70.40 -1.20 113.30 62.12 5.68
(a) Our results on a Germana sample.
Spacing β1
Skew β2
Thickness H
0 0.01
-0.01 0.05
0 1.41
-0.49 2.24
-0.09 0.26
0.25 0.22
-2.40 8.82
0.02 1.09
0 0
(b) Zheng, et al.’s results.
Figure 3: Comparison with Zheng, et al.’s algorithm [4].