Minimum Description Length principle, we have introduced a fast near-optimal algorithm for multi-model error-bounded approximation of digital curves.
MINIMUM DESCRIPTION LENGTH APPROXIMATION OF DIGITAL CURVES Alexander Kolesnikov Department of Computer Science and Statistics University of Joensuu, Joensuu, FINLAND ABSTRACT In this paper we have examined a problem of piecewise approximation of digital curves with a set of models. Each segment of the input curve was approximated by a function selected from a given set of functions (line segments, circular arcs, polynomials, splines, etc). Following the Minimum Description Length principle, we have introduced a fast near-optimal algorithm for multi-model error-bounded approximation of digital curves. The algorithm was tested on a large-sized test data se and demonstrated a sufficient tradeoff between time performance and efficiency of solutions. The processing time for the large-size test data is less than 1s. Index Terms—multi-model piecewise approximation, minimal description length, circular arc, polygonal approximation. 1. INTRODUCTION Polygonal approximation of digital curves is widely used in image processing and analysis, shape analysis and encoding, and digital cartography to reduce the description of digital curves. However, with more relevant approximation functions (circular and elliptical arcs, polynomials, etc.), a digital curve can be fitted with fewer segments of approximation functions than with polygonal representation for the same error bound. On the other hand, we need more data to be stored or transmitted for description of more sophisticated models (functions). The description length of an approximation model is defined as the data size required to describe the model [1,12]. For example, it can be the number of the model parameters or the bitrate to encode the parameters with fixed- or variable-length codes. In [2, 3], the problem of a digital curve representation with line segments and circular arcs was solved in the length-angle parametric space for a given number of segments for the curve. Dynamic Programming algorithms for multi-model approximation given in [4-7] are based on the minimization of heuristic cost functions with error measure L2. In [4], the cost function (perceptional error) is the product of the polygonal approximation error and the
978-1-4244-5654-3/09/$26.00 ©2009 IEEE
449
description length of the model. The cost function in [5] is the sum of the approximation error and the heuristic description length of the model. In [6], the process of approximation is controlled by the set of thresholds. In [7], the cost function includes a penalty for the angle between two adjacent segments and the segment length. In [13 ], the problem of approximation with minimal Integral Square Error (ISE) for given total number of line segments and circular arcs is considered. However, the algorithms above do not give optimal solution for the multi-model approximation in terms of the description length (output data size). Moreover, one additional problem with the algorithms [3,5, 13] should be considered: the solutions are constructed for a given total number of curve segments, in the algorithms [2,4,6,7] the solution depends on heuristic coefficients. A fast algorithm for multi-model approximation with Minimum Description Length (MDL) for a given error bound would be useful in practical applications. In Section 2, we formulate the MDL problem and represent a fast near-optimal algorithm for multi-model error-bounded approximation of digital curves with the L2norm. In Section 3, we provide the results of our experiments for two-model approximation of curves with line segments and circular arcs. In Section 4, the conclusions are drawn.
Fig. 1. Example of MDL approximation for error bound σ0=1.5: a) with 16 line segments, description length Λ=32 (left); and b) with 6 line segments and 4 circular arcs, description length Λ=24 (right).
2. MULTI-MODEL APPROXIMATION 2.1. Problem formulation Let us consider a N-vertex open planar curve P={p(1), …, p(N)} and a set of K approximation models (functions) Φ={ϕ1, …, ϕK}. The input curve P is divided
ICIP 2009
into M segments. The partition is defined by an ordered set of vertex indices i1, i2, …, iM+1; where i1=1 and iM+1=N. Each segment {p(im), …, p(im+1)} of P is approximated by a function ϕk from the set Φ. The description length λk of the approximation model ϕk is defined as the data size required to describe the model. In addition, we have to spend log2K bits per segment to encode the model type. The total description length Λ for the curve P with model set Φ is the sum of description lengths for all segments of P: M
Λ=
¦λ
(1)
m.
m =1
The approximation error of a curve segment with the L2norm is defined as the sum of squared distances d from the curve vertices p(n) to the approximation function ϕk as follows: ek2 (im , im+1 ) =
im +1
¦d
2
k
(2)
(n ) .
n=im
Then, the total approximation error E(P, Λ) is defined as the sum of distortions for all approximation segments: M
E ( P, Λ ) =
¦ e (i 2 k
m , im+1 )
.
(3)
M =1
To define a bound for approximation error with L2-norm, we introduce a threshold σ0 for the Root-Mean-Square-Error (RMSE) of the approximation [8]. Then, the Integral Square Error (ISE) bound D0 for the N-vertex curve P is defined as T = Nσ 02 , correspondingly. The MDL problem can be formulated as follows: Given an open N-vertex polygonal curve P, a set of approximation models Φ and a RMSE-bound σ0, find the piecewise approximation of the curve P with the models from Φ so that the total approximation error E(P, Λ*) does not exceed the ISE bound and the description length Λ* is minimal: Λ* = min ® {im ,ϕ m }¯
(
¦
M m =1
λ m ½¾ ¿
the space complexity is linear. The method has one drawback – only solutions on the convex hull of the rate distortion curve can be found with the method. Our objective is to develop a faster near-optimal algorithm for a multi-model MDL problem. To solve thisproblem, we generalize the approach presented recently in [8] to the case of multi-model approximation. The main idea of the proposed algorithm for the multimodel MDL problem is the construction of the shortest path in the weighted directed acyclic multi-graph G created on the vertices of the input curve P. The maximum number of edges between two graph nodes in G is K, the number of the models. An edge gk(j,n) in G corresponds to the curve segment {p(j), …, p(n)} approximated with a model ϕk. The weight wk(j, n) of the edge is defined by the description length λk of the model ϕk. for the segment. Minimal Description Length of the piecewise approximation is given as the length of the shortest path in the weighted directed acyclic multigraph G. The shortest path in the directed acyclic multigraph can be found using one-dimensional functions defined for n=1..N. First, we define the cost function L(n) as the sum of description lengths for n-vertex sub-curve, P(n)={p(1), …, p(n)}. Consequently, the recursive equation for the minimal description length L(n) amounts to: L (n ) =
min
1≤ j < n ,1≤ k ≤ K
A(n ) = arg min
1≤ j < n ,1≤ k ≤ K
{L( j ) + wk ( j , n)};
(5a)
{L( j ) + wk ( j , n)}.
(5b)
Here L(1)=0. The function A(n) is used to store indices for partition of the curve P into segments and model type for the segments. The Minimal Description Length for P is given by the cost function value L(N). Then, we introduce an error function F(n)=E(P(n), L(n)) as the approximation error for the sub-curve P(n) with the description length L(n) and the corresponding number of segments M, respectively: M
(4)
F ( n) =
¦e
2 k (i m , i m + 1 )
,
(6)
m =1
)
subject to: E P, Λ* ≤ Nσ 02 . Our task is to find a total number M of the segments, the partition i1, i2, …, iM+1 of P into segments, and a model ϕk for each segment, that the total approximation error E(P,Λ) for the N-vertex curve P satisfies the error bound T and the total description length Λ is minimal. 2.2. Fast near-optimal algorithm for MDL problem The constrained optimization problem (5) can be solved with the Lagrange multiplier method [9]. The time complexity of the method is O(N3) by cost of quadratic space complexity and the time complexity is O(N3 logN) if
450
where i1=1 and iM+1=n. We should define the feasibility of model ϕk for a curve segment {p(j), …, p(n)} and the weight wk(j, n) of the corresponding edge vk(j, n). For this purpose, we extend the RMSE bound given for the curve P to the sub-curves P(n) and define an error bound T(n) for P(n) as follows: T (n) = nσ 02 . We call the model ϕk, for the curve segment {p(j), …, p(n)}, feasible if the approximation error for subcurve P(n) calculated with the use of the model satisfies the error bound T(n):
F ( j ) + ek2 ( j , n) ≤ nσ 02 .
(7)
Now we can define the weight wk(j, n) as follows: ° λ ; F ( j ) + e k2 ( j , n ) ≤ nσ 02 , wk ( j , n) = ® k °¯ ∞; otherwise.
(8)
and find the MDL path in the multigraph G with recurrent equation (6) for the cost function L(n). The number of the segments, the partition of P into segments, and model type for each segment is defined by backtracking with the function A(n). Though, with the more tight constraint (8) on the approximation error, the global minimum of the description length for the whole curve P cannot be guaranteed. However, with this assumption we can evaluate the feasibility of the candidate models on-the-fly and construct a near-optimal solution in one pass over the data. The complexity of the basic algorithm for the shortest path construction in a directed acyclic graph is O(N2). If the complexity of the approximation error calculation is O(N), the total complexity of the algorithm is O(N3). But we can speed-up the process by stopping the search (6) at the moment when even the minimal approximation error for the current sub-curve becomes much bigger than the ISE bound for the whole curve:
{
}
min F ( j ) + ek2 ( j , n) > α N σ 02 .
1≤ k ≤ K
(9)
With the coefficient α=3, the processing time can be essentially reduced, on the other hand the constraint on the depth of search does not have a considerable effect on the efficiency of the found solution. The average number of vertices in a curve segment is O(N/M), so the output dependent complexity of the algorithm with reduced search depth is sub-cubic, O(N3/M2). The Space complexity of the algorithm is linear. In the case of many input curves, each curve can be processed individually with the proposed algorithm for a given RMSE bound σ0.
Fig. 2. Test data set: 1) 1062-vertex digitized curve Maple Leaf; and 2) vector map Elevation Lines, 769 digital curves with 31,388 vertices.
3. EXPERIMENTS AND DISCUSSIONS The Test data sets, a digitized curve and a vector map, are represented on Fig. 2. The algorithms were tested on a 2.3 GHz PC Pentium 4. For the sake of simplicity, we have tested the proposed algorithm for the MDL problem with two-model
451
approximation that includes line segments and circular arcs (see Fig. 3). The line segments and circular arcs are quite often used jointly for shape representation [2-5] and in the spatial databases. The line segment model ϕ1 for a curve segment is defined by two parameters; so the description length is λ1=2. The approximation circular arc model ϕ2 for a curve segment is defined by three parameters with λ2=3. The radius R and center (X0, Y0) of a circle for the arc with end points at vertices of the curve are in [5].
Fig. 3. Distance d from vertex p(n) to the approximation line (left) and to the circular arc (right).
We have compared the following approaches: 1) PA: Optimal polygonal approximation [10]; 2) CAA: Optimal circular arc approximation [11]; 3) MDL-Opt: Two-model MDL approximation with the optimal Lagrange multiplier method; 4) MDL-Fast: Two-model MDL approximation with the proposed fast near-optimal algorithm. The rate-distortion curves for the test set are represented on Figs. 4, and 5. The examples of MDL approximation with line segments and circular arcs are shown on Figs. 1 and 6. At first, we evaluated the benefit of using two models in comparison with the single-model methods. Among the two tested single-model methods, approximation with circular arcs outperforms polygonal approximation in most cases for the test data. Whereas, two-model MDL approximation outperforms both single-model methods in all cases.
Fig. 3. Rate-distortion curves for the test data #1.
We also compared the efficiency and time performance of the proposed fast near-optimal algorithm with those for the optimal method. Following [12], we define efficiency as the ratio of description lengths of the optimal and evaluated algorithms, respectively. Our experiments have shown that the efficiency of the proposed fast algorithm for the test data is in the range of 90-95%.
The processing time with the proposed algorithm for 1062-vertex digital curve #1 is approximately 0.5s; as opposed to 8s and 75s for the optimal methods of quadratic and linear space complexity, respectively. The processing of the test data set #2 takes less than 1s against 10s and 95s for the optimal method of quadratic and linear space complexity, correspondingly.
11. REFERENCES [1] J. Rissanen, “Modeling by shortest data description”, Automatica, vol. 14(5), pp. 465-471, 1978. [12] P.L.Rosin, “Techniques for assessing polygonal approximations of curves”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 14(6), pp. 659-666, 1997. [2] E. Bodansky and A. Gribov, “Approximation of a polyline with a sequence of geometric primitives”, Int. Conf. Image Analysis and Recognition, LNCS, vol. 4142, pp. 468-478, 2006. [3] F. Tortorella, R. Patraccone, and M. Molinara, “A Dynamic Programming approach for segmenting digital planar curves into line segments and circular arcs”, Proc. Int. Conf. Pattern Recognition-ICPR’08, 2008. [4] K. Mori, K. Wada, and K. Toraichi, “Function approximated shape representation using Dynamic Programming with multiresolution analysis”, Proc. Int. Conf. Signal Process. and Applications & Technology-ICSPAT'99, 1999.
Fig. 5. Rate-distortion curves for the test data #2.
[5] J.-H. Horng, and J.T. Li: “A Dynamic Programming approach for fitting digital planar curves with line segments and circular arcs”,. Pattern Recognition Letters, vol. 22(2), pp. 183-197, 2001. [6] R. Mann, A.D. Jepson, and T. El-Maraghi, “Trajectory segmentation using Dynamic Programming”, Proc. Int. Conf. Pattern Recognition, vol. 1, pp. 331-334, 2002.
Fig. 6. Example of MDL approximation of the test data #2 for σ0=1.5 m (fragment).
4. CONCLUSIONS Digital curve approximation with a set of models was considered. Following the Minimum Description Length principle, we introduced a fast near-optimal algorithm for the multi-model piecewise approximation of digital curves for given RMSE bound. The algorithm was tested on two models: line segments and circular arcs. The time complexity of the algorithm is sub-cubic with linear space complexity. The efficiency of the proposed algorithm for two-model MDL problem is about 90-95%. The algorithm was tested on large-sized test data set. The processing time for the 31,000-vertex vector data set is about 1s, so that the algorithm can be used in the real-time applications for processing large-sized vector data in shape analysis, vector data compression, vectorization, and digital cartography. The proposed fast algorithm gives near-optimal solution for the MDL problem. This solution can be improved up to the optimal one using the correspondent modification of the Reduced-Search Dynamic Programming algorithm [14]. This is topic for the future research.
452
[7] L. Yin, Y. Yajie, and L. Wenyin, “Online segmentation of freehand stroke by Dynamic Programming”, Proc. Int. Conf. Document Analysis and Recognition-ICDAR’05, vol. 1, pp. 197201, 2005. [8] A. Kolesnikov, “Fast algorithm for ISE-bounded polygonal approximation”, Proc. Int. Conf. Image Processing-ICIP’08, pp. 1013-1015, 2008. [9] G.M. Schuster, and A.K. Katsaggelos “An optimal polygonal boundary encoding scheme in the rate-distortion Sense”, IEEE Trans. Image Proc., vol. 7, pp. 13-26, 1998. [10] J.C. Perez, and E. Vidal, “Optimum polygonal approximation of digitized curves”, Pattern Recognition Letters, vol. 15, pp. 743750, 1994. [11] S.-C. Pei, and J.-H. Horng, “Optimum approximation of digital planar curves using circular arcs”, Pattern Recognition, vol. 29(3), pp. 383-388, 1996. [13] B. Sarkar, L.K. Singh, and D. Sarkar, “ Approximation of digital curves with line segments and circular arcs using genetic algorithms”, Pattern Recognition Letters, vol. 24, pp. 2585-2595, 2003. [14] A. Kolesnikov, and P. Fränti, “Reduced-search Dynamic Programming for approximation of polygonal curves”, Pattern Recognition Letters, vol. 24, pp. 2243-2254, 2003.