gives a âbestâ fit to the data, and we consider different criteria. Two of the criteria are based on the idea of orthogonal distances: for every xi associate a point x(a ...
The use of the l1 and l∞ norms in fitting parametric curves and surfaces to data I. Al-Subaihi and G. A. Watson∗, Department of Mathematics, University of Dundee, Dundee DD1 4HN, Scotland.
Abstract Given a family of curves or surfaces in Rs , an important problem is that of finding a member of the family which gives a “best” fit to m given data points. There are many application areas, for example metrology, computer graphics, pattern recognition, and the most commonly used criterion is the least squares norm. However, there may be wild points in the data, and a more robust estimator such as the l1 norm may be more appropriate. On the other hand, the object of modelling the data may be to assess the quality of a manufactured part, so that accept/reject decisions may be required, and this suggests the use of the Chebyshev norm. We consider here the use of the l1 and l∞ norms in the context of fitting to data curves and surfaces defined parametrically. There are different ways to formulate the problems, and we review here formulations, theory and methods which generalize in a natural way those available for least squares. As well as considering methods which apply in general, some attention is given to a fundamental fitting problem, that of lines in three dimensions. ∗ Invited talk, International Conference on Numerical Analysis and Applied Mathematics, Chalkis, Greece, 10-14 September 2004
1
1
Introduction
Let data points xi ∈ Rs , i = 1, . . . , m, be generated, and let a parametric curve or surface in Rs be defined by a ∈ Rn , with a point on the surface given by x(a, t), where t ∈ Rd . It is required to determine a value of a which gives a “best” fit to the data, and we consider different criteria. Two of the criteria are based on the idea of orthogonal distances: for every xi associate a point x(a, ti (a)) on the surface defined by a ∈ Rn , where ti (a) = arg min kxi − x(a, t)k, i = 1, . . . , m, t
(1)
and where unadorned norms are l2 norms. For any a ∈ Rn , define the orthogonal distance vectors wi (a) = xi − x(a, ti (a)), i = 1, . . . , m,
(2)
and also r(a) = (kw1 (a)k, . . . , kwm (a)k)T ∈ Rm , T s(a) = (w1T (a), . . . , wm (a))T ∈ Rms .
Let 1 ≤ p ≤ ∞. Then two basic fitting problems which arise from this are to find minn kr(a)kp , (3) a∈R
which we refer to as lp orthogonal distance regression, and min ks(a)kp ,
a∈Rn
(4)
which we refer to as lp componentwise orthogonal distance regression. For any a ∈ Rn , ti ∈ Rd , i = 1, . . . , m, define h(a, t1 , . . . , tm ) = ((x1 − x(a, t1 ))T , . . . , (xm − x(a, tm ))T )T ∈ Rms . Then a third problem is that of finding min kh(a, t1 , . . . , tm )kp ,
a,t1 ,...,tm
which we refer to as lp distance regression. 2
(5)
All these problems are identical if p = 2, when the problem is known as orthogonal distance regression, and methods of Gauss-Newton type have proved popular. Although the problems are identical, the direct treatment of the problems in the forms given here lead to different algorithms. For the Gauss-Newton method applied to (3), see for example [10, 13, 14, 20], and applied to (4), see for example [4, 5, 6]. For a unified treatment of these methods, and some comparisons, see [2]. The direct solution of (5) trades the subproblems for ti (a) for an increase in the number of variables, and generates orthogonal distances only in the limit: methods which exploit the structure of the Jacobian in Gauss-Newton (or Levenberg-Marquardt) methods are given in [7, 12]. Which approach should be used depends mainly on whether or not ti (a) is easy to calculate. Although attention has focussed on the case p = 2, the nature of the errors may be such that least squares is not particularly suitable. For example, there may be wild points in the data, and a more robust estimator such as the l1 norm may be more appropriate. Another situation which can arise in the modelling of manufactured parts is that accept/reject decisions may be required. Then least squares may not give precise enough information, and it may be necessary to use the Chebyshev norm. Therefore we are concerned here with the special cases p = 1 and p = ∞, and we will examine some numerical methods which extend in a natural way those mentioned in the previous paragraph. (For a treatment of other values of p, see [3].) It is appropriate to deal with the cases p = 1 and p = ∞ together because the problems share certain characteristics, such as the piecewise linear nature of the linear subproblems to which linear programming methods can be applied. In some situations, it may be desirable to work with rotated data, in order to simplify the model. However, for ease of presentation, we will not consider that here: straightforward changes would be required. It is also possible to work with implicit models, rather than parametric ones, but again we will not consider this. We will use ∇1 (∇2 ) to denote the operation of taking partial derivatives of a function of two sets of variables with respect to the first (second) set.
3
2
The l1 norm.
When p = 1, the three problems posed in the last section are all different. Most attention has been given to (3), so first consider that problem. Then we have to solve minimize kr(a)k1 . (6) For given a ∈ Rn , the Gauss-Newton step d for minimizing krk1 is given by finding min kr + Jdk1 , d
where the ith row of J consists of, for any ri 6= 0, ∇ a ri
wiT (a) = − (∇1 x(a, ti (a)) + ∇2 x(a, ti (a))∇a ti (a)) kwi (a)k wT (a) ∇1 x(a, ti (a)), = − i kwi (a)k
(7)
by definition of ti (a). Using the expression on the right hand side of (7), the correct vector of partial derivatives of ri is easily calculated. There appears to be a potential difficulty in using this method when any ri becomes zero, and we would expect this to be the case at a limit point of the iteration, by the nature of the l1 norm. However, it is shown in [18] that under very mild conditions the theory is just as for smooth problems. Because d is a descent direction, convergence can be forced by step length control, for example a line search. The performance depends primarily on the number of zero components of r at a limit point: if this number is n then a second order rate of convergence is expected, otherwise convergence is likely to be slow. A trust region method is given in [11]. Consider now a fundamental fitting problem.
2.1
Fitting lines in 3 dimensions by l1 ODR
Let a line in 3 dimensions (s = 3, d = 1) be parameterized by
a1 t + a2 x(a, t) = a3 t + a4 , t 4
where t is a parameter. This includes all lines except those which lie in a plane parallel to z = 0. Example 1 Consider the approximation by a line in 3 dimensions, parameterized as above, using (3) with p = 1. Using the data set with m = 10 given in Table 1 of [18], then starting from a = (0.0, 2.0, −1.0, 6.0)T , which gives krk1 = 12.7036, the Gauss-Newton method (with line search based on halving) reaches krk1 = 11.9572 after 50 iterations, with a step length of 8 × 10−3 and kdk∞ = 4.99. After 400 iterations, the value of krk1 is 11.9240, the step length is 2 × 10−3 and kdk∞ = 2.04. The optimal value of the objective function is 11.828075, and there is one interpolation point at the solution, with r2 = 0. This example is typical and shows that the Gauss-Newton is a poor method to use in this case. The reason is a simple one: n = 4, but there will generally only be 2 interpolation points. Because this problem seems a simple as well as a fundamental one, let us look at it a little more closely. For fixed a, each term of the sum can be minimized with respect to ti , which easily leads to the minimizing values ti (a) =
(a1 , a3 , 1)xi − a1 a2 − a3 a4 , i = 1, . . . , m. a21 + a23 + 1
These values can then be used to give explicitly a problem in a alone, which can be expressed as the minimization of f (a) =
m X
kwi (a)k,
(8)
i=1
where (dropping explicit dependence of ti on a)
ti 1 0 0 0 wi (a) = xi − 0 0 ti 1 a − 0 , i = 1, . . . , m. 0 0 0 0 ti By definition of ti , for every i,
ti 1 0 0 ∇a wi (a) = − 0 0 ti 1 . 0 0 0 0 5
Let a be such that wi (a) = 0, i ∈ Z, and let Z c denote the complement of Z. Then the terms of the sum in (8) split into differentiable ones associated with Z c and nondifferentiable ones associated with Z. Standard theory [15] tells us that a necessary condition for a to minimize f is that there exist vectors vi ∈ R3 such that kvi k ≤ 1, i ∈ Z, with
X wi (a)T
i∈Z c
kwi (a)k
∇a wi (a) +
X
viT ∇a wi (a) = 0.
(9)
i∈Z
This defines a to be a stationary point. With an assumption of nondegeneracy, |Z| ≤ 2, and clearly finding a line which interpolates 2 of the data points is straightforward. Indeed we can easily find the one which gives the smallest value of (8). Further, we can then easily establish whether or not this gives a stationary point. Theorem 1 At a, let |Z| = 2, with Z = {j, k}, and let h(a)T =
X wi (a)T
i6=j,k
kwi (a)k
∇a wi (a).
Then a is a stationary point if max{(h1 − ti h2 )2 + (h3 − ti h4 )2 } ≤ (tj − tk )2 . i=j,k
Proof Let vj =
1 (h1 − tk h2 , h3 − tk h4 , 0)T , tj − tk
vk =
1 (h2 tj − h1 , h4 tj − h3 , 0)T . tj − tk
(10)
Then it is easily verified that (10) implies kvj k ≤ 1 and kvk k ≤ 1, and also X wi (a)T
i6=j,k
kwi (a)k
∇a wi (a) + vjT ∇a wj (a) + vkT ∇a wk (a) = 0.
The result follows. 6
(11)
If a defined in this way is not a stationary point, then it is straightforward to define a descent direction at a and so we can make progress from the current interpolation point. Indeed this can be done in such a way that one of the components of r remains at zero. Theorem 2 Let a be such that Z = {j, k} and (11) is satisfied with kvj k > 1. Define d ∈ R4 by (vj )1 (vj )2 d1 = θ , d3 = θ , kvj k kvj k where θ = sign(tk − tj ) and d2 = −d1 tk , d4 = −d3 tk . Then d is a descent direction for f at a in the subspace defined by rk = 0. Proof First notice that by the definition of d, wk (a + γd) = 0, for any γ, and also ∇a wk (a)d = 0. Then if γ is a positive number, f (a + γd) = = =
m X
kwi (a + γd)k
i=1 m X
i6=j,k m X
kwi (a + γd)k + kwj (a + γd)k [kwi (a)k + γ∇a kwi (a)kd] + γk∇a wj (a)dk + O(γ 2 )
i6=j,k
= f (a) + γ[k∇a wj (a)dk − vjT ∇a wj (a)d] + O(γ 2 ), using the definition of d and the conditions (11). Now k∇a wj (a)dk = |(tj − tk )|, and vjT ∇a wj (a)d = |(tj − tk )|kvj k. Thus if kvj k > 1, the coefficient of γ is negative and the result follows. 7
Suppose we wish to minimize f subject to rk = 0. Then the constraints corresponding to rk = 0 are a2 = (xk )1 − a1 (xk )3 , a4 = (xk )2 − a3 (xk )3 , and so these can be used to eliminate a2 and a4 giving an unconstrained problem in the remaining 2 components of a. If a starting point which gives a value lower than the current value of f is used, then the problem is obviously a differentiable one. Theorem 3 Let a minimize f in the subspace defined by rk = 0. Then there exists vk satisfying (9) with Z = {k}. If kvk k ≤ 1 then a is a stationary point. Proof By definition, a solves the problem minimize
X
kwi (a)k subject to
i6=k
"
tk 1 0 0 0 0 tk 1
#
a=
"
(xk )1 (xk )2
#
,
where we have used the fact that tk = (xk )3 . Therefore at a minimum there exist Lagrange multipliers λ1 and λ2 such that X
∇a kwi (a)k − (λ1 tk , λ1 , λ2 tk , λ2 ) = 0.
i6=k
This is just (9) with vk = (λ1 , λ2 , d)T , where d is arbitrary. Setting d = 0, the result is proved. Based on these results, a method can be developed which has the potential to produce an efficient solution to the line fitting problem. This work is in progress, but we illustrate by a simple example. Example 2 Consider the problem of Example 1. Solving the 10 C2 = 45 interpolation problems gives lowest value of f as 12.0578 when a = (−0.28, 2.836, −0.44, 4.628)T , with j = 9, k = 8. Checking the characterization conditions at this point gives kv8 k = 0.917460, kv9 k = 1.131133, so 8
we fix r8 = 0. Solving the resulting differentiable problem gives convergence to f = 12.037792 with a1 = −0.252782, a3 = −0.489627 (this problem is obviously differentiable because any zero ri other than with i = 8 gives a higher function value). Checking the characterization conditions at this point gives kv8 k = 1.118502, so this is not a solution. Consider now the second lowest value obtained by interpolation, which has f = 12.0733, with a = (0.1, 2.06, −0.8, 5.62)T and j = 3, k = 2. Checking the characterization conditions at this point gives kv2 k = 0.369039 and kv3 k = 1.62471, so we keep r2 at zero. Solving the resulting differentiable problem takes us to f = 11.828075 and the minimum is verified to be a stationary point since kv2 k = 0.521857. Consider now the solution of (4) with p = 1. Once more the GaussNewton method is an option, and again for fast local convergence we require n zero components of s. But this is a much more realistic prospect: the difficulty with (3) is that for the Gauss-Newton method to work well, n zeros of r corresponding to ns zeros of s are needed. Example 3 Consider (4) with p = 1 and with the same data as in Example 1, same starting point, which gives ksk1 = 20.8. After 5 iterations, with full (unit) steps, we have kdk∞ = 2.7 × 10−7 , ksk1 = 19.025148, and zero components of s corresponding to interpolation at components 1, 4, 5, 6, 9 of s. There is interpolation at just one data point, with r2 = 0 once again. To summarise, when p = 1 there is a good prospect of the Gauss-Newton method being effective in fitting a line in 3 dimensions using (4); by contrast, a good algorithm for (3) remains to be fully developed.
2.2
Fitting by l1 distance regression
Let us turn now to (5) with p = 1. Then we have to solve: minimize
m X
kxi − x(a, ti )k1 ,
(12)
i=1
with respect to a ∈ Rn , ti ∈ Rd , i = 1, . . . , m, so that we have n + md parameters. This problem can be reformulated as a smooth nonlinear programming problem and can be solved by many methods. We consider here the use of a method of Gauss-Newton type which exploits the structure. Let Ai = ∇1 x(a, ti ), Ri = ∇2 x(a, ti ), i = 1, . . . , m, 9
defined at approximations a, ti , i = 1, . . . , m to the solution. Then the GaussNewton step δa, δti , i = 1, . . . , m solves minimize
m X
kxi − x(a, ti ) − Ai δa − Ri δti k1 .
(13)
i=1
Introducing variables ui , vi , i = 1, . . . , m, this sub-problem can be reformulated as the linear programming problem: minimize
m X
(eT ui + eT vi ) subject to
i=1
ui − vi − xi + x(a, ti ) + Ai δa + Ri δti = 0, i = 1, . . . , m,
(14)
ui ≥ 0, vi ≥ 0, i = 1, . . . , m. Consider the case of a curve in 2 dimensions, when s = 2, d = 1. Then ti is a scalar ti and we will write x0 (a, ti ) for Ri = ∇2 x(a, ti ). Further, each pair of (scalar) equations of (14) for each i can be reduced to one, if δti is eliminated from one and substituted into the other; this is always possible unless x0 (a, ti ) = 0, a form of degeneracy which we exclude. Assume with no loss of generality that the second component of x0 (a, ti ) is non-zero, and define cTi = eT1 −
eT1 x0 (a, ti ) T eT1 x0 (a, ti ) e = (1, − ), i = 1, . . . , m. eT2 x0 (a, ti ) 2 eT2 x0 (a, ti )
Then the linear equality constraints reduce to h
D1 −D1
u G v = g, δa i
where T T u = [uT1 , . . . , uTm ]T , v = [v1T , . . . , vm ] ,
D1 ∈ Rm×2m is the block diagonal matrix D1 = diag {cT1 , . . . , cTm }, G ∈ Rm×n has ith row
cTi Ai , i = 1, . . . , m, 10
(15)
and g ∈ Rm with gi = cTi (xi − x(a, ti )), i = 1, . . . , m. It is clear from (15) that there exists a permutation matrix P such that [D1 : −D1 ]P = [I : −I : D : −D]. Thus the linear equality constraints can be expressed in the form
"
u P v [I : −I : D : −D : G] δa where D = diag{− P
"
u v
#
#
= g,
(16)
eT1 x0 (a, ti ) , i = 1, . . . , m}, eT2 x0 (a, ti )
= ((u1 )1 , (u2 )1 , . . . , (um )1 , (v1 )1 , . . . , (vm )1 ,
(u1 )2 , (u2 )2 , . . . , (um )2 , (v1 )2 , . . . , (vm )2 )T . Now consider the linear programming problem: minimize eT (u + v) subject to
"
u P v [I : −I : D : −D : G] δa
#
= g,
(17)
−τ1 e ≤ δa ≤ τ1 e, −τ2 e ≤ eT2 (ui − vi ) − eT2 (xi − x(a, ti )) ≤ τ2 e, i = 1, . . . , m. u ≥ 0, v ≥ 0. Then we can note the following. 1. If τ1 → 0, τ2 → 0, then δa → 0 and also for all i, i = 1, . . . , m, δti = eT2 (xi − x(a, ti ) − Ai δa − ui + vi )/eT2 x0 (a, ti ) → 0. Therefore this problem can be interpreted as a full Levenberg-Marquardt variant of the Gauss-Newton subproblem (14), with complete stepsize control based on standard ways of adaptive adjustment of τ1 and τ2 . 11
2. The problem (17) has exactly the same structure as one identified in [17] and [19] for related problems and can therefore be solved by the simplex based method developed in [19]. In particular (i) all the inequalities can be treated as bounds on variables, and so cause no increase in complexity, (ii) an initial basic feasible solution is immediately available, (iii) columns of both I and D and their negatives can be swapped into and out of the basis matrix so as to maximize the length of descent steps and so make small the number of simplex steps. The incorporation of step size restrictions can be important in cases when the starting approximation is poor or when second order convergence is not possible. Of course such bounds can be incorporated directly into the problem (14). Further, for very large structured linear programming problems, interior point methods are an alternative way to proceed, as most work is in the Newton step and the structure can be exploited in the linear algebra. However, the problems considered here could have large m but are perhaps unlikely to be in the first rank of large problems, and indeed may be relatively small. Therefore it may be preferable to use vertex methods rather than interior point methods. Example 4 Consider fitting a circle to 100 points in 2-space. Table 1 shows progress of a method based on solving (17) using the method from [19]. The iteration number (increased each time (17) is solved) is given by k, the objective function value is given by F , the number of simplex iterations to solve the subproblem by ns. Also shown is the Chebyshev norm of the increment vector solving (17) kδak∞ . The initial approximations to ti correspond to least squares nearest points for the initial circle. Clearly we have second order convergence here. Example 5 Now consider fitting an ellipse in general position (n = 5) to the same set of 100 points as in Example 3. The outcome is shown in Table 2. Again the initial approximations to ti correspond to least squares nearest points for the initial approximation (a circle). Again we have second order convergence. The cases s > 2 can be dealt with in an analogous way, again removing δti from the constraints (14). However, the resulting constraints will not be expressible in the form (16), and more effort will be required to fully exploit their structure. An alternative to solving (13) is to go to the dual problem which may be 12
k 1 2 3 4 5
F ns 19.4760 11 7.58729 4 6.92667 3 6.83103 3 6.83051 3
kδak∞ 0.164820 0.043514 0.028868 0.005364 0.000029
Table 1: Example 4: Circle fit
k 1 2 3 4 5 6
F ns 74.1574 15 51.4779 16 14.1855 8 10.6297 5 10.56526 5 10.56249 5
kδak∞ 0.625536 0.226420 0.014637 0.005917 0.000058 0.0000001
Table 2: Example 5: Ellipse fit
13
written max
m X
bTj vj subject to
j=1
AT1 . T R1 . . R2T . . . .
. . . . .
. ATm . . . . . . T . Rm
v = 0,
−e ≤ v ≤ e, where bi = xi − x(a, ti ), i = 1, . . . , m, and e is a vector of ones. This problem can be solved by a variant of DantzigWolfe decomposition [9]. This requires the solution of a number of subproblems, having the form min kRi α − ci k1 , j = 1, . . . , m, α where ci is a known vector. At an appropriate point in the calculation, we have a solution to the primal given by δti = arg min kRi α − ci k1 , j = 1, . . . , m. α
The vector δa is also easily obtained. We do not give further details here, but if it is sufficient to solve (13) and not to include bounds, this may be a more efficient way to proceed, and one which is valid for any s.
3
The Chebyshev norm.
As in the last section we can identify three separate problems, (3), (4) and (5) when p = ∞. For the first two of these, involving orthogonal distances, the Gauss-Newton method or variants can be applied in a straightforward manner. For (3), zero distances are a potential problem, but such occurrences are unlikely (in contrast to the l1 case), and may be interpreted as a form of degeneracy. Performance of the methods therefore depends mainly on whether or not krk∞ or ksk∞ is attained at (n + 1) indices at a limit point 14
of the iteration: if this is the case, we expect a second order convergence rate, if there are fewer than (n + 1) convergence can be slow. This does not represent as strict a condition on r as in the analogous l1 situation, and so the advantage identified in that case for using (4) rather than (3) no longer applies. Trust region methods for (3) are given in [11]. See also [1, 8]. A special case is the fitting of lines in 3 dimensions by l∞ ODR. Note that for l2 ODR, this problem can be solved through the eigenvalue decomposition of a 3 × 3 matrix, so possible starting values are easily obtained (see, for example, [16]).
3.1
Fitting lines in 3 dimensions by l∞ ODR
This problem is equivalent to fitting to data an enclosing cylinder of smallest radius, and (as in the l1 case) turns out to be surprisingly difficult to solve. One reason for this is that the number of extrema at a solution is frequently less than 5 (the number required for second order convergence of methods like the Gauss-Newton method which are based on linearization). Let a line in 3 dimensions be parameterized as in Section 2.1. The problem can be stated as the minimization of max kwi (a)k,
1≤i≤m
where we recall that wi (a) denotes the ith orthogonal distance vector. For given a ∈ Rn , let I(a) = {i : kwi (a)k = max kwj (a)k}. j
Then, by standard theory (for example [15]), necessary conditions for a to give a minimum are the existence of λi ≥ 0, i ∈ I(a) not all zero such that X
λi ∇a kwi (a)k = 0T .
i∈I(a)
Now for any i, as before, ∇a kwi (a)k = −
wiT (a) ∇1 x(a, ti (a)). kwi (a)k 15
Thus since kwi (a)k, i ∈ I(a) is constant, we can write the above condition as ti 1 0 0 X T λi wiT (a) 0 0 ti 1 = 0 , λi ≥ 0, i ∈ I(a). i∈I(a) 0 0 0 0 Now (dropping the explicit dependence of w and I on a) let wi =
"
vi αi
#
, i ∈ I,
where vi ∈ R2 , i ∈ I. Then the conditions may be written 0 ∈ conv{di , i ∈ I}, where ‘conv’ denotes the convex hull and di =
"
vi ti vi
#
=
"
vi (zi )3 vi
#
,
where zi is the nearest point on the line to xi , i ∈ I. Now by orthogonality viT c + αi = 0, i ∈ I, where c = (a1 , a3 )T . Let X
λi vi = 0.
X
λi αi = 0,
X
λi wi = 0.
i∈I
Then i∈I
and so i∈I
Similarly X
λi (zi )3 wi = 0.
i∈I
Therefore the conditions are equivalent to "
wi 0 ∈ conv{ (zi )3 wi 16
#
, i ∈ I}.
Now there are alternative parameterizations of the line, and so the presence of the third component of z is purely a consequence of the present one. Indeed it is clear that we can effectively interchange the parametric representations of the different variables. Therefore we must have "
wi 0 ∈ conv{ wi zTi ej
#
, i ∈ I}, j = 1, 2, 3,
(18)
provided that the line does not lie in a plane parallel to the x = 0, y = 0 or z = 0 planes. Lemma 1 If |I| = 3, the vectors wi must be of the form w, w, −w. Proof From X λi wi = 0, i∈I
X
γi λi wi = 0,
i∈I
where γi is the jth component of zi , it follows that for some number k, w3 = kw2 , (γ3 6= γ2 by assumption). Similarly for some number l, w1 = lw2 . Thus the vectors wi , i ∈ I are parallel. But since they all have the same length and satisfy X λi wi = 0, i∈I
where λ ≥ 0, the result follows. The extent to which the above results can be used as the basis for an algorithm is not clear, although some possibilities have been considered by Zwick [21]. Numerical results have been obtained for some problems based on using the Gauss-Newton method, either on its own if it worked well, otherwise as a first phase to identify I. In the latter case, and assuming correct identification, locally the problem can be stated min kwj (a)k subject to 17
kwi (a)k − kwj (a)k = 0, i ∈ I − {j} for any j ∈ I. Such problems were solved using the NAG Fortran subroutine E04UCF. A check is necessary at the solution to ensure I is still correct. It is easy to produce examples with |I| having different values. With 100 random points, the case |I| = 3 was obtained, and the above relationship between w1 , w2 , and w3 confirmed. On the other hand examples with |I| = 4 and |I| = 5 are also easy to generate, in one case with both providing local solutions to the same problem. For the case when |I| = 5, a second order rate of convergence of the Gauss-Newton method is obtained. Interestingly in all examples solved so far, the components of λ (which give the convex combination as zero) in (18) are identical for all j. In fact numerical evidence (although somewhat limited) suggests that (18) can be written 0 ∈ conv{[wi : wi zTi ], i ∈ I}, in other words the same multipliers can be used for each of the 3 “zero in the convex hull” results. We do not know if this is in fact always the case.
3.2
Fitting by l∞ distance regression
Finally, consider (5) with p = ∞. This problem may be stated as minimize max kxi − x(a, ti )k∞ . 1≤i≤m
(19)
Let a∗ , t∗i , i = 1, . . . , m be a solution with M = max kxi − x(a∗ , t∗i )k∞ . 1≤i≤m
Then obviously kxi − x(a∗ , t∗i )k∞ ≤ M, i = 1, . . . , m. It is clear that t∗i is not unique if for any i there is an infinite number of vectors t satisfying kxi − x(a∗ , t)k∞ ≤ M. This is the case if min kxi − x(a∗ , t)k∞ < M. t
18
Thus for uniqueness it is necessary that min kxi − x(a∗ , t)k∞ = M, i = 1, . . . , m, t
and since even for modest m this is unlikely to hold, it is clear that, even if a∗ is unique, in general there is not a unique solution to (19). To remove this freedom in t∗i , we can choose the solution which is such that t∗i = arg min kxi − x(a∗ , t)k∞ , i = 1, . . . , m,
(20)
t
provided of course that these problems have unique solutions, and also that a∗ is unique. Define U = {i : min kxi − x(a∗ , t)k∞ = M } t
with U¯ the complement of U in {1, . . . , m}. Then a counting argument shows that the number of conditions required to hold at a solution to (19) satisfying (20) matches the number of available parameters if |U | = n + 1. Now consider applying the Gauss-Newton method to (19) at particular approximations a, ti , i = 1, . . . , m. Then we have to minimize over δ max kxi − x(a, ti ) + ∇a,t (xi − x(a, ti ))δk∞ .
1≤i≤m
In other words if δ T = (δaT , δtT1 , . . . , δtTm ), we have to find δa, δti , i = 1, . . . , m, h to solve: min h subject to
A1 R1 . . Am . −A1 −R1 . . −Am .
. . . . . Rm . . . . . −Rm
e . e e . e
δa δt1 . ≥ δtm h
b1 . bm −b1 . −bm
(21)
with Ai , Ri , bi , i = 1, . . . , m as before. The usual way to solve (21) is through the dual problem: maximize
m X
bTi yi −
m X i=1
i=1
19
bTi zi subject to
AT1 R1T
.
.
.
eT
ATm
. T Rm T . e
−AT1 −R1T
.
.
.
eT
−ATm
. T −Rm T . e
y1 . ym z1 . zm
= en+md+1 ,
(22)
yi , zi ≥ 0, i = 1, . . . , m. Reordering the rows and columns of the matrix of this problem means that we can exploit the structure by using the Dantzig-Wolfe decomposition method (see, for example, [9]). This requires the solution of a number of subproblems, and these turn out to have the form min kRi α − ci k∞ , α where ci is a known vector: these are simple to solve. At an appropriate point in the calculation, we can set kRi α − ci k∞ , i = 1, . . . , m, δti = arg min α giving a primal solution. These also (normally) remove freedom from the solution of the Gauss-Newton subproblem in a way that is consistent with the removal of freedom identified for the solution of the original problem. The solution δa is also available at this point, and the Gauss-Newton method can proceed as usual. The outcome of the process is a method which (despite the inherent non-uniqueness in both the nonlinear problem and the linear subproblems) can actually have a second order convergence rate. The details will be given elsewhere, but we illustrate by an example. We (k) assume that a(k) , ti , i = 1, . . . , m are approximations to the solution at the kth iteration, (k) M (k) = max kxi − x(a(k) , ti )k∞ , 1≤i≤m
h(k) is the corresponding value for the linear subproblem, and γk is the step length. Example 6 Consider fitting an ellipse in general position (n = 5) to 100 points. Typical performance from a particular starting ellipse is given in Table 3.
20
k 0 1 2 3
M (k) 0.229721 0.074859 0.072370 0.072365
h(k) kδk k∞ 0.072348 0.1298 0.072366 0.02284 0.072365 0.00035 0.072365 9 × 10−8
γk 1.0 1.0 1.0 -
Table 3: Example 6: Ellipse fit
In fact it is possible to modify this approach to incorporate control on the size of the step length. Essentially we can replace the objective function by max{ max kxi − x(a, ti ) + ∇a,t (xi − x(a, ti ))δk∞ , γkδk∞ }, 1≤i≤m
where γ is chosen to control the size of δ.
4
Concluding remarks
We have considered methods for fitting curves and surfaces to data using the l1 and l∞ norms. The first of these is of value in the presence of wild points in the data, and the second in the context of accept/reject decisions for manufactured parts. Based on these norms, there are different fitting criteria which can be used, some based on orthogonal distances and some not. In all cases, however, there is considerable structure, and this can be exploited in different ways. While for some problems good fitting methods are now available, there are also some very basic problems which turn out to be surprisingly difficult and for which more work is needed to develop practical algorithms.
References [1] G. T. Anthony, H. M. Anthony, B. Bittner (and nine others), Reference software for finding Chebyshev best-fit elements, Precision Engineering 19, pp. 28–36 (1996).
21
[2] A. Atieg and G. A. Watson, A class of methods for fitting a curve or surface to data by minimizing the sum of squares of orthogonal distances, J. Comp. Appl. Math. 158, pp. 277–296 (2003). [3] A. Atieg and G. A. Watson, Use of lp norms in fitting curves and surfaces to data, Australian and New Zealand Industrial and Applied Mathematics Journal 45 (E), C187–C200 (2004). [4] S. J. Ahn, W. Rauh and H.-J. Warnecke, Best-fit of implicit surfaces and plane curves, in Mathematical Methods for Curves and Surfaces: Oslo 2000, T. Lyche and L. L. Schumacker (eds), Vanderbilt University Press (2001). [5] S. J. Ahn, W. Rauh and H.-J. Warnecke, Least-squares orthogonal distances fitting of circle, sphere, ellipse, hyperbola, and parabola, Pattern Recognition 34, pp. 2283–2303 (2001). [6] S. J. Ahn, E. Westk¨amper and W. Rauh, Orthogonal distance fitting of parametric curves and surfaces, in Algorithms for Approximation IV, J. Levesley, I. Anderson and J. C. Mason (eds), University of Huddersfield, pp. 122–129 (2002). [7] P. T. Boggs, R. H. Byrd and R. B. Schnabel, A stable and efficient algorithm for nonlinear orthogonal distance regression, SIAM J. Sci. Stat. Comp. 8, pp. 1052–1078 (1987). [8] R. Drieschner, Chebyshev approximation to data by geometric elements, Numer. Alg. 5, pp. 509–522 (1993). [9] G. Hadley, Linear Programming, Addison-Wesley, Reading, Mass. (1962). [10] H.-P. Helfrich and D. S. Zwick, A trust region algorithm for parametric curve and surface fitting, J. Comp. Appl. Math. 73, pp. 119–134 (1996). [11] H.-P. Helfrich and D. S. Zwick, l1 and l∞ fitting of geometric elements, in Algorithms for Approximation IV, J. Levesley, I. Anderson and J. C. Mason (eds), University of Huddersfield, pp. 162–168 (2002).
22
[12] R. Strebel, D. Sourlier and W. Gander, A comparison of orthogonal least squares fitting in coordinate metrology, in Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling, S. Van Huffel (ed), SIAM, Philadelphia, 249–258 (1997). [13] D. A. Turner, The Approximation of Cartesian Co-ordinate Data by Parametric Orthogonal Distance Regression, PhD Thesis, University of Huddersfield (1999). [14] D. A. Turner, I. J. Anderson, J. C. Mason, M. G. Cox and A. B. Forbes, An efficient separation-of-variables approach to parametric orthogonal distance regression, in Advanced Mathematical and Computational Tools in Metrology IV, P. Ciarlini, A. B. Forbes, F. Pavese and D. Richter (eds), Series on Advances in Mathematics for Applied Sciences, Volume 53, World Scientific, Singapore, pp. 246–255 (2000). [15] G. A. Watson, Approximation Theory and Numerical Methods, Wiley (1980). [16] G. A. Watson, The solution of generalized least squares problems, in Multivariate Approximation Theory 3, W. Schempp and K. Zeller (eds), ISNM 75, Birkhauser Verlag, Basel, pp. 388-400 (1985). [17] G. A. Watson, The use of the L1 norm in nonlinear errors-in-variables problems, in Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling, S. Van Huffel (ed), SIAM, Philadelphia, pp. 183–192 (1997). [18] G. A. Watson, On the Gauss-Newton method for l1 orthogonal distance regression, IMA J. of Num. Anal. 22, pp. 345–357 (2002). [19] G. A. Watson and C. Yiu, On the solution of the errors in variables problem using the l1 norm, BIT 31, pp 697-710 (1991). [20] D. S. Zwick, Applications of orthogonal distance regression in metrology, in Recent Advances in Total Least Squares Techniques and Errors-inVariables Modeling, S. Van Huffel (ed), SIAM, Philadelphia, pp. 265–272 (1997). [21] D. S. Zwick, Private communication. 23