A Robust Estimator Based on Density and Scale Optimization and its Application to Clustering Olfa NASRAOUI and Raghu KRISHNAPURAM Department of Computer Engineering and Computer Science University of Missouri-Columbia, Columbia, MO, 65211 USA olfa or
[email protected] December 2, 1999 Abstract In this paper, we propose a new robust algorithm that estimates the prototype parameters of a given structure from a possibly noisy data set. The new algorithm has several attractive features. It does not make any assumptions on the proportion of noise in the data set. Instead, it dynamically estimates a scale parameter and the weights/memberships associated with each data point, and softly rejects outliers based on these weights. The algorithm essentially optimizes a density criterion, since it tries to minimize the size while maximizing the cardinality. Moreover, the proposed algorithm is computationally simple, and can be extended to perform parameter estimation when the data set consists of multiple clusters.
1 Introduction It is well known that classical statistical estimators such as the Least Squares (LS) are inadequate for most real applications where data can be corrupted by arbitrary noise. To overcome this problem, several robust estimators have been proposed. Examples of such estimators include M ?, W ?, and L?estimators [3]. These estimators offer an advantage in terms of easy computation and robustness. However, they have relatively low breakdown points. Most of them are better able to withstand outliers in the dependent variable, but break down very early in the presence of leverage points (outliers in the explanatory variables). More recently, the Least Median Of Squares (LMedS), the Least Trimmed Squares (LTS) [5], and the Reweighted Least Squares (RLS) [2] have been proposed. These estimators can reach a 50% breakdown point. However, they have nonlinear or discontinuous objective functions that are not amenable to mathematical optimization. This means that a quasi-exhaustive search on all possible parameter values needs to be done to find the global minimum. As a variation, random sampling/searching of some kind has been suggested to find the best fit. In any case, these estimators are limited to estimating the parameters of a single homogenous structure in
a data set. Other limitations of these estimators include their strong dependence on a good initialization, and their reliance on a known or assumed amount of noise present in the data set (contamination rate), or equivalently an estimated scale value or inlier bound. When faced with more noise than assumed, all these algorithms may fail. And when the amount of noise is less than the assumed level, the parameter estimates suffer in terms of accuracy, since not all the good data points are taken into account. In this paper, we present a new algorithm for the robust estimation of the parameters of a given component without any presuppositions about the noise proportion. The Maximal Density Estimator algorithm (MDE) yields a robust estimate of the parameters by minimizing an objective function that incorporates both the overall error and the contamination rate via a set of robust weights and a scale factor. Unlike most algorithms, the MDE is computationally attractive and practically insensitive to initialization. A modified version of the algorithm can perform clustering or parameter estimation in the case when multiple components are present in the data set. By selecting an appropriate distance measure in the energy function to be minimized, the algorithm can estimate the prototype parameters of structures of different shapes in the data set. For instance the Euclidean distance is used when the centers of spherical components are being sought. Other distance measures should be used while estimating the parameters of ellipsoidal, linear or quadratic clusters.
2 Background In estimating the parameters , instead of minimizing the sum of squared residuals, Rousseeuw [5] proposed minimizing their median, i.e.,
med d2j ; min
(1)
j
x
where dj are residuals or distances from data points j to the prototype being estimated. This estimator basically trims the b n2 c observations having the largest residuals. Hence it assumes that the noise proportion is 50%. A major drawback of the LMedS is its low efficiency, since it only uses the middle residual value. The LTS [5] offers a more efficient way to find robust estimates by minimizing the objective function given by h X ?
min ?
j =1
d2
j :n
;
(2)
where d2 j :n is the j th smallest residual or distance when the residuals are ordered in ascending order, i.e., ? 2
d
?
1:n
d2
2:n
?
d2
n:n
:
Since h is the number of data points whose residuals are included in the sum, this estimator basically finds a robust estimate by identifying the (n ? h) points having the largest residuals as outliers, and discarding (trimming) them from the data set. The resulting estimates are essentially LS estimates of the trimmed data set. It can be seen
that h should be as close as possible to the number of good points in the data set, because the higher the number of good points used in the estimates, the more accurate the estimates are. In this case, the LTS will yield the best possible estimate. One problem with the LTS is that its objective function does not lend itself to mathematical optimization. Besides, the estimation of h itself is difficult in practice. In addition, the LTS objective function is based on hard rejection. That is, a given data point is either totally included in the estimation process or totally excluded from it. This may lead to instabilities when optimizing the objective function with respect to the parameters. Instead of the noise proportion, some algorithms use weights that distinguish between inliers and outliers. However, these weights usually depend on a scale measure which is also difficult to estimate. For example, the RLS [2] tries to minimize
min
n X j =1
wj d2j :
(3)
where dj are robust residuals resulting from an approximate LMedS or LTS procedure. Here the weights wj essentially trim outliers from the data used in LS minimization, and can be computed after a preliminary approximate phase of the LMedS or the LTS. The function wj is usually continuous and has a maximum at 0 and is monotonicaly non-increasing with distance. In addition, wj depends on an error scale which is usually heuristically estimated from the results of the LMedS or the LTS. The RLS was intended to refine the estimates resulting from other robust but less efficient algorithms. Hence it requires a good initialization. Several algorithms [4, 6] address the dilemma of having to know the noise proportion (or equivalently the scale parameter related to the inlier bound) beforehand. Most of them perform a robust estimation process repetitively, with different fixed contamination rates (or equivalently inlier bounds). They finally choose the estimate that optimizes a goodness of fit measure. This procedure can be lengthy and computationally expensive, since it performs an exhaustive search over a discretized contamination rate or scale interval.
3 The maximal density estimator algorithm To confront the problem of estimating h in the LTS, we may want to allow its value to be variable, and optimize a modified objective function which minimizes the trimmed sum of errors while trying to include as many good points as possible in the estimation process. To reflect these multiple objectives, we can formulate the following compound energy function to be minimized
min ;h
h X ? j =1
d2
j :n
? h;
(4)
where is a constant that reflects the relative importance of the two objectives. Unfortunately, the main drawback of the LTS is still present, since the mathematical optimization with respect to h is still not possible. To get around this problem, we consider
a close relative of the LTS, the RLS. The weights in the RLS determine its success, and the key to this success usually lies in the estimation of a scale measure, , that reflects the variance of the set of good points. In other words, it is related to the inlier bound, which determines the maximal residual of the good data points. Unfortunately, mathematical optimization with respect to a scale parameter usually leads to the scale shrinking to zero. This is because a zero value for scale corresponds to the case when all weights are zero, and this situation results in a global minimum of the objective function. Since the scale parameter is closely related to the proportion of noise, we can reformulate the above objective function so that scale and weights are parameters rather than the number of good points h. This formulation is given by
J=
min ;
n X j =1
n X d2 wj j ? wj ;
(5)
j =1
where wj are a set of positive decreasing weights. The weight wj can also be considered as the membership of data point j in the set of good points. The first term of this objective function tries to minimize the volume enclosed by the residual vectors of the good points. The second term of this objective function tries to use as many good points (inliers) as possible in the estimation process, via their high weights. Thus the combined effect is to optimize the density, i.e., the ratio of the total number of good points to the volume. In the first term, the distances are normalized by the scale measure for several reasons. First, this normalization counteracts the tendency of the scale
x
d2
to shrink towards zero. Second, unlike the absolute distance d2j , j is a relative measure that indicates how close a data point is to the center relative to the inlier bound. Therefore using this normalized measure is a more sensible way to penalize the inclusion of outliers in the estimation process in a way that is independent of the cluster size. Finally, this normalization makes the two terms of the energy function comparable in magnitude. This relieves us from the problem of estimating a value for which otherwise would depend on the data set’s contamination rate and cluster sizes. Hence, the value of is fixed as follows:
= 1:
Finally, we should note that d should be a suitable distance measure, tailored to detect desired shapes, such as the Euclidean distance for spherical clusters, or the GustafsonKessel (GK) distance [1] for ellipsoidal clusters characterized by a covariance matrix, etc. Since the objective function depends on several variables, we can use the alternating optimization technique, where in each iteration one variable is optimized while fixing all others. If the weights are fixed, then the optimal prototype parameters are found by setting 2 j
@d2 wj @j = 0 j =1 For instance if d2j is the Euclidean distance d2j = kxj ? ck, then the center c is given @J = 1 @
by
c=
n X
Pn
wj xj : j =1 wj
j =1
Pn
(6)
To derive the optimal scale regardless of the distance measure being used, we fix the prototype parameters, and set
@J = 0: @
Further simplification of this equation depends on the definition of the weight function. We choose to use the Gaussian weight function
?d2 wj = exp 2j
(7)
to obtain the following update equation for the scale parameter P
n w d4 = 13 Pjn=1 wj dj2 : j =1 j j
(8)
Therefore, the algorithm consists of alternative updates of the prototype parameters, the scale parameter and the weights in an iterative fashion until convergence, or for a fixed maximum number of iterations.
4 Generalization of the maximal density estimator algorithm to clustering The objective function in (5) canbe extended to allow the estimation of C prototype parameters simultaneously, = 1 ; : : : ; C , and to allow for different scales i , as follows
min ; i
J=
C X X i=1
where
x 2C j
i
C X d2ij X wij ? wij ; i i=1 x 2C j
?d2 wij = exp 2ij ; i
x
(9)
i
(10)
and d2ij is the distance from data point j to the prototype of cluster Ci . Here wij can also be considered as the membership of data point j in cluster Ci . The partition of the data space is done in a mimimum distance classifier sense. That is,
x
Ci = xj 2 Xj d2ij = min d2kj k=1
C
(11)
Since each cluster is independent of the rest, it is easy to show that the optimal update equations are similar to the ones obtained for estimating the parameters for one cluster. The scale parameter of the ith cluster is given by P
wij d4 i = 31 Px 2C w dij2 : x 2C ij ij j
i
j
i
(12)
To find the optimal prototype parameters i of the ith cluster, we set
@J = 1 X w @d2ij = 0 @i i x 2C ij @i j
For instance if given by
i
d2ij is the Euclidean distance d2ij = kxj ? ci k, then the center ci is
ci =
P
x 2C wij xj x 2C wij
Pj
i
j
(13)
i
We call the resulting algorithm the C ?MDE algorithm, since it estimates the prototype parameters for C clusters in a data set.
5 Experimental Results We will use three data sets with a variable amount of noise to illustrate the performance of the C ?MDE algorithm when it is used for estimating the prototype parameters from a data set containing several components. In all the experiments, the initial prototype parameters are computed by performing 5 iterations of the K ?Means algorithm with the Euclidean distance measure, followed by 5 iterations using the distance measure selected to detect the particular shape structure. The scale parameters are all initialized to 100. Fig. 1(a) shows the first data set which contains two synthetic spherical Gaussian clusters, for which it is appropriate to use the Euclidean distance. The centers found by the C ?MDE are marked by crosses in Fig. 1(b). The contours correspond to the distance at which the weight is 0.005. They delimit the data points that can be described as inliers. When random noise is added to the original data set, the results of the C ?MDE remain almost identical, as shown in Fig. 1(c). Finally, more noise is added, and the results are almost unchanged as shown in Fig. 1(d). For the second experiment, we use the data set consisting of three synthetic multivariate Gaussian clusters shown in Fig. 2(a). In this case, one of the clusters is ellipsoidal in shape. hence we use the GK distance measure to estimate the centers and the covariance matrices. The results are shown in Fig. 2(b), where the elliptical contours delimiting the inliers correspond to a Mahalanobis distance of (3:5)2 from the center. These results remain almost unchanged when random noise is added, as displayed in Fig. 2(c), and when even more noise is added, it does not seem to affect the performance of the C ?MDE, as seen in Fig. 2(d). Finally, we illustrate the performance of the C ?MDE on the real data set shown in Fig. 3(a). This feature space is typical of what is used in an experiment of immunophenotyping using flow cytometry. Several measurements are made using a cytometer on a sample of bone marrow extract in order to classify the cell types for the purpose of the detection of Leukemia. The two features used are the forward and side scatter of light intensities respectively. As shown in Fig. 3(b), the C ?MDE succeeds to identify the centers and the inlier bounds for the two clusters that would be identified by a human expert. It is important to note that the C ?MDE algorithm showed little sensitivity to the initial prototype estimates or scale values, as long as the initial partition is valid.
In this context, a reasonable partition means that all clusters are distinct and no cluster is empty. As far as the initial scale values are concerned, any reasonable values that reflect the average cluster size are acceptable (between 50 and 1000). We finally note that convergence is relatively fast (less than 30 iterations).
6 Conclusion We presented a new efficient algorithm, the Maximal Density Estimator (MDE), that computes robust estimates of the prototype parameters for a given structure in a data set. Independence from any assumptions on the proportion of noise is achieved through the mathematical optimization of a scale parameter that is related to the inlier bound. The MDE optimizes the density by choosing a scale value such that the volume is as small as possible (first term) while including as many points as possible in the estimation (second term). The algorithm was extended to seek the robust parameter estimates of C clusters simultaneously, resulting in the C ?MDE. Through the use of a meaningful objective function that seeks the parameters of the densest region in the data set, the MDE algorithm does not rely on any thresholds, and from a series of experiments with different data sets, we have noticed its low sensitivity to initialization. The MDE algorithm is able to delineate clusters of various shapes, as long as the distance measure used is appropriate for the desired structure. When the number of clusters in a data set is unknown, the parameters of all the clusters can be estimated by extracting one cluster at a time from the data set. Work on this application, as well as on variations and applications of the MDE algorithm is currently being undertaken.
Acknowledgments This work was partially supported by a grant from the Office of Naval Research (N0001496-1-0439)
References [1] E. E. Gustafson and W. C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In IEEE CDC, pages 761–766, San Diego, California, 1979. [2] P. W. Holland and R. E. Welsh. Robust regression using iteratively reweighted least-squares. Commun. Statist.-Theor.Meth., A6(9):813–827, 1977. [3] P. J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981. [4] J. M. Jolion, P. Meer, and S. Bataouche. Robust clustering with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8):791–802, Aug. 1991. [5] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiliey & Sons, New York, 1987. [6] C. V. Stewart. Minpran: A new robust estimator for computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10):925–938, Oct. 1995.
(a)
(b)
(c)
(d)
Figure 1: Two Gaussian clusters. (a) original data set, (b) results of the C ?MDE, (c) results of the C ?MDE when the data set in (a) is contaminated with random noise, (d) results of the C ?MDE when the data set in (a) is contaminated with more noise.
(a)
(b)
(c)
(d)
Figure 2: Three Gaussian clusters. (a) original data set, (b) results of the C ?MDE, (c) results of the C ?MDE when the data set in (a) is contaminated with random noise, (d) results of the C ?MDE when the data set in (a) is contaminated with more noise.
(a)
(b)
Figure 3: Cytometry feature space. (a) original data set, (b) results of the C ?MDE.