Controlling Mixture Component Overlap for Clustering Algorithms

10 downloads 0 Views 172KB Size Report
models are suitable since many works adopt a mixture approach as a basis for the ..... components of a mixture as a generalization of Tab- bone's condition.
MATHEMATICAL THEORY OF PATTERN RECOGNITION

Controlling Mixture Component Overlap for Clustering Algorithms Evaluation1 E. Aitnouri*, F. Dubeau**, S. Wang**, and D. Ziou** *Semiconductor Insights, Inc., 300 Solandt Road, Kanata, Ontario, K2K 2X2, Canada e-mail: [email protected] **Department of Mathematics and Computer Science, University of Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada e-mail: {dubeau, wang, ziou}@dmi.usherb.ca Abstract—This paper presents two algorithms for generating artificial mixtures of univariate normal densities with controlled component overlap. Such mixtures are used basically as test benchmarks for the evaluation of clustering algorithms. Although the algorithms are different, they are based on the same definition. Both algorithms use formal methods to ensure the generation of “non totally overlapped” mixtures, in contrast to algorithms in the literature which generally use ad-hoc methods to control the overlap. For this purpose, the overlap is assumed to affect only two adjacent components of the mixture. Three configurations are defined: the “maximum overlap” beyond which two components are considered to be totally merged, the “minimum overlap” beyond which two components are considered “nonoverlapped”, and the “rate of overlap”, denoted by λ (0 < λ ≤ 1), which defines a given overlap between the maximum and the minimum. Both algorithms presented in this paper are, then, designed to generate mixtures that control these overlap configurations. The first algorithm uses the width of components to control the overlap, while the second uses the mean. The first algorithm allows a better control over the boundaries of the generated mixture, but involves complex relations. It has been used to generate unidimensional mixtures to approximate image histograms. The second algorithm involves less complex relations, and allows one to control the rate of overlap. The reason for controlling the rate of overlap is to generate artificial mixtures constituting a given degree of difficulty. Indeed, the greater the rate of overlap, the more complex the mixture. Both algorithms allow the generation of mixtures with any number of components. 1

1. INTRODUCTION

Due to the ad-hoc nature of virtually all clustering methods, it is difficult to obtain analytical proofs of the properties of these procedures. Some researchers have approached the problem using a Monte-Carlo framework. An important advantage of this approach is that it is possible to use artificial data with known structures. Indeed, most of the work done in the domain of cluster analysis has advocated the use of constructed (simulated) test data to evaluate clustering algorithms [1–3]. The use of simulated data makes it possible to evaluate a clustering solution using such similarity–dissimilarity measures as Rand statistics [4] or Kappa statistics [5], since the true structure of the cluster is available. Thus, an important part of the validation study comes down to how to generate simulated data, and more precisely how to generate simulated data possessing the intrinsic structures for evaluating the clustering method. Several definitions of cluster structures have been proposed. One approach uses the concept of an ultrametric space as a basis for the generation of cluster 1 This

paper was submitted by the authors in English.

Received November 2, 2001

structures [6, 7]. The authors argue that such structures are essential for evaluating hierarchical clustering routines, since they usually invoke ultra-metric inequality. It would be expected that such routines should be able to recover these kinds of structure. Bromly [8] uses Thumbestone’s box problem data set. Everitt [3] provides several two-dimensional plots for selected bivariate normal mixtures which possess strong visual appeal. The most popular method is based on the use of mixture models, especially mixtures of normal densities, as generic models for simulated data [9]. These models are suitable since many works adopt a mixture approach as a basis for the design of clustering algorithms [10, 11]. The mixture model satisfies some intuitive definitions of cluster structures. One such definition, offered by Cormack [2], basically involves two criteria. A cluster should exhibit the properties of internal cohesion and external isolation. These properties were defined by Milligan [7] as follows: “External isolation requires that entities in one cluster should be separated from entities in another cluster by fairly empty areas of space. Internal cohesion requires that entities within a same cluster should be similar to each other, at least within the local metric”. Internal cohesion is an inherent characteristic of mixture models, since for a given cluster, data are generated from the same distribution. External isolation concerns the rate of overlap of com-

Pattern Recognition and Image Analysis, Vol. 12, No. 4, 2002, pp. 331–346. Original Text Copyright © 2002 by Pattern Recognition and Image Analysis. English Translation Copyright © 2002 by MAIK “Nauka /Interperiodica” (Russia).

332

AITNOURI et al.

ponents of the mixture. Indeed, the greater the component overlap, the smaller the external isolation. However, Everitt [3] argues that clusters with total isolation are natural clusters, and therefore, not useful in evaluating the ability of clustering algorithms to separate clusters. A number of different schemes for generating artificial mixtures have been proposed. Blashfield [12] and Edelbrock [13] used unconstrained multivariate normal mixtures with fairly complex covariance structures. Milligan and Issac proved in [7] that the generation processes used in [12] and [13] lead to data with overlapping clusters. Other researchers such as Kuiper and Fisher [14], and Bayne et al. [15] used multivariate normal clusters with simple covariance structure and directly manipulated variables such as the separation between clusters. In 1985, Milligan [16] developed a nine-steps algorithm for generating artificial mixtures, whose main assumption is that the generated data do not overlap in the first dimensional subspace. Verification of this assumption was mainly performed using nonautomatic techniques such as visual inspection. In this paper, we propose two different automatic algorithms for controlling mixture component overlap in the first dimensional subspace. In [17], we introduced the idea of controlling the overlap of two adjacent components of a mixture of univariate normal densities. First, we gave a definition of the concept of “overlap” and then derived an algorithm that automatically controls it. The algorithm, named OLR was used to generate artificial univariate mixtures to evaluate and compare some clustering methods in image histogram PDF estimation [18]. A short paper on the OLR algorithm has been published [19]. Control of overlap in the OLR algorithm was done using the widths (standard deviation) of the components. The idea of using widths is linked to the definition of overlap which originates in the edge detection domain [20]. In this paper, we make use of the basic idea of the OLR algorithm to propose extended definitions of the overlap. Indeed, we will define three different configurations related to component overlap. For this, we suppose that the overlap affects only two adjacent components of the mixture. First, we define a “maximum overlap” beyond which two components are stated to be “totally overlapped”. Second, we define the “minimum overlap” beyond which two components are considered “nonoverlapped”. Finally, we will define the “rate of overlap”, which is an overlap configuration between the maximum and the minimum. After introducing these definitions, we describe two algorithms that control them. The first algorithm uses the widths of components and the second algorithm uses the means of components. The two algorithms are structurally identical. The rest of the paper is organized as follows. The use of mixture models for generating cluster data and the definitions of the concepts of overlap are presented

in Section 2. Section 3 describes the use of widths to control overlap, while Section 4 deals with the use of means to control overlap. Finally, some results and discussion are presented in Section 5. 2. MIXTURE MODELS AS GENERIC MODELS FOR CLUSTER TESTING Mixture models are being used by many researchers as a generic model for clustering test data because they form the basis of many cluster analysis algorithms [4, 12–14, 18]. Here, we focus on mixtures of univariate normal densities (Gaussian), which are widely used in the literature. A mixture of M univariate normal densities is given by M

p ( x, γ ) =

∑ κ G ( x, θ ) j

j

(1)

j

j=1



with the restrictions j > 0 for j = 1, …, M and Mj = 1 κ j = 1. θj = (µj , σj ) denotes the parameters of the jth distribution Gj , respectively, the mean j and the standard deviation (or width σj ). Gj is the jth univariate normal distribution of the mixture, given by G(x, θ) = 2 1 ( x – µ) M -------------- exp – -----------------. γ = (θj , κ j ) j = 1 totally describes 2 2πσ 2σ p(x, γ). We will use the notations Γi (x) = κiGi (x, θi ) to describe the ith component of the mixture, and S σi = 1

– --κi 2 --------------- e to describe the height of the ith Gaussian 2πσ i where it deviates from its center by σi , where σi is the width of Γi(x).

2.1. Related Work As mentioned above, many researchers are interested in how to generate artificial test data, especially nonoverlapped cluster data. One of the most appealing methods was proposed by Milligan [16]. It provides a clear formalization of the problem. Indeed, in his ninestep algorithm for generating multivariate mixtures, the second step describes only how to generate nonoverlapped univariate normal densities. The method is described by the author as follows: After that the number of clusters is set, a random ordering of the clusters is first determined. The boundary length of the cluster is selected randomly from a uniform distribution of 10 to 40 units. The selected length is used to specify the standard deviation of the cluster. The length is defined to be three standard deviations for the cluster on the dimension. The midpoint of the length represents the mean. The cluster boundaries will be separated by random quantities f ( Si + S j )

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP

Histogram

(a) 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0

(b)

0.007 0.006 0.005 0.004 0.003 0.002 0.001

Histogram

50 100 150 200 250

0

333 (c)

0.018

Histogram

Histogram

0.014 0.010 0.006 0.002 0

50 100 150 200 250 Gray levels

100

200

300

Fig. 1. An artificial three-component mixture of univariate Gaussians: (a) γ1 = (60, 31, 0.3), γ2 = (140, 32, 0.3), γ3 = (210, 25, 0.4); (b) γ1 = (60, 31, 0.3), γ2 = (140, 22, 0.3), γ3 = (210, 25, 0.4); (c) γ1 = (60, 9, 0.3), γ2 = (140, 7, 0.3), γ3 = (210, 8, 0.4).

(a)

(b)

0.012

(c) 0.009

0.009

0.010

0.007

0.008

0.007

0.005

0.006

0.005

0.003

0.004

0.003

0.002

0.001 0

100

200

300

0

0.001 100

200

300

0

100

200

300

Fig. 2. Tabbone’s condition for the presence of false edges: (a) two edges with κ1 = κ2 and 2σ = µd (γ1 = (100, 25, 0.5) and γ2 = (150, 25, 0.5)); (b) two edges with κ1 = κ2 and 2σ < µd (γ1 = (100, 20, 0.5) and γ2 = (150, 20, 0.5)); (c) two edges with κ1 = κ2 and 2σ > µd (γ1 = (100, 31, 0.5) and γ2 = (150, 31, 0.5)).

where Si is the standard deviation for cluster i, and f is a uniform random variable with a range of 0.25 to 0.75. Although the algorithm is simple, it generates only data sets of 2, 3, 4, or 5 distinct nonoverlap clusters. These are natural clusters with simple structures, and all algorithms evaluated with such test data are likely to recover the true structure of the data without difficulty. 2.2. Why Control the Overlap of Mixture Components? The problem we are faced with when we want to generate a large set of test mixture is to ensure that the components of a mixture are not totally overlapped. As an example of what we mean by overlap, suppose we generate a three-component mixture. Due to component overlap, only two components may be visible in the mixture. The example in Fig. 1 illustrates this phenomenon. The mixture in Fig. 1a is actually composed of three components, despite the fact that only two components are visible. The components of the mixture in Fig. 1a are totally overlapped; the components of the mixture in Fig. 1b are partially overlapped; however, the components of the mixture in Fig. 1c are nonoverlapped. PATTERN RECOGNITION AND IMAGE ANALYSIS

It is, however, unfair and biased to evaluate an algorithm using mixtures such as the one in Fig. 1a. For this reason, controlling the overlap of mixture components is important and necessary. 2.3. The Overlap Principle To control the overlap process, a formal definition of this concept is needed. In this paper, we propose a definition which is new to the best of our knowledge. The definition we propose is visually appealing and originates in the edge-detection domain. Tabbone [20] studied the presence of a false edge between the steps of a staircase edge. The problem in [20] is treated as a mixture of two Gaussians with equal widths σ, since an edge is supposed to behave as a one-dimensional Gaussian [21]. For the case where κ1 = κ2 , the Gaussians have the same height Γ(µ1) = Γ(µ2 ), it has been proven in [20] that a false edge appears if and only if 2σ < µd , where µd = µ2 – µ1 (µ1 < µ2) is the distance between the means of the two Gaussians (edge positions). Figure 2 shows three configurations involving the presence or absence of false edges. In Fig. 2a, Tabbone’s condition is verified (2σ = µd) and there is no minimum (false

Vol. 12

No. 4

2002

334

AITNOURI et al. (a)

0.035

Mixture C1 C2

0.030 0.025

Mixture C1 C2

0.020 0.015

0.020 0.015 0.010

Sσ2

Sσ2

0.010 0.005

0.005 0

(b)

0.025

50

100 150 200

0

50

(c)

0.040 0.035 0.030 0.025 0.020 0.015 0.010 0.005

100 150 200

0

Mixture C1 C2

Sσ2

50

100 150 200

Fig. 3. Illustration of overlap: (a) total overlap, (b) maximum overlap, and (c) partial overlap..

edge) between the two components. In Fig. 2b, a false edge exists since we have 2σ < µd . Figure 2c shows the case where 2σ > µd , leading to the merging of the two edges into one. There exists a close relation between the problem of false edges and the problem of controlling the overlap of two Gaussian components. In order to utilize Tabbone’s results, let us introduce a new definition of component overlap. We will suppose that the overlap affects only two adjacent components of a mixture, denoted by Γ1(x) and Γ2(x), with the restrictions σ1 = σ2 = σ and κ1 = κ2 . The components have a maximum overlap if and only if σ1 + σ2 = µd . This means that the intersection point xint of the two components is such that Γ1(xint ) = Γ2(xint ) = S σ1 = S σ2 . The components are nonoverlapped if and only if σ1 + σ2 < µd . This means that the intersection point xint of the two components is such that Γ1(xint ) = Γ2(xint) < S σ1 = S σ2 . The components are totally overlapped if and only if σ1 + σ2 > µd . This means that the intersection point xint of the two components is such that Γ1(xint) = Γ2(xint) > S σ1 = S σ2 . These configurations are represented, respectively, by Figs. 2a, 2b, and 2c. For the general case where the parameters of the components are not equal, σ1 ≠ σ2 and κ1 ≠ κ2 , we propose a new definition of the overlap of two adjacent components of a mixture as a generalization of Tabbone’s condition. The general condition is given by Γ 1 ( x int ) ≤ minS σ1, S σ2

(2)

where xint is the intersection point of the two components and mina, b denotes to the minimum of a and b. Figure 3 shows three different configurations related to the condition of Eq. (2). Figure 3a illustrates the case where Γ1(xint) > min S σ1 , S σ2 . In this case, the two components are totally merged. This case is not considered here. Figure 3b shows the case where Γ1(xint) = min S σ1 ,

S σ2 , while Fig. 3c illustrates the case where Γ1(xint) < min S σ1 , S σ2 . Thus, controlling the overlap amounts to the generation of component parameters that satisfy the condition of Eq. (2). 2.4. Maximum Overlap It follows from Eq. (2) and from Tabbone’s condition that the maximum overlap is defined as follows: Definition 1. Two adjacent Gaussian components of a mixture, Γ1(x) and Γ2(x), have a maximum overlap if the condition Γ1(xint) = min S σ1 , S σ2 is satisfied, where xint is the intersection point of the two components Γ1(x) and Γ2(x). Figure 4 illustrates the maximum overlap for the two cases, S σ1 > S σ2 referred to case 1 and S σ1 < S σ2 referred to case 2. Actually, this condition defines a maximum beyond which two components are totally overlapped. The algorithms developed here are designed to generate mixtures with component overlap less than this maximum. 2.5. Minimum Overlap As was done previously for the maximum overlap, in this section, we define the minimum overlap of two adjacent components of a mixture. The problem is quite different from that of the maximum, since the separation of two components is possible for a large number of configurations, as shown in Fig. 5. It is, therefore, necessary to define a unique minimum overlap configuration. In this paper, we define the minimum overlap as the intersection of the two mixture components at four standard deviations from the respective means. Thus, the minimum overlap is defined as follows. Definition 2. Two adjacent Gaussian components of a mixture Γ1(x) and Γ2(x), have a minimum overlap if

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP 0.025

Mixture C1 C2

0.020 0.015

0.025

335

Mixture C1 C2

0.020 0.015

Sσ2

0.010

0.010

0.005

Sσ1

0.005

0

50

100

150

0

200

50

100

Fig. 4. Illustration of the maximum overlap for the two cases: (a) case 1 S σ > 1

(a)

0.025

150

200

S σ2 , (b) case 2 S σ1 < S σ2 .

(b)

0.035 0.030

0.020

0.025 0.015

0.020

0.010

0.015 0.010

0.005 0

0.005 50

100

150

200

0

50 100 150 200 250 300

Fig. 5. Examples illustrating two possible minimum overlaps.

their intersection point xint is symmetrically distant by four standard deviations from the centers of both components. Formally, the condition for minimum overlap can be summarized as follows when µ1 < µ2 :  x int = µ 1 + 4σ 1   x int = µ 2 – 4σ 2

(3)

2.6. Rate of Overlap In order to perform quantitative evaluations and comparisons of clustering algorithms, one should evaluate the capacity of these algorithms to separate overlapped clusters. For this purpose, we need to quantify cluster overlap versus the complexity of a cluster. In this paper, we define a “rate of overlap”, based on the definition of maximum overlap introduced above. The rate of overlap proposed in this paper is visually very appealing, as shown in Figs. 10 and 14. Definition 3. We define the rate of overlap, denoted by λ, of two adjacent Gaussian components of a mixture, Γ1(x) and Γ2(x), to be the ratio of the height of the PATTERN RECOGNITION AND IMAGE ANALYSIS

intersection point xint of the two components, to the height of the intersection point of the two components with maximum overlap; i.e., Γ 1 ( x int ) λ = ----------------S σmax

(4)

Γ1(xint) is the height at which the intersection point of components with overlap λ is localized. S σmax = min S σ1 , S σ2 is the height of the intersection point of components with maximum overlap. It follows from Eq. (4) that 0 < λ ≤ 1. Now, using these definitions, we propose two different algorithms for controlling component overlap. The algorithms are structurally similar, since both begin random generation of the first component, and then control the parameters of the second component in order to avoid total overlap. However, they differ in terms of the component parameters used for control. The first algorithm uses the width of the component, while the second algorithm uses the mean of the component. The following section describes the use of the width to control component overlap.

Vol. 12

No. 4

2002

336

AITNOURI et al. 0.020 0.018 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0

0.018 0.016

Mixture C1 C2

Mixture C1 C2

0.014 0.012

σ^ 1

σ2

0.010 0.008 0.006

σ^ 2

σ1

0.004 0.002 50

10

150

200

0

50

10

150

200

Fig. 6. Apparent width: (a) when Γ1(µ1) > Γ2(µ2), (b) when Γ1(µ1) < Γ2(µ2).

3. CONTROLLING OVERLAP USING THE WIDTHS OF COMPONENTS

cases. Using Definition 4, the condition of Definition 1 can be described as follows:

As mentioned above, we, first, randomly generate the parameters of the first component: the mean µ1, the width σ1, and the mixing parameter κ1. Then, we randomly generate the mean and the mixing parameter of the second component, µ2 and κ2 . Finally, we compute the width of the second component σ2 . We will first compute the width which gives maximum overlap of the components, denoted by σmax , and move to the computation of the width σmin that gives the minimum overlap. Finally, we will introduce the rate of overlap. 3.1. Maximum Overlap Consider the overlap of two adjacent components Γ1(x) and Γ2(x) with σ1 ≠ σ2 and κ1 ≠ κ2 . Let us denote by S σi = Γi(µi – σi ) (i = 1, 2) the height at which the width σi is located for a given component Γi(x), and by xint, the intersection point of the two components. Definition 1 states that in order to have less than maximum overlap, the intersection point of the two components must be below both width heights S σ1 and S σ2 . In other words, by Definition 1, Γ 1 ( x int ) = Γ 2 ( x int ) ≤ minS σ1, S σ2 .

(5)

Definition 4. For a pair of adjacent components Γl (x, µl , σl) and Γh (x, µh , σh), where Γh(µh ) > Γl (µl ) ( S σl < S σh ), we define a notion of apparent width, denoted by σˆ , as the distance |µh – x|, where x in given so that Γh (x) = Γl (µl – σl ). Since the general case refers to components with unequal parameters, we distinguish two cases, namely S σ1 > S σ2 (case 1) and S σ1 < S σ2 (case 2). Figure 6 illustrates the principle of apparent width for both

σˆ h + σ l < µ d

(6)

where σˆ h is the apparent width of the higher component, σl is the width of the lower component, and µd is the distance between the two means. It is self-evident that if the components have the same widths and the same heights (σ1 = σ2 and κ1 = κ2 ), Eq. (6) collapses into the condition of Tabbone [20], since σˆ h = σ = σl . However, proving the general condition is a very complex problem. For each of the above cases, we can compute σˆ h as a function of γ1, γ2 . Then, by introducing the expression of σˆ h in the condition of Definition 1, we can solve the resulting relation in order to obtain the bound on σmax . Here, we give development for case 1, whereas case 2 is developed in appendix A. We proceed by equalizing the two components at height S σ2 as follows: Γ 1 ( µ 1 + σˆ 1 ) = Γ 2 ( µ 2 + σ max )

(7)

Substituting the resulting expression for σˆ 1 in Eq. (6) gives A ln ( Bσ max ) + σ max < µ d

(8)

κ1e -. where A and B are given by A = 2 σ1 and B = --------------κ2 σ1 By solving Eq. (8) for σmax , we can obtain a condition on the upper bound of σ2 as a function of the remaining known parameters. It is difficult to obtain an analytical solution for σmax from Eq. (8) because of the nonlinearity of the Gaussian expression. Thus, a simplification of Eq. (8) has been obtained through an approximation of the Gaussian expression.

PATTERN RECOGNITION AND IMAGE ANALYSIS

– 1/2

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP

3.1.1. Approximation of the Gaussian Consider a family of lines ∆l = {∆l1, ∆l2, ∆l3, …, ∆lp}, where ∆li (i = 1, …, p) is a tangent line to the point xiσ = µ – iσ. If we approximate a Gaussian by the series of lines ∆li we obtain a piecewise linear approximation denoted by g˜ (x), given by  1 (i + 1)  g˜ ( x ) = i ----------------e x – µ + -----------------σ 2   i  2πσ   x ∈ [ µ – iσ, µ – ( i – 1 )σ ], i ≥ 1 1 2 – --- i 2 

Approximation

0.008 0.006 0.004 0.002

(9) 0

The development of this approximation for the right-hand part of the Gaussian is given in Appendix B. A better approximation can be obtained if we change the bounds for each line segment. The limit points of the bounds will be the intersection between two successive segments ∆i and ∆i + 1. Indeed, ∆li and ∆l(i + 1) intersect at a given point xf(i) = µ – f(i)σ, and f(i) (f(i) < i < f(i + 1)) given by 2

i – ---2

(i + 1) – --------------2

ie

– ( i + 1 )e



1 S l ( σ, p ) = ---------- e 2π i = 1

1 2 – --- i 2

0.006 0.004 0.002 0

f (i) + f (i – 1) i + 1 × ----------------------------------- – ------------i 2 3.1.2. Simplified General Condition With this new approximation, the computation of the apparent width σˆ can be computed on the tangent lines ∆i (i = 1, …, p), which gives (for case 1)

– ( σ 1 ( i + 1 ) – iµ d ) + δ 1 σ max = ------------------------------------------------------------2i 2

(11)

σ 1 < 0 (12) 2

PATTERN RECOGNITION AND IMAGE ANALYSIS

(13)

The worst case in relation with i arises when the intersection point of the two components is close to S σmax , i.e., xint is taken to be on ∆1. Thus, σmax is the solution of Eq. (13) for i = 1. By the same reasoning, we find the general condition for case 2 1

Here, the new approximation of the apparent widths depends on i. Introducing Eq. (11) in the general condition of Eq. (6), after simplification we obtain 2

100 150 200 250 300

The discriminant δ1 of the quadratic form of Eq. (12) is positive for all i > 0. Thus, a general solution in this case is given by

2

– --- ( i – 1 )  2  2 κ σ e i + 1 2 1  σˆ 1 = ------------- – ------------------------------ σ 1  i iκ 1 σ max   

2

50

Fig. 8. An example of a four-component mixture generated with maximum overlap, γ1 = (92, 22.89, 0.11), γ2 = (125, 9.13, 0.05), γ3 = (158, 20.63, 0.45), γ4 = (190, 9.42, 0.39).

(10)

iσ max + ( σ 1 ( i + 1 ) – iµ d )σ max – ----- e κ1

250

0.010

2

1 2 κ 2 – --2- ( i – 1 )

200

0.008

[ f (i – 1) – f (i)]

1

150

0.012

(i + 1) – -----------------2

In this case, the new bounds will be given by x ∈ [µ – f(i)σ, µ – f(i – 1)σ], |i| ≥ 1. Figure 7 shows the approximation for p = 3. This approximation is very robust since it does not depend on any parameters of the approximated Gaussian. The approximation error is given in Eq. (10) and its development is presented in Appendix C. p

100

0.014

2

2

50

Fig. 7. Approximation of a Gaussian with p = 3 and MSE = 7.48742e–06.

( i + 1 )e – ( ( i + 1 ) + 1 )e f ( i ) = -------------------------------------------------------------------------------2 2 2

Gaussian Approximation

0.010

2

i – ---2

337

σ max and

σ1

> 1

--- ( i – 1 ) κ2 2 2 ----- ( i + 1 )e σ1 – δ2 κ1 = -------------------------------------------------------------2i

R(i)µd ,

2

where

σˆ 2

(14) 2

=

+1  i----------- i

+

κ 1 σ max – --2- ( i – 1 ) ---------------e σmax ) and δ2 is the discriminant of the κ2 σ1 quadratic form given by

Vol. 12

No. 4

2

2002

338

AITNOURI et al. 0.025

σmin can be deduced from Eq. (15), and is given by

0.020

µ d – 4σ 1 σ min = ------------------4

0.015 0.010 0.005 0

50

100

150

200

Fig. 9. An example of a mixture of four components generated using minimum overlap.

0.040 C1 C2 C2

0.035 0.030

σmax and σmin , as was presented in the previous sections, can be considered as the boundaries on the width σ2 of the second component. Actually, σ2 can be picked randomly within this interval in order to ensure that components are not totally overlapped. However, it is interesting to control the rate of overlap between the two components, examined in this section. Definition 3 presents the rate of overlap of two components of a mixture. Given the rate of overlap λ such that 0 < λ ≤ 1, Definition 3 states that

0.020 0.015 Sσmax

Γ1(xint)

0.005 0

Note that there is a condition on σ1 in Eq. (16): σ1 µ should be greater than -----d . For the case of mixtures of 4 two components, this condition is satisfied since σ1 can µ be randomly generated within the interval ]0, ----d- [. 4 However, for mixtures of more than two components, the condition may not be satisfied. Figure 9 shows an example of a mixture of four components generated with the condition of Eq. (16). Let us now describe how to control the rate of overlap. 3.3. Rate of Overlap

0.025

0.010

50 100 150 200 250 300

Fig. 10. Rate of overlap principle with the use of width for control.

1

– --- ( i κ 2 2 2 σ max – -----2 ( i + 1 )e κ1 1

2

κ – --2- ( i + ( µ d – σ 1 ) -----2 e κ1

– 1)

2

σ 1 σ max

– 1)

σ1 .

Note that δ2 is positive if σ1 > R(i)µd and R(i) = 3iκ 2 ------------------------------------------------------------- . Even for this case, we take 1 2 2

– --- ( i – 1 ) 2

( i + 1 ) κ1e + 4κ 1 i xint to be on ∆1, and thus, σmax is the solution of Eq. (14) for i = 1. The solutions of Eqs. (13) and (14) provide a limit value of σ2 . Figure 8 shows a four-component mixture generated with maximum overlap. 2

(16)

Figure 10 illustrates the principle of the rate of overlap, using the component width to control it. From the condition of Definition 3, we need to find the width of the second component σ2 that results in the rate of overlap λ. For this purpose, first, we compute the intersection point xint from the first component and, then, substitute it into the expression for the second component. Let us replace Γ1(xint) in the condition of Definition 3 2

( x int – µ 1 ) – ------------------------2 2σ 1

κ1 by its expression, given by Γ1(xint) = ---------------e . 2πσ 1 After some transformations, we find the intersection point xint , given by x int = µ 1 + η 1 σ 1

According to Definition 2, the minimum overlap involves the condition (15)

(17)

1 – ---

2 2 log  λS σmax S σi e  (i = 1, …, 2), where   κi - e–1/2. S σmax and S σ1 are known and given by S σi = --------------2πσ i The condition of Definition 3 can also be established

with ηi =

3.2. Minimum Overlap

µ2 – µ1 = 4(σ1 + σ2 ),

Γ 1 ( x int ) λ = ----------------S σmax

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP

for the second component as follows:

0.045 0.040 0.035 0.030 0.025 0.020 0.015 0.010 0.005

Γ 2 ( x int ) λ = ----------------S σmax Substituting the expression of xint given by Eq. (17) in the expression for the second component in the condition of Definition 3, after simplification, we obtain η1 σ1 – η2 σ2 = µd

(18)

It is difficult to obtain an analytic solution for σ2 from Eq. (18) due to the nonlinearity of ηi (i = 1, 2). We tried to reduce the complexity of Eq. (18) by using the approximation of the Gaussian introduced previously. However, we were faced with another problem concerning the choice of the segment indices in the approximation. Unlike the computation of σmax , where the intersection is located in the first segment, here we need to compute the segment indices, which is a very difficult task. The complexity of the rate of overlap solution, as well as the minimum overlap, led us to use a parameter other than width to control the overlap of two Gaussian components. In the next section, we describe the use of the component mean. Nevertheless, generating mixtures with less than the maximum component overlap is an efficient method, which has already been used [18]. The method described in this section, namely, using the width to control component overlap has the merit of generating bounded mixtures with different numbers of components. Indeed, the algorithm starts by randomly generating the means and mixing parameters of all components of the mixture. This allows us to establish, at the beginning, a given interval within which the means are generated. This algorithm has been successfully used to generate artificial image histograms for the comparison of clustering methods in an image histogram PDF estimation [18]. Actually, image histograms are generally within the interval [0, 255]. In [18], only the maximum overlap was considered, and mixtures were generated with less than the maximum component overlap. Figure 11 shows an example of a mixture of four components generated using the width to control the maximum overlap. We can see from Fig. 11 that there are different rates of overlap between the adjacent components of the mixture. The algorithm used to generate the mixture in Fig. 11 is given below. 3.4. Algorithm In this section, we summarize the algorithm for generating mixtures of univariate normal densities with component overlap controlled by using the widths of components. Step 1. Generate a random value of M (the number of components). Step 2. For j = 1, …, M, randomly generate µj and M κ = 1. κj and µj < µj + i , κj > 0, j=1 j



PATTERN RECOGNITION AND IMAGE ANALYSIS

339 Histogram

0

50

100

150

200

250

300

Fig. 11. A histogram generated using the width to control the maximum overlap.

Step 3. For j = 1, randomly generate σ1 such that σ1 < (µ2 – µ1). Step 4. For j = 2, …, M, generate the width of the ith component as follows: if σ1 < R(1)µd , compute σjmax according to Eq. (13), else compute σjmax according to Eq. (14). Step 5. Randomly generate σj from the bound ]0, σjmax]. The use of the width to control component overlap results in a concise methodology for generating univariate mixtures, even if neither the minimum nor the rate of overlap can be controlled. Indeed, due to the complex expressions involved in the method, only the maximum overlap could be controlled owing to the approximation of the Gaussian distribution. The resulting algorithm was used in [18] to generate artificial histograms. Using the width should not be abandoned; however, controlling the rate of overlap requires the use of another parameter of the Gaussian. Let us now describe the control of overlap using the component means. 4. CONTROLLING THE OVERLAP USING COMPONENT MEANS In this section, we describe the control of component overlap using the means. We proceed in the same manner as for control using widths. We suppose that the overlap affects only two adjacent components of the mixture. We randomly generate the widths and the mixing parameters of both components, as well as the mean of the first component. Then, we compute the mean of the second component for a given overlap. We compute the mean resulting in the maximum overlap, denoted by µmax , described by Definition 1, the mean resulting in the minimum overlap, denoted by µmin as described by

Vol. 12

No. 4

2002

340

AITNOURI et al. 0.030

κ 1 σ 2 1/2 - e . The 2 log  --------- κ 2 σ 1 square root and the logarithm exist, since we are develκ1 σ2 - > 1. oping case 1, thus ---------κ2 σ1

where η1 > 1 is given by η1 =

0.025 0.020 0.015 0.010

Now, we replace the value of the intersection point in the second component using the relation

0.005

Γ 2 ( x int ) = S σ2 0

50

100

150

200

Fig. 12. Example illustrating maximum overlap of a fourcomponent mixture with parameters γ1 = (29, 8, 0.325), γ2 = (43.33, 5, 0.275), γ3 = (76.88, 23, 0.225), and γ4 = (122.32, 20, 0.175).

0.06

After replacing both sides of Eq. (22) by their equivalents, the condition becomes as follows: 2

κ2 ----------------e 2πσ 2

( x int – µ 2 ) – ------------------------2 2σ 2

κ2e = ---------------2πσ 2 – 1/2

(23)

After some transformations, we obtain µ max = x int + σ 2

0.05 0.04

(24)

which is a logical result, and xint is given by Eq. (21). Figure 12 shows an example using the conditions of Eqs. (21) and (24). Indeed, Fig. 12 illustrates a fourcomponent mixture, all involving maximum overlap. Now, let us define the minimum overlap.

0.03 0.02 0.01 0

(22)

50

100

150

4.2. Minimum Overlap

200

Fig. 13. Example illustrating minimum overlap of a fourcomponent mixture with the mean using as control.

Definition 2, and the mean resulting in the rate of overlap λ, as described by Definition 3. 4.1. Maximum Overlap The mean µmax resulting in the maximum overlap can be directly computed from the condition of Definition 1. Due to symmetry, we will give the development for case 1, while the solutions for case 2 will be directly deduced at the end of the present section. For case 1, the condition of Definition 1 becomes as follows: Γ 1 ( x int ) = S σ2

(19)

First, we look for the intersection point of the two adjacent components. Replacing Γ1(xint) by its equivalent in Eq. (19) gives κ1 ----------------e 2πσ 1

( x int – µ 1 ) – ----------------------2 2σ 1

= S σ2

(20)

After simplification, we obtain x int = µ 1 + η 1 σ 1

(21)

For the minimum overlap, µmin can easily be obtained from the condition of Definition 2. From Eq. (15), µmin can be obtained as µ min = µ 1 + 4 ( σ 1 + σ 2 )

(25)

There is no condition on the computation of the minimum means µmin in Eq. (25) unlike the computation of µmin when we use widths to control overlap. Figure 13 shows an example of a four-component mixture generated by the condition of Eq. (25). Now, since the minimum and maximum overlaps are defined, let us define the rate of overlap of two adjacent components of a mixture. 4.3. Rate of Overlap µmax and µmin , as was presented in the previous sections, can be considered as the boundaries for the mean of the second component µ2 . Actually, µ2 can be picked randomly within this interval in order to ensure that components are not totally overlapped. Definition 3 presents the rate of overlap of two components of a mixture. Given the rate of overlap λ such that 0 < λ0 ≤ 1, Definition 3 states that Γ 1 ( x int ) ----------------- = λ0 S σmax

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP

Figure 14 illustrates the principle of the rate of overlap, using component mean to control it. The light dashed line in Fig. 14 represents the second component for maximum overlap. Thus, the intersection point is at height S σmax . To yield a rate of overlap of value λ, the second component must intersect with the first one at a point xint whose height is λ S σmax . Now, to find the mean of the second component, we use the same reasoning as for maximum overlap. Thus, first, we look for the intersection point xint of the two components, satisfying the condition of Definition 3; i.e., κ2e Γ 1 ( x int ) = λ ---------------2πσ 2

0.030 0.025

0.015

0.005

(27)

κ 1 σ 2  1/2 - e . The 2 log  ------------- λκ 2 σ 1 square root and the logarithm exist, since we are develκ1 σ2 - > 1. oping case 1; thus ---------κ2 σ1

where η1 > 1 is given by η1 =

Now, we replace the value of the intersection point in the second component. The condition becomes 2

λκ 2 e = ------------------2πσ 2 – 1/2

(28)

After some transformation, we obtain µ 2 = x int + η 2 σ 2

(29)

where η2 is given by η2 = 1 – 2 log λ and xint is given by Eq. (27). If the rate of overlap λ = 1, this implies that η2 = 1, which indicates the presence of maximum overlap, and Eq. (29) collapses into Eq. (24). However, λ must not vanish in Eq. (29). Thus, when we need a minimum overlap, we use Eq. (25) instead. Equation (25) offers a lower limit for means in Eq. (29). Due to symmetry, µ2 for case 2 can be easily obtained and is given by  x int = µ 1 + η 1 σ 1   µ 2 = x int + η 2 σ 2

(30)

κ 2 σ 2 1/2 -e 2 log  ------------- λκ 2 σ 2 

≥ 1, and η2 =

where η1 =

0

λ Sσ max 50 100 150 200 250 300

Fig. 14. Rate of overlap principle.

x int = µ 1 + η 1 σ 1

2

Sσmax

0.010

(26)

where the right side of Eq. (26) is the height where the intersection point is located. After simplification, we obtain

κ2   ---------------e  2πσ 

C1 C2 (max) C2 (rate = r)

0.020

– 1/2

( x int – µ 2 ) – ------------------------2 2σ 2

341

1 – 2 log λ ≥ 1.

PATTERN RECOGNITION AND IMAGE ANALYSIS

4.4. Algorithm In this section, we summarize the algorithm for generating mixtures of univariate normal densities with component overlap controlled by using the means of components. Step 1. Set the algorithm parameters; set the number of components M of the mixture; set the rate of overlap λ. Step 2. Randomly generate both the standard deviations and the mixing parameters of components of the mixture; in order to make the generated mixtures bounded, one should define a minimum and a maximum value of the standard deviations, respectively, σMIN and σMAX , as well as a minimum value for the mixing parameters, κmin . The mixing parameters sum to unity. Step 3. Randomly generate the mean of the first component µ1; here, µ1 is generated within the interval ]0, 3σ1]. Step 4. For i = 2, …, M, generate the mean of the ith component µi as follows: if λ = 0, compute µi from Eq. (25); else compute µi from Eq. (29); here, one should, first, test whether any of case 1 and case 2 is valid by κ1 σ2 -. testing the quantity ---------κ2 σ1 Q In our experiments, we used σMIN = 1, σMAX = -------- , 6M and κmin = 0.01, where Q was set around 100.

4.5. Results and Application to the Case of Multidimensional Mixtures In this section, we present some results arising from the use of component means to generate univariate mixtures with controlled component overlap. In addi-

Vol. 12

No. 4

2002

342

AITNOURI et al. (a)

0.035 0.030 0.025 0.020 0.015 0.010 0.005 0

50

100

150

(b)

100 90 80 70 60 50 40 30 20 10

rate = 1

200

0

20 40 60 80 100 120 140

Fig. 15. Four-component mixture with rate λ = 1: (a) one-dimensional mixture, (b) two-dimensional mixture.

(a)

0.012

rate = 0.75

0.010 0.008 0.006 0.004 0.002 0

50

100

150

(b)

90 80 70 60 50 40 30 20 10

200 0

40

80

120

160

Fig. 16. Four-component mixture with rate λ = 0.75: (a) one-dimensional mixture, (b) two-dimensional mixture.

(a)

0.016

rate = 0.5

0.014 0.012 0.010 0.008 0.006 0.004 0.002 0

50

100

150

200

90 80 70 60 50 40 30 20 10 0 –10

(b)

0 20 40 60 80 100 120 140 160

Fig. 17. Four-component mixture with rate λ = 0.5: (a) one-dimensional mixture, (b) two-dimensional mixture.

tion to the logical applicability to the case of univariate mixtures, we show in this section that the algorithm described in the previous section can also be used in the case of multivariate mixtures. Indeed, it can actually be used as the second step of the Milligan’s algorithm for

generating multivariate clusters by using mixtures [16]. The advantage of our algorithm lies in the possibility of controlling the rate of overlap, which is not possible with the Milligan’s algorithm. Figs. 15–19 show different plots of mixtures generated by our algorithm.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP (a)

0.030

(b)

80

rate = 0.25

343

70

0.025

60 0.020

50

0.015

40 30

0.010

20 0.005

10

0

50

100

150

200 0

20

40

60

80 100 120

Fig. 18. Four-component mixture with rate λ = 0.25: (a) one-dimensional mixture, (b) two-dimensional mixture.

(a)

0.025

(b)

80

rate = 0

70 0.020

60 50

0.015

40 0.010

30 20

0.005

10 0

50

100

150

40

200 0

80

120

160

Fig. 19. Four-component mixture with rate λ = 0: (a) one-dimensional mixture, (b) two-dimensional mixture.

Figs. 15a, 16a, 17a, 18a, and 19a show a four-component mixtures with rates of overlap of λ = 1, λ = 0.75, λ = 0.5, λ = 0.25, and λ = 0, respectively. We see from these figures that the rate of overlap is constant, in contrast to the case of mixtures generated by using widths. Figs. 15b, 16b, 17b, 18b, and 19b show plots of twodimensional clusters of 200 points. The one-dimensional mixtures were taken from the examples of Figs. 15a, 16a, 17a, 18a, and 19a. In order to make the effect of the rate of overlap more appealing, we have set the parameters of the second dimension, especially the means, to be equal. It is also possible to generate mixtures with defined rates of overlap between each pair of components. For an M-component mixture, we can set M – 1 different rates of overlap, so that the ith one defines the rate of overlap of the ith and (i + 1)th components. This differs from the examples with constant rate of overlap between all pairs of components. A variable rate of overlap can increase our control of the generated mixtures. Fig. 20 shows two examples with different rates of overlap. PATTERN RECOGNITION AND IMAGE ANALYSIS

5. CONCLUSION The algorithms presented in this paper propose a formal methodology for generating mixtures of univariate normal densities with controlled component overlap which, to our knowledge, constitutes an innovative contribution. The general structure of these algo0.030

0.06

0.025

0.05

0.020

0.04

0.015

0.03

0.010

0.02

0.005

0.01

Vol. 12

0

50

100 150 200 0

50

100 150 200

Fig. 20. Examples with different rates of overlap: (a) λ1 = 0.9, λ2 = 0.5, λ3 = 0.1, λ4 = 0.65; (b) λ1 = 0.07, λ2 = 0.87, λ3 = 0.3, λ4 = 0.7. No. 4

2002

344

AITNOURI et al.

rithms allows the generation of mixtures with practically any number of components. The range of applicability of these algorithms is very large. Indeed, along with being applicable to the generation of mixtures to approximate image histograms, the algorithm using components width to control overlap can also be applied to any problem related to the generation of mixtures of univariate normal densities as simulates data. The advantage of the algorithm using the width is the possibility of a strict control over the bounds of the generated mixture. This helps to generate simulated data as close to real data as possible. However, controlling the rate of overlap with the use of the width is very complex. When means rather than widths are used, the problem formalizations are much simpler. The reason is that the mean of a normal component is in a linear relation with the x axis. This greatly simplifies all of the solutions for the different concepts of overlap defined in this paper, namely, the maximum, the minimum, and the rate of overlap. The figures presented clearly illustrated the effectiveness of this algorithm. Despite the fact that the use of means to control the overlap makes it somewhat complex to control the boundaries of the generated mixtures, this method has much broader applicability than the use of width, especially given the possibility of controlling the rate of overlap. APPENDIX A Let us take the general condition of Eq. (6). Here, we develop the general solution of σ2 for case 2, namely, when Γ1(µ1) < Γ2(µ2 ). Thus, the apparent width is defined for Γ2(x), and the general condition of Eq. (6) becomes σ 1 + σˆ 2 < µ d By equalizing the two components at height S σ1 , we have Γ 1 ( µ 1 + σ 1 ) = Γ 2 ( µ 2 + σˆ 2 )

Introducing the expression of the apparent width in the general condition of Eq. (6), we obtain 1 2σ 2 ln  ---------- + σ 1 < µ d  Aσ 2 which constitutes the condition on σ2 . APPENDIX B Consider a family of lines ∆r = {∆r1, ∆r2 , ∆r3 , …, ∆rp}, where ∆ri (i = 1, …, p) is a tangent line to the point (xiσ = µ + i, G(xiσ)). Let us now approximate the righthand part of the Gaussian by the series of lines ∆ri (i = 1, …, p), in order to obtain a piecewise linear approximation denoted by g˜ r (x). For this purpose, we approximate the part of the Gaussian in the interval x ∈ [µ + iσ, µ + (i + 1)σ] with ∆i , giving gˆ r ( x ) = Ax + B where A and B depend on the parameters of the Gaussian. A simple solution is to begin with using the derivative of the Gaussian to determine A. Then, we use the expression of A to determine B. The derivative of the Gaussian at the tangent point ( x) x = µ + iσ is equal to A, namely, dG = A. --------------dx x = µ + iσ After deriving G(x), we have 2

κ dG ( x ) --------------- = – ----------------3 ( x – µ )e dx 2πσ

( x – µ) – -----------------2 2σ

After replacing x by its value at the tangent point, we have 1 – --- i 2

κe A = – i ----------------2 2πσ

2

κ1 ----------------e 2πσ 1

1 – --2

κ2 = ----------------e 2πσ 2

ln ( Aσ 2 ) =

σˆ – -----22 σ2

1 – --- i 2

–1

-----κ 2 with A = -----1 σ1 e . After some simplification, we find κ2 the apparent width for σ2 as

σˆ 2 =

1 – --- i 2

κe κe G ( µ + iσ ) = -------------- = – i ----------------σ ( µ + iσ ) + B 3 2πσ 2πσ

2 σˆ – -----22 σ2

1 2σ 2 ln  ----------  Aσ 2

Now, by substituting the expression of A in the linear form of G(x) at the tangent point x = µ + iσ, we have

1 – --2

After some manipulations, we obtain 1 – --- i 2

ke ( i + 1 )σ B = i ----------------2  µ + ---------------------   i 2πσ 2

Thus, the linear form of the Gaussian on the righthand side will be given by PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

CONTROLLING MIXTURE COMPONENT OVERLAP

The approximation error in relation with the number of segments p

1 2 2 – --- i  1 (i + 1)  2   g r ( x ) = – i ----------------e -----------------σ x – µ – 2   i  2πσ   x ∈ [ µ + iσ, µ + ( i + 1 ) σ ], i ≥ 1

APPENDIX C Approximation Error It is interesting to compute the approximation error in relation with the width σ and the number of segments p used to approximate the Gaussian. For a normalized Gaussian, the error denoted by E(σ, p), is given by E ( σ, p ) = 1 – S T ( σ, p )

(31)

where ST is the surface under the approximation curve. Since the Gaussian is symmetric and the approximation deals with the part of the Gaussian below the width height Sσ , E(σ, p) is given by S l ( σ, p ) E ( σ, p ) = 1 – 2C – -----------------2

(32)

where Sl (σ, p) is the surface of one of the symmetric parts of the approximation (say, the left-hand part), included in the interval [0, µ – σ] (for the left-hand σ part), and C is a constant given by C = µ G (x)dx, which can be solved using a short “erf” function. Let us compute Sl (σ, p) which is given (for the left-hand part of the approximation) by



µ–σ



S l ( σ, p ) =

g˜ ( x, p, σ ) d x

(33)

0

with g˜ (x, p, σ) given by Eq. (9). The surface Sl (σ, p) depends on the number of segments p. Thus, for a given p, Sl (σ, p) is given by p µ – f ( i – 1 )σ

S l ( σ, p ) =





g˜ ( x, i, σ ) d x

(34)

µ – f ( i )s

i=1

( p + 1) where f(p) = ------------------- and f(0) = 1. By integrating g˜ (x, p p, σ) and replacing the limits of the integral, we have 2

p

1

2

– --- i 1 2 S l ( σ, p ) = ---------- e [ f ( i – 1 ) – f ( i ) ] 2π i = 1



(35)

f (i) + f (i – 1) i + 1 × ----------------------------------- – ------------i 2 2

Sl (σ, p) does not depend on the width σ. Thus, the error E(σ, p) in Eq. (32) does not depend on σ, which gives a robust approximation. Note that E(σ, p) decreases as p increases, ceasing to decrease when p ≥ 3. The table PATTERN RECOGNITION AND IMAGE ANALYSIS

345

p

Error

1 2 3 4 4

4.03 × 10–3 1.13 × 10–3 7.48 × 10–6 7.01 × 10–6 6.99 × 10–6

gives the approximation errors in relation with p. The best choice of the number of segments in terms of accuracy of approximation and complexity is p = 3. REFERENCES 1. Anderberg, M.R., Cluster Analysis for Applications, New York: Academic, 1973. 2. Cormack, R.M., A Review of Classification, J. R. Stat. Ser. A, 1971, vol. 134, no. 3, pp. 321–367. 3. Everitt, B.S., Cluster Analysis, London: Halstead Press, 1974. 4. Rand, W.M., Objective Criteria for Evaluation of Clustering Methods, J. Am. Stat. Assoc., 1971, vol. 66, no. 336, pp. 846–850. 5. Cohen, J., A Coefficient of Agreement for Nominal Scales, J. Educat. Psychol. Measur., 1960, vol. XX, no. 1, pp. 37–46. 6. Cunningham, K.M. and Ogilvie, J.C., Evaluation of Hierarchical Grouping Techniques: A Preliminary Study, Computer J., 1972, vol. 15, pp. 209–213. 7. Milligan, G.W., An Examination of the Effect of Six Types of Errors Perturbation on Fifteen Clustering Algorithms, Psychometrica, 1980, vol. 45, no. 3, pp. 325– 342. 8. Bromley, D.B., Rank Order Cluster Analysis, Brit. J. Math. Stat. Psychol., 1966, vol. 19, pp. 105–123. 9. Windham, M.P. and Cutler, A., Information Ratios for Validating Mixture Analyses, Am. Stat. Assoc. Theory Meth., 1992, vol. 87, no. 420, pp. 1188–1192. 10. Everitt, B.S. and Hand, D.J., Finite Mixture Distributions, London: Chapman and Hall, 1981. 11. Aitnouri, E.M., Wang, S., Ziou, D., Vaillancourt, J., and Gagnon, L., Estimation of a Multimodal Histogram’s pdf Using a Mixture Model, Neur. Parallel Sci. Comput., 1999, vol. 7, no. 1, pp. 103–118. 12. Blashfield, R.K., Mixture Model Test of Cluster Analysis: Accuracy of Four Agglomerative Hierarchical Methods, Psychol. Bull., 1976, vol. 83, no. 3, pp. 377–388. 13. Edelbrock, C., Comparing the Accuracy of Hierarchical Grouping Techniques: the Problem of Classifying Everybody, Multivariate Behav. Res., 1979, vol. 14, no. 4, pp. 367–384. 14. Kuiper, F.K. and Fisher, L., A Monte Carlo Comparison of Six Clustering Procedures, Biometrica, 1975, vol. 31, no. 1, pp. 86–101. 15. Bayne, C.K., Beauchamp, J.J., Begovich, C.L., and Kane, V.E., Monte Carlo Comparisons of Selected Clus-

Vol. 12

No. 4

2002

346

AITNOURI et al. tering Procedures, Pattern Recognit., 1980, vol. 12, no. 2, pp. 51–62.

16. Milligan, G.W., An Algorithm for Generating Artificial Test Clusters, Psychometrica, 1985, vol. 50, no. 1, pp. 123–127. 17. Aitnouri, E.M., Dubeau, F., Wang, S., and Ziou, D., Generating Test Vectors for Clustering Algorithms, Tech. Rep. No. 236, DMI, Sherbrooke: Univ. of Sherbrooke, 1999. 18. Aitnouri, E.M., Wang, S., and Ziou, D., On Comparison of Clustering Techniques for Histogram PDF EstimaElmehdi M. Aitnouri. Received his Degree in Electrical Engineering from Algiers University of Technology in 1992 and his PhD Degree in Computer Science from the University of Sherbrooke in 2002. In 1992, he joined the CDTA Research Center of Algiers and worked until 1997 in the design of DSP ASIC architectures. Since 2002, he is a postdoctoral research fellow at the Semiconductor Insight, Inc. and works on scanning electron microscopy imaging. His research interests include image processing and pattern recognition. Fran@cois Dubeau. Professor at the Department of Mathematics and Computer Science of the University of Sherbrooke. He taught at the Collége Militaire Royal de St. Jean from 1982 to 1992. He received his PhD Degree in Mathematics from the University of Montreal in 1981, his MSci Degree in Industrial Engineering in 1973, and his BSci Degree in Engineering Physics in 1971 from the École Polytechnique de Montréal. His research interests include applied mathematics, numerical analysis, and digital image processing.

tion, Pattern Recognit. Image Anal., 2000, vol. 10, no. 2, pp. 206–217. 19. Aitnouri, E.M., Dubeau, F., Wang, S., and Ziou, D., Generating Test Vectors for Clustering Algorithms, Proc. IASTED Int. Conf. on Artificial Intelligence and Soft Computing (CD ROM), Cancun, Mexico, 2001. 20. Tabbone, S., Détection multi-échelle de contours subpixel et de conjonctions, PhD Thesis, Institut National Polytechnique de Lorraine, France, 1994. 21. Ziou, D., Optimal Thresholding for Image Segmentation, Proc. 2nd African Conf. on Research in Computer Science, Burkina Faso, 1993. Shengrui Wang. Full professor at the Department of Mathematics and Computer Science of the University of Sherbrooke. Received his PhD Degree in Computer Science from the Institut National Polytechnique de Grenoble in 1989. His research interests include neural networks, image processing and understanding, image databases, remote sensing, and expert systems.

Djemel Ziou. Received the BEng Degree in Computer Science from the University of Annaba (Algeria) in 1984, and PhD Degree in Computer Science from the Institut National Polytechnique de Lorraine (INPL), France, in 1991. Prom 1987 to 1993 he served as lecturer in several universities in France. During the same period, he was a researcher in the Centre de Recherche en Informatique de Nancy (CRIN) and the Institut National de Recherche en Informatique et Automatique (INRIA) in France. Presently, he is Associate Professor at the University of Sherbrooke in Canada. His research interests include image processing, computer vision, and pattern recognition.

SPELL: Québec, nonoverlapped, unidimensional, Gaussians, bivariate, Fran@cois, Polytechnique

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 12

No. 4

2002

Suggest Documents