An Unsupervised Collaborative Learning Method to

1 downloads 0 Views 316KB Size Report
to make them converge towards a unique hierarchy that unifies all the results. ... an unsupervised learning (the choice of the method and ... A solution to these problems is to build a meta- ... ers, since only the training data are shared. It is the.
An Unsupervised Collaborative Learning Method to Re ne Classi cation Hierarchies Cedric Wemmert, Pierre Gancarski, Jerzy Korczak Groupe de Recherche en Intelligence Arti cielle Laboratoire des Sciences de l'Image, de l'Informatique et de la Teledetection P^ole API, 67400 Illkirch - FRANCE fwemmert, gancars, [email protected]

Abstract This article deals with the design of a hybrid learning system. This system integrates di erent kinds of unsupervised learning methods and gives a set of class hierarchies as the result. The classes in these hierarchies are very similar. The method occurrences compare their results and automatically re ne them to try to make them converge towards a unique hierarchy that uni es all the results. Thus, the system decreases the importance of the initial choices made when initializing an unsupervised learning (the choice of the method and its parameters) and to solve some of the limitations of the methods used such as an imposed number of classes, a non-hierarchical result, or the size of the hierarchy.

launching an unsupervised learning :



the choice of the family of classi cation methods (statistical, neural or conceptual methods),

 

the choice of the classi cation method itself

A solution to these problems is to build a metaclassi er with multiple classi cation methods to remove indeterminacy and to improve their performance. Actually, there exist many techniques to combine different classi ers. In [1], Alpaydin divide combination approaches into two, depending on their architecture:



multiexpert methods like boosting [16] [4] [15], bagging [3], or stacked generalization [18]: the different classi cation methods work in parallel and give all their classi cation; then the nal classi cation is computed by a separate combiner;



multistage methods like cascading [2] [8]: the methods work in a serial way; each method is trained or consulted only for the patterns rejected by the previous classi er(s).

1. Introduction In many problems, like remote sensing image classi cation (which is our interest), you don't have any background knowledge about the data and thus you have to automatically extract knowledge from them. There exist a great variety of unsupervised knowledge extraction methods: neural networks, bayesian methods, conceptual clustering, statistical methods, etc. Previous works in our laboratory on neural networks [9], statistical algorithms [12], or conceptual clustering [11], have shown the eÆciency of these methods in general problems and particularly in unsupervised classi cation. But each has shown some limitations such as an imposed number of classes, a too deep hierarchical result or a subjective initialization of the clusters, and for one given data set, each method produces a di erent result. Moreover, the user has many choices to make when

and nally the choice of the parameters to initiate the method (number of classes, acuity, : : : ).

But these techniques could not easily be applied with unsupervised classi cation methods. That's why we suggest a third kind of combination methodology. The method that we propose [17] integrates several unsupervised classi cation methods and makes them collaborate during phases of automatic re nement of their results. Thus, we can not only decrease the importance of the initial choices presented above but also solve some of their limitations, by using several kinds of classi cation methods with di erent parameters. In this article we rst present the structure and the general behavior of this system. Then we show how the

results are evaluated and we describe how the methods collaborate. We nally detail preliminary results and give some conclusions and perspectives of our work.

2. System overview 2.1. Structure of the system

Step 1

DATA ACQUISITION

Step 2

INITIAL LEARNING

Step 3

RESULTS EVALUATION CONFLICTS Step 4 RESOLUTION

The proposed system is hybrid and includes several unsupervised classi cation methods such as:



k-means,



Cobweb,

  

Autoclass,

iterative statistical clustering [6] (already integrated), grated), Som,

conceptual clustering [7] (already inte-

minimal length search algorithm [14],

Generally, any method composed of the four following operations could be integrated in this system:

   

Step 6

RESULTS OUTPUT

Figure 1. Learning process



Step 2: each method occurrence carries out an initial learning on the data;



Step 3:

all results are evaluated. If the system estimates that the global result is not acceptable, the speci c results from all methods are compared and a list of the existing con icts between the pairs of results is created; if not the learning process continues into Step 5;



Step 4:



Step 5:

uni cation of the results (vote, synthesis,



Step 6:

output of the results.

classi cation of a data set, hierarchical splitting of a class, fusion of several classes, reclassi cation of the objects of a class.

The rst operation correspond to the original classi cation technique (without any modi cation) and the three following ones have to be added if they do not exist. To carry out a learning process, the user must choose among the available methods, knowing that the same method can be used several times, with di erent initial parameters. 2.2. Learning process

The goal of our system is to propose a single hierarchy which uni es the results generated by the various methods. In a "typical" classi er, the result is re ned only according to the knowledge of the learning method used. This system allows the re ning to be done according to the knowledge acquired by all the methods (section 4.2). The entire learning process is presented in Figure 1.



RESULTS UNIFICATION

Bayesian method [5],

neural network classi cation [10],

Snob,

Step 5

Step 1:

the data are loaded and an initial corpus is created;

resolution phase of the con icts. Several pairs of results attempt to resolve their most signi cant local con ict according to the heuristic de ned section 4.2; then return to Step 3; etc.);

We can notice that the initial learnings in step 2 could be done at the same time. Indeed, each method can calculate its classi cation independently of the others, since only the training data are shared. It is the same for the resolutions of con icts during the step 4. A method can be implied only in one resolution of con ict at the same time. These confrontations can thus be distributed.

Incoherent result. During each evaluation, the con-

vergence coeÆcient is calculated for each result (section 3, D7). This coeÆcient demonstrates to what point the result is in agreement with the others. If this coeÆcient tends towards zero, it means that this result diverges

completely from the other results. In this case, the result is removed and the method is reinitiated with new parameters (estimated according to the results of the other method occurrences).

Non-representative data. Due to the great

amount of data to process, the system works incrementally. The processing is initially carried out with a limited number of objects. This procedure can cause the lowering of the quality of the global result, since the selected objects are not enough representative of the set of objects to classify. If the system is not able to converge using these objects, new objects are automatically added to the studied corpus (section 4.3).

Local maximum. The system can converge towards a local maximum since it is constantly trying to improve the quality of the result. This problem is resolved when the reports of the resolutions have been taken into account in Step 2 (section 4.3).

a hierarchy restricted to only one class containing all the objects;



a hierarchy where all the classes are reduced to a singleton.

For this reason we introduce a criterion which estimates the quality (compactness, maximum size, : : : ) of the result required by the user. Nevertheless, this uni cation can only be realized in very few cases. Therefore the system produces homogeneous results which respect this quality criterion. The uni cation is then done with a voting algorithm or a hierarchical synthesis. Definition D1 (Quality

quality coeÆcient

i

:::m

i

i

i

i k

i

i k

C = fx 1 ; x 2 ; : : : ; x

i k

i k

i k

The closer Æ is to 1, the more C is in accordance with the quality criterion chosen by the user.

i k;

i k

i k;n

ik g;

x

i k;l

k

:::n

2 fx ; 1 6 q 6 ng q

i k

3.2.2 Local external characterization The local external characterization of a class of a result compared with the classes of another result, is estimated with a distribution coeÆcient and a local similarity coeÆcient. Definition D2 (Confusion

confusion vector

3.2. Class characterization

Before being able to evaluate the global quality and similarity of all the results, one has to characterize the classes of each result. This can be done according to three di erent points of view. Thus a class is characterized :



by comparing it with its hierarchy (internal characterization );



by comparing it with another hierarchy (local external characterization ); by comparing it with all the other hierarchies (global external characterization ).

vector)

The of class C compared to result H shows how the objects of class C are distributed in H . i;j k

j



coeÆcient)

The Æ of a class C expresses how class C is in accordance with the quality criterion chosen by the user. 0 < Æ 6 1. i k

Let n be the number of objects to classify. Each one is represented by p attributes. Let fM g =1 be the set of learning method occurrences chosen by the user, and H be the result of the method M . Each H is a hierarchy of classes; its leaves de ne a set of n classes fC g =1 i . Each class C is a set of n objects : i k;



Property P1

3.1. Notations

i k

The goal of our system is to unify all the results in only one hierarchy. There are however two obvious solutions which are not acceptable:

i k

3. Evaluation of the results

i

3.2.1 Internal characterization

i k

j

= ( ) =1 i;j

i;j

k;l l

k

;::: ;n

j

; where = i;j

k;l

Definition D3 (Distribution

distribution coeÆcient

i k

jC \ C j jC j j

i k

l

i k

coeÆcient)

The  of class C from result H compared to result H characterizes the distribution expressed by the confusion vector of class C compared to result H . i

j

i;j k

i k

i k

j

 = i;j

j  X n

k

=1



i;j

2

k;l

l

where is the confusion vector between class C and the classes from H . i;j k

j

i k

P j According to D2, =1 = 1 ) 0 <  6 1. The more the distribution coeÆcient is closer to 1, the more C is included in one single class from H . On the other hand, the more  is closer to 1, the more the objects of C are scattered in the classes from H .

p is the similarity weight and p the quality weight. These parameters enable the user to de ne to what point he desires the nal results to be identical (local similarity coeÆcient, P4) versus they respect the internal quality criterion (internal characterization, D1).

Definition D4 (Local

Property P5

Property P2

n

i;j

i;j

l

k;l

k

i k

j

i;j k

i k

j

similarity coeÆcient)

local similarity coeÆcient

The ! of class C from result H characterizes the similarity between class C and a class C m from a result H . i;j

k

i

j

i k

i k

j

k

! = i;j

i;j

j;i

k

k

k

where m = max f i;j

i;j

k;k

k;l

g =1 l

m

:::n

;k

j

Property P3

According to D2 and P2, the more ! is closer to 1, the more class C is similar from one class, called C m , from result H . i;j

k

j

i k

k

j

 Global external characterization

The

estimates if class C is found by the other method occurrences. 1 X !

= m 1 j=1 i k

i k

m

i;j

i k

If p is increased, the system will tend towards a solution in which the classes of the di erent results are very similar, by taking a lesser account of the internal quality. On the other hand, if p is increased, the system will take into greater account the internal quality of the classes, therefore take a lesser account of their similarity. s

q

Definition D7 (Convergence

convergence coeÆcient

k

j6=i

i

According to D7 and D6, 0 < 6 1. The closer tends towards 1, the more the result H is similar to the other ones. On the other hand, if it tends towards 0, it means that H is divergent from the other results.

To de ne the general agreement coeÆcient two new criteria are introduced: the rate of local agreement and the convergence coeÆcient.

of local agreement)

The

estimates the similarity between two results H and H , taking into account the internal quality of the classes of these results.   1 0  j p :! + p :Æ i p :! + p :Æ X X 1 A

= @ + 2 =1 n n =1 i;j

n

s

i;j

i

k

where p + p = 1. s

q

k

q

i k

i;j

j

n

k

j;i

s

j

q

k

j

i

i

i

Definition D8 (General

=

agreement coeÆcient)

1X m =1 m

i

i

Property P7

3.3. General agreement coefficient definition

i

i;j

Property P6

i k

rate of local agreement

i

j6=i

According to D5 and P3, the more is closer to 1, the more it means that C is a class that was found by all the other method occurrences.

Definition D6 (Rate

i

i

Property P4

i k

coeÆcient)

The of result H estimates whether or not H is evolving in the common direction. 1 X =

m 1 j=1

i

similarity coeÆcient)

global similarity coeÆcient

q

m

The global external characterization is evaluated by the global similarity coeÆcient. In order to calculate this coeÆcient from a class of a speci c result one needs to compare it to all the other results. Definition D5 (Global

s

k

According to D8 and P6, 0 < 6 1. When is closer to 1, it denotes the results are similar, and the more is closer to 0, the more they are di erent.

4. Results re nement The results re nement is divided into three phases:



the choice of local con icts to resolve;



the resolution of chosen local con icts;



the taking into account at the global level of the local modi cations that have been carried out.

H = H nCR(C ; H ; p )[fmerge(CR(C ; H ; p ))g j

4.1. Choice of local conflicts to resolve Definition D9 (Con ict)

j

i k j

i k

i;j k

Property P8

According to P2, the more  is closer to 0, the more the objects of C are scattered in H classes. Thus, the more  is closer to 0, the greater the con ict is. i;j k

i k

i;j

j

else

There is a con ict between H and H about class C , if the objects of C are not in only one class from H , i.e. if  6= 1. i

0

j

k

At one moment t, the system is in state E and compares the results in pairs. It indicates the con icts that exist between each pair of results and estimates their importance. Resolutions are initiated in decreasing order of the importance of the con icts while at least two occurrences of method are not involved in a resolution. There are bm=2c resolutions which are initiated in parallel. t

i k

j

i k

cr

j

H = reclassify(H ; C ) i

0

i

endif

i k

The resolution process of a local con ict is demonstrated by algorithm A2: Algorithm A2 (Local conflict resolution)

while local agreement coeÆcient increases (D6)

representative classes of C compared to H (D10) choice and execution of the operators (A1) (H ; H ) = (H ; H ) with =max f ; ; ; g i k

i

j

I

I ;J

j

J

0

i;j

i ;j

endwhile

i;j

0

0

i ;j

0

report writing, containing (H ; H ) I

J

4.3. Local modifications management

A report is generated for each resolution d. This report is composed of the new global state, called E , in which are inserted the two new results H ; H , that propose to resolve the con ict. At the completion of all the resolutions, the system analyzes all the reports. Several strategies are examined for the merging of the local modi cations. Either the system seeks among the (E ) =1 b 2c the combination of results which maximizes . In this case, we have +1 > . Nevertheless, this solution has two major problems: i

4.2. Resolution of chosen local conflicts

The resolution of a local con ict is similiar to a negotiation in which two results, H and H , try to nd agreement on a group of objects that they have classi ed di erently. For this, three operators can be used: i

 

split:



reclassify:

j

a class is split in several subclasses;

the objects of several classes are merged in a new class; a class is eliminated and its objects are reclassi ed among the remaining classes.

The operator which is used and its parameters (classes to be merged, number of subclasses for the splitting, : : : ) are chosen according to the representative classes (D10) of the treated class. This choice is done according to the heuristic given by algorithm A1. Definition D10 (Representive

representative classes

classes)

The of class C from result H compared to result H are the set of classes from H which have more than p % of their objects included in C (p is given by the user). n o CR(C ; H ; p ) = C : > p ; 8l 2 [1; n ] i

i k

j

j

cr

i k

cr

i k

j

cr

j

j;i

l

l;k

cr

j

Algorithm A1 (Choice of operators to apply)

let n = j CR(C ; H ; p ) j if n > 1 i k

H =H i

0

i

j

cr

n fC g [ fsplit(C ; n)g i k

i k

d t

j

d t d

t

merge:

cr

;::: ; m=

t



the system can converge towards a local maximum;



the number of solutions tested is equal to 2( 2) , which quickly becomes too great when the number of method occurrences increases. m=

Or it combines all the new results and only considers the new general agreement coeÆcient +1 without testing if an improvement has been achieved or not. In this case, the problem of the local maximum does not arise any longer since no test on +1 is carried out. However there is no more convergence control of the global solution. Or all the new results are combined and +1 is calculated. If this new general agreement coeÆcient is better than the previous one, then the system starts a resolution phase again. If not: t

t

t

 

either new data is added to the corpus; or one of the method occurrences is re-initialized if it is considered to be too divergent;



or the new solution suggested is accepted (although it is worse) but a copy of the best solution, called best temporary solution, is preserved. In this last case, the solutions are accepted as long as they do not exceed a loss of quality more than p % from the best temporary solution (p is a parameter given by the user). This third strategy, presented by algorithm A3, has been implemented in the system: Algorithm A3 (New results combination)

% p is given by th user (near 0) E +1 = combination of all new results +1 = (E +1 )

1.0

0.5

1

2

3

4

5

6

confrontation steps

Figure 3. Evolution of the general agreement coefficient

g

t

t

if

E

+1

t

>

max

max

t

t

= E +1 = +1 t

t

else if exist unused data in data set i

t

i t

g

i

t

endif endif

1

= 0:54,

2

= 0:60,

3

= 0:58;

with = 0:57. After 6 resolution steps, the system nished with the following results:

add data in studied corpus +1 < p re-initialization of M else if +1 > p  end of learning : nal solution E

else if 9M :

 H 1 - 3 classes,  H 2 - 6 classes,  H 3 - 7 classes,

max

max

5. Results The presented results have been obtained by a prototype using two dissimilar learning methods. It includes the k-means algorithm and the conceptual clustering algorithm Cobweb. 5.1. Iris Plants Database

The data used in this rst result is extracted from the machine learning database Iris Plants Database (R.A. Fisher, July, 1988). We used only two attributes (width and height of the petals), to better display the results produced by the system. The system was run using three method occurrences:  M 1 - k-means with 3 initial nodes,

 M 2 - k-means with 8 initial nodes,  M 3 - Cobweb with 0.28 as acuity factor value.

These parameters were selected according to the data to be classi ed; the acuity for Cobweb was intentionally high so that the number of found concepts was not too great. The initial results are:

 H 1 - 3 classes,  H 2 - 3 classes,  H 3 - 3 classes,

1

= 0:92,

2

= 0:90,

3

= 0:92;

with = 0:91. The evolution graph of general agreement coeÆcient is given by Figure 3. Figure 4 presents the objects in which there was a con ict before and after the processing. These objects are represented by the black dots on the gure. One may notice that initially 31% of the objects are the subject of a con ict, i.e. the three method occurrences did not classify them in a same manner. After processing, there remains only 5% of the objects which are the subject of a con ict. The system identi es three almost identical classes for the three method occurrences. The local similarity coeÆcient (D4) highlights the correspondence between the classes of the various results (example: class C11 from result H 1 corresponds to class C12 from result H 2 ). Moreover, with the confusion vector coeÆcients (D2), the "mixed" objects are immediately detected ( g. 4). 5.2. Remote sensing images classification

We also obtained some results in remote sensing images classi cation. We carried out a hybrid classi cation from an Spot image of an area of Strasbourg. These results were obtained using three occurrences of methods:

H1 C21

petals width

petals width

Initial result

C31

Final result

C21

C11

C31

C11

Initial result H 2

C52

C42 C22

C12

petals height

C62

petals height

petals width

petals width

H1

Final result H 2

C22

C32

C33

C12

H3

C43 C23

C13

C53

C63

C73

petals height

petals width

petals width

petals height

Initial result

Final result

H3 C23

C33

C33

C13 petals height

petals height

Figure 2. Initial and final classification of the three methods

 M1

- K-means with 6 initial nodes (randomly choosen);

 M2

- K-means with 14 initial nodes (randomly choosen);  M 3 - Cobweb with an acuity of 7.5; After 12 re ning phases, the system stopped because stopped increasing. The results were the following : Methods # classes M1 6 0.55 Initial results: M2 14 0.53 M3 23 0.50

petals width

Initial con icts objects in conflict

i

petals height

petals width

Final con icts objects in conflict

Methods # classes M1 9 0.62 Final results: M2 9 0.62 M3 14 0.61 i

petals height

Figure 4. Conflicts before and after processing

We obtained an improvement of the similarity between the various results since passed from 0.53 to 0.62. Moreover, each occurrence of method made its result converge towards the common result since each increased approximately of 10 points. Moreover, the system have build a single result representing all the classi cations, using a voting method, and a representation of the mixed pixels (in black) on a map, as shown in the gure 5. i

[6] [7] [8]

[9] Figure 5. Unified result and mixed pixels

6. Conclusion The designed system associates many occurrences of learning methods. It allows each of them to produce a hierarchical result that has been re ned according to all the method occurrences. Moreover, sets of very similar classes have been obtained. This can be very useful, particularly if one wants to carry out a synthesis of the hierarchies [13] or to combine all the results by a voting method [1]. One can easily integrate new methods, because the methods are integrated without any modi cation in the system. Moreover, the same method can be used several times in the same processing, by using di erent initial parameters. The structure of the hierarchies could be modi ed by using additional knowledge, resulting from the structure of the other hierarchies. Current research shows that the addition of supervised knowledge has to be completed in order to carry out classi cations of a higher level (groups of objects). This system should be able to easily integrate this kind of knowledge and to apply it in its reasoning.

References [1] E. Alpaydin. Techniques for combining multiple learners. In E. Alpaydin, editor, Proceedings of Engineering of Intelligent Systems'98 Conference, volume 2, pages 6{12, 1998. [2] E. Alpaydin and C. Kaynak. Cascading classi ers. Kybernetika, 34(4):369{374, 1998. [3] L. Breiman. Bagging predictors. Machine Learning, 24:123{140, 1996. [4] L. Breiman. Arcing classi ers. The Annals of Statistics, 26(3):801{849, 1998. [5] P. Cheeseman and J. Stutz. Bayesian classi cation (autoclass): theory and results. In U. M. Fayyad,

[10] [11]

[12] [13]

[14]

[15]

[16]

[17]

[18]

G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 6, pages 61{83. AAAI Press, 1995. E. Diday, J. Lemaire, J.and Pouget, and F. Testu. Elements d'analyse de donnees. Dunod, 1982. D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine learning, 2:139{172, 1987. J. Gama. Combining classi ers with constructive induction. In C. Nedellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), volume 1398 of LNAI, pages 178{189, Berlin, Apr. 21{23 1998. Springer. F. Hammadi-Mesmoudi and J. J. Korczak. An unsupervised neural network classi er and its application in remote sensing. In Proc. International Conference on Image Processing, 1995. T. Kohonen, K. Makisara, O. Simula, and J. Kangas. Arti cial neural networks. In International Conference on Arti cial Neural Networks, 1991. J. Korczak, D. Blamont, and A. Ketterlin. Thematic image segmentation by a concept formation algorithm. In Proc. of the European Symposium on Satelite Remote Sensing, Rome, 1994. J. Korczak and M. Rymarczyk. Application of classical clustering methods to digital image analysis. Technical report, CRI, 1993. N. Louis and J. J. Korczak. Synthesis of conceptual hierarchies applied to remote sensing images. In European Symposium on Remote Sensing, Barcelone, Spain, 1998. J. D. Patrick. Snob : A programm for discriminating between classes. Technical report, Dept. of Computer Science, Monash University, Clayton, Victoria, Australia, 1991. J. R. Quinlan. Boosting rst-order learning. In S. Arikawa and A. K. Sharma, editors, Proceedings of the 7th International Workshop on Algorithmic Learning Theory, volume 1160 of LNAI, pages 143{155, Berlin, Oct.23{25 1996. Springer. R. E. Schapire. Theoretical views of boosting. In P. Fischer and H. U. Simon, editors, Proceedings of the 4th European Conference on Computational Learning Theory (COLT-99), volume 1572 of LNAI, pages 1{10, Berlin, Mar. 29{31 1999. Springer. C. Wemmert, P. Gancarski, and J. Korczak. Un systeme de raÆnement non-supervise d'un ensemble de hierarchies de classes. In M. Sebag, editor, Actes de la Premiere Conference d'Apprentissage, CAP'99, Palaiseau, France, juin 1999. D. H. Wolpert. Stacked generalization. Technical Report LA-UR-90-3460, Complex Systems Group, Theoretical Division, and Center for Non-linear Studies, Los Alamos, NM, 1990.

Suggest Documents