Tracking concept drifting with Gaussian mixture model - CiteSeerX

Tracking Concept Drifting with Gaussian Mixture Model Jun Wu†*, Xian-Sheng Hua‡, Bo Zhang† †

State Key Laboratory of Intelligent Technology and System, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China ‡

Microsoft Research Asia, 3F Sigma Center, 49 Zhichun Road, Beijing, 100080, P.R. China ABSTRACT

This paper mainly addresses the issue of semantic concept drifting in temporal sequences, such as video streams, over an extended period of time. Gaussian Mixture Model (GMM) is applied to model the distribution of under-investigating data, which are supposed to arrive or be generated in batches over time. The up-to-date classifier, which tracks the drifting concept, is directly built on the outdated models trained from the old labeled data. A couple of properties, such as Maximum Lifecycle, Dominant Component, Component Drifting Speed, System Stability, and Updating Speed, are defined to track concept drifting in the learning system, which is applied to determine corresponding parameters for model updating in order to obtain optimal up-to-date classifier. Experiments on simulated data and real-world data demonstrate that our proposed GMM-based batch learning framework is effective and efficient for dealing with concept drifting. Keywords: concept drifting, Gaussian Mixture Model, batch learning

1.

INTRODUCTION

Over a long period of time in temporal sequences, the underlying data distribution, or the concept that we are trying to learn from the data sequence, is constantly evolving. Often these changes make the models built on old data inconsistent with the new data, thus instant updating of the models is required 1. This problem, known as concept drifting 2, complicates the task of learning concepts from data. An effective learner should be able to track such changes and quickly adapt to them 1. To model concept drifting in data sequences has become an important and challenging task. Klinkenberg at el 3, 4 propose a new method to recognize and handle concept changes with support vector machines, which maintains an automatically adjusted window on training data so that the estimation generation error is minimized. Fan 5 points out that the additional old data does not always help produce a more accurate hypothesis than using the most recent data only. It increases the accuracy only in some “lucky” situations. In 6 Fan also demonstrates a random decision-tree ensemble based engine, named as StreamMiner, to mine concept drifts in data streams, in which systematic data selection of old data and new data chunks to compute the optimal model that best fits on the changing data streams. Wang at el 7 propose a general framework for mining drifting concept in data streams using weighted ensemble classifiers based on their expected classification accuracy on the test data under the time-evolving environment. Though many methods have been proposed to deal with concept drifting, till now few researchers have covered the issue of how to “track” concept drifting in a systematic view. This paper will address this issue in detail. We propose a general framework based on generative models, such as GMM, to track concept drifting in learning problems, in which the evolving process of the drifting concept, as well as a couple of properties of the concept drifting system are investigated. In the system, data are supposed to arrive or be generated in batches. The up-to-date classifier is directly built on outdated models incorporated with several systematic properties during the model updating procedure. Our *

Supported by National Natural Science Foundation of China (No.60135010), National Natural Science Foundation of China (No.60321002) and the Chinese National Key Foundation Research Development Plan (2004CB318108). Part of this work is cooperated with Microsoft Research Asia.

1562

Visual Communications and Image Processing 2005, edited by Shipeng Li, Fernando Pereira, Heung-Yeung Shum, Andrew G. Tescher, Proc. of SPIE Vol. 5960 (SPIE, Bellingham, WA, 2005) · 0277-786X/05/$15 · doi: 10.1117/12.632730

proposed batch learning framework, which tracks the drifting concept in an evolving system, is able to avoid costly model updating procedure, as learners are only updated incrementally based on a couple of systematic properties. Furthermore, it can also be applied to online learning system. The remainder of this paper is organized as follows. In Section 2, a general batch learning framework is presented. Then, a GMM-based model updating procedure is detailed in the next section, and how to track concept drifting is discussed in Section 4. Experiments are introduced in Section 5, followed by conclusions and future works in Section 6.

2.

BATCH LEARNING FRAMEWORK

As aforementioned, we are aiming at investigating the problem of concept drifting in temporal sequences as a learning problem. Suppose the feature vector (random variable) for identifying a certain set of specific concepts is denoted by Y = [Y(1), Y(2), …, Y(d)]T, where d is the dimension of the feature vector, with y = [y(1), y(2), …, y(d)]T representing one particular outcome of Y. Data are supposed to arrive or be generated overtime in batches, as illustrated in Figure 1. The set of training samples in t-th batch for semantic concept c (1≤ c ≤ C) is denoted by ytc = {yt1,c, yt2,c, …, ytn(c),c} (n(c) is the number of labeled samples for concept c). Therefore, the whole data set can be formulated as D={ y1c ,…, yt-2c, yt-1c, ytc ,…}. For simplicity, if there is no confusion, we denote yc as any element batch in D (i.e., the superscript t is omitted). 2.1 Finite Gaussian Mixture Model Suppose a certain semantic concept c can be described by a finite mixture distribution in feature space Y, represented by f Y (y c | Θ k ,c ) =

k

∑ α m , c N ( µ m ,c , Σ m ,c )

,

(1)

m =1

where k is the number of mixture components, N (µm,c, Σm,c) is an Gaussian component, θm,c = (µm,c, Σm,c) is its mean and covariance matrix, and αm is the mixing probabilities (Σkm=1αm,c=1). Let Θk,c= { θ1,c, θ2,c, … , θk,c, α1,c, α2,c, … , αk-1,c} be the parameter set defining a given mixture. Typical EM algorithm will iteratively maximize Maximum Likelihood (ML) to estimate Θk,c based on the training samples yc as Θˆ k ,c = arg max L(Θk ,c , y c ) .

(2)

Θ k ,c

To estimate best k=k(c) for yc, typically Minimum Description Length (MDL) criterion is applied. In this paper, a modified EM algorithm, AEM, which is based on Mixture MDL (MMDL) 8, is adopted to estimate Θk,c and k for GMM models from the labeled samples. Denote k(c) the best k obtained from AEM, and then the optimal GMM, represented by the optimal parameters, is Θk(c),c (for simplicity, the hat “ ˆ ” on the optimal parameters are omitted). 2.2 A Batch Learning Framework As mentioned above, the scenario we are investigating is a batch learning problem. When the concept drifts, the corresponding model will change. Therefore, the key issue is how to update the models instantly, which will be preliminarily introduced as below. The flowchart of our proposed framework is shown in Figure 1. The yellow batches (on the left of the dotted vertical line) indicate the previously labeled data, and the green one (on the right of the dotted vertical line) indicates the incoming under-test data. The up-to-date learner L(c,t) is determined by the two previously trained GMM models and a parameter set P (to be explained later). At time t, only two previously trained GMM models, i.e., M tc−1 and M tc− 2 , are available. ( M tc−1 is trained from yct-1, and M tc− 2 from yct-2.)

To predict concept c in t-th batch (unlabeled), a learner can be formulated as

Proc. of SPIE Vol. 5960

1563

L( c ,t ) = Ψ [ M tc− 2 , M tc−1 L( c,t −1) , P ] ,

(3)

where Ψ denotes a function family, and P is a couple of properties for tracking concept drifting, which will be discussed in detail in Section 4.

L(c,t − 1)

Learner: Traine d GMM : Data Batches :

...

M tc− 2

M tc−1

yt-2c

yt-1c

time

L(c, t )

ytc

Labeled

...

Unlabeled

Figure 1. The flowchart of our proposed batch learning framework based on GMM.

3.

UPDATING GAUSSIAN MIXTURE MODEL

Since drifting concept accords with the change of parameters for generative models, in this section, we will investigate how to predict the new learner through updating the parameters of GMM models. It is observed that there are three basic cases for changing of GMM parameters. The first case is adding, in which some “new” components will be added into the newly-updated model. In the second case, deleting, some “old” components will disappear in the new model. The third, drifting, is that some components trend to drift to a new position in the parametric space. As a result, we decompose the GMM updating procedure into these three cases. As mentioned above, the key issue of our scheme for updating the learners L(c,t) (as in Equation (3)) is to combine the previously trained models ( M tc−1 and M tc− 2 ). According to Equation (1), we can formulate these two models as

( = f (y

) ).

M tc−2 = f y ct −2 | Θk(t −2) ,c ,

M tc−1

t −1 c

| Θ k( t −1) ,c

(4) (5)

To update the previous predictor L(c,t-1) to L(c,t), we combine the components in M tc−2 with the most “close” one in

M tc−1 , delete those who no longer appear in M tc−1 , or add one or more new components into M tc− 2 , as follows.

(

)

(

Step 1: For each Gaussian component N µ tj,-1c , Σ tj,-c1 in GMM Θkt -1(c),c , find the most “close” component N µi,t -c2 , Σ i,t c- 2

)

9

in Θk t -2 ( c ) ,c in terms of Kullback-Leibler divergences (DKL) , which is defined as i = arg

min

1≤ m≤k t -2 (c)

((

) (

))

t -2 t -2 DKL N µ tj,-1c , Σ tj,-c1 , N µ m, c , Σ m,c .

(6)

If the minimum DKL in (6) is larger than a predefined threshold, go to Step 3 (adding new components). Otherwise, go to Step 2 (drifting existing components).

(

) in GMM ((1 − α ) N (µ

Step 2: Gaussian component N µi,t -c2 , Σ i,t c- 2

(µ

* i,c ,

)

* Σ i,c = arg min D KL ( µ, Σ )

(

)

Θ k t- 2 ( c ) ,c is replaced by N µ*i,c , Σ i,*c , which is defined by

t -2 i,c ,

)

(

) (

))

* Σ i,t -c2 + α N µ tj,-c1 , Σ tj,-c1 , N µ *i,c , Σ i,c ,

where α is a parameter reflecting the updating speed. According to 10, (µ*i,c , Σ i,*c ) has a close form as

1564


(7)

t -2 -1 µ*i,c = (1 − α ) µi,c + α µ tj,c ,

(

(8)

* Σ i,c = (1 − α ) Σ i,t -c2 + µi,t -c2 ( µi,t -c2 ) T

( )

+ α  Σ tj,-c1 + µ tj,-1c µ tj,-1c 

(

If a component N µi,t-c2 , Σ i,t -c2

)

T

)

( )

 − µ* µ*  i,c i,c 

T

(9)

.

in Θk t -2 ( c ) ,c drifts to a new position, add label i into a label set, J t − 2 (c) , which is

initialized as an empty set. Step 3:

(

N µ tj,-1c , Σ tj,-1c

) is added into

Θk t -2 ( c ) ,c as a new component, and the updated GMM Θ k′ t -1 (c), c has a form of

(

)

(

kt - 2

)

(

t -2 t -2 t -1 t -1 f y tc | Θ' kt -1 (c ),c = ∑ (1 − β ) α m,c N µ m, c , Σ m,c + β N µ j,c , Σ j,c m =1

)

,

(10)

where β is also a parameter controlling the updating speed, and kt-1(c) = kt-2(c) + 1. Finally, add a kt-2(c)-based index number into the label set Jt-2(c).

(

Step 4: (deleting outdated components) For 1 ≤ i ≤ kt-2(c), if i ∉ J t−2 (c) , then delete the i-th component N µi,t-c2 , Σ i,t-c2

)

in Θkt- 2 (c ) ,c by setting the corresponding weight α i,c = 0. Finally, the updated GMM Θ new has the form of k ( c ),c t

(

f y tc | Θ knew (c ),c t

) = ∑α N (µ kt - 2

m =1

old m,c

t -2 t- 2 m,c , Σ m,c

)

(

∑

+

)

α new N µ tj,-1c , Σ tj,-1c . p

(11)

p∈J t − 2 ( c )

old where α m,c is the weight of all remaindered components in Θk t- 2 (c ) ,c , and α new is the weight of added new ones p

from Θk t-1 ( c ) ,c . Therefore, for a sample yi in the incoming batch, the classification result is determined by

(

)

. c (y i ) = arg max f y i | Θ new k ( s ), s 1≤ s ≤ C

t

(12)

That is, the sample yi is classified as concept c(yi). In this section, we have introduced the detailed procedure about how to update GMM in our batch learning framework. In the next section, we will investigate how to track concept drifting in the batch learning system.

4.

TRACKING CONCEPT DRIFTING

In a concept drifting system, tracking the evolving process of the concept is beneficial, as from which we can understand the evolving process of some concepts in depth, analyze the rules of concept drifting, and investigate how concept drifting influences the learning process. Furthermore, we are able to predict the tendency of concept drifting based on tracking. When one concept is represented by a generative model, the concept drifting can be mapped to the change of parameters in the parametric space. For easier analysis of concept drifting, in this section, based on the three typical updating procedures for GMM mentioned in Section 3, we will define some systematic properties in the evolving process of the concept, such as Maximum Lifecycle, Dominant Component, Component Drifting Speed, System Stability, and Updating Speed, as follows.


1565

4.1 Lifecycle and Maximum Lifecycle As aforementioned, for certain component in a GMM, it may disappear sometime. As a result, how long does it stay in the model is an important property. For component i, Lifecycle is defined as the period from its first appearance to its “death”, i.e.

LC i = Tdelete − Tadded .

(13)

The “Maximum Lifecycle” (MLC) is the one that has the longest Lifecycle, which demonstrates the importance of a component in the current GMM-based learning system. Figure 2 gives an example of a GMM’s evolving process. It illustrates a GMM with four components (labeled as C-1, … , C-4 in the figure) in time t = 1. When t = 4, C-4 is deleted and C-5 is added. In time t = 7, C-3 is also deleted and at this time there are only three components. Note that each column represents a snapshot of the GMM at certain time, and each block means one component. Furthermore, the height of each block equals its weight in GMM. This figure clearly shows that for this GMM, C-1 and C-2 both have Maximum Lifecycle.

GMM Component Lifecycle and Weights Distribution t=1

t=2

t=3

t=4

t=5

t=6

t=7

C-1

C-1

C-1

C-1

C-2

C-2

C-1

C-1

C-2

C-2

C-2

C-3

C-3

C-3

C-4

C-4

C-4

C-1

C-3

C-3

C-5

C-5

C-2

C-2

t=8 C-1 C-2

C-3

C-5

C-5

C-5

Figure 2. An example of a GMM’s evolving process.

4.2 Dominant Component Under the formulation of GMM, we can define the significance of a certain component according to its weight. The Dominant Component (DC) is the one that has the largest weight among all, that is

l DC = arg max(α p ) .

(14)

1≤ p≤ k ( c )

As illustrated in Figure 2, when t = 1, component 1 (labeled as C-1 in the figure) is a DC, and when t = 7, component 5 is a DC. 4.3 Component Drifting Speed As has mentioned, a GMM component may gradually drift along the timeline. Component Drifting Speed (CDS) means the KL divergence 9 between M tc−1 and M tc , that is,

(

) (

)

t-1 t-1 t t CDS t,ci = DKL [ N µi,c , Σ i,c , N µi,c , Σ i,c ].

(15)

CDS reflects not only individual drifting magnitude but also the drifting speed of the whole system. Figure 3 illustrates the CDS values of the example used in Section 4.1 and 4.2.

1566


Component Drifting Speed (CDS) Distribution 0.3 0.25 0.2 0.15 0.1 0.05 0

CDS

1 2 3 4 5 6 7 8 Timeline

C-4 C-1

C-1 C-2 C-3 C-4 C-5

Figure 3. Component Drifting Speed (CDS).

4.4 System Stability The numbers of adding and deleting operations, as well as the magnitudes of CDSs, reflect the stability of the concept learning system. Suppose the GMM model has been updated for H times. We define System Stability of a concept drifting system as H

∑ Su ,

(16)

∑α j CDSuc, j δ (u, j) + γ [ N u (delete) + N u (add )] ,

(17)

S=

u =1

where, Su =

k (c) j =1

and

1, component j is shifting at u 0 , otherwise.

δ (u , j ) = 

(18)

N u (delete) ( N u (add ) ) is the total number of deleted (newly added) components, and γ is a predefined parameter to balance the two sum items. We also can graphically observe the evolving process by plotting the curve Su, in which the area of the region under the curve represents S. 4.5 Updating Speed

As mentioned in Section 3, two parameters, α and β, control the model updating speed. Parameter α is used to control the component drifting speed, while β controls the component adding speed. Though it is difficult to determine the optimal ones explicitly, it is obvious that known factors will affect these two parameters. To be exact, DC, CDS and System Stability will influence α (refer to equation (7)). And, the sizes of batches, Lifecycle, MLC and System Stability will affect β (refer to equation (10)). We can tune these two parameters based on experiments. Consequently, we are able to define Pc = {LCi , MLC, lDC , S, Su , CDS, α ,β},

(19)

as a property set describing the evolving process of a certain concept c in the learning system, and P = {Pc | 1≤ c ≤ C} represents the whole concept drifting system.


1567

5.

EXPERIMENTS

To evaluate our proposed batch learning framework, we compare it with several related schemes both on some simulated data and real-world video data. Totally, five schemes (A to E) are proposed here. A

Our proposed batch learning framework.

B

Only use the first batch to train a global model and test on all other new batches. It is the simplest method.

C

Only (t-1)-th batch is used as the training data of the predictor for t-th batch.

D

Compared with C, only (t-2)-th batch can be used.

E

Compared with C, t-th batch can be used for training. It is an ideal case, which is expected to achieve best results.

5.1 Testing on Simulated Data The simulated data consists of 10 data batches denoted by D(t), 1 ≤ t ≤10, and each batch has two classes (concepts) (6000 samples for each class/concept). They are both drawn from two 3-component GMMs, by gradually shifting the center of the Gaussian components along time t. More precisely, the GMM models are defined by  0 .2

α j = 1 / 3 , µ j = [ X gj , t , Ygj, t ]T , Σ j =  0

0 , 0.2

where 1 ≤ j ≤ 3 is the component number, g = 1, 2 is the class number, t is the batch index, and  X gj , t = 7 + K g cos( π − t π / 10 ) + 0 . 7 cos( j π / 3 − π / 6 ) + δ ( j )   j K g sin( π − t π / 10 ) + 0 . 7 sin( j π / 3 − π / 6 ) + δ ( j ) ,  Y g , t = where K 1 = 4, K 2 = 7 and

 rand (0,0.05), j = 0, 1  rand (0.1,0.2), j = 2 .

δ ( j) = 

Note that rand(a,b) denotes a randomly generated value between a and b (a