Online Adaptive Hierarchical Clustering in a Decision Tree Framework

0 downloads 0 Views 4MB Size Report
basak@netapp.com, [email protected]. NetApp India Private Limited, .... system. In predictive clustering [13], a decision tree structure is induced ...... the Ninth International Conference on Information and Knowledge Management, ...
Journal of Pattern Recognition Research 2 (2011) 201-229 Received March 8, 2011. Accepted May 17, 2011.

Online Adaptive Hierarchical Clustering in a Decision Tree Framework Jayanta Basak

[email protected], [email protected]

NetApp India Private Limited, Advanced Technology Group, Bangalore, India

www.jprr.org

Abstract Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Our model consists of an online adaptive binary tree and a code formation layer. The adaptive tree hierarchically partitions the data and the finest level clusters are represented by the leaf nodes. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes, and the tree adapts the separating hyperplanes between the clusters at every layer in an online adaptive mode. The membership of a sample in a cluster is decided by the tree and the tree parameters are guided by the stored codes. As opposed to the existing hierarchical clustering techniques where certain local objective function at every level is optimized, we adapt the tree in an online adaptive mode by minimizing a global objective functional. We use the same global objective functional as used in the fuzzy c-means algorithm (FCM), however, we observe that the effect of the control parameter is different from that in the FCM. In our model the control parameter regulates the size and the number of clusters whereas in the FCM the number of clusters is always constant (c). We never freeze the adaptation process. For every input sample, the tree allocates it to certain leaf node and at the same time adapts the tree parameters simultaneously with the adaptation of the stored codes. We validate the effectiveness of our model on certain real-life datasets and also show that the model is able to perform unsupervised classification on certain datasets. Keywords: Decision tree, online adaptive learning, fuzzy c-means

Pattern clustering [1, 2] is a well-studied topic in the pattern recognition literature where samples are grouped into different clusters based on certain self-similarity measure. Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Depending on different criteria such as data representation, similarity/dissimilarity measure, interpretation, domain knowledge and modalility (e.g., incremental/batch mode), different clustering algorithms have been developed – a comprehensive discussion on which can be found in [2]. Methods of clustering can be generically partitioned into two groups namely, partitional and hierarchical. In partitional or flat clustering, the data samples are grouped into several disjoint groups depending on certain criteria function and there exists no hierarchy between the clusters. Among different partitional (flat) clustering algorithms, K-means and fuzzy c-means are widely studied in the literature. The quality of partitional clustering algorithms often depends on the objective functional that is minimized to produce the clustering. Fuzzy c-means or soft c-means [3] (Appendix I) and its variations in general employ certain global objective functional, and the clusters are described in terms of cluster centers and the membership values of the samples in different clusters.

c 2011 JPRR. All rights reserved. Permissions to make digital or hard copies of all or part

of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or to republish, requires a fee and/or special permission from JPRR.

Basak

Hierarchical clustering [1], on the other hand generates certain nested hierarchical groupings of the data samples where the top layers of the hierarchy contain coarse level grouping with less number of clusters and lower layers represent finer level groupings with a large number of clusters. Hierarchical clustering is usually performed in two different ways namely, divisive and agglomerative. In agglomerative clustering, smaller clusters are grouped into larger clusters in a bottom-up manner [4, 5]. In the divisive hierarchical clustering algorithm, the set of data samples are iteratively partitioned in a top-down manner to construct finer level clustering [6, 7, 8, 9, 10, 11]. In divisive clustering hierarchical self-organization maps have also been used where at each level a growing self-organization map is generated depending on the data distribution at the layer [11, 9, 8]. Decision trees have been used for divisive hierarchical clustering [12, 13, 14, 15]. As an example, in [12], the process of construction of unsupervised decision tree was mapped onto that of a supervised decision tree where samples are artificially injected such that empty regions contain sparse injected samples. The decision tree is then constructed by discriminating between the original samples and the artificially injected samples. At every level of the tree, the number of the injected samples is controlled depending on the number of actual data samples available at that node. In [14], top-down induction mechanism of decision trees is used for clustering. The top-down induction mechanism is specifically used to build first-order logical decision tree representation of the inductive logical programming system. In predictive clustering [13], a decision tree structure is induced based on the attribute values such that the separation between the clusters and the homogeneity within clusters in terms of the attribute values are maximized. A value of the target variable is then associated with each cluster represented by the leaf of the tree. In [15], the dataset is iteratively partitioned at each level of the tree by measuring the entropy from the histogram of the data distribution. However, in all these algorithms [12, 13, 14, 15], a decision tree is constructed based only on the local splitting criteria at each node as in the supervised counterpart of the decision trees [16, 17, 18]. In such cases of top-down divisive partitioning, if an error is committed at a top layer then it becomes very hard to correct that in the bottom layer. In order to overcome such situation, efforts have been made to define certain global objective function on a decision tree in the context of clustering, and the objective function is optimized to generate the clustering. For example, with a given tree topology, a stochastic approximation of the data distribution through deterministic annealing is performed in [19] to obtain groups of data samples at the leaf nodes of the tree. Essentially in this approach the cross-entropy measure between the probabilistic cluster assignments in the leaf nodes and that in the parent nodes is minimized. The formulation is then extended to an online learning set-up to handle non-stationary data. In this paper, we present a method for performing online adaptive hierarchical clustering in a decision tree framework (a preliminary version of this work can be found in [20]). Our model consists of two components namely, an online adaptive decision tree and a code formation layer. The adaptive tree component represents the clusters in a hierarchical manner and the leaf nodes represent the clusters at the finest granularity. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes. For the adaptive tree, we use a complete binary tree of a specified depth. We use a global objective function considering the activation of the leaf nodes and the stored codes together to represent the compactness of the clusters. We use the same objective function as used in the fuzzy c-means algorithm (Appendix I) where we replace the cluster centers by the stored codes.

202

Online Adaptive Hierarchical Clustering a Decision Tree Framework

In our model, we do not explicitly compute the cluster centers since the clusters are not represented by their centers rather these are determined by the separating hyperplanes of the adaptive tree. We adapt the tree paramenters and the stored codes simultaneously in an online adaptive manner by minimizing the objective functional for each sample separately considering the samples are independently selected. We do not perform any estimation of the data distribution, rather we perform deterministic gradient descent on the objective functional. In this adaptation process, we adapt the parameters of the entire tree for every sample. Therefore unlike the usual divisive clustering, we do not recursively partition the dataset in a top-down manner. We never freeze the learning i.e., whenever a new data sample is available, the leaf nodes of the tree get activated and we find out the maximally activated leaf node to represent the corresponding cluster for the input sample; and at the same time we incrementally adapt the parameters of the tree and the stored codes. We do not decrease the learning rate explicitly depending on the number of iterations, rather we compute the optimal learning rate for every sample to minimize the objective functional. We also observe that even though we consider a fixed tree topology, the tree structure also adapts depending on the dataset and order of appearance of the input samples. In this framework, we attain an online adaptive hierarchical clustering algorithm in a tree framework which exhibits stability as well as plasticity. Since we do not compute the cluster centers directly from the data (rather they are determined by the tree parameters), we observe stability in the behavior of the tree. At the same time, incremental learning rules enable plasticity of the tree. The selection of global objective function relieves the tree of local partitioning rules. The hyperplanes at the intermediate nodes are adjusted based on the global behavior. This enables the tree of incremental learning and also reduces the risk of committing early errors of local partitioning. The intuition of reducing error due to local partitioning is similar to that in the supervised OADT [21] where we performed Gaussian parity problem classification which is not possible in classical decision tree framework with local partitioning.

1. Description of the Model The model (Figure 1) consists of two components namely an online adaptive tree [21, 22] and a code formation layer. We used online adaptive tree before for pattern classification and function approximation tasks in the supervised mode [21, 22]. Here we employ online adaptive tree in the unsupervised mode for adaptive hierarchical clustering. The adaptive tree is a complete binary tree of a certain specified depth l. Each non-terminal node i has a local decision hyperplane characterized by a vector (wi , θi ) with kwi k = 1 such that (wi , θi ) together has n free variables for an n-dimensional input space. Each nonterminal node of the adaptive tree receives the same input pattern x = [x1 , x2 , · · · , xn ], and the activation from its parent node. The root node receives only the input pattern x. Each nonterminal node produces two outputs, one for the left child and the other for the right child. We number the nodes in the breadth-first order such that the left child of node i is indexed as 2i and the right one as 2i + 1. The activation of two child nodes of a non-terminal node i are given as u2i = ui .f (wiT x + θi ) u2i+1 = ui .f (−(wiT x + θi ))

(1)

where ui is the activation of node i, and f (.) represents the local soft partitioning function governed by the parameter vector (wi , θi ). We choose f (.) as a sigmoidal function. The 203

Basak

x u(1)=1

Adaptive Tree

(w1 θ1 )

x

x u(2)

(w4 θ4 ) (w θ ) u(5) 5 5

u(3)

x

x

x

x u(4)

(w3θ3 )

(w2 θ2 )

u(6) (w6θ6 ) (w θ ) u(7) 7 7

β Nn

β11 Code Formation Layer

x1

x2 x3

xn

Fig. 1: Structure of the model. It has two components namely, adaptive tree and a code formation layer. Every intermediate node and the root node of the adaptive tree receives the input pattern x. Also the code formation layer receives the input pattern x as input.

204

Online Adaptive Hierarchical Clustering a Decision Tree Framework

activation of the root node, indexed by 1, is always set to unity. The activation of a leaf node p can therefore be derived as Y f (lr(i, p)(wi .x + θi )) . (2) up = i∈Pp

where Pp is the path from the root node to the leaf node p, and lr(.) is an indicator function indicating whether the leaf node p is on the right or left path of node i such that  if p is on the left path of i  1 lr(i, p) = −1 if p is on the right path of i (3)  0 otherwise

Considering the property of sigmoidal function f (.) that f (−v) = 1 − f (v), we observe the sum of activation of two siblings is always equal to that of their parent irrespective of the level of the nodes (since every intermediate node has a sigmoidal function to produce the activations of the left and right child nodes). Since the sum of two sibling leaf nodes is equal to their parent, the sum of activations of two sibling parent nodes is equal to the parent at the next higher layer, and so on. We extend this reasoning to obtain (Appendix II) X uj = u1 (4) j∈Ω

where Ω represents the set of leaf nodes, and u1 is the activation of the root node. As mentioned before, we always set u1 = 1, and therefore X uj = 1 (5) j∈Ω

We have chosen f (.) as f (v) =

1 1 + exp(−mv)

(6)

where m determines the steepness of the function f (.). In the appendix (Appendix III), we show that 1 m ≥ log l/ǫ (7) δ where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that 1 − ǫ. For example, if we choose ǫ = 0.1 and δ = 0.1 then m ≥ 10 log(10l). The code formation layer stores the codes representative of the clusters corresponding to the leaf nodes. The codes are equivalent to the cluster centers as used in the fuzzy c-means algorithm, however, we do not compute the codes by finding out the mean of the membership values. Corresponding to each leaf node j, there is a code βj = {βj1 , βj2 , · · · , βjn } stored in the code formation layer. Thus for N leaf nodes corresponding to a maximum N clusters we have N code vectors β1 , β2 , · · · , βN where βN = {βN 1 , βN 2 , · · · , βN n } as shown in Figure 1. The code formation layer receives the same input x and measures the deviation of the input x from the stored code βj for an activated leaf node j, i.e., kx − βj k2 . If the leaf node activation is high and the deviation is also large then the error (value of the objective function) is large, and if either of the deviation or the leaf node activation is small then the 205

Basak

1/α

error should be small. Therefore, at each leaf node j, we have an error kx − βj k2 uj where uj is the activation of leaf node j and α is a control parameter. We iterate to minimize the value of an objective function which measures the total error over all leaf nodes, by adapting the stored codes β and the stored parameter values (w, θ). In this model, the maximum number of possible clusters is N = 2l where l denotes the depth of the adaptive tree.

2. Learning Rules We define a global objective functional E by considering the discrepancy between the input patterns and the existing codes in β, and the activation of the leaf nodes together. E=

X

E(x) =

x

1 XX 1/α kx − βj k2 uj (x) 2 x

(8)

j∈Ω

where α is a constant which acts as a control parameter. The objective functional is identical to that in the fuzzy c-means [3]. In fuzzy c-means (Appendix I), u represents the membership value of a sample in a specific cluster, and β represents the cluster centers. The parameter α controls the fuzzification or the spread of membership values. In the limiting condition α = 1, fuzzy c-means becomes hard c-means (Appendix I). The number of clusters in fuzzy c-means is constant as defined by the parameter c. The membership values and the cluster centers are updated in an iterative two-step EM-type algorithm where the membership values are computed based on the cluster centers in the first step and the cluster centers are updates using the membership values in the second step of the iteration. The stored codes β in our model, act as the reference points of the clusters which are analogous to the cluster centers as in the fuzzy c-means [3]. However, in our model the clusters are not explicitly defined by the stored codes, rather we obtain the clusters from the separating hyperplanes of the adaptive tree. We learn both β and the separating hyperplanes together unlike the fuzzy c-means [3]. Moreover, the parameter α in our model, controls the number and size of the clusters unlike the EM-based fuzzy c-means algorithm where the number of clusters is constant. In the online mode, we compute the objective functional for each individual sample x as E(x) =

1X 1/α kx − βj k2 uj 2

(9)

j∈Ω

considering the samples are independently selected. We then adapt the parameters of the model to minimize E(x) for each individual sample in the online mode. For a given α and for an observed sample x, we minimize E(x) by steepest gradient descent such that ∂E ∆βj = −ηopt ∂β j ∂E ∆wi = −ηopt ∂w i ∂E ∆θi = −ηopt ∂θ i

(10)

where ηopt is a certain optimal learning rate parameter decided in the online mode. Evaluating the partials in Equation (10), we obtain 1/α

∆βj = ηopt uj (x − βj ) ∆wi = −ηopt qi x ∆θi = −ηopt λqi 206

(11)

Online Adaptive Hierarchical Clustering a Decision Tree Framework

where qi is expressed as   m X qi = kx − βj k2 (uj )1/α (1 − vij )lr(i, j) 2α

(12)

j∈Ω

Each nonterminal node i locally computes vij which is given as vij = f ((wiT x + θi )lr(i, j))

(13)

Since (w, θ) at each nonterminal node has n free variables together (n is the input dimensionality) subject to the condition that kwk = 1, the updating of w in Equation (11) is restricted to the normalization condition. The parameters w is updated such a way that ∆w lies in the hyperplane normal to w which ensures that w.∆w = 0, i.e., the constraint kwk = 1 is satisfied for small changes after updating. Therefore, we take the projection of qi x onto the normal hyperplane of wi and obtain the ∆wi as ∆wi = −ηopt qi (I − wi wi T )x

(14)

where I is the identity matrix. In Equation (11), the parameter λ represents the scaling of the learning rate parameter with respect to θ. We normalize the input variables such that x is bounded in [−1, +1]n , and we select λ = 1. Note that if the input is not normalized then we can select a different value for λ. Since we never freeze the adaptation process, we decide the learning rate adaptively based on the discrepancy measure E(x) for a sample x in the online mode. For every input sample x at every iteration, we select ηopt using the line search method (also used by Friedman in [23]) such that the change in E to E + ∆E is linearly extrapolated to zero. Equating the first order approximation of ∆E (line search) to −E, we obtain, ηopt = P

j∈Ω (uj

)2/α

+

P

E 2 2 T 2 i∈N qi (λ + kxk − (wi x) )

(15)

In Equation (15), we observe that for very compact or point clusters, η can become very large near optima. In order to stabilize the behavior of the adaptive tree, we modify the learning rate as ηopt =

E γ+

P

j∈Ω (uj

)2/α

+

P

i∈N

qi2 (λ + kxk2 − (wi T x)2 )

(16)

with a constant γ. The parameter γ acts as a regularizer in obtaining the learning rate. In our model, we chose γ = 1. From the learning rules, we observe that the codes represented by β are updated such that they approximate the input pattern subject to the activation of the leaf nodes. The maximally activated leaf node attracts the corresponding β most towards the input pattern. If there is a close match between the input pattern x and the stored code β corresponding to the maximally activated leaf node, then the decision hyperplanes are not perturbed much since kx − βk becomes very small. On the other hand, if there is a mismatch then the movement of the decision hyperplane is decided by the stored function lr(i, j). If the maximally activated leaf node j is on the left path of the nonterminal node i then lr(i, j) = 1, i.e., wi is moved such a way that ∆wi has an opposite sign of x subject to being in the 207

Basak

normal hyperplane of wi . Since the activated leaf node is on the left path i.e., x is on the left half of the decision hyperplane, the decision hyperplane effectively moves opposite to x to increase the activation of the activated leaf node. The magnitude of the shift depends on the activation value of the leaf, and the discrepancy between the corresponding stored code and the input pattern, Similar reasoning is valid if the maximally activated leaf node j is on the right path of the nonterminal node i. We also observe that as we move towards the root node, the decision hyperplanes are perturbed by the cumulative effect of the entire subtree under the nonterminal node (if j is not under the subtree of i then lr(i, j) = 0). We also notice that the updating rule for decision hyperplanes are independent of each other. For example, a leaf node j can be on the left path of a nonterminal node i with lr(i, j) = 1, however it can be on the right path of its left child. Therefore, the leaf nodes affect different nonterminal nodes differently at different levels depending on the stored function lr(i, j). In the adaptive process, if the input patterns are drawn from a stable distribution then E decreases (due to the decrease in discrepancy between stored codes and the input patterns) and therefore ηopt decreases with the number of iterations. and therefore, the updating rules reveal that the corresponding codes in β are also less affected by the presence of input patterns. In our model, we do not explicitly decrease η through any learning schedule; rather depending on the input distribution, ηopt adjusts itself. The overall algorithm for adapting the tree in the online mode is given as follows. Step 1: Initialize the tree parameters (w) and the code vectors (β) Repeat the next steps until new pattern x appears (in the online mode we never freeze the learning) Step 2: for given x, compute the leaf node activations Step 3: for given x, compute q for each intermediate node of the tree (Equation 12), Step 4: for given x, compute ηopt (Equation 16) Step 5: for given x, compute ∆w for all intermediate nodes (Equation 14), compute ∆β for all leaf nodes (Equation 11), Step 6: for given x, update w as w = w + ∆w for all intermediate nodes, update β as β = β + ∆β for all leaf nodes. Continue Steps 2-6 until new pattern is presented to the model.

3. Choice of α The objective functional in Equation (8) is exactly the same as that in fuzzy c-means [24, 3] (Appendix I) where β corresponds to the cluster centers, u corresponds to the membership values, and h = 1/α. However, h in soft c-means is chosen in [1, ∞] (i.e., 1/α ∈ [1, ∞)) and usually h > 1 which means α < 1. In the limit α = 1, fuzzy c-means algorithm becomes the hard c-means algorithm (Appendix I). In fuzzy c-means α is chosen less than unity (i.e., h > 1) in order to make the membership function wider to obtain smoother cluster boundaries. However, α in fuzzy c-means has no role in deciding the number of clusters, the number of clusters being a user-defined parameter. In our model on the other hand, leaf node activations do not directly depend on the computed codes in the code formation layer. Rather, the decision hyperplanes stored in the intermediate nodes of the adaptive 208

Online Adaptive Hierarchical Clustering a Decision Tree Framework

tree determine the leaf node activations. The code formation layer, parameterized by the β matrix, stores the respective codes corresponding to the groups of samples identified by the adaptive tree, each leaf node representing one group of samples respectively. The decision hyperplanes in the adaptive tree and the codes in the code formation layer are adapted together to minimize the objective functional (Equation (8)). The number of clusters in our model is guided by the choice of α. In the limiting condition, as α → 0, E → 0, and the model does not adapt at all and remains in the initial configuration. On the other hand, for α → ∞, there is no role of individual leaf node activations in E, and the parameters of the adaptive tree are adapted in such a way that all samples are merged into one group. We can also observe the same behavior from the combination of adaptation rules and the learning rate (Equation (12) and (15)), where with the increase in the value of α, the magnitude of the changes in parameter values increase, and this enables model to become more adaptive. In other words, even though there are 2l leaf nodes corresponding to a maximum of 2l clusters, the effective number of clusters depends on the parameter α. The effective number of clusters depends on the number of leaf nodes which are maximally activated for a set of input samples. Thus if there exist a leaf node which is never maximally activated for any input sample then the leaf node does not represent any cluster. In this model the effective number of clusters depends on the control paramter α. We explain the effect of α empirically with a simple example. Let there be only two different samples x1 and x2 , and the model has only two leaf nodes. Without loss of generality, let us represent the activation of leaf nodes by u1 and u2 respectively. Let the distance between the two samples be d = kx1 − x2 k. Ideally, there are two possible cases. Case I: The sample x1 forms one cluster and x2 forms the second one such that codes formed are β1 = x1 and β2 = x2 . For the minimum value of E, a separating hyperplane exactly normal to the vector x1 − x2 and equidistant from the samples is constructed. In such case, when x1 is presented to the model then u1 (x1 ) = 1/(1 + exp(−md/2)) and u2 (x1 ) = 1/(1 + exp(md/2)). In such a scenario, 1/α

E(x1 ) = kx1 − β1 k2 u1

1/α

+ kx1 − β2 k2 u2

(17)

i.e., E(x1 ) =

d2 (1 + exp(md/2))(1/α)

(18)

We get exactly the same value for E(x2 ) due to symmetry and therefore the total loss is E1 = E(x1 ) + E(x2 ) =

2d2 (1 + exp(md/2))(1/α)

(19)

Case II: Both the samples x1 and x2 form one cluster and the code β is exactly placed in the middle such that β = (x1 + x2 )/2. Since both samples are in the same cluster, there is no separating hyperplane between them (ideally it is at an infinite distance from x1 and x2 ), and only one leaf node is activated such that u1 = 1 and u2 = 0 for both x1 and x2 . In such case, 1/α

E(x1 ) = kx1 − βk2 u1

1/α

+ kx1 − βk2 u2

(20)

i.e., E(x1 ) = d2 /4 209

(21)

Basak

Since we obtain exactly the same value for E(x2 ), the total loss is E2 = E(x1 ) + E(x2 ) =

d2 2

From Equations (19) and and (22), we obtain  α 4α E1 = E2 1 + exp(md/2)

(22)

(23)

From Equation (23), we observe that there exists an α0 =

1 log2 (1 + exp(md/2)) 2

(24)

such that for α ≥ α0 , E1 ≥ E2 . In other words, since we minimize the objective functional, the scenario in case II is preferred over that in case I if we choose α > α0 , i.e., both samples are clustered in the same group. On the other hand for a choice of α < α0 , the samples are preferred to represent two different groups (since in that case E1 < E2 ). As discussed in appendix (Appendix III), 1 ≥1−ǫ (1 + exp(−mδ))l

(25)

where l is the depth of the tree, δ is the minimum distance of a sample from the nearest hyperplane, and (1 − ǫ) is the minimum activation of the corresponding cluster (activation of the maximally activated leaf node). In this example, l = 1 since there are only two leaf nodes. In the present example, if we consider δ = d/2 then 1 ≥1−ǫ 1 + exp(−md/2)

(26)

1 + exp(md/2) ≥ 1/ǫ

(27)

After simplification, Therefore from Equations (24) and (27), we have α0 ≥

1 log2 (1/ǫ) 2

(28)

Therefore, if we want to prevent fragmentation i.e., assign both x1 and x2 in the same cluster then we should have 1 (29) α ≥ α0 ≥ log2 (1/ǫ) 2 In the first case, in order to assign both the samples in a single cluster, if we choose ǫ = 0.5 then we should have α ≥ 0.5. Similarly if we consider ǫ = 0.25, we should have α ≥ 1. The same reasoning is valid for more than two samples with multiple levels in the adaptive tree. In the case of multiple levels, we assume that two closest points are separated only at the last level of the tree. Therefore, for a small α, the adaptive tree can partition single small clusters into multiple ones. In short, for a choice of large α, the number of groups formed by the adaptive tree decreases and large chunks of data are grouped together. On the other hand, if α is decreased then the number of groups increases. In other words, we can vary the value of α in the range [α0 , ∞) to obtain finer to coarser clustering. In soft c-means, on 210

Online Adaptive Hierarchical Clustering a Decision Tree Framework

the contrary, the number of clusters is always constant and equal to c. The parameter α, in soft c-means, is used to control the radius of each soft cluster during the computation of the centroid of each cluster; and the membership values of the samples to different clusters depend on the computed cluster centers. In our model, although the maximum number of possible clusters is equal to the number of leaf nodes, the actual number of clusters depends on the choice of α. With the increase in α, the number of clusters decreases. In other words, although the objective functional in Equation (8) is exactly the same as that in the soft c-means, the effect of α is different in these two algorithms.

4. Experimental Results We implemented the model in the MATLAB environment. We experimented the performance of model with both synthetic and real-life data sets. 4.1 Protocol The entire batch of samples is repeatedly presented in the online mode, and we call the number of times the batch of samples presented as the number of epochs. If the data density is high or the dataset has a set of compact clusters then we observe that the model takes less number of epochs to converge; and for relatively low data density or in the presence of sparse clusters, the model takes a larger number of epochs. On an average, we observed that the model converges near its local optimal solution within 50 epochs and sometimes even less, although the required number of epochs increases with the depth of the tree. We normalize all input patterns such that each component of x lies in the range [−1, +1]. We normalize each component of x (say, xi ) as xˆi =

2xi − (xmax + xmin ) i i max min xi − xi

(30)

where xmax and xmin are the maximum and minimum values of xi over all observations i i respectively. In this normalization, we do not separately process the outliers. Data outliers can badly influence the variables in such normalization. Instead of linear scaling, use of certain non-linear scaling or certain more robust scaling such as that based on inter-quartile range may further improve the performance of our model. However, in this paper, we preserve the exact distribution of the samples and experiment with the capability of the model in extracting the groups of samples in the online adaptive mode. The performance of the model depends on the slope (parameter m) of the sigmoidal function (Equation (6)). In general, we observed that for a given control parameter α, the number of decision hyperplanes increases with the decrease in the slope (this is also validated in Equation (24) where α0 decreases with the decrease in m). This is due to the fact that if we decrease the slope then more than one leaf node gets significant activation to attract the codes (β) in the code formation layer and thereby the number of clusters increases. On the other hand, with the increase in the slope, only a few (in most of the cases only one) leaf nodes are significantly activated to influence the decision hyperplanes and therefore the number of clusters decreases. In general, for a small m, a group of leaf nodes together represents a compact cluster. As stated in Equation (51), for a given depth l of the adaptive tree, we can fix the slope as m ≥ 1δ log(l/ǫ), where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that 1 − ǫ. Experimentally, we fixed δ = 0.2 for an ǫ = 0.1. In other words, we fix the slope as m = 5 log(10l) for given depth l of the tree. Note that, the value of δ that we have chosen is valid for the normalized 211

Basak

input space that we use. If the data is not normalized and the input range is higher then a larger value of δ can also be selected. We also observed the behavior of the model with respect to the control parameter α. As we increase α, the number of clusters extracted by the model decreases and vice-versa. In the adaptation rule (Equation (11)) as stated before, we always select λ as unity since we operate in the normalized input space. We also select the constant γ = 1 in computing the learning rate irrespective of any other parameter. We initialize β to zero. We also initialize θ at each nonterminal node to zero. We initialize the weight vectors w such that every component of each w is the same. In other words, for an n-dimensional input space, √ we initialize w such that every component of w is equal to 1/ n which makes kwk = 1. Therefore, the performance of the model solely depends on the order in which the samples are presented and independent of initial condition (since the initial condition is the same in all experiments). We also experimented by initializing each weight vector with randomly generated Gaussian distribution. We observed that initialization with randomly generated normally distributed weight vectors can sometimes lead to better results than that with fixed initialization (all weights being equal). In the next section, we illustrate the results for Iris and Crabs dataset with an initialization using normally distributed weight vectors. We then perform class discovery with fixed initialization (all weight being equal) and then report the results on a face dataset using the same fixed initialization. We experimented with different depths of the tree and naturally the number of clusters extracted by the model increases with the depth of the adaptive tree. 4.2 Examples 4.2.1 Synthetic Data We used a synthetically generated data to test the performance of the model. We used mixture of Gaussian to generate the dataset. Figure 2 illustrates the clusters extracted by the model for different values of α and different depths. The objective of this experiment is to show that the model behaves in the same manner as we discussed, i.e., with the increase in α, the number of clusters decreases and vice-versa. Similarly, with the increase in depth the number of clusters increases. As we stated the effect of α in grouping the data by the model, we illustrate the same in Figure 3. We illustrate the results for α = 1,1.2,1.4, and 1.6 respectively for an adaptive tree of depth l = 5 with m0 = 5 for 200 epochs. We observe that for an α = 1.0, relatively large number of regions are identified by the tree. This is due to the fact that initially the regions are created and due to smaller value of α they are not marged with other regions. As α is increased to 1.6, we observe that different groups of data are merged into one segment by the tree, and seven different groups are identified. 4.2.2 Crabs Data In crabs dataset [25], there are total 200 samples, each sample having five attributes. The crabs are from two different species, each species has male and female categories. Thus there are in total four different classes and the data set contains 50 samples from each class. The samples from different classes get highly overlapped when the data is projected onto two-dimensional space spanned by the first and second principal components. Interestingly, the four classes are nicely observable when the data is displayed in the space spanned by the second and third principal components. We therefore obtain the second and third principal components and project the dataset onto these two principal components to obtain the two dimensional projected data. We then normalize such that each projected sample is bounded

212

Online Adaptive Hierarchical Clustering a Decision Tree Framework

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1.5 −1.5

−1

−0.5

0

0.5

1

−1.5 −1.5

1.5

−1

−0.5

(a)

0

0.5

1

−1.5 −1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1

−0.5

−0.5

0

0.5

1

−1.5 −1.5

1.5

−1

−0.5

0

(d)

0.5

1

1.5

−1.5 −1.5

−1

−0.5

(e) 1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5 −1.5

−1.5 −1.5

−0.5

0

0.5

1

1.5

0.5

1

0

0.5

1

(f)

1.5

−1

0

(c)

1.5

−1.5 −1.5

−1

(b)

(g)

−1

−0.5

0

0.5

1

(h)

Fig. 2: The decision hyperplanes constructed by the model for different depths with different control parameters with a constant m0 = 5. The constructed hyperplanes are shown for (a)depth equal to 3 and α = 1.2, (b) depth equal to 3 and α = 1.5, (c) depth equal to 3 and α = 3.0, (d) depth equal to 4 and α = 1.2, (e) depth equal to 4 and α = 3.0, (f) depth equal to 4 and α = 4.0, (g) depth equal to 5 and α = 1.2, (h) depth equal to 5 and α = 4.0

213

Basak

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5 −1.5

−1

−0.5

0

0.5

1

−1.5 −1.5

1.5

−1

−0.5

(a) 1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1

−0.5

0.5

1

1.5

0.5

1

1.5

(b)

1.5

−1.5 −1.5

0

0

0.5

1

−1.5 −1.5

1.5

(c)

−1

−0.5

0

(d)

Fig. 3: The decision hyperplanes constructed by the tree with a constant m0 = 5 and depth equal to 5. The control parameter α is varied. (a), (b), (c), and (d) illustrate the behavior of the tree for α = 1, α = 1.2, α = 1.4, and α = 1.6 respectively.

214

Online Adaptive Hierarchical Clustering a Decision Tree Framework

in [-1,1]. Figure 4(a)-(d) illustrates the behavior of model for depths equal to two to five with a constant α = 1.2. Figure 4(e)-(f) illustrates the behavior for depths equal to four and five with α = 2.0. In Figure 4(a), we observe that the model is not able to separate the data points from different classes completely with only four regions. However, if we consider only species information, then the two different species are well separated with a depth of 2. In Figure 4(c), we observe that the model generated a region where samples from two different classes (same species with different sex) are mixed up. In all other situations, the model is able to separate the four different classes by allocating more than one leaf node to each class. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.0. We observe that the proposed clustering algorithm is able to separate different class structures as good as the fuzzy c-means algorithm although the proposed algorithm is hierarchical in nature. 4.2.3 Iris Data We illustrate the behavior of the mode for ‘iris’ dataset (the original dataset was reported in [26], and we obtain the data from [27]). In the ’iris’ dataset, there are three different types of iris flowers, each category having four different features namely sepal length, sepal width, petal length, and petal width. There are 50 samples from each class resulting in a total of 150 samples in the four dimensional space. The three different classes can be nicely identified when the data set is displayed in a two-dimensional space spanned by two derived features namely sepal area (sepal length ∗ sepal width) and petal area (petal length ∗ petal width). We transform the data into these two dimensions then normalize. Figure 6 illustrates the behavior of the model for this dataset with these two normalized derived features for depths equal to two to four with a constant α = 1.2. We observe that a model with depth equal to 2, is able to separate one class completely. However, it creates one region (corresponding to one leaf node) where samples from two different classes are mixed up. The purity of the regions improves a lot when we use a depth equal to 3. For a depth equal to 4, we observe that the model is able to perfectly separate the three different classes from the data. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.0. We report only those results where the performance of FCM was visually most appealing. We observe that even for 3 clusters, FCM is able to separate one class completely just as performed by the proposed model with a depth of 2. As the number of clusters increases, the FCM algorithm is able to partially separate the two other classes. However, we observe that the class in the middle of the other two is better separated by the proposed model as compared to FCM. In our model the class structure is captured by two approximate parallel lines which is not so apparent in FCM although FCM separates the class to certain extent from the other two classes.

215

1.5

1.5

1

1

2nd principal component →

2nd principal component →

Basak

0.5

0

−0.5

−1

−1.5 −1.5

0.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

−1

−0.5

1

1

0.5

0

−0.5

−1

1.5

0.5

1

1.5

0.5

1

1.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

−1

−0.5

0

3rd principal component →

(c)

(d) 1.5

1

1

2nd principal component →

2nd principal component →

1

0.5

1.5

0.5

0

−0.5

−1

−1.5 −1.5

0.5

(b) 1.5

2nd principal component →

2nd principal component →

(a) 1.5

−1.5 −1.5

0

3rd principal component →

0.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

(e)

−1

−0.5

0

3rd principal component →

(f)

Fig. 4: The decision hyperplanes constructed by the proposed model on the “crabs” dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant m0 = 5 for (a) depth equal to 2 and α = 1.2, (b) depth equal to 3 and α = 1.2, (c) depth equal to 4 and α = 1.2, (d) depth equal to 5 and α = 1.2, (e) depth equal to 4 and α = 2.0, (f) depth equal to 5 and α = 2.0.

216

1.5

1.5

1

1

2nd principal component →

2nd principal component →

Online Adaptive Hierarchical Clustering a Decision Tree Framework

0.5

0

−0.5

−1

−1.5 −1.5

0.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

−1

−0.5

1

1

0.5

0

−0.5

−1

1.5

0.5

1

1.5

0.5

1

1.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

−1

−0.5

0

3rd principal component →

(c)

(d) 1.5

1

1

2nd principal component →

2nd principal component →

1

0.5

1.5

0.5

0

−0.5

−1

−1.5 −1.5

0.5

(b) 1.5

2nd principal component →

2nd principal component →

(a) 1.5

−1.5 −1.5

0

3rd principal component →

0.5

0

−0.5

−1

−1

−0.5

0

0.5

3rd principal component →

1

−1.5 −1.5

1.5

(e)

−1

−0.5

0

3rd principal component →

(f)

Fig. 5: The decision hyperplanes constructed by the FCM algorithm on the “crabs” dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2.0 for number of clusters equal to (a) 4, (b) 6, (c) 8, (d) 10, (e) 12, and (f) 14.

217

1.5

1.5

1

1

0.5

0.5

petal area →

petal area →

Basak

0

0

−0.5

−0.5

−1

−1

−1.5 −1.5

−1

−0.5

0

0.5

1

−1.5 −1.5

1.5

−1

−0.5

sepal area →

0

0.5

1

1.5

sepal area →

(a)

(b) 1.5

1

petal area →

0.5

0

−0.5

−1

−1.5 −1.5

−1

−0.5

0

sepal area →

0.5

1

1.5

(c) Fig. 6: The decision hyperplanes constructed by the proposed model on the “iris” dataset projected onto a two-dimensional plane spanned by the sepal area (product of first and second attribute values) and the petal area (product of third and fourth attribute values). Figure illustrates the hyperplanes with a constant m0 = 5 and α = 1.2 for (a) depth equal to 2, (b) depth equal to 3, and (c) depth equal to 4.

218

1.5

1.5

1

1

0.5

0.5

petal area →

petal area →

Online Adaptive Hierarchical Clustering a Decision Tree Framework

0

0

−0.5

−0.5

−1

−1

−1.5 −1.5

−1

−0.5

0

sepal area →

0.5

1

−1.5 −1.5

1.5

−1

−0.5

1.5

1

1

0.5

0.5

0

−0.5

−1

−1

−1

−0.5

0

sepal area →

0.5

1

−1.5 −1.5

1.5

−1

−0.5

1

1

0.5

0.5

0

−0.5

−1

−1

−0.5

0

sepal area →

0.5

1

1.5

0.5

1

1.5

0

−0.5

−1

1.5

(d) 1.5

petal area →

petal area →

(c) 1.5

−1.5 −1.5

1

0

−0.5

−1.5 −1.5

0.5

(b)

1.5

petal area →

petal area →

(a)

0

sepal area →

0

sepal area →

0.5

1

−1.5 −1.5

1.5

(e)

−1

−0.5

0

sepal area →

(f)

Fig. 7: The decision hyperplanes constructed by the FCM algorithm on the “iris” dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2 for number of clusters equal to (a) 3, (b) 4, (c) 6, (d) 7, (e) 9, and (f) 10.

219

Basak

4.2.4 Unsupervised Classification on Real-life Data Here we demonstrate the effectiveness of the model in performing the class discovery on certain real-life data sets (in the UCI [27] repository) by means of unsupervised classification. In order to obtain the classification performance, we consider the samples allocated to each leaf node. For each leaf node, we obtain the classlabel of the majority of the samples allocated to that leaf node, and assign the corresponding class label to that leaf node. Thus we assign a specific class label to each leaf node of the tree depending on the groups of samples allocated to that leaf node. In crabs dataset, we use two class labels (species only) instead of the four labels. Apart from the iris and crabs dataset, we also use five other data sets namely pima-indians-diabetes (originally reported in [28], we call it ‘pima’), Wisconsin Diagnostic Breast Cancer (originally reported in [29], in short called ‘WDBC’), Wisconsin Prognostic Breast Cancer (originally reported in [30], in short called ‘WPBC’), E-Coli bacteria dataset, and BUPA liver disease dataset. We obtained these three data sets from [27]. We modified the ‘Ecoli’ data originally used in [31] and later in [32] for predicting protein localization sites. The original dataset consists of eight different classes out of which three classes namely, outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane cleavable signal sequence have only five, two, and two instances respectively. We omitted samples from these three classes and report the results for the rest five different classes. Table 1 summarizes the data set description. Table 1: Description of the pattern classification datasets.

Data Set Indian diabetes (Pima) Diagnostic Breast Cancer (Wdbc) Prognostic Breat Cancer (Wpbc) Liver Disease (Bupa) Flower (Iris) Bacteria (Ecoli) Crabs

No. of Instances

No. of Features

No. of Classes

768

8

2

569

30

2

198

32

2

345

6

2

150 326 200

4 7 5

3 5 2

In performing the unsupervised classification using real-life datasets, we compare the results using our model with that obtained with the fuzzy c-means clustering algorithm. As a comparison, we also provide the cross-validated results obtained by a supervised decision tree, C4.5, [18, 33] on these data sets. In our model, the randomness can come in two different ways. One in the initialization process and second in the order of presentation of the samples in the online mode. We eliminate the randomness in the initialization process by using a fixed initialization. We initialize all components of all the weight vectors (w) to be equal, initialize the bias component (θ) to zero for all intermediate nodes, and initialize the code vectors (β) equal to zero. We experiment with different random order of presentation of the samples in the online mode and report the average performance over 10 different trials for each dataset and each tree. In the case of fuzzy c-means, the second randomness is not present since it is a batch-mode algorithm. However, performance of the FCM algorithm 220

Online Adaptive Hierarchical Clustering a Decision Tree Framework

depends on the initialization and we report the average performance of the FCM algorithm over 10 different trials. In performing the unsupervised classification, we experimented with different values of α and the depth of the tree. We observed that the performance depends on the value of α. If α is very low then the classification accuracy degrades due to fragmentation. The performance improves with an increase in α. Classification accuracy again degrades if the value of α becomes large since many samples are grouped together. Figure 8 illustrates the behavior of the model on three different datasets namely, PIMA, WDBC, and ECOLI. 78

76

depth=3 depth=4 depth=5 depth=6 depth=7

93.5

93

classification accuracy →

classification accuracy →

94

depth=3 depth=4 depth=5 depth=6 depth=7

74

72

70

92.5

92

91.5

91

90.5

68 90

66 0.5

0.6

0.7

0.8

0.9

1

α→

1.1

1.2

89.5 0.5

1.3

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

α→

(a)

(b) 90 depth=3 depth=4 depth=5 depth=6 depth=7

classification accuracy →

85

80

75

70

65

60 0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

α→

(c) Fig. 8: Dependencies of the classification accuracy on the parameter α of the adaptive tree for (a) PIMA dataset, (b) WDBC dataset, and (c) ECOLI dataset.

In Table 2, we report the best results obtained by the proposed model, and that by the FCM. As a comparison, we provide cross-validated results obtained by the supervised C4.5 decision tree. For the fuzzy c-means algorithm, we experimented with different combinations of the number of clusters and the exponent (h). We increased the number of clusters to 15 and the exponent (h) to 5.0. For each dataset we report the best result obtained by the FCM algorithm, and report the number of clusters and the exponent with each score. In the case of our model also, we tested with different combinations of the depth of the adaptive tree and the parameter α. We increased the depth up to 7 and α to 3.0. We report the best results and the corresponding value of α for each depth of the adaptive tree. Since the performance of the adaptive tree depends on the sequence in which the samples appear, 221

Basak

we report the standard deviation of the scores over 10 trials with different input sequences. Similarly performance of the FCM depends on the initialization and we report the standard deviation over 10 different trials. We observe that the adaptive tree is able to produce better classification accuracy over FCM for the BUPA, PIMA, and IRIS datasets. On the other hand, FCM is much better for the CRABS dataset, and also performs better for the WDBC and ECOLI dataset. The performances are comparable for the WPBC dataset. In other words, we observe that although we perform hierarchical clustering, the adaptive tree is able to produce results comparable with the partitional clustering algorithm on certain datasets. We also compare the results with that produced by agglomerative hierarchical clustering using group average (dendogram). We have constrained the AHC by producing at most 20 clusters for each dataset, and we report the results for 20 clusters. We observe that in the case of AHC the results consistently improve as we increase the number of clusters. In order to make it comparable with the fuzzy c-means, we have constrained to 20 clusters. We observe that the adaptive tree is able to produce comparable and sometimes better results than AHC when AHC is constrained to 20 clusters. Table 2: Classification accuracies obtained by the unsupervised online adaptive tree (‘OADT’ stands for online adaptive decision tree). As a comparison, we provide the results produced by the fuzzy c-means algorithm, dendogram, and the supervised C4.5 decision tree.

Depth 3 (std) (α) 4 (std) (α) OADT 5 (std) (α) 6 (std) (α) 7 (std) (α) FCM (std) (# Cluster) (h = 1/α) Dendogram C4.5 (depth)

BUPA 61.59 (±1.269) (0.5) 61.36 (±1.701) (0.8) 63.04 (±1.541) (0.7) 64.26 (±1.874) (0.7) 65.04 (±1.319) (0.5) 63.88 (±0.775) [15] (1.2) 63.19 66.38 (7)

PIMA 70.59 (±1.263) (0.6) 72.76 (±1.246) (0.8) 74.06 (±1.528) (0.8) 74.71 (±0.619) (0.8) 76.17 (±0.601) (0.7) 72.51 (±0.801) [7] (1.4) 71.61 74.09 (9)

WDBC 92.64 (±1.607) (0.8) 92.51 (±1.877) (0.8) 93.16 (±1.395) (0.8) 93.57 (±1.893) (0.8) 93.11 (±1.860) (0.9) 95.17 (±0.569) [15] (1.2) 94.38 92.44 (6)

WPBC 76.57 (±0.543) (0.8) 77.93 (±0.892) (0.8) 78.08 (±0.797) (0.8) 77.73 (±1.247) (0.8) 77.98 (±0.832) (0.8) 77.93 (±1.07) [15] (1.2) 77.78 76.77 (6)

IRIS 96.27 (±0.344) (0.7) 95.80 (±1.335) (0.8) 96.07 (±0.0913) (0.5) 96.60 (±1.153) (0.5) 97.73 (±0.344) (0.6) 96.67 (±0.629) [13] (1.8) 94.67 95.33 (4)

ECOLI 82.18 (±1.809) (0.6) 86.04 (±1.498) (0.7) 86.69 (±0.808) (0.6) 86.53 (±1.594) (0.7) 87.27 (±1.094) (0.7) 88.34 (±0.560) [13] (1.2) 86.50 84.66 (5)

CRABS 69.45 (±0.438) (1.2) 71.15 (±0.474) (0.6) 71.20 (±0.537) (0.5) 71.25 (±0.755) (0.5) 71.50 (±0.527) (0.5) 76.80 (±2.80) [15] (3.8) 72.50 90.7 (7)

We validate the clustering produced by adaptive tree by computing the F-measure validation index [5]. In Table 3, we report the F-measure indices for fixed value of α for different depths of the tree on all datasets as in Table 1. We also tested the effectiveness of our model in performing unsupervised classification of acute leukemia samples from the gene expression using DNA microarray as used in Golub et al [34]. In the data set, there are 72 samples consisting of two different types of leukemia namely, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). For each 222

Online Adaptive Hierarchical Clustering a Decision Tree Framework

Table 3: F-measures for the adaptive tree clustering for different values of α and depths.

α

depth

BUPA

PIMA

WDBC

WPBC

IRIS

ECOLI

CRABS

α = 0.6 α = 0.6 α = 0.6 α = 0.6 α = 0.8 α = 0.8 α = 0.8 α = 0.8 α = 1.0 α = 1.0 α = 1.0 α = 1.0

3 4 5 6 3 4 5 6 3 4 5 6

0.444 0.449 0.451 0.452 0.484 0.493 0.485 0.474 0.579 0.583 0.591 0.600

0.432 0.327 0.339 0.330 0.453 0.417 0.445 0.438 0.555 0.620 0.620 0.580

0.738 0.727 0.766 0.792 0.745 0.826 0.807 0.829 0.884 0.901 0.894 0.900

0.564 0.539 0.598 0.667 0.506 0.541 0.551 0.562 0.423 0.425 0.436 0.434

0.678 0.690 0.719 0.707 0.696 0.701 0.675 0.697 0.795 0.756 0.747 0.755

0.646 0.602 0.514 0.487 0.708 0.702 0.706 0.715 0.636 0.635 0.635 0.635

0.361 0.361 0.378 0.391 0.340 0.361 0.334 0.352 0.393 0.387 0.388 0.387

sample, gene expression was monitored using Affymetrix expression arrays. Golub et al. used self-organizing map to study the gene expression data and performed the unsupervised classification of the cancerous genes. The same data has been used in [35] where appropriate model order has been obtained based on stability-based validation of the clustering solutions. The original data consists of 72 different samples (47 ALL and 25 AML samples), each expressed by 6817 genes. We preprocess the data using the same technique as reported in [35]. The final data set consists of 72 samples each being expressed by 100 genes. We further normalize each sample such that each variable lies in the range [−1, +1]. When we use an adaptive tree of depth 3, we achieve 80.56% accuracy, whereas for depths of 4 and 5 we consistently obtain an accuracy of 86.11% (62 samples were correctly classified). This is exactly the same accuracy that was achieved in [35] using the ALL-AML labeling. With a model of depth 6 we obtain 84.7% accuracy (61 samples are labeled correctly). The reduction in the accuracy happens due to the fact that a larger depth creates too many fragmented regions and therefore in some regions very few samples are contained. This reduces the classification accuracy since we decide the class labels based on the majority of labels in each leaf. Finally we show the decision trees generated for a relatively large dataset namely, the 20Newsgroup dataset1 . It consists of 16242 documents (a subset of the original newsgroup data as available in [27]), and a word dictionary containing 100 words. Each document is represented by a vector of length 100 with each entry being 0 or 1 indicating an absence or a presence of the corresponding word in the document. The documents are annotated with the highest level newsgroup domain and there are four classes namely ‘comp.*’, ‘rec.*’, ‘sci.*’, and ‘talk.*’. We constructed adaptive trees with this dataset for different depths. Once the adaptive tree is constructed, we check the leaf nodes. If we find that a leaf node is never activated during the adaptation process we purge that leaf node. Since the adaptive tree is a binary tree, if any node is purged, the cluster corresponding to the sibling of the purged node can be represented by their parent. Once the adaptive tree is constructed, we determine the actual structure of the tree by this iterative purging process. We observed that the adaptive tree is able capture three classes namely ‘comp.*’, ‘rec.*’ and ‘talk.*’, however it fails to capture the class ‘sci.*’ for depths equal to 3 and 4. The adaptive tree 1 The dataset is available at the webpage maintained by S. Roweis: http://www.cs.toronto.edu/~ roweis/data.html

223

Basak

is able to capture all the classes perfectly for depth greater than equal to 5. Figure 9 illustrates the different structures of the adaptive tree that we obtained after 20 iterations for depths equal to 5, 6, and 7 respectively. In the adaptation process we used α = 1 and fixed initialization (as mentioned before) with a random sequencing of the input patterns. We observe that although the class ‘sci.*’ is captured, the number of leaf nodes allocated for this specific class is only one for depths 5 and 6 and two for a depth 7. The other classes appear prominently in the leaf nodes. This shows that the samples from class ‘sci.*’ have high amount of overlap with the samples from other classes (as represented by the vocabulary of 100 words). 1 0 0 1

: comp.*

1 0 0 1

: rec.*

000 111 111 000 000 111

1 0 0 1

: sci.*

11 00 00 11 00 11

: talk.*

00 11 11 00 00 11 00 11

1 0 1 0 00 11 11 00 00 11

: comp.* : rec.*

11 00 11 00 11 00

: sci.*

111 000 000 111 000 111 000 111

: talk.*

000 111 111 000 000 111 000 111 1 0 0 1 0 1 000 111 1 0 111 0 000 1 000 111

1 0 0 1

000 111 111 000 000 111

1 0 1 0

1 0 0 1

111 000 000 111 000 111 000 000 111 111 000 111

11 00 00 11 00 11

1 0 0 1

000 111 111 000 000 111 000 111

1 0 0 1 1 0 1 0

1 0 0 1

1 0 1 0

1 0 1 0

1 0 1 0

1 0 1 0

000 111 000 111 000 111 000 111 0000 111 000 111 000 111 0000 111 0000 111 1111 1111 1111 000 11111 000 1 00 000 000 111 000 111 000 111 1 000 000 111 000 111 1 000 1 000 111 00000 111 00 0 11 0 11 1 11111 00 000 111 000 111 000 111 0000 111 000 111 000 111 0000 111 0000 111 000 111

(a)

(b) 11 00 00 11

: comp.* : rec.* 00 11 00 11 00 : sci.* 11 000 111 000 : talk.* 111 000 111 000 111 11 00 00 11 00 11

000 111 111 000 000 111 000 111

1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0

1 0 0 1

00 11 0011 11 0011 11 00 11 00011 111 00 11 0111 1 00 0 11 1111 0 0001 111 00011 00 11 11 0 00 100 0 11 0 00 100 0 11 11 00 1 11 0 1 0 1 0 0 00 00 000 111 0 11 00 1 000 000 11 00 1 11 1 11 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 000 11 0000 11 0000 11 00 11 00000 111 00 11 0111 00 1 11 0001 111 00000 111 00 11

00 11 11 00 00 11

(c) Fig. 9: Unsupervised decision tree structures derived by iterative purging of the adaptive tree on 20Newsgroup dataset for (a) depth = 5, (b) depth = 6, and (c) depth = 7.

5. Discussion and Conclusions In this paper we have designed a hierarchical clustering algorithm in a decision tree framework in the online adaptive mode. Our model consists of two components namely an adaptive tree and a code formation layer. The adaptive tree hierarchically partitions the dataset and a node contains a group of samples. The leaf nodes of the adaptive tree represents the clusters at the finest level. The code formation layer stores certain reference vector for each cluster at the leaf node. We do not perform iterative partitioning of the data as it is usually performed in the divisive clustering and decision trees. Instead we use a global objective functional to represent the quality of the clustering as obtained by the leaf nodes. We use the same objective functional that is used in the fuzzy c-means with an 224

Online Adaptive Hierarchical Clustering a Decision Tree Framework

exception that the cluster centers in the fuzzy c-means are replaced by the stored codes. We iteratively minimize the objective functional for each sample in the online adaptive mode to fine tune the separating hyperplanes of the adaptive tree and the stored codes of the code formation layer together. The partitioning process at any intermediate node is guided by the overall global objective functional and also depends on the partition obtained by the child nodes. This is in contrast to usual top-down divisive clustering where a dataset is locally partitioned at every node recursively. We also never freeze the adaptation process, and whenever there is an input sample, the adaptive tree allocates that sample to a leaf node and at the same time adapts the parameters of the tree simultaneously the adaptation of the codes in the code formation layer in the online mode. In the adaptation process, not only the parameters of the tree and the code formation layer are adapted but the structure of the tree is also softly modified. For example, at any point of time we check the leaf nodes and retrieve the structure of the tree formed. We check the leaf nodes and if we find that a leaf node is not activated for any input sample over a period of time (in a batch) then we purge that leaf node. We then follow an iterative bottom-up purging process such that a cluster represented by the sibling of purged node can be represented by the parent of that nodes and we can purge the sibling connecting the child subtree to the parent node. The online adaptive tree construction in the unsupervised manner can also be viewed as a self-organization process where the adaptive tree adjusts the parameter values with the incoming data without any supervision. We experimented with real-life datasets and observed that for certain datasets, the adaptive tree is able to extract the class structures in the unsupervised mode. We also compared with the performance with that of the fuzzy c-means algorithm and observed that for certain datasets fuzzy c-means performs better than the adaptive tree in terms of unsupervised classification whereas for certain other datasets, the adaptive tree performs better. We have also shown that the effect of the exponent in the objective functional has a different role in the adaptive tree as compared to that in the fuzzy c-means although the two objective functional are same. In the adaptive tree, the exponent controls the size and number of the clusters whereas in the fuzzy c-means the exponent has no role in determining the number of clusters (the number of clusters is user specified). Note that it is possible to determine the optimal number of clusters in the K-means algorithm in a principal component analysis framework as provided in [36] where the performance of K-means does not depend on the initialization. However, we do not consider that version of the K-means in our paper. We used steepest gradient descent using line search technique to automatically decide the learning rate for any new sample. Further investigations are possible for improving the learning performance of the adaptive tree.

Appendix A. Appendix I: Fuzzy C-Means Algorithm In fuzzy c-means, the global objective function is given as

J=

c XX x j=1

uhj (x)kx − µj k2

(31)

where 1 < h < ∞ is a real-valued exponent controlling the membership function uj (x) of a sample x belonging to a cluster j, µj is the center of the cluster j, and c is the number of 225

Basak

clusters. The sum of membership values is normalized to unity i.e., X uj (x) = 1

(32)

j

for all x. The objective functional J is iteratively minimized in a two-step (EM) procedure. First, given a set of membership values of the samples, the centers are computed; and then with the given centers, the optimal membership values are computed which minimizes J. These two steps are repeated until convergence (till the maximum change in the membership values goes below a threshold). The updating rules for the fuzzy c-means are given as uj (x) =

1 2  P kx−µj k  h−1 k

and µj =

(33)

kx−µk k

h x uj (x)x P h x uj (x)

P

(34)

We observe that in the limiting condition of h = 1, the membership values uj (x) becomes 0 or 1 depending on the distance of sample x from the center µj . In other words, in the limiting condition of h = 1, the fuzzy c-means becomes hard c-means algorithm.

Appendix B. Appendix II: Sum of Activation of the Leaf Nodes Claim 1 : The total activation of the leaf nodes is constant if the activation function of each intermediate node is sigmoidal in nature. Proof : Let us assume one form of the sigmoid function as 1 1 + exp(−mv)

(35)

f (v; m) + f (−v; m) = 1

(36)

f (v; m) = Then

This property in general holds true for any kind of symmetric sigmoid functions (including the S-function used in the literature of fuzzy set theory). For any child node p, the activation is given as up = u⌊p/2⌋ .f ((−1)mod(p,⌊p/2⌋) (w⌊p/2⌋ .x + θ⌊p/2⌋ )) (37) Thus for any two leaf nodes p and p + 1 having a common parent node p/2 2 , we have up + up+1 = up/2

(38)

Thus the total activation of the leaf nodes is l+1 p=2 X −1

up =

l q=2 X−1

uq

(39)

q=2l−1

p=2l

2 p is even since leaf node indices always start from an even number according to our node numbering convention

226

Online Adaptive Hierarchical Clustering a Decision Tree Framework

By similar reasoning l q=2 X−1

uq =

l−1 −1 r=2X

ur

(40)

r=2l−2

q=2l−1

As we go up the tree till we reach the root node, the sum of the activation becomes equal to u(1). Since we consider that u(1) = 1 by default, the total activation of the leaf nodes is X up = 1 (41) p

Therefore, decreasing the output of one leaf node will automatically enforce the increase in the activation of the other nodes given that the activation of the root node is constant.

Appendix C. Appendix III: Lower Bound on the Slope The activation of each intermediate node is given as up (x) =

l Y

n=1

 p p f (−1)mod(⌊ 2n−1 ⌋,⌊ 2n ⌋) (w⌊

p ⌋ 2n

.x + θ

p 2n

 )

(42)

and the activation function f (.) is f (v) =

1 1 + exp(−mv)

(43)

The value of m is to be selected in such a way that a leaf node representing a cluster for x should be able to produce an activation value greater than certain threshold, i.e., up (x) ≥ 1 − ǫ where p is the activated leaf node representing the cluster Let us further assume that ′ mini |wi x + θ| = δ

(44) 3

(45)

i.e., δ is a margin between a sample x and the closest hyperplane. In that case, the minimum activation of a maximally activated leaf node representing the cluster is l  1 (46) up (x) ≤ 1 + exp(−mδ) considering that the pattern x satisfy all conditions governed by all hyperplanes wi on the path from root to the leaf node p, ′

(wi x + θ)lr(i, p) > 0

(47)

1 ≥1−ǫ (1 + exp(−mδ))l

(48)

In that case, from Equation (44),

3 If condition in Equation (44) holds then the total activation of the rest of the leaf nodes is less than ǫ.

227

Basak

such that l log(1 + exp(−mδ)) ≤ − log(1 − ǫ)

(49)

Assuming exp(−mδ) and ǫ to be small, we have l exp(−mδ) ≤ ǫ

(50)

1 log(l/ǫ) δ

(51)

i.e., m≥

References [1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Upper Saddle River, NJ: PrenticeHall Inc., 1988. [2] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, pp. 264–323, 1999. [3] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum Press, 1981. [4] Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, vol. 10, pp. 141–168, 2005. [5] J. Wu, H. Xiong, and J. Chen, “Towards understanding hierarchical clustering: A data distribution perspective,” Neurocomputing, vol. 73, pp. 2319–2330, 2009. [6] M. H. Chehreghani, H. Abolhassani, and M. H. Chehreghani, “Density link-based methods for clustering web pages,” Decision Support Systems, vol. 47, pp. 374–382, 2009. [7] R. C. Veras, S. R. L. Meira, A. L. I. Oliveira, and B. J. M. Melo, “Comparative study of clustering techniques for the organization of software repositories,” in Proc 19th International Conf. on Tools with Artificial Intelligence (ICTAI), pp. 371–377, 2007. [8] E. Pampalk, G. Widmer, and A. Chan, “A new approach to hierarchical clustering and structuring of data with self-organizing maps,” Intelligent Data Analysis, vol. 8, pp. 131–149, 2004. [9] M. Dittenbach, A. Rauber, and D. Merkl, “Uncovering hierarchical structure in data using the growing hierarchical self-organizing map,” Neurocomputing, vol. 48, pp. 199–216, 2002. [10] G. Tsekouras, H. Sarimveis, E. Kavakli, and G. Bafas, “A hierarchical fuzzy-clustering approach to fuzzy modeling,” Fuzzy Sets and Systems, vol. 150, pp. 245–266, 2005. [11] J. P. Herbert and J. Yao, “Growing hierarchical self-organizing maps for web mining,” in Proc IEEE/WIC/ACM International Conf. on Web Intelligence (WI ’07), pp. 299–302, 2007. [12] B. Liu, Y. Xia, and P. Yu, “Clustering through decision tree construction,” in Proceedings of the Ninth International Conference on Information and Knowledge Management, (McLean, VA, USA), pp. 20–29, ACM, 2000. [13] J. Struyf and S. Dzeroski, “Clustering trees with instance level constraints,” in Proc European Conf Machine Learning (ECML 2007), pp. 359–370, 2007. [14] H. Blockeel, L. De Raedt, and J. Ramon, “Top-down induction of clustering trees,” in Proc Int. Conf Machine Learning (ICML 1998), pp. 55–63, 1998. [15] J. Basak and R. Krishnapuram, “Interpretable hierarchical clustering by constructing an unsupervised decision tree,” IEEE Transactions Knowledge and Data Engineering, vol. 17, pp. 121– 132, 2005. [16] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition). New York: Wiley, 2001. [17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. New York: Chapman & Hall, 1983. [18] J. R. Quinlan, Programs for Machine Learning. San Fransisco, CA, USA: Morgan Kaufmann, 1993. [19] M. Held and J. M. Buhmann, “Unsupervised on-line learning of decision trees for hierarchical data analysis,” in Neural Information Processing Systems NIPS’ 1997 (M. Jordan, M. Kearns, and S. Solla, eds.), vol. 10, USA: MIT Press, 1998.

228

Online Adaptive Hierarchical Clustering a Decision Tree Framework

[20] J. Basak, “Online adaptive clustering in a decision tree framework,” in Proc. IAPR Intl Conf Pattern Recognition (ICPR 08), Tampa, Florida, USA. [21] J. Basak, “Online adaptive decision trees,” Neural Computation, vol. 16, pp. 1959–1981, 2004. [22] J. Basak, “Online adaptive decision trees: Pattern classification and function approximation,” Neural Computation, vol. 18, pp. 2062–2101, 2006. [23] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals Statistics, vol. 29, pp. 1189–1232, 2001. [24] J. C. Bezdek and P. Castelaz, “Prototype classification and feature selection with fuzzy sets,” IEEE Trans. on Systems, Man and Cybernetics, vol. 7, pp. 87–92, 1977. [25] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge, UK: Cambridge University Press, 1996. [26] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, pp. 179–188, 1936. [27] C. Blake and C. Merz, “UCI repository of machine learning databases,” http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998. [28] J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes, “Using the adap learning algorithm to forecast the onset of diabetes mellitus.,” in Proceedings of the Twelfth Annual Symposium on Computer Application in Medical Care, (Washington D.C., NY, USA), pp. 261–265, IEEE Computer Society Press, 1988. [29] W. N. Street, W. H. Wolberg, and O. L. Mangasarian, “Nuclear feature extraction for breast tumor diagnosis,” in Proceddings IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology, vol. 1905, (San Jose, CA), pp. 861–870, 1993. [30] W. N. Street, O. L. Mangasarian, and W. H. Wolberg, “An inductive learning approach to prognostic prediction,” in Proceedings of the Twelfth International Conference on Machine Learning (A. Prieditis and S. Russell, eds.), pp. 522–530, San Francisco: Morgan Kaufmann, 1995. [31] K. Nakai and M. Kanehisa, “Expert sytem for predicting protein localization sites in gramnegative bacteria,” PROTEINS: Structure, Function, and Genetics, vol. 11, pp. 95–110, 1991. [32] P. Horton and K. Nakai, “A probablistic classification system for predicting the cellular localization sites of proteins,” in Intelligent Systems in Molecular Biology, (St. Louis, USA), pp. 109–115, 1996. [33] J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal of Artificial Intelligence, vol. 4, pp. 77–90, 1996. [34] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caliguri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, pp. 531–537, 1999. [35] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann, “Stability-based validation of clustering solutions,” Neural Computation, vol. 16, pp. 1299–1323, 2004. [36] C. Ding and X. He, “K-means clustering via principal component analysis,” in Proc. Intl Conf. Machine Learning (ICML 2004), pp. 225–232, 2004.

229