Renato Cordeiro de Amorim and Trevor Fenner. Department of Computer Science and Information Systems. Birkbeck University of London, Malet Street WC1E ...
Weighting Features for Partition around Medoids Using the Minkowski Metric Renato Cordeiro de Amorim and Trevor Fenner Department of Computer Science and Information Systems Birkbeck University of London, Malet Street WC1E 7HX, UK {renato,trevor}@dcs.bbk.ac.uk
Abstract. In this paper we introduce the Minkowski weighted partition around medoids algorithm (MW-PAM). This extends the popular partition around medoids algorithm (PAM) by automatically assigning K weights to each feature in a dataset, where K is the number of clusters. Our approach utilizes the within-cluster variance of features to calculate the weights and uses the Minkowski metric. We show through many experiments that MW-PAM, particularly when initialized with the Build algorithm (also using the Minkowski metric), is superior to other medoid-based algorithms in terms of both accuracy and identification of irrelevant features. Keywords: PAM, medoids, Minkowski metric, feature weighting, Build, L-p space.
1
Introduction
Clustering algorithms follow a data-driven approach to partition a dataset into K homogeneous groups S = {S1 , S2 , ..., Sk }. These algorithms can be very useful when creating taxonomies and have been used in various different scenarios [1–5]. The heavy research effort in this field has generated a number of clustering algorithms, K-Means [6, 7] arguably being the most popular of them. K-Means iteratively partitions a dataset I around K synthetic centroids c1 , c2 , ..., ck , each being the centre of gravity of its corresponding cluster. Although popular, there are scenarios in which K-Means is not the best option. For instance, the clustering of malware [8], countries or physical objects would require the use of realistic centroids representing each cluster, while KMeans centroids are synthetic. With this is mind, Kaufman and Rousseeuw [4] introduced the partition around medoids algorithm (PAM). PAM can provide realistic representations of clusters because it uses medoids rather than centroids. A medoid mk is the entity with the lowest sum of dissimilarities from all other entities in the same cluster Sk ; so it does represent an actual entity in the cluster. Because of its widespread use, the weaknesses of PAM are also well-known, some shared with K-Means. For instance, PAM treats all features equally, irrespective of their actual relevance, while it is intuitive that different features may have different degrees of relevance in the clustering. Another weakness is J. Hollm´ en, F. Klawonn, and A. Tucker (Eds.): IDA 2012, LNCS 7619, pp. 35–44, 2012. c Springer-Verlag Berlin Heidelberg 2012
36
R.C. de Amorim and T. Fenner
that the outcome of PAM depends heavily on the initial medoids selected. These weaknesses have been addressed in relation to K-Means by Huang et al. [9–11], who introduced the weighted K-Means algorithm (WK-Means), which automatically assigns lower weights to less relevant features; and Mirkin [12] with his intelligent K-Means (iK-Means), a heuristic algorithm used to find the number of clusters in a dataset, as well as the initial centroids. We have extended both of these by introducing intelligent Minkowski weighted K-Means (iMWK-Means), which we have shown to be superior to both WK-Means and iK-Means [13]. In this paper we propose a new algorithm, called Minkowski weighted partition around medoids (MW-PAM), which produces realistic cluster representations by using medoids. This also avoids possible difficulties in finding the centre of gravity of a cluster under the Minkowski metric, as finding this centroid is no longer required. We initialize MW-PAM with medoids obtained with Build [4], a popular algorithm used to initialize PAM, and present versions using the Minkowski and Euclidean metrics for comparison. We experimentally demonstrate that our proposed methods are able to discern between relevant and irrelevant features, as well as produce better accuracy than other medoid-based algorithms.
2
Related Work
For a dataset I to be partitioned into K clusters S = {S1 , S2 , ..., SK }, the partition around medoids algorithm attempts to minimise the K-Means criterion: W (S, C) =
K
d(yi , ck ),
(1)
k=1 i∈Sk
where d(yi , ck ) is a dissimilarity measure, normally the square of the Euclidean distance, between entity yi and ck , the centroid of cluster Sk . K-Means defines a centroid ck ∈ C as the centre of gravity of the cluster Sk . This is a synthetic value and not normally a member of Sk . The partition around medoids algorithm (PAM) takes a different approach by using medoids rather than centroids. The medoid is defined to be the entity mk ∈ Sk that has the lowest sum of dissimilarities from all other entities yi ∈ Sk . PAM minimises (1), with ck replaced by mk , partitioning the dataset I into K clusters, S = {S1 , S2 , ..., SK }, as follows: 1. Select K entities at random as initial medoids m1 , m2 , ..., mK ; S ← ∅. 2. Assign each of the entities yi ∈ I to the cluster Sk of the closest medoid mk , creating a new partition S = {S1 , S2 , ..., SK }. 3. If S = S, terminate with the clustering S = {S1 , S2 , ..., SK }, and corresponding medoids {m1 , m2 , ..., mK }. Otherwise, S ← S . 4. For each of the clusters in S = {S1 , S2 , ..., SK }, set its medoid mk to be the entity yj ∈ Sk with the lowest sum of dissimilarities from all other entities yi ∈ Sk . Return to step 2.
Weighting Features for Partition around Medoids
37
The PAM algorithm inherits some of the weaknesses of K-means, including the necessity of knowing K beforehand, the high dependence on the initial medoids, and its uniform treatment of all features of the entities in the dataset I. In this paper we assume K to be known, but deal with the other weaknesses. We intuitively believe that PAM should benefit from feature weighting, so that less relevant features have a lower impact on the final set of clusters. In order to apply feature weights in K-Means and generalize to the Minkowski metric, we introduced a weighted dissimilarity measure [13]. Equation (2) shows the weighted dissimilarity between the V-dimensional entities yi and mk . dp (yi , mk ) =
V v=1
p wkv |yiv − mkv |p ,
(2)
where p is the exponent of the Minkowski metric and wkv is a cluster and feature specific weight, accommodating the fact that a feature v may have different degrees of relevance for different clusters Sk . Our dissimilarity measure in (2) allowed us to generalise the work of Huang et al. [9–11], which used the Euclidean metric (i.e. p = 2), and use instead the following Minkowski-metric weighted KMeans criterion: Wp (S, C, w) =
K V k=1 i∈Sk v=1
p wkv |yiv − ckv |p ,
(3)
subject to Vv=1 wk v = 1 for each cluster. We use crisp clustering in which an entity belongs solely to one cluster. We followed [9–11] and [13] in the calculation of the weights wkv . This takes into account the within-cluster variance of features. Features with a smaller withincluster variance are given a higher weight according to the following equation: wkv =
1 , 1/(p−1) u∈V [Dkv /Dku ]
(4)
where Dkv = i∈Sk |yiv − ckv |p . There are indeed other possibilities, even when using the Euclidean dissimilarity d(y, m) = (y − m)T (y − m) = Vv=1 (yv − mv )2 (where T denotes the transpose vector). Alternatively one could use one of the following. (i) Linear combinations of the original features, extending the inner product T to include a weight matrix W , so that d(y, m) = (y − m) W (y − m) = ij wij (yi −mi )(yj −mj ). There have been different approaches to determining suitable weights in W , for instance: by making major changes in the KMeans criterion (1) dividing it by an index representing the ‘between cluster dispersion’ [14]; by adjusting the feature weights using a heuristic procedure [15]; or by using extra information regarding pairwise constraints [16, 17]. (ii) Non-linear weights, setting the dissimilarity as below. d(y, m) = (y − m)T W β (y − m) =
V v=1
wvβ (yv − mv )2 ,
(5)
38
R.C. de Amorim and T. Fenner
where W is now a diagonal feature weight matrix. This approach seems to have been introduced by Makarenkov and Legendre [18], and later extended by Frigui and Nasraoui [19] to allow cluster-specific weights, both just for with β = 2 (we note that this is the same as (3) for p = 2). The possibility of β being a user-defined parameter was introduced by Huang et al. [9–11], still using the Euclidean metric. It is worth noting that only for β = 2 are the weights feature-rescaling factors, allowing their use in the datapreprocessing stage regardless of the clustering criterion in use. For a more comprehensive review, see [13, 11].
3
Minkowski Weighted PAM
The MWK-Means algorithm partitions entities around synthetic centroids in Lp space. This presents us with two problems. Firstly, MWK-Means requires the calculation of the centre of gravity of each cluster. This centre is easy to calculate only in the cases p = 1 and p = 2, when it is equivalent to the median or mean, respectively. For other values of p, one can use a steepest descent method [13], but this only gives an approximation to the real centre. The second problem we deal with here is that synthetic centroids cannot always be used. A clustering problem may require the centroids to have realistic values or even represent real entities in the dataset. In order to deal with these two problems, we have designed the Minkowski weighted PAM algorithm (MW-PAM). This partitions entities around medoids using the weighted dissimilarity defined in (2), adding the constraint of using medoids rather than centroids in the minimisation of the criterion shown in (3). The steps of MW-PAM in partitioning the data into K clusters are as follows: 1. Select K entities ∈ I at random for the initial set of medoids M = {m1 , m2 , ..., mK }; set a value for p; S ← ∅. 2. Assign each of the entities yi ∈ I to the cluster Sk of its closest medoid mk , using the dissimilarity in (2), generating the partition S = {S1 , S2 , ..., Sk }. 3. If S = S , terminate with the clustering S = {S1 , S2 , ..., SK } and medoids {m1 , m2 , ..., mk }. Otherwise, S ← S . 4. For each of the clusters in S = {S1 , S2 , ..., SK }, set its medoid mk to be entity yi ∈ Sk with the smallest weighted sum of the Minkowski dissimilarities from all other entities in the same cluster Sk . Return to step 2. Because of the non-deterministic nature of MW-PAM, we ran it 50 times in each of our experiments (discussed in Section 4). Of course, if we were to provide MWPAM with good initial medoids rather than use random entities, we could make MW-PAM deterministic. In an attempt to solve the same problem for PAM, Kaufman and Rousseeuw [4] presented the Build algorithm: 1. Pick the entity m1 ∈ I with the lowest sum of dissimilarities from all other entities in I as the first medoid. Set M = {m1 }.
Weighting Features for Partition around Medoids
39
2. The next medoid is the farthest entity from M . Select an entity m for which Ey is the maximum over y ∈ I − M , where Ey = m∈M d(y, m). Add m to M. 3. If the cardinality of M equals K stop, otherwise go back to Step 2. It is intuitively consistent to use the Minkowski dissimilarities in Build with the same exponent p as we use in MW-PAM.
4
Setting of the Experiments and Results
Our experiments aim to investigate whether MW-PAM is able to discern between relevant and irrelevant features, as well as produce better accuracy than other popular algorithms. It is very difficult to quantify the relevance of features in real-world datasets, so we have added extra features as uniformly distributed random noise to each dataset we use. These features will clearly be irrelevant for clustering purposes. We have performed experiments on 6 real-world-based datasets and 25 Gaussian mixture models (GM). The 6 real-world-based datasets were derived from two popular real-world datasets downloaded from the UCI repository [20]. These datasets were: (i) Iris dataset This dataset contains 150 flower specimens described over four features, partitioned into three clusters. From this dataset, we generated two others by adding two and four extra noise features. (ii) Hepatitis dataset This dataset is comprised of 155 entities representing subjects with hepatitis over 19 features, partitioned into two clusters. From this standardized dataset, we generated two others by adding 10 and 20 noise features. The GMs were generated with the help of Netlab [21]. Each mixture contains K hyperspherical clusters of variance 0.1. Their centres were independently generated from a Gaussian distribution with zero mean and variance unity. We generated 20 GMs, five in each of the following four groups: (i) 500 entities over 6 features, partitioned into 5 clusters, to which we added two noise features, making a total of 8 features (500x6-5 +2); (ii) 500 entities over 15 features, partitioned into 5 clusters, to which we added 10 noise features, totalling 25 features (500x15-5 + 10); (iii) 1000 entities over 25 features partitioned into 12 clusters, to which we added 25 noise features, totalling 50 features (1000x25-12 +25); (iv) 1000 entities over 50 features, partitioned into 12 clusters, to which we added 25 noise features, totalling 75 features (1000x50-12 +25). After generating all datasets, we standardized them. The hepatitis dataset, unlike all the others, contains features with categorical data. We transformed these features into numerical features by creating a binary dummy feature for each of the existing categories. A dummy feature would be set to one if the original categorical feature was in the category the dummy feature represents,
40
R.C. de Amorim and T. Fenner
and zero otherwise. At the end, the original categorical features were removed and each dummy features had its values reduced by their average. All other features were standardized by Equation (6). ziv =
yiv − I¯v , 0.5 ∗ range(Iv )
(6)
where I¯v represents the average of feature v over the whole dataset I. This process is described in more details in [12]. We chose it because of our previous success in utilizing it for K-Means [8, 22, 13]. Regarding the accuracy, we acquired all labels for each dataset and computed a confusion matrix after each experiment. With this matrix we mapped the clustering generated by the algorithms to the given clustering and used the Rand index as the proportion of pairs of entities in I for which the two clusterings agree on whether or not the pairs are in the same cluster. Formally: R=
a+b , a+b+c
(7)
where a is the number of pairs of entities belonging to the same cluster in both clusterings, b is the number of pairs belonging to different clusters in both clusterings, c is the number of remaining pairs of entities. Here we compare a total of 8 algorithms: (i)PAM: partition around medoids; (ii)WPAM: a version of PAM using the feature weighting method of Huang et al. [9–11] in which, in (2), p is used as an exponent of the weights, but the distance is Euclidean; (iii) MPAM: PAM using the Minkowski metric with an arbitrary p; (iv) MW-PAM: our Minkowski weighted PAM; (v)Build + PAM: PAM initialized with the medoids generated by Build; (vi)Build + WPAM: WPAM also initialized with Build; (vii)M Build + MPAM: MPAM initialized using the Minkowski dissimilarity-based Build; (viii)M Build + MW-PAM: MW-PAM also initialized using the Minkowski dissimilarity-based Build. We discarded results for any of the algorithms in two cases: (i) if the number of clusters generated by the algorithm was smaller than the specified number K; (ii) if the algorithm failed to converge within 1,000 iterations. We ran each of the non-deterministic algorithms, PAM, WPAM, MPAM and MW-PAM, 50 times. 4.1
Results Given a Good p
In this first set of experiments we would like to determine the best possible accuracies for each algorithm. With this in mind, we provide each algorithm that needs a value for p with the best we could find. We found this by testing all values of p between 1 and 6 in steps of 0.1, and chose the one with the highest accuracy, or highest average accuracy for the non-deterministic algorithms. Table 1 show the results for each of the three iris-based and hepatitis-based datasets. Regarding the former dataset, these are the original, and the ones with two and four extra noise features. We observe that the highest accuracy of 97.3% was achieved by Build + WPAM and M Build + MW-PAM for the original
Weighting Features for Partition around Medoids
41
dataset and the version with four extra noise features. We are not aware of any unsupervised algorithm having attained such high accuracy for this dataset. The set of medoids obtained from the Build algorithm was not the same in the two cases. Although the accuracy of both Build + WPAM and M Build + MW-PAM was the same for all three datasets, Table 2 shows that M Build + MW-PAM generated lower weights than Build + WPAM for the irrelevant features in the version of iris with four extra noise features. Our experiments with the hepatitis dataset show that, this time, M Build + MW-PAM had higher accuracy than Build + WPAM for two out of the three datasets and the same accuracy for the third. This was again higher that of the other deterministic algorithms. Even though the same or slightly better maximum accuracy was obtained by the non-deterministic algorithms, the average accuracy was always less than that of M Build + MW-PAM. The results of the experiments with the GMs in Table 3 were more variable. They show that PAM had the lowest average performance of 32.1% over the 20 GMs, while M Build + MW-PAM had the best with 65.4% accuracy, closely followed by Build + WPAM with 64.7%. Our tables show that increasing the number of noise features has resulted in slightly better results for a few datasets. As the increase appeared in only some experiments, this may not have statistical significance. We intend to address this in future research. The processing time presented in all tables relates to a single run. The nondeterministic algorithms tended to have lower processing times; however, these were run 50 times. Table 1. Experiments with the iris and hepatitis datasets. The results are shown per row for the original iris dataset, with +2 and +4 noise features, and the original hepatitis dataset, with +10 and + 20 noise features. The algorithms marked with a * use the value of p solely as the exponent of the weights, the distance is Euclidean.
PAM
WPAM*
MPAM
MW-PAM
Build + PAM
Build + WPAM*
M Build + MPAM
M Build + MW-PAM
Iris Accuracy avg sd Max p 78.8 16.7 90.7 70.0 10.3 90.0 66.6 8.5 77.3 91.0 10.4 96.0 2.1 81.8 13.8 96.0 3.1 82.6 10.9 96.0 4.8 85.4 12.2 92.0 1.2 75.4 11.8 93.3 1.2 73.0 10.6 88.0 1.3 90.6 9.9 96.0 4.2 83.0 13.8 96.0 2.7 81.8 15.3 97.3 3.0 90.0 78.0 77.3 97.3 1.1 96.7 1.2 97.3 1.7 92.0 1.0 90.7 1.1 82.0 1.0 97.3 1.1 96.7 1.2 97.3 1.1
Time (s) 0.05±0.01 0.04±0.01 0.04±0.01 0.10±0.04 0.10±0.04 0.10±0.05 0.09±0.03 0.08±0.03 0.08±0.02 0.12±0.05 0.20±0.08 0.25±0.08 0.56 0.51 0.51 0.56 0.56 0.56 0.52 0.54 0.51 0.60 0.65 0.64
Hepatitis Accuracy avg sd Max p Time (s) 68.5 5.0 79.3 0.08±0.02 66.1 8.0 74.8 0.06±0.01 62.7 9.2 81.3 0.06±0.01 78.8 0.9 80.0 1.0 0.06±0.002 78.7 0.7 80.0 1.0 0.07±0.002 78.4 0.8 80.0 1.0 0.07±0.003 70.2 4.9 81.3 1.1 0.14±0.05 69.0 6.9 76.1 1.1 0.18±0.04 67.6 9.5 81.3 1.4 0.24±0.04 78.7 0.8 80.0 1.0 0.08±0.0043 78.8 0.7 80.0 1.0 0.08±0.004 78.8 0.9 80.0 1.0 0.08±0.005 69.7 0.45 72.9 0.37 52.9 0.38 78.7 1.0 0.41 80.0 1.0 0.42 76.8 1.0 0.42 70.3 1.0 0.41 76.8 1.0 0.39 75.5 1.5 0.56 80.0 1.0 0.42 80.0 1.0 0.42 80.0 1.0 0.46
42
R.C. de Amorim and T. Fenner
Table 2. Final weights given by the experiments with the iris dataset. It shows one row per cluster for each dataset and algorithm. NF stands for noise feature. Sepal length 0.000 0.000 0.000 M Build+MWPAM 0.000 0.000 0.000 +2 Noise features Build+WPAM 0.000 0.000 0.002 M Build+MWPAM 0.004 0.001 0.013 +4 Noise features Build+WPAM 0.044 0.028 0.099 M Build+MWPAM 0.000 0.000 0.000 Original iris Build+WPAM
4.2
Sepal width 0.000 0.000 0.001 0.001 0.000 0.045
Petal length 0.544 1.000 0.995 0.525 0.967 0.922
Features Petal width NF 1 0.456 0.000 0.004 0.474 0.033 0.032 -
NF 2 -
NF 3 -
NF 4 -
0.002 0.000 0.018 0.011 0.000 0.164
0.832 0.990 0.932 0.499 0.884 0.690
0.166 0.010 0.049 0.486 0.116 0.133
0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000
-
-
0.071 0.007 0.093 0.001 0.000 0.045
0.435 0.759 0.468 0.525 0.967 0.922
0.425 0.204 0.250 0.474 0.033 0.032
0.008 0.000 0.025 0.000 0.000 0.000
0.009 0.000 0.027 0.000 0.000 0.000
0.007 0.001 0.022 0.000 0.000 0.000
0.001 0.001 0.016 0.000 0.000 0.000
Learning p
We have shown in the previous section, that, given a good p, the M Build + MW-PAM algorithm provides accuracy and detection of irrelevant features that is competitive or superior to that given by other algorithms. Of course this raises the question of how one can obtain a good value for p. Table 3. Results of the experiments with the Gaussian mixtures dataset. The results are shown per row for the 500x6-5 (+2), 500x10-5 (+15), 1000x25-12 (+25) and 1000x50-12 (+25), respectively. The algorithms marked with a * use the value of p solely as the exponent of the weights, the distance is Euclidean. Algorithms PAM
WPAM*
MPAM
MW-PAM
Build + PAM
Build + WPAM*
M Build + MPAM
M Build + MW-PAM
Accuracy avg sd Max 38.8 6.3 62.6 35.7 5.5 51.4 19.6 1.9 26.3 34.5 5.0 47.5 50.3 5.0 58.1 62.0 5.0 67.4 57.3 3.1 61.4 76.7 1.1 77.9 43.3 3.8 50.8 40.8 2.7 44.6 25.8 1.0 27.0 45.0 1.9 47.1 44.0 4.2 51.5 52.6 4.8 58.3 50.8 3.3 55.3 74.1 1.7 76.0 39.0 8.6 54.4 37.5 3.2 42.4 20.3 1.2 22.1 47.1 3.8 51.9 43.5 8.2 61.4 65.8 5.3 71.2 54.9 6.0 64.0 94.6 3.4 97.5 48.5 7.8 62.6 48.5 7.7 62.0 26.2 2.7 30.9 68.4 6.5 75.9 43.3 7.7 60.2 67.0 8.5 76.2 55.1 5.4 63.0 96.4 1.8 98.2
p 4.96±0.79 4.86±0.83 3.58±0.13 3.30±0.43 1.10±0.15 1.06±0.13 1.00±0.00 1.08±0.08 4.71±0.81 3.86±0.25 3.98±0.19 3.14±0.30 4.48±0.97 4.46±1.09 3.80±0.69 2.60±0.43 1.12±0.25 1.00±0.0 1.02±0.04 1.08±0.13 2.89±1.36 4.06±1.07 3.88±0.63 1.84±0.48
Time (s) 0.14±0.04 0.15±0.02 0.37±0.08 0.46±0.09 12.05±0.98 17.08±2.96 122.70±7.71 114.60±4.17 11.37±1.06 11.65±2.0822 103.06±1.04 113.06±10.7 11.93±1.28 14.43±1.96 118.70±3.11 110.92±3.84 9.90±0.15 9.86±0.02 99.21±0.32 99.61±0.47 12.19±0.88 15.16±1.40 118.92±7.95 112.42±6.41 11.07±0.76 10.86±0.39 105.78±6.02 114.80±14.36 12.20±0.71 15.42±2.18 118.70±3.69 121.07±15.46
Weighting Features for Partition around Medoids
43
From our experiments, it is clear that the best value depends on the dataset. We have solved the same issue for MWK-Means using a semi-supervised approach [13] and suggest here a similar solution for the medoid-based algorithms. We repeated the following procedure 50 times. 1. Select 20% of the entities in the dataset I, at random, keeping a record of their cluster indices. 2. Run M Build + MW-PAM on the whole dataset I. 3. Choose as the optimal p that with the highest accuracy among the selected entities. Table 4 shows the results we obtained using the above algorithm, as well as the optimal results of Tables 1 and 3. These clearly show that it is possible to closely approximate the optimal accuracy of M Build + MW-PAM when p is unknown by using semi-supervised learning. Table 4. Results of the experiments using the semi-supervised learning to find p for the M Build + MW-PAM algorithm Datasets Iris
Hepatitis
500x6-5 +2 500x15-5 + 10 1000x25-12 + 25 1000x50-12 + 25
5
p 1.16±0.34 1.25±0.18 1.05±0.05 1.44±0.85 1.09±0.46 1.10±0.37 2.81±1.22 3.78±0.81 4.00±0.60 1.84±0.45
Learnt avg 96.67±0.40 95.25±1.84 96.33±1.01 77.57±2.90 79.08±2.89 79.12±2.04 41.19±2.81 65.26±8.37 54.09±5.10 96.18±1.77
Max 97.33 96.67 97.33 80.00 80.00 80.00 60.20 76.20 63.00 98.20
Optimal p avg 1.1 1.2 1.1 1.0 1.0 1.0 2.89±1.36 43.3±7.7 4.06±1.07 67.0±8.5 3.88±0.63 55.1±5.4 1.84±0.48 96.4±1.8
Max 97.33 96.67 97.33 80.00 80.00 80.00 60.20 76.20 63.00 98.20
Conclusion
In this paper we have presented a new medoid-based clustering algorithm that introduces the use of feature weighting and the Minkowski metric. We have performed extensive experiments using both real-world and synthetic datasets, which showed that the Minkowski weighted partition around medoids algorithm (MW-PAM) was generally superior or competitive in terms of both accuracy and detection of irrelevant features to the 7 other PAM algorithms considered. In terms of future research, we intend to replace the semi-supervised algorithm by finding a method for generating a suitable value for p from the data, as well as perform experiments with datasets with many more noise features.
References 1. Brohee, S., Van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7(1), 488–501 (2006) 2. Hartigan, J.A.: Clustering algorithms. John Willey & Sons (1975) 3. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666 (2010)
44
R.C. de Amorim and T. Fenner
4. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley Online Library (1990) 5. Mirkin, B.: Core concepts in data analysis: summarization, correlation and visualization. Springer, New York (2011) 6. Ball, G.H., Hall, D.J.: A clustering technique for summarizing multivariate data. Behavioral Science 12(2), 153–155 (1967) 7. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, USA, pp. 281–297 (1967) 8. de Amorim, R.C., Komisarczuk, P.: On partitional clustering of malware. In: CyberPatterns, pp. 47–51. Abingdon, Oxfordshire (2012) 9. Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition 37(5), 943– 952 (2004) 10. Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in kmeans type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5), 657–668 (2005) 11. Huang, J.Z., Xu, J., Ng, M., Ye, Y.: Weighting Method for Feature Selection in K-Means. In: Computational Methods of Feature Selection, pp. 193–209. Chapman and Hall (2008) 12. Mirkin, B.G.: Clustering for data mining: a data recovery approach. CRC Press (2005) 13. de Amorim, R.C., Mirkin, B.: Minkowski Metric, Feature Weighting and Anomalous Cluster Initializing in K-Means Clustering. Pattern Recognition 45(3), 1061– 1075 (2011) 14. Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003) 15. Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Computational Statistics & Data Analysis 52(10), 4658–4672 (2008) 16. Bilenko, M., Basu, S., Mooney, R.J.: Integrating Constraints and Metric Learning in Semi-Supervised Clustering. In: Proceedings of 21st International Conference on Machine Learning, Banff, Canada, pp. 81–88 (2004) 17. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 16, pp. 521–528 (2002) 18. Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and K-means partitioning: Methods and software. Journal of Classification 18(2), 245–271 (2001) 19. Frigui, H., Nasraoui, O.: Unsupervised learning of prototypes and attribute weights. Pattern Recognition 37(3), 567–581 (2004) 20. Irvine UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/ 21. Nabney, I., Bishop, C.: Netlab neural network software. Matlab Toolbox 22. de Amorim, R.C.: Constrained Intelligent K-Means: Improving Results with Limited Previous Knowledge. In: ADVCOMP, pp. 176–180 (2008)