K-nearest neighbors clustering algorithm ∗
∗
Dariusz Gauza , Anna ukowska
and Robert Nowak
Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland
ABSTRACT Cluster analysis, understood as unattended method of assigning objects to groups solely on the basis of their measured characteristics, is the common method to analyze DNA microarray data. Our proposal is to classify the results of one nearest neighbors algorithm (1NN). The presented method well cope with complex, multidimensional data, where the number of groups is properly identied. The numerical experiments on benchmark microarray data shows that presented algorithm give a better results than k-means clustering.
Keywords:
cluster analysis, nearest neighbour, k-means, microarray data, multidimensional data
1. INTRODUCTION DNA microarrays are an important research tool in the eld of modern molecular biology. The activity of large number of genes is investigated in a single experiment, therefore it is a valuable service for medical diagnosis. DNA microarray is a few-centimeter glass tile with installed DNA molecules, called probes, on the specic points of the surface, called spots.
Every spot contains
10−12
of DNA, the probes are dierent in dierent
spots. The DNA microarray is fabricated using photolitography or electrochemistry on microelectrode arrays. Microarray is used to test gene activity by binding labeled uorescent DNA molecules to the probes, uorescent markers are inducted by laser light with specied wavelength.
1
then
The image with dierent pixel's
intensity is obtained, then converted into the nal matrix of numerical values, through the number of low-level data processing techniques. The elements of this matrix correspond to the genes expression intensities. 1
Clustering is unsupervised technique belonging to exploratory data analysis,
it is based on assigning ob-
jects to clusters exclusively according to their features. In many cases the number of groups or clusters in data is a priori unknown, that is the main dierence between clusterizaton and classication. belongs to class NP-hard,
2
Clustering problem
therefore the optimal solution is unsolvable in practice, even for small instances.
The heuristic methods are common in clustering, because they oer approximate results in acceptable computa3
tional complexity. Cluster analysis use model-based algorithms (e.g. Expectation-Maximization ), partitioning 4
algorithms (e.g. k-means), 8
neural
5
hierarchical grouping algorithms,
and Bayesian networks.
density-based algorithms (DBSCAN,
6
7
OPTICS ),
9, 10
In this paper the modication of K-nearest neighbors classication algorithm is proposed to cluster analysis. The new algorithm, called K-Nearest Neighbor Clustering Algorithm or KNN-clustering is the modication of 1NN clustering algorithm. Our approach is density-based, and uses form of pseudo-classication what diers it from popular algorithms like DBSCAN or OPTICS. It works ne for multidimensional microarray data without even high-level preprocessing them by features extraction, without standardization and without features selection according to value of coecient of variation. ∗
These authors contributed equally to this work.
Further author information: (Send correspondence to Robert Nowak) Robert Nowak: E-mail:
[email protected]
Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014, edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 9290, 92901I © 2014 SPIE · CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2074124 Proc. of SPIE Vol. 9290 92901I-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms
2. RESULTS The KNN-clustering algorithm is depicted in Fig. 1. It requires only 2 parameters, the number of neighbours (k ) and the discrimination value (g ) and distance measure (e.g. Euclidean). The algorithm is made of three main phases. The rst and the second phase repeat iteratively the calculation for all objects in the input collection. The rst phase is 1NN grouping, if the similarity between considered example and existing clusters are less then discrimination value the new cluster is created. The second phase use these clusters and k-nearest neighbour algorithm. The majority voting shows which cluster, among created in phase 1, is assigned to object. The last phase remove the unnecessary clusters.
Figure 1. KNN-clustering algorithm
kNNClustering(X = {xo , x1 , ..., xn−1 }, k, g) # X : collection of objects # C : collection of clusters # xi ∈ X : i -th input object # cj ∈ C : j -th cluster # k - number of k -nearest neighbors # g - discrimination value # d - distance calculation
PHASE 1: Perform grouping by 1NN a. C = { }; b. for all x ∈ X for all c ∈ C if d(x, c) < g assign x to cluster c x is not assigned to cluster C = C ∪ {x} - create new cluster
if object
PHASE 2: If k>1 perform classication by using KNN algorithms otherwise skip procedure a. calculate the dissimilarity matrix b. for all x ∈ X c. search for k-nearest neighbors of the object using dissimilarity matrix (step a) d. on the basis of object labels obtained in step c perform simple-majority voting on the basis of the modal value arbitrating to which group object should belong
e.
in the case of a tie, assign it to the nearest of tied groups if the label of the tested object does not agree with the result of the voting performed last step change its label
PHASE 3: Remove trivial clusters (one cluster = one object) assign on the basis of the vote of the k-nearest neighbors the single 'objects- clusters ' to a larger group
4
We compare the KNN-clustering with results given by very popular and widely used k-means algorithm.
During our numerical experiments described below, we always set the correct number of clusters for k-means, therefore the k-means results are optimal. We use the Fowlkes and Mallows index
11
of similarity (FM index) to compare cluster algorithms.
It is
modication of Rand index, it is normalized measure from 0 to 1 the bigger value, the stronger similarity. The FM index is calculated using Eq. 1.
K P L P
0
F M (Cl, Cl ) =
M 2ij − N k=1 l=1 p p ( |C k |2 − N )( |C 0 l |2
− N)
=p
N11 (N11 + N10 )(N11 + N01 )
Proc. of SPIE Vol. 9290 92901I-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms
,
(1)
where:
Cl Cl0 Mij N11 N00 N10 N01 N
results of clustering A, results of clustering B, confusion matrix,
Cl and Cl0 , 0 pairs which are in dierent clusters in Cl and Cl , pairs which are in the same cluster in Cl, but in dierent clusters pairs which are in dierent clusters in Cl, but in the same cluster N clustered objects, N11 + N00 + N10 + N01 = 2
number of pairs which are in the same clusters in number of number of number of number of
2D projection (data visibility: 100.00 %)
2.0
0
Q
++
+
4
+++ +++ + ++ 4 i,+ +iF'44F+
v
bil
ó
0
GA
0
o
++
c 0.0
.
tP2P
Ñ
A 4,
++
i
+
+ i+*
°c
0.0
Ñ
s:
-0.5
-0.5
-1.0-
-0.5
1.0
00
0.5
1.0
1.5
1.0
-1.0
20
4
4-4. +441-
+ +'*#+
ç -0-
m0
++ :+++ + + + + + + ++
0.5
++
+
4,.++++ #
8ó 49
A
1.0
E
++ +
,* S
AA
+p
t+
A
AA A
1.5
Li'a
AA
{ *++++#++'+
in
2D projection (data visibility: 100.00 %)
2.0
0
6L, p
1.5
Cl0 , Cl0 ,
in
s}&.4, t
+ -0-
+
+
++ + +
+
+*
rk
-0.5
00
0.5
1.0
20
1.5
First principal component
First principal component
(b) K -means
(a) KNN
2D projection (data visibility: 100.00 %)
2D projection (data visibility: 100.00 %) A
40
40 A
c
to
c
20
n g-
E
+
#+
+ 0 14+
c
++
+ + +
vC á
20
8.
+
-f
V
+
A
o
+
c
0
C
-20
0Q
0
^=
l°) -20 iñ
a -40
-10
rz
-40 0
10
20
40
30
50
-10
0
First principal component
10
20
30
40
50
First principal component
(d) K -means
(c) KNN
Figure 2. Results of clustering 2-dimensional synthetic data: the mixture of random variables with Gaussian distribution (a,b) and the groups with irregular shape (c,d) by KNN-clustering and k-means. In Fig. 2 we present results achieved on synthetic 2-dimensional data in two most signicant variables according to principal component analysis (PCA
12
). The shapes represent dierent clusters. The results obtained for
mixture of random variables with Gaussian probability distribution (Fig. 2a and Fig. 2b) are similar and good. The clustering results of groups of irregular shapes, as depicted in Fig. 2c and Fig. 2d show dierences between
Proc. of SPIE Vol. 9290 92901I-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms
our method and k-mean algorithms. The KNN-clustering achieves about 100% accuracy, where k-means does not work at all for the group with irregular shape (center group in the picture). The numerical experiments with real microarray data are depicted in Fig. 3 and Fig. 4.
We use popular
Leukemia13 and LungCancer.14 We have run the KNN-clustering algorithm g = 36.0, 9 nearest neighbors (k = 9) and Euclidean distance. The discrimination
benchmark microarray data-sets: with discrimination value parameter
g
was estimated from dissimilarity matrix, we assume that this value is equal to half of median of
values in this matrix. The
Leukemia13
data-set contains 500-dimensional data of 72 objects.
The objects belong to one from
three groups: acute myeloid leukemia (AML), acute lymphoblastic leukemia B (ALLB) or acute lymphoblastic leukemia T (ALLT). The Fig. 3 shows the clustering results depicted in coordinates of two most signicant variables according to PCA. This data is quite easy to cluster, because there are three rather clearly separated groups. Both algorithms achieve similar results.
2D projection (data visibility: 34.67 %)
20
o
+ ++
+
10
Kr a
.
o
+
O
o
cv
4.
°
++ + +
A
o
A °
E
+
ó
A
-10
-30
0
-20
°
A
G1
A.,
os,
-30
A
-10
0
20
10
-30
4-40
30
-20
-10
10
20
30
(b) K-means
(a) KNN
Figure 3. Results of clustering
LungCancer14
0
First principal component
First principal component
The
-
Zs.
A
-20
+
A
* G
-30
4.
A
°
-4-40
++ + +
A
+*+ +
+
+
A
ú
o
+*
Q
ñ
m
0
+ ++
+
+
°°
AA
++++++ '+
4
°
10
+
+++
*
a
2D projection (data visibility: 34.67 %)
20
Leukemia data-set by KNN-clustering and k-means
is data-set made of ve groups: adenocarcinoma (AD), normal lung (NL), small cell lung
cancer (SMCL), squamous cell carcinoma (SQ), pulmonary carcinoid (COID). This microarray data-set counts 6794 features of 203 objects. Because one of the group contains only 6 objects it is almost impossible to correct identify this cluster by unsupervised grouping techniques. Our algorithm achieved slightly better results than properly initialized k-mean (number of groups is 5). What is really important, KNN-clustering algorithm almost correctly recognizes number of clusters. The Tab. 1 and Tab. 2 show the comparison of clustering according to FM index (Eq. 1). We compare the clustering results with the proper division of the objects using the additional data (depicted as Diagnosis in Tab. 1 and Tab. 2). For
Leukemia
we achieve 51% for both algorithms, but for
LungCancer
our algorithms
31% where the properly initialized k-means 29%.
Table 1. Comparison of results obtained by algorithms for
Leukemia according to value of FM index (Eq. 1)
Diagnosis
K-means
KNN
Diagnosis
1.00
0.51
1.00
K-means
0.51
1.00
0.51
KNN
1.00
0.51
1.00
The results of clustering analysis for complex data such as
LungCancer
diers quite considerably not only
between the diagnosis (expert based clustering) and results of algorithms, but also between algorithms.
Proc. of SPIE Vol. 9290 92901I-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms
It
2D projection (data visibility: 33.78 %)
40
° , °° o + +*# , ° t + °° + +tF+ ; +
+
20
+#+
+
+
++++&+#+
+
++
+ +
+
óV
-40
+ +
++ +
++ +
+
*
a;+
$
+
+
+
+
+
+
p
°°
0 -20
-80
°
000 S 00 O
00
-60
4 i
o
o
00 O
N
oo
-
OO
°° o0 00 oo00
0
V
A
A A
+
c -40
O -100
OT
N O
O°
50
o
C
A
+°
+
0
m
°g.
+*.*
++ + ° ++ +
+
O
0 V
° 6°p ° ° °
80 +++ °°O ;F+ +
+ + + +# ° +
0
+
+
++++
;FF
0
ó o. E
A
+
+
cv
°08
6' °°
LT
+
20
-
o
4 pâ
+46 +0,6A
8+
+
CO
-60
2D projection (data visibility: 33.78 %)
40
4
* A
A
A
0
t+o
o
00
-50
0
80
-150
100
50
-50
-100
First principal component
0
50
100
First principal component
(a) KNN
(b) K -means
Figure 4. Results of clustering
LungCancer
data-set by KNN-clustering and k-means
seems, however, that this discrepancy should not be treated as an error. From this perspective, KNN-clustering algorithm can be thought of as just another research tool designed for this type of tasks. The time complexity of both algorithms is comparable and equal
m
O(kn2 m),
where
k
is number of groups,
n
is number of objects and
number of attributes for each object.
Table 2. Comparison of results obtained by algorithms for
LungCancer
according to value of FM index (Eq. 1)
Diagnosis
K-means
KNN
Diagnosis
1.00
0.29
0.31
K-means
0.29
1.00
0.36
KNN
0.31
0.36
1.00
The presented clustering algorithm has been implemented in Python. The code of algorithm takes only 78 lines. We use the methods from the scipy-0.13.0
15
Python package. The results obtained by KNN algorithm 15
were compared with k-means available in the scipy-0.13.0 .
The Fowlkes-Mallows index counting is our own
implementation. The computer program, with graphical user interface, functions to load data, save the results of clustering and visualization in the space of the two rst principal components will be send on request. This software has 1727 lines of code and depends on following Python packages:
wxPython-2.8, matplotlib-1.2.1,
scipy-0.13.0 and numpy-MKL-1.8.0. The software was tested under Linux and Windows operating system.
3. DISCUSSION AND CONCLUSION The presented article shows that the KNN-clustering algorithm has the following properties:
•
generally aptly recognizes the number of clusters in the collection (for data collection to identify 6 of sensible
LungCancer
objects is practically impossible without the knowledge of the initial number of
clusters). However, there is not possible to determine in advance the resulting number of clusters;
•
for nontrivial data has a relatively high resistance to initial setup variables;
•
recognizes clusters of irregularly shaped few- and multidimensional data;
•
the algorithm is deterministic;
Proc. of SPIE Vol. 9290 92901I-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms
•
sometimes it is hard to estimate correct number of k-nearest neighbors; too small or too large value leads to worse results. We have no rules to how to set this variable, we noticed that it strongly depends on data type.
The proposed algorithm pretty well dealt with grouping data from DNA microarray samples.
It is worth
noting that the presented results were not subjected to standardization or to the extraction of features e.g. based on an analysis of major components, which usually improves the results of grouping the multidimensional data. Our algorithm is still being developed; we plan to develop computationally ecient version of our algorithm (in C++ with Python bindings) and compare our algorithm with a greather number of clustering algorithms.
Acknowledgments This work was supported by the statutory research of Institute of Electronic Systems of Warsaw University of Technology.
REFERENCES [1] Stekel, Dov, [Microarray bioinformatics ], Cambridge University Press (2003). [2] Tukey, John W, We need both exploratory and conrmatory, The American Statistician
34(1),
2325
(1980). [3] Xu, Rui and Wunsch, Don, [Clustering ], vol. 10, John Wiley & Sons (2008). [4] Krishna, K and Murty, M Narasimha, Genetic k-means algorithm, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
29(3), 433439 (1999).
[5] Figueiredo, Mario AT and Jain, Anil K., Unsupervised learning of nite mixture models, Pattern Analysis and Machine Intelligence, IEEE Transactions on
24(3), 381396 (2002).
[6] Ester, Martin and Kriegel, Hans-Peter and Sander, Jörg and Xu, Xiaowei, A density-based algorithm for discovering clusters in large spatial databases with noise., in [KDD ],
96, 226231 (1996).
[7] Ankerst, Mihael and Breunig, Markus M and Kriegel, Hans-Peter and Sander, Jörg, Optics: Ordering points to identify the clustering structure, in [ACM SIGMOD Record ],
28(2), 4960, ACM (1999).
[8] Raczynski, Lech and Wozniak, Krzysztof and Rubel, Tymon and Zaremba, Krzysztof, Application of density based clustering to microarray data analysis, International Journal of Electronics and Telecommunications
56(3), 281286 (2010).
[9] Friedman, Nir and Linial, Michal and Nachman, Iftach and Pe'er, Dana, Using Bayesian networks to analyze expression data, Journal of computational biology
7(3-4), 601620 (2000).
[10] Peweª Szlendak and Robert M. Nowak, Statistical relationship discovery in SNP data using Bayesian networks, in [Proc. SPIE Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2009 ], 75022J1 9 (2009). doi:10.1117/12.837602.
[11] Fowlkes, Edward B and Mallows, Colin L, A method for comparing two hierarchical clusterings, Journal of the American statistical association
78(383), 553569 (1983).
[12] Jollie, Ian, [Principal component analysis ], Wiley Online Library (2005). [13] Golub, Todd R and Slonim, Donna K and Tamayo, Pablo and Huard, Christine and Gaasenbeek, Michelle and Mesirov, Jill P and Coller, Hilary and Loh, Mignon L and Downing, James R and Caligiuri, Mark A and others, Molecular classication of cancer: class discovery and class prediction by gene expression monitoring, science
286(5439), 531537 (1999).
[14] Bhattacharjee, Arindam and Richards, William G and Staunton, Jane and Li, Cheng and Monti, Stefano and Vasa, Priya and Ladd, Christine and Beheshti, Javad and Bueno, Raphael and Gillette, Michael and others, Classication of human lung carcinomas by mRNA expression proling reveals distinct adenocarcinoma subclasses, Proceedings of the National Academy of Sciences
98(24), 1379013795 (2001).
[15] Jones, Eric and Oliphant, Travis and Peterson, Pearu, SciPy: Open source scientic tools for python,
http: // www. scipy. org/
(2001).
Proc. of SPIE Vol. 9290 92901I-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms