K-nearest neighbors clustering algorithm

K-nearest neighbors clustering algorithm ∗

∗

Dariusz Gauza , Anna ukowska

and Robert Nowak

Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland

ABSTRACT Cluster analysis, understood as unattended method of assigning objects to groups solely on the basis of their measured characteristics, is the common method to analyze DNA microarray data. Our proposal is to classify the results of one nearest neighbors algorithm (1NN). The presented method well cope with complex, multidimensional data, where the number of groups is properly identied. The numerical experiments on benchmark microarray data shows that presented algorithm give a better results than k-means clustering.

Keywords:

cluster analysis, nearest neighbour, k-means, microarray data, multidimensional data

1. INTRODUCTION DNA microarrays are an important research tool in the eld of modern molecular biology. The activity of large number of genes is investigated in a single experiment, therefore it is a valuable service for medical diagnosis. DNA microarray is a few-centimeter glass tile with installed DNA molecules, called probes, on the specic points of the surface, called spots.

Every spot contains

10−12

of DNA, the probes are dierent in dierent

spots. The DNA microarray is fabricated using photolitography or electrochemistry on microelectrode arrays. Microarray is used to test gene activity by binding labeled uorescent DNA molecules to the probes, uorescent markers are inducted by laser light with specied wavelength.

1

then

The image with dierent pixel's

intensity is obtained, then converted into the nal matrix of numerical values, through the number of low-level data processing techniques. The elements of this matrix correspond to the genes expression intensities. 1

Clustering is unsupervised technique belonging to exploratory data analysis,

it is based on assigning ob-

jects to clusters exclusively according to their features. In many cases the number of groups or clusters in data is a priori unknown, that is the main dierence between clusterizaton and classication. belongs to class NP-hard,

2

Clustering problem

therefore the optimal solution is unsolvable in practice, even for small instances.

The heuristic methods are common in clustering, because they oer approximate results in acceptable computa3

tional complexity. Cluster analysis use model-based algorithms (e.g. Expectation-Maximization ), partitioning 4

algorithms (e.g. k-means), 8

neural

5

hierarchical grouping algorithms,

and Bayesian networks.

density-based algorithms (DBSCAN,

6

7

OPTICS ),

9, 10

In this paper the modication of K-nearest neighbors classication algorithm is proposed to cluster analysis. The new algorithm, called K-Nearest Neighbor Clustering Algorithm or KNN-clustering is the modication of 1NN clustering algorithm. Our approach is density-based, and uses form of pseudo-classication what diers it from popular algorithms like DBSCAN or OPTICS. It works ne for multidimensional microarray data without even high-level preprocessing them by features extraction, without standardization and without features selection according to value of coecient of variation. ∗

These authors contributed equally to this work.

Further author information: (Send correspondence to Robert Nowak) Robert Nowak: E-mail: [email protected]

Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014, edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 9290, 92901I © 2014 SPIE · CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2074124 Proc. of SPIE Vol. 9290 92901I-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms

2. RESULTS The KNN-clustering algorithm is depicted in Fig. 1. It requires only 2 parameters, the number of neighbours (k ) and the discrimination value (g ) and distance measure (e.g. Euclidean). The algorithm is made of three main phases. The rst and the second phase repeat iteratively the calculation for all objects in the input collection. The rst phase is 1NN grouping, if the similarity between considered example and existing clusters are less then discrimination value the new cluster is created. The second phase use these clusters and k-nearest neighbour algorithm. The majority voting shows which cluster, among created in phase 1, is assigned to object. The last phase remove the unnecessary clusters.

Figure 1. KNN-clustering algorithm

kNNClustering(X = {xo , x1 , ..., xn−1 }, k, g) # X : collection of objects # C : collection of clusters # xi ∈ X : i -th input object # cj ∈ C : j -th cluster # k - number of k -nearest neighbors # g - discrimination value # d - distance calculation

PHASE 1: Perform grouping by 1NN a. C = { }; b. for all x ∈ X for all c ∈ C if d(x, c) < g assign x to cluster c x is not assigned to cluster C = C ∪ {x} - create new cluster

if object

PHASE 2: If k>1 perform classication by using KNN algorithms otherwise skip procedure a. calculate the dissimilarity matrix b. for all x ∈ X c. search for k-nearest neighbors of the object using dissimilarity matrix (step a) d. on the basis of object labels obtained in step c perform simple-majority voting on the basis of the modal value arbitrating to which group object should belong

e.

in the case of a tie, assign it to the nearest of tied groups if the label of the tested object does not agree with the result of the voting performed last step change its label

PHASE 3: Remove trivial clusters (one cluster = one object) assign on the basis of the vote of the k-nearest neighbors the single 'objects- clusters ' to a larger group

4

We compare the KNN-clustering with results given by very popular and widely used k-means algorithm.

During our numerical experiments described below, we always set the correct number of clusters for k-means, therefore the k-means results are optimal. We use the Fowlkes and Mallows index

11

of similarity (FM index) to compare cluster algorithms.

It is

modication of Rand index, it is normalized measure from 0 to 1 the bigger value, the stronger similarity. The FM index is calculated using Eq. 1.

K P L P

0

F M (Cl, Cl ) =

M 2ij − N k=1 l=1 p p ( |C k |2 − N )( |C 0 l |2

− N)

=p

N11 (N11 + N10 )(N11 + N01 )

Proc. of SPIE Vol. 9290 92901I-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/26/2014 Terms of Use: http://spiedl.org/terms

,

(1)

where:

Cl Cl0 Mij N11 N00 N10 N01 N

results of clustering A, results of clustering B, confusion matrix,

Cl and Cl0 , 0 pairs which are in dierent clusters in Cl and Cl , pairs which are in the same cluster in Cl, but in dierent clusters pairs which are in dierent clusters in Cl, but in the same cluster N clustered objects, N11 + N00 + N10 + N01 = 2

number of pairs which are in the same clusters in number of number of number of number of

2D projection (data visibility: 100.00 %)

2.0

0

Q

++

+

4

+++ +++ + ++ 4 i,+ +iF'44F+

v

bil

ó

0

GA

0

o

++

c 0.0

.

tP2P

Ñ

A 4,

++

i

+

+ i+*

°c

0.0

Ñ

s:

-0.5

-0.5

-1.0-

-0.5

1.0

00

0.5

1.0

1.5

1.0

-1.0

20

4

4-4. +441-

+ +'*#+

ç -0-

m0

++ :+++ + + + + + + ++

0.5

++

+

4,.++++ #

8ó 49

A

1.0

E

++ +

,* S

AA

+p

t+

A

AA A

1.5

Li'a

AA

{ *++++#++'+

in


2.0

0

6L, p

1.5

Cl0 , Cl0 ,

in

s}&.4, t

+ -0-

+

+

++ + +

+

+*

rk

-0.5

00

0.5

1.0

20

1.5

First principal component


(b) K -means

(a) KNN


2D projection (data visibility: 100.00 %) A

40

40 A

c

to

c

20

n g-

E

+

#+

+ 0 14+

c

++

+ + +

vC á

20

8.

+

-f

V

+

A

o

+

c

0

C

-20

0Q

0

^=

l°) -20 iñ

a -40

-10

rz

-40 0

10

20

40

30

50

-10

0


10

20

30

40

50


(d) K -means

(c) KNN

Figure 2. Results of clustering 2-dimensional synthetic data: the mixture of random variables with Gaussian distribution (a,b) and the groups with irregular shape (c,d) by KNN-clustering and k-means. In Fig. 2 we present results achieved on synthetic 2-dimensional data in two most signicant variables according to principal component analysis (PCA

12

). The shapes represent dierent clusters. The results obtained for

mixture of random variables with Gaussian probability distribution (Fig. 2a and Fig. 2b) are similar and good. The clustering results of groups of irregular shapes, as depicted in Fig. 2c and Fig. 2d show dierences between


our method and k-mean algorithms. The KNN-clustering achieves about 100% accuracy, where k-means does not work at all for the group with irregular shape (center group in the picture). The numerical experiments with real microarray data are depicted in Fig. 3 and Fig. 4.

We use popular

Leukemia13 and LungCancer.14 We have run the KNN-clustering algorithm g = 36.0, 9 nearest neighbors (k = 9) and Euclidean distance. The discrimination

benchmark microarray data-sets: with discrimination value parameter

g

was estimated from dissimilarity matrix, we assume that this value is equal to half of median of

values in this matrix. The

Leukemia13

data-set contains 500-dimensional data of 72 objects.

The objects belong to one from

three groups: acute myeloid leukemia (AML), acute lymphoblastic leukemia B (ALLB) or acute lymphoblastic leukemia T (ALLT). The Fig. 3 shows the clustering results depicted in coordinates of two most signicant variables according to PCA. This data is quite easy to cluster, because there are three rather clearly separated groups. Both algorithms achieve similar results.


20

o

+ ++

+

10

Kr a

.

o

+

O

o

cv

4.

°

++ + +

A

o

A °

E

+

ó

A

-10

-30

0

-20

°

A

G1

A.,

os,

-30

A

-10

0

20

10

-30

4-40

30

-20

-10

10

20

30

(b) K-means

(a) KNN

Figure 3. Results of clustering

LungCancer14

0



The

-

Zs.

A

-20

+

A

* G

-30

4.

A

°

-4-40

++ + +

A

+*+ +

+

+

A

ú

o

+*

Q

ñ

m

0

+ ++

+

+

°°

AA

++++++ '+

4

°

10

+

+++

*

a


20

Leukemia data-set by KNN-clustering and k-means

is data-set made of ve groups: adenocarcinoma (AD), normal lung (NL), small cell lung

cancer (SMCL), squamous cell carcinoma (SQ), pulmonary carcinoid (COID). This microarray data-set counts 6794 features of 203 objects. Because one of the group contains only 6 objects it is almost impossible to correct identify this cluster by unsupervised grouping techniques. Our algorithm achieved slightly better results than properly initialized k-mean (number of groups is 5). What is really important, KNN-clustering algorithm almost correctly recognizes number of clusters. The Tab. 1 and Tab. 2 show the comparison of clustering according to FM index (Eq. 1). We compare the clustering results with the proper division of the objects using the additional data (depicted as Diagnosis in Tab. 1 and Tab. 2). For

Leukemia

we achieve 51% for both algorithms, but for

LungCancer

our algorithms

31% where the properly initialized k-means 29%.

Table 1. Comparison of results obtained by algorithms for

Leukemia according to value of FM index (Eq. 1)

Diagnosis

K-means

KNN

Diagnosis

1.00

0.51

1.00

K-means

0.51

1.00

0.51

KNN

1.00

0.51

1.00

The results of clustering analysis for complex data such as

LungCancer

diers quite considerably not only

between the diagnosis (expert based clustering) and results of algorithms, but also between algorithms.


It


40

° , °° o + +*# , ° t + °° + +tF+ ; +

+

20

+#+

+

+

++++&+#+

+

++

+ +

+

óV

-40

+ +

++ +

++ +

+

*

a;+

$

+

+

+

+

+

+

p

°°

0 -20

-80

°

000 S 00 O

00

-60

4 i

o

o

00 O

N

oo

-

OO

°° o0 00 oo00

0

V

A

A A

+

c -40

O -100

OT

N O

O°

50

o

C

A

+°

+

0

m

°g.

+*.*

++ + ° ++ +

+

O

0 V

° 6°p ° ° °

80 +++ °°O ;F+ +

+ + + +# ° +

0

+

+

++++

;FF

0

ó o. E

A

+

+

cv

°08

6' °°

LT

+

20

-

o

4 pâ

+46 +0,6A

8+

+

CO

-60


40

4

* A

A

A

0

t+o

o

00

-50

0

80

-150

100

50

-50

-100


0

50

100


(a) KNN

(b) K -means

Figure 4. Results of clustering

LungCancer

data-set by KNN-clustering and k-means

seems, however, that this discrepancy should not be treated as an error. From this perspective, KNN-clustering algorithm can be thought of as just another research tool designed for this type of tasks. The time complexity of both algorithms is comparable and equal

m

O(kn2 m),

where

k

is number of groups,

n

is number of objects and

number of attributes for each object.

Table 2. Comparison of results obtained by algorithms for

LungCancer

according to value of FM index (Eq. 1)

Diagnosis

K-means

KNN

Diagnosis

1.00

0.29

0.31

K-means

0.29

1.00

0.36

KNN

0.31

0.36

1.00

The presented clustering algorithm has been implemented in Python. The code of algorithm takes only 78 lines. We use the methods from the scipy-0.13.0

15

Python package. The results obtained by KNN algorithm 15

were compared with k-means available in the scipy-0.13.0 .

The Fowlkes-Mallows index counting is our own

implementation. The computer program, with graphical user interface, functions to load data, save the results of clustering and visualization in the space of the two rst principal components will be send on request. This software has 1727 lines of code and depends on following Python packages:

wxPython-2.8, matplotlib-1.2.1,

scipy-0.13.0 and numpy-MKL-1.8.0. The software was tested under Linux and Windows operating system.

3. DISCUSSION AND CONCLUSION The presented article shows that the KNN-clustering algorithm has the following properties:

•

generally aptly recognizes the number of clusters in the collection (for data collection to identify 6 of sensible

LungCancer

objects is practically impossible without the knowledge of the initial number of

clusters). However, there is not possible to determine in advance the resulting number of clusters;

•

for nontrivial data has a relatively high resistance to initial setup variables;

•

recognizes clusters of irregularly shaped few- and multidimensional data;

•

the algorithm is deterministic;


•

sometimes it is hard to estimate correct number of k-nearest neighbors; too small or too large value leads to worse results. We have no rules to how to set this variable, we noticed that it strongly depends on data type.

The proposed algorithm pretty well dealt with grouping data from DNA microarray samples.

It is worth

noting that the presented results were not subjected to standardization or to the extraction of features e.g. based on an analysis of major components, which usually improves the results of grouping the multidimensional data. Our algorithm is still being developed; we plan to develop computationally ecient version of our algorithm (in C++ with Python bindings) and compare our algorithm with a greather number of clustering algorithms.

Acknowledgments This work was supported by the statutory research of Institute of Electronic Systems of Warsaw University of Technology.

REFERENCES [1] Stekel, Dov, [Microarray bioinformatics ], Cambridge University Press (2003). [2] Tukey, John W, We need both exploratory and conrmatory, The American Statistician

34(1),

2325

(1980). [3] Xu, Rui and Wunsch, Don, [Clustering ], vol. 10, John Wiley & Sons (2008). [4] Krishna, K and Murty, M Narasimha, Genetic k-means algorithm, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on

29(3), 433439 (1999).

[5] Figueiredo, Mario AT and Jain, Anil K., Unsupervised learning of nite mixture models, Pattern Analysis and Machine Intelligence, IEEE Transactions on

24(3), 381396 (2002).

[6] Ester, Martin and Kriegel, Hans-Peter and Sander, Jörg and Xu, Xiaowei, A density-based algorithm for discovering clusters in large spatial databases with noise., in [KDD ],

96, 226231 (1996).

[7] Ankerst, Mihael and Breunig, Markus M and Kriegel, Hans-Peter and Sander, Jörg, Optics: Ordering points to identify the clustering structure, in [ACM SIGMOD Record ],

28(2), 4960, ACM (1999).

[8] Raczynski, Lech and Wozniak, Krzysztof and Rubel, Tymon and Zaremba, Krzysztof, Application of density based clustering to microarray data analysis, International Journal of Electronics and Telecommunications

56(3), 281286 (2010).

[9] Friedman, Nir and Linial, Michal and Nachman, Iftach and Pe'er, Dana, Using Bayesian networks to analyze expression data, Journal of computational biology

7(3-4), 601620 (2000).

[10] Peweª Szlendak and Robert M. Nowak, Statistical relationship discovery in SNP data using Bayesian networks, in [Proc. SPIE Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2009 ], 75022J1 9 (2009). doi:10.1117/12.837602.

[11] Fowlkes, Edward B and Mallows, Colin L, A method for comparing two hierarchical clusterings, Journal of the American statistical association

78(383), 553569 (1983).

[12] Jollie, Ian, [Principal component analysis ], Wiley Online Library (2005). [13] Golub, Todd R and Slonim, Donna K and Tamayo, Pablo and Huard, Christine and Gaasenbeek, Michelle and Mesirov, Jill P and Coller, Hilary and Loh, Mignon L and Downing, James R and Caligiuri, Mark A and others, Molecular classication of cancer: class discovery and class prediction by gene expression monitoring, science

286(5439), 531537 (1999).

[14] Bhattacharjee, Arindam and Richards, William G and Staunton, Jane and Li, Cheng and Monti, Stefano and Vasa, Priya and Ladd, Christine and Beheshti, Javad and Bueno, Raphael and Gillette, Michael and others, Classication of human lung carcinomas by mRNA expression proling reveals distinct adenocarcinoma subclasses, Proceedings of the National Academy of Sciences

98(24), 1379013795 (2001).

[15] Jones, Eric and Oliphant, Travis and Peterson, Pearu, SciPy: Open source scientic tools for python,

http: // www. scipy. org/

(2001).


K-nearest neighbors clustering algorithm

K-nearest neighbors clustering algorithm

Suggest Documents

Density Based k-Nearest Neighbors Clustering Algorithm for ...

Fast Reciprocal Nearest Neighbors Clustering - Semantic Scholar

Neighbors Progressive Competition Algorithm for Classification ... - arXiv

BoCluSt: Bootstrap Clustering Stability Algorithm

A Fast Incremental Clustering Algorithm

Ant Clustering Algorithm - Semantic Scholar

Spatial Clustering Algorithm Based on

k-Nearest Neighbors Algorithm in Profiling Power ... - Radioengineering

Cross-Clustering: A Partial Clustering Algorithm with Automatic ... - Plos

Family Fun Time - Neighbors Helping Neighbors

A Pruning Algorithm for Reverse Nearest Neighbors in Directed Road ...

Cranky Neighbors

Valley Neighbors

Genetic Algorithm for Document Clustering with ... - CiteSeerX

A NEW CLUSTERING ALGORITHM FOR ... - CiteSeerX

An Online Hierarchical Algorithm for Extreme Clustering

Combinatorial Clustering Algorithm of Quantum-Behaved Particle ...

GENETIC k-MEANS CLUSTERING ALGORITHM FOR MIXED

GAECH: Genetic Algorithm Based Energy Efficient Clustering ...

Optimal Hops-Based Adaptive Clustering Algorithm - Core

An Improved Fuzzy C-means Clustering Algorithm

Improving K-means clustering algorithm with the

A New Spectral Clustering Algorithm - arXiv

Clustering algorithm for determining community ... - Semantic Scholar