A Novel Quad Tree Based Data Clustering Technique - IEEE Xplore

1 downloads 0 Views 1MB Size Report
technique based on quad tree. The current researchers established an algorithm, which is inspired by the construction and behavior of quad tree data structure.
(ICRCICN)

A Novel Quad Tree Based Data Clustering Technique Debjyoti Basu

Subhasree Sengupta

Department of Information Technology Future Institute of Engineering & Management Sonarpur, Kolkata, India [email protected]

Department of Computer Science Future Institute of Technology Kolkata, India. [email protected]

Abstract— Data clustering is a branch of computation where division of data into groups of similar objects is made. Each group is formed with the data that are similar in some parametric values and dissimilar compared to data of other groups. Active research is being in progress with data clustering in several fields such as statistics, pattern recognition and machine learning. This paper proposes a data clustering technique based on quad tree. The current researchers established an algorithm, which is inspired by the construction and behavior of quad tree data structure. In its initiation, the algorithm has been tested thoroughly on synthetic data. The different set of results is found to be very encouraging for future investigation. The algorithm also shows nice execution time bound. Keywords— Data clustering, quad tree, execution time.

I.

INTRODUCTION

Data mining incorporates to clustering the complications of big data set with very large number of attributes of different types. Data clustering is a primal task of explorative data mining. It is a common method for statistical data analysis used in the domains such as machine learning, patterning recognition, image analysis, information retrieval and bio informatics. It is an important unsupervised learning technique [1]. Data clustering has been studied vividly since last two decades. The main goal of data clustering algorithm are determining good clusters and computing with efficiency. This paper presents a quad tree inspired quad cluster method. The algorithm has been examined on several types of synthetic data. Very brief literature review and problem descriptions are given in the following section 2 and 3. Subsequent section 4, elaborates the proposed algorithm. In the next section 5, the critical and experimental result analysis is given. Finally, the conclusion up to this stage is added with the future scope of investigation. II.

factor for current researchers. The basic challenges associated with data clustering techniques were highlighted in [2]. Data clustering algorithms studied by current researchers can be classified in three types as follows: A. Partition clustering: Partition clustering creates a simple partition of the collection of the items into clusters. In this method relation of instances are made by moving them from one cluster to another, starting from the initial partitioning. Here, the number of cluster to be preset by the user. K-means data clustering is the simplest unsupervised learning algorithm that solves the clustering problem. In this method, classification of given data set into certain number of clusters are made at first. Then K initial centroids are randomly selected, one for each cluster. Thereafter, iteration are carried out to converge to an objective function which in this case usually squared error function. The objective function is the chosen distance measured between a data point and a cluster center [3]. B. Hierarchical clustering: Hierarchical clustering aims to obtain a hierarchy of clusters. Divisive clustering initiates with one cluster with all data points and recursively splits the most appropriate clusters. The process continues until a stopping criterion [4]. The algorithm that compute the distance between two clusters to be equal to the longest distance from any number of one cluster to any number of other clusters is studied in [5]. Average linked clustering are studied in [6, 7]. It may form elongation to split and for part of neighboring clusters may merge [8]. C. Density based clustering: Here a cluster is defined as connected defined component, which grows in any direction that density guides. It can capture an arbitrary set. This also gives a natural protection against outliers.

LITERATURE REVIEW

In-spite-of large number of studies done in past and recent times, still a very large amount of experiment is going on. This clearly indicates the prevalence of difficulty level of data clustering. Large number of data clustering algorithm and their success in different application domain are the stimulation

III.

PROBLEM DEFINITION

In our proposed method, due to unavailability of actual data, synthetic data has been generated, which is mainly based on random distribution. Then the data has been plotted as a scatter plot (Figure 1.a).

157

Figure 1.d): Quad tree based on quad decomposition. Figure 1.a): scatter plot of a data set.

set.

Figure 1.b): Identification of max-min value of the data

The maximum and minimum values of that data are identified (Figure 1.b). These maximum and minimum values form the boundary of that data-domain or data space. Our proposed work is mainly based on quad tree decomposition method. The entire data-domain is divided into four quads. The entire data space is divided into four equal quads. Each quad is subdivided into four sub quads into next level. Continuing in this manner the specified level has been reached.

Based on this decomposition we can create a quad tree. To illustrate this, consider the Figures 1.c) and 1.d). For simplicity, let us consider there are ten data points in a data space. Here, the root node (marked by ‘/’) represents the entire data-domain. At the first level it is sub-divided into four quads. If we start from the NW quad and ends at SW quad in a clock-wise direction we will find the first level NW quad contains four data points (1, 2, 3, 4). The NE quad contains one data point, SE quad contains five data points and SW quad contains no data point. Based on this the quad tree has been constructed. Here ‘N’ represents a null quad or quad having no data point. For each level number of data points present in each small quad is calculated. After that the area of each quad and then the data point density of each small quad have been calculated. If the area of the entire data domain is A, then the area of each quad in the first level will be equals to A/4, in the second level the area will be A/16, in the third level it will be A/64 and so on. Data point density of each quad will be equals to number of data points present in each quad / area of that quad (Figure 1.e and Figure 1.f).

Figure 1.e): Area of each quad.

Figure 1.c): Quad decomposition technique.

158

Figure 1.f): Data point density of each quad. A quad having no data points is considered as a null quad. A null quad will not be sub divided into sub quads in the next level. Proceeding in this way up to a specified level, all data related to quad boundary, data density are stored in a 2D vector.

Figure 1.i): Maximum possible eight neighbor quads (marked with N) of a marked quad. The centre point of this quad has been calculated as a cluster centre. Proceeding in this way all the cluster centers are identified. In this algorithm the neighbor quads of a marked quad are not considered because they are considered as a part of the same cluster. This has been implemented using tree recursion. Algorithm: Quad Decomposition Calculate area; Calculate number of data point present in the area; Calculate data point density; level:=level+1; IF level=desired level of quad decomposition THEN Store information related to quad detail into a vector (lvldtl); Return; END IF IF level0 ] WHILE lvldtl(i,1) 0 AND c

Suggest Documents