Palestine Polytechnic University
The Second Students Innovative Conference (SIC2013) June 12,2013- Hebron, State of Palestine
Clustering of Digital Images based on Color histogram Anas A. Amro Master of Informatics Palestine Polytechnic University Hebron, Palestine
[email protected]
Ibrahim N. Nassar Master of Informatics Palestine Polytechnic University Hebron, Palestine
[email protected]
Abstract— Clustering is the process of partitioning or grouping a given set of patterns into disjoint clusters. This is done such that patterns in the same cluster are alike and patterns belonging to two different clusters are different. This paper presents a new approach for image clustering by applying k-means algorithm based on color histogram. Given millions of mixed images to be group into sets that contain similar images, clustering using this method can be beneficial, due to its efficiency. Keywords-Clustering; K-means algorithm; Color histogram.
I. INTRODUCTION Clustering of data is a method by which large a set of data are grouped into clusters of smaller sets of similar data. Computer-assisted analysis must partition objects into groups, and must provide an explanation for this partitioning [1]. Many clustering methods exist to partition a data set by some natural measure of similarity [2]. This similarity measure places similar objects close to one another forming a group, thus several clusters related to objects are formed. An ideal clustering algorithm is one that classifies data such that samples that belong to a cluster are close to each other while samples from different clusters are further away from each other. Many algorithms for clustering are available. A popular algorithm is the K-means, based on a given number of clusters the algorithm iterates to find best clusters for the objects. This paper discusses method for clustering images using Kmeans algorithm and color histogram. K-means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. The process of clustering many images has many phases: first the images are read for a specified folder. The color histogram for each image is calculated, and then a distance of each image with all of images is calculated to find the similarities. Finally the images can be grouped together according to their color similarities The rest of this paper is organized as follows: Section 2 discusses k-means algorithm and color histogram. Section 3 shows the methodology. Finally, Section 4 presents the Conclusion and results.
Hashim Tamimi College of IT and Computer Eng. Palestine Polytechnic University Hebron, Palestine
[email protected]
II.
BACKGROUND
A. K-means algorithms The K-Means algorithm is a method to cluster objects based on their attributes into k partitions. It assumes that the k clusters exhibit Gaussian distributions. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance. The points are clustered around centroids which are obtained by minimizing the objective : ∑
∑
Where there are k clusters Si, i = 1, 2 … k and μi is the centroid or mean point of all the points xj ϵ Si As a part of this project, an iterative version of the algorithm was implemented. Various steps in the algorithm are as follows: 1. Compute the intensity distribution (also called the histogram) of the intensities. 2. Initialize the centroids with k random intensities. 3. Repeat the following steps until the cluster a label of the image does not change anymore. 4. Cluster the points based on distance of their intensities from the centroid intensities. c(i) = arg min j || x(i) - μj|| 2
(2)
5. Compute the new centroid for each of the clusters. ∑ ∑
{
} {
}
Where k is a parameter of the algorithm (the number of clusters to be found), i iterates over the all the intensities, j iterates over all the centroids and i are the centroid intensities [3].
B. Color Histrogram Color histograms are collected counts of data organized into a set of predefined bins, when we say data we are not restricting it to be intensity values. The data collected can be whatever feature you find useful to describe your image. Also the data contained in a digital image can be displayed as a histogram which is a plot of the pixel values versus the number of pixels that have that particular value. III.
METHODLOGY Start
Figure 2: the relationship between k and time Input name of images’ folder, number of clusters, Name of output folder
And if the number of bins increased, the time increases, the result shows in figure 3.
Histogram (# of bins ) Kmeans ( # of k )
Image new folders based on # of k
Figure 3: the relationship between bins and clusters
End
Figure 1: Image clustering based on k-means algorithm and histogram V. Figure 1 shows the steps of how we group the images to k clusters, the first step we enter the name of images folders which contains many types of images. After that we input the number of clusters k. Also we need to enter a name for the new folder which contains files based on k IV.
CONCLUSION AND RESUTL
We have successfully implemented k-means clustering algorithm. And we find histogram for each image in the folder after the program read them, also it can comparison each one with others, and finally it can classify all of them to a new folder. If k increases to classify the images, we notice the time is increased and the result for the relationship between them is in Figure 2.
[1] [2] [3] [4]
REFERENCES
M.J. A. Berry, G. Linoff, Data Mining Techniques- for Marketing, Sales and Customer Support. John Wiley & Sons, NY, USA, 1997. M.S Aldenderfer, R.K. Blashfield, Cluster Analysis, Sage Publications, Beverly Hills, USA, 1984. S. Clerk Tatiraju, Avi Mehta, Image Segmentation using k-means clustering, EM and Normalized Cuts. Team, O. D. (2011). OpenCV 2.4.5.0 documentation(Clustering,Histogram). Retrieved 5 1, 2013, from OpenCV: http://docs.opencv.org/modules/core/doc/