large datasets as one of the most challenging tasks for data mining and processing. We propose an improved MapReduce design of. Kmeans algorithm with an ...
2015 11th International Conference on Semantics, Knowledge and Grids
An improved MapReduce design of Kmeans with iteration reducing for clustering stock exchange very large Datasets Oussama Lachiheb, Mohamed Salah Gouider, Lamjed Ben Said Laboratoire SOIE University of Tunis
Kmeans which makes the processing faster but it still has significant iterations and execution time especially with very large datasets we are focusing on. This paper proposes an improved Map Reduce design of Kmeans with iteration reducing in the context of stock exchange very large datasets clustering in order to identify risky investment based on stocks variation, this approach is called SMRKmeans. The contents of this paper are organized as follow: Related works are presented in section 2, section 3 describes the proposed approach of reducing the Kmeans iterations and its Map reduce design, experimental studies are shown in section 4, conclusion and future work are discussed in the last section.
Abstract—This paper targets the problem of clustering very large datasets as one of the most challenging tasks for data mining and processing. We propose an improved MapReduce design of Kmeans algorithm with an iteration reducing method. Experiments show that this method reduces the number of iterations and the execution time of the Kmeans algorithm while keeping 80% of the clustering accuracy. The employment of MapReduce programming paradigm and iterations reducing techniques offers the possibility to process the huge volume of data generated by stock exchanges daily transactions which performs a better decision making by analysts. Keywords—apReduce, Clustering, BigData, Stock Exchange, RiskapReduce, Clustering, BigData, Stock Exchange, RiskM
I. I NTRODUCTION Over the past 20 years, the main challenge has changed from collecting data to processing very large datasets, to deal with this increasing amount, it is necessary to use powerful tools for knowledge discovery. Stock exchanges over the world are offering a complete range of exchange related services including Trading, clearing, settlement and depository services that are traded on day to day. Storing all related data generates a huge volume of information that represents a good source for financial and data analysts. Clustering [1] is an important technique in the rapidly growing field known as exploratory data analysis and is being applied in a variety of engineering and scientific fields. It organizes data by abstracting underlying structure either as a grouping of individuals or as a hierarchy of groups, clustering is one of the most used tasks for stock exchange data analysis, it is applied for portfolio management, price clustering and for the identification of risky investment. Clustering very large datasets was the target of many researchers especially in stock markets because of the accuracy obtained when processing huge volumes of stock data. In the literature, many approaches were proposed in order to deal with the increasing volume problem such as representative samples, parallelization and better initial center selection [3,4,5], but they cant preserve clustering accuracy because of discarding an important part of the input dataset. Kmeans [1] has gained the interest of many researchers because of its simplicity, accuracy; it has been revised in many works to perform large datasets clustering, one of the most important improvements is the map reduce [6] implementation of 978-1-4673-9808-4/15 $31.00 © 2015 IEEE DOI 10.1109/SKG.2015.24
II.
L ITERATURE R EVIEW
As mentioned above, clustering in the data mining field is useful for discovering groups and identifying interesting distributions in the source data. However, in the Big Data era, applying clustering techniques to large datasets is a difficult task due to their high computational cost. Facing these challenges was the target of many researchers in the literature, many methods were proposed to deal with the problem of large datasets clustering such as BIRICH which use a summary data structure called CF which reduce the clustering runtime as explained in [9]. Guha and Rastogi proposed the CURE method [7], this algorithm is a clustering technique using representatives, it represents a number of fixed points, and they are generated by selecting well scratted points from the cluster before retracting them to the center by a specified fraction. CURE is an efficient clustering algorithm for large databases and it can well identify clusters having non spherical shapes. Kmeans known as one of the simplest clustering algorithms was also the subject of many improvements in order to reduce the runtime when processing large datasets. Authors in [8] presented a method that can summarize the whole input dataset to a subset of points before running the Kmeans algorithm which make the processing faster. All mentioned methods are handling large datasets by discarding a part of the input dataset which impact the clustering accuracy. These issues have led to a new trend referring to large dataset clustering based on parallel processing. As mentioned in the previous section, Map reduce is a programming paradigm that ensures the parallel 251 252
processing of large databases. Adopting this paradigm has taken research works on Kmeans to a new level. Authors in [10] proposed a Map reduce implementation of Kmeans, we can summarize the treatment into 2 major steps; a mapper function that assigns each record to the nearest center and a reducing function that updates centers values after each iteration. The evaluation of this method with respect to speed up, scale up and size up shows encouraging results. Other implementations were discussed in other works such as [11, 12]. However, it is believed that even this model needs to be improved to face the challenging problems of data intensive applications. To improve Kmeans based on Mapreduce, Xialu and Zhu illustrated a method that reduces the runtime but affect the clustering accuracy by using sampling and discarding, this algorithm is explained in [15]. Recently, Van Hieu and Meesad [13] proposed a cutting off method to reduce the number of iteration of Kmeans MapReduce. This method supposes that the last iterations have the least contribution to the percent of correctly clustered records. It checks the difference between the centers of clusters on the two last iterations and stops the treatment if reach a fault tolerance ε. In this work, we are focusing on stock exchange trading as one of the most data intensive applications and for the importance of the clustering task which makes data useful for analysts and decision makers. We propose an improvement to the cutting off method that can be applied with very large datasets of stock data and a parallelization using Map reduce programming paradigm. This new approach called SMRKmeans will be presented in the next sections. III.
iterations have the least contribution to the percent of correctly clustered objects. For that reason, we use the method introduced in [13] to stop iterations when reaches a predefined value. Let : •
Ci (j) The center of cluster Ci on the iteration j.
•
Ci (j − 1) The center of cluster Ci on the iteration j-1. = |Ci (j) − Ci (j − 1)| StopCondition : < ε
This cutting off method is improved by discarding stocks which have no variation in their prices due to their low impact on both clustering accuracy and stocks analysis as identified in our experimental studies. The employment of our cutting off method is demonstrated in Algorithm1.
Algorithm 1 Kmeans Input:Dataset X, number of clusters K, ε Output:Index of clusters that each object belongs to Randomly initialize K centers for each object in X do if Not Variation=nothing then Calculate distance Assign object to the nearest cluster Update new centers Calculate = |Ci (j) − Ci (j − 1)| Check stop condition : If < ε then stop processing end if end for
SMRK MEANS FOR CLUSTERING STOCK EXCHANGE VERY LARGE DATASETS
Stock exchange provides services for stock brockers and traders to buy or sell stocks, bonds, and other securities. The change in the trading behavior and the evolution in hardware and software in last decade have led to a massive amount of data that needs to be analyzed in near real time. Clustering historical stocks was always and efficient widely used tool in stock markets. It was employed in problems like portfolio optimization and price clustering. Kmeans algorithm was employed in many stock markets research works such as [14] and [18], the problem of the huge amount of data has not been addressed in these works despite of its importance and its effect on the clustering accuracy. Our stocks data clustering approach is addressing two main challenges; the huge volume of stock data which is resolved by adopting the MapReduce paradigm, and the high computational cost resolved by applying our improved cutting off method.
B. Parallelization using MapReduce
The next step in our work is to parallelize our algorithm described in Algorithm1 using MapReduce programming paradigm. As mentioned in the previous section, we have to specify a map function and a reduce function. In the case of SMRKmeans, the map function assigns each sample to the nearest center and checks the stop condition and the discarding condition that reduces the number of iterations, the reduce function performs the procedure of updating the new centers. Algorithm2 illustrates the Map function which is executed in parallel by N mappers (machines or tasks). The map procedure is followed by a combiner function designed for local centroid calculations; this function is demonstrated in Algorithm3.
A. An improved method for reducing the number of iteration The high volume of the input dataset increases the complexity of Kmeans algorithm. To reduce the computational cost and keep a good clustering accuracy, we discard objects that do not have a contribution to the percent of correctly clustered objects. Our preliminary experiments and our survey show that the last
253 252
TABLE I: experiment Datasets
Algorithm 2 SMRKmeans Map Function Input:A list of (key1, value1) pairs (key1 is the index of object and value1 is its content), a list of K global centers. Output: A list of (key2, value2) pairs (key2 is a concatenation between key1 and index of clusters that each object belongs to and key2 is content of object) Initialize a list of (key2, value2) pairs for Each pair (key2, value2) do if Not value2(variation)=Nothing then Calculate distance to global centers Update key2 end if Check stop condition end for
Dataset Dataset1 Dataset2
Number of records 1,020,000 1,800,000
Number of Attributes 6 6
Size 1,1GB 1,6GB
Dataset1 is a real dataset collected from the Tunisian stock exchange daily trading in fiscal years 2012, 2013 and 2014 and Dataset2 is generated automatically with random values. The table 2 describes our obtained results in terms of reduced iterations and clustering accuracy. We have changed ε many times, Table2 proves that the lower predefined, the higher accuracy obtained
TABLE II: experiments results Algorithm 3 SMRKmeans Combine Function
Dataset
Input:A list of (key2, value2) pairs Output:a list of (key3, value3) (key3 is index of clusters, value 3 is a local center associated with number of objects belong to that cluster ) Initialize a list of (key3, value3) pairs for Each pair (key3, value3) do Calculate Value 3 Update (key3, value3) end for
Dataset1 Dataset2
Stop condition ε 0,01 0,001 0,01 0,001
Reduced iterations 41% 29% 57% 31%
Accuracy 83,12% 96,55% 79,34% 95,06%
Due to the cutting off method and discarding unimportant records, the number of iterations doesnt increases exponentially when there is an increase in the number of data samples, obtained results are illustrated in figure 1. The number of reduced iterations has a direct impact on the clustering algorithm execution time. In the case of traditional Kmeans, it takes more than 1 hour to cluster 1 million records. However, parallel methods proposed in [10] and [13] ensure a considerable improvement in term of execution time. Experiments illustrated in figure 2 prove that our approach has the best execution time comparing to previous methods (PKmeans, FastKmeans, Traditional Kmeans) while preserving a high clustering accuracy.
Algorithm 4 obtains inputs from all combine functions and produces global centers as detailed in the reduce function. Algorithm 4 SMRKmeans Reduce Function Input:A list of (key3, value4) pairs Output:A list of (key4, value4) (key4 is index of clusters and value 4 is global centers of clusters. Initialize a list of (key4, value4) pairs for Each pair (key4, value4) do Calculate Value 4 Update (key4, value4) end for
IV.
Clusters number 5 5 5 5
E XPERIMENTAL R ESULTS
This section explains our experimental studies; we are evaluating results in terms of number of iterations, running time and clustering accuracy(based on the method described in [13] and [17]). SMRKmeans algorithm was implemented using JAVA and Hadoop framework [16] to ensure the execution of MapReduce tasks. We have used two machines each one has an Intel i3 processor and 4 GB of RAM. Information about our experiments datasets are listed in Table1.
Fig. 1: Number of iterations
254 253
[12]
[13]
[14] [15] [16] [17]
Fig. 2: Execution time comparing to previous methods [18]
V.
C ONCLUSION
The experiment results tested with 2 stock exchange very large datasets show that our approach can reduce up to 40% of iterations while keeping more than 80% of clustering accuracy. SMRKmeans performs the clustering of stocks in a reasonable time comparing to previous approaches which makes clustering tasks easier and faster. This approach can be employed in future works for building a real time portfolio management system which improves the investment decision making in stock markets. R EFERENCES [1]
K, Jain., C,Dubes.: Algorithms for clustering data. In: Pretentice Hall Advanced Reference Serie, (1966) [2] Shirkhorshidi., A,Seyed.: Big Data Clustering: A Review.In: Computational Science and Its ApplicationsICCSA 2014. Springer International Publishing, pp.707-720 (2014) [3] Barioni, M.C.N., Razente,H., Marcelino, A.M.R., Traina, A.J.M., Traina, C.: Open issues for partitioning clustering methods: an overview. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, pp.161177 (2014) [4] Hadian, A., Shahrivari, S.: High performance parallel k-means clustering for diskresident datasets on multi-core CPUs. In: The Journal of Supercomputing, pp.119 (2014) [5] Bharill, N., Tiwari, A.: Handling Big Data with Fuzzy Based Classification Approach. In: Advance Trends in Soft Computing. STUDFUZZ, vol. 312, pp. 219227. (2014) [6] Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun.In: ACM 51, pp. 107113 (2008) [7] Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: SIGMOD Rec. 27, pp.7384 (1998) [8] Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In:the Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL. (2004) [9] Zhang,T.,Ramakrishnan, H.,Livny,M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103114 (1996) [10] Zhao, Weizhong, Huifang,M., Qing,H.: Parallel k-means clustering based on mapreduce.In: Cloud Computing. Springer Berlin Heidelberg, pp.674-679 (2009) [11] Cui, Xiaoli,al.: Optimized big data K-means clustering using MapReduce. In: The Journal of Supercomputing 70.3 pp.1249-1259 (2014)
255 254
Anchalia, Prajesh P., Anjan K. Koundinya, and N. K. Srinath.: MapReduce Design of K-Means Clustering Algorithm.In: Information Science and Applications (ICISA), 2013 International Conference on IEEE. (2013) Hieu,V., Duong, Meesad,P.: Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method. In: Knowledge and Systems Engineering. Springer International Publishing, pp.287-298 (2015) Law, H.,al.:Processing of Kuala Lumpur Stock Exchange Resident on Hadoop MapReduce (2011) Cui, Xiaoli.: Optimized big data K-means clustering using MapReduce. In: The Journal of Supercomputing 70.3, pp.1249-1259 (2014) Borthakur, Dhruba.: HDFS architecture guide.In: Hadoop Apache Project, pp58 (2008) Wagner, Silke,Dorothea, W.:Comparing clusterings: an overview. In: Karlsruhe: Universitt Karlsruhe, Fakultt fr Informatik, (2007) Nanda, S. R., Biswajit,M.,M. K. Tiwari.: Clustering Indian stock market data for portfolio management.In: Expert Systems with Applications37.12 pp.8793-8798 (2010)