Data Clustering on Taiwan Crop Sales under Hadoop

0 downloads 0 Views 434KB Size Report
Keywords: Big Data Analytics, Hadoop, Mahout, Clustering, Distributed Compu- ... node of Hadoop cluster should have HDFS service and MapReduce service.
Data Clustering on Taiwan Crop Sales under Hadoop Platform Chao-Lung Yang1, Mohammad Riza Nurtam2 1

Department of Industrial Management. National Taiwan University of Science and Technology. No.43, Sec. 4, Keelung Rd., Da'an Dist., Taipei 10607, Taiwan (R.O.C.) [email protected] 2

Department of Industrial Management. National Taiwan University of Science and Technology. No.43, Sec. 4, Keelung Rd., Da'an Dist., Taipei 10607, Taiwan (R.O.C.) [email protected]

Abstract. Hadoop is one of the most promising cloud computing platforms to execute a Big Data analytics task which is a process of discovering hidden patterns, unknown correlations, and other valuable information from an extremely large distributed dataset. In this paper, a data clustering was implemented under Hadoop platform to study a large crop sales dataset collected distributedly in Taiwan. Hadoop infrastructure was built to give access of the distributed data centers. An online clustering algorithm utilizing Mahout, a scalable machine learning library, was performed to analyze crop price and yield data from the distributed datasets. This clustering analysis is usually exhausting and time consuming if a single machine is in charge of the whole process. Therefore, in this research, the clustering jobs will be handled under an experimental distributed Hadoop environment. The result can be used to help decision making of crop planning by forecasting or detecting demand changes in the market as early as possible.

Keywords: Big Data Analytics, Hadoop, Mahout, Clustering, Distributed Computing

1

Introduction

Nowadays, more and more data is collected and stored every day in the world (Gopalkrishnan, Steier et al. 2012) and the trend of data size growth has been closer to Moore’s Law (Fisher, DeLine et al. 2012). That means that the volume of collected data will be almost doubled every year. How to analyze the collected huge dataset and create values from it has catch a lot of attentions. Big data is a

2

Chao-Lung Yang, Mohammad Riza Nurtam

term coined by data scientists to name this huge dataset. However, the definition of big data is varied. The simple definition to describe big data is a dataset that is too large to fit in a single drive, so it has to be stored in distributed storage (Fisher, DeLine et al. 2012). Moreover, IBM defines that big data have 3 characteristics called V3 (Volume, Variety, and Velocity). These characteristics simply state that we have data that are so big in size, comes in structured or unstructured form and gets bigger over time with speed (IBM, Zikopoulos et al. 2011). To extract valuable information from big data, a special tool sets for analyzing big data is needed to handle the relatively large data repository by utilizing fast data computation resource. Big data analytics is an emerging research area to perform the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information (Rouse 2012). In Taiwan, the crop sales data including vegetable and fruit price and sale volume is collected daily in the distributed crop sales markets geographically (AFA 2013). In each market, there are huge amounts of crop sales transactions are processed manually. Regarding the variety of crops, understanding the crop sales pattern and further predicting the price of crop in the coming year or season are important for farmers to conduct the crop cultivation plan. In this research, we utilize the public database to study the crop sales data. A Hadoop platform was built to analyze Taiwan crop sales data using clustering algorithm in Mahout to perform data clustering analysis on crop sales data to study and discover hidden pattern in the data. The reminder of this paper is organized as follows. Section 2 introduces Hadoop platform and the basic operation of MapReduce. Section 3 describes how the Mahout under Hadoop platform was utilized to perform a data mining analysis on Taiwan crop data and the experimental results. Finally, summary and managerial implications are concluded in Section 4.

2

Literature Review

2.1 Hadoop Hadoop is a distributed computing platform developed by open source community to work with large dataset. Hadoop enables data scientist to store and analyze data by using multiple computing machines as a cluster. The development of Hadoop was initially inspired by a paper presented in 2004 by Google about MapReduce (Dean and Ghemawat 2004, Dean and Ghemawat 2008). That contributive research inspired open source community to implement the map-reduce paradigm that exists in functional programming (Hughes 1989) into open source project, and Hadoop is the result (Harris 2013).

Data Clustering on Taiwan Crop Sales under Hadoop Platform

3

Hadoop consists of several core components: Hadoop common, Hadoop YARN, Hadoop Distributed File System (HDFS), and Hadoop MapReduce (Hadoop 2013). Hadoop common is a set of utilities for working with Hadoop platform, while YARN is a framework for computational job scheduling and cluster management. HDFS is a data management service to handle the distributed storage, and Hadoop MapReduce is a divide-conquer system to perform largescale data computation. The typical architecture for Hadoop cluster is shown in Fig. 1. Hadoop Cluster

Hadoop Client

Cluster Machine

Cluster Machine

MapReduce Agent

MapReduce Agent

HDFS Node

HDFS Node

Name Node

Fig. 1 Typical Hadoop cluster

A cluster of Hadoop consists of several machines and services. At least one node of Hadoop cluster should have HDFS service and MapReduce service. In HDFS, a name node is a service which handles the task management, data assignment, and scheduling. Usually, the secondary name node is also established in case the primary name node fails to work properly. In the same manner, each node is able to take over other node when a failure occurs. For security reason, the data copied onto HDFS will be duplicated to multiple data nodes to increase the reliability. This replication process also allows the ability of retrieving data from the nearest node (Shvachko, Hairong et al. 2010). Hadoop uses map-reduce paradigm to provide the distributed and parallel processing of large data set. This programming model consists of the Map function that performs filtering and sorting on the dataset. On the other hand, the Reduce function performs summary routines to aggregate the data from distributed data note. The output of map function is an intermediate pairs and this pairs will be the input of reduce function. An example of the MapReduce operation shown in Fig. 2. which is a process of counting average demand of vegetables (cabbage, broccoli and carrot) from HDFS data source (Dean and Ghemawat 2008, Leu, Yee et al. 2010, Espinosa, Hernandez et al. 2012). The data stored in HDFS has to split into several partitions and assigned to mapping workers. Mapping worker nodes processes the input data and map it into intermediate pairs. Then, the pairs are shuffled and

4

Chao-Lung Yang, Mohammad Riza Nurtam

stored into local files, where each file holds data with one specific key. After the mapping process finished, reducing worker retrieve the mapping files from remote machine and start the reducing process and finally store the result to output files.

Split

Split

Vegetables Cabbage Broccoli Carrot Carrot Broccoli Cabbage Carrot Broccoli Cabbage Carrot Cabbage Broccoli Cabbage Broccoli Carrot Cabbage Broccoli Carrot

Demand 10 9 15 10 19 7 15 10 7 9 15 21 10 12 10 20 4 12



Worker



Worker





Worker



Worker





Worker



Worker





Input files (on HDFS)

Map phase

Intermediate files (on local disks)

Reduce phase

Output files (on HDFS)

Fig. 2 MapReduce operation

2.2 Mahout Mahout is a scalable data mining and machine learning library that can be run on Hadoop distributed platform or the local system (Esteves and Chunming 2011, Esteves, Pais et al. 2011). Mahout can be used to process large data set with many data mining algorithms that are already implemented by Java language. Mahout supports four different data mining tasks currently: classification, clustering, recommendation mining, and frequent itemset mining. Mahout is developed based on the multicore MapReduce algorithm (Chu, Kim et al. 2006). The input and result file of the data mining process is saved into sequence file format. To be able to read this result, we need to convert the result file with utility programs called clusterdump and seqdumper that are provided with Mahout, to dump the result file to readable files.

3

Experiment and Result

Taiwan crop sales data are provided by Taiwan Agriculture and Food Agency at http://amis.afa.gov.tw. This web-based database has multiple query pages and the data has several attributes such as city, crop category, weather, sales price, and sales volumes, and so on. Taiwan crop sales data are collected from major Taiwan

Data Clustering on Taiwan Crop Sales under Hadoop Platform

5

markets every day, from 1st January 1996 (85-01-01 in Minguo calendar format) and from 2767 crop commodities from three categories: vegetables, fruits, and flowers. By using the automatic data retrieval program, we retrieve the data from public database and store the data in a MariaDB database. From this database, a text file in vector format can be created by using SQL query, and sent to HDFS for analyzing. This collected data was used for data analysis and Mahout experiments which will be addressed in the following sections. 4.1 Preliminary data analysis To have better understanding about crop sales data, a preliminary data analysis was conducted. For this analysis, we selected a particular crop, persimmon tomato, with crop ID ‘FJ1’ to demonstrate the preliminary analysis. Persimmon tomato is a variant of tomato, a big size tomato with some lines on the body of the fruit. In Fig. 3, two datasets are plotted together. The red curve indicates time series data of the sales price of persimmon tomato in 1999; the blue dashed-line curve indicates another time series data of sales volume of persimmon tomato in the same year. As can be seen, the sales price for this tomato is relatively low from February to July. The sales price tends to increase in the middle of summer and reach the top price around October (at the beginning of winter in Taiwan). On the other hand, sales volume data has different type of trend and is not that steep as price data. The Sales tend to increase in the beginning of the year but with very larger fluctuation day by day. The volume slightly decreases through a year until October. After passing October, the volume starts to increase. Obviously, the seasonal effect is clear on sales price and demand of persimmon tomato. If looking at the data carefully, it can be found that the trend of sales price data is the opposite of sales volume. It means the demand decreases in higher price market, while demand increases if the price is low. To summarize, persimmon tomato are available in large stock and traded in large volume during the beginning of spring and the price are cheaper when compared with the price in summer.

Fig. 3 Sample data of Persimmon Tomato sales

6

Chao-Lung Yang, Mohammad Riza Nurtam

4.2 Data clustering on crop sales data The seasonal pattern is discussed in the previous chapter, but this analysis heavily relies on domain expert’s judgment and sometime it is very difficult to separate the data, especially in large scale dataset like Taiwan crops sales data. By using Mahout k-means algorithm under Hadoop (3 nodes in this case), the clustering analysis was performed 5 times. The average of Sum of Squared Error (SSE) of each run with different number of cluster, K, was computed and the results are shown in Fig. 5. As shown in the plot, the SSE is dropping from K=2 to K=10. However, the dropping of SSE when K = 3 seems largest. In order to visual data clustering, the scatter plots of K = 3 are shown in Fig. 5.

Fig. 4 Bimonthly sales data

Fig. 5 SSE on various number of K in Mahout K-Means clustering

In Fig. 4, different colors and symbols are used to indicate 3 different clusters. As can be seen, three clusters is revealed by price-volume combination. The cluster 1 is high-volume–low-price; cluster 2 is middle-volume-middle-price. Interestingly, cluster 3 is a low-volume group across all price range. These grouping of price-volume data actually provide another aspect of tomato sales beyond the time series data. The clustering results can easily indicate the sales patterns where might be correlated with other factors such as the weather or market. The further data analysis is needed to find the influential factors which can causes these clustering results.

Fig. 6 1999 crops sales data in 3 clusters

Fig. 7 Hadoop Performace

Data Clustering on Taiwan Crop Sales under Hadoop Platform

7

The k-means method is performed with different number of computational node as shown in Fig. 7. Obviously, the more computational node we use, the faster the algorithm is. The number of K for K-means algorithm seems not influential on running time. Although it is intuitive, this scalable structure in fact is the advantage of Hadoop platform because once the more computational effort is needed, the more nodes can be assigned to the job for shortening the processing time. In this research, we focus on applying the Hadoop platform to demonstrate the data mining capability on Taiwan crop sales data. Mahout that is utilizing MapReduce framework can be applied to perform data mining work across different computational notes. Once the data is very big, the Hadoop platform is useful to empower the data analysis simultaneously which has been demonstrate by this experiment.

4

Conclusions

Nowadays, Hadoop is a prominent platform for data scientists that work with big data. Hadoop implements map-reduce programming paradigm and provides the distributed and parallel processing of big data. To be able to analyze the data, an analysis program should be provided and run under MapReduce framework. Mahout is one of promising tool of machine learning and data mining on big data. By running upon Hadoop framework, Mahout can utilize the distributed and parallel processing to enhance analysis performance. In this research, we used Taiwan crops sales data as a sample of big data, which can be difficult to analyze in a single computer because of its big size. The Mahout under Hadoop was applied to analyze one experimental example, persimmon tomato data, from Taiwan crop sales dataset which collected from public agriculture database. The k-means clustering was used to perform clustering analysis on sales price and volume. The result of the analysis shows that Taiwan crop sales data have seasonal effect in sales price and volume, and the sales can be grouped into multiple clusters in which deferent patterns can be revealed.

5

Acknowledgements

This study was conducted under the "Project Digital Convergence Service Open Platform" of the Institute for Information Industry which is subsidized by the Ministry of Economy Affairs of the Republic of China.

8

6

Chao-Lung Yang, Mohammad Riza Nurtam

References

AFA. (2013, July 2013). "Agriculture Market Information System." Retrieved July 2013, 2013, from http://amis.afa.gov.tw. Chu, C., S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng and K. Olukotun (2006). Map-Reduce for Machine Learning on Multicore. NIPS, MIT Press. Dean, J. and S. Ghemawat (2004). MapReduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design \& Implementation - Volume 6. San Francisco, CA, USENIX Association: 10-10. Dean, J. and S. Ghemawat (2008). "MapReduce: simplified data processing on large clusters." Commun. ACM 51(1): 107-113. Espinosa, A., P. Hernandez, J. C. Moure, J. Protasio and A. Ripoll (2012). "Analysis and improvement of map-reduce data distribution in read mapping applications." Journal of Supercomputing 62(3): 1305-1317. Esteves, R. M. and R. Chunming (2011). Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud. 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom). Esteves, R. M., R. Pais and R. Chunming (2011). K-means Clustering in the Cloud -- A Mahout Test. 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA). Fisher, D., R. DeLine, M. Czerwinski and S. Drucker (2012). "Interactions with big data analytics." interactions 19(3): 50-59. Gopalkrishnan, V., D. Steier, H. Lewis and J. Guszcza (2012). Big data, big business: Bridging the gap. 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine-12 - Held in Conjunction with SIGKDD Conference, August 12, 2012 - August 12, 2012, Beijing, China, Association for Computing Machinery. Hadoop. (2013, 05/15/2013). "Welcome to Apache™ Hadoop®!" Retrieved 2013-05-15, from http://hadoop.apache.org/. Harris, D. (2013, 04/04/2013). "The history of Hadoop: From 4 nodes to the future of data." Retrieved 2013-05-15, from http://gigaom.com/2013/03/04/the-history-of-hadoop-from4-nodes-to-the-future-of-data/. Hughes, J. (1989). "Why functional programming matters." The computer journal 32(2): 98-107. IBM, P. Zikopoulos, C. Eaton, T. Deutsch and G. Lapis (2011). Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill Education. Leu, J.-S., Y.-S. Yee and W.-L. Chen (2010). Comparison of Map-Reduce and SQL on Large-Scale Data Processing. 2010 International Symposium on Parallel and Distributed Processing with Applications (ISPA). Rouse, M. (2012). "Definition; Big Data Analytics." from http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics. Shvachko, K., K. Hairong, S. Radia and R. Chansler (2010). The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).