Web Data Mining based on Cloud Computing

2 downloads 0 Views 451KB Size Report
Keywords Web date mining, Cloud Computing, Cloud Model, Storage Cloud, .... This system framework is constituted by terminal, network and serv- er. System ...
Web Data Mining based on Cloud Computing Liangfei XUE1

Dongfeng Yuan2

Mingyan Jiang3

Abstract With the recent success of cloud computing, data mining is going to be more accessible due to easier access to less expensive computational resources. In this paper, we use the virtualization technology which is the key in cloud computing to build up a web data mining cloud model. This model is consisted of Storage Cloud and Calculation Cloud, established mainly through the parallel storage technology and parallel computing technology. Finally, this paper describes a specific instance of Web Date Mining combined with the application of Cloud Computing. This instance shows the proposed method can satisfy the mass user’s information demand concurrently, fast, real-time and efficiently.

Keywords Web date mining, Cloud Computing, Cloud Model, Storage Cloud, Calculation Cloud.

1 Introduction The wide adoption of the Internet has fundamentally altered the ways in which we communicate, gather information, conduct businesses and make purchases. How to get useful information exactly from such a wide variety of data determine the development of the society, this is now one of the most important problems. This makes Web Data Mining interesting and challenging. Cloud Computing is the new

Foundation: the National Natural Science Foundation of Shandong Province of China under Grant No.ZR2010FM040; Special Funding Project for Independent Innovation Achievements Transform of Shandong Province under Grant No.2009ZHZX1A0108, No.2010ZHZX1A1001. 1 Liangfei Xue () School of Information Science and Engineering, Shandong University, Jinan 250100, China e-mail: [email protected] 2 Dongfeng Yuan() School of Information Science and Engineering, Shandong University, Jinan 250100, China e-mail: [email protected] 3

Mingyan Jiang () School of Information Science and Engineering, Shandong University, Jinan 250100, China e-mail: [email protected]

2

Liangfei Xue

Dongfeng Yuan

Mingyan Jiang

concept during the emergence of the parallel computing development. Cloud computing now for many customers can do mass data mining with cheap cost, which is of important scientific research value and commercial value. The potential value of the cloud computing has got attention from Google, IBM and other foreign firms and domestic companies such as Inspur and Baidu[1,2]. Recently, the distributed data mining focused on the research of grid computing, and has obtained some achievements. Literature [3] proposed an OGSI.net framework of distributed data mining model, and gave the software deployment scheme of this model, but it did not really applied to experimental projects. Literature [4] analysis a grid environment data mining system, which is made up by the personal computer, the data mining process and the work each process should complete. Foreign studies have shown that the data mining based on cloud computing has the characteristics of low power consumption. This paper is to combine the Web data mining and the cloud computing through the study of the theory of the cloud computing and Web data mining. As a result, we will build a Web data mining cloud model, to get rapid and real-time method for the mass data mining on Web.

2 Data Mining Using Cloud Computing Model

2.1 Web Data Mining Web data mining is the data processing technology conforming to this need, namely using traditional thoughts and methods of data mining, depending on a large amount of climbing through Web page to dig out the useful information. Web data mining tasks can be divided into three main types: Web structure mining, Web content mining and Web use mining [5], deciding the purpose of Web data mining is to explore useful information from Web link, Web content and structure and the user log. For Web data mining is concerned, with the spread and development of the Internet, the quantity of the information is increasing and information is also changing with time passed by, so the data collection is a difficult task, especially in Web structure mining and Web content mining. This will need to climb a great deal of Web pages. The rise of cloud computing has brought a great deal of applications. The development of parallel computing makes the cloud computing as the wide useful theory in solving mass data mining. This society where the information in World Wide Web explodes, if you want to get precise and effective Web data collection, has prompted the cloud computing. According to the current recognition about cloud computing, cloud computing is a web based, the masses participated parallel

Web Data Mining based on Cloud Computing

3

computation mode. Cloud computing resources includes computing power, storage capacity expansion, the virtualization and can provide mass cloud users related services to solve complicated mass task request.

2.2 Distributed Storage Technology In the cloud computing, data storage is implemented with the distributed data storage technology implementation. This would ensure file storage of high reliability, high availability and guarantee the efficiency of resources. Google's open source System GFS (Google File System) and Hadoop's open source System HDFS (Hadoop Distributed File System) are the most popular distributed data storage technology among cloud computing. Using redundancy storage method can be used to ensure that the data storage of high reliability. The information will be stored with several piece of data section during the storage process, at the same time, producing backups in the different physical node. This is also the way of using the software reliability to make up for the deficiency of the hardware. By this way we can solve the problems in Web data mining urgently. These problems are how to do web data mining cheaply and fast. In addition, data distributed storage technology is of high throughput rate and the characteristics of the high transfer rate in cloud computing. The node is flexible and easy to management so that cloud computing system can satisfy the demand of users, provide services for mass users and can be applied to the changing user group. Here, we will elaborate cloud computing distributed storage technology principle and technological advantages with open source HDFS example.

2.3 Distributed Computing Technology In the distributed storage technology, we talked about calculation should be moved to the data storage areas. Data storage is stored distributedly in different DataNode, so we inevitable refer to distributed computing technology. Map/Reduce is brought by Google Company. The basic requirement is the processed data sets can be broken down into many small data sets. Each small data set can be completely parallel processed. Map/Reduce is "task decomposition and results aggregation". It divided all the operation on data into two steps, Map stage and Reduce stage. Distributed computing technology is to change a task into many more fine grain sub-tasks. These sub-tasks can be scheduled when there are some spare nodes. Through this way we can make the faster processing speed of processing the task of the node.

4

Liangfei Xue

Dongfeng Yuan

Mingyan Jiang

2.4 Virtualization technology Virtualization of resources is storage cloud. Storage cloud is the virtualized resources pool. Storage cloud provides data storage service and management operation for cloud calculation cloud. It is not the file system, but must rely on local file system to provide the services. For the safety of the data files, storage cloud can monitor the file quantity anytime anywhere. With the character of easy application, storage cloud has higher fault-tolerant mechanism and disaster recovery, insuring the consistency of the whole file system. Virtualization of service is calculating cloud. Calculation cloud is the virtualized service pool. Calculation cloud use mass data stored in cloud storage according to the user’s program request. Get and show the final results on the Map/Reduce parallel computing with the mass data. In cloud computing, it is because of the existence of calculation cloud and storage cloud, the resources in cloud can be dynamic expanded and configured. The single integral form characteristics of cloud computing in the logical finally can realized. It can more convenient to complete the task of data mining. As a result, virtualization is the most critical and the most core driving force technology in Web data mining.

3 Cloud Model of Web Data Mining

3.1 Cloud model Web data mining although can satisfy user's information service request, the process is trivial, energy consumption and high. Cloud computing has the efficient information processing, low energy consumption characteristics. Put it to the Web use in data mining can is a very good solution to this problem. Data mining software adopt the mode of parallel data mining. This is based on cloud computing. The same algorithm can be distributed in multiple nodes. Multiple algorithms are parallel executed. Those resources distribute according to its need. Distributed computing model is same to the cloud computing model using virtualization. Data processing also is to use the distributed file system in cloud computing. The data set required for cloud model is already pretreated. It comes from our storage cloud and be ready to data mining. Thus, we can define our own functions in the calculation cloud for our data mining process. The output from calculation cloud again distributed stored in our resource pool. Here, we define the pretreat-

Web Data Mining based on Cloud Computing

5

ment data, the store data in storage cloud, data computing in the calculation cloud and results stored in storage cloud as Web data mining cloud model. The model is shown in figure 1.

Client

Cloud server

pretreate

command

Data sets

request

Re tu to rn th the e cl res ie ul nt t

Data sets from web mass information

Original data

Computing result

Storage cloud stores the data sets and the output produced by the calculation cloud

Calculation cloud receive the command from the cloud server

Fig. 1 Web data mining cloud model

In cloud model, the storage cloud and calculation cloud are the same computer cluster. When data collection is ready we regard computer cluster resources as virtual storage cloud. When we receive user order the computer cluster conduct data mining algorithm. At this time, virtual storage cloud becomes calculation cloud. After the results restore in the data block, computer cluster is called the storage cloud again. Cloud server returns the final combined results to the user through global control algorithm. Web data mining model based on cloud computing mainly has three main modules.  The global algorithm module. This module processes the overall algorithm, coordinate the distributed data mining process and finally synthesize the mining results.  Local algorithm module. This module process the local algorithm. Each data block will produce local data mining results with local data mining algorithm.  The data management module. This module is to manage the data block and the local data mining results. The local algorithm module is shown in figure 2.

6

Liangfei Xue

Dongfeng Yuan

Mingyan Jiang

Data clock Local data mining algorithm

Local data mining result

Fig. 2 Local algorithm module

3.2 Contrast In the process of cloud model we must establish the communication system and data exchange mechanism between each component part. We ally the function of each component and finally establish Web data mining system based on cloud computing. The difference between the traditional Web data mining and the cloud model is shown in table 1. Table 1 Web data mining comparison Contract entries

Traditional

Cloud model

Data storage management

Trivial

Convenient

Data storage speed

Slow

Fast

Data mining speed

Slow

Fast

Real-time of the results

Dad

Good

4 Application Examples Web data mining has a wide range of applications in real life. We need to get valuable information from mass data every day. Here, we put forward low cost, high reliability and interactive rural information-based construction system. We use this application example to illustrate the great advantage of cloud computing in Web

Web Data Mining based on Cloud Computing

7

data mining. This system framework is constituted by terminal, network and server. System architecture is shown in figure 3. terminal

terminal terminal

Server cluster

Internet

terminal terminal terminal

Fig. 3 Rural information-based system architecture

In this application, data in storage cloud comes from existing professional agriculture website related to the agricultural production. These data sets are the pretreatment collections of information. Calculation cloud is the response to the user's required algorithm. For example, a terminal user issue cotton cultivation technology. Cloud control server will give commands to the calculation cloud. Calculation cloud use specific data mining method to crawl information on data in storage cloud. The grabbing information returns to the customer. The traditional Web data mining costs a long time and high energy consumption. As shown in figure 4 is the grab situation.

Fig. 4 Traditional web data mining statistics

In cloud model trial there are many virtual nodes. If this task runs in the distributed node, a fnode for data mining time-consuming and grab situation shows in figure 5. After data statistics we can see, using the cloud model Web data mining can complete the task faster and more real-time. This model can reduce the energy consumption.

8

Liangfei Xue

Dongfeng Yuan

Mingyan Jiang

Fig. 5 Web data mining cloud model mining statistics

Through mathematical analysis, we can get the comparison in table 2. The original data set is the same. Table 2 Cloud model comparison Contract entries

2 nodes

5 nodes

8 nodes

Average Time consuming

5 hour

20 minute

17 minute

Data storage speed

10 KB/sec

35 KB/sec

43 KB/sec

We can see from table that multiple nodes can significantly improve the data processing speed. Combining Web data mining and cloud computing organically make full use of the advantage of cloud computing in mass data processing. In cloud model, the backbone is the storage cloud and calculation cloud. These are the essence of the whole model and they provide the new ideas for future Web data mining. Storage cloud make mass data storage no longer the bottlenecks of the system. Storage cloud greatly improves the data management and the speed and accuracy of usage. Calculation cloud can use distributed computing technology to do rapid mass data mining in such a short period. This improves the quality of the data mining and shortens the response time of the service. Cloud model is greatly reduced the emergence of data in storage and in the process of computation. This model fits green economy theory.

5 Conclusions This paper solves the problem that how to mine useful information from the vast amount of information on internet. We brought the cloud computing thought to the Web data mining and established the Web data mining cloud model. This model can solve the most users' will and be able to give reliable information service according to the request. Along with the development of cloud computing, this solution is the inevitable result in solving Web data mining. Web data mining algorithm is various. The focus of the research in the future will be concentrated on looking for algorithm that can be better and more effective to meet parallel

Web Data Mining based on Cloud Computing

9

computing technology (Map/Reduce) features in data mining in the cloud model so as to better solve the problems in real life. Using cloud computing virtualization technology provides information services to a large number of users through distributed storage which can quickly store mass information. Through the calculation cloud we can develop efficient Web data mining to satisfy people's growing information needs.

7 References 1. Liu B. Web Data Mining[M]. New York: Springer-Verlag, 2007. 2. H. Roth, J. Schiefer, H. Obweger, and S. Rozsnyai: Event Data Warehousing for Complex Event Processing. in 4th International Conference on Research Challenges in Information Science. 2010. 3. Szabolcs Rozsnyai, Aleksander Slominski, Yurdaer Doganata. Large-Scale Distributed Storage System for Business Provenance[C]. Cloud Computing, 2011:516~524. 4. Wu, K.L. Yu, P. S. Ballman, A. “A Web usage mining and analysis tool“, IBM Systems Journal, 2010. 5. S. Rozsnyai, R. Vecera, J. Schiefer, and A. Schatten. Event Cloud-Searching for Correlated Business Events. In Proceedings of the 9th IEEE International Conference on E-Commerce Technology and The 4th IEEE International Conference on Enterprise Computing, Ecommerce and E-Services (CEC-EEE 2007), pages 409--420. IEEE Computer Society, 2007. 6. B.F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing (SoCC'10). ACM, New York, NY, USA, 143-154. 7. Borthaku D. The Hadoop Distributed File System: Architecture and Design[EB/OL]. (201101-20). http://hadoop.apache.org/common/docs/r0.18.0/hdfs_design.pdf. 8. Amazon, Inc.. Amazon Simple Store Service (Amazon S3) [EB/OL]. (2011-01-20). http://www.amazon.com/S3. 9. CHANG F, DEAN J, CHEMA WAT S. Big Table: A distributed storage system for structured data [J]. ACM Transactions on Computer System,2008,26(2):1-26. 10. John Shafer, Rakesh Agrawal, Manish Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining [C].U.S: IBM Almaden Research Center, 1996:544~555. 11. WANG JZ, WAN JG, LIU Z, WANG P. Data Mining of Mass Storage based on Cloud Computing[C]. Grid and cooperative computing (GCC), 2010:426~431. 12. T.R. Gopalakrishnan Nair, K.Lakshmi Madhuri. Data mining using hierarchical virtual kmeans approach integrating data fragmenting data fragments in cloud computing environment[C]. Cloud Computing and Intelligence System (CCIS), 2011:230~234. 13. Pieter Noordhuis, Michiel Heijkoop, Alexander Lazovik. Mining Twitter in the Cloud: A Case Study[C]. Cloud Computing, 2010:107~114. 14. Raymond Kosala, Hendrik Blockeel,” Web Mining Research: A Survey”, In ACM SIGKDD, July 2000. 15. Chen, M. S, Han, J. and Yu, P. S. “Data Mining: An overview from a database perspective”, IEEE transaction on knowledge and data engineering, Vol. 08, No. 6, pp:866-883, 1996