A Mining Approach for Parallel Systems using Hadoop Techniques

3 downloads 1589 Views 58KB Size Report
we design a massive web log data which can be analyzed on a platform bases on cloud - Hadoop Framework. Along with it in order to improve the efficiency of ...
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 9, Number 19 (2014) pp. 6055-6061 © Research India Publications http://www.ripublication.com

A Mining Approach for Parallel Systems using Hadoop Techniques Sachin Katarki and Srinivisa Perumal R “SITE” - VIT UNIVERSITY VELLORE “SITE” – VIT UNIVERSITY VELLORE [email protected], [email protected]

ABSTRACT: One of the powerful platform to assimilate, dissimilate and retrieve information as well as extract useful information is World Wide Web. This Web data is massive, dynamic in nature and is complex in nature. Extraction of potential Value from web is carried out through Data mining, but traditional data mining has a bottleneck in storage and computing when the data is too complex. Due to increasing number of providers a new technology- Cloud Computing is used which offers various web services which help to overcome this bottleneck. Meanwhile user’s behavior and demands are changing sharply, in order to maintain a balance and maximize the revenue there is a desperate need for new principle. Now a day’s most of the service providers are using a static method which neglects if there is a dynamic in the user behaviors. To solve the above problem we use cloud computing technology wherein we design a massive web log data which can be analyzed on a platform bases on cloud - Hadoop Framework. Along with it in order to improve the efficiency of the existing mining methods a parallel algorithm for web log mining is needed. The proposed algorithm can be used to parallel systems wherein the data is usually stored on cloud and also helps to identify the users using methods of cloud computing and Map Reduce techniques Keywords- Cloud computing; Map/Reduce; Web log mining, Node Data,Association rule mining.

INTRODUCTION: The term web mining is generally used for applying data mining techniques for extracting useful information from World Wide Web. As much of the work has not been carried out in this area making its implementation complex. Web mining

6056

Sachin Katarki and Srinivisa Perumal R

incorporates areas such as Database, Information Retrieval, and artificial intelligence. The term web mining was first mentioned by Oren Etzioni in 1996 [1], he claimed that web mining [1] can be used as mining techniques to extract information from WWW and other services.With the fast growing Internet technology, the content in internet is growing exponentially, so extracting useful information from has become a hot research. Web data mining a stream of data mining where in data mining techniques are applied on massive web data. The research carried out mainly focuses on improving the mining algorithms that can increase the processing ability [2]. Considering the face of massive data the single node computing creates a bottle-neck, for such an efficient solution is by to grab the advantages of cloud computing distributed processing along with virtualization. This method distributes the complex structures across multiple nodes through the network. Considering the research carried out data mining algorithms over the past few decades it has been carried out to execute on a single computer. Now a days “cloud” based computing is blossoming on large scale creating great interests in both industries and research areas. The distributed computing has been simplified by Map/Reduce- without considering any complexity and is also applied in many fields beyond cloud as well. In this semantic web era its becoming very easy for extraction of information from Web, provided the data that we store no longer resides on a single machine. So there is a need for parallelization of data into multiple machines, in order to increase both storage capacity and performance. So there is a need for an efficient algorithm that can extract information from multiple machines and process them which used Map/Reduce programing paradigm to deal with multiple machines concepts. This paper combines the features of web log file and mining methods for Hadoop cluster framework that helps in extracting information from parallel systems by improving Apriori algorithm and analyzing its efficiency with the existing methods.

DATA MINING IN CLOUD COMPUTING TECHNOLOGY: A. The MapReduce programming model: A programing model that is used got parallel computing for large-scale data sets usually more than 1TB. This deals with large data sets on clusters mainly through two steps “Map” and then “Reduce”. First step is to Map – processes the input key/value pair for generating immediate results, generated in the pair of key/value pair. Once the resulting pair is obtained it is cleaned and stored in the system [3]. Next step is Reduce, this phase is best known for merging the processed result according to key value, generating the final result. B. The concept of Web log mining A file that contains the information related to web server and the data related to it which is recorded while user browses through a web pages, for data interactive operations. This log file is very useful for analyzing and finding user information such as patterns of users who browse the Web, user preference access path or user access interest.

A Mining Approach for Parallel Systems using Hadoop Techniques1

6057

C. The process of Web log mining Processing Web log mining is generally divided into following phases- data processing, pattern recognitionand pattern analysis. 1. Preprocessing: As web logs are usually semi- structured or unstructured data, which results that it cannot b directly fed to a data mining algorithm. So there is a need for cleaning the data, without preprocessing the data it’s hard to generate useful information [3] from raw data. A well-formed transactions can be formed after preprocessing, a large reference path is considered for a transaction, which can be a meaningful session aces path. For example the part of the user transaction can be an IP address such as 192.168.10.100 USER ID

USER TRANACTION NUMBER

U1

A1,A3,A7

U2

A1,A4

U3

A1,A4,A10

U4

A1,A5,A6 Fig(1) table of web log and user transaction number

2. Pattern recognition: This phase is mainly used for selecting an appropriate data mining technique and algorithm that can be efficiently implemented for data preprocessing to discover the hidden patterns that reflects user’s behavior, sessions and brief user statistics. 3. Personalization:There are two methods for personalization they are: Classification – Supervised learning, in this method classes are defined previously and users are identified and placed in best suited classes. Here predefining the classes is must and should be carried out well. Clustering – Unsupervised learning, in this method visitors are grouped by the means of their common characteristics. Once we have customer/visitor profile we can easily specify how and how many clusters can be generated to uniquely identify within a group. There clustered formed are then implemented for rest of profiles.

DESIGN OF WEB LOG MINING BASED ON COMPUTER MINING A. Single node data mining system A single server single server based technique was used on traditional cartelized data mining systems to finish all the tasks which includes data collection, pre-processing and data storage. Single node systems have good efficiency when data mining task is ok low complexity and low amount of data is to be processed. However with the increasing internet log data complexity increases for mining tasks so centralized data mining wont b efficient to meet the demands of massive Log processioning.

6058

Sachin Katarki and Srinivisa Perumal R

B. The mass cloud-based Web Log Mining System From the above discussing it’s clear that traditional data mining systems has many bottlenecks while processing huge amount of data so possible solutions for the able problem can be use of cloud computing technology. Along with cloud computing platform use of Master Node, Name Node can help in processing these huge data. Coordination and scheduling of the work process in computing nodes can be carried out by Master Node, it also handles processing of node failures, load balancing and other issues.Along with it Master Node can be used as Name Node which stores metadata information. In order to reduce the complexity each Data Node is also computing node so that massive data can be distributed among different storage and is capable of performing complex data mining tasks assigned for parallel processing, which overcomes the problem of single node workload. The above analysis can be carried out using the Hadoop cluster architecture of cloud computing. 1. Distributed data storage layer This is the actual physical storage hardware resource of the system comprising of computers and network facilities. These form the storage space for the Data Node in cloud computing platform. The Nodes stored in the networking medium not only provides data storage but also helps in data calculation function. Raw bed log data that are collected will be transformed into a semi- structured XML form after prepressing, later it is stored into Data Nodes of the systems. For better efficiency a copy of that will be stored into different storage nodes which will be useful for recovery of the nodes during system failure. 2. Functional layer This Core functionality of the system are provided by the functional layer, the data mining algorithms and components used for data mining are deployed in this layer. Data preprocessing components usually include components to recognize parts, attributes. In this layer Master Node is responsible for web log mining and is efficient to allocate complex data mining tasks to multiple systems so that data mining task can be carried out simultaneously to achieve parallel data mining in distributed systems. On each cycle Master Node tend to send a signal to all data node to make sure that all data nodes are working as per required and replies to the confirmation message to the master Node. Whenever a task is assigned to a Master Node as per the request of the client- it will handle as per assigned and distribute among data nodes where computing tasks are completed by handling file data block and will be later on returned to master node. 3. Visual presentation layer This layer provides user interaction as well as presents the result in a visual display function. At first visual display receives information from user’s data mining which is given as input to the functional layer for processing, later on mining results will be displayed[4][5]. The layer provides

A Mining Approach for Parallel Systems using Hadoop Techniques1

6059

METHODOLOGY In Conventional method the web log mining was carried out by maximum forward sequence- reference methods and later carried out by association rule. Association rule is a knowledge model that describes happens of an event in a transaction. By detailed analysis association rule can also be used for analyzing habits and hobby of users who access web pages, which help in developing a personalized information about the user. The following steps are carried out for finding association rule, first finding out all frequent item-sets second is to identify the association property for those frequent sets[6]. One of such classical association rule is The Apriori algorithm which uses layer search iteration method to generate frequent item sets. The steps followed in this method is as follows:  Firstly, scan the database for identifying items which meet the minimum support threshold.  Then generate 2-frequent item-sets (names C2), and identify the L2 from C2.  Repeat the above steps to generate L3, and do the same loop until there is no new frequent item sets.

PROPOSED METHODOLOGY: It’s obvious to understand that association rule algorithm helps us to analyses that a large number of candidate set will be generated while scanning the database, like web which leads to a drawback resulting in low efficiency and low accuracy of excavation. So here as a solution for the above problem a parallel based algorithm which used cloud computing platform and Data Node for processing and to perform parallel processing [7]. DataNode: A Data Node is a storage cell in Hadoop Systems. Data is spread and replicated across several Data Nodes which is the functional unit of file systems. Once the Name Node has been provided to the location of the data the client application can directly communicate with the Data Node. Map Reduce methods help a Data Node to access and communicate with other Data Nodes for file sharing. This even initializes a Task Tracker that keeps a track of all Data Nodes and its instance can be deployed on server as that of host server of Data Node so that it forms a closed data operation to enhance performance. The instances of Data Node can communicate with each other which is used for replication of data. Which helps us to avoid Raid storage as Data Nodes are designed for self-replication and storage across multiple servers rather than storing on same storage structure. Another benefit of using Data Node is it helps to avoid using NFS for data storage in production system.

PROPOSED ALGORITHM: Before understanding the algorithm here are some methodology that are implemented in the algorithm, the terminologies are

6060

Sachin Katarki and Srinivisa Perumal R

Split Function: This function is used to eliminate the unwanted and unrelated entries from user transaction tables. Before understanding the steps followed in here let us consider an assumption ie, the set for analyzing split function should have more than one element in it. The steps followed here are 1. Generate the power set excluding the null set and itself 2. If all the sets generated are subsets of Transaction table set follow 3, or else go to 4. 3. Accept the record as possible frequent pattern 4. Reject the record. Confidence Analysis: Confidence analysis is used to reject a possible frequent pattern if the pattern’s support count is less than the minimum confidence. Now the question arises how to find the minimum confidence- generated by the minimum count of a set of records. The confidence analysis accepts a record into frequent pattern if the following condition is satisfied, For every non empty set S of I, the output rule that is generated for this record can be given as follows “S(l-S)” If Support_count (l)/ Support_count(s) >= Min_confi, then the relation is accepted or else it’s rejected ALGORITHM: Step 1. Extract Data Node from Map Reduce for parallel computing systems Step 2. Calculate minimum count (Min_con) for the available range of sets Step 3. Eliminate all the entries from Data Node whose value is less than Min_con Step 4. Generate Power set for all possible entries in the data Node for all unique values Step 5. Create a table for of user transaction set and no of times it’srepeated in the data Node and eliminate all the records whose reparations are less than Min_con. Step 6. Analyze for Split Function. Step 7. Confidence Analysis Step 8. Generate the behavior of the user pattern The output of the confidence Analysis is the possible frequent set that can be used for analyzing user behavior and pattern of his web usage.

CONCLUSION: This paper handles data mining concepts on cloud computing platform to fix the problems that existed in traditional data mining systems based on single node computing techniques. Considering these problems this paper proposes an efficient and better algorithm for web log data mining which uses cloud computing techniques

A Mining Approach for Parallel Systems using Hadoop Techniques1

6061

along with Hadoop cluster framework, this proposed method uses Map Reduce to accomplish parallelism. Combination of these techniques not only improves the efficiency of data processing and analysis but also overcomes the bottle- neck of the conventional system.

REFENCES: [1] Oren Etzioni. The world-wide web: Quagmire or gold mine? Communications of the ACM, 1996. [2] JiaWei Han,Bo Kan.Data mining concepts and technology Beijing: Machinery, 2007. [3] Peng Liu. Cloud computing [M]. Beijing - Electronic Industry Press2010. [4] E Zhang,PeiFeng Zheng, GengZhongFeng. Research on data preprocessing methods of Web log data miningcomputer application 2004. [5] Miao Cheng, cloud computing technology in web log mining.China University of Science and Technology, 2011. [6] LingJuan Li, Min Zhang.Research on association rule mining algorithm under cloud computing environment, 2011. [7] WeiHua Feng.WEB log data mining research and realization.University of Electronic Science and Technology of China, 2010 [8] ZhenQi Wang . Research of massive Web log data mining based on cloud computing, International Conference on Computational and Information Sciences 2013.

6062

Sachin Katarki and Srinivisa Perumal R

Suggest Documents