ISSN:2229-6093 Ravindra Gupta et al,Int.J.Computer Techology & Applications,Vol 3 (1),160-162
Application specific web log pre-processing Ravindra Gupta, Prateek Gupta
[email protected],
[email protected]
Abstract
2. Web Logs
Web Usage Mining discovers interesting patterns in accesses to various Web pages within the Web space associated with a particular server. The Web Usage Mining architecture divides the process into two main parts- the first part includes pre-processing, transaction identification, and data integration components. The second part includes the largely domain independent application of generic data mining and pattern matching. Nearly 80% of mining efforts often spend to improve the quality of data. All application removes error log, css file log etc., but pre-processing also depends on application. This paper presents customized web log preprocessing which reduces size of logs to be mined.
On the World Wide Web (WWW), logs of HTTP traffic are recorded continuously as a function of most origin web servers as well as intermediate proxies[5]. Web Server logs are plain text (ASCII) files, that is independent from the server platform. Data used for Web usage mining can be collected at one of these parts :Server level collection, Client level collection, Proxy level collection. The most popular log file format is the Common Log Format (CLF)-
1. Introduction
The Web Mining [1] is the application of Data Mining techniques to automatically discover and extract information from the web. Web usage mining is the task of discovering the activities of the users while they are browsing and navigating through the Web. The aim of understanding the navigation preferences of the visitors is to enhance the quality of electronic commerce services (e-commerce), to personalize the Web portals [2] or to improve the Web structure and Web server performance Web usage mining [3], from the data mining aspect, is applying data mining techniques to discover usage patterns from Web data. Examples of applications of such knowledge include improving designs of web sites, analyzing system performance as well as network communications, understanding user reaction and motivation, and building adaptive Web sites. The process of Web usage mining also consists of three main steps: (i) pre-processing, (ii) pattern discovery and (iii) pattern analysis. Web log pre-processing is used to clean raw log data. The purpose of data pre-processing is to extract useful data from raw web log and then transform these data in to the form necessary for pattern discovery. Due to large amount of irrelevant information in the web log, the original log cannot be directly used in the web log mining procedure, hence in data pre-processing phase, raw Web logs need to be cleaned, analyzed and converted for further step.
3. Existing Method
IJCTA | JAN-FEB 2012 Available [email protected]
Web log pre-processing [9] contains four sub steps: Data Cleaning, User Identification, Session Identification and Formatting. Raw Web Log
Data Cleaning
User Identification
Database of Cleaned Log
Session Identification
Fig 1: Pre-processing steps Data cleaning is the first step performed in the Web usage mining process [4]. Some low level data integration tasks may also be performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. Techniques to clean a server log to eliminate irrelevant items are of importance for any type of Web log analysis, not just data mining. Unique users must be identified [4]. This task is greatly complicated by the existence of local caches, corporate firewalls, and proxy servers. The Web Usage Mining methods that rely on user cooperation are the easiest ways to deal with this problem. The goal of session identification is
160
ISSN:2229-6093 Ravindra Gupta et al,Int.J.Computer Techology & Applications,Vol 3 (1),160-162
to divide the page accesses of each user into individual sessions. Another problem in reliably identifying unique user sessions is determining if there are important accesses that are not recorded in the access log[6]. This problem is referred to as path completion.
4. Proposed Work Existing pre-processing method gives output for some particular application, but Different web application requires different pre-processing of logs. Multimedia application requires log of multimedia link request like log having jpg, mpg, and gif etc. resource. E-commerce application requires different user requests. Traditionally web log pre-processing removes http error log, css access log etc. We introduced one more step in traditional preprocessing steps, before data cleaning, Customization. In this step we clean log on the basis of user requirement for application. User selects choice of application for which he wants to perform usage mining according to that our pre-processing approach works. Following figure shows the method. Raw Web Log
Customization
Data Cleaning
Session Identification
User Identification
5. Example 5.1. Raw Web Log 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 0400] "GET /shuttle/product/ HTTP/1.0" 200 3985 199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085 burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0 199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET
5.2. Intermediate Web Log after customized selection 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 0400] "GET /shuttle/product/ HTTP/1.0" 200 3985 199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085 burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET
5.3. User Identification Database of Cleaned Log
Fig 2: Customized web log pre-processing Steps for customized web log pre-processing are as follows: Step 1. Input raw web accessing log file. Step 2. Take user choice for normal, multimedia, graphics or e-commerce applications. Step 3. Read raw web log file and remove logs according to user selection to make intermediate file. Step 4. Identify users & resources uniquely & assign a unique id to them. Step 5. Identify resource accessed by users according to id. Step 6. Create preprocessed file by mapping of user id and resource ids accessed by them.
1 199.72.81.55 2 unicomp6.unicomp.net 3 199.120.110.21 4 burger.letters.com 5 205.212.115.106 6 d104.aa.net
5.4. Web Resource Identification 1 /history/apollo/ 2 /shuttle/product/ 3 /shuttle/missions/sts-73/mission-sts-73.html 4 /shuttle/countdown/liftoff.html 5 /shuttle/missions/sts-73/sts-73-patch-small.bmp 6 /shuttle/countdown/countdown.html
5.5. User & Resource Combination 0 199.72.81.55 /history/apollo/ 1 unicomp6.unicomp.net /shuttle/product/
IJCTA | JAN-FEB 2012 Available [email protected]
161
ISSN:2229-6093 Ravindra Gupta et al,Int.J.Computer Techology & Applications,Vol 3 (1),160-162
2 199.120.110.21 /shuttle/missions/sts-73/mission-sts73.html 3 burger.letters.com /shuttle/countdown/liftoff.html 4 199.120.110.21 /shuttle/missions/sts-73/sts-73patch-small.bmp 5 205.212.115.106 /shuttle/countdown/countdown.html
5.6. Numeric Web Access log after Preprocessing
8. Conclusion Research work includes preprocessing phase of web usage mining which can be utilized in industry and application oriented system. We uses customized web log preprocessing rather than traditional approach which may reduces size of raw web log file. In future research, work is carried out for developing a web usage mining tool with customized web log preprocessing and combined pattern analysis approaches according to different application.
1 9 21 24 25 26 30 2 4 15 356 7 37 5 8 12 13 9 14 This output is forwarded for next phase, pattern discovery.
9. References
6. Implementation We have implemented all internal steps of customized web log pre-processing by using c++. User interface is being developed in java for providing user friendly environment. Fig 3 shows output screen of our system.
[1] R. Cooly, B. Mobasher and J. Srivastava, “Web Mining : Information and Pattern Discovery on the World Wide Web” , IEEE, August 1997. Pp.558 – 566. [2] Charu C. Aggarwal and Philip S. Yu, “An Automated Sys - tem for Web Portal Personalization”, Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002 [3] Renáta Iváncsy, István Vajk, “Frequent Pattern Mining in Web Log Data”, Acta Polytechnica Hungarica, Vol. 3, No. 1, 2006 [4] J. Han, J. Pei, and Y. Yin. “Mining frequent patterns with out candidate generation”. IEEE, Sept.1998 pp-365-378. [5] K. R. Suneetha and Dr. R. Krishnamoorthi, "Identifying User Behavior by Analyzing Web Server Access Log File",IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009, pp. 327-332. [6] Eirinaki, M., and Vazirgiannis, M.,Web Mining for Web Personalization. ACM Transactions on Internet Technology, 2002.
Fig 3: Output Screen
7. Result Customized Web Log Pre-processing reduces file size for pattern analysis according to application which reduces time to find frequent pattern or user behaviour. Chart below represents reduction of file size according to different application requirements.
Normal or Body Text 45000 40000
38100
File Size
35000
37299
35770
30000 25000 20000
17569
Fig 4: Result 15000 10000 5000 0 Series1
Raw Web Log
Graphics
Multimedia
E-Commerce
38100
35770
17569
37299
IJCTA | JAN-FEB 2012 Available [email protected]
Customized preprocessing
162