A Complete Pre Processing Method for Web Usage Mining - IJETAE

3 downloads 276653 Views 303KB Size Report
Keywords—Web Usage mining, Data pre processing, Data mining, Server logs .... requests issued by pairs of (Host, User Agent) identified as being a Web robot.
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 10, October 2013)

A Complete Pre Processing Method for Web Usage Mining Ankit R Kharwar1, Chandni A Naik2, Niyanta K Desai3 1

Assistant Professor, Department of Computer, Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli Student of M.Tech Computer Engineering in Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli

2,3

Abstract— Web usage mining, data mining techniques applied to data from the Web usage patterns is to discover, in order to understand and better serve the needs of Web-based applications. There are several preprocessing tasks that must be performed prior to data collected from server log data mining algorithms to apply. Data preprocessing for data mining algorithms to implement the data necessary to further the process of transforming raw data into abstraction. The paper has several data preparation techniques that preprocessing in order to identify unique users and user data session can be used to improve the performance features. Keywords—Web Usage mining, Data pre processing, Data mining, Server logs, Users and User sessions.

II. DATA PREPROCESSING IN WEB USAGE MINING Ideally, web usage mining process to input a user session file that a Web site, what pages were requested to deliver and in what order, and how long each page was viewed an accurate account does. Page a user reaches the session during a visit to a Web site for the set. However, after the reasons we have a raw web server data preprocessing before a user session file does not represent strength will discuss information contained in the log. Generally, data cleansing data preprocessing user identification, session identification and full path, as shown in Figure 1. [2]

I. INTRODUCTION Web usage mining, Web mining is the type of activity that one or more web server user access patterns to automatically search is included. As more organizations rely on the Internet and World Wide Web to conduct business, to traditional market analysis techniques and strategies should be revisited in this context. Organizations often generate and collect large amounts of data in their daily activities. Most of this information is usually generated automatically collected by Web servers in the server access logs. Other sources of information include the user context to the context of each page that logs information about referring pages, and CGI scripts as user registration or survey data collected through the device contains. Analyzing such data can help the customer life time value of these organizations, cross marketing strategies across products, and the effectiveness of promotional campaigns, among other things, to determine. Analyze server access logs and user registration data is also how better to structure a Web site for the organization in order to create a more effective presence can provide valuable information on. Using intranet technologies in organizations, such as analysis and effective management of workgroup communication and organizational structure is highlighted. Finally, organizations analyze user access patterns on the World Wide Web to sell advertising targeting ads to specific groups of users helps. [1]

Fig. 1. Phases of Data Preprocessing in Web Usage Mining.[2]

III. DATA CLEANING The first steps of data preprocessing to remove log files are useless requests. Typically, the process concerning nonimages, multimedia files, page style files, JavaScript files, etc. Data cleanup and displays the web robots off their requests as requests to delete resources analyzed.

638

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 10, October 2013) By filtering out useless data, we use log files to reduce storage space to facilitate the coming actions can reduce the size. For example, by filtering out the image requests, we at 50% of its original size to cut down the web server log files.

Therefore, a WR will generate a huge number of requests on a Web site. Moreover, the requests of a WR are out of the analysis scope, as the analyst is interested in discovering knowledge about users' behaviour. Most of the web bots identify themselves with the user agent field of the log file. Several reference databases known robot is maintained. However, these databases are not exhaustive and every day new WRS show or a new name, making the WR identify task more difficult. To identify WR hosts, we currently use three heuristics: 1. Robots.txt (RT): We look for all the hosts that have requested the page /robots.txt. This file contains browsing rules for the WRs that index the Web site, such as the names of the folders not to be indexed. 2. Known User Agent (UA): We use a list of user agents known as robots. The list is created using data from various sources such as [4], [5] 3. Robot IP (RI): We use a list of Robot IP know as robots. The list is created using data from various sources such as [6], [7] Once all the Web robots are identified, we can remove the requests that they generated. This procedure is straightforward, consisting in the removal of all the requests issued by pairs of (Host, User Agent) identified as being a Web robot.

A. Removing Requests for Non-analyzed Resources The entire document should be in Times New Roman or Times font. Type 3 fonts must not be used. Other font types may be used if needed for special purposes. Nowadays, most Web pages contain images. If these images or design purposes (such as lines and colour buttons) or to serve the information (such as graphics and maps) was present. Or to keep these images for Web use log files to remove the decision depends more on the purpose of mining. For the purpose of web caching or prefetching support, log analyst, referring to images and multimedia files should not remove the entries. A Web cache application to make it more than other (text) files a request for predicting such requests for files, usually because the images are larger than HTML documents is important. Conversely, for analysts to find flaws in the structure of a web site or your visitors want to provide personal dynamic link, they requested to keep clear why these requests should represent users' actual work. In addition to image files, other file types we like the web page style files can be included in the pages contained requests can cause, the script (JavaScript) files, applets (Java object code) files and so on. Except for the resources needing explicit requests (like some applet files), the requests for these files need to be removed as they will not bring any new knowledge in the pattern discovery phase.

IV. DATA STRUCTURATION This step groups the unstructured requests of a log file by user, user session, path completion. At the end of this step, the log file will be a set of transactions, where by transaction we refer to a user session, path completion.

B. Removing Web Robots' Requests A Web robot (WR) (also called a spider or bot) is a software tool that scans the website regularly to take its contents. WRS automatically follow all links on a website. Search engines like Google, WRS regularly used to collect all the pages from the website to update their search indexes. Number of queries from one WR may be equal to the number of URI's website. If the Web site does not attract many visitors, the number of inquiries come from all WRS who have visited the site could be more than humangenerated requests. Removing WR-generated log entries not only simplifies the mining task that will follow, but it also removes uninteresting sessions from the log file. Usually, a WR has a breadth (or depth) first search strategy and follows all the links from a Web page.

A. User Identification A user is identified as the principal using a client to interactively retrieve and provide the resources or access forms. The Web Usage Mining techniques based on the cooperation of users are the easiest ways to deal with this problem. In our experiment, we use the following heuristics to identify the user: 1) Each IP address represents one user; 2) For more logs, if the IP address is the same, but the agent log shows a change in User Agent (with Version) or operating system (with screen resolution and processer speed etc.), an IP address represents a different user.

639

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 10, October 2013) B. User Session Identification A user session identified one or more sessions over the web servers. The target user clicks (click stream) means a delimited set of individual sessions each user accesses the page divide. The methods to identify user session include timeout mechanism. The following is the rules we use to identify user session in our experiment: 1) If there is a new user, there is a new session; 2) If the time between page requests exceeds a certain limit (30 or 25.5mintes), it is assumed that the user is starting a new session.

In Figure 2, Bar chart 1 represents the initial requests in raw web log. From Bar chart 2 represents the row data after preprocessing method. From Bar chart 3 represent the image & status code Entry Remove. From Bar chart 4 represent the Robots Entry Remove. TABLE I The Processes and Results of Data Preprocessing in Web Usage Mining.

Total Entries in web log Entries after data cleaning Image And other Data Remove with status code Identify robots.txt and Remove Identify Robots User Agent and Remove Identify Spider/crawler IP and Remove Identify robots.txt & Robot User Agent on same Record and Remove Identify robots.txt & Spider/crawler IP on same Record and Remove Identify Robot User Agent & Spider/crawler IP on same Record and Remove All There Bots on Same Record and Remove

C. Path Completion Another critical step in data preprocessing is path completion. There is some reason that the path that results in incompletion, for example, local cache, the cache agent, "post" technique and the browser's "back" button to access some important log file accesses can result in not entered, and There are number of Uniform Resource Locators (URLs) entered in the log can be less than the real one. Use of local caching and proxy server path to meet the production difficulties because users use the server logs record without leaving any local caching or proxy server caching the pages can use. As a result, user access paths using incomplete web logs are preserved. The user to search the travel patterns, user access route missing pages should be attached. Path for the purpose of completion of this task is complete. Better results of data pre-processing, we mined patterns to improve the quality of the algorithm running and will save time. This is especially important for Web log files, in relation to the structure of the Web log files of database or data warehouse as data are not the same. They are not structured and because of various causations are completed. So this particular web usage mining Web log files to the former process is necessary. Data preprocessing, through web logs and data structure, which is easy as it can be transformed mining. [3]

92168 26584 59981 1527 4670 608 879 71 283

31

V. EXPRIMENT To effectiveness and efficiency of our methodology mentioned above, with valid we have to use web server logs. November 12, 2005 Initial data source for our experiment to 25 November 2005, which size is around 36 MB. As shown in Table 1, after data cleaning, the number of requests declined from 92168 to 26584. Figure 2 shows the detail changes in data cleaning.

Fig.2. Processes of Data Cleaning

As Table 1 shows, the results obtained with the three methods employed for Web robot detection overlapped as shown in the Figure 3.

640

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 10, October 2013) We preprocessing in order to design and apply them easily at every stage of data to give some rules. Our experiments are important to us and our practices effectiveness data preprocessing estimates. This not only reduces log file size, but also increases the quality of data available. However, many problems remain such as data collection, applications of some heuristics in some phase of data preprocessing, the accuracy of user identification and session identification, applying the results of data preprocessing to patterns discovery and so on. We’ll focus on solving these issues in the future. REFERENCES [1] Fig.3. Number of Web Robot Hosts Identified Using Each Method

[2]

In General Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Size of original log File before apply preprocessing is 37765942 Byte (36.01 MB) and after apply the preprocessing is 4251968 Byte (4.06 MB) so the Reduction in log File is 88.741263226 % Finally, we have identified 3546 unique user on the basis of user identification’s results, we have identified 4319 sessions by a threshold of 25.5 minutes and path completion.

[3] [4] [5] [6] [7]

VI. CONCLUSION The paper work that the use of web mining, data mining and knowledge discovery techniques applied to the WWW server access logs are required to display the details of preprocessing the data is presented.

641

Bamshad Mobasher, “A Web Usage Mining”, http://maya.cs.depaul.edu/~mobasher/webminer/survey/node6.html. 1997. Li Chaofeng , “Research and Development of Data Preprocessing in Web Usage Mining ," Rajni Pamnani, Pramila Chawan , " Web Usage Mining: A Research Area in Web Mining " Andrew Shen , “Http User Agent List”, http://www.httpuseragent.org/list/ Andreas Staeding , “User-Agents (Spiders, Robots, Crawler, Browser)”, http://www.user-agents.org/ “Robots Ip Address”, http://chceme.info/ips/ “Volatile Graphix, Inc.”,http://www.iplists.com/nw/