Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
EXPLORING WEB USAGE MINING WITH SCOPE OF AGENT TECHNOLOGY AARTI SINGH Associate Professor, MMICT & BM Maharishi Markendeshwar University Mullana, Ambala , Haryana, India
[email protected]
RAVI DUTT MISHRA Lecturer, Deptt. of CSE Swami Vivekanand Institute of Technology HSBTE, Udana-Karnal , Haryana, India
[email protected] Abstract: Web Mining is currently being used to extract the knowledge about user’s requirements while visiting the internet. This paper elaborates the contribution of web usage mining in the Web Mining field. Web usage mining is contributing towards the development of user friendly web and optimized search results. However some problems like large volume of the data to be mined, need for development of adaptive websites and issues related to privacy of users are still prevailing in this area. This work elaborates web usage mining process and highlights the research challenges. This paper further describes agent technology and explores applicability of agents in web usage mining. Keywords: Web Mining, Web Usage Mining, User Click-Streams, Agent, MAS 1. Web Mining Today, World Wide Web (WWW) has become a popular medium to search information. Whenever some information is desired from the web, we come across a huge amount of information, out of which sorting the relevant information is left for the user. This situation is termed as information overload [10; 11] which leads to difficulty in finding relevant information. Web Mining is used as a tool to remove the problem of information overload while searching information over the WWW. Various organizations are also employing the web mining process in order to make their website more user-friendly and to improve the web surfing experience of users. Internet is a pool of diverse information sources such as text, images, hyperlinks, audio and video. Depending upon the type of data being mining, web mining [9; 10] is categorized as 1.1Web content mining: When mining techniques are applied on the contents of web pages such as text, images, video etc. , it is termed as Web Content mining. 1.2 Web structure mining: It is the application of mining techniques on the hyperlink structure of the web. It is useful in measuring ranking of a web page and information about a page’s ranking, it provides the relationship and similarity between different websites. 1.3 Web usage mining: This mining is used for analyzing the user’s interaction with the web. It generates the secondary data such as user interests and behavior as a result of these interactions.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4283
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
Figure 1 given below provides classification of web mining on the basis of mined data. WEB MINING
WEB CONTENT MINING
WEB STRUCTURE MINING
WEB USAGE MINING
Text Images
Browsing Log Hyperlinks User’s Profile User’s Queries
Video Audio
Transaction Log
Figure 1 Classification of Web Usage Mining
This work explores Web Usage Mining in detail and throws light on its various aspects. 2. Overview of Web Usage Mining Web Usage Mining [12] is the analysis of the user’s interaction with the web server. Web servers maintain logs of data accessed by users while browsing internet. Usage data is also collected from user’s profile, transactions performed and queries given by the user. Web Usage Mining is basically used to study the user’s browsing behavior to find out the areas of interests and nature of contents in which users are more interested. It analyzes the data such as websites or the links most visited by the user and the topics of interest of user etc. The results obtained after such analysis are helpful in many ways such as optimizing future searches [5] and restructuring of websites [5] to provide better experience to the users. These results are also helpful in making page recommendations for the users. Special techniques are used to study browsing paths or click streams in case multiple users operate from the same IP address. These click streams [5] are then mined using clustering and associations finding techniques. Data collected in web server logs is not directly suitable for applying web mining techniques; it has to be preprocessed to make it suitable for mining. This preprocessing involves three phases [15; 17] namely data preprocessing, pattern discovery and pattern analysis. Figure 2 illustrates Web Usage Mining process. Next section elaborates these phases to make them better understandable. 2.1 Data Preprocessing: Main objective of performing preprocessing is to convert the available data into abstract form necessary for pattern discovery. Preprocessing the data [15] includes data cleaning, user & session identification and path completion. 2.1.1 Data Cleaning: Data collected from databases of web servers, proxy servers etc contains noise, thus data cleaning [10; 15; 20] methods are applied on it. In this process irrelevant web access logs are eliminated. Since here our main concern is to know travelling pattern of the user so irrelevant records are not necessary. 2.1.2 User & Session Identification: A client side tracking mechanism is used to record only the IP addresses and server side click streams on account of identifying the users and sessions. Different researchers have proposed different methods for this purpose, like Dong (2009) in [3] highlighted that user identification involves identifying which websites and
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4284
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
further web pages are surfed by that particular user. Users are recognized with user IP address and agent field. Revankar et. al in [15] and Dong (2009) in [3] stated following rules for recognizing users:
Different IP values in the IP address field represent different users. Even if the IP values are same but agent field values are not same then they represent different users. If the IP values as well as the agent field values are same then the data surfed by the user is taken into account. If there is any link between the page last requested and earlier requested pages then they represent same users otherwise different. If there is no link between data accessed or there is large interval between the accessing times then it is identified as a new user session. Same user may visit web more than once at different point of time, so a time heuristic [15] is used to divide those intervals into different user sessions. After the user identification, the sessions are created. Sessions are the series of web pages that a user browses in a single access. Following rules have been defined to identify user sessions [3; 15]: A threshold value of accessing time is setup (say 30 Minutes). When the access time of the same user exceeds 30 minutes then it is considered as different user session. Various sessions are then labeled with session IDs and values are assigned to them. To calculate the time difference four fields i.e. day, timehh, timemm and timess indicating days, hours, minutes and seconds are added. 2.1.3 Path completion: There are some reasons like local caching & proxy servers due to which complete web usage access record does not get reflected in access log files [15] and all URLs are not available. Due to which user access paths are not completely recorded. Missing pages has to be made available to improve quality of mined patterns since it will in turn affect the study of the browsing behavior of users. Thus it becomes necessary to complete the path as web log files are neither structured nor complete. Problems caused by local caching can be avoided by inferring cache hits based on how pages and activities are effectively linked together. Traversal paths can also be completed by using modified browsers which uses a client side program called activity recorder to record the local activities. Such complete data structure is easy to mine.
DATA PREPROCESSING
PATTERN DISCOVERY
PATTERN ANALYSIS
Data Collection Clustering Data Cleaning
Uninteresting Rules or Patterns Classification
User & Session Identification
Path Completion
Association Rules
Figure 2 Process of Web Usage Mining
2.2 Pattern Discovery: Once the user and sessions are identified, various data mining [13; 17; 19] techniques such as clustering, classification & association rules are applied to find out the various sets of matching patterns. 2.2.1 Classification: Classification involves grouping the data into several predefined classes. The data is segregated by matching with the best described features or properties of a class or category.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4285
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
2.2.2 Clustering: Clustering means to group similar patterns together. Clusters can be usage clusters and page clusters. Clustering of users establishes a set of users with similar browsing patterns. Classes in clustering are not predefined rather they emerge as a result of clustering. Clustering of pages highlights and clubs pages with similar or related content. 2.2.3 Association Rules: Association rules discover those pages which are associated or related with each other in some way. The pages which are accessed together are discovered using association rules. For example a person searching for journals in any field may also be interested in conferences, seminars and workshops in the same domain. 2.3 Pattern Analysis: Aim of this phase [15; 17] is to filter out interesting patterns and visualize and interpret the interesting patterns. Exact methodology for pattern analysis is decided by the application performing mining. This phase works by first deleting the less significant patterns and then exploiting OLAP [3; 15; 17] technology. A knowledge query mechanism like SQL can also be used for this purpose. 3. Applications and Scope of Web Usage Mining Web Usage Mining provides a vast and precious amount of information which, if used properly, can be used in many applications of commercial importance. Also it can improve the web interaction drastically. It has three main goals [3] for any application.
To optimize user’s web surfing experience.
To improve the performance of a website.
To improve the design of a website.
Various areas of application of Web Usage Mining are: 3.1 Development of user friendly web: User’s browsing pattern is analyzed through Web Usage Mining which gives information regarding user’s preferences. This information may be useful in development of websites which are easy and interesting to use. Naïve users feel uncomfortable while surfing the internet due to complex structure of the websites. Application of Web Usage Mining can change the scenario. Web Usage Mining also allows restructuring of existing websites leading to their better management. 3.2 Security: Pattern analysis [4] develops a pattern about user’s preferences, way of surfing, areas of interests. Every time a user visits the Internet, the same pattern is followed. If a user other than the routine user visits on the Internet, change in browsing pattern is immediately detected. Thus, this technique may be used for intrusion or unauthorized access detection. 3.3 Search Engine Optimization: Browsing pattern analysis can help in developing optimized search engines which will produce only relevant and filtered information by understanding what exactly the user want to search. Such technique will save the user from information overload problem. Web usage Mining is all about how a user uses a website and for what purpose. Results of Web Usage Mining [18] may be very helpful for the future research in the field of customization of web sites, improving web intelligence, and computer security. Although lot of research has been done in this domain, still there are many challenges prevailing in this area. Next section explores these challenges. 4. Research Challenges Web Usage Mining has provided a better way of web utilization but still there are some areas which need attention and more research need to be performed in these areas. 4.1 Development of adaptive websites: In the field of personalization of web, adaptive websites [9] can improve the search scenario. Adaptive websites are the websites which can change their organization & presentation according to the preferences of the users.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4286
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
Use of agents can make it simpler and user friendly. Agents are employed to provide recommendations for the users. Although work in this direction has started but it is in initial stage and requires more research efforts. 4.2 Data Sampling: In case of on-line social networking sites, data sampling [19] to reduce the size of the data is the main concern. Here data can be email, image, text, video etc. and a combination of these. Thus to reduce the size of the data in Web Usage Mining is a big challenge. 4.3 Privacy Concern: In order to personalize the web and improve the browsing experience it is necessary for the site administrator to identify the users [17] uniquely. But most users do not want them to get monitored every time they visit on the web. Some rules and regulations need to be applied so that site administrator can perform analysis of data without affecting the user’s identity. That is, there should be strict rules that user’s identity will not be disclosed and usage data will not be misused. Next section explores agent technology that has become prime solution for problems focusing on distributed, dynamic and heterogeneous environments such as WWW. Since web usage mining possesses above said attributes, we would like to explore applicability of agent technology in this domain. 5. Agent Technology Agents are software entities which perform a specific task on behalf of their user. These reflect agents in the real world which do something on user’s behalf, and any technology that makes use of agents to perform a function is termed as agent technology [2; 6]. Once deployed an agent reside in an environment, senses its own input as and when it occurs and acts on it to achieve its predefined goal autonomously. Agents can be classified in following categories based on the nature of their task performance [8]. 5.1 Gopher agents: The agent which on meeting a specified condition performs the task and gives the result to the user. For example- inform the user when email from any particular id is received. 5.2 Service Performing Agents: The agent which performs a task only when the user requests them to do so. For example- book an air ticket to U.K next month. 5.3 Predictive Agents: The agent which voluntarily provides information and services to the user, without being asked for, are known as predictive agents. For example- an agent may inform a user that a heavy discount is being offered by a particular brand of user’s interest. Agents can also be categorized depending upon their characteristics [7]. These are
Personal Agents: Those agents who interact directly with the user, monitor user’s activities & user’s preferences are personal agents. All the three classes of agents discussed earlier are the kinds of personal agents.
Mobile Agents: Those agents who visit the remote sites to collect the information and perform tasks are known as mobile agents. These agents return back to their source and provide results to the user.
Collaborative Agents: Agents which collaborate with other agents to collect information or work in groups to perform some task are known as collaborative agents.
6. Multi Agent System: A Multi Agent System (MAS) [2; 6] describes a system with more than one agent. Collaborative agents participate in formation of MASs. Since a single agent can perform its predefined task only thus in order to provide complex services agents collaborates with each other and forms MAS. Agents in MAS communicate with each other using a complex structured language called Agent Communication Language (ACL). When a task is assigned to an agent, the agent searches for the related information in its own database and if not found there, it will pass the query to other agents. After finding the related information, the agent will return the result to the client. Agent stores the result in its own database also and when the same information is requested again, it gives the result immediately. This concept is shown in figure 3 given below.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4287
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
Figure 3 MAS Relational Model (Throne, 2002) [15]
6.1 Features of Agents: Every agent is possessed with following features [7; 8]. An agent should be: Autonomous: Every agent should be able to perform a task on its own. It should have an overall control over the actions and various states. Adaptive: Environment in which agents reside may change over the time and it can affect the proper functionality of the agents. Thus agents must be able to adapt the environment accordingly. Collaborative: Agents have to communicate with other agents so as to collaborate with other agents in their activities whenever required. Competent: Agents should be competent i.e. able to perform the task successfully and able to manipulate the environment accordingly. Cooperative: Agents should be cooperative in nature so as to work collaboratively to perform complex tasks. Reactive: Agents should be able to respond in a timely manner to the changes that occur in the environment. Proactive: To respond to their environment agents should exhibit an opportunistic and initiative behavior. 7. Scope of Agents in Web Usage Mining Web mining is an upcoming area of research which is gaining attention from research fraternity. Information overload is making commercial and personal web usage cumbersome for the users. Web mining is the only solution for filtering web contents, selecting and providing appropriate contents to end users. Considering the large volumes of data already available on the web and the rate of new information upload on the web, there is dire need of a technology which can automate the processes involved in web mining, specifically in usage mining. Agent technology seems to be promising solution in this case. Agents are already been employed in many web based research domains such as semantic web, wireless sensor networks, web services etc. and had been proved beneficial. Literature review highlights that researchers have not paid much attention towards employing agent technology in web usage mining in past, however some efforts are being made in this direction recently [10,16]. Kosala et. al in [10] indicated that two basic approaches used by agents in mining process are:
Content Based Approach: an analysis of the content explored by any user is performed.
Collaborative Approach: users having similar searching behavior are found and then recommendations are made based on their interests.
Here the collaborative approach is indicating to web usage mining. Nowadays, agents are an integral part of web search engines in the form of web crawlers. Once employed in web usage mining they can help facing the existing challenges in this domain.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4288
Aarti Singh et al. / International Journal of Engineering Science and Technology (IJEST)
8. Conclusions This work explored the area of Web Usage Mining emphasizing on its application areas and research challenges. Role and applicability of software agents in context of web usage mining has also been explored. Agent technology has lot of scope for employment in this field. When employed agents can help overcome challenges existing in this domain and intelligent mechanisms may be designed to address problems of web usage mining. References: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
Abedin, B.; Sohrabi, B. (2005): A Web Designer Agent, Based on Usage Mining Online Behavior of Visitors, World Academy of Science, Engineering and Technology 6 2005, pp. 41-45. Cleary, P., et al: Investigating the use of Software Agents to Reduce The Risk of Undetected Errors in Strategic Spreadsheet Applications, http://arxiv.org/ftp/arxiv/papers/0806/0806.0189.pdf Dong, D. (2009): Exploration on Web Usage Mining and its Application, IEEE. Fu, Y.; Shih, M. (2002): A Framework for Personal Web Usage Mining, In International Conference on Internet Computing (IC'2002), pages 595-600, Las Vegas, NV. Fürnkranz, J. (2005): Web Mining, In O. Maimon and L. Rokach (Eds.), The Data Mining and Knowledge Discovery Handbook, pp. 899–920. Berlin: Springer. Genesereth, M. R.; Ketchpel, S. P. (1994): Software Agents, Communications of the ACM 37 (7), pp. 48-53. Griss, M. (2001): Software agents as next generation software components, In Component-Based Software Engineering: Putting the Pieces together, Edited by G.T. Heineman and W.T. Councill (Addison-Wesley, Boston, 2001), pp. 641–657. Jennings, N. R.; Wooldridge, M. (1996): Software Agents, IEE Review, pp. 17–20. Kolari, P.; Joshi, A. (2004): Web Mining: Research and Practice, Computing in Science and Engineering, 6(4), pp. 49-53. Kosala, R.; Blockeel, H. (2000): Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15. Li, Y.; Zhong, N. (2004): Web Mining Model and Its Applications on Information Gathering, Knowledge-Based Systems, vol. 17, pp. 207- 217. Liu, L.; Chen, J.; Song, H. (2002): The Research of Web Mining, Proceedings of the 4th World Congress on Intelligent Control and Automation, June 10-14, Shanghai/China,. Nina, S. P., et al (2009): Pattern Discovery of Web Usage Mining, IEEE. Reddy, K. S.; Varma, G. P. S.; Reddy, S. S. S. (2012): Understanding the Scope of Web Usage Mining & Applications of Web Data Usage Patterns, International Conference on Computing, Communication and Applications, ICCCA – 2012, February 22 - 24, 2012. Revankar, P.; Dahiwele, J. (2011): Web Usage Mining, 5th National conference; INDIACom-2011, Computing For Nation Development, March 10-11, 2011. http://www.bvicam.ac.in/news/INDIACom%202011/214.pdf Singh, A. (2012): Agent Based Framework for Semantic Web Content Mining, Published in International Journal of Advancements in Technology, Vol. 3, No. 2, April 2012, pp. 108-113. Srivastava, J., et al (2000): Web usage mining: Discovery and applications of usage patterns from web data, SIGKDD Explorations, 1(2). Srivastava, J.; Desikan, P.; Kumar, V. (2002): Web Mining: Accomplishments and Future Directions, Proceedings of the US National Science Foundation Workshop on Next-Generation Data Mining (NGDM), National Science Foundation. Ting, H. I. (2008): Web Mining Techniques for On-line Social Networks Analysis, In Proceedings of the 5th International Conference on Service Systems and Service Management, Melbourne, Australia, 30 June-2 July 2008, pp. 696-700. Zaїane, O. R. (2001): Web usage mining for a better web-based learning environment, In Proceedings of Conference on Advanced Technology for Education, pp. 60–64, Banff, AB. Zhan, L.; Zhijing, L. (2003): Web Mining Based On Multi-Agents, COMPUTER SOCIETY, IEEE.
ISSN : 0975-5462
Vol. 4 No.10 October 2012
4289