Advanced AI Techniques for Web Mining - CiteSeerX

23 downloads 224185 Views 174KB Size Report
different web mining tasks and the second one is focusing on advanced artificial intelligence (AI) methods for ... found a number of applications, including web mining. An agent is a ... collective mental map (CMM) development: averaging.
MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS

Advanced AI Techniques for Web Mining IOAN DZITAC IT Department Agora University Piata Tineretului, 8, 410526 Oradea, ROMANIA IOANA MOISIL “Hermann Oberth” Faculty of Engineering - Computer Science and Automatic Control “Lucian Blaga” University of Sibiu 10, Blvd. Victoriei, 550024 Sibiu

Abstract: - The World Wide Web has evolved in less than two decades as the major source of data and information for all domains. Web has become today not only an accessible and searchable information source but also one of the most important communication channels, almost a virtual society. Web mining is a challenging activity that aims to discover new, relevant and reliable information and knowledge by investigating the web structure, its content and its usage. Though the web mining process is similar to data mining, the techniques, algorithms and methodologies used to mine the web encompass those specific to data mining, mainly because the web has a great amount of unstructured data and the changes are frequent and rapid. This paper is structured into two sections. The first one briefly discusses the different web mining tasks and the second one is focusing on advanced artificial intelligence (AI) methods for information retrieval and web search, link analysis, opinion mining and web usage mining. Key-Words:- Web Mining, Multi-Agent System, Swarm Intelligence, Ant Colony Optimizer, Classification Rule Mining mining and web usage mining. Web structure mining is about discovering knowledge from the hyperlinks. Important web pages can be identified; also users that have common interests, i.e. they are using the same clusters of linked pages. Until 1996, pages were retrieved based on content similarity. Starting from 1997 the most used hyperlink search algorithms were PageRank and HITS (Hypertext Induced Topic Search). These algorithms are tributary to social network analysis (measures of the degree of prominence of an actor in a social network). Pages are ranked according to their prestige or authority. The Web is considered a virtual social network, pages being the social actors and hyperlinks, the relationships. In this way models and techniques from social networks analysis can be transferred to web structure mining. Web content mining aims to extract useful information or knowledge from the content of web pages. Pages can be clustered and classified based on their topic, patterns concerning users’ opinion on different products or forums postings can also be found from unstructured texts that have been generated by the user. Opinion mining is using not only data mining

1 Introduction The World Wide Web has evolved in less than two decades as the major source of data and information for all domains. Web has become today not only an accessible and searchable information source but also one of the most important communication channels, almost a virtual society. Web mining is a challenging activity that aims to discover new, relevant and reliable information and knowledge by investigating the web structure, its content and its usage. Though the web mining process is similar to data mining, the techniques, algorithms and methodologies used to mine the web encompass those specific to data mining, mainly because the web has a great amount of unstructured data and the changes are frequent and rapid [1,4]. Moreover the Web links are an important source of information. Also the Web is not only a huge repository of data and information but also a provider of services of all kinds. All these make the web a virtual society, where people, organizations and systems are interacting. Web mining is the process of discovering useful information or knowledge from hyperlink structure, pages content and data usage. There are three main Web mining tasks: web structure mining, web content

ISSN: 1790-2769

343

ISBN: 978-960-474-012-3

MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS

the world where agents live. Agents’ actions consist in following links and visiting pages. They receive signals from the environment that are texts and link characteristics of the pages and they learn from these signals. Each action has an energy cost that can be, for example, the size of the fetched page. Energy is gained from visiting new pages that are relevant to the topic of the query [26]. Other applications of multi-agent systems are the wrappers, programs designed to extract structured data from the web.

techniques but also techniques of natural language processing. Web usage mining aims to automatically discover and analyse patterns in click stream and associated data collected or generated as a result of user interactions with web resources, on one or more web sites. Behavioural patterns and profiles of users interacting with a web site are captured, modelled and analysed in order to improve services. Almost all web mining tasks are using artificial intelligence techniques and algorithms in order to perform efficiently. In the following we will briefly describe some of the most valuable AI techniques, the multi-agent technology and swarm intelligence algorithms.

3 Swarm Intelligence Swarm intelligence (SI) is a term introduced in 1989 in the context of cellular robotic systems and representing the collective behaviour of decentralized, self-organized artificial systems, by analogy to the real world where the collective behavior of a swarm can lead to the emergence of an apparent intelligent behavior [9, 15, 22]. Collective intelligence is defined as the ability of a group to solve more problems than its individual members [25]. SI systems are in fact simple agents that are interrelated, being able to communicate one with another and to interact with their environment. The community of agents carry out a distributed problem solving. They follow simple rules and there is no centralized control. Examples from nature of SI include ant colonies, bird flocking, animal herding, bacterial growth, and fish schooling. In the following we will present some applications of swarm intelligence algorithms to web mining. One of the most known SI algorithms is ACO – Ant Colony Optimizer [10], introduced in 1992 by M.Dorigo in his Ph.D. thesis. The algorithm was inspired by the behaviour of ants in finding paths from the nest to food, when they create a network of pheromone trails. Since then several improvements and variants of the algorithm have been developed for different application fields, including data mining [11, 12, 13, 14, 15, 16, 17]. For example, the process of navigating on the web is similar to ants’ colonies foraging [19, 23, 24]. Similar to the ants that do not have a global view of the environment, a web user is navigating the web without having information about the route followed by other users with the same objective. In order to apply an ACO algorithm to find the shortest path to a certain document or cluster of documents, information about target pages and routes can be kept on a special server. Heylighen has noted that the ant colony behavioral model can enable us to define “some basic mechanisms of collective mental map (CMM) development: averaging of individual preferences, amplification of weak links by positive feedback, and integration of specialised sub-

2 Multi-Agent Technology Agent/multi-agent systems have become an important field within artificial intelligence research. They have found a number of applications, including web mining. An agent is a computer system that is capable of independent action on behalf of its user or owner in order to satisfy design objectives. An intelligent software agent has to be autonomous, reactive, proactive, and social (capable to interact with other agents, to communicate, and to negotiate). Intelligent agents can learn and adapt to new situations. Some other characteristics are also valuable: the possibility to move on an electronic network, veracity, benevolence and rationality. A multi-agent system consists of a number of agents that interact with one-another, cooperate for realizing the different tasks and are able to negotiate and solve conflicts. Multi-agent systems or simple agents are used in almost all content mining tasks. A web crawler is a program that automatically downloads web pages. A crawler can collect information to be then analyzed and mined online or offline. Crawlers are universal, topical and focused. Adaptive topical crawlers are the most sophisticated and they are designed using different machine learning techniques, in particular classifiers to guide them through the web. Intelligent crawlers adapt to the web content and hyperlink structure. It can use a statistical model to learn to assign priorities to URLs in the considered neighborhood, based on the Bayesian interest factors derived from features (token extracted from candidate URLs, source page content and link, etc.). An adaptive crawling algorithm that uses reinforcement learning when crawling online, without any supervised learning, is InfoSpider. This crawler is inspired from artificial life models, where a population of agents live, learns, evolve, reproduce and die. The agents learn from experience. They can be rewarded or punished for their actions. Using this model, the Web is

ISSN: 1790-2769

344

ISBN: 978-960-474-012-3

MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS

classifiers at the document level are used for sentiment classification There are many classification rule mining algorithms (decision trees, Naive Bayesian classifiers, quadratic classifiers, neural networks, k-nearest neighbor classifiers, Bayesian networks, Support Vector Machines, boosting, hidden Markov, ensemble methods). Here we will mention the algorithm based-on ant colony optimization (ACO). Peng Jin et al. have implemented improvements of the ACOMiner algorithm to enhance classification predictive accuracy and simplicity of rules [11]. Their algorithm, the SIMiner, has a multi-population parallel strategy where the costbased discretization method is adopted, and algorithm’s parameters are adjusted step by step. SIMiner was used to experiment on six data sets taken from UCI Repository on Machine Learning, the results showing a better performance in predictive accuracy and simplicity of rules. Another algorithm for mining classification rule, the Threshold Ant Colony Optimization Miner (TACOMiner) was proposed by K. Thangavel et al [12].

networks through division of labour” [25]. If we assign weights to the Web links, we will be able to treat it as a CMM. Heylighen has studied two kinds of algorithms. In the first one, the co-occurrence of links in web pages (user selections) was used to compute a matrix of link strengths. The second type of algorithms extracted information from a user sequential path through the web through learning rules in order to change link strengths and create new links. Heylighen conclusion was that “the resulting weighted web can be used to facilitate problem-solving by suggesting related links to the user, or, more powerfully, by supporting a software agent that discovers relevant documents through spreading activation”.

3.1 Web Content Mining AntWeb, an application of ants’ foraging behavior to the design of an adaptive web server was adapted by Weigang et al. to the Brazilian legislation web site [6]. Users were considered as artificial ants and the ants’ foraging model was used to guide users’ activity in a web site. Web log files are preprocessed in order to extract information on users behavior. In AntWeb the amount of pheromone associated with a link was represented by the degree of learned desirability to choose the specific path. The quantity of pheromone was set proportional to the quality of the solution (shortest path), and visiting information were collected in a data base. The page was adapted at link level.

4 Conclusion The World Wide Web is today the major source of data and information for all domains. It is not only an accessible and searchable information source but also one of the most important communication channels, almost a virtual society. Web mining is an important and challenging activity that aims to discover new, relevant and reliable information and knowledge by investigating the web structure, its content and its usage. In our paper we have presented only two main AI techniques: the multi-agent systems and swarm intelligence, with some of their applications in web mining. The mining tasks are so complex that they cannot be efficiently performed without the support of appropriate advanced AI techniques.

3.2 Classification Rule Mining Classification, called also supervised or inductive learning, is one of the most important data mining tasks. In data mining we are creating classification models (predictive models, or classifiers) by examining already classified data and inductively finding a pattern that can be used both to understand the existing data and to predict how new instances will behave. We usually have access to data that have been classified – cases- and we want to build classification models based on these data and to obtain a predictive pattern. Data are structured in entities – data tuples - characterized by sets of attributes. The instances of an entity are named records, or examples, vectors or cases. Data that are used for learning constitute the training set. The model obtained is evaluated using a different set of data – the test data set (unseen data). In Web usage mining we are interested in identifying users’ profiles belonging to specific classes. Features that describe in the best way the properties of a specific class have to be extracted and selected. Collaborative filtering for recommender systems is also an application of classification and prediction. In Web opinion mining,

ISSN: 1790-2769

References: [1] Kantardzic, M., Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0471228524 , 2003 [2] Zhai, C., Statistical Language Model for Information Retrieval. Tutorial Notes [3] Kleinberg, J.M., Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), v.46 n.5, p.604-632, Sept. 1999 [4] Liu, B., Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007 [5] Fouss, F., Pirotte, A., Renders, J-M., Saerens, M., Random-Walk Computation of Similarities between

345

ISBN: 978-960-474-012-3

MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS

stigmergetic control for communications networks. Journal of Artificial Intelligence Research, 9:317-365, 1998 [19] R. Schoonderwoerd, O. Holland, J. Bruten and L. Rothkrantz. Ant-based Load Balancing in Telecommunications Networks. Adaptive Behavior, 5(2):169-207, 1996 [20] M. Dorigo, T. Stützle, Ant Colony Optimization, MIT Press, 2004 [21] M. Dorigo, M. Birattari, T. Stützle, Ant Colony Optimization-Artificial Ants as a Computational Intelligence Technique, IEEE Computational Intelligence Magazine, 2006 [22] G. Beni, J. Wang, Swarm Intelligence in Cellular Robotic Systems, Proceed. NATO Advanced Workshop on Robots and Biological Systems, Tuscany, Italy, June 26–30 , 1989 [23] R. Beckers, S.Goss, J.L. Deneubourg & J.M. Pasteels (1989). Colony size, communication and ant foraging strategy. Psyche, 96, 239-256. [24] J.L. Deneubourg, S. Goss, R. Beckers & G. Sandini, Collectively self-solving problems. In : SelfOrganization, Emerging Properties and Learning, Ed. A. Babloyantz, B260, Plenum Press, New York, 267278, 1991 [25] F. Heylighen, Collective Intelligence and its Implementation on the Web: Algorithms to Develop a Collective Mental Map, Computational & Mathematical Organization Theory, Springer, Business and Economics, Volume 5, Number 3 / October, 1999, pages 253-280. [26] F.Menczer, G. Pant, P. Srinivasan, Topical Web Crawlers: Evaluating Adaptive Algorithms, ACM Transactions on Internet Technology 4(4), pp.378-419, 2004.

Nodes of a Graph with Application to Collaborative Recommendation, IEEE Transactions on Knowledge and Data Engineering, v.19 n.3, p.355-369, March 2007 [6] L. Weigang, M. V. P. Dib, W. M. Teles, V. M. de Andrade, A. C.M. Alves de Melo, J. T. Cariolano, Using ants’ behavior based simulation model AntWeb to improve website organization, in Proc. SPIE's Aerospace/Defense Sensing and Controls Symposium:Data Mining, Vol. 4730, pp. 229-240, Orlando, USA, April 2002 [7] R. Srikant and Y. Yang, Mining Web Logs to Improve Website Organization, In Proc. of the Tenth International World Wide Web Conference, Hong Kong, May 2001. [8] Hercules Antonio do Prado&Edilson Ferneda, Emerging Technologies of Text mining. Techniques and Applications, Chapter X, Published by Idea Group Inc (IGI), 2007 [9] J. Kennedy, R. C. Eberhart, and Y. Shi. Swarm Intelligence. Morgan Kaufmann, San Francisco, CA, 2001 [10] M. Dorigo and T. Stützle. Ant Colony Optimization. MIT Press, Cambridge, MA, 2004 [11] Peng Jin , Yunlong Zhu , Kunyuan Hu, Sufen Li , Classification Rule Mining Based on Ant Colony Optimization Algorithm, Intelligent Control and Automation, Springer, LNCIS, volume 344/2006, pages 654-663 [12] K. Thangavel, P. Jaganathan, Rule Mining Algorithm with a New Ant Colony Optimization Algorithm, Proc. of the International Conference on Computational Intelligence and Multimedia Applications, 2007, Volume 2, Issue , 13-15 Dec. 2007 Page(s):135 – 140 [13] R. S. Parpinelli, H. S. Lopes, and A. A. Freitas. Data Mining with an Ant Colony Optimization Algorithm. IEEE Trans on Evolutionary Computation, special issue on Ant Colony Algorithms, 2002, 6(4): 321332 [14] L. M. Gambardella M. Dorigo Solving Symmetric and Asymmetric TSPs by ant colonies. International Conference on Evolutionary Computation, Nagoya, Japan: 1996, 622-627. [15] E. Bonabeau, M. Dorigo, and G. Theraulaz. Swarm Intelligence: From Natural to Artificial System. Oxford University Press, NewYork, 1999 [16] M. Dorigo, V. Maniezzo, and A. Colorni The ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 1996, 26 (1): 29-41. [17] M. Dorigo, L. M. Gambardella Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evolutionary Computation, 1997, 1(1): 53-66 [18] G. Di Caro and M. Dorigo. AntNet: Distributed

ISSN: 1790-2769

346

ISBN: 978-960-474-012-3