Postgraduate Annual Research Seminar 2007 (3-4 July 2007)
Multi-Agent Crawling System (MACS) Architecture for Effective Web Retrieval Siti Nurkhadijah Aishah Ibrahim and Ali Selamat Faculty of Computer Science and Information System, Universiti Teknologi Malaysia, 81310 Johor Bahru., Malaysia Tel: +607-5532099; Fax: +607-5532210 Email:
[email protected],
[email protected]
Abstract Recently, many web search engines used for information gathering in World Wide Web (WWW). For instance, Google, Yahoo, AltaVista and others. Web crawler is a program or automated script which browses the WWW in a methodically, automated manner that mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. From the study, we found that web pages crawled by crawlers will slow down the server. Thus, it makes users refuses to allow crawlers exploring web pages and even worst if they block the crawler’s IP address during entering the web pages. In order to achieve higher accuracy rate, we propose the architecture of multi-agent system in web crawling known as Multi-Agent Crawling System (MACS). Since Java Agent Development Framework (JADE) is one of the most used and promising agent development framework, MACS will be model in Java based on JADE architecture. We expected this model will enhance the network interaction between the web agents and servers.
Currently, crawling system faced problem of retrieving the more accurate result from the database based on user’s query because of the rapid growth of World Wide Web (WWW) and also web documents from the internet are too vast to gather all of it [2,4,5]. In this paper, we are focusing on topical crawler which poses as the most popular tools used for web crawling nowadays [5]. Moreover, topical crawler aims to search and gather web pages from the WWW which is related to specific topic [6,7].
Crawling Loop
Keywords: Agent, Multi-Agent Crawling System (MACS), Java Agent Development Framework (JADE), Web crawling
1. Introduction Web crawling is one of main component in information gathering [1,2,3]. Web crawlers are used by Web search engines to visit web pages automatically, by recursively following links until a certain stopping criteria is met [4]. Figure 1 shows the flow of a basic sequential of a crawler.
Figure 1: Flow of basic sequential crawler[11] As mention before, the crawler faced a problem when try to crawl the web pages in the internet. The crawled server detected that the crawler cause slow down on the server due
This paper has not been revised and corrected according to reviewers comments Copyright PARS’07
Postgraduate Annual Research Seminar 2007 (3-4 July 2007)
to the crawling process. In order to minimize this problem, we propose the architecture of Multi-Agent Crawling System (MACS) which agent technology is our basis structure for web crawler. MACS architecture will be implemented using Java Agent Development Framework (JADE) since JADE is a middleware [10,12,13] writes in Java and simplifies the implementation of multi-agent systems by providing a set of graphical tools that support the debugging and deployment phases. The rest of the paper is organized as follows. Section 2 discusses the related works done by previous researchers. Next, in section 3, we discuss the important used of multi agent. In section 4, we explain JADE as a multi agent platform. Section 5, we illustrate the MACS architecture. In section 6, there is discussion of the future work for this research and finally in section 7 is the conclusion.
2. Related Works
3. Agent and Multi-agent Actually there are many definitions about what is agent, respectively. The term agent describes a software abstraction, an idea, or a concept, similar to Object Oriented Programming (OOP) terms such as methods, functions, and objects. The concept of an agent provides a convenient and powerful way to describe a complex software entity that is capable of acting with a certain degree of autonomy in order to accomplish tasks on behalf of its user. But unlike objects, which are defined in terms of methods and attributes, an agent is defined in terms of its behavior. Agents itself have several characteristics that makes researchers interested to explore the agent technology. The characteristics are as follows [12]: 1. 2.
In some paper related to this research, there is a multi-agent system for cooperative information gathering. [2] present the parallel web spider model based on multi-agent system which aims to avoid the page redundancy caused by parallization and minimize the efficiency cost. MAGE is the multi-agent platform used for develop a parallel spider prototype.
3. 4.
5. 6.
The Sydney Strategy as presented in [4] is the outcome from the basic idea which explain that whenever a node is found with out-degree, they sampled some its out-links and visit it, storing the high degree node into the secondary queue for visiting the rest of its neighbors later. Thus, they can reduce the queue size while at the same time preserving coverage, quality of the retrieved pages and politeness toward Web servers.
7.
A multi-agent system (MAS) is a system composed of several agents, collectively capable of reaching goals that are difficult to achieve by an individual agent or monolithic system [8]. Multi agent is: 1.
AutoCrawler [6] is an integrated system for automatic topical crawler. AutoCrawler consists of a topic specification module, a classifier learning module, a URL ordering module and an analysis module. It used the mechanism that combines term suggestion, query modification and document ranking. Surprising, AutoCrawler had extend their crawler to be grid-enabled to overcome the network bandwidth problem.
Autonomous; taking the initiative as appropriate. Goal-directed; maintaining an agenda of goals which it pursues until accomplished or believed impossible. Task able; one agent can delegate rights/actions to another. Situated in an environment (computational and/or physical) which it is aware of and reacts to. Cooperative with other agents (software or human) to accomplish its tasks. Communicative with other agents (human or software). Adaptive, modifying beliefs and behavior based on experience.
2. 3. 4. 5. 6.
Heterogeneous agents are experts in different areas. Self-motivated Act to fulfill internal goals Share tasks with others Communicate and collaborate No global or centralized control mechanism
In addition, the advantages using agent technology in web crawling is it can improve the performance of the search engine and
This paper has not been revised and corrected according to reviewers comments Copyright PARS’07
Postgraduate Annual Research Seminar 2007 (3-4 July 2007)
produce more comprehensive search in the WWW.
4. Java Agent Development Framework (JADE) as Multi Agent Platform In our research, we will implement the agent web crawling system using Java seems JADE is the latest platform for multi agent system. JADE is one of the most used and promising agent development frameworks. JADE also support the development of multi agent system through the predefined programmable and extensible agent model and a set of management and testing tools. Besides, JADE allows each agent to dynamically discover other agents and to communicate with them according to the peer-to-peer paradigm [12]. JADE platform also is composed of agent containers that can be distributed over the network. Communication also is the main part in agent architecture [12]. Without communication, the agents itself cannot interact with others agents. In JADE platform, agents use a special communication language, called agent communication languages that based on speech act theory [12]. The latest used of the agent communication language in JADE is FIPA ACL. The main features of FIPA ACL are the possibility of using different content language and the management of conversations through predefined interaction protocols.
5. Our Approach: Multi Agent Crawling System (MACS) Architecture The implementation of web crawler cause of slow down on web server. This happen especially if the frequency of accesses to a given server are too high [9]. In this research, a system based on agent technology will be develop in order to achieve our objective on minimizing the problem cause by the web crawler. Multi Agent Crawling System (MACS) is one of the agent based crawling system that contains of multi agent for web crawling to ease the crawling process and minimize the slow down on the crawled server. General view of MACS architecture shown in figure 2 consists of several agents stated in
the local computer and the web server. The main purpose of this architecture is to show how the agents are located in the MACS architecture.
Figure 2: MACS architecture In MACS architecture, we have the crawling agent which needs to communicate with other agents in different host.
6. Discussion and Future Work In this paper, we have discussed about how the agent technology will give some impact in web crawling process. The main advantages of using the multi agent system in solving the problem arise as discussed is it can ease the crawler to crawl over the network with communication with other agents in different web servers. We also have studied other techniques on how to continue the later process after crawling. In the other hand, we have only discussed about how we will using the agent technology in JADE platform. There are more tasks to do to make sure that the crawler will running smoothly and solve the problem.
7. Conclusion As a conclusion, crawling is the most important part in the web retrieval. It provide the information from all area in World Wide Web but the problem arise when crawler make difficult in web server. So, we proposed the
This paper has not been revised and corrected according to reviewers comments Copyright PARS’07
Postgraduate Annual Research Seminar 2007 (3-4 July 2007)
Multi Agent Crawling System (MACS) as one of the solution.
avoid overloading websites. Review on 20th May 2007.
8. References
[10] M. Nikraz, G. Caire, and P. A. Bahri (2006). A Methodology for the Analysis and Design of Multi-Agent Systems using JADE. International Journal of Computer Systems Science & Engineering special issue on “Software Engineering for Multi-Agent Systems”.
[1]M. Shokouhi, P. Chubak and Z. Raeesy (2005). Enhancing Focused Crawling with Genetic Algorithm. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). [2] J. Luo, Z. Shi1, M. Wang1 and W. Wang (2005). Parallel Web Spiders for Cooperative Information Gathering. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1192 – 1197, 2005.
[11]M. Woodridge, N. R Jennings: Intelligent Agents: Theory and Practice The Knowledge Engineering Review, 10(2): 115-152, 1995. [12]F. Bellifemine, G. Caire, A. Poggi, D. Greenwood (2007). Developing Multi-Agent Systems. Wiley Series in Agent Technology.
[3]Y. X. Ding, X. L. Wang, L. B. Lin, Q. Zhang, Y. H. Wu (2006). The Design and Implementation of the Crawler-INAR. Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006. [4]C. Castillo, A. Nelli and A. Panconesi (2006). A Memory-Efficient Strategy for Exploring the Web. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings) (WI'06). [5]R. R. Trujillo (2006). Simulation tool to study focused web crawling strategies. Department of Information Technology Lund University. Master’s Thesis. [6]J. J. Tsay, C. Y. Shih and B. L. Wu (2005). AuToCrawler: An Integrated System for Automatic Topical Crawler. Proceedings of the Fourth Annual ACIS International Conference on Computer and Information Science (ICIS’05). [7]M. Jamali, H. Sayyadi, B. B. Hariri and H. Abolhassani (2006). A Method for Focused Crawling Using Combination. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'O6). [8]http://en.wikipedia.org/wiki/Software_agent . Software Agent. Review on 20th May 2007. [9]http://www.devbistro.com/articles/Misc/Im plementing-Effective-Web-Crawler. How to
This paper has not been revised and corrected according to reviewers comments Copyright PARS’07