Multi-Agent System for Search Engine based Web ... - Semantic Scholar

3 downloads 547 Views 1MB Size Report
registration form submission. “Analyzer Agent” is being used to develop the analyzing strategy. Finally,. “Search Agent” searches the Web to find the desired.
Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu1,3, Sutirtha Kr. Guha2, Tanmoy Chakraborty1, Subhadip Chakraborty1, Snehashish Pal1, and Debajyoti Mukhopadhyay3,4 1 Netaji Subhash Engineering College, West Bengal University of Technology, Calcutta 700152, India 2 Seacom Engineering College, West Bengal University of Technology, Howrah, West Bengal 711302, India 3 Web Intelligence & Distributed Computing Research Lab (WIDiCoReL) Green Tower C-9/1, Golf Green, Calcutta 700095, India 4 Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal 743503, India {anik76in, sutirthaguha, turn2tanmoy, 01subha, snehashishpal, debajyoti.mukhopadhyay}@gmail.com doi: 10.4156/jdcta.vol3.issue4.4

Abstract

Keywords

Existing Web Servers supporting any standard Search Engine follow all possible combinations of the search keywords as an input by the user to a Search Engine. As a result, a huge number of Web-pages are shown in the Web browser. This type of search result is confusing for the user to understand which documents are necessary. It will take a lot of time to go through all the Web-pages. As a result, a user needs more specific search. This paper proposes a system for user specific data mining over World Wide Web (WWW). Learning & testing methodology has been applied to the system for managing the characteristic behavior of the user. The proposed solution comprises of several agents which are capable of working separately & intelligently to achieve their individual goal. The following modules & agents are covered within this paper: User Module, Data Transfer Module, Group Agent, Analyzer Agent, Search Agent & Retrieval Module. Among these agents “Group Agent” is the most important part, since the classification of groups can't be done using conventional numerical analysis. The main reason behind it is the frequent change in profile of the user. All the required information to illustrate user account is acquired through a registration form submission. “Analyzer Agent” is being used to develop the analyzing strategy. Finally, “Search Agent” searches the Web to find the desired result for the specific type of user.

Multi-agent, Web Server, Search Engine, Group Agent, Analyzer Agent, Search Agent.

1. Introduction In the Internet, millions of users access Search Engine all the time as per their need. Typically, every user has distinct characteristics. So, their profile is different [4-6]. Most of the existing systems use a search in all possible pages for detecting important/concerned page [1-3]. The detection of important information is based on the collaboration among the agents. Most of the existing systems do not have a dynamic grouping. This work reports an efficient scheme for designing of a Web Server which acts like a Search Engine for specific users by assigning “Login ID” to each of them. The major contribution of this paper is the presentation of algorithm-based design approaches, which are: User Module, Data Transfer Module, Group Agent, Analyzer Agent, Search Agent and Retrieval Module. We present proposed approach with case study in Section 2. In Section 3, experimental results are shown. We conclude our work in Section 4.

2. Proposed Approach The proposed system is based on three modules and three different agents. Each type of agent works separately and intelligently to achieve their individual goal. A unit is one of the modules of the overall system, when a unit acts as decision making part then

44

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009 that unit is represented by an agent. Figure 1 shows the overall architecture of our system. Each sub-system takes output of the previous module or agent as input and simultaneously works on it to generate output for the next module or agent. All the sub-systems are maintained as discussed in the following sub-sections.

globally known as initial requests. Among the requests, the top-most (highest priority) goal will become the attitude (current goal) after the module goes through desired filtering process (filter()). Selected plans are utilized to generate query depending on the current goal. Let C, R, A represent ‘Conviction’, ‘Request’ & ‘Attitude’ respectively of an agent; while C0 & R0 represent ‘Initial Conviction’ and ‘Initial Request’ provided by the user.

Figure 2. Detailed View of User Module

Figure 1. Overview of System Framework of our Search based Web Server

The attitude is chosen after a filtering function (filter()) by giving an agent the initial convictions and requests. This filter() is issued to choose among the competing requests. Based on A, the agent selects plan from the “Sample Plans”. Then, a loop is executed on the “Selected Plans” (SP) to produce ‘Query’ until SP is empty. Initially, the first plan from the SP will be popped up to become the current plan and executes that plan. If the current plan happens to fail, the agent will need to select an alternate plan from the set of sample plans in achieving the goal. Like this way, all the selected plans are being executed for output generation as shown in Figure 2. Thus, some initial information is given or submitted to the system. The processing of that information is called ‘Conviction’. The global objective or goal of the system is called ‘Initial Request’, among the objectives the highest priority objective is called the ‘Attitude’.

In each agent based module, ‘Decision index’ means a corresponding value that will take part for deciding the result. Fuzzy logic is introduced in the paper for agent based activity. In fuzzy logic, input is taken as crisp value; then these values are plotted, calculated and finally produce a result based on some well known calculative processes. Thus, Decision index is calculated for finding the output value of a system or sub-system which is to be considered.

2.1. User Module User Module is basically a client-side module providing user interface with rest of the server based system. It works within a client machine while using Web browser. It collects the initial information about the user and store within a temporary file which is the user specific database. Characteristic behavior of the user can be estimated using the information provided by user at the time of creation or updating of the concerned profile. When this module is created, the user should provide the module some initial information (conviction) and the goal to be achieved which is

Algorithm 1: User Module Process Input: User interest Output: Initial query generation Step 1: Select conviction and request of user

45

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 3. Detailed View of Data Transfer Module Step 5: Transmission of data/query Step 6: Stop

Step 2: Use filter() to choose attitude from user conviction and request Step 3: Compare attitude with sample plans from database to select specific plan Step 4: Generate query from selected plans Step 5: Stop

2.3. Group Agent Group Agent is responsible for grouping/classifying the users. After receiving client/user data at server-end, it is being saved using classification [7-8] technique based on user information within database of the server. After classification, a login-id is generated for each user. This login-id is being sent to the user through Data Transfer Module and this id should be used for future reference. Every user account is protected through password. A user group may be changed in future depending on the search interest of the user. The Group Agent is responsible for grouping the users using predefined database. The decisions of classify the users is controlled by this agent. Hence this module takes some intelligent task of decision making. Thus, this module is considered as an agent. Output of Algorithm 1 is transferred to Web server via Data Transfer Agent using Algorithm 2. Further, output of Algorithm 2 is taken as input of Algorithm 3.

2.2. Data Transfer Module The stored file of ‘User Module’, containing the user information at client side is being transferred to the server-end via Data Transfer Module. It uses the network (LAN/WAN/WWW) connection for transmission of data file. The well known ‘GET’ and/or ‘POST’ method [11] has been used to transmit client data to server through Web browser. Figure 3 shows the Data Transfer Module in a brief. Algorithm 2: Data Transfer Module Process Input: User information/query generation from Algorithm 1 Output: Transmitted data to server-end Step 1: Select the type of connection (UDP/TCP) between client & server Step 2: HTTP protocol is selected Step 3: User data from the Web-page (browser) is being compiled using scripting language Step 4: Select ‘GET’ or ‘POST’ method for data transmission to the server

46

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009 Figure 4 represents Group Agent activity. The classified user profiles are stored within the server database for ready reference. The classification has been done using non-linear null boundary Single Cycle Multiple Attractor Cellular Automata (SMACA) [1213]. Algorithm 3: Group Agent Process Input: Output of Algorithm 2 Output: Modified search query Step 1: Store data within database Step 2: Find category using user data Step 3: If not found, create new category Step 4: Else, check for group data modification Step 5: If required, do modification Step 6: Modified query is generated Step 7: Stop

Figure 4. Detailed View of Group Agent Activity Firstly, the input data is stored within the server database. Then, check user data to find the category of user. Each and every category is predefined with some default data in learning phase. After that, in testing phase, these data are modified depending on the context of search. So, groups are dynamic in nature, since we have used variable context data time-to-time. So, users are shifted from one group to another. The output of Group Agent depends on its dynamic profile. GROUP Group1 Group2 Group3 Group4 Group5 Group6 Group7

In our proposed multi-agent system, Group Agent groups/clusters different users in several groups in the basis of their preferences. It is assumed that most of the searches are made on three fields like sports, films, education as a case study. The groups are made on this basis as shown in Table 1.

Table 1. Group vs. User Preference 1ST PREFERENCE 2ND PREFERENCE Sports Films Sports Education Education Sports Education Films Films Sports Films Education Others Others

In our agent based system for the calculation part of the agent concept of Fuzzy logic is used. The crisp input ranges for different inputs are shown in the following tables (refer Table 2 to Table 5).

3RD PREFERENCE Education Films Films Sports Education Sports Others

Table 4. Input ranges of Films INPUT RANGE FIELD NAME 30-45 Hindi Film 40-55 English Film 50-60 Bengali Film

Table 2. Input ranges based on different fields INPUT RANGE FIELD NAME 0-40 Education 30-60 Films 50-100 Sports

Table 5. Input ranges of Sports INPUT RANGE FIELD NAME 50-70 Cricket 60-90 Football 80-100 Tennis

Table 3. Input ranges of Education INPUT RANGE FIELD NAME 0-30 Research Work 25-40 Conventional Study

The membership functions of the input types are depicted as follows (refer Figure 5 to Figure 8):

47

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 5. Membership Function of Inputs

Figure 6. Membership Function of Education

Figure 7. Membership Function of Films

48

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009

Figure 8. Membership Function of Sports Corresponding rulebases are as follows (refer Table 6 to Table 8):

HINDI FILMS ENGLISH FILMS BENGALI FILMS

Table 6. Rulebase 1: Rulebase between Films and Sports CRICKET FOOTBALL Cricket Hindi Films English Films English Films Cricket Football

TENNIS Hindi Films Tennis Tennis

Table 7. Rulebase 2: Rulebase between Sports and Education RESEARCH WORK CONVENTIONAL STUDY CRICKET Cricket Cricket FOOTBALL Football Football TENNIS Research Work Conventional Study Table 8. Rulebase 3: Rulebase between Films and Education RESEARCH WORK CONVENTIONAL STUDY HINDI FILMS Hindi Films Hindi Films ENGLISH FILMS English Films English Films BENGALI FILMS Research Work Bengali Films

2.4. Analyzer Agent Figure 9 shows Analyzer Agent more precisely. It detects the system behavior based on the particular user's interest. Here behavior means type of jobs related to the user search. Each job has some symptoms and depending on those symptoms each job can be thought as a set of tasks. To complete a job, all the tasks should be executed. A sensor module is attached to sense the tasks for monitoring. The output of Analyzer Agent is the initial choice (updated query) of crawler for searching the WWW intelligently.

Figure 9. Flow diagram of Analyzer Agent

49

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay Output of Algorithm 3 is analyzed by an soft analyzer module. This module first checks the type of jobs based on the symptoms present within the input. After that, jobs are partitioned into one or more number of tasks. Finally, choices have been selected through monitoring the whole process.

Step 5: Stop Agent fuzzification part is calculated based on the crisp input within Analyzer Agent. Consider, at any time instance the given crisp input is 53. The corresponding membership values can be found by plotting the crisp input on the design graph (refer Figure 10).

Algorithm 4: Analyzer Agent Process Input: Output of Algorithm 3 Output: Choice of crawler type Step 1: Select type of jobs Step 2: Partition job into a set of tasks Step 3: Task(s) identification Step 4: Choice completed

From the membership function (refer Figure 10), it is obtained that the given input value resides in between ‘Films’ and ‘Sports’. Hence, the corresponding membership functions are calculated accordingly (refer Figure 11 & Figure 12).

Figure 10. Membership Function of Inputs with Given value

Figure 11. Membership Function of Films with Given value

50

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009

Figure 12. Membership Function of Sports with Given value Substitution of the corresponding field values has been performed using the ‘Rulebase 1’ of Table 6 and

HINDI FILMS 0.85 0.45

consequently Table 9 has been generated. The ‘MIN’ operation is done as per fuzzy logic.

Table 9. Substitution values on Rulebase 1 of Table 6 0.6 FOOTBALL Cricket Hindi Films 0.6 English Films 0.45 Football

According to the ‘MAX’ method, the result is ‘English Film’ as it contains the maximum value. Centroid method is also introduced for better accuracy. The ‘fuzzified decision’ and ‘scaled fuzzified decision’

TENNIS Hindi Films Tennis Tennis

graphs are shown in Figure 13 & Figure 14 respectively.

Figure 13. Fuzzified Decision

51

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 14. Scaled Fuzzified Decision As per centroid method, the calculated Final Decision Index (FDI) is 0.6378. The classical defuzzification technique is not preferable because of its

less accuracy. In some special cases, classical techniques are used ahead of centroid method where centroid method may result inconsistency.

Figure 15. Fuzzy Decision Index

2.5. Search Agent

The obtained final decision would be the search result which consists of 80% data related to ‘Hindi Films’ and the rest 20% data related to ‘Cricket’ (refer Figure 15).

Figure 16 shows Search Agent with two options like ‘static’ & ‘dynamic’. Here static means the search criteria would be selected before starting the search

52

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009

Figure 16. Illustrative view of Search Agent operation whereas dynamic means the search criteria could be selected at runtime. In case of dynamic searching, the number of crawlers might be changed depending on the circumstances. It depends on the searching criteria which are derived from the ‘choice’ module of Analyzer Agent (refer Algorithm 4). Two types of crawlers are used for predicted & unpredicted number of Web-pages to be downloaded respectively. Single and parallel crawling methodologies have been utilized to accomplish static searching. On the other hand, hierarchical crawling is used for dynamic searching. To get an illustrative view on single, parallel & hierarchical crawling, follow [9, 14]. Finally, these crawlers download Web-pages from WWW.

Output: Downloaded Web-pages Step 1: Select a crawler type based on choice Step 2: Feed specific URL(s) to the crawler(s) based on modified search query Step 3: Searching and further downloading of concerned Web-pages are done as described in [9, 14] Step 4: Save the Web-pages to the server Step 5: Stop It is assumed in this paper that most of the searches are made based on the events, information or tutorials. Different types of crawlers are required based on the type of field as shown in Table 10 & Figure 17. Table 11 shows input ranges which is actually the output from Analyzer Agent. In this case, ‘cricket’ (Sports) & ‘Hindi’ (Films) are considered. Figure 18 shows the corresponding membership function.

Algorithm 5: Search Agent Process Input: Output of Algorithm 3 & Algorithm 4

Table 10. Input ranges of different type of searching INPUT RANGE FIELD NAME 0-30 Event 20-70 Information 60-100 Tutorial Table 11. Input ranges based on the captured data from the Analyzer Agent INPUT RANGE FIELD NAME 0-30 Cricket (Sports) 20-100 Hindi (Films)

53

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 17. Membership Function for Search Agent

Figure 18. Membership Function with Information for Search Agent Constructed rulebase for this agent is shown in Table 12. Table 12. Rulebase 4: Rulebase between the output of Analyzer Agent & crawler selection based on searching criteria EVENT INFORMATION TUTORIAL CRICKET Single Crawler Parallel Crawler Hierarchical Crawler HINDI FILM Single Crawler Parallel Crawler Hierarchical Crawler The Final Decision Index (FDI), obtained from Analyzer Agent, is taken as input for further fuzzification at this stage. Therefore, the given crisp input is 63.78 at this stage. It is actually the fuzzy decision index calculated at the previous agent

multiplied by 100 for scaling up as per the requirement in Search Agent. The corresponding membership values can be found by plotting the crisp input on the design graphs (refer Figure 19 & Figure 20).

54

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009

Figure 19. Membership Function for Search Agent with taken Value

Figure 20. Membership Function with Information for Search Agent with taken Value Substituting the corresponding field values in ‘Rulebase 4’ of Table 12 and consequently performing ‘MIN’ operation, the following table (refer Table 13) has been formed.

CRICKET 1

Fuzzified decision and scaled fuzzified decision graphs for this agent look like Figure 21 & Figure 22 respectively.

Table 13. Substitution values on Rulebase 4 of Table 12 EVENT 0.7 Single Crawler Parallel Crawler Single Crawler 0.7

55

0.2 Hierarchical Crawler 0.2

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 21. Fuzzified Decision for Search Agent

Figure 22. Scaled Fuzzified Decision for Search Agent Again Centroid method is utilized for calculation of Final Decision Index (FDI) which is 0.467.

Figure 23. Fuzzy Decision Index for Search Agent

56

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009 Thus, final result is 10% information for Hierarchical Crawler and 90% for Parallel Crawler. The final decision should be Parallel Crawler.

Output: Web-pages ready to be transmitted to clientend Step 1: Verify each Web-page against the user profile Step 2: If the verified Web-page is suitable for concerned user search, then goto Step 3; else goto Step 5 Step 3: Calculate rank of each Web-page as described in [10] Step 4: Web-pages are ready for transmission Step 5: Stop

2.6. Retrieval Module Working of Retrieval Module is formalized as per Algorithm 6. After downloading the related documents (as per user's interest) at server-end from the Internet, it should be delivered to the client-end. First of all, each and every downloaded Web-page has to be verified against the concerned user profile. After successful verification, the Web-pages are further ranked using link analysis [10]. Then, these ranked Web-pages are handed over to the Data Transfer Agent which in-turn sends the acknowledged documents to the user of client-end.

3. Experimental Results In this section, experimental result of data search using different type of crawlers is shown in Figure 24. It has been seen from the experiment that better result is achieved using hierarchical crawling at the time of searching through Search Agent in case of unpredicted searching [9, 14].

Algorithm 6: Retrieval Process Input: Downloaded Web-pages at server-end by Algorithm 5

Figure 24. Timing Diagram of Single, Parallel & Hierarchical Crawling Procedure In Figure 25 & Figure 26, a sample search & its corresponding results have been shown. In this particular case, using existing Search Engine, 11,200 results have been thrown into the ‘html’ browser. To find the specific documents from that type of huge

number of search results is very difficult task; where as by our methodology, it is possible to find the related documents only in a confined range. So, the number of search results is lesser in our case.

57

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, Subhadip Chakraborty, Snehashish Pal and Debajyoti Mukhopadhyay

Figure 25. Sample Search Result using existing Search Engine

Figure 26. Sample Search Result using our approach

58

International Journal of Digital Content Technology and its Applications Volume 3, Number 4, December 2009

4. Conclusion

Computer Society Press, New York, USA, ISBN 0-76952635-7, December 18-21, 2006, pp.297-298 [11]http://www.w3.org/TR/html4/interact/forms.html#h17.13.1 [12]Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay, “Generation of SMACA and its Application in Web Services,” 9th International Conference on Parallel Computing Technologies, PaCT 2007 Proceedings, Pereslavl-Zalessky, Russia, September, 3-7, 2007, pp.140-152 [13]Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay, “Design of SMACA: Synthesis & its Analysis through Rule Vector Graph for Web based Application,” International Journal of Intelligent Information and Database Systems; Inderscience Publication, Europe; Vol. 2, No. 4, 2008 [14]Anirban Kundu, Ruma Dutta, Rana Dattagupta, Debajyoti Mukhopadhyay, “Mining the Web with Hierarchical Crawlers – A Resource Sharing based Crawling Approach,” International Journal of Intelligent Information and Database Systems; Inderscience Publication, Europe; Vol. 3, No. 1, 2009

In this paper, we have proposed a multi-agent based system for performing search on the Internet through a Search Engine, which should help the user to narrow down the desired search results. This system consists of six different types of agents or modules which are inter-related to each other by specific logic. By such approach, a user has to sign-in to the Web server before searching to achieve the desired result as per the user specific interests based on their profiles.

5. References [1] Sergey Brin, Lawrence Page, “The Anatomy of a LargeScale Hypertextual Web Search Engine,” Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998 [2] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, “Searching the Web,” ACM Transactions on Internet Technology, Volume 1, Issue 1, August 2001 [3] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee, “Self Organization and Identification of Web Communities,” IEEE Computer, 35(3), 66-71, 2000 [4] Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, Gary W. Flake, “Using Web Structure for Classifying and Describing Web Pages,” WWW2002, Honolulu, Hawaii, USA, 7-11 May 2002 [5] Soumen Chakrabarti, Byron E. Dom, Ravi Kumar, Prabhakar Raghavan, Shidhar Rajagopalan, Andrew Tomkins, David Gibson, Jon Kleinberg, “Mining the Web's Link Structure,” IEEE Computer, (32)8: August 1999, pp. 60-67 [6] B. D. Davison, “Topical Locality in the Web,” Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), ACM, Athens, Greece, July 2000, pp. 272279 [7] Debajyoti Mukhopadhyay, Sanasam Ranbir Singh, “An Algorithm for Automatic Web-Page Clustering using Link Structures,” Proceedings of the IEEE INDICON 2004 Conference, India, 20-22 December 2004, pp. 472477 [8] J. Furnkranz, “Exploiting Structure Information for text Classification on the WWW,” Intelligent Data Analysis, 1999, pp. 487-498 [9] Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay, Young-Chon Kim, “A Hierarchical Web Page Crawler for Crawling the Internet Faster,” International Conference on Electronics & Information Technology Convergence, EITC 2006 Proceedings, Yang Dong Publication, ISSN 1975-809X, Republic of Korea, December 8, 2006, pp.61-67 [10]Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay, “An Alternate Way to Rank Hyper-linked Web Pages,” 9th International Conference on Information Technology, ICIT 2006 Proceedings, Bhubaneswar, India, IEEE

59