(Agent) for Dynamic Internet/Intranet Searching - CiteSeerX

49 downloads 85113 Views 252KB Size Report
AT&T Foundation Special Purpose Grants in Science and Engineering, ... Databases," June 7, 1996), NCSA Access Magazine, HPCWire, and Business Week. .... Internet, but with increasingly diverse support for collaborative data sharing and ...
Hsinchun Chen,

1

An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching Hsinchun Chen

Yi-Ming Chung 1

Marshall Ramsey

Christopher C. Yang 2

Associate Professor, MIS Department, Karl Eller MIS Department, Karl Eller Graduate School of Graduate School of Management, University of Management, University of Arizona, Tucson, AriArizona, McClelland Hall 430Z, Tucson, Arizona zona 85721, [email protected]. 85721, visiting Senior Research Scientist, NCSA, [email protected], (520) 621-4153. MIS Department, Karl Eller Graduate School of ECE Department, University of Arizona, Tucson, Management, University of Arizona, Tucson, Ari- Arizona 85721, [email protected]. zona 85721, [email protected].

1 Currently a research programmer at the CANIS-Community Systems Laboratory, University of Illinois at Urbana-Champaign. 2 Currently an assistant professor at the University of Hong Kong. 3 This project was supported mainly by the following grants:  NSF/ARPA/NASA Digital Library Initiative, IRI-9411318, 1994-1998 (B. Schatz, H. Chen, et. al, \Building the Interspace: Digital Library Infrastructure for a University Engineering Community"),  NSF CISE, IRI-9525790, 1995-1998 (H. Chen, \Concept-based Categorization and Search on Internet: A Machine Learning, Parallel Computing Approach"),  AT&T Foundation Special Purpose Grants in Science and Engineering, 1994-1995 (H. Chen), and  National Center for Supercomputing Applications (NCSA), High-performance Computing Resources Grants, 1994-1996 (H. Chen).

We would like to thank University of Arizona Arti cial Intelligence Group members for their participation in our experiment. We also thank Prof. Jerome Yen of Hong Kong University and Prof. Pai-chun Ma of Hong Kong University of Science and Technology for their comments and involvement during system development and testing. 4

Hsinchun Chen,

2

Abstract As Internet services based on the World-Wide Web become more popular, information overload has become a pressing research problem. Diculties with search on Internet will worsen as the amount of online information increases. A scalable approach to Internet search is critical to the success of Internet services and other current and future National Information Infrastructure (NII) applications. As part of the ongoing Illinois Digital Library Initiative project, this research proposes an intelligent personal spider (agent) approach to Internet searching. The approach, which is grounded on automatic textual analysis and general-purpose search algorithms, is expected to be an improvement over the current static and inecient Internet searches. In this experiment, we implemented Internet personal spiders based on best rst search and genetic algorithm techniques. These personal spiders can dynamically take a user's selected starting homepages and search for the most closely related homepages in the web, based on the links and keyword indexing. A plain, static CGI/HTML-based interface was developed earlier, followed by a recent enhancement of a graphical, dynamic Java-based interface. Preliminary evaluation results and two working prototypes (available for Web access) are presented. Although the examples and evaluations presented are mainly based on Internet applications, the applicability of the proposed techniques to the potentially more rewarding Intranet applications should be obvious. In particular, we believe the proposed agent design can be used to locate organization-wide information, to gather new, time-critical organizational information, and to support team-building and communication in Intranets.

Hsinchun Chen,

3

Keywords Agents, machine learning, spider, evolutionary programming, information retrieval, semantic retrieval, Java, World-Wide Web, Internet, intranet

Hsinchun Chen,

4

Biography Hsinchun Chen is an Associate Professor of Management Information Systems at the University of Arizona and head of the UA/MIS Arti cial Intelligence Group. He is also a Visiting Senior Research Scientist at National Center for Supercomputing Applications (NCSA). He received an NSF Research Initiation Award in 1992, the Hawaii International Conference on System Sciences (HICSS) Best Paper Award, and an AT&T Foundation Award in Science and Engineering in 1994 and 1995. He received the Ph.D. degree in Information Systems from New York University in 1989. Chen has published more than 30 articles covering semantic retrieval, search algorithms, knowledge discovery, and collaborative computing. He is a PI of the Illinois Digital Library Initiative project, funded by NSF/ARPA/NASA, 1994-1998, and has received several grants from NSF, DARPA, NASA, NIH, and NCSA. He is the guest editor of IEEE Computer special issue on "Building Large-Scale Digital Libraries" and the Journal of the American Society for Information Science special issue on "Arti cial Intelligence Techniques for Emerging Information Systems Applications." His recent work was featured at Science ("Computation Cracks `Semantic Barriers' Between Databases," June 7, 1996), NCSA Access Magazine, HPCWire, and Business Week.

Yi-Ming Chung received her M.S. degree in Management Information Systems from the University of Arizona in 1996. Her thesis explored the use of genetic algorithms and neural networks for Internet search and vocabulary switching. She continues her research at the CANIS-Community Systems Laboratory, University of Illinois at Urbana-Champaign as a research programmer. Currently, she is working on automatic subject indexing, concept mapping and vocabulary switching across community repositories in the Internet. Her research interests include digital libraries, neural networks, intelligent agents, object-oriented design, and design patterns.

Hsinchun Chen,

5

Marshall Ramsey is a Ph.D. student at the University of Arizona's Department of Management Information Systems and a member of the UA/MIS Arti cial Intelligence Group. He received his B.S. degree (MIS) in 1993 and M.S. degree (MIS) in 1997 from the University of Arizona. He was awarded a Research Fellowship from the National Library of Medicine (1996 to 1997) for work in semantic retrieval for large collections of medical documents. His research interests are cross-media and translingual semantic retrieval, data visualization, and digital libraries.

Christopher C. Yang is an assistant professor in the Department of Computer Science and Information Systems at the University of Hong Kong. He was born in Hong Kong. He received his B.S., M.S., and Ph.D. in Electrical Engineering from the University of Arizona, Tucson, AZ, in 1990, 1992, and 1997, respectively. From 1995 to 1997, he was a research scientist in the UA/MIS Arti cial Intelligence Group in the Department of Management Information Systems at the University of Arizona. From 1992 to 1997, he was a research associate in the Intelligent Systems Laboratory in the Department of Electrical and Computer Engineering. His current research interests are digital library, Internet agent, visualization, color image processing, constraint network, and computer integrated manufacturing and inspection. He was a member of program committee for the 1997 IEEE International Conference on Systems, Man, and Cybernetics, a member of organizing committee for 1998 3rd Asian Conference on Computer Vision, and a member of program committee for the 1998 1st Asian Digital Library Workshop.

Hsinchun Chen,

6

1 Introduction Although network protocols and software such as HTTP and Netscape/Mosaic support significantly ease importation and fetching of online information sources, their use is accompanied by the disadvantage of the users' not being able to explore and nd what they want in an enormous information space [1, 2, 20]. While the Internet services are popular and appealing to many online users, diculties with search are expected to worsen as the amount of online information increases. This is mainly due to the problems of information overload and vocabulary di erences [11] [4]. Many researchers consider that devising a scalable approach to Internet search is critical to the success of Internet and Intranet services and other current and future National Information Infrastructure (NII) applications [21] [6]. The main information retrieval mechanisms provided by the prevailing Internet WWWbased software are based on either keyword search (e.g., Lycos, Alta Vista, and Yahoo servers) or hypertext browsing (e.g., NCSA Mosaic, Netscape Navigator, and Microsoft Internet Explorer). Keyword search often results in low precision, poor recall, and slow response time due to the limitations of indexing and communication methods (bandwidth), controlled language based interfaces (the vocabulary problem), and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing allows users to explore only a very small portion of the large Internet information space. An extensive information space accessed through hypertext-like browsing can also potentially confuse and disorient its user, the \embedded digression problem," and it can cause the user to spend a great deal of time while learning nothing speci c, the \art museum phenomenon" [3]. Our proposed approach, which is grounded on automatic textual analysis of Internet documents and general-purpose search algorithms, aims to address the Internet search problem by creating dynamic and \intelligent" personal spiders (agents) that take users' requests and perform real-time, customized searches. In particular, best rst search was adopted for a

Hsinchun Chen,

7

local search personal spider and a genetic algorithm was used to develop a global, stochastic personal spider. These personal spiders (agents) could dynamically take users' selected starting homepages and search for the most closely related homepages in the web, based on the links and keyword indexing. Extensive algorithmic revisions and interface development based on CGI/HTML and Java have been performed. This paper summarizes our current research e ort. We believe the applicability of the proposed techniques to the potentially more rewarding Intranet applications is promising and speci cally include adoption of the proposed agent design in the following Intranet-related areas:

 Locating organization-wide information: By restricting the agent to search on servers within the organizational Intranet boundary (i.e., selected URLs or domain names), the proposed agent can be used to help users nd only related Intranet sites instead of wandering through the Internet. For large corporations and government agencies, the need to explore and search e ectively within their own loosely-connected Intranet sites is real and pressing. By restricting the search space of the proposed agent, the tool could be used to locate organization-wide Intranet information.

 Gathering new, time-critical organizational information: The value of the agent-based spider strongly relies on its ability to perform exhaustive and real-time Internet or Intranet searches. Obsolete and dead Web sites can be avoided { the proposed agent imposes a time-out component to avoid connecting to dead sites. Such design is believed to enable the agent to gather new, time-critical organizational information.

 Team-building and communications: In our discussion with several Webmasters, it has been suggested that the tool could be used in Intranets for team-building and communication. By launching their own spiders within Intranets, di erent teams and groups, previously unknown to each other in large corporations, would be able to

Hsinchun Chen,

8

nd Cyberspace collaborators and colleagues and thereby be used e ectively as an organizational communication tool.

2 Literature Review: Internet Spiders At its inception as the ARPANET, the Internet was conceived primarily as a means of remote login and experimentation with telecommunication [2]. However, the predominant usage quickly become e-mail communication. This trend continues into the present form of the Internet, but with increasingly diverse support for collaborative data sharing and distributed, multimedia information access, especially using the World-Wide Web (WWW). Many people consider the Internet and the WWW the backbone of the information superhighway and the window to the cyberspace. The WWW was developed initially to support physicists and engineers at CERN, the European Particle Physics Laboratory in Geneva, Switzerland [1]. In 1993, when several browser programs (most noticeably the NCSA Mosaic) became available for distributed, multimedia, hypertext-like information fetching, Internet became the preview for a rich and colorful information cyberspace [22]. However, as Internet services based on WWW have become more popular, information overload has become a pressing research problem [2]. The user interaction paradigm on Internet has been shifted from simple hypertext-like browsing (human-guided activity exploring the organization and contents of an information space) to content-based searching (a process in which the user describes a query and a system locates information that matches the description). Many researchers and practitioners have considered Internet/Intranet searching to be one of the more pressing and rewarding areas of research for future NII applications. Internet searching has been the hottest topic at recent World-Wide Web Conferences. Two major approaches have been developed and experimented with: one is the client-based search spider (agent) and the other is online database indexing and searching. However,

Hsinchun Chen,

9

some systems contain components of both approaches.

2.1 Client-Based Search Spiders (Agents) Broadly de ned, an \agent" is a program that can operate autonomously and accomplish unique tasks without direct human supervision (similarly to human counterparts such as real estate agents, travel agents, etc.). The basic idea of agent research is to develop software systems which engage and help all types of end users [19]. Such agents might act as \spiders" on the Internet and look for relevant information [10], analyze meeting output on behalf of executives [5], or lter newsgroup articles based on "induced" (or learned) users' pro les [14]. Many researchers have focused on developing scripting and interfacing languages for designers and users such that they can create mobile agents of their own [24]. Some researchers attempt to address the question: \How should agents interact with each other to form digital teamwork?" Other researchers are more concerned about designing agents which are \intelligent" [19] [5]. Several software programs based on the concept of spiders, agents, or softbots (software robots) have been developed. TueMosaic and the WebCrawler are two prominent early examples. Both of them use variations of conventional best rst (local) search strategies [17]. DeBra and Post [9] reported tueMosaic v2.42, modi ed at the Eindhoven University of Technology (TUE) using the \ sh search" algorithm, at the First WWW Conference in Geneva. Using tueMosaic, users can enter keywords, specify the depth and width of search for links contained in the current homepages displayed, and request the spider agent to fetch homepages connected to the current homepage. The sh search algorithm is a modi ed best rst search method. However, potentially relevant homepages that do not connect with the currently active homepages cannot be retrieved and, when the depth and breadth of search become large (an exponential search), the search space becomes enormous. The inef ciency and local search characteristics of BFS/DFS-based spiders and the communication

Hsinchun Chen,

10

bandwidth bottleneck on Internet severely constrained the usefulness of such a local search approach. At the Second WWW Conference, Pinkerton reported a more ecient spider (crawler). The WebCrawler extends the tueMosaic's concept to initiate the search using its index and to follow links in an intelligent order. Webcrawler evaluates the relevance of a link based on its similarity to the anchor texts of the user's query. However, problems with local search and communication bottleneck persist. Due to the proliferation of WWW sites, many newer spiders with di erent functionalities recently have been developed. The TkWWW robot was developed by Spetka and funded by the Air Force Rome Laboratory [23]. TkWWW robots are dispatched from the TkWWW browser and are designed to search Web neighborhoods to nd logically related homepages and return a list of \hot" links. However, their search process is limited to one or two local links from the original homepages. TkWWW robots can also be run in the background to build HTML indexes, compile WWW statistics, collect a portfolio of pictures, or perform any other functions that can be described by TkWWW Tcl extensions. WebAnts, developed by Leavitt at Carnegie Mellon University, investigates the distribution of information collection tasks to a number of cooperating agents (ants). The goal of WebAnts is to create cooperating agents that share searching results and the indexing load without repeating each other's e ort. The RBSE (Respository Based Software Engineering) spider was developed by Eichmann and funded by NASA. RBSE spider was the rst spider to index documents by content. It uses the Mite program to fetch documents and uses four local search mechanisms: (1) breadth rst search from a given URL, (2) limited depth rst search from a given URL, (3) breadth rst search from unvisited URLs in the database, and (4) limited depth rst search from unvisited URLs in the database. (For a complete review of other similar Internet spiders/agents, readers are referred to [8].)

Hsinchun Chen,

11

2.2 Online Database Indexing and Searching An alternative approach to Internet resource discovery is based on the database concept of indexing and keyword searching. Such systems collect complete or partial Web documents and store them on the host server. These documents are then keyword indexed on the host server to provide a searchable interface. Most popular Internet databases such as Lycos, Alta Vista, and Yahoo are based on such a design. Lycos, developed at CMU [15], uses a combination of spider fetching and simple ownerregistration. Internet servers can access the Lycos server and complete registration in a few simple steps. In addition, Lycos uses spiders based on the connections to the registered homepages to identify other un-registered homepages. With this suite of techniques, Lycos has acquired an impressive list of URLs on the Internet. Lycos adopted a heuristics-based indexing approach for these homepages that indexes them based on title, headings and subheadings, 100 most important words, rst 20 lines, size in bytes, and number of words. However, Lycos's success also illustrates the vulnerability of the approach and the daunting task of creating \intelligent" and ecient Internet search engines. Its popularity has caused a severe degradation of information access performance, due to the communication bottleneck and the task of nding selected documents in an all-in-one database of Internet homepages. Alta Vista, developed at Digital's Research Laboratories in Palo Alto, combines a fast Web crawler with scalable indexing software to build a large index of the Web. It was made public on December 15 of 1995 and has quickly become one of the most comprehensive searchable databases on Internet. It also provides a full-text index that is updated in realtime for over 13,000 news groups. Although based on similar local search spider algorithms, the Alta Vista server has been successful due to its superior hardware platforms and high-end communication bandwidth. Instead of taking the all-in-one database approach adopted by Lycos and Alta Vista, the

Hsinchun Chen,

12

Yahoo server represents an attempt to partition the Internet information space to provide meaningful subject categories (e.g., science, entertainment, engineering, etc.). However, its manually-created subject categories are limited in their granularity and the process of creating such categories is cumbersome and time-consuming. The demand to create up-todate and ne-grained subject categories and the requirement that an owner place a homepage under a proper subject category has signi cantly hampered Yahoo's success and popularity.

3 Research Design and Algorithms Funded by the ongoing Illinois Digital Library Initiative project [21] [7], this research aims to create \intelligent" Internet personal search agents that can be deployed on the Web for ecient, timely, and optimal searches. We planned to answer the following two general research questions:

 Can an Internet spider be designed to take individual users' requests and perform a global, optimal search on the Internet (i.e., not restricted by the links connected to the starting/anchor homepages)?

 Can a dynamic, agent-based interface be designed to allow users to present requests, evaluate intermediate results, and perform analysis during personal spider search sessions? After a careful evaluation of many general-purpose search algorithms, a genetic algorithm (GA), which featured a global, stochastic search process, was examined in detail. The complex and dynamic nature of Internet/WWW appears to be suited for application of such an algorithm. A comparison of a GA-based personal spider and a conventional best rst search (BFS) spider was performed in our experiment. An earlier version of a CGI/HTML interface for GA/BFS spiders revealed the inadequacy of the stateless, static HTML interface. A prototype interface based on Java was therefore

Hsinchun Chen,

13

designed to allow dynamic interactions between users and the personal agents. In this section, we describe the search algorithms implemented. Our CGI/HTML and Java interface will be illustrated in the next section.

3.1 Determining Score/Fitness: The Jaccard's Function In order to determine the \goodness" (or tness, using GA terminology) of a given new homepage, a Jaccard's similarity function was adopted [18]. Each homepage was represented as a weighted vector of keywords, that had been automatically indexed by our system and connecting links. A new homepage fetched by the system was compared with the anchor/starting homepages to determine whether or not it was promising. A new homepage which is more similar to the starting homepages was considered more promising and thus was explored rst. The Jaccard's functions adopted were based on the combined (equal) weights of the Jaccard's score from links and the Jaccard's score from keywords.

 Jaccard's Scores from Links: Given two homepages, A and B, and their connected links/URLs, X = (x1 ,x2 , ...., xm ) and Y = (y1,y2, ...., yn), the Jaccard's score between A and B based on links was computed as follows: #(X \ Y ) Jlink (A; B ) = #( (1) X [Y)

where #(S) indicates the cardinality of set S.

 Jaccard's Scores from Keywords: For a given homepage, terms were identi ed based on an automatic indexing procedure developed in our previous research [7]. Term frequency (tf) and inverse document frequency (idf), term weighting heuristics also adopted in such popular searchable databases as Lycos, were then computed, Term frequency, tfij , represents the number of occurrences of term j in document (homepage) i. Homepage frequency, dfj , represents

Hsinchun Chen,

14

the number of homepages in a collection of N homepages in which term j occurs. The combined weight of term j in homepage i, dij , was computed as follows: dij = tfij  log( dfN  wj ) (2) j where wj represents the number of words in term j , and N represents the total number of homepages connected to the starting homepages. Representing each homepage as a weighted vector of keywords, the Jaccard's score between homepages A and B based on keyword was computed as follows: PL dAj dBj (3) Jkeyword(A; B ) = PL 2 PLj=1 2 PL j =1 dAj + j =1 dBj ? j =1 dAj dBj where L is the total number of terms. The combined Jaccard's score between any two homepages, A and B, was a weighted summation of the above two Jaccard's scores, i.e.,

J (A; B ) = 0:5  Jlink (A; B ) + 0:5  Jkeyword(A; B )

(4)

3.2 Best First Search Algorithm Two search algorithms, best rst search and a genetic algorithm were investigated in detail. The best rst search algorithm was developed to simulate the various client-based spiders developed in earlier studies and were used as a benchmark for comparison. The genetic algorithm was adopted to enhance the global, optimal search capability of existing Internet spiders. Best rst search is a serial state space traversal method [17]. In our implementation, the algorithm explored the best (based on Jaccard's score of new homepage vs. anchor homepages) homepage at each iteration and terminated when the system had identi ed the desired number of homepages requested by a user. A sketch of the best rst search algorithm adopted in our personal agent is presented below:

Hsinchun Chen,

15

1. Input anchor homepages and initialize: Initialize an iteration counter k to 0. Obtain a desired number of homepages from users and a set of input anchor homepages, (input1 , input2 ,..., inputm ). These input homepages represent the users' preferred starting points for Internet search and their interests. Texts of homepages are fetched over the network in real time via Lynx HTTP communication software and homepages (URLs) connected from these input homepages are extracted and saved in the unexplored homepage queue, H , where H = (h1 ; h2; :::; hn). 2. Determine the best homepage: Based on the Jaccard's function described earlier, determine the best homepage, p, in H , which has the highest Jaccard's score among all the homepages in H and save it as outputp. This homepage is considered most similar to the anchor homepages in both keywords and links, and thus should be explored rst. 3. Explore the best homepage: Fetch the best homepage using Lynx and add its connected homepages to the unexplored homepage queue, H . Increment iteration counter k by 1. 4. Iterate until a desired number of homepages is obtained: Repeat the above steps until k equals to the total number of homepages requested by the user.

3.3 Genetic Algorithm Genetic algorithms (GAs) [12] [16] [13] are problem solving systems based on principles of evolution and heredity. Genetic algorithms perform a stochastic evolution process toward global optimization through the use of crossover and mutation operators. The search space

Hsinchun Chen,

16

of the problem is represented as a collection of individuals, which are referred as chromosomes. The quality of a chromosome is measured by a tness function (Jaccard's score in our implementation). After initialization, each generation produces new children based on the genetic crossover and mutation operators. The process terminates when two consecutive generations do not produce noticeable population tness improvement (i.e., reach a small threshold value or converge). A sketch of the genetic algorithm adopted for Internet client-based searching is presented below: 1. Initialize the search space: The GA spider attempts to nd other most relevant homepages in the entire Internet search space using the user-supplied starting homepages. Initially, the system saves all the input homepages in a set called Current Generation, CG = (cg1; cg2; :::; cgm). 2. Crossover: A heuristics-based cross-over operation is then used. New homepages connected to starting homepages in CG set are extracted. Homepages that have been connected to multiple starting homepages (i.e., multiple parents) are considered Crossover Homepages and saved in a new set, C = fc1; c2;   g. 3. Mutation: In order to avoid trapping in the local minimum that might result from adopting a simple crossover operator, we have added a heuristics-based mutation procedure to add diversity to the homepage population. A Yahoo spider created in our previous research is used to traverse the Yahoo's 14 high-level subject categories (e.g., science, business, entertainment, etc.) and collected several thousand \mutation seed" homepages in each category. These homepages are indexed using the Web indexing freeware, SWISH (Simple Web Indexing System for Humans). When the GA search algorithm requests a

Hsinchun Chen,

17

mutated homepage, the system retrieves the top-ranked homepage from homepages in the user-speci ed category based on the keywords presented in the anchor homepages. This process is similar to performing a search on the Yahoo database in order to suggest new, promising homepages for further exploration. New mutated homepages are saved in the set of Mutation Homepages, M = fm1 ; m2;   g. The probabilities of mutation and crossover can vary depending on user needs. Higher crossover probabilities generally support exploitation of local linkages; while higher mutation probabilities support exploration of the global landscape. Exploitation and exploration are two powerful features of genetic programming [16]. Our default settings for crossover and mutation probabilities both are 50%. 4. Stochastic selection scheme based on Jaccard's tness: Each new crossover and mutation homepage is evaluated based on the same Jaccard's function. Based on an \elicit selection" procedure [16], homepages which obtain higher tness values are selected stochastically. A random number generator controlled by a homepage's tness value is used to select \ tter" homepages for the new generation. Homepages that \survive" the (nature) selection procedure become the new population for the new generation. 5. Convergence: Repeat the above steps until the improvement in total tness between two generations is less than a small threshold value (empirically determined). The nal converged set of homepages is then presented to users as the output homepages.

4 Benchmarking Experiment In an attempt to examine the quality of results obtained by best rst search and genetic algorithm, we performed a set of benchmarking experiments, which compared the performances

Hsinchun Chen,

18

and eciency of the best rst search and genetic algorithm based personal spiders. Using a test set of 40 search scenarios, each composed of 1-3 homepages in di erent subject areas, we examined the nal Jaccard's scores of the BFS/GA-suggested (top 10) homepages and their corresponding CPU times and wall clock times. Higher Jaccard's scores of new homepages would suggest a closer match to a user's stated query interests (i.e., the anchor/starting homepages). Detailed benchmarking results are presented in Table 1. Figures 1, 2, and 3 show statistical analyses of the nal tness score, CPU time, and wall clock time of testing 40 cases.

 Complementary searches through exploitation and exploration: The results show that the output homepages obtained by genetic algorithm had a slightly higher tness score than those obtained by best rst search, but the di erence is not signi cant. The averages of 40 Jaccard's scores for the genetic algorithm and the best rst search were 0.08705 and 0.08519, respectively. Although the Jaccard's scores showed no signi cant di erence between the performances of genetic algorithm and best rst search, we noticed that about 50% of the homepages obtained from the genetic algorithm were the result of the mutation operation (crossover and mutation probabilities were set to 50% and 50%, respectively). We found that these homepages, although promising, had never been linked to the starting homepages, and thus could not have been obtained by any local search spiders (including our best rst search spider). This suggests the potential usefulness of the genetic algorithm spider to supplement the local best rst search spider, i.e., by permitting combination of the results in both sets. During our experimentation, we also found that the genetic algorithm performed very similarly to best rst search when the mutation probabilities were set low (say 5%)

Table 1 about here.

Hsinchun Chen,

19

and crossover probabilities were high (say 95%). With limited mutation operations, the crossover operation in genetic algorithm accomplished a local exploitation process similar to that of a local best rst search. The mutation process appears to have been instrumental in allowing our personal spider to get out of the local search minimum.

 Sparse link constraint: In addition, we also noticed that starting homepages played an important role in determining system output. If starting homepages contained very few and sparsely connected links, best rst search spiders tended to get trapped quickly in the Internet search space because of lack of traversal paths. The genetic algorithm, however, was not restricted by such a sparse link constraint because of its mutation operator. On the other hand, for starting homepages that contained rich and dense connections, best rst search often resulted in a fruitful nal search set. The genetic algorithm only added limited diversity in such a scenario. Figure 1 about here.  Exploration and communication are time consuming: The genetic algorithm based spider was signi cantly more time consuming than the best rst search spider, as shown in Figures 2 and 3. The average CPU times for best rst search and genetic algorithm were 3 minutes and 11 seconds and 1 minute and 51 seconds, respectively. However, due to the communication bottleneck, the average wall clock time for best rst search and genetic algorithm were 45 minutes and 41 seconds and 23 minutes and 45 seconds, respectively. The SWISH keyword search procedure implemented in the genetic algorithm spider caused a signi cant CPU time requirement. The elite selection procedure and multiple generations also consumed signi cant CPU cycles. However, the communication bandwidth (i.e., the actual time to fetch a remote homepage) seemed to be the most signi cant bottleneck of the entire process for both the

Hsinchun Chen,

20

best rst search spider and the genetic algorithm spider. This de ciency can only be resolved when the current Internet backbones are upgraded.

5 Dynamic, Agent-based Interface

Figures 2 and 3 about here.

Currently, we have developed two interfaces for our spiders. One is based on CGI/HTML and the other is based on Java. The CGI/HTML implementation enables image maps and ll-out forms to interact with the HTTP server. However, it is static and does not support dynamic display and interaction during the search process. On the other hand, Java is an object-oriented, platform-independent, multi-threaded, dynamic, graphical, general-purpose programming environment for the Internet, Intranet, and any other complex, distributed network. The Java interface allows us to display lively intermediate spider search results and accept changes of input parameters (e.g., crossover and mutation probabilities) dynamically. These dynamic, interactive features of Java are crucial to design of customizable and \intelligent" agents [5]. The two prototype interfaces are summarized below. Readers are encouraged to connect to the the University of Arizona Arti cial Intelligence Group homepage (HTTP://ai.bpa.arizona.edu/) for actual demonstrations.

5.1 CGI Based User Interface The CGI based user interface provides ll-in forms that let users submit input to the spiders. Users may request local or global search, as shown in Figure 4. Invoking the local search spider, as shown in Figure 5, users are requested to provide up to 5 starting URLs. Users also need to indicate the desired number of searched homepages and preferred types of servers. Similarly, invoking the global search spider results in a ll-in form as shown in Figure 6. The evolution-based option activates the genetic algorithm spider. The probability-based

Hsinchun Chen,

21

option activates a simulated annealing based spider developed earlier in our research. In addition to providing starting URLs and the desired number of homepages, users also need to indicate their preferred \mutation seed" database to draw new mutation homepages. After submitting the search request, Figure 7 shows an output homepage which displays the system-suggested relevant homepages, with a title and keyword summary. Each homepage can then be clicked on for closer examination. Due to the real-time nature of our spiders, all homepages retrieved are \live," unlike many \dead" homepages often found in other all-in-one searchable homepage databases.

5.2 Java Based User Interface Despite these \intelligent" and customized search processes, the CGI/HTML interface is severely hampered by its lack of dynamic display and interaction. A Java-based user interface was designed to alleviate these problems. The Java interface homepage is shown in Figures 8 and 9. When \global evolution based search" is clicked on, the system displays a dialog block similar to that designed for the CGI/HTML genetic algorithm spider interface. In addition, users can set their preferred crossover and mutation probabilities. A timeout mechanism was also introduced to avoid the system from time-consuming, unfruitful connections. Figure 10 shows the window which displays the result of the entire search process dynamically and graphically. The control panel is displayed at the top of the window. All input parameters can be changed during an ongoing search process, producing di erent search results. The fetched URLs are displayed during each generation (instead of the the nal results at the last generation) and can be clicked on for real-time evaluation. The system also graphically displays the Jaccard's link score, keyword score, and fetch time score for each homepage (in three di erent colors, not shown in the attached screen dump). A spider-chasing- y animation is displayed dynamically when our \spider" is out chasing a new

Hsinchun Chen,

22

homepage ( y). Since we placed our Java-based spider on our server in Summer 1996, the response from our initial test subjects was overwhelming. Users found the Java-based interface to be more interactive, lively, and friendly than our earlier CGI/HTML interface. They have reported our spiders to be a dynamic, intelligent personal agent, instead of a static, non-customizable Internet database search engine. Figures 4, 5, 6, 7, 8, 9, 10 about here. The results from our current experimentation of Internet personal spiders are encouraging. In response to Research Question 1 (designing a global optimal search spider), although the genetic algorithm spider did not out-perform the best rst search spider, we found both results to be comparable and complementary. The mutation process introduced in genetic algorithm allows users to nd other potential relevant homepages that cannot be explored via a conventional local search process. Regarding Research Question 2, we found the Java-based interface to be a necessary component for designing an interactive and dynamic Internet agent. The CGI/HTML interface is simply too restrictive for such a task.

6 Conclusion and Discussion

Although the examples and evaluations presented are mainly based on Internet applications, the applicability of the proposed techniques to the potentially more rewarding Intranet applications should be obvious. In particular, we belive the proposed agent design can be used to locate organization-wide information, to gather new, time-critical organizational information, and to support team-building and communication in Intranets. In our ongoing e ort in the Illinois Digital Library Initiative project, we are in the process of exploring other general-purpose search and classi cation algorithms for Internet resource categorization and search. Several neural network-based algorithms have been explored, including the Hop eld network and the Kohonen self-organizing map, both are under development in Java.

Hsinchun Chen,

23

References [1] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret. The World-Wide Web. Communications of the ACM, 37(8):76{82, August 1994. [2] C. M. Bowman, P. B. Danzig, U. Manber, and F. Schwartz. Scalable internet resource discovery: research problems and approaches. Communications of the ACM, 37(8):98{ 107, August 1994. [3] E. Carmel, S. Crawford, and H. Chen. Browsing in hypertext: A cognitive study. IEEE Transactions on Systems, Man and Cybernetics, 22(5):865{884, September/October 1992. [4] H. Chen. Collaborative systems: solving the vocabulary problem. IEEE COMPUTER, 27(5):58{66, Special Issue on Computer{Supported Cooperative Work (CSCW), May 1994. [5] H. Chen, A. Houston, J. Yen, and J. F. Nunamaker. Toward intelligent meeting agents. IEEE COMPUTER, 29(8):62{70, August 1996. [6] H. Chen and B. R. Schatz. Semantic retrieval for the NCSA Mosaic. In Proceedings of the Second International World Wide Web Conference '94, Chicago, IL, October 17-20, 1994. [7] H. Chen, B. R. Schatz, T. D. Ng, J. P. Martinez, A. J. Kirchho , and C. Lin. A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois Digital Library Initiative Project. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):771{782, August 1996. [8] F. Cheong. Internet Agents. New Riders Publishing, Indianapolis, Indiana, 1996.

Hsinchun Chen,

24

[9] P. DeBra and R. Post. Information retrieval in the World-Wide Web: making clientbased searching feasible. In Proceedings of the First International World Wide Web Conference '94, Geneva, Switzerland, 1994. [10] O. Etzioni and D. Weld. A softbot-based interface to the Internet. Communications of the ACM, 37(7):72{79, July 1994. [11] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30(11):964{ 971, November 1987. [12] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, MA, 1989. [13] J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge, MA, 1992. [14] P. Maes. Agents that reduce work and information overload. Communications of the ACM, 37(7):30{40, July 1994. [15] Mauldin and Leavitt. Web-agent related research at the CMT. In Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval (SIGNIDR-94), August 1994. [16] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg, 1992. [17] J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Publishing Company, Reading, MA, 1984.

Hsinchun Chen,

25

[18] E. Rasmussen. Clustering algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cli s, NJ, 1992. [19] D. Riecken. Intelligent agents. Communications of the ACM, 37(7):18{21, July 1994. [20] B. R. Schatz, A. Bishop, W. Mischo, and J. Hardin. Digital library infrastructure for a university engineering community. In Proceedings of Digital Libraries '94, pages 21{24, June 1994. [21] B. R. Schatz and H. Chen. Building large-scale digital libraries. IEEE COMPUTER, 29(5):22{27, May 1996. [22] B. R. Schatz and J. B. Hardin. NSCA Mosaic and the World Wide Web: global hypermedia protocols for the internet. Science, 265:895{901, 12 August 1994. [23] S. Spetka. The TkWWW robot: Beyond browsing. In Proceedings of the Second World Wide Web Conference, October 17-20 1994. [24] M. M. Waldrop. Software agents prepare to sift the riches of cyberspace. Science, 265:882{883, 12 August 1994.

Hsinchun Chen,

Final Jaccard's Score CPU Time (sec.) Wall Clock Time (sec.) GA BFS GA BFS GA BFS 0.064111 0.171631 355 449 3677 3361 0.037873 0.033949 390 428 13053 4371 0.116297 0.030918 121 31 722 772 0.084534 0.086722 181 67 1890 535 0.078234 0.078808 203 48 4840 551 0.172200 0.169866 111 375 2334 2861 0.055612 0.067690 267 14 3226 175 0.038149 0.049555 209 59 4636 1539 0.139013 0.121848 260 15 1646 448 0.142467 0.140435 239 167 13684 1800 0.084445 0.081548 198 95 1868 641 0.039268 0.041037 149 35 2865 497 0.073365 0.047864 201 110 1474 526 0.105819 0.084379 132 62 502 1081 0.124926 0.116992 294 111 2659 1970 0.223007 0.211883 246 41 3663 698 0.060740 0.061900 140 160 540 2384 0.067829 0.055259 263 195 1470 1678 0.077254 0.037858 160 105 1134 679 0.089374 0.052490 139 148 1095 6215 0.076198 0.089744 116 4 2181 1219 0.069978 0.094988 164 20 2265 201 0.075281 0.084414 198 28 2314 536 0.146929 0.198999 212 22 1862 111 0.156446 0.170072 139 206 1505 1260 0.096210 0.130226 156 16 2125 139 0.059283 0.055598 114 137 2498 1684 0.065573 0.050638 110 22 2016 193 0.045675 0.058230 173 44 1851 372 0.072970 0.069212 129 60 1716 6033 0.075478 0.055161 314 190 6452 1890 0.072598 0.079657 197 29 2898 513 0.096236 0.130226 158 17 2249 140 0.033593 0.024276 106 43 811 788 0.060900 0.038381 127 200 1148 1861 0.030297 0.019969 144 143 896 1107 0.083009 0.049055 111 19 1001 167 0.149195 0.133777 407 396 2848 4944 0.065770 0.105248 245 22 1636 533 0.075966 0.026925 98 121 2384 508 0.087053 0.085186 191.9 111.3 2741 1425

Table 1: Detailed statistics of the benchmarking results for 40 test cases.

26

Hsinchun Chen,

ANALYSIS OF VARIANCE SOURCE DF SS FACTOR 1 0.00007 ERROR 78 0.16535 TOTAL 79 0.16541 LEVEL N MEAN BFS 40 0.08519 GA 40 0.08705 POOLED STDEV = 0.04604

MS 0.00007 0.00212

F 0.03

27

p 0.857

INDIVIDUAL 95 PCT CI'S FOR MEAN BASED ON POOLED STDEV STDEV ----------+---------+---------+-----0.04996 (-------------*--------------) 0.04176 (-------------*--------------) 0.080 0.090 0.100

Figure 1: Statistics of the average Jaccard's scores obtained from 40 test cases by best rst search and genetic algorithm. ANALYSIS OF VARIANCE SOURCE DF SS FACTOR 1 129766 ERROR 78 790235 TOTAL 79 920001 LEVEL BFS GA

N 40 40

POOLED STDEV =

MEAN 111.3 191.9 100.7

MS 129766 10131

F 12.81

p 0.001

INDIVIDUAL 95 PCT CI'S FOR MEAN BASED ON POOLED STDEV STDEV -+---------+---------+---------+----118.4 (-------*-------) 79.0 (-------*-------) -+---------+---------+---------+----80 120 160 200

Figure 2: Statistics of the CPU time for 40 test cases by best rst search and genetic algorithm. ANALYSIS OF VARIANCE SOURCE DF SS MS FACTOR 1 34654232 34654232 ERROR 78 390805280 5010324 TOTAL 79 425459520 LEVEL BFS GA

N 40 40

POOLED STDEV =

MEAN 1425 2741 2238

F 6.92

p 0.010

INDIVIDUAL 95 PCT CI'S FOR MEAN BASED ON POOLED STDEV STDEV --+---------+---------+---------+---1565 (--------*--------) 2752 (--------*--------) --+---------+---------+---------+---800 1600 2400 3200

Figure 3: Statistics of the wall clock time for 40 test cases by best rst search and genetic algorithm.

Hsinchun Chen,

28

Figure 4: The CGI/HTML spider homepage.

Figure 5: The input homepage of the local best rst search spider.

Figure 6: The input homepage of the global genetic algorithm search spider.

Figure 7: The output homepage of the global genetic algorithm search spider.

Figure 8: The Java spider homepage.

Figure 9: The control panel for initiating a Java-based genetic algorithm spider.

Figure 10: The display window shows the result of the search process dynamically. Animation is displayed on the upper right-hand corner. Control panel which allows user to change parameters during the process is located at the upper portion. Search results are summarized at the center of the window.

Suggest Documents