Evolving Intelligent Text-based Agents Edmund S. Yu
Ping C. Koo
Elizabeth D. Liddy
Syracuse U & MNIS-TextWise 401 S. Salina St. Syracuse, NY 13202 (315) 426-931 lx227
MNIS-TextWise 401 S. Salina St. Syracuse, NY 13202 (315) 426-931 lx233
Syracuse University 2-212 CST Syracuse, NY 13244 (315) 443-4456
[email protected]
[email protected]
[email protected]
Information seekers require an individualized, autonomous desktop system that can learn about a user's specific interest in a particular topic, then efficiently scour diverse resources unattended, looking for relevant information to return to the user for inspection. The system must be intelligent enough to continually monitor user behavior and feedback, and thus able to adapt the search and retrieval process over time.
ABSTRACT In this paper we describe our neuro-genetic approach to developing a multi-agent system (MAS) which forages as well as meta-searches for multi-media information in online information sources on the ever-changing World Wide Web. We present EVA, an intelligent agent system that supports 1) multiple Web agents working together concurrently and collaboratively to achieve their common goal, 2) the evolution of these Web agents and the user profiles to achieve a better filtering, classification, and categorization performance, and 3) longer-term adaptation by using our unique neuro-genetic algorithm. Individual Web agents use neural networks for local searching and learning. Genetic algorithms are used to facilitate the evolution of agents on a global scale. NLP technology allows users to write sophisticated queries, and allows the system to extract important information from the user queries and the retrieved documents. The new text categorization technology used by EVA, which is also based on the neuro-genetic algorithm, can learn to automatically categorize and classify Web pages with high accuracy, using as few terms as possible. Additionally, we have developed a technique for integrating meta-searching and Web-crawling to produce intelligent agents that can retrieve documents more efficiently, and a self-feedback or automatic relevance feedback mechanism to automatically train the Web agents, without human intervention. This algorithm, together with the neuro-genetic algorithm, has greatly enhanced the autonomy of the Web agents.
Existing Web-based intelligent agent technology has promise, but often requires constant supervision for operation. It has limited intelligence. It requires queries to be stated in simplistic abbreviated form. Most agent technologies utilized for text-based tasks are unable to initiate actions autonomously or operate autonomously. They don't change or evolve without direct orders from the user and their ability to be trained by user feedback or other knowledge inputs is highly circumscribed. Some existing Web agent systems can learn (e.g.[4][12][26][32][33]), but usually the learning process relies on only one learning algorithm. Research has shown that no single machine learning mechanism performs well across all domains or across diverse requests or queries [20][28]. Some existing Web agent systems can deploy multiple agents for the same core query (e.g. MetaBot [31]), but there is usually no inter-agent learning. Multiple Web agents are used only as a means of speeding the recovery of data, not as a means of improving the retrieval performance of the system. We have employed a neuro-genetic approach to developing a multi-agent system (MAS) which forages as well as metasearches for multi-media information in online information sources on the ever-changing World Wide Web. We present EVA, an intelligent agent system that supports 1) multiple Web agents working together concurrently and collaboratively to achieve their common goal, 2) the evolution of these Web agents and the user profiles to achieve a better filtering, classification, and categorization performance, and 3) longer-term adaptation by using our unique neuro-genetic algorithm. Individual Web agents use neural networks for local searching and learning. Genetic algorithms are used to facilitate the evolution of agents on a global scale. NLP technology allows users to write sophisticated queries, and allows the system to extract important information from the user queries and the retrieved documents. The new text categorization technology used by EVA, which is also based on the neuro-genetic algorithm, can learn to automatically categorize and classify Web pages with high accuracy, using as few terms as possible. Additionally, we have developed a technique for integrating meta-searching and Web-crawling to produce intelligent agents that can retrieve documents more efficiently, and a self-feedback or automatic relevance feedback mechanism to automatically train the Web agents, without human
Keywords Information agents, evolution of agents, learning and adaptation, multi-agent teams
1. INTRODUCTION Productive use of online resources is hampered by the very thing that makes them attractive: huge glut of information. An excessive amount of time is required to locate useful data, and the dynamic and transient nature of many online data repositories means that much information is lost, overlooked or quickly outdated. In one study, researchers found that well over 50% of online users spent more time searching for information than actually using it [6].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the lhll citation on the first page. To copy otherwise, to republish, to post on se~we~ or to redistribute to lists, requires prior specific permission and/or a fee. Agents 2000 Barcelona Spain Copyright ACM 2000 1-58113-230-1/00/6...$5.00
388
intervention. This algorithm, together with the neuro-genetic algorithm, has greatly enhanced the autonomy of the Web agents.
evolutions with neural-net structure evolutions. Additional reasons for using genetic algorithms for feature selection can be found in [37].
2. THE N E U R O - G E N E T I C A P P R O A C H
By approaching the agent learning task from two different levels, the local level of individual agents and the global level of interagent operations, we can ensure that each agent can be optimized from local knowledge, while the genetic algorithm acts as a 'driving force' to evolve the agents collectively based on the global pooled knowledge. The end goal is to produce a new generation of agents that benefit from the learning experiences of individual 'parent' agents and the collective learning experiences of all previous generations.
For agents to be truly intelligent, we believe that adding machine learning capabilities is the key [18][25], Since it would be impractical to assume that we could predict all possible events in the external environment and encode all the knowledge about those events in advance, agents need learning capabilities. How they react to new circumstances can be programmed. What they learn cannot. It has been reported in the literature that different machine learning techniques have different assumptions about the underlying model, leading to different strengths and weaknesses [20]. A combined hybrid system usually outperforms each individual technique. In this project we use neural networks, or more specifically, multi-layer feedforward neural networks with the back-propagation learning algorithm [2], as the baseline learning algorithm for agents, because of their wide applicability and their good performance in general. It is well known that neural networks are capable of performing non-linear mappings between real-valued inputs and outputs. As a matter of fact, mathematical theorems have proven that a three-layer feedforward neural network, with sigmoid units in the hidden layer, can approximate a given real-valued, continuous multivariate function to any desired degree of accuracy [5][7][11]. Furthermore, the consistency property of three-layer feedforward networks have also been established in [36], which, in turn, implies that this class of neural networks possesses nonparametric regression capability. Therefore, they can be used as a classifier to determine if a document is relevant to a topic or not, based on the input pattern of terms. At the training stage, each training pattern consists of an input pattern and a single output node, which indicates the pre-judged relevancy of the input sample. The input pattern is derived from the input text itself, which might be a piece of natural language text of any length. At the recall (or testing) stage, we will use the same method to derive the input pattern from an unseen document and feed it to the NN. The output value then indicates the relevancy of this new document, as judged by the system.
The output of the neural net is actually a real value from 0.0 to 1.0, which can be interpreted as the probability of this Web page being relevant to the query. It has been shown that B a y e s a p o s t e r i o r i probability can be estimated by feedforward neural networks through computer simulation. In a recent paper by the well-known neural network researcher Ken-ichi Funahashi, Bayes decision theory is combined with the approximation theory on three-layer neural networks to study 2-category n-dimensional Gaussian classification problem. He proved theoretically that three-layer neural networks with at least 2n hidden units have the capability of approximating the a p o s t e r i o r i probability in the two-category classification problem with arbitrary accuracy. He also proved that the input-output function of neural networks with at least 2n hidden units tends to the a posteriori probability as Back-Propagation learning proceeds ideally [8]. Since our relevant/ non-relevant assignment problem is a 2-category ndimensional classification problem, and assuming that the ndimensional probability density, p(xlml), where x is the input pattern and o~i, I=1 or 2, are the two categories, is Gaussian is quite natural, these results provide a theoretical basis for our current neural net parameters. We use 2n hidden units in the neural nets embedded in our agents. Here is a description of how our neuro-genetic algorithm works: Encode the features (or words in our case) used by the neural network as a binary string, which is usually called a chromosome in the literature. Each 'gene' in a chromosome refers to a feature (or a word currently) available to the neural network. If the gene is set to 1, it tells the neural network to use it as input. If it is set to 0, it tells the neural network to ignore this word.
Intelligent agents based solely on neural networks can only learn locally: that is, their learning experiences are restricted to the documents they have scanned or the Web sites they have traveled through. To expand the learning horizon and to create more intelligent agents, a learning algorithm that can operate at higher level and view things from an inter-agent perspective is needed. Genetic algorithms are shown to be excellent candidates. [21] Descriptions of some prototypical genetic algorithms and their successful applications can be found in [21] and [22]. In our case, we decided to use the genetic algorithm as a mechanism to select the best (or optimal) feature set for the neural networks, as opposed to the traditional approach of using genetic algorithms to optimize the connection weights or the topology of neural nets. This decision was based on the observation that the quality of the input feature (or word in this case) set will strongly impact each neural network's classification and prediction capability, perhaps more strongly than the machine learning algorithm itself, as can be seen in [24]. Selecting the best set of features (words, phrases, concepts, etc) is always important, for retrieval effectiveness and efficiency. However, in the future we may integrate feature set
Evaluate a chromosome's fitness by using it as a blueprint to dynamically construct a corresponding neural net, and evaluate its classification performance. Its fitness is: fitness -
1 total cost 1 a *cost~ + (1 - c t ) * costz 1
a*
1 accuracy(x)
+(l-a)
1 parsimony(x)
_ (1 + ~ 2) *accuracy(x)*parsirnony(x) 2,
'
where ~ 2 = 1-____O_a a
where costc is the cost incurred by classification errors, costr is the cost incurred by using irrelevant and redundant features, ct is a linear combination coefficient between 0 and 1, x is a feature (word) subset, and ~ > [3 > 0 is used to determine the relative
389
importance between accuracy and parsimony. Please note that this equation, which is also Van Rijsbergen's F measure in information retrieval [34], can be easily extended to combining more than two objectives or criteria used in the GA. Accuracy is defined to be 1 minus the mean-squared error currently. Parsimony is defined as (number of available features - number of used features) / number of available features. Currently 13 is set to a very small number to indicate that classification accuracy is more important to us at the moment. Feature subsets with higher fitness are more likely to 'survive' the selection procedure used by the genetic algorithm.
3. EVA WEB A G E N T S A user navigates through the Web space by following the hyperlinks. However, this kind of browsing allows users to explore only a very small portion of the large Web information space. The depth-first surfing inherently encouraged by Web browsers usually confuses and disorients most users, and causes them to get lost in the cyberspace. (The embedded digression problem) Web browsing can also cause most users to spend a great deal of time while learning nothing specific. (The art museum phenomenon) [1][3]. Search engines usually employ Web robots to traverse the Web. The Web robots, starting from a specific Web site, follow the hyperlinks and retrieve documents along the way. When a user wants to find information of interest, she submits a (usually keyword-based) query to a search engine. The search engine then matches the query against indexes and returns a list of URL's matching the user's query. Thus, users can rely o n the search engines to quickly find a better starting point of exploration, and mitigate the effects of the Web browser problem mentioned above. However, the indexes used by the search engines are not complete, and hardly ever up-to-date. Keyword search (e.g. Alta Vista, Yahoo, Lycos, etc.) often results in low precision. Hierarchical subject category search (e.g. Yahoo) often results in low recall. Users of search engines usually are not able to fully articulate their information needs. And the most critical problem is the "vocabulary problem."
We use 50 chromosomes in each generation, which correspond to 50 neural networks. Each neural network is then trained with the training set using only the part of the input pattern that corresponds to the l ' s in the chromosome. One tenth of the training set is reserved as the cross-validation set. After training is complete, the fitness of a chromosome is computed using the cross-validation set, according to the fitness function specified earlier. We rank the chromosomes based on their fitness, and select parents for reproducing the next generation using the rank-based approach. Any chromosome may be selected as a parent to reproduce the next generation, but the likelihood of a chromosome being selected is governed by this formula: p*(l-p) n-I where p is the probability that the best chromosome will be selected, which is 0.6 currently, and n is the chromosome's rank. Please note that:
The Web robots used by the Web search engines can be personalized to address the Web browser problem as well. The personalized Web robots can take a user's request and perform real-time, customized searches. They promise to find all the information that is relevant to your needs at a given moment. Using this approach, you won't have to sift through terabytes of constantly changing data to find exactly what you want, and you are assured of seeing even the most recently published information. Users do not even have to be online while the agents are working or searching for them. They could log on at a later time to check on the progress of their agents and download any results. An agent system that supports personal Web robots (or agents) may spawn one or more agents in response to a user request, to fully exploit parallel searches on a distributed environment such as the Internet. These kinds of agent systems usually keep an online user profile, which reflects the user's interests, and the agents they spawn use the online user profile to provide query results with higher precision. An intelligent agent system with learning capabilities can fine-tune the user profile to capture the user's not fully articulated information needs, and can adapt the user profile to the changing user interests and the Web information space.
Z ,:=,p , (1_ p ) , _ l = 1 we have also applied 'elitism' here. Elitism is an optional characteristic of a genetic algorithm. Its role is to make sure that the fittest chromosome(s) of a population is(are) passed on to the next generation unchanged. We keep the top two chromosomes as the elite. We use uniform crossover, which has been shown to be more effective in combining schemata than either l-point or 2-point crossover [I 1], with one addition. As for mutation, we use a typical small value, 0.001, as the probability of mutation, Pm. The algorithm terminates when I) the entire population consists of only one type of chromosome, or 2) the fitness of the fittest chromosome reaches the maximum value, which is 1.0 in our case, or 3) 20 generations have passed. Although our neuro-genetic approach can be applied to the entire feature space, it is neither practical nor necessary. For EVA, we usually start with 50-200 features (terms). We use the same kind of simple measure as in [24], i.e. their correlation coefficient, to select the desired number of features as the start point for our neuro-genetic algorithm. It is the "one-sided" )~Z-test. The correlation coefficient of a word w, with respect to a category or topic is:
C=
The only drawback of the "personal Web agent" solution is that users usually do not get instant responses to their requests, because it takes time for the personal agents to crawl (or to traverse the Web hyperlinks), and to perform document/text processing. In order to remedy that, we added a meta-search capability, which enables EVA to submit user requests to the Web search engines directly, at the same time as it spawns personal Web agents. The URLs returned by the search engines are merged with the query results from the personal Web agents, and later are also 'crawled' by the personal Web agents. As mentioned in the previous paragraph, the major advantage of the
( N~ +N._ - N~_ N.+ ),J-N x/(Nr+ + Nr_)(N,+ + N,_)(Nr+ + N,+)(N r_ + N , _ )
where N~+ (N,+) is the number of relevant (non-relevant) documents in which w occurs. Nr. (Nn-) is the number of relevant (non-relevant) documents in which w does not occur.
390
'search engine' solution is that users can rely on the search engines to quickly find a better starting point of exploration, and mitigate the effects of the Web browser problem. By integrating this meta-search capability into EVA, the agent system can now automate this process and take full advantage of the 'search engine' solution. EVA now has the best of both worlds.
of the words, phrases, or other semantic entities in the documents, and c is the class label. A major difficulty of this problem stems from the high dimensionality of its feature space, or the length of the feature vector x in the above definition, which normally consists of tens or even hundreds of thousands of unique words, phrases, and perhaps other semantic entities that occur in the natural language texts to be categorized. It would be a formidable task to train learning algorithms, such as neural networks, with so many inputs. Hence, reducing the dimensionality, or selecting a good subset of features, without sacrificing accuracy, is of great importance for learning algorithms, such as neural networks, to be successfully applied to text categorization. In our research we have developed a methodology that integrates genetic algorithms, neural networks and the Baldwin effect in order to evolve the best feature subset for text categorization.
Currently, we are also working on using our neuro-genetic algorithm to automatically classifying Web pages based on Yahoo categories. This will remedy Yahoo's low recall problem. Hence, some of our EVA Web agents are text categorization agents. See Section 5 for more info about this important aspect of EVA.
4. TEXT C A T E G O R I Z A T I O N AGENTS Text categorization is the problem of automatically assigning predefined categories to natural language texts, based on their contents. It is a very complex problem, but is also a task of increasing importance. Mainly due to the proliferation of electronic documents usually stored in data warehouses, a new area called text mining has been gaining importance. One important goal in text mining is to classify or categorize these electronic documents automatically [35], which is exactly a text categorization task. Furthermore, text mining on the Web arguably will become the most important application of text categorization in the current era of World Wide Web based information age, because the Web is the "mother of all data warehouses" [9]. When combined with the agent technology for the Internet [23], text categorization techniques can be used to automatically classify and filter new or existing Web pages, Web sites and other documents available on the Web. Our EVA is such a system, being applied to Yahoo categories. In [38] the author distinguishes three types of text categorization systems:
There are two main approaches to feature selection in machine learning: the filter approach and the wrapper approach. The filter approach selects the feature subset independent of the learning method, and is conducted before the learning process. The wrapper approach uses the same learning algorithm as a black box in feature selection. It generates a candidate feature subset, runs the learning algorithm based on the candidate feature subset, and uses the accuracy of the learning algorithm to evaluate the candidate feature subset. Most recent research in machine learning relies more on the wrapper approach rather than the filter approach, because the learning algorithm that will use the feature subset should provide a better estimate of accuracy than a separate measure that may have an entirely different induction bias [13][14]. But in text categorization researchers still rely on the filter approach. In [39] the authors did a comparative study on five aggressive feature selection methods: document frequency, information gain, mutual information, ;~2-test and term strength. All of them use the corpus statistics to determine the 'goodness' of a term, and then apply a threshold to filter out lowscoring terms. Other feature selection methods commonly used in the literature also fall into this filter approach.
Na'fve word matching, which matches categories to text based on shared words between the text and the names of categories. This is the weakest method. It cannot capture any categories that are conceptually related to a text but happen not to share any words with the text. Thesaurus-based matching, which uses lexical resources to relate a given text to the names of descriptive phrases of categories. This method usually suffers from the weakness that the lexical resources are typically static, not sensitive to the context in which a term (word or phrase) is used.
On the other hand, our neuro-genetic approach is a wrapper approach, but can also be used in conjunction with a filter, as mentioned earlier in Section 2. Experimental results show that our neuro-genetic algorithm is able to perform as well as, if not better than, the best results of neural networks to date, while using fewer input features. Additional information about our neuro-genetic approach to text categorization can be found in [4011411.
Empirical learning of term-category associations from a set of training text documents and their categories assigned by humans. It differs from the other two types mainly in the sense that it makes uses of human relevance judgements to statistically and context sensitively captures the semantic associations between terms and categories. The research reported here is an example of this type of text categorization technologies. We can characterize this statistical classification problem as [22] [19]:
Our now commercialized document retrieval system DR-LINK also has a text categorization component, called the Subject Field Coder (SFC) [15][16]. Our new approach to text categorization described here can be considered as a learning version of SFC.
5. THE EVA SYSTEM A R C H I T E C T U R E
Given an instance space X consisting of all possible text documents (or objects), and a training set of labeled text documents for the target classification function f(x), which can take on any value from a finite set V (e.g. interesting, not_interesting), learn from this training set to predict the target value for subsequent text documents. The text documents are usually represented in the form (x, c), where x is a vector of feature-values, such as the number of occurrences or the presence
This system is fully implemented in Java. Figure 1 shows the current system architecture. The user interface agent is embedded within our Java-based graphic user interface (GUI). Its purpose is to assist the user with daily information retrieval, evaluation, filtering and organization tasks. The 'perceived' user styles, interests, etc. is represented as user profiles [27], which are sent to the Agent Server for execution, and the Database Server for storage and later retrieval. These stored user profiles can be
391
modified by the users manually or 'evolved' by the genetic algorithms automatically. The user interface agent is also responsible for presenting the results and obtaining user feedback. The obtained user feedback and the associated results are sent to the Agent Server to expand the agent training materials. Each user profile is defined by a name to identify both the user profile and the associated Agent Leader. This information includes the natural query, the search type, and any starting URLs. The starting URLs represent the addresses at which different crawler agents will start searching the WWW. However, the user is not required to enter any starting URLs, since EVA can autonomously and intelligently find a good starting point for the user.
category. Preliminary results are shown in the section on Experimental Results With the user profile information entered and processed, the agent server then creates an agent leader for the user profile, and initializes the neural network using the processed query from the natural language processor. Next the agent leader for the user profile generates a team of agents in which each agent is embedded with a trained neural network. There are two types of agents generated: crawler agents and meta-search agents. The agent server displays each of the documents with a matching score (RSV, or retrieval status value) above a certain adjustable threshold from the search to the user via the GUI. The agent leader in the agent server ranks them by their RSVs, and continuously updates the rank as new documents are retrieved by the agents. All results are also stored in the database in terms of the URL, its RSV, relevancy, the results from the natural language processor, etc. Due to the large number of documents, which may be retrieved, the database server saves only the highest ranked 100 documents.
Figure 1. EVA System Architecture World Wide Web
Database
Server
Agent
(u,., prom.,)
(Aglnt
p e r f o ~ a ~ords)
(G~IIc
{OntolOgle$)
(C~ef~/resultslWebSites)
We have designed and implemented a module to facilitate user relevance feedback. The module now has an associated button in the GUI called 'Train.' User feedback is being used to supply relevance judgments for documents, images, and Web sites retrieved by the Web agents. These relevance judgments are being used to train the neural networks embedded in the agents.
Server
Algo r I thmJl~volu u o n a P / P ~ e s s ) ( A g t n t TrainingCe.ter)
Users may provide relevance judgments and select one or more retrieved documents that contain the relevant images to be used as additional training materials (or patterns). We then need to determine what is the common content, or the essence, of those documents, since we usually assume they have common or similar content. It is this common content, the essence, that makes the user deem all these documents, and hence the contained images, relevant. The essence is usually represented as a list or a vector of concepts. For the sake of generalizahility and efficiency, we usually limit the dimensionality of the content vectors. In [17], we empirically determined this number. Hence, we initially followed the same approach and empirically determined that number. However, the genetic algorithm is now used to find the optimal or sub-optimal number of concepts to use, and hence this number changes dynamically.
Web Browser
Query/Document Processor
I
(NLP)
[~ H
A Process Query button directs the agent server to send the query to the natural language processor. The user may review the query processing results prior to starting the search. To assist the user in selecting starting the URLs, the Database Server may store a table having records by subject categories listing recommended starting URLs associated with such subject categories. We are currently working on applying our NGA to the geospatial categories in the Yahoo hierarchy. We are using the URLs under Yahoo categories such as 'Geography,' which includes GIS (Geographical Information Systems) and 'Geology & Geophysics' as training data for our NGA to train EVA agents. When complete, our system will be able to automatically determine if a new (or old) page belongs to these categories, and we can use that kind of information to automatically and dynamically rank Web pages and Web sites, in terms of their relevancy to those categories. In other words, we will be able to build smart portals automatically, starting with the Yahoo categories. We also call them resource maps for obvious reasons. Yahoo categories have been selected because of the high quality work performed by their human indexers in manually determining which HTML page (URL) should reside in which
If the user has not selected any relevant documents, the agent server automatically performs relevance feedback by considering the top X number of documents having the highest RSVs as relevant and includes such documents in the training set. X is currently 10. The most frequent subject categories and terms from the training are selected and added to the original natural language query. The database is also modified by the agent server via the database server to add those selected subject categories and terms to the stored agent records. Please see Figures 2 and Figure 3 for the terms autonomously and automatically selected by the system, based on the original query "intelligent agents." After the relevance feedback process is complete, the agent server checks if it is time to evolve the agents. If so, the agent server will invoke the NGA (Neuro-Genetic Algorithm) manager to do the work. The NGA manager applies the NGA as described earlier to evolve the features (or terms) for the embedded neuralnetworks in the agents, using the URLs (HTML documents) and their associated user/manual or automatic relevance feedback information as stored in the database. The evolution time may be
392
a clock time set by the user via the GUI, or may be on a periodic interval.
FIOURE 4 I
s~ u~ user ~Les
v,^ A wee.~se~
I
Q~H~AL
...........
I .....................
USe~ ~TeRFACe
I
I~
!o..................
i
,.
Figure 4. The EVA system flowchart. After a new neural network has evolved and has been embedded in the present agents, the agents continue to search the WWW under an agent leader, and the agent server continues to re-train and evolve the agents as described before, until the user stops the search. Figure 4 is a flow chart showing the operations of the complete EVA system.
Figure 2. Terms before automatic r d e v ~ c e (~10 f~:dback.
Java has built-in support for multi-threading, which is a way of building applications with multiple threads. This built-in support for multi-threading provides us with a much easier way to build EVA as a multi-agent system. EVA is a multi-agent system on three levels. First, users can use the GUI to create, remove, and specify multiple personal Web agents. Each personal Web agent represents a user's different interest, and multiple agents can run concurrently, satisfying the user's different information needs. The second level of multiple agents occurs automatically behind the scene. When a user is done with specifying her agent and hits the 'Start Agent' button, multiple agents are spawned, each of which searches a different information source, such as a specific search engine or a Web site, to satisfy a common information need. And again they do so concurrently. The retrieval results from each agent, together with their associated scores (or more formally Retrieval Status Values, or RSVs) are then merged and re-ranked. The final results are shown in the 'Show Results' window. Please note that due to dynamic nature of the agents, the 'Show Results' window is constantly updated. All the diagnostic messages are displayed in the 'Show Log' window. Thirdly, each EVA agent itself is also multi-threaded. The multi-threading model we use is the generic producer/consumer model. The producer thread of an EVA agent checks if there are more hyperlinks emanating from the current Web page. If there are, their URL's are put in the queue (or the list) for future analysis, and the producer thread then moves on to the next hyperlink, in a breadth-first fashion currently. The consumer thread performs the detailed Web page (or HTML document) analysis, and uses its 'brain,' the embedded neural networks, to determine if the Web page is relevant to the user's interest.
Figure 3. Terms after one run of automatic relevance (self) feedback. EVA does multiple runs of automatic relevance feedback. The previous Figure 2 shows the initial query, which has only 2 words "intelligent" and "agents". This figure shows that your humble EVA agents (crawlers, meta-search agents, etc) automatically, without any guidance from the user, gather additional terms (and information) for you, while they are searching the Web for you. Those terms will be further 'evolved' day alter day. This is our ideas of Autonomous Agents.
393
6. EXPERIMENTAL RESULTS
9Cat column in the Table II & III) provide a more reasonable indicator of our system's categorization performance.
Measuring the performance of an agent system such as EVA is hard, because the agents' longer-term learning and adaptation can not be evaluated right away, since the system is still under construction. Therefore, the following numbers should be used only as an indicator of the system's retrieval performance at the very moment, and of its usefulness. Table I shows EVA's retrieval effectiveness as measured by a standard performance metric, precision at 20, commonly used in information retrieval on the Web. Precision at 10
Precision at 20
Before Automatic Relevance Feedback
0.753
0.656
After Automatic Relevance Feedback
0.830
0.748
The last rows in those two tables are the results based on ensemble averaging [10]. They are even better than individual results. Ensemble averaging is a linear combination of the outputs from 10 neural and neuro-genetic classifiers, respectively. We use the standard performance measures in this area to evaluate our system's performance. The standard performance measures include precision, recall, break-even point. Precision is the ratio between the number of correct category assignments and the total number of category assignments the system has made. Recall is the ratio between the number of correct category assignments and the number of category assignments that should have been made. Neural networks can output real-valued probabilities. By adjusting the threshold or the decision boundary, you may change the precision and recall. Usually recall and precision are inversely proportional, and the breakeven point is where they equal.
Table I. Results based on 10 geospatial queries such as accelerometers, gyrocompasses, thermal mappers, etc. For the text categorization aspect, in addition to the results we have already reported in [40] based on the Reuters news story collection, the de facto standard benchmark data set, we have two new sets of results to report, based on Yahoo categories. Table II is the results of just using the neural net part of our neuro-genetic algorithm, while Table III is the results of the full neuro-genetic algorithm.
rr"~ r~t
112 ~
110 "2
1313 13
1"[/ 18
3CO 33
R.N1 RU~
Q~5 Q~5 Q~5 1 1 Q875 OtiS3 Q~P3 (1875 QflP5
Q~ QS~ 08]3 Q8~8 0833 ~ Q.¢~ Q833 0833 QS~
Q75 Q?3 Q'/5 075 Q~5 Q~5 Q73 Q75 Q75 QT5
Q5 05 (15 Q5 Q5 Q5 Q5 Q5 Q5 (15
Q818 Q818 QS"B Q818 Q818 Q8'e Q818 Q818 Q81B Q8'18
Q9 Q~5
0833 0833
Q';5 Q'F3
Q5 Q5
0818 ~ Q818 C~7
R..hq
RJ~
Ae'ag~
98 10 Q(]~' Q6~ Qflfl7 06~7 Q(]~' Q~7 (1687 Q~7 Q(~7 Q~7
112 12
3~ 42
134 14
80 9
Q~9 QSB9 Q8~9 QE~9 Q8~9 QflflEI Ofl~ Q~9 Q8~9 Q~9
Q~9 Q~9 O~9 Qcd~ Og~ Q9~9 QCd~ Q~9 Q~I~ O~
O@5 1 Q~5 QE~5 QE~5 QEP5 Off5 1 OtiS5 QE~5
025 025 025 025 025 025 025 025 025 025
Q73f16O'RflB~ QTo11O S ~ Q731~QT:J~ Q';5'11QSI~"/E Q75'11QS~E Q7"-J~Q'/z:~E Q73f18QTed~fiEE Q'FJI1QS~ET"/E Q73f16QT~8~E Q7386Q ~
QS~B QC~9 09 Q~9 0 c ~ 9 QEB5
Q~ 025
07436Q'RSN Q7-'J~QT:~E~
7. CONCLUSION In this paper, we have described our EVA agent system that uses our neuro-genetic algorithm and text categorization technology to build search agents. EVA search agents can autonomously access, classify, filter and fuse multi-media information from diverse online multi-media information sources on the everchanging World Wide Web. The neural-genetic algorithm can learn from examples, and evolve the best feature set for the classification problems at hand, using as few input features as possible. Our new text categorization technology also uses the neuro-genetic algorithm that can learn to automatically categorize and classify textual documents with high accuracy, using as few terms as possible. Additionally, we have developed a techniques for integrating meta-searching and Web-crawling to produce intelligent agents that can retrieve documents more efficiently, and a self-feedback or automatic relevance feedback algorithm that automatically train and improve the performance of the Web search agents, without human intervention. This algorithm, together with the near-genetic technology, has greatly enhanced the autonomy of the search agents.
Table II. Performance of automatic Web page categorization using neural networks and Yahoo categories. Sasr~o~l~cl~L~ Tmn T='~
RJ~ R.,N] R.N} R.N'I0 h,ea~
BctayR~c~ENaJc~aG~tu~F~rRas G a ~ c s A ~
GS
112 12
110 12
133 13
177 18
~D 33
~ 10
112 12
3~' ~
134 14
83 9
1 1 Q8~5 1 1 I 1 1 1 Q8~5
QS~ Q(~ OS~ Q833 QS~ Ql~fi7 OS~ ~ Q~ QS~
Q~5 Q8~5 Q@5 Q~P3 1 Off75 Q~5 Q~5 1 1
(15 Q5 05 Q5 Q4 Q6 Q5 Q6 06 Q5
Q884 Q~o4 Q'T/3 Q818 Q818 Q773 Q773 Q818 Q864 QT~
Q~ 0E6;' O6~' 0067 Q1887 QS~ Qfl~3 Q~3 QI~E~' Q657
Q889 QSB9 Q889 QE~9 (1E89 Q~ QRC= QSB9 QSf19 Q889
Q~6 QgE6 Q~81 Q~'I Q986 Qgf16 1 Q~6 Q~ QflE6
Q75 Q~3 Q~5 Q,8~5 QE~5 Q8~5 Qfl~5 Q6~5 Q;5 Q75
025 0;5 025 025 025 025 025 025 025 025
Q~5 Q7~2 1 OS~
Q9 Q~5
Q~P QS~ Q5 Q~9
Q71E8 Q889 Q ~ , QSt25 QI86~ QflEO QflE6 Q8~5
Business intelligence portals have recently emerged to provide users with an invaluable tool to successfully navigate their business information. They provide a single access point to critical business information repositories of an organization, such as analytic applications, production reports, query/analysis tools. Most enterprises and organizations recognize that both timely access to information and efficient management is essential to their viability. According to industry analysts, portals are primed to be the new trend in the business intelligence arena. We have developed our EVA system so that it can help users construct business intelligence portals automatically. Furthermore, due to the proliferation of electronic documents on the Web and other information sources, the importance of text mining has increased dramatically. One important goal in text mining is automatic classification of electronic documents. Our neuro-genetic algorithm and EVA agent technology were designed for that purpose first and foremost.
10(~/. 9C~.
Q74~ QSQ211" QTE6]Q8114~ Q74E8 O8] Q;l~8 Qflgl~ Q'~/] Qf127~ Q7"/28Q8308E Q'/828 Q8¢ Q'P.~ Q8~8111 Q7819 Q~I Q7457Q8C077~
025 Q~81 Q81~E Q25 Q7764QSEt4fl6E
Table III. Performance of automatic Web page categorization using a neuro-genetic algorithm, which is better than Table II. Please note that the category GIS is treated as a special case, because its train data at Yahoo is too diversified. It is listed here because of its relevancy to our current emphasis on the geospatial information sources in the project. The other 9 categories (the
394
8. ACKNOWLEDGEMENTS
[20] D. Michie, D.J. Spiegelhalter & C.C. Taylor, 1994, Machine Learning, Neural and Statistical Classification. Ellis Horwood.
The ongoing research reported here is supported in part by NIMA (National Imagery and Mapping Agency) University Research Initiative Grant NMA202-97-1-1025.
[21] M. Mitchell, 1996, Algorithms, MIT Press.
9. REFERENCES
An
Introduction to Genetic
[22] T.M. Mitchell, 1997, Machine Learning, McGraw Hill.
[1] E. Carmel, S. Crawford, & H. Chen, 1992. Browsing in hypertext: A cognitive study. IEEE Trans. on Systems, Man and Cybernetics, 22(5), 865-884.
[23] D. Mladenic, 1999, Text-learning and related intelligent agents: a survey, IEEE Intelligent System, vol. 14, no.4, pp.44-54.
[2] Y. Chauvin, & D.E.Rumelhart (eds), 1995, Backpropagation: theory, architectures, and applications, Lawrence Erlbaum.
[24] H.T. Ng, W.B. Goh and K.L. Low, 1997, Feature selection, perceptron Learning, and a usability case study for text categorization, Proceedings ofACM SIGIR 97, pp.67-73.
[3] H. Chen, Y. Chung, & M Ramsey, 1998, A smart itsy bitsy spider for the Web, Journal of the American Society for Information Science, 49(7):604-618, 1998.
[25] H.S. Nwana & D.T. Ndumu, 1997, An introduction to agent technology, Software Agents and Soft Computing, pp.3-26, [26] M. Pazzani, J. Muramatzu, D. Billsus, 1996, Syskill & webert: identifying interesting web sites. Proc. of AAAI conf.
[4] L. Chen& K. Sycara, 1998, WebMate:A personal agent for browsing and searching. Proc. AA '98, pp. 132-138.
[27] M. Pazzani & D. Billsus, 1997, Learning and revising user profiles: the identification of interesting Web sites. Machine Learning, v.27, pp.313-331.
[5] G. Cybenko, 1989, Approximation by superposition of a sigmoidalfunction. Mathematics of Control, Signals & Systems, v.2, pp.303-314. [6] S. Deerwester, August 26, 1996, Next Generation of Search Engines Find Their Feet. Financial Times, pp. 14.
[28] H. Schutze, D.A. Hull, & J. O. Pedersen, 1995, A Comparison of Classifiers and Document Representations for the Routing Problem. Proceedings of ACM SIGIR '95 conference.
[7] H. Funahashi, 1989, On the approximate realization of continuous mapping by neural networks. Neural Networks, vol. 2, pp. 183-192.
[30] G. Syswerda, 1989, Uniform crossover in genetic algorithms, Proc. of the 3 ra International Conference on GA, pp. 2-9, 1989.
[8] H. Funahashi, 1998, Multilayer neural networks and Bayes decision theory. Neural Networks 11, pp.209-213.
[31] URL: http://metabot.kinetoscope.com/docs/docs.html
[9] R.D. Hackathorn, 1999, Web Farming for the Data Warehouse, Morgan Kaufmann.
[33] URL: http://www.wisewire.com/
[32] URL: http://www.agentware.com/ [34] C,J. van Rijsbergen, 1979, Information Retrieval, 2 "d edition. Butterworths. 1979.
[10] S. Haykin, 1999, Neural Networks: A Comprehensive Foundation, 2 nd, Prentice Hall.
[35] S.M. Weiss et al., 1999, Maximizing text-mining performance, IEEE Intelligent System, vol. 14, no.4, pp.63-69.
[11] K. Hornik, 1989, Multilayer feedforward networks are universal approximators, Neural Networks, v.2, pp.359-366. [12] T. Joachims, D. Freitag & T. Mitchell, 1998, WebWatcher: A tourguidefor the Worm Wide Web, Proceedings of IJCAI 97.
[36] H. White, 1990, Connectionist nonparametric regression: multilayer feedforward networks can learn arbitrary mappings, Neural Networks, vol.3, pp.535-549.
[13] G.H. John, R. Kohavi and K. Pfleger, 1994, Irrelevant features and the subset selection problems, Proceedings of the 11th International Conference on Machine Learning, pp. 121-129.
[37] J. Yang and V. Honavar, 1998, Feature subset selection using a genetic algorithm, IEEE Intelligent System, v.13. no.2, pp.44-49.
[14] P. Langley, 1994, Selection of relevant features in machine learning, Proceedings of AAAI Fall Sym. on Relevance, pp. 1-5.
[38] Y. Yang, 1996, Sampling strategies and learning efficiency in text categorization, Proceedings of AAAI Spring Symposium on Machine Learning in Information Access.
[15] E.D. Liddy, W. Paik & E.S. Yu, 1994, Text categorization for multiple users based on semantic information from a MRD, ACM Trans. of Information Systems, v. ! 2, no.3, pp.278-295.
[39] Y. Yang and J.O. Pedersen, 1997, A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference of Machine Learning, pp. 317-332.
[16] E.D. Liddy, W. Paik, E.S. Yu & M.E. McKenna, 1994, Document retrieval using linguistic knowledge. Proc. RIAO '94.
[40] E.S. Yu and E.D. Liddy, 1999, Feature selection in text categorization using the Baldwin effect, Proceedings of IJCNN '99, IEEE Press.
[17] E.D. Liddy, W. Paik, E.S. Yu & M. McKenna, 1995, A Natural Language Text Retrieval System with Relevance Feedback. Proceedings of the 16 th. National Online Meeting.
[41] E.S. Yu and E.D. Liddy, Feature selection in text categorization using the Baldwin effect, in Cognitive and Neural Models for Character Recognition and Document Processing. (to appear)
[18] P. Maes, 1994, Agents That Reduce Work and Information Overload, Communications of the ACM, Vol.37, no.7, pp.31-40. [19] C.D. Manning and H. Schutze, 1999, Foundations of Statistical Natural Language Processing, MIT Press.
395