Suggy: An Agent for Query Refinement

0 downloads 0 Views 148KB Size Report
Frequency and Document Frequency techniques with heuristics that help in the choice of the most ... on the type and structure of the information stored and retrieved in an IR system but, in practice ... technique (more effective than simple indexes or keywords algorithms) and is able to ... Free, registration, charge, etc.). Lastly ...
Suggy: An Automatic Query Refinement Agent F. Abbattista, F. Esposito, N. Fanizzi, S. Ferilli, G. Semeraro Dipartimento di Informatica, Università di Bari, Via Orabona 4, 70126 Bari, Italy, Tel. +39 80 5443264, Fax +39 80 5443196, {fabio, esposito, fanizzi, ferilli, semeraro}@di.uniba.it Abstract We present the agent Suggy, whose aim is the refinement of queries in an Information Retrieval environment on the Internet. It integrates standard Term Frequency and Document Frequency techniques with heuristics that help in the choice of the most characterizing keywords. An experimentation was carried out to confirm that the ideas underlying Suggy can in fact improve both user satisfaction as regards search results and the efficiency of the refinement process.

1. Introduction The goal pursued in this paper is dealing with the problem of Information Retrieval (IR) on the web (Baeza-Yates & Ribeiro-Neto, 1999), i.e. how to cope with the search in a huge amount of unstructured information. Ideally, there should not be constraints on the type and structure of the information stored and retrieved in an IR system but, in practice, several IR systems analyze and process textual information during retrieval. Suggy is a software agent (Bradshaw, 1997) that implements new query specialization techniques for repositories containing unstructured data. Documents returned from an initial query are examined by the user, who can decide to select interesting pieces of text to be stored in a personal repository. The agent tries to tackle the large amount of data by scanning the personal repository and suggesting new queries in order to increase the number of hits that are interesting to the user. In particular, we address the problem of assisting users during information extraction from the Web (Cheong, 1996). However, the first query must be issued by the user, since it is not possible to know a priori the best query meeting his search needs. Hence, the process of query reformulation evolves towards an optimal one on the ground of documents returned from the previous search process. The Suggy system has been developed in collaboration with IBM Semea Sud, and has been integrated to a larger plan called CDL (Corporate Digital Library), an ongoing project at the Computer Science Department of the University of Bari [EMS+98; CESF99]. The idea for the integration arises from the possibility of facilitating the user in the collection of several types of documents. Suggy makes suggestions to the user during his querying process, allowing him to extract from CDL only those documents which satisfy his needs. These documents will be stored into his private library and later used to refine new queries. The rest of the paper is organized as follows. Section 2 reports related work in this field; Section 3 presents the system Suggy, while Section 4 shows experimental results. Lastly, Section 5 draws some conclusions. 2. Related Work Information extractors represent a major component of systems that interact with human users, since being able to provide users with correct information (i.e., information they asked for) is a critical factor.

In this context, Internet represents a challenging field because of the huge amount of on-line data and the great number of data repositories. In the past years information extractors technology has grown, from simple search engines (Altavista, Yahoo!, etc.) to automated data mining and information extractors agents and, currently, to intelligent autonomous agents (personal assistants) [Sho93; LM97]. In the following, we analyze some information extraction agents for the Web. Virtual Agent (VA) (Virtual Resources Corporation)represents an automated job agency that is able to relate job seekers with job suppliers. Based on users personal data (name, surname, phone and fax number, e-mail address) and professional profile (preferences skills and education levels, past experiences and future expectations), VA generates a query made up of sets of concepts extracted from user profiles: Occupation concepts (the kind of job and company), Location concepts (where the user likes to work), Benefit concepts (benefits required: insurance, pension, etc.), Skill concepts (skills and experiences), Education concepts (information about the education). Users can refine the query by prioritizing these concepts and classifying them as required, desired, prohibited (this class allows to exclude documents from the search) or discarded. In the search phase, VA sorts the retrieved documents on the basis of their correspondence with user queries. It exploits a statistical indexing technique (more effective than simple indexes or keywords algorithms) and is able to overcome some typing errors or inconsistencies in the user profile. Firefly (FF) (LA Agents) is an agent to create and manage a virtual community of users based on common interests. To join communities, users must register, entering their personal data and preferences. Successively, FF will allow them to contact each other and to communicate with friends (users sharing the same interest). FF is composed of three modules: Firefly Passport, where the user profile is stored; Firefly Passport Office, that manages passports and allows the communications among friends; Firefly Catalog Navigator, that customizes relations among user interests and company products. WBI (IBM) is a personal agent that remembers all past visited web sites and information searched. WBI provides several services to help users to trace their web browsing: Personal history (to maintain the history of web accesses and information about pages containing a specific keyword), Watch (to monitor site updating), Shortcuts (to directly reach frequently used pages), Path (to store the path towards an already visited web page), Traffic light (to measure the access speed to a link), Offline browsing (to copy visited sites on local disk). FinanceWise (FW) (Wise)is a search agent specialized in the banking and financial domain. FW provides three different search procedures. Keyword Search performs a simple search, based on a boolean combination of keywords, in all the Internet financial sites. Search Results are sorted by document correspondence to keywords and then are graphically displayed by means of colored circles (whose number is proportional to the document correspondence level). By means of the Smart Search module, users choose from a pre-defined set, a particular kind of information, such as: Name of the company (to search its address), Organization type (to search companies by category), Country and Region (to search financial companies in a geographical region), Main Language and English language (to search financial sites on the basis of the language used), Site access (to search financial sites based on the kind of access: Free, registration, charge, etc.). Lastly, the module Search by Sector allows users to

select from a pre-defined list one or more financial categories and, possibly, subcategories, to which the sites should belong to. 3. The System Suggy In a document retrieval system, the main goal is to acquire the greatest number of documents that fulfill the user’s search criteria. This can be achieved in two steps. The former involves a query suggester that can formulate refined queries on the ground of the documents retrieved in an initial query, storing the results in a local repository. In the latter, a classifier is used to cluster the documents saved along previous retrieval sessions in order to generate a hierarchy of classes that can be browsed by the user for searching documents of interest. Suggy is a system for the intelligent refinement of queries. It can be used in the former step for extracting the most relevant keywords from textual data. For instance, a typical e-commerce web site presents a number of items on sale — books, CDs, etc. — each associated with a textual comment. Once users have chosen some items, hence showing their interest for them, Suggy stores their associated comments (as text-units) into a repository whose analysis allows to process relevant phrases rather than whole documents. This considerably reduces the cost of the keyword extraction process. Each text-unit saved in the repository is maintained as a list of words, that will be further reduced, by removing all the terms contained in a stop-list. Porter’s suffixes elimination algorithm (Porter, 1980) is then applied to the resulting list to remove all suffixes and collect all terms under the same stem. It treats complex suffixes like compositions of simple ones and deletes them in a finite number of steps. Hence, the keywords suggested are very general since they contain mere stems. The repository can be regarded as a matrix Am×n, where m designates all the different terms and n is the total number of text-units in the repository [SM83]. Each entry Aij in A represents the weight of the i-th term in the j-th text-unit according to the TFDF (Term Frequency per Document Frequency) method (Sparck Jones & Willet, 1997): Aij = tfij · (dfi / n) where tfij is the term frequency (i.e., the number of occurrences of the i-th word in the j-th text-unit) and dfi is the document frequency (i.e., the number of text-units in the repository that contain the i-th word). The choice of the TFDF for assigning weights to the words in the text-units is motivated by various considerations: An ideal IR system should produce high recall values (i.e. it should find all the interesting items), and high precision values (by eliminating the non-interesting ones). Terms that occur with a high frequency in single documents increase the values of recall; this means that these terms should have high values as regards both Term Frequency and Document Frequency. However, when using the TFDF method, we increase the recall at the cost of low precision when terms with high frequency are not concentrated in few documents but are disseminated over the entire collection. Hence, a new factor is to be taken into account, that favors the extraction of terms concentrated in few documents. In TFIDF (Term Frequency per Inverse Document Frequency) (Salton & Buckley, 1988), such a factor is represented by the Inverse Document Frequency, that varies inversely with respect to the number of documents containing a given term. Thus, the most discriminative terms in a document should have a high Term Frequency and a low Document Frequency. This is caught by the formula: Aij = tfij · log (n / dfi)

The keywords to be exploited for applying the methods above are extracted by means of heuristics. The heuristic technique used by Suggy is structured in two phases. In the former, the set of the most important keywords is drawn from the repository. In the latter, for each keyword extracted, other related terms are selected which will be combined with the starting keyword. Five heuristics have been developed to cope with the extraction of characteristic terms from the repository. The algorithm implementing the first heuristic aims at extracting the most discriminating terms from the couple of most similar text-units. Such a similarity is computed by means of distance measures, since documents are represented by vectors in a vector space. In particular, cosine similarity, that measures the cosine of the angle between the two vectors in the space, was used. Given two vectors X = (x1,…, xn) and Y = (y1,…, yn), the formula is: n

sim( X , Y ) =

∑x y i

i

I =1

n

n

I =1

I =1

2 2 ∑ (xi ) ∑ ( yi )

After selecting the couple of most similar text-units, the vectors that represent them are sorted and, for each of them, the two highest TFIDF values are stored and only the terms whose weight is equal to the computed maxima is extracted. The second heuristic extracts the most discriminative terms from the document whose corresponding vector has the greatest norm. The norm used, for a vector V = (v1,…, vn), is the following: n

V = ∑ vi i =1

Since the elements of the vector are positive (indeed they are weights), the norm corresponds to their sum. At this point, the vector representing the text-unit with greatest norm is sorted for extracting the four greatest TFIDF values, and only terms whose weight is equal to the computed maxima are taken into account. The third heuristic differs from the second one only in the norm used, that is: n

V = ∑ (vi )

2

i =1

The fourth can be regarded as the query refinement heuristic, since it extracts a set ST of specialization terms and a set GT of generalization ones. In case the results of previous search sessions are too generic (resp., too specific), the user can combine the initial query with the terms in ST (resp., in GT) through the boolean operator AND (resp., OR). The repository is represented in two different ways: matrix A defined above, and a matrix Bn×m, where n denotes the number of distinct terms, m the number of text-units composing them and Bi,j is the TFDF value for the i-th term in the j-th document. The heuristic extracts terms from the whole repository rather than single documents: in a former phase it singles out the most discriminative terms, by comparing the TFIDF values of matrix A, which yields ST; in the latter phase, it selects terms occurring in most of the documents comparing the TFDF values in matrix B, which belong to GT. Hence, these sets can be exploited to tune the parameters of precision and recall, respectively. The fifth heuristic, used by Suggy to formulate new queries, operates in two phases: keyword extraction and query composition. During the former, the most discriminating keywords (four) are extracted from the repository; the latter derives,

for each of them, more terms (two) related to it, that, combined with the initial keyword, will make up the new query. 4. Experimental Results The main goals of the performed experiments were to verify the effectiveness of the query extraction method employed by Suggy, in terms of amount of retrieved documents matching user interests as well as the number of necessary refining steps of the queries. The experimental laboratory. All of the experiments were performed in a simulated personal Web. The personal Web dealt with seven different subjects of interests: Jazz Music, Cars, Motorcycles, Astronomy, Wine, Artificial Intelligence and Foreign Commerce. For each of the subjects, 50 pages have been extracted from the Web using the search agent Copernic98 (Copernic Technologies Inc.). Our simulated user was particularly interested in search agents and wine. The experimental protocol. The simulated user starts his search with a generic initial query and, if he is not satisfied by results, he extracts relevant text-units from the retrieved documents and stores them in a personal repository. Suggy exploits the personal repository to define new, more refined queries. In all of the experiments, we used 8 personal repositories with increasing size (1, 2, 3, 4, 5, 10, 15 and 20). Each of the repositories is included in the larger ones. Precision and recall have been used to evaluate the fitness of each query. Queries on search agents. In these experiments the simulated user starts with the generic initial query (Q0) "retrieval systems". From each of the 8 personal repositories, Suggy extracts 4 new queries. For example, some of the more frequent suggested queries were "agent comput learn" and "inform agent comput". Table 1 reports the fitness of the suggested queries for each repository; the fitness of the initial query was: Precision (P) = 0,22 and Recall (R) = 0,80. The results show that increasing the personal repository size has a strong impact on the effectiveness of the suggested queries. The precision, in fact, is always better than the precision achieved by the initial query, and it grows proportionally for larger repositories. The recall is always much better than the recall achieved by the initial query and it reaches the maximum value using small repositories too.

Table 1. The fitness values of the suggested queries for the subject "search agent".

Q1 Q2 Q3 Q4

Rep. 1 P=0,40 R=0,94 P=0,30 R=0,94 P=0,28 R=0,94 P=0,38 R=0,94

Rep. 2 P=0,34 R=1,00 P=0,34 R=1,00 P=0,34 R=1,00 P=0,27 R=1,00

Rep. 3 P=0,34 R=1,00 P=0,30 R=1,00 P=0,34 R=1,00 P=0,28 R=0,94

Rep. 4 P=0,34 R=1,00 P=0,30 R=1,00 P=0,28 R=1,00 P=0,34 R=1,00

Rep. 5 P=0,40 R=1,00 P=0,32 R=1,00 P=0,30 R=1,00 P=0,40 R=1,00

Rep. 10 P=0,50 R=1,00 P=0,43 R=1,00 P=0,34 R=1,00 P=0,35 R=1,00

Rep. 15 P=0,50 R=1,00 P=0,50 R=1,00 P=0,57 R=1,00 P=0,42 R=1,00

Rep. 20 P=0,68 R=1,00 P=0,57 R=1,00 P=0,68 R=1,00 P=0,68 R=1,00

Table 2. The fitness values of the suggested queries for the subject "wine".

Q1 Q2 Q3 Q4

Rep. 1 P=0,17 R=0,48 P=0,16 R=0,50 P=0,14 R=0,48 P=0,15 R=0,50

Rep. 2 P=0,36 R=0,98 P=0,25 R=0,98 P=0,35 R=0,98 P=0,30 R=0,98

Rep. 3 P=0,41 R=0,98 P=0,25 R=0,98 P=0,40 R=0,98 P=0,35 R=0,98

Rep. 4 P=0,36 R=0,98 P=0,36 R=0,98 P=0,36 R=0,98 P=0,25 R=0,98

Rep. 5 P=0,36 R=0,98 P=0,36 R=0,98 P=0,41 R=0,98 P=0,25 R=0,98

Rep. 10 P=0,41 R=0,98 P=0,40 R=0,98 P=0,40 R=0,98 P=0,40 R=0,98

Rep. 15 P=0,43 R=0,98 P=0,40 R=0,98 P=0,40 R=0,98 P=0,40 R=0,98

Rep. 20 P=0,43 R=0,98 P=0,65 R=0,98 P=0,65 R=0,98 P=0,40 R=0,98

Queries on wine. In these experiments the simulated user starts with the generic initial query (Q0) "grape producer". From each of the 8 personal repositories, Suggy extracts 4 new queries. For example, some of the more frequent queries were "wine shop club" and "spirit wine produc ". Table 2 reports, for each repository, the fitness of the suggested queries; the fitness of the initial query was: Precision (P) = 0,12 and Recall (R) = 0,40. Also in this second experiment increasing the size of the personal repository has a strong impact on the effectiveness of the suggested queries. The resulting precision, in fact, is always better than the precision achieved by the initial query, but only for the larger repositories its value raises significantly with respect to smaller repositories. The resulting recall is always much better than the recall achieved by the initial query and it nearly reaches an optimum value even when using small repositories. Discussion. The results reported in Tables 1 and 2 seem to be a convincing evidence of the hypotheses that the performance of Suggy is strongly influenced from the repository size. However, another critical factor to be considered is represented by the user knowledge about the subject. In fact, the user knowledge drives the text-units extraction process and, as a consequence, the quality of information stored in the repository. Expert users (as in the experiments on "search agent") are able to extract few but very significant text-units from the retrieved documents and to construct repositories containing little but expressive information. Beginner users (as in the second experiment about “wine”) have not the skill to extract relevant text-units and, in this case, the increasing in repository size does not mean better results. 5. Conclusions and Future Work We have presented the agent Suggy, whose goal is the refinement of queries in an Information Retrieval environment on the Web. It integrates standard Term Frequency and Document Frequency techniques with heuristics that help in the choice of the most characterizing keywords. An experimentation has verified that the ideas underlying Suggy can in fact improve both user satisfaction as regards search results and the efficiency of the refinement process. Future work will concern the integration within the Personal Digital Library system of this agent with a machine learning tool for clustering the documents resulting from several query sessions into a hierarchy where the user could be browsed for searching the documents of interest.

References 1. 2. 3. 4. 5. 6.

7. 8. 9. 10. 11. 12. 13. 14. 15.

Bradshaw, J. M. (ed.): Software Agents, AAAI, MIT Press, 1997. Baeza-Yates, R. & Ribeiro-Neto, B.: Modern Information Retrieval, Addison Wesley Longman, 1999. Costabile, M.F., Esposito, F., Semeraro, G. & Fanizzi, N.: An Adaptive Visual Environment for Digital Libraries. International Journal on Digital Libraries, 2:124-143, Springer, 1999. Cheong, F.: Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN, New Riders, 1996. Copernic Technologies Inc.: Copernic98 (http://www.copernic.com). Esposito, F., Malerba, D., Semeraro, G., Fanizzi, N. & Ferilli, S.: Adding Machine Learning and Knowledge Intensive Techniques to a Digital Library Service. International Journal on Digital Libraries, 2(1):3-19, Springer, 1998. IBM: WBI (http://www.almaden.ibm.com/cs/wbi). LA Agents: Firefly (http://www.firefly.net). Lesnick, Leslie, L. & Moore, R. E.: Creating cool intelligent agents for the Net, Foster City, CA, IDG, 1997. Porter, M.F.: An algorithm for suffix stripping, Program, 14(3):130-137, 1980. Salton, G. & Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988. Shoham, Y.: Agent-Oriented Programming, Artificial Intelligence, 60(1), 51-92, 1993. Sparck Jones, K. & Willett, P. (eds.): Readings in Information Retrieval, Morgan-Kaufmann, San Mateo, US, 1997. Virtual Resources Corporation: Virtual Agent (http://www.careersite.com). Wise Ltd.: FinanceWise (http://www.financewise.com).

Suggest Documents