query expansion using an intranet-based semantic ... - Semantic Scholar

1 downloads 8624 Views 306KB Size Report
It has been noted that the average Internet search engine query consists ... find out if it is possible to capture the structure of a corporate intranet and to ...
QUERY EXPANSION USING AN INTRANET-BASED SEMANTIC NET Dick Stenmark Dept. of informatics Göteborg University P.O.Box 620 SE-40530 Göteborg, Sweden [email protected]

Abstract Many of today’s web search engine users submit single word queries resulting in an imprecise and overwhelming result set. This paper describes the implementation of a query expansion application prototype, i.e., a tool to augment the original query with more, and hopefully, relevant terms. The prototype uses a semantic net to represent relationships between words and concepts used on a corporate intranet. The aim has been to increase precision and prevent receiving too many documents. The result, compared to that of an unmodified search, shows a gain in precision at the cost of lower recall. Another major thing the experiment teaches us is that the knowledge acquisition cost required to manually build the semantic net may be too high. The paper closes with a discussion about the pros and cons of a query expansion program such as a semantic net. Keywords: query expansion, semantic net, intranet, information retrieval

1. INTRODUCTION Search engines are often promoted as the answer to the problem of information overload caused by the wealth of information made available trough the World Wide Web. However, search engines have a difficult time trying to sort out the relevant information from what is useless given a certain query. This is not necessarily only due to technical shortcomings with the tools but also due to human behaviour. It has been noted that the average Internet search engine query consists of 1.5 keywords and that a vast majority of users submit single keyword queries only (Pinkerton, 1994). The “vocabulary problem” (Furnas et al., 1987), i.e., the ambiguity of natural languages, makes efficient retrieval even more difficult. As the size of the indices continue to grow, the single keyword query will match, and hence also retrieve, more and more documents, of which a growing portion will be irrelevant. Even if the “right” keyword is chosen, the most relevant documents may be hidden behind loads of irrelevant material. Even on corporate intranets we see the same things happening. These internal webs have grown so large that what used to be a global Internet problem is now occurring also locally on intranets. A solution would be to provide more context by supplying not only one but a number of keywords but people are lazy, they do not understand the syntax of web search, and they do not know what keywords to combine. If the search engine could assist the users by recommending terms that better described their area of interest, the situation might improve. This is the premise behind the idea of query expansion (QE) which has been used since the early days of information retrieval. QE is “a process of adding new terms to a given query in an attempt to provide better contextualization (and hopefully retrieve documents which are more useful to the user)” (BaezaYates & Ribeiro-Neto, 1999, p. 449). Since the search engine has processed the contents of the net, should it not be possible to use that internal knowledge to expand the initial query by putting the query terms in context? We think so, especially when looking at a smaller subset of the web such as a corporate intranet. Next, we shall account for some previous work in this field and thereafter describe the semantic net used in our experiment. Section 4 starts by giving a description of the problems we had trying to understand the relationships that existed and manually load them into the net. We then describe how searching was done, what scoring algorithm was used to compare the returned documents, and present the result of the evaluation. Section 5 presents the results and the paper finishes with the conclusions and some thought on future work.

2. RELATED WORK Quite a lot of research has been carried out in the area of query expansion. Basically three modes of QE have been identified; manual, automatic, and interactive. Irrespective of the mode, the QE may be based on (previously) retrieved search results or on some a priori knowledge structure, which in turn may or may not be collection dependent (see figure 1). Due to the unwillingness of most non-professional information seekers to invest in the additional efforts required to manually produce more relevant search queries, much of the QE research has focused on automatic methods for QE (cf. Mitra et al., 1998; Manu & Ogawa, 2001; Sahlgren et al., 2002). We too adhere to this approach, but we differ from the mainstream by basing our prototype on the semantic net approach. The only previous reference to semantic net-based QE found in the literature is McGuinness et al. (1997), who exploited built-in features of the commercially available search engine Verity’s topic set tool. McGuinness et al.(1997) identified three areas where query expansion using semantic nets may prove useful; i) to limit the scope of the search, ii) to exploit knowledge of the structure of the domain, and iii) to exploit

understanding of the contents of the site. Our work touches upon all of those areas when we try to find out if it is possible to capture the structure of a corporate intranet and to understand the relationships that exist between the topics, and this was the main reason for choosing the semantic net approach. Not leaning on a commercial product, we are however looking for a more general and vendor-independent way of representing knowledge. Query Expansion

Manual QE

Automatic QE

Based on Search Results

Interactive QE

Based on Knowledge Structures

Collection Dependent

Collection Independent

Figure 1. Different possible strategies for query expansion (Efthimiadis, 1996) Our QE strategy has been to use automatic QE techniques and to base the expansion on knowledge structures that are specific for the context, i.e. collection dependent.

3. IMPLEMENTING A SEMANTIC NET A semantic net is a graphical depiction of relationships between concepts or topics of a specific area. Nodes are used to represent objects and links indicate relationships as illustrated in figure 2. The nodes are ordered hierarchically in subclasses and a class may be a subclass of another class while individual objects are instances of classes. A class has all the attributes of the super-class and may also have additional attributes. An instance should have values for all the attributes of the class of which it is an instance. Since it is also an instance of the super-class, it should have values for those attributes also. The instances inherit values, and only when wanting to override a default need one store a value at instance level. Animal Subclass_of egg

birth

Bird

locomotion

Instance_of Pingu

covering

locomotion

feather fly

walk

Figure 2. A typical semantic net with instances and attributes.

3.1 Relationships In our semantic net we chose to deviate from the norm and implemented a specialisation of a semantic net in which there is no inheritance hierarchy. The class/sub-class concept has been completely abandoned and the only attribute used is synonym which you cannot inherit. This deviation was not initially planned but something we realised during the work. The semantic net implemented has only two relationships – generalise and specialise – and one attribute – synonyms (see figure 3). The purpose of Generalise was to move up one level in the relationship tree, whereas Specialise move down the three. vehicle generalise specialise car

synonym

automobile

Figure 3. Nodes and links representing objects and relationships in a semantic net. 3.2 Synonyms We introduced the synonym relationship since we wanted to be able to allow alternative words and alternative languages. The idea was to link not only to strict synonyms but also to words that are more vaguely connected. Synonym could also be used to link to words with alternative spellings or words in different languages. The last point is not least important in an international corporate group. Originally, synonym was thought to be a transitive relation. Then each object would only have needed a single link to one of its synonyms and that object would have linked on the next and so forth. However, if you let bike be a synonym for bicycle and motor cycle also be a synonym for bike it would lead to that cycle and motor cycle are synonyms, which may not be true. Due to this, synonym was defined as an attribute and we now have to hold a separate synonym list for each object, which makes the database bigger. However, we found that the synonym concept was not as useful as we initially thought and we ended up not using it at all. The tests showed that synonyms increased the portion of irrelevant documents and thus lowered precision. 3.3 The programs The idea was to let the semantic net represent knowledge about the content of a corporate intranet. For that we needed two things; a tool to build a semantic net and feed it with words and their relationships, and; an interface for the semantic net so when given a word, the program would return relevant related words. We first wrote the administrator’s tool to create and maintain the semantic net, and it included the following features: • •

Add terms. Allows the administrator to add words or concepts, their synonyms, and their relationship to other words or concepts. Update terms. Lets the administrator change or remove relationships between words or concepts.



Insert terms. With the insert option the administrator may insert a word or a concept between two words already linked to each other. • Delete terms. Provides the administrator with the possibility to delete a whole word with all its relationships and synonyms, or to only delete one or more of the relationships or synonyms. • Print the net. Lists all defined words or concepts. • Search the net. Lets the administrator search for a given word. If it exists all its relationships and synonyms are displayed. The administrator also has the option of loading the semantic net from or saving it to a file. However, the tool does not contain any graphics that would present the semantic net in a more conceptual way. The print option merely lists all the defined terms and the relationships must be examined per term by using the search option. The second part, i.e., the function that given a term it would return other related terms, was used by the search engine to assist the user while searching. The program reuses most of the code from the administrator’s tool but has basically only one feature; a modified version of the search option. There is also no selection panel but instead the program accepts three parameters; 1) the name of the file that contains the semantic net, 2) the search word from the query, and 3) whether to generalise or specialise. The semantic net program runs as a separate application and we modified the search engine to send its queries to the semantic net application before retrieving the results. 3.4 Internal representation The programs are internally based on a hash table containing linked list of objects sorted alphabetically. The main reason for this that the search time becomes more or less independent of the database size. By making the structure rather wide (i.e. many entries in the hash table) the linked lists ended up quite small and response times could be kept short. Hash table

generalise

coffe

island

specialise generalise

java

specialise malta

Figure 3. The hash table with its linked lists of objects and their relationships.

4. POPULATING THE SEMANTIC NET We focused on single-keyword queries because i) they are the most common, both in our particular case and on the Internet in general (Pinkerton, 1994), and ii) they have worse precision than do multi-keyword queries due to the vocabulary problem (Furnas et al., 1987). The subsection of the intranet we examined contained approximately 140,000 different words but having applied linguistic methods such as stemming, we were able to reduce the number to around 30,000. Initially, the idea was to base the decision whether to generalise or specialise on the result of the initial query. Our first attempts turned out to be completely wrong since we tended to forget a basic prerequisite – to exploit the understanding of the intranet content. We thought that we understood the domain but experience proved us wrong and much more time had to be spent examining the intranet and the relationships between the terms before populating the semantic net. We finally engaged a reference group consisting of ten domain experts from different business units and with different backgrounds and job descriptions to test and evaluate our ideas. Even though our view of the contents of the net changed many times, we were able to keep the program and its original data structures unmodified. The knowledge was very specific to the domain (and had to be, in order to be useful), but we did not fully understand that until having made a number of unsuccessful attempts to load the knowledge base. In parallel to our work, another project was trying to automatically cluster the intranet using latent semantic indexing (LSI). When combining their work with the input from the domain experts did we begin to understand the relationships. Using the search engine log files, 20 words or terms from previously submitted queries were randomly selected. Using LSI and the domain experts, we derived the most related terms, ordered these according to semantic net logic, and fed them into the semantic net tool. In all, little over 200 words were thus entered. Although the initial 20 words were taken from the search engine query log, we did not initially know whether or not they existed on the intranet. However, the fact that LSI managed to produce related words indicated that the words were present on the intranet. This was however not necessarily true for the words added by the domain experts.

5. EVALUATION How to expand the queries was not a trivial question. It is long known in the information retrieval literature that a larger result set also means a larger irrelevant result set (Salton and McGill, 1983). Initially, we were sharing the hypothesis of McGuinness et al. (1997) and Burke et al. (1996) that in a constrained domain such as a corporate intranet it should be able to have the relevant document set grow faster than irrelevant ditto. We were wrong and this insight led us to abandon the generalise part of the application. When testing we found that regardless of the query we always received more than ten documents. This experience was fully shared by the members of the reference group. We therefore decided to concentrate our efforts on specialising, since receiving too many documents was considered a bigger problem than receiving too few. Further tests verified that it was all right to always apply Specialise. One initial drawback of the first approach was that an unmodified query has to be submitted first, then the result has to be analysed, and then the expanded query could be submitted. This took time and effort. The benefits when specialising directly were that we did not have to wait for the original query nor examine the output. It simplified and speeded up the process and the results were good.

When specialising, the original word is logically AND:ed with a related, more specific word from the semantic net and submitted. To extend the semantic net to also handle multiple keyword queries will probably require a more sophisticated algorithm, or recall will suffer too much. Parsing must also be done more carefully in multi-keywords queries. We wanted to produce a better result than the naïve single query. One thing we had to do was to define what “better” would mean. Initially we wanted to improve both recall and precision but our focus soon shifted towards precision only. In general, users in corporate environment are better served by a tool that quickly gives them one correct answer rather than a tool that does an exhaustive retrieval. If the answer you are looking comes out on top you need not bother about the rest of the list (Nielsen, 1999). Hence, a relevant document at the top of the list is good regardless of the total number of retrieved documents. This intuitive interpretation was shared by the reference group. As a result of how most search engines tend to display their results it will be good to have many relevant documents amongst the top ten. Since the top ten documents often are shown on the first page users need not go to a second page. However, if the top documents are irrelevant you have to start to go through the list and studies have shown that the users are reluctant to do so (Nielsen, 1999). Even if they do bother to look, a long list is bad news since it takes time to look at each entry. Therefore we only consider the 20 highest ranked documents. We also thought that having 7 relevant documents in a sample of 8 retrieved documents should be better (and thus score higher) that to have 7 in a sample of 20. We therefore took the ratio of relevant documents into account (relative precision). This reasoning led to the following scoring algorithm: 1. IF the top ranked document is relevant THEN 25 points ELSE 0 points. 2. Two points per relevant document amongst the top 10. 3. One point per relevant document amongst documents 11-20. 4. (total ratio of relevant documents amongst the top 20) * 5 points 5. Total score is the sum of the above. This gives a maximum of 25 + 20 + 10 + 5 = 60 points. We now had a way to evaluate the result of a query and we could do a search without the semantic net and compare it with a search with the semantic net.

6. RESULTS The twenty words previously selected were fed to the company internal search engine and the output was recorded according to the scoring algorithm described above. As a result, 15,762 documents were retrieved, an average of 788 documents per query. In 3 of the cases the top document was indeed what we were looking for. The ratio of relevant documents amongst the top-10 was .125, and using the above scoring algorithm gave a total of 160.36 point. Having received and analysed the output from the original query, the query was modified to be more specific by AND:ing expansion terms. The result from the semantic net was evaluated in the same way as the results from the original query and the two were compared. The expanded queries retrieved a total of 1,726 documents, a reduction of 89%. Twelve times a relevant document turned up on top. The ratio of relevant documents amongst the top 10 was .42 and the total score was 549.34, i.e., approximately 3.4 times as high as for the original query. The average score increased from 8.0 to 27.5 and only one query received zero points, i.e. had no relevant document amongst the top 20, after being expanded. See table 1 below for detail.

Table 1. The search words and their results. The words have been translated from Swedish to English which makes single Swedish words to show up as two English words. Original query Search term

# Retrieved documents

Expanded query

Score

# Retrieved documents

Score

naming standard

11

7.36

1

32.00

objective

400

0

141

2.25

ecb

43

12.75

2

34.00

dan

500

0

6

42.00

package

900

0

14

39.75

news

3000

0

600

2.50

nedcar

102

8.75

29

10.00

haku

88

10.00

12

52.00

salaried employees

164

0

9

9.33

sales

2000

0

400

33.75

johannesson

113

49.50

70

60.00

quicktime

51

2.50

12

29.83

odbc

200

2.00

51

34.00

magnesium

32

30.00

3

32.33

obsolete

400

35.25

12

16.92

law

500

0

98

0

radio

2000

0

34

4.75

job

200

2.25

25

55.00

ibm

5000

0

200

29.50

waterloo

58

0

7

29.43

sum

15762

160.36

1726

549.34

As illustrated in figure 4 below, recall decreased when the total number of retrieved documents went down, although only by 12%. At the same time precision increased 8 times from under 3% to more than 22%. That is a big improvement on overall precision. Even the specific precision amongst the top 10 and the number one ranked document increased three and four times, respectively. The reference team all agreed on that a 12% loss in recall is a small price for a precision gain of many hundred per cent. Reducing the corpus by some 14.000 documents, of which less than .5% were relevant, means that the user much easier can focus in on the remaining relevant documents. The whole document corpus Irrelevant Relevant documents documents Result set from original query

Result set from expanded query

Figure 4. Precision vs. Recall for the original and expanded queries. We re-ran the test some three weeks later, having installed a different search engine. Even though the contents had changed somewhat during these weeks, the new spider did not work in exactly the same way and had not been crawling for as long, the query language was different, and the ranking algorithms used in the search engine were not the same, the second test received almost identical results (see table 2). Running the original queries, we retrieved 17,961 documents. Five times a relevant document turned up on top and the ratio amongst the top-10 was .205, giving a total score of 166.06. The expanded query returned a total of 1858 documents. A relevant document was found at the top 15 times and the top-10 ratio was .355 making the total score 604.83. The document reduction was 89% also this time. These figures confirm the results from the first test as seen in table 2. Table 2. Comparing the results from test #1 and test #2. Test #1 Document reduction

Test #2

89.05%

89.50%

Top-10 precision gain

236%

73.2%

Score improvement

146%

164%

recall loss

11.8%

10.9%

As previously seen in table 1, test one resulted in one query scoring less after being expanded (query word was “obsolete”) and yet another one did not improve its score (query word was “law”). The second time, two other queries scored less after being expanded (quicktime and waterloo) while the two that did not do so well in the first test, improved with this engine as

shown in table 3 below. More work is needed to further analyse why these queries performed so unpredictably. Table 3: The search words that differed in the second test. Original query

Expanded query

Search term

# Retrieved documents

Score

# Retrieved documents

Score

quicktime

39

6.75

10

2.50

waterloo

14

39.14

6

33.50

obsolete

555

3.75

14

42.86

Law

829

2.25

13

32.15

7. CONCLUDING DISCUSSION Despite being a prototype, our implementation shows that if the correct relationship between intranet terms can be captured, a semantic net can successfully be used to expand search engine queries. The study has shown that precision can greatly benefit from having the knowledge contained in the semantic net suggesting alternative keywords and better combinations of keywords for the users’ otherwise single keyword queries. However, knowledge acquisition plays a big role in most Artificial Intelligence-related project, and this study is no exception. The implementation of the semantic net itself, i.e. the coding of the Administrators tool and data structure, was done relatively quickly. What took time was working out the best heuristics for how to augment the query and the related activity of understanding the relationships between intranet terms. During that process we had to reload the semantic net over and over again, as our understanding of the relationships between the topics slowly increased. This makes us hesitant to whether this approach is scaleable. The linguistic tools and techniques (i.e., LSI and stemming) that we used proved to be very helpful suggesting relationships that we otherwise probably had foreseen. We examined 20 words of a total 30,000 and found that it was much work, and even though the tools were helpful to a human they were far from being able to create a semantic net automatically. This suggests that if trying to up-scale this approach, the labour of feeding the semantic net should be distributed over several individuals, possibly every single user of the system. Such an approach requires a way of visualising the relationships already defined and we have not examined that field. When evaluating an information retrieval system, the measures typically used are precision and recall, which goes back to the fifties (cf. Kent et al., 1955). There are however a number of drawbacks with these two measures to which a number of commentators have pointed (cf. Raghavan et al. 1989; Chu and Rosenthal, 1996; Nielsen, 1999; Karlgren, 2000). The exact number of relevant documents must be known, given a specific query. Though this may be arranged in a laboratory setting it is impossible to obtain in a filed experiment. It must also be agreed upon exactly what is relevant and what is not. Again, this may be possible in a very controlled environment but not generally so. The fact that things in reality often are partly relevant is also ignored. Still, precision – and in particular amongst the top-X documents – is an important quality factor and something ordinary end-users care about and can relate to (Nielsen, 1999). This is mainly why we experimented with different scoring algorithms but continued to

use precision as one variable. It can obviously be questioned whether our evaluation mechanism was the optimum choice and the results presented must be confirmed in other tests using other measures. It can be argued that a system that only recommends keywords and gives the user an opportunity to edit the expanded query manually before submitting may be advantageous, since it leaves the user in control instead of being a “black box”. The user is also likely to over time gain a better understanding of the structure of the intranet and how to formulate queries. However, interactive query expansion takes more time since it engages the user in a dialogue and it is not certain that users are willing to invest the extra time. To summarise the advantages with our approach we point to the fact that the linguistic relationships between words on an intranet are not necessary the same as those on the Internet. Unexpected and context-dependent relationships are likely to be found on a corporate internal web. Once worked out, knowledge about the structure of the intranet provides great help when searching by suggesting keywords otherwise not thought of. The drawbacks related to this approach are that knowledge acquisition is very resource intensive. Lots of manual labour has to be spent on understanding the true relationships between the words. Further, since the relationships of one intranet are not necessarily the same as those on another intranet, the experiences from one intranet may not be relevant in another domain and all the work has to be done all over again.

8. REFERENCES Baeza-Yates, R. and Ribiero-Neto, B. Modern Information Retrieval, Addison-Wesley, 1999. Burke, R., Hammond, K. and Young, B. “Knowledge-Based Navigation of Complex Information Spaces”, In Proceedings of the 13th National Conference on Artificial Intelligence, Portland, OR, USA, August 1996. Chu, H., and Rosenthal, M. “Search Engine for the World Wide Web: A Comparative Study and Evaluation Methodology”, In Proceedings of the 1996 ASIS, October 19-24, 1996. Efthimiadis, E. “Query Expansion”, In M. E. Williams’ (ed.) Annual Review of Information Science and Technology, vol. 31. Medford, 1996, pp. 121-187 Furnas, G., Landauer, T., Gomez, L. and Dumais, S. “The Vocabulary Problem in HumanSystems Communication”, Communications of the ACM, 30 (11), 1987, pp. 964-971. Karlgren, J. The Basics of Information Retrieval: Statistics and Linguistics. Stockholm: Swedish Institute of Computer Science, 2000. Kent, A., Berry, M., Leuhrs, F.U., and Perry, J.W. “Machine literature searching VIII. Operational criteria for designing information retrieval systems”, American Documentation, Vol.6, No.2, 1955, pp. 93-101. Mano, H. and Ogawa, Y. “Selecting Expansion Terms in Automatic Query Expansion”, In Proceedings of SIGIR ’01, ACM Press: New Orleans, LA, USA, 2001, pp. 390-391. McGuinness, D. L., Manning, H. and Beattie, T. W. “Knowledge Augmented Intranet Search”, In Proceedings of the 6th International World Wide Web Conference, Santa Clara, CA, USA, April 7-11 1997. Mitra, M, Singhal, A., and Buckley, C. “Improving Automated Query Expansion”, In Proceedings of SIGIR ’98, ACM Press: Melbourne, Australia, 1998, pp. 206-214.

Nielsen, J. “User Interface Directions for the Web”, Communications of the ACM, Vol. 42, No. 1, 1999, pp 65-72. Pinkerton, B. “Finding What People Want: Experiences with the WebCrawler”, In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, July, 1994. Raghavan, V., Bollman, P., and Jung, G. S. “A critical investigation of recall and precision as measures of retrieval system performance”, Communication of the ACM, Vol. 7, No. 3, 1989, pp 205-229. Sahlgren, M., Karlsgren, J., Cöster, R. and Järvinen, T. “Automatic Query Expansion using Random Indexing”, In Proceedings of CLEF 2002, Rome, Italy, 2002. Salton, G. and McGill, M.J. Introduction to Modern Information retrieval, McGraw Hill Book Co., New York, 1983.

Suggest Documents