Lightweight Document Matching for Help Desk ... - CiteSeerX

3 downloads 53631 Views 223KB Size Report
Aug 17, 1999 - uments are typically detailed problem descriptions or free form textual queries of unlimited length, .... a lightweight help desk that employs a document matcher. ..... In Figure 6, the top-k matching documents were found, and a.
Lightweight Document Matching for Help Desk Applications

Sholom Weiss, Brian White, Chid Apte, Fred Damerau In Workshop on Text Mining Foundations, Techniques, and Applications Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99) Revised Version

Lightweight Document Matching for Help Desk Applications Sholom M. Weiss, Brian F. White, Chidanand V. Apte, and Fredrick J. Damerau T.J. Watson Research Center, IBM Research Division, Yorktown Heights, NY 10598 August 17, 1999

Abstract

We describe a fast document matcher that matches new documents to those stored in a database. The matcher lists in order those stored documents that are most similar to the new document. The new documents are typically detailed problem descriptions or free form textual queries of unlimited length, and the stored documents are potential answers such as frequently asked questions or service tips. The method uses minimal data structures and lightweight scoring algorithms to compute eciently even in restricted environments, such as mobile or small desktop computers. Evaluations on benchmark document collections demonstrate that predictive performance for multiple document matches is competitive with more computationally expensive procedures.

Keywords

Text mining, Information Retrieval, Text Categorization, Case-Based Reasoning

1

1 Introduction In an age of distributed and pervasive computing, many future programs will run on smaller capacity machines requiring "lightweight" algorithms and data representations. We describe a Java-based document matcher that can accept as input a textual structure of unlimited length. A fast matching algorithm is employed to produce a ranked list of relevant documents. Our approach employs minimal processing and storage and is therefore suitable for installation directly in restricted environments, such as Java compatible mobile or small desktop computers. This does not eliminate the possibility of running on a large server, but the approach taken here is e ective even when resources are relatively scarce. The problem of retrieving relevant documents from a corpus has been extensively addressed by the information retrieval community. Most versions of solutions to this problem are essentially text search-based systems that accept as input a query or limited length textual phrase, and produce as output a list of potentially relevant documents[Salton & McGill, 1997]. Many text search based systems require substantial storage and computing resources. They can employ complicated document representations and document matching algorithms. Search engines are one well-known embodiment of these systems. Typically, most words in all stored documents are indexed, and a small number of words in a query are matched to the stored words. In distinction to the pure search engine, a document matcher takes a much larger new document as input, so the query can consist of hundreds of words. A search algorithm that looks to identically match many input words will likely nd no documents for an exact match. One obvious approach to document matching is to use words as features and then use information retrieval techniques, such as nearest neighbors, to score the stored documents by similarity[Stan ll & Waltz, 1986]. With thousands of documents and dictionaries of tens of thousands of words such an approach can be complex. An alternative approach is case-based reasoning[Chang et al., 1996]. Each document is described in terms of decision rules and scoring criteria. Typically, the description is provided by humans. However, only a small set of key characteristics are needed for each document. A new document is then processed to produce a special format that can be matched to the stored representations. The idea of typing a problem description to a computer and getting a 2

response was an early goal of arti cial intelligence systems. Tied to help desks, the classic problem emerges in modern form. One example is a user who types a complete problem description and expects a program to provide help, often in the form of nding the relevant documents in a stored database. In this paper, we describe a method that is completely automated, can match lengthy input documents, gives output like a search engine, yet has the simplicity of case-based reasoning. We o er some empirical results that show that despite its lightweight algorithms, it is e ective in ful lling its goals for predictive performance. The major characteristics of the document matcher are the following: Given a set of documents, their titles, and possibly keywords, an automatic o -line pre-process constructs (a) local dictionaries of relevant words for each document, and (b) a global dictionary of unique keywords. The on-line process uses this information to score the relevance of stored documents to an input query document. The scoring algorithm uses the count of matched words as a base score, and then assigns bonuses to words that have high predictive value. It optionally assigns an extra bonus for words that matched from salient sub-structures in a document, such as a title, and domain speci c document tags (for example, speci c product or release identi cation associated with a document).

2 Methods and Procedures The objective of a document matcher is to match a new document to old documents and to rank the retrieved documents by assigning a score or relevance. The methodology is capable of taking a problem description, which may be just a few words or a long document, and nding relevant documents that may provide a solution. An example of an application of this technology is a self-help system for customers or a help desk tool for customer service representatives. Figure 1 illustrates one scenario for customer use of a lightweight help desk that employs a document matcher. The lightweight document matching methods use only some of the available elements of the document for searching, speci cally, the title of the document, the keywords assigned to the document, if any, and additional special tags as available, such as product names. Speci cally, documents in a repository are indexed using words only from the following sub-structures: 1. Document title 3

Customer

Browser

Lightweight Self Help Desk

Enter problem description

Find relevant documents

Queue assigned, description e-mailed

Problem Solved, No Call

Figure 1: Help Desk Scenario 2. Other extracted keywords (a) manually assigned document tags (b) k most frequent words in document body with stopwords removed andk typically to a set low value like 8, A word is a set of contiguous alphanumeric characters, separated by delimiters such as whitespace or punctuation. Figures 2 and 3 illustrate the overall process. An o -line back-end program processes the documents, using an XML-style markup language to delineate the parts of the documents relevant to text retrieval and presentation to the end user. An example of the markup for a single document is given below. Twinax Tools: DUMP Task

4

Convert documents to standard representation (e.g. SGML)

Extract local dictionary for each document

1) human-assigned keywords 2) high frequency words 3) remove stopwords

Create global dictionary

Create unique identifiers for all words (e.g. singular/plural)

Create extract files for matching and display

Compute table of word weights based upon frequency in documents

Back-End Process

Figure 2: Algorithm Overview - Back-end Process twinax tools dmpjob appc 5716XA100 5763XA100 Twinax Tools: DUMP Task This document will contain the required steps to run a Dump Task for the twinax T1 component. From the STRSST menu, select 1 to Start a service tool, then 4 for Display/Alter/Dump. Select 2 if you want to Dump to printer Select 4 Tasks/Processes Select 1 for Task Select 5 to Display list of tasks From the list of tasks, you can choose those starting

5

Load global dictionary into Table 1

Create Table 2 that maps words to documents

Update Table 3 of matched document ids and words

For every token that maps to an identifier in Table 1, use Table 2 to find set of associated documents

1) every matched word contributes 1 2) Add bonus for matched word from word weight table 3) Add/subtract bonus/penalty for match/mismatch in special sections

Score every document in Table 3

Load word weight table and extract file

Read and tokenize input document

Sort documents by scores in descending order

Output results

um

Front End Process

Figure 3: Algorithm Overview - Front-end Process with T1- or choose another task defined by the developer. 10324314

Two data structures are derived from the resulting le:  a set of local dictionaries that contains the words that are relevant to speci c documents. Typically, 8 to 10 keywords are assigned for each document. The words are not unique to documents; the same word may appear in many documents.  a pooled, global dictionary containing a list of all words that are relevant to any document. This is a unique collection of words. The XML document contains information relevant to document retrieval that is not contained in these two data structures, such as document titles, and possibly application-speci c attributes such as component identi ers. 6

A nal XML-style extract document incorporates the contents of the local dictionaries with these additional attributes, as illustrated below. The words in the local dictionaries are represented by the internal identi ers used by the object embodying the global dictionary. Twinax Tools: DUMP Task 573 226 2987 2944 969 2887 915 320 2601 5716XA100 5763XA100

These two data structures, a global dictionary and an extract le representing a set of local dictionaries and additional attributes, are sucient for the fast document matcher to score new documents. Unless revised, the dictionaries are created once. They then can be read by an application program that repeatedly matches documents, and the program can be distributed to multiple users. Given this document representation, a special scoring function is employed to compare a request, entered as key words or as a natural-language document, to the stored document representations. The output is a ranked list of documents which are relevant to the problem entered. Words in the new document are matched to words in the global dictionary. Words must match exactly (plurals are mapped to singular) so that a hash table can be employed for almost immediate lookup in the table. The words in the global dictionary point to the local dictionaries of the stored documents. The base score of each stored document is the number of its local keywords found in the new document. A bonus, usually 1, is then added to the base score, for every matched word that also appears in a title or special tags, e.g. product or release identi er. Furthermore, a bonus equal to the predictive value of a matched word is also given. The predictive value of a word is 1=n, where num is the number of stored documents that contain that word. An example of a graphical user interface is shown in Figure 4. The html display produced in response to the query is shown in Figure 5. Not 7

Figure 4: Query Panel all of these elds are required for all applications. If the text areas are separated into a single one-line summary versus a detailed description, then a slight bonus can be optionally assigned to words that appear in the one-line summary, giving extra weight to a "title" e ect.

3 A Comparison to Classical Information Retrieval Methods Although they may score many words and documents, document matchers can be considered a variation of the classical information retrieval techniques used in search engines. The central theme of the lightweight document matcher is to reduce dimensions of large problems, resulting in ecient processing even in restricted environments. Let's look at the key characteristics of our approach and contrast them with classical information retrieval: 1. Simpli ed additive scoring of positive words 8

Figure 5: Response Panel 2. Exact word match with synonyms and no stemming except for plurals 3. Reduced indexing with no document frequencies Equations 1 and 2 summarize the classical \cosine" approach to scoring [Salton & Buckley, 1997]. The data are weighted by \tf", the term frequency, i.e. the frequency of each word in a document, and \idf", the inverse document frequency, i.e. the number of times, nk , the k-th word appears in a document within a collection of N documents. This computation is more complicated than simple additive scoring and requires storage of frequency information. Scoring is a variation of nding the nearest-neighbors, as measured by Equation 1, where Q is a new document that is matched to stored documents D. Our additive count is analogous to the cosine formula: our bonus is a form of idf. The denominator of the Equation 1 normalizes the computation, while our measures are relatively normalized with a count of 1 for each positive word match and a maximum of 1 for the bonus. The main di erence is the term frequency, where we just check for presence or absence of key or high frequency words. Term frequency helps with \recall" of docu9

ments that otherwise might be lesser importance. The simple positive score has the great advantage of transparency of scoring, leaving the user with a clear explanation, in terms of identi ed key words, for the retrieval of a matched document.

similarity (Q; D) = qPt

P

t k=1

wqk wdk

2 k=1 (wqk ) 

P 

t k=1

(wdk )2

(1)

wdk = tfdk idfk = tfdk log(N=nk)

(2) Our second attribute is greater reliance on exact matching of individual words without stemming, i.e. substring and common root matching. This is only justi ed when the feature space is reduced from full indexing. 



Help Desk Reuters Full indexing 27.6 words/doc 47.2 words/doc Full indexing 117,249 unique words 27,975 unique words Partial indexing 7.5 words/doc 8.7 words/doc Partial indexing 19,428 unique words 10,132 unique words Table 1: Indexing and Document Feature Lists By far the greatest and most important technique for reducing complexity and application dimensions is item 3, reduced indexing. The number of documents that are referenced, the computations needed for matching many words, and the storage requirements, all grow dramatically with full indexing. Table 1 compares partial with full indexing for two applications. The dramatic e ects on storage for a reduced indexing have been demonstrated in the GLIMPSE system[Manber & Wu, 1993]. There, disk storage is greatly reduced with some additional computation for nding substrings matches in text. In common with our lightweight document matcher is the greatly reduced inverted index. In our case, we use a reduced inverted index as the basis of all matching without any additional computation. In the next section, we consider the e ects of these dimension reduction techniques on predictive performance.

4 Results Our goals are to build a document matcher that is (a) lightweight and (b) predictive. First, let's look at the goal of lightweight. Table 2 describes var10

ious measures of program performance for one of our help-desk applications. With a previous generation laptop, capacity of nearly 20,000 documents was achieved, and new documents were readily matched in real-time. Computer 166MHz Thinkpad 560 Memory 48Mb Java code 60Kb Documents stored 17,415 Dictionary words 19,428 Response time per new document less than 1 second Table 2: Capacity and Performance for Help Desk Application Excellent capacity and runtime performance does not imply good predictive performance. We consider two forms of evaluation of predictive capability. First, the customer for the help-desk application did their own small, yet fully independent evaluation. They created 10 scenarios with brief problem statements. The human experts' pre-selected the stored document that they considered to contain the correct solution. They concluded that for 7 of the 10 problems, the lightweight document matcher found the correct answer in one of its top ten document matches. A more carefully controlled evaluation of predictive capability compares performance for text categorization. These tasks allow for training on labeled documents, and new documents are assigned to one of a set of prede ned topics. The literature on this task is extensive; many di erent methods have been developed and their predictive performances measured[Apte, Damerau, & Weiss, 1994], [Weiss et al., 1999]. The lightweight document matcher's scoring procedures were formally evaluated on the well-known Reuters-21578 benchmark for text categorization [Lewis, 1995]. We used the Mod-Apte variation of the benchmark with 9603 training documents and 3299 test documents. These are labeled documents; each document is assigned one of 93 topics. Documents are scored by simply counting the number of dictionary words that appear in both a training and test document plus the bonus for the predictive value of those matched words. The standard evaluation criterion for the Reuters benchmark is the breakeven point, the point at which precision equals recall on independent test documents. Given a binary classi cation problem of topic vs. nottopic, recall is the ratio of correct-topic-cases/total-topic-cases. Precision is 11

94

100

94

97

96

86

84

80 65

Accuracy

62

60

543 words 2133 words

40 20 0 1

2

5

10

Top-k Matches

Figure 6: Breakeven Accuracy For Top-k Matches correct-topic-cases/total-predicted-topic-cases. If the three individual components of these measures are summed over all topics, we can compute an overall measure of performance. To obtain the breakeven point, the decision threshold of a classi er is arti cally adjusted. For our purposes, we need only a rough estimate of performance, and we average precision and recall. The best reported result is nearly 88% [Weiss et al., 1999], and the reported result for nearest-neighbor methods with cosine distance obtain about 82% with 30 nearest neighbors for each category and a 10,000 word dictionary[Joachims, 1997]. Figure 6 summarizes the results for di erent dictionary sizes and numbers of matches accepted as correct. The 2133 word dictionary is identical to that used to achieve the benchmark 88% result of [Weiss et al., 1999]. The 543 word dictionary is formed using the same procedures as the 2133 word dictionary. Both use the most frequent words found for each topic: the larger dictionary uses the 150 most frequent words, the smaller, the 50 most frequent words. Stopwords, i.e. known weakly predictive words like pronouns, are removed prior to completing the nal dictionary. All docu12

ments were indexed with all words. When any of the top-k documents is the correct class, the answer is marked correct. Figure 6 displays the increase in accuracy as the number of allowed top matches increases. To solve the Reuters catgorization problem, clearly only one answer can be given. In Figure 6, the top-k matching documents were found, and a test document's label was marked correct if it appeared in any of the top-k matches. To provide a unique answer, the strongest approach is to accept the label that has the highest frequency among the top-k matched documents, not just the label of the top ranked document. This is k-nearest neighbor scoring for our additive score as a distance measure, which computes distance only for words occurring in both the test document and its matched documents. To avoid top-10 ties for dichotomous classi cation, we found the top-11 documents for each test document. Using the same 2133 dictionary, the breakeven predictive performance was 79%. The result for the cosine distance measure, which also computes over positive word matches, was 82%. This result matches the previously published result, but uses a far smaller dictionary and unstemmed words.

5 Discussion We have described lightweight document matching algorithms and representations that have been implemented and tested in a Java program. Indexing is restricted to a relatively small number of keywords per document. The presence or absence of a word in a document is determined by exact word match, which is readily implemented by ecient hash table lookup with no stemming, no lexical analysis, and no frequency counts. A simple additive scoring method is employed that counts positive word matches. The idea of processing text for classi cation or processing queries by retrieving relevant documents has been a fundamental task of many research communities for decades. We have shown that by combining relatively lightweight approaches, we produced a solution that computes eciently even in restricted environments, such as mobile or small desktop computers. This approach can mimic a search engine, yet has the capability to match a new document of many thousands of words. In our recent applications, the typical size of documents in the repositories was in the 20,000 range. For one help-desk application, the entire system easily resides on a laptop, requiring only about 50MB of disk space to store and search documents. In terms of predictive performance, we deemed the results surprisingly 13

good. Matched to systems that train on labeled data, the lightweight algorithm gives reasonable predictive performance for classi cation problems that require a unique answer[Yang, 1998]. Performance for a simple additive score of positive word-matches is only slightly weaker than nearest neighbor methods that compute the more complex cosine distance over fully indexed documents. However, in our expected application environment for document matching, multiple answers are satisfactory, and the user may intervene and discriminate among the top document choices, ltering the strong from weak responses. Experimental results support the idea that for any reasonable scoring method some good answers are likely to appear in a top-10 set of potential document matches. The fast document matcher seems to be an ideal solution component for building help-desk and retrieval systems for mobile computing or the expected proliferation of distributed computers and network appliances.

14

References [Apte, Damerau, & Weiss, 1994] Apte, C.; Damerau, F.; and Weiss, S. 1994. Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems 12(3):233{251. [Chang et al., 1996] Chang, K.; Raman, P.; Carlisle, W.; and Cross, J. 1996. A self-improving helpdesk service using case-based reasoning techniques. Computers in Industry 113{125. [Joachims, 1997] Joachims, T. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Technical report, University of Dortmund. [Lewis, 1995] Lewis, D. 1995. README File for the Reuters Text Distribution. Technical report, ATT. See http://www.research.att.com/lewis. [Manber & Wu, 1993] Manber, U., and Wu, S. 1993. Glimpse: A tool to search through entire le systems. Technical Report TR 93-34, University of Arizona. See http://glimpse.cs.arizona.edu/webglimpse/index.html. [Salton & Buckley, 1997] Salton, G., and Buckley, C. 1997. Term-weighting approaches in automatic text retrieval. In Sparck-Jones, K. and. Willet, P., ed., Readings in Information Retrieval. Morgan Kaufmann. 323{328. [Salton & McGill, 1997] Salton, G., and McGill, M. 1997. The smart and sire experimental retrieval systems. In Sparck-Jones, K. and. Willet, P., ed., Readings in Information Retrieval. Morgan Kaufmann. 381{399. [Schapire & Singer, 1999] Schapire, R., and Singer, Y. 1999. Boostexter: A boosting-based system for text categorization. Machine Learning in press. [Stan ll & Waltz, 1986] Stan ll, C., and Waltz, D. 1986. Toward memorybased reasoning. ACM Communications 1213{1228. [Weiss et al., 1999] Weiss, S.; Apte, C.; Damerau, F.; and et al. 1999. Maximizing text-mining performance. IEEE Intelligent Systems in press. [Yang, 1998] Yang, Y. 1998. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval (to appear).

15