Incremental Context Mining for Adaptive Document Classification Rey-Long Liu
Yun-Ling Lu
Dept. of Information Management Chung-Hua University HsinChu, Taiwan, R.O.C. Tel: 886-3-5186530
Dept. of Information Management Chung-Hua University HsinChu, Taiwan, R.O.C. Tel: 886-3-5186519
Email:
[email protected]
Email:
[email protected]
ABSTRACT Automatic document classification (DC) is essential for the management of information and knowledge. This paper explores two practical issues in DC: (1) each document has its context of discussion, and (2) both the content and vocabulary of the document database is intrinsically evolving. The issues call for adaptive document classification (ADC) that adapts a DC system to the evolving contextual requirement of each document category, so that input documents may be classified based on their contexts of discussion. We present an incremental context mining technique to tackle the challenges of ADC. Theoretical analyses and empirical results show that, given a text hierarchy, the mining technique is efficient in incrementally maintaining the evolving contextual requirement of each category. Based on the contextual requirements mined by the system, higher-precision DC may be achieved with better efficiency.
Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content analysis and indexing – Indexing methods.
General Terms Algorithms
Keywords Context text mining, incremental mining, adaptive document classification
1. INTRODUCTION Automatic document classification (DC) aims to map documents to suitable categories. It facilitates the storing, dissemination, elicitation, and sharing of information and knowledge, which are represented in document form. As the spaces of information and knowledge are often ever changing in the real world [13], there Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada. Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.
could be lots of documents to be classified and stored into the document database at any time. Automatic DC is thus a must for cost-effective management of the documents. However, as the document database evolves, both its content and vocabulary evolve as well. This phenomenon calls for adaptive document classification (ADC), which aims to promote the efficiency and precision of DC by incremental and efficient adaptation to the evolution. ADC brings several challenges to text mining. The first challenge is context mining. Previous studies have identified two essential forms of contexts: neighbor terms surrounding the keywords [8] and content-bearing terms indicating the context of usage/discussion of the document [12]. Previous DC techniques mainly focused on the first form (e.g. word strings [2] and linguistic phrases [14]). The mining of the content-bearing and usage-indicative terms for each category deserves exploration. These terms may serve as the contextual requirement (CR) of the category. Unlike neighbor terms, which are extracted based on their locality in a single document, CR terms should be mined by analyzing multiple documents from multiple categories. The second challenge is incremental mining, due to the fact that both the content and the vocabulary of the database may evolve when new documents are added. As the database evolves, CR of each category evolves as well. Re-triggering the whole mining process for each new document is obviously computationally impractical. It requires lots of I/O and computations. Typical previous DC techniques included symbolic rule induction [1], regression [19], the Rocchio's linear classifiers [15], the k-Nearest Neighbor (kNN) method [6], the Bayesian independence classifier [7], the support vector machine method [3], and the Perceptronbased method [11]. There were also studies relying on a given text hierarchy to cluster documents [4, 16] and classify documents [3, 5, 9, 10]. They often preset a feature set (vocabulary) on which their classifiers were built (a feature often corresponded to a term or a phrase). Obviously, since the vocabulary may evolve in ADC, no feature set may be presumed. Even the feature set may “evolve” by covering all features currently seen in the documents (e.g. [2]), inappropriate features may introduce the problems of inefficiency [18] and errors (over-fitting) [10] in DC. Better performance (in terms of efficiency and precision of DC) is often achieved by semi-automatic and/or trail-and-error feature selection [11]. The number of features selected was thus often treated as an experimental issue (e.g. [9, 10, 18]). The construction of an “optimum” feature set (if any) thus consists of a series of tuning
processes, which may be re-triggered by the addition of a new document. The third challenge is efficient DC, which should be supported by the result of mining. Obviously, classification is often triggered more frequently than mining. Its efficiency is thus essential to the management of information and knowledge. That is, by incremental and efficient mining of CR for each category, precision of DC should be promoted with good efficiency. Simultaneously tackling the challenges is essential to text mining. It also provides practical contributions to document classification in the ever-changing world. In this paper, an incremental mining technique ACclassifier (Adaptive Context-based Classifier) is developed to tackle the challenges of ADC. To empirically evaluate ACclassifier, experiments on a real-world document database were conducted (ref. section 3). Empirical results show that ACclassifier may achieve efficient and incremental mining of contextual requirement for each document category, which may serve as the basis for supporting efficient and high-precision DC.
2. INCREMENTAL CONTEXT MINING FOR ADAPTIVE DC ACclassifier consists of two components: an incremental context miner and a document classifier. Both components work on a given text hierarchy in which a node corresponds to a document category.
2.1 The incremental context miner Given a training document together with its category label (a leaf node in the tree), the incremental context miner updates the CR of each related category. Table 1. The incremental context miner Input: (I) A text hierarchy T, and (II) A training document d and the leaf (mostspecific) category c that d belongs to. Effect: Update the contextual requirements (CR) of related categories of c in T. Begin (1) W ←{w | w is a word in d with its number of occurrences ≥ δ}; (2) While (c is not the root of T) do (2.1) For each word w in W, do (2.1.1) If w is not a CR term for c, add (w, s) to CR of c (strength s is unknown); (2.2) For each pair (w, s) in CR of c, do (2.2.1) Update s; (2.2.2) For each sibling category b of c, do (2.2.2.1) If w is a CR term for b, update w's strength in CR of b; (2.3) c ← father of c; End.
2.1.1 The algorithm The algorithm of the miner is defined in Table 1. CR of a category c is a set, which is initially empty. Each element in CR of a category c is a pair (w, s), where w is a word (or feature) and s is
the strength of w serving as a context word for the documents under c (i.e. leaf categories that are descendants of c). More specially, a word w may have a higher strength in c if it may be used to significantly distinguish c from sibling categories of c. For example, suppose category "computer-based information system" (CBIS) has two children categories: "decision support systems" (DSS) and "management information systems" (MIS). The word "computer" should have a lower strength in categories MIS and DSS (since it occurs frequently in both categories), but a higher strength in category CBIS (if sibling categories of CBIS are not about computer systems). That is, "computer" may be a good CR term for the documents under CBIS. The pair of "computer" and its strength should thus be included in CR of CBIS. To determine whether a word may be a CR term, the miner employs a threshold δ (Step 1). A word with its associated strength is qualified to be included in a CR only if it occurs at least δ (e.g. five) times in the training document. Thus δ actually acts as a simple threshold to filter out irrelevant words. Similar thresholds were often employed in previous studies as well [5, 9, 10, 11]. To estimate the strength s of a word w in distinguishing category c from siblings of c (Step 2.2.1 and 2.2.2.1), the miner employs a modified TFIDF (term frequency × inverse document frequency) technique: Strength(w, c) = P(w|c) * (Bc / ΣiP(w|ci)), where Bc is the number of siblings of c plus one (i.e. including c). P(w|c) is estimated by computing the probability of w occurring in the documents under c (i.e. descendant leaf categories under c). The summation of P(w|ci) is conducted over c and its siblings. Thus, for example, w may get a higher strength if it occurs frequently in c (i.e. P(w|c) is high), but infrequently (on average) in siblings of c (i.e. Bc / ΣiP(w|ci) is high). The maximum strength is Bc.
2.1.2 Behavior analysis Given a training document, the update of CR is conducted on related categories only (in a bottom-up manner). The related categories include the category c, its sibling categories, its antecedent categories, and siblings of its antecedent categories (ref. steps 2.2.1, 2.2.2, and 2.3). To update the strength of w in the related categories (ref. steps 2.2.1 and 2.2.2.1), the context miner only needs to keep track of the related data for computing P(w|ci). Time complexity of the mining process may be measured by estimating the normal range of the number of the features updated. When a document is added into a category c, P(w|c) for each feature w in c changes. Therefore, the strengths of all features in CR of c, including the existing ones and the new ones, need to be updated. The update of P(w|c) changes the IDF component (i.e. Bc / ΣiP(w|ci)) of the strength of w. Therefore, the strengths of these features in each sibling of c need to be updated as well. Similarly, when the update process proceeds to the father of c (ref. Step 2.3), the strengths of all features in the father category (and the father’s siblings) need to be updated. That is, only a small subset of CRs is updated for every training document. No training documents are reprocessed, no feature set is predefined, and no time-consuming trail-and-error feature set tuning is conducted. Moreover, the miner achieves hierarchical context annotation and weighting for each category in the text hierarchy, which are essential for high-precision DC. For a non-context-indicative term (feature), its strength is low in lower-level categories (i.e. it's rare or common and even a stop word). On the other hand, for a context-indicative term, its strength is high in higher-level
categories (i.e. it's a context word the categories). That is, instead of selecting the features to build a universal feature space for DC, the context miner allows the terms to percolate up to their suitable categories. A term that is competent in serving as a context word for a high-level category will be assigned a higher strength in the category. The mining result may help to recognize the context of the document to be classified, and hence promote both the efficiency and precision of DC.
2.2 The document classifier Given a document to be classified, the document classifier consolidates the CRs of the categories to identify a leaf (most specific) category. Table 2. The document classifier Input: (I) A text hierarchy T, and (II) A document d to be classified. Output: A leaf (most-specific) category for d. Begin (1) S ← The hierarchy T with DOAc of each category c being set to 0; (2) For each distinct word w in d, do (2.1) For each category c in S, do (2.1.1) If w is a good CR term for c, DOAc ← DOAc + Strength(w, c) * TF(w, d); (3) For each category c in S, do (3.1) DOAc ← DOAc / ∑wStrength(w, c), ∀ w that is a good CR term of c; (4) m ← -1; (5) For each leaf category l in S, do (5.1) n ← Average {DOAi}, where i may be l or antecedents of l; (5.2) If m < n, (5.2.1) g ← l; (5.2.2) m ← n; (6) Return g; End.
2.2.1 The algorithm The algorithm of the classifier is defined in Table 2. Given a document d to be classified, the basic idea is to compute the degree of acceptance (DOA) of d on each category c. The DOA is computed based on the strengths of d’s distinct words on c (ref. step 2). A word w may join the computation of DOA for c only if w is a good CR term for c (ref. step 2.1.1). A term w is a good CR term for a category c, if it satisfies the following two constraints: (C1) Support(w, c) or P(w|c) ≥ minSupport, and (C2) Strength(w, c) ≥ Bc/2 In that case, w is representative enough in c (minSupport is a given parameter), and its strength is at least the average strength of all terms in c and siblings of c (recall that Strength(w, c) is bounded by Bc or the number of c’s siblings plus one). Note that the two constraints can be applied only in classification (not in mining). This is because a good (not good) CR term may become not good (good) in the course of incremental mining. ACclassifier needs to keep track of all terms in mining, while extract good CR terms for classification. Also note that, although minSupport is a parameter
for ACclassifier, it may be estimated based on a simple observation on the probability of a good CR term occurring in a category. More importantly, its setting does not need to be changed as more documents are entered into the database, since it governs the support of all features. This is particularly important for ADC in which, as noted above, previous techniques often had difficulties in setting the parameters (e.g. the feature set size) that needed to evolve as the document database evolves. When computing the DOA value contributed by w to c, Strength(w, c) is multiplied with TF(w, d) (i.e. w’s number of occurrences in d). The result is then added to the DOA of c. That is, if w is a strong context word in c and occurs many times in d, c is more likely to “accept” d. The DOA value of each category is then normalized according to the sum of the strengths of all good CR terms in the category (ref. step 3). The normalized value may indicate the extent to which the category accepts the document. The final output category is determined as follows. For each leaf category l, the average DOA value of l and its antecedents is computed (ref. step 5.1). The output category is simply a leaf category g, which has the largest average DOA value (ref. step 5.2 and 6). The reason of averaging DOA of antecedents is that a topic may often have different views, which should be taken into account in DC. For example, DSS may be discussed from the viewpoints of CBIS and financial management (FM). From the CBIS viewpoint, the architecture of DSS as a computer system may be the focus; while from the FM viewpoint, the usage of DSS in supporting FM may be the focus. Therefore, a document talking about DSS cannot be properly classified until the context of the discussion is recognized. The DOA values from the antecedent categories may serve as the basis for the context recognition. A well-written document should specify its view (context), which may be recognized by ACclassifier when classifying the document.
2.2.2 Behavior analysis The classifier actually consists of two phases: (1) the estimation of DOA for each category (step 2 and step 3 in Table 2), and (2) the identification of the winner category (step 5 in Table 2). Time complexity of the first phase is obviously dominated by step 2, which requires O((H-1)q) computations, where H is the height of the tree, and q is the number of distinct words in the input document. This is because, in worst case, all terms selected from the input document are CR terms. Thus O(q) strength additions at each level (from level 1 to level H) are required. On the other hand, the second phase of the classifier requires O(p+m+m) computations, where p is the number of categories in the tree, and m is the number of leaf categories. This is because the phase needs to compute the total DOA value of each pedigree in a top-down manner (i.e. O(p)), average the DOA value for each leaf category (i.e. O(m)), and then output the largest one (i.e. O(m)). The classifier is thus efficient enough, since its time complexity is dominated by the first phase (i.e. O((H-1)q) >> O(p+m+m)), and in practice, H > H) using a universal feature set whose size could be tens of thousand (>> q). Classification based on context mining also achieves fault tolerance in hierarchical DC. Fault tolerance is a major weakness in strictly top-down hierarchical DC (e.g. [5]) in which any single miss-classification in an internal node (i.e. non-leaf category) will finally lead to a significant error. ACclassifier achieves fault
tolerance by consolidating the DOA values from different nodes without invoking an exponential number of complicated classification tasks.
3. EXPERIMENTS To further evaluate ACclassifier, we conduct experiments on a real-world document database.
3.1 Environments Experimental data was extracted from the “science” category, the “computers and Internet” category, and the “society and culture” category of Yahoo! (http://www.yahoo.com). There were totally 83 categories extracted, among which there were 25 non-leaf categories (and hence 58 leaf categories). There were 1997 documents in the collection. The average length of the documents was about 6K. Among the 1997 documents, 1838 documents were extracted as training documents and 159 documents as testing documents. The testing documents were extracted from each category. The number of testing documents extracted from a category was proportional to the total number of documents in that category. The testing documents could thus comprehensively represent the contents of the information space. Their distribution could also reflect the common distribution of the documents entered for classification. To simulate common evolution of most document databases, we separated the training documents into two sets: 1100 documents for initial training and 738 documents for evolutionary training. The documents for initial training were comprehensively sampled from each category as well. They were used to train all systems (including ACclassifier and the baseline systems to be described later). After initial training, the documents for evolutionary training were entered one by one so that the contributions of adaptation could be evaluated. We measured both the efficiency and the precision of all the systems.
3.2 Systems evaluated As noted above, ACclassifier has two parameters: δ (for filtering out non-important words in training) and minSupport (for filtering out non-representative features in testing). In the experiment, δ was set to 5. As to minSupport, since there were often thousands of words in a document and δ was 5, minSupport was set to 0.001 (5/5000). As noted above, the setting does not need be changed as the document database evolves (unlike the feature set size, which is dependent to the current collection of documents). As to the baseline systems for the experiment, we did not find any previous DC techniques dedicated to all the challenges of ADC. To facilitate performance comparison between ACclassifier and most previous techniques, we set up two baselines using the kNearest Neighbor technique (kNN) and the Naive Bayes technique (NB), which are popular techniques in DC. Both of them have been employed and evaluated with respect to various techniques, including non-hierarchical DC [6, 7, 17] and hierarchical DC [5, 9]. They treated all leaf categories as candidate categories. Under a public document database (i.e. the database from Yahoo!), the performance comparisons between ACclassifier and the baselines may facilitate cross-evaluation for measuring the contributions of incremental context mining. Given a document d to be classified, kNN estimated the similarity between d and each training document. The most similar k training documents (i.e. neighbors) were allowed to use their degrees of
similarity to “vote”. The final output category was simply the one that got the highest accumulated degree of similarity. In the experiment, the similarity estimation was based on the well-known vector space model (VSM) in which each document was represented as a vector (using the predefined features as the dimensions for the vector space), and the similarity between two documents was measured by the cosine of the angle between them (i.e. the cosine similarity). We tested two versions of kNN by setting k (i.e. the number of neighbors) to be 5 (kNN-5) and 10 (kNN-10), respectively. On the other hand, NB pre-estimated the conditional probability P(wi|Cj) for every selected feature wi and category Cj (with standard Laplace smoothing to avoid probabilities of zero). The “similarity” between a category and an input document d was based on the product of the conditional probabilities of the features (in the category) that occurred in d. The NB classifier simply output the most similar category (for more details, the reader is referred to [6, 7, 9, 17]). Therefore, another important reason of setting kNN and NB as the baselines was that they could represent two typical branches of the techniques: kNN can “adapt,” while NB cannot. kNN is basically a memory-based reasoning method. It classifies the input document by finding the most similar training documents (neighbors). Therefore, as more training documents were entered, the capability of kNN evolved. On the other hand, NB employs a fixed training set to build a table of conditional probabilities. Its capability could not evolve as new documents were entered. Both kNN and NB required a fixed feature set, which was built from the 1100 documents for initial training. The selection of the features was based the strength of each feature, which was estimated by the χ2 weighting technique. The technique has been shown to be more promising than others [18]. As noted above, there is no perfect way to determine the size of the feature set. Therefore, for each baseline, we tried three different sizes: 3500 features (kNN-3500 and NB-3500), 5000 features (kNN-5000 and NB-5000), and all features (kNN-All and NB-All). When all features were selected, there were 6715 features. The reason of setting such feature sets was that they have a similar size difference (1500 = 5000-3500 ≅ 6715-5000). We aimed to measure the performances of the baseline systems under different feature set sizes. The result facilitated the investigation of the contributions of context mining in reducing the errors incurred by inappropriate features.
3.3 Result and analysis Figure 1 shows the performances of all baselines under different feature set sizes. NB outperformed kNN (with k=5) in classification precision, although kNN could improve its performance through evolution. Moreover, when considering the final performances of the baselines, kNN-5000 outperformed kNN-3500 (63 vs. 57), while NB-5000 could not outperform NB3500 (97 vs. 97). Actually, NB-All did not outperform NB-5000 and NB-3500 either. This confirmed that, using more features did not necessarily lead to a better performance. The setting of the feature set was actually a trial-and-error process, which depended on the technique employed and the documents collected. Based on the result, the baseline systems were thus allowed to use 5000 features in their feature set. For kNN, we tested different settings for the parameter k (i.e. kNN-5 and kNN-10). The results were shown in Figure 2. ACclassifier achieved the best classification precision with adaptation capability. When
the contributions of the context miner (ref. section 2.1.2). When compared with the average performance of the baselines, ACclassifier contributed 51.9% improvement (123 vs. 81).
130 120 Correct classifications
compared with the best version of kNN (i.e. kNN-5), ACclassifier achieved a higher precision (123 vs. 63, and hence 95.2% improvement). When comparing with the best version of NB (i.e. NB-5000), ACclassifier achieved a higher precision as well (123 vs. 97, and hence 26.8% improvement). A previous hierarchical DC technique did not show significant improvement over NB under a similar database of Yahoo! [9]. The contributions of ACclassifier in evolution and precision of DC were thus promising. A detailed analysis showed that the improvement was contributed by context recognition (in DC) supported by incremental context mining (in training). Features and their strengths of being CR terms were incrementally learned. The classifier could thus adapt and promote precision of DC.
110 100
ACclassifier
90
kNN-5
80
NB
70 60 50 1200
1300
1400
1500
1600
1700
1800
1838
Training document
Figure 3. Comparison with “well-trained” baselines
110
Correct classifications
100 kNN-3500
90
kNN-5000
80
kNN-All
70
NB-3500 NB-5000
60
NB-All 50 40 1200
1300
1400
1500
1600
1700
1800
1838
Training document
Figure 1. Baselines under different feature set sizes
130
Correct classifications
120 110 ACclassifier
100 90
kNN-5
80
kNN-10
70
NB
60
To qualitatively analyze the contributions of context mining in DC, consider the test document entitled “Setting up Email in DOS with today's ISP using a dialup PPP TCP/IP connection”. It was extracted from category Computers & Internet Æ Communication and Networking Æ Email. NB misclassified the document into Computers & Internet Æ Software Æ Operating Systems Æ Windows, while kNN misclassified it into Computers & Internet Æ Software Æ Operating Systems Æ Unix Æ Linux. The test document mentioned several terms (e.g. “Software”, “Widows”, and “Operating Systems”) that were “preferred” by the categories output by the baseline systems. Actually most of the terms were not context-indicative. They could not represent the main context of discussion of the document. ACclassifier recognized several context-indicative terms such as “TCP/IP,” “connection,” “computer networking,” and “userID”. The effects of inappropriate terms (e.g. “Windows” and “Operating Systems”) were thus alleviated.
50 1200
1300
1400
1500
1600
1700
1800
1838
Training document
Figure 2. Classification of the testing documents To focus on the “pure” contributions of context mining (i.e. removing the contributions learning new features), we allowed the baselines to use all training documents to build their feature set and classifiers. Thus the baselines were “well-trained” in the sense that they were built from more complete sets of training documents and features. There were totally 8999 features in the feature set. The results are shown in Figure 3. kNN finally demonstrated a little bit better precision (67 correct classifications, recall that kNN only achieved 63 correct classifications in Figure 2), while NB demonstrated a little bit poorer performance (95 correct classifications, recall that NB achieved 97 correct classifications in Figure 2). This indicated that the classifiers could not significantly benefit from a more complete set of training documents and a more complete set of features (there were techniques that selected all features to build their classifiers [2]). This is actually a major bottleneck of employing the previous techniques to mine for building classifiers. Context mining provides a promising way to tackle the challenge. It facilitates context recognition, and hence alleviates the effects of inappropriate features in DC. The results confirm the analysis on
Cumulative training & testing time (sec.)
40 12000 10000 8000 ACclassifier
6000
kNN-5
4000 2000 0 1200
1300
1400
1500
1600
1700
1800
1838
Training document
Figure 4. Efficiency of training and testing In addition to precision of DC, we were also concerned with the efficiency of context mining and document classification. As noted above, efficiency was essential to the realization of online adaptation and DC. Figure 4 shows the cumulative run time of ACclassifier and kNN-5. Since kNN-10 ran slower than kNN-5, and NB could not evolve, they were excluded from the figure. The efficiency of kNN-5 degraded much faster than ACclassifier. This was because almost all of the time spent by kNN-5 was for comparing the input document with each training document (i.e. testing). As more training documents were accumulated in the database, the testing documents needed to be compared with more training documents, increasing the system’s loading. On the other hand, the time spent by ACclassifier grew slower when about 1400
training documents were entered. This was because the number of new features learned grew slower after 1400 training documents were entered, reducing the growth of the load of ACclassifier in updating the strengths of features.
[1] C. Apte, F. Damerau, and S. M. Weiss (1994), Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, Vol. 12, No. 3. [2] W. W. Cohen and Y. Singer (1996), Context-Sensitive Learning Methods for Text Categorization, Proc. of ACM SIGIR'96.
600 Cumulative testing time(sec.)
6. REFERENCES
500 400 ACclassifier
300
NB
200 100 0 1200
1300
1400
1500
1600
1700
1800
1838
Training document
[3] S. Dumais and H. Chen (2000), Hierarchical Classification of Web Content, Proc. of ACM SIGIR 2000. [4] M. Iwayama and T. Tokunaga (1995), Cluster-Based Text Categorization: A Comparison of Category Search Strategies, Proc. of ACM SIGIR'95. [5] D. Koller and M. Sahami (1997), Hierarchically Classifying Documents Using Very Few Words, Proc. of ICML'97.
Figure 5. Efficiency of testing (classification)
[6] W. Lam and C. Y. Ho (1998), Using A Generalized Instance Set for Automatic Text Categorization, Proc. of ACM SIGIR'98.
It is also interesting to compare the systems’ efficiency in classifying documents (i.e. excluding training time). Figure 5 shows the cumulative classification time spent by ACclassifier and NB, which was much more efficient than kNN. The result showed ACclassifier was 3.3 times faster than NB (i.e. 160 seconds vs. 528 seconds) in DC. It confirms the time-complexity analysis in section 2.2.2. As classification is often invoked more frequently than training, the high efficiency contributed by ACclassifier is of particular significance to DC in practice. Together with the above results on precision, the results show that incremental context mining may be efficient enough to support adaptive and higherprecision DC in better efficiency.
[7] L. S. Larkey and W. B. Croft (1996), Combining Classifiers in Text Categorization, Proc. of ACM SIGIR'96.
4. CONCLUSIONS AND EXTENSIONS
[11] H. T. Ng, W. B. Goh, and K. L. Low (1997), Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization, Proc. of SIGIR.
This paper presents an efficient incremental context mining technique ACclassifier, and demonstrates its contributions to adaptive DC. Its contributions lie on (1) efficient mining of the contextual requirements for high-precision DC, (2) incremental mining without reprocessing previous documents, (3) evolutionary maintenance of the feature set, and (4) efficient and fault-tolerant hierarchical DC. The contributions are essential for those applications that base their operation on efficient classification of information and knowledge documents whose content and vocabulary may evolve over time (e.g. management of information and knowledge in the ever-changing world). We conjecture that misclassifications caused by ACclassifier could mainly be attributed to the impreciseness of the contexts of the documents, although there is no perfect way to judge whether the context of a document has been precisely specified. Some of the errors may even be due to incorrect classifications of training documents (this is also treated as the “noise” of the real-world training data [9]). We are thus further analyzing the errors and seeking the possibility of refining the technique. Typical refinements include the active sampling of training documents, the detection of errors in training documents, the tolerance of the errors, and the incremental recovery from the errors.
5. ACKNOWLEDGEMENT This research was supported in part by the National Science Council of the Republic of China under the grants NSC 88-2213E-216-003 and NSC 89-2218-E-216-008.
[8] S. Lawrence and C. L. Giles (1998), Context and page Analysis for Improved Web Search, IEEE Internet Computing, Vol. 2, No. 4, pp. 38-46. [9] A. McCallum, R. Rosenfeld, T. Mitchell, A. Y. Ng (1998), Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proc. of ICML'98. [10] D. Mladenic and M. Grobelnik (1998), Feature Selection for Classification based on Text Hierarchy, Proc. of the Conference on Automated Learning and Discovery.
[12] H. M. Nicholas and P. J. Clarkson (2000), Web-Based Knowledge Management for Distributed Design, IEEE Intelligent Systems, pp. 40-47. [13] I. Nonaka (1994), A Dynamic Theory of Organizational Knowledge Creation, Organization Science, Vol.5, No. 1. [14] E. Riloff and W. Lehnert (1994), Information Extraction as a Basis for High-Precision Text Classification, ACM Transactions on Information Systems, Vol. 12, No. 3. [15] R. E. Shcapire, Y. Singer, and A. Singhal (1998), Boosting and Rocchio Applied to Text Filtering, Proc. of ACM SIGIR’98. [16] P. Willett (1988), Recent Trends in Hierarchical Document Clustering: A Critical Review, Information Processing & Management, Vol. 24, No. 5. [17] Y. Yang and X. Lin (1999), A Re-examination of Text Categorization Methods, Proc. of ACM SIGIR'99. [18] Y. Yang and J. O. Pedersen (1997), A Comparative Study on Feature Selection in Text Categorization, Proc. of ICML’97. [19] Y. Yang and C. G. Chute (1994), An Example-Based Mapping Method for Text Categorization and Retrieval, ACM Transactions on Information Systems, Vol. 12, No. 3.