A Study of Approaches to Semi-structured Document Classification Andrej Bratko and Bogdan Filipiˇc {andrej.bratko,bogdan.filipic}@ijs.si Department of Intelligent Systems Joˇzef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia
Technical Report IJS-DP 9015 November 23, 2004
Abstract This report examines several different approaches to exploiting structural information in semi-structured document classification. The methods range from trivial modifications of text modeling and classification algorithms to more elaborate classification schemes, specifically tailored to structured documents. We compare the performance of these methods on five datasets in combination with three different text classification algorithms. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a master classifier. We also discuss drawbacks of modeling structure, particularly the problem of data sparseness, and introduce effective smoothing methods to overcome these difficulties for probabilistic classifiers. One interesting overall result is that the best performance on every single dataset was achieved by structured representations, making a strong case for incorporating structural information in document classification.
1
Introduction
The task of supervised text classification, or categorization, is to learn to classify text documents into predefined categories based on a limited number of labeled training examples. Research in text classification has received much attention over recent years from both the information retrieval and machine learning communities and many methods have been proposed for this task. Very often, the documents being classified belong to the broad category of semistructured documents. With the increasing adoption of standards for structured document representation, such as XML, text collections will become even more structured. However,
1
most existing text categorization research is focused on flat text categorization, ignoring potentially useful information that can be drawn from document structure. This report examines several different approaches to exploiting structural information in semi-structured document classification. The methods range from trivial modifications of text modeling and classification algorithms to more elaborate classification schemes, specifically tailored to structured document classification. Many of these approaches are novel or have not been evaluated thoroughly by previous research, especially on nonhypertext document collections. This is also the first work to give a comparative evaluation of the identified methods on standard data collections. We expect these results to aid system designers in their choice of document representation. We first investigate the use of a simple preprocessing scheme called tagging, which has the advantage of being independent of the classification algorithm that is used. In the splitting approach, a separate model is built to classify each distinct document part. The models are combined in a way that is natural to the underlying classification algorithm. Finally, we consider stacking, in which a master classifier makes the final prediction based on the results of different models obtained by splitting. We evaluate these methods empirically in combination with three different classification algorithms. Experiments were conducted on five standard datasets containing different types of semi-structured documents, such as web pages, e-mail messages and news stories. Experimental results indicate that classification performance can often benefit from structural information. Of the different approaches that were considered, the stacking approach seems particularly promising. We also discuss drawbacks of the methods presented, particularly the problem of data sparseness, a common problem in text classification, which increases as the complexity of the model is increased in order to incorporate structural information. We introduce effective methods to overcome these difficulties using probability smoothing methods developed for natural language modeling. These methods are generally applicable for probabilistic classifiers, such as Naive Bayes. The remainder of the report is structured as follows. We first introduce semi-structured documents in Section 2 and review previous work on semi-structured document classification in Section 3. In Section 4 we briefly introduce the Naive Bayes, Support Vector Machine and Fuzzy Set Membership algorithms for text classification. Section 5 presents different approaches to structured document classification. In Section 6 we introduce selected smoothing methods used in natural language modeling and data compression and show how these methods can be used for semi-structured document classification with probabilistic classifiers, such as Naive Bayes. This is followed by a review of our experimental setup and a report on the results of our evaluation in Section 7. The final section summarizes our findings and points out some promising directions for future research.
2
Semi-structured Documents
In this report, we focus on classification of semi-structured documents. Such documents may be broken down into components or fields, each of which contains either structured (numeric or symbolic) data or non-structured textual data. In some cases, document structure is hierarchical, so that components may also include other components. An example is shown in Figure 1. 2
From:
[email protected] To:
[email protected]
S1
Subject:
S2
Results of initial experiments with stacking.
S3
Body: I’ve attached the results of some initial experiments with stacking as the means of combining per-component classifiers. It seems that this method is the best of the lot.
S4
Regards, Andrej
Figure 1: A semi-structured document with components S1 -S4 The structure of a document is often identified by markup languages. A good example is the popular XML format, which is quickly becoming a standard for data transfer and storage on the web. Another, related example is HTML. It should be noted that even older, legacy data formats often contain structure, which is usually less flexible and often implicit, or in the form of meta-data or headers. Electronic mail or scientific articles are such examples. Indeed, most electronic document formats contain structural information in some form or another, albeit to a varying degree. Structure can be very helpful to humans for determining the source, topic, or other characteristics of a given document. We expect that some document components are much more useful than others for a certain classification tasks. In some cases, documents are explicitly organized by the contents of a certain structural component, for example, many people organize their email based exclusively on the sender field. It is the purpose of this work to investigate how such structural information may be exploited for automated text categorization.
3
Related Work
Although most text categorization research is limited to flat text representations, there exists a limited body of literature dealing with semi-structured documents. Most of this research is focused on particular types of documents, such as hypertext or e-mail. Frnkranz [11] shows an improvement in classification accuracy on hypertext documents by using the textual context of links to a web page as features for classification, rather than text of the web page itself. A similar approach is adopted by Glover et al. [13], who also consider combining such a model with a model trained on the local text. Chakrabarti, Dom and Indyk [3] also exploit the hyperlink structure of web pages and combine this with the prediction based on the local text. These approaches typically use document structure to determine the linked documents and to extract the relevant portions of text from linked documents. However, all these approaches are limited to HTML pages or other types of linked documents. In this work, we are more concerned with the use of internal document structure. A study by Yang, Slattery and Ghani [33] investigates a number of approaches to
3
classification of hypertext documents. Although they also use information gained from the HTML hyperlink structure, they also consider internal document structure in some approaches. They show that using HTML meta tags and title words alone sometimes outperforms flat-text models. A previous study by the same authors [12] shows that combining these features with the flat text content of web pages usually outperforms either representation alone. Yi and Sundaresan [34] propose a tree structure for modeling HTML documents where nodes correspond to document components, such as headings, links, etc. They consider tagging words with the identifier of the component in which they occur and using a Naive Bayes classifier on the resulting feature vector. They also propose splitting the feature vector into separate vectors for individual components and training a separate model for each such component. This splitting approach is augmented with transition probabilities for the nodes in order to incorporate a model of the structure itself into the prediction. Experiments on two HTML datasets indicate significant performance gains with tagging and further gains with their splitting approach. Denoyer and Gallinari [6] use a very similar approach, which is further refined by using an SVM classifier with a Fisher kernel on top of the original generative model. They report encouraging results on a number of datasets with different types of structured documents. Diligenti et al. [7] also use a tree structure for modeling HTML documents. They train a Hidden Tree-Markov Model on the training documents, so that nodes in the structure correspond to the hidden states. Since the hidden states are in fact observable, this approach is again similar to [34] and [6]. Their tests on the WebKb dataset indicate minor gains of their method with respect to a flat-text Naive Bayes classifier. Recently, Klimt and Yang [18] investigated a stacked approach to classifying email messages using ridge regression to combine SVM models trained on different message components. This method substantially improves on the flat-text model for their email corpus, although the details of their methods are unclear. In dealing with increased data sparseness arising from structured document representations, we make use of smoothing methods originally designed for natural language modeling and data compression. Recently, similar smoothing techniques have been applied to Naive Bayes flat-text categorization. Vilar et al. [28] investigate smoothing per-class probabilities with global word distributions from the entire training set. Peng, Schuurmans and Shaojun [24] augment the Naive Bayes text classifier using n-gram language models so that word probabilities are estimated depending on preceding words in the text in addition to the class variable.
4
Classification Algorithms
We evaluate different approaches to structured document classification in combination with three different text classifiers. We used two popular text categorization algorithms, namely Naive Bayes and the Support Vector Machine, and a third algorithm, named Fuzzy Set Membership, which was recently proposed by Wingate and Seppi [29]. A brief description of each of these three algorithms is provided in the following subsections.
4
4.1
Naive Bayes
The Naive Bayes (NB) classifier is commonly used in text categorization [22, 19] due to its relatively good performance, favorable size and speed complexity and its ability to learn incrementally. Many variants of Naive Bayes are found in the literature, depending on the underlying probability model that is used. The most common are Multivariate Bernoulli and Multinomial models, of which the Multinomial model usually performs best [21, 9]. To classify a previously unobserved document d, the Multinomial Naive Bayes classifier selects the class c that is most probable with respect to the document text. An estimate for the conditional class probability p(c|d) is obtained using Bayes rule: p(c|d) =
p(c)p(d|c) p(d)
(1)
The prior probability p(c) is estimated from the training data as the relative frequency of training documents belonging to class c. The conditional probability of document d, given the class label c, is calculated from the probabilities of individual word occurrences over all words found in d: p(d|c) =
|d|! p(w|c)f (w,d) f (w, d)! w∈d w∈d
(2)
In the above equation, f (w, d) denotes the number of occurrences of word w in document d and |d| is the sum of the frequencies of all words in the document, i.e. the length of the document. The multinomial coefficient |d|!/ w∈d f (w, d)! denotes the number of all possible orderings of the words. Equation (2) is the probability of the observed word distribution under a multinomial model. It is based on the assumption that each word occurrence is the result of an independent multinomial trial. Note that p(d) in Equation (1) and the multinomial coefficient in Equation (2) are independent of class, and can therefore be omitted in the final class selection rule. The individual word probabilities are derived from the training data by simply counting the number of word occurrences (word frequencies) p(w|c) =
1 + f (w, c) |V| + w ∈V f (w , c)
(3)
where f (w, c) denotes the number of occurrences of word w in class c. Laplace smoothing is commonly used to avoid zero probabilities. The particular form of smoothing used here corresponds to adding 1 ‘virtual’ occurrence of each word in vocabulary V to complement data in the training document collections for each class.
4.2
Support Vector Machine
The Support Vector Machine (SVM) [27] is a kernel-based classification method suited to binary classification problems. The SVM has favorable theoretical properties for text categorization and is also known to perform well on such tasks empirically [17, 16, 32]. We outline the basic ideas and motivation behind the SVM algorithm here. For details, see for example [1]. The SVM seeks to find a hyperplane that separates positive and negative training examples with the widest margin, as depicted in Figure 2. The motivation for this principle 5
Figure 2: Support vectors and the maximum margin hyperplane is that maximizing the separation margin minimizes an upper bound on the VapnikChervonenkis dimension of the resulting classifier which in term serves as an indication of better generalization ability of such a model. A dual optimization problem to the problem of finding the maximum margin hyperplane exists, which may be solved efficiently using quadratic programming. It turns out that the maximum margin hyperplane is a linear combination of the examples which lay closest to the decision boundary. These examples are called support vectors. The SVM classifies a test example d into the positive class if it occurs on the positive side of this hyperplane, and vice versa. In the linear case, this means that the positive class is selected if a · d + b > 0
(4)
where a and b are the model parameters learned by the SVM. The SVM may be generalized to non-linear class boundaries by replacing the dot product with a non-linear kernel (e.g. polynomial kernels, radial basis functions, etc.). However, linear kernels are usually sufficient for text classification problems [32]. Different methods exist for extending the SVM to multi-class problems. We use the most straightforward and most often used one-vs-rest approach which builds one binary classifier for each class. All training documents not in the class are used as negative examples. The class with the highest score is predicted in case of unilabeled classification. For multilabeled classification, all classes that satisfy the decision rule in Equation 4 are selected. Although a number of different document representations have been studied for text categorization with SVMs, it is generally believed that the SVM is fairly robust to the choice of representation [5]. In our experiments, documents were represented using Term Frequency/Inverse Document Frequency (TFIDF) vectors, a representation which is commonly used in combination with the SVM. TFIDF weights for each word w in document d are computed by the following formula: 6
T F IDF (w, d) = f (w, d) · log
|{d ; d
|D| ∈ D ∧ f (w, d ) > 0}|
(5)
where f (w, d) is the number of occurrences of word w in document d and |{d ; d ∈ D ∧ f (w, d ) > 0}| is the number of documents in the training data that contain w. All document vectors These weights are used as components of the document vector d. were normalized with their euclidean length before training the SVM.
4.3
Fuzzy Set Membership
The Fuzzy Set Membership (FSM) [29] algorithm is a simple ad hoc linear classifier. It was designed as an alternative to Naive Bayes, primarily to overcome the need for smoothing and other parameter estimation problems of NB, without sacrificing the advantages of NB, such as its speed and its ability to learn incrementally. The original FSM algorithm, as proposed by Wingate and Seppi [29], selects the class c(d) according to the following class selection rule: ⎡
⎤
f (w, c) ⎦ f (w, d) c(d) = arg max ⎣ c f (w, •) w∈d
(6)
The sum in Equation 6 is over all words w that appear in document d. f (w, d) denotes the number of occurrences of word w in document d, f (w, c) is the number of occurrences of the word in all training documents belonging to class c and f (w, •) is the total number of occurrences of word w in the entire training dataset. (w,c) can be interpreted as the probability of class c, given that word w The fraction ff (w,•) appears in the document. The overall score for class c is thus the average probability of class c over all words occurrences. Initial results that we obtained with the original version of the FSM classifier were not too impressive (see Section 7.5). In particular, the original method suffers from a high bias in favor of larger classes. To illustrate this issue, consider a binary classification task in which one class contains three times as much training data as the other. A quick look at the FSM class selection rule in Equation 6 reveals that the FSM classifier will favor the larger class on a particular word, even if the word is twice as common in the smaller class. In order to improve performance on datasets with imbalanced training data, we modified the original class selection rule to use relative word frequencies: ⎡
⎤
f ∗ (w, c) ⎦ f (w, d) ∗ c(d) = arg max ⎣ c c f (w, c ) w∈d
(7)
where f ∗ (w, c) denotes the relative frequency of word w in class c: f ∗ (w, c) =
f (w, c) f (•, c)
(8)
In the above equation, f (•, c) is the number of occurrences of all words in the training data for class c.
7
Indeed, this modified algorithm, which we labeled FSM*, achieves performance that is competitive to Naive Bayes on most datasets and sometimes substantially better (see Section 7.5).
5
Exploiting Document Structure
In this section, we attempt to present different approaches to structured document classification in a unified framework. To our knowledge, most existing work on semi-structured document classification can be viewed as a special case or a combination of the general approaches described here. Note however that we limit ourselves to using only internal document structure and do not explore any structural relations among multiple documents, such as web structure.
5.1
Tagging
A common approach is to treat word occurrences in different document components as different features. We refer to this as tagging, since the most straight-forward implementation is to tag words with the name of the component in which they appear, thus producing different features for the same word in different contexts. It should be noted that component tagging is often used by authors in comparative evaluations of text categorization algorithms, such as the one by Dumais et al. [8], but no comparison to the flat-text model is usually provided. Experiments by Yi on two HTML datasets indicate that component tagging can significantly boost accuracy [34], however, we are not aware of any thorough comparison of this approach and the flat bag-of-words (BOW) representation.
5.2
Splitting
Another method for incorporating document structure is to train as many instances of the base classifier as there are structural components. A separate model is trained for each document component and the final classification is a combination of the predictions of all such models. We call this approach splitting, since the flat-text BOW vector is split into a set of BOW vectors, one for each component. Each document is thus represented by one such set of vectors. The predictions of the individual models are combined in a manner that is “natural” to the classification algorithm at hand. In particular, the combination of individual models is not weighted, so that words occurring in any particular structural component are not given preference per se. If we assume that words are distributed differently in different document components, then this approach may model these distributions better, since it considers them separately. Note that if words are not distributed differently in different document components, then the best we can do is to use the flat-text model. 5.2.1
Splitting + Naive Bayes
Staying with the independence assumption, components are considered independent, so that the total probability of the observed document is merely a product of the individual component probabilities: 8
p(d|c) =
ps (ds |c)
(9)
s∈d
ps (ds |c) =
ps (w|c)fs (w,d)
(10)
w∈s
The first product is over all structural components s that are present in the document d so that ps (ds |c) is the probability of each document component. The multinomial coefficient has been omitted for brevity. A separate model ps is trained for each document component and word frequencies fs (w, d) are maintained on a per-component basis. This method of combining per-component models is consistent with the work of other authors [34, 6, 7]. 5.2.2
Splitting + SVM
The output of the SVM classifier is the distance from the test example to the separating hyperplane. The most straight-forward method of combining these predictions is a sum over all per-component models. However, this approach overvalues words occurring in shorter components. The reason for this is that the SVM is relatively stable w.r.t. the dimensionality of the input vector, thus we expect all components to contribute a similar amount to the overall sum on average. In other words, the information about how many features contributed to the prediction of a particular component model is lost. To give an example, consider an email message with a single word in the subject and 200 words of body text. If both component models yield a similar margin, the subject word will contribute as much to the overall classification as all 200 words in the body, although some body words may actually separate the classes better. Platt reports empirical evidence that the output of a linear SVM classifier is often proportional to the log odds of the positive class [25]. In view of this, a weighted sum of the scores produced by the individual component models seems like a good way of combining the models. One plausible approach is to multiply the component scores with the number of words that contributed to the score, i.e. the number of non-zero values in the feature vector for the document component. The original binary decision rule for SVM in Equation 4 thus becomes:
|{w; w ∈ s}| (as · ds + bs ) > 0
(11)
s∈d
The sum is over all structural components s where |{w; w ∈ s}| is the number of distinct words in the component and ds is the corresponding feature vector of the example document. A separate model < as , bs > is learned by the SVM for each component. If we assume that the output of the SVM is independent of the dimension of the input vector, then this approach ensures that all words are treated as equally important. Note that in reality, the resulting classifier may still overvalue shorter document components. To illustrate this, consider a hypothetical case where training examples are distributed evenly in the instance space and the separating hyperplane cuts through the center of this space. Recall that all examples lay on a ball with radius 1 and all components of the vector are non-negative. As the dimensionality increases, an increasingly large portion of the instance space is “close” to the separating hyperplane. 9
5.2.3
Splitting + FSM*
The FSM* algorithm assigns a score to each class which is a sum over all features in the document vector (Equation 7). After splitting the vector according to the document components, this sum is over all features of all the resulting vectors. Equivalently, the class score is computed as a sum of the scores produced by the individual document component models: ⎡ ⎤ ∗ f (w, c) ⎦ fs (w, d) s ∗ c(d) = arg max ⎣ c
s∈d w∈s
c fs (w, c)
(12)
Again, the first sum is over all structural components s that are present in the document d. Relative word frequencies fs∗ (w, c) are maintained for each document component separately.
5.3
Stacking
The splitting approach discussed in the previous section has the advantage of modeling word distributions in each document component separately. However, it treats all document components as equally important to the classification task at hand. In reality, we may expect that some document components are more relevant than others. For example, it seems intuitive to put greater weight on words occurring in an email subject than words in the body text, since the subject was composed by a human with the intent to capture the essence of the entire message. A reasonable solution is to use classifier stacking [31] to combine the predictions of individual per-component models. In the stacking approach, the predictions of a number of level-0 classifiers are used as input for a level-1 classifier, or meta classifier, which in turn makes the final prediction. To achieve this, a number of design decisions must be made. 5.3.1
Meta Features
Firstly, we must choose a suitable representation for the output of the level-0 models. The features of this intermediate representation are also called meta-features, since they serve as input to the meta classifier. Ting and Witten [26] suggest that per-class probability estimates predicted by each of the underlying models are a suitable intermediate representation. When classifying an example document d, the input for the meta classifier is a fixed-length vector ed of the following form: ed =< pcs ; s ∈ d >
(13)
Here, pcs is the probability of class c, as predicted by the model corresponding to the structural component s. We implemented stacking using per-component models of NB, SVM and FSM* as level-0 classifiers. The meta-features pcs are computed differently for each base classifier, as indicated below. All equations for pcs contain a normalization constant Z, which is chosen so that c pcs = 1.
10
NB Since NB is a probabilistic classifier, its predictions are directly applicable as metafeatures after normalization: pcs =
1 ps (w|c)fs (w,d) · p(c) Z w∈s
(14)
SVM In order to obtain probability estimates, we transform the output of the level-0 SVM classifier for component s with the sigmoid function and re-normalize1 : pcs =
1 1 · a sc Z 1 + e ·ds +bsc
(15)
In the above equation, < asc , bsc > are the parameters of the maximum margin hyperplane on the binary classification problem for class c. As in Section 5.2.2, ds denotes the feature vector of document d pertaining to component s. FSM* The FSM* classifier is a probabilistic classifier. To obtain meta features, we simply re-normalize the output of the per-component FSM* classifiers: pcs = 5.3.2
1 f ∗ (w, c) fs (w, d) s ∗ Z w∈s c fs (w, c )
(16)
The Meta Classifier
We expect that a suitable meta classifier will be able to learn which document components are more discriminative and weight the predictions of the corresponding per-component models accordingly. The work of Ting and Witten [26] suggests that a linear combination of the predictions made by the level-0 classifiers may work well in practice. We chose the SVM with a linear kernel as our meta classifier. Besides being a linear model geared directly toward classification, we expect it to work well with the chosen intermediate representation ed with all meta-features in [0..1]. 5.3.3
Training the Stacked Classifier
Finally, a strategy must be devised for training the meta classifier. In particular, unbiased training examples containing predictions made by the level-0 classifiers must be provided. To this end, we create a 10-fold split of the training data. For each fold, 1/10 of the training data is held out and the level-0 classifiers are trained on the remaining 9/10 of the data. The predictions of these models on the held-out portion of the data are used to create training examples for the meta classifier. After the meta-classifier is trained, the level-0 classifiers are re-trained on the entire training set. At classification time, an 1
Using a sigmoid to transform the output of an SVM into probability estimates was suggested by Platt [25]. It should be noted that our approach is very simplistic in this respect. A more sophisticated approach would be to introduce additional parameters in order to calibrate the sigmoid. However, this would be computationally expensive, since it involves a nested cross-validation run to produce the data required to fit the sigmoid for each binary level-0 classifier. On the other hand, we expect that any gains would be negligible, since the output of the level-0 SVM classifiers is fairly regular and the meta classifier may adapt to the uncalibrated probability estimates easily.
11
example document is first classified by all level-0 (per-component) classifiers. The output of these models is combined by the meta-classifier to form the final prediction. 5.3.4
Computational Considerations
The use of stacking implies a considerable computational overhead. Since the dimensionality of the intermediate representation is limited, training the meta classifier is fairly inexpensive. The brunt of the computations lies in generating the training data on which to build the meta classifier. In our implementation, the level-0 classifiers must be trained 10 times on large subsets of the training data to create examples for the meta-classifier. This overhead is negligible for incremental algorithms with linear time complexity, such as Naive Bayes and FSM*. The SVM with its quadratic complexity is slightly more problematic.
6
Probability Estimation in Structured Models
In this section, we identify some problems that are inherent to all methods for modeling structured documents discussed so far. These issues are especially daunting in combination with probabilistic classifiers, such as Naive Bayes. We also discuss methods to overcome these difficulties.
6.1
Data Sparseness and OOV Words
Word frequencies in natural language are modeled well by the heavy-tailed Zipfian (powerlaw) distribution. This means that most words in a typical corpus are extremely infrequent [23], making probability estimates used in probabilistic classifiers very unreliable. All approaches to structured document classification are based on the reasonable assumption that word distributions vary among structural components. However, this implies that one set of parameters must be learned for each structural component, greatly increasing the problem of data sparseness. A second problem is an increase in the number of words that are novel in a certain context, so called out-of-vocabulary (OOV) words. By splitting the training data, we will often discard meaningful words in a test document because they are novel in the context of a particular document component, even if the word is otherwise a good indicator of the document class. It is intuitive to expect that a word which is common in one structural component is likely to occur in other components. For example, if the word “hockey” appears often in the body of email messages of a certain class, it is likely also to appear in the subject, even if this has not been observed so far. Moreover, words generally have the same meaning irrespective of the structural context.
6.2
Smoothing Probability Estimates
Our proposed solution to the issues of data sparseness and OOV words is to blend specific per-component word probability distributions with more general distributions, which contain consolidated statistics from a collection of document components. This approach is an attempt to combine the expressive power of per-component modeling with more
12
E-mail
News story
Web page FLAT TEXT
TEXT
TEXT
LINK
EMAIL ADDR.
SUBJECT
SENDER
RECIPIENTS
IMAGE URL
BODY
CAPTION
TITLE
TITLE
HEADING
BODY
DATELINE
Figure 3: Document structure models used for smoothing probability estimates. Leaf nodes (white) correspond to actual document components. Internal nodes (shaded) correspond to consolidated components. reliable parameter estimation of the flat BOW model. To achieve this, we employ techniques designed for smoothing probability estimates in natural language modeling and data compression. All such smoothing methods work in a similar fashion: They reserve some probability mass from the observed events in the higher-order (more specific) model and distribute this probability among the unseen events, according to their distribution in a lower-order (more general) model, which may itself be blended. In our case, events correspond to words. The higher-order model is the per-component word probability distribution and the lower-order model is the consolidated distribution over multiple components. The consolidated distribution therefore provides a prior for the per-component distribution. Figure 3 shows document structure models used for the representation of email messages, web pages and newswire stories in our experiments. Consolidated components contain the concatenated text of all of their child components. These “virtual” components define the lower-order models used for smoothing per-component word probability estimates. We experimented with three widely-used smoothing techniques, namely Absolute discounting [23], Linear interpolation [14] and Witten-Bell smoothing [30]. In the following subsections, we describe briefly each of these methods. A good in-depth review of smoothing techniques used in natural language modeling is given in [4], it is beyond the scope of this report. 6.2.1
Absolute Discounting
Absolute discounting [23] discounts the frequency of observed words in the higher-order distribution by a discount factor Dh < 1. The probability mass gained from the total discount over all observed words is distributed according to the lower-order distribution pl : pabs (w) =
max [ch (w) − Dh , 0] Nh Dh pl (w) + w ch (w ) wi ch (wi ) 13
(17)
In the above equation, p(w)abs denotes the blended probability, ch (w) is the number of occurrences of word w in the higher-order context and Nh is the number of distinct words that occur at least once in the higher-order context. An estimate for Dh is n1h /(n1h + 2n2h ), where n1h is the number of distinct words occurring exactly once and n2h the number of distinct words occurring exactly twice in the higher-order context. 6.2.2
Linear Interpolation
Linear interpolation computes a weighted sum of the higher-order (maximum likelihood) probability and the lower-order probability: plin (w) = λh
ch (w) + (1 − λh )pl (w) w ch (w )
(18)
An estimate for the weights λh is n1h /Th , where n1h is the number of words occurring exactly once in the higher-order context and Th is the number of all word occurrences in the higher-order context [23]. 6.2.3
Witten-Bell Smoothing
Witten-Bell smoothing [30] is similar to the m-estimate [2] with a prior that equals the lower-order model and an equivalent sample size m that equals of the number of different words Nh observed in the higher-order distribution: pwb (w) =
7
ch (w) + Nh pl (w) Nh + w ch (w )
(19)
Experiments
We conducted an extensive experimental evaluation of the different approaches to semistructured document categorization that were discussed in previous sections. This section contains a description of our experimental setup and a summary of the results of these experiments.
7.1
Experimental Setup
In all experiments, texts were first split into word tokens consisting of one or more consecutive alphabetical characters delimited by whitespace or punctuation. All characters were converted to lower case. If a single - or character appeared in between two alphabetical characters, it was considered part of the word. We call these characters connectors, since they effectively connect two alphabetical substrings into a longer word. All words in the resulting vocabulary were used for classification. No stemming, stopword removal or feature selection was used. We used the SVMlight package of Joachims [15] for the SVM classifier. All parameter settings were left at their default values, in particular, the default value for C was used (the trade-off between training error and margin). We used the linear kernel in all experiments. We used our own implementations of Naive Bayes and FSM.
14
7.2
Datasets
Experiments were conducted on five publicly available datasets containing different types of semi-structured documents. The 20 Newsgroups2 dataset consists of 18941 email messages posted to 20 different public newsgroups. The messages are distributed roughly evenly across all classes. We used the “bydate” version of the dataset which has a predefined train/test split. The Ling-Spam3 dataset consists of 2893 spam and legitimate email messages posted to a linguistics newsgroup. The distribution of examples is approximately five to one in favor of legitimate messages. We used the “bare” version and the predefined 10-fold cross validation split. The 7 Sectors4 dataset contains 4339 corporate web pages, organized into a hierarchy of classes corresponding to different industry sectors. There are seven sectors in the top level of the hierarchy. We flattened the class hierarchy into 48 non-empty classes containing from 32 to 105 examples each. Testing was done using a random 4-way cross validation split. The WebKb5 dataset consists of HTML web pages from 4 different universities. We used only 4 categories (project, faculty, student and course) which contain a total of 4199 web pages. The smallest category (project) contains 504 documents while the largest category (student) contains 1641 documents. Testing was done using a random 4-way cross validation split. Finally, the Reuters-215786 dataset contains newswire articles indexed by the Reuters news agency. We report experiments using the “ModApte” train/test split on the 10 largest categories with a total of 9035 documents. The articles are distributed unevenly across the 10 categories with the largest category containing 3964 documents and the smallest only 286. Reuters-21578 is a multilabeled dataset, so we train a binary one-vsrest classifier for each category.
7.3
Document Modeling
Semi-structured documents were modeled as follows. Email messages were represented with sender, recipients, subject and body components7 . Web pages were split into , (heading) and components, as well as link URLs and image source URLs. Newswire articles contained title, body and dateline components. Each document is represented by the textual value of these components. We ignore any hierarchical relations between the components (e.g. we ignore the fact that components are usually included within a component) and the cardinality of components (e.g. we concatenate the contents of all components found in a document). Figure 3 shows consolidated components used for smoothing probability estimates for email messages, web pages and newswire stories. Note that certain components are not 2
http://people.csail.mit.edu/∼ jrennie/20Newsgroups/ http://www.aueb.gr/users/ion/ 4 http://www-2.cs.cmu.edu/∼ webkb/ 5 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ 6 http://www.daviddlewis.com/resources/testcollections/reuters21578/ 7 The document structure for email is over-engineered for the 20 Newsgroups and Ling-Spam datasets, since neither contain recipient headers and the sender header is available only in 20 Newsgroups. However, this model proved successful in other experiments with email that are not reported here. 3
15
smoothed. Such components typically contain non-textual or strongly structured content that is specific to the particular component and does not overlap with the words seen in other document parts.
7.4
Performance Measures
We measured classifier accuracy on unilabeled datasets, i.e. the proportion of test documents that were classified correctly by the classifier. For the multilabeled Reuters-21578 dataset, we report on micro-averaged recall, precision, break-even-point (BEP) and F-measure with β = 1 (F1) [20]. Recall and precision are standard information retrieval measures and are defined as follows. For each document/class combination, the classifier may either assign the document to the class, in which case we say it made a positive classification, or not, resulting in a negative classification. If the decision made by the classifier was correct, we label it as a true positive (TP) or true negative (TN) respectively. If the decision was incorrect, we label it as a false positive (FP) or false negative (FN). Using this notation, we may define recall R and precision P as:
R = P
=
TP TP + FN TP TP + FP
(20) (21)
Recall is thus the proportion of all document/class assignments in the test data that were correctly predicted by the classifier. Precision is the proportion of document/class assignments that were predicted by the classifier correctly, i.e. they also appear in the test data. It is trivial to maximize each of these measures separately. However, a good classifier must balance both. The BEP and F1 score combine precision and recall in a single performance measure. The BEP statistic finds the point where precision and recall are equal. Since this is hard to achieve in practice, a common approach is to use the average of recall and precision as an approximation: P +R (22) 2 The F-measure attempts to capture the relative importance of precision and recall with a special parameter β. It is defined as follows: BEP =
Fβ =
(1 + β 2 )P R β2P + R
(23)
When β = 1, the F-measure is reduced to the harmonic mean of precision and recall. We report on micro-averaged recall, precision, BEP and F1 scores. TP, FP, TN and FN counts are first summed over all per-class contingency tables. Precision, recall, BEP and F1 measures are then computed from these accumulated statistics.
16
7.5
Flat-Text Categorization Results
Table 1 shows classification accuracy for the flat-text classifiers on the four unilabeled datasets and Table 2 shows results on the multilabeled Reuters dataset. The best results are in boldface. The SVM outperforms the other learners on all datasets, which is consistent with our expectations and studies by other authors [32, 16]. The FSM* algorithm also gives a good performance. It improves on the original FSM algorithm on all accounts. It outperforms Naive Bayes by a wide margin on two of the five datasets and is also competitive on the remaining three. In general, the performance of the FSM* algorithm seems quite stable across all datasets. Considering the simplicity of the FSM* algorithm, this is by itself a remarkable result and warrants further investigation. NB
SVM
FSM
FSM*
0.7639 0.9907 0.6720 0.8273
0.8540 0.9917 0.9359 0.8654
0.7459 0.8400 0.8868 0.6949
0.8320 0.9810 0.9145 0.8057
Dataset 20 News. Ling-Spam 7 Sectors WebKb
Table 1: Classification accuracy for flat-text classifiers on unilabeled datasets NB
SVM
FSM
FSM*
0.9623 0.8280 0.8952 0.8901
0.9214 0.9658 0.9436 0.9431
0.3527 0.9990 0.6758 0.5213
0.9627 0.7703 0.8665 0.8558
Measure Recall Precision BEP F1
Table 2: Micro-averaged recall, precision, break-even point (BEP) and F-measure (F1) for flat-text classification on the Reuters-21578 dataset
7.6
Evaluation of Individual Component Models
To evaluate the usefulness of individual document components for our particular text categorization tasks, we conducted experiments using only the textual contents of individual components. The results of these tests are depicted in Figure 4. The flat-text representation is superior to any individual structural component with the single exception of the component in the WebKb dataset in combination with the SVM classifier. As a rule of thumb, components that contain a larger amount of text are more informative for classification. The graphs suggest that all classifiers make good use of the abundance of features provided by such components. One notable exception is the title component of Reuters news stories. An interesting observation is that Naive Bayes does almost as well as the SVM on the title component and better than SVM on the body of Reuters news stories. However, the SVM does a better job at combining these two. A later inspection into the bad performance of the heading component in the 7 Sectors 17
Accuracy on Ling-Spam
F1 Score on Reuters-21578
1
1 0.99
0.9
0.98 0.97
0.8
0.96
0.6
0.95
0.5
0.94 0.93
0.4 0.2
0.1
0.92 0.91
0
0.9
0
0.3
Accuracy on WebKb
Accuracy on 7 Sectors 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
L UR Li nk
Im ag e
U
RL
FSM* Bo dy
L
RL
UR Li nk
U
Bo dy
Im ag e
H
ea di ng
e Ti tl
t
0.5
ea di ng
0.55
SVM
e
0.6
NB
H
0.7 0.65
Ti tl
0.8 0.75
t
0.9 0.85
Fl at -te x
0.95
Fl at -te x
Bo dy
t
0.1
Fl at -te x
Bo dy
Su bj ec t
Fl at -te x
t
Se nd er
0.2
e
0.3
t
0.4
Fl at -te x
0.5
0.7
Bo dy
0.6
Su bj ec t
0.7
Ti tl
0.8
at el in e
0.9
D
Accuracy on 20 Newsgroups
Figure 4: Evaluation of per-component models. Note the different scale of the some of the graphs. dataset revealed that very few documents actually contain any headings in this dataset. Consequently, most documents are simply classified into the majority class by this model. Another observation that can be made from the graphs in Figure 4 is that while the SVM clearly dominates in the flat-text representation and in components with abundant training data, its performance is inferior to some of the simpler models for certain components with less features, such as the subject field in the Ling-Spam dataset or the component in WebKb. This is consistent with the findings of Forman and Cohen [10], who show that Naive Bayes may be superior to SVMs when little training data is available.
7.7
Structured Document Models
The performance of Naive Bayes in combination with different methods for structured document classification is reported in Tables 3 and 4. The Flat+Tag representation generates a tagged and a non-tagged feature for each word occurrence. The document is represented by a single feature vector. By far the most successful approach overall is the stacked classifier. Its performance is substantially superior to other models for three out of five datasets. Compared to the flat-text model, the Flat+Tag representation is beneficial on the 7 Sectors dataset and the 18
splitting approach does well on the WebKb dataset. Performance of the SVM classifier in combination with different methods for structured document classification is reported in Tables 5 and 6. For SVM, the performance of the flat-text model is improved with stacking for every dataset. This improvement is most notable on the 20 Newsgroups, 7 Sectors and WebKb datasets. Note that stacking in combination with Naive Bayes was most beneficial for the same three datasets. Again, splitting improved performance on the WebKb dataset and the Flat+Tag representation was helpful on the 7 Sectors dataset. We expected that the SVM classifier would work better with the Flat+Tag document representation, since the classifier usually copes well with a large number of features. However, this was not the case in our experiments. Results for the FSM* classifier are reported in Tables 7 and 8. Here also the stacked classifier achieves the best performance on most datasets. An exception is the 20 Newsgroups dataset, in which the stacked classifier does substantially worse than all other models. This is quite surprising since the FSM* classifier does rather well on each individual structural component for this dataset, as seen in Figure 4. We have so far not found a reasonable explanation for this anomaly.
7.8
Effect of Smoothing with Naive Bayes
Table 9 shows the effect of smoothing probability estimates for Naive Bayes on the four unilabeled datasets. Better smoothing has a substantial, positive impact on classifier performance on all of the datasets. The effect of smoothing is especially strong with the splitting method. Although the smoothed Naive Bayes classifier still falls short of the SVM, it does come close on many datasets. It should be noted that the smoothing methods employed have virtually no computational overhead. Table 10 shows results on the multilabeled Reuters-21578 dataset. All smoothing methods consistently perform badly on this dataset. We have so far been unable to find a reasonable explanation for this phenomenon. One obvious possibility is that text in different structural components is less correlated in this dataset. It is also possible that this is an effect of the binary one-vs-rest classification scheme used for multiclass classification. Another issue is the extremely imbalanced training data in this collection, which is made worse by the binary one-vs-rest classifier. Since the vocabulary is fixed to the words encountered during training, hardly any words will be novel with respect to the distribution derived from the large “rest” class, while very many words will be novel with respect to the tiny “one” class. The performance of smoothing may be questionable in such situations. Overall, different smoothing methods perform similarly well in all tests in comparison to the Laplace estimator.
8
Conclusion
We have presented a number of methods for exploiting internal document structure for text categorization and evaluated their performance on five datasets with three different types of semi-structured documents. Of all the methods that were considered, stacking is the most refined and is a clear winner in terms of classification performance. We believe that stacking is the most viable approach when accuracy is the most important factor
19
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.7639 0.9907 0.6720 0.8273
0.7556 0.9896 0.6444 0.8342
0.7775 0.9903 0.7022 0.8316
0.7600 0.9896 0.6635 0.8538
0.8286 0.9889 0.8569 0.8993
Dataset 20 News. Ling-Spam 7 Sectors WebKb
Table 3: Classification accuracy for NB on unilabeled datasets
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.9623 0.8280 0.8952 0.8901
0.9519 0.8444 0.8981 0.8949
0.9677 0.8034 0.8856 0.8779
0.9548 0.8485 0.9017 0.8985
0.9189 0.8827 0.9008 0.9004
Measure Recall Precision BEP F1
Table 4: Micro-averaged recall, precision, break-even point (BEP) and F-measure (F1) for NB on the Reuters-21578 dataset
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.8540 0.9917 0.9359 0.8654
0.8544 0.9924 0.9239 0.8893
0.8549 0.9917 0.9355 0.8693
0.8394 0.9924 0.9283 0.9133
0.8621 0.9938 0.9375 0.9493
Dataset 20 News. Ling-Spam 7 Sectors WebKb
Table 5: Classification accuracy for SVM on unilabeled datasets
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.9214 0.9658 0.9436 0.9431
0.9150 0.9637 0.9393 0.9387
0.9196 0.9672 0.9434 0.9428
0.9660 0.9178 0.9419 0.9413
0.9372 0.9628 0.9500 0.9498
Measure Recall Precision BEP F1
Table 6: Micro-averaged recall, precision, break-even point (BEP) and F-measure (F1) for SVM on the Reuters-21578 dataset
20
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.8320 0.9810 0.9145 0.8057
0.8403 0.9813 0.9152 0.8238
0.8351 0.9810 0.9161 0.8123
0.8421 0.9813 0.9189 0.8340
0.8096 0.9893 0.9203 0.8885
Dataset 20 News. Ling-Spam 7 Sectors WebKb
Table 7: Classification accuracy for FSM* on unilabeled datasets
Flat text
Tagging
Flat+Tag
Splitting
Stacking
0.9627 0.7703 0.8665 0.8558
0.9587 0.7943 0.8765 0.8688
0.9634 0.7776 0.8705 0.8606
0.9577 0.7960 0.8768 0.8694
0.9085 0.8654 0.8870 0.8864
Measure Recall Precision BEP F1
Table 8: Micro-averaged recall, precision, break-even point (BEP) and F-measure (F1) for FSM* on the Reuters-21578 dataset
Dataset 20 News. Lng-Spm 7 Sectors WebKb
Lap.
Splitting Abs. Lin.
W.B.
Lap.
0.760 0.990 0.664 0.854
0.821 0.992 0.808 0.862
0.818 0.992 0.807 0.861
0.829 0.989 0.857 0.899
0.814 0.992 0.811 0.859
Stacking Abs. Lin. 0.839 0.993 0.893 0.895
0.836 0.993 0.894 0.891
W.B. 0.840 0.993 0.894 0.893
Table 9: Classification accuracy for NB with smoothing on unilabeled datasets
Measure Lap. Recall Precision BEP F1
0.955 0.849 0.902 0.899
Splitting Abs. Lin.
W.B.
Lap.
0.962 0.835 0.898 0.894
0.967 0.823 0.895 0.890
0.883 0.919 0.901 0.900
0.960 0.832 0.896 0.892
Stacking Abs. Lin.
W.B.
0.883 0.873 0.878 0.878
0.883 0.865 0.874 0.874
0.879 0.867 0.873 0.873
Table 10: Micro-averaged recall, precision, break-even point (BEP) and F-measure (F1) for NB with smoothing on the Reuters-21578 dataset
21
in the design of a text categorization system. A limitation of the stacking method is its considerable computational overhead. Also, we expect stacking to be rather unstable with respect to the choice of meta-features and the meta classifier. Indeed, Wolpert described these design decisions as ”black art” in his original paper [31]. Our results suggest that our choice of representation does work in practice, however, it remains to be seen if our design decisions were optimal. The potential benefits of tagging and splitting are less clear and vary among datasets. The reasons for varying performance of tagging and splitting should be investigated further in order to determine the characteristics of text collections that bear the greatest effect on the performance of these methods. Some pitfalls of modeling document structure have been identified, such as data sparseness and OOV words. We propose the use of smoothing methods designed for natural language modeling and data compression to tackle these deficiencies. These methods are only applicable to probabilistic classifiers. The evaluation of these methods in combination with Naive Bayes show promising results with considerable improvements on most datasets. Of the three text categorization algorithms studied, the FSM algorithm is a newcomer in the field of text categorization. A modified version of this algorithm, FSM*, exhibits a combination of strong classification performance and favorable computational properties. Although not the primary goal of this report, our results do warrant further research into the FSM method. One interesting overall result is that the best results were achieved by structured representations on every single dataset, making a strong case for incorporating structural information in document classification.
Acknowledgment The work presented in this report was supported by the Ministry of Education, Science and Sport of the Republic of Slovenia.
References [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. [2] B Cestnik. Estimating probabilities: A crucial task in machine learning. In Proceedings of ECAI-90, 9th European Conference on Artificail Intelligence, pages 147–149, 1990. [3] S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of SIGMOD-98, ACM International Conference on Management of Data, pages 307–318. ACM Press, New York, US, 1998. [4] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harward University, Cambridge, USA, 1998.
22
[5] F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 784–788. ACM Press, 2003. [6] L. Denoyer and P. Gallinari. Bayesian network model for semi-structured document classification. Information Processing and Management, 40(5):807–827, 2004. [7] M. Diligenti, M. Gori, M. Maggini, and F. Scarselli. Classification of html documents by hidden tree-markov models. In Proceedings of ICDAR ’01, 6th International Conference on Document Analysis and Recognition, pages 849–853. [8] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of 7th international conference on Information and knowledge management, pages 148–155. ACM Press, 1998. [9] S. Eyheramendy, D. D. Lewis, and D. Madigan. On the naive bayes model for text categorization. In Proceedings of AISTATS 2003, 9th International Workshop on Artificial Intelligence and Statistics, 2003. [10] G. Formang and I. Cohen. Learning from little: Comparison of classifiers given little training. In Proceedings of PKDD-04, 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 161–172. [11] J. Frnkranz. Exploiting structural information for text classification on the WWW. In Proceedings of AIDA-99, 3rd International Symposium on Advances in Intelligent Data Analysis, pages 487–498, 1999. [12] R. Ghani, S. Slattery, and Y. Yang. Hypertext categorization using hyperlink patterns and meta data. In Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 178–185. Morgan Kaufmann Publishers, San Francisco, US, 2001. [13] E. Glover, K. Tsioutsiouliklis, S. Lawrence, D. Pennock, and G. Flake. Using web structure for classifying and describing web pages. In Proceedings of WWW2002, 11th International World Wide Web Conference, Honolulu, Hawaii, May 7–11 2002. [14] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern Recognition in Practice, pages 381–397. North-holland, 1980. [15] T. Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, pages 169–184. MIT Press, Cambridge, MA, 1998. [16] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137–142. Springer Verlag, 1998. [17] T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, 2002.
23
[18] B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research. In Proceedings of ECML-04, 15th European Conference on Machine Learning, pages 217–226. [19] D. Lewis. Naive bayes at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4–15, 1998. [20] D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246–254. ACM Press, 1995. [21] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification, 1998. [22] T. Mitchell. Machine Learning. McGraw-Hill, 1997. [23] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language, 8(1):1–38, 1994. [24] F. Peng, D. Schuurmans, and W. Shaojun. Augmenting naive bayes classifiers with statistical language models. Information Retrieval, 7(3-4):317–345, 2004. [25] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classiers, pages 61–74. MIT Press, 1999. [26] K. M. Ting and I. H. Witten. Stacked generalization: When does it work? In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence, pages 866–873, 1997. [27] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., 1995. [28] D. Vilar, H. Ney, A. Juan, and E. Vidal. Effect of feature smoothing methods in text classification tasks. In Proceedings of PRIS 2004, 4th International Workshop on Pattern Recognition in Information Systems, 2004. [29] D. Wingate and K. Seppi. Linking naive bayes, vector quantization, and fuzzy set membership for text classification, 2004. [30] I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), July 1991. [31] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992. [32] Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 42–49. ACM Press, New York, US, 1999.
24
[33] Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219–241, 2002. [34] J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proceedings 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 340–344. ACM Press, 2000.
25