Under consideration for publication in Knowledge and Information Systems
Single Pass Text Classification by Direct Feature Weighting Hassan H. Malik1 , Dmitriy Fradkin2 and Fabian Moerchen2 1 Thomson
Reuters, 195 Broadway, New York, NY 10007, USA Data Systems, Siemens Corporate Research, 755 College Rd. East, Princeton, NJ 08540, USA
2 Integrated
Keywords: Text Classification, Feature Weighting, Linear Classifiers, Information Gain, Scalable Learning
Abstract. The Feature Weighting Classifier (FWC) is an efficient multi-class classification algorithm for text data that uses Information Gain to directly estimate per-class feature weights in the classifier. This classifier requires only a single pass over the dataset to compute the feature frequencies per class, is easy to implement, and has memory usage that is linear in the number of features. Results of experiments performed on 128 binary and multi-class text and web datasets show that FWC’s performance is at least comparable to, and often better than that of Naive Bayes, TWCNB, Winnow, Balanced Winnow and linear SVM. On a large-scale web dataset with 12,294 classes and 135,973 training instances, FWC trained in 13 seconds and yielded comparable classification performance to a state of the art multi-class SVM implementation, which took over 15 minutes to train.
1. Introduction Supervised classification of documents into predefined categories is a very common task. Text classification applications include web page categorization, email spam filtering, internet advertising, topic identification, document indexing, and word sense disambiguation, etc. Supervised text classifiers are typically constructed by an inductive process (Sebastiani 2002), i.e., by automatically learning a model from a set of previously labeled documents and then applying this model to obtain labels for previously unseen documents. Some of the popular inductive classifiers that have successfully been applied to text classification include probabilistic classifiers such as Naive Bayes (McCallum & Received February 04, 2010 Revised May 03, 2010 Accepted May 22, 2010
2
H. Malik et al
Nigam 1998), decision tree classifiers such as ID3 (Quinlan 1986) and C4.5 (Quinlan 1993), rule-based classifiers such as FOIL (Quinlan & Cameron-Jones 1993), RIPPER (Cohen 1995) and CPAR (Yin & Han 2003), and maximum margin classifiers such as Linear SVMs (Joachims 2002). With the amount of information available online increasing at an unprecedented rate (Lyman & Varian 2003), popular web catalogues such as the Open Directory1 have grown to hundreds of thousands of categories. Consequently, modern text classification systems sometimes encounter training datasets that contain hundreds of thousands to millions of documents (e.g., the datasets used in the recent Pascal challenge on largescale hierarchical text classification2 ). In such situations training speed becomes a concern, in addition to a more traditional concern over the quality of classification. Therefore recent research (Malik & Kender 2008, Fan et al. 2008, Joachims 2006, Anagnostopoulos et al. 2008) has focused on improving the runtime performance of text classification algorithms while maintaining or improving the quality. Many of these efficient approaches rely on linear classifiers, which tend to be faster to train and to apply, while maintaining a high level of quality on text data. Given a set of labeled training instances D, with F features and C classes, the models produced by linear classifiers, such as Naive Bayes and linear SVM, can be represented by a |F | × |C| matrix, where the entry in ith row and j th column is the learned weight for feature-class pair (fi , cj ). We explore the question whether expensive training procedures are necessary for linear classifiers or if feature weights can be directly estimated using information-theoretic measures to construct a high-quality classifier. We propose the Feature Weighting Classifier (FWC), a simple classification algorithm that uses Information Gain (IG) to directly compute per-class feature weights. FWC obtains per-class feature frequencies in one pass over the dataset, and then uses these frequencies to compute feature IG and feature support in each class. The final weights for each feature-class pair are derived by combining feature IG with feature class support using a user-specified parameter α. Test instances are classified by computing a score for each class and selecting the class with the highest score. Per-class scores are computed as the product of the vector of weights for a class with the term frequency vector of the document. We performed experiments on diverse binary and multi-class text and web datasets indicate that FWC’s predictive performance is at least comparable to that of SVM and noticeably better than that of Naive Bayes and Winnow. A number of recent papers explored the use of probabilistic and information-theoretic measures as feature weights. These approaches assign a weight wf c to each featureclass pair f ∈ F and c ∈ C, based on the discriminative information that f provides for distinguishing class c from the remaining classes. Feature weighting was used as a pre-processing step by (Forman 2008) to improve the quality of existing text classification algorithms, whereas (Junejo & Karim 2008) used feature weighting to map documents into a two-dimensional space (for each class), where linear discriminant functions for classes were learned. Unlike these approaches, where feature weighting is used to change document representation as a pre-processing step to an induction algorithm, whether by transformation in the same feature space (Forman 2008) or by mapping into a new one (Junejo & Karim 2008), FWC directly computes feature weights for the classifier from the document collection, requiring no further learning. This dis-
1 2
http://www.dmoz.org http://lshtc.iit.demokritos.gr
Single Pass Text Classification by Direct Feature Weighting
3
tinguishes FWC from typical feature weighting schemes and makes it similar to classification schemes such as Naive Bayes. In similar spirit to our work, a recently published approach (Madani et al. 2009) involves direct learning of feature-class weights on a feature index. Each feature contributes only to its top classes, with the number of such connections per feature limited by a parameter. The feature-class weights are directly updated in an online fashion as training examples become available. This approach was shown to be competitive in terms of accuracy but more efficient than state of the art classifiers. The rest of the paper is organized as follows. We describe and analyze the FWC algorithm in detail in Section 2. The empirical evaluation is described in Section 3. Section 4 describes related work. Discussion and future work are presented in Section 5. Conclusions are presented in Section 6.
2. FWC Algorithm In this section we describe the FWC algorithm. We first discuss steps involved in constructing the FWC classification model. Next, we describe applying this model to classify test instances and discuss the computational complexity of the FWC algorithm. Finally, we provide motivation for FWC weights and compare them with weights produced by Naive Bayes.
2.1. Training the FWC Classifier Let aij be the number of documents of class cj where feature fi occurs, let bj be the number of documents in class cj , let κi be the number of classes where fi occurs, and let ri be the number of documents where feature fi occurs. The support of feature fi in class cj is defined as: aij pij = (1) bj This is the empirial conditional probability P (fi |cj ). The Information Gain (IG) of feature fi , which in this case is the same as Mutual Information of fi and class label distribution, is defined as: X X P (c, xi ) IG = (2) P (c, xi ) log2 P (c)P (xi ) c∈C xi ∈{0,1}
where xi takes values 0 or 1 depending on absence or presence of feature fi in a docuaij ri ment. P (c, fi ) can be computed as |D| and P (fi ) = |D| (McCallum & Nigam 1998). The construction of the FWC model (Algorithm 1) begins by making one pass over the training data to determine aij , bj and κi (Line 1). Knowing these quantities enables us to compute the weight for each feature for each class in a single pass over all features (Lines 2-7). The weight for each feature with respect to a class is calculated in Line 5 using three components: a global significance measure, a class significance measure, and a discriminative penalty. Here we use Information Gain as the global significance measure (computed in Line 3 according to Equation 2), representing the discriminative information of a feature across all classes. The class significance measure is the class support of a feature pij , and κi is the penalty factor. A user-defined parameter α is used
4
H. Malik et al
to control the tradeoff between the global feature significance and the feature significance in individual classes. Note that the features used here are essentially binary and FWC does not utilize feature frequencies within individual training documents. In contrast, feature frequencies in test documents are utilized in computing class scores for test instances, as we discuss in Section 2.2. Algorithm 1 Training the FWC model Require: A set D of sparse labeled instances, with a set of features F and set of classes C; a parameter α ≥ 0. 1: Read the data, keeping track of: aij , bj , κi , and computing pij (Eq. 1) 2: for i = 1, . . . , |F | do 3: Compute IGi (Eq. 2) 4: for j = 1, . . . , |C| do α i 5: wij = IG κi (pij ) 6: end for 7: end for 8: Wj = {wij } // weight vector for class cj 9: return M = {Wj }, j = 1, . . . , |C|.
2.2. Using the FWC Classifier Test instances are classified by computing a score for each class and selecting the class with highest score (Algorithm 2). Each feature in the test instance that has a non-zero weight in the model for a class contributes towards the class score. In addition, feature frequencies in the test instance are used to scale feature weights, which allows locallyfrequent features to have a higher contribution towards class scores. Per-class scores are therefore computed as a product of the vector of weights for a class with the term frequency vector of the document. Algorithm 2 Applying the FWC model Require: A sparse instance d, FWC model M = {Wj } 1: for j = P 1, . . . , |C| do 2: sj = fi ∈(d T Wj ) dij ∗ wij 3: end for 4: return cm , where m = argmaxj sj .
2.3. Algorithm Analysis In this section we discuss computational complexity of FWC. The first line in Algorithm 1 involves a single pass over the data, and requires Θ(|D||F |) when the dataset is dense or Θ(|E|), where E is the set of non-zero entries, when the dataset is sparse. During this pass, values aij , bj and κi are stored and pij values are computed, requiring Θ(|F ||C|) storage. The double loop in lines 2-7 takes Θ(|F ||C|) time, since computation of IG for a feature (Line 3) is linear in the number of classes, as follows from Equation 2, and so is
Single Pass Text Classification by Direct Feature Weighting
5
the inner loop (lines 4-7), since computation of wij takes constant time. The amount of memory required is Θ(|F |) for IG values, and Θ(|F ||C|) for the weights. Therefore, training FWC takes Θ(|E|+|F ||C|) time and Θ(|F ||C|) space on sparse data. Effectively, training FWC requires a single pass over the data, followed by a single pass over the coefficients. The time complexity of testing / assignment of new instances is linear (a single pass) in the number of features in the new document times the number of classes in the model. These characteristics are exactly the same as for the Naive Bayes classifier. In Section 3, we show that FWC tends to be more accurate than Naive Bayes. A linear time algorithm for training SVM has recently been developed (Joachims 2006). However, this algorithm is complicated and involves multiple passes over the data, making it much slower than Naive Bayes or FWC. We show below (Section 3) that FWC’s accuracy is comparable to that of SVM, which is considered “state of the art”.
2.4. Motivation for the FWC Weights There have been several probabilistic and information-theoretic feature weighting and selection schemes proposed in the literature (Forman 2003). Some of the commonly used schemes include Accuracy, Bi-Normal Separation (BNS), Chi-Square, Document Frequency (i.e., Global Support), F1 -Measure, Information Gain, Odds Ratio, Odds Ratio Numerator, Power and Probability Ratio. Among these schemes, only Chi-Square, Information Gain and Document Frequency generalize to multi-class problems (Forman 2003). Since FWC focuses on direct multi-class classification, we limit ourselves to these three schemes. Furthermore, our experiments (Section 3.8) indicate that Information Gain is more stable and results in better classification performance as compared to Chi-Square and Global Support. Information Gain considers both the presence and absence of a word to measure its significance. This means that Information Gain expects a word to provide information when it does not occur, whereas in typical text classification scenario words occur sparsely and only provide information when they occur (Rennie 2001). Because of this property, words that occur just a few times in the dataset receive low IG scores even though these words may be highly predictive of some classes. Because of this undesirable side effect, while being extremely useful as a global feature significance measure, Information Gain may not be suitable as the sole feature weighting technique (see Section 3.7 for experimental results). On the other hand, using class significance as the sole feature weighting technique may not be suitable either, as this approach would result in assigning a very high weight to features found in a large fraction of a rare class. For example, in the extreme case, a class with only one instance that contains all features in the dataset would receive highest weights for most features, resulting in assigning almost all test instances to this class. Therefore, we attempt to balance global significance with class significance by adjusting feature Information Gain with its class support (Lines 5-7) using parameter α. A suitable value for α can be estimated with cross-validation on the training set. Alternatively, since wide ranges of values for α result in similar performance, as demonstrated in Section 3.9, reasonable defaults can be selected. Intuitively, features that are shared across many classes are less useful for classification than features that are observed in only a few classes. We therefore use κi as a penalty factor (multi-class penalty) in Line
6
H. Malik et al
5. Section 3.7 shows that penalizing weights of features that are shared across classes almost always improves the classification accuracy.
2.5. Comparison of FWC Weights With Naive Bayes FWC is a heuristic approach where model weights are not derived as a solution to optimizing some objective function. We have discussed the motivation for the particular choice of the formula for the weights (Section 2.4), and our experimental results in Section 3 show that FWC performs well on a variety of text classification problems. In this section we explore in some detail the relationship between FWC and a similar method, Naive Bayes. Multivariate Bernoulli Naive Bayes (McCallum & Nigam 1998) NB can be seen as associating with each class cj a set of weights wij for each feature fi : NB wij = log(
1 + aij ) 2 + bj
(3)
These weights are combined linearly with document features, and a prediction for a new instance d is made by selecting a class cj with the highest score: X NB dij ∗ wij (4) fi ∈(d
T
Wj )
FWC weights are described by the formula in Line 5 of Algorithm 1. Examining this formula, we observe that each weight is a product of two parts. The first part is independent of classes, i.e. is the same regardless of the class of the weight vector: IGi (5) κi and can therefore be seen as a form of feature weighting similar to IDF (Salton & Buckley 1988), but taking advantage of label information. The second part depends on both the class and the feature: aij (pij )α = ( )α (6) bj and can be seen as a counterpart to Naive Bayes weights (Equation 3). Whereas NB uses a log of an expression, FWC uses an exponential (0 < α < 1) of a very similar expression. The use of exponential as opposed to logarithm limits the effect of features that are rare in a class, with their contribution to the class score limited to a very small positive value, rather to a large negative value. This means that in FWC features that are rare in all classes will not significantly affect scores for any classes. In contrast, features that are frequent in some classes but infrequent in other classes will boost the scores of classes where the feature is frequent. In Multinomial NB (McCallum & Nigam 1998), used in our experiments, the weights are: 1 + nij MNNB wij = log( ) (7) |F | + nj where nij is the number of occurrences of feature fi in documents of class cj , and nj is the total number of occurrences of all features in documents of class cj . Again, the main difference with class-specific part of FWC weight (Equation 6) is in use of logarithm rather than exponential, and the same reasoning as when comparing FWC and Multivariate Bernoulli NB applies.
Single Pass Text Classification by Direct Feature Weighting
7
dataset
|C|
|D|
|F |
avg cls
min cls
max cls
cacmcisi cranmed fbis hitech k1a k1b la1 la2 mm new3 ohscal re0 re1 reviews sports tr11 tr12 tr23 tr31 tr41 tr45 wap
2 2 17 6 20 6 6 6 2 44 10 13 25 5 7 9 8 6 7 10 10 20
4663 2431 2463 2301 2340 2340 3204 3075 2521 9558 11465 1504 1657 4069 8580 414 313 204 927 878 690 1560
14409 31720 2000 22498 21839 21839 29714 19692 29973 70822 11465 2886 3758 36746 27673 6429 5804 5832 10128 7454 8261 8460
2331.5 1215.5 144.882 383.5 117 390 534 512.5 1260.5 217.227 1116.2 115.692 66.28 813.8 1225.71 46 39.125 34 132.429 87.8 69 78
1460 1033 38 116 9 60 273 248 1133 104 709 11 10 137 122 6 9 6 2 9 14 5
3203 1398 506 603 494 1389 943 905 1388 696 1621 608 371 1388 3412 132 93 91 352 243 160 341
Table 1. Summary of Cluto datasets
Thus, FWC can be also seen as a combination of specific term weighting with a weight function that aims to compensate for some potentially damaging behaviors of Naive Bayes weights.
3. Experimental Evaluation 3.1. Datasets The main evaluation was performed on the TechTC-100 (Davidov et al. 2004) and Cluto (Karypis 2003) collections of text datasets. Both are publicly available and frequently used to evaluate classification algorithms. Additional datasets were used to validate the initial findings (Section 3.4) and to evaluate the runtime performance (Section 3.6). Table 1 summarizes the properties of datasets in the Cluto collection. These datasets represent text classification problems collected from various sources, such as Reuters news articles, TREC tasks, and OHSUMED (Medline records). The TechTC-100 collection consists of 100 binary classification problems with 100-200 documents each. The problems were generated using real web sites classified by human editors as part of the Open Directory Project3 . They are designed to have a varying difficulty for classification algorithms (Davidov et al. 2004). TechTC-100 datasets were preprocessed using standard stemming and stop-word elimination techniques and converted to the Cluto file format.
3
http://www.dmoz.org
8
H. Malik et al
3.2. Methods The FWC classification algorithm was evaluated against Naive Bayes (McCallum & Nigam 1998) and linear SVM. Both methods are known to perform well on textual data. FWC was also evaluated against Winnow (Littlestone 1988) and balanced Winnow (Littlestone 1989) on binary datasets (Section 3.5) and against TWCNB (Rennie 2001) on large-scale unbalanced datasets (Section 3.6). The LibLinear (Fan et al. 2008) implementation was used for linear SVM training in linear time (see also (Joachims 2006)). All methods were evaluated with repeated randomized cross-validation with the same splits by using the same random seed. On the Cluto datasets 10 times 10-fold crossvalidation was used and on the TechTC100 datasets 50 times 2-fold cross-validation was used resulting in 100 evaluation runs each. The TechTC100 datasets have relatively few documents each. Fewer folds result in less training data but avoid quantization effects of the quality evaluation. The evaluation measures used were accuracy, i.e., the fraction of correctly classified documents, and macro-averaged F1 measure, i.e., the average of per-class harmonic mean of precision and recall scores, without considering class sizes. FWC: FWC training is described in Section 2. The exponent of the weighting function (i.e., the α parameter) was automatically selected from the set {10−3 , k10−2 , k10−1 |k = 1, ..., 9} using 5-fold cross-validation on the given training data. The model with the best accuracy was used to predict class labels for the test instances Naive Bayes: The multinomial NB implementation closely followed the description of (McCallum & Nigam 1998). SVM: SVM was trained with TFIDF vectors using the following commonly used weighting (Lewis et al. 2004): N TFIDF = log(TF + 1) · log (8) DF where TF is the term frequency (how many times did the term appear in the document), N the number of documents processed, and DF is the document frequency (in how many documents did the term appear). The regularization parameter C was selected from the set {10k |k = −4, ..., 1} using a 5-fold cross-validation on the given training data. The model with the best accuracy was used to predict class labels for the test instances. For multi-class problems, the default Liblinear setting was used which uses a one-vs-rest strategy. The comparison of the methods is done without additional feature selection methods. While feature selection is known to improve performance of classification methods in many cases (Gabrilovich & Markovitch 2004), it would need to be tuned separately for each method and dataset, obscuring a direct comparison of the approaches that is the focus of this evaluation. We plan to pursue this direction in future work.
3.3. Comparison with NB and SVM Figures 1 and 2 present the performance of the three classifiers on the Cluto and TechTC100 collections for each dataset individually as well as accuracy and F1 score summaries for each collection. The summaries show the mean and one standard deviation of the difference in accuracy or macro-averaged F1 score for each dataset collection. FWC is found to be comparable to NB and SVM on the Cluto datasets and consistently better than both competitors on the TechTC-100 datasets. NB is comparable to FWC and to SVM on the Cluto datasets but is consistently worse than both FWC and SVM
Single Pass Text Classification by Direct Feature Weighting
9 macro-averaged F1
accuracy dataset
SVM
NB
FWC
SVM
NB
FWC
bbc bbc sport review polarity
0.982 0.991 0.809
0.932 0.957 0.659
0.955 0.994 0.811
0.982 0.992 0.809
0.930 0.959 0.657
0.954 0.995 0.809
Table 2. Summary of additional datasets and classifier performance on these datasets
on the TechTC-100. Note that the baseline SVM results on TechTC-100 presented in (Gabrilovich & Markovitch 2004) are different from ours (but still significantly worse as compared to FWC) because of the differences in the experimental setup.
3.4. Results on Additional Datasets We used three additional “hold-out” datasets to validate our findings. The BBC dataset (Greene & Cunningham 2006) contains 2225 stories with 9636 features from the British Broadcasting Service website that correspond to five topical classes. The BBC sports dataset (Greene & Cunningham 2006) contains 737 news articles with 4613 features that correspond to five sports-related categories. The review polarity 2.0 dataset (Pang & Lee 2004) is commonly used for sentiment analysis and contains 1000 positive and 1000 negative movie reviews from the Internet Movie Database4 , with 26187 features. Table 2 presents ten fold cross-validated accuracies and macro-averaged F1 scores achieved by the three classifiers. We observe that SVM is somewhat better than FWC on the BBC dataset, and slightly worse on BBC sports and review datasets. Naive Bayes is ranked 3rd on all datasets, and in case of the review dataset, by a very large margin. This is consistent with our previous analysis.
3.5. Comparison with Winnow In this section we compare FWC with Winnow (Littlestone 1988) and balanced Winnow (Littlestone 1989). Unlike SVM, Winnow is a completely on-line learning algorithm, i.e., it can be trained in a single pass over the data, one example at a time. Winnow is a linear classifier that associates with each feature fi a weight wi . During training, a prediction is made for an example. If the prediction is correct, no change to the model is made. If the prediction is wrong, then the weights of features present in the example are increased (promotion step) or decreased (demotion step) by some factor α (Littlestone 1988). Since the updates involve only the features that occurred in the example, Winnow is very efficient. Note that FWC models can also be updated using one example at a time, but this requires re-computing of all IG values and therefore of all the weights, making it less efficient than Winnow in the online mode. For the evaluation, we used Weka (Hall et al. 2009) implementations of both Winnow methods. Since these support only two-class problems, we used cacmcisi, cranmed, mm and review polarity datasets, as well as all 100 TechTC problems (essentially, all 4
http://www.imdb.com
10
H. Malik et al 1 0.95 0.9
accuracy
0.85 0.8 0.75 0.7 FWC IG NB SVM
ca c
m ci hi si. te ch . w ap . k1 oh a. sc a ne l. w 3. re 0. fb is . re 1. tr1 1. tr2 3. la 1. la 2. tr1 2. t re r45 vi . ew s. tr4 1. tr3 1. m sp m. or ts . cr k1 an b. m ed .
0.65
(a) Accuracies on Cluto 1 0.95 0.9
MacroaverageF1
0.85 0.8 0.75 0.7 0.65 0.6 FWC IG NB SVM
0.55
w
ap . k1 hi a. t e ca c cm h. ci si . fb is . re 0. tr1 oh 1. sc al . re 1 ne . w 3. tr2 3. tr3 1. tr4 5. tr1 2. la 1. la 2. t re r41 vi . ew sp s. or ts . k1 b. cr mm an . m ed .
0.5
0.4
0.4
0.3
0.3 worse than FWC | better than FWC
worse than FWC | better than FWC
(b) Macro-averaged F1 on Cluto
0.2 0.1 0 −0.1 −0.2 −0.3
0.2 0.1 0 −0.1 −0.2 −0.3
−0.4
−0.4 NB
∆ accuracy
SVM
(c) Cluto accuracy summary
NB
∆ MacroaverageF1
SVM
(d) Cluto F1 summary
Fig. 1. Accuracies and F1 scores of FWC, NB and SVM on Cluto datasets. 1(a) and 1(b) shows results per dataset ordered by increasing performance of FWC. 1(c) and 1(d) shows mean ± one standard deviation of the differences in performance between FWC and its competitors. Values below the x-axis indicate worse performance than that of FWC.
Single Pass Text Classification by Direct Feature Weighting 1
11
1
0.95 0.9
0.9
MacroaverageF1
accuracy
0.85 0.8 0.75 0.7
0.8
0.7
0.6
0.65 0.6
0.5
FWC IG NB SVM
0.55 0.5
0.4
datasets
0.4
0.3
0.3 worse than FWC | better than FWC
0.4
0.2 0.1 0 −0.1 −0.2 −0.3
datasets
(b) Macro-averaged F1 on TechTC-100
(a) Accuracies on TechTC-100
worse than FWC | better than FWC
FWC IG NB SVM
0.2 0.1 0 −0.1 −0.2 −0.3
−0.4
−0.4 NB
∆ accuracy
SVM
(c) TechTC-100 accuracy summary
NB
∆ MacroaverageF1
SVM
(d) TechTC-100 F1 summary
Fig. 2. Accuracies and F1 scores of FWC, NB and SVM on TechTC-100 datasets. 2(a) and 2(b) shows results per dataset ordered by increasing performance of FWC. 2(c) and 2(d) shows mean ± one standard deviation of the differences in performance between FWC and its competitors. Values below the x-axis indicate worse performance than that of FWC.
two-class datasets used in our experiments). The results, obtained with 4-fold crossvalidation on TechTC and 10-fold cross-validation on the other datasets, are presented in Table 3. They show that Winnow lags behind FWC in terms of accuracy and F1 on all datasets except for cacmcisi, which is the one dataset where FWC does the worst. Of the 100 TechTC problems, FWC outperforms both Winnow methods in terms of accuracy on 89 problems, and in terms of F1 on 90 problems. The results on TechTC are shown in Figure 3.
3.6. Runtime Comparisons While we argued that FWC is comparable to Naive Bayes and much more efficient than SVM, it is interesting to compare them on actual large-scale text classification problems. Our evaluation is done on 3 large datasets. The SRAA dataset5 contains 73,218 documents in four classes. The methods were evaluated using 10-fold cross validation on this dataset. Two multi-class classification problems were derived from the recent Pascal chal5
http://www.cs.umass.edu/˜mccallum/code-data.html
12
H. Malik et al macro-averaged F1
accuracy dataset
FWC
W
BW
FWC
W
BW
cacmcisi
0.673
0.983
0.983
0.673
0.980
0.980
cranmed
0.999
0.922
0.948
0.999
0.920
0.946
mm
0.980
0.878
0.956
0.980
0.878
0.955
review polarity
0.811
0.637
0.726
0.809
0.636
0.723
TechTC-100*
0.891
0.748
0.792
0.890
0.736
0.785
Table 3. Results of comparison between FWC, Winnow (W) and Balanced Winnow (BW). The results for TechTC-100 are averages over all 100 problems.
1
1
0.9
0.9
0.8 Macroaverage F1
accuracy
0.8
0.7
0.6
0.7
0.6
0.5 0.5
0.4
FWC Winnow Balanced Winnow datasets
(a) Accuracies on TechTC-100
0.4
FWC Winnow Balanced Winnow datasets
(b) F1 scores on TechTC-100
Fig. 3. Accuracies and F1 scores of FWC and the two Winnow variants on TechTC-100.
lenge on large-scale hierarchical text classification (i.e., the LSHTC challenge6 ). LSHTC challenge uses a five-level deep classification hierarchy and each document is assigned to exactly one leaf node. Since hierarchical classification is not the main focus of this research, we flattened the hierarchy by considering each leaf-node as a direct class in the multi-class classification problem. The first classification problem contains 128,710 web page description vectors in 12,294 classes. The classification methods were evaluated using 10-fold cross validation on this problem. The second classification problem uses 135,973 description vectors for training (i.e., all web page description vectors from the first problem and additional category description vectors) and uses 34,880 content vectors for testing. This problem corresponds to the “cheap” task in the LSHTC challenge and simulates a scenario where training and test sets follow different word distributions7 . Since the public datasets do not include actual labels for these test instances, we have used the evaluation “oracle” provided on the LSHTC challenge web site to measure the accuracies and Macro-F1 scores. Both of the LSHTC classification problems are highly unbalanced. The smallest classes contain as few as two instances and the largest classes contain as many as 3,000 6
http://lshtc.iit.demokritos.gr We did not use the content vectors from other LSHTC tasks for training classifiers because they contain about 400,000 unique features, making it resource prohibitive to train an in-memory flat classifier on our test machine (i.e., the memory required to allocate |F | × |C| matrices exceeds the available physical memory) 7
Single Pass Text Classification by Direct Feature Weighting
13
instances whereas the average number of instances assigned to the classes is 10.5. Considering that the basic Naive Bayes is known to perform poorly on unbalanced data (Rennie et al. 2003), we have also included the TWCNB algorithm (Rennie 2001) in our comparisons. All algorithms implementations were in Java and the same runtime environment was used for all experiments. The runtime environment consisted of a 64-bit Java Virtual Machine deployed on a dedicated 64-bit system with two 2.67GHz Intel quad-core processors. The Weka toolkit (Hall et al. 2009) was used for training multinomial Naive Bayes and TWCNB, and the Java version of the LibLinear (Fan et al. 2008) library was used for linear SVM training. The SVM results in all other sections in this paper used the default LibLinear setting for training multi-class SVMs, which uses a onevs-rest strategy. However, the default setting turned out to be very inefficient on the LSHTC datasets, because of the large number of classes. Therefore, for these experiments we switched to LibLinear implementation of multi-class SVM by Crammer and Singer (Crammer & Singer 2002) which uses an improved and more efficient formulation (Keerthi et al. 2008). This resulted in over 80% reduction in SVM training times on LSHTC datasets. The results reported here do not include any I/O times or times needed to prepare implementation specific dataset representations (such as preparing the in-memory Weka dataset for training). We also do not report the classification times here because for all methods used in our experiments classification time is linear in size of the test samples and any differences are likely to be implementation related. Finally, the training times reported here used the optimal parameter for each method on each dataset. Table 4 presents the average 10-fold cross-validation accuracies and macro-F1 scores as well as the total training times for all folds on the SRAA dataset. It is interesting to note that, while SRAA is clearly an easy classification problem, SVM and FWC are comparable to each other and better than both Naive Bayes methods. Also, TWCNB (Rennie 2001) did not perform better than regular NB on this dataset. This finding is consistent with recent results ((Kibriya et al. 2004)), which also shows that TWCNB is not always better than the regular NB. In terms of the training times, SVM was about 10 times slower than FWC. Tables 5 and 6 present the classification and runtime performance of the four classification methods on LSHTC datasets. On both datasets, TWCNB outperformed regular NB on predictive measures, which indicates that TWCNB indeed performs better than NB on unbalanced data. However, FWC outperformed both Naive Bayes methods with a very high margin. While SVM achieved the highest classification accuracies, its MacroF1 scores are comparable to or lower than FWC, and it was also substantially more expensive to train. We also noticed that the SVM training times are highly sensitive to the regularization parameter values. For example with C = 10−3 , SVM training on the first LSHTC dataset took almost twice as much time as compared to training SVM using C = 10−5 . This observation is consistent with (Joachims 2006). In contrast, FWC training times do not vary with α. This could greatly simplify workload balancing for concurrent selection of FWC parameters on multi-core and parallel architectures. Note that the Naive Bayes methods show somewhat longer training times than FWC on all three datasets. Since the computational complexity of Naive Bayes is equivalent to FWC, these times are slower at least partially because of the Weka overhead, which extensively uses Java objects. A more efficient Naive Bayes implementation is expected to result in training times that are similar to FWC. Finally, we note that the FWC results reported here used α values that maximize the classification accuracies on each dataset. For situations where the classification performance on small classes is considered more important, α may be selected to maximize
14
H. Malik et al method
accuracy
macro-averaged F1
training time (seconds)
SVM
1.000
1.000
59.86
NB
0.989
0.981
22.85
TWCNB
0.953
0.929
13.23
FWC
0.999
0.999
5.53
Table 4. Classification and runtime performance of SVM, NB, TWCNB and FWC on the SRAA dataset.
method
accuracy
macro-averaged F1
training time (seconds)
SVM
0.456
0.294
8556.3
NB
0.115
0.014
355.7
TWCNB
0.304
0.157
478.1
FWC
0.417
0.297
111.3
Table 5. Classification and runtime performance of SVM, NB, TWCNB and FWC on the first LSHTC classification problem.
Macro-F1 instead of accuracy. On the first LSHTC dataset, this method improved the Macro-F1 score to 0.313 and on the second dataset, it improved the Macro-F1 to 0.28 while slightly reducing the accuracies.
3.7. Comparing Variants of FWC As discussed in Section 2.1, feature weight for each class combines Information Gain, class support factor and multi-class penalty factors. Here we evaluate the effect of removing class support or penalty for features shared by multiple classes on the performance. The results are presented in Figure 4. It is clear from these results that without class support FWC performs extremely poorly. This behavior is observed both on Cluto and on TechTC. Removing penalty for features shared by multiple classes has little effect on TechTC but somewhat larger effect on Cluto. This is because TechTC problems are binary and most features occur in both classes, so class penalty affects few weights. On Cluto, where many of the problems are multi-class, removing the penalty factor leads to a loss of accuracy. method
accuracy
macro-averaged F1
training time (seconds)
SVM
0.370
0.255
1083.6
NB
0.026
0.0004
34.8
TWCNB
0.236
0.126
59.9
FWC
0.352
0.266
13.6
Table 6. Classification and runtime performance of SVM, NB, TWCNB and FWC on the second LSHTC classification problem.
Single Pass Text Classification by Direct Feature Weighting 1
15
1
0.9
0.9
0.8 0.7
accuracy
accuracy
0.8 0.6 0.5 0.4 0.3
0.7
0.6
0.2 DEFAULT NO CLASS SUPPORT NO PENALTY
0.1
0.5
DEFAULT NO CLASS SUPPORT NO PENALTY
ca cm
cis
hit i. ec h. w ap . k1 oh a. sc a ne l. w 3. re 0. fb is . re 1. tr1 1. tr2 3. la 1. la 2. tr1 2. t re r45 vie . w s. tr4 1. tr3 1. m m sp . or ts . cr k1 an b. m ed .
0
(a) Accuracies on Cluto
0.4
datasets
(b) Accuracies on TechTC-100
Fig. 4. Accuracy of FWC with default settings, FWC without class support, and FWC without multi-class penalty on Cluto and TechTC-100 datasets. The problems are ordered by increasing accuracy of FWC with default settings.
Hence, these experiments provide an empirical justification for using class support to balance Information Gain and using the multi-class penalty factor.
3.8. Alternative Feature Weighting Schemes In this section we compare the classification performance of Information Gain with ChiSquare and Global Support (i.e., Document Frequency) when used as the global feature significance scheme in FWC (Algorithm 1). As we have discussed in Section 2, we do not consider other feature weighting and selection schemes such as BNS (Forman 2003) and Odds Ratio in this paper because they do not have a multi-class form. Figure 5 presents the classification accuracies and macro-averaged F1 scores of FWC on Cluto and TechTC-100 dataset collections using Information Gain, Chi-Square and Global Support as global significance measures for FWC. Clearly, Information Gain is the most stable measure and consistently outperformed the alternative measures on both dataset collections. Global support is also stable but resulted in poor classification performance on small classes, as the difference in F1 scores on Cluto datasets is higher than the difference in accuracies between Information Gain and Global Support. This behavior is not observed on TechTC-100 datasets because they are highly balanced. Finally, Chi-Square outperformed Information Gain in some cases but is highly unstable in general. This is not surprising because Chi-Square is known to behave erratically for very small expected counts, which is common in text classification, as others have also noted (Forman 2003).
3.9. FWC Parameter Sensitivity FWC has only one tunable parameter, α, which trades off information gain and class support of a feature. We have shown in Section 3.7 that all these components are necessary for high performance of FWC. However, it is reasonable to ask how FWC performance changes with different values of α. To address this question, we have varied α over a range [0.001, 0.7]. Figure 6 shows results on TechTC datasets (qualitatively similar results on Cluto are not included). These plots show the minimum, maximum and mean differences in FWC accuracy
H. Malik et al 1
1
0.9
0.9
0.8
0.8
0.7
0.7 MacroaverageF1
0.6 0.5 0.4 0.3
0.6 0.5 0.4 0.3
0.2
0.2 FWC IG FWC CHI FWC GS
0.1
FWC IG FWC CHI FWC GS
0.1 0 w
ca cm
cis
ap
hit i. ec h. w ap . k1 oh a. sc a ne l. w 3. re 0. fb is . re 1. tr1 1. tr2 3. la 1. la 2. tr1 2. t re r45 vie . w s. tr4 1. tr3 1. m sp m. or ts . cr k1 an b. m ed .
0
. k1 hit a. ca ec cm h. cis i. fb is . re 0. tr1 oh 1. sc al. re 1 ne . w 3. tr2 3. tr3 1. tr4 5. tr1 2. la 1. la 2. tr4 re 1 vie . w sp s. or ts . k1 b. cr mm an . m ed .
accuracy
16
(b) F1 -Measure on Cluto
(a) Classification accuracies on Cluto 1
1 0.9
0.9 0.8 MacroaverageF1
accuracy
0.8
0.7
0.7 0.6 0.5
0.6 0.4 0.5
FWC IG FWC CHI FWC GS
0.4
FWC IG FWC CHI FWC GS
0.3 0.2
datasets
datasets
(d) F1 -Measure on TechTC-100
(c) Classification accuracies on TechTC-100
Fig. 5. Classification accuracies and macro-averaged F1 scores of FWC using Information Gain, Chi-Square and Global Support on Cluto and TechTC-100 datasets.
MacroaverageF1 0.3 0.25
0.2
Improvement over alpha=0.001
Improvement over alpha=0.001
accuracy 0.3 0.25
0.15 0.1 0.05 0 −0.05
0.2 0.15 0.1 0.05 0 −0.05
−0.1
−0.1
−0.15
−0.15
−0.2
0.01
0.03
0.05
0.07 0.10 alpha
0.20
0.50
0.70
−0.2
0.01
0.03
0.05
0.07 0.10 alpha
0.20
0.50
0.70
Fig. 6. Comparison of classification accuracies (left) and macro-averaged F1 scores (right) of FWC on TechTC datasets when α is varied. The plots show min, max, mean and standard deviation of the differences. The baseline used α = 0.001. The y-axis is improvement over α = 0.001.
and F1 for different values of α when compared against α = 0.001. The average improvement over α = 0.001 increases slowly. The standard deviations and the ranges of the differences do increase as the alpha values are taken further apart. However, note that the difference in accuracy or F1 between α = 0.001 and α = 0.7 on any topic is not greater than 2.5%, suggesting that FWC performance is rather stable over a large
Single Pass Text Classification by Direct Feature Weighting
17
range of values of α. Note that this plot does not suggest that higher values of α lead to improvement on all topics. Also, consider that average tuned FWC accuracy and macro-F1 on TechTC are 0.8717 and 0.8679 respectively. The same measures for α = 0.7 are 0.8684 and 0.8632 and decrease by several points to 0.8336 and 0.8277 for a ten-fold decrease in α = 0.07. For comparison, SVM results are still below these (at 0.8170 and 0.8142 for accuracy and macro-F1). Therefore, one can simply pick a value of α in this range without performing any tuning and still obtain high quality results.
4. Related Work Feature selection is a well-known problem and many papers have addressed it by suggesting a score indicative of feature’s usefulness (Yang & Pedersen 1997, Forman 2003). Traditionally, the top-k features with the highest scores, or all features with scores above a certain threshold are retained and the rest are discarded. Another way of using these scores however is to use them as weights for the corresponding features. Feature weighting or scaling is a task of assigning each feature (a word) in an instance (a document) a weight that would correspond to that feature in a vector representation of the document. The basic idea here is to assign higher weights to more discriminative features to simplify the task of learning predictive models. A set of such weighted vectors is usually provided to a learning algorithm with a set of relevance labels in order to construct predictive models. The TFIDF scheme (Salton & Buckley 1988) and related variants are the most common vector representations used in text analysis. In addition, various alternative weighting schemas have been proposed in the literature. Recently, Forman (Forman 2008) suggested using BNS, a feature selection measure proposed in (Forman 2003), for term scaling in the context of binary text classification. He argued that using BNS leads to better performance than many other representations, and also obviates the need for feature selection. In the recently proposed Democratic Classifier (Malik & Kender 2008), a weight is directly assigned to each feature in the model without the intermediate steps of creating a new document representation, or explicitly training a classifier. However, the Democratic Classifier requires many additional steps such as ensuring instance coverage by some minimum number of features, new feature construction from pairs of words, and computing additional weights. The FWC approach presented here is significantly simpler than the Democratic Classifier with faster training and comparable accuracy. For example, FWC was as much as ten times faster to train on the sports dataset, while achieving an accuracy of 0.983 which is about 1% better than the Democratic Classifier. The Discriminative Term Weighting Classifier (DTWC) (Junejo & Karim 2008) explores similar ideas. Odds, log-odds and information gain (KL divergence) were evaluated as discriminative term weighting schemes. Positive and negative scores were computed for each document in the training set, and a linear classifier in the two-dimensional space of such scores was trained. The evaluation performed on subsets of three text classification problems: Spam8 , Movie Reviews9 and SRAA10 indicates that DTWC, when tuned with the best performing weighting scheme on each dataset, results in accuracies that are comparable to (and in some cases, better than) those of SVM and Naive Bayes. 8
http://www.ecmlpkdd2006.org/challenge.html http://www.cs.cornell.edu/People/pabo/movie-review-data/ 10 http://www.cs.umass.edu/ mccallum/code-data.html ˜ 9
18
H. Malik et al
However, none of the three term weighting schemes consistently outperformed existing classifiers. Unlike DTWC, FWC constructs only one model that contains scores for all classes, and does not require training a separate linear classifier, resulting in a simpler and more efficient training procedure. Winnow (Littlestone 1988) is an incremental linear-threshold algorithm that attempts to reduce the classification error with each incoming example. Winnow responds to each training example according to the current hypothesis, and then updates the hypothesis based on the correct classification, if necessary. Winnow is especially useful when the majority of the attributes are irrelevant. Balanced Winnow (Littlestone 1989) is a variant of Winnow that maintains separate positive and negative weights for each feature, thus allowing for negative coefficients. These negative coefficients make Balanced Winnow more robust for documents with varying lengths.
5. Discussion and Future Work Recent advances in automatic training data expansion (Wang et al. 2009) and the unprecedented on-going growth in the sizes of real life text databases such as the Open Directory, PubMed11 , Blogs, and the web content in general is making it difficult to apply traditional learning methods on modern text classification problems. This requires researchers to investigate new methods that are capable of handling large-scale training datasets with tens of thousands to hundreds of thousands of categories and millions of documents. With FWC we attempt to address this problem by directly constructing a classifier from feature frequency counts obtained in a single-pass over training data. In terms of simplicity, FWC resembles Naive Bayes but goes beyond utilizing conditional probabilities. FWC draws inspiration from (Malik & Kender 2008) and (Junejo & Karim 2008), which combined several significance measures for short pattern-based classification and from (Forman 2008), that evaluated many feature weighting functions. While there may not be a direct theoretical justification for FWC (which is also true for many other methods that we have cited in the previous section), FWC is based on a well-studied measure from information theory (i.e., Information Gain) and our extensive experimental study suggests that it works well on a variety of text classification problems. Information Gain was also found useful in many other applications such as online recommender systems (Zhang & Tran 2010). In the future we plan to apply FWC to data streams with an online formulation of the learning algorithm.
6. Conclusions We proposed FWC, a novel single pass text classifier that is constructed directly from feature frequency counts, following the spirit of Naive Bayes. For each class, FWC assigns a weight for each feature obtained by incorporating global significance, class significance, and multi-class presence. The time complexity is linear in the number of entries in the training dataset and the space complexity is linear in the number of features. Experiments performed on 128 binary and multi-class text and web datasets from various domains show that FWC’s performance is at least comparable to, and often 11
http://www.ncbi.nlm.nih.gov/PubMed/
Single Pass Text Classification by Direct Feature Weighting
19
better than that of linear SVM, while being much easier to train. FWC also performs better than Naive Bayes, and has comparable training complexity. FWC is an efficient text classifier that is very easy to implement and rivals more complex linear SVMs in terms of classification performance on a variety of datasets. We recommend including it as one of the choices to evaluate on any text classification problem, in particular if scalability is an issue.
7. Acknowledgments We would like to thank the editors and the anonymous reviewers for their constructive and detailed comments that greatly helped us improve this paper.
References Anagnostopoulos, A., Broder, A. & Punera, K. (2008), ‘Effective and efficient classification on a search-engine model’, Knowledge and Information Systems 16(2), 129– 154. Cohen, W. (1995), Fast effective rule induction, in ‘Proceedings of the International Conference on Machine Learning (ICML)’, pp. 115–123. Crammer, K. & Singer, Y. (2002), ‘On the learnability and design of output codes for multiclass problems’, Machine Learning 47. Davidov, D., Gabrilovich, E. & Markovitch, S. (2004), Parameterized generation of labeled datasets for text categorization based on a hierarchical directory, in ‘The 27th Annual International ACM SIGIR Conference’, pp. 250–257. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. (2008), ‘LIBLINEAR: a library for large linear classification’, Journal of Machine Learning Research 9, 1871–1874. Forman, G. (2003), ‘An extensive empirical study of feature selection metrics for text classification’, Journal of Machine Learning Research (JMLR) 3, 1289–1305. Forman, G. (2008), BNS feature scaling: An improved representation over TF-IDF for SVM text classification, in ‘Proceedings of 17th ACM Conference on Information and Knowledge Management (CIKM)’, pp. 263–270. Gabrilovich, E. & Markovitch, S. (2004), Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with c4.5, in ‘The 21st International Conference on Machine Learning (ICML)’, pp. 321–328. Greene, D. & Cunningham, P. (2006), Practical solutions to the problem of diagonal dominance in kernel document clustering, in ‘Proceedings of the 23rd International Conference on Machine learning (ICML)’, pp. 377–384. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. H. (2009), ‘The WEKA data mining software: An update’, SIGKDD Explorations 11. Joachims, T. (2002), Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, Springer. Joachims, T. (2006), Training linear SVMs in linear time, in ‘Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD)’, pp. 217–226. Junejo, K. N. & Karim, A. (2008), A robust discriminative term weighting based linear discriminant method for text classification, in ‘Proceedings of IEEE International Conference on Data Mining (ICDM)’, pp. 323–332.
20
H. Malik et al
Karypis, G. (2003), ‘CLUTO: A software package for clustering high dimensional datasets’, http://www-users.cs.umn.edu/˜karypis/cluto/. Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J. & Lin, C.-J. (2008), A sequential dual method for large scale multi-class linear SVMs, in ‘Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’. Kibriya, A. M., Frank, E., Pfahringer, B. & Holmes, G. (2004), Multinomial Naive Bayes for text categorization revisited, in G. Webb & X. Yu, eds, ‘AI 2004, LNAI 3339’, Springer-Verlag, pp. 488–499. Lewis, D. D., Yang, Y., Rose, T. & Li, F. (2004), ‘RCV1: a new benchmark collection for text categorization’, Journal of Machine Learning Research 5, 361–397. Littlestone, N. (1988), ‘Learning quickly when irrelevant attributes are abound: A new linear threshold algorithm’, Machine Learning 2, 285–318. Littlestone, N. (1989), Mistake bounds and logarithmic linear-threshold learning algorithms, Technical report UCSC-CRL-89-11, University of California, Santa Cruz. Lyman, P. & Varian, H. R. (2003), ‘How much information?’, http://www2.sims.berkeley.edu/research/projects/how-much-info-2003. Madani, O., Connor, M. & Greiner, W. (2009), ‘Learning when concepts abound’, Journal of Machine Learning Research 10, 2571–2613. Malik, H. H. & Kender, J. R. (2008), Classifying high-dimensional text and web data using very short patterns, in ‘Proceedings of IEEE International Conference on Data Mining (ICDM)’, pp. 923–928. McCallum, A. & Nigam, K. (1998), A comparison of event models for Naive Bayes text classification, in ‘Proceedings of AAAI-98 Workshop on Learning for Text Categorization’, pp. 41–48. Pang, B. & Lee, L. (2004), A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, in ‘Proceedings of the ACL’. Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine Learning 1, 81–106. Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufman. Quinlan, J. R. & Cameron-Jones, R. M. (1993), FOIL: A midterm report, in ‘Proceedings of the European Conference on Machine Learning (ECML)’, pp. 3–20. Rennie, J. D. (2001), Improving multi-class text classification with Naive Bayes, AI technical report 2001-04, Massachusetts Institute of Technology. Rennie, J. D., Shih, L., Teevan, J. & Karger, D. (2003), Tackling the poor assumptions of Naive Bayes text classifiers, in ‘Proceedings of the 20th International Conference on Machine Learning (ICML)’. Salton, G. & Buckley, C. (1988), ‘Term-weighting approaches in automatic text retrieval’, Information Processing and Management 24(5), 513523. Sebastiani, F. (2002), ‘Machine learning in automated text categorization’, ACM Computing Surveys 34, 1–47. Wang, P., Hu, J., Zeng, H.-J. & Chen, Z. (2009), ‘Using wikipedia knowledge to improve text classification’, Knowledge and Information Systems 19(3). Yang, Y. & Pedersen, J. O. (1997), A comparative study on feature selection in text categorization, in ‘Proceedings of ICML-97, 14th International Conference on Machine Learning’, pp. 412–420. Yin, X. & Han, J. (2003), CPAR: Classification based on predictive association rules, in ‘Proceedings of the SIAM International Conference on Data Mining (SDM)’, pp. 331–335.
Single Pass Text Classification by Direct Feature Weighting
21
Zhang, R. & Tran, T. (2010), ‘An information gain-based approach for recommending useful product reviews’, Knowledge and Information Systems .
Author Biographies Hassan Malik is currently a Senior Technical Specialist at Thomson Reuters where he conducts data mining and machine learning research for large-scale text processing applications. Prior to joining Thomson Reuters, he was a Research Scientist at Siemens Corporate Research. Concurrent to his undergraduate and graduate studies, he had held several senior technical and management positions at companies in New Jersey, Silicon Valley, North Carolina and Karachi, Pakistan. Dr. Malik obtained his Ph.D. in Computer Science from Columbia University in the City of New York in May 2008. His doctoral research focused on investigating efficient algorithms for mining unstructured data. He also holds a Master of Engineering degree in Computer Science from North Carolina State University in Raleigh, NC, 2003 and undergraduate degrees in Computer Science from SZABIST and University of Karachi, 1999. Dmitriy Fradkin received his B.A. in Mathematics and Computer Science from Brandeis University, Waltham, MA in 1999 and his Ph.D. from Rutgers, The State University of New Jersey in 2006. He then worked for 1.5 years at Ask.com, and since 2007 has been at Siemens Corporate Research in Princeton, NJ. His research interests include pattern mining, information retrieval, classification and cluster analysis. Dr. Fradkin is a member of the ACM and the ACM SIGKDD.
Fabian Moerchen graduated with a Ph.D. in Feb 2006 from University or Marburg, Germany after just over 3 years with summa cum laude. In his thesis he proposed a radically different approach to temporal interval patterns that uses itemset and sequential pattern mining paradigms. Since 2006 he has been working at Siemens Corporate Research, a division of Siemens Corporation, leading data mining projects with applications in predictive maintenance, text mining, healthcare, and sustainable energy. He has continued the study of temporal data mining in the context of industrial and scientific problems and has served the community as a reviewer, organizer of workshops, and presenter of tutorials.
Correspondence and offprint requests to: Hassan H. Malik, Thomson Reuters, 195 Broadway, New York NY 10007, USA. Email:
[email protected]