Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes Xin Jin, Rongyan Li, Xian Shen, Rongfang Bie* College of Information Science and Technology Beijing Normal University Beijing 100875, China * 8610-58800068
[email protected] corresponding author:
[email protected]
*
the major problem of Naive Bayes is that it ignores attribute dependencies. On the other hand, although Bayesian Network can represent arbitrary attribute dependencies, it is intractable to learn it from data [25]. In this paper we present a Hidden Naive Bayes (HNB) [17] based method for web page classification. HNB can avoid the intractable computational complexity for learning an optimal Bayesian network and still take the influences from all attributes into account [17, 25].
ABSTRACT A great challenge of web mining arises from the increasingly large web pages and the high dimensionality associated with natural language. Since classifying web pages of an interesting class is often the first step of mining the web, web page categorization/classification is one of the essential techniques for web mining. One of the main challenges of web page classification is the high dimensional text vocabulary space. In this research, we propose a Hidden Naive Bayes based method for web page classification. We also propose to use the ReliefF feature selection method for selecting relevant words to improve the classification performance. Comparisons with traditional techniques are provided. Results on benchmark dataset show that the proposed methods are promising for accurate web page classification.
In the field of data mining many have argued that maximum performance is often not achieved by using all available features, but by using only a “good” subset of features. This is called feature selection. For web page classification, this means that we want to find a subset of words which help to discriminate between different kinds of web pages. In this paper we introduce a ReliefF [1, 2, 5, 7] based method to find relevant words for improving web page classification performance. ReliefF is able to efficiently estimate the quality of attributes with strong interdependencies that can be found for example in parity problems. The key idea of ReliefF is to estimate attributes according to how well their values distinguish among the instances that are near to each other.
Categories and Subject Descriptors H.2.8 [Data Mining]:
General Terms
The remainder of this paper is organized as follows. Section 2 presents the web page representation and preprocessing method. Section 3 describes the ReliefF based word selection method. Section 4 presents HNB based web page classification. Section 5 presents the performance measures and the experiment results. Finally, conclusions are drawn in Section 6.
Algorithms, Documentation, Performance, Experimentation.
Keywords Web mining, ReliefF feature selection, hidden naive bayes.
1. INTRODUCTION
2. WEB PAGE REPRESENTATION AND PREPROCESSING
Classification of Web pages has been studied extensively since the Internet has become a huge repository of information, in terms of both volume and variance. Given the fact that web pages are based on loosely structured text, various statistical text learning algorithms have been applied to Web page classification [8, 1823]. Among them Naive Bayes has shown great success. However,
We represent each web page as a bag of words/features. A feature vector V is composed of the various words from a dictionary formed by analyzing the web pages. There is one feature vector per web page. The ith component/word wi of the feature vector is the IDF transforming of word frequency: IDF = Fi*log(num of web pages/num of web pages with word i)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07, March 11-15, 2007, Seoul, Korea. Copyright 2007 ACM 1-59593-480-4/07/0003…$5.00.
where Fi is the frequency of word i in the web page. Word tokens in web pages are formed only from contiguous alphabetic sequences. In addition, since web pages are in the HTML format, HTML tags are removed before web page classification. In tokenizing we perform stemming, stop-word removing and Document Frequency Thresholding (DFT) [24].
617
It was shown that this extension significantly improves the reliability of estimates of attributes’ qualities [2, 4].
3. RELIEFF WORD SELECTION Relief can be seen as an extension to Relief [1, 2].The key idea of Relief is to estimate attributes according to how well their values distinguish among the instances that are near to each other [6, 7]. For that purpose, given an instance, Relief searches for its two nearest neighbors: one from the same class (called nearest hit, “H”) and the other from a different class (called nearest miss, “M”). The original algorithm of Relief randomly selects m training instances Ri, i = 1,…, m, where m is the user-defined parameter, the weight of attribute A is calculated as:
W [ A] := W [ A] −
1 ∑ diff ( A, Ri , H ) m i
3.2 Multiclass Solution Instead of finding one near miss M from a different class, ReliefF finds one near miss M(C) for each different class C and averages their contribution for updating the estimate W[A]. The average is weighted with the prior probability of each class. The idea is that the algorithm should estimate the ability of attributes to separate each pair of classes regardless of which two classes are closest to each other.
(1)
Using ReliefF algorithm to calculate the weight RF of each word in the web pages, then word selection is achieved by selecting the words with the highest weights.
Function diff(A, I1, I2) calculates the difference between the values of the attribute A for two instances I1 and I2. The function diff is used also for calculating the distance between instances to find the nearest neighbors. This case the distance is the sum of distances over all attributes.
The performance of ReliefF is compared to the following three feature selection methods: Information Gain (IG) which is based on the feature’s impact on decreasing entropy [10]; Gain Ratio (GR) which compensates for the number of features by normalizing by the information encoded in the split itself [11]; Chi Squared (CS) which is based on comparing the obtained values of the frequency of a class because of the split to the a priori frequency of the class.
1 + ∑ diff ( A, Ri , M ) m i
ReliefF, an extension of Relief, improves the original algorithm by estimating probabilities more reliably (that is, more robust and can deal with noisy data) and extends it to deal with incomplete and multiclass data sets [3, 4]. Figure 1 shows the pseudo code of ReliefF algorithm.
4. HNB WEB PAGE CLASSIFICATION The basic idea of HNB for web page classification is to create a hidden parent for each word/attribute, which combines the influences from all other words/attributes.
Algorithm ReliefF Input: for each training instance a vector of attribute values and the class value Output: the vector W of estimations of the qualities of attributes 1. set all weights W[A] := 0.0; 2. for i := 1 to m do begin 3. randomly select an instance Ri; 4. find k nearest hits Hj; 5. for each class C ≠ class(Ri) do 6. from class C find k nearest misses Mj(C); 7. for A := 1 to a do k 8. W[A] :=W[A] − ∑ diff(A,R i ,H j )/(m ⋅ k) + j =1
P(C ) k ( diff(A,R i ,M j (C)))/(m ⋅ k) ∑ ∑ j =1 C ≠ class ( Ri ) 1 − P (class ( Ri ))
10. end;
Figure 2. The structure of HNB
Figure 1. Pseudo code of ReliefF algorithm
Figure 2 gives the structure of an HNB, which is originally proposed by H. Zhang etc. [17]. In an HNB, attribute dependencies are represented by hidden parents of attributes. C is the class node, and is also the parent of all attribute nodes. Each attribute Ai has a hidden parent Ahpi, i = 1, 2,…, n, represented by a dashed circle. The arc from the hidden parent Ahpi to Ai is also represented by a dashed directed line, to distinguish it from regular arcs.
3.1 Reliable Estimation Parameter m in the algorithm Relief, represents the number of instances for approximating probabilities in W[A]. The larger m implies more reliable approximation. However, m cannot exceed the number of available training instances. The obvious choice, adopted in ReliefF, is to set m to the upper bound and run the outer loop of the algorithm over all available training instances.
The joint distribution represented by an HNB is defined as follows.
The selection of the nearest neighbors is of crucial importance in Relief. To increase the reliability of the probability approximation ReliefF searches for k nearest hits/misses instead of only one near hit/miss and averages the contribution of all k nearest hits/misses.
618
Learning an HNB is mainly about estimating the parameters in the HNB from the training data. HNB based web page classification is depicted in Figure 3.
n
P ( A1 , L , An , C ) = P (C )∏ P ( Ai | Ahpi , C )
(5)
i =1
where,
5. EXPERIMENTS
P( Ai | Ahpi , C ) =
n
∑
j =1, j ≠ i
Wij ∗ P( Ai | Aj , C )
CMU Industry Sector [12] is a collection of web pages belonging to companies from various economic sectors. We choose 10-fold cross-validation on this benchmark dataset to estimate classification performance. Comparison is done with three traditional methods: Naive Bayes (NB) [16], Support Vector Machine (SVM) [13, 14, 15] and Decision Tree (DT) [9, 11].
(6)
The hidden parent Ahpi for Ai is essentially a mixture of the weighted influences from all other attributes.
We use a subset of the original data, which form a two-level hierarchy. There are 527 web pages partitioned into 7 classes: materials, energy, financial, healthcare, technology, transportation and utilities. Each class has about 80 web pages. There are 20257 words after stemming and stop-word removing, and 1258 words after DFT feature deduction.
Algorithm HNB(T) Input: a set T of training web pages. For each value c of C Compute P(c) from T For each pair of words/attributes Ai and Aj For each assignment ai, aj, and c to Ai, Aj, and C Compute P(ai;aj|c) from T For each pair of attributes Ai and Aj Compute IP(Ai;Aj|C) For each attribute Ai Compute
5.1 Performance Measures We use the following classification performance measures:
Wi = ∑ j =1, j ≠i I P ( Ai ; Aj | C )
Error Rate (ER): defined by the ratio of the number of incorrect predictions and the number of all predictions (both correct and incorrect): ER= Nip/Np, where Nip is the number of incorrect predictions and Np is the number of all predictions (i.e. the number of test samples). ER ranges from 0% to 100%, the lower ER the better, with 0% corresponding to the ideal.
n
For each attribute Aj and j ≠ i Compute
Wij =
I P ( Ai ; Aj | C ) Wi
F1: It is a normal practice to combine recall and precision to F1 measure so that classifiers can be compared in terms of a single rating. F1 can be defined as F1=2R*P/(R+P). Recall (R) is the percentage of the web pages for a given category that are classified correctly. Precision (P) is the percentage of the predicted web pages for a given category that are classified correctly. F1 ranges from 0 to 1, the higher the F1 the better.
Output: HNB models for T Figure 3. HNB algorithm for web page classification The classifier corresponding to an HNB on an example E = (a1, … , an) is defined as follows. n
c( E ) = arg max P (c)∏ P ( ai | ahpi , c) c∈C
(7)
i =1
5.2 Results
The approach to determining the weights Wij , i, j = 1, … , n and i ≠ j, is crucial. HNB compute the estimated values from data and use the conditional mutual information between two attributes Ai and Aj as the weight of P(Ai|Aj,C).
Figure 4 shows the word selection results. Feature ranking scores (including RF, CS, IG, GR) of the words normalized to have a maximum of 1. The results show that the top 395 ranked words are informative.
1 CS GR IG RF
0.8 0.6 0.4 0.2 0 1
79
157 235
313 391 469 547 625 703 781 859 937 1015 1093 1171 1249
-0.2
Figure 4. Feature ranking scores of the words. X-axis represents the sorted ranking index according to the score of the features. For example, ‘113’ represents the rank 113 feature. Y-axis represents the ranking score.
619
Figure 7 shows the F1 of HNB and the three traditional methods with and without feature selection. We can see the performance of HNB is better than traditional methods both with and without feature selection. HNB achieves the highest F1 of 0.96 with feature selection and 0.94 without feature selection. Feature selection can improve the performance of all the classifiers. ReliefF’s performance is comparable to or better than traditional feature selection methods. RF is the best for NB, DT, and HNB.
Figure 5 shows the Error Rate (ER) of HNB and three traditional methods (NB, DT, SVM) with and without feature selection. We can observe from the results that the performance of HNB is better than the traditional methods both with and without feature selection. HNB achieves the lowest ER of 5.7% with feature selection and 8.8% without feature selection. Feature selection can improve the performance of all the classifiers. RF’s performance is comparable with traditional feature selection methods (CS, GR and IG). For NB and DT classifiers, RF is even the best. (%)
20 18 16 14 12 10 8 6 4 2 0
F1 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8
ER Original
RF
CS
GR
IG
Original
NB NB
DT
SVM
RF
DT
CS
GR
IG
SVM
HNB
Figure 7. F1 of HNB and traditional methods with feature selection and without feature selection (Original).
HNB
Figure 5. Error Rate (ER) of HNB and traditional methods with feature selection (RF, CS, GR and IG) and without feature selection (Original). RF = ReliefF, CS = Chi Squared, GR = Gain Ratio, IG = Information Gain. “Original” means without feature selection. The X-axis denotes the learners, Y-axis denotes the ER.
Figure 8 shows F1 curves of HNB and traditional classifiers (NB, DT and SVM) with RF feature selection. When the selected words are over 150, HNB is better than traditional classification methods. HNB achieves its best performance by reaching the highest F1 at the top 395 RF selected words.
Figure 6 shows Error Rate (ER) curves of HNB and other classifiers with RF feature selection. The number of selected words varies from 100 to 395. “all” denotes without feature selection, that is, with all 1258 words. When the selected words are over 200, HNB is better than the traditional classification methods. HNB achieves its best performance by reaching the lowest ER at the top 350 RF selected words.
1
F1
0.9 0.8 0.7
(%)
ER
20
0.6
15
0.5 100
10 5
NB SVM
150
NB
DT
SVM
HNB
200 250 300 350 395 Number of Selected Words
all
Figure 8. F1 curves of HNB and three traditional methods with ReliefF (RF) feature selection.
DT HNB
6. CONCLUSIONS
0 100
150 200 250 300 350 395 Number of Selected Words
all
In this paper we propose a ReliefF (RF) feature selection based method for selecting relevant words in web pages. We also introduce a Hidden Naive Bayes (HNB) based method for classifying web pages. Comparison is done with traditional techniques.
Figure 6. Error Rate (ER) curves of HNB and three traditional methods (NB, DT and SVM) with ReliefF (RF) feature selection. NB = Naive Bayes, DT = Decision Tree, SVM= Support Vector Machines. The X-axis denotes the number of selected words, Yaxis denotes the ER. “all” denotes without feature selection, that is, with all 1258 words.
Results on benchmark web page dataset CMU Industry Sector indicate that ReliefF based feature selection is a promising technique for web page classification. With ReliefF feature selection, all the classifiers can achieve better performance than with the original data. The performance of RF is comparable with
620
[13] Corinna Cortes and Vladimir Vapnik: Support-vector Networks. Machine Learning, 20(3):273-297 (1995)
traditional methods, in some cases, it is even better than them. The results also show that HNB based method is better than the traditional methods for web pages classification.
[14] J. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, eds., MIT Press (1998)
7. ACKNOWLEDGMENTS This work was supported by the National Science Foundation of China under the Grant No. 10001006 and No. 60273015.
[15] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Computation, 13(3), pp 637-649 (2001)
8. REFERENCES [1] M. Robnik-Sikonja and I. Kononenko: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53(1-2):23.69 (2003)
[16] Karl-Michael Schneider: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 307-314, April, (2003)
[2] Kononenko, I. and E. Simec: Induction of Decision Trees using ReliefF. In: G. Della Riccia, R. Kruse, and R. Viertl (eds.): Mathematical and Statistical Methods in Artificial Intelligence, CISM Courses and Lectures No. 363. Springer Verlag (1995)
[17] H. Zhang, L. Jiang, J. Su: Hidden Naive Bayes. Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05). pp.919-924, AAAI Press (2005)
[3] I. Kononenko. Estimating Attributes: Analysis and Extensions of Relief. In Proceedings of ECML'94, pages 171.182. Springer-Verlag New York, Inc. (1994)
[18] Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang: PEBL: Web Page Classification without Negative Examples. IEEE Trans. Knowl. Data Eng. 16(1): 70-81 (2004)
[4] Kononenko, I., E. Simec, and M. Robnik- Sikonja: Overcoming the Myopia of Inductive Learning Algorithms with ReliefF. Applied Intelligence 7, 39–55 (1997)
[19] S. Dumais, and H. Chen, Hierarchical Classification of Web Content. Proc. 23rd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '00), pp. 256263 (2000)
[5] Yuhang Wang and Fillia Makedon: Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data (poster paper). In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pages 497-498, Stanford, California (2004)
[20] W. Wong and A.W. Fu: Finding Structure and Characteristics of Web Documents for Classification. Proc. 2000 ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD ’00), pp. 96-105 (2000)
[6] Kira, K. and L. A. Rendell: The Feature Selection Problem: Traditional Methods and New Algorithm. In: Proceedings of AAAI’92 (1992)
[21] J. Yi and N. Sundaresan: A Classifier for Semi-Structured Documents, Proc. Sixth Int’l Conf. Knowledge Discovery and Data Mining (KDD ’00), pp. 340-344 (2000)
[7] Kira, K. and L. A. Rendell: A Practical Approach to Feature Selection. In: D.Sleeman and P.Edwards (eds.): Machine Learning: Proceedings of International Conference (ICML’92). pp. 249–256, Morgan Kaufmann (1992)
[22] H. Oh, S. Myaeng, and M. Lee: A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information, Proc. 23rd ACM Int’l Conf. Research and Development in Information Retrieval (SIGIR ’00), pp. 264-271 (2000)
[8] H. Mase: Experiments on Automatic Web Page Categorization for IR System. Technical Report, Stanford Univ., Stanford, Calif. (1998)
[23] L. K. Shih, David R. Karger: Using Urls and Table Layout for Web Classification Tasks. WWW 2004: 193-202 (2004)
[9] I.Witten, E.Frank: Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann (2000)
[24] Stemming: http://www.comp.lancs.ac.uk/computing/research/stemming/ general/ (Access 2006)
[10] J. Ross Quinlan: Induction of Decision Trees. Machine Learning, 1:81-106 (1986)
[25] Chickering, D. M. Learning Bayesian networks is NPComplete. In Fisher, D., and Lenz, H., eds., Learning from Data: Artificial Intelligence and Statistics V. Springer-Verlag. 121-130 (1996)
[11] Ross Quinlan: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA. (1993) [12] Industry Sector Dataset: http://www.cs.cmu.edu/~TextLearning/datasets.html (2005)
621