Web Page Classification Using Firefly Optimization

5 downloads 1717 Views 566KB Size Report
to maintain Web directories and to increase search engines' performance. ... recent optimization technique namely the firefly algorithm (FA), to select best ...
Web Page Classification Using Firefly Optimization Esra Saraç

Selma Ayşe Özel

Çukurova University, Faculty of Engineering and Architecture, Department of Computer Engineering Balcalı, Sarıçam, Adana, 01330 Turkey

Çukurova University, Faculty of Engineering and Architecture, Department of Computer Engineering Balcalı, Sarıçam, Adana, 01330 Turkey

Abstract— Increase in the amount of information on the Web has caused the need for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines’ performance. As every (HTML/XML) tag and every term on each Web page can be considered as a feature, we need efficient methods to select best features to reduce feature space of the Web page classification problem. In this study, our aim is to apply a recent optimization technique namely the firefly algorithm (FA), to select best features for Web page classification problem. The firefly algorithm (FA) is a metaheuristic algorithm, inspired by the flashing behavior of fireflies. In this study, we use FA to select a subset of features, and to evaluate the fitness of the selected features J48 classifier of the Weka data mining tool is employed. WebKB and Conference datasets were used to evaluate the effectiveness of the proposed feature selection system. We observed that when a subset of features are selected by using FA, WebKB and Conference datasets were classified without loss of accuracy, even more, time needed to classify new Web pages reduced sharply as the number of features were decreased. Keywords- Firefly Algorithm; Classification; Feature selection

I.

Web

page

classification;

INTRODUCTION

As the popularity of the Web increases, the amount of information on the Web has also increased. This information growth has caused the need for accurate and fast classification of Web pages to increase search engines’ performance. Automatic Web page classification is a supervised learning problem in which a set of labeled Web documents is used for training a classifier, and then the classifier is employed to assign one or more predefined category labels to future Web pages [1]. Automatic Web page classification is not only used for improving search engines’ performance, it is also essential to the development of Web directories, to topic-specific Web link analysis, to contextual advertising, to analysis of the topical structure of the Web, and to improve the quality of Web search [1]. Several classification methods such as decision trees, Bayesian classifier, support vector machines, k nearest neighbors have been developed [2]. Among these methods, decision trees, and support vector machines are suitable for classification problems in which number of features is small [2]. Web page classification problem, on the other hand, is a high dimensional problem since each term in

[978-1-4799-0661-1] ©[2013] IEEE

each HTML or XML tag of each Web page can be taken as a feature. In this study, we propose a firefly algorithm (FA) based wrapper technique which finds the best features for Web pages, to make fast and accurate classification. Firefly algorithm (FA) is a recent search and optimization technique, which was first introduced by Xin-She Yang in 2008 [3]. The primary purpose for a firefly's flash is to act as a signal system to attract other fireflies. The main rules of the algorithm are as follows [3]:  



All fireflies are unisexual Attractiveness is proportional to their brightness, and for any two fireflies, the less brighter one will be attracted by the brighter one; however, the brightness can decrease as their distance increases If there are no fireflies brighter than a given firefly, it will move randomly.

In this study, our aim is to choose the best n features among hundreds of features that were extracted from Web pages to reduce feature space of the Web page classification problem [4, 5] to reduce the time needed to classify new (unseen) Web pages. To our knowledge there is not such a study that uses the firefly algorithm for feature selection from Web pages. This paper is organized as follows: in the next section, we give more detail about Web page classification, and summarize related work on the FA applications. The third section describes our FA-based feature selection system. The data sets used in this study and the experimental results are presented in the fourth section. Finally, the fifth section concludes the study. II.

RELATED WORK

Web page classification problem is defined as the problem of assigning a Web page to one or more predefined category labels [1]. In this study, our aim is to determine the “role” of a Web page such as to decide whether the Web page is a “student home page”, or a “course page”, or a “department home page”. While doing that, we give a single class label (e.g. “course page”) to each Web page, and we make binary classification in which we categorize instances into exactly one of the two classes (e.g. “course page”, or “not course

page”). This kind of classification problem exists especially in focused crawling systems of vertical search engines. It is also possible to extend the solution technique developed in this study to other binary classification problems. In [6], the FA is used for clustering on benchmark problems and the performance of the FA is compared with other two nature inspired techniques, Artificial Bee Colony (ABC) and Particle Swarm Optimization (PSO). According to this study, average classification error rates are as follows; 11.36%, 13.13% and 15.99% for the FA, ABC, and PSO respectively. In [7], a new feature selection approach that combines the Rough Set Theory (RST) with nature inspired firefly algorithm is presented. The algorithm simulates the attraction system of real fireflies that guides the feature selection procedure. The experimental result proves that the proposed algorithm scores over other feature selection method in terms of time and optimality. They have reduced number of features from 13 to 6-7 with PSO to 3 with FA for Cleveland Heart dataset. III.

1. 2. 3. 4.

5.

Generate initial population of fireflies (xi) Determine light intensity Ii for each firefly xi by f(xi) where f(xi) is the objective function value for by xi Define light absorption coefficient λ While (t Ii) then Move firefly i towards j in d-dimension endif Update attractiveness with distance r via exp-λr Evaluate new solution and light intensity end for end for 4.2 Rank the fireflies and find the current best one 4.3 Increment t end while Display the best firefly

FEATURE SELECTION USING THE FIREFLY ALGORITHM

Firefly Algorithm (FA) is an optimization technique, developed recently by Xin-She Yang at Cambridge University [3]. It is inspired by social behavior of fireflies and the phenomenon of bioluminescent communication. Fireflies can generate light inside of it. Light production in fireflies is due to a type of chemical reaction. It is thought that light in adult fireflies was originally used for similar warning purposes, but evolved for use in mate or sexual selection via a variety of ways to communicate with mates in flirtations. Although they have many mechanisms, the interesting issues are what they do for any communication to find food and to protect themselves from enemy hunters including their successful reproduction. In general, the pattern of flashes is unique for a particular species of fireflies. The flashing light is generated by a chemical process of bioluminescence. However, two fundamental functions of such flashes are i) to attract mating partners or communication, and ii) to attract potential victim. Flashing may also be used for a protective warning mechanism. The light intensity at a particular distance from the light source follows the inverse square law. It means that, as the distance increases, the light intensity decreases. Furthermore, the air absorbs light which becomes weaker and weaker as there is an increase of the distance. There are two combined factors that make most fireflies visible only to a limited distance that is usually good enough for fireflies to communicate each other. The flashing light can be formulated in such a way that it is associated with the objective function to be optimized. This makes it possible to formulate new metaheuristic algorithms. The main steps of FA described in Figure 1.

Figure 1. Firefly Algorithm

In this study we developed an FA based [3] algorithm to select best features for Web pages to provide accurate and fast classification. For this purpose, first of all, we extracted features which consist of all of the stemmed words that are not stopwords, from each of the positive documents in the training dataset. Feature extraction is performed only once for all datasets as a pre-processing step. After extracting features, document vectors for the Web pages are created by counting the occurrences of each feature in each Web page. Then, document vectors are normalized. After that FA was used for feature selection. In the proposed feature selection method, each feature represents a node, and all nodes are independent. Nodes (i.e. features) were selected according to their selection probability Pk(i) which is the document frequency (df) value of each term After the probability evaluation, a roulette wheel selection algorithm was used for selecting the next feature [8]. Fmeasure values of selected subsets are used as objective function f(xi). The main steps of our FA based feature selection algorithm is described in Figure 2. First of all, we have generated our initial population having a pre-determined number of fireflies. And then initial light intensities of features are defined by df values of features. We have chosen df values as the light intensities of features because df is an important metric for classification accuracy and feature’s attractiveness. In the first step, each firefly chose randomly three unique features. When all fireflies complete their subset selection process, two arff files (i.e., train and test files) were generated for each firefly. By using the train dataset a decision tree classifier is learned by using the J48 classifier of Weka data mining tool. After that, the test dataset is classified. F-measure value of the classification process is computed. These steps are repeated for all fireflies and then the best one of these k fireflies is founded, and the other fireflies have forced to seem

the best firefly. The best firefly is updated as the most attractive one and the light intensity of this firefly’s features are updated by using F-measure value of the best firefly’s solution. The formula of light intensity update is computed as follows; df(i)= df(i)*exp-λ*F-measure Where λ=-1, and i Є best firefly’s subset and F-measure value is belongs to best one. Our purpose is the increase F-measure value, because of this F-measure value is used in light intensity update process. The proposed algorithm includes two feature inclusion functions for second step. If firefly is the local best firefly, this firefly chose randomly new term from unselected term list. In other case, fireflies chose randomly a term from the best firefly’s selected term list. For each firefly xi, two arff files were generated for each firefly (train and test phase). Fmeasure values of all fireflies were computed and, the local best has found. The above processes are repeated until t equals to MaxGeneration value. Then, the algorithm has completed and, each firefly has n features. 1. 2. 3.

4

Generate initial population of fireflies (xi) Determine light intensity Ii for each feature by their df values While (t Ii) then Move firefly i towards j in d-dimension endif Update attractiveness with respect to F-measure Evaluate new solution and light intensity end for end for 3.2 Rank the fireflies and find the current best one 3.3 Increment t end while Display the best firefly

Figure 2. Firefly Feature Selection Algorithm

IV. EXPERIMENTAL EVALUATION AND RESULTS All the implementations for the experiments were made in Java programming language under Eclipse environment [9]. The proposed method was tested under Microsoft Windows XP SP3 operating system. The hardware used in the experiments had 1 GB of RAM and Intel Core2Duo 1.60 GHz processor. A. FA Parameters In the proposed FA based feature selection method, there are several parameters such as the number of fireflies, the number of features an so on. The number of fireflies is defined as 30 for this method, according to our experiments on ACO and IWD algorithms 30 was the optimum number for ants and intelligent water drops.

In order to compare this study with our previous studies, we used 30 fireflies [4, 5, 10]. The parameter λ is defines as -1. We have defined MaxGeneration number as 30 experimentally, because after first ten steps, we observed that fireflies can find the best feature subset for classification. B. Datasets Two datasets namely the Conference, and the WebKB [11] were used in the experiments. The Conference dataset consists of the Computer Science related conference homepages that were obtained from the DBLP web site [12]. We labeled the conference Web pages as positive documents in the dataset. To complete the dataset, the short names of the conferences were queried using the Google search engine, and the irrelevant pages in the result set were taken as negative documents. Then, all the positive and the negative documents were randomly distributed among the train and the test datasets. The Conference dataset contains 2369 Web pages in total and 824 of them are conference homepages. We used 75% and 25% split for the training and the testing. The number of pages in the train and test datasets are presented in Table I. For this dataset our aim was to determine whether a Web page is a conference homepage or not. The WebKB dataset is a well-known dataset that is obtained from the WebKB project [13]. The WebKB dataset contains course, department, faculty, project, staff, and student Web pages gathered from the Computer Science departments of the Cornell, Texas, Washington, and Wisconsin universities as well as some irrelevant pages from those four universities. We used course, faculty, project and student classes in our experiments since these classes have more instances than others. The dataset contains 7648 Web pages in total such that it has 883 course, 1028 faculty, 493 project, and 1480 student homepages, and 3764 negative (belongs to other classes) Web pages. The train and the test datasets were constructed as described in the WebKB project Web site [13]. For this study we used pages from Cornell, Texas, and Washington universities in the training, and pages from Wisconsin university in the test phase. We used the WebKB dataset as a binary class classification dataset. For example the Course dataset contains 883 course and 3764 negative pages, the Faculty dataset has 1028 faculty and 3764 negative pages, and so on (Table I.). TABLE I. TRAIN/TEST DISTRIBUTION OF WEBKB AND CONFERENCE DATASETS FOR BINARY CLASS CLASSIFICATION Train Test Relevant/Non-relevant

Relevant/Non-relevant

Course

846 / 2822

86 / 942

Project

840 / 2822

26 / 942

Student Faculty Conference

1485 / 2822 1084 / 2822 618/1159

43 / 942 42 / 942 206/386

C. Feature Extraction and Selection For each dataset, the features were extracted by taking stemmed terms that are not stopwords from the tags and URLs of the positive (i.e., relevant) Web pages in the training set. As an example, the Course dataset has 305 features

extracted from the tag as shown in Table II. We used the features extracted from tag and URLs of the Web pages since in our previous studies [4, 5, 10] we observed that titles and URL addresses of Web pages contain important features.

TABLE IV. PERFORMANCE OF THE PROPOSED FA BASED FEATURE SELECTION ALGORITHM FOR FEATURES EXTRACTED FROM TITLE TAGS

TABLE II. NUMBER OF FEATURES EXTRACTED FOR ALL CLASSES Class

tag

URL

# of Features

Course

305

479

Project

596

686

10

Student Faculty Conference

1987 1502 890

1557 1208 1115

30

After the feature extraction step, for each dataset, the FA based feature selection process was performed. After selecting the best n features by using our FA based algorithm, test (unseen) Web pages were classified with respect to the selected n features by using the J48 classifier of Weka [14]. The proposed algorithm has run with different n values with respect to our previous studies [4, 5, 10]. Selected n values are; 10, 30, 50, 60, 100. D. Results In this study, each firefly chooses a predefined number of features. Performance of the proposed FA based method with respect to F-measure and run time can be seen in Table III and Table IV. TABLE III. PERFORMANCE OF THE PROPOSED FA BASED FEATURE SELECTION ALGORITHM FOR FEATURES EXTRACTED FROM URL URL TAGS Average F-Measure Values and Run Times # of Features 10

30

50

60

100

All features

Title TAGS Average F-Measure Values and Run Times

Course

Project

Faculty

Student

Conference

0.942

0.995

0.855

0.585

0.978