Feature Selection algorithms based on HTML tags importance (PDF ...

2 downloads 0 Views 226KB Size Report
[4] K. Curran, and J. Mc-Glinchey, “Vertical Search Engines,” ITB. Journal, Issue 16, Dec 2007. ... 2009 · Procedia Engineering. Kevin Curran · Jude McGlinchey.
Feature Selection Algorithms Based on HTML Tags Importance Amany M. Sarhan

Ghada M. Hamissa

Computer & Automatic Control Dept., Faculty of Engineering Tanta University, Egypt [email protected]

Electrical Engineering Dept. Faculty of Engineering KFS University, Egypt. [email protected]

Heba E. Elbehiry Electrical Engineering Dept. Faculty of Engineering ,KFS University, Egypt. [email protected]

Abstract—Traditionally in Web crawling, the required features are extracted from the whole contents of HTML pages. However, the position which a word is located inside the HTML tags indicates its importance in the web page. This research proposes two ideas concerning the Feature Selection stage in HTML web pages. The first idea reduces the features by simply extracting them from the important tags in an HTML page in order to achieve faster classification. The second idea gives weights for each of the important tags. Two algorithms are presented in this paper based on these ideas: i) Important HTML tags only algorithm, ii) Weighted Important HTML tags only algorithm. The selected features are classified using two famous classifiers in the literature: Support Vector Machine (SVM) and Naïve Bayes classifier (NBC). The accuracy of each algorithm is computed. Comparison between the accuracies of traditional feature selection method, which uses the whole contents of HTML page, and the proposed algorithms is performed. Complete evaluation is performed which indicates the effectiveness of using our technique. The experimental results show improved precision and recall with the proposed algorithms with respect to keyword-based search. The algorithms are implemented in JAVA and its extended packages. Keywords— Web Crawling, Web search engine, Feature selection, Support Vector Machine, and Naive Bayes Classifier. I. INTRODUCTION The networking technologies lead to increase in the amount of information on the Web which leads to spending a lot of time during the search for specific subject. Vertical search engines are introduced and used to overcome such search problems [4]. Search engines, like yahoo and Google, searches within billions of web pages. With this large number of documents, faster and more accurate search engines should be introduced. Web crawling is the technique that search engines use to collect pages from the Web [14, 20]. Web crawler serves as a first stage in most of the types of search engines in a

978-1-4673-9971-5/15/$31.00 ©2015 IEEE

crawl-index-search design methodology. The efficiency of the search engines relies on the performance of the web crawler as the other two steps build upon its outputs [14]. Web Crawling is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks [14, 20] as shown in Figure 1. Two main tasks are performed in Web Crawling: (i) Feature Selection [2], and (ii) Web page classification. Feature Selection algorithm differentiates features to discriminate samples which belong to different categories. Different feature selection techniques are proposed for this purpose such as Document Frequency (DF), Information Gain (IG), and Term Frequency (TF) [10]. The accuracy of the web crawling depends heavily on the performance of the feature selection step and the quality of the features selected [14].

Fig. 1: Web crawling architecture In Web page classification process, two famous techniques are commonly used which are: (i) Support Vector Machine (SVM) [2, 7, 8, 15, 19], and (ii) Naive Bayes Classifier (NBC)

185

This is to obtain highly relevant content as fast as it can while avoiding low-quality, irrelevant, and redundant content. However, the crawler must make a tradeoff between scale and selectiveness. We do not need to discard pages that could contain useful content by using a wrong feature selection algorithm. The crawler needs to prioritize and focus only on the useful parts of the Web by predicting the probability that a link to a particular page is relevant to a given topic before actually downloading the page. Early focused crawlers did not contain classifiers until Chakrabarti [5, 18] that seeks pages on a specific set of topics that represent a relatively small portion of the Web. The classifier determines the relevance/importance of the crawled pages to decide on the link expansion and computes a measure of the centrality of crawled pages to determine their download priorities. Support Vector Machine (SVM) [7, 8, 19], and Naive Bayes Classifier (NBC) [9, 21] were used as classifiers in the focused web crawlers due to their popularity in classification in text categorization field. Text categorization methods were first proposed in the 1950's where the word frequency was used to categorize documents automatically [2, 5, 15]. Applications of machine learning techniques help to reduce the manual attempt required to analysis and the accuracy of the systems also improved through the use of these techniques [1, 2]. The first classifier is SVM which is very popular method for binary classification [7, 8, 19]. It takes a set of input data and expected, on each particular input, which includes two categories of possible inputs. This makes SVM for nonprobabilistic binary linear seeded [7, 19]. More specifically, a SVM constructs a hyperplane or set of hyperplanes in a high- or infinite- dimensional space, which make it suitable for tasks like classification and regression [14, 19]. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space [8]. SVM has two types: (i) Linear Support Vector Machine (linSVM), and (ii) Nonlinear Support Vector Machine (nlinSVM). linSVM is restricted to the dot (linear) kernel, but outputs a high performance model that only contains the linear coefficient for faster model application. While, nlinSVM is used for the decision surface that best separates the positive from the negative data in the n dimensional space, so called the hyperplane [14, 19]. NBC is the second classifier which is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions [9, 21]. A more descriptive term for the underlying probability model would be independent feature model. In simple terms, a NBC assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature [9, 21]. HTML has been used for building web pages since the very beginning of the web. HTML consists of HTML elements which are denoted by angled brackets. These HTML elements called Tags. The Markup Tags do not show up on a web page like normal text. Instead, they direct the web browser to display the page and its content in a specific manner. HTML document consists normally of three famous important parts [6]: (1) Heads, (2) Titles, and (3) Body. The Head, heading element, briefly describes the topic of the

[9, 21]. The major problems in classification stage are the accuracy of the positive web pages and the training time. We need to produce new procedures to overcome such problems due to the rapid increase in the number of web pages across the time. Our research focuses on the HTML elements as a method of choosing the features of the pages. HTML (Hyper Text Markup Language) is the main language used in the design of web pages [6]. Common methods of crawling apply the crawling process on the whole elements of the HTML web page handling the web page text as a plain text, ignoring the information that HTML tags can provide about the words. One of the most promising approaches that consider the importance of the location of the word in the web document is presented in [20]. This algorithm calculates the word's mixed weight using the information of the HTML tags feature, and then mines the classification rules based on the mixed weight to classify the web pages. Their results proved the enhancement of performance of this approach over the traditional associated classification methods. In this paper, we propose to consider only the parts of the web page that is included by a list of important tags of the HTML web page when applying the crawling process. This is meant to reduce the crawling time, reduce the memory needed for performing the crawling process and to generate more accurate web pages more relevant to the search topic. Most of us must have experienced the problem of the overwhelming number of results produced when we search for any topic on the Internet. Most of the retrieved pages may be irrelevant to the search topic; they just contain a large number of words that makes them seem relevant to the search topic. Thus, we believe that if we just concentrate on the words quoted by the important HTML tags, we will achieve the three advantages stated above all at once. We also elaborate the idea to perform the crawling on only the important tags of the HTML web page giving weights for each tag according to its importance in the page. We will prove, through experimental results, that we could reduce both the time and memory required without harming the accuracy and recall measures. Two Web page classifiers are used with these algorithms: i) Linear support vector machine (linSVM) and nonlinear support vector machine (nlinSVM) [16, 19] and ii) Naive Bayes classifier (NBC) [21]. The results of the classifiers are compared to determine the performance of the system. The outline of this paper is as follows. Section II gives the necessary background and related work on HTML, SVMs and NBC. Section III, the main section of the paper, describes the feature selection technique used with the proposed ideas. Section V includes the experimental results and comparisons with other algorithms. The conclusion and further research are presented in Section IV.

II. BACKGROUND AND RELATED WORK Any successful search engine must be provided with an efficient crawler component for collecting web resources to be able to discover the content of the Web. Although the task of web crawling seems very simple, web crawling faces many challenges such as: the large scale of web documents it has to navigate, specially with the change of each page content during a small time span, and the proper content selection. To overcome the first problem, the search engines should be run over high speed computers or even on parallel machines. The second problem is handled by performing crawling selectively on a set of the features rather than all the features. 186

Important tags are defined in this paper as the tags that have more information about the web page content. Three tags are considered important in this paper namely; Title, Bold and Headers fonts (h1, h2 and h3). Commonly, words that are tagged by these tags are meant to express its importance in the web page. The following algorithm describes the steps of extracting the important features from the HTML page .

section it introduces. Heading information may be used by user agents, for example, to construct a table of contents for a document automatically. There are six levels of headings in HTML with H1 as the most important and H6 as the least [1]. Most of the previous research treats the web page as a plain text, as in documents, ignoring the possibility of gaining information about the importance of the word through its different locations in the HTML tags inside the web pages. However, in [20], the researchers divided the HTML tags into three classes and calculated the word's mixed weight by the information of the HTML tags feature. They mined the classification rules based on the mixed weight to classify the web pages. Their experiments showed that the performance of their approach is better than the traditional associated classification methods.

III.

Algorithm: Feature Extraction of Imp-Tag Algorithm Input : web pages in the training dataset (P) . Output : keywords list Title={} ,header={}, bold={} ,body = {} For each web page in the training dataset do For each word w in P do If w is not a stop word then If w belongs to tag then Title =Title U stem(w) else if w belongs to

or

or

then Header = header U stem (w) else if w belongs to then Bold = Bold U stem (w) end if end for end for

FEATURE SELECTION ALGORITHM BASED ON

IMPORTANT HTML TAGS The Traditional Tag (Trd-Tag) method scans all the web page words during the Feature Selection step. The Feature Selection algorithm is applied on all the elements of the HTML page and this takes much time. The following algorithm (Algorithm 1) presents the steps of the Traditional Tag feature selection. Although HTML tags are used to build and format the web pages, they can also be used to give information about the importance of the words inside the page. Words found in a header or in large fonts in a page can indicate their more relevancies to the major topic contained in the page.

IV. FEATURE SELECTION ALGORITHM BASED ON WEIGHTED HTML TAGS The second idea proposed in this paper is to give weights for each of the HTML tags according to its importance in the page. Then, a word (w) found in a web page will have different weights according to its location in the HTML tags. We will divide the HTML Tags into three levels of weights: Level 1: Tags: (Title, H1:H6, Bold), weight factor = 4 Level 2: Tags (anchor), weight factor =2 Level 3: Tags (body), weight factor =1 Thus, the weight of the word in the web page not only reflects the contribution degree of the word to identify the subject of the page, but also stands for a capacity of distinguishing between the different pages. As the characteristic of the web document, three factors are responsible for the word weight in the web page: (1) the word frequency in the web page; (2) the location feature of the word in the page (i.e. the HTML tag included in); (3) the quantity of the web pages containing this word which is also called inverse document frequency. The term frequency will be computed as:

Algorithm 1: Feature Extraction of Trd-Tag Algorithm Input : web pages in the training dataset (P) . Output : keywords list Title={} ,header={}, bold={} ,body = {} For each web page in the training dataset do For each word w in P do If w is not a stop word then If w belongs to tag then Title =Title U stem(w) else if w belongs to

or

or

then Header = header U stem (w) else if w belongs to then Bold = Bold U stem (w) else if w belongs to then Body = Body U stem (w) end if end for end for

() =  ℎ (  ) ∗ ()

. Our goal in this work is to reduce the datasets database and time of searching by concentrating As the number of web pages needed to be processed in the crawling step is growing tremendously, it may be more beneficial and quicker to focus on parts of the web pages that contain indicative information and ignore the others parts. For example, in a web page about sports, we normally find words relevant to sports in the title, in large fonts or written in bold inside the web page. Therefore, we can focus on these parts of the page instead of going through the whole page. It will certainly take less time; however, this should not harm the accuracy of the crawling process only on the important features of the web page. Based on [20], the idea of this research is coming to light. We call this algorithm Important Tags only (Imp-Tag). The Imp-Tag algorithm divides the HTML web page into several categories and treats each category separately. In this work, we propose instead of processing all the content of the web page in the feature selection phase consuming much time, we will only consider the words found in locations of important tags only.

(1)

where weighting factor(HTML tag) is the weigh factor of the HTML tag surrounding the word (one of the three weights assigned as indicated above) and TF (term) is frequency of a term in the given document surrounded by the HTML tag. For the above algorithms; Trd-Tag, Imp-Tag, and WghtTag, we will use SVM (linear and nonlinear) and NBC as classification techniques. These two classifiers are famously used because of their flexibility and effective results [7, 9, 19, 21]. V.

USING SVM AND NAÏVE BAYS CLASSIFIERS WITH THE PROPOSED ALGORITHMS

Feature Extraction algorithm is the first step in the Feature Selection technique. It extracts the contents of the HTML web page according to the requirements. Feature Extraction algorithm is modified for each of the three proposed methods (Trd-Tag, Imp-Tag, and Wght-Tag). Each of them has its special modified Feature Extraction. Stemming stage is the second step in the Feature Selection 187

Since the various components (attributes) of the data x are independently distributed, then Eq. 5 can be written as [21]:

process. In this stage, the source words, of the extracted keywords are obtained to minimize the database. After stemming, the extracted Tags are computed to get the frequencies of keywords resulted from the algorithm. The term frequency–inverse document frequency (TF-IDF) technique is the most common weighting method used to describe documents in the Vector Space Model [16]. TF-IDF function weights each vector component (each of them relating to a word of the vocabulary) of each document on the following basis [17]. First, it incorporates the word frequency in the document. Thus, the more a word appears in a document the more it is estimated to be significant in this document. In addition, IDF measures how infrequent a word is in the collection. This value is estimated using the whole training text collection at hand [5, 17]. Accordingly, if a word is very frequent in the text collection, it is not considered to be particularly representative of this document. In contrast, if the word is infrequent in the text collection, it is believed to be very relevant for the document. The TF-IDF function is [17]:   ∗ () = () ∗  (1 + ) …..(2)

Pr( ?? = +1|-) =

∏K C A(BC |DEFGG H I#)࣭ J(DEFGG H I#) A(B)

where, xj is the j th attribute of vector x, and p(x) = p(x|class=+1) + p(x|class= −1)

(6)

(7)

VI. EXPERMINTAL RESULTS In this paper, we used standard datasets which are commonly used in several Web Crawling researches [10]. Each dataset consists of four categories which are: (basic materials, course, energy, and technology). Comparison between the performance of Trd-Tag, ImpTag, and Wght-Tag feature extraction algorithms, when using SVM (both linSVM and nlinSVM) and NBC as classifiers, are presented. We have chosen these two classifiers as they are famously used in the learning applications because of their flexibility and effective results [9, 21]. In the comparison, we will use the accuracy as a comparison basis. Accuracy is the overall correctness of the model and is calculated as the sum of correct classifications divided by the total number of classifications. The accuracy is computed [13]:

()

where TF(term) is the frequency of a term in the given document (or F the frequency of a weighted term in the given document), N is the total number of documents in the collection, DF(term) is number of documents that contain the term. After the Feature Selection stage, the extracted datasets of the Trd-Tag, Imp-Tag, and Wght_Tag methods are classified using linSVM, nlinSVM, and NBC. The linSVM classifier is divided into two stages: the training stage and the testing stage [16]. Suppose we have N training data points: {(x1, y1), (x2, y2), . . . ,(xN , yN )} where; xi Rd and yi ∈ {+1, −1}. In our work, each training observations for linSVM denotes as a pair, {xi, yi} where; xi's are vector elements of δith (the output of the proposed algorithm), and yi's are the vector of thresholds. Finding a maximal margin for the separating hyper planecannot be calculated directly, however we can calculate it by using the following equation [11]: #  !," % &  subject to '* (% & -* − .) ≥ 1 (3)

LM ' =

N + O N + O + Q + O

(8)

where TP = True Positive, FN = False Negative, FP = False Positive and TN = True Negative. We will also compute the time required to perform the feature extraction algorithms to prove that they will decrease the time of the whole classification process. Experiment 1: Comparing Time of the three feature extraction algorithms In this experiment, the time taken to perform the classification by the two proposed feature extraction algorithms: weighted tag (Wght-Tag) feature selection algorithm and important tags only (Imp-Tag) feature selection algorithm is compared to the time taken to perform the traditional feature selection algorithm (Trd-Tag). SVM (both linSVM and nlinSVM) and NBC are used as classifiers after the three algorithms. From the results in Table 1, the time taken to perform the classification is reduced greatly by the important tag (Imp-Tag) feature selection algorithm as we only consider some of the tags rather than the whole page. However, the same process is performed by the weighted tag (Wght-Tag) feature selection algorithm, but multiplying each term by a weight eliminates this benefit and makes its time equals to the time taken by traditional tags feature extraction algorithm which scans all the tags.

$

where b is the bias and w is the weights of the model. The previous calculations lead the output of the linSVM classifier which is the required positive pages. In nonlinear Support Vector Machine (nlinSVM) classifier, the points in the input space (xi ,yi) are not linearly separable, which can transform these points to high dimensional space [12]. The maximal margin separating hyperplane for high dimensional space is calculated by the following equation:

TABLE 1: TIME TAKEN (IN SEC) FOR TRD-TAG,WGHT_TAG AND IMP-TAG linSVM nlinSVM NBC

#

max0 ℒ3 ≡ ∑7 ∝7 − ∑7 8 ∝7 ∝8 y7 y8 ϕ(x7 ). ϕ(x8 ) (4) $ subject to: 0 ≤∝* ≤ >, ∑* ∝* '* where ℒ3 is simplifying the dual Lagrangian for training dataset.Φ() is the transformation to the high dimensional space that using by kernel function, C is small training errors, and αi is called the box constraint. This kernel functions are given in [18]. In NBC, the data (x) is produced from a probability distribution which is p(x|class = +1) if the class is positive. In the Bayes formula, the posterior probability is written as follows [21]: Pr(class = +1|x) (5)

Trd_Tag and Wght_Tag

19

39

11

Imp_Tag

10

15

8

Experiment 2: Using Linear SVM with the Feature Selection Algorithms In this section, each of the extracted dataset using the traditional feature selection algorithm (Trd-Tag), the proposed important tags only feature selection algorithm (Imp-Tag), and the proposed weighted tags feature selection algorithm Wght-

188

Finally, in this section, each of the extracted dataset using the traditional feature selection algorithm (Trd-Tag), the proposed important tags only feature selection algorithm (Imp-Tag), and the proposed weighted tags feature selection algorithm WghtTag algorithms are classified using NBC. Fig. 4 presents the accuracy of the NBC classifier for the extracted datasets when using Trd-Tag, proposed Imp-Tag, and proposed Wght-Tag algorithms at the four datasets.

Tag algorithms are classified using linear Support Vector Machine (linSVM) classifier. To evaluate the performance of the proposed feature selection algorithms (proposed Important Tags only and proposed Weighted Tags), we conducted experiments to compare their performance to the Traditional feature selection algorithm. Fig. 2 presents the accuracy of the linSVM classifier for the extracted datasets when using Trd-Tag, proposed Imp-Tag, and proposed Wght-Tag algorithms at the four datasets: basic materials, course, energy, and technology. According to Fig. 2, the proposed weighted tag (Wght-Tag) feature selection algorithm achieves the best performance among the three algorithms in terms of accuracy. However, the proposed important tags only (Imp-Tag) feature selection algorithm is better than the traditional feature selection algorithm but has less accuracy than the proposed weighted tag (Wght-Tag) feature selection algorithm. Hence, better and accurate results are generated by the proposed algorithms.

Fig. 4 Accuracy of Trd-Tag, proposed Imp-Tag, and proposed WghtTag algorithms with NBC

Experiment 3: Using Nonlinear SVM with the Feature Selection Algorithms In this section, each of the extracted dataset using the traditional feature selection algorithm (Trd-Tag), the proposed important tags only feature selection algorithm (Imp-Tag), and the proposed weighted tags feature selection algorithm WghtTag algorithms are classified using nonlinear Support Vector Machine (nlinSVM) classifier. Fig. 3 presents the accuracy of the nlinSVM classifier for the extracted datasets when using Trd-Tag, proposed Imp-Tag, and proposed Wght-Tag algorithms at the four datasets.

From Fig. 4, it’s noticed that the accuracy percentages of all the algorithms; Trd-Tag, Imp-Tag, and Wght-Tag, are nearly close to each other. The overall results of this simulation show: i) The linear support vector machine (linSVM) achieves higher accuracy with the proposed weighted tag (Wght-Tag) feature selection algorithm than with the proposed important tags only (Imp-Tag) feature selection algorithm. ii) The Nonlinear support machine (nlinSVM) gives more accurate performance than the linear support vector machine (linSVM) approximately by 6%. iii) The Nonlinear support machine (nlinSVM) is more accurate than the Naïve Bayes Classifier (NBC) approximately by 4%. iv) The time taken by any of the classifiers with the important tags only (Imp-Tag) feature selection algorithm is less than with the traditional or weighted tag (Wght-Tag) feature selection algorithm.

Fig. 2: Accuracy of Trd-Tag, proposed Imp-Tag, and proposed Wght-Tag algorithms with linSVM

VII. CONCLUSION In this paper, we proposed two algorithms for crawling HTML pages (through the Feature Selection process): (a) Imp-Tag algorithm, and (b) Wght-Tag algorithm. The Imp-Tag proposed algorithm subject to two targets. First target is reducing the extracted dataset’s keywords of the HTML pages which lead to saving the memory space. Second target is saving time of searching and extracting the keywords. While the Wght-Tag proposed algorithm aims to obtain accurate classification of Web pages. We used three classifiers: linSVM, nlinSVM and NBC with the proposed algorithms in order to choose the best of them. From the experimental results, it’s concluded that, the nlinSVM classifier gives the best accuracy. It is increased by approximately 5% in (Imp-Tag) algorithm and by approximately 7% in (Wght-Tag) algorithm. This means that the extracted keywords from the HTML web pages using Wght-Tag algorithm are smaller and more accurate than that of Trd-Tag algorithm and Imp-Tag algorithms.

According to Fig. 3, the proposed weighted tag (Wght-Tag) feature selection algorithm has better accuracy than the proposed important tags only (Imp-Tag) feature selection algorithm in two categories (courses and technology) and equal in the others categories. Both of them are better than the traditional feature selection algorithm in all categories.

Fig. 3: Accuracy of proposed Imp-Tag, and proposed Wght-Tag algorithms with nlinSVM

VIII.

REFERENCES

[1] C. Aggarwal, C. Zhai, “A Survey of Text Classification Algorithms,” Book Chapter 6, Mining Text Data, SpringerVerlage, pp. pp 163-222, Jan. 2012.

Experiment 4: Using Naive Bays with the Feature Selection Algorithms 189

[2] H. Alshalabi , S. Tiun, N. Omar,and M. Albared, “Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization,” Proceedings of the 4th International Conference on Electrical Engineering and Informatics, Malaysia, Malaysia, pp. 748 – 754, 24 -25 June 2013. [3] CMU World Wide Knowledge Base (Web->KB) project http://www.cs.cmu.edu/webkb. [4] K. Curran, and J. Mc-Glinchey, “Vertical Search Engines,” ITB Journal, Issue 16, Dec 2007. [5] A. Elyasir and K. Anbananthen, “Focused Web Crawler,” Proceedings of the International Conference on Information and Knowledge Management (ICIKM), Kuala Lumpur, Malaysia, vol. 45, p149, 24-26 July 2012. [6] HTML Basics http://www.austincc.edu/hr/profdev/eworkshops/docs/HTML_Bas ics.pdf [7] P. Hoffman, PhD , “Support Vector Machines,” June, 2010. [8] T. Hocking, “Implementing linear SVM using quadratic programming,” e-book, Nov. 2012. [9] E. Karabuluta, and S. Özelb, “A Comparative Study on the Effect of Feature Selection on Classification Accuracy,” Proceedings of the International First World Conference on Innovation and Computer Sciences (INSODE 2011), Procedia Technology, vol. 1, Istanbul, Turkey, pp. 323-327, 2-4 Sept., 2011. [10] A. Khan, B. Baharudin, L. Lee and K. khan, “A Review of Machine Learning Algorithms for Text-Documents Classification,” Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, Feb. 2010. [11] V. Korde and C. Mahender, “Survey of Text Classification and Classifiers,” International Journal of Artificial Intelligence & Applications (IJAIA), vol. 3, no. 2, 2012. [12] S. Mayor and B. Pant , “Document Classification Using Support vector machine,” International Journal of Engineering Science and Technology (IJEST), vol. 4, no.4, 2012. [13] M. Najork, “Web Crawler Architecture,” Book Chapter in Encyclopedia of Database Systems, Springer-verlage, pp. 34623465, Sept 2009. [14] G. Pant, and P. Asan ,“Learning to Crawl: Comparing Classification Schemes,” ACM Transactions on Information Systems, Vol. 23, No. 4, pp. 430–462, 2005. [15] I. Pilászy, “Text Categorization and Support Vector Machines,”, 2005, http://conf.uni-obuda.hu/mtn2005/Pilaszy.pdf. [16] Predictive Analytics Reimagined http://www.rapid miner.com. [17] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries,” Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. [18] T. Verma, “Automatic Text Classification and Focused Crawling,” International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), vol. 2, no. 1, pp. 88-92, Jan.-Feb. 2013. [19] Max Welling, “Support Vector Machines,” Max Welling's Classnotes in Machine Learning. http://www.ics.uci.edu/~welling/classnotes/classnotes.html [20] Li Xingyi, JunLan and Huaji Shi, “Associative Web Document Classification Based on Word Mixed Weight,” Proceedings of the IEEE 3rd International Conference on Computer Science and Information Technology (ICCSIT), vol. 3, Chengdu, China, pp. [21] W. Zhang and F. Gaoa, “An Improvement to Naive Bayes for Text Classification,” Advanced in Control Engineeringand Information Science, Vol.15, pp. 2160 – 2164, 2011.

190

Suggest Documents