Link Processing for Fuzzy Web Pages Clustering ... - Semantic Scholar

2 downloads 4055 Views 273KB Size Report
E-mail: [email protected] ... from HTML form of web pages. ... A web page usually is shown with a vector structure, that become extracted from HTML form.
European Journal of Scientific Research ISSN 1450-216X Vol.27 No.4 (2009), pp.620-627 © EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm

Link Processing for Fuzzy Web Pages Clustering and Classification Amir Masoud Rahmani Islamic Azad University, Science and Research branch, Tehran, Iran E-mail: [email protected] Zahra Hossaini Islamic Azad University, Science and Research branch, Tehran, Iran E-mail: [email protected] Saeed Setayeshi Islamic Azad University, Science and Research branch, Tehran, Iran E-mail: [email protected] Abstract Clustering and classification are two ways of arranging objects in related groups, according to their similarities, different groups, have different characteristics. Large volume of web pages in World wild web, need to be arranged in a way that make users comfortable, and clustering is an efficient way of grouping. Fuzzy as a flexible method could be used to find similarities, by means of membership functions. For web page clustering and classification, usually fixed size vectors of words/weights, become extracted from HTML form of web pages. Those kinds of vectors ordinarily are long and need so much time to process. To avoid this here, word are gathered from tag, which gives shorter vectors, with most of useful information that could be obtained from other parts. These vectors have variable size. This method gives acceptable clusters and almost precise classes, with 89.76% precision rate, and also reduces processing time. Keywords: Clustering, Classification, Fuzzy, tag

1. Introduction Clustering is an unsupervised method, that can finds hidden relations between data, and arranges them in internally related groups, and classification is a supervised method of grouping data in a way, that more similar elements come together in the same group [1], [2]. Excessive number of web documents need to be ordered in related groups to ease using them. Clustering and classification have been used in different approaches, for a long time to order, uncategorized data, K-means is a popular method of clustering, but it needs to knows the number of clusters a priori [8], [10]. Most of clustering approaches use fixed size vectors in their classifier [12], [13], which cost, long time for processing unnecessary elements, that have zero weight in representative vector. To avoid the problem, variable sized vectors are used in this application, so useless elements processing time will be bypassed.

Link Processing for Fuzzy Web Pages Clustering and Classification

621

A web page usually is shown with a vector structure, that become extracted from HTML form of the document [13], the vector consists of a series of word and a coefficient that is the frequency of the word in that document. Coefficient could be 1 or take a weight according to tags that a word appears on it [11], [13] or even could be size of document or etc. This is a good method but final vectors are still too long, so in order to have, shorter but still effective vectors,
tag is used. Since most of the times links show web pages, which have near concepts to current page, they can give useful vectors that are shorter, but do not decrease discriminating ability. After all, a fast, precise clustering method is needed, and fuzzy is a complete match. Fuzzy is used for clustering in [5], [6], but those works used fixed size vectors, in [3] variable size vectors is used, but cluster centers are too long, and importance of appearance place of word does not considered. In this paper, variable size vectors, which become extracted from tag is used in a fuzzy clustering method, and it gives acceptable clusters and a low percent of misclassified pages. A data dictionary is used to avoid unrelated words, such as those, which come in advertisement links. Through the next sections the algorithm will be explained in details. Sections arrangement is as following. In section 2 related concepts are explained, used algorithm and steps of this method will be explained in section 3 and section 4 is about experimental results and finally section 5 concludes the paper.

2. Related Concepts 2.1. K- means and Clustering In pattern recognition a group of related data is called a cluster. This kind of relationship usually is specified by a distance function. A data belongs to a group, when it has minimum distance with the center of that group. K-means algorithm is a simple and effective method of clustering, and most of other methods are inspired by it. K-means algorithm starts with a random choose of a data as first cluster center, then finds other data distances from this center and clustered them in or out, according to their desistance from the center. Outer data are considered as new canters and the algorithm continues with those new centers until some conditions, such as a time limit, iteration number limit and so on. 2.2. Data Dictionary Data dictionary is a pool of words, which let us prevent using unrelated words to a class, it could be useful to eliminate irrelevant words, such as those, which appear in advertisements and some parts of the page, that have other links. In this process for each class a sub dictionary is provided. Those sub dictionaries start with some related words to their category and during the learning process, high weight words will be added to them. 2.3. Fuzzy Logic A fuzzy system is a rule based system that has a knowledge base inside, some if/else rules makes this knowledge base effective for different applications, so defining those rules is an important step. The rules use some words that their values should be defined by membership functions, because they are linguistic words, and real world values need to be mapped on them. Fuzzy could be used for clustering, because, usually data that need to be clustered, do not distribute well, so structures could not be defined exact. Also by this way, a data could be clustered in different groups with different membership values.

622

Amir Masoud Rahmani; Zahra Hossaini and Saeed Setayeshi

3. Proposed Method The clustering method consists of two main parts: first part is link processor and second, is fuzzy cluster maker. Figure 1: Diagram of clustering and classification steps Fuzzy clustering and classification

Web Doc Conversion to Vector form

DATA DICTIONARY

Figure 1 shows a simple diagram of performed method, each part will be explained through this section. 3.1. Link Processor First of all a web document should be converted to a processable structure, vector is a common and simple form that is used in different applications. These kinds of vectors consist of some words and their related weights, those words and weights, become extracted from HTML form of documents. Weight, usually is the number of occurrence of word in a page or term frequency or TF, or could be TF divided by number of all words in the document, or other possible values [], it also could be multiplied by a value according to importance degree of the tag that the word appears in, for example words in tag are more important than those in

tag. By using this way, the achieved vector could be too long, and take a lot of time to be processed, so it seems sensible to use a(some) tag(s) that gives a shorter vector, which still has most of necessary words. Most of the times tags, make links to other pages, which have the same content as this page, or extra information about some important concepts, and they have valuable information about the subject of link in their structure, so by extracting this kind of information, a shorter but still useful vector could be constructed, and precision of process does not have perceptible decrease. A link could be shown in some different ways, make an inner anchor, by means of a text, make a link to another page via a text or make a link via a picture. 1. This text is a link to a part on this page . 2. This text is a link to a page on the World Wide Web . 3. . In two first way, italic text shows subject and reason of link, and has some useful keywords, and in third one alt=”…” has some extra information about the subject of the link on the other hand tag could have some explanatory properties such as name, title or id, which could give a lot of keywords. fuzzifing data instance for inference engine .

Link Processing for Fuzzy Web Pages Clustering and Classification

623

So after finding keywords, TF of each word will be calculated, and get weight. Weight of word/term i (ti) in document j is defined as following: α if ti is in title or name part of tag β if t i is in id part of tag

Wij=

(1)

γ if ti is in rel part of
tag 1 if ti is in descriptive part (>…1 For example suppose word “fuzzy” appears three times in and 5 times in descriptive part so its weight become 3*α+ 5 Before finding TFs a data dictionary could be used to eliminate unrelated words. Data dictionary is constructed according to documents in a training set, and has been completed during the extraction process. After this, all weighted TFs are become normal, Suppose p different words had been extracted from a document and ri is the number of appearance of word i in that document, or the word redundancy (TF), and R be the vector of ris then norm(ri) is defined as: Norm(ri)= (ri-min(R))/(max(R) – min(R)) i=1:P (2) Then final vector is ready for entering to fuzzy part, to be clustered, the vector structure is shown in figure 2. Figure 2: Document vector structure after analyzing t1j W1j

t2j W2j

… …

tpj Wpj

Length of each vector could be variable, so unnecessary appearance of words that are related to category but do not exist in documents can be ignored. 3.2. Clustering and Classification by Fuzzy Logic (CCFL) K-means as a method of clustering, choose a random or some random data points as its first cluster center, and then finds the similarity between other data points and this centers, data point is placed in a cluster, which have the most similarity (or minimum distance) with its center. If the maximum similarity value that the point achieved be smaller than a threshold, this point will become a center in next run. Similarity value could be found as a fuzzy membership value, because, this value could be supposed as an average function that has values between zero and one and it could be written in the form fuzzy if /else rules. Suppose there are n web pages in our data set di is the ith document in vector form, which have mi terms, and cj is the jth center that have lj terms, then distance between these two vectors could be shown by equation 3 [11]. dist(di,cj)=1-[((nc*[∑k=1mi x(tk)*μ(tk)]r) / lj] (3) Where nc is number of common words between document i and center j, x(tk) is importance degree of kth term in document j and μ(tk) is the frequency of word tk in cluster cj and r>0. Equation 3 makes a S-shaped membership function for output values. So rules can be written in this way: • If nc/li=high and x(tk)=high and μ(tk)= high then distance is very low. • If nc/li=high and x(tk)=high and μ(tk)= medium then distance is low. …

624

Amir Masoud Rahmani; Zahra Hossaini and Saeed Setayeshi •

If nc/li=low and x(tk)=low and μ(tk)= low then distance is too high. Now question is that, what x(t) and µ(t) are. Suppose that mth term in document i is t and has frequency of ft, then the importance degree of t for cluster j is: (ft/favg,)p

x(t)=

,

if ftfavg, (favg,j/ft)p , if ft≥favg,j , fmax,j=favg,j

(4)

This is a Triangular-shaped membership function where p>0, 0

Suggest Documents