tag. By using this way, the achieved vector could be too long, and take a lot of time to be processed, so it seems sensible to use a(some) tag(s) that gives a shorter vector, which still has most of necessary words. Most of the times tags, make links to other pages, which have the same content as this page, or extra information about some important concepts, and they have valuable information about the subject of link in their structure, so by extracting this kind of information, a shorter but still useful vector could be constructed, and precision of process does not have perceptible decrease. A link could be shown in some different ways, make an inner anchor, by means of a text, make a link to another page via a text or make a link via a picture. 1. This text is a link to a part on this page . 2. This text is a link to a page on the World Wide Web . 3.
Link Processing for Fuzzy Web Pages Clustering and Classification
623
So after finding keywords, TF of each word will be calculated, and get weight. Weight of word/term i (ti) in document j is defined as following: α if ti is in title or name part of tag β if t i is in id part of tag
Wij=
(1)
γ if ti is in rel part of tag 1 if ti is in descriptive part (>…1 For example suppose word “fuzzy” appears three times in and 5 times in descriptive part so its weight become 3*α+ 5 Before finding TFs a data dictionary could be used to eliminate unrelated words. Data dictionary is constructed according to documents in a training set, and has been completed during the extraction process. After this, all weighted TFs are become normal, Suppose p different words had been extracted from a document and ri is the number of appearance of word i in that document, or the word redundancy (TF), and R be the vector of ris then norm(ri) is defined as: Norm(ri)= (ri-min(R))/(max(R) – min(R)) i=1:P (2) Then final vector is ready for entering to fuzzy part, to be clustered, the vector structure is shown in figure 2. Figure 2: Document vector structure after analyzing t1j W1j
t2j W2j
… …
tpj Wpj
Length of each vector could be variable, so unnecessary appearance of words that are related to category but do not exist in documents can be ignored. 3.2. Clustering and Classification by Fuzzy Logic (CCFL) K-means as a method of clustering, choose a random or some random data points as its first cluster center, and then finds the similarity between other data points and this centers, data point is placed in a cluster, which have the most similarity (or minimum distance) with its center. If the maximum similarity value that the point achieved be smaller than a threshold, this point will become a center in next run. Similarity value could be found as a fuzzy membership value, because, this value could be supposed as an average function that has values between zero and one and it could be written in the form fuzzy if /else rules. Suppose there are n web pages in our data set di is the ith document in vector form, which have mi terms, and cj is the jth center that have lj terms, then distance between these two vectors could be shown by equation 3 [11]. dist(di,cj)=1-[((nc*[∑k=1mi x(tk)*μ(tk)]r) / lj] (3) Where nc is number of common words between document i and center j, x(tk) is importance degree of kth term in document j and μ(tk) is the frequency of word tk in cluster cj and r>0. Equation 3 makes a S-shaped membership function for output values. So rules can be written in this way: • If nc/li=high and x(tk)=high and μ(tk)= high then distance is very low. • If nc/li=high and x(tk)=high and μ(tk)= medium then distance is low. …
624
Amir Masoud Rahmani; Zahra Hossaini and Saeed Setayeshi •
If nc/li=low and x(tk)=low and μ(tk)= low then distance is too high. Now question is that, what x(t) and µ(t) are. Suppose that mth term in document i is t and has frequency of ft, then the importance degree of t for cluster j is: (ft/favg,)p
x(t)=
,
if ftfavg, (favg,j/ft)p , if ft≥favg,j , fmax,j=favg,j
(4)
This is a Triangular-shaped membership function where p>0, 0