On Some Methods for Improving Feature Vectors for ... - NUS Computing

2 downloads 0 Views 173KB Size Report
#ШАйБ4А as the “initial feature vector.” Subsequently, we denote the im- proved feature vector y&t. #АйБ4А as follows: Вt. #ШАйБ4АД ГЖ ЕИЗ t. #АйБ4А dhu Й.
DEWS2003 2-B-03

On Some Methods for Improving Feature Vectors for Web Pages and their Retrieval Accuracy 

Kazunari SUGIYAMA , Kenji HATANO , Masatoshi YOSHIKAWA , and Shunsuke UEMURA 

Nara Institute of Science and Technology 8916–5 Takayama, Ikoma, Nara 630–0192, Japan  Nagoya University Furo, Chikusa, Nagoya, Aichi 464–8601, Japan Abstract In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting that of their hyperlinked neighboring pages. In this paper, we first propose some methods for improving the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare retrieval accuracy of our proposed methods. Experimental results show that more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page. Key words WWW, Information Retrieval, TF-IDF scheme, Hyperlink

1. Introduction The WWW (World Wide Web) is a useful resource for users to obtain a great variety of information. Three billion Web pages are the lower bound that comes from the coverage of search engines [1], and it is obvious that the number of Web pages continues to grow. Therefore, it is getting more and more difficult for users to find relevant information on the WWW. Under these circumstances, search engines are one of the most popular methods for finding valuable information effectively, and they are classified into two generations based on their indexing techniques [2]. In the first-generation search engines developed in the early stages of the Web, only terms contained in Web pages were utilized as indexes. Therefore, the traditional document retrieval technique was merely applied to Web page search. However, Web pages have peculiar features such as hyperlink structures or are in numbers too great to search effectively. Consequently, users are not satisfied with ease of use and retrieval accuracy of search engines because such features of Web pages are not exploited in the first-generation search engines. To deal with these problems, in the second-generation search engines, the hyperlink structures of Web pages are taken into account. For example, the approaches called PageRank [3] and HITS (Hypertext Induced Topic Search) [4] are applied to Google (1)[5] and the search engine of the CLEVER project [6], respectively. In these algorithms, weighting Web pages using hyperlink structures achieves higher retrieval accuracy. However, these algorithms have shortcomings in that (1) the weight for a Web page is merely defined; and (2) the relativity of contents among hyperlinked Web pages is not considered. Taking these points into account, Davison [7] concentrated on textual content and showed that Web pages are siginificantly more likely to be related topically to pages to which they are linked. Based on this finding, his research group have released the search engine “Teoma(2)” that does context-sensitive HITS on the lines of the CLEVER project. This search engine uses the concept of “SubjectSpecific Popularity [8],” which ranks a site based on the number of samesubject pages that reference it, not just general popularity, to determine a site’s level of authority. However, the problem of Web pages irrelevant to user’s query often being ranked higher still remains. Hence, in order to provide users with relevant Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately. In order to achieve this purpose, we have proposed some methods for improving a feature vector for target Web page [9]. Our proposed methods, however, also have a problem in that only Web pages out-linked from a target Web page are used to generate the feature vector of target Web page. Since Web pages usually have their in- and out- linked pages, in the case of generating more accurate feature vectors of Web pages, it is necessary to use both their in- and out- linked pages. Therefore, in this paper, we first propose three methods for improving the tf-idf scheme [10]

(1) http://www.google.com/ (2) http://www.teoma.com/

for a target Web page using both its in- and out-linked pages in order to represent the contents of the target Web page more accurately. Then, we compare retrieval accuracy of our proposed methods using the improved feature vectors. Our method is novel in improving tf-idf based feature vector of target Web page by reflecting the contents of its hyperlinked neighboring Web pages.

2. Related Work 2. 1 IR Systems based on the concept of “Optimal Document Granularity” With respect to this research area, we refer to the following works. Tajima et al. [11] presented a technique which uses “cuts” (results of Web structure analysis) as retrieval units for the Web. Moreover, they extended to rank search results involved multiple keywords by (1) finding minimal subgraphs of links and Web pages including all keywords; and (2) computing the score of each subgraph based on locality of the keywords within it [12]. Following these works, Li et al. [13] introduced the concept of “information unit”, which can be regarded as a logical document consisiting of multiple physical Web pages as one atomic retrieval unit, and proposed a novel framework for document retrieval by information units. However, these approaches requires considerable processing time to analyze hyperlink structures and to discover the semantics of Web pages. In addition, while these systems can find retrieval units exactly, it often arises that they find retrieval units irrelevant to the user specified query terms. As for these works, we do not believe that users could understand the search results intuitively, because the multiple query keywords disperse in several hyperlinked Web pages. 2. 2 HITS Algorithm Kleinberg [4] originally developed HITS algorithm applied to the search engine of the CLEVER project [6]. This algorithm depends on the query and considers the set of pages  that point to or are redirected by pages in the answer. Web pages that have many links pointing to them in  are called authorities, while Web pages that have many outgoing links are called hubs. That is to say, better authorities come from incoming edges from good hubs and better hubs come from outgoing edges to good authorities. Let   and  be the hub and authority score of page  . These scores are defined such that the following equations are satisfied for all pages  :   

        

! "# $% 

where  and   are normalized for all Web pages. These values can be determined through an iterative algorithm, and they converge to the principal eigenvector of the link matrix of  . Several researchers have extended the above original HITS algorithm. For instance, the ARC algorithm of Chakrabarti et al. [14] enhanced the HITS algorithm with textual analysis. ARC computes a distance-2 neighborhood graph and weights edges. The weight of each edge is based on the match between the query terms and the text surrounding the hyperlink in the source document. In [15], Bharat et al. introduced additional

heuristics to HITS algorithm by giving a document an authority weight of &('*) if the document is in a group of ) documents on a first host which link to a single document on a second host, and a hub weight of &(',+ if there are + links from the document on a first host to a set of documents on a second host. However, the major problem in this technique is that the Web pages the root document point to get the largest authoritiy scores because the hub score of the root page dominates all the others when a Web page has few in-links but a large number of out-links, most of which are not very relevant to the query. In order to solve this problem, Li et al. [16] proposed a new weighted HITS-based method that assigns appropriate weights to in-links of root Web pages and combined content analysis with HITS-based algorithm. In addition, Chakrabarti et al. [17], [18] considered not merely the text of Web page but also the Document Object Model (DOM) within the HTML. This improves on intermediate work in the CLEVER project [6] that broke hubs into pieces at logical HTML boundaries. 2. 3 PageRank Algorithm PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability - , or follows a random hyperlink with probability &#.- . It is further assumed that this user never returns to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from which the stationary probability of being in each Web page can be computed. This value is then used as a part of the ranking mechanism used by Google [5]. Let /0 1% be the number of outgoing links from Web page 1 and suppose that 57 6 Web pages 32 to  4 point to the Web page 1 . Then, the PageRank,  18 of 1 is defined as: 4 576 576

 18 9-#:;& ? 2

  >

 /0  >

where the value of - is empirically set to about 0.15-0.2 by the system. The weights of other Web pages are normalized by the number of links in the Web page. PageRank can be computed using an iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the Web. The main problems of this algorithm are that (1) the contents of Web pages are not analyzed, so the “importance” of a given Web page is independent of the query; and (2) specific famous Web sites tend to be ranked more highly. Rafiei et al. [19] proposed using the set of Web pages that contain some term as a bias set for influencing the PageRank computation, with the goal of returning terms for which a given page has a high reputation. Furthermore, Haveliwala [20] proposed computing a set of PageRank vectors to capture more accurately the notion of importance with respect to a particular topic.

propose three methods for improving the “initial feature vector” defined by Equation (2) as follows: x the method for reflecting the contents of all Web pages at levels  @NM @EM up to y{z > 4| in the backward direction and levels up to y}z~ @ | in the forward direction from the target page  @ A@ (Method I), x the method for reflecting the centroid vectors of clusters generated @EM from Web page groups created at each level up to y{z > 4| in the backward  @EM direction and each level up to y zX~ @ | in the forward direction from the target page  @A,@ (Method II), x the method for reflecting the centroid vectors of clusters generated @NM from Web page groups created at levels up to y{z > 4| in the backward di @EM rection and levels up to y{zX~ @ | in the forward direction from the target page  @ A@ (Method III). Method I In this method, we reflect the contents of all Web pages at levels up  @NM @EM to y z > 4| in the backward direction and levels up to y zX~ @ | in the forward direction from the target page  @A,@ . Based on the ideas that (1) there are Web pages similar to the contents of  @A,@ in the neighborhood of  @A,@ ; and (2) since on one hand such Web pages exist right near @ A,@ ,  P Q;P on the other hand they might exist far removed from  @ A@ in the vector and feature vector of inspace, we reflect the distance between O bP Q"P of  @A,@ in the vector space on each element IP Q"of P initial and out-linked pages . For example, Figure 1(a) shows that O t is genfeature vector O  P Q"up P to second erated by reflecting the contents of all Web pages at levels in the backward and forward directions from  @A,@ on O . In Figure P @NM @EM 1(a),  >€; LX‚,ƒ and  >€"X„F… ƒ correspond to the † page in the B level in  P Q"P Additionthe backward and forward directions from @ A,@ , respectively. ally, Figure 1(b) shows that improved feature vector O t is generated IP Q"P by reflecting each feature vector of in- and out-linked pages of  b@ P AQ"@ P on P Q"P the initial feature vector O . In this method, each element f t@ g of P defined O t W IP Q;is S‡W bP Q"P as follows: t@ g

@ g

3. Proposed Method As we described in Section 2. 1, the IR systems based on the concept of “optimal document granularity” have a problem, in that the search results are incomprehensible for users. Moreover, HITS [4] and PageRank [5] also have problems: (1) the weight for a Web page is merely defined; and (2) the relativity of contents among hyperlinked Web pages is not considered. On the basis of these problems, in order to represent the contents of Web pages more accurately, the feature vector of a Web page should be generated by using the contents of its hyperlinked neighboring pages. We, therefore, propose improving the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages. Unlike researches described in the previous section, our method is novel in improving tf-idf based feature vector of a target Web page by reflecting the contents of its hyperlinked neighboring Web pages. Our approach is query-independent and link-based computations are performed offline as well as PageRank algorithm. At query time, we only compute the similarity between the improved feature vector and user specified query. Therefore, the query-time costs are not much greater than HITS algorithm whose query-time costs depends on the query. In the following discussion, let a target page be  @A,@ . Then, we define B as the number of the shortest directed path from @ A,@ . Let us assume that @NM there are C > Web pages ( >ED F > G IHIHIHJF > KIP L Q")P in the B level from  @ A@ . IP Q"PTSVUXW  P Q;P W  P Q"P W  P Q"P Moreover, we denote the feature vector O of  @A,@ as follows: R (1) @ DZY @ G[Y\\\,Y @N]Z^ Y where _ is the number of unique terms in the Web page collection, and `ba  P Q"P  P Q"P Using the tf-idf scheme, we ) c&(,d(bHIHIHJF_e denotes the each term. also  define the each element f @hg of O as follows: P Q;P `Ii

f @Eg

kl 

`Ii

`Ia

`ba

`Ii

j l @ A@j

H;mXn*o `  F @ A,@

-

Cqp8r;s i `ba  

F @A,@F is the frequency of term

(2) `Ia

in the target page i `b a @ A@ , Cqp3

is bP r"Q"s P is the total number of Web pages in `Ia the collection, and -  the number of Web pages P Q"inP which term appears. Below, we refer to O as the “initial feature vector.” Subsequently, we denote the im P Q"P W  P Q"P  P Q"P bP Q"feature PTSVUXW vector proved as W follows: Ot R I  P " Q P u D Y v G ; Y  \  \  \ Y t@ t@ t@N]w^ Y (3) t and refer to this O t as the “improved feature vector.” In this paper, we where



? 2



ˆ

‰ Š0‹ Œ

ˆ

’ ‹“ €? 2 P P  X„F… ƒ  L X„F… ƒ

L‘  LX‚,ƒ   P @ Q"g P ‘L   LX‚,ƒ R RY U

>? 2

‰ Š0‹ Œ

>? 2



W

  LŽ‚,ƒ  L  LŽ‚,ƒ

€? 2

^ P L X„F… ƒ  P IP @ Q"g P L X„F… ƒ (4) R Y  ^ ” L‘  LX‚,ƒ ` a   P Q;P (weight in Web @hg L‘  LŽ‚ƒ of term O

(the distance beW

’ ‹ “

U R



Equation (4) shows that the product of f page  >‘€  ILŽ‚,P Q"ƒ P ) and the Lreciprocal of -BJ•bFO   LŽ‚,ƒ  tweenL‘O X„F… P ƒ and O in the vector space), and P similarly, the product `ba    P Q"P (weight of f @ g in Web page  >€ X„F…  ƒ P )Q"P and the reciprocal L X„F…ofP ƒ term L X„F… P ƒ  P Q"P of -(BJ•bjO and O in O

(the distance between O `ba the vector space) is added to f @ g (weight of term in  @ A,@ , computed > @EM by Equation (2)) with respect to all Web pages  at levels up to y z 4| @EM    P " Q P P in the backward direction and levels up to y z~ @ | inLthe   LŽ‚,ƒ forward L X„Fdirec… ƒ tion from  @ A@ . If the distance between O and O in ,O bP Q"P the vector space is very close, the values of the second and third terms of Equation (4) can be dominant compared with the first term f @ g . Therefore, in order to prevent this phenomenon, we also define –qBE_ ,   P Q"P bP Q"P P the Web page collection. which Idenotes L‘the   LŽ‚ƒ number of unique terms L X„Fin … ƒ -BJ•bFO O

and -B,•bjO O

are defined the following equations, respectively: ’ ‹“

’ ‹ “

R

U

Y R 

 P Q;P R



 P Q"P U

Y R

S

L‘  LX‚,ƒ

L X„F… P ƒ

^

a ? 2 S

^

k

k a ? 2

UXW IP Q;P



W

@ g˜— UXW bP Q"P @ g

W —

L‘  XL ‚,ƒ @ g

 @ g

^F™ Y

(5)

P L X„F… ƒ ^ ™ ”

(6)

pLN

L(in)

pi j

(in)

tm p21

p2N

p22

(in)

2(in)

(in)

p

p

11(in)

š

p

12(in)

ptgt

(initial feature vector of target page ptgt œ

šb›

1N1(in)

(improved feature vector of target pagep

ptgt

tgt

œ

ptgt p11 p21

p1N

p12

(out)

p22

(out)

t3

1(out)

(out)

2(out)

(out)

t2

feature vectors of neighboring Web pages hyperlinked from target page ptgt

p2N

t1

(b)

pi j

(out)

pLN

L(out)

(a) Figure 1 The improvement of a feature vector as performed by Method I [(a) in the Web space, (b) in the vector space].

G

L(in)

p

LN L

(in)

p

G

ij

i (in)

(in)

G

21

(in)

G

1(in)

tm

p

p

p

2(in)

2N 2

22

(in)

(in)

p

p

11



p

12

(in)

(in)

ptgt

(initial feature vector of target page p )

1N 1

(in)

bž

tgt

ptgt

ptgt

(improved feature vector of target page ptgt ) clusters generated from groups of Web pages in each level from target page t3 t2 t1

p11

G

1(out)

p

21

(out)

G

(out)

p12

(out)

p1N

1(out)

centroid vectors of the clusters

p

p

2N2

22

(out)

(out)

2(out)

G

(b)

pi j i (out)

(out)

p G

LN L

(out)

(a) Figure 2 The improvement of a feature vector as performed by Method II [(a) in the Web space, (b) in the

L(out)

vector space].  P Q"P

Method II In this method, we first construct Web page groupsP Ÿ >  LŽ‚,ƒ at each level > @NM in the backward direction, and Ÿ > X„F… ƒ at each level up to up to  y{z 4| @EM P Q"P forward direction from the target page  @ A@ . Then, we y z~ @ | in  the  P Q"P by generate O t P reflecting centroid vectors of clusters generated from LX‚,ƒ and Ÿ > X„F… ƒ on initial feature vector O . This method is based Ÿ >  on the idea that Web pages at each level in the backward and forward P Q"P in the each level. In directions from  @ A@ is classified into some topics and the centroid vectors addition, we reflect the distance between O IP Q"P of the clusters in the vector space on each element of the initial feature vectorP O . In other words, we first create Web page groups Ÿ >N LŽ‚,ƒ and ƒ SV¡"¢ as follows: ¢ ¢ Ÿ > X„F… defined

  >E LX‚,ƒ > 2  LX‚,ƒ ¢Y >  LŽ‚ƒ  Y \\\JY ¢ >  L  LŽ‚,ƒ£ Y (7) S9¡"¢ ™ P P P P   > X„F… ƒ > 2 X„F… ƒ Y > X „F… ƒ U ;Y \S \,\Y >  L X„F… ƒ £ (8) ™ ‹ Ž Y  ¤  Y  \  \ , \ N Y ¥ ‰ P ^ and then produce ¦ clusters in each Web page group Ÿ >  LŽ‚,ƒ and Ÿ L > § X„FLŽ… ‚,ƒ ƒ A  P by means L § of the ¦ -means algorithm [21]. The centroid vectors O P A X„F… ƒ LŽ‚,ƒ Q"P ƒ  ¨ c&(Id(JHIHbHIF¦ are produced in Ÿ >   P and Ÿ > X„F… , reand O P reflecting spectively. We generate an improved feature vector L §  LX‚,ƒ O t L§ Xby „F… ƒ A bP Q"P centroid IP Q"P vector, O A the distance between each ,O and the

initial feature vector O on O . For instance, Figure P 2(a) showsP that LŽ‚,ƒ LX‚,ƒ ƒ   X  F „ … IŸ we construct Web page groups Ÿ 2 , Ÿ 2 , and Ÿ X„F… ƒ at ™ ™  P ; Q P each level up to second in the backward and forward direction from  @ A@ , and generate an improved feature vector O t by reflecting the centroid IP Q;P in each Web page group Ÿ 2  LŽ‚ƒ bŸ  LŽ‚,ƒ , vectorsP of each cluster produced P ™ ƒ ƒ Ÿ 2 X„F… , and Ÿ X„F… on  O P Q;P . Moreover, Figure 2(b) shows that im™  P Q;P proved feature vector P O Q;P t is generated by reflecting centroid vectors of each cluster on O . In this method, we define each element f t@hg

bP Q"P of W OIt P Q;P S‡ as W follows: t@ g

@ g ˆ

  LŽ‚,ƒ

‰ Š0‹ Œ

W

>? 2

© ’ ‹ “

ª ? 2

U

L§ A  ŽL ‚,ƒ  P @ Q"g P L § A  LŽ‚,ƒ R Y R

^ P L §  X„F… ƒ © A X „F… ƒ U  P @ Q"g P P ‰ ˆ L § (9) Š0‹ Œ A X„F… ƒ ’ ‹ “ R Y R >? 2 ª ? 2 ^ ” L § A  LX‚,ƒ `ba f (weight of term in cenEquation (9) shows that the product @ g L § A  P Q; LŽP ‚,ƒ Ž L , ‚ ƒ ofL§cluster troid vector O ¨ constructed from Ÿ >   P Q")P and the recipL§ A  LŽ‚,ƒ A  LŽ‚,ƒ P rocal of -BJ•bFO O

(the distance between O L § X„F… and O ƒ A in the vector space), and similarly, (weight P of L § X„F… P ƒ the product of f @ g A `Ia ƒ P ¨ Ÿ > X„F … P Q"P ) term in centroid vector O  P Q"P of constructed from L § Xcluster „F… ƒ A  P " Q P P and the L reciprocal of -B,•bjO ,O

(the distance between O § A X„F… ƒ `Ia and O in the vector space) is added to f @hg (weight of term in P

W

 @ A@ , computed by Equation (2)) with respect to all centroid vectors con@EM structed at each level up to y z > 4| in the backward direction and each @EM level up to y{zX~ @ | in the forward direction from  @ A,@ . In addition, we

introduce –«Bh_ for the purpose of preventing the values of the second and  P Q"P from dominating compared  P Q"P with the P first term in Equation (9). third terms L § L§ A  LŽ‚,ƒ A X„F… ƒ -BJ•bFO O

and -(BJ•bjO ,O

are defined as follows:

’ ‹ “

bP Q"P U

’ ‹“

R

IP Q;P U R

A Y R

A Y R

L § X„F… P ƒ

k

UXW IP Q;P

a ? 2

@ g

S

L §  LŽ‚ƒ ^

^

S k

UXW  P Q"P

a ? 2

@hg

W —

@ g W

—

A

L §  LŽ‚,ƒ

^ ™ Y

P L § A X „F… ƒ E@ g h^ ™ ”

(10)

(11)

Method III @NM This method is based on the idea that Web pages at levels up to y{z > 4|  @EM in the backward direction and levels up to y{zX~ @ | in the forward direction from the target page  @ A@ is composed of some topics. According @NM to this idea, we cluster the set of all Web pages at levels up to y{z > 4| IP Q"P up to y}z~ @ | @EM in the forward diin the backward direction and levels P Q"P by  reflecting centroid vectors of rection from  @ A@ , and generate O t  P ; Q P the clusters on the initial feature vector O . Furthermore, we reflect P Q"P and theIcentroid vector of the cluster in the the distance between O vector space on each element of ; in other words, we create Web P O page groups Ÿ >  LŽ‚,ƒ and Ÿ > X„F… ƒ as defined by Equation (12) and (13), ¢ ¢ respectively, S9¡;¢   >  LŽ‚ƒ

LX‚,ƒ ¢ 2;2  LX‚,ƒ ¢ 2  ™ > 2  LŽ‚,ƒ ¢ S9¡"¢ P P   >NX„F… ƒ ƒ ¢ 22 X„F… Y ¢ P ƒ ¢ 2 X„F… Y ¢ ™ P > 2 X„F… ƒ Y

Y ¢ 2  XL ‚,ƒ Y\\\,Y ¢ 2  D  LX‚,ƒ Y ™  LX‚,ƒ Y\\\,Y ¢  G  LX‚,ƒ Y Y¢ ™;™ ™ Y >  LŽ‚,ƒ Y\\\,Y ¢ >  L  LX‚,ƒ£ Y (12) ™ P P ƒ Y\\\,Y X  F „ …  D ƒ Y X  F „ … 2 ¢ 2 ™ P P X„F… ƒ Y\\\,Y ¢  G X„F… ƒ Y ™™ P ™ P > X„F… ƒ U Y\\S \,Y >  L X„F… ƒ £ Y (13) ™ ‹ Y ¬ ¤ ; Y , \  \  \ N Y ¥ Y ‰ P ^ > X„F… ƒ by means P and produce ¦ clusters in Ÿ >  LŽ‚,ƒ and of the ¦ -means §  LXŸ ‚,ƒ § A A X„F… ƒ and O algorithm. The centroid vectors O P  ¨­®&(Id(JHIHbHF¦

IP Q"P Ÿ >EX„F… ƒ , respectively. Then, we generate imare produced in Ÿ >N LŽ‚,ƒ and P by reflecting the distance between each cenO t § proved feature vector § A   LŽP ‚,Q"Pƒ A X„F… ƒ  P Q"P O  ¨ ¯&(Jd(IHIHIHIh¦ and initial feature vec,O troid vector

tor O on O . For instance, PFigure 3(a) shows that we construct Web page groups Ÿ  LŽ‚,ƒ and Ÿ X„F… ƒ at levels up to second in the back™ ™  P Q"P forward direction from  @ A@ , and generate improved feature vecward and  P Q"in P Web tor O t by reflecting the centroid vectors of clusters produced P page group Ÿ  LŽ‚ƒ and Ÿ X„F… ƒ on the initial feature vector O bP Q"P . Fur™ ™ thermore, Figure 3(b) shows that improved feature vector O t is gen P Q"P  P Q"P initial feature P Q"P erated by reflecting centroid vectors of each cluster on the vector O . In this method, each element f t@hg of Ot is defined as follows: IP Q"P IP Q;P W

S°W

t@ g

@ g

W

§ A  ŽL ‚,ƒ I  P " Q P g @ ˆ ‰ § Š7‹ Œ A  LŽ‚,ƒ ’ ‹ “ R Y R ^ ª ? 2 P W § © A X„F… ƒ U  P @ Q"g P ˆ ‰ § X „F… P ƒ (14) Š7‹ Œ ’ ‹ “ R R A Y ^ ” ª ? 2 §  LX‚,ƒ A `ba Equation (14) shows the product of f @ g (weight of term in §  LŽ‚,that ƒ A LX‚,Q"P ƒ >    P " Q P P centroid vector O ) and the§ recip§ of cluster ¨ generated from Ÿ LX‚,ƒ A  LŽ‚,ƒ P A  § and in rocal of -B,•bjO ,O

(the distance between O O A X„F… ƒ the vector space), and similarly § the P product of element f (weight g @ A X„F… ƒ `Ia >   LŽ‚,P Q;ƒ P ) of term in centroid vector  O P Q"P § of P cluster ¨ generated from Ÿ A X„F… ƒ  P ; Q P P and the § reciprocal of -B,•bjO ,O

(the distance between O A X„F… ƒ `ba and O in the vector space) is added to f @Eg (weight of term ©

U

in @ A,@ computed by Equation (2)), with respect to the number of clusters ¦ . As mentioned in Method I and II, in order to prevent the value P Q"P of the second and third term of equation  (14) from becoming dominant compared with the original term weight f @hg , we introduce –qBE_ , which  P Q;P P Q"P Web page P collection. We also denotes the number of§ unique terms in  the § A  LŽ‚ƒ A X„F… ƒ ,O

and -B,•bjO O

as follows: define -BJ•bFO ’ ‹ “

IP Q;P U R

§ A  LŽ‚,ƒ Y R ^

S k

UXW  P Q"P

a ? 2

@ g

—

W

§ A  LŽ‚,ƒ ^ ™ Y @ g

’ ‹ “

 P Q;P U R

S § X „F… P ƒ RY A ^

UXW bP Q"P k

a ? 2

W

@ g±—

P § A X „F… ƒ @ g ^h™ ”

(16)

4. Experiment 4. 1 Experimental Setup Our proposed methods described in Section 3. were implemented using Perl on a workstation (CPU: UltraSparc-II 480 MHz ² 4, Memory: 2 GBytes, OS: Solaris 8), and the experiments to verify the retrieval accuracy were conducted using the TREC WT10g test collection [22], which contains about 1.6 million Web pages. Stop words were eliminated from all the Web pages in the collection based on the stopword list(3)and stemming was performed using the Porter Stemmer(4)[23]. We formed query vector ³ using the terms contained in the “title” field in Topics from 451 to 500 at the TREC WT10g test collection. This query vector ³ is S9UXµ as follows: µ µ denoted ´

@ D Y @ G Y\\\,Y @ ] ^ Y

where _

(17)

is the number of unique terms in the Web page collection, and `ba ;)¶·&(,d(bHJHIHIF_e denotes the each term. Each element ¸ @Eg of ³ is UX¾ defined as follows:¹ µ

S

¹

@ g

UXÅ S º "\ »½¼ UXa ¾ UX¾ Œ l^ j\ À ÁbÂĒ à p8r"a s kl ‰ Y¤¬Y\\,\Y ^ (18) ” ? »¿¼ ¼ ^ ^ 2 i ` a ` a p3r"s , and - 

is the number of index terms , the

º ˆ ” ` a

i

where Æ  FC total number of Web pages in the test collection, and the number of Web `ba pages in which the term appears, respectively. As reported in [24], P Q;P Equation (18) is the element of a query vector that  brings the best search IP Q"the P IP Q;P imresult. We then compute similarity •"Bh_ejOt ,³­ between proved feature vector O t and query vector ³ . The •"Bh_ejO t ,³Ç

I  P " Q P is defined  as U P Q"P follows: S “,‹ Œ

R t

Y

´

^

ÈR

R  P Q"P t t

È

\

´

È´ È \ I  P Q"P”

(19)

J³­ , we evaluate retrieval accuracy Based on the value of •"BE_ejO t using “precision at 11 standard recall levels” described in [25] and [26].

4. 2 Experimental Results  P Q;P Method I  P Q"P We generated improved feature vector O t for initial feature vector O of target page  @ A@ with respect to the following three cases: @EM P (MI-a) where the contents of all Web pages at levels up to y}z > 4|  inP Q"the backward direction from @ A,@ reflect on the initial feature vector ,  O @EM (MI-b) where the contents of all Web pages at levels up to y{zX~ @ |  P Q"inP the forward direction from  @A,@ reflect on the initial feature vector O , > @NM in (MI-c) where the contents of all Web pages  at levels up to y{z 4| @EM the backward direction and levels up to y{zX~ @ |  P Q;inP the forward direction from  @ A@ reflect on the initial feature vector O . Using these improved feature vectors, we verified the retrieval accuracy. Figure 4 shows the experimental results of  (MI-a), (MI-b), and (MI-c). In these figures, the value of y{z > 4| and y}zX~ @ | represents the maximum value of B , which denotes the number of the shortest directed path from  @ A@ as Figure 1(a) illustrates. As shown in (MI-a) and (MI-b), we found that we can not obtain much improvement in retrieval accuracy compared with the tf-idf scheme when using only in-linked pages or only out-linked pages of the target page  @ A@ . In Figure 4(MI-a) and (MI-b), the average improvement ratio is +0.02 compared with tf-idf scheme. On the other  P Q"P hand, in the case of reflecting the contents of all Web pages up to first in the backward and forward direction from  @ A@ on O , as Figure 4(MIc) shows, we can obtain slight improvement in retrieval accuracy (only +0.007 improvement). However, Figure 4(MI-c)billustrates that, when reP Q"P flecting the contents of all Web pages at levels up to second or third in the backward and forward direction from  @A,@ on O , we can obtain much better retrieval accuracy compared with tf-idf scheme (+0.028, +0.029 improvement, respectively). Therefore, in general, we found that we can generate a more accurate feature vector of Web page by reflecting the contents of all Web pages at levels up to at least second in the backward and forward direction from the target page @ A,@ .

(15) (3) ftp://ftp.cs.cornell.edu/pub/smart/english.stop (4) http://www.tartarus.org/%7Emartin/PorterStemmer/

G

L(in)

p

LN L

(in)

p

G

ij

i(in)

(in)

G

2(in)

(in)

(in)

G

1(in)

tm

p2N

p22

p21

2(in)

p12

p11

(in)

(in)

p1N

É 1(in)

ptgt G

p11

1(out) (out)

p

21

(out)

p12

(out)

p1N

1(out)

centroid vectors of the clusters

2N 2

22

(out)

(out)

ÉbÊ

(improved feature vector of target page p ) tgt clusters generated from groups of Web pages in the backward and forward direction from ptgt

p

p

p tgt (initial feature vector of target page p ) tgt p tgt

G

2(out)

t3 t2 t1

(b)

p

ij

(out)

G

i(out)

p

LN L

(out)

G

(a) Figure 3 The improvement of a feature vector as performed by Method III [(a) in the Web space, (b) in the

L(out)

vector space]. IP Q"P

Method III Method II bP Q"P for initial feature vector We generated an improved feature vector for target page  @ A@ with respect We generated the improved feature vector O t O of target page  @ A@ with respect to the following three cases: to the following three cases: (MII-a) where the centroid vectors of clusters generated by the group of (MIII-a) where the centroid vectors of clusters generated by the group of @EM  P Q;P @EM P in the backward direction from backward direction from  @ A@ Web pages at each level up to y{z > 4| all Web pages at levels up to y{z > 4|  inP Q;the  @ A@ reflect on the initial feature vector O , reflect on the initial feature vector O , (MII-b) where the centroid vectors of clusters generated by the group of (MIII-b) where the centroid vectors of clusters generated by the group of @EM P @NM Q;P the forward direction from  @ A@ the forward direction from  @ A@ Web pages at each level up to y zX~ @ | IP Q"in all Web pages at levels up to y zX~ @ | IP in reflect on the initial feature vector O , reflect on the initial feature vector O , (MII-c) where the centroid vectors of clusters generated by the group of (MIII-c) where the centroid vectors of clusters generated by the group of @EM > @NM in the backward direction and levWeb pages at each level up to y}z > 4| in the backward direction and each all Web pages   at levels up to y{z 4| @EI M P Q"P b@EP M Q"P level up to y zX~ @ | in the forward direction from  @A,@ reflect on the initial els up to y z~ @ | in the forward direction from  @ A@ reflect on the initial feature vector O feature vector O . . We then conducted experiments to verify the retrieval accaracy using We then conducted experiments to verify the retrieval accuracy using these improved feature vectors. Figures 5, 6 and 7 show the experimental these improved feature vectors. Figures 8, 9, and 10 show the experiresults of (MII-a), (MII-b), and (MII-c), respectively. In these figures, the mental results of (MIII-a), (MIII-b), and (MIII-c), respectively. In these value of ¦ means the number of clusters produced by Web page group figures, the value of ¦ represents the number of clusters generated by denoted by Equation (7), and (8). Web page groups denoted by Equation (12), and (13). In regard to (MII-a), we could obtain the best retrieval accuracy when Regarding (MIII-a), in comparison with the tf-idf scheme, we could we generated an improved feature vector that considered the contents of obtain the best retrieval accuracy when we generated an improved feature Web pages at each level up to first in the backward direction from  @ A@ , vector that considered the contents of Web pages at levels up to second in compared with the tf-idf scheme (Figure 5, y z > 4| c& , +0.022 improvethe backward direction from  @ A@ (Figure 8, y z > 4| °d , +0.032 improvement). However, we could not obtain notable retrieval accuracy when we ment). Furthermore, when the Web page groups were produced from all generated an improved feature vector that considered the contents of Web Web pages at levels up to second (Figure 8, y{z > 4| Ïd ) or third (Figure 8, pages at each level up to second (Figure 5, y}z > 4| Ëd ) or third (Figure y}z > 4| ÏÌ ) in the backward direction from  @ A@ , we could obtain the best 5, y{z > 4| ¯Ì ) in the backward direction from a  @A,@ , compared with the retrieval accuracy as the value of ¦ equaled three. We can infer from this tf-idf scheme (at best, +0.011, +0.018 improvement, respectively). fact that topics of Web page that point to the target page  @ A@ are generally Regarding (MII-b), as well as (MII-a), we could obtain the best recomposed of three topics. trieval accuracy when we generated an improved feature vector that conIn regard to (MIII-b), we could obtain the best retrieval accuracy when sidered the contents of Web pages at each level up to first in the  forward we generated an improved feature vector that considered the contents of  direction from  @ A@ , compared with tf-idf scheme (Figure 6, y zX~ @ | Í& , all Web pages at levels up to second in the forward direction from @ A,@ , compared with the tf-idf scheme (Figure 9, y z~ @ | ‡d , +0.028 improve+0.027 improvement). However, like (MII-a), we could not obtain notable retrieval accuracy when we generated an improved feature vector ment). In addition, in the case of this (MIII-b), we could obtain the best   that considered the contents of Web pages at each level up to second retrieval accuracy when the value of ¦ equaled three in any case where (Figure 6, y zX~ @ | Îd ) or third (Figure 6, y zX~ @ | ÎÌ ) in the forward y zX~ @ | Ð&(jy zX~ @ | Ðd(jy zX~ @ | ·Ì . We can infer from this fact that direction from  @A,@ (at best, +0.020, +0.016 improvement, respectively). topics of Web pages that we can follow from  @ A,@ is often composed of Furthermore, regarding (MII-c), just as with (MII-a) and (MII-b), we three topics. could obtain the best retrieval accuracy when we generated an improved Finally, in regard to (MIII-c), the best retrieval accuracy is obtained feature vector that considered the contents of Web pages at each level up when we generated an improved feature vector that considered the con to first in the backward and forward direction from  @ A@ , compared with tents of all Web pages at levels up to second in the backward and fortf-idf scheme (Figure 7, y z > 4| Ë&(Fy zX~ @ | Ë& , +0.024 improvement). ward directions  from  @ A@ , compared with tf-idf scheme (Figure 10, y z > 4| Ñd(jy zX~ @ | Òd , +0.026 improvement). Furthermore, in any   However, just as with (MII-a) and (MII-b), neither could we obtain remarkable retrieval accuracy when we generated an improved feature veccase where [y{z > 4  | u&(Fy{zX~ @ | u& ], [y{z > 4| Ód(jy{zX~ @ | ud ],  tor that considered the contents of Web pages at each level up to second [y}z > 4| ÔÌ(hy}z~ @ | ÔÌ ], we could obtain the best retrieval accuracy (Figure 7, y}z > 4| Íd(jy{zX~  @ | Íd , at best, +0.012 improvement) or third when the value of ¦ equaled three. Therefore, hyperlinked Web pages in (Figure 7, y{z > 4| ËÌ(jy{zX~ @ | ËÌ , at best, +0.012 improvement) in the the neighborhood of a target page tend to have about three topics. As we described above, the best retrieval accuracy is obtained when backward and forward direction from @ A,@ . we generate an improved feature vector considering the contents of all Based on the findings described above, we found that, with regard to the contents of Web page, there is strong similarity between the feature Web pages at levels up to second in the backward direction from @ A,@ in vector of the target page @ A,@ and the centroid vector generated by Web (MIII-a), and considering the contents of all Web pages at levels up to page groups at each level up to first in the backward and forward disecond in the forward direction from @ A,@ in (MIII-b). Therefore, we can rections from @ A,@ . However, we also found that similarity between the infer that the result in (MIII-c) is the effect obtained by merging the best feature vector of  @A,@ and the centroid vector generated by the group of results of (MIII-a) and (MIII-b). Web pages at each level from  @A,@ reduces as the value of B , which de4. 3 Summary of Experimental Results notes the number of the shortest directed path from  @ A@ , increases. In summary, Figure 11 illustrates the comparison of the best retrieval accuracy obtained using the Method I, II, and III described in the previ-

ous section. According to these results, we found that we could obtain the best retrieval accuracy in comparison with tf-idf scheme, in the case of generating improved feature vector by creating a group of all Web pages at levels up to second in the backward direction from  @ A@ and producing three clusters from the group in Method III. In addition, Figure 11 shows that, in any case of Method I, II, and III, the best retrieval accuracy is obtained using the contents of in-linked pages of a target page. Therefore, it is assumed that more accurate feature vectors of Web pages can be generated by assigning higher weight to in-linked pages rather than out-linked pages of a target page. 0.7 0.6

’tfidf’ ’(MI-a), L(in)=3’ ’(MII-a), L(in)=3, K=3’ ’(MIII-a), L(in)=2, K=3’

improvement +0.029 +0.029 +0.032

precision

0.5

average precision 0.313 0.342 0.342 0.345

0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

recall

Figure 11 Comparison of best search accuracy obtained using each Method I, II and III.

5. Conclusion In this paper, in order to represent the contents of Web pages more accurately, we proposed three methods for improving tf-idf scheme for Web pages using their hyperlinked neighboring pages. Our approach is novel in improving tf-idf based feature vector of target Web page by reflecting the contents of its hyperlinked neighboring pages. Then, we conducted experiments with regard to the following three methods: x the method for reflecting the contents of all Web pages  at levels @EM @EM up to y z > 4| in the backward direction and levels up to y zX~ @ | in the forward direction from the target page  @ A@ , x the method for reflecting the centroid vectors of clusters generated @NM from Web page groups created at each level up to y z > 4| in the backward  @EM direction and each level up to y}z~ @ | in the forward direction from the target page  @ A,@ , x the method for reflecting the centroid vectors of clusters generated @NM from Web page groups created at levels up to y{z > 4| in the backward di @NM rection and levels up to y{zX~ @ | in the forward direction from the target page  @A,@ , and evaluated retrieval accuracy of improved feature vector obtained from each method using recall precision curves. Regarding Method I, we can generate a more accurate feature vector of Web page by utilizing the contents of all Web pages at levels up to at least second in the backward and forward direction from the target page  @ A@ . In the case of Method II, we found that there is strong similarity between the feature vector of the target page  @ A,@ and the centroid vector generated by the group of Web pages at each level up to first in the backward and forward directions from  @A,@ . On the other hand, the similarity between  @A,@ and the centroid vector generated by the group of Web pages at each level from  @ A@ reduces as the number of the shortest directed path from  @ A@ increases. With regard to Method III, a more accurate feature vector of Web page is generated when we use the contents of Web pages at levels up to second in the backward direction from  @ A@ . Furthermore, compared with respective best retrieval accuracy obtained using these three methods, we found that in-linked pages of a target page mainly affect for generating feature vector that represents the contents of the target page more accurately. Consequently, it is assumed that more accurate feature vector of Web pages can be generated by assigning higher weight to in-linked pages rather than out-linked pages of a target page. We plan to verify this assumption in future work. In this paper, we used the ¦ -means algorithm in order to classify the features of in- and out-linked pages of a target page. However, since we have to set the number of clusters initially in the ¦ -means algorithm, we consider this algorithm to be inappropriate for classifying the features of Web pages that have various link environments. Therefore, in future work, we plan to devise some clustering methods appropriate for various link environments of Web pages. Moreover, in this paper, we focused on the hyperlink structures of the Web aiming at generating more accurate feature vectors of Web pages. However, in order to satisfy the user’s actual information need, it is more important to find relevant Web page from the enourmous Web space. Therefore, we plan to address the technique to provide users with personalized information. References [1] R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web Structure, Dynamics and Page Quality. In Proc. of the 9th International Symposium on String

Processing and Information Retrieval (SPIRE 2002), pp. 117–130, 2002. [2] A. Broder and P. Raghavan. Combining Text- and Link-Based Information Retrieval on the Web. SIGIR’01 Pre-Conference Tutorials, Sep. 2001. [3] L. Page. The PageRank Citation Ranking: Bringing Order to the Web. http://google.stanford.edu/%7Ebackrub/pageranksub.ps, 1998. [4] J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proc. of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 1998), pp. 668–677, 1998. [5] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proc. of the 7th International World Wide Web Conference (WWW7), pp. 107–117, 1998. [6] IBM Almaden Research Center. Clever Searching. http://www.almaden.ibm.com/cs/k53/clever.html. [7] B. D. Davison. Topical Locality in the Web. In Proc. of the 22nd Annual International ACM SIGIR Conference (SIGIR 2000), pp. 272–279, 2000. [8] Ask Jeeves, Inc. Search with Authority: The Teoma Difference. http://sp.teoma.com/docs/teoma/about/searchwithauthority.html. [9] K. Sugiyama, K. Hatano, M. Yoshikawa, and S. Uemura. A Method of Improving Feature Vector for Web Pages Reflecting the Contents of their Outlinked Pages. In Proc. of the 13th International Conference on Database and Expert Systems Applications (DEXA2002), pp. 891–901, 2002. [10] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [11] K. Tajima, Y. Mizuuchi, M. Kitagawa, and K. Tanaka. Cut as a Querying Unit for WWW, Netnews, E-mail. In Proc. of the 9th ACM Conference on Hypertext and Hypermedia (HYPERTEXT ’98), pp. 235–244, 1998. [12] K. Tajima, K. Hatano, T. Matsukura, R. Sano, and K. Tanaka. Discovery and Retrieval of Logical Information Units in Web. In Proc. of the 1999 ACM Digital Libraries Workshop on Organizing Web Space (WOWS ’99), pp. 13–23, 1999. [13] W-S. Li, K. Selc¸uk Candan, Q. Vu, and D. Agrawal. Retrieving and Organizing Web Pages by “Information Unit”. In Proc. of the 10th International World Wide Web Conference (WWW10), pp. 230–244, 2001. [14] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proc. of the 7th International World Wide Web Conference (WWW7), pp. 65–74, 1998. [15] K. Bharat and M. R. Henzinger. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proc. of the 21st Annual International ACM SIGIR Conference (SIGIR ’98), pp. 104–111, 1998. [16] L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based Algorithms on Web Documents. In Proc. of the 11th International World Wide Web Conference (WWW2002), pp. 527–535, 2002. [17] S. Chakrabarti. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction. In Proc. of the 10th International World Wide Web Conference (WWW10), pp. 211–220, 2001. [18] S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks. In Proc. of the 23rd Annual International ACM SIGIR Conference (SIGIR 2001), pp. 208–216, 2001. [19] D. Rafiei and A. O. Mendelzon. What is this Page Known for? Computing Web Page Reputations. In Proc. of the 9th International World Wide Web Conference, pp. 823–835, 2000. [20] T. H. Haveliwala. Topic-Sensitive PageRank. In Proc. of the 11th International World Wide Web Conference (WWW2002), pp. 517–526, 2002. [21] J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proc. of the 5th Berkeley Symposium on Mathmatical Statistics and Probability, pp. 281–297, 1967. [22] D. Hawking. Overview of the TREC-9 Web Track. NIST Special Publication 500-249: The Ninth Text REtrieval Conference (TREC-9), pp. 87–102, 2001. [23] M. F. Porter. An Algorithm for Suffix Stripping. Program, Vol. 14(3), pp. 130–137, 1988. [24] G. Salton and C. Buckley. Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, Vol. 24(5), pp. 513– 523, 1988. [25] I. H. Witten and A. Moffatand T. C. Bell. Managing Gigabytes. Van Nostrand Reinhold, pp. 149–150, 1994. [26] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.

0.7 0.6

average precision 0.313 0.324 0.335 0.342

’tfidf’ ’L(in)=1’ ’L(in)=2’ ’L(in)=3’

0.7

average precision 0.313 0.332 0.331 0.337

improvement +0.011 +0.022 +0.029

0.6

+0.019 +0.018 +0.024

0.6

0.4

0.3

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0

0

0.2

0.4

0.6

0.8

1

+0.007 +0.028 +0.029

0.4

0.2

0

improvement

precision

0.4

average precision 0.313 0.320 0.341 0.342

’tfidf’ ’L(in)=1,L(out)=1’ ’L(in)=2,L(out)=2’ ’L(in)=3,L(out)=3’

0.5

precision

0.5

precision

0.5

’tfidf’ ’L(out)=1’ ’L(out)=2’ ’L(out)=3’

0.7 improvement

0

0.2

0.4

0.6

recall

0.8

0

1

0

0.2

0.4

0.6

recall

0.8

1

recall

(MI-a) (MI-b) (MI-c) Figure 4 Comparison of search accuracy obtained using Method I [(MI-a): in-link pages only, (MI-b): out-link pages only, (MI-c): both in-link and out-link pages].

0.7 0.6

0.7

average improvement precision 0.313 0.319 +0.006 0.335 +0.022 0.328 +0.015

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0.2

0.4

0.6

S

0.8

0

1

0

0.2

0.4

0.6

S

recall

0.8

0

1

0

0.2

0.4

SÇÕ

recall

¥ z > 4|

average improvement precision 0.313 0.331 +0.018 0.324 +0.011 0.320 +0.007

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.315 +0.002 0.324 +0.011 0.320 +0.007

¥ z > 4| ‰

0.6

0.8

1

recall

¥ z > 4| ¤

Figure 5 Comparison of search accuracy obtained using Method II [(MII-a): in-link pages only].

0.7 0.6

0.7

average improvement precision 0.313 0.325 +0.012 0.335 +0.022 0.340 +0.027

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0.2

0.4

¥ zX~



0.6

S

0.8

0

1

0

0.2

0.4

recall

@ |

average improvement precision 0.313 0.324 +0.011 0.324 +0.011 0.329 +0.016

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.319 +0.006 0.333 +0.020 0.328 +0.015

¥ zX~ ‰



0.6

S

0.8

0

1

0

0.2

0.4

recall

¤

@|

¥ zX~

SÖÕ

0.6

0.8

1

recall



@ |

Figure 6 Comparison of search accuracy obtained using Method II [(MII-b): out-link pages only].

0.7 0.6

0.7

average improvement precision 0.313 0.328 +0.015 0.335 +0.022 0.337 +0.024

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0.2

¥ z > 4|

S

0.4

0.6 recall

‰ YN¥ zX~

 @ |

0.8

S ‰

1

0

average improvement precision 0.313 0.325 +0.012 0.323 +0.010 0.322 +0.009

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.325 +0.012 0.320 +0.007 0.317 +0.004

0

0.2

¥ z > 4|

S

0.4

0.6 recall

¤¬Y ¥ zX~

 @|

0.8

S ¤

1

0

0

0.2

¥ z > 4|

0.4

0.6

SÇÕ

recall

YN¥ zX~

Figure 7 Comparison of search accuracy obtained using Method II [(MII-c): both in-link and out-link pages].

 @ |

S­Õ

0.8

1

0.7 0.6

average precision 0.313 0.315 0.335 0.328

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.7 +0.002 +0.022 +0.015

0.6

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0

0.2

0.4

0.6

S

0.8

1

0

0.2

0.4

0.6

S

recall

0.8

0

1

0

0.2

0.4

SÇÕ

recall

¥ z > 4|

average improvement precision 0.313 0.319 +0.006 0.330 +0.017 0.342 +0.029

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.320 +0.007 0.335 +0.022 0.345 +0.032

improvement

¥ z > 4| ‰

0.6

0.8

1

recall

¥ z > 4| ¤

Figure 8 Comparison of search accuracy obtained using Method III [(MIII-a): in-link pages only].

0.7 0.6

0.7

average improvement precision 0.313 0.320 +0.007 0.331 +0.018 0.338 +0.025

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0.2

0.4

¥ zX~



0.6

S

0.8

0

1

0

0.2

0.4

recall

@ |

average improvement precision 0.313 0.319 +0.006 0.331 +0.018 0.337 +0.024

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.323 +0.010 0.331 +0.018 0.341 +0.028

’tfidf’ ’K=1’ ’K=2’ ’K=3’

¥ zX~ ‰



0.6

S

0.8

0

1

0

0.2

0.4

recall

¤

@|

¥ zX~

SÖÕ

0.6

0.8

1

recall



@ |

Figure 9 Comparison of search accuracy obtained using Method III [(MIII-b): out-link pages only].

0.7 0.6

0.7

average improvement precision 0.313 0.325 +0.012 0.331 +0.018 0.333 +0.020

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.6

0.6

precision

0.4

0.4

0.3

0.4

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0.2

¥ z > 4|

S

0.4

0.6 recall

‰ YN¥ zX~

 @ |

0.8

S

1

0

0

0.2

¥ z > 4| ‰

S

0.4

0.6 recall

¤¬Y ¥ zX~

 @|

0.8

S ¤

1

0

0

0.2

¥ z > 4|

0.4

0.6

SÇÕ

recall

YN¥ zX~

Figure 10 Comparison of search accuracy obtained using Method III [(MIII-c): both in-link and out-link pages].

average improvement precision 0.313 0.321 +0.008 0.332 +0.019 0.334 +0.021

’tfidf’ ’K=1’ ’K=2’ ’K=3’

0.5

precision

0.5

precision

0.5

0.7

average improvement precision 0.313 0.325 +0.012 0.335 +0.022 0.339 +0.026

’tfidf’ ’K=1’ ’K=2’ ’K=3’

 @ |

S­Õ

0.8

1

Suggest Documents