Page Segmentation by Web Content Clustering

8 downloads 2303 Views 4MB Size Report
May 26, 2011 - Web Intrapage Informative Structure Mining. Based on DOM (Term entropy based on heuristics). 4 / 19. Basic Idea. ▻ Start with complete Page ...
Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems

May 26, 2011

WIMS ’11

1 / 19

Outline 1

Introduction Motivation Related Work

2

Web Page Segmentation by Clustering General Idea Distance functions for web contents Clustering methods

3

Evaluation Studies Distance functions Clustering

4

Conclusion and Future work WIMS ’11

2 / 19

Introduction

Motivation

Motivation

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Web Page is cluttered with different contents I Different news articles I

Link lists

I

Commercials

I

Template elements

I

Functional elements

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Web Page Segmentation I Separation of web contents into structural and semantic cohesive blocks

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Applications I Web Content Search I

Web Page Categorization

I

Web Page Adaptation for Mobile Devices

I

Web Image Indexing

I

...

WIMS ’11

3 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation I

TOP-DOWN page segmentation: I

I

I

I

KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)

BOTTOM-UP page segmentation: I

I

I

CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11

4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation I

TOP-DOWN page segmentation: I

I

I

KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)

Basic Idea I Start with complete Page as initial block I Decide for each block: I

I

should the block be separated? if yes, where to separate?

! Based on heuristics

WIMS ’11

4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation Basic Idea I Start with smallest content units (e.g., DOM leafs)

I

I

group them to blocks of coherent content

I

How?

BOTTOM-UP page segmentation: I

I

I

CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11

4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation Our Approach I belongs to BOTTOM-UP methods

I

I

DOM leafs are used as basic web objects

I

Idea!: group web objects to blocks by clustering

BOTTOM-UP page segmentation: I

I

I

CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11

4 / 19

Web Page Segmentation by Clustering

Web Page Segmentation by Clustering

WIMS ’11

5 / 19

Web Page Segmentation by Clustering

General Idea

Page Segmentation by Clustering General Definition: Clustering I Clustering is the process of organizing objects into groups whose members are similar in some way I

A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Open questions addressed in this work I

How can the similarity (or dissimilarity) of web objects be estimated?

I

Which representation is best suitable to represent web objects?

I

Which clustering method should be applied for clustering?

WIMS ’11

6 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

Different Representations of Web objects I

Geometric Representation web browser puts every object of a web page in a 2-dim plane extract the bounding rectangle for each object

I I

I

Semantic Representation elements in DOM contain some textual contents extract keywords from the corresponding text

I I

I

DOM-based Representation each object is a node in the DOM tree of the page use the position of the object in DOM tree to characterize it

I I

A

B

⇒ Different distance measures are possible WIMS ’11

7 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

Geometric Distance I

Let R = [(rx , ry ), (rx , , ry , )] and S = [(sx , sy ), (sx , , ry , )] be two bounding rectangles

I

The geometric distance of R to S is given  by , 1  ri − s i P  2 2 s − ri , , with ti = dist(R, S) = i∈x,y ti  i 0 Visually:

I

(0, 0)

if if if

ri > si , ri , < si otherwise.

x (sx, sy)

S (rx, ry)

min

R,S

dist(

)

(sx', sy')

R y

(rx', ry')

WIMS ’11

8 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

Semantic Distance I I

Given T1 = (dog, run, street), T2 = (puppy, walk, road) Cosine Similarity Measure (Information Retrieval) I

Lexical word-to-word matching → sim(T1 , T2 ) = 0

I

to strict: e.g. synonym and hyponym relationships are not considered

I

Instead: text similarity measure based on WordNet [Corley 05] I

Words are mapped to concepts in WordNet (concept-to-concept matching)

P sim(T1 , T2 ) =

wi ∈T1

maxSim(wi , T2 ) · idf (wi ) P wi ∈T1 idf (wi )

WIMS ’11

9 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

Requirements I Nodes under same parent are closer than nodes under different parent I

Nodes on higher tree level are closer that nodes on lower level

WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

I

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak

WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

I

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak

I

For each element in P define a weight wpi that expresses the costs needed to reach pi from its predecessor in P

WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

I

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

The distance between pa , pb ∈ P, (wlog. a < b) is defined as: b X

d(pa , pb ) =

wpi

i=a+1

Example: d( 2k , 4k ) = w3 + wC + w4 WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Distance functions for web contents

DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

I

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6

level = 2 level degree = 0

The weight wi of a node pi ∈ P depends on the level l and the level degree dl of pi :  c : dl = 0 w (l) = (1) dl · w (l + 1) : dl > 0, e.g., w (2) = c, w (1) = 3 ∗ w (2) = 3c, w (0) = 2 ∗ w (1) = 6c, WIMS ’11

10 / 19

Web Page Segmentation by Clustering

Clustering methods

Clustering Methods

I

Partitioning Clustering I

I

Agglomerative Hierarchical Clustering I

I

k-medoid (similar as k-means, but cluster representatives are real objects) single link method applied to compute distance between sets of objects

Density-based Clustering I

DBSCAN variant (able to find clusters of different density levels)

WIMS ’11

11 / 19

Evaluation Studies

Evaluation Studies

WIMS ’11

12 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization I

A distance matrix contains all pairwise distances of the objects to be clustered, e.g. a b c a 0 1.9 1.1 b 1.9 0 2.3 c 1.1 2.3 0

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

Numbers I indicate the position of particular web 55-56 in the source code objects

1-19

20-28

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60

b) Geometric-Distance

1

10

20

30

40

50

60

c) Semantic-Distance

I

left and bottom axe represent the web objects ordered by their appearance on the web page

I

each pixel represents a distance value

I

white means lowest distance, black means highest distance

I

bright squares in the diagonal indicate possible page blocks WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

1

10

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

b) Geometric-Distance

60

1

10

20

30

40

50

60

c) Semantic-Distance

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60

b) Geometric-Distance

1

10

20

30

40

50

60

c) Semantic-Distance

Results: I

DOM-distance has good correspondence

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60

b) Geometric-Distance

1

10

20

30

40

50

60

c) Semantic-Distance

Results: I

DOM-distance has good correspondence

I

Geometric distance has some correspondence, but there are other bright rectangles

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60

b) Geometric-Distance

1

10

20

30

40

50

60

c) Semantic-Distance

Results: I

DOM-distance has good correspondence

I

Geometric distance has some correspondence, but there are other bright rectangles

I

Semantic distance has almost no correspondence WIMS ’11

13 / 19

Evaluation Studies

Clustering

Evaluation - Clustering Performance

Dataset I

78 web documents from 8 different categories from Yahoo! directory

I

23,819 web contents (in average 305 per web page)

I

Web contents clustered manually by 3 volunteers

I

Ground truth is combination of all three proposals

Method I For each combination of clustering method & distance function I

I

compute clustering of the web contents into page blocks

Ground Truth Clustering (GT), Computed Clustering (C)

WIMS ’11

14 / 19

Evaluation Studies

Clustering

Evaluation - Performance Measure Based on Contingency Table for pairs of objects: Same cluster in GT Different cluster in GT

Same cluster in C f11 f01

Different cluster in C f10 f00

Performance Measure Rand statistic =

f00 + f11 f00 + f01 + f10 + f11

WIMS ’11

15 / 19

Evaluation Studies

Clustering

Average Rand Statistic Results

partitioning hierarchical density-based

DOM-based 0.45 0.52 0.61

geometric 0.47 0.41 0.43

semantic 0.25 0.24 0.27

Results: I Rand statistic values are similar to the results of Distance Matrix Visualization I DOM distance reaches highest values with DB clustering I

I

the distances between together belonging objects are varying in the metrical space derived by DOM distance DB clustering is able to find clusters of different densities

WIMS ’11

16 / 19

Conclusion and Future work

Conclusion and Future Work

WIMS ’11

17 / 19

Conclusion and Future work

Conclusion I Web Page Segmentation by Clustering was presented I

three different distance function for web objects based on geometric, semantic and DOM properties

I

three clustering methods from different categories: partitioning, hierarchical and density-based clustering

I

best clustering accordance to ground truth with DOM-Distance and DB clustering

Future Work I combination of distance measure (linear, multiplicative, ?) I

comparison to other Web Page Segmentation methods from literature

I

application of Web Page Segmentation to Web Image Context Extraction (paper accepted, to be published) WIMS ’11

18 / 19

Conclusion and Future work

Thank You!

WIMS ’11

19 / 19

Suggest Documents