Page Segmentation by Web Content Clustering

Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems

May 26, 2011

WIMS ’11

1 / 19

Outline 1

Introduction Motivation Related Work

2

Web Page Segmentation by Clustering General Idea Distance functions for web contents Clustering methods

3

Evaluation Studies Distance functions Clustering

4

Conclusion and Future work WIMS ’11

2 / 19

Introduction

Motivation

Motivation

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Web Page is cluttered with different contents I Different news articles I

Link lists

I

Commercials

I

Template elements

I

Functional elements

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Web Page Segmentation I Separation of web contents into structural and semantic cohesive blocks

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

WIMS ’11

3 / 19

Introduction

Motivation

Motivation

Applications I Web Content Search I

Web Page Categorization

I

Web Page Adaptation for Mobile Devices

I

Web Image Indexing

I

...

WIMS ’11

3 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation I

TOP-DOWN page segmentation: I

I

I

I

KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)

BOTTOM-UP page segmentation: I

I

I

CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11

4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation I

TOP-DOWN page segmentation: I

I

I

KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)

Basic Idea I Start with complete Page as initial block I Decide for each block: I

I

should the block be separated? if yes, where to separate?

! Based on heuristics

WIMS ’11

4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation Basic Idea I Start with smallest content units (e.g., DOM leafs)

I

I

group them to blocks of coherent content

I

How?


I

I


4 / 19

Introduction

Related Work

Overview of Related Work to Web Page Segmentation Our Approach I belongs to BOTTOM-UP methods

I

I

DOM leafs are used as basic web objects

I

Idea!: group web objects to blocks by clustering


I

I


4 / 19

Web Page Segmentation by Clustering


WIMS ’11

5 / 19


General Idea

Page Segmentation by Clustering General Definition: Clustering I Clustering is the process of organizing objects into groups whose members are similar in some way I

A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Open questions addressed in this work I

How can the similarity (or dissimilarity) of web objects be estimated?

I

Which representation is best suitable to represent web objects?

I

Which clustering method should be applied for clustering?

WIMS ’11

6 / 19


Distance functions for web contents

Different Representations of Web objects I

Geometric Representation web browser puts every object of a web page in a 2-dim plane extract the bounding rectangle for each object

I I

I

Semantic Representation elements in DOM contain some textual contents extract keywords from the corresponding text

I I

I

DOM-based Representation each object is a node in the DOM tree of the page use the position of the object in DOM tree to characterize it

I I

A

B

⇒ Different distance measures are possible WIMS ’11

7 / 19



Geometric Distance I

Let R = [(rx , ry ), (rx , , ry , )] and S = [(sx , sy ), (sx , , ry , )] be two bounding rectangles

I

The geometric distance of R to S is given  by , 1  ri − s i P 2 2 s − ri , , with ti = dist(R, S) = i∈x,y ti  i 0 Visually:

I

(0, 0)

if if if

ri > si , ri , < si otherwise.

x (sx, sy)

S (rx, ry)

min

R,S

dist(

)

(sx', sy')

R y

(rx', ry')

WIMS ’11

8 / 19



Semantic Distance I I

Given T1 = (dog, run, street), T2 = (puppy, walk, road) Cosine Similarity Measure (Information Retrieval) I

Lexical word-to-word matching → sim(T1 , T2 ) = 0

I

to strict: e.g. synonym and hyponym relationships are not considered

I

Instead: text similarity measure based on WordNet [Corley 05] I

Words are mapped to concepts in WordNet (concept-to-concept matching)

P sim(T1 , T2 ) =

wi ∈T1

maxSim(wi , T2 ) · idf (wi ) P wi ∈T1 idf (wi )

WIMS ’11

9 / 19



DOM-based Distance 1

4 level = 0 level degree = 2

A 5

2

3

6

B 1

2

level = 1 level degree = 3

C 3

4

5

6


WIMS ’11

10 / 19





A 5

2

3

6

B 1

2


C 3

4

5

6


Requirements I Nodes under same parent are closer than nodes under different parent I

Nodes on higher tree level are closer that nodes on lower level

WIMS ’11

10 / 19





A 5

2

3

I

6

B 1

2


C 3

4

5

6


Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak

WIMS ’11

10 / 19





A 5

2

3

I

6

B 1

2


C 3

4

5

6


Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak

I

For each element in P define a weight wpi that expresses the costs needed to reach pi from its predecessor in P

WIMS ’11

10 / 19





A 5

2

3

I

6

B 1

2


C 3

4

5

6


The distance between pa , pb ∈ P, (wlog. a < b) is defined as: b X

d(pa , pb ) =

wpi

i=a+1

Example: d( 2k , 4k ) = w3 + wC + w4 WIMS ’11

10 / 19





A 5

2

3

I

6

B 1

2


C 3

4

5

6


The weight wi of a node pi ∈ P depends on the level l and the level degree dl of pi : c : dl = 0 w (l) = (1) dl · w (l + 1) : dl > 0, e.g., w (2) = c, w (1) = 3 ∗ w (2) = 3c, w (0) = 2 ∗ w (1) = 6c, WIMS ’11

10 / 19


Clustering methods

Clustering Methods

I

Partitioning Clustering I

I

Agglomerative Hierarchical Clustering I

I

k-medoid (similar as k-means, but cluster representatives are real objects) single link method applied to compute distance between sets of objects

Density-based Clustering I

DBSCAN variant (able to find clusters of different density levels)

WIMS ’11

11 / 19

Evaluation Studies

Evaluation Studies

WIMS ’11

12 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization I

A distance matrix contains all pairwise distances of the objects to be clustered, e.g. a b c a 0 1.9 1.1 b 1.9 0 2.3 c 1.1 2.3 0

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

Numbers I indicate the position of particular web 55-56 in the source code objects

1-19

20-28

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

Distance-Matrix Visualization 1

1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60

b) Geometric-Distance

1

10

20

30

40

50

60

c) Semantic-Distance

I

left and bottom axe represent the web objects ordered by their appearance on the web page

I

each pixel represents a distance value

I

white means lowest distance, black means highest distance

I

bright squares in the diagonal indicate possible page blocks WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions


1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

1

10

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50


60

1

10

20

30

40

50

60


WIMS ’11

13 / 19

Evaluation Studies

Distance functions


1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60


1

10

20

30

40

50

60


Results: I

DOM-distance has good correspondence

WIMS ’11

13 / 19

Evaluation Studies

Distance functions


1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60


1

10

20

30

40

50

60


Results: I


I

Geometric distance has some correspondence, but there are other bright rectangles

WIMS ’11

13 / 19

Evaluation Studies

Distance functions

57-65

44-54

66-68

29-43

1-19

20-28

55-56

WIMS ’11

13 / 19

Evaluation Studies

Distance functions


1

1

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

10

1

20

30

40

a) DOM-Distance

50

60

1

10

20

30

40

50

60


1

10

20

30

40

50

60


Results: I


I

Geometric distance has some correspondence, but there are other bright rectangles

I

Semantic distance has almost no correspondence WIMS ’11

13 / 19

Evaluation Studies

Clustering

Evaluation - Clustering Performance

Dataset I

78 web documents from 8 different categories from Yahoo! directory

I

23,819 web contents (in average 305 per web page)

I

Web contents clustered manually by 3 volunteers

I

Ground truth is combination of all three proposals

Method I For each combination of clustering method & distance function I

I

compute clustering of the web contents into page blocks

Ground Truth Clustering (GT), Computed Clustering (C)

WIMS ’11

14 / 19

Evaluation Studies

Clustering

Evaluation - Performance Measure Based on Contingency Table for pairs of objects: Same cluster in GT Different cluster in GT

Same cluster in C f11 f01

Different cluster in C f10 f00

Performance Measure Rand statistic =

f00 + f11 f00 + f01 + f10 + f11

WIMS ’11

15 / 19

Evaluation Studies

Clustering

Average Rand Statistic Results

partitioning hierarchical density-based

DOM-based 0.45 0.52 0.61

geometric 0.47 0.41 0.43

semantic 0.25 0.24 0.27

Results: I Rand statistic values are similar to the results of Distance Matrix Visualization I DOM distance reaches highest values with DB clustering I

I

the distances between together belonging objects are varying in the metrical space derived by DOM distance DB clustering is able to find clusters of different densities

WIMS ’11

16 / 19

Conclusion and Future work

Conclusion and Future Work

WIMS ’11

17 / 19


Conclusion I Web Page Segmentation by Clustering was presented I

three different distance function for web objects based on geometric, semantic and DOM properties

I

three clustering methods from different categories: partitioning, hierarchical and density-based clustering

I

best clustering accordance to ground truth with DOM-Distance and DB clustering

Future Work I combination of distance measure (linear, multiplicative, ?) I

comparison to other Web Page Segmentation methods from literature

I

application of Web Page Segmentation to Web Image Context Extraction (paper accepted, to be published) WIMS ’11

18 / 19


Thank You!

WIMS ’11

19 / 19

Page Segmentation by Web Content Clustering

Page Segmentation by Web Content Clustering

Suggest Documents

Web Page Segmentation - eMine - METU

Web Communities Defined by Web Page Content - Ajith Abraham

Web Page Segmentation & Structure Analysis for Eliminating

Web Page Segmentation Evaluation - ACM Digital Library

Automatic Web Page Segmentation and Noise

Web page clustering using Query Directed Clustering ... - Google Sites

Web Page Segmentation & Structure Analysis for Eliminating ...

Segmentation Based, Personalized Web Page Summarization Model

Segmentation and Clustering

Clustering-Aided Page Object Generation for Web

SEGMENTATION USING CLUSTERING METHODS

Hierarchical Image Segmentation by Structural Content - CiteSeerX

Clustering Visually Similar Web Page Elements for Structured Web

Web Page Clustering Using Heuristic Search in the Web Graph

Hierarchical Web-page Clustering via In-page and Cross ... - CiteSeerX

segmentation using clustering methods

Image segmentation through contextual clustering

Discriminative Clustering for Market Segmentation

Manifold clustering for motion segmentation

Skin Lesion Segmentation Using Clustering

An Approach for Content Retrieval from Web Pages Using Clustering ...

Clustering web content for efficient replication - Semantic Scholar

VisuAl ContEnt BAsEd ClustEring of NEAr DuplicAtE WEB SEArcH

Sky Segmentation by Fusing Clustering with Neural Networks ...