May 26, 2011 - Web Intrapage Informative Structure Mining. Based on DOM (Term entropy based on heuristics). 4 / 19. Basic Idea. â» Start with complete Page ...
Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems
May 26, 2011
WIMS ’11
1 / 19
Outline 1
Introduction Motivation Related Work
2
Web Page Segmentation by Clustering General Idea Distance functions for web contents Clustering methods
3
Evaluation Studies Distance functions Clustering
4
Conclusion and Future work WIMS ’11
2 / 19
Introduction
Motivation
Motivation
WIMS ’11
3 / 19
Introduction
Motivation
Motivation
Web Page is cluttered with different contents I Different news articles I
Link lists
I
Commercials
I
Template elements
I
Functional elements
WIMS ’11
3 / 19
Introduction
Motivation
Motivation
Web Page Segmentation I Separation of web contents into structural and semantic cohesive blocks
WIMS ’11
3 / 19
Introduction
Motivation
Motivation
WIMS ’11
3 / 19
Introduction
Motivation
Motivation
Applications I Web Content Search I
Web Page Categorization
I
Web Page Adaptation for Mobile Devices
I
Web Image Indexing
I
...
WIMS ’11
3 / 19
Introduction
Related Work
Overview of Related Work to Web Page Segmentation I
TOP-DOWN page segmentation: I
I
I
I
KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)
BOTTOM-UP page segmentation: I
I
I
CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11
4 / 19
Introduction
Related Work
Overview of Related Work to Web Page Segmentation I
TOP-DOWN page segmentation: I
I
I
KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics)
Basic Idea I Start with complete Page as initial block I Decide for each block: I
I
should the block be separated? if yes, where to separate?
! Based on heuristics
WIMS ’11
4 / 19
Introduction
Related Work
Overview of Related Work to Web Page Segmentation Basic Idea I Start with smallest content units (e.g., DOM leafs)
I
I
group them to blocks of coherent content
I
How?
BOTTOM-UP page segmentation: I
I
I
CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11
4 / 19
Introduction
Related Work
Overview of Related Work to Web Page Segmentation Our Approach I belongs to BOTTOM-UP methods
I
I
DOM leafs are used as basic web objects
I
Idea!: group web objects to blocks by clustering
BOTTOM-UP page segmentation: I
I
I
CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11
4 / 19
Web Page Segmentation by Clustering
Web Page Segmentation by Clustering
WIMS ’11
5 / 19
Web Page Segmentation by Clustering
General Idea
Page Segmentation by Clustering General Definition: Clustering I Clustering is the process of organizing objects into groups whose members are similar in some way I
A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters
Open questions addressed in this work I
How can the similarity (or dissimilarity) of web objects be estimated?
I
Which representation is best suitable to represent web objects?
I
Which clustering method should be applied for clustering?
WIMS ’11
6 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
Different Representations of Web objects I
Geometric Representation web browser puts every object of a web page in a 2-dim plane extract the bounding rectangle for each object
I I
I
Semantic Representation elements in DOM contain some textual contents extract keywords from the corresponding text
I I
I
DOM-based Representation each object is a node in the DOM tree of the page use the position of the object in DOM tree to characterize it
I I
A
B
⇒ Different distance measures are possible WIMS ’11
7 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
Geometric Distance I
Let R = [(rx , ry ), (rx , , ry , )] and S = [(sx , sy ), (sx , , ry , )] be two bounding rectangles
I
The geometric distance of R to S is given by , 1 ri − s i P 2 2 s − ri , , with ti = dist(R, S) = i∈x,y ti i 0 Visually:
I
(0, 0)
if if if
ri > si , ri , < si otherwise.
x (sx, sy)
S (rx, ry)
min
R,S
dist(
)
(sx', sy')
R y
(rx', ry')
WIMS ’11
8 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
Semantic Distance I I
Given T1 = (dog, run, street), T2 = (puppy, walk, road) Cosine Similarity Measure (Information Retrieval) I
Lexical word-to-word matching → sim(T1 , T2 ) = 0
I
to strict: e.g. synonym and hyponym relationships are not considered
I
Instead: text similarity measure based on WordNet [Corley 05] I
Words are mapped to concepts in WordNet (concept-to-concept matching)
P sim(T1 , T2 ) =
wi ∈T1
maxSim(wi , T2 ) · idf (wi ) P wi ∈T1 idf (wi )
WIMS ’11
9 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
Requirements I Nodes under same parent are closer than nodes under different parent I
Nodes on higher tree level are closer that nodes on lower level
WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
I
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak
WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
I
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
Traverse DOM-tree in preorder traversing: ) , 6k , Ck , 4k , 5k , 3k , Bk , 1k , 2k P = ( Ak
I
For each element in P define a weight wpi that expresses the costs needed to reach pi from its predecessor in P
WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
I
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
The distance between pa , pb ∈ P, (wlog. a < b) is defined as: b X
d(pa , pb ) =
wpi
i=a+1
Example: d( 2k , 4k ) = w3 + wC + w4 WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Distance functions for web contents
DOM-based Distance 1
4 level = 0 level degree = 2
A 5
2
3
I
6
B 1
2
level = 1 level degree = 3
C 3
4
5
6
level = 2 level degree = 0
The weight wi of a node pi ∈ P depends on the level l and the level degree dl of pi : c : dl = 0 w (l) = (1) dl · w (l + 1) : dl > 0, e.g., w (2) = c, w (1) = 3 ∗ w (2) = 3c, w (0) = 2 ∗ w (1) = 6c, WIMS ’11
10 / 19
Web Page Segmentation by Clustering
Clustering methods
Clustering Methods
I
Partitioning Clustering I
I
Agglomerative Hierarchical Clustering I
I
k-medoid (similar as k-means, but cluster representatives are real objects) single link method applied to compute distance between sets of objects
Density-based Clustering I
DBSCAN variant (able to find clusters of different density levels)
WIMS ’11
11 / 19
Evaluation Studies
Evaluation Studies
WIMS ’11
12 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization I
A distance matrix contains all pairwise distances of the objects to be clustered, e.g. a b c a 0 1.9 1.1 b 1.9 0 2.3 c 1.1 2.3 0
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
57-65
44-54
66-68
29-43
1-19
20-28
55-56
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
57-65
44-54
66-68
29-43
Numbers I indicate the position of particular web 55-56 in the source code objects
1-19
20-28
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization 1
1
1
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
10
1
20
30
40
a) DOM-Distance
50
60
1
10
20
30
40
50
60
b) Geometric-Distance
1
10
20
30
40
50
60
c) Semantic-Distance
I
left and bottom axe represent the web objects ordered by their appearance on the web page
I
each pixel represents a distance value
I
white means lowest distance, black means highest distance
I
bright squares in the diagonal indicate possible page blocks WIMS ’11
13 / 19
Evaluation Studies
Distance functions
57-65
44-54
66-68
29-43
1-19
20-28
55-56
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization 1
1
1
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
1
10
20
30
40
a) DOM-Distance
50
60
1
10
20
30
40
50
b) Geometric-Distance
60
1
10
20
30
40
50
60
c) Semantic-Distance
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization 1
1
1
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
10
1
20
30
40
a) DOM-Distance
50
60
1
10
20
30
40
50
60
b) Geometric-Distance
1
10
20
30
40
50
60
c) Semantic-Distance
Results: I
DOM-distance has good correspondence
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization 1
1
1
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
10
1
20
30
40
a) DOM-Distance
50
60
1
10
20
30
40
50
60
b) Geometric-Distance
1
10
20
30
40
50
60
c) Semantic-Distance
Results: I
DOM-distance has good correspondence
I
Geometric distance has some correspondence, but there are other bright rectangles
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
57-65
44-54
66-68
29-43
1-19
20-28
55-56
WIMS ’11
13 / 19
Evaluation Studies
Distance functions
Distance-Matrix Visualization 1
1
1
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
10
1
20
30
40
a) DOM-Distance
50
60
1
10
20
30
40
50
60
b) Geometric-Distance
1
10
20
30
40
50
60
c) Semantic-Distance
Results: I
DOM-distance has good correspondence
I
Geometric distance has some correspondence, but there are other bright rectangles
I
Semantic distance has almost no correspondence WIMS ’11
13 / 19
Evaluation Studies
Clustering
Evaluation - Clustering Performance
Dataset I
78 web documents from 8 different categories from Yahoo! directory
I
23,819 web contents (in average 305 per web page)
I
Web contents clustered manually by 3 volunteers
I
Ground truth is combination of all three proposals
Method I For each combination of clustering method & distance function I
I
compute clustering of the web contents into page blocks
Ground Truth Clustering (GT), Computed Clustering (C)
WIMS ’11
14 / 19
Evaluation Studies
Clustering
Evaluation - Performance Measure Based on Contingency Table for pairs of objects: Same cluster in GT Different cluster in GT
Same cluster in C f11 f01
Different cluster in C f10 f00
Performance Measure Rand statistic =
f00 + f11 f00 + f01 + f10 + f11
WIMS ’11
15 / 19
Evaluation Studies
Clustering
Average Rand Statistic Results
partitioning hierarchical density-based
DOM-based 0.45 0.52 0.61
geometric 0.47 0.41 0.43
semantic 0.25 0.24 0.27
Results: I Rand statistic values are similar to the results of Distance Matrix Visualization I DOM distance reaches highest values with DB clustering I
I
the distances between together belonging objects are varying in the metrical space derived by DOM distance DB clustering is able to find clusters of different densities
WIMS ’11
16 / 19
Conclusion and Future work
Conclusion and Future Work
WIMS ’11
17 / 19
Conclusion and Future work
Conclusion I Web Page Segmentation by Clustering was presented I
three different distance function for web objects based on geometric, semantic and DOM properties
I
three clustering methods from different categories: partitioning, hierarchical and density-based clustering
I
best clustering accordance to ground truth with DOM-Distance and DB clustering
Future Work I combination of distance measure (linear, multiplicative, ?) I
comparison to other Web Page Segmentation methods from literature
I
application of Web Page Segmentation to Web Image Context Extraction (paper accepted, to be published) WIMS ’11
18 / 19
Conclusion and Future work
Thank You!
WIMS ’11
19 / 19