Web Page Segmentation Evaluation Andrés Sanoja
Stéphane Gançarski
Université Pierre et Marie Curie 4 place Jussieu Paris. France
Université Pierre et Marie Curie 4 place Jussieu Paris. France
[email protected]
[email protected]
ABSTRACT
(changes in important blocks) from distinct versions of a page[12]. This is useful for crawling optimization, as it allows tuning of crawlers so that they will revisit pages with important changes more often [15]. It also helps controlling curation actions, by comparing the page version before and after the action. Several web page segmentation methods have been proposed over the last decades as detailed in Section 2, most of them coming along with an ad hoc evaluation method. When studying the literature about Web page segmentation, we noticed that there is no full comparative evaluation of Web page segmentation algorithms. The different approaches cannot be directly compared due to a wide diversity of goals, the lack of a common dataset, and the lack of meaningful quantitative evaluation schemes. Hence, there is a need to define common measures to better evaluate the Web page segmentation algorithms. To this end, we also investigate, in Section 2, evaluation methods for the connex and widely studied area of Document Processing Systems. In this paper, we propose a new Web page segmentation evaluation method based on a ground truth. Whichever algorithm is used to segment a web page, the output can be expressed in terms of block geometry, block content, and page layout. In our framework, these are all taken into account to evaluate the segmentation. We define a model and a method for evaluation of web page segmentation, adapted from [17] which was designed for scanned page segmentation. Their work allow measuring the quality of a segmentation using a block correspondence graph between two segmentations (the computed one and the ground truth). Blocks are represented as rectangles, associated with the corresponding quantity of elements that these regions cover. Four representative web page segmentation algorithms were used to perform the evaluation. A dataset was built as a ground truth with pages being manually segmented and assigned to one page type (blog, forum, . . .). We also present a method for ground truth construction that eases the annotation (manual segmentation) of web pages. Each segmentation algorithm was evaluated with respect to the ground truth. Computed segmentations and ground truth segmentations are compared according to the six metrics of our evaluation method. Results show that the ranking among algorithms may vary according to the Web page type. This paper is organized as follows. In the next section, we study the related work. Section 3 presents the segmentation algorithms that we evaluated. Section 4 describes the evaluation model. In Section 5, we describe the experimental setup and the collection used. Section 6 presents the
In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.
1.
∗
INTRODUCTION
Web pages are becoming more complex than ever, as they are usually not designed manually but generated by Content Management Systems (CMS). Thus, automatically identifying different elements from Web pages, such as main content, menus, user comments, advertising among others, becomes difficult. A solution to this issue is given by web page segmentation. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called blocks. Detecting these different blocks is a crucial step for many applications, such as mobile applications [20], information retrieval [5], web archiving [14], among others. In the context of Web archiving, segmentation can be used to extract interesting parts to be stored. By giving relative weights to blocks according to their importance, it also allows the detection of relevant changes ∗ This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. SAC 2015 Symposium on Applied Computing Proceedings Copyright 2015 ACM 978-1-4503-3196-8/15/04 . . . $15.00 http://dx.doi.org/10.1145/2695664.2695786.
753
experiments and results. Section 7 concludes the paper.
2.
correctly identify segments in modern Web pages? This question raises the issue of evaluating page segmentation methods.
RELATED WORK
In this section, we first present the different approaches for Web page segmentation, then we study the state-of-the-art for evaluation of segmentation correctness.
2.1
2.2
Segmentation correctness evaluation
Different interpretations of correctness can be used with respect to segmentation. As defined in the literature, the correctness of an algorithm is asserted when it complies with a given specification. The problem here is that such a specification cannot be established a priori, without a human judgement. Thus, we focus on evaluation approaches based on a ground truth. We also investigated the correctness issue in the connex domain of scanned page segmentation, since the issue of evaluating such systems is quite similar to our problem. Segmentation issues have been addressed for almost thirty years in the optical character recognition (OCR) domain [6]. Automatic evaluation of (scanned) page segmentation algorithms is a very well studied topic. Several authors have obtained good results in performance and accuracy evaluation, as well in measuring quality assurance [9, 22, 4]. There are common problems in the evaluation of Web pages and scanned pages segmentation algorithms : the lack of a common dataset, a wide diversity of goals/applications, a lack of meaningful quantitative evaluation, and inconsistencies in the use of document models. This observation led us to closely study how segmentation is evaluated for scanned pages. Although Web pages and scanned pages are different (pixels/colors vs. elements/text), the way they are analysed and the result of their segmentation are similar. In both cases, blocks can be organized as a hierarchy or a set of non-overlapping rectangles (Manhattan layout [19]). Given the nature of the Web content, almost all the algorithms represent the final segmentation of a Web page as a Manhattan layout. It can be hierarchical [5] or non-hierarchical [7, 10]. The latter can be obtained from the former by only considering the leaves. There is a wide range of work around automatic evaluation based on a predefined ground truth in the literature. We highlight the work of [17] to measure the quality of a page segmentation by analysing the errors in the text recognized by OCR. [17] presents a vectorial score that identifies the common classes of segmentation errors using a ground truth of annotated scanned pages. Our work is inspired by this paper. Neither of the mentioned approaches for Web pages provide an evaluation method that helps comparing to others directly. Authors usually do a qualitative evaluation of their algorithms, asking to human assessors to validate the segmentation. Others use analytical approaches, measuring the performance of their algorithms, for example, with cluster correlation metrics such as Adjusted Rand Index (AdjRand) or Normalized Mutual Information (NMI). Those metrics are well suited for checking if the segmentation preserves the textual content pages. The problem is that they do not take the geometric properties into account. Another important issue is that the datasets are not publicly available (or do not exist any more). Some provide access to the tools but it is not the general case. An interesting experience is the work of [11]. They present a method for quantitative comparison of semantic Web page segmentation algorithms. They also provide two datasets of
Web page segmentation
One of the first efforts in structuring information of Web pages was the creation of wrappers for information extraction [1]. Authors intended to use those wrappers to apply existing database models and query languages on semistructured documents (e.g. web pages). The concept of fragment is introduced by [2], as “a portion of a web page which has a distinct theme or functionality and is distinguishable from the other parts of the page”. Moreover, the authors introduce the notion of “interesting fragments”, those shared with other pages served from the same web site. Different terms have been used in the literature as synonymous of fragment: web elements, logical unit, block, sub-page, segment, component, coherent region, pagelet, among others. Since the earlier 2000’s, web page segmentation has been a very active research area. Several works were published, showing that the visual aspect of the page was a key for segmenting Web pages. For instance, [3] divides a Web page into nine segments using a decision-tree that uses an information gain measure and geometric features. [5] describes the Vision-based Page Segmentation (VIPS) algorithm which computes, for each candidate block a Degreeof-Coherence (DoC) utilizing heuristic rules based on the DOM representation of a page as well as visual features. Blocks are generated when their DoC meets a predefined value (which is a parameter of the algorithm). [13] presents an adaptation of the VIPS algorithm using Java. It follows the same heuristics as the original algorithm. However the results are not exactly equal because there are differences in the detection of vertical separators. [8] proposes a Web page analysis method, based on support vector machine rules, to detect parts of the page that belongs to high level content blocks (header, body, footer or sidebars) and, for each of those parts, applies explicit and implicit separation detection heuristics to refine blocks. Later on, the need for formalizing the problem to go beyond heuristic methods inspired other approaches. [7] faced the page segmentation issue from a graph-theoretic point of view. DOM nodes are organized into a complete weighted graph and the edge weights estimate the costs needed to gather the connected nodes into one block. [10] represents atomic content units of a Web page by a quantitative linguistic measure of text-density and reduces the segmentation problem to solving a 1D-partitioning task. They propose an iterative block fusion algorithm (BlockFusion), applying methods adapted from computer vision. [16] present the Block-o-Matic algorithm (BoM). It is an hybrid approach based on the vision-based Web page segmentation approach and the document processing model from the computer vision domain. The segmentation is presented in three structures: DOM, content and logic structure. Each one represents a different perspective of the segmentation, being the logic structure the final segmentation. To sum up, several approaches have been developed for page segmentation, most of them described in [21]. However, a question remains : how well do these approaches
754
annotated Web pages publicly available1 2 . This approach mainly uses text content comparison in order to perform the match between ground truth and segmentations blocks, which is not enough, since geometry of blocks plays a key role in the segmentation.
3.
They visually detect separators by splitting the page around the visual blocks so that no separator intersects with a block. Subsequently they assign weights to the separators, according to certain predefined heuristic rules. From the visual blocks and the separators they can then assemble the visionbased content structure of the page, using the Degree of Coherence (DoC) of each block as granularity.
SEGMENTATION ALGORITHMS
3.4
In this section we give a short description of the segmentation algorithms we evaluated.
3.1
jVIPS [13] is another implementation of the VIPS model proposed by Cai [5]. Hence, the granularity parameter is the same as VIPS : DoC. JVIPS is implemented in Java using the CSSBox rendering engine. The difference between VIPS and jVIPS resides in two of the heuristic rules, the version of jVIPS prohibiting splitting some blocks that VIPS would split. This implies that jVIPS often generates blocks as wide as the Web page width. This algorithm has been referenced and used in several projects so it is worthy to include it in our evaluation.
BoM (Block-o-Matic)
BoM [16] use the geometric aspects of a page (W) and a categorization of DOM elements to perform the segmentation. The categorization is that specified in the HTML5 content categories. DOM elements are evaluated by these categories instead of by their tag names or by their attributes. First the elements are filtered, excluding some categories, producing a segmentation of small blocks. This fine-grained segmentation is used to find composite blocks, which are the layout of the page. Small blocks which are covered by composite blocks, are associated with the latter. In a second round, the small blocks are merged following heuristic rules, producing blocks with a normalized block rectangle area greater or equal to the granularity parameter rD. The outcome is a segmentation tree where the root node is a special composite block that represents the whole page. Nonterminal nodes represent the composite blocks, and terminal nodes the blocks.
3.2
4.
BF (BlockFusion)
W 0 = (P age, granularity) where P age is a kind of special block that represents the whole page and granularity is a parameter that affects the size of the rectangles in the segmentation. P age is defined as follows :
VIPS (Vision-based Web Page Segmentation)
P age = (rect, htmlcover, textcover, {Block})
The VIPS algorithm [5] was designed to segment Web pages as a human would do it. It thus analyses the rendered version of a web page. It first develops a vision-based content structure, which analyses the page with visual cues present in the rendered page instead of the HTML source code. This structure is built by splitting a page into a 3-tuple consisting of a set of visual blocks, a set of separators, and a function that describes the relationship (shared separators) between each pair of blocks of a page. Separators are for example vertical and horizontal lines, images similar to lines, headers and white-space. This structure is built by going top-down through the DOM tree and taking both the DOM structure and the visual information (position, color, font size) into account. 1 2
EVALUATION MODEL
The goal of the evaluation model is to compare an automated segmentation of a web page W with the corresponding ground truth, in order to determine its quality. Both segmentations are organized as non-hierarchical Manhattan layout (cf. section 2.2), in other words, they are flat segmentations. It is an adaptation to web pages of the model presented by [17] for scanned page segmentation evaluation. The ground truth is manually designed (we explain in Section 5.3 how it was built for the evaluated collection). The comparison focuses on block geometry and content. Each block (B) is associated with its bounding rectangle (B.rect) and two values: the amount of HTML elements it covers (B.htmlcover) and the text it covers (B.textcover) in the original page W . Note that in this section, all the rectangles are modelled as quadruples (x, y, h, w), where x and y are the coordinates of the origin point and h and w are the height and the width of the rectangle. A segmentation W 0 of W is defined as follows :
The BlockFusion algorithm [10] uses the text density as a valuable heuristic to segment documents. The text density is calculated by taking the number of words in the text and dividing it by the number of lines, where a line is capped to 80 characters. A HTML document is then first preprocessed into a list of atomic text blocks. The density is computed for each atomic block. Iteratively, two adjacent blocks are merged if their text densities are below a certain threshold ϑmax . The value of this threshold represents the granularity of the segmentation. The authors report that its optimal value is ϑmax ≈ 0.38 and we take it as is.
3.3
jVIPS (Java VIPS)
where {Block} is the set of blocks that form the segmentation, such that ∀ b ∈ {Block}, b.rect ⊂ P age.rect The quality of a segmentation can be measured in two complementary ways: • Block correspondence : measures how well the blocks of the computed segmentation match with the ones of the ground truth. • Text covering : measures to which extent the global content (here expressed as the number of words) of the blocks is the same as the content of the page.
4.1
https://github.com/rkrzr/dataset-popular https://github.com/rkrzr/dataset-random
Measuring block correspondence
The Block correspondence indicates whether the blocks
755
rectangles of a segmentation match those of the ground truth. Consider two segmentations for a page W : a computed one WP0 (denoted P in the rest of the section), and the ground truth WG0 (denoted G). Figures 1(a) and (b) give respectively an example for G and P . To compute the block correspondence, we build a weighted bipartite graph called block correspondence graph (BCG) as follows: As seen on Figure 1(c), nodes of BCG are the blocks of P and of G. An edge is added between each couple of nodes ni and nj such that the weight w(ni , nj ) of the edge is equal to the number of underlying HTML elements and text in the intersection of the regions covered by the rectangle of each of the blocks corresponding to the two nodes. If the blocks rectangles do not overlap in P and G, no edge is added. Thus, the algorithm that build the BCG is the following: If the computed segmentation P fits perfectly with Data: nodes ni ∈ G,nj ∈ P Result: vertex (ni ,nj ) and its weight (if apply) if ni .rect is contained in nj .rect then create vertex (ni ,nj ); w(ni , nj ) = ni .htmlcover + ni .textcover; else if ni .rect contains nj .rect then create vertex (nj ,ni ); w(ni , nj ) = nj .htmlcover + nj .textcover; else /* no vertex is created w(ni , nj ) = 0; end
metric for measuring the quality of a segmentation. 2. Oversegmented blocks (Co ). The number of G nodes having more than one edge. This metric measures how much a segmentation produced too small blocks. However, those small blocks fit inside a block of the ground truth. In the example of Fig. 1, node 6 of the ground truth is oversegmented in the proposed segmentation. In the example, the metric value is Co = 2 because nodes 6 and 2 are both over-segmented. 3. Undersegmented blocks (Cu ). The number of P nodes having more than one edge. The same as above, but for big blocks, where blocks of the ground truth fit in. For instance, on Fig. 1, node D of the proposed segmentation is undersegmented with respect to the ground truth, and the value for the metric is Cu = 1. 4. Missed blocks (Cm ). The number of G nodes that have no match with any in P. This metric measures how many blocks of the ground truth are not detected by the segmentation. One example is node 3 shown in the Fig. 1 and the value of the metric is Cm = 1. 5. False alarms (Cf ). The number of P nodes that have no match with any in G. This metric measures how many blocks are “invented” by the segmentation. For instance, in Fig. 1 node I has no correspondant in the ground truth making the metric value as Cf = 1. Tc is a positive measure, Cm and Cf are negative measures. Co and Cu are “something in the middle”, as they count “not too serious” errors : found blocks could match with the ground truth if they were aggregated or splitted. Note that the defined measures cover all the possible cases when considering the matching between G and P . To evaluate the quality of the segmentation we define a score Cq , as the total number of acceptable blocks discovered, i.e. Cq = Tc + Co + Cu .
*/
the ground-truth segmentation G, then the BCG will be a perfect matching. That is, each node in the two component of the graph has exactly one incident edge. If there are differences between the two segmentations, P or G may have multiples edges. If there is more than one edge incident to a node n in P (resp. in G), n is considered oversegmented (resp. undersegmented). Using these definitions, we can introduce several measures for evaluating the correspondence of a web page segmentation algorithm. Intuitively, if all blocks in G are in P , this means that the algorithm has a good performance. If one set of blocks in G are grouped into one block in P or if one block in G is divided in several blocks in P then there is an issue with respect to the granularity but no error. We determine a segmentation error if one block in the ground truth is not found in the computed segmentation or if there are blocks that were “invented” by the algorithm. The measures for block correspondence are defined as follows: 1. Total correct segmentation (Tc ). The total number of one-to-one matches between P and G. A one-to-one match is defined by a couple of nodes (ni , nj ), ni in P , nj in G, such that w(ni , nj ) ≥ tr , where tr is a threshold that defines how well a detected block must match to be considered as correct. For instance, in Fig. 1, there is an edge between node 2 and node B and another one between node 2 and node C. However, as the weight w(2, C) is less than tr , and the weight w(2, B) is greater tr , B is considered as a correct block. The metric value for the example is Tc = 2 . Tc is the main
4.2
Measuring text coverage
The intuitive idea of evaluating the covering is to know if there is some content from the original page not taken into account by the segmentation. The covering of a segmentation W 0 is given by the Textcover function, which returns the proportion of words that appear in W but not in the blocks of W 0 , as follows : P T extcover(W 0 ) =
b.textcover
b∈blocks
P age.textcover
More complex functions can be used to measure the text coverage, but this is left for future work.
4.3
Normalization
In order to compare two segmentations, we need to normalize the rectangles and the granularities. Given a segmentation W 0 , its normalized version fits in a ND × ND square, where ND is a fixed value. In our experimentations, we fixed this value to 100. Thus W 0 has the new following property Nrect (normalized rectangle): W 0 .P age.N rect = {0, 0, N D, N D} Each block rectangle is then normalized according to the stretch ratio of the page, i.e.
756
Figure 1: (a) Ground-truth segmentation. (b) Proposed segmentation. (c) BCG.
∀ b ∈ W 0 .Blocks, b.N rect.x =
BlockFusion and JVIPS), the adaptation has been made on the source code. For VIPS, the adaptation has been made on the output.
0
N D × W .P age.rect.x W 0 .P age.rect.w
5.1.1
The other values of the block rectangle (y, w and h) are normalized in the same way. This allows for defining the normalized area of a block as b.N area = b.N rect.h × b.N rect.w = (b.N rect.h × b.N rect.w) ×
ND2 page area
5.1.2
where,
SETUP
Our evaluation framework allows running different web page segmentation algorithms on a collection of web pages and measuring their correctness, as defined in Section 4. Four algorithms are tested, adapted in such a way that it was possible to extract the page, the block geometries and the word counts as well. At a glance the framework gets an URL (the page to be segmented) and a granularity, and produces one score for the covering and five for the correspondence, using the ground truth, as described in Section 4.
5.1
BF
There is an implementation of this algorithm, included in the BoilerPipe 4 application. As BF is text-oriented, we had to modify the original Boilerpipe source code to get the rectangles and their content values. The strategy used was to pre-process the input page. An input page is traversed using a browser and, for each of its elements, an attribute geometry is added. This attribute contains the rectangle dimension and its (recursive) word count. The outcome is a set of rectangles represented by the TextBlocks produced by the algorithm using the ARTICLE extractor of BF.
page area = W 0 .P age.rect.w × W 0 page.rect.h
5.
BoM
As BoM is implemented 3 in Javascript it was very straightforward to adapt it for evaluation by adding a new custom javascript function. The information is extracted from the terminal blocks of the segmentation tree. Rectangles are taken from these blocks and the values from the DOM elements associated to the latter.
5.1.3
VIPS
There is an implementation of this algorithm, in the form of a Dynamic Linked Library (DLL) 5 . As VIPS is implemented as a DLL, we chose Microsoft Visual Basic.NET development environment to build a wrapper in order to obtain the information needed.
Adaptation of Algorithms
We adapted the implementation of the tested algorithms in order to get the information needed for the comparison: the rectangles of the page (Block.rect), the HTML elements number (Block.htmlcover) and the word count of the whole page (Page.textcover) and for each block (Block.textcover). For algorithms where the source code was available (BoM,
3 https://github.com/asanoja/ web-segmentation-evaluation/tree/master/ chrome-extensions/BOM 4 https://code.google.com/p/boilerpipe/ 5 http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS. html
757
5.1.4
jVIPS
The ground truth was assessed by human assessors at laboratory. However it is planed to crawl a bigger set of pages and to include other assessors to do this task.
With jVIPS, obtaining the required data for evaluation was straightforward because the source code is publicly available 6 . When the visual content structure is completed (the predefined granularity is met) the rectangles, HTML elements and word count are obtained from the corresponding blocks.
5.2
6.
Collections, Crawling and Rendering
A segmentation repository holds the offline version of Web pages, together with their segmentations (including the ground truth), organized in collections. Within a collection, each page is rendered with different rendering engines and at different granularity values. To each quadruple (page, render engine, algorithm, granularity) corresponds a segmentation performed on that page, rendered by that engine, using one algorithm at that granularity. For the work presented in this paper, we built the GOSH (GOogle SearcH) collection 7 described below.
5.2.1
GOSH Collection
Web pages in this collection are selected by their “functional” type, or category. This selection is based in the categorization made by Brian Solis [18], ”The Conversation Prism”. It depicts the social media landscape from ethnography point of view. In this work, we considered the five most common of these categories, namely Blog, Forum, Picture, Enterprise, and Wiki. For each category, a set of 25 sites have been selected using Google search to find the pages with the highest PageRank. Within each of those sites, one page is crawled 8 .
5.2.2
6.1
Setting the granularity
The accuracy of the measures directly depends on the way the ground truth is built. If the human assessors defined a granularity in the ground truth, the granularity parameter of each algorithm needs to be adjusted accordingly. In the present work our goal is to detect blocks of medium size. We focus neither in detecting only large blocks, such as header, menu, footer and content, nor in detecting blocks at a too high level of detail (sentences, links or single images). Instead, we focus on detecting parts of the page that represent significant pieces of information, such as a blog post, table of content, image and caption, set of images, forum response, and so forth. This is a more challenging task for segmentation algorithms. Thus, in the following experiments, the granularity was set so that each algorithm produces medium size blocks.
Setting the thresholds
Setting the relative threshold tr is not so obvious, as the notion of “good block” is quite subjective. In this paper, we fixed tr to 0.1 as we observed, on a significant number of example, that it corresponds to our notion of good block. In the future, we plan to perform supervised machine learning with a large number of users to determine the right value. Each user will annotate the segmentation blocks with the corresponding block in the ground truth if (s)he thinks that the blocks sufficiently match. Because rendering engines may produce some small differences in their rendering, we introduce a geometric tolerance tt to help in the comparison of the rectangles. The value of this parameter is fixed based on the experience in working on the collection. It is category-dependent. In general, block rectangle do not differ in more than ± 5 pixels. For the whole collection, the best value was found to be 2 pixels.
Rendering
Collection post-processing
A Web page rendered with different engines may result in differences in the display. The most common case are the white spaces between the window borders and the content. We must assure that all renders of the same Web page have the same dimensions. For that reason, we check the above mentioned white space and remove them.
5.3
In this section, we present the results of evaluating the four segmentation algorithms described in Section 3. The algorithms were evaluated using the GOSH collection based on the measures defined in Section 4. These measures evaluate different aspects of a segmentation algorithm for a given quadruple (page, render engine, algorithm, granularity) as a parameter.
6.2
Several rendering engines are used. They are encapsulated using Selenium WebDriver 9 . Although Selenium can handle several browser engines, only Chrome and Internet Explorer are used in the present work.
5.2.3
EXPERIMENTS AND EVALUATION
Ground truth construction
6.3
The human assessor selects a set of elements that compose a block. Then we must deduce the bounding rectangle of the block and compute the word count. This is a time consuming and error prone task. To speed up the process we have developed the tool MoB (Manual-design of Blocks)10 . It assists human assessors to select those elements that form a block and automatically extracts all the information needed.
Computing block correspondence
We computed the different metrics for block correspondence, as defined in Section 4.1. Table 1 shows the scores (average of the metrics on all the documents of the collection) obtained by the different algorithms on the global collection. Several observations can be done: • BoM obtains the best overall result Tc , as it is more accurate. It produces very few serious errors (Cm , Cf ) with respect to the other algorithms, but could be improved in terms of granularity as indicated by its high values for Co and Cu . • BF obtains the worst result for Tc , but with a low level of false alarms. In other words, BF does not detect all the correct blocks (mainly, it misses the blocks that are not located in the center of the page) but detects
6
https://github.com/tpopela/vips_java http://www-poleia.lip6.fr/~sanojaa/BOM/inventory/ 8 https://github.com/asanoja/ web-segmentation-evaluation/tree/master/dataset 9 http://docs.seleniumhq.org/projects/webdriver/ 10 https://github.com/asanoja/ web-segmentation-evaluation/tree/master/ chrome-extensions/MOB 7
758
Algorithm BF BOM JVIPS VIPS
Tc 0.98 2.99 1.29 1.36
Co 0.40 1.41 1.42 0.96
Cu 0.70 0.79 0.97 0.89
Cm 4.07 0.99 1.75 1.88
Cf 0.52 0.86 5.46 1.73
Algorithm BF BoM JVIPS VIPS
all 56 69 86 95
forum 42 75 100 96
blog 85 87 80 95
wiki 91 91 98 94
picture 14 61 87 95
enterprise 37 71 92 95
Table 2: Coverage values for each algorithm Table 1: Correspondence metrics for the global collection by algorithms. • BF does perform well for Forum. As those kinds of pages contain many text (question/responses) blocks, the text density is sufficient to detect most of them, but not those surrounding the main content. • JVIPS has problems with the Picture collection. Those pages do not have headers and footers, i.e. blocks that occupy the whole width of the page. Instead, they have many small blocks that JVIPS cannot detect.
6.4
Computing text coverage
We computed the text coverage as defined in Section 4.2. Table 2 gives the (rounded) values for the whole collection and for each category of pages. The first observation is that the coverages obtained by all the algorithms are quite high. This means that each of them is able to perform the basic task of text extraction. It appears that BF does not perform well for Picture and Enterprise. However, this is mainly due to the fact that, for those categories, BF misses a lot of blocks (as seen above), thus misses their content.
Figure 2: Total correct blocks by categories
7. good blocks, with a rather good granularity. This is mainly due to the fact that BF uses the text density for determining blocks. As the blocks on the sides of the pages have a low text density, it is hard for BF to detect them. • VIPS and JVIPS have comparable results in terms of correct blocks and missed blocks. However, JVIPS generates a lot of false alarms. This is due to a specific heuristic rules used in JVIPS that tends to detect blocks as wide as the page width. This is worth for blocks like headers or footers, but not for the content located in the center of the page. In order to study the adequacy between segmentation algorithms and Web page categories, Figure 2 shows the quality of the segmentation, represented by the average values of the metric Cq , described in section 4.1, for the five Web page categories above mentioned. Each algorithm is represented by a color bar, the dashed lines are the averages over all the collection. The AVG line represents the average number of correct blocks while TAVG represents the average number of expected blocks in the ground truth. We make the following observations: • The best results are obtained for the Picture collection. The reason is probably because picture pages have a regular and simple structure. This observation also holds, though attenuated, for the Enterprise category. • The worst results are obtained for the Forum category. The reason for this, is probably that forum pages are constituted of several question/answers blocks, each of them having a complex structure (including avatars, email addresses, and so on) which is not easy to detect
CONCLUSION
In this paper, we present a framework for evaluating and comparing Web pages segmentation algorithms. To the best of our knowledge, this is the first work that focuses on segmentation intrinsic properties, which are document layout, content and blocks geometry. Existing approaches do have evaluation, but they are driven by specific applications and thus are not generic enough to compare all the segmentations algorithms. Our approach is based on a ground truth, built thanks to a tool (MoB) we developed and that substantially eases the manual design of a segmentation. Our dataset contains 125 pages, covering five categories (25 pages per category). We present an evaluation model that defines several useful metrics for the evaluation. One metric is devoted to the text extraction task, the other ones compute how well the blocks detected by a given algorithm match the ones of the ground truth. We use this model for evaluating and comparing four segmentation algorithms, adapted in order to fit into our framework. The results show that the algorithms perform reasonably for extracting text from pages. With respect to geometric block detection, results slightly depend on the category of the pages considered. For instance, VIPS performs well for the Forum and Wiki categories, while it is much weaker for the three other ones. BoM seems to give the best overall results. It overpasses the other algorithms for the Blog, Wiki and, most importantly, Forum categories. There are many directions for future work. First, we plan to use machine learning (ML) techniques for learning the tolerance parameter tr . We also plan to use ML for discovering new relevant score functions, based on the feedback of
759
users giving manual score to segmentations from a training set. Second, we will continue to experiment segmentation algorithms on more pages and more page categories. Our aim is to develop a complete evaluation framework in order to help users in choosing the best segmentation algorithm depending on their application and on the category of pages they manipulate. Of course, as the results show that some algorithms have problems with some categories, they can also be used to help improving the efficiency of segmentation algorithms for those categories. Third, we plan to evaluate the segmentation algorithms with respect to the type of task that uses the segmentation. Task types include Web entity extraction, layout detection, boilerpipe detection, visualization in small screen devices, and, in the context of digital libraries, optimization of Web archives crawling, change detection between web page versions, among others. This implies defining scripts that perform the task (including calls to segmentation) and defining new ad hoc metrics for each task. Finally, we will work on enhancing the model. We would like to include the importance in the evaluation model, so that algorithms that detect important blocks get a better score. Also, we would like to define a generic model for web page segmentation that can express all the existing approaches. This would allow for an analytic evaluation of segmentation algorithms.
8.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
REFERENCES
[16]
[1] Abiteboul, S.: Querying semi-structured data. In: Afrati, F.N., Kolaitis, P.G. (eds.) Database Theory ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings. Lecture Notes in Computer Science, vol. 1186, pp. 1–18. Springer (1997) [2] Asakawa, C., Takagi, H.: Annotation-based transcoding for nonvisual web access. In: Proceedings of the Fourth International ACM Conference on Assistive Technologies. pp. 172–179. Assets ’00, ACM, New York, NY, USA (2000), http://doi.acm.org/10.1145/354324.354588 [3] Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of the 15th international conference on World Wide Web. pp. 33–42. ACM (2006) [4] Breuel, T.M.: Representations and metrics for off-line handwriting segmentation. In: Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on. pp. 428–433. IEEE (2002) [5] Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting content structure for web pages based on visual representation. In: APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer (2003) [6] Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. ITC-irst Technical Report 9703(09) (1998) [7] Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th international conference on World Wide Web. pp. 377–386. ACM (2008) [8] Chen, Y., Xie, X., Ma, W.Y., Zhang, H.J.: Adapting
[17]
[18] [19]
[20]
[21]
[22]
760
web pages for small-screen devices. IEEE Internet Computing 9(1), 50–56 (2005) Hu, J., Kashi, R., Wilfong, G.: Document image layout comparison and classification. In: 1999. ICDAR ’99. Proceedings of the Fifth International Conference on Document Analysis and Recognition. pp. 285–288 (Sep 1999) Kohlsch¨ utter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM conference on Information and knowledge management. pp. 1173–1182. ACM (2008) Kreuzer, R.: A Quantitative Comparison of Semantic Web Page Segmentation Algorithms. Master’s thesis, Universiteit Utrecht (2013) Pehlivan, Z., Saad, M.B., Gan¸carski, S.: Vi-diff: Understanding web pages changes. In: DEXA (1). pp. 1–15 (2010) Popela, T.: IMPLEMENTACE ALGORITMU PRO VIZUALNI SEGMENTACI WWW STRANEK. Master’s thesis, BRNO University of Technology (2012) Saad, M.B., Gan¸carski, S.: Using visual pages analysis for optimizing web archiving. In: Proceedings of the 2010 EDBT/ICDT Workshops. p. 43. ACM (2010) Saad, M.B., Gan¸carski, S.: Archiving the web using page changes patterns: a case study. Int. J. on Digital Libraries 13(1), 33–49 (2012) Sanoja, A., Gan¸carski, S.: Block-o-matic: A web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS’14). Marrakeh, Morroco (2014) Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(6), 941–954 (2008) Solis, B.: The conversation prism (2014), https://conversationprism.com/ Tang, Y.Y., Suen, C.Y.: Document structures: a survey. International journal of pattern recognition and artificial intelligence 8(05), 1081–1111 (1994) Xiao, Y., Tao, Y., Li, Q.: Web page adaptation for mobile device. In: Wireless Communications, Networking and Mobile Computing, 2008. WiCOM ’08. 4th International Conference on. pp. 1–5 (2008) Yesilada, Y.: Web page segmentation: A review. Tech. rep., University of Manchester and Middle East Technical University Northern Cyprus Campus (2011) Zhang, Y., Gerbrands, J.: Objective and quantitative segmentation evaluation and comparison. Signal processing 39(1), 43–54 (1994)