Block-o-Matic: A Web Page Segmentation Framework - IEEE Xplore

48 downloads 5505 Views 6MB Size Report
Email: [email protected] ... The DOM tree represents the HTML elements of a page, the ... structure of an HTML page, either by rendering and visual.
Block-o-Matic: A Web Page Segmentation Framework Andrés Sanoja (*)

Stéphane Gançarski

LIP6-UPMC Paris, France Email: [email protected]

LIP6-UPMC Paris, France Email: [email protected]

Abstract—In this paper we describe Block-o-Matic, a web page segmentation framework. It is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the content structure organizes page objects according to content’s categories and geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The logic structure represents the final segmentation. The segmentation process is divided into three phases: analysis, understanding and reconstruction of a web page. An evaluation is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. Block-oMatic gives promising results.

automated document processing. Our proposed segmentation consists of three phases: analysis, understanding and reconstruction. Three corresponding structure are involved: DOM tree, content structure and logical structure. The outcome is a tree, which is the consolidation of the structures taking flow order into account. Through experimentation we show how BoM performs better than VIPS, the most popular state-ofthe-art implementation.

Keywords: web pages, page segmentation, correctness

There are different approaches to web page segmentation as detailed in a survey [18]: structure [4], layout [5], hybrid [8], image [3], fixed-length [6] and text-based approach [7]. The structure based approach refers to the use of the HTML tags, the DOM elements and their hierarchical relationships to detect blocks. The layout based approach focuses on the repetitive elements found in web sites to lead page partitioning. The hybrid approach uses both the structure and the layout of the page to create a hierarchy of blocks through block extraction and recursive refinement. The image based approach takes an image of a web page (also called snapshot) and applies image processing techniques to detect blocks. The fixed-length based approach removes all the semantic information (tags) from the page and then uses fixed-length algorithms to segment the web page. The text based approach retrieves segments from web pages based on the properties of text such as paragraph similarity, clustering, among others.

I.

I NTRODUCTION

Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called blocks. Detecting these different blocks is a crucial step for many applications, such as mobile devices [16], information retrieval [2], archiving [12], web accessibility [10], evaluating visual quality (aesthetics) [15], among others. Until now, the segmentation problem has mainly been addressed by analysing the DOM (Document Object Model) structure of an HTML page, either by rendering and visual analysis or by interpreting/learning the meaning and importance of tag structures. Segmentation effectively keeps related content together as blocks. However, to fully understand a web page content, blocks themselves are not enough, it is necessary to take into account the users’ understanding of the content, such as block categorization and flow order (better known as reading order). Indeed, web pages are designed in such a way that the content organization may have more to do with optimizing the space and its aesthetics, than reflecting the logical order of the information contained. Whereas block detection and its classification have been already studied, works on flow order is still missing. In this paper we propose a new web page segmentation framework: Block-o-Matic (BoM). It combines two popular approaches, one from web page analysis and the other from

This paper is organized as follows. In next section, we study the related works. Section III describes the web page segmentation model and the segmentation algorithms. Section IV the experiments and results. Section V concludes. II.

R ELATED W ORKS

The hybrid approach guarantees some semantic coherence of blocks. It takes into account the spatial location, visual properties of the page (e.g. by using CSS) and relationships among blocks. However, those aspects are not enough to have a good description of the page. Blocks should be related to a role into the layout (for example their flow in the page document). Relating blocks to the layout provides more information for the applications based on web page segmentation. To accomplish this, the logic structure of the web page should be taken into account, along with the content structure. We observed that there is a clear relationship between page segmentation and the field of computer vision. Segmenting and understanding scanned documents images is a very well studied subject in the Document Processing domain [9], [13],

978-1-4799-3824-7/14/$31.00 ©2014 IEEE

[14]. In Document Processing systems the concepts of objects, zones or blocks are applied to define the geometric and logic structure of a document where each block is associated with a category, its geometric information and a label. Processing a document comprises the document analysis and understanding phases. Document images are analysed to detect these blocks based on pixel information. Blocks are categorized, their geometric information is extracted and, in function of both features, a label is assigned. Moreover, by understanding what blocks contain (label) and where they are located (geometry), it is possible to cluster them and give them a reading order or flow (for example one title block should go first followed by the corresponding text blocks). There are algorithms for web page segmentation that follow the Document Processing approach. On of the well known and first to use is the Vision-based Page Segmentation algorithm (VIPS) [2]. Its segmentation model is based on the recursive geometric model proposed by Tang [14] for recognizing regions in a document image. VIPS itself focuses mainly on the content and geometric structure of a web page. Although they do not explicitly include a logic structure, they understand the document by extracting the blocks and by grouping them based on separators. Kohlschütter [7] uses the relation between the pixel concepts with text elements in the web page domain. They transfer the concept of pixel to HTML text as character data (the atomic text portion), an image region is translated to a sequence of atomic text portions (blocks). They measure the density of each block and merge those which are below a threshold tolerance using the BlockFusion algorithm. [11] uses segmentation, structure labelling and text segmentation and labelling to define a random field that leads the extraction. They define the understanding of web pages as problem and expose the its importance to perform a better segmentation. Although it is the closest work to our approach the flow order is not mentioned in their analysis.

III.

A F RAMEWORK FOR W EB PAGE S EGMENTATION

We first define a segmentation model based on three structures and their relationships. Then, we present the segmentation phases based on the model. Web page segmentation is a key component for the automated content analysis [1] in many areas of research and application. It is formally defined as [11]: Definition 1 Given a web page, segmentation is the task of partitioning the page at the semantic level and constructing the semantic tree of the page. Each node in this tree will correspond to a block of coherent content in the original page (See Figure 1 for an example). In other words, for a web page (W ) the output of its segmentation is the semantic tree of a web page (W 0 ). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value

Fig. 1.

Web page segmentation example

A. Web Page Segmentation Model In the context of web pages, the content structure describes how elements are organized, to which content model 1 they belong and their geometrics properties. The geometric structure is then contained in the content structure of the page, which is not the case for image segmentation. By using the information given by a web browser we can build the content and logic structures, and process them as described by Tang (see Figure 2). The content structure of a web page is derived from the DOM elements in conjunction with the geometric properties of elements given by the computedStyle interface, provided by the browser. The logical structure describes the blocks connection and their flow in the document. Elements are classified according to the HTML5 standard, among the following types: root, meta-data, scripting, sections, grouping, text-level, edits, embedded, tabular, forms, interactive and links elements. For example tags such as < ul > and < li > belong to the grouping category while < div > and < p > tags belong to sections category. As shown in Figure 2, the segmentation process of a page (W) is divided into three phases: page analysis, page understanding and page reconstruction. The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function), W 0 = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))). It represents the segmented page that can be used by applications as mentioned in the section I. The d2c algorithm constructs the content structure from the DOM tree by using only valid elements. It also organizes elements into the content categories enumerated above. The c2l algorithm constructs the logic structure from content 1 http://www.w3.org/TR/html5/

Fig. 2.

Web page segmentation model

structure based on the parameter pG. A group of blocks can be formed based on their category, their emplacement in the document and their separation. An order and a flow is given to each group. Figure 3 shows how all structures are extracted from HTML source code and their relationships. Finally, the segmented web page W 0 is created with the Rec function consolidating the three structures in a linear form, respecting the order flow (see arrows in figure 3c). The web page segmentation is a process that needs to go beyond the rigid hierarchical relationships between DOM elements. One need is use it over a versatile structure. Taking the DOM tree as the first and basic structure, we define the content structure and the logic structure, which allow us to have different perspective of a web page. In the following sections, we describe the three structures involved. B. Content Structure The content structure describes how objects are placed into the web page and how they are classified. The geometry of the objects is extracted by using the computedStyle interface provided by the browser. It gives the actual style properties values that allow us to obtain the exact geometry. We represent the access to this interface by the function css(property). So css(’offsetLeft’) will returns the value of the offsetLeft property from the ComputedStyle DOM interface. Definition 2: A Content Object (Co) is a member of the content structure of the web page. A content object is defined as: Co = (elements, attributeSet, f unctionSet) Attributes in AttributeSet are: (1) Geometry, described by the tuple (x,y,w,h) representing the absolutes coordinates of the rectangle that covers all elements. x and y correspond to the minimum value of css(’offsetLeft’) and minimum css(’offsetTop’), w and h to the maximum value of css(’width’) and css(’height’) elements properties values. (2) Category, integer taking values as root, metadata, scripting, sections, grouping, text-level, edits, embedded, tabular, forms, interactive and links.

Fig. 3.

Web page Structures example

Functions in functionSet are: (1) Diagonal(), which represents the length of the diagonal in the geometry of the object (2) rDiagonal(), the diagonal of the object divided by that of its parent. (3) Valid() , indicates if an element is valid; i.e. it is visible and its area and content representative. Definition 3: The Content Structure (Wcont ) of a web page W is an unranked, labeled, finite tree of content objects. It is composed of: (1) One node with a content object Co1 as a root. (2) A node with the content object Coij and a set of trees ij ij (possibly empty) Wcont ij 1 , Wcont 2 , . . . , Wcont k whose roots ij ij ij i Co1 , Co2 ,. . . ,Cok are children of Coj . C. Logic Structure The logic structure describes the connection between web page blocks, simulating the human perception. Definition 4: A Logic Object (Lo) is a member of the logic structure of the web page. A logic object is defined as:

Lo = (contentobjects, attributeSet, f unctionSet) Attributes in AttributeSet are: (1) Geometry, a set of points describing a polygon that forms the boundaries of the object (2) Label: it can be header, navigation, title, content, table, image, logo, footer and none (3) Follows, the preceding logical object in the flow. The values can be: independent or a reference to a logical object.

The HTML content classification proposes that each element belongs to an elements category. Some some exceptions apply depending of its attributes For example tags such as < div > or < p > are considered in the grouping category(respectively < a > in < links >). E. Page Understanding

Functions in functionSet are: (1) Diagonal(), representing the length of the longest diagonal present in the geometry of the object. Note here that the geometry of logic objects are represented as polygons and their diagonal is computed differently that content objects. (2) rDiagonal(), the diagonal of the object divided by that of its parent. (3) Distance(otherLo), is the minimum distance between the two objects based on their geometry attribute. (4) VisualCuesPresent(), indicates if the content elements has distinctive visual properties, such as background color or font size (5) Accept(label): declares the object as a leaf block and to assign it a label. This means that all children blocks are removed and the present object becomes a leaf, keeping all the recursive relationships with the underlying content objects. (6) MergeWith(otherLo), a function allowing to merge two objects. The resulting geometry is computed searching the orthogonal hull of their union.

As the web page analysis extracts content structures of a web page by categorizing the DOM elements, the understanding process maps the content structure into the logical structure. It takes into account the logic objects category, the relative diagonal, their location and distance with respect to others objects. We try to merge objects the relative diagonals of which are less than the granularity parameter pG. Finally we group the objects by their distance, labels and determine the flow into each group. This grouping strategy emulates the human perception based in the four Gestalt laws described in [17]. Proximity: items tend to be grouped together according to their nearness. Similarity: similar items tend to be grouped together. Closure: items are grouped together if they tend to complete some structure. Simplicity: items tend to be organized into simple structures according to symmetry, regularity, and smoothness.

Definition 5: The Logic Structure (Wlog ) of a web page W is an unranked, labeled, finite tree of logic objects. It is composed of (1) One node with one logic object Lo1 representing the root of the Page. (2) A node with the logic ij object Loij and a set of trees (possibly empty), Wlog ij 1 , Wlog 2 , ij ij ij ij . . . , Wlog n whose roots Lo1 , Lo2 , . . . ,Lon are children of Loij .

As a precondition to the page understanding, there should exists a logic structure, having a one-to-one relationship with the content structure. Each logic object has the none label. Then, each logic object is evaluated recursively beginning from root object (Algorithm 2).

D. Page Analysis The goal of the web page analysis is to construct the content structure from the DOM tree. It thus takes as input the DOM elements and builds the content structure Wcont . The process starts with the body element. Then, each element is evaluated to determine if it belongs to one of the categories described in section III-B. If so, a new content object is created, setting the category and geometry attributes. The outcome of this process is a tree representing the content structure. Algorithm 1 describes the algorithm d2c: Data: element e Result: content object Co if e has only one children then d2c(child); else if e.valid() then a new content Co object is created; Co.elements ← e;  e.css(0 of f setLef t0 ),   e.css(0 of f setT op0 ), Co.geometry ←  e.css(0 width0 )    e.css(0 height0 ) Co.category ← getCategory(e); for each e.children as child do Co.children ← d2c(child); end end end

Algorithm 1: d2c algorithm

Data: logic object Lo Result: logic object Lo if Lo.rDiagonal() ≥ pG then Lo.label ← getLabelF romContent(Lo); for each Lo.children as child do c2l(child) end else if Lo.visualCuesP resent() then Lo.accept(); Lo.label ← getLabelF romContent(Lo) else for each sibling object of Lo as Los do if Lo.distance(Los ) <  and Aligned(Lo, Los ) then Lo.mergeW ith(Los ); Lo.label ← getLabelF romContent(Lo) ; end end if Lo.rDiagonal() < pG and for any Los .rDiagonal() < pG then Lo.accept() ; end end end

Algorithm 2: c2l algorithm

;

Using heuristics we are able to assign a label to objects, based on theirs geometry and their location. As content objects get their type from DOM elements, the logical objects get their label in base of how the related content objects are categorized. Let us consider the following example, shown in Figure 4: The
    and the
  • elements are mapped to the grouping category in the content structure. Each content object Ci1 has similar relative diagonals, they are horizontally aligned and the sum of their diagonals is similar to that of their parent C 1 . Elements associated with the content objects Ci1 are interactive

    IV.

    E XPERIMENTS AND E VALUATION

    The evaluation of the segmentation algorithm is an important challenge. In order to compare one algorithm with another, it is crucial to have a ground truth, validated by human assessors, to check algorithms correctness. Thus, we evaluate the performance of our algorithm against a ground truth. We also apply the same method for VIPS algorithm and consolidate the results for comparisons. Evaluation with most known segmentation algorithms is left as a future work.

    Fig. 4.

    Fig. 5.

    Heuristic rule example for content categorization

    Example of object grouping. Each color represent a different group

    (hyperlinks, text and events). In this case the suggested label for the logic object Lo1 is navigation. Due to limited place, we do not describe all the categorization rules. Besides the hierarchical relationship of the logic structure, logic objects have a precedence relationship between them. It allows describing the flow of the blocks in the page. We group objects that are close one to the other (proximity), the label of which are not so distant (similarity) or follow the label list order (closure), and they are aligned (simplicity). To sum up, one group is formed by objects which relative distance is less than a predefined threshold , their edges are horizontally or vertically aligned and their labels follow the list order presented in section III-C. Figure 5 shows an example of grouping made to si.com homepage. Each color represents a group. Solid arrows represent intra-group relationships while dotted arrows represent inter-group ones. F. Web Page Reconstruction Following the precedence order, each logical object is placed as parent of the related content objects and the same for the associated DOM elements. This process is done recursively for all objects and produces a new document tree W’. W’ is an XML document which gathers the original HTML content with content and logical objects expressed in XML syntax. This approach is practical since it puts the content and the segmentation in the same structure for further use.

    To compare different approaches, we build a custom test collection of 400 pages crawled from dmoz.org Open Directory. We manually assessed these pages to define a comparable segmentation. A set of 25 pages from each of 16 categories has been selected. This directory consists in several levels, sublevels of categories and results. Each sub-level is associated with a number which represents the total results. The selected pages are from those categories with the highest total results. The set of 400 pages were segmented using the MoB tool and their geometric information registered in a database. All the pages have been automatically segmented with both algorithms, registering the same information as for the ground truth. In this experiment the block type and flow order are not evaluated. With the ground truth (G) and an automatic segmentation (S), a block s ∈ S is said correctly segmented if its geometry and location are equal to only one g ∈ G. A block g with no match in S is called missed. An element in S with no matching block is called a false alarm. In practice, rendering engines, with the same page, produce different rendering. For that reason we define a tolerance threshold tt: blocks geometry is considered more or less by tt pixels where 0 ≤ tt ≤ 50. When a web page is rendered a default stylesheet is applied, and in cascade, the custom user style. It is observable that the values for margin property in most top browsers are between 10px and 40px. That is why a range between 0 and 50px can cover almost all the cases. Figure 6a shows the results for correct blocks (i.e. the number of blocks correctly segmented) for both algorithms (BOM _T c,V IP S_T c) compared to number of blocks in G (GT _blocks), for all categories. The BoM algorithm has a better performance than VIPS in the amount of correct blocks found over the whole collection. Both algorithms have problems when the tolerance is very low, which is normal because the geometry of blocks is not entirely exact. However, with 10px of tolerance BoM present the best performance which means that blocks geometry is very similar. On the other hand, VIPS requires a high tolerance to observe a better performance. If we detail results by categories (Figures 6b and 6c) we see that BOM algorithm overpasses VIPS in the regional category while in the arts category both have low performance, because the occurrence of missing blocks. This can be explained by the type of pages in this category. They are those pages with multimedia and highly interactive elements (i.e.: video, flash, embedded media) that both algorithms do not handle correctly, due to limitations of the content visibility of these kind of elements. V.

    C ONCLUSION

    We proposed a general purpose effective Web page segmentation framework. We combine structural, visual and logi-

    Further work should also include a more general performance evaluation, including different algorithms. ACKNOWLEDGEMENT This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137) We would like to express our gratitude to Dr. Zeynep Pehlivan for providing good advices and reviewing this paper. R EFERENCES [1]

    [2]

    [3]

    [4]

    [5]

    [6] [7]

    [8]

    [9]

    Fig. 6. Results for correct blocks. a) total correct blocks for the whole collection. b) total correct blocks for Regional category. c) total correct blocks for Arts category.

    [10]

    [11]

    cal features of web pages to analyse and understand their content. The flow/reading order is included in the understanding phase. That helps to have a better description of the page and to know how blocks are related. Using the geometric information, we evaluate the quality of the segmentation and compare to the well know algorithm VIPS, achieving 7% more accuracy in average. Under this approach the resulting W’ document is very useful in different applications. For example, applying either a new CSS style one can arrange blocks in a new render or use blocks as a CSS3 Regions and define a new web page layout. In the context of web archives format migration is also a direct application, for example migrate pages from an HTML4 table layout to a HTML5 layout. While there is still some room for improvement, a solid foundation for further research has been laid down by work in this paper. Our future work will focus on web archives, particularly on studying the applications that can benefit of our segmentation results in comparison to others algorithms.

    [12]

    [13]

    [14]

    [15]

    [16]

    [17]

    [18]

    Apostolos Antonacopoulos and Jianying Hu. Web document analysis. In BidyutB. Chaudhuri, editor, Digital Document Processing, Advances in Pattern Recognition, page 407-419. Springer London, 2007. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Extracting content structure for web pages based on visual representation. In APWeb 2003, volume 2642 of LNCS, page 406-417. Springer, 2003. Jiuxin Cao, Bo Mao, and Junzhou Luo. A segmentation method for web page analysis using shrinking and dividing. International Journal of Parallel, Emergent and Distributed Systems, 25(2):93–104, Apr 2010. Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graphtheoretic approach to webpage segmentation. In Proceedings of the 17th international conference on World Wide Web (pp. 377-386). Apr, 2008. V Kalaivani and K Rajkumar. Reappearance layout based web page segmentation for small screen devices. International Journal of Computer Applications, 49(20):1–8, 2012. Marcin Kaszkiel and Justin Zobel. Effective ranking with arbitrary passages. J. Am. Soc. Inf. Sci. Technol., 52(4):344–364, February 2001. Christian Kohlschütter and Wolfgang Nejdl. A densitometric approach to web page segmentation. In Proc. of the 17th ACM conference on Information and knowledge management (pp. 1173-1182). Oct, 2008. K.S Kuppusamy. Multidimensional web page evaluation model using segmentation and annotations. International Journal on Cybernetics & Informatics, 1(4):1–12, Aug 2012. J. Liang, I. Phillips, R. Haralick. Performance evaluation of document structure extraction algorithms. Computer Vision and Image Understanding, 2001, vol. 84, no 1, p. 144-159 Jalal U. Mahmud, Yevgen Borodin, and I. V. Ramakrishnan. Csurf: A context-driven non-visual web-browser. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 31–40, New York, NY, USA, 2007. ACM. Zaiqing Nie, Ji-Rong Wen, and Wei-Ying Ma. Webpage understanding: beyond page-level search. ACM SIGMOD Record, 37(4):48–54, 2009. Myriam Ben Saad and Stéphane Gançarski. Using visual pages analysis for optimizing web archiving. In Proceedings of the 2010 EDBT/ICDT Workshops, page 43. ACM, 2010. Y. Tang, M. Cheriet, J. Liu, J. Said, and C. Suen. Handbook of Pattern Recognition and Computer Vision,. World Scientific Publishing Company, 1999. Yuan Y. Tang and Ching Y. Suen. Document structures: A survey. International journal of pattern recognition and artificial intelligence, 8(05), 1081-1111. 1994. Ou Wu, Yunfei Chen, Bing Li, and Weiming Hu. Evaluating the visual quality of web pages using a computational aesthetic approach. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 337–346. ACM, 2011. Yunpeng Xiao, Yang Tao, and Qian Li. Web page adaptation for mobile device. In Wireless Communications, Networking and Mobile Computing, 2008. (pp. 1-5). IEEE. Xin Yang and Yuanchun Shi. Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing In Web Intelligence and Intelligent Agent Technologies, 2009. IEEE/WIC/ACM (Vol. 3, pp. 4649) Yeliz Yesilada. Web page segmentation: A review. Technical report, Middle East Technical University Northern Cyprus Campus, 2011.

Suggest Documents