A hybrid approach for extracting informative content ...

Information Processing and Management 49 (2013) 928–944

Contents lists available at SciVerse ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

A hybrid approach for extracting informative content from web pages Erdinç Uzun a,⇑, Hayri Volkan Agun b, Tarık Yerlikaya b a b

Namik Kemal University, Corlu Engineering Faculty, Computer Engineering Department, Çorlu, Tekirdag˘, Turkey Trakya University, Engineering and Architecture Faculty, Computer Engineering Department, Edirne, Turkey

a r t i c l e

i n f o

Article history: Received 23 January 2012 Received in revised form 14 February 2013 Accepted 21 February 2013 Available online 26 March 2013 Keywords: Web Content Extraction Template Detection Web Cleaning Web Learning Modeling

a b s t r a c t Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction The Internet has become a major information source distributed over web pages. Conversely, web pages contain noisy content, including advertisements, banners, menus, and unnecessary links, which can adversely affect performance of text-based processing systems such as search engine, web summarization, question answering and text understanding. In this instance, informative content (i.e. text content, headline, date or author name) can be used to enhance the results of these techniques. However, automatically extracting informative content is difficult, as a web page contains both noisy and informative content in a same file. This file consists of Hypertext Markup Language (HTML) tags and content between these tags that allow us to display pages in web browsers. In this study, we introduce a hybrid approach for obtaining informative content from different web pages. The Web is an invaluable source of data for studies, especially those that do not have enough natural language texts. In particular, researchers choose online newspapers as an alternative test collection (Can et al., 2008; Carlberger, Dalianis, Hassel, & Knutsson, 2001; Savoy, 2007, 2008; Uzun, 2011) to improve the ranking of search engine results of their natural language searches. However, as mentioned, this test collection contains noisy texts, compared to the relevance of the content. For example, Uzun et al. (2011a) developed a crawler to obtain news between 1998 and 2008 from the Turkish newspaper Milliyet (http://www.milliyet.com.tr). They searched a regular expression pattern that can be used for string manipulation to eliminate noise in web pages. They found the following pattern for web pages between 2003 and 2007. ⇑ Corresponding author. Tel.: +90 2822502325. E-mail address: [email protected] (E. Uzun). 0306-4573/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ipm.2013.02.005

E. Uzun et al. / Information Processing and Management 49 (2013) 928–944

929

Headline Block: Article Information Block: Main Block:

Uninformative Blocks: DIV id=menu H1 DIV id=menu H3 DIV id=menu

(a) An Example Web Page

(b) Extraction Rules in XML format

Fig. 1. Determining blocks in a web page and producing rules.

< !--print:start-- > ð: ?Þ < !--print:finish-- > This is a comment tag that begins with tag . The browser does not display this tag but only gives information about a web page. This pattern matches the opening and closing pair of comment tags. However, using this pattern is not a reliable extraction method for different web domains. Moreover, web designers may be changed their HTML tag naming and hierarchy over time. For example, the designers of Milliyet have used different HTML structure after 2007. Due to varying HTML tag naming and hierarchy, preparing regular expression patterns for extracting the informative content becomes a challenge as described in Section 4.1. Besides, preparing these patterns is cumbersome. To automate pattern extraction and maintain efficiency, we present a hybrid approach.1 In this approach, patterns as rules are first obtained by using our learning model which utilizes an appropriate machine learning (ML) technique. Secondly, these rules are used to extract informative content from web pages without using ML inference. The key issue in content extraction based on the ML techniques is to represent tag information as features obtained from web data. In this study, features such as tag name, word frequency and link frequency are prepared from the Document Object Model (DOM), which is a language-independent convention and tree-based hierarchy for representing a web page. DOM represents not only all types of tags in a web page but also the structure of these tags. Current state-of-the-art techniques for extracting informative content use DOM-based features related by the attributes and the content of an HTML tag. However, the complexity of the DOM structure is not suitable for effectively extracting informative content in a direct manner when compared with using an appropriate regular expression pattern. The main approach of our study is to learn this pattern from DOM-based features. And then, this pattern is used for effectively extracting informative content. Many studies focus on extracting only informative content of a web page. In this study, we divide informative content into three blocks including main, headline and article information as shown in Figs. 1 and 4. These blocks can be used to maximize storage efficiency of text-based processing systems. For extracting these blocks for a web page, appropriate patterns can be produced. In this study, we automatically produce these patterns by using a suitable ML method. Firstly, we analyze our datasets based on the number of informative blocks/tag ratio to select most of the important tags. Two layout HTML tags, TD and DIV, are determined to be important tags for accessing these blocks again. Therefore, our pattern model contains at least one of these tags and the other HTML tags (i.e., H1–H6, FONT, SPAN, EN, UL, and LI) in the block extraction. HTML tags are also widely used to style content and for visualization. For this reason, HTML-based information on Web is semi-structured, which does not conform to the formal structure. The World Wide Web Consortium (W3C – http:// www.w3.org/) has presented a fully structured tag system based on XML (Extensible Markup Language), which is a wellformed structure that simply marks up pages with descriptive tags. Fig. 1 gives information about these structures for blocks in HTML and our rule model in XML format. In Fig. 1, the H1 and H3 tags do not give clear information for determining patterns because H1 and H3 in the tag of
present different informative content to users when compared with H1 and H3 in the tag of
. The tag of
is the actual content of this web page. On the other hand, the tag of
contains unnecessary texts. We automatically determine these blocks as shown Fig. 1b. However, it is difficult process because of the heterogeneous and semi-structured nature of web pages as shown detailed examples in Section 4.1. The first step of our approach prepares rules in XML format from an HTML document using ML methods. This step (creating DOM, preparing features and applying an ML method) increases complexity. Aware of the time complexity, Uzun et al. (2011b) show that 1 The Web Content Extractor application implements the hybrid approach. This application is an open-source project and available via the web page http:// bilgmuh.nku.edu.tr/webce/. Moreover, the datasets that we used in the learning stage of our approach and our experimental results is also available on this web page.

930


simple string manipulation techniques can efficiently extract informative content without using DOM. Therefore, the second step utilizes a simple string manipulation technique using rules obtained from step one of this task. The experimental methodology of the first step is the same with other approaches in algorithmic view. However, the main focus in this paper is not only building a more accurate model with additional features but also producing effective rules for each domain. The next section presents related works and our own focus. The third section introduces the similarities and differences of the hybrid approach as well as the workflow of the extraction process. The fourth section is dedicated to rule induction and ML methods and introduces the details of the problem, feature selection, dataset, learning algorithms, metrics and experiments. The fifth section covers the efficiency of the extraction algorithm and makes a comparison. The last section provides our conclusions. 2. Related work Related works can be grouped into two categories. These are automatic extraction techniques and hand-crafted rules. The main focus of automatic extraction techniques is inference through features extracted from HTML. Hand-crafted rules are mostly used to extract information from HTML through string manipulation functions. Though preparing the hand-crafted rules is difficult and cumbersome, their efficiency is high with proper adjustments. The advantage of automatic extraction techniques over hand-crafted rules is their automaticity and easy applicability. This study inherits the qualifications from two different extraction methods by combining the efficiency and automaticity. Several researchers have investigated automatic extraction techniques on web pages through web page segmentation. These studies have mostly focused on DOM-based segmentation (Baluja, 2006), location-based segmentation (Kovacevic, Diligenti, Gori, & Milutinovic, 2002) and vision-based segmentation (Cai et al., 2003a, 2003b; Kovacevic et al., 2002; Yu, Cai, Wen, & Ma, 2003). DOM-based studies use DOM-level features with trained classifiers to extract useful content from web document templates (Bar-Yossef & Rajagopalan, 2002; Chakrabarti, Kumar, & Punera, 2008; Chen, Zhou, Shi, Zhang, & Qiu, 2001; Debnath, Mitra, Pal, & Giles, 2005). Most of these studies use heuristics that depend on site level hyperlink information (Chakrabarti, Kumar, & Punera, 2007), distribution of segment level text density ratios (Kohlschutter, 2009; Kohlschutter & Nejdl, 2008), n-grams (Baroni, Chantree, Kilgarri, & Sharo, 2008) and shallow text features (Kohlschutter, Fankhauser, & Nejdl, 2010). Approaches of these studies are not suitable for generating rules that can be used for string manipulation. Our approach provides a method to generate rules that can be utilized for efficient extraction. In DOM-level approaches (Gibson, Wellner, & Lubar, 2007; Hofmann & Weerkamp, 2007; Spousta, Marek, & Pecina, 2008; Yi & Liu, 2003a, 2003b), the common idea is that dissimilar content with the same structure or similar repeating patterns in templates and styles across several pages contain noisy blocks. Based on that idea, Bar-Yossef and Rajagopalan (2002) report that eliminating templates increases the precision of a search engine called Clever at all recall levels. Similarly, Lin and Ho (2002) designed InfoDiscover, a tool that extracts informative content from web pages. In their study, they use the TABLE tag to partition the web page into blocks. Several other tools use sequential approaches, including n-grams and conditional random fields, to clean noisy text from web pages (Evert, 2008; Spousta et al., 2008). However, none of these studies lists the importance of blocks, except that conducted by Song, Liu, Wen, and Ma (2004). This study emphasizes that using a scheme that weights blocks by their importance is useful for both search engines and data mining applications. Our approach detects unnecessary and three informative blocks (main, headline and article information). Other studies focus on layout HTML tags (DIV and TD) or all HTML tags. However, our study takes into account the appropriate HTML tags for each block. In the main block detection, layout HTML tags can be used to determine the most comprehensively informative texts of a web page. The detection of headline and article information blocks contains all HTML tags. Location-based segmentation relies on the position features of the areas of interest. These areas are determined by their location and are mostly labeled as left menu, right menu, footer content, etc. This approach depends on the assumption that the location, width and area of certain tags are valuable information for extracting useful content, and they should indeed be combined with label features of these tags. Conversely, in vision-based segmentation, the features used for segmentation are changed to the visual features, including lines, colors, blanks, images, different font sizes and different colors. Some visionbased segmentation approaches rely too much on the DOM structure, which indeed diminishes segmentation efficiency, while some approaches use both visual cues and DOM structure. Other approaches, similar to visual-based segmentation, attempt to identify the most interesting and informative portions of web content (Baluja, 2006; Chen, Ma, & Zhang, 2003; Xue et al., 2007; Yang, Xiang, & Shi, 2009). These studies use clusters of style and similar content positions for different pages and distinguish these clusters as template regions that are uninformative. Yi and Liu (2003a, 2003b) utilize a compressed tree structure and site style tree, respectively, to identify uninformative DOM nodes across pages. Two studies focus on TABLE tags. Ma, Goharian, Chowdhury, and Chung (2003) look for repeated blocks to mark as uninformative TD sections, whereas Lin (2002) use entropy over a set of word-features to remove redundant blocks from web pages. All of these studies focus on TD tags. However, nowadays web designers prefer to use DIV tags instead of TD tags. Therefore, this tag is added into our model. Additionally, there are web scrappers that skip DOM structure creation and use rules instead, including regular expressions written in languages such as Java and Perl. These tools consider the judging criteria of efficiency and accuracy in their studies (Adelberg, 1998; Liu, Pu, & Han, 2000; Vieira et al., 2006). These tools are efficient, but they are inappropriate or labor-intensive for extracting information from web templates that change over time. Hand-crafted rules also tend to be


931

impractical for more than a couple of sources. The approach presented in this study is built on the appropriate combination of hand-crafted rules and automatic extraction techniques. 3. The hybrid approach The approach developed in this study involves automatic rule creation instead of manual hand-crafted rule insertion. These rules are used to infer informative content from simple HTML pages. Similar to other studies, our approach first extracts DOM-based features and utilizes these features to extract informative contents. The difference of this study from earlier ones is that our approach infers rules that can be used as hand-crafted rules. A model is designed for this task. Our model is based on two block tags: DIV and TD, selected as the most suitable markers for determining the boundaries of informative content. Because the system is constructed on DIV and TD tags, we can automatically determine the most comprehensive rule sets and maintain efficiency in the informative content extraction. Fig. 2 shows the workflow of this approach. The workflow given in Fig. 2 shows the learning process, extraction process, rule selection and creation of a well-formed document based on the appropriateness criterion of the rule for the web pages. This workflow consists of two main steps. 1. Rule induction from a ML method 2. Efficient informative content extraction from rules In the first step, rule induction is performed via ML methods; in the second, the extracted rules are used to determine informative content in web pages. The rules inside a well-formed document that contain simple informative content are constructed. This procedure is as follows. The given web page is first tested against whether the rules are stored in the database.

Rule Induction Step

1

Create DOM Extract Features

A web page (In HTML format)

Any Rule for a given web page?

Apply ML Methods

Rule Database

Is Rule appropriate or not? 2

Efficient Extraction Step Fig. 2. The workflow of the hybrid approach.

Fig. 3. A simple web page.

A Well-formed Document

932


Fig. 4. Blocks separated by DIV tags in an example web page.

If the rules are in the database, then the test is performed whether there are appropriate rules for the web page or not. Otherwise, an ML method occurs to induce the rules and create the well-formed document. This decision is made when the rule produces a single result. In the rule induction phase, marked as step one, DOM is created and features are extracted from this DOM tree. An ML method then applied. For this step, we compare several different machine learning methods in the following sections and choose Decision Tree Learning with the Sub-tree Raising method as the most effective and accurate method for the dataset. 4. Rule induction with ML In the rule induction step, the learning stage produces rules for efficient web content extraction. Preparing the learning stage requires a dataset and the appropriate features derived from this dataset. The dataset should be well-defined, and it should not contain too much noise but should include a low number of samples. In this section, we describe the HTML and DOM structures prior to the dataset and feature selection. We then introduce the feature selection part in the creation of the dataset and the dataset itself. We next give the tested ML methods and metrics. We then conclude the section with a description of the results of ML methods. Moreover, we introduce how rules are obtained from ML results. 4.1. HTML and DOM HTML is a simple and effective markup language used to develop web sites. HTML contains several tag sets for visualizing content. A web browser simply interprets these tags and creates a web page that a human can easily understand. Developers who want to demonstrate their visual content with richer features, including Javascript, use a hierarchy called DOM. Fig. 3 shows the content of a simple web page in three different views. Fig. 3 shows that information content is written between HTML tags. An HTML tag is generally formed from the HEAD, which contains necessary information about the web site, and the BODY, which is used to visualize the content. DIV and TD tags, also called block tags, are used to separate the web site into several practical blocks, and they are generally referred to as block markers. An A HREF tag is used to give links to different web pages. H1 and P tags are used to format the text. For the example given in Fig. 3, a simple regular expression
(.⁄?)
can be used to extract the title of the web page. However, H1 and P tags are not the only tags used to hold the title. There are other tags, including H2–H6, FONT, SPAN, EN, UL, and LI. Using these tags to extract informative content is not suitable for rule induction. The block tags can thus be used in the extraction process. A previous study (Yerlikaya & Uzun, 2010) observed how an intelligent browser was developed to extract only informative content via pre-manual DIV/TD selection and proper adjustments in the DOM hierarchy. Block tags have descriptive parts (including ID and CLASS) that can be used for producing rules. Fig. 4 gives a simple example of using block tags (i.e., DIV tags) in web pages. Fig. 4 contains only DIV tags as block tags. TD tags can also be used in this visualization. DIV and TD tags are generally used in web design to separate several different blocks. The TD tag is the most specific tag for TABLE formatting in HTML. Conversely, the DIV tag became one of the most frequently used tags in HTML, along with the Cascading Style Sheets (CSS) styling, which is used as a reference by HTML. CSS also introduced flexibility and ease of design to HTML. Both TD and DIV formatting can be represented in a nested structure. However, other tags (i.e., H1–H6, FONT, SPAN, EN, UL, and LI) are used


933

in flat form. For instance, the following regular expression .⁄? can be easily used to extract all links in a web page without involving any closing tag ambiguity. Block detection directly depends on using proper features in extraction. For instance, when Fig. 4 is explained using the link distribution as a selection criterion, we can determine the informative blocks as those with the least numbers of links, and the uninformative blocks as those that have high numbers of links. Certainly, the link and word distributions for each block are two important selection criteria, but some blocks may contain both informative and uninformative blocks. When such blocks are encountered, the most specific block under the parent block should be tested whether it is informative. This test also changes the word and link distributions of informative and uninformative blocks. In Fig. 4, the block
contains both informative and uninformative blocks. To simplify the block extraction problem, we categorize blocks into four different cases. The corresponding blocks given in Fig. 4 are listed below. Extracting uninformative blocks

Extracting the main (most informative) block
Extracting the title The H1 tag inside the
block. Extracting the text summary, author name, date, image titles and user comments
The block rules may vary among web sites. However, when rules are extracted for a single web page, they can be used for structurally similar web pages of the same web domain. However, some web pages may contain errors, as in Fig. 4, where the
block marker contains unrelated information in the tag. These errors might also happen for other blocks, including the main content block. In this study, we try to minimize user-oriented noise on block tags by including web pages from various web domains in our datasets. In Fig. 4, a more complex DIV structure is demonstrated in an HTML page. This figure is a good example of how blocks can be nested to form a hierarchy. This creates a simple problem when extracting the main DIV block with a regular expression. As the HTML structure of Fig. 4 contains many nested
tags, a simple regular expression ((.⁄?)
) cannot find the proper closing of the main block. There are two types of matching mechanisms in regular expressions for this task. One is matching the last
tag, and the other one is matching the first
tag. To prevent this closing tag ambiguity, a DOM structure can be used to extract the target match. However, creating a DOM structure increases the complexity of basic string matching problems. In this study, we train a learning model with the necessary features derived from block tags to directly extract informative content blocks from HTML without using DOM. We use DOM-based features in the rule induction phase, but the model does not re-create the DOM in the efficient extraction. 4.2. Datasets used in this study Almost all ML methods need a specific dataset for the problem to adjust the weights of their algorithms. Similarly, studies involving ML methods about informative content extraction from web pages use hand-annotated samples from various web domains. An old text collection, annotated by Kohlschutter et al. (2010), contains 621 web pages from 408 different domains selected from Google News until 2008. Moreover, a new text collection, containing 550 web pages from 275 different domains selected from Goggle News until 2011, is generated. Finally, we use a double-blind annotation technique to re-annotate the samples from these text collections. Using this annotation, we consider the following four classes:

Main block: The largest informative block, which contains a DIV and TD tag with a high number of terms. Uninformative blocks: The blocks mostly used for Advertisement, Link, and Menu separated by DIV, TD and UL tags. Headline: The tags that contain the web page title. Article information: The tags contain the text summary, author name, date information, image titles, and user comments.

To annotate the classes given above, a double-blind annotation technique is used among eight annotators (academic staff, graduate students, and undergraduate students taking the Information Retrieval course). In this technique, two annotators independently mark the samples. A third annotator determines the differences between these two to define a Gold Standard. After this annotation, the statistical results (Table 1) of the text collections are obtained.

934


In Table 1, each tag is given in the columns, and each class is given in the rows. Based on the counts given in Table 1, while the usage of DIV tag has increased, the usage of TD has decreased between 2008 and 2011. This finding supports Uzun et al. (2011a) that indicate the TD tag has become an old informative block marker, while the DIV tag is a current trend due to evolution of the web. Datasets used in ML should contain enough samples for the learning process. However, there are extreme differences between the numbers of some tags in text collections. Therefore, two datasets as training and test dataset are created randomly from these text collections for balancing ML data (Table 2). Training Dataset (Dataset-1): 736 web pages from 573 different domains. Test Dataset (Dataset-2): 434 web pages from 324 different domains. The statistical results of the datasets in Table 2 are suitable for using these datasets in ML methods. The DIV block marker is the most common tag used to separate the main and irrelevant blocks. On the other hand, article information content highly shares most of the tags. We expect that this block has a lower prediction accuracy due to its uniformly distributed nature. 4.3. Feature selection As in many other ML approaches, the methods applied in our experiments require that the data be represented as vectors of feature-value pairs. In our experiments, we not only adopt general shallow text features as feature vectors, but we also expand the common approach with combinations of classical features and after extraction features. These features are mostly referred to as shallow text features and are used to find the informative content of the tags from which they are extracted (Gibson et al., 2007; Kohlschutter, 2009; Kohlschutter & Nejdl, 2008; Kohlschutter et al., 2010). In a similar way, we use classical features like (1-2-3) in Table 3. Moreover, we derive a new features (like (4-5-6) in Table 3) that are combinations of classical features. Common approaches that use shallow text features also consider non-nested tags, including H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI. However, our learning model construction also is based on the nested tags (i.e., DIV and TD). Common approaches thus create a problem in our learning model. For example, a parent DIV tag may contain several uninformative and informative DIV children. This creates noise in the statistics derived from the parent DIV tag, so the features extracted

Table 1 Block information for annotated text collections. DIV

TD

UL

H1

H2

H3

H4

H5

FONT

SPAN

P

Total

Until 2008 Main Irrelevant Headline Article information Total

477 3991 – 278 4746

149 1099 – 113 1361

– 969 – 12 981

– – 325 1 326

– – 164 8 172

– – 134 10 144

– – 26 15 41

– – 11 14 25

– – 35 28 63

– – 27 93 120

– – 4 3 7

626 6059 726 575 7986

Until 2011 Main Irrelevant Headline Article information Total

516 8129 – 1565 10,210

23 554 – 65 642

– 755 – 3 758

– – 384 – 384

– – 89 – 89

– – 18 – 18

– – 40 – 40

– – – – –

– – 10 5 15

– – 13 12 25

– – 14 – 14

539 9438 568 1650 12,195

Table 2 Tag information about the annotated datasets. DIV

TD

UL

H1

H2

H3

H4

H5

FONT

SPAN

P

Total

Dataset-1 Main Irrelevant Headline Article information Total

608 7507 – 1136 9251

117 1133 – 106 1356

– 1029 – 12 1041

– – 418 1 419

– – 179 4 183

– – 92 9 101

– – 29 4 33

– – 7 7 14

– – 32 15 47

– – 26 57 83

– – 10 1 11

725 9669 793 1352 12,539

Dataset-2 Main Irrelevant Headline Article information Total

385 4613 – 707 5705

55 520 – 72 647

– 695 – 3 698

– – 291 – 291

– – 74 4 78

– – 60 1 61

– – 37 11 48

– – 4 7 11

– – 13 18 31

– – 14 48 62

– – 8 2 10

440 5828 501 873 7642


935

Table 3 Shallow text features. Feature name

Description

(1) Word Frequency (WF) (2) Density in HTML (D-HTML) (3) Link Frequency (LF) (4) Word Frequency in Links (WF-L) (5) Average Word Frequency in Links (A-WF-L) (6) Ratio of Word Frequency in Links to All Words (R-WF-L-AW)

The number of terms inside tags The ratio of the number of terms inside tags to the number of all terms inside the HTML document The count of A HREF links inside tags The count of terms inside A HREF links placed tags The ratio of the number of terms inside A HREF links placed inside tags to the number of links The ratio of the number of terms inside A HREF links placed inside tags to all of the number of terms inside tags.

may not contain appropriate link and word frequency ratios. To handle this problem, we extract most child DIV/TD tags to form a flat (non-nested) structure from the parent DIV/TD tags. After extraction (AE) of these child DIV/TD tags, we reevaluate the values of the same feature sets. This extraction process provides new feature sets to the ML step. Table 4 gives these new features. Differing from traditional approaches, our study proposes AE features to the learning problem because of the nested nature of the used tags (DIV and TD). The effect of these new features for the learning task is investigated in the experiment section. The non-nested tags, including H1–H6, P, FONT, and A HREF, do not require any AE features. Along with the features given in Tables 3–5 introduces two tag labels and CSS styling features. Tag name as a feature is useful for distinguishing whether a tag may belong to one of the four different classes introduced above. According to CSS styling, which is used for visualization of elements in a web page, some tags may contain ID and/or CLASS attributes. We believe that whether a DIV has an ID or CLASS attribute may have a positive effect on the efficiency of the content extraction step. Fig. 4 supports this idea, but this feature is also investigated in detail below (Section 5). 4.4. Machine learning methods applied in this study Four different common ML methods were applied to our dataset: a naïve Bayes algorithm, a Bayesian Network Algorithm, an instance-based clustering algorithm (k-Nearest Neighbor), and a Decision Tree Algorithm to discover an appropriate learning method. Descriptions of each learning algorithm are given below. The experiments are conducted using the Weka library with a tenfold cross-validation test method (Witten & Frank, 2005). 4.4.1. Naïve Bayes classification Naïve Bayesian classification (Rish, 2001) relies on the assumption that attributes are conditionally independent of each other given the class of examples. Though this hypothesis is often inappropriate for real-world problems where attributes strongly depend on each other, this classification approach helps reduce the dimensionality effect by simplifying the problem. Given example X with a feature vector ðx1 ; . . . ; xn Þ, the Naïve Bayes classifier looks for a class label C that maximizes the following likelihood:

PðXjCÞ ¼ Pðx1 ; . . . ; xn jCÞ Below are short descriptions of specific settings employed in our Naïve Bayes classification experiments (Witten & Frank, 2005):

Table 4 Additional shallow text features. Feature name (7) Word Frequency-AE (WF-AE) (8) Density in HTML-AE (D-HTML-AE) (9) Link Frequency-AE (LF-AE) (10) Word Frequency in Links-AE (WF-L-AE) (11) Average Word Frequency in Links-AE (AWF-L-AE) (12) Ratio of Word Frequency in Links to All Number of Words-AE (R-WF-L-AW-AE)

Table 5 Other features. Feature name

Description

(13) Tag Name (TN) (14) Contains Tag ID or CLASS (C-TC)

One of the tag name (TD, DIV, H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI) Whether the tag has an attribute of ID or CLASS

936


Normal Distribution: Standard Naïve Bayes classifier algorithm for numeric attributes. Kernel Estimator: Kernel estimation for numeric attributes rather than Normal Distribution. Supervised Discretization: Used to convert numeric attributes to nominal ones. 4.4.2. Bayesian Network classification Bayesian Networks (Friedman, Geiger, & Goldszmidt, 1997), more commonly called belief networks or probabilistic networks, are directed a-cycling graphs (DAGs) that contain no cycles. In a Bayesian Network, each node corresponds to a random variable, P(X), and each arc between nodes, P(Y|X), represents the probabilistic dependency between variables. The nodes and arcs define the structure of the network, whereas the conditional probabilities are the parameters for this structure. In Bayesian Networks, inference and structure learning are two learning process tasks. After obtaining the structure, classification can be conducted through inference. Network structure can be given manually instead of learning it from features. Our Weka experiments use settings of two structure-learning search algorithms, namely K2 (Cooper & Herskovits, 1992) and TAN (Cheng & Greiner, 1999; Rish, 2001). Both algorithms are used in local searches for the appropriate structure in Bayesian Networks. 4.4.3. k-Nearest Neighbor classification k-Nearest Neighbor classification (Bremner et al., 2005) is a nonparametric approach used to estimate the class-conditional densities, namely P(X|Ci). Given the discriminant function as below:

g i ðxÞ ¼ PðxjC i ÞPðC i Þ we have P(x|Ci) = ki/(NiVk(x)), where ki is the number of neighbors of the k-nearest that belong to Ci, and Vk(x) is the volume of the n-dimensional hypersphere centered at x, with radius r = ||x xk||, where xk is the k-nearest observation to x (among all neighbors of all classes of x). Selecting the number of neighbors for comparison is an important property in learning; k should thus be selected appropriately. 4.4.4. Decision Tree Classification In Decision Tree Learning (Breiman, Friedman, Olshen, & Stone, 1984), trees are composed of decision nodes and terminal leaves. Given a new instance to be classified, test functions are applied to an instance recursively in decision nodes until hitting a leaf node that assigns a discrete output to it. An instance feature is tested in every node for branching. The information gain of selecting an attribute to form a tree must be calculated, and a predefined number of the most informative attributes must be selected to minimize the depth of the tree. In cases where more than one hypothesis is extracted from the training set, the ensemble learning methods are used to increase classifier efficiency by selecting and combining a set of hypotheses from the hypotheses’ space. These hypotheses are combined into a single classifier that makes predictions by taking a vote of its constituents. One common method in ensemble learning is boosting. The boosting model is sequentially induced from the training examples where the example weights are adjusted at each iteration. The Weka library provides an implementation of the C4.5 Decision Tree Algorithm (Quinlan, 1993) in the J48 class. Some settings employed for Decision Tree Classification in our experiments are briefly explained below:

Default Setting: Standard Decision Tree classifier algorithm. Reduced Error Pruning: An independent test set to estimate the error at each node. No Sub-tree Raising: Used to disable Sub-Tree Raising of the most popular branch. Unpruned: Used to disable prepruning strategies.

4.5. Classification metrics There are several metrics to assess the performance of the ML Methods. One of them is N-Fold Cross Validation. In N-Fold Cross Validation, N tests are made on the dataset, and the observed metrics, including accuracy, precision, recall and f-Measure, are averaged. In each test, the N 1 training sets are trained, and the Nth portion is tested. For 10-Fold Cross Validation, the data are first split into 10 sets, and a test is made for each set to train the remaining sets. Considering that the ML is a binary classification task, each sample is separated into two cases (classes), positive and negative (i.e., has required, has not required). Based on these cases, Table 6 gives the necessary definitions to calculate the accuracy, precision, recall and f-Measure. This definition table is called a Confusion Matrix: the actual values are in rows, and the predicted ones are in columns. The diagonal cells show the number of correct predictions for both positive and negative cases. Other cells show both misclassifications and actual class classifications. The accuracy metric allows for measuring the percentage of correct predictions for the overall data. This metric accounts for both positive and negative instances. According to the definitions given in Table 6, the following equations define the accuracy, precision and recall, respectively.

accuracy ¼

TP þ TN TP þ TN þ FP þ FN


937

Table 6 Confusion matrix. Predicted

Known Positive Negative

Positive

Negative

TP FP

FN TN

Where True Positive (TP): Number of correctly classified positive examples, False Positive (FP): Number of incorrectly classified positive examples, True Negative (TN): Number of correctly classified negative examples, and False Negative (FN): Number of incorrectly classified negative examples.

precision ¼

recall ¼

TP TP þ FP

TP TP þ FN

In a special case where beta (b) equals 1, the f-Measure combines precision and recall by calculating their harmonic mean and can be called f1-measure.

F b ¼ ð1 þ b2 Þ

precision recall ðb2 precisionÞ þ recall

Moreover, the last metric is kappa statistics, which measures the degree to which two different ML methods perform. Kappa statistics is an alternative to the accuracy measure for evaluating methods. It was first introduced as a metric used to measure the degree of agreement between two observers (Cohen, 1960) and has been used in several disciplines. In ML, it is a measure to assess the improvement of a method’s accuracy over a predictor employing chance as its guide. This measure is defined as:

k¼

ðPo P c Þ ð1 P c Þ

where Po is the accuracy of the method, and Pc is the expected accuracy that can be achieved by a randomly guessing method on the same dataset. Kappa statistics has a range between 1 and 1, where 1 is total disagreement (i.e., total misclassification), and 1 is perfect agreement (i.e., 100% accurate classification). Kappa fundamentally assesses how much better a learning method is compared to the majority, and class distribution-based random classifiers score zero kappa. Landis and Koch (1977) suggest that a kappa score over 0.4 indicates a reasonable agreement beyond chance. 4.6. ML results and error analysis ML methods can classify the informative and uninformative content within an error margin. Each learning method may have a different error rate for the same dataset. To find the most accurate learning method, we used two datasets for training and testing. The training dataset is crucial to classification performance so tenfold cross-validation is applied in order to evaluate and compare different ML methods with several configurations. Additionally, the test dataset is used to understand the efficiency of obtained results from the training process. In the performance evaluation, we measured the accuracy, precision, recall, f-Measure and kappa statistics. Table 7 gives the training (cross-validated) and testing results. The Naive Bayes algorithm is one of the simplest algorithms used in the learning task. Though it gives good results with the Supervised Discretization method, it does not perform better than other algorithms due to low kappa results. In Bayesian Networks, TAN and K2 search methods are tested. These search methods form a proper Bayesian network structure and boost the accuracy. The results of other two methods, k-Nearest Neighbors and Decision Tree, are very close. In k-Nearest Neighbors, the appropriate selection of the number of neighbors (k) has a significant effect on the accuracy. The best performance is obtained in the Decision Tree Learning and its Sub-tree Raising method with 95.76% accuracy in training dataset and 94.88% accuracy in testing dataset. To expand the details of the Decision Tree Learning, Table 8 gives the confusion matrix. The confusion matrix compares the actual results with the predicted ones in terms of classes. The number of uninformative classes is higher than other classes, which is why they are more likely to be confused in the inference. In Table 8, the uninformative classes are mis-classified and are confused with the article information class in 272 instances of training dataset and 201 instances of testing dataset. We believe that this is expected, as both classes contain similar link structures and word counts. Manual observations generally show that article information contains shorter links. To overcome this problem, we can set a threshold value for the length of the links. The system may, however, accept the links of ‘‘send, print and Facebook’’ this time. The most common approach for short and uninformative links is to group the common words for the same

938


Table 7 Weighted average results of predictions of the main blocks, uninformative blocks, headline and article information content on two different dataset. Classification algorithms

Accuracy (%)

Precision

Recall

f-Measure

Kappa

Normal Distribution Kernel Estimator Supervised Discretization

76.33 84.14 90.55

0.86 0.89 0.92

0.76 0.84 0.91

0.79 0.86 0.91

0.52 0.65 0.77

Bayesian Network

Search Algorithm: K2 Search Algorithm: TAN

90.72 93.23

0.92 0.94

0.91 0.93

0.91 0.94

0.77 0.83

k-Nearest Neighbor

k=1 k=2

95.69 95.33

0.96 0.95

0.96 0.95

0.96 0.95

0.89 0.87

Decision Tree

Sub-tree Raising Unpruned Reduced Error Pruning

95.76 95.65 95.49

0.96 0.96 0.95

0.96 0.96 0.96

0.96 0.96 0.95

0.89 0.89 0.88

Normal Distribution Kernel Estimator Supervised Discretization

76.77 84.81 90.30

0.86 0.89 0.91

0.77 0.85 0.90

0.80 0.86 0.91

0.53 0.67 0.77

Bayesian Network

Search Algorithm: K2 Search Algorithm: TAN

90.40 92.85

0.91 0.93

0.90 0.93

0.91 0.93

0.77 0.83

k-Nearest Neighbor

k=1 k=2

94.96 94.59

0.95 0.94

0.95 0.95

0.95 0.94

0.87 0.86

Decision Tree

Sub-tree Raising Unpruned Reduced Error Pruning

94.88 94.84 94.76

0.95 0.95 0.95

0.95 0.95 0.95

0.95 0.95 0.95

0.87 0.87 0.87

Dataset-1 (Train) Naïve Bayes

Dataset-2 (Test) Naïve Bayes

Table 8 Confusion matrix and prediction of four classes via Sub-tree Raising method of Decision Tree Learning on two different dataset. Dataset-1: Training dataset

Dataset-2: Training dataset Predicted a

Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)

Uninformative blocks Main blocks Headline Article information Weighted avg.

9481 29 1 208

Predicted b

c

26 631 0 17

0 0 780 47

d 162 15 26 1092

Precision

Recall

f-Measure

0.98 0.94 0.95 0.84 0.96

0.98 0.94 0.97 0.80 0.96

0.98 0.94 0.96 0.82 0.96

Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)

Uninformative blocks Main blocks Headline Article information Weighted avg.

a

b

c

d

5668 17 4 143

12 390 2 15

0 0 501 43

148 10 6 686

Precision

Recall

f-Measure

0.97 0.93 0.92 0.80 0.95

0.97 0.94 0.98 0.78 0.95

0.97 0.93 0.95 0.79 0.95

web domain and remove them directly. This makes the approach language dependent, so we did not manage this adjustment to conserve language independency. The result of the Decision Tree Learning algorithm is a binary tree that provides a better understanding of features and their relation. The decision tree in our model consists of 269 decision nodes and 153 leaves. Fig. 5 shows the portion of the actual tree used to predict the blocks of main and article information. When we analyze Fig. 5, we see that 614 of 675 Main Blocks are classified correctly with only using D-HTML-AE (Density in HTML – After Extraction), R-WF-L-AW-AE (Ration of Word Frequency in Links to all words – After Extraction) and WF-L-AE (The count of terms inside A HREF links placed tags – After Extraction). Only nine errors occur in TN = DIV classification in the decision tree. As a result, after extraction features and new features like R-WF-L-AW-AE and WF-L-AE derived in our approach have positive effects in Main prediction. On the other hand, other features that do not have AE features are also effective in article information prediction. 449 of 1352 article information blocks are classified correctly by using six features. Only 22 errors occur in this prediction. These analyzes of several portions of the actual tree indicate that AE is crucial features for the prediction of Main Block. The information gain, a statistical property, can be used to examine the effects of all features in prediction. Information gain measures how features are effective in different combinations. Fig. 6 shows different feature sets and their information gains for the whole learning process.


Main Prediction D-HTML-AE > 0.099698 | | R-WF-L-AW-AE 76 | | | | TN = DIV: Main Block (518/9) | | | | TN = TD: Main Block (96/0)

939

Article Information Prediction R-WF-L-AW

Suggest Documents

informative wavelet analysis for extracting the level

Read more

Extracting General Lists from Web Documents: A Hybrid Approach

Read more

A Hybrid Approach to Extracting and Classifying ... - LREC Conferences

Read more

Extracting General Lists from Web Documents: A Hybrid Approach

Read more

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach

Read more

EGEA: A New Hybrid Approach Towards Extracting Reduced Generic ...

Read more

A hybrid approach of feature extraction for content ...

Read more

A PROBABILISTIC APPROACH FOR EXTRACTING DESIGN ...

Read more

A PROBABILISTIC APPROACH FOR EXTRACTING DESIGN

Read more

A Subspace Approach for Extracting Signals

Read more

informative wavelet analysis for extracting the level of hypnosis

Read more

A HYBRID APPROACH FOR SCHOLARLY

Read more

A participatory approach to public space design as informative for ...

Read more

A participatory approach to public space design as informative for ...

Read more

Discovering Informative Content Blocks from Web ... - CiteSeerX

Read more

a hybrid approach

Read more

A Hybrid Control Approach

Read more

Discovering Informative Content Blocks from Web ... - CiteSeerX

Read more

Proteomic Approach for Extracting Cytoplasmic ...

Read more

An evolutionary approach for automatically extracting intelligible ...

Read more

Polygon-based Approach for Extracting Multilane ...

Read more

Model-Based Approach for Extracting Femur

Read more

An evolutionary approach for automatically extracting intelligible

Read more

A Fragmenting Hybrid Approach for Targeted ... - BioMedSearch

Read more

Report "A hybrid approach for extracting informative content ..."

Your name

Email

Reason

Description

Copyright © 2025 M.MOAM.INFO. All rights reserved.
| About Us | Privacy Policy | Terms of Service | Help | Copyright | Contact Us | Cookie Policy

Sign In

Email

Password

Remember me Forgot password?

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close