contains unnecessary texts. We automatically determine these blocks as shown Fig. 1b. However, it is difficult process because of the heterogeneous and semi-structured nature of web pages as shown detailed examples in Section 4.1. The first step of our approach prepares rules in XML format from an HTML document using ML methods. This step (creating DOM, preparing features and applying an ML method) increases complexity. Aware of the time complexity, Uzun et al. (2011b) show that 1 The Web Content Extractor application implements the hybrid approach. This application is an open-source project and available via the web page http:// bilgmuh.nku.edu.tr/webce/. Moreover, the datasets that we used in the learning stage of our approach and our experimental results is also available on this web page.
930
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
simple string manipulation techniques can efficiently extract informative content without using DOM. Therefore, the second step utilizes a simple string manipulation technique using rules obtained from step one of this task. The experimental methodology of the first step is the same with other approaches in algorithmic view. However, the main focus in this paper is not only building a more accurate model with additional features but also producing effective rules for each domain. The next section presents related works and our own focus. The third section introduces the similarities and differences of the hybrid approach as well as the workflow of the extraction process. The fourth section is dedicated to rule induction and ML methods and introduces the details of the problem, feature selection, dataset, learning algorithms, metrics and experiments. The fifth section covers the efficiency of the extraction algorithm and makes a comparison. The last section provides our conclusions. 2. Related work Related works can be grouped into two categories. These are automatic extraction techniques and hand-crafted rules. The main focus of automatic extraction techniques is inference through features extracted from HTML. Hand-crafted rules are mostly used to extract information from HTML through string manipulation functions. Though preparing the hand-crafted rules is difficult and cumbersome, their efficiency is high with proper adjustments. The advantage of automatic extraction techniques over hand-crafted rules is their automaticity and easy applicability. This study inherits the qualifications from two different extraction methods by combining the efficiency and automaticity. Several researchers have investigated automatic extraction techniques on web pages through web page segmentation. These studies have mostly focused on DOM-based segmentation (Baluja, 2006), location-based segmentation (Kovacevic, Diligenti, Gori, & Milutinovic, 2002) and vision-based segmentation (Cai et al., 2003a, 2003b; Kovacevic et al., 2002; Yu, Cai, Wen, & Ma, 2003). DOM-based studies use DOM-level features with trained classifiers to extract useful content from web document templates (Bar-Yossef & Rajagopalan, 2002; Chakrabarti, Kumar, & Punera, 2008; Chen, Zhou, Shi, Zhang, & Qiu, 2001; Debnath, Mitra, Pal, & Giles, 2005). Most of these studies use heuristics that depend on site level hyperlink information (Chakrabarti, Kumar, & Punera, 2007), distribution of segment level text density ratios (Kohlschutter, 2009; Kohlschutter & Nejdl, 2008), n-grams (Baroni, Chantree, Kilgarri, & Sharo, 2008) and shallow text features (Kohlschutter, Fankhauser, & Nejdl, 2010). Approaches of these studies are not suitable for generating rules that can be used for string manipulation. Our approach provides a method to generate rules that can be utilized for efficient extraction. In DOM-level approaches (Gibson, Wellner, & Lubar, 2007; Hofmann & Weerkamp, 2007; Spousta, Marek, & Pecina, 2008; Yi & Liu, 2003a, 2003b), the common idea is that dissimilar content with the same structure or similar repeating patterns in templates and styles across several pages contain noisy blocks. Based on that idea, Bar-Yossef and Rajagopalan (2002) report that eliminating templates increases the precision of a search engine called Clever at all recall levels. Similarly, Lin and Ho (2002) designed InfoDiscover, a tool that extracts informative content from web pages. In their study, they use the TABLE tag to partition the web page into blocks. Several other tools use sequential approaches, including n-grams and conditional random fields, to clean noisy text from web pages (Evert, 2008; Spousta et al., 2008). However, none of these studies lists the importance of blocks, except that conducted by Song, Liu, Wen, and Ma (2004). This study emphasizes that using a scheme that weights blocks by their importance is useful for both search engines and data mining applications. Our approach detects unnecessary and three informative blocks (main, headline and article information). Other studies focus on layout HTML tags (DIV and TD) or all HTML tags. However, our study takes into account the appropriate HTML tags for each block. In the main block detection, layout HTML tags can be used to determine the most comprehensively informative texts of a web page. The detection of headline and article information blocks contains all HTML tags. Location-based segmentation relies on the position features of the areas of interest. These areas are determined by their location and are mostly labeled as left menu, right menu, footer content, etc. This approach depends on the assumption that the location, width and area of certain tags are valuable information for extracting useful content, and they should indeed be combined with label features of these tags. Conversely, in vision-based segmentation, the features used for segmentation are changed to the visual features, including lines, colors, blanks, images, different font sizes and different colors. Some visionbased segmentation approaches rely too much on the DOM structure, which indeed diminishes segmentation efficiency, while some approaches use both visual cues and DOM structure. Other approaches, similar to visual-based segmentation, attempt to identify the most interesting and informative portions of web content (Baluja, 2006; Chen, Ma, & Zhang, 2003; Xue et al., 2007; Yang, Xiang, & Shi, 2009). These studies use clusters of style and similar content positions for different pages and distinguish these clusters as template regions that are uninformative. Yi and Liu (2003a, 2003b) utilize a compressed tree structure and site style tree, respectively, to identify uninformative DOM nodes across pages. Two studies focus on TABLE tags. Ma, Goharian, Chowdhury, and Chung (2003) look for repeated blocks to mark as uninformative TD sections, whereas Lin (2002) use entropy over a set of word-features to remove redundant blocks from web pages. All of these studies focus on TD tags. However, nowadays web designers prefer to use DIV tags instead of TD tags. Therefore, this tag is added into our model. Additionally, there are web scrappers that skip DOM structure creation and use rules instead, including regular expressions written in languages such as Java and Perl. These tools consider the judging criteria of efficiency and accuracy in their studies (Adelberg, 1998; Liu, Pu, & Han, 2000; Vieira et al., 2006). These tools are efficient, but they are inappropriate or labor-intensive for extracting information from web templates that change over time. Hand-crafted rules also tend to be
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
931
impractical for more than a couple of sources. The approach presented in this study is built on the appropriate combination of hand-crafted rules and automatic extraction techniques. 3. The hybrid approach The approach developed in this study involves automatic rule creation instead of manual hand-crafted rule insertion. These rules are used to infer informative content from simple HTML pages. Similar to other studies, our approach first extracts DOM-based features and utilizes these features to extract informative contents. The difference of this study from earlier ones is that our approach infers rules that can be used as hand-crafted rules. A model is designed for this task. Our model is based on two block tags: DIV and TD, selected as the most suitable markers for determining the boundaries of informative content. Because the system is constructed on DIV and TD tags, we can automatically determine the most comprehensive rule sets and maintain efficiency in the informative content extraction. Fig. 2 shows the workflow of this approach. The workflow given in Fig. 2 shows the learning process, extraction process, rule selection and creation of a well-formed document based on the appropriateness criterion of the rule for the web pages. This workflow consists of two main steps. 1. Rule induction from a ML method 2. Efficient informative content extraction from rules In the first step, rule induction is performed via ML methods; in the second, the extracted rules are used to determine informative content in web pages. The rules inside a well-formed document that contain simple informative content are constructed. This procedure is as follows. The given web page is first tested against whether the rules are stored in the database.
Rule Induction Step
1
Create DOM Extract Features
A web page (In HTML format)
Any Rule for a given web page?
Apply ML Methods
Rule Database
Is Rule appropriate or not? 2
Efficient Extraction Step Fig. 2. The workflow of the hybrid approach.
Fig. 3. A simple web page.
A Well-formed Document
932
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Fig. 4. Blocks separated by DIV tags in an example web page.
If the rules are in the database, then the test is performed whether there are appropriate rules for the web page or not. Otherwise, an ML method occurs to induce the rules and create the well-formed document. This decision is made when the rule produces a single result. In the rule induction phase, marked as step one, DOM is created and features are extracted from this DOM tree. An ML method then applied. For this step, we compare several different machine learning methods in the following sections and choose Decision Tree Learning with the Sub-tree Raising method as the most effective and accurate method for the dataset. 4. Rule induction with ML In the rule induction step, the learning stage produces rules for efficient web content extraction. Preparing the learning stage requires a dataset and the appropriate features derived from this dataset. The dataset should be well-defined, and it should not contain too much noise but should include a low number of samples. In this section, we describe the HTML and DOM structures prior to the dataset and feature selection. We then introduce the feature selection part in the creation of the dataset and the dataset itself. We next give the tested ML methods and metrics. We then conclude the section with a description of the results of ML methods. Moreover, we introduce how rules are obtained from ML results. 4.1. HTML and DOM HTML is a simple and effective markup language used to develop web sites. HTML contains several tag sets for visualizing content. A web browser simply interprets these tags and creates a web page that a human can easily understand. Developers who want to demonstrate their visual content with richer features, including Javascript, use a hierarchy called DOM. Fig. 3 shows the content of a simple web page in three different views. Fig. 3 shows that information content is written between HTML tags. An HTML tag is generally formed from the HEAD, which contains necessary information about the web site, and the BODY, which is used to visualize the content. DIV and TD tags, also called block tags, are used to separate the web site into several practical blocks, and they are generally referred to as block markers. An A HREF tag is used to give links to different web pages. H1 and P tags are used to format the text. For the example given in Fig. 3, a simple regular expression
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
933
in flat form. For instance, the following regular expression .⁄? can be easily used to extract all links in a web page without involving any closing tag ambiguity. Block detection directly depends on using proper features in extraction. For instance, when Fig. 4 is explained using the link distribution as a selection criterion, we can determine the informative blocks as those with the least numbers of links, and the uninformative blocks as those that have high numbers of links. Certainly, the link and word distributions for each block are two important selection criteria, but some blocks may contain both informative and uninformative blocks. When such blocks are encountered, the most specific block under the parent block should be tested whether it is informative. This test also changes the word and link distributions of informative and uninformative blocks. In Fig. 4, the block
930
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
simple string manipulation techniques can efficiently extract informative content without using DOM. Therefore, the second step utilizes a simple string manipulation technique using rules obtained from step one of this task. The experimental methodology of the first step is the same with other approaches in algorithmic view. However, the main focus in this paper is not only building a more accurate model with additional features but also producing effective rules for each domain. The next section presents related works and our own focus. The third section introduces the similarities and differences of the hybrid approach as well as the workflow of the extraction process. The fourth section is dedicated to rule induction and ML methods and introduces the details of the problem, feature selection, dataset, learning algorithms, metrics and experiments. The fifth section covers the efficiency of the extraction algorithm and makes a comparison. The last section provides our conclusions. 2. Related work Related works can be grouped into two categories. These are automatic extraction techniques and hand-crafted rules. The main focus of automatic extraction techniques is inference through features extracted from HTML. Hand-crafted rules are mostly used to extract information from HTML through string manipulation functions. Though preparing the hand-crafted rules is difficult and cumbersome, their efficiency is high with proper adjustments. The advantage of automatic extraction techniques over hand-crafted rules is their automaticity and easy applicability. This study inherits the qualifications from two different extraction methods by combining the efficiency and automaticity. Several researchers have investigated automatic extraction techniques on web pages through web page segmentation. These studies have mostly focused on DOM-based segmentation (Baluja, 2006), location-based segmentation (Kovacevic, Diligenti, Gori, & Milutinovic, 2002) and vision-based segmentation (Cai et al., 2003a, 2003b; Kovacevic et al., 2002; Yu, Cai, Wen, & Ma, 2003). DOM-based studies use DOM-level features with trained classifiers to extract useful content from web document templates (Bar-Yossef & Rajagopalan, 2002; Chakrabarti, Kumar, & Punera, 2008; Chen, Zhou, Shi, Zhang, & Qiu, 2001; Debnath, Mitra, Pal, & Giles, 2005). Most of these studies use heuristics that depend on site level hyperlink information (Chakrabarti, Kumar, & Punera, 2007), distribution of segment level text density ratios (Kohlschutter, 2009; Kohlschutter & Nejdl, 2008), n-grams (Baroni, Chantree, Kilgarri, & Sharo, 2008) and shallow text features (Kohlschutter, Fankhauser, & Nejdl, 2010). Approaches of these studies are not suitable for generating rules that can be used for string manipulation. Our approach provides a method to generate rules that can be utilized for efficient extraction. In DOM-level approaches (Gibson, Wellner, & Lubar, 2007; Hofmann & Weerkamp, 2007; Spousta, Marek, & Pecina, 2008; Yi & Liu, 2003a, 2003b), the common idea is that dissimilar content with the same structure or similar repeating patterns in templates and styles across several pages contain noisy blocks. Based on that idea, Bar-Yossef and Rajagopalan (2002) report that eliminating templates increases the precision of a search engine called Clever at all recall levels. Similarly, Lin and Ho (2002) designed InfoDiscover, a tool that extracts informative content from web pages. In their study, they use the TABLE tag to partition the web page into blocks. Several other tools use sequential approaches, including n-grams and conditional random fields, to clean noisy text from web pages (Evert, 2008; Spousta et al., 2008). However, none of these studies lists the importance of blocks, except that conducted by Song, Liu, Wen, and Ma (2004). This study emphasizes that using a scheme that weights blocks by their importance is useful for both search engines and data mining applications. Our approach detects unnecessary and three informative blocks (main, headline and article information). Other studies focus on layout HTML tags (DIV and TD) or all HTML tags. However, our study takes into account the appropriate HTML tags for each block. In the main block detection, layout HTML tags can be used to determine the most comprehensively informative texts of a web page. The detection of headline and article information blocks contains all HTML tags. Location-based segmentation relies on the position features of the areas of interest. These areas are determined by their location and are mostly labeled as left menu, right menu, footer content, etc. This approach depends on the assumption that the location, width and area of certain tags are valuable information for extracting useful content, and they should indeed be combined with label features of these tags. Conversely, in vision-based segmentation, the features used for segmentation are changed to the visual features, including lines, colors, blanks, images, different font sizes and different colors. Some visionbased segmentation approaches rely too much on the DOM structure, which indeed diminishes segmentation efficiency, while some approaches use both visual cues and DOM structure. Other approaches, similar to visual-based segmentation, attempt to identify the most interesting and informative portions of web content (Baluja, 2006; Chen, Ma, & Zhang, 2003; Xue et al., 2007; Yang, Xiang, & Shi, 2009). These studies use clusters of style and similar content positions for different pages and distinguish these clusters as template regions that are uninformative. Yi and Liu (2003a, 2003b) utilize a compressed tree structure and site style tree, respectively, to identify uninformative DOM nodes across pages. Two studies focus on TABLE tags. Ma, Goharian, Chowdhury, and Chung (2003) look for repeated blocks to mark as uninformative TD sections, whereas Lin (2002) use entropy over a set of word-features to remove redundant blocks from web pages. All of these studies focus on TD tags. However, nowadays web designers prefer to use DIV tags instead of TD tags. Therefore, this tag is added into our model. Additionally, there are web scrappers that skip DOM structure creation and use rules instead, including regular expressions written in languages such as Java and Perl. These tools consider the judging criteria of efficiency and accuracy in their studies (Adelberg, 1998; Liu, Pu, & Han, 2000; Vieira et al., 2006). These tools are efficient, but they are inappropriate or labor-intensive for extracting information from web templates that change over time. Hand-crafted rules also tend to be
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
931
impractical for more than a couple of sources. The approach presented in this study is built on the appropriate combination of hand-crafted rules and automatic extraction techniques. 3. The hybrid approach The approach developed in this study involves automatic rule creation instead of manual hand-crafted rule insertion. These rules are used to infer informative content from simple HTML pages. Similar to other studies, our approach first extracts DOM-based features and utilizes these features to extract informative contents. The difference of this study from earlier ones is that our approach infers rules that can be used as hand-crafted rules. A model is designed for this task. Our model is based on two block tags: DIV and TD, selected as the most suitable markers for determining the boundaries of informative content. Because the system is constructed on DIV and TD tags, we can automatically determine the most comprehensive rule sets and maintain efficiency in the informative content extraction. Fig. 2 shows the workflow of this approach. The workflow given in Fig. 2 shows the learning process, extraction process, rule selection and creation of a well-formed document based on the appropriateness criterion of the rule for the web pages. This workflow consists of two main steps. 1. Rule induction from a ML method 2. Efficient informative content extraction from rules In the first step, rule induction is performed via ML methods; in the second, the extracted rules are used to determine informative content in web pages. The rules inside a well-formed document that contain simple informative content are constructed. This procedure is as follows. The given web page is first tested against whether the rules are stored in the database.
Rule Induction Step
1
Create DOM Extract Features
A web page (In HTML format)
Any Rule for a given web page?
Apply ML Methods
Rule Database
Is Rule appropriate or not? 2
Efficient Extraction Step Fig. 2. The workflow of the hybrid approach.
Fig. 3. A simple web page.
A Well-formed Document
932
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Fig. 4. Blocks separated by DIV tags in an example web page.
If the rules are in the database, then the test is performed whether there are appropriate rules for the web page or not. Otherwise, an ML method occurs to induce the rules and create the well-formed document. This decision is made when the rule produces a single result. In the rule induction phase, marked as step one, DOM is created and features are extracted from this DOM tree. An ML method then applied. For this step, we compare several different machine learning methods in the following sections and choose Decision Tree Learning with the Sub-tree Raising method as the most effective and accurate method for the dataset. 4. Rule induction with ML In the rule induction step, the learning stage produces rules for efficient web content extraction. Preparing the learning stage requires a dataset and the appropriate features derived from this dataset. The dataset should be well-defined, and it should not contain too much noise but should include a low number of samples. In this section, we describe the HTML and DOM structures prior to the dataset and feature selection. We then introduce the feature selection part in the creation of the dataset and the dataset itself. We next give the tested ML methods and metrics. We then conclude the section with a description of the results of ML methods. Moreover, we introduce how rules are obtained from ML results. 4.1. HTML and DOM HTML is a simple and effective markup language used to develop web sites. HTML contains several tag sets for visualizing content. A web browser simply interprets these tags and creates a web page that a human can easily understand. Developers who want to demonstrate their visual content with richer features, including Javascript, use a hierarchy called DOM. Fig. 3 shows the content of a simple web page in three different views. Fig. 3 shows that information content is written between HTML tags. An HTML tag is generally formed from the HEAD, which contains necessary information about the web site, and the BODY, which is used to visualize the content. DIV and TD tags, also called block tags, are used to separate the web site into several practical blocks, and they are generally referred to as block markers. An A HREF tag is used to give links to different web pages. H1 and P tags are used to format the text. For the example given in Fig. 3, a simple regular expression
(.⁄?)
can be used to extract the title of the web page. However, H1 and P tags are not the only tags used to hold the title. There are other tags, including H2–H6, FONT, SPAN, EN, UL, and LI. Using these tags to extract informative content is not suitable for rule induction. The block tags can thus be used in the extraction process. A previous study (Yerlikaya & Uzun, 2010) observed how an intelligent browser was developed to extract only informative content via pre-manual DIV/TD selection and proper adjustments in the DOM hierarchy. Block tags have descriptive parts (including ID and CLASS) that can be used for producing rules. Fig. 4 gives a simple example of using block tags (i.e., DIV tags) in web pages. Fig. 4 contains only DIV tags as block tags. TD tags can also be used in this visualization. DIV and TD tags are generally used in web design to separate several different blocks. The TD tag is the most specific tag for TABLE formatting in HTML. Conversely, the DIV tag became one of the most frequently used tags in HTML, along with the Cascading Style Sheets (CSS) styling, which is used as a reference by HTML. CSS also introduced flexibility and ease of design to HTML. Both TD and DIV formatting can be represented in a nested structure. However, other tags (i.e., H1–H6, FONT, SPAN, EN, UL, and LI) are usedE. Uzun et al. / Information Processing and Management 49 (2013) 928–944
933
in flat form. For instance, the following regular expression .⁄? can be easily used to extract all links in a web page without involving any closing tag ambiguity. Block detection directly depends on using proper features in extraction. For instance, when Fig. 4 is explained using the link distribution as a selection criterion, we can determine the informative blocks as those with the least numbers of links, and the uninformative blocks as those that have high numbers of links. Certainly, the link and word distributions for each block are two important selection criteria, but some blocks may contain both informative and uninformative blocks. When such blocks are encountered, the most specific block under the parent block should be tested whether it is informative. This test also changes the word and link distributions of informative and uninformative blocks. In Fig. 4, the block
contains both informative and uninformative blocks. To simplify the block extraction problem, we categorize blocks into four different cases. The corresponding blocks given in Fig. 4 are listed below. Extracting uninformative blocks
Extracting the main (most informative) block
Main block: The largest informative block, which contains a DIV and TD tag with a high number of terms. Uninformative blocks: The blocks mostly used for Advertisement, Link, and Menu separated by DIV, TD and UL tags. Headline: The tags that contain the web page title. Article information: The tags contain the text summary, author name, date information, image titles, and user comments.
To annotate the classes given above, a double-blind annotation technique is used among eight annotators (academic staff, graduate students, and undergraduate students taking the Information Retrieval course). In this technique, two annotators independently mark the samples. A third annotator determines the differences between these two to define a Gold Standard. After this annotation, the statistical results (Table 1) of the text collections are obtained.
934
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
In Table 1, each tag is given in the columns, and each class is given in the rows. Based on the counts given in Table 1, while the usage of DIV tag has increased, the usage of TD has decreased between 2008 and 2011. This finding supports Uzun et al. (2011a) that indicate the TD tag has become an old informative block marker, while the DIV tag is a current trend due to evolution of the web. Datasets used in ML should contain enough samples for the learning process. However, there are extreme differences between the numbers of some tags in text collections. Therefore, two datasets as training and test dataset are created randomly from these text collections for balancing ML data (Table 2). Training Dataset (Dataset-1): 736 web pages from 573 different domains. Test Dataset (Dataset-2): 434 web pages from 324 different domains. The statistical results of the datasets in Table 2 are suitable for using these datasets in ML methods. The DIV block marker is the most common tag used to separate the main and irrelevant blocks. On the other hand, article information content highly shares most of the tags. We expect that this block has a lower prediction accuracy due to its uniformly distributed nature. 4.3. Feature selection As in many other ML approaches, the methods applied in our experiments require that the data be represented as vectors of feature-value pairs. In our experiments, we not only adopt general shallow text features as feature vectors, but we also expand the common approach with combinations of classical features and after extraction features. These features are mostly referred to as shallow text features and are used to find the informative content of the tags from which they are extracted (Gibson et al., 2007; Kohlschutter, 2009; Kohlschutter & Nejdl, 2008; Kohlschutter et al., 2010). In a similar way, we use classical features like (1-2-3) in Table 3. Moreover, we derive a new features (like (4-5-6) in Table 3) that are combinations of classical features. Common approaches that use shallow text features also consider non-nested tags, including H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI. However, our learning model construction also is based on the nested tags (i.e., DIV and TD). Common approaches thus create a problem in our learning model. For example, a parent DIV tag may contain several uninformative and informative DIV children. This creates noise in the statistics derived from the parent DIV tag, so the features extracted
Table 1 Block information for annotated text collections. DIV
TD
UL
H1
H2
H3
H4
H5
FONT
SPAN
P
Total
Until 2008 Main Irrelevant Headline Article information Total
477 3991 – 278 4746
149 1099 – 113 1361
– 969 – 12 981
– – 325 1 326
– – 164 8 172
– – 134 10 144
– – 26 15 41
– – 11 14 25
– – 35 28 63
– – 27 93 120
– – 4 3 7
626 6059 726 575 7986
Until 2011 Main Irrelevant Headline Article information Total
516 8129 – 1565 10,210
23 554 – 65 642
– 755 – 3 758
– – 384 – 384
– – 89 – 89
– – 18 – 18
– – 40 – 40
– – – – –
– – 10 5 15
– – 13 12 25
– – 14 – 14
539 9438 568 1650 12,195
Table 2 Tag information about the annotated datasets. DIV
TD
UL
H1
H2
H3
H4
H5
FONT
SPAN
P
Total
Dataset-1 Main Irrelevant Headline Article information Total
608 7507 – 1136 9251
117 1133 – 106 1356
– 1029 – 12 1041
– – 418 1 419
– – 179 4 183
– – 92 9 101
– – 29 4 33
– – 7 7 14
– – 32 15 47
– – 26 57 83
– – 10 1 11
725 9669 793 1352 12,539
Dataset-2 Main Irrelevant Headline Article information Total
385 4613 – 707 5705
55 520 – 72 647
– 695 – 3 698
– – 291 – 291
– – 74 4 78
– – 60 1 61
– – 37 11 48
– – 4 7 11
– – 13 18 31
– – 14 48 62
– – 8 2 10
440 5828 501 873 7642
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
935
Table 3 Shallow text features. Feature name
Description
(1) Word Frequency (WF) (2) Density in HTML (D-HTML) (3) Link Frequency (LF) (4) Word Frequency in Links (WF-L) (5) Average Word Frequency in Links (A-WF-L) (6) Ratio of Word Frequency in Links to All Words (R-WF-L-AW)
The number of terms inside tags The ratio of the number of terms inside tags to the number of all terms inside the HTML document The count of A HREF links inside tags The count of terms inside A HREF links placed tags The ratio of the number of terms inside A HREF links placed inside tags to the number of links The ratio of the number of terms inside A HREF links placed inside tags to all of the number of terms inside tags.
may not contain appropriate link and word frequency ratios. To handle this problem, we extract most child DIV/TD tags to form a flat (non-nested) structure from the parent DIV/TD tags. After extraction (AE) of these child DIV/TD tags, we reevaluate the values of the same feature sets. This extraction process provides new feature sets to the ML step. Table 4 gives these new features. Differing from traditional approaches, our study proposes AE features to the learning problem because of the nested nature of the used tags (DIV and TD). The effect of these new features for the learning task is investigated in the experiment section. The non-nested tags, including H1–H6, P, FONT, and A HREF, do not require any AE features. Along with the features given in Tables 3–5 introduces two tag labels and CSS styling features. Tag name as a feature is useful for distinguishing whether a tag may belong to one of the four different classes introduced above. According to CSS styling, which is used for visualization of elements in a web page, some tags may contain ID and/or CLASS attributes. We believe that whether a DIV has an ID or CLASS attribute may have a positive effect on the efficiency of the content extraction step. Fig. 4 supports this idea, but this feature is also investigated in detail below (Section 5). 4.4. Machine learning methods applied in this study Four different common ML methods were applied to our dataset: a naïve Bayes algorithm, a Bayesian Network Algorithm, an instance-based clustering algorithm (k-Nearest Neighbor), and a Decision Tree Algorithm to discover an appropriate learning method. Descriptions of each learning algorithm are given below. The experiments are conducted using the Weka library with a tenfold cross-validation test method (Witten & Frank, 2005). 4.4.1. Naïve Bayes classification Naïve Bayesian classification (Rish, 2001) relies on the assumption that attributes are conditionally independent of each other given the class of examples. Though this hypothesis is often inappropriate for real-world problems where attributes strongly depend on each other, this classification approach helps reduce the dimensionality effect by simplifying the problem. Given example X with a feature vector ðx1 ; . . . ; xn Þ, the Naïve Bayes classifier looks for a class label C that maximizes the following likelihood:
PðXjCÞ ¼ Pðx1 ; . . . ; xn jCÞ Below are short descriptions of specific settings employed in our Naïve Bayes classification experiments (Witten & Frank, 2005):
Table 4 Additional shallow text features. Feature name (7) Word Frequency-AE (WF-AE) (8) Density in HTML-AE (D-HTML-AE) (9) Link Frequency-AE (LF-AE) (10) Word Frequency in Links-AE (WF-L-AE) (11) Average Word Frequency in Links-AE (AWF-L-AE) (12) Ratio of Word Frequency in Links to All Number of Words-AE (R-WF-L-AW-AE)
Table 5 Other features. Feature name
Description
(13) Tag Name (TN) (14) Contains Tag ID or CLASS (C-TC)
One of the tag name (TD, DIV, H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI) Whether the tag has an attribute of ID or CLASS
936
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Normal Distribution: Standard Naïve Bayes classifier algorithm for numeric attributes. Kernel Estimator: Kernel estimation for numeric attributes rather than Normal Distribution. Supervised Discretization: Used to convert numeric attributes to nominal ones. 4.4.2. Bayesian Network classification Bayesian Networks (Friedman, Geiger, & Goldszmidt, 1997), more commonly called belief networks or probabilistic networks, are directed a-cycling graphs (DAGs) that contain no cycles. In a Bayesian Network, each node corresponds to a random variable, P(X), and each arc between nodes, P(Y|X), represents the probabilistic dependency between variables. The nodes and arcs define the structure of the network, whereas the conditional probabilities are the parameters for this structure. In Bayesian Networks, inference and structure learning are two learning process tasks. After obtaining the structure, classification can be conducted through inference. Network structure can be given manually instead of learning it from features. Our Weka experiments use settings of two structure-learning search algorithms, namely K2 (Cooper & Herskovits, 1992) and TAN (Cheng & Greiner, 1999; Rish, 2001). Both algorithms are used in local searches for the appropriate structure in Bayesian Networks. 4.4.3. k-Nearest Neighbor classification k-Nearest Neighbor classification (Bremner et al., 2005) is a nonparametric approach used to estimate the class-conditional densities, namely P(X|Ci). Given the discriminant function as below:
g i ðxÞ ¼ PðxjC i ÞPðC i Þ we have P(x|Ci) = ki/(NiVk(x)), where ki is the number of neighbors of the k-nearest that belong to Ci, and Vk(x) is the volume of the n-dimensional hypersphere centered at x, with radius r = ||x xk||, where xk is the k-nearest observation to x (among all neighbors of all classes of x). Selecting the number of neighbors for comparison is an important property in learning; k should thus be selected appropriately. 4.4.4. Decision Tree Classification In Decision Tree Learning (Breiman, Friedman, Olshen, & Stone, 1984), trees are composed of decision nodes and terminal leaves. Given a new instance to be classified, test functions are applied to an instance recursively in decision nodes until hitting a leaf node that assigns a discrete output to it. An instance feature is tested in every node for branching. The information gain of selecting an attribute to form a tree must be calculated, and a predefined number of the most informative attributes must be selected to minimize the depth of the tree. In cases where more than one hypothesis is extracted from the training set, the ensemble learning methods are used to increase classifier efficiency by selecting and combining a set of hypotheses from the hypotheses’ space. These hypotheses are combined into a single classifier that makes predictions by taking a vote of its constituents. One common method in ensemble learning is boosting. The boosting model is sequentially induced from the training examples where the example weights are adjusted at each iteration. The Weka library provides an implementation of the C4.5 Decision Tree Algorithm (Quinlan, 1993) in the J48 class. Some settings employed for Decision Tree Classification in our experiments are briefly explained below:
Default Setting: Standard Decision Tree classifier algorithm. Reduced Error Pruning: An independent test set to estimate the error at each node. No Sub-tree Raising: Used to disable Sub-Tree Raising of the most popular branch. Unpruned: Used to disable prepruning strategies.
4.5. Classification metrics There are several metrics to assess the performance of the ML Methods. One of them is N-Fold Cross Validation. In N-Fold Cross Validation, N tests are made on the dataset, and the observed metrics, including accuracy, precision, recall and f-Measure, are averaged. In each test, the N 1 training sets are trained, and the Nth portion is tested. For 10-Fold Cross Validation, the data are first split into 10 sets, and a test is made for each set to train the remaining sets. Considering that the ML is a binary classification task, each sample is separated into two cases (classes), positive and negative (i.e., has required, has not required). Based on these cases, Table 6 gives the necessary definitions to calculate the accuracy, precision, recall and f-Measure. This definition table is called a Confusion Matrix: the actual values are in rows, and the predicted ones are in columns. The diagonal cells show the number of correct predictions for both positive and negative cases. Other cells show both misclassifications and actual class classifications. The accuracy metric allows for measuring the percentage of correct predictions for the overall data. This metric accounts for both positive and negative instances. According to the definitions given in Table 6, the following equations define the accuracy, precision and recall, respectively.
accuracy ¼
TP þ TN TP þ TN þ FP þ FN
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
937
Table 6 Confusion matrix. Predicted
Known Positive Negative
Positive
Negative
TP FP
FN TN
Where True Positive (TP): Number of correctly classified positive examples, False Positive (FP): Number of incorrectly classified positive examples, True Negative (TN): Number of correctly classified negative examples, and False Negative (FN): Number of incorrectly classified negative examples.
precision ¼
recall ¼
TP TP þ FP
TP TP þ FN
In a special case where beta (b) equals 1, the f-Measure combines precision and recall by calculating their harmonic mean and can be called f1-measure.
F b ¼ ð1 þ b2 Þ
precision recall ðb2 precisionÞ þ recall
Moreover, the last metric is kappa statistics, which measures the degree to which two different ML methods perform. Kappa statistics is an alternative to the accuracy measure for evaluating methods. It was first introduced as a metric used to measure the degree of agreement between two observers (Cohen, 1960) and has been used in several disciplines. In ML, it is a measure to assess the improvement of a method’s accuracy over a predictor employing chance as its guide. This measure is defined as:
k¼
ðPo P c Þ ð1 P c Þ
where Po is the accuracy of the method, and Pc is the expected accuracy that can be achieved by a randomly guessing method on the same dataset. Kappa statistics has a range between 1 and 1, where 1 is total disagreement (i.e., total misclassification), and 1 is perfect agreement (i.e., 100% accurate classification). Kappa fundamentally assesses how much better a learning method is compared to the majority, and class distribution-based random classifiers score zero kappa. Landis and Koch (1977) suggest that a kappa score over 0.4 indicates a reasonable agreement beyond chance. 4.6. ML results and error analysis ML methods can classify the informative and uninformative content within an error margin. Each learning method may have a different error rate for the same dataset. To find the most accurate learning method, we used two datasets for training and testing. The training dataset is crucial to classification performance so tenfold cross-validation is applied in order to evaluate and compare different ML methods with several configurations. Additionally, the test dataset is used to understand the efficiency of obtained results from the training process. In the performance evaluation, we measured the accuracy, precision, recall, f-Measure and kappa statistics. Table 7 gives the training (cross-validated) and testing results. The Naive Bayes algorithm is one of the simplest algorithms used in the learning task. Though it gives good results with the Supervised Discretization method, it does not perform better than other algorithms due to low kappa results. In Bayesian Networks, TAN and K2 search methods are tested. These search methods form a proper Bayesian network structure and boost the accuracy. The results of other two methods, k-Nearest Neighbors and Decision Tree, are very close. In k-Nearest Neighbors, the appropriate selection of the number of neighbors (k) has a significant effect on the accuracy. The best performance is obtained in the Decision Tree Learning and its Sub-tree Raising method with 95.76% accuracy in training dataset and 94.88% accuracy in testing dataset. To expand the details of the Decision Tree Learning, Table 8 gives the confusion matrix. The confusion matrix compares the actual results with the predicted ones in terms of classes. The number of uninformative classes is higher than other classes, which is why they are more likely to be confused in the inference. In Table 8, the uninformative classes are mis-classified and are confused with the article information class in 272 instances of training dataset and 201 instances of testing dataset. We believe that this is expected, as both classes contain similar link structures and word counts. Manual observations generally show that article information contains shorter links. To overcome this problem, we can set a threshold value for the length of the links. The system may, however, accept the links of ‘‘send, print and Facebook’’ this time. The most common approach for short and uninformative links is to group the common words for the same
938
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Table 7 Weighted average results of predictions of the main blocks, uninformative blocks, headline and article information content on two different dataset. Classification algorithms
Accuracy (%)
Precision
Recall
f-Measure
Kappa
Normal Distribution Kernel Estimator Supervised Discretization
76.33 84.14 90.55
0.86 0.89 0.92
0.76 0.84 0.91
0.79 0.86 0.91
0.52 0.65 0.77
Bayesian Network
Search Algorithm: K2 Search Algorithm: TAN
90.72 93.23
0.92 0.94
0.91 0.93
0.91 0.94
0.77 0.83
k-Nearest Neighbor
k=1 k=2
95.69 95.33
0.96 0.95
0.96 0.95
0.96 0.95
0.89 0.87
Decision Tree
Sub-tree Raising Unpruned Reduced Error Pruning
95.76 95.65 95.49
0.96 0.96 0.95
0.96 0.96 0.96
0.96 0.96 0.95
0.89 0.89 0.88
Normal Distribution Kernel Estimator Supervised Discretization
76.77 84.81 90.30
0.86 0.89 0.91
0.77 0.85 0.90
0.80 0.86 0.91
0.53 0.67 0.77
Bayesian Network
Search Algorithm: K2 Search Algorithm: TAN
90.40 92.85
0.91 0.93
0.90 0.93
0.91 0.93
0.77 0.83
k-Nearest Neighbor
k=1 k=2
94.96 94.59
0.95 0.94
0.95 0.95
0.95 0.94
0.87 0.86
Decision Tree
Sub-tree Raising Unpruned Reduced Error Pruning
94.88 94.84 94.76
0.95 0.95 0.95
0.95 0.95 0.95
0.95 0.95 0.95
0.87 0.87 0.87
Dataset-1 (Train) Naïve Bayes
Dataset-2 (Test) Naïve Bayes
Table 8 Confusion matrix and prediction of four classes via Sub-tree Raising method of Decision Tree Learning on two different dataset. Dataset-1: Training dataset
Dataset-2: Training dataset Predicted a
Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)
Uninformative blocks Main blocks Headline Article information Weighted avg.
9481 29 1 208
Predicted b
c
26 631 0 17
0 0 780 47
d 162 15 26 1092
Precision
Recall
f-Measure
0.98 0.94 0.95 0.84 0.96
0.98 0.94 0.97 0.80 0.96
0.98 0.94 0.96 0.82 0.96
Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)
Uninformative blocks Main blocks Headline Article information Weighted avg.
a
b
c
d
5668 17 4 143
12 390 2 15
0 0 501 43
148 10 6 686
Precision
Recall
f-Measure
0.97 0.93 0.92 0.80 0.95
0.97 0.94 0.98 0.78 0.95
0.97 0.93 0.95 0.79 0.95
web domain and remove them directly. This makes the approach language dependent, so we did not manage this adjustment to conserve language independency. The result of the Decision Tree Learning algorithm is a binary tree that provides a better understanding of features and their relation. The decision tree in our model consists of 269 decision nodes and 153 leaves. Fig. 5 shows the portion of the actual tree used to predict the blocks of main and article information. When we analyze Fig. 5, we see that 614 of 675 Main Blocks are classified correctly with only using D-HTML-AE (Density in HTML – After Extraction), R-WF-L-AW-AE (Ration of Word Frequency in Links to all words – After Extraction) and WF-L-AE (The count of terms inside A HREF links placed tags – After Extraction). Only nine errors occur in TN = DIV classification in the decision tree. As a result, after extraction features and new features like R-WF-L-AW-AE and WF-L-AE derived in our approach have positive effects in Main prediction. On the other hand, other features that do not have AE features are also effective in article information prediction. 449 of 1352 article information blocks are classified correctly by using six features. Only 22 errors occur in this prediction. These analyzes of several portions of the actual tree indicate that AE is crucial features for the prediction of Main Block. The information gain, a statistical property, can be used to examine the effects of all features in prediction. Information gain measures how features are effective in different combinations. Fig. 6 shows different feature sets and their information gains for the whole learning process.
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Main Prediction D-HTML-AE > 0.099698 | | R-WF-L-AW-AE 76 | | | | TN = DIV: Main Block (518/9) | | | | TN = TD: Main Block (96/0)
939
Article Information Prediction R-WF-L-AW
Extracting the title The H1 tag inside the
tag. To prevent this closing tag ambiguity, a DOM structure can be used to extract the target match. However, creating a DOM structure increases the complexity of basic string matching problems. In this study, we train a learning model with the necessary features derived from block tags to directly extract informative content blocks from HTML without using DOM. We use DOM-based features in the rule induction phase, but the model does not re-create the DOM in the efficient extraction. 4.2. Datasets used in this study Almost all ML methods need a specific dataset for the problem to adjust the weights of their algorithms. Similarly, studies involving ML methods about informative content extraction from web pages use hand-annotated samples from various web domains. An old text collection, annotated by Kohlschutter et al. (2010), contains 621 web pages from 408 different domains selected from Google News until 2008. Moreover, a new text collection, containing 550 web pages from 275 different domains selected from Goggle News until 2011, is generated. Finally, we use a double-blind annotation technique to re-annotate the samples from these text collections. Using this annotation, we consider the following four classes: block. Extracting the text summary, author name, date, image titles and user comments
tag, and the other one is matching the first The block rules may vary among web sites. However, when rules are extracted for a single web page, they can be used for structurally similar web pages of the same web domain. However, some web pages may contain errors, as in Fig. 4, where the
) cannot find the proper closing of the main block. There are two types of matching mechanisms in regular expressions for this task. One is matching the last block marker contains unrelated information in the tag. These errors might also happen for other blocks, including the main content block. In this study, we try to minimize user-oriented noise on block tags by including web pages from various web domains in our datasets. In Fig. 4, a more complex DIV structure is demonstrated in an HTML page. This figure is a good example of how blocks can be nested to form a hierarchy. This creates a simple problem when extracting the main DIV block with a regular expression. As the HTML structure of Fig. 4 contains many nested
tags, a simple regular expression ((.⁄?)Main block: The largest informative block, which contains a DIV and TD tag with a high number of terms. Uninformative blocks: The blocks mostly used for Advertisement, Link, and Menu separated by DIV, TD and UL tags. Headline: The tags that contain the web page title. Article information: The tags contain the text summary, author name, date information, image titles, and user comments.
To annotate the classes given above, a double-blind annotation technique is used among eight annotators (academic staff, graduate students, and undergraduate students taking the Information Retrieval course). In this technique, two annotators independently mark the samples. A third annotator determines the differences between these two to define a Gold Standard. After this annotation, the statistical results (Table 1) of the text collections are obtained.
934
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
In Table 1, each tag is given in the columns, and each class is given in the rows. Based on the counts given in Table 1, while the usage of DIV tag has increased, the usage of TD has decreased between 2008 and 2011. This finding supports Uzun et al. (2011a) that indicate the TD tag has become an old informative block marker, while the DIV tag is a current trend due to evolution of the web. Datasets used in ML should contain enough samples for the learning process. However, there are extreme differences between the numbers of some tags in text collections. Therefore, two datasets as training and test dataset are created randomly from these text collections for balancing ML data (Table 2). Training Dataset (Dataset-1): 736 web pages from 573 different domains. Test Dataset (Dataset-2): 434 web pages from 324 different domains. The statistical results of the datasets in Table 2 are suitable for using these datasets in ML methods. The DIV block marker is the most common tag used to separate the main and irrelevant blocks. On the other hand, article information content highly shares most of the tags. We expect that this block has a lower prediction accuracy due to its uniformly distributed nature. 4.3. Feature selection As in many other ML approaches, the methods applied in our experiments require that the data be represented as vectors of feature-value pairs. In our experiments, we not only adopt general shallow text features as feature vectors, but we also expand the common approach with combinations of classical features and after extraction features. These features are mostly referred to as shallow text features and are used to find the informative content of the tags from which they are extracted (Gibson et al., 2007; Kohlschutter, 2009; Kohlschutter & Nejdl, 2008; Kohlschutter et al., 2010). In a similar way, we use classical features like (1-2-3) in Table 3. Moreover, we derive a new features (like (4-5-6) in Table 3) that are combinations of classical features. Common approaches that use shallow text features also consider non-nested tags, including H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI. However, our learning model construction also is based on the nested tags (i.e., DIV and TD). Common approaches thus create a problem in our learning model. For example, a parent DIV tag may contain several uninformative and informative DIV children. This creates noise in the statistics derived from the parent DIV tag, so the features extracted
Table 1 Block information for annotated text collections. DIV
TD
UL
H1
H2
H3
H4
H5
FONT
SPAN
P
Total
Until 2008 Main Irrelevant Headline Article information Total
477 3991 – 278 4746
149 1099 – 113 1361
– 969 – 12 981
– – 325 1 326
– – 164 8 172
– – 134 10 144
– – 26 15 41
– – 11 14 25
– – 35 28 63
– – 27 93 120
– – 4 3 7
626 6059 726 575 7986
Until 2011 Main Irrelevant Headline Article information Total
516 8129 – 1565 10,210
23 554 – 65 642
– 755 – 3 758
– – 384 – 384
– – 89 – 89
– – 18 – 18
– – 40 – 40
– – – – –
– – 10 5 15
– – 13 12 25
– – 14 – 14
539 9438 568 1650 12,195
Table 2 Tag information about the annotated datasets. DIV
TD
UL
H1
H2
H3
H4
H5
FONT
SPAN
P
Total
Dataset-1 Main Irrelevant Headline Article information Total
608 7507 – 1136 9251
117 1133 – 106 1356
– 1029 – 12 1041
– – 418 1 419
– – 179 4 183
– – 92 9 101
– – 29 4 33
– – 7 7 14
– – 32 15 47
– – 26 57 83
– – 10 1 11
725 9669 793 1352 12,539
Dataset-2 Main Irrelevant Headline Article information Total
385 4613 – 707 5705
55 520 – 72 647
– 695 – 3 698
– – 291 – 291
– – 74 4 78
– – 60 1 61
– – 37 11 48
– – 4 7 11
– – 13 18 31
– – 14 48 62
– – 8 2 10
440 5828 501 873 7642
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
935
Table 3 Shallow text features. Feature name
Description
(1) Word Frequency (WF) (2) Density in HTML (D-HTML) (3) Link Frequency (LF) (4) Word Frequency in Links (WF-L) (5) Average Word Frequency in Links (A-WF-L) (6) Ratio of Word Frequency in Links to All Words (R-WF-L-AW)
The number of terms inside tags The ratio of the number of terms inside tags to the number of all terms inside the HTML document The count of A HREF links inside tags The count of terms inside A HREF links placed tags The ratio of the number of terms inside A HREF links placed inside tags to the number of links The ratio of the number of terms inside A HREF links placed inside tags to all of the number of terms inside tags.
may not contain appropriate link and word frequency ratios. To handle this problem, we extract most child DIV/TD tags to form a flat (non-nested) structure from the parent DIV/TD tags. After extraction (AE) of these child DIV/TD tags, we reevaluate the values of the same feature sets. This extraction process provides new feature sets to the ML step. Table 4 gives these new features. Differing from traditional approaches, our study proposes AE features to the learning problem because of the nested nature of the used tags (DIV and TD). The effect of these new features for the learning task is investigated in the experiment section. The non-nested tags, including H1–H6, P, FONT, and A HREF, do not require any AE features. Along with the features given in Tables 3–5 introduces two tag labels and CSS styling features. Tag name as a feature is useful for distinguishing whether a tag may belong to one of the four different classes introduced above. According to CSS styling, which is used for visualization of elements in a web page, some tags may contain ID and/or CLASS attributes. We believe that whether a DIV has an ID or CLASS attribute may have a positive effect on the efficiency of the content extraction step. Fig. 4 supports this idea, but this feature is also investigated in detail below (Section 5). 4.4. Machine learning methods applied in this study Four different common ML methods were applied to our dataset: a naïve Bayes algorithm, a Bayesian Network Algorithm, an instance-based clustering algorithm (k-Nearest Neighbor), and a Decision Tree Algorithm to discover an appropriate learning method. Descriptions of each learning algorithm are given below. The experiments are conducted using the Weka library with a tenfold cross-validation test method (Witten & Frank, 2005). 4.4.1. Naïve Bayes classification Naïve Bayesian classification (Rish, 2001) relies on the assumption that attributes are conditionally independent of each other given the class of examples. Though this hypothesis is often inappropriate for real-world problems where attributes strongly depend on each other, this classification approach helps reduce the dimensionality effect by simplifying the problem. Given example X with a feature vector ðx1 ; . . . ; xn Þ, the Naïve Bayes classifier looks for a class label C that maximizes the following likelihood:
PðXjCÞ ¼ Pðx1 ; . . . ; xn jCÞ Below are short descriptions of specific settings employed in our Naïve Bayes classification experiments (Witten & Frank, 2005):
Table 4 Additional shallow text features. Feature name (7) Word Frequency-AE (WF-AE) (8) Density in HTML-AE (D-HTML-AE) (9) Link Frequency-AE (LF-AE) (10) Word Frequency in Links-AE (WF-L-AE) (11) Average Word Frequency in Links-AE (AWF-L-AE) (12) Ratio of Word Frequency in Links to All Number of Words-AE (R-WF-L-AW-AE)
Table 5 Other features. Feature name
Description
(13) Tag Name (TN) (14) Contains Tag ID or CLASS (C-TC)
One of the tag name (TD, DIV, H1–H6, P, FONT, A HREF, SPAN, EM, UL, and LI) Whether the tag has an attribute of ID or CLASS
936
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Normal Distribution: Standard Naïve Bayes classifier algorithm for numeric attributes. Kernel Estimator: Kernel estimation for numeric attributes rather than Normal Distribution. Supervised Discretization: Used to convert numeric attributes to nominal ones. 4.4.2. Bayesian Network classification Bayesian Networks (Friedman, Geiger, & Goldszmidt, 1997), more commonly called belief networks or probabilistic networks, are directed a-cycling graphs (DAGs) that contain no cycles. In a Bayesian Network, each node corresponds to a random variable, P(X), and each arc between nodes, P(Y|X), represents the probabilistic dependency between variables. The nodes and arcs define the structure of the network, whereas the conditional probabilities are the parameters for this structure. In Bayesian Networks, inference and structure learning are two learning process tasks. After obtaining the structure, classification can be conducted through inference. Network structure can be given manually instead of learning it from features. Our Weka experiments use settings of two structure-learning search algorithms, namely K2 (Cooper & Herskovits, 1992) and TAN (Cheng & Greiner, 1999; Rish, 2001). Both algorithms are used in local searches for the appropriate structure in Bayesian Networks. 4.4.3. k-Nearest Neighbor classification k-Nearest Neighbor classification (Bremner et al., 2005) is a nonparametric approach used to estimate the class-conditional densities, namely P(X|Ci). Given the discriminant function as below:
g i ðxÞ ¼ PðxjC i ÞPðC i Þ we have P(x|Ci) = ki/(NiVk(x)), where ki is the number of neighbors of the k-nearest that belong to Ci, and Vk(x) is the volume of the n-dimensional hypersphere centered at x, with radius r = ||x xk||, where xk is the k-nearest observation to x (among all neighbors of all classes of x). Selecting the number of neighbors for comparison is an important property in learning; k should thus be selected appropriately. 4.4.4. Decision Tree Classification In Decision Tree Learning (Breiman, Friedman, Olshen, & Stone, 1984), trees are composed of decision nodes and terminal leaves. Given a new instance to be classified, test functions are applied to an instance recursively in decision nodes until hitting a leaf node that assigns a discrete output to it. An instance feature is tested in every node for branching. The information gain of selecting an attribute to form a tree must be calculated, and a predefined number of the most informative attributes must be selected to minimize the depth of the tree. In cases where more than one hypothesis is extracted from the training set, the ensemble learning methods are used to increase classifier efficiency by selecting and combining a set of hypotheses from the hypotheses’ space. These hypotheses are combined into a single classifier that makes predictions by taking a vote of its constituents. One common method in ensemble learning is boosting. The boosting model is sequentially induced from the training examples where the example weights are adjusted at each iteration. The Weka library provides an implementation of the C4.5 Decision Tree Algorithm (Quinlan, 1993) in the J48 class. Some settings employed for Decision Tree Classification in our experiments are briefly explained below:
Default Setting: Standard Decision Tree classifier algorithm. Reduced Error Pruning: An independent test set to estimate the error at each node. No Sub-tree Raising: Used to disable Sub-Tree Raising of the most popular branch. Unpruned: Used to disable prepruning strategies.
4.5. Classification metrics There are several metrics to assess the performance of the ML Methods. One of them is N-Fold Cross Validation. In N-Fold Cross Validation, N tests are made on the dataset, and the observed metrics, including accuracy, precision, recall and f-Measure, are averaged. In each test, the N 1 training sets are trained, and the Nth portion is tested. For 10-Fold Cross Validation, the data are first split into 10 sets, and a test is made for each set to train the remaining sets. Considering that the ML is a binary classification task, each sample is separated into two cases (classes), positive and negative (i.e., has required, has not required). Based on these cases, Table 6 gives the necessary definitions to calculate the accuracy, precision, recall and f-Measure. This definition table is called a Confusion Matrix: the actual values are in rows, and the predicted ones are in columns. The diagonal cells show the number of correct predictions for both positive and negative cases. Other cells show both misclassifications and actual class classifications. The accuracy metric allows for measuring the percentage of correct predictions for the overall data. This metric accounts for both positive and negative instances. According to the definitions given in Table 6, the following equations define the accuracy, precision and recall, respectively.
accuracy ¼
TP þ TN TP þ TN þ FP þ FN
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
937
Table 6 Confusion matrix. Predicted
Known Positive Negative
Positive
Negative
TP FP
FN TN
Where True Positive (TP): Number of correctly classified positive examples, False Positive (FP): Number of incorrectly classified positive examples, True Negative (TN): Number of correctly classified negative examples, and False Negative (FN): Number of incorrectly classified negative examples.
precision ¼
recall ¼
TP TP þ FP
TP TP þ FN
In a special case where beta (b) equals 1, the f-Measure combines precision and recall by calculating their harmonic mean and can be called f1-measure.
F b ¼ ð1 þ b2 Þ
precision recall ðb2 precisionÞ þ recall
Moreover, the last metric is kappa statistics, which measures the degree to which two different ML methods perform. Kappa statistics is an alternative to the accuracy measure for evaluating methods. It was first introduced as a metric used to measure the degree of agreement between two observers (Cohen, 1960) and has been used in several disciplines. In ML, it is a measure to assess the improvement of a method’s accuracy over a predictor employing chance as its guide. This measure is defined as:
k¼
ðPo P c Þ ð1 P c Þ
where Po is the accuracy of the method, and Pc is the expected accuracy that can be achieved by a randomly guessing method on the same dataset. Kappa statistics has a range between 1 and 1, where 1 is total disagreement (i.e., total misclassification), and 1 is perfect agreement (i.e., 100% accurate classification). Kappa fundamentally assesses how much better a learning method is compared to the majority, and class distribution-based random classifiers score zero kappa. Landis and Koch (1977) suggest that a kappa score over 0.4 indicates a reasonable agreement beyond chance. 4.6. ML results and error analysis ML methods can classify the informative and uninformative content within an error margin. Each learning method may have a different error rate for the same dataset. To find the most accurate learning method, we used two datasets for training and testing. The training dataset is crucial to classification performance so tenfold cross-validation is applied in order to evaluate and compare different ML methods with several configurations. Additionally, the test dataset is used to understand the efficiency of obtained results from the training process. In the performance evaluation, we measured the accuracy, precision, recall, f-Measure and kappa statistics. Table 7 gives the training (cross-validated) and testing results. The Naive Bayes algorithm is one of the simplest algorithms used in the learning task. Though it gives good results with the Supervised Discretization method, it does not perform better than other algorithms due to low kappa results. In Bayesian Networks, TAN and K2 search methods are tested. These search methods form a proper Bayesian network structure and boost the accuracy. The results of other two methods, k-Nearest Neighbors and Decision Tree, are very close. In k-Nearest Neighbors, the appropriate selection of the number of neighbors (k) has a significant effect on the accuracy. The best performance is obtained in the Decision Tree Learning and its Sub-tree Raising method with 95.76% accuracy in training dataset and 94.88% accuracy in testing dataset. To expand the details of the Decision Tree Learning, Table 8 gives the confusion matrix. The confusion matrix compares the actual results with the predicted ones in terms of classes. The number of uninformative classes is higher than other classes, which is why they are more likely to be confused in the inference. In Table 8, the uninformative classes are mis-classified and are confused with the article information class in 272 instances of training dataset and 201 instances of testing dataset. We believe that this is expected, as both classes contain similar link structures and word counts. Manual observations generally show that article information contains shorter links. To overcome this problem, we can set a threshold value for the length of the links. The system may, however, accept the links of ‘‘send, print and Facebook’’ this time. The most common approach for short and uninformative links is to group the common words for the same
938
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Table 7 Weighted average results of predictions of the main blocks, uninformative blocks, headline and article information content on two different dataset. Classification algorithms
Accuracy (%)
Precision
Recall
f-Measure
Kappa
Normal Distribution Kernel Estimator Supervised Discretization
76.33 84.14 90.55
0.86 0.89 0.92
0.76 0.84 0.91
0.79 0.86 0.91
0.52 0.65 0.77
Bayesian Network
Search Algorithm: K2 Search Algorithm: TAN
90.72 93.23
0.92 0.94
0.91 0.93
0.91 0.94
0.77 0.83
k-Nearest Neighbor
k=1 k=2
95.69 95.33
0.96 0.95
0.96 0.95
0.96 0.95
0.89 0.87
Decision Tree
Sub-tree Raising Unpruned Reduced Error Pruning
95.76 95.65 95.49
0.96 0.96 0.95
0.96 0.96 0.96
0.96 0.96 0.95
0.89 0.89 0.88
Normal Distribution Kernel Estimator Supervised Discretization
76.77 84.81 90.30
0.86 0.89 0.91
0.77 0.85 0.90
0.80 0.86 0.91
0.53 0.67 0.77
Bayesian Network
Search Algorithm: K2 Search Algorithm: TAN
90.40 92.85
0.91 0.93
0.90 0.93
0.91 0.93
0.77 0.83
k-Nearest Neighbor
k=1 k=2
94.96 94.59
0.95 0.94
0.95 0.95
0.95 0.94
0.87 0.86
Decision Tree
Sub-tree Raising Unpruned Reduced Error Pruning
94.88 94.84 94.76
0.95 0.95 0.95
0.95 0.95 0.95
0.95 0.95 0.95
0.87 0.87 0.87
Dataset-1 (Train) Naïve Bayes
Dataset-2 (Test) Naïve Bayes
Table 8 Confusion matrix and prediction of four classes via Sub-tree Raising method of Decision Tree Learning on two different dataset. Dataset-1: Training dataset
Dataset-2: Training dataset Predicted a
Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)
Uninformative blocks Main blocks Headline Article information Weighted avg.
9481 29 1 208
Predicted b
c
26 631 0 17
0 0 780 47
d 162 15 26 1092
Precision
Recall
f-Measure
0.98 0.94 0.95 0.84 0.96
0.98 0.94 0.97 0.80 0.96
0.98 0.94 0.96 0.82 0.96
Known Uninformative blocks (a) Main blocks (b) Headline (c) Article information (d)
Uninformative blocks Main blocks Headline Article information Weighted avg.
a
b
c
d
5668 17 4 143
12 390 2 15
0 0 501 43
148 10 6 686
Precision
Recall
f-Measure
0.97 0.93 0.92 0.80 0.95
0.97 0.94 0.98 0.78 0.95
0.97 0.93 0.95 0.79 0.95
web domain and remove them directly. This makes the approach language dependent, so we did not manage this adjustment to conserve language independency. The result of the Decision Tree Learning algorithm is a binary tree that provides a better understanding of features and their relation. The decision tree in our model consists of 269 decision nodes and 153 leaves. Fig. 5 shows the portion of the actual tree used to predict the blocks of main and article information. When we analyze Fig. 5, we see that 614 of 675 Main Blocks are classified correctly with only using D-HTML-AE (Density in HTML – After Extraction), R-WF-L-AW-AE (Ration of Word Frequency in Links to all words – After Extraction) and WF-L-AE (The count of terms inside A HREF links placed tags – After Extraction). Only nine errors occur in TN = DIV classification in the decision tree. As a result, after extraction features and new features like R-WF-L-AW-AE and WF-L-AE derived in our approach have positive effects in Main prediction. On the other hand, other features that do not have AE features are also effective in article information prediction. 449 of 1352 article information blocks are classified correctly by using six features. Only 22 errors occur in this prediction. These analyzes of several portions of the actual tree indicate that AE is crucial features for the prediction of Main Block. The information gain, a statistical property, can be used to examine the effects of all features in prediction. Information gain measures how features are effective in different combinations. Fig. 6 shows different feature sets and their information gains for the whole learning process.
E. Uzun et al. / Information Processing and Management 49 (2013) 928–944
Main Prediction D-HTML-AE > 0.099698 | | R-WF-L-AW-AE 76 | | | | TN = DIV: Main Block (518/9) | | | | TN = TD: Main Block (96/0)
939
Article Information Prediction R-WF-L-AW