A hybrid approach for extracting informative content ...

9 downloads 4947 Views 1MB Size Report
Mar 26, 2013 - CSS also introduced flexibility and ease of design to HTML. ..... Supervised Discretization: Used to convert numeric attributes to nominal ones.
Information Processing and Management 49 (2013) 928–944

Contents lists available at SciVerse ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

A hybrid approach for extracting informative content from web pages Erdinç Uzun a,⇑, Hayri Volkan Agun b, Tarık Yerlikaya b a b

Namik Kemal University, Corlu Engineering Faculty, Computer Engineering Department, Çorlu, Tekirdag˘, Turkey Trakya University, Engineering and Architecture Faculty, Computer Engineering Department, Edirne, Turkey

a r t i c l e

i n f o

Article history: Received 23 January 2012 Received in revised form 14 February 2013 Accepted 21 February 2013 Available online 26 March 2013 Keywords: Web Content Extraction Template Detection Web Cleaning Web Learning Modeling

a b s t r a c t Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction The Internet has become a major information source distributed over web pages. Conversely, web pages contain noisy content, including advertisements, banners, menus, and unnecessary links, which can adversely affect performance of text-based processing systems such as search engine, web summarization, question answering and text understanding. In this instance, informative content (i.e. text content, headline, date or author name) can be used to enhance the results of these techniques. However, automatically extracting informative content is difficult, as a web page contains both noisy and informative content in a same file. This file consists of Hypertext Markup Language (HTML) tags and content between these tags that allow us to display pages in web browsers. In this study, we introduce a hybrid approach for obtaining informative content from different web pages. The Web is an invaluable source of data for studies, especially those that do not have enough natural language texts. In particular, researchers choose online newspapers as an alternative test collection (Can et al., 2008; Carlberger, Dalianis, Hassel, & Knutsson, 2001; Savoy, 2007, 2008; Uzun, 2011) to improve the ranking of search engine results of their natural language searches. However, as mentioned, this test collection contains noisy texts, compared to the relevance of the content. For example, Uzun et al. (2011a) developed a crawler to obtain news between 1998 and 2008 from the Turkish newspaper Milliyet (http://www.milliyet.com.tr). They searched a regular expression pattern that can be used for string manipulation to eliminate noise in web pages. They found the following pattern for web pages between 2003 and 2007. ⇑ Corresponding author. Tel.: +90 2822502325. E-mail address: [email protected] (E. Uzun). 0306-4573/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ipm.2013.02.005

E. Uzun et al. / Information Processing and Management 49 (2013) 928–944

929

Headline Block: Article Information Block: Main Block:

Uninformative Blocks: DIV id=menu H1 DIV id=menu H3 DIV id=menu

(a) An Example Web Page

(b) Extraction Rules in XML format

Fig. 1. Determining blocks in a web page and producing rules.

< !--print:start-- > ð:  ?Þ < !--print:finish-- > This is a comment tag that begins with tag . The browser does not display this tag but only gives information about a web page. This pattern matches the opening and closing pair of comment tags. However, using this pattern is not a reliable extraction method for different web domains. Moreover, web designers may be changed their HTML tag naming and hierarchy over time. For example, the designers of Milliyet have used different HTML structure after 2007. Due to varying HTML tag naming and hierarchy, preparing regular expression patterns for extracting the informative content becomes a challenge as described in Section 4.1. Besides, preparing these patterns is cumbersome. To automate pattern extraction and maintain efficiency, we present a hybrid approach.1 In this approach, patterns as rules are first obtained by using our learning model which utilizes an appropriate machine learning (ML) technique. Secondly, these rules are used to extract informative content from web pages without using ML inference. The key issue in content extraction based on the ML techniques is to represent tag information as features obtained from web data. In this study, features such as tag name, word frequency and link frequency are prepared from the Document Object Model (DOM), which is a language-independent convention and tree-based hierarchy for representing a web page. DOM represents not only all types of tags in a web page but also the structure of these tags. Current state-of-the-art techniques for extracting informative content use DOM-based features related by the attributes and the content of an HTML tag. However, the complexity of the DOM structure is not suitable for effectively extracting informative content in a direct manner when compared with using an appropriate regular expression pattern. The main approach of our study is to learn this pattern from DOM-based features. And then, this pattern is used for effectively extracting informative content. Many studies focus on extracting only informative content of a web page. In this study, we divide informative content into three blocks including main, headline and article information as shown in Figs. 1 and 4. These blocks can be used to maximize storage efficiency of text-based processing systems. For extracting these blocks for a web page, appropriate patterns can be produced. In this study, we automatically produce these patterns by using a suitable ML method. Firstly, we analyze our datasets based on the number of informative blocks/tag ratio to select most of the important tags. Two layout HTML tags, TD and DIV, are determined to be important tags for accessing these blocks again. Therefore, our pattern model contains at least one of these tags and the other HTML tags (i.e., H1–H6, FONT, SPAN, EN, UL, and LI) in the block extraction. HTML tags are also widely used to style content and for visualization. For this reason, HTML-based information on Web is semi-structured, which does not conform to the formal structure. The World Wide Web Consortium (W3C – http:// www.w3.org/) has presented a fully structured tag system based on XML (Extensible Markup Language), which is a wellformed structure that simply marks up pages with descriptive tags. Fig. 1 gives information about these structures for blocks in HTML and our rule model in XML format. In Fig. 1, the H1 and H3 tags do not give clear information for determining patterns because H1 and H3 in the tag of
present different informative content to users when compared with H1 and H3 in the tag of