Title identification of web article pages using HTML ...

Title identification of web article pages using HTML and visual features Jian Fan*a, Ping Luob, Parag Joshia Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA, USA 94304; b Hewlett-Packard Laboratories, No.1 Zhong Guan Cun East Road, Beijing, China 100084 a

ABSTRACT Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size. Keywords: data extraction, web article extraction, title identification

1. INTRODUCTION World-Wide-Web has become a valuable source of information for many internet users in the world. There are many types of web pages including news, blog, shopping, maps/directions, financial information, photos, videos, email and etc. However, most web pages are designed for on-screen viewing only and may not be suitable for other purposes such as printing. Repurposing Web content generally requires re-layout of the content [1]. This means that not only the content but also the structural roles of various content components should be recognized. In this paper, we particularly focus on identifying title from web article pages which contains a significant amount of consecutive text in a rectangular region. Even though title is an important component of an article and it is often given prominent visual characteristics, its robust identification is not an easy problem. In the context of web article extraction, Wang et al reported a SVM (Support Vector Machine) based machine learning approach that utilizes both spatial (the position and size of the DOM subtree) and content (font size, number of words and period in the end) features [2]. They reported 97-100% accuracy on only 12 major news web sites. However, we argue that their content-based features of number-of-words and period-in-the-end are not robust discriminators. In our web article collection we found some web article titles end with abbreviation such as “U.S.” (http://news.nationalgeographic.com/news/2009/10/091014-giant-snakes-invasion-us.html) or “…” (http://www.zdnet.com/blog/government/can-hackers-have-a-disease-and-be-absolved-from-a-crime-maybe/6340), which may not be easily distinguishable from a period. We also found the number of words in a title may range from one (http://bnreview.barnesandnoble.com/t5/Reviews-Essays/Open/ba-p/1804) to twenty (http://www.guardian.com/AboutGuardian/Newsroom/News/gi_013305). Moreover, the word count is language dependent. For English, there is a space between words. But for some Asian languages such as Chinese, each character is a word and there is no space between them. In an earlier study targeting a broader context of general web pages [3], Xue et al used both CRF (Conditional Random Field) and SVM models and include 245 features extracted from an html file and DOM tree, in addition to position features derived from rendered page. However, their best result is around 80% in term of F1 measure. Although HTML file format does reserve a “title” tag for the title of a web page, in reality this field alone is not a reliable indicator for true title [3]. In fact, the methods by both research groups did not utilize the title field for title detection. In this paper, we present a title identification method for web article pages utilizing title field, header tag, font size and spatial alignment. Our work distinguishes from the above mentioned papers in two aspects. First, we targeted only individual web article pages instead of general web pages or multiple pages from the same web site. We further assume that the article text body is already identified and its location is known [1]. Second, we develop an explicit numeric formula incorporating the four html and spatial features in evaluating each title candidate and selecting the winner.

Imaging and Printing in a Web 2.0 World II, edited by Qian Lin, Jan P. Allebach, Zhigang Fan, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7879, 78790K · © 2011 SPIE-IS&T · CCC code: 0277-786X/11/$18 · doi: 10.1117/12.876708 Proc. of SPIE-IS&T Vol. 7879 78790K-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 07/27/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx

Our evaluation result shows that the proposed method is accurate and robust for a very diverse collection of web article pages. In particular, the incorporation of the title field significantly improves the accuracy of title identification for web pages containing category/column title. The remaining of the paper is organized as the following. Section 2 explains our method in detail and gives the formula. Section3 describes the ground truth data set and the results of performance evaluation. Section 4 concludes the paper.

2. OUR METHOD Our title identification method is built on top of article body detection, which renders the web page, identifies all text elements/paragraphs (that may consist of multiple DOM nodes) as well as the text elements that are parts of the article [1]. The title identification utilizes both html and visual information of all the text elements and the information of the main article text. It takes two steps. First, candidate text elements are selected. Second, the candidates are scored numerically and the one with the highest score is selected as the title. The title candidates are selected according to the following criteria: 1.

The horizontal starting position must not exceed the horizontal center position of the main text region, and the top position must not below the top quarter of the main text region. Figure 1 shows an example in which the detected main article region is colored green and the left and top maximum point is marked by a red cross.

2.

The font size must not be smaller than the font size of the main article text. Text elements with font size equals to that of the main article text are eligible only if they are either tagged with “H1” to “H6” or they are all bold.

Figure 1 shows an example in which the selected ten title candidates are colored cyan and the main article text region is colored green. In the second step, we compute a score of real number for each candidate and choose the one with the highest score as the title. The score makes use the following html elements and visual features: 1.

Title field. This is a text string delimited by the ‘’ and ‘’ tags in an html file.

2.

Header tag “H1” to “H6”. We give higher weight to text elements under a DOM node with the header tags. “H1” and “H6” have the highest and the lowest weight, respectively.

3.

Font size. Title usually has large font size that makes it visually significant.

4.

Horizontal alignment with the main text body. Our observation is that most titles are either left or center aligned to the main article text.

For each of the features, a sub score is computed. The matching between a title candidate s and the title field t is measured based on the Levenshtein distance: m s ,t = 1 − min d s ,t , Lt Lt , where d s ,t is the Levenshtein

(

)

distance between string s and t, Lt is the string length of the title field t. The sub score

m s ,t values in the range of 0 to 1 with 1 represents a perfect

match. The spatial alignment between a title candidate and the main article text region is measured by: α = 1 − min(dl , dc ) wmain , where dl and dc are the distance between the lefts and the horizontal centers of the candidate

Figure 1. An example of a web article (http://ngm.nationalgeographic.com/20 06/08/tom-abercrombie/belt-text.html) with detected main article text region (green) and selected title candidates (cyan). The red cross represents the maximum position of title candidates.

Proc. of SPIE-IS&T Vol. 7879 78790K-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 07/27/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx

element and the main article text, respectively, and the wmain is the width of the main text region. Figure 2 illustrates the distances.

dc

The overall score for a title candidate s is computed using the following formula:

Qs = (m s ,t + λ )⋅ β (7 − h ) ⋅ (0.1 + α ) ⋅ f s ω

where

(1)

dl

m s ,t is the matching score between the title candidate s and the title

wmain

field t, α is the spatial alignment score, fs is the font size in points, h is the Figure 2. Distances for the Alignment header level that h=1 to 6 correspond to H1 to H6 tags and h=7 for strings measure. with non-header tags, parameters λ=0.5, β=1.15 and ω=2 are fixed and empirically determined.

3. PERFORMANCE EVALUATIONS As an important part of our web content analysis projects, we have collected 1993 web pages from top 98 distinct news and blog web sites such that each site is represented by about 20 web pages [4]. All the web pages were saved locally using a Firefox plug-in such that all the contents embedded in the web page are saved properly. In this way, these pages can be accessed in the future even when some of the web pages are either no longer online, or modified. For all the saved 1997 web pages, we have human experts to extract and label the informative content of [Title], [Main Body], [Image URL], [Image Caption], [Print Link] and [Next Page Link], and saved the results as text files. These saved text files served as ground truth data to be compared against the generated extraction result. For each test web page, the web page is rendered with a WebKit-based engine such that a DOM tree with complete HTML and visual information is available. After the main text is identified, title candidates are selected and the one with the highest score is identified as the title. For title identification, the result is either true or fail. This is done by comparing the detected title and the true title strings. If the two strings are identical the detected title is true. To verify the efficacy of the proposed method, we performed the following evaluations using various set of features: 1.

Font size only. In this case, Qs ≡ f s . It means we simply select the text with the largest font size as the title.

2.

No title field. In this case, we set m s ,t ≡ 1 such that title field does not affect the outcome.

3.

All four features.

The numeric results are list in Table 1. The result using only the font size shows complete or near complete failure for 19 web sites. This is due to the existence of category/column title such as examples Figure 3. Web pages containing a category title.(top left) , nj (top right), shown in Figure 3. The average lonelyplanet (bottom left) and bnreview.barnesandnoble. accuracy in this case is 78%. The result for the second set of features is very similar but drastically better for one web site: lonelyplanet. In contrast, the result using the proposed method with all


four features achieved average accuracy 97.5%, almost 20% above the baseline of the font size only method. However, the accuracy of the proposed method dropped to a very low level for two web sites: blogspot and www.nih.gov. The reason is that these pages contain both category and article titles and their title field also contains both titles. To improve the accuracy for these web sites more features should be added. Table 1. Results of title identification with various features. N: number of valid test pages; Nf : detected by font size only; Np : detected without the title field ( m s ,t ≡ 1 ); NQ : detected with the all features;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Site

N

Nf

Np

NQ

cnn news.aol news.yahoo foxnews usnews newsweek nytimes time bbc slate azcentral theatlantic nypost sina lifehacker washingtonpost weather msnbc huffingtonpost reuters guardian wsj indiatimes latimes timesonline bloomberg associatedcontent forbes usatoday examiner abcnews businessweek smh nationalgeographic ft chicagotribune theage theonion helium mercurynews

20 20 20 20 20 20 20 20 19 20 20 20 20 20 20 19 20 40 20 20 20 20 20 20 20 19 20 20 20 20 20 20 20 40 19 20 20 20 20 20

14 20 20 20 2 20 20 18 19 20 0 19 18 20 20 19 17 40 19 20 20 20 20 19 20 19 20 20 20 0 17 20 20 38 13 3 20 20 20 20

14 20 20 20 2 20 20 18 19 20 0 19 18 20 20 19 17 40 19 20 20 20 20 19 20 19 20 20 20 0 17 20 20 38 13 3 20 20 20 20

20 20 20 20 19 20 20 20 19 20 20 19 18 20 20 19 18 40 19 20 18 20 20 20 20 19 20 20 20 20 20 20 20 38 19 20 20 20 20 20

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89

Site

N

Nf

Np

NQ

denverpost blogspot wordpress xanga fool zdnet lonelyplanet frommers tripadvisor ehow boingboing wikihow webmd howtodothings britannica wikinews scientificamerican gamespot kotaku bnreview.barnesandnoble dpreview techcrunch about livejournal sciencemag freep sky news.cnet vancouversun thedailybeast espn npr cricinfo nba nfl sportsillustrated.cnn nhl skysports answers typepad

20 19 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

20 7 19 3 20 0 2 20 17 20 19 20 4 20 0 20 20 20 20 0 20 20 0 13 20 0 20 20 20 20 20 5 20 8 20 20 20 20 17 20

20 7 19 3 20 0 20 20 17 20 19 20 4 20 0 20 20 20 20 0 20 20 0 13 20 0 20 20 20 20 20 5 20 8 19 20 20 20 17 20

20 9 19 18 20 20 20 20 20 20 19 20 18 20 20 20 20 20 20 20 16 20 20 17 20 20 20 20 20 20 20 20 20 20 19 20 20 20 18 20


41 42 43 44 45 46 47 48 49

ajc suntimes economist nj chron newsmax startribune thestar wnd

20 20 20 20 20 20 20 20 20

20 20 15 0 0 3 20 20 20

20 20 15 0 0 3 20 20 20

20 20 15 20 20 20 20 20 19

90 91 92 93 94 95 96 97 98

www.nih.gov howstuffworks fifa stackoverflow allexperts sciencedaily nature tmz engadget

20 20 20 20 20 20 20 20 20

2 16 18 19 0 0 20 20 20

2 16 18 19 0 0 20 20 20

12 19 20 20 20 20 20 20 20

4. CONCLUSION In this paper we presented a title identification method based on four html and spatial features. We evaluated the proposed method extensively on 1993 web pages from 98 web sites. The results show that title field is a very effective feature for many web pages containing both category and article titles. On average the proposed method achieved 97.5% accuracy, about 20% above a baseline method using only the font size. To further improve the accuracy and robustness, we plan to add more features and investigate a machine learning based approach.

REFERENCES [1] P. Luo Jian Fan, Sam Liu, Fen Lin, Yuhong Xiong, and Jerry Liu, “Web Article Extraction for Web Printing: a DOM+Visual based Approach,” Proc. ACM DocEng, 66-69 (2009) [2] Junfeng Wang, Chun Chen, Can Wang Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang, “Can we learn a template-independent wrapper for news article extraction from a single training site?”, Proc. ACM SIGKDD, 13451353 (2009) [3] Yewei Xue, Yunhua Hu, Guomao Xin, Ruihua Song, Shuming Shi, Yunbo Cao, Chin-Yew Lin, and Hang Li, " Web page title extraction and its application," SID Information Processing and Management Papers 43, 1332–1347 (2007) [4] http://www.hpl.hp.com/research/multimedia_understanding/downloads/web_article_dataset


Title identification of web article pages using HTML ...

Title identification of web article pages using HTML ...

Suggest Documents

isolating informative blocks from large web pages using html tag ...

isolating informative blocks from large web pages using html tag ...

isolating informative blocks from large web pages using html ... - Wireilla

volume66.html 15 pages

Article Title: Public Participation Using 3D web-based City Models

New Perspectives on Creating Web Pages with Dynamic HTML ...

HTML Pattern Generator - Automatic Data Extraction from Web Pages

CA097A Creating Web Pages with HTML-Course - Mission College

New Perspectives on Creating Web Pages with HTML Tutorial ...

Ranking billions of web pages using diodes

Title Pages

759 ~Web Resources800/ - HTML PartII HTML Web ... - WordPress.com

Title Pages

Pemrograman Web I (HTML)

Semantic Annotation of Web Pages Using Web Patterns - CiteSeerX

Web Usage Based Analysis of Web Pages Using RapidMiner - WSEAS

Recycling HTML pages as XML documents using W4F - CiteSeerX

Identification of Malicious Web Pages Through Analysis of ... - CiteSeerX

Identification of Malicious Web Pages Through Analysis of ... - CiteSeerX

pdf Title of file for HTML - Nature

Identification of Malicious Web Pages with Static Heuristics - CiteSeerX

Identification of Malicious Web Pages with Static Heuristics - CiteSeerX

Identification of Malicious Web Pages with Static Heuristics - CiteSeerX

Determining Web Pages Similarity Using Distributed ...