Web Page Title Extraction and Its Application1

4 downloads 38456 Views 329KB Size Report
In one method, we create a DOM Tree for the given HTML document and extract features from the DOM ... In information retrieval, previous work has shown that the use of title fields, anchor texts, and URLs of web pages .... Tag information.
Web Page Title Extraction and Its Application

1

Yewei Xue 1, Yunhua Hu 1, Guomao Xin2, Ruihua Song2, Shuming Shi2, Yunbo Cao2, Chin-Yew Lin2, and Hang Li2 1 Xi‘an Jiaotong University, Xi‘an China 2 Microsoft Research Asia, Beijing China

Abstract This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ‗definition‘ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (Direct Object Model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6%-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1% 30.3% improvements). Keywords: Information Retrieval, HTML Document, Metadata Extraction

1. Introduction In this paper we address the issue of automatically extracting titles from the bodies of HTML documents (web pages). Titles are the ‗names‘ of documents and thus are very useful information for web information processing. In HTML documents, authors can explicitly specify titles by inputting them into the title fields marked by the ‗‘ and ‗‘ tags. However, people usually do not do it carefully. We have evaluated the title fields of HTML documents in the TREC Web Track data set containing 1 053 111 HTML documents and have found that about 33.5% of the title fields are somewhat bogus (see Section 6 for details). In fact, titles also often exist in the bodies of HTML documents. They tend to be more reliable than the content in the title fields, because they are more noticeable to readers and thus usually are more carefully created by the authors. One question arises here: can we extract titles from the bodies and use them in web information processing? This is exactly the problem we address in this paper. To the best of our knowledge, no previous work has been done on the same problem before our current work. Title extraction from the bodies of HTML documents is not as easy as it appears to be. Web pages usually have a large variety in design. It was not clear whether it would be possible to conduct the extraction and whether the extracted titles would be helpful for web page processing. In this paper, we take a machine learning approach to address the problem. The key issues are to define a specification on HTML titles and to identify useful information (features) for the extraction task. We first define what we mean by an HTML title that we call HTML title specification or SPEC (see Section 3 for details). We then annotate a sample document corpus according to the SPEC, and take it as training set, in which we 1

A previous version of the paper appeared in Proceedings of SIGIR 2005. The work was conducted when the first two authors were visiting Microsoft Research Asia.

train statistical models and evaluate the models. We consider two methods to process an HTML document and extract features. In one method, we create a DOM Tree for the given HTML document and extract features from the DOM Tree. In the other method, we render the given HTML document and extract features from the rendered result (based on vision). In both the DOM Tree based and the vision based methods, we utilize formatting information such as font size, font weight, alignment, etc. as features. We consider two types of statistical models: Support Vector Machines (SVM) and Conditional Random Fields (CRF) in title extraction. SVM is a local classification model, while CRF is a global tagging model. We also propose a new method for web page retrieval. The BM25-based method combines the uses of body, title, and extracted title. It conducts normalization on the score calculated with each type of data, and combines the normalized scores using linear combination. Experimental results indicate that for HTML title extraction our methods (both DOM Tree based and vision based methods) can significantly outperform the baseline: one that always uses the lines in the largest font sizes as titles (22.6%-37.4% improvement in F1 score). By combining the extraction results of the two methods the accuracy of extraction can be further improved. Experimental results, on the TREC Web Track data, also indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone. Moreover, the use of extracted titles is particularly helpful in the task of named page finding in TREC (25.1% -30.3% improvements). The rest of the paper is organized as follows. In Section 2, we introduce related work, and in Section 3, we give our spec on HTML title. In Section 4 we describe our methods of title extraction, one is based on DOM Tree and the other is based on vision. In Section 5, we explain our method of web page retrieval using extracted titles. Section 6 gives our experimental results. We make concluding remarks in Section 7. 2. Related work To the best of our knowledge, there has been no previous work on title extraction from HTML body. It also has not been clarified previously whether and how extracted titles can be used for enhancing web page retrieval. Title extraction from HTML document is a specific task of web information extraction, which has become a popular research area and for which many methods have been proposed (Eikvil, 1999). Automatic extraction of different types of web information has been studied. For instance, Liu et al. (Liu, Grossman, & Zhai, 2003) have proposed a method of extracting data records from web pages. Reis et al. (Reis, Golgher, Silva, & Laender, 2004) have investigated the issue of extracting news articles. Craven (2003) has proposed a method of extracting summaries from web pages. Web information extraction has also been conducted at different levels of data structure. For instance, Breuel (2003) has proposed parsing web pages as trees of HTML tags (called the DOM Tree) and pulling out information from the trees (See also Kosala, Bruynooghe, Bussche, & Blockeel, 2003; Reis, Golgher, Silva, & Laender, 2004). Song et al. (Song, Liu, Wen, & Ma, 2004) have proposed rendering web pages, dividing web pages into a number of blocks, and conducting information extraction from the blocks. In general, there are two approaches to web information extraction: namely, the rule based approach and the machine learning based approach. The machine learning based approach is more widely employed (Chidlovskii, Ragetli, & Rijke, 2000; Craven, 2003; Crescenzi, Mecca, & Merialdo, 2001; Crescenzi, Mecca, & Merialdo, 2002; Eikvil, 1999; Evans, Klavans, & McKeown, 2004; Freitag, 2000; Freitag & McCallum, 1999; Han, Giles, Manavoglu, Zha, Zhang, & Fox, 2003; Li, Zaragoza, Herbrich, Shawe-Taylor, & Kandola, 2002; Liu, Grossman, & Zhai, 2003; Muslea, Minton, & Knoblock, 1999; Reis, Golgher, Silva, & Laender, 2004; Song, Liu, Wen, & Ma, 2004; Zhang, Song, Lin, Ma, Jiang, Jin, Liu, Zhao, & Ma, 2002). There is also related work in the field of OCR. For example, in (Chang & Lui, 2001; Chaudhuri & Garain, 1999; Giuffrida, Shek, & Yang, 2000), formatting features are used in title recognition in OCR. Furthermore, related work can also be found in other information extraction applications. For example, in (Belaï d, 2001; Hurst, 2002; Pinto, McCallum, Wei, & Croft, 2003; Wang, 1996) vision based information is used in table extraction from documents. In information retrieval, previous work has shown that the use of title fields, anchor texts, and URLs of web pages (HTML documents) can enhance web page retrieval. Cutler et al. (Cutler, Shih, & Meng, 1997) have proposed using the structures in HTML documents to improve HTML document retrieval. Specifically, they linearly combine term

frequencies in several fields extracted from an HTML document. In TREC-2002, several participants (Amitay, Carmel, Darlow, Lempel, & Soffer, 2002; Collins-Thompson, Ogilvie, Zhang, & Callan, 2002; Zhang, Song, Lin, Ma, Jiang, Jin, Liu, Zhao, & Ma, 2002) have reported that they utilize different fields in HTML files for web page retrieval. Zhang et al. (Zhang, Song, Lin, Ma, Jiang, Jin, Liu, Zhao, & Ma, 2002; Zhang, Song, & Ma, 2003) have explored the roles of different HTML fields such as title field, bold text, etc. in document retrieval. Amitay et al. (Amitay, Carmel, Darlow, Lempel, & Soffer, 2002) have proposed eliminating documents whose title fields do not contain any query word in document retrieval. In TREC-2003, more than half of the participants have considered the use of richer representations based on document structures (Craswell & Hawking, 2003). For instance, Ogilvie and Callan (2003a; 2003b) have tried to represent different document representations from different sources using language models. See also Yau & Hawker, (2004). 3. What is an HTML document title? We consider the HTML title extraction problem by giving a ‗specification‘ on the titles. With the spec, humans can annotate titles of HTML documents in a relatively objective way. Using the labeled data, we can train (and also evaluate) a model for title extraction. The specification defines titles mainly from the viewpoint of document formatting. Intuitively, a title of an HTML document is the ‗most conspicuous‘ description. The detail of the specification is as follows. 1. Number An HTML document can have two titles, one title, or no title. 2. Position a) Titles must be in the ‗top block‘ or the ‗main block‘ (the definitions of top block and main block are given below); b) Titles must be in the top region of a main block (i.e., within the top 1/3 region of the block); c) Titles cannot be in the ‗side block‘ (the definition of side block is given below); d) If there are two titles, then the two titles usually are in the top block and main block respectively. 3. Form a) The font sizes of titles are usually the largest and second largest in the page; b) Titles are conspicuous in terms of font family, font weight, font color, font style, alignment, and background color; c) The formats of titles are usually different from those of surrounding texts. 4. Span a) Titles can consist of several consecutive lines but must be in the same format. (i.e., subtitles in smaller font sizes are ignored); b) Titles cannot be a part of bullets or numbering, and cannot be names of chapter or section. 5. Content a) Titles cannot be a link, time expression, address, etc. b) Titles cannot be expressions like ‗under construction‘, ‗last updated‘, etc. c) Titles can be the expressions immediately after ‗Title:‘ and ‗Subject:‘ d) Titles cannot be too long. 6. Other a) Titles in images are not considered. We categorize the layouts of web pages (HTML documents) into seven classes. Fig. 1 shows six of them. It also gives statistics on the distribution of the classes, obtained from 1 200 randomly selected pages in the TREC data set. We define the largest center block in a page the ‗main block‘ of the page, the block on the top region the ‗top block‘, and the block on one of the two sides the ‗side block‘. The sixth class includes three kinds of page layout. The seventh class is the ―others‖ and contains layouts which do not fall into any of the previous six classes. This class does not appear in Fig. 1, due to the difficulty in clearly representing it with similar figures below. We use all seven types of layout information in our title extraction methods.

Fig. 1. Types of web page layout.

The page in Fig. 2 belongs to class 4 and there are two titles. ―National Weather Service Oxnard‖ is a title and ―Los Angeles Marine Weather Statement‖ is the other title. The former title is in the top block and the latter title is in the main block. The page in Fig. 3 belongs to the class 5 and there are also two titles. ―NINDS Website Privacy Statement‖ is a title and ―National Institute of Neurological Disorders and Stroke‖ is the other title. The former is in the main block while the latter is in the top block.

Fig. 2. Example of web page title.

Fig. 3. Example of web page title.

4. Title extraction methods In this paper, we take a machine learning approach to address the HTML document title extraction problem. The first method is based on DOM tree. The second method is based on vision. Both methods consist of two phases: training and extraction. Before training and extraction, there is a preprocessing step; after the extraction, there is a postprocessing step. 4.1 Model We describe the model for title extraction in a general framework. The training phase takes input as a sequence of units x1 xn aligned with a sequence of labels y1  yn which represent extraction targets, i.e., titles. Suppose that X1  X n are random variables denoting a sequence of instances, and Y1 Yn are random variables denoting a sequence of labels. The learning process is aimed at constructing the conditional probability model.

P(Y1 Yn | X 1  X n ) An example of the model is CRF (Conditional Random Fields, Lafferty, McCallum, & Pereira, 2001). If the Ys are assumed to be independent from each other, then we have

P(Y1 Yn | X1  X n )  P(Y1 | X1  X n )P(Yk | X1  X n ) Each conditional probability model turns out to be a classifier. In this paper, as classifier we employ SVM (Support Vector Machines, Joachims, 2001). 4.2 DOM Tree based method 4.2.1 Overview The first machine learning method for title extraction is a DOM Tree based method. The input of the preprocessing is an HTML document. In the preprocessing, the HTML document is parsed and the DOM (Document Object Model)

Tree of the document is constructed. In this paper, we use MSHTML (for the detail see MSHTML Reference) as the tool for document parsing. Fig. 4 shows an example of DOM Tree (for the definition of DOM Tree see Breuel, 2003; W3C DOM Technical Committee, 2003). Next, ‗units‘ are collected from the DOM Tree. Specifically all the leaf nodes in the DOM Tree are defined as units. A unit generally corresponds to a line of text in the HTML document. A unit contains not only content information (linguistic information) but also formatting information. Nearly all HTML documents can be correctly parsed and the corresponding units can be correctly collected. Fig. 5 shows the units obtained from an HTML document. The output of pre-processing is a sequence of units (instances).

Fig. 4. The DOM Tree of the document in Fig. 2.

Fig. 5. Units in HTML documents.

In learning, the input is sequences of units and the corresponding sequences of labels. A label denotes whether a unit is a title. One sequence of units and labels is obtained one document. We use the labeled training data to construct a model for title extraction, i.e., identifying which unit forms a title. In extraction, the input is a sequence of units obtained from one document. We employ the trained model to identify which unit is a title with a confidence score. In post-processing of extraction, we extract titles using a trained model and heuristics. The output is the extracted titles of the document. Specifically, we choose the consecutive units with the highest scores given by the model as the first title, and then choose the consecutive units with the second highest scores as the second title, provided that the scores are larger than zero.

4.2.2 Features In either the CRF model or the SVM model we utilize formatting information as features. We list here the major types of features. There are in total 245 features defined from a DOM Tree. 1. Rich format information a) Font size: 1~7 levels; b) Font weight: bold face or not; c) Font family: Times New Roman, Arial, etc. d) Font style: normal or italic; e) Font color: #000000, #FF0000, etc. f) Background color: #FFFFFF, #FF0000, etc. g) Alignment: center, left, right, and justify. 2. Tag information a) H1,H2,…,H6: levels as header; b) LI: a listed item; c) DIR: a directory list; d) A: a link or anchor; e) U: an underline; f) BR: a line break; g) HR: a horizontal ruler; h) IMG: an image; i) Class name: ‗sectionheader‘, ‗title‘, ‗titling‘,‘ header‘, etc. 3. DOM Tree information a) Number of sibling nodes in the DOM Tree; b) Relations with the root node, parent node and sibling nodes in terms of font size change, etc. c) Relations with the previous leaf node and next leaf node, in terms of font size change, etc. Note that the nodes might not be siblings. 4. Linguistic information a) Length of text: number of characters; b) Length of real text: number of alphabet letters; c) Negative words: ‗by‘, ‗date‘, ‗phone‘, ‗fax‘, ‗email‘, ‗author‘, etc. d) Positive words: ‗abstract‘, ‗introduction‘, ‗summary‘, ‗overview‘, ‗subject‘, ‗title‘, etc. 5. Format change information a) Font change with neighbors; b) Alignment and color change with neighbors. 4.3 Vision based method 4.3.1 Overview The second machine learning method for title extraction is a vision based method. The input of preprocessing is an HTML document and the output of it is a sequence of units (instances) extracted from the document. The units are not from a DOM Tree, but from a rendered result of the page. Specifically, we employ a tool like that in (Cai, Yu, Wen, & Ma, 2003; Song, Liu, Wen, & Ma, 2004) to parse the web page document and obtain the blocks of the page. We also identify the layout class of the page. Most of the HTML documents can be correctly processed with the method. Fig. 6 shows the blocks of the page in Fig. 2. We next generate units and collect content information (linguistic information) and formatting information of the units. The block and page layout information is acquired heuristically on the basis of vision. The heuristic is as follows. We define seven regions of a page as shown in Fig. 7 (these regions can adjust automatically according to the size of the web page.) When we get a document, we parse it, obtain a structure of blocks, and map all the blocks into the regions. Next, we identify which block is the main block, top block, etc. by looking at the overlaps between the blocks and the regions. Finally, we identify which layout class the page falls into (cf., Fig. 1). For example, in Fig. 6, the first block located on the top part of the page and overlaps with the regions 1, 2, and 3, and

thus we can identify it as ‗top block‘. The second block overlaps with regions 4, 5, and 6, and we judge that the block is the ‗main block‘. We can further categorize this page into class 4. Note that there is a one-to-one mapping relationship between the DOM Tree based units and the vision based units from an HTML document, because all the units are actually generated from the leaf nodes of DOM Tree.

Fig. 6. The blocks of the document in Fig. 2.

Fig. 7. Seven regions of a page.

In learning, the input is sequences of units and the corresponding sequences of labels. We train a learning model like SVM or CRF. In extraction, the input is a sequence of units obtained from one document. We employ the trained model to identify which unit is a title with a confidence score. In post-processing of extraction, we extract titles using the trained models. 4.3.2 Features In addition to the rich format features described in Section 4.2.2, we also utilize the following vision based features: 1. Page layout information a) Page layout class: one of the 1~7 classes (showed in Fig. 1); b) Height and width of the page; c) Position of left top corner of the page. 2. Block information a) Block type: top, main, side, or other; b) Height and width of the block;

c) Position of left top corner of the block. 3. Unit position information a) Position of unit from top of the page; b) Position of unit from left site of the page; c) Position of unit from top of the block; d) Position of unit from left site of the block; e) Height and width of the unit. Note that when we generate the features according to the position information, we only use relative distance. For example, if the distance between the unit and the top of the page is smaller than half of the height of the page, we will set corresponding feature to 1, otherwise 0. 5. Document retrieval method We propose a linear combination method for using extracted titles in document retrieval. Our method takes BM25 (Okapi) as basic function and is unique in its way of normalizing the BM25 scores. Given an HTML document, we extract information from it and store the result in several fields: body, title, and extracted title. The extracted title field contains the title extracted by one of our methods. We also create an additional field in which we combine the extracted title field and the title field. We denote it as ‗CombTitle‘. We consider four methods for document retrieval with different uses of the fields. BasicField In this method, a document is represented by all the texts in the title and body. Given a query, we employ BM25 to calculate the score of each document with respect to the query: S  iq

(k1  1)tf i N  df i  0.5 log dl df i  0.5 k1 ((1  b)  b )  tf i avdl

(1)

Here, i denotes a word in the query q; tf i and df i are term frequency and document frequency of i respectively; dl is document length, and avdl is average document length; k1 and b are parameters. We set k1  1.1, b  0.7 . BasicField+CombTitle We calculate the BM25 score of the combined field CombTitle (i.e., view it as a document), with k1  0.4, b  0.95 . We also calculate the BM25 score of BasicField as before. We next normalize both the BM25 score of the combined field and that of the baseline method.     ( k  1 ) tf N  df  0 . 5 1 i i   log  iq dl df i  0.5  )  tf i  k1 ((1  b)  b  avdl   S'   N  df i  0.5    (k1  1) log   iq df i  0.5  

(2)

' ' We next linearly combine the two scores, S BasicFields and S ComTitle,:

S ' BasicField  (1   ) S ' Com Title

(3)

Here  is coefficient ranging from 0 to 1. BasicField+ExtTitle We employ a similar method to that of BasicField+ComTitle, but we use the extracted title field instead of the combined title field.

BasicField+Title This is a similar method to BasicField+ComTitle, but we use the title field instead of the combined title field. 6. Experimental results 6.1 Data sets As data set, we used the .GOV data in the TREC Web Track. We call it TREC hereafter. There are 1 053 111 web pages in TREC. We randomly selected 4 258 HTML documents from TREC. We manually annotated titles in the selected HTML documents. The annotation was based on the specification in Section 3. Three people were used in the annotation. Two of them annotated the documents separately and the third one judged when the other two judges did not agree. There were 3 332 HTML documents with annotated titles (Recall that an HTML document can have no title). 6.2 Evaluation measures for extraction We used ‗precision‘, ‗recall‘, and ‗F1-score‘ in evaluation of title extraction results. In the evaluation, if the extracted title can approximately match to the annotated title, then we view it as a correct extraction. We define the approximate match between the two titles t1 and t2 in the following way. d (t1, t 2)  0.3 max( l1, l 2)

where d(t1,t2) is the edit distance between t1 and t2; l1 and l2 are lengths of t1 and t2 respectively. 6.3 Evaluation of title fields We tried to see how many title fields in the HTML documents are correct. We found that there are 33.5% of HTML documents in the TREC data set having bogus titles. There are three cases: 1. Empty title field There are 60 524 pages (5.8%) which have nothing in the title fields, i.e., empty between ‗‘ and ‗‘. 2. ‗Untitled‘ title field There are 4 964 pages (0.8%) which have ―untitled‖ or ―untitled document‖ in their title fields. 3. Duplicated title field 282 826 pages (26.9%) fall into this type. Many web sites contain web pages sharing the same title field but having different contents. In our investigation, a title field is considered duplicated if it is repeated more than N times in a web site. N is determined heuristically on the basis of M, the total number of pages in the web site. There are five rules: if M  1000 then N  1/ 40  M ; if 1000  M  600 then N  20 ; if 600  M  300 then N  15 ; if 300  M  60 then N  10 ; if M  60 then N  5 .

6.4 Title extraction experiment We conducted title extraction experiments on the TREC data set. We used two methods as baseline: one always extracted the largest font size units and the other always extracted the first unit. We also evaluated the titles in title fields denoted as ‗title-field‘. For our title extraction methods, we considered several possible options. As models we employed both SVM and CRF. As document processing methods, we used both the DOM Tree based method (denoted as DOM Tree) and the vision based method (denoted as Vision). We further combined DOM Tree and Vision by combining all the features

from the two models (note that the units in the two models can be mapped one by one). For SVM, we used several kernel functions: linear, polynomial, and rbf. For CRF, we used features from different contexts: surrounding units (position +-1), previous units (position -1), and current units only (position 0). In the experiments, we conducted 5-fold cross validation, and thus all the results are averaged over 5 trials. Tables 1, 2, and 3 show the results. The results indicate that our method significantly outperforms the baseline methods and titlefield. It seems that the use of only one type of information for extraction is not enough. Our learning-based method can make an effective use of various types of information in title extraction. For comparison, we also add the results on the same data set using our previous method based on the Perceptron model (Li, Zaragoza, Herbrich, Shawe-Taylor, & Kandola, 2002), denoted as Hu et al. (Hu, Xin, Song, Hu, Shi, Cao, & Li, 2005). The methods used the combined feature proposed in this paper also significantly outperform our previous method. Table 1 Performances of baseline methods for title extraction on TREC Approach Largest font (Baseline) First unit Title-field Hu et al (2005)

Precision 0.528 0.327 (-38.1%) 0.270 (-48.8%) 0.698 (+32.2%)

Recall 0.643 0.402 (-37.5%) 0.324 (-49.6%) 0.703 (+9.3%)

F1-Score 0.580 0.360 (-37.8%) 0.295 (-49.1%) 0.701 (+20.9%)

Table 2 Performances of SVM on TREC Model type

Feature

Precision

Recall

F1-measure

SVM linear

DOM Tree

0.708 (+34.1%)

0.714 (+11.0%)

0.711 (+22.6%)

Vision

0.716 (+35.6%)

0.724 (+12.6%)

0.720 (+24.1%)

Combine

0.751 (+42.2%)

0.757 (+17.7%)

0.754 (+30.0%)

DOM Tree

0.755 (+43.0%)

0.767 (+19.3%)

0.761 (+31.2%)

Vision

0.747 (+41.5%)

0.756 (+17.6%)

0.751 (+29.5%)

Combine

0.782 (+48.1%)

0.797 (+30.0%)

0.789 (+36.0%)

DOM Tree

0.767 (+45.3%)

0.774 (+20.4%)

0.770 (+32.8%)

Vision

0.752 (+42.4%)

0.763 (+18.7%)

0.757 (+30.5%)

Combine

0.789 (+49.4%)

0.803 (+24.9%)

0.796 (+37.2%)

Model type

Feature

Precision

Recall

F1-measure

CRF position 0

DOM Tree

0.772 (+46.2%)

0.728 (+13.2%)

0.749 (+29.1%)

Vision

0.765 (+44.9%)

0.736 (+14.5%)

0.750 (+29.3%)

Combine

0.802 (+51.9%)

0.775 (+20.5%)

0.789 (+36.0%)

DOM Tree

0.783 (+48.3%)

0.756 (+17.6%)

0.769 (+32.6%)

Vision

0.783 (+48.3%)

0.743 (+15.6%)

0.763 (+31.6%)

Combine

0.810 (+53.4%)

0.785 (+22.1%)

0.797 (+37.4%)

DOM Tree

0.788 (+49.2%)

0.753 (+17.1%)

0.771 (+32.9%)

Vision

0.783 (+48.3%)

0.746 (+16.0%)

0.764 (+32.7%)

Combine

0.805 (+52.5%)

0.777 (+20.8%)

0.790 (+36.2%)

SVM polynomial kernel

SVM rbf kernel

Table 3 Performances of CRF on TREC

CRF position -1

CRF position -1 and 1

From the results above, we see that Vision performs almost equally well as DOM Tree, and Combine is always better than DOM Tree and Vision. It seems that Vision and DOM Tree conduct title extraction from different perspectives and thus a combination of them (Combine) can further enhance the accuracy.

We also see that CRF is a better model for the task when compared with SVM. This is probably because title extraction is a tagging problem, not a classification problem. Not surprisingly, SVM with nonlinear kernels (polynomial and rbf) are better than SVM with linear kernel. Furthermore, CRF using features from surrounding units (position +-1) performs slightly better than that only using features from the previous units (position -1) (except in the Combine method), and is much better than that only using features from the current unit alone (position 0). We further investigated the relation between the titles in title fields and the titles extracted from the bodies. Table 4 shows the results. We see that our method can still achieve a relatively high performance when title-fields are incorrect. In this case, the extracted titles are particularly useful. We also see that when title-fields are correct, we have better performance, but still cannot conduct the extraction completely correctly. The result indicates that the task of title extraction is a challenging problem. Table 4 Accuracies of title extraction with respect to different types of title-fields Data Set TREC

Title-field is incorrect 0.672

Title-field is correct 0.737

6.5 Web retrieval experiment We conducted web page retrieval experiments on the TREC data by using extracted titles. As the title extraction method, we used the best performing method CRF+Combine. In the experiment, we used the queries and relevance judgments of Web Track in TREC-2002, TREC-2003, and TREC-2004. The queries have been classified into three types, i.e. named-page finding (NP), homepage finding (HP) and topic distillation (TD) (see Craswell & Hawking, 2003). The number of queries in each type for each year is listed in Table 5. The topic distillation queries of TREC-2002 were not used because the specification in it is different from those in TREC-2003 and TREC-2004. Table 5 Distribution of queries Year 2002 2003 2004

Task NP TD NP + HP TD + NP + HP

Number of queries 150 50 150 + 150 75 + 75 + 75

We first performed the experiment using the queries of TREC-2003. We applied the three methods BaseField+Title, BaseField+CombTitle, BaseField+ExtTitle to the TREC data and evaluated the results in terms of Mean Average Precision. Fig. 8, 9, and 10 show how the performances of the three methods change when the coefficient of alpha changes, in the three tasks: NP, HP, and TD. The baseline method is BaseField and its performance is obtained when alpha equals 1. The results indicate that the best of BaseField+CombTitle outperforms the baseline method in all three tasks. This is also true for BaseField+Title. However, BaseField+ExtTitle can only beat the baseline for NP and TD, but cannot beat the baseline for HP. Furthermore, BaseField+CombTitle is always better than BaseField+Title. The results indicate that the extracted titles are useful for web page retrieval, especially for NP, and it is better to employ BaseField+CombTitle.

BasicField+Title

BasicField+ExtTitle

BasicField+CombTitle

0.7

Mean Average Precision (MAP)

0.65

0.6

0.55

0.5

0.45

0.4

0.35 1

6

11

16

21

26

31

36

41

46 51 56 Alpha * 100

61

66

71

76

81

86

91

96 101

Fig. 8. Web retrieval results with TREC-2003 NP.

BasicField+Title

BasicField+ExtTitle

BasicField+CombTitle

0.45

Mean Average Precision (MAP)

0.4

0.35

0.3

0.25

0.2

0.15 1

11

21

31

41

51

Alpha * 100

61

71

81

91

101

Fig. 9. Web retrieval results with TREC-2003 HP.

BasicField+Title

BasicField+ExtTitle

BasicField+CombTitle

0.15

Mean Average Precision (MAP)

0.14

0.13

0.12

0.11

0.1

0.09

0.08 1

6

11

16

21

26

31

36

41

46 51 56 Alpha * 100

61

66

71

76

Fig. 10. Web retrieval results with TREC-2003 TD.

81

86

91

96 101

We also note that when alpha equals 0, all the three methods turn out to be the same as the one without using BaseField. For all the three tasks, the performance of CombTitle alone is close to or better than those of the baseline. The result implies that CombTitle, a combination of title and extracted title, can serve as a good summarization of an HTML document. Table 6 shows the best result for each method in each task. The results with the ―>>‖ mark are those significantly better than the baseline of BasicField by the means of ‗T-test‘. We can see that BasicField+CombTitle performs better than BasicField and BasicField+Title for all the three tasks. For NP and HP, the improvements of both BasicField+CombTitle and BasicField+Title over BasicField are statistically significant. Table 6 Best retrieval results (mean average precision) on TREC-2003 BasicField

+Title

+CombTitle

2003.TD

0.0955

0.1273 (+33.3%)

0.1401 (+46.7%)

2003.HP

0.3016

0.3974 (+31.8%) (>>)

0.4292 (+42.3%) (>>)

2003.NP

0.5280

0.6036 (+14.3%)

0.6605 (+25.1%) (>>)

We next conducted web retrieval experiments on the data of TREC-2002 and TREC-2004, using the optimal values of alpha obtained from TREC-2003. Table 7 shows the result. (When BasicField+CombTitle performs best, alpha is 0.62 for NP, 0.46 for HP, and 0.68 for TD.) Again, we see that BasicField+CombTitle outperforms both BasicField and BasicField+Title for most of the tasks. For NP, BasicField+CombTitle significantly outperforms BasicField. For HP, both BasicField+CombTitle and BasicField+Title perform significantly better than BasicField. Table 7 Retrieval results (mean average precision) on TREC-2002 and TREC-2004 BasicField

+Title

+CombTitle

2002.NP

0.5932

0.5880 (-0.9%)

0.6373 (+8.4%) (>>)

2004.HP

0.2720

0.4590 (+68.8%) (>>)

0.4032 (+48.2%) (>>)

2004.NP

0.4877

0.5614 (+15.1%)

0.6355 (+30.3%) (>>)

2004.TD

0.0980

0.1021 (+4.2%)

0.1101 (+12.3%)

We conclude that extracted titles are useful for web page retrieval. 7. Conclusion In this paper, we have investigated the problem of automatically extracting titles from the bodies of HTML documents, and have investigated how the extracted titles can help improve web page retrieval. We have proposed a specification of HTML title. We have proposed a machine learning approach to address the problem, either using SVM or using CRF. In the approach, we employ two methods for processing an HTML document, DOM Tree based or vision based, and we use formatting information, linguistic information, etc. as features in the machine learning models. Our experimental findings include: (1) Our method can work significantly better than the baseline methods for title extraction. (2) For title extraction, CRF model is better than SVM. The DOM Tree based method and the vision based method work equally well. It is better to combine the uses of DOM Tree based and vision based methods. (3) Using extracted titles can indeed improve web page retrieval, particularly name page finding of TREC.

Acknowledgement We thank Dmitriy Meyerzon, Ming Zhou, and Wei-Ying Ma for their encouragements and supports. We thank Hugo Zaragoza, Nick Craswell and the anonymous reviewers for their comments to this paper.

References Amitay, E., Carmel, D., Darlow, A., Lempel, R., & Soffer, A. (2002). Topic Distillation with Knowledge Agents. In Proceedings of the Eleventh Text REtrieval Conference. Belaïd, A. (2001). Recognition of Table of Contents for Electronic Library Consulting. International Journal on Document Analysis and Recognition, vol. 4 no. 1, 35-45. Breuel, T.M. (2003). Information Extraction from HTML Documents by Structural Matching. In Proceedings of the Second International Workshop on Web Document Analysis. Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Technical Report (MSRTR-2003-79). Chang, C.H.& Lui, S.C. (2001). IEPAD: information extraction based on pattern discovery. In the Proceedings of the tenth international conference on World Wide Web, (pp. 681 – 688). Chaudhuri, B.B. & Garain, U. (1999). Extraction of type style-based meta-information from imaged documents. In the Proceedings of the Fifth International Conference on Document Analysis and Recognition, (pp. 138-149). Chidlovskii, B., Ragetli, J., & de Rijke, M. (2000). Wrapper Generation via Grammar Induction. In Proceedings of the Eleventh European Conference on Machine Learning, (pp. 96-108). Collins-Thompson, K., Ogilvie, P., Zhang, Y., & Callan, J. (2002). Information Filtering, Novelty Detection, and Named-Page Finding. In Proceedings of the Eleventh Text Retrieval Conference. Craswell, N. & Hawking, D. (2003). Overview of the TREC 2003 Web Track. In Proceedings of the Twelfth Text Retrieval Conference, (pp. 78-93). Craven, T.C. (2003). HTML Tags as Extraction Cues for Web Page Description Construction. Informing Science Journal, 6, 1-12. Crescenzi, V., Mecca, G. & Merialdo, P. (2001). Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the Twenty-seventh International Conference on Very Large Databases, (pp. 109-118). Crescenzi, V., Mecca, G. & Merialdo, P. (2002). Wrapping-Oriented Classification of Web Pages. In Proceedings of the 2002 ACM Symposium on Applied Computing, (pp. 1108-1112). Cutler, M., Shih, T. & Meng, Y. (1997). Using the Structure of HTML Documents to Improve Retrieval. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, (pp. 241-251). Eikvil, L. (1999). Information Extraction from World Wide Web - A Survey. Technical Report 945, Norweigan Computing Center, Oslo, Norway. Evans, D. K., Klavans, J. L. & McKeown, K. R. (2004). Columbia newsblaster: multilingual news summarization on the Web. In Proceedings of Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, (pp. 1-4). Freitag, D. (2000). Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2/3), 169-202. Freitag, D. & McCallum, A. (1999). Information Extraction with HMMs and Shrinkage. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction, (pp. 31-36). AAAI Technical Report WS-99-11. Giuffrida, G., Shek, E. C., & Yang, J. (2000). Knowledge-based metadata extraction from PostScript files. In Proceedings of the fifth ACM conference on digital libraries, (pp. 77–84). Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z. & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. In Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 37-48). Hurst, M. (2002). Classifying TABLE Elements in HTML. WhizBang!Labs. http://www2002.org/CDROM/poster/115/. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., & Li, H. (2005). Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval. In Proceedings of the Twenty-eighth Annual International ACM SIGIR Conference, (pp. 250-257). Joachims, T. (2001). A statistical learning model of text classification with Support Vector Machines. In the 24th ACM International Conference on Research and Development in Information Retrieval, (pp. 128-136).

Kosala, R., Bruynooghe, M., Bussche, J.V. & Blockeel, H. (2003). Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, (pp. 403-408). Lafferty, J., McCallum, A. & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 282-289). Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. S. (2002). The perceptron algorithm with uneven margins. In Proceedings of the Nineteenth International Conference on Machine Learning, (pp. 379-386). Liu, B., Grossman, R. & Zhai, Y. (2003). Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 601-606). Muslea, I., Minton, S. & Knoblock C. (1999). A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third International Conference on Autonomous Agents, (pp. 190-197). Ogilvie, P. & Callan, J. (2003a). Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. In Proceedings of the Twelfth Text Retrieval Conference, (pp. 177-184). Ogilvie, P. & Callan, J. (2003b). Combining Document Representations for Known-Item Search. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 143-150). Pinto, D., McCallum, A., Wei, X., & Croft, W.B. (2003). Table extraction using conditional random fields. In the Proceedings of the 26th ACM SIGIR conference, (pp. 235—242). Reis, D., Golgher, P., Silva, A. & Laender, A. (2004). Automatic Web News Extraction Using Tree Edit Distance. In Proceedings of International WWW Conference, (pp. 502-511). Robertson, S., Zaragoza, H. & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of ACM Thirteenth Conference on Information and Knowledge Management, (pp. 42-49). Song, R., Liu, H., Wen, J.-R. & Ma, W.-Y. (2004). Learning Block Importance Models for Web Pages. In Proceedings of International WWW Conference, (pp. 203-211). Song, R., Wen, J.-R., Shi, S., Xin, G., Liu, T.-Y., Qin, T., Zheng, X., Zhang, J., Xue, G., & Ma, W.-Y. (2004). Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. In Proceedings of the Thirteenth Text REtrieval Conference Proceedings. Wang, X. (1996). Tabular Abstraction, Editing, and Formatting. PhD Thesis. University of Waterloo, Ontario, Canada. Yau, H.S. & Hawker, J.S. (2004). SA_MetaMatch: Relevant Document Discovery Through Document Metadata and Indexing. In Proceedings of ACM Southeast Regional Conference, (pp. 385-390). Zhang, M., Song, R., Lin, C., Ma, L., Jiang, Z., Jin, Y., Liu, Y., Zhao, L. & Ma, S. (2002). THU at TREC 2002: novelty, web, and filtering. In Proceedings of the Eleventh Text REtrieval Conference. Zhang, M., Song, R. & Ma, S. (2003). DF or IDF? On the use of HTML primary feature fields for Web IR. In Proceedings of the Twelfth International World Wide Web Conference, poster. W3C DOM Technical Committee. (2003). Document object model technical reports. http://www.w3.org/DOM/DOMTR. MSHTML Reference. http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp