Expert Systems with Applications 37 (2010) 8492–8498
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Tag tree template for Web information and schema extraction Xiangwen Ji, Jianping Zeng *, Shiyong Zhang, Chengrong Wu School of Computer Science, Fudan University, Shanghai 200433, China
a r t i c l e
i n f o
Keywords: Tag tree template Web information extraction Schema extraction Tree similarity
a b s t r a c t The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information. Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Information extraction is the process of extracting relevant information from massive data, such as text, database, semi-structured and multimedia documents. An information extraction system performs this task by using techniques in various areas like machine learning, data mining, pattern mining, and grammar induction. With the booming of Internet, the World Wide Web is transforming itself into the largest information source ever available, which brings extensive research of Web information extraction. While HTML of Web pages is one kind of semi-structured data, the process of information extraction from Web is both interesting and challenging, which could also be the base process that helps Web Searching, Information Retrieval and Web Mining. Many Web sites utilize enterprise framework, such as J2EE, and Active Page technologies, to organize their massive information in certain HTML templates from database to be shown on Web pages. Often the information shown on pages is structured in some sections of HTML template mapping to records in the database table of Web site. In this case, automatic extraction of database records while keeping the structure information of Web pages is of great useful. We employ a HTML Tag tree to describe a Web page, in which both HTML tag and text are treated as tree nodes. Algorithms are devised to compare Tag tree for tree similarity calculation, at the same time a tree template whose nodes are in all Tag trees is extracted, and the pages are classified into different groups according * Corresponding author. Tel.: +86 13564317273. E-mail addresses:
[email protected] (X. Ji),
[email protected] (J. Zeng),
[email protected] (S. Zhang),
[email protected] (C. Wu). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.05.027
to their site template. By using tree template and finding patterns in the exclusive field out of template, the record boundary and data schema of Web structured information can be estimated. Information of records in pages will then be extracted out with the schema. The main contributions of the paper are as follows. (1) A similarity function which is based on the costs in constructing tree is proposed to measure the similarity between the templates of Web pages. The similarity function has the characteristics of multiplicative and transitivity, which are helpful in template discovering. (2) An approach that can simultaneously extract text content and schema attributes in tags is proposed. The approach is based on monitoring the variable fields in Tag tree. Hence, the extracted text information can be meaningful with rich structure information. (3) An algorithm to find the templates of Web pages from different sites is proposed. By monitoring the MP (Minimal Template Portion of original tree) value of pages, new templates can be generated automatically. The organization of the paper is as follows. In Section 2 some related works are introduced. In Section 3, the proposed Tag tree template and tree similarity are described. Template extraction and Web information extraction method are presented in Section 4. Experiment and results analysis are provided in Section 5. Finally, the conclusion and future work are pointed out in last section.
2. Related works Most of the information of Web page is in the form of unstructured text, making it hard to query. The earliest extraction technologies rely on human to encode the template in a program called
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498
Wrapper to parse HTML on Web sites. Hammer declares template rules in a ‘‘wrapper generator” to generate wrapper (Hammer, Garcia-Molina, Cho, Crespo, & Aranha, 1997). The wrapper generated by human is also declared in STALKER (Muslea, Minton, & Knoblock, 1999), and XWRAP (Liu, Pu, & Han, 2000). The wrapper generation needs inductive learning from pages, whereas Zhai proposes an instance-based learning method for extraction from only a single labeled instance, but it needs user activity to define the instance (Zhai & Liu, 2007). Embley et al. use heuristic rules (Embley, Jiang, & Ng, 1999), which are also used in our research, to discover record boundaries in Web documents. Presentation regularities and domain knowledge are used to extract Web information in the research of Srinivas (Vadrevu, Gelgi, & Davulcu, 2007). Takama (Takama & Mitsuhashi, 2005) analyses layout to calculate visual similarity of Web page for retrieving. By using visual and interactive user interface, Baumgartner semi-automatically generates wrappers which translate relevant pieces of HTML pages into XML (Baumgartner et al., 2001). There are some researches made on the automatic extraction. ROADRUNNER (Crescenzi, Mecca, & Merialdo, 2001) and IEPAD (Chang & Lui, 2001) are the tools that use tags on the generation of HTML templates. IEPAD uses repeating patterns of closely occurring HTML tags to identify and extract records of Web document. Like our research, EXALG first discovers the unknown template that generated the pages and uses the template to extract data from pages (Arasu & Garcia-Molina, 2003). The difference between EXALG and our work is that EXALG uses equivalence classes and differentiating roles to find tokens to generate template but we parse HTML into Tag tree to extract template. The template of our approach is like the trunk of Tag tree on which the fields of records look like leaves. By comparing with template, Web pages of different sites will be clustered into different groups because of the different template of different site. Álvarez uses Clustering and Edit Distance Techniques for Web data extraction (Álvarez, Pan, Raposo, Bellas, & Cacheda, 2007), but it is suitable for the pages in which there are many records. Further the approach does not extract the common template of pages and can’t extract record information in Tag attributes. After retrieving Web document content out of template, we use repeating patterns and heuristic rules to find schema in the exclusive field of template, but not to explore the semantic relations. Doug Downey uses a simple pattern learning algorithm (PL) to get patterns for extracting information from unstructured text on the Web (Downey, Etzioni, Soderland, & Weld, 2004). Chang proposes to extract Web information by discovering repeated pattern and multiple pattern alignment with a PAT tree data structure (Chang, Hsu, & Lui, 2003; Chang, Lui, & Wu, 2001). For semantic information extraction, Alani use predefined ontology which represents the type and form of knowledge to be extracted and generate biography of document content (Alani et al., 2003). Structural similarity needs to be considered in our research for the template generation. Buttler makes a survey of Document Structure Similarity Algorithms (Buttler, 2004), which includes simple weighted tag similarity algorithm, Fourier transforms (Flesca, Manco, Masciari, Pontieri, & Pugliese, 2002, 2007), Tree Edit Distance algorithms (Bille, 2005). Yeonjung Kim assign different values to different HTML tree nodes according to their weights for displaying the corresponding data objects in the browser when using TED extraction (Kim, Park, & Kim, 2007). In our research, we devise a simple similarity approach for the tree template generation, which is the construction cost similarity of Tag tree.
3. Tag tree and ITS template The Document Object Model (DOM) (http://www.w3.org/DOM) is a platform- and language-neutral interface that will allow
8493
programs and scripts to dynamically access and update the content, structure and style of documents. DOM is often applied in XML which is strictly structured but HTML is semi-structured. 3.1. Tag tree DOM parses XML document as hierarchy trees, while HTML in a Web page is parsed as a Tag tree in this research. Tag tree could be thought as the base structure to implement DOM. Here attributes are also regarded as nodes, as well as the remark node includes remark tag and script section. The child nodes and attributes of tag node can also be well sorted and indexed. The Tag tree is constructed according to the following rules. 1. There are 3 types of node in the Tag tree, that is, text node, remark node and tag node. 2. Tags and text between a pair of tags, such as
| , are all children nodes of the tag node. 3. All the content between a pair of tags is a text node as the only child of node. 4. A pair of tags, such as
and | , is treated as a node whose closing flag is true, a tag without closing tag is a node whose closing flag is false. 5. A tag ended with ‘‘/>” is a node which self-closing flag is true, such as of a XHTML. 6. All attributes of a node which are parsed in order, will be inserted into the attribute map. Based on these rules, a HTML Web page can be parsed into a tag tree. In the tree, text nodes are not parent of any nodes. However, tag node may have its children nodes and attributes set. Fig. 1 is an example of HTML document and the corresponding Tag tree. If the parent of two nodes is the same, they are sibling nodes of each other. Because the sequence of tags in HTML is important for the HTML to be shown on Web browser, we insert each sub node of a node into a vector in order. Definition 1 (HTML tree). HTML tree is a tree T whose root is R, text node TX, tag node TN and remark node RN are the elements of the tree. Hence, it can be denoted as follows:
T h ¼ fR; TX; TN; RNg
ð1Þ
Definition 2 (HTML Tag tree). HTML Tag tree is a sub tree of Th, it can be defined as,
T t ¼ fRt ; TX t ; TNt ; RN t g
ð2Þ
where, Rt R \ TN \ RN, TXt TX, TNt TN and RNt RN. 3.2. Template of tag tree For many Web sites with database running on background, Web pages are dynamically generated by filling content in a HTML template, and usually the inserted content are the child nodes or the leaf nodes of the template tree. Based on the premise, we define the template Ttt of a HTML Tag tree as follows: Definition 3 (Template of Tag tree).
T tt ¼ fRtt ; TX tt ; TN tt ; RN tt g
ð3Þ
where, TXtt TXt, Rtt Rt, TNtt TNt and RNtt RNt. The tree template Ttt is a connected graph. A template looks like the trunk of a tree and the content filled in the template is the leaves on the tree. Fig. 2 is a template of a Tag tree as shown in Fig. 1. The blank box in the figure is the exclusive parts of template.
8494
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498
Fig. 1. HTML and HTML Tag tree.
Fig. 2. Template of Tag tree.
Template of Tag tree is extracted by analyzing the frequent elements of a Tag trees set. To support the analysis process, a template class of tag trees is defined as follows:
Definition 4 (Template class of Tag trees). Template class of Tag trees is a ‘‘reverse” binary tree whose leaf node is Tag tree and other nodes as template of tag tree. A template class of Tag tree with N templates can be denoted as follows:
TC ¼ ffT tti ; i ¼ 1; 2; . . . Ng; fT tj ; j ¼ 1; 2; . . . N þ 1gg
ð4Þ
where, the N templates of Tag trees are similar, which can be described by the proposed tree similarity in the next section. Next we present a visual description for template class of Tag trees, as shown in Fig. 3. Tag tree A and B have a template C. Tag tree A, B and D have a template E, if E and C are similar with each other, they are in the same template class. Minimum template in the template class reflects the most suitable template for Tag trees, for example, template E is the minimum template of TC = {{C, E}, {A, B, D}}, and template G is the minimum template of TC = {{C, E, G}, {A, B, D, F}}.
Fig. 3. An example of template class.
3.3. Tree similarity There’re many methods to calculate the structural similarity between trees, in which the tree edit distance (TED) is a simple and efficient algorithm (Buttler, 2004). We devised a new method, which is like TED, to compare two trees of similarity for template
8495
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498
extraction. Firstly for each tree the cost to construct the tree is calculated. In cost function, the summary of all nodes of a tree, including the root node and its sub nodes, are accumulated. However, in template extraction, the parent node is more important than the child nodes, the costs of sub nodes will be associated with a coefficient a less than 1. Hence, the cost function of tree T with m nodes is defined as in a recursive form, as follows:
CT ¼ 1 þ a
m X
CTi
ð5Þ
i¼1
And, the cost of leaf nodes is defined as 1. Then the similarity between two trees is defined based on the cost function, as in (6).
SðT 1 ; T 2 Þ ¼
C T tt C T tt CT1 CT2
ð6Þ
where, Ttt is the template of tree T1 and T2. The tree similarity is used for template extraction. Different from TED, the cost-based similarity can ensure that the template extraction is a consistent hierarchy comparison. For example, tree A, B and D have the same template as seen in Fig. 3. There may be little association between TED of A and D and the one of tag tree A and B or the one of B and D, thus there maybe a different result in another sequence during extraction. However, using the proposed cost function, results of different sequences are the same according to the characteristics introduced in next section.
relations between template and tag trees during the template extraction which help template clustering. 1. Minimal Template Portion of Tag tree. MP reflects what the weight of template is in a big Tag Tree, when there are many large exclusive contents in the Tag Tree. For each set of Tag trees in template class, MP is computed as follows: MPA ¼ 1; MPB ¼ 1; MPD ¼ 1 MPC ¼ MinðMPA SðT ttc ; T tA Þ; MPb SðT ttc ; T tB ÞÞ ¼ MinðSðT ttc ; T tA Þ; SðT ttc ; T tB ÞÞ; MPE ¼ MinðMPC SðT ttE ; T tA Þ; MPD SðT ttE ; T tD ÞÞ
2. Minimal Similarity in template class. MS is the minimal similarity of every two Tag trees in the template class. It shows whether the Web pages diverse significantly in a site.
MSA ¼ 1;
MSB ¼ 1;
MSD ¼ 1
MSC ¼ MinðMPA MPB SðT tB ; T tA Þ; MSA ; MSB Þ MSE ¼ MinðMPC MPD SðT ttc ; T tD Þ; MSC ; MSD Þ 3. Average Template Portion in template class
Pn
AP ¼
i¼1 MPi
n
ð7Þ
The AP is the average of MPs. 4. Average Similarity in template class
Pn1 Pn AS ¼
j¼iþ1 Sði; jÞ nðn 1Þ=2
i¼1
ð8Þ
The AS is the average of similarities of each pair pages. In both (7) and (8), n is the number of Tag tree in template class.
4. Template and record extraction In the proposed approach Web pages are firstly parsed as Tag trees, from which the tree templates will be extracted. Then, using templates to parsing the trees, the exclusive content in Web pages can be got. Furthermore, the schema and records of Web page can also be extracted from the exclusive content of templates. 4.1. Templates extraction from sites Since the template extraction is based on template class of Tag tree, we describe two important properties about the template class, which is the transitivity in template extraction process and multiplicative of similarities. We describe the properties for the template class in Fig. 3, since it is a typical template class. 1. Transitivity For two tree Tt1 and Tt2, their template is T = Tt1 \ Tt2. Hence, we have T ttc ¼ T tA \ T tB and T ttE ¼ T tD \ T ttC . Then, by substituting T ttc into T ttE , T ttE ¼ T tA \ T tB \ T tD . On
the
Other C T tt
hand,
we
have C T tt
SðT ttc ; T tA Þ ¼
C T tt CTt
c
and
A
SðT ttE ; T ttC Þ ¼ C T . Then, SðT ttE ; T tA Þ ¼ C T ¼ SðT ttC ; T tA ÞSðT ttE ; T ttC Þ. ttC tA 2. Multiplicative For the similarity between a pair of Tag trees in template class, for example, according the definition of cost-based similarity, we have E
E
SðT tB ; T tA Þ ¼ SðT tB ; T ttC ÞSðT ttC ; T tA Þ; and SðT tD ; T tA Þ ¼ SðT tD ; T ttE ÞSðT ttE ; T ttC ÞSðT ttC ; T tA Þ Based on the two characteristics, four measurements are defined for the tree template extraction, i.e. Minimal Template Portion (MP), Minimal Similarity (MS), Average Portion (AP) and Average Similarity (AS). The measurements are used to describe
By utilizing the properties of template and similarity, we extract templates from Web pages of different sites based on a template class structure. Firstly each page is parsed as a Tag tree which will be used to extract new template with each template in template class. Then by measuring the MP of each new template, the maximum one is chosen as the template in which the Tag will be added. If the maximum MP is below to a thresholdh, a new tag tree will be generated and the tree be added to the class. So that Web pages are clustered in different classes, meanwhile template of each class is generated. Below is the algorithm that extracts Web templates from different sites which uses Minimal Portion as the criterion. After the process the template classes in which there is a template and its corresponding Tag trees can be got. Algorithm (TEC) Templates Extraction based on template class. Input: A set of pages PS ¼ fT h1 ; T h2 ; . . . ; T hn g on all sites. Output: A template class set TCS = {TC1, TC2, . . . , TCm} for all sites. Process: 1. Initialize template class set TCS with each template class as NULL. 2. Select the first page T h1 , parse it into a Tag tree T t1 . 3. Create a template class TC 1 ¼ fT tt1 ; T t1 g, in which T tt1 ¼ T t1 is also the template. 4. Parse each Web page T hi ði ¼ 2; . . . nÞ in PS into HTML Tag tree T ti , ((do steps of (a)–(j)). (a) Take minimum template Ttt in each TCi to extract new tag tree templates T tti ði ¼ 1; 2; . . . ; mÞ for Tag tree T ti . (b) Calculate the MPj ¼ ðT ttj ; T ti Þ; j ¼ 1; 2; . . . ; m, and then select MPk = max(MPj) . (c) IF the MPk > h then (d) IF SteadyTemplate (TCk) = FALSE then (e) Add T tti to the template class of TCk as a new template node. (f) Add T ti to the new template node as a new leaf node.
8496
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498
(g) (h) (i)
END IF ELSE Let m = m + 1, and then create a new template class TCm, in which T tt1 ¼ T ti is also the template END IF
(j)
Function SteadyTemplate checks whether a template class is in steady state by calculating the template similarity. For example, in Fig. 3, if G is the same as E, and E is the same as C, we say that the template is likely to be steady. In this case, the extraction task for this template can be stopped, as shown in algorithm TEC. The function is described as follows: Function SteadyTemplate Input: TC: A Template class of Tag tree. Output: R: Whether the class is steady or not Process: 1. For each pair of neighbor templates ðT tti ; T ttiþ1 Þ in TC which has N templates of Tag tree, compute their template similarity as follows:
TSðT tti ; T ttiþ1 Þ ¼ SðT tti ; T ttiþ1 Þ; 2. If
PN1 i¼1
TSðT tti ;T ttiþ1 Þ N1
i ¼ 1; 2; . . . ; N 1
approaches 1, then R = TRUE, else R = FALSE
4.2. Records in exclusive content After a set of template class has been learnt from Web pages, those pages which have corresponding Tag tree in the class can be analyzed and extracted using the template. While for an unknown page Th, we should find a suitable template Ttt to extract information. This can be achieved by computing their similarity, as follows:
T tt ¼ arg maxðSðT h ; T tt in each TC i of TCSÞÞ i
ð9Þ
Then By using the template Ttt to parse the page, we can get the exclusive contents out of the template. The exclusive data may be child nodes of template or attributes of tags, which are set apart by the template from tag tree. Sometimes the attributes of tags are as important as the text area, in the document record. For example, the links of anchor tags may show the Web documents relationship. We call the child nodes of template as exclusive node (EN) and the attributes of tags not in template as exclusive attributes (EA). The data schema and records can be extracted from the exclusive content because different parts of exclusive data could be considered as the record fields of the template. There are regularities in the presentation of the fields. Some heuristic rules, such as font and size, are also found effective to detail the schema of structured information. Apart from researches (Vadrevu et al., 2007) in which regularities and heuristic rules are used to extract record in a document, our approach is to find document records of many pages in the same template, by finding repeating patterns and using some heuristics rules. Mainly four heuristic rules are used to discover the labels of fields. 1. In the exclusive text area of template, the repeating pattern is the field description of its following changing text. 2. If there is no repeating pattern in text area, the repeating text node prior to the text area is the field description. 3. If there is no previous repeating sibling text node, its parent repeating node is the field description. 4. For the attributes in tags, if the child text node is repeating pattern, the text is the field description. Otherwise, repeating text node prior to the attributes is the field description.
When each Web page contains only one document, it is easy to extract the content out of pages by using the template. When each page contains more than one record, there are repeating patterns in the template. By finding repeating patterns, and there’re repeating patterns in the exclusive content out of template, we can extract each record out of pages. 5. Experiments In the experiment, we use web crawler to download pages of 12 different Web sites and analyze the pages to extract Web information. Then document records are extracted from the pages with templates. 5.1. Templates extraction from different sites Web pages of 12 sites as listed in Table 1 are downloaded. There are 30 pages of each site and totally 360 pages of all sites. We parsed each page to Tag tree and extract template class of pages, in which process different h and a are set. As a works for the cost function of Tag tree, it affects the similarity calculation and the four metrics. In fact, the larger a is, the larger cost is calculated, but less the proportion of ‘‘trunk” is gotten. On the other hand, h is the threshold below which a new template class will be created. If the maximum MP is larger than h, the tree will be inserted into the template class. Therefore, h is important for the cluster process and a affects the comparing of Tag trees. If both h and a are larger, lower similarity of pages is obtained and more delicate templates are extracted, further more template class are clustered. When h = 0.5 and a = 0.3, exactly 12 template class are extracted for the pages of 12 sites after the process. Table 1 is the results of each site’s extraction. Each page of each site is clustered in the site’s template class. The accuracy and recalling rate are both 100%. There are different numbers of exclusive parts, i.e. EA and EN, in the template classes. High values of MP, MS, AP and AS shows that pages in each template class are similar with each other because of the same template. As is shown in the Table 1, there are many exclusive parts in the templates of some Web sites such as sina.com, sohu.com and xinhuanet.com. The templates of these sites are huge in size and many exclusive parts in template are duplicated. The main content of page may be separated by template nodes as several exclusive parts, but the whole content of document can still be extracted by combining exclusive part. Another reason for the many exclusive parts of www.tianya.cn is that there’re many repeating patterns in a page, which are records of reply to a topic. In this case by finding repeating patterns in exclusive parts the records in a document can be extracted.
Table 1 Template extraction of 12 Web sites, h = 0.5 and a = 0.3. Site
Pages
EA
EN
MP
MS
AP
AS
qq.com sina.com china.com tom.com 163.com Bbs.fudan.edu.cn Bbs.nju.edu.cn yahoo.com tianya.cn Xinhuanet sohu.com ynet.com
30 30 30 30 30 30 30 30 30 30 30 30
15 15 12 9 17 10 5 7 79 20 31 15
18 26 4 18 17 3 2 9 32 26 37 9
0.96 0.98 0.99 0.98 0.99 0.97 0.96 0.94 0.95 0.99 0.96 0.94
0.93 0.96 0.98 0.97 0.98 0.94 0.92 0.89 0.90 0.99 0.92 0.88
0.98 0.99 0.99 0.99 0.99 0.97 0.96 0.94 0.95 0.99 0.97 0.94
0.95 0.97 0.99 0.97 0.98 0.95 0.93 0.89 0.91 0.99 0.94 0.89
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498 Table 2 Template extraction of 12 Web sites, h = 0.7 and a = 0.7. Site
Pages
EA
EN
MP
MS
AP
AS
qq.com sina.com china.com tom.com 163.com bbs.fudan.edu.cn bbs.nju.edu.cn yahoo.com tianya.cn Xinhuanet sohu.com ynet.com ynet.com
30 30 30 30 30 30 30 30 30 30 30 22 8
15 15 12 9 17 10 5 7 79 20 31 33 30
18 26 4 18 17 3 2 9 32 26 37 18 15
0.84 0.94 0.94 0.93 0.93 0.87 0.91 0.88 0.86 0.98 0.84 0.89 0.90
0.71 0.88 0.89 0.87 0.89 0.78 0.83 0.79 0.75 0.95 0.70 0.81 0.84
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
8497
We increase h and a step by step in the experiment. In most cases the template class are the same as previous case of h = 0.5 and a = 0.3, and template of each class are also the same, except the values of metrics become a little smaller. When h = 0.7 and a = 0.7, the class of ynet.com is divided into two classes, although the two template are similar with each other. Because the two templates are more delicate with more ‘‘branches”, many exclusive parts are extracted. Yet, there is no change with other templates and their exclusive parts. Table 2 is the results of extraction in this case. The experiments demonstrate that it is accurate and efficient to cluster pages by templates. Web pages of each site can be distinguished from other and it means the template is rightly extracted. Therefore the templates can be used to extract exclusive parts of each page in the template class.
Fig. 4. An extracted record and schema.
8498
X. Ji et al. / Expert Systems with Applications 37 (2010) 8492–8498
Table 3 Comparison of 3 extraction methods.
Text extraction Style of text Tag attribute information Extract useless information Learning process Document schema generation
Tag tree template
HTMLParser method
Visual layout extraction
Text not in template Keep Yes
All text Lost No
Main section in layout Keep No
No
Yes
No
Yes Yes
No No
Yes No
meaningful information. Comparison between HTMLParser, visual layout extraction shows that the proposed method is superior, especially in schema and tag attribute extraction. The approach is effective to find records in tag attributes as well as in text area. Future work will be focused on optimizing the method of finding repeating patterns and summarize the heuristic rules in exclusive content out of template to specify the schema of record in complex situation. Acknowledgements The paper is supported by the career development plan for new teachers of Fudan University. The paper is also supported by Shanghai Leading Academic Discipline Project (Project No.B114).
5.2. Records extraction from exclusive content
References
Record fields will be extracted from the exclusive parts of template. Taking the template of http://bbs.fudan.edu.cn as an example, we find 10 exclusive attributes and 3 exclusive nodes which change for each page. After finding patterns in exclusive parts as separation tokens, totally 19 fields of the document schema are found. Fields in text area describe the sender, title, content, date time, channel of the document. Whereas fields in tag attributes show the relationship with other documents, such as top document of the topic, next document, and previous document of the board. By using the heuristic rules, we find the description, or the name of the record field. So the whole record of the schema can be extracted. Fig. 4 shows the found schema and a page record of http://bbs.fudan.edu.cn.
Alani, H., Kim, S., Millard, D. E., Weal, M. J., Lewis, P. H., Hall, W., et al. (2003). Automatic extraction of knowledge from web documents. In Proceedings of the ISWC workshop (pp. 77–87). Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F., (2007). Using clustering and edit distance techniques for automatic web data extraction. In Proceedings of the eighth international conference on web information systems engineering (pp. 212– 224). Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD (pp. 337–348). Baumgartner R., Flesca S., & Gottlob G. (2001). Visual web information extraction with Lixto. In Proceedings of the 27th international conference on very large data bases table of contents (pp. 119–128). Bille, P. (2005). A survey on tree edit distance and related problems. Theoretical Computer Science, 337(1–3), 217–239. Buttler, D. (2004). A short survey of document structure similarity algorithms. In The fifth international conference on internet computing. Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proceedings of 2001 international world wide web conference (pp. 681–688). Chang, C.-H., Lui, S.-C., & Wu, Y.-C. (2001). Applying pattern mining to web information extraction. In Proceedings of the fifth Pacific-Asia conference on knowledge discovery and data mining (pp. 4–16). Chang, C.-H., Hsu, C.-N., & Lui, S.-C. (2003). Automatic information extraction from semi-structured web pages by pattern discovery. Decision Support Systems, 35(1), 129–147. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites In Proceedings of the 2001 international conference on very large data bases (pp. 109–118). Downey, D., Etzioni, O., Soderland, S., & Weld, D. S. (2004). Learning text patterns for web information extraction and assessment. In AAAI-04 workshop on adaptive text extraction and mining. Embley, D .W., Jiang, Y., & Ng, Y. K. (1999). Record-boundary discovery in web documents. In Proceedings of the SIGMOD’99. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2002). Detecting structural similarities between XML documents. In The fifth international workshop on the web and databases. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2007). Exploiting structural similarity for effective web information extraction. Data & Knowledge Engineering, 60(1), 222–234. Hammer, J., Garcia-Molina, H., Cho, J., Crespo, A., & Aranha, R. (1997). Extracting semi structure information from the web. In Proceedings of the workshop on management of semistructured data. Kim, Y., Park, J., & Kim, T., (2007). Web information extraction by HTML tree edit distance matching. In 2007 International conference on convergence information technology. Liu, L., Pu, C., & Han, W. (2000). XWRAP: An XML-enabled wrapper construction system for web information sources. In Proceedings of the 2000 international conference on data engineering (pp. 611–621). Muslea, I., Minton, S., & Knoblock, C. A. (1999). A hierarchical approach to wrapper induction. In Proceedings of third international conference on autonomous agents (pp. 190–197). Takama, Y., & Mitsuhashi, N. (2005). Visual similarity comparison for web page retrieval. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence. Vadrevu, S., Gelgi, F., & Davulcu, H. (2007). Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web, 10(2), 157–179. Zhai, Y. H., & Liu, B. (2007). Extracting web data using instance-based learning. World Wide Web, 10(2), 113–132.
5.3. Extraction comparison with other methods Other methods which are commonly used in extracting HTML pages are employed to perform extraction experiments in this section. HTMLParser (http://htmlparser.sourceforge.net/) is a famous open source project to extract information from a Web page. The extraction method is based on Node, AbstractNode and Tag. NodeList is filled after parsing a page, and then text in node can be got through NodeList. However, the schema of the page and the record structure can not be recognized in HTMLParser. On the other hand, HTMLParser can only deal with a given Web page one time. Hence, extra work should be done to automatically generate the templates of the pages. Another method is visual layout extraction (Baumgartner et al., 2001). Although extracted result implies main content can be seen from visual layout, the defect of visual extraction is that information dispersed in other sections can not be extracted. Table 3 is the comparison result of the 3 methods. 6. Conclusion and future work A new Web information extraction method based on Tag tree template is proposed to efficiently extract meaningful information including records and data schema. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract