IADIS International Conference WWW/Internet 2006
EXPERIENCES REGARDING AUTOMATIC DATA EXTRACTION FROM WEB PAGES Mirel Coşulschi, Bogdan Udrescu, Nicolae Constantinescu and Mihai Gabroveanu Dept. of Computer Science, University of Craiova, Romania mirelc, nikyc,
[email protected],
[email protected]
Adrian Giurcă Dept. of Internet Technology, Brandenburg Technical University Cottbus, Germany
[email protected]
ABSTRACT
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they are not highly reliable regarding the information retrieved. Our experiments align in the area of this last type of methods for this purpose developing a tool for automatic data extraction from HTML pages. KEYWORDS data extraction, wrapper, clustering
1. INTRODUCTION The vast amount of information on World Wide Web cannot be fully exploited due to its main characteristics: web pages are designed with respect to the human readers, who interact with the systems by browsing HTML pages, rather than to be used by a computer program. The semantic content structure of web pages is the principal element exploited by many web applications: one of the latest directions is the construction of wrappers in order to structure web data using regular languages and database techniques. A wrapper is a program whose scope is to extract data from various web sources. In order to accomplish this task the wrapper must identify data of interest and put them into some suitable formats, and eventually store back into a relational database. [14] states the problem of generating a wrapper for Web data extraction as follows: Given a web page S containing a set of input objects, determines a mapping W that populates a data repository R with the objects in S. The mapping W must also be capable of recognizing and extracting data from other page similar with S. Any possible solution to the above problem must take into consideration at least the following two contexts: 1. A HTML page may contain many types of information presented in different forms, such as text, image or Java applets (programs written in Java and executed, better said interpreted). The Hyper Text Markup Language (HTML) is a language designed for data presentation, and was not intended as a mean of structuring information and easing the process of structured data extraction. Another problem of HTML pages is related to their bad construction, language standards frequently being broken (i.e. improper closed tags, wrong nested tags, bad parameters and incorrect parameter values). Nevertheless today, there is a consistent effort of Word Wide Web Consortium1 (W3C) is the W3C's first Recommendation for XHTML 1
http://www.w3c.org
281
ISBN: 972-8924-19-4 © 2006 IADIS
1.02, following on from earlier work on HTML 4.01, HTML 4.0, HTML 3.2 and HTML 2.0. With a wealth of features, XHTML 1.0 is a reformulation of HTML 4.01 in XML, and combines the strength of HTML 4 with the power of XML. 2. The web pages from commercial web sites are usually generated dynamically using different server side technologies (JSP3, PHP4) with data stored in a back-end DBMS. Also the dynamic content is generated with client side technologies (ECMA Script, Ajax). The visitor can easily notice the usage of a few templates in the same site, with slight differences between them. The page-generation process can be seen as the result of two activities: (a) the execution of some queries targeted at the support database (b) the source dataset is intertwined with HTML tags and other strings in a process that can be called codification. Eventually, URL links and banners or images can also be inserted. In the past years various tools for data extraction have been proposed in the literature with the goal of efficiently solving this problem. Data published in the pages of very large sites and the higher rate of newcomer sites increased demand for semi-automatic or automatic systems. The cost of a reliable application whose maintenance requires human intervention and which will provide information with a high degree of precision increases linearly with the number of wrapped sources. For example, RoadRunner ([6]) generates a schema for the data contained in many similar pages during an iterative process in which the current schema at step k, Tk is tested and eventually updated against a new web page S, resulting a new schema T(k+1). The algorithm compares HTML tag structure of two or more pages from the same class (they are generated by the same script and are based on the same template). If the pages are not part of the same class the result will be a very general schema, which can extract data contained only in common structures, while the elements from a singular structure remain unknown because they cannot be identified. Based on the schema T a grammar is constructed, which is capable of recognizing among other things, nested-structured data objects with a variable number of values corresponding to their attributes. The problem we are addressing in the paper is automatically extraction of database values from such template-generated web pages without any learning examples or other similar human input. The paper is organized as follows: In Section 1, we describe the main problems which are encountered in the automatic data extraction of information from web pages, mainly because of the unstructured information description allowed by HTML and some advantages provided by template-based automatically generation of web pages. Section 2 concerns with presentation of the template concept together with citations of many bibliographic data sources related to the topic. Section 3 describes the first step of the information extraction process i.e. grouping of web pages into clusters. We focus on the description of Minimum Description Length principle due to [8]. We present the concepts of tree edit distance and extraction pattern, which are used by our tool. Section 4 is devoted to the design and implementation issues. We present the components of the tool i.e. the crawler component and the data analyzer component. Both descriptions contain the UML class diagrams and screen-shots. The main point is the Algorithm 1 who describes the pattern matching discovery process. The Conclusion Section resumes the main achievements of the paper together with possible future improvements.
2. TEMPLATES Nowadays it is common for many websites to contain large sets of pages, generated using a common template or layout. The values used to generate the pages (e.g. author, title, ...) are obtained from a database. The set of common characteristics from the structure of a web page is called template. To our best knowledge there is no standard and formal definition of a template. [3] provides the following template definition:
2
http://www.w3.org/TR/xhtml1/ Java Server Pages. A JSP page is a text document that contains two types of text: static data, which can be expressed in any text-based format (such as HTML, SVG, WML, and XML), and JSP elements, which construct dynamic content. 4 Hypertext Preprocessor is a widely-used general-purpose scripting language that is especially suited for Web development and can be embedded into HTML. 3
282
IADIS International Conference WWW/Internet 2006
Definition 2.1 A template is a collection of pages that share the same look and feel and are controlled by a single authority Recognizing and dealing with templates seems an important step for increasing the accuracy of data extraction tools. Some templates are very simple (e.g. the default HTML code generated by a HTML editor) while others are more complex, containing navigational bars, advertisement banners, links to administrative pages, link to website administrator. Another possible definition of a template may be: Definition 2.2 A template is a set of common layouts and formatting rules, which is encountered into a set of HTML web pages, and is produced by the same script that dynamically generates the content of a web page A template will contain constants (elements that are not changing their value from a web page to another) and variables. Each variable of a template is an object which can contain important information. It is advisable that the automatically generated wrappers identify these objects. The extraction process can be easily divided into four distinct steps: 1. Grouping the web pages into clusters. There is a number of publications that address this: [4,7,8,10,11]. 2. The generation of extraction pattern for each cluster. An extraction pattern is the generalization of all the pages contained in a cluster. See [4,6,15,19] for more detailed descriptions. 3. Data alignment ([19,21]). 4. Data labeling (See [2,19]). Our paper and software application deals with the first two actions.
3. CLUSTERING Properties related to the frequency and distribution of tags inside a HTML page were exploited in [7], [11] for classifying those web pages in concordance with their structure. [8] uses the MDL (Minimum Description Length) principle for inferring a reasonable site-model. A desirable site-model of the site structure is one in which classes represent a useful partition of the pages, such that pages in the same class are structurally homogeneous. In order to create clusters our program requests as input a set of web pages. The clusters will be composed from web pages sharing a common template. Each cluster is generalized to a common model that fulfills the requirements of cluster's components. It is worth to mention here that the clustering algorithm will not rely only on their URL address because every minor change in script's parameters can lead to a completely different HTML page. For clustering we have used algorithms for tree distance: the distance between two HTML pages is computed as the tree-distance between the associated DOM trees (see [4], for example).
3.1 Tree edit distance Techniques for finding and extraction of relevant information from web pages deal with document structure, visual characteristics and links between web pages. All research is mainly based on these used individual or in combinations. In this paper, our goal is to analyze the web page structure. Because this structure can be described very well by the DOM tree, a natural approach was to use the tree edit distance concepts. In order to create clusters we have to find a metric (or distance) that will guide some traditional clustering algorithms. Since the web page structure is very well described by the DOM tree, a natural approach is to use the tree edit distance. This distance helps us compute the structural similarity value between two web pages. The string edit distance, the well-known example for the technique of dynamic programming, inspires the tree edit distance: intuitively the edit distance between two trees TA and TB is the cost of minimal set of operations needed for transforming TA into TB. Trees belong to the basic data structures set that must not miss from the toolbox of a programmer. A tree is defined as a directed simple acyclic graph. The definitions below introduce some basic concepts. Definition 3.1 (Ordered Tree) An ordered tree is a rooted tree in which the children of each node are ordered. If a node x has k children then these children are uniquely identified, left to right as x1, x2, …, xk.
283
ISBN: 972-8924-19-4 © 2006 IADIS
Given an ordered tree T with a root r of degree k, the first-level subtrees, T1, T2, …, Tk of T are the subtrees rooted at r1, r2, …, rk.. Definition 3.2 (Labeled Tree) A labeled tree T is a tree that associates a label, λ ( x ) , with each
node x ∈ T . We let λ ( x ) denotes the label of the root of T. Definition 3.3 (Tree Equality) Let A and B be two ordered labeled trees, and their first-level subtrees A1, A2, …, Am and B1, B2, …, Bn. We say that A and B are equal and denote A=B if for all i,j such that 0≤i≤m, 0≤ j ≤n
λ ( A) = λ ( B ) , m = n
and i = j ⇒ Ai = B j
Definition 3.4 (Labeled Ordered Rooted Tree) A labeled ordered rooted tree is a tree whose root vertex is fixed, there is a fixed relative order among the children of a vertex and each vertex has a label attached. For computing the edit distance we will have to consider three operations: (1) vertex removal, (2) vertex insertion and (3) vertex replacement. Definition 3.5 (Vertex Operation) RenT(lnew) is a renaming operation applied to the root of T that yields the tree T' with λ (T ') = l new and first-level subtrees T1, T2, ..., Tm. Given a node x with degree 0, InsT(x, i) is a node insertion operation applied to T at i that yields the tree T' with λ (T ') = l and first-level subtrees T1, ..., Ti, x, Ti+1 , ..., Tm. If a first-level subtree Ti is a leaf node, DelT(Ti) is a delete node operation applied to T at i that yields the tree T' with λ (T ') = l and first-level subtrees T1 , ..., Ti-1, Ti+1 , ..., Tm. We associate a cost to each of these operations; the sum of costs associated with edit operation defines the edit distance. The solution to the problem consists in determining the minimal set of operations to transform one tree into another. An equivalent formulation of the problem is to discover the minimum cost mapping between the trees: Definition 3.6 (Tree Mapping) Let Tx be a tree and let Tx[i] be the i-ism vertex of tree Tx in a preorder walk of the tree. A mapping between a tree T1 of size n1 and a tree T2 of size n2 is a set M of ordered pairs (i,j), satisfying the following conditions for all (i1, j1), (i2, j2) ∈ M: 1. i1=i2 iff j1=j2 2. T1[i1] is on the left of T1[i2] iff T2[j1] is on the left of T2[j2] 3. T1[i1] ia an ancestor of T1[i2] iff T2[j1] is an ancestor of T2[j2] Among many variations of algorithms for calculating the tree edit distance, we have limited to the topdown edit distance problem. In one of the first approaches, [20] presents a recursive dynamic programming algorithm with complexity O(n1n2) where n1 represents the size of T1 and n2 represents the size of T2. In [5] is presented an non-recursive algorithm that computes the value within a single dynamic programming method. [4] introduces a new type of mapping called Restricted top-down mapping-RTDM, the proposed algorithm combining ideas from [18] and [20] with a worst-case complexity of O(n1 • n2). For our implementation we have chosen the algorithm used for XML document categorization (as in [17]) in order to cluster such documents. The tests effectuated with above mentioned algorithms lead us to this choice.
3.2 Clustering algorithm At the first step the labeled undirected complete graph Kn=(V,U) is constructed, where: 1. V = {P1 , K , Pn } is the set of all HTML pages E is the set of edges where the length of the edge (PA, PB) is equal with the distance between the DOM trees associated to the pages PA and PB. The graph is complete. So for a set of cardinality n we will compute n(n-1)/2 distance values. At the second step, the algorithm computes the minimum spanning tree (MST) using one of the wellknown algorithms: the Kruskal algorithm or the Prim algorithm. From this determined minimum spanning tree, we removed the edges whose associated cost w is greater than a value p (w ≥ p), where p is a superior limit apriori computed. Each conex component that remains after edges deletion represents a cluster. 2.
284
IADIS International Conference WWW/Internet 2006
3.3 Pattern generation Next, we shall present some basic elements regarding extraction pattern and the pattern generation. For more details the interested reader can consult [4]. An extraction pattern for trees is similar to a regular expression for strings. An extraction pattern is a rooted, ordered, labeled tree that can contain, as labels for vertices, special elements called wildcards. Each wildcard must be a leaf in the tree. There are four types of wildcards: 1. single (• ) - a wildcard that captures one sub-tree and must be consumed
2. plus (+ ) - a wildcard that captures sibling sub-trees and must be consumed 3. option (?) - a wildcard that captures one sub-tree and may be discarded 4. kleene (*) - a wildcard that captures sibling sub-trees and may be discarded. The goal is that every wildcard must correspond to a data-rich in the template: single and plus designates required objects, while option and Kleene wildcards designate optional objects. It is said that an extraction pattern accepts a tree if there is a mapping a with finite cost between the pattern tree and the data tree. The process of pattern generation for a cluster of web pages consists of an iterating process called generalization, at each step the previous generated extraction pattern being composed with a DOM tree associated with a HTML page so as to result a new extraction pattern. At the end, the result is a pattern tree that accepts all the pages from cluster. In the process of merging a pattern tree with a data tree, the tree edit distance algorithm is also used. There is defined a function with particular costs for insert, delete and replacement for all combination of wildcard operators with data nodes. For example, every wildcard followed by a set of option wildcards should be converted into a Kleene or plus wildcard that designates variable size objects. More generally, it can be observed that optional vertices of the template that extraction pattern is trying to harness should be kept optional after the composing step, and higher quantifiers should remain in the resulted pattern. The operation of replacing data nodes from the source tree with distinct data nodes from target tree will be denoted with wildcard operations. In conclusion, the pattern tree designates the minimum set of operations (insert, delete, replacement) for transforming one source tree into a target tree (or the minimum cost mapping between them).
4. DESIGN AND IMPLEMENTATION During the design and implementation phases, we focus our attention on the first two steps presented before: 1. Page grouping into clusters 2. Generation of pattern for each cluster and data extraction. At this moment the application is composed from two independent modules: (1) a crawler and (2) a data analyzer.
4.1 Crawler component The crawling of web pages is recursively, starting with an entry point for the current site, and incrementally following its references to other pages. Finally, the structure resulted after visiting a site is a tree having as root the starting page. The application will create a master index file that contains this tree structure in order to keep the accounting of page crawling. The information is saved in XML format, the file being composed from a from a single type of nodes called item which has two attributes: webPath - the remote address from where the file was downloaded and localPath - the local address where the file was saved
285
ISBN: 972-8924-19-4 © 2006 IADIS
The Figure 1 presents the class diagram of module for content scan. This module keeps track of links from one web page to others web pages: each URL address is processed successively by a sequence of filters whose goal is to verify its correctness, and in case of miss-configuration, tries to find a link by searching through a list of possible link constructions
4.2 Data Analyzer Component The analyzer receives as input the structure of the site from the master index file, and a set of HTML pages. As we already point out, the World Wide Web Consortium has warmly recommended usage of stricter standard Markup Languages, such as XHTML and XML, in order to reduce errors resulted in the process of parsing various web pages created by disobeying the basic rules. Despite this recommendation, there still remains a huge quantity of web pages that do not respect the new standards, and with whom, the parser of search engine must cope. We transform each HTML page into a well-formed XHTML page. Of the applications that can be involved in this process, we enumerate JTidy, a Java tool based on tidy [13] that is a W3C open source software, and neko [16]. From a constructed XHTML file, the DOM tree representation used in the next step of our process can be created with no effort.
Figure 1. The Class Diagram of the Content Scanner
The input is split into a set of clusters: each cluster is composed of a set of HTML pages and its associated DOM trees, with the property that they share a similar template. We have designed this module bearing in mind the idea of independency, in order to make it possible to be used in other applications. The abstract class AbstractAnalyser is the base of analyzer concept, and can inform StatusListener objects about changes of its internal status. Let's denote by C={cluster1, cluster2, ..., clustern} the set of resulted clusters. The clusters are represented as nodes in a graph G=(V, E), V = C, and a directed edge from vertex i to vertex j exists iff there is at most one link from a page of cluster clusteri to a page belonging to cluster clusterj (see Figure 2). In the next processing step, the analyzer extracts a pattern which will define the common template. As results of selecting one element from the nodes of graph G, there is a list with the pages currently in the cluster, together with the extracted pattern. As we presented before, the pattern is also a tree with special operators. This pattern will be used in the step of extracting data from web pages. The Algorithm 1 describes the pattern matching process of data extraction from each HTML page. The ExtractData procedure determines a mapping between the pattern tree pattern and the data tree input. Their root nodes represent these two trees.
286
IADIS International Conference WWW/Internet 2006
Figure 2. Cluster’s graph Algoritm 1: PatternMatch Input:patern – the root of the pattern tree (TreeNode) input - the root of the data tree Output: the list of data extracted Local: data - a temporary list k - the index of child node from where searching starts 1: procedure ExtractData(pattern, input) 2: if pattern.getLabel() == input.getLabel() then 3: data ← ExtractDataFromList(pattern.getChilds(), input.getChilds()); 4: k←1 5: for i ←1, pattern.getChildrenCount() do 6: lablel ← pattern.getChildAt(i).getLabel(); 7: if label is not wildcard then 8: node ← GetChild(input, label, k); 9: if node ≠ null then 10: k ← input.getChildIndex(node) + 1; 11: data.addAll(ExtractData(pattern.childAt(i), node)); 12: end if 13: end if 14: end for 15: return data; 16: else 17: return ∅; 18: end if 20:end procedure
The current routine returns a list of mappings. If the two labels are identical (see line 2) then we shall try to find a matching between the sequence of children of node pattern and the sequence of children of node input. The procedure ExtractDataFromList receives two lists of nodes as input: the algorithm is similar to that of aligning a string with wildcards with a string composed only of regular characters and returns the matchings. Then for each data node from children list of node pattern we try to find the first mapping that matches in labels (lines 5-14). GetChild(input, label, index) gets the child of the node input whose information is label. The search for the child with this property starts from the position index.
287
ISBN: 972-8924-19-4 © 2006 IADIS
5. CONCLUSIONS We have developed a prototype tool for automatic data extraction from HTML pages. We shall pursue to improve the metric involved in the clustering step, and to study other different algorithms for computing tree edit distance. We aim at personalizing those algorithms for our special needs. The versatile design we have chosen easily allows changes of those elements.
REFERENCES [1] Arasu et al., 2003, Extracting structured data from web pages, , in ACM SIGMOD 2003 [2] Arlotta et al., 2003, L. Arlotta, V. Crescenzi, G. Mecca, P. Merialdo, Automatic annotation of data extraction from large Web sites, in Proceedings of the International Workshop on the Web and Databases, San Diego, USA. [3] Bar-Yossef and Rajagopalan, 2002, Z. Bar-Yossef, S. Rajagopalan, Template Detection via Data Mining and its Applications, In Proceedings of World Wide Web Conference (WWW02). [4] Castro et al., 2004, D. de Castro Reis, P. Golgher, A. Da Silva, A. Laender, Automatic Web news extraction using tree edit distance, in Proceedings of World Wide Web Conference (WWW04), New York. [5] Chawathe, 1999, S. S. Chawathe, Comparing hierarchical data in external memory, in Proceedings if the 25th International Conference on Very Large Databases, Edinburgh, Scotland. [6] Crescenzi et al., 2002, V. Crescenzi, G. Mecca, P. Merialdo, RoadRunner: Automatic Data Extraction from DataIntensive Web Sites, International Conference on Management of Data and Symposium on Principles of Database Systems (SIGMOD02). [7] Crescenzi et al., 2002, V. Crescenzi, G. Mecca, P. Merialdo, Wrapping-Oriented Classification of Web Pages, In Proceedings of the fifth ACM symposium on Applied Computing (SAC02), ACM Press. [8] Crescenzi et al., 2003, V. Crescenzi, P. Merialdo, P. Missier, Fine-grain web site structure discovery, In Proceedings of fifth ACM international workshop on Web information and data management, ACM Press. [9] Chang, 2001, C.-H. Chang, IEPAD: Information extraction based on pattern discovery, in Proceedings of the tenth international conference on World Wide Web. [10] Dalamagas et al., 2004, T. Dalamagas, T. Cheng, K.-J. Winkel, T. Sellis, Clustering XML documents by structure, Technical Report, KBDS Laboratory, University of Athens. http://www.dbnet.ece.ntua.gr/pubs/uploads/TR-20045.pdf [11] Flesca et al., 2002, S. Flesca, G. Manco, L. Pontieri, A. Pugliese, Detecting structural similarities between XML documents, in Proceedings of the Intl Workshop on The Web and Databases (WebDB 2002). [12] Hsu and Yih, 1997 J. Y. Hsu, W. Yih, Template-Based Information Mining from HTML Documents, AAAI/IAAI. [13] HTML Tidy W3C, HTML Tidy, http://www.w3.org/People/Raggett/tidy. [14] Laender et al., 2002 A. Laender, B. Ribeiro-Neto, A. Da Silva, J. Texeira, A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Record, 31(2). [15] Liu et al., 2004, Z. Liu, W. K. Ng, E.-P. Lim, An automated algorithm for extracting website skeleton, in Proceedings of DASFAA'2004. [16] Neko HTML Parser, Neko HTML Parser, http://www.apache.org/~andyc/neko/doc/html/index.html [17] Nierman and Jagadish, 2002 A. Nierman, H. V. Jagadish, Evaluating structural similarity in XML documents, in Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA. [18] Valiente, 2001, G. Valiente, An efficient bottom-up distance between trees, in Proceedings of the 8th Internaktional Symposium on String Processing and Information Retrieval, Chile, IEEE Computer Science Press. [19] Wang and Lochovsky, 2002, J. Wang, F.H. Lochovsky, Data-rich section extraction from html pages, 3rd International Conference on Web Information Systems Engineering (WISE 2002), Singapore, December, Proceedings IEEE Computer Society. [20] Yang, 1991, W. Yang, Identifying syntactic differences between two programs, Software-Practice and Experience, 21(7). [21] Zhai and Liu, 2005, Y. Zhai, B. Liu, Web data extraction based on partial tree alignment, In Proceedings of World Wide Web Conference (WWW05), Chiba, Japan.
288