A Method for Web Information Extraction - Springer Link

A Method for Web Information Extraction Man I. Lam1, Zhiguo Gong1, and Maybin Muyeba2 1

Faculty of Science and Technology University of Macau, Macao, PRC {ma46522, fstzgg}@umac.mo 2 School of Computing Liverpool Hope University, Liverpool, L16 9JD, UK [email protected]

Abstract. The Word Wide Web has become one of the most important information repositories. However, information in web pages is free from standards in presentation and lacks being organized in a good format. It is a challenging work to extract appropriate and useful information from Web pages. Currently, many web extraction systems called web wrappers, either semi-automatic or fully-automatic, have been developed. In this paper, some existing techniques are investigated, then our current work on web information extraction is presented. In our design, we have classified the patterns of information into static and non-static structures and use different technique to extract the relevant information. In our implementation, patterns are represented with XSL files, and all the extracted information is packaged into a machinereadable format of XML.

1 Introduction During the past decade, information extraction has been extensively studied with many research results as well as systems developed. Since the late 1980’s, through the message understanding conference (MUC) sponsored by defense advances research project agency, many information extraction systems have been successfully developed and quantitatively evaluated [1]. The information source can be classified into three main types, including free text, structured text and semi-structured text. Originally, the extraction system focuses on free text extraction. Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted, unregulated information, which employs the syntactic and semantic characteristics of the language to generate the extraction rules. The structured information usually comes from databases, which provide rigid or well defined formats of information, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). The other type is the semistructured information, which falls between free text and structured information. Web pages are a typical example of semi-structured information. In this paper, we will focus on extracting text information from web pages. According to the statistical results by Miniwatts Marking Group [URL1], the growth of web users during this decade is over 200% and there are more than 1 billion Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 383–394, 2008. © Springer-Verlag Berlin Heidelberg 2008

384

M.I. Lam, Z. Gong, and M. Muyeba

Internet users from over 233 countries and world regions. At the same time, public information and virtual places are increasing accordingly, which almost covers any kind of information needs. Thus this attracts much attention on how to extract the useful information from the Web. Currently, the targeted web documents can easily be obtained by inputting some keywords with a web search engine. But the drawback is that the system may not necessarily provide relevant data rich pages and it is not easy for the computer to automatically extract or fully understand the information contained. The reason is due to the fact that web pages are designed for human browsing, rather than machine interpretation. Most of the pages are in Hypertext Markup Language (HTML) format, which is a semi-structured language, and the data are not given in a particular format and change frequently [1]. There are several challenges in extracting information from a semi-structured web page such as the lack of a schema, ill formatting, high update frequency and semantic heterogeneity of the information. In order to overcome these challenges, our system design transforms the page into a format called Extensible Hypertext Mark-up Language (XHTML) [URL2]. Then, we make use of the DOM tree hierarchy of a web page and regular expressions are extracted out using the Extensible Style sheet Language (XSL) [URL3, URL4] technique, with a human training process. The relevant information is extracted and transformed into another structured format— Extensible Mark-up Language (XML) [URL5]. The remainder of the paper is organized as follows: some related works are illustrated in section 2, which involve a brief overview of the current web information extraction systems; then detail techniques in our approach are addressed in section 3; experimental results are explained in section 4; finally, the conclusion and future work are mentioned in the last section.

2 Related Work From time to time, many extraction systems have been developed. In the very beginning, a wrapper is constructed to manually extract a particular format of information. However, the wrapper is not adaptive to change, it should be reconstructed accordingly to different types of information. In addition, it is complicated and knowledge intensive to construct the extraction rules used in a wrapper for a specific domain. Therefore only experts may have knowledge to do that. No doubt, the inflexibility and the development cost for construction are the main disadvantages of using wrappers. Due to the extensive work in manually constructing a wrapper, many wrapper generation techniques have been developed. Those techniques could be classified into several classes, including language development based, HTML tree processing based, natural language processing based, wrapper induction based, modeling based and ontology based [2]. In order to assist the user to accomplish the extraction task, a new language was developed for a language development based system. The famous systems for this type include TSIMMIS [3] and Web-OQL [4]. One of the drawbacks of such a model is that not all users are familiar with the new query language, so the performance of the system may not be as expected. Then, as most of the web pages are in HTML

A Method for Web Information Extraction

385

format, another type of extraction system, HTML tree processing based system, was proposed. By parsing the tree structure of a web page, a system is able to locate useful pieces of information. XWRAP [5] and RoadRunner [6] are examples in this respect. In this solution, web pages need to be transformed into XHMTL or XML format due to limitations of the HTML format. For some pages which are mainly composed of grammatical text or paragraphs, Natural Language Processing (NLP) systems can be used. NLP is popularly used to extract free text information, and makes use of filtering, part-of-speech tagging and lexical semantic tagging technology to build up the extraction rules. SRV [7], WHISH [8] and KnowItAll [9] are examples of this technique. However, for some pages which are composed of the tabular or list format, NLP based tools may not be effective since the internal structure of the page can not be fully exploited. The wrapper induction based systems can induce the contextual rules for delimiting the information based on a set of training samples. SoftMealy [10] and STALKER [11] are typical examples. In modeling based systems, according to a set of modeling primitives, for example tables or lists, the data are conformed to the pregiven structure. Then the system tries to locate the information against given structures. NoDoSe [12] is an example of this type of systems. The last type is ontology based systems. Ontology techniques can be used to decompose a domain into objects, and further to describe these objects [13]. This type of system does not rely on the structures of web pages or the grammars of texts but instead an object is constructed for a specific type of data. WebDax [13] is a typical example in this respect. Besides classifying by the main techniques used, the wrapper can also be grouped into semi-automatic wrapper or fully-automatic wrapper. For the semi-automatic wrapper, human involvements are necessary. Most of the systems belong to this type, such as TSIMMIS [3] and XWRAP [5]. For the fully-automatic wrapper, no human intervention is needed, examples include Omini [14] and STAVIES [15], which make use of tree structures, or the visual structures of pages to perform the extraction task. Our proposed system belongs to the type of semi-automatic wrappers. Through the training process, our system learns rules for extraction. We suppose that with training, the system can be more adaptive to different type of pages if the training samples are broad enough. In addition, for different types of information, we make use of different techniques for extraction. For most existing systems, usually, only one main methodology is applied for extraction. The benefit of a multi extraction methodology is that the extraction can produce higher performance.

3 System Design The system works in two phases, pre-processing phase and the extraction phase as shown in figure 1. In the Pre-processing phase, in order to overcome the ill representations of HTML documents, all the pages are transformed into XHTML format. Then, the training process is performed, which gathers patterns or rules for the extraction phase. In the extraction phase, based on the human training results, the system chooses suitable extraction methods for different information fields.

386


Fig. 1. System Architecture

3.1 Patterns and Rules of the Information All of the web pages are transformed from HTML into a W3C [3] recommended XHTML format in the preprocessing phase. Though the current web browsers can present correctly the ill formulated HTML documents, it is difficult to identify the hierarchical structures of web pages. For example, the tag
or
may be used alone for HTML elements, without the corresponding closing tags associated. However, such usage is not allowed in XHTML format. Therefore the closing tag should be added accordingly during the preprocessing phase. An open source library called Tidy [URL6], provided by W3C organization, is used to transform the web pages. Tidy is able to fix up a broad range of ill formed HTML. After pages are transformed into XHTML format, the training process is performed. Users need to highlight the extracted word in sample pages. Figure 2 shows the interface of the training process in our system. The target information is modeled in a schema as r(f1, f2, … , fn), where fi is the field of information. Let PSet={p1, p2, … , pm} be a set of sample training web pages and we suppose that target records can be extracted at least partly from each of those training pages. Each web page p in PSet is annotated with a vector (f1:l1, f2: l2, …, fn:ln), where li is the location of the field fi in page p. The objective of the training process is to mine out (extract) patterns and rules for target information. For each field fi, a pattern set, annotated as PSi, is constructed from those sample pages. And a pattern pnij in PSi is the characteristic of context of fi. In our system, we represent pnij into the format {PWij, EWi,}, where PWij is the words which occur just before the instance of fi in page pj, and EWi is the formulation rule of fi’s value, which is described using regular expression in our work and given by the users. And we further suppose EWi is irrelevant to the individual training pages. As we know, a web page is represented with a DOM tree. And all the fields for the same record are embedded in different tag nodes of the tree. We further use directed graph to describe the organizational constraint of the fields in the web page pj. Let F={f1,f2,…,fn}, then the constraint is defined as a directed graph CGj=, where Ej is the set of directed edges such that fi→fk in Ej if and only if the tag node of fi is embedded in the tag node of fk, fi↔fk if and only if the node of fi and node of fk are sibling elements in the DOM tree. Then, rlj=(PSj, CGj) is called an extraction rule for target record r(f1, f2, … , fn) with respect to training page pj. In fact, with the training process, multiple rules can be derived from those potential training web pages. We use RS to denote the set of all the possible rules for the given schema. That is, RS={rlj}.


387

The number of rules in the original RS can be as many as the number of training pages in PSet. However, many rules may show redundant information or contradictory information. To reduce the size of rules, some reduction algorithm is performed on the original set of RS. To do so, we suppose that patterns of fields are orthogonal to constraints of fields. In fact, the former give the local context of fields, and latter describe their occurrence relationships in web pages. With such assumptions, we merge the patterns and the constraints independently. For the patterns of field fi, the merged pattern is defined as: s-PSi ={{tij:tf(tij)}, EWi}, where tij is a pre-word extracted from sample pages with respect to field fi, tf(tij) is the number of occurrences of tij for field fi. In one extreme situation, all the pre-words extracted for fi is the same, denoted as t, then tf(t)=m (the total number of training pages). In another extreme, all the pre-words are different from each other, then tf(tij)=1. Therefore, the weight of tij indicates its significance degree as a pre-word for field fi. To normalize the weight, we replace tf(tij) with ntf(tij)=tf(tij)/m. Then, {j}ntf(tij)=1. Thus, the overall merged pattern for schema r(f1, f2, … , fn) is then described as SP=(sPS1:EW1, s-PS2:EW2,…, s-PSn:EWn). As we know, each constraint of the fields gives the organizational structure of those fields in one training page. The constraint is represented as a directed graph, with fields as nodes, and embedded relationship as the directed edges. We concatenate all those constraint graphs into one graph, with nodes as those fields, and weight directed edge from fi to fk with weight n(i,k), where n(i,k) is the number of constraints having the edge from fi to fk. The merged constraint graph is denoted as CG. Then, (SP,CG) is called the extraction rule for schema r(f1, f2, … , fn).

∑

Fig. 2. Screen Shot for training process

388


3.2 Information Extraction Processing According to the rule (SP,CG) of the target information, our system classifies each extraction field into static field (SF) or non-static field (NSF) based on the EWi in SP. It is easy to classify if the extracted word contain any special character, i.e., the character other than the alphabetical and numerical character. If most of the EWi in one extraction field contains rather similar special characters, then this field is classified as STF. For example, all the e-mail addresses contain “@” and “.”, and the structure is stable, which can be easily represented by a regular expression. In the following section, the extraction method for SF will be discussed first, followed by the NSF. 3.2.1 Methodology for Static Structure Information As the extracted information is in a static structure, the system makes use of this feature to generate an extraction rule using a regular expression. Before going into detail, a brief introduction to regular expressions will be provided first. Then the rule generation process will be explained in detail. Introduction to Regular Expression Regular expression is the value formulating pattern, which is defined by using regular expression syntax, to represent some information requirements. The regular expression syntax which has been used in our system is shown in table 1. For different types of languages used, the regular expression syntax is slightly different. In our system, all the extraction patterns are represented with an XSL template to generate a XML format output result. The regular expression syntax used in XSL is specified in XML Schema [URL5], which is based on the established conventions of languages such as Perl. Some specific feature for regular expression used in XSL, such as the curly braces doubled up in order to distinguish the attribute value representation used in XSL, must be cautioned. Here we do not explain it in more detail, since the syntax used in our extraction rule is the commonly used one. Table 1. Regular expression syntax

Character Classes \w \W \d \D \s

Any word character which composed of a-z or A-Z or 0-9 or _ Any non-word character Any digit, that is 0-9 Any character other than a digit Any blank space character

Repetition * +

Zero or more relationship One or more relationship

Rule Extraction After the training process, rule extraction for the target information has been obtained. As shown in the following equations, a particular processing extraction for a field fi is


389

shown in equation (1), which is composed of several rule patterns rlij with “or” relationship. For each rule pattern rlij, it is composed of the pre-word PWij and the regular expression of the extracted word EWi, as shown in equation (2), where δ function is used to transform the content into a regular expression. In order to avoid the confusion when more than one fi have the similar structure, such as phone number and fax number, therefore, the s-PWij is added in the rule pattern.

s − PWi = { PWi1 , PWi 2 ,... PWim }

(1)

psij = s − PWij : δ ( EWij )

(2)

For the δ function, the main syntaxes we have used are shown in table 1. The main idea is to transform all the space character into “\s*” and all the text or number patterns into [\w]*. For simplicity, figure 3 shows the step of generating the extraction rule for email. Assume that the PWij is “E-mail:”, and the EWij is “[email protected]”. As shown in the following steps, the process of δ function is to replace all the word character into [\w]* and then replace the space character into \s*.

rp ji = PWi ∪ δ ( EWi ) =" E − mail :"∪δ (" profa @ umac .mo" )

=" E − mail :"∪"[\ w] * @[\ w] * .[\ w]*" = E − mail : \ s * ([\ w] * @[\ w] * .[\ w]*) \ s * Fig. 3. Steps of generating the extraction rule for email

Target Information Extraction After generating the extraction rule PSi for an extraction field fj, the rule is then represented with a XSL template in order to generate the XML output result. Figure 4 shows the XSL template used in our system for single information extraction. In order to take the advantages of the regular expression, the newest version of XSL, i.e. v.2.0, can provide some new instructions for this purpose. After the regular expression is passed into the XSL template, the xsl:anayze-string instruction will start to test if the regular expression “regexp” matches the content in the input string, here the selected input string is “.”, that is the whole page. After that, each of the matched parts will be processed by xsl:matching-substring child instruction, here we only use the most simplest instruction xsl:value-of to get out the value of the matched part. XSL [URL4] can help us to find the element nodes in XML document. If the XSL file is set probably, it is able to capture out the most meaningful data in the transformed web page. In our system, XSL 2.0 is used, which is not popularly used, and not all the software systems support this version, however, it is worth using it. The XSL processor used in our system is Saxon-B 8.7, which is a limited but free version of XSL processor [URL7].

390


Fig. 4. XSL Template

3.2.2 Methodology for Non-static Information Extraction For the information without static structure, it is not suitable to use the regular expression rule. Therefore, our system tries to exploit the special features of web page, the organizational structure of the fields in the DOM tree architecture. As mentioned before, all the pages are transformed into XHTML format before the extraction process can be performed. So the DOM tree should be valid, well formatted and correct. These requirements are essential and a pre-requisite for our system, because the whole extraction algorithm relies on the HTML tags.

PTageSet = {tag1 , tag 2 ,..., tag n } ⊂ p KWSeti = {KW1 , KW2 ,..., KWl } For each KWi ∈ KWSeti { If KWi ⊂ p { ∃ ktag

i

∈ PTagSet

, where

ktag i = nearest (ot )

KTagSet = KTagSet + ktag i

} } For each ktagk ( k =1..n) ∈ KTagSet{

WPl = Split ( p, ktag k ) For each WPl { If ∃KWi ∈ WPl { Output = Output + WPl

p = p − WPl } } } Fig. 5. Extraction Steps


391

Let p be a web page, and PTagSet be the set of tags in p, that is PTagSet = {tag1, tag2, tag3, … , tagn}, and each of the tag is either open or close tag. From the result training process, keywords exist in the schema information. After analyzing, a set of keywords, KWSeti={KW1, KW2, … , KWl} for extraction field fj is formed. For each KWi in KWSeti, the location of the KWi will be identified in the page p. Then the first most nearest open tag against KWi the keyword KTagSet will be put into a set. After all KWi in KWSeti are applied, KTagSet= {KTage1, KTage2, … , KTagen} will be formed. For each KTagi in KTagSet, the page p will be split up to several parts by using KTagi. For every part of the page, if any KWi is located in between, then the whole part will be added to the output result set. The above methodology will be illustrated in figures 6 and 7. As an example, if keyword Wi appears between the first pair of the
as shown in figure 6, then the KTag will be
, since it is the first open tag for the keyword Wi. Then the whole page is split by the tag
. After that, for each separated part, if the Wi exists, the part will add to the result set. In example 7, a more complicated example is presented. As shown in figure 7, keyword Wi appears more than one time in the page. In this example, the key tag list is {
,
}. Then the page is split by
first and any part that contains Wi will be output to the result set. After that, the remaining parts of the page will be further split by
and the process continues.

Fig. 6. DOM tree structure of web page example 1

Fig. 7. DOM tree structure of web page Example 2

392


The methodology mentioned above is only the basic one. In order to enhance the extraction result, instead of collecting the keyword KW from adifferent training sample pages, some equivalent and synonymous keywords will be added. In our system, the additional keywords added to the KWSet are selected from a lexical dictionary called WordNet [17], which provides semantic relations among words. For example, the “Teach” in WordNet 2.1 has the synonymous words “instruct” and “learn”. The performance of the basic methodology and the extended one will be discussed later in the evaluation section.

4 Evaluation In order to evaluate our system in more scientific way, the evaluation metrics recall and precision measurements are used. Recall measures the amount of the relevant information that the system correctly extracts, and the precision measures the reliability of the information extracted [18]. In our experiment, the extraction record contains 4 fields.

FSet = {Telephone , Fax , TeachingCo urse , Re searchInte rest } . There are totally 450 pages for the training process and 2100 pages for testing. All of them come from the staff information page of different universities, including Boston University, Columbia University, Cornell University, University of Maryland, New York University, Perdue University, Macau University and Yale University. In addition, some other pages are chosen by using Google with the keyword “professor”. All these pages are selected randomly in this experiment, in order to evaluate the normal performance. After the training process, the telephone, fax and email have a rather static structure, therefore the regular expression is used. For teaching course and research interest, DOM tree analyzing methodology is used. Table 2 shows the results of our experiments on average. Since all the web pages are selected from different sources, the structures and content of them are comparatively divergent. By using only the regular expression, it is fair that the recall rate for the telephone and fax are around 0.7, while the precision rate can reach more than 0.9. To extract the information on teaching course and research interest, by comparing the two different methodologies, basic and the advanced one, it is obvious that the performances have some improvement. However, the major advantages of the advanced approach is that after enhancing the key word set, most of the information is captured out at the same time, since those of the added words come from a lexical dictionary, only the synonymous words with the same word type will be considered to be added. As a result, they will not cause much “noise” to the key word set. In addition, by using the advanced approach, it is possible to omit the training process. After the user enters a keyword, a set of synonymous words can be found from the dictionary. By using those words, it is enough to extract some information.


393

Table 2. Resulting figure

Methodology DOM Tree Regular Expression Analyzing (Basic) Teaching Research Telephone Fax Course Interest 0.721622 0.696703 0.77593 0.875255 Recall 0.952381 0.831667 0.559836 Precision 0.9821

DOM Tree Analyzing (Advanced) Teaching Research Course Interest 0.82098 0.878205 0.807409 0.562824

5 Conclusion and Future Work In this paper, we have firstly discussed the history and current developments in web information extraction. A further detailed analysis of our approach shows that it relies on the human training process. In contrast to most existing extraction systems, our system uses different methodologies to extract the information, either through the regular expressions or through analysis of the DOM tree. Furthermore, we have also proposed an advanced approach for DOM tree analysis by enhancing the keyword set. After the extraction process, the result is output to an XML file format. However, some limitations also exist. In the current system, the extraction task is only individual page based. It means that all the fields for the same record are supposed to be contained in the same pages. However, in many other situations, the fields may be located in different relevant pages, such as several linked web pages. In future work, we are going to extend our system to handle multi-page extractions. Acknowledgement. This Work was supported in part by the University Research Committee under Grant No. RG069/05-06S/07R/GZG/FST and by the Science and Technology Development Found of Macao Government under Grant No. 044/2006/A.

References 1. Eikvil, L.: Information Extraction from World Wide Web – A Survey, Technical Report 945, Norweigan Computing Center, Oslo, Norway (July 1999) 2. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002) 3. Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Hammer, J., McHugh, J., Garcia-Molina, H. (eds.) Proc. I East-European Workshop on Advances in Database and Information Systems - ADBIS 1997, Petersburg, Russia (1997) 4. Arocena, G., Mendelzon, A.: WebOQL: Restructuring Documents, Databases, and Webs. In: Proc. IEEE Intl. Conf. Data Engineering 1998, Orlando (February 1998) 5. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the international conference on data engineering (ICDE), pp. 611–621 (2000)

394


6. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc 27th Very Large Databases Conference, VLDB 2001, pp. 109–118 (2001) 7. Freitag, D.: Information Extraction from HTML: Application of a General Learning Approach. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998) (1998) 8. Solderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 34, 233–272 (1999) 9. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in KnowItAll (preliminary results). In: Proceedings of the 13th World Wide Web Conference, pp. 100–109 (2004) 10. Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23(8), 521–538 (1998) 11. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001) 12. Adelberg, B.: NoDoSE—A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, June 1998, pp. 283–294 (1998) 13. Snoussi, H., Magnin, L., Nie, J.-Y.: Toward an Ontology-based Web Data Extraction (2002) 14. Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proceedings of the 21th International Conference on Distributed Computing Systems, pp. 361–370 (2001) 15. Papadakis, N.K., Skoutas, D., Raftopoulos, K.: IEEE Computer Society. In: Varvarigou, T.A. (ed.) STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques, IEEE Transactions on Knowledge and Data Engineering, vol. 17(12), pp. 1638–1652 (December 2005) 16. Xiao, L., Wissmann, D.: Information Extraction from the Web: System and Techniques. Applied Intelligence 21, 195–224 (2004) 17. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database, Revised (August 1993) 18. Cardie, C.: Empirical methods in information extraction. AI Magazine 18(4), 65–80 (1997) URL1 http://www.internetworldstats.com/ Miniwatts Marking Group URL2 http://www.w3.org/TR/xhtml1/ XHTML, W3C Recommendation URL3 http://www.zvon.org/xxl/XSLTutorial/Output/contents.html XSLT Tutorial URL4 http://www.w3.org/TR/xslt.html XSL Transformations, W3C Recommendation URL5 http://www.w3.org/TR/xmlschema-0/ XML Schema Primer, W3C Working Draft URL6 http://tidy.sourceforge.net/ HTML Tidy Library Project URL7 http://www.saxonica.com/ Saxon Processor