Data Extraction for Deep Web Using WordNet - IEEE Xplore

7 downloads 0 Views 916KB Size Report
Data Extraction for Deep Web Using WordNet. Jer Lang Hong. Abstract—Our survey shows that the techniques used in data extraction from deep webs need to ...
854

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Data Extraction for Deep Web Using WordNet Jer Lang Hong

Abstract—Our survey shows that the techniques used in data extraction from deep webs need to be improved to achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity of data records and detect the correct data region with higher precision using the semantic properties of these data records. The advantages of this method are that it can extract three types of data records, namely, single-section data records, multiple-section data records, and loosely structured data records, and it also provides options for aligning iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from multilingual web pages and that it is domain independent. Index Terms—Automatic wrapper, deep web, ontology.

I. INTRODUCTION ITH the advent of information technology, a user is able to obtain relevant information from the World Wide Web, which contains a huge amount of information, simply and quickly by entering search queries [1], [6], [28], [30]. In response to the queries, the database servers generate the information and deliver it directly to the user. The generated information forms the hidden web (deep web or invisible web) and is usually enwrapped in HyperText Markup Language (HTML) pages as data records. Due to the dynamic nature of the generated data records from the hidden web, current search engines (either general or commercial) are unable to index the HTML page accordingly. Thus, this type of web pages is termed deep web pages. To facilitate human browsing, advertisers usually display their information (data records) using a predefined template. In a web page, data records normally form groups known as data regions. Data records generated from web databases are useful in meta search engine applications. However, for data records to be used in a meta search engine, they need to be extracted from the search engine results page and converted to a machine-readable form. To achieve this, a specialized program (a wrapper) is needed to identify the data records and extract them accordingly. The importance of automatic wrapper is its use in automating meta search engine and in evaluating and comparing shopping lists.

W

Manuscript received June 1, 2010; revised September 12, 2010; accepted October 19, 2010. Date of publication December 10, 2010; date of current version October 19, 2011. This paper was recommended by Associate Editor A. M. Tjoa. J. L. Hong is with The School of Information Technology, Monash University, Bandar Sunway 46150, Selangor, Malaysia (e-mail: david.hong@infotech. monash.edu.my). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCC.2010.2089678

We study the problems of extracting data records from deep webs. The extraction of data records has received much attention in recent years, and most of the work focuses on developing wrappers to accurately extract these data records. Automatic wrappers developed currently rely on HTML Document Object Model (DOM) tree and additional visual cue from the browser rendering engine for data extraction [3]–[5], [11], [14], [15], [17], [21], [22], [24], [32]–[35]. These wrappers use properties such as the regularity of structure in data records (i.e., repetitive HTML Tags) and the location and size of relevant data region (i.e., centrally located data region) in their design. For more information on wrapper design, see [7]. When data records are extracted, they can be further partitioned into smaller units; these units are usually referred to as data items. Data items need to be rearranged and tabulated for further use. This process is known as data alignment. Data records may contain iterative and disjunctive data items. Iterative data items are data items that are similar and occur repetitively, and disjunctive data items are data items that may exist in some data records but not in all the data records. For example, in a web site showing a pet shop with dogs for sale, the data items “labrador retriever,” “german shepherd,” and “siberian husky” are iterative data items as they have similar format and structure. This pet shop may also sell other pets as part of their items, for example, cats. Therefore, data item “persian cat” is a disjunctive data item as this pet shop may not have this data item as part of their contents shown in the web site. Current state-ofthe-art wrappers are unable to align iterative data items [22], [38] because they treated the detected data items as separate entities without further checking whether these data items are having similar parent HTML tags and similar tree structures. Take the “pet” data item as an example, “labrador retriever,” “german shepherd,” and “siberian husky” should be aligned under three columns with similar column name under “pet,” but current wrappers will align them under three separate columns with different names. For disjunctive data items, current wrappers are unable to align them correctly because they are optional items in some data records, and thus, the position to which they are to be inserted into the template cannot be determined. Aligning data items are useful in differentiating similar and dissimilar entities; hence, it allows a more accurate grouping and classification of data items. We use Fig. 1 to show the Lycos search engine web site with examples of data record, data region, and data item. Data records can be broadly divided into four groups: singlesection data records, multiple-sections data records, loosely structured data records, and unstructured data records. Singlesection data records are by far the most common type of data records. They usually exist in most of the current web pages and generated from the database server using a fixed template [4], [15], [17] (see Fig. 1). Some of the data records presented in web pages are normally grouped and categorized when presented in

1094-6977/$26.00 © 2010 IEEE

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

Fig. 1. Data item, data record, and data region shown in Lycos web site (single-section data records).

the web site, that is, relevant data records sharing similar characteristics are grouped under the same category. These groups are called sections. As sections contain data records and can occur more than once, we call them multiple-sections data records [14]. Multiplesections data records are highly irregular, as each section may have different format compared to other sections. On the other hand, some data records follow a simple but strict rule for their pattern, and their internal structure can be flexible, provided that they do not violate the strict rule of the pattern. These data records are known as loosely structured data records [39]. Examples of this type of data records are forums and blogs. The other group of data records is highly unstructured, that is, they have no specific format and layout for their structure [10] (e.g., plain text document). There are two types of data in the deep webs that can be extracted using wrappers. The first group is the list page, where a list of data records is generated from a search query and displayed as search results [4], [15], [17]. An example of this page is the product page of Amazon. The second type is a detailed page, where specific information for a product is generated for the user [29]. In this paper, we focus on the extraction of data records from the list page. We develop a wrapper for the extraction and alignment of data records using lightweight ontological technique as the main component of our wrapper, that is, checking the similarity of data records. There are many ontological techniques available currently. Some of the common ones are CYC [40], DOLCE [41], SUMO [42], and WordNet [8]. In this study, we use WordNet as the ontological tool for the extraction and alignment of data records. Our wrapper is called the Ontological Wrapper (OW). Preliminary version of this paper has appeared in [19] and [20]. There are three main components in our wrapper design, namely, parsing, extraction, and alignment of data records. OW requires the deep web page to be parsed and stored in a DOM tree. This DOM tree is then passed through a few filtering stages; each filter is based on a particular heuristic technique. We propose an adaptive search technique to detect and label the different groups of potential data records. Our extraction module includes a few

855

filtering rules to remove irrelevant information on a step-bystep basis. Irrelevant information, such as menu bars, menus, and advertisements, are removed based on the characteristics of the DOM tree and conclusions made by others [4], [11], [15], [24], [34], [35], [38]. The main features of our extraction module are the similarity check and scoring function. Our similarity check uses the semantic relation of data records instead of using visual information and tree structure as used in the conventional methods. Each filter is designed to filter out a particular group of irrelevant data until one data region containing the relevant data records remains. To locate the relevant data region, we use visual cue to measure the size of text and image in each of the data region. We believe that using ontological technique will make a wrapper more robust in extracting data records. The flexibility and advantages of our wrapper in data extraction are as follows. 1) Unlike existing ontology-based wrappers [10], our wrapper could cover a much larger domain using lightweight ontological technique WordNet as our extraction tool, which is closer to a thesauri [23]. In other words, our wrapper is domain independent. 2) The ontological technique we use is able to extract three types of data records, they are single-section data records, irregular multiple-sections data records, and loosely structured data records. As these data records are semantically related, this will significantly improve the efficiency of the wrapper. Unlike existing wrappers, which extract specific type of data records, our wrapper is tailored to handle data records with varying structure and format. As shown in the experimental tests, our wrapper is able to obtain better results than current state-of-the-art wrappers. We are also of the opinion that data items in a data record are related semantically. In other words, similar data items (e.g., iterative data items) are not only similar in their DOM tree structures, but they are also similar in their contents while dissimilar data items (e.g., disjunctive data items) are dissimilar in their contents compared to other data items. We use regular expression rules of data records to align data items. These regular expression rules can also align disjunctive and iterative data items based on the DOM tree structures of the data records. However, some of these data items may have identical parent HTML tag; hence, current wrappers are not able to align these data items accordingly. To overcome this, we align disjunctive and iterative data items by using an ontological technique. In this study, we use word similarity check for WordNet to identify similar data items and rearrange these data items accordingly. Words that are synonymous in each data item are grouped and classified under the same category and treated as iterative items. For example, using the pet shop web site mentioned earlier, our wrapper is able to identify the three data items (“labrador retriever,” “german shepherd,” and “siberian husky”) as related to “pet,” and consider them as similar entity, while the other data item (“persian cat”) as disjunctive data item, and insert them into the correct positions accordingly. This paper is divided into several sections. Section II describes the current studies that are related to ours, and Section III describes the role of WordNet for our wrapper design.

856

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Section IV gives the implementation details of our wrapper, while Section V provides the experimental results for our wrapper. Finally, Section VI concludes our work.

II. RELATED WORK A. Data Extraction 1) Current Wrappers: Current automatic wrappers (except ontology-assisted data extraction (ODE) wrapper [35]) do not use the semantic properties of data records in their design. Earlier automatic wrappers extract data records by checking the patterns inherent in data records using the DOM tree structure. For example, mining data region (MDR) [4] checks the repetitive HTML tags in order to locate data records, while EXALG uses equivalence classes (template which represents all the data records) to determine the template of data records before data extraction is carried out. Generally, these wrappers are not accurate compared to the recent visual-assisted wrappers (wrappers that use visual cue in addition to DOM tree structures) because they are unable to use visual cue from the underlying browser rendering engine, which could provide reliable information for data extraction. The state-of-the-art automatic wrappers are visual assisted and they are proved to be more reliable and robust than automatic wrappers relying on HTML tags only. ViNT [15], for example, uses content lines (a rectangular box enclosing a HTML text node) for data extraction. A group of content lines may form a content block, which constitutes a data record. ViPER [22] on the other hand, enhances the algorithm of MDR by using primitive tandem repeat, which detects the repetitive sequence of HTML tags using a matrix. VSDR [24] extracts search engine results pages by detecting the centrally located data region in a web page. 2) Ontology-Assisted Wrappers: Wrappers also utilize ontology domain as part of their operations. With the exception of ODE wrapper [35], which is fully automatic; most of them are specific to a particular domain and are not suitable for large-scale web comparisons. Embley [10] proposed the ontology-assisted wrappers that describes the data relationships, lexical appearance, and keywords. By using a predefined domain ontology, this wrapper parses the ontology and constructs a database schema for extracting data from unstructured documents. Snoussi et al. [13] proposed a wrapper to extract data using a three-stage operation: 1) Parse and convert a HTML page into XML format; 2) use the ontology to construct a data model; and 3) provide a mapping between XML elements and the ontology. Once the data are mapped, they can be further used by others. Using and creating ontology is a time-consuming and tedious process. To overcome this limitation, various methods are used to automate or supervise the data extraction process. DeepMiner [36] uses domain ontology to mark up web services. DeepMiner uses the data collected from the query web pages to generate the domain ontology. Recently, ODE wrapper [35] used ontology technique to extract, align, and annotate data from search engine results pages. However, ODE requires training data to generate the domain ontology. ODE is also only able to extract a specific type of data record (single-section data records). Thus, it is not able

to extract irregular data records, such as multiple-sections data records and loosely structured data records. 3) WordNet: There are many ontological techniques available currently. Some of the common ones are CYC [40], DOLCE [41], and SUMO [42]. WordNet [8] was developed in 1998 as a lightweight ontological technique (closer to a thesauri), and it is a lexical database for English for the semantic matching of words in information retrieval research [23]. WordNet represents nouns, adverbs, verbs, and adjectives as a group of cognitive synonyms (synsets) with their own distinct concepts. Synsets are linked by means of conceptual semantic and lexical relations. A browser is used to manage and navigate the individual component in WordNet. It categorizes English words into several groups, such as hypernyms, synonyms, and antonyms. There are also several different variant of WordNet developed for other languages, a notable example is EuroWordnet [43], which is developed for European languages. Recently, research was carried out for word matching to match words, which are synonymous. In general, semantic matching of words can be divided into four categories. The first category measure the similarity of words based on two terms (concepts) as a function of the length of the path between the terms and the position of the terms in the taxonomy [26], [37]. In the second category, the similarity is measured by checking the difference in information content of the two terms using a probabilistic function [9], [25]. For this group, WordNet is used to calculate the probabilistic function. For the third category, similarity of words is measured using the two terms as a function of their properties (e.g., gloss overlap) or based on their relationship with other similar terms in the taxonomy. Finally, the last category measures similarity of words by combining the methods mentioned earlier (hybrid) [27]. The common algorithms developed are by Jiang and Conrath [16], Hirst and St-Onge [12], and Leacock and Chodorow [9]. A survey [2] was carried out to evaluate the effectiveness of these algorithms. This survey shows that the algorithm of Jiang and Conrath performs better than other existing algorithms for matching two words. B. Data Alignment Zhai and Liu [38] proposed a partial tree alignment algorithm to insert data into the tree accordingly where necessary. Before data alignment is carried out, the tree structure of data records is matched to determine the template for all the data records. Whenever two nodes are identical, they are considered as matched and their contents will be used for further matching. The data alignment algorithm by Zhai and Liu assumes that a data could find a match in a tree structure for insertion. If this location could not be found, the algorithm will search further to match the data with a second data for insertion. This process will be repeated until the last data is chosen for insertion. This procedure becomes impractical, if there is no match for all the possible data chosen. Furthermore, early insertion of data will affect the state of the tree for future insertions of data. Simon and Lausen [22] proposed multiple sequence alignment (MSA) algorithm to align data based on maximal unique matches (MUM). MSA could efficiently align data records in

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

a polynomial time complexity, but to find the MUM requires extensive checking on the DOM tree structure. MUM is also not suitable to represent data, which is iterative. MSA also assumes that if a MUM is created, it may contain more than one text elements. Text nodes in a HTML DOM tree are atomic entities, and therefore, they should be considered as separate entities when used for data alignment. ViDE [34] uses a visual-based approach for aligning data records. Unlike existing automatic wrappers, data items are differentiated using their visual properties rather than DOM tree structure. Using the size, relative and absolute positions of data items, ViDE wrapper is able to align data records based on their size and position in the web page. Data items, which are similar in size are grouped and categorized and the priority of alignment is given to data items, which are located on top and to the left of the data items under consideration.

III. WORDNET AS AN EXTRACTION TOOL In this study, the objective is to provide options to overcome the problems encountered as mentioned in earlier sections. Our study shows that existing lexical database for English (WordNet) can be used to check the meaning of words in their contents using the semantic relations of the words. WordNet has been frequently used for information retrieval. We believe that this principle is equally applicable to data extraction. Thus, the main aim of this study is to examine the possibility of incorporating the semantic properties of data records in a wrapper design. The intuition behind our approach is threefold. 1) Data regions containing similar DOM tree and Visual properties but with different contents: Our observation is that data regions in current web pages may contain similar DOM tree structures but with different contents. For example, as shown in Fig. 1, the contents of data regions “Narrow Your Search” and “Expand Your Search” (solid rectangles) are presented using similar DOM tree Structures (e.g., a list of hyperlinks for every data records), but these data regions have dissimilar contents. These data regions also have similar visual layout in their contents. Therefore, current automatic wrappers designed based on DOM tree and visual cue of data records are not able to differentiate these data regions. Our study also indicates that the contents of these data regions are not similar. If we can find a way to represent and analyze the contents of these data regions, we will be able to differentiate them in a correct way, which is helpful in data extraction. 2) The use of WordNet to create the domain ontology: WordNet contains a huge amount of information (150 000 words organized in over 115 000 synsets for a total of 207 000 word-sense pairs) [8] that is useful for our study. Unlike existing wrappers that utilize ontology domain for data extraction [10], [35], [36], our wrapper is able to cover a much larger web ontology domain by using WordNet. The extensive use of WordNet library enables us to extract data records from deep web pages reliably using the semantic properties of these data records.

857

Fig. 2.

Main components of OW.

3) Reducing the candidates of data region for extraction using WordNet: Using the semantic characteristics of data records has the advantage of reducing the number of potential data regions as candidates for extraction. Our investigation shows that when current automatic wrappers are used for data extraction, they consider data records having similar structure and visual properties as the potential candidates. The candidates of data regions chosen for extraction include menu bars, advertisements, and the search results. In our wrapper, we reduce this list by considering data records, which are related semantically. Thus, data regions, such as menu bars, are excluded in our evaluation, as these data regions are not related semantically. For example, when a user enters the search query “Web,” the search results will display information related to “Web.” Other irrelevant data regions, such as advertisements, may also have “Web” related information. However, menu bars and navigation bars for the search results pages will not have “Web” related information. Our wrapper is developed to identify words, which generally occur in the data records and extract them accordingly based on the semantic properties of that particular word. IV. ONTOLOGICAL WRAPPER A. Overview of OW For OW to start its operation, the HTML web pages are first parsed and stored in a DOM tree (see Fig. 2). In order to simplify our wrapper, we make the assumption that the page under extraction must contain at least three repetitive patterns; pages that do not meet this criterion are rejected, as normally, an HTML page contains more than three repetitive patterns. The parsing phase involves parsing the HTML page and stores its content in a DOM tree. The second component is the data extraction phase. In this component, four stages of filtering rules are required (see Fig. 3). Once a HTML page is parsed, we use a search algorithm to detect data records based on repetitive patterns of HTML tags in a particular level of the DOM tree. We also use the algorithm of Jiang and Conrath [16] to measure the similarity of words for our wrapper. Our observations state that data records in the relevant data region are related semantically. The algorithm of Jiang and Conrath measures the similarity of two texts by measuring the distance between the locations of the two texts in the WordNet hierarchical tree structure and normalized them to a probabilistic function. These data records will then be input into the filtering stage to remove the data records

858

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Fig. 4. Data Extraction in OW, where every node that occurs more than two times are considered as potential data records.

Fig. 3.

Stages of Extraction module in OW.

that are not required. At the end of the filtering processes, there is only one data region containing data records. This data region is the relevant data region and is then used as input for the next component, data alignment. OW uses regular expression rules to align data records, with allowance for disjunctive and iterative data items. Some data items share similar parent HTML tags; therefore, using the conventional DOM tree method for aligning these data items is not appropriate, as this information is not sufficient to determine disjunctive and iterative data items. OW first checks the repetitive HTML tags and labels these nodes as iterative. Using word similarity check of WordNet available at http://grid.deis.unical.it/similarity/, OW determines the synonymy of each word for the two data items. The wrapper then assigns a weighting function to differentiate data items, which are disjunctive and iterative. Data items are considered iterative, if they have a number of words, which are synonymous, while dissimilar data items are considered as disjunctive data items. For multilingual web pages, we use the study in [44] to support five different languages. We hope to extend our study in future to support more languages, such as European and Russian languages, which are dominant languages. B. Data Extraction 1) Overview: In the extraction module of OW, there are four filtering stages. Each filtering stage has its own function, and it carries out a specific task to filter out irrelevant data records. The purpose of each filtering stage is to reduce the lists of data regions until only the correct data region remained. The four filtering stages are described in the following sections. 2) Adaptive Search: The inclusion of this stage is to detect and label the different groups of potential data records. Groups of data records can be defined as a set of data records having similar parent HTML tag, containing repetitive sequence of HTML tags and are located in the same level of the DOM tree. OW uses the adaptive search extraction technique to determine and

label potential tree nodes that represent data records. Subtrees that store data records may be contained in potential tree nodes. The nodes in the same level of a tree are checked to determine their similarity (whether they have the same contents). If none of the nodes can satisfy this criterion, the search will go one level lower and perform the search again on all the lower level nodes. Our method involves the detection of repetitive nodes, which may contain data records and the rearrangement of these nodes to form groups of potential records in a list in two steps. 1) In a particular tree level, if there are more than two nodes and a particular node occurs more than two times in this level, OW will treat it as a potential data record irrespective of the distance between the nodes. 2) These potential data records identified in this tree level are then grouped and stored in a list. The potential data records in this list are identified by the notation [A1 , A2 , . . . An ], where A1 denotes the position of a node in the potential data records where it first appears, A2 is the position where the same node appears the second time, and so on. Fig. 4 shows an example where nodes A, B, and C are grouped and stored in list 1. 3) Filtering Stages a) Overview: The authors of [4], [11], [15], [24], [34], [35] and [38] on information extraction in web pages have pointed out several unique features inherent to a data record. We have also made several observations on the constitution of a data record. Based on these observations, we come out with a way to formulate simple statistics of a DOM tree to correctly extract data records. The following are the observations made by several authors as presented in their papers. Observation 1 [11], [15], [24], [34], [35]: The size of the data records in deep web page is usually large in relation to the size of the whole page. Observation 2 [4], [11], [38]: Data records in deep web page usually occur more than three times in a given web page. Our Observation 3: Data records in deep web page are also related semantically, that is they contain words, which are similar in meaning. Our Observation 4: Data records usually consist of several HTML tags that make up their tree structure. We examine carefully these four observations and find that these criteria could be formulated using visual cue, statistical

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

frequency measures, and semantic properties of data records. Four steps of filtering rules are proposed, with each step taking into consideration the aforementioned observations. These observations will be considered as part of the requirements for OW wrapper to extract data records from web page. Our examination shows that data regions fall into one of several groups. We group the first set of potential data regions as menus; this group of data regions determines the layout of HTML pages and is usually large in size and highly dissimilar. The second group is advertisements; regions of this group are highly similar but with simple structures. The third group consists of menu bars; these regions are simple but nearly similar in structure. It is the last group of data records that is relevant to our study: the deep web pages; these regions are highly similar in structure and large in size. We aim to design our wrapper so that it can extract this last group of data regions, while removing the other irrelevant ones. We used filtering stage 1 to remove advertisements, filtering stage 2 to remove menus, which determine the layout of the HTML page, and finally filtering stage 4 to remove the remaining irrelevant data records. Filtering stage 3 is designed to remove data records, which occur less frequently, as observed by Liu et al. [4]. b) HTML tags filter: To extract data records from the HTML web pages, the following filtering rules are needed. In these rules, OW performs the filtering process based on observation made by the authors. Once the list of the groups of data records are obtained from adaptive search, stage 1 involves removing data records that have less than three HTML tags in each and every group. We notice that data records usually have more than three HTML tags (including closing tags) in their DOM tree structure in presenting data in the search engine result pages. Based on this observation, OW filters those potential data records having HTML tags less than three. The purpose of this filtering stage is to remove advertisement-related information. We observe that advertisement usually contains simple structure to present its content (usually a list of hyperlinks as their content). Removing these data records will result in faster execution time for OW wrapper, as the subsequent filtering phases need not consider for these data records anymore. c) Similarity filter i) Determining the similarity of nodes: Stage 3 is designed to extract data records considering the similarity in their data contents. OW only compares the similarity of two data records when they are in the same group (with identical parent nodes). Take for example in Fig. 4, the contents of A nodes are compared. The same approach is applied to B nodes and C nodes. However, it is not possible for OW to match the contents of nodes A and B. This is because A and B can be nodes which are either sections or data records, for example, A can be a section node and B may be a data record node, or A and B can be data record nodes, but they may be found in different sections. We observe that relevant data region usually contains data records with nearly similar semantic properties; hence, these data records will likely have all their items retained after this filtering stage. However, other data regions, such as menus, which determine the layout of a HTML page, have data records with dissimilar semantic properties. We aim to reduce

859

Fig. 5.

Example of data records related to “Web.”

these data records (menus) to a minimum of two items so that the filtering rule in stage 3 is able to remove these data records (by the end of this filtering stage, this data region has two data records remaining, where it will be a candidate for removal in the next filtering stage). ii) Clustering of data records: One efficient way to check the semantic similarity of data records is to classify and group the text nodes of the data records because the clustering of text nodes allows us to match the data records in a fast and accurate way. Our approach in grouping and clustering the text nodes of data records is the determination of the ancestor nodes of these text nodes. Text nodes with similar set of ancestor nodes are classified as similar and stored in a cluster. For example, in Fig. 5, text nodes “World Wide Web Consortium,” “StatCounter Free invisible Web tracker, Hit counter and Web stats,” and so on, will be grouped in the same cluster, while text nodes "http://www.w3.org,” “http://www.statcounter.com/,” etc. are grouped to form another cluster. Once clusters are formed, we can then match data records in the same data region accordingly. The following section will show the process of matching the contents of data records. iii) Determination of data regions, which are semantically similar using word matching: This similarity filter is designed based on observation 3. This rule aims to remove data records, which have contents not similar to other data records. Unlike the existing wrappers, our wrapper checks the similarity of data records based on their contents and not the tree structure and visual properties. Such a measurement will result in more accurate data extraction, as the possibility of obtaining the correct data region is increased. To check the similarity of data records, we examine the individual word in each of these data records. The number of occurrence of the words in the data records is noted.

860

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Words that occur in all the data records are analyzed further. If these words occur in most of the data records, OW will take these data records as a valid data region. Otherwise, the data records in the data region are discarded. Referring to Fig. 5, as an example, there are seven data records. The search query entered by the user is “Web” and as can be seen from Fig. 5, all the data records contain search results related to “Web”. OW takes all the texts in the data records and calculates the number of occurrence of each word. It is noted that words such as “is,” “a,” and “the” are not taken into account, as these words tend to occur in large quantities and are of no importance to the wrapper. The word “Web” occurs frequently in the data records; hence, OW treats this data region as valid and retains it for future processing. To match the contents of data records, we use the tag path (the set of ancestor nodes) of a particular text node in a data record and match it with the tag path of other text node in other data records. Only text nodes within the same clusters are matched. For example, in Fig. 5, the sentence “World Wide Web Consortium” is matched with the sentence “StatCounter Free invisible Web tracker, Hit counter and Web stats.” However, it is not possible for our wrapper to match the sentence “World Wide Web Consortium” with the sentence “http://www.statcounter.com/,” as these sentences are found in different text nodes located in different levels of the DOM tree. Besides matching words, our algorithm also checks for singular, plural nouns, and past tense. We use the Snowball stemmer (available at http://snowball.tartarus.org/) for the words our wrapper attempts to match. If a match is found, we consider these words as similar. We also use the algorithm of Jiang and Conrath [16] for matching words using WordNet to check for synonymous words encountered in data records. A threshold of 0.7 is adopted for two words to be considered as synonymous as studies shown in [2] indicate that the matching of two words has to be above 0.7 based on human measurements. If a data region contains similar words for most of its data records, this data region is considered as valid and will be retained for the subsequent filtering stage. iv) Enhancement using sentence matching: Some data records contain phrases as part of their contents. For example, a pet shop web site selling dogs as their products may have two keywords “German Sherperd” and “Poodle,” where the first keyword has two words, while the second keyword has only one word. To match these items, OW split the text token in a data record into several sets of sentences (using "." character as separation point). OW then constructs sets of words with length between two and five words from each sentence. Once this is done, OW matches the sets of keywords to determine the data region, which is semantically similar. v) Disambiguation of words: OW wrapper is also designed to distinguish and check the disambiguation of words. For example, the word “Interest” in the sentences “Interest in bank” and “Interest in subject” has two different meanings. The similarity of the sentences can be checked using the meaning of “interest” in these sentences. To properly identify the disambiguation of words, we use the adapted Lesk algorithm presented in [31] to check whether the word that occurs concurrently in two sentences are synonymous or identical (exact match). If

they are matched, the other words in adjacent to it are then checked and if they are related, the sentences will be treated as semantically similar. Otherwise, these sentences will be treated as ambiguous. For example, the aforementioned sentences are treated as nonambiguous because the word “interest,” which occurs concurrently in the sentences has different meaning. d) Number of nodes filter: This stage is designed based on observation 2. OW removes data regions containing data records, which occur less than two times. It is assumed that deep web pages contain at least three data records in their output in normal circumstances. e) Relevant data region filter: OW will have a list of data regions at the end of the earlier three filtering stages. We assume that these data regions contain data records, which are visually, structurally, and semantically similar, as they survived the filtering stages 1 to 3. Menus, which represent the layout of the web page are removed in stage 2 of the filtering rules. Data regions that are still available in this filtering stage are advertisements, and data records, which are relevant to our study. These data regions are assigned a scoring function. The correct data region will be the one that has the largest score value. We carry out filtering stage 4 based on observation 1. The correct data region is the one that has large amount of texts and images compared to other data regions. Therefore, the size of texts and images of these data regions are measured based on their visual boundaries. The sizes of text and image are measured based on the bounding box of HTML text and HTML tag. Once the areas of the bounding boxes are ascertained, they are summed up to give a final score value. Each of the data regions is assigned a score value. As HTML separator nodes, such as
, also contribute to the space occupied in a data record, the visual boundary of these nodes are measured and summed up to form the final score value. The scoring function of the OW wrapper is given as follows: a = size of texts in a data region; b = size of IMG tags in a data region; c = size of HTML separator tags in a data region; x = data region; Score(x) = a+b+c. 4) Sections and Records Boundary: For the data records of the extracted relevant data region, OW will partition them based on their boundaries. Our datasets for single-section data records and loosely structured data records contain data records occurring repetitively in a single level of their tree structures. For such a case, single-section data records and loosely structured data records are partitioned according to their parent nodes (i.e., nodes A in Fig. 4). However, multiple-sections data records are much more difficult to partition than the earlier data records. For this study, we use the algorithm of [18] to partition multiplesections data records into their respective boundaries. Details of this study can be found in the literature of [18]. C. Data Alignment 1) Aligning Single-Section Data Records a) Aligning data records using DOM tree: The extracted data records in the data region can be aligned for further use.

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

Fig. 6.

861

Template detection.

The data items in each of the data records need to be rearranged and presented in a tabular form for the user. We use the same data alignment algorithm of WISH to align data records. WISH data alignment algorithm is DOM tree-based, while OW data alignment algorithm enhances the algorithm of WISH by incorporating WordNet to align iterative and disjunctive data items. This section presents the data alignment algorithm of WISH wrapper [17]. No attempts have been made to align loosely structured data records, as their contents are highly irregular and contain no specific format. OW checks the patterns of data records to determine the template to be used for the data records, with consideration also given to the subtemplate of a subtree. We use the tree structure of data records to merge similar tags in the tree located next to each other. Our observations indicate that data records contain nearly similar tree structures; therefore, we find it useful to match these tree structures and create a template based on the regular expression rule inferred to generate the data records. Further observations show that an iterative statement (e.g., for, while) in the server scripts generates iterative data, while a selective statement (e.g., if) generates disjunctive data. Based on these observations, we use repetitive HTML Tags and subtree structures of these HTML Tags to check for iterative data and string alignment to check for disjunctive and optional data. Fig. 6 gives a simple example to show how our template detection algorithm is used in our wrapper. The aim of our tem-

plate detection algorithm is to reconstruct the original schema used to generate the tree structures of data records. While complete reconstruction of the original schema is not possible due to some missing information in the data records, we aim to construct the template to be nearly similar to that of the original schema. As shown in Fig. 6, OW uses the first data record to create an initial template. OW will further enhance the template by adding dissimilar elements from the subsequent data records to the template. When OW encounters two different nodes located in the same position in two different trees, it will treat these nodes as disjunctive, provided that these nodes have the same previous and subsequent elements. The general data alignment algorithm used in OW wrapper is shown in Fig. 7. The general rules incorporated in OW to generate a data template for different types and groups of data items are as follows. 1) Iterative data: In response to a user’s queries, a database server usually generates a set of data records and embeds them in a HTML page for the user. As the server usually uses the same code to generate these data records, it allows us to use a generalized rule to represent these data records. Our study shows that data records appear contiguously and we are able to use the symbol ∗ to represent repetitive pattern for the regular expression of data records.

862

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Fig. 8.

Fig. 7.

Template detection algorithm.

2) Optional and disjunctive data: We need to specially treat data records with optional and disjunctive data items. OW takes into consideration the characteristics of special data items such as optional and disjunctive data items. For example, given four data records with elements ABCCCD, ACCE, ABCCE, and ACE, where element B is the optional data item and elements D and E are the disjunctive data items, we will be able to create three regular expressions ABC∗D, AC∗E, and ABC∗E. From the three regular expressions, we can further generalize them to form a final regular expression AB?C∗(DE). To generalize the regular expression rule, we use string alignment to detect data records with disjunctive and optional patterns and data merging to handle iterative patterns. An alignment of two strings is carried out by appending the HTML tags in a particular level of a tree. Referring to Fig. 6, the first level of the tree in record 1 contains HTML Tags of the sequence P, DIV, B, P, DIV, B, P,

String alignment.

I. Appending these HTML tags results in a string sequence of

. Likewise, we can also append HTML Tags in record 2 to form a string sequence of

. From these two sequences, we can then determine the optional and disjunctive patterns by examining the position of the individual HTML tag. Take Fig. 8 as an illustrative example, the HTML tag occurs in the first string but do not occur in the second string. Therefore, we use the symbol “?” for HTML tag as an optional attribute. The resulting regular expression after applying the string alignment algorithm is

?. 3) Groups of HTML tags, which are iterative data: To determine iterative patterns, we need to generalize the regular expression rules by detecting repetitive HTML tags (See point 1) above). However, our observations on data records’ structure indicate that not only HTML tags occur repetitively, but a group of HTML tags may also occur repetitively. Referring to Fig. 6, none of the HTML tags occur repetitively for the sequence

and

. However, we can see that the group of HTML tags (

and

) occur repetitively. Based on these patterns, we are left with the choice of applying the group of HTML tags

as iterative data, or

as another iterative data. In actual case, the former pattern is the right pattern, while the latter is the incorrect pattern. To determine the correct pattern, our template detection algorithm analyzes the tree structures of the HTML tags in the patterns to check for regularity in their tree structures. For our example, we know that the string sequence

has similar tree structures for all the similar HTML tags (e.g., first HTML tag
has tree structure of , which is similar to the second HTML tag
). However, after checking the string sequence

, we find that the first HTML tag

has tree structure of HTML text node (USD) while the second HTML tag

has tree structure of HTML tag containing subtree structure of HTML text node (add to cart). Due to the difference in tree structures of these two HTML tags, we can then conclude that the string sequence

contains iterative data

, while the string sequence

does not have iterative data.

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

863

TABLE I DATA ALIGNMENT IN OW

Fig. 9.

Application of data alignment using two trees.

Fig. 11. Fig. 10.

Web site of a pet shop.

Resulting template for the two trees in Fig. 9.

4) Subtree of HTML tags: Once the iterative, optional, and disjunctive data are considered, we can generalize our regular expression rule and apply it to the remaining tree structures of the first-level HTML tags (i.e., applying the rules to the remaining levels of HTML tags in data records). In OW, an element that appears more than once but if the two same elements are not located next to each other in a template, they are treated differently. For example, for the data records ABBCDBE and ABCFBBE, OW will take same elements located next to each other, as similar and the regular expressions for the data records are AB∗CDBE and ABCFB∗E. When the two data records are merged, the result is a new template AB∗C(DF)B∗E. OW treats the two similar B elements as different elements, as they are located in positions not next to each other, even though they contain similar identities. Once the template has been obtained as described in the earlier section, data alignment is carried out. The nodes of a tree are labeled using the notation [A1 , A2 , A3 , . . . , Ab ], where Ab represents the position of the node in the tree starting with the left-most node in level l, Ab−1 is the position of the parent node in a higher level, Ab−2 is the label of node in a position one level higher than the parent node, etc. In Fig. 9, there are two trees with two different data records and four text elements. OW will use the template detection algorithm to generate a template for the two data records, as shown in Fig. 10. Nodes A–F are determined from the template generated. The results of data alignment are summarized in Table I. Row 1 of Table I shows the columns’ name for each of the text nodes of the data records. Row 2 is the aligned data for data record 1, and row 3 shows the aligned data for data record 2. OW aligns each of a set of data records starting from the first data record, referring to the template generated from all the data records to be aligned. Take for example Text 1 (see Fig. 10) the column name assigned to it is (1:1:2:1), as it is in the first position in level 1

(node A), first position in level 2 (node B), second position in level 3 (node E, there is a node D on the left of E in level 3), and first position in level 4. The other elements are aligned using the same principles. b) Aligning iterative and disjunctive data items using WordNet: Iterative and disjunctive data items may have similar parent HTML tags, but their contents in iterative data items may be similar and those of disjunctive data items may not be similar (see Fig. 12). Current automatic wrappers are unable to handle these unusual data items as the data alignment algorithms in these wrappers are based on DOM tree properties. To deal with iterative data items, OW checks for repetitive HTML tags. Once these tags are identified, OW then determines whether each of these tags contains HTML text nodes in its tree structure. Iterative data items usually share nearly similar tree structure containing text nodes in its content and with repetitive parent nodes as their topmost node in the tree structure. As repetitive HTML tags may not contain iterative data items, it is, therefore, appropriate to check the tree structure and the number of HTML text nodes to locate iterative data items. Once iterative data items are found, their contents are analyzed further to determine the similarity of these items. Individual words are checked and analyzed using the algorithm of Jiang and Conrath [16] (word similarity check for WordNet). OW treats two data items as iterative, if these data items contain a number of words, which are synonymous. Words are considered synonymous, if the value returned by Jiang and Conrath’s algorithm is above 0.7 for the matching of a pair of words. A threshold of 0.7 is adopted for two words to be considered as synonymous, as studies shown in [2] indicate that the matching of two words has to be to above 0.7 based on human measurements. Referring to Fig. 11 (HTML page) and Fig. 12 (part of DOM tree for HTML page in Fig. 11), there are five repetitive HTML tags with individual text contents for the data record 1, while there are four repetitive HTML tags with individual text contents for the data record 2. Each of these font tags contains a text element in its content; therefore, OW will label these nodes as iterative. OW then further determines these nodes by checking

864

Fig. 12.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

Iterative and disjunctive data in data records.

the synonymy of each word in each of the text component of these nodes. A word similarity check indicates that the data items enclosed in the middle rectangular box are iterative, while the data items enclosed in the first rectangular box are not as the words “Golden Retriever,” “German Shepherd,” and “Siberian Husky” (marked by rectangular boxes) are nearly similar in meaning. The third rectangular box contains a disjunctive data item, as this data item is not located in the second data record (see Fig. 12). For the alignment of disjunctive data items, we use a modified version of string alignment presented in Section IV-C1a. Instead of checking the different HTML Tags structure in a string sequence, we check the data items with different contents in their text nodes. We use the algorithm of Jiang and Conrath [16] (word similarity check for WordNet) to measure the similarity of contents in each data item. However, it is important to check for disambiguation of words using the adapted Lesk algorithm [31] before the algorithm of Jiang and Conrath [16] is applied. A sequence of data items is then constructed by appending the lists of the contents of data items in data record 1. For example, in Fig. 12, after appending the lists of contents of data items, we have a list of . Similarly, if we construct the list of contents of data items for data record 2, we have a list of . These two lists are compared by checking the synonymy of the contents of the particular data items using the algorithm of Jiang and Conrath [16]. From these lists, we are able to determine that the data item is a disjunctive data item, as this data item does not exist in data record 2. The ascertained disjunctive data items are aligned accordingly, as shown in Table II. The data alignment procedure is similar to that of Section IV-C1a. c) Ambiguities in word matching: The algorithm of Jiang and Conrath [16] is able to match two words, which are synonymous. These words are verb, adjective, or adverb. However,

TABLE II DATA ALIGNMENT AFTER APPLYING WORDNET

there are a number of words where OW is not able to recognize. For example, some book web sites contain iterative data items, which include the price of book as part of their contents; therefore, this word is not defined in WordNet. To match these words, OW checks the pattern of each word under iterative data items. For example, price usually has the character “$” followed by . . .XX.xx, while ISBN has the format XX-XX-XXXX. If iterative data items have similar patterns for their contents, OW will treat these iterative data items as valid. 2) Aligning Multiple-Sections Data Records: OW aligns multiple-sections data records on a bottom-up basis. OW aligns data records first; then, it groups these similar data records into sections and arranges them in a tabular form. As there are several sections in a web page, OW will have several tables generated at the end of the data alignment process. We use the algorithm presented in [18] to differentiate sections and data records. OW uses HTML tags that represent sections as separation point between sections. For the first case (see [18, Fig. 3]), OW can easily align sections in tabular form, as each section is located under different nodes. For the second case (see [18, Fig. 4]), OW aligns sections using the
tag as a separation point between sections. Finally, for the final case (see [18, Fig. 5]), OW uses tree edit distance to detect the different subtrees within each node of the tree. Nodes that contain different subtrees are assumed to represent sections, while nodes that exist repetitively and have nearly similar subtrees are assumed to be data records. OW then aligns and partitions sections based on these characteristics.

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

V. EXPERIMENTAL TESTS

865

TABLE III RESULTS OBTAINED FROM DATASET 1, USING OW, VINT, AND DEPTA

A. Preparation of Datasets The datasets used in this study are taken from deep web pages. There are three types of datasets used to test our study; these are single-section data records, multiple-sections data records, and loosely structured data records. These datasets are divided into seven groups: dataset 1, dataset 2, dataset 3, dataset 4, dataset 5, dataset 6, and dataset 7 with size of 150 pages for dataset 1, 119 web pages for dataset 2, 50 web pages for dataset 3, 51 web pages for dataset 4, 380 pages for dataset 5, 50 pages for dataset 6, and, finally, 50 pages for dataset 7. The distribution of data for each of the dataset varies, ranging from academic sites, general sites, to governmental sites. Web pages of the first dataset are randomly chosen from the internet and preference is given to those taken from complicated search engine web pages (web pages with many data regions, say more than 15 data regions). Datasets 1 and 6 are publicly available at http://hawksbill.infotech.monash.edu.my/ ∼jlhong/OW.html. The second and third data sets are taken from ViNT testbed, available at http://www.data.binghamton.edu: 8080/vints/testbed.html. The fourth dataset is the TDBW v1.02, obtained from http://daisen.cc.kyushu-u.ac.jp/TBDW/. Datasets 2–4 consist of simple web sites with only a few data regions (i.e., less than ten data regions). The first data set is restricted to search engine web sites that are not available in the second, third, and fourth datasets. We prefer to choose complicated web pages that contain potential data regions to evaluate the performance of our wrapper compared to the existing ones. There are 132 complicated and 18 simple web pages in dataset 1. The purpose of this dataset is to test whether OW is able to extract the correct data records from complicated web pages and how well it is able to perform compared with other wrappers. We also choose web sites that contain data regions, such as menus, which determine the layout of HTML page. These menus have the characteristics of a data record and they are considered as potential candidate for data extraction. These menus are the indicators used to evaluate the performance of OW in removing irrelevant data records. The datasets contain different web pages, with none of the web pages chosen for one of the datasets will occur in any other datasets. Datasets 2 and 3 are the data used initially to test the performance of ViNT wrapper. ViNT performed well when tested on these datasets. These datasets are then used to test our wrapper as a useful indicator to see the accuracy and reliability of our wrapper. The fourth dataset is used to test ViPER [22]. The purpose of this dataset is to see how well our wrapper can perform when compared with ViNT [15] and DEPTA [38] when tested against a neutral publicly available dataset. Dataset 5 is taken from the study of MSE wrapper [14], with 380 web pages containing multiple-sections data records (38 web sites with 10 sample pages each). Finally, dataset 6 includes 50 web pages randomly collected from forums and blogs. The datasets chosen contain semistructured data records. We carry out our experimental tests on datasets containing single-section data records, multiple-sections data records, and loosely structured data records. For the test on single-section data records,

we compare our study with ViNT [15] and DEPTA [38], and for the test on multiple-sections data records, we compare our study with MSE [14] and WISH [17]. We also carry out a test to show that our OW is able to accurately extract loosely structured data records. For loosely structured data records, we do not compare our study with those of other state-of-the-art wrappers, as the executable programs of [39] and [35] are not publicly available. For data alignment, we only align single-section and multiplesections data records, as loosely structured data records are highly irregular and do not contain specific rule and format. We also show that our wrapper is domain independent by testing our wrapper on multilingual deep web pages (dataset 7). The measurement of wrapper’s efficiency are based on three factors, the number of actual data records to be extracted, the number of extracted data records from the test cases, and the number of correct data records extracted from test cases. Based on these three values, precision and recall are calculated according to the following formula: Recall = Correct/Actual∗100 Precision = Correct/Extracted∗100. B. Data Extraction Results 1) Overview of Data Extraction: For data extraction, we used all the six datasets for our experiments. OW takes about 400 ms on average to generate a result for a web page. Experimental test results are as follows. a) Single-section data records i) Dataset 1: OW achieves high recall and precision rates (see Table III). The results from OW are used to compare with ViNT [15] and DEPTA [38]. OW performs better than ViNT and DEPTA, both in terms of recall and precision rates. Dataset 1 contains mostly complicated web pages, with numerous data regions, which have similar characteristic as correct data region. These web pages will affect the performance of ViNT and DEPTA, as they are not designed to extract data from these web pages. As OW uses visual cue that measures text and image sizes, this enables it to distinguish correct data region from incorrect ones. This approach works well on these web pages; hence, OW is able to extract the data records of this dataset more accurately. OW extracts data records based on their semantic properties; hence, it is also able to reduce the candidates of data regions for consideration in data extraction. This will increase the chance in extracting the relevant data region. OW also solves the problems of extracting search identifiers in WISH [17]. This is because search identifiers are not semantically related to relevant data records; hence, OW will treat them as irrelevant and remove them accordingly.

866

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

TABLE IV RESULTS OBTAINED FROM DATASET 2, USING OW, VINT, AND DEPTA

TABLE VI RESULTS OBTAINED FROM DATASET 4, USING OW, VINT, AND DEPTA

TABLE V RESULTS OBTAINED FROM DATASET 3, USING OW, VINT, AND DEPTA

TABLE VII DATASET 5 (MULTIPLE-SECTIONS DATA RECORDS)

ii) Dataset 2: OW also achieves excellent results when tested on dataset 2 (see Table IV). OW achieves better recall and precision rates than ViNT [15] and DEPTA [38]. This experiment shows that OW is able to perform as well as ViNT and better than DEPTA using datasets prepared by the ViNT authors, which contain simple web pages. Similar to dataset 1, dataset 2 contains sample pages with search identifiers, which have similar tree structures to those of relevant data records. However, these identifiers are not semantically related to those of relevant data records; hence, they are removed accordingly in filter similarity of OW. iii) Dataset 3: The result of OW on dataset 3 is good (see Table V). OW also achieves better recall and precision rates than ViNT [15] and DEPTA [38] in this dataset. Similar to dataset 2, this experiment shows that OW is able to perform as well as ViNT and better than DEPTA using datasets prepared by the ViNT authors, which contain simple web pages. Most of the sample pages in this dataset contain highly structured data records; hence, OW is able to accurately extract them. iv) Dataset 4: OW outperforms DEPTA in terms of recall rates using neutral publicly available dataset, which contains simple web pages (see Table VI). Dataset 4 contains sample pages with structurally similar data regions. The presence of these data regions affects the extraction accuracy of both ViNT and DEPTA. A number of sample pages in this dataset contain irrelevant data, which have similar tree structures to those of relevant data records. Our earlier wrapper WISH [17] is not able to remove these irrelevant data, as they are similar visually and structurally. However, OW wrapper is able to remove the irrelevant data, as they are not semantically related to the relevant data records. b) Multiple-sections data records: For dataset 2, OW takes about 900 ms on average to generate a result. Current wrappers are designed to extract data records, which are similar structurally and visually. This is a limitation, as some web pages may contain irregular data records, such as multiplesection data records. However, OW is able to extract irregular multiple-sections data records, as they are related semantically, and this helps to overcome the limitations of wrappers using

TABLE VIII DATASET 6 (LOOSELY STRUCTURED DATA RECORDS)

TABLE IX DATASET 7 (MULTILINGUAL WEB PAGES)

DOM tree and visual cue. As shown in Table VII, OW outperforms MSE [14] and WISH [17], both in terms of recall and precision rates, when used in extracting multiple-sections data records. Unlike MSE, OW uses a single page for data extraction and the requirement for section boundary marker (a HTML text representing the heading or ending of a section) is not needed. c) Loosely structured data records: Our wrapper is also able to extract loosely structured data records (see Table VIII). Unlike current wrappers, which detect the regularity in data records’ structure, our wrapper detects data records by measuring their semantic similarity. This is an advantage, as our wrapper is able to extract data records with varying structure and visual layout. d) Multilingual dataset: The OW wrapper could achieve high accuracy in extracting data records from deep web pages using multilingual dataset (see Table IX). However, there are several cases, where OW is unable to locate the correct data region from the deep web pages. The MLSN WordNet library [44] is still new, and the Java interface [45] provided to port to this library is a general interface developed for five languages. The five languages supported by MLSN library contain different terms and syntaxes; therefore, the mapping of the WordNet similarity code to this library will not be the same as to the

HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET

867

TABLE X RESULT OF DATA ALIGNMENT FOR WISH AND OW

WordNet English library. However, OW wrapper is able to extract three different types of deep web pages accurately, which is not supported by current wrappers. 2) Data Extraction Results Summary: Experimental results show that when our wrapper is used for extracting data records from complicated web pages, it can achieve higher precision and recall rates. Our wrapper also uses an ontological technique to check the similarity of data records. This technique is able to filter out irrelevant data region, such as menus, which determines the layout of a HTML page and are also able to reduce the candidates for data extraction, hence, results in higher accuracy in data extraction. OW wrapper is also able to extract irregular data records, such as the multiple-sections data records and loosely structured data records. Unlike conventional wrappers, which use DOM tree and visual properties of data records, OW uses the semantic properties of data records to extract multiplesections data records and loosely structured data records. OW wrapper is also able to extract data from multilingual deep web pages accurately. Our future studies include developing better Java interface, which can map MLSN library more efficiently for our use. C. Data Alignment Results 1) Single-Section Data Records: Datasets 1–4 are used to evaluate the performance of our wrapper in data alignment. We compare the data items for each column in the table to determine the recall and precision rates. If a column has its content aligned correctly, we treat that column as correctly aligned and extracted [38]. Otherwise, we treat the column as containing data items that are correctly extracted but not correctly aligned. Actual value indicates the number of columns present in a web page. We evaluate the performance of OW data alignment algorithm by comparing its efficiency to the state-of-the-art wrapper WISH [17]. As shown in Table X, OW outperforms WISH in terms of recall and precision rates when tested on datasets 1–4. This could be attributed to the ability of OW to handle iterative and disjunctive data items using WordNet. 2) Multiple-Sections Data Records: There are several wrappers, which are able to align data records [22], [38]. However, none of these can align multiple-sections data records. Once multiple-sections data records are extracted, OW aligns these data records in a tabular form. We evaluate the data alignment algorithm efficiency using two measurements, namely, recall and precision. A section is considered correctly extracted and aligned if it has the data records correctly aligned in a table that

TABLE XI RESULTS OF DATA ALIGNMENT FOR DATASET 5

represents that section. Otherwise, the data is treated as correctly extracted but not correctly aligned. Actual value signifies the actual section that should be aligned. Table XI shows that OW is able to produce high recall and precision rates. VI. CONCLUSION Our proposed ontological technique could extract data records with varying structures effectively. Experimental results showed that our wrapper is robust in its performance and could significantly outperform existing state-of-the-art wrappers. Our wrapper is able to distinguish data regions based on the semantic properties of data records but not the DOM tree structure and visual properties. Unlike existing wrappers, which work on specific type of data records, our wrapper is able to distinguish and extract three types of data records, and most important of all, our wrapper is domain-independent. The ontological technique could also reduce the number of potential data regions for data extraction and this will shorten the time and increase the accuracy in identifying the correct data region to be extracted. Measurement of the size of text and image to locate and extract the relevant data region further improves the precision of our wrapper. The use of ontological technique for aligning data records is highly effective for aligning disjunctive and iterative data items, which is not supported by current wrappers. Our ontology-based wrapper is tailored to extract data records with varying structures, and it thus provides more flexibility and is simpler to use in the extraction of complicated data records. Tests also show that OW is able to extract data records from multilingual web pages. REFERENCES [1] C. Alcaraz and J. Lopez, “A security analysis for wireless sensor mesh networks in highly critical systems,” IEEE Trans. Syst., Man, Cybern., vol. 4, no. 4, pp. 419–428, Jul. 2010. [2] A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures,” in Proc. NAACL, 2001, pp. 29–34.

868

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011

[3] A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” presented at the ACM SIGMOD Conf., San Diego, CA, 2003. [4] B. Liu, R. Grossman,, and Y. Zhai, “Mining data records in Web pages,” presented at the ACM SIGKDD Conf., Washington, DC, 2003. [5] B. Liu and Y. Zhai, “NET—A system for extracting web data from flat and nested data records,” in Proc. WISE, 2005, pp. 487–495. [6] S. H. Choi, Y.-S. Jeong, and M. K. Jeong, “A hybrid recommendation method with reduced data for large-scale application,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 5, pp. 557–599, Sep. 2010. [7] C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006. [8] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. [9] C. Leacock and C. Martin, Combining Local Context and WordNet Similarity for Word Sense Identification. Cambridge, MA: MIT Press, 1998. [10] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, and D. W. Lonsdale, “Conceptual-model-based data extraction from multiple-record Web pages,” Data Know. Eng., vol. 31, pp. 227–251, 1999. [11] G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser, “Extracting data records from the web using tag path clustering,” in Proc. ACM WWW, 2009, pp. 981–990. [12] G. Hirst and D. St-Onge, Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. Cambridge, MA: MIT Press, 1998. [13] H. Snoussi, L. Magnin, and J.-Y. Nie, “Heterogeneous web data extraction using ontology,” in Proc. Agent-Oriented Inf. Syst., 2001, pp. 99–110. [14] H. Zhao, W. Meng, and C. Yu, “Automatic extraction of dynamic record sections from deep web,” presented at the ACM VLDB Conf., Seoul, Korea, 2006. [15] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully automatic wrapper generation for search engines,” in Proc. ACM WWW, 2005, pp. 66–75. [16] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proc. Int. Conf. Res. Comput. Linguist., 1997, pp. 19–33. [17] J. L. Hong, E. Siew, and S. Egerton, “Information extraction for search engines using fast heuristic techniques,” Data Know. Eng., vol. 69, pp. 169– 196, 2010. [18] J. L. Hong, E. Siew, and S. Egerton, “WMS—Extracting multiple sections data records from search engine results pages,” in Proc. ACM SAC, 2010, pp. 1696–1701. [19] J. L. Hong, E. Siew, and S. Egerton, “Aligning data records using WordNet,” in Proc. IEEE CRD, 2010, pp. 56–60. [20] J. L. Hong, “Deep web data extraction,” presented at the IEEE SMC Conf., Istanbul, Turkey, Oct. 2010. [21] J. Wang and F. H. Lochovsky, “Data extraction and label assignment for web databases,” presented at the ACM WWW Conf., Budapest, Hungary, 2003. [22] K. Simon and G. Lausen, “ViPER: Augmenting automatic information extraction with visual perceptions,” presented at the ACM CIKM Conf., Bremen, Germany, 2005. [23] O. Lassila and D. McGuinness, “The role of frame-based representation on the semantic web,” Know. Syst. Lab., Stanford Univ., Stanford, CA, Tech. Rep. KSL-01-02, 2001. [24] L. Li, Y. Liu, A. Obregon, and M. A. Weatherston, “Visual segmentationbased data record extraction from web documents,” in Proc. IEEE IRI, 2007, pp. 502–507. [25] P.W. Lord, R.D. Stevens, A. Brass, and G. CA, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, vol. 19, pp. 1275–1283, 2003. [26] R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Trans. Syst., Man, Cybern., vol. 19, no. 1, pp. 17–30, Jan./Feb. 1989. [27] M. Rodriguez and M. Egenhofer, “Determining semantic similarity among entity classes from different ontologies,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 2, pp. 442–456, Mar./Apr. 2003.

[28] A. Rodriguez, W. A. Chaovalitwongse, L. Zhe, H. Singhal, and H. Pham, “Master defect record retrieval using network-based feature association,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 3, pp. 319–329, May 2010. [29] S. Zheng, R. Song, J.-R. Wen, and C. L. Giles, “Efficient record-level wrapper induction,” presented at the ACM CIKM Conf., Hong Kong, 2009. [30] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar, “Distributed text classification with an ensemble kernel-based learning approach,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 3, pp. 287–297, May 2010. [31] S Banerjee, “Extended gloss overlaps as a measure of semantic relatedness,” in Proc. ACM IJCAI, 2003, pp. 805–810. [32] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards automatic data extraction from large web sites,” in Proc. ACM VLDB, 2001, pp. 109–118. [33] W. Liu, X. Meng, and W. Meng, “Vision-based web data records extraction,” in Proc. ACM WebDB, 2006, pp. 20–25. [34] W. Liu, X. Meng, and W. Meng, “ViDE: A vision-based approach for deep web data extraction,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 3, pp. 447–460, Mar. 2009. [35] W. Su, J. Wang, and F. H. Lochovsky, “ODE: Ontology-assisted data extraction,” ACM Trans. Database Syst., vol. 34, no. 2, pp. 1–35, 2009. [36] W. Wu, A. Doan, C. Yu, and W. Meng, “Bootstrapping domain ontology for semantic web services from source web sites,” in Proc. VLDB Workshop, 2005, pp. 11–22. [37] Z. Wu and M. Palmer, “Verb semantics and lexical selection,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 1994, pp. 133–138. [38] Y. Zhai and B. Liu, “Web data extraction based on partial tree alignment,” in Proc. ACM WWW, 2005, pp. 76–85. [39] Y. Wu, J. Chen, and Q. Li, “Extracting loosely structured data records through mining strict patterns,” in Proc. IEEE ICDE, 2008, pp. 1322– 1324. [40] (2010). [Online]. Available: http://www.cyc.com [41] (2010). [Online]. Available: http://www.loa-cnr.it/DOLCE.html [42] (2010). [Online]. Available: http://www.ontologyportal.org/ [43] (2010). [Online]. Available: http://www.illc.uva.nl/EuroWordNet/ [44] (2010). [Online]. Available: http://two.dcook.org/software/mlsn/about/ download.html [45] (2010). [Online]. Available: http://nlpwww.nict.go.jp/wn-ja/index.en. html#downloads

Jer Lang Hong received the B.Sc. degree in computer science degree from the University of Nottingham, Nottingham, U.K., in 2005 and the Ph.D. degree from Monash University, Melbourne, Australia, in 2010. He is currently a Lecturer with the School of Information Technology, Monash University. His research interests include information extraction and automated data extraction. He is the author and coauthor of several Association for Computing Machinery (ACM)/IEEE conference papers. Dr. Hong is the Program Committee Member/Reviewer for several IEEE conferences/journals, such as the IEEE TRANSACTION ON SYSTEMS, MAN, AND CYBERNETICS. He has also won several student travel grants for IEEE/ACM conferences during his Ph.D. study.

Suggest Documents