. Likewise, we can also append HTML Tags in record 2 to form a string sequence of
. From these two sequences, we can then determine the optional and disjunctive patterns by examining the position of the individual HTML tag. Take Fig. 8 as an illustrative example, the HTML tag occurs in the first string but do not occur in the second string. Therefore, we use the symbol “?” for HTML tag as an optional attribute. The resulting regular expression after applying the string alignment algorithm is
?. 3) Groups of HTML tags, which are iterative data: To determine iterative patterns, we need to generalize the regular expression rules by detecting repetitive HTML tags (See point 1) above). However, our observations on data records’ structure indicate that not only HTML tags occur repetitively, but a group of HTML tags may also occur repetitively. Referring to Fig. 6, none of the HTML tags occur repetitively for the sequence
and
. However, we can see that the group of HTML tags (
) occur repetitively. Based on these patterns, we are left with the choice of applying the group of HTML tags
as another iterative data. In actual case, the former pattern is the right pattern, while the latter is the incorrect pattern. To determine the correct pattern, our template detection algorithm analyzes the tree structures of the HTML tags in the patterns to check for regularity in their tree structures. For our example, we know that the string sequence
, we find that the first HTML tag
has tree structure of HTML text node (USD) while the second HTML tag
has tree structure of HTML tag containing subtree structure of HTML text node (add to cart). Due to the difference in tree structures of these two HTML tags, we can then conclude that the string sequence does not have iterative data.
HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET
863
TABLE I DATA ALIGNMENT IN OW
Fig. 9.
Application of data alignment using two trees.
Fig. 11. Fig. 10.
Web site of a pet shop.
Resulting template for the two trees in Fig. 9.
4) Subtree of HTML tags: Once the iterative, optional, and disjunctive data are considered, we can generalize our regular expression rule and apply it to the remaining tree structures of the first-level HTML tags (i.e., applying the rules to the remaining levels of HTML tags in data records). In OW, an element that appears more than once but if the two same elements are not located next to each other in a template, they are treated differently. For example, for the data records ABBCDBE and ABCFBBE, OW will take same elements located next to each other, as similar and the regular expressions for the data records are AB∗CDBE and ABCFB∗E. When the two data records are merged, the result is a new template AB∗C(DF)B∗E. OW treats the two similar B elements as different elements, as they are located in positions not next to each other, even though they contain similar identities. Once the template has been obtained as described in the earlier section, data alignment is carried out. The nodes of a tree are labeled using the notation [A1 , A2 , A3 , . . . , Ab ], where Ab represents the position of the node in the tree starting with the left-most node in level l, Ab−1 is the position of the parent node in a higher level, Ab−2 is the label of node in a position one level higher than the parent node, etc. In Fig. 9, there are two trees with two different data records and four text elements. OW will use the template detection algorithm to generate a template for the two data records, as shown in Fig. 10. Nodes A–F are determined from the template generated. The results of data alignment are summarized in Table I. Row 1 of Table I shows the columns’ name for each of the text nodes of the data records. Row 2 is the aligned data for data record 1, and row 3 shows the aligned data for data record 2. OW aligns each of a set of data records starting from the first data record, referring to the template generated from all the data records to be aligned. Take for example Text 1 (see Fig. 10) the column name assigned to it is (1:1:2:1), as it is in the first position in level 1
(node A), first position in level 2 (node B), second position in level 3 (node E, there is a node D on the left of E in level 3), and first position in level 4. The other elements are aligned using the same principles. b) Aligning iterative and disjunctive data items using WordNet: Iterative and disjunctive data items may have similar parent HTML tags, but their contents in iterative data items may be similar and those of disjunctive data items may not be similar (see Fig. 12). Current automatic wrappers are unable to handle these unusual data items as the data alignment algorithms in these wrappers are based on DOM tree properties. To deal with iterative data items, OW checks for repetitive HTML tags. Once these tags are identified, OW then determines whether each of these tags contains HTML text nodes in its tree structure. Iterative data items usually share nearly similar tree structure containing text nodes in its content and with repetitive parent nodes as their topmost node in the tree structure. As repetitive HTML tags may not contain iterative data items, it is, therefore, appropriate to check the tree structure and the number of HTML text nodes to locate iterative data items. Once iterative data items are found, their contents are analyzed further to determine the similarity of these items. Individual words are checked and analyzed using the algorithm of Jiang and Conrath [16] (word similarity check for WordNet). OW treats two data items as iterative, if these data items contain a number of words, which are synonymous. Words are considered synonymous, if the value returned by Jiang and Conrath’s algorithm is above 0.7 for the matching of a pair of words. A threshold of 0.7 is adopted for two words to be considered as synonymous, as studies shown in [2] indicate that the matching of two words has to be to above 0.7 based on human measurements. Referring to Fig. 11 (HTML page) and Fig. 12 (part of DOM tree for HTML page in Fig. 11), there are five repetitive HTML tags with individual text contents for the data record 1, while there are four repetitive HTML tags with individual text contents for the data record 2. Each of these font tags contains a text element in its content; therefore, OW will label these nodes as iterative. OW then further determines these nodes by checking
864
Fig. 12.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011
Iterative and disjunctive data in data records.
the synonymy of each word in each of the text component of these nodes. A word similarity check indicates that the data items enclosed in the middle rectangular box are iterative, while the data items enclosed in the first rectangular box are not as the words “Golden Retriever,” “German Shepherd,” and “Siberian Husky” (marked by rectangular boxes) are nearly similar in meaning. The third rectangular box contains a disjunctive data item, as this data item is not located in the second data record (see Fig. 12). For the alignment of disjunctive data items, we use a modified version of string alignment presented in Section IV-C1a. Instead of checking the different HTML Tags structure in a string sequence, we check the data items with different contents in their text nodes. We use the algorithm of Jiang and Conrath [16] (word similarity check for WordNet) to measure the similarity of contents in each data item. However, it is important to check for disambiguation of words using the adapted Lesk algorithm [31] before the algorithm of Jiang and Conrath [16] is applied. A sequence of data items is then constructed by appending the lists of the contents of data items in data record 1. For example, in Fig. 12, after appending the lists of contents of data items, we have a list of . Similarly, if we construct the list of contents of data items for data record 2, we have a list of . These two lists are compared by checking the synonymy of the contents of the particular data items using the algorithm of Jiang and Conrath [16]. From these lists, we are able to determine that the data item is a disjunctive data item, as this data item does not exist in data record 2. The ascertained disjunctive data items are aligned accordingly, as shown in Table II. The data alignment procedure is similar to that of Section IV-C1a. c) Ambiguities in word matching: The algorithm of Jiang and Conrath [16] is able to match two words, which are synonymous. These words are verb, adjective, or adverb. However,
TABLE II DATA ALIGNMENT AFTER APPLYING WORDNET
there are a number of words where OW is not able to recognize. For example, some book web sites contain iterative data items, which include the price of book as part of their contents; therefore, this word is not defined in WordNet. To match these words, OW checks the pattern of each word under iterative data items. For example, price usually has the character “$” followed by . . .XX.xx, while ISBN has the format XX-XX-XXXX. If iterative data items have similar patterns for their contents, OW will treat these iterative data items as valid. 2) Aligning Multiple-Sections Data Records: OW aligns multiple-sections data records on a bottom-up basis. OW aligns data records first; then, it groups these similar data records into sections and arranges them in a tabular form. As there are several sections in a web page, OW will have several tables generated at the end of the data alignment process. We use the algorithm presented in [18] to differentiate sections and data records. OW uses HTML tags that represent sections as separation point between sections. For the first case (see [18, Fig. 3]), OW can easily align sections in tabular form, as each section is located under different nodes. For the second case (see [18, Fig. 4]), OW aligns sections using the tag as a separation point between sections. Finally, for the final case (see [18, Fig. 5]), OW uses tree edit distance to detect the different subtrees within each node of the tree. Nodes that contain different subtrees are assumed to represent sections, while nodes that exist repetitively and have nearly similar subtrees are assumed to be data records. OW then aligns and partitions sections based on these characteristics.
HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET
V. EXPERIMENTAL TESTS
865
TABLE III RESULTS OBTAINED FROM DATASET 1, USING OW, VINT, AND DEPTA
A. Preparation of Datasets The datasets used in this study are taken from deep web pages. There are three types of datasets used to test our study; these are single-section data records, multiple-sections data records, and loosely structured data records. These datasets are divided into seven groups: dataset 1, dataset 2, dataset 3, dataset 4, dataset 5, dataset 6, and dataset 7 with size of 150 pages for dataset 1, 119 web pages for dataset 2, 50 web pages for dataset 3, 51 web pages for dataset 4, 380 pages for dataset 5, 50 pages for dataset 6, and, finally, 50 pages for dataset 7. The distribution of data for each of the dataset varies, ranging from academic sites, general sites, to governmental sites. Web pages of the first dataset are randomly chosen from the internet and preference is given to those taken from complicated search engine web pages (web pages with many data regions, say more than 15 data regions). Datasets 1 and 6 are publicly available at http://hawksbill.infotech.monash.edu.my/ ∼jlhong/OW.html. The second and third data sets are taken from ViNT testbed, available at http://www.data.binghamton.edu: 8080/vints/testbed.html. The fourth dataset is the TDBW v1.02, obtained from http://daisen.cc.kyushu-u.ac.jp/TBDW/. Datasets 2–4 consist of simple web sites with only a few data regions (i.e., less than ten data regions). The first data set is restricted to search engine web sites that are not available in the second, third, and fourth datasets. We prefer to choose complicated web pages that contain potential data regions to evaluate the performance of our wrapper compared to the existing ones. There are 132 complicated and 18 simple web pages in dataset 1. The purpose of this dataset is to test whether OW is able to extract the correct data records from complicated web pages and how well it is able to perform compared with other wrappers. We also choose web sites that contain data regions, such as menus, which determine the layout of HTML page. These menus have the characteristics of a data record and they are considered as potential candidate for data extraction. These menus are the indicators used to evaluate the performance of OW in removing irrelevant data records. The datasets contain different web pages, with none of the web pages chosen for one of the datasets will occur in any other datasets. Datasets 2 and 3 are the data used initially to test the performance of ViNT wrapper. ViNT performed well when tested on these datasets. These datasets are then used to test our wrapper as a useful indicator to see the accuracy and reliability of our wrapper. The fourth dataset is used to test ViPER [22]. The purpose of this dataset is to see how well our wrapper can perform when compared with ViNT [15] and DEPTA [38] when tested against a neutral publicly available dataset. Dataset 5 is taken from the study of MSE wrapper [14], with 380 web pages containing multiple-sections data records (38 web sites with 10 sample pages each). Finally, dataset 6 includes 50 web pages randomly collected from forums and blogs. The datasets chosen contain semistructured data records. We carry out our experimental tests on datasets containing single-section data records, multiple-sections data records, and loosely structured data records. For the test on single-section data records,
we compare our study with ViNT [15] and DEPTA [38], and for the test on multiple-sections data records, we compare our study with MSE [14] and WISH [17]. We also carry out a test to show that our OW is able to accurately extract loosely structured data records. For loosely structured data records, we do not compare our study with those of other state-of-the-art wrappers, as the executable programs of [39] and [35] are not publicly available. For data alignment, we only align single-section and multiplesections data records, as loosely structured data records are highly irregular and do not contain specific rule and format. We also show that our wrapper is domain independent by testing our wrapper on multilingual deep web pages (dataset 7). The measurement of wrapper’s efficiency are based on three factors, the number of actual data records to be extracted, the number of extracted data records from the test cases, and the number of correct data records extracted from test cases. Based on these three values, precision and recall are calculated according to the following formula: Recall = Correct/Actual∗100 Precision = Correct/Extracted∗100. B. Data Extraction Results 1) Overview of Data Extraction: For data extraction, we used all the six datasets for our experiments. OW takes about 400 ms on average to generate a result for a web page. Experimental test results are as follows. a) Single-section data records i) Dataset 1: OW achieves high recall and precision rates (see Table III). The results from OW are used to compare with ViNT [15] and DEPTA [38]. OW performs better than ViNT and DEPTA, both in terms of recall and precision rates. Dataset 1 contains mostly complicated web pages, with numerous data regions, which have similar characteristic as correct data region. These web pages will affect the performance of ViNT and DEPTA, as they are not designed to extract data from these web pages. As OW uses visual cue that measures text and image sizes, this enables it to distinguish correct data region from incorrect ones. This approach works well on these web pages; hence, OW is able to extract the data records of this dataset more accurately. OW extracts data records based on their semantic properties; hence, it is also able to reduce the candidates of data regions for consideration in data extraction. This will increase the chance in extracting the relevant data region. OW also solves the problems of extracting search identifiers in WISH [17]. This is because search identifiers are not semantically related to relevant data records; hence, OW will treat them as irrelevant and remove them accordingly.
866
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011
TABLE IV RESULTS OBTAINED FROM DATASET 2, USING OW, VINT, AND DEPTA
TABLE VI RESULTS OBTAINED FROM DATASET 4, USING OW, VINT, AND DEPTA
TABLE V RESULTS OBTAINED FROM DATASET 3, USING OW, VINT, AND DEPTA
TABLE VII DATASET 5 (MULTIPLE-SECTIONS DATA RECORDS)
ii) Dataset 2: OW also achieves excellent results when tested on dataset 2 (see Table IV). OW achieves better recall and precision rates than ViNT [15] and DEPTA [38]. This experiment shows that OW is able to perform as well as ViNT and better than DEPTA using datasets prepared by the ViNT authors, which contain simple web pages. Similar to dataset 1, dataset 2 contains sample pages with search identifiers, which have similar tree structures to those of relevant data records. However, these identifiers are not semantically related to those of relevant data records; hence, they are removed accordingly in filter similarity of OW. iii) Dataset 3: The result of OW on dataset 3 is good (see Table V). OW also achieves better recall and precision rates than ViNT [15] and DEPTA [38] in this dataset. Similar to dataset 2, this experiment shows that OW is able to perform as well as ViNT and better than DEPTA using datasets prepared by the ViNT authors, which contain simple web pages. Most of the sample pages in this dataset contain highly structured data records; hence, OW is able to accurately extract them. iv) Dataset 4: OW outperforms DEPTA in terms of recall rates using neutral publicly available dataset, which contains simple web pages (see Table VI). Dataset 4 contains sample pages with structurally similar data regions. The presence of these data regions affects the extraction accuracy of both ViNT and DEPTA. A number of sample pages in this dataset contain irrelevant data, which have similar tree structures to those of relevant data records. Our earlier wrapper WISH [17] is not able to remove these irrelevant data, as they are similar visually and structurally. However, OW wrapper is able to remove the irrelevant data, as they are not semantically related to the relevant data records. b) Multiple-sections data records: For dataset 2, OW takes about 900 ms on average to generate a result. Current wrappers are designed to extract data records, which are similar structurally and visually. This is a limitation, as some web pages may contain irregular data records, such as multiplesection data records. However, OW is able to extract irregular multiple-sections data records, as they are related semantically, and this helps to overcome the limitations of wrappers using
TABLE VIII DATASET 6 (LOOSELY STRUCTURED DATA RECORDS)
TABLE IX DATASET 7 (MULTILINGUAL WEB PAGES)
DOM tree and visual cue. As shown in Table VII, OW outperforms MSE [14] and WISH [17], both in terms of recall and precision rates, when used in extracting multiple-sections data records. Unlike MSE, OW uses a single page for data extraction and the requirement for section boundary marker (a HTML text representing the heading or ending of a section) is not needed. c) Loosely structured data records: Our wrapper is also able to extract loosely structured data records (see Table VIII). Unlike current wrappers, which detect the regularity in data records’ structure, our wrapper detects data records by measuring their semantic similarity. This is an advantage, as our wrapper is able to extract data records with varying structure and visual layout. d) Multilingual dataset: The OW wrapper could achieve high accuracy in extracting data records from deep web pages using multilingual dataset (see Table IX). However, there are several cases, where OW is unable to locate the correct data region from the deep web pages. The MLSN WordNet library [44] is still new, and the Java interface [45] provided to port to this library is a general interface developed for five languages. The five languages supported by MLSN library contain different terms and syntaxes; therefore, the mapping of the WordNet similarity code to this library will not be the same as to the
HONG: DATA EXTRACTION FOR DEEP WEB USING WORDNET
867
TABLE X RESULT OF DATA ALIGNMENT FOR WISH AND OW
WordNet English library. However, OW wrapper is able to extract three different types of deep web pages accurately, which is not supported by current wrappers. 2) Data Extraction Results Summary: Experimental results show that when our wrapper is used for extracting data records from complicated web pages, it can achieve higher precision and recall rates. Our wrapper also uses an ontological technique to check the similarity of data records. This technique is able to filter out irrelevant data region, such as menus, which determines the layout of a HTML page and are also able to reduce the candidates for data extraction, hence, results in higher accuracy in data extraction. OW wrapper is also able to extract irregular data records, such as the multiple-sections data records and loosely structured data records. Unlike conventional wrappers, which use DOM tree and visual properties of data records, OW uses the semantic properties of data records to extract multiplesections data records and loosely structured data records. OW wrapper is also able to extract data from multilingual deep web pages accurately. Our future studies include developing better Java interface, which can map MLSN library more efficiently for our use. C. Data Alignment Results 1) Single-Section Data Records: Datasets 1–4 are used to evaluate the performance of our wrapper in data alignment. We compare the data items for each column in the table to determine the recall and precision rates. If a column has its content aligned correctly, we treat that column as correctly aligned and extracted [38]. Otherwise, we treat the column as containing data items that are correctly extracted but not correctly aligned. Actual value indicates the number of columns present in a web page. We evaluate the performance of OW data alignment algorithm by comparing its efficiency to the state-of-the-art wrapper WISH [17]. As shown in Table X, OW outperforms WISH in terms of recall and precision rates when tested on datasets 1–4. This could be attributed to the ability of OW to handle iterative and disjunctive data items using WordNet. 2) Multiple-Sections Data Records: There are several wrappers, which are able to align data records [22], [38]. However, none of these can align multiple-sections data records. Once multiple-sections data records are extracted, OW aligns these data records in a tabular form. We evaluate the data alignment algorithm efficiency using two measurements, namely, recall and precision. A section is considered correctly extracted and aligned if it has the data records correctly aligned in a table that
TABLE XI RESULTS OF DATA ALIGNMENT FOR DATASET 5
represents that section. Otherwise, the data is treated as correctly extracted but not correctly aligned. Actual value signifies the actual section that should be aligned. Table XI shows that OW is able to produce high recall and precision rates. VI. CONCLUSION Our proposed ontological technique could extract data records with varying structures effectively. Experimental results showed that our wrapper is robust in its performance and could significantly outperform existing state-of-the-art wrappers. Our wrapper is able to distinguish data regions based on the semantic properties of data records but not the DOM tree structure and visual properties. Unlike existing wrappers, which work on specific type of data records, our wrapper is able to distinguish and extract three types of data records, and most important of all, our wrapper is domain-independent. The ontological technique could also reduce the number of potential data regions for data extraction and this will shorten the time and increase the accuracy in identifying the correct data region to be extracted. Measurement of the size of text and image to locate and extract the relevant data region further improves the precision of our wrapper. The use of ontological technique for aligning data records is highly effective for aligning disjunctive and iterative data items, which is not supported by current wrappers. Our ontology-based wrapper is tailored to extract data records with varying structures, and it thus provides more flexibility and is simpler to use in the extraction of complicated data records. Tests also show that OW is able to extract data records from multilingual web pages. REFERENCES [1] C. Alcaraz and J. Lopez, “A security analysis for wireless sensor mesh networks in highly critical systems,” IEEE Trans. Syst., Man, Cybern., vol. 4, no. 4, pp. 419–428, Jul. 2010. [2] A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures,” in Proc. NAACL, 2001, pp. 29–34.
868
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 41, NO. 6, NOVEMBER 2011
[3] A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” presented at the ACM SIGMOD Conf., San Diego, CA, 2003. [4] B. Liu, R. Grossman,, and Y. Zhai, “Mining data records in Web pages,” presented at the ACM SIGKDD Conf., Washington, DC, 2003. [5] B. Liu and Y. Zhai, “NET—A system for extracting web data from flat and nested data records,” in Proc. WISE, 2005, pp. 487–495. [6] S. H. Choi, Y.-S. Jeong, and M. K. Jeong, “A hybrid recommendation method with reduced data for large-scale application,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 5, pp. 557–599, Sep. 2010. [7] C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006. [8] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. [9] C. Leacock and C. Martin, Combining Local Context and WordNet Similarity for Word Sense Identification. Cambridge, MA: MIT Press, 1998. [10] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, and D. W. Lonsdale, “Conceptual-model-based data extraction from multiple-record Web pages,” Data Know. Eng., vol. 31, pp. 227–251, 1999. [11] G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser, “Extracting data records from the web using tag path clustering,” in Proc. ACM WWW, 2009, pp. 981–990. [12] G. Hirst and D. St-Onge, Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. Cambridge, MA: MIT Press, 1998. [13] H. Snoussi, L. Magnin, and J.-Y. Nie, “Heterogeneous web data extraction using ontology,” in Proc. Agent-Oriented Inf. Syst., 2001, pp. 99–110. [14] H. Zhao, W. Meng, and C. Yu, “Automatic extraction of dynamic record sections from deep web,” presented at the ACM VLDB Conf., Seoul, Korea, 2006. [15] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully automatic wrapper generation for search engines,” in Proc. ACM WWW, 2005, pp. 66–75. [16] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proc. Int. Conf. Res. Comput. Linguist., 1997, pp. 19–33. [17] J. L. Hong, E. Siew, and S. Egerton, “Information extraction for search engines using fast heuristic techniques,” Data Know. Eng., vol. 69, pp. 169– 196, 2010. [18] J. L. Hong, E. Siew, and S. Egerton, “WMS—Extracting multiple sections data records from search engine results pages,” in Proc. ACM SAC, 2010, pp. 1696–1701. [19] J. L. Hong, E. Siew, and S. Egerton, “Aligning data records using WordNet,” in Proc. IEEE CRD, 2010, pp. 56–60. [20] J. L. Hong, “Deep web data extraction,” presented at the IEEE SMC Conf., Istanbul, Turkey, Oct. 2010. [21] J. Wang and F. H. Lochovsky, “Data extraction and label assignment for web databases,” presented at the ACM WWW Conf., Budapest, Hungary, 2003. [22] K. Simon and G. Lausen, “ViPER: Augmenting automatic information extraction with visual perceptions,” presented at the ACM CIKM Conf., Bremen, Germany, 2005. [23] O. Lassila and D. McGuinness, “The role of frame-based representation on the semantic web,” Know. Syst. Lab., Stanford Univ., Stanford, CA, Tech. Rep. KSL-01-02, 2001. [24] L. Li, Y. Liu, A. Obregon, and M. A. Weatherston, “Visual segmentationbased data record extraction from web documents,” in Proc. IEEE IRI, 2007, pp. 502–507. [25] P.W. Lord, R.D. Stevens, A. Brass, and G. CA, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, vol. 19, pp. 1275–1283, 2003. [26] R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Trans. Syst., Man, Cybern., vol. 19, no. 1, pp. 17–30, Jan./Feb. 1989. [27] M. Rodriguez and M. Egenhofer, “Determining semantic similarity among entity classes from different ontologies,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 2, pp. 442–456, Mar./Apr. 2003.
[28] A. Rodriguez, W. A. Chaovalitwongse, L. Zhe, H. Singhal, and H. Pham, “Master defect record retrieval using network-based feature association,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 3, pp. 319–329, May 2010. [29] S. Zheng, R. Song, J.-R. Wen, and C. L. Giles, “Efficient record-level wrapper induction,” presented at the ACM CIKM Conf., Hong Kong, 2009. [30] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar, “Distributed text classification with an ensemble kernel-based learning approach,” IEEE Trans. Syst., Man, Cybern., vol. 40, no. 3, pp. 287–297, May 2010. [31] S Banerjee, “Extended gloss overlaps as a measure of semantic relatedness,” in Proc. ACM IJCAI, 2003, pp. 805–810. [32] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards automatic data extraction from large web sites,” in Proc. ACM VLDB, 2001, pp. 109–118. [33] W. Liu, X. Meng, and W. Meng, “Vision-based web data records extraction,” in Proc. ACM WebDB, 2006, pp. 20–25. [34] W. Liu, X. Meng, and W. Meng, “ViDE: A vision-based approach for deep web data extraction,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 3, pp. 447–460, Mar. 2009. [35] W. Su, J. Wang, and F. H. Lochovsky, “ODE: Ontology-assisted data extraction,” ACM Trans. Database Syst., vol. 34, no. 2, pp. 1–35, 2009. [36] W. Wu, A. Doan, C. Yu, and W. Meng, “Bootstrapping domain ontology for semantic web services from source web sites,” in Proc. VLDB Workshop, 2005, pp. 11–22. [37] Z. Wu and M. Palmer, “Verb semantics and lexical selection,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 1994, pp. 133–138. [38] Y. Zhai and B. Liu, “Web data extraction based on partial tree alignment,” in Proc. ACM WWW, 2005, pp. 76–85. [39] Y. Wu, J. Chen, and Q. Li, “Extracting loosely structured data records through mining strict patterns,” in Proc. IEEE ICDE, 2008, pp. 1322– 1324. [40] (2010). [Online]. Available: http://www.cyc.com [41] (2010). [Online]. Available: http://www.loa-cnr.it/DOLCE.html [42] (2010). [Online]. Available: http://www.ontologyportal.org/ [43] (2010). [Online]. Available: http://www.illc.uva.nl/EuroWordNet/ [44] (2010). [Online]. Available: http://two.dcook.org/software/mlsn/about/ download.html [45] (2010). [Online]. Available: http://nlpwww.nict.go.jp/wn-ja/index.en. html#downloads
Jer Lang Hong received the B.Sc. degree in computer science degree from the University of Nottingham, Nottingham, U.K., in 2005 and the Ph.D. degree from Monash University, Melbourne, Australia, in 2010. He is currently a Lecturer with the School of Information Technology, Monash University. His research interests include information extraction and automated data extraction. He is the author and coauthor of several Association for Computing Machinery (ACM)/IEEE conference papers. Dr. Hong is the Program Committee Member/Reviewer for several IEEE conferences/journals, such as the IEEE TRANSACTION ON SYSTEMS, MAN, AND CYBERNETICS. He has also won several student travel grants for IEEE/ACM conferences during his Ph.D. study.