Fact Finding Survey to Extract Query Result Records from Deep Web ...

3 downloads 225 Views 331KB Size Report
after they are generated, deep web pages are usually not stored, but are .... result page changes. Hence, it is necessary for ViNTs to monitor format changes to.
The International Conference on Networking, Information & Communications-ICINIC-2014, 10-12 July 2014

Web Annotation: Fact Finding Survey to Extract Query Result Records from Deep Web Database Sonali T. Kadam#1,Sanchika Bajpai#2 #

Department of Computer Engineering, Bhivarabai Sawant Institute of Technology and Research, University of Pune, India 1

[email protected] [email protected]

2

Abstract: Web data extraction and web annotation has become a popular research area in recent years. Though the search engine technology has been developed in current years, the various types of searches return unsatisfactory and insufficient output. Such situations can be handled successfully if we use different data extraction techniques such as SHOE,EXALG method, CREAM, conceptual-model-based data extraction method, multi-strategy approach etc. to search result records from deep web database. In this paper, basically we have focused on searching the result records, information extraction from deep web database and ontology for structured and effective query based searching process. A major role to these approaches is designed to extract a powerful knowledge for the common user from deep web. A result web page retrieved from a WDB has many search result records (SRRs).Each SRRs consists of data units which can be stored in a different groups. Data in same group has same semantics. These groups are used to assign meaningful labels which helps to predict the annotations in deep web database. We have shown, how the fact finding survey helps web annotation by using search result records in the deep web. Keywords: Search Annotation

I.

Engine,

Query

Records,

Deep

Web,

INTRODUCTION

Databases available on the Web, called web databases, compose what is referred to as the deep web. Different pages in the surface web, which are stored for subsequent querying after they are generated, deep web pages are usually not stored, but are generated dynamically from web databases in reply to a user query submitted through a query interface. A review in July 2000 estimated that there were 43,000-96,000 deep web sites and that the deep web content was 500 times larger than that of the surface web [1]. A subsequent survey in April 2004 estimated that there were 307,000 deep web sites [1]. In less than 4 years the number of deep web sites had expanded 3-7 times. The web database gives a large amount of useful information which is usually in unstructured format. Unstructured information is converted into the structured database by using different techniques. The unstructured nature of web pages makes it difficult to do sophisticated querying over the information present in them. There are much more web sites that contain a large collection of structured web pages .These web pages encode data from an underlying structured resource, like a relational database. The representation and organization of the information items should provide the users with easy access to information of

their own interest. As a human point of view, information retrieval deals with mainly to study the behaviour of the user, understanding their main requirements, and determines the effects of organization and operation of the information retrieval system. Recent advanced changes in IR to research efforts conducted by pioneers such as Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin Moores, this last one having allegedly coined the term information retrieval [2]. Structured data Extraction from the web database is really very useful, since it allows users to give complex queries over the data. Extracting structured data is essential in information integration systems [3, 4], which works for data integration from different web-sites. In the returned result pages of different search engines come from the structured databases or relational databases. These search engines will be referred to as Web databases. The deep web hides the contents in the HTML form. A large area of the structured information is represented on web. Database community [5] has a great challenge for accessing data from deep web. A data integration solution and surfacing are the two common methods to access Deep-Web contents. Meta search, Comparison-shopping and deep web crowded applications need to retrieve query search result records enwrapped in result pages reverted from web database in response to user queries. The query search result records from a given search engine are formatted based on a template. Specifically recognizing such template can significantly help to extract and annotate the data records within every data record correctly. In this paper, we have discussed search results records. For that different methods are there such as graph model to represent record template and build up a field independent statistical technique to spontaneously mine the data record template [6] for search engine using model search result records. This approach identifies together template HTML tags and non-tag texts, and it also explicitly addresses the disparities between the tag or mark-up structures and the data structures of search records. A typical search result page of a Web database consists of several search result records (SRRs) and each SRR corresponds to an entity [7]. For example, each of the three SRRs in Figure 1 contains information about a book which shows three SRRs on a result page from a book WDB. Each SRR represents one book with several data units, e.g., the first book record has data units “Talking Back to the Machine: Computers and Human Aspiration,” “Peter J. Denning,” etc.

Sri Venkateshwara College Of Engineering, Bangalore

The International Conference on Networking, Information & Communications-ICINIC-2014, 10-12 July 2014 Usually, each SRR consists of multiple data units like book title, paper id, author, publisher, price, year etc in Figure 1. Frequently, not all data units are encoded with meaningful labels. For example, the first line of the first SRR in Figure 1 is not labelled with \title" even though people can recognize it easily. Paper [8] addresses how to automatically annotate the data records in the SRRs returned by Web databases.

a)

Original search result page from bookpool.com

Talking Back to the Machin:Computers and Human Aspiration
Peter J.Denning / Springer-Verlag / 1999 / b)HTML source code for first SRR 0387984135/0.06667
Our Price $17.50~You Save $9.50(35% Off) b) HTML source code for QRR
Out-Of-Stock Figure. 1. Example of search results from Bookpool.com.

In this paper we have discussed the related work, annotation concept, deep web database, searching the query records. Further paper describes the conclusion and future work to extract the exact search results records from deep web by users query.

II.

RELATED WORK

Deep annotation [9] is an inventive framework to make available semantic annotation for big data. Deep web annotation leaves semantic data where it is controlled best in web database systems. As an outcome, deep annotation provides resources for mapping and re-using active information in the semantic web with tools that are moderately modest and intuitive to use. To get this objective deep annotation architecture has been developed [9].Serverside tags that agree the handler to define semantic mappings by using Onto Mat-Annotizer [10]. Ontology and mapping editor and an inference engine are then used to investigate and exploit the resulting descriptions. Thus, a complete framework and its prototype implementation for deep annotation are provided.

Query search result extraction from web database has established more attention from the web database and information extraction research areas in present years due to the quality of deep web data [11, 12,14]. As the returned data for a query are surrounded in HTML pages, the research has focused on data extraction methods. Earlier work focused on automatic wrapper induction methods [7], which require human being to construct a wrapper. More recently, data extraction methods was proposed to automatically extract the query records from the query result pages. ViNTs has several drawbacks. First, if the data records are distributed over multiple data regions only the major data region is reported. Second, it requires users to collect the training pages from the website including the no-result page, which may not exist for many web databases because they respond with records that are close to the query if no record matches the query exactly. Third, the prelearned wrapper usually fails when the format of the query result page changes. Hence, it is necessary for ViNTs to monitor format changes to the query result pages, which is a difficult problem. In contrast, CTVS requires neither training pages nor a prelearned wrapper for a website. However, unlike ViNTs, CTVS cannot handle no-result pages, since CTVS assumes there are at least two QRRs in the page to be extracted. All the preceding works make use of only the information in the query result pages to perform the data extraction. There are works that make use of additional information, specifically ontology’s, to assist in the data extraction. While these approaches can overcome some of the limitations of CTVS (e.g., that a query result page must contain at least two QRRs) and can achieve high accuracy, they require the availability of additional resources to construct an ontology as well as the additional step of actually constructing the ontology. Embley et al. [15] use ontology’s together with a number of heuristics to automatically extract data in multi record documents. Though, ontology’s for different domains must be constructed by hand. Mukherjee et al. [16] utilize the presentation styles and the spatial locality of semantically related items, but its learning process for annotation is domain dependent. Moreover, a seed of instances of semantic concepts in a set of HTML documents needs to be hand labelled. These methods are not fully automatic. Deep annotation is important for a large and speedy growing number of web sites that has different goals which are mentioned in [9].Scientific databases that they are recurrently built to foster cooperation among researchers. Medline, Swissprot, is an example that can be found on the Web. In the bioinformatics community more than 500 large web databases are generously accessible. Syndication besides express access to HTML pages of news investigation reports, etc., commercial material providers repeatedly offer syndication facilities. Community Web Portal serves the information requests of a community on the deep Web with possibilities for contributing and accessing data by community members. Deep web database as we have presented in following section there are many communities that have contributed the

Sri Venkateshwara College Of Engineering, Bangalore

The International Conference on Networking, Information & Communications-ICINIC-2014, 10-12 July 2014 great efforts towards success of deep annotation. So far, we have studied query extraction for information integration, deep web database and annotation in the next section.

III.

QUERY RESULT RECORDS

Web databases create query result pages based on a customer’s query. Automatically removing the information from these query result pages is important for different applications, like data integration, which need to work together with various deep web databases. A novel data extraction and alignment technique [8] called CTVS that combines both label and value similarity. CTVS automatically extracts information from searched query result pages by initially recognizing and segmenting the query records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same aspect are place into the same column. Specially, new techniques is proposed to handle the case when the QRRs are not contiguous, which may be due to the presence of supplementary information, like a comment, approval or commercial, and for handling several nested structure that canoccur in the QRRs. Figure 2. shows the process for QRR extraction [7]. Given a search query result page, the tag/ mark-up construction module first constructs a tag for the page rooted in the tag. Every node represents a tag in the html page and its children are tags enclosed inside it. Each internal node of the tag tree has a tag string which includes the tags descendants, and a tag path, which includes the tags from the root. Next, the data region identification module identifies all possible data regions, which usually contain dynamically generated data, top down starting from the root node. The record segmentation module then segments the identified data regions into data records according to the tag patterns in the data regions. Given the segmented data records, the data region merge module merges the data regions containing similar records. Finally, the query result section identification module selects one of the merged data regions as the one that contains the QRR. Query extraction process consists of some steps like information region identification, record segmentation and data region merge. In Data Region Identification as mentioned in [11] and [12], some child sub trees of the same parent node form similar data records, which assemble a data region. However, Simon and Lausen [11] and Zhai and Liu [12], shown that same data records are usually represented contiguously in a page. Instead, we observed that in many query result pages some additional item that explains the data records, such as a recommendation or comment, often separates similar data records. Hence, in [8] a new method is described to handle non-contiguous information regions so that it can be applied to more web databases.

Figure. 2. QRR extraction process.

CTVS employs two steps for this task. The first step identifies and segments the QRRs. Existing techniques are improved by allowing the QRRs in a data region to be noncontiguous. The second step aligns the data values among the QRRs. If a query result page has more than one data region that contains result records and the records in the different data regions are not similar to each other, then CTVS will select only one of the data regions and discard the others. Under the assumption that there are at least two QRRs in a query result page, the data region identification algorithm discovers data regions in a top-down manner. Starting from the root of the query result page tag tree, the data region identification algorithm is applied to a node n and recursively to its different. The Information region identification algorithm is recursively applied to the children of tree only if it does not have any similar siblings. In Record Segmentation if there is auxiliary information, which corresponds to nodes between record instances, within a data region, the tandem repeat that stops at the auxiliary information is the correct tandem repeat since auxiliary information usually is not inserted into the middle of a record as observed in [12], the visual gap between two records in a data region is usually larger than any visual gap within a record. Hence, the tandem repeat that satisfies this constraint is selected. If the preceding two heuristics cannot be used, select the tandem repeat that starts the data region. In query result section identification even after performing the data region merge step, there may still be multiple data regions in a query result page. Three heuristics are used to identify this information region, called the query result section. In QRR alignment, it is performed by a novel three-step data alignment technique that combines mark-up and value similarity. First step is pair wise QRR alignment firstly arrange in a line the data values in a pair of QRRs to run the evidence to align the data values among all QRRs. In second step holistic alignment aligns the data values in all the QRRs.

Sri Venkateshwara College Of Engineering, Bangalore

The International Conference on Networking, Information & Communications-ICINIC-2014, 10-12 July 2014 Third step is nested structure processing categorizes the nested structures which are available in the QRRs.

IV.

DEEP WEB DATABASE

In the Deep Web contents are hidden behind HTML forms and recognized as an important gap in search engine coverage. Meanwhile it shows a great area of the structured data in the web database, accessing Deep-Web data has been a time-honoured dare for the database community. This paper describes the information about surfacing Deep-Web content and submissions for each HTML form and adding the resulting HTML pages into a search engine index. The Deep Web surfing faces several challenges as mentioned in [1]. First is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. Finally, getting maximum reporting of each web site can decrease the diversity of Deep-Web coverage. In order to achieve maximum coverage for a web site, earlier works had to trust on customized scripts for all websites that extracts and interprets separate results on the surfaced web pages to compute a running estimate of their completed coverage. This is possible when a small number of selected sites are indexed.

V.

THE PROCESS OF DEEP ANNOTATION

The process of deep annotation [9] consists of the following four steps as shown in Figure 3. Input: A Web site determined by an original relational database. Step 1: The database owner produces server-side web page markup according to the information structures of the database as described in [9]. Result: Web site with server-side markup. Step 2: The annotator produces client-side annotations conforming to the client ontology and the server-side markup. Result: Mapping rules between database and client ontology. Step 3: The annotator publishes the client ontology and the mapping rules derived from annotations. Result: The annotator’s ontology and mapping rules are

Figure 3. The process of deep annotation [9] available on the Web. Step 4: The querying party loads second party’s ontology and mapping rules and uses them to query the database via the web service API. Result: Results retrieved from database by querying party. Clearly, in this process one single person may be the database owner and/or the annotator and/or the querying party. To align this with our running example of the community Web portal, the annotator might annotate an organization entry from onto web. Then, he may use the ontology and mapping to instantiate his own syndication services by regularly querying for all recent entries the titles of which match his list of topics

VI.

ANNOTATION

Web annotation has been a current research issue since the discovery of hypertext and supplementary technologies such as HTML, XML, and Wiki. In this paper, web annotations will be defined as: “Online annotations related with web resources such as web pages, with which users can insert, update or remove data from a web page without modifying original the page itself”. A comment, notes, images, highlight can be added on web page. Annotation can help in information retrieval. The highlighted texts can be used to augment the document representation as shown in [13]. They conducted several experiments which tested how annotations can be used to improve document access and document clustering. Three ways of using annotations in information retrieval such as the highlighted texts can be used to build personalized document summaries, thus improving document access and retrieval, automatic document clustering can also use them to generate user-directed document clusters, automatic document classifiers can take advantage of the highlighted text to extract significant words from the documents without using the usual word frequency and inverse document frequency measures. Finally, we need to consider annotation proper as part of deep annotation. There, we “inherit” the principal annotation mechanism for creating relational metadata as elaborated in [9]. The interested reader finds an elaborate comparison of annotation techniques there as well as in a forthcoming book on annotation [2]. Finally, we need to consider annotation

Sri Venkateshwara College Of Engineering, Bangalore

The International Conference on Networking, Information & Communications-ICINIC-2014, 10-12 July 2014 proper as part of deep annotation. There, we “inherit” the principal annotation mechanism for creating relational metadata as elaborated in [8]. The interested reader finds an elaborate comparison of annotation techniques there as well as in a forthcoming book on annotation [10].

VII.

CONCLUSION

We presented a survey on query extraction process to automatically extract QRRs from a query result page. The returned data for a query are surrounded in HTML pages; the research has focused on data extraction methods. We have shown a deep annotation process. Deep annotation provides a means for mapping and re-using dynamic data in the Semantic Web with tools that are comparatively simple and intuitive to use.There are a number of interesting directions for future work. The first direction is to develop methods for crawling, indexing and querying support for the “structured” pages in the deep web. Obviously, a bunch of information in these pages is lost when naive key word indexing, and searching is used. While doing this survey, we found two specific problems. Firstly, how to automatically locate collections of pages that is structured? Second, is it practicable to generate some large “web database” from these pages.

[11] K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. 14th ACMInt’l Conf. Information and Knowledge Management, pp. 381388, 2005. [12] Y. Zhai and B. Liu, “Structured Data Extraction from the Web Based on Partial Tree Alignment,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006. [13] Denoue, L., &Vignollet, L. (2000),”An annotation tool for web browsers and its applications to information retrieval”, Proceedings of RIAO200, Apr. 2000. [14] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. 21st Int’l Conf.Distributed Computing Systems, pp. 361-370, 2001. [15] D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y. Ng, and R. Smith, “Conceptual-Model-Based Data Extraction fromMultiple-Record Web Pages,” Data and Knowledge Eng., vol. 31, no. 3, pp. 227-251, 1999. [16]S. Mukherjee, I.V. Ramakrishnan, and A. Singh, “Bootstrapping Semantic Annotation for Content-Rich HTML Documents,” Proc.IEEE Int’l Conf. Data Eng. (ICDE), 2005.

ACKNOWLEDGMENT It is with deep sense of gratitude that authors acknowledge the sincere help of concerned people for providing very constructive suggestions to improve the method and special thanks to professors of BSIOTR.

REFERENCES [1] J. Madhavan, D. Ko, L. Lot, V. Ganapathy, A. Rasmussen, and A.Y. Halevy, “Google’s Deep Web Crawl,” Proc. VLDB Endowment,vol. 1, no. 2, pp. 1241-1252, 2008. [2] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [3]Laura M. Haas, Donald Kossmann, Edward L. Wimmers, and Jun Yang.“Optimizing queries across diverse datasources”,In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276–285, 1997. [4] Jeffrey D. Ullman. “Information integration using logical views”.In Proc. of the International Conference onDatabase Theory (ICDT), pages 19–40, 1997. [5] W. Wu, C. Yu, A. Doan, and W. Meng.“An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web”.In SIGMOD, 2004. [6] H. Zhao, W. Meng, and C. Yu, “Mining Templates form Search Result Records of Search Engines,” Proc. ACM SIGKDD Int’l Conf.Knowledge Discovery and Data Mining, 2007. [7]Yiyao Lu, Hai He, Hongkun Zhao, WeiyiMeng, and Clement Yu, “Annotating Search Results from Web Databases” IEEE Trans. Knowledge and Data Eng., vol. 25, no. 3, pp. 514-527, Mar. 2013. [8]Weifeng Su, Jiying Wang, Frederick H. Lochovsky,“ Combining Tag and Value Similarity for Data Extraction and Alignment ”, IEEE Transactions On Knowledge And Data Engineering, vol. 24, NO. 7,pp 1186-1200, July 2012 [9] S. Handschuh, S. Staab, and R. Volz,“On Deep Annotation,” Proc.12th Int’l Conf. World Wide Web (WWW), 2003. [10] http://annotation.semanticweb.org/iswc/documents.html

Sri Venkateshwara College Of Engineering, Bangalore