A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering, 6(1), 25-48, January-March 2011 25

A Novel Architecture for Deep Web Crawler Dilip Kumar Sharma, Shobhit University, Meerut, India A. K. Sharma, YMCA University of Science and Technology, Faridabad, India

ABSTRACT A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms. Keywords:

Authenticate Crawling, Deep Web, Hidden Web, Invisible Web, QIIIEP, Web Crawlers

1. INTRODUCTION Traditional Web crawling techniques have been used to search the contents of the Web that is reachable through the hyperlinks but they ignore the deep Web contents which are hidden because there is no link is available for referring these deep Web contents. The Web contents which are accessible through hyperlinks are termed as surface Web, while the hidden contents hidden behind the html forms are termed as deep Web. Deep Web sources store their contents in searchable databases that produce results dynamically only in response to a direct request (Bergman, 2001). The deep Web is not completely hidden for crawling. Major traditional search engines DOI: 10.4018/jitwe.2011010103

can be able to search approximately one-third of the data (He, Patel, Zhang, & Chang, 2007) but in order to utilize the full potential of Web, there is a need to concentrate on deep Web contents since they can provide a large amount of useful information. Hence, there is a need to build efficient deep Web crawlers which can efficiently search the deep Web contents. The deep Web pages cannot be searched efficiently through traditional Web crawler and they can be extracted dynamically as a result of a specific search through a dedicated deep Web crawler (Peisu, Ke, & Qinzhen, 2008; Sharma & Sharma, 2010). This paper finds the advantages and limitations of the current deep Web crawlers in searching the deep Web contents. For this purpose an exhaustive analysis of existing deep Web crawler mechanism is done for searching the

Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

26 International Journal of Information Technology and Web Engineering, 6(1), 25-48, January-March 2011

deep Web contents. In particular, it concentrates on development of novel architecture for deep Web crawler for extracting contents from the portion of the Web that is hidden behind html search interface in large searchable databases with the following points. •

•

Analysis of different existing algorithms of deep Web crawlers with their advantages and limitations in large scale crawling of deep Web. After profound analysis of existing deep Web crawling process, a novel architecture of deep Web crawling based on QIIIEP (query intensive interface information extraction protocol) specification is proposed (Figure 1).

This paper is organized as follows: In section 2, related work is discussed. Section 3 summarizes the architectures of various deep Web crawlers. Section 4 compares the architectures of various deep Web crawlers. The architecture of the proposed deep Web crawler is presented in section 5. Experimental results are discussed in section 6 and finally, a conclusion is presented in section 7.

2. RELATED WORK Deep Web stores their data behind the html forms. Traditional Web crawler can efficiently crawl the surface Web but they cannot efficiently crawl the deep Web. For crawling the deep Web contents various specialized deep Web crawlers are proposed in the literature but they have

limited capabilities in crawling the deep Web. A large volume of deep Web data is remains to be discovered due to the limitations of deep Web crawler. In this section existing deep Web crawlers are analyzed to find their advantages and limitations with particular reference to their capability to crawl the deep Web contents efficiently.

Application/Task Specific Human Assisted Approach Various crawlers are proposed in literature to crawl the deep Web. One of the deep Web crawler architecture is proposed by Raghavan and Garcia-Molina (2001). In this paper, a taskspecific, human-assisted approach is used for crawling the hidden Web. Two basic challenges are associated with deep Web search, i.e. the volume of the hidden Web is very large and there is a need of such type of user friendly crawler which can handle search interfaces efficiently. In this paper a model of task specific human assisted Web crawler is designed and realized in HiWE (hidden Web exposure). The HiWE prototype was built at Stanford and it crawls the dynamic pages. HiWE is designed to automatically process, analyze, and submit forms, using an internal model of forms and form submissions. HiWE uses a layout-based information extraction (LITE) technique to process and extract useful information. The advantages of HiWE architecture is that its application/task specific approach allows the crawler to concentrate on relevant pages only and automatic form filling can be done with the human assisted approach. Limitations of this

Figure 1. Mechanism of QIIIEP based deep Web crawler

Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

22 more pages are available in the full version of this document, which may be purchased using the "Purchase" button on the product's webpage: www.irma-international.org/article/novel-architecture-deepweb-crawler/52804/

Related Content How Culture May Influence Ontology Co-Design: A Qualitative Study Linda Anticoli, and Elio Toppano (2011). International Journal of Information Technology and Web Engineering (pp. 1-17).

www.irma-international.org/article/culture-may-influence-ontologydesign/55380/ Interoperability in Web-Based Geospatial Applications Iftikhar U. Sikder, Aryya Gangopadhyay, and Nikhil V. Shampur (2008). International Journal of Information Technology and Web Engineering (pp. 66-88).

www.irma-international.org/article/interoperability-web-based-geospatialapplications/2654/ Approaches to Building High Performance Web Applications: A Practical Look at Availability, Reliability, and Performance Brian Goodman, Maheshwar Inampudi, and James Doran (2007). Architecture of Reliable Web Applications Software (pp. 112-146).

www.irma-international.org/chapter/approaches-building-high-performanceweb/5217/ Modeling Variant User Interfaces for Web-Based Software Product Lines Suet Chun Lee (2006). International Journal of Information Technology and Web Engineering (pp. 1-34).

www.irma-international.org/article/modeling-variant-user-interfacesweb/2601/ A Survey of Web Service Discovery Systems Duy Ngan Le, and Angela Eck Soong Goh (2007). International Journal of Information Technology and Web Engineering (pp. 65-80).

www.irma-international.org/article/survey-web-service-discoverysystems/2627/

A Novel Architecture for Deep Web Crawler

A Novel Architecture for Deep Web Crawler

Suggest Documents

A Novel Architecture for Domain Specific Parallel Crawler

A Scalable, Extensible Web Crawler with Focused Web Crawler

Deep Web Crawler: Exploring and Re-ranking of Web ...

Crawler-Friendly Web Servers

chapter 2 web crawler

Facebook Crawler Architecture for Opinion Monitoring ...

Parallel Crawler Architecture and Web Page Change ... - wseas.us

A New Mercator Web Crawler

Methodologies for Crawler Based Web Surveys - CiteSeerX

IRJET- Hidden Web Crawler: A Survey

Integration of Web mining and web crawler

22. Crawler-friendly web servers

myanmar web pages crawler - AIRCC

World Wide Web Crawler - CiteSeerX

An Intelligent Web Crawler - CiteSeerX

Siphon++: A Hidden-Web Crawler for Keyword-Based Interfaces

CoBWeb â A Crawler for the Brazilian Web - CiteSeerX

DeepBot: A Focused Crawler for Accessing Hidden Web Content

A Web Crawler for Career Matching and University of ...

Using Web Crawler Technology for Geo-Events Analysis: A Case

CoBWeb â A Crawler for the Brazilian Web - CiteSeerX

a new web crawler for gathering source code

A Framework for Incremental Hidden Web Crawler - CiteSeerX

Crawler Architecture using Grid Computing - AIRCC Publishing ...

A Novel Architecture for Deep Web Crawler