Crawling Web Pages with Support for Client-Side Dynamism - CiteSeerX

5 downloads 85391 Views 237KB Size Report
mitting forms, managing HTML layers and/or performing complex redirections ... href attribute of an anchor, or can be executed when some event of the page is .... The second element allows a crawling process to access a URL added by other.
Crawling Web Pages with Support for Client-Side Dynamism* Manuel Álvarez1, Alberto Pan1,**, Juan Raposo1, and Justo Hidalgo2 1

Department of Information and Communications Technologies, University of A Coruña, 15071 A Coruña, Spain {mad, apan, jrs}@udc.es 2 Denodo Technologies Inc, 28039 Madrid, Spain [email protected]

Abstract. There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.

1 Introduction The “Hidden Web” or “Deep Web” [1] is usually defined as the part of WWW documents that is dynamically generated. The problem of crawling the “hidden web” can be divided in two tasks: crawling the client-side and crawling the server-side hidden web. Client-side hidden web techniques are concerned about accessing content dynamically generated in the client web browser, while server-side techniques are focused in accessing to the valuable content hidden behind web search forms [3] [6]. This paper proposes novel techniques and algorithms for dealing with the first of these problems. 1.1 The Case for Client-Side Hidden Web Today’s complex web pages use scripting languages intensively (mainly JavaScript), session maintenance mechanisms, complex redirections, etc. Developers use these client technologies to add interactivity to web pages as well as for improving site navigation. This is done through interface elements such as *

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. ** Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science. J.X. Yu, M. Kitsuregawa, and H.V. Leong (Eds.): WAIM 2006, LNCS 4016, pp. 252 – 262, 2006. © Springer-Verlag Berlin Heidelberg 2006

Crawling Web Pages with Support for Client-Side Dynamism

253

pop-up menus or by disposing content in layers that are either shown or hidden depending on the user actions. In addition, many sources use scripting languages, such as JavaScript [10], for a variety of internal purposes, including dynamically building HTTP requests for submitting forms, managing HTML layers and/or performing complex redirections. This situation is aggravated because most of the tools used for visually building web sites generate pages which use scripting code for content generation and/or for improving navigation. 1.2 The Problem with Conventional Crawlers There exist some problems that make it difficult for traditional web crawling engines to obtain data from client-side hidden web pages. The most important problems are described in the following sub-sections. 1.2.1 Client-Side Scripting Languages Many HTML pages make intensive use of JavaScript and other client-side scripting languages (such as Jscript or VBScript) for a variety of purposes such as: • Generating content at runtime (e.g. document.write methods in JavaScript). • Dynamically generating navigations. Scripting code may be for instance in the href attribute of an anchor, or can be executed when some event of the page is fired (e.g. ‘onclick’ or ‘onmouseover’ for unfolding a pop-up menu when the user clicks or moves the mouse over a menu option). It is also possible for the scripting code to rewrite a URL, to open a new window or to generate several navigations (more than URL to continue the crawling process). • Automatically filling out a form in a page and then submitting it. Successfully dealing with scripting languages requires that HTTP clients implement all the mechanisms that make it possible to a browser to render a page and to generate new navigations. It also involves following anchors and executing all the actions associated to the events they fire. Using a specific interpreter (e.g. Mozilla Rhino for JavaScript [7]) does not solve these problems, since real world scripts assume a set of browser-provided objects to be available in their execution environment. Besides, in some situations such as multi-frame pages, it is not always easy to locate and extract the scripting code to be interpreted. That is why most crawlers built to date, including the ones used in the most popular web search engines, do not provide support for this kind of pages. To provide a convenient execution environment for executing scripts is not the only problem associated with client-side dynamism. When conventional crawlers reach a new page, they scan it for new anchors to traverse and add them to a master list of URLs to access. Scripting code complicates this situation because they may be used to dynamically generate or remove anchors in response to some events. For instance, many web pages use anchors to represent menus of options. When an anchor representing an option is clicked, some scripting code dynamically generates a list of new anchors representing sub-options. If the anchor is clicked again, then the script code may “fold” the menu again, removing the anchors corresponding with the suboptions. A crawler dealing with the client-side deep web should be able to detect

254

M. Álvarez et al.

these situations and to obtain all the “hidden” URLs, adding them to the master URL list. 1.2.2 Session Maintenance Mechanisms Many websites use session maintenance mechanisms based on client resources like cookies or scripting code to add session parameters to the URLs before sending them to the server. A number of challenges to deal with: • While most crawlers are able of dealing with cookies, we have already stated that is not the case with scripting languages. • Another problem arises for distributed crawling. Conventional architectures for crawling are based on a shared “master list” of URLs from which crawling processes (maybe running in different machines) pick URLs and access them independently in a parallel manner. Nevertheless, with session-based sites, we need to insure that each crawling process has all the session information it requires (such as cookies or the context for executing the scripting code). Otherwise, any attempt to access the page will fail. Conventional crawlers do not deal with these situations. • Accessing the documents at a later time. Most web search engines work by indexing the pages retrieved by a web crawler. The crawled pages are usually not stored locally but they are indexed with their URLs. When at a later moment a user obtains the page as result of a query against the index, he can access the page through its URL. Nevertheless, in a context where session maintenance issues exist, the URLs may not work when used at a later time. For instance, the URL may include a session number that expires a few minutes after being created. 1.2.3 Redirections Many websites use complex redirections that are not managed by conventional crawlers. For instance, some pages include JavaScript redirections executed after an on load page event (the client redirects after the page has been completely loaded);