Creating and Exploring Web Form Repositories Luciano Barbosa
Hoa Nguyen
Thanh Nguyen
School Of Computing University of Utah
School Of Computing University of Utah
School Of Computing University of Utah
[email protected] [email protected] [email protected] Ramesh Pinnamaneni Juliana Freire School Of Computing University of Utah
School Of Computing University of Utah
[email protected]
[email protected]
ABSTRACT We present DeepPeep (http://www.deeppeep.org), a new system for discovering, organizing and analyzing Web forms. DeepPeep allows users to explore the entry points to hidden-Web sites whose contents are out of reach for traditional search engines. Besides demonstrating important features of DeepPeep and describing the infrastructure we used to build the system, we will show how this infrastructure can be used to create form collections and form search engines for different domains. We also present the analysis component of DeepPeep which allows users to explore and visualize information in form repositories, helping them not only to better search and understand forms in different domains, but also to refine the form gathering process.
1.
INTRODUCTION
There has been an explosive growth in the volume of structured information on the Web. This information often resides in the deep (hidden) Web, stored in databases and exposed only through queries over Web forms. A recent study by Google estimates that there are several million form interfaces [8]. However, the highquality information in online databases can be hard to find, since it is out of reach for traditional search engines, whose indexes include only content in the surface Web. Consider, for example, a biologist that needs to find databases related to molecular biology. If she searches on Google for the keywords “molecular biology database”, over 4.6 million documents are returned. Among these, she will find pages that contain databases, but the results also include a very large number of pages that do not (e.g., pages from journals, scientifgic articles, etc.). A number of emerging applications attempt to expose and integrate hidden-Web information, including: hidden-Web crawlers [1, 9], large-scale information integration systems [5, 12, 8], and online database directories [6]. An important requirement shared by all of these applications is the ability to locate the entry points to the hidden Web. In this demo, we present DeepPeep, a search engine specialized in Web forms. The system was designed to cater to the needs of
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.
casual Web users in search of online databases, expert users whose goal is to build applications that access hidden-Web information, and information providers who want to build form collections for different domains. We also describe the infrastructure underlying DeepPeep, which automates to a great extent the process of locating and organizing Web forms, as well as provide end-users the ability to explore and analyze form collections.
2.
DEEPPEEP: SYSTEM OVERVIEW
DeepPeep was built using a set of components that we have developed to locate and organize hidden-Web sites. DeepPeep provides a search interface and an analysis sub-system that allows users to explore the contents of its form repository. The high-level architecture of the system is shown in Figure 1 and we briefly describe the the different components below.
2.1
Locating and Organizing Web Forms
The ACHE Focused Crawler To locate forms on the Web, we use the ACHE crawler [2]. ACHE uses the contents of pages to focus the crawl on a given topic, and it also prioritizes links that are more likely to lead to pages that contain searchable forms. ACHE automatically improves its focus strategy during a crawl by applying online learning. Not only this adaptive learning strategy significantly improves the harvest rate as the crawl progresses, but it also allows crawlers for different domains to be configured with little effort. Form Classification. An automated crawling process retrieves a diverse set of forms. Even a focused crawler may gather pages that contain searchable forms from different database domains. For example, while crawling to find Airfare search interfaces, a crawler is likely to retrieve a large number of forms in different domains, such as Rental Cars and Hotels, since these are often co-located with Airfare search interfaces in travel sites. The set of retrieved forms also includes many non-searchable forms that do not represent database queries such as forms for login, mailing list subscriptions, or quote requests. To filter out irrelevant forms, we use HIerarchical Form Identification (HIFI) [3] framework. HIFI classifes forms with respect to a domain and it has been shown to be both scalable and accurate. Unlike previous approaches to form classification, which rely on the ability to extract the form attribute labels (a task that is hard to automate), HIFI utilizes only form features that can be automatically extracted. It takes into account both structural characteristics and the textual content of forms for classification purposes. But instead of applying a single classifier, it combines two classifiers in a hierarchical fashion: one is used to determine whether a form is searchable based on structural features; and the other is
Pre-processing! Title Extraction
Form Clustering
Focused Crawler
Form Classifier
PageRankBacklink Extraction
Thumbshot Generation
Form Repository
LabelEx
Indexing
PruSM
Label Database
DeepPeep
Form Analysis
Figure 1: DeepPeep Architecture. trained to identify, among searchable forms, forms whose content are indicative of the target database domain. Form Clustering. Form classification can be effective when one has a good idea about the form domain of interest. In situations where this information is not known, or when a large heterogeneous form collection is available, it is important to discover different domains. To group (cluster) together forms that belong to the same domain we use the Context-Aware Form Clustering (CAFC), a framework for clustering Web forms [4]. CAFC models Web forms as a set of hyperlinked objects and considers visible information in the form context both within and in the neighborhood of forms as the basis for similarity comparison. In contrast to previous approaches that require complex label extraction and manual pre-processing of forms, in CAFC clustering is performed over automatically extracted features. In addition, because it uses a rich set of metadata, CAFC is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces.
2.2
Extracting Meta-Data
After the forms are gathered and organized in the form repository, they are processed to extract additional meta-data that is useful for querying. First, the textual labels associated with form elements are identified and then, schema matching is applied to identify label synonym sets. This metadata has many uses, including the ability to provide better result ranking (e.g., forms that contain a search term as a label can be ranked higher) and automatic query expansion (e.g., a query make=Toyota can be expanded into make=Toyota OR brand=Toyota, if make and brand are found to be synonyms). LabelEx and Label Extraction. Although the HTML standard provides a label attribute to associate descriptive information with individual form elements, it is not widely used. Instead, the common practice is to intersperse text representing attribute names with the HTML markup. Given the wide variation in form layout and label placement, automatically extracting these labels is a challenging problem. But having these labels is a critical requirement for applications that need to understand form interfaces to retrieve and integrate information that is hidden behind form interfaces. In DeepPeep, we use LabelEx, a learning-based approach for automatically parsing and extracting labels of form elements [10]. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, LabelEx makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifierderived mappings to boost extraction accuracy. PruSM and Schema Matching We use PruSM, a prudent schema matching algorithm [11], to identify the correspondences among different form attributes (and their labels). Unlike previous ap-
Domain Airfare Auto Book Hotel Job Rental Biology
Description airfare search used cars books search hotel availability job search car rental availability molecular biology search
Number of forms 2195 8806 1878 18962 11611 2361 661
Table 1: Database domains used in the demo. proaches to form-schema matching, PruSM does not require any manual pre-processing of forms, and it effectively handles both noisy data and rare labels.
2.3
The DeepPeep Site
The current version of DeepPeep was built using a series of crawlers and form classifiers customized for seven different domains (see Table 1). DeepPeep uses Lucene [7] to index the contents of the forms and of the pages where they are located, as well as the form labels extracted by LabelEx. DeepPeep’s search interface allows users to exploit the Form Repository using the index produced by Lucene. It includes a simple, keyword-based interface as well as an advanced query interface which provides additional functionality, including the ability to pose simple structured queries that involve attribute value comparisons (e.g., state=Utah) as well as meta-data queries (e.g., retrieve all forms with a label state). There is also the interface for expert users, which supports more complex queries over the form repository (e.g., show the top-k labels in a domain, or topk values for given attributes) and lets users interact with the form content. An important feature of the DeepPeep search interface is how it ranks query results. The current implementation combines three different features: term content, number of backlinks and pagerank. Pagerank and backlink information is obtained from external sources, including Google and Yahoo! Search API. For the term content, Lucene provides a composite score of the Web page content based on the user query and results returned using tf-idf for each page in the result set.
2.4
Analyzing Forms
To help users explore the form collection, we have created a series of analyses and visualizations. Figure 2 illustrates one possible interactive analysis session. This figure shows a hierarchical clustering of a set of 30 forms in the Auto domain. By hovering over a cluster (or subcluster), the user can see a tag cloud associated with the forms in that cluster. Based on the terms in the tag cloud, a user can quickly identify whether forms in this cluster are relevant or not, and drill down to identify a specific form (or forms) of her
Figure 2: Users can interactively explore the form collection using different visualizations and analyses. interest. Besides being helpful to identify relevant forms, the tag clouds can also help in identifying potential errors in classification and clustering. For example, although these forms were classified as being in the Auto domain, some of them seem to contain information about furniture—the tag cloud in Figure 2(c) contains terms such as cabinet and chair. The user can further drill down and explore other statistical information about the data. In Figure 2(a), a histogram is displayed which shows the most common values for the make attribute. This histogram shows that “Toyota” and “Ford” are the most popular car manufacturers: they are listed on most sites. By hovering over an individual form, a preview of the form is displayed (see Figure 2(d)), this way, the user can preview a form without having to navigate to the site where the form is located.
3.
SYSTEM DEMONSTRATION We will demonstrate three key aspects of the DeepPeep system: • Search interface: we will show a series of queries and search strategies that users can apply to effectively locate forms in the DeepPeep repository; • Building a form collection: we will demonstrate how the components of DeepPeep can be customized to construct form repositories for different domains. We will also demonstrate how the analysis interfaces can be used to refine and further customize the process to improve the quality of the form collection. • Form repository exploration: we will present different interactive visualizations and analysis we have designed to help users explore and better understand the contents in the form repository.
Acknowledgements. This work has been partially supported by the National Science Foundation (under grants IIS-0713637, IIS0746500, CNS-0751152) and a University of Utah Seed Grant. We also thank Sumit Tandon for his contributions to an early prototype of DeepPeep.
4.
REFERENCES
[1] L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In Proceedings of the Brazilian Symposium on Databases (SBBD), pages 309–321, 2004. [2] L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In WWW, pages 441–450, 2007. [3] L. Barbosa and J. Freire. Combining classifiers to identify online databases. In WWW, pages 431–440, 2007. [4] L. Barbosa, J. Freire, and A. Silva. Organizing hidden-web databases by clustering visible web documents. In ICDE, pages 621–633, 2007. [5] K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In CIDR, pages 44–55, 2005. [6] M. Galperin. The molecular biology database collection: 2007 update. Nucleic Acids Res, 35, 2007. [7] Lucene. lucene.apache.org. [8] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342–350, 2007. [9] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google’s deep web crawl. PVLDB, 1(2):1241–1252, 2008. [10] H. Nguyen, T. Nguyen, and J. Freire. Learning to extract form labels. In VLDB, pages 684–694, 2008. [11] T. Nguyen. Prudent schema matching for web forms. Technical report, University of Utah, 2008. [12] W. Wu, C. T. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In ACM SIGMOD, pages 95–106, 2004.