The Web-DL Environment for Building Digital Libraries from the Web
P´avel P. Calado Marcos A. Gonc¸alves Edward A. Fox Berthier Ribeiro-Neto Alberto H. F. Laender Altigran S. da Silva Davi C. Reis Pablo A. Roberto Monique V. Vieira Juliano P. Lage
Federal University of Minas Gerais Dep. of Computer Science 31270-901, Belo Horizonte, MG, Brazil pavel, alti, berthier, laender, palmieri, davi, pabloa, monique @dcc.ufmg.br
Virginia Tech Dep. of Computer Science Blacksburg, VA 24061, USA mgoncalv, fox @vt.edu
Federal University of Amazonas Dep. of Computer Science 69077-000, Manaus, AM, Brazil
[email protected]
Abstract The Web contains a huge volume of unstructured data, which is difficult to manage. In digital libraries, on the other hand, information is explicitly organized, described, and managed. Community-oriented services are built to attend specific information needs and tasks. In this paper, we describe an environment, Web-DL, that allows the construction of digital libraries from the Web. The Web-DL environment will allow us to collect data from the Web, standardize it, and publish it through a digital library system. It provides support to services and organizational structure normally available in digital libraries, but benefiting from the breadth of the Web contents. We experimented with applying the Web-DL environment to the Networked Digital Library of Theses and Dissertations (NDLTD), thus demonstrating that the rapid construction of DLs from the Web is possible. Also, Web-DL provides an alternative as a largescale solution for interoperability between independent digital libraries.
c ACM, 2003. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL 2003 (May 2003) http://doi.acm.org/10.1145/827201
1. Introduction The Web contains a huge volume of information. Almost all of it is stored in the form of unstructured data and is, therefore, difficult to manage. Access to the information is granted through browsing and searching, which normally involves no assumptions about the users’ tasks or their specific information needs. On the other hand, we have databases, where data has a rigid structure and services are provided for specialized users. Digital libraries (DLs) stand in the middle. We can say that DL users have broader interests than database users, but more specific interests than regular Web users. Also, within DLs information is explicitly organized, described, and managed—targeted for communities of users with specific information needs and tasks, but without the rigidness of database systems. In this paper we present Web-DL, an environment that allows the construction of digital libraries from the Web. Web-DL allows us to collect data from Web pages, normalize it to a standard format, and store it for use with digital library systems. By using standard protocols and archival technologies, Web-DL enables open, organized, and structured access to several heterogeneous and distributed digital libraries and the easy incorporation of powerful digital library and data extraction tools. The overall environment thus supports services and organization available in digital libraries, but benefiting from the breadth of the Web contents. By moving from Web to DL we are providing quality services for communities of users interested in specific domain information. Services like searching over several different
DLs, browsing, and recommending are made available with high quality, since we reduce the search space, restricting it to the data related to the users’ interest, and structuring and integrating such data through canonical metadata standards. We demonstrate the feasibility of our approach by implementing the proposed environment for a digital library of electronic theses and dissertations (ETDs), in the context of the Networked Digital Library of Theses and Dissertations (NDLTD). The NDLTD currently has over 160 members among universities and research institutions, providing support for the implementation of DL services using standard protocols, but is deficient in dealing with members that publicize their ETDs only through the Web. Fortunately, our approach matches with the growing tendency among sites that publish ETDs to create a Web page for each ETD, containing all the relevant data (or metadata). Using our proposal, we will be able to add such ETDs to the NDLTD collection with little user effort. The Web-DL environment builds upon tools and techniques for collecting Web pages, described in [10], extracting semi-structured data, described in [6, 14], and managing digital libraries, described in [12]. In this paper we show how these tools are seamlessly integrated under WebDL and extended to provide solutions for data normalization problems, usually found when extracting data from the Web. Experiments performed in the context of the NDLTD confirm the quality of the results reported in [4], now obtained with a more general solution and less user effort, since the data extraction process has been further automated. The rest of this paper is organized as follows. Section 2 discusses some related works. Section 3 presents an overview of the architecture proposed for the Web-DL environment. Sections 4, 5, and 6 describe the main components of Web-DL, and the ASByE, DEByE and MARIAN tools, respectively. Section 7 presents our approach for the Web data normalization problem. Section 8 shows an example digital library built using Web-DL. Finally, in Section 9 we discuss some of the problems found and present our conclusions.
2. Context and related work Digital libraries involve rich collections of digital objects and community-oriented specialized services such as searching, browsing, and recommending. Many DLs are built as federations of autonomous, possibly heterogeneous DL systems, distributed across the Internet [8, 17]. The objective of such federations is to provide users with a transparent, integrated view of their collections and information services. Challenges faced by federated DLs include interoperability among different digital library systems/protocols, resource discovery (e.g., selection of the
best sites to be searched), issues in data fusion (merging of results into a unique ranked list), and aspects of quality of data and services. One such federated digital library is the Networked Digital Library of Theses and Dissertations (NDLTD) [7], an international federation of universities, libraries, and other supporting institutions focused on efforts related to electronic theses and dissertations (ETDs). Although providing many of the advantages of a federated DL, NDLTD has particular characteristics that complicate interoperability and transparent resource discovery across its members. For instance, institutions are autonomous, each managing most services independently and not being required to report either collection updates or changes to central coordinators. Also, all NDLTD members do not (yet) support the same standards or protocols. The diversity in terms of natural language, metadata, protocols, repository technologies, character coding, nature of the data (structured, semi-structured, unstructured, multimedia), as well as user characteristics and preferences make them quite heterogeneous. Finally, NDLTD already has many members and eventually will aim at supporting all those that will produce ETDs. New members are constantly added and there is a continuing flow of new data, as theses and dissertations are submitted. In DL cases like NDLTD, there are basically three approaches for interoperability and transparent resource discovery. They differ in the amount of standardization or effort required by the DL [19], as follows: Federated services: In this approach to interoperability a group of organizations decide that their services will be built according to a number of agreed upon specifications, normally selected from formal standards. The work of forming a federation is the effort required by each organization to implement and keep current with all the agreements. This normally does not provide a feasible solution in a dynamic environment such as the NDLTD.
Harvesting: A difficulty in creating large federations is increasing motivation. So, some recent efforts aim at creating looser groupings of digital libraries. The underlying concept is that the participants make some small efforts to enable some basic shared services, without specifying a complete set of agreements. The best example is illustrated by the Open Archives Initiative (OAI) [16], which promotes the use of Dublin Core as a standard metadata format and defines a simple standard metadata harvesting protocol. Metadata from DLs implementing the protocol can be harvested to central repositories upon which DL services can be built. Particularly in the case of OAI, there is an initial impedance for its implementation by some archives since it involves small amounts of coding and build-
ing of middleware layers, especially for local repositories that sometimes do not match very well the OAI infrastructure, such as, for example, those repositories based on the Z39.50 protocol. Further, very small archives may lack staff resources to install and maintain a server. Moreover, some archives will not take any active steps to open their contents at all, making gathering, the next approach, the only available option.
Gathering: If the various organizations are not prepared to cooperate in any formal manner, a base level of interoperability is still possible by gathering openly accessible information. The best example of gathering is via Web search engines. Because there is minimal staff cost, gathering can provide services that embrace large numbers of digital libraries, but the services are of poorer quality than those that can be achieved by partners who cooperate more fully. This is mainly due to the quality of the data that can be gathered, including aspects of lack of structure and absence of provenance information.
For NDLTD, a combination of federated search (for a small number with Z39.50 support), harvesting (from institutions who agree to use a set of standard protocols), and gathering (from institutions who cannot, or do not want to use such protocols) is the best solution. Although the problem of quality with Web data is well known, many have collected data from the Web in order to develop collections of suitable size for various DL-like systems. The Harvest system, one of the first systems to apply focused gathering, had simple HTML-aware extraction tools [3]. PhysNet [20], a project to collect Physics information from the Web, still uses Harvest. The New Zealand digital library (http://www.nzdl.org) has been developing collections since 1995 based on content distributed over the Internet. Recent enhancements to the Greenstone system provide additional support, but require the manual construction and programming of wrappers, called plugins and classifiers [21]. On a different approach, the CiteSeer system [18] collects scientific publications from the Web and automatically extracts citation information. The data extraction process, however, is specific for identifying author, title, citations, and other fields common to scientific papers. Similarly, Bergmark [2] proposes the use of clustering techniques to collect pages on scientific topics from the Web, but does not approach the issue of how to extract relevant data from such pages. Nevertheless, these works show that, with sufficient manual intervention, useful services can be built with data from the Web. In the following, we present the architecture of the WebDL environment, which (1) combines harvesting and gathering to broaden the scope of interoperability in federated digital libraries, and (2) provides a framework to integrate
a number of technologies, such as focused crawling, data extraction, and digital library toolkits. Ultimately Web-DL provides an infrastructure for building high-quality digital libraries from Web contents. We illustrate the usefulness of our approach by using the Web-DL environment to integrate data from OAI and non-OAI compliant members of NDLTD.
3. The Web-DL environment architecture To build an archive from the Web, data must be collected from Web sites and integrated into a DL system. This operation has three main steps: (1) crawl the Web sites to collect the pages containing the data, (2) parse the collected pages to extract the relevant data, and (3) make the data available through a standard protocol. Figure 1 shows the Web-DL environment and architecture for the integration and building of a digital library from the Web. Collecting Web pages with the target information is done by using the ASByE tool, described in detail in Section 4. After providing ASByE with a simple navigation example, a Web crawler is created for the site. This crawler collects all the relevant pages, leaving them available for data extraction. Collected pages then must be parsed to extract the relevant data. This is accomplished by the DEByE tool, described in detail in Section 5. Given one or more example pages, DEByE is able to create a wrapper for the site to be collected. The site pages are then parsed by DEByE generated wrappers and the data is extracted and stored locally in a relational database. In order to be used by most digital library systems (in our case, the MARIAN system [12]), data must be stored in a structured way, (e.g., MARC or XML), usually using community-oriented semantic standards (e.g., Dublin Core, or FGDC for geospatial data). In the work reported in this paper, we use ETD-MS, a metadata standard for electronic theses and dissertations [1], which builds upon Dublin Core. Nonetheless, since data in Web sites is frequently in nonstandard, non-structured formats, we need some normalization procedure. In Section 7, our approach to normalize the extracted data is described. This approach presents a more general solution than the one proposed in [4], allowing Web-DL to be easily used in different domains. After the data in ETD-MS format is stored, an OAI server set up on top of the local database will make it available to anyone using the OAI protocol for metadata harvesting (OAI-PMH), in our particular case, available to the MARIAN system. The MARIAN system, described in Section 6, uses an OAI harvester to collect the metadata provided by DEByE, extracted from the Web pages. This data is stored in a union archive, using MARIAN’s indexing modules. DL regular services are made available to users
User User
Web ETD Site
User
Web ETD Site
...
Services (search, browse, ...)
HTML HTML
...
Web Crawler
Wrapper
ASByE
DEByE Indexer
ETD− MS
Union Archive OAI Server
OAI Harvester MARIAN/NDLTD
Figure 1. Proposed architecture for the Web-DL environment.
through the union archive created by MARIAN. The following sections describe in detail all the mechanisms used to build the architecture proposed here.
4. Obtaining pages from the ETD sites: the ASByE tool In this section we describe how we use the ASByE tool for generating the agents that automatically collect pages containing data of interest from the Web. These agents can be seen as specialized crawlers that automatically traverse the publishing sites, exploring hyperlinks, filling forms, and following threads of pages until they find the target pages, that is, the pages that contain data of interest. Each target page found is retrieved and can have their data extracted by a wrapper. ASByE (Agent Specification By Example) is a user driven tool that generates agents for automatically collecting sets of dynamic or static Web pages. The ASByE tool features a visual metaphor for specifying navigation examples, automatic identification of collections of related links, automatic identification of threads of answer pages generated from queries, and dynamically filling of forms from parameters provided for the agents, by the user. In a typical interaction with the tool, the user provides examples of (1) how to reach the target pages, filling any form, if needed, and (2) how to group together related pages. The output of the tool is a parameterized agent that fetches the selected
pages. The ASByE tool is fully described in [10]. The graphical interface of the ASByE tool uses a graphlike structure in which nodes displayed in a workspace represent pages (or page sets) and directed arcs represent hyperlinks. The user navigates from node to node exploring the hyperlinks according to her interests. The source nodes in the graph (i.e., the ones not pointed to by any other node) are called Web entry points and are directly selected by the user by entering the URL of the page used to start the exploration. The tool then fetches the page and builds a node corresponding to it. From this point onward, the user can select, for each node, an operation to perform. The set of operations available depends on the type of node reached. The most common and simple operation allows the user to access a document to explore by selecting one of the hyperlinks. In Figure 2, we illustrate other features of the ASByE tool showing how to generate an agent for retrieving pages from the Virginia Tech ETD Collection. The user begins by selecting the URL http://scholar.lib.vt.edu/theses/ browse/by author/all.htm as an entry point. The page at this URL contains a list of hyperlinks to each one of the target pages containing the documents available on the Virginia Tech ETD Collection. Using a number of heuristics based on criteria such as hyperlink distribution, hyperlink placement, similarity among URLs, and similarity among hyperlink labels, the tool identifies the list of links to the target pages, i.e., the pages to be collected. The user then
Figure 2. Snapshot of an agent specification session with the ASByE tool. can select the agent generation operation. The agent resulting from this specification session will first retrieve the entry point URL, extract from it all URLs currently belonging to the link collection, and then retrieve each target page corresponding to these URLs, giving them as its output. In some sites, there is no way to browse the whole document collection. The only way of reaching the target pages is by filling an HTML form, submitting it, and then navigating through the answer pages. Although ASByE is capable of generating agents to perform such operations, this feature was not used for the problem presented in this paper. A detailed description of the feature can be found in [10].
5. Wrapping publishing sites: the DEByE tool We now describe the use of the DEByE tool for generating wrappers that extract data from pages in the collected sites. For a full discussion of the DEByE tool and the DEByE approach, we refer the interested reader to [14]. DEByE (Data Extraction By Example) is a tool that generates wrappers for extracting data from Web pages. It is fully based on a visual paradigm which allows the user to specify a set of examples of the objects to be extracted. These example objects are taken from a sample page of the same Web source from which other objects (data) will be extracted. By examining the structure of the Web page and the HTML text surrounding the example data, the tool derives an Object Extraction Pattern (OEP), a set of regular expressions that includes information on the structure of the objects to be extracted and also on the textual context in which the data appears in the Web pages. The OEP is then passed to a general purpose wrapper that uses it to extract data from new pages in the same Web source, provided that they have structure and content similar to the sample page, by applying the regular expressions and some structuring
operations. DEByE is currently implemented as a system that functions as a Web service, to be used by any application that wishes to provide data extraction functionality to the end users. This allows us to implement any type of interface on top of the DEByE core routines. For instance, for general data extraction solutions, we use a DEByE interface based on the paradigm of nested tables [5], which is simple, intuitive, and yet powerful enough to describe hierarchical structures very common in data available on the Web. For the Web-DL environment, we have built an ETD-MS specific interface, with which the user can extract examples and assign them directly to ETD-MS fields. The DEByE/WebDL interface was fully implemented in Javascript and can be used via any Web browser that supports the language. In Figure 3 we show a snapshot of a user’s session for specifying an example object on one or more sample pages. The sample pages are displayed in the upper window, also called the Source window. In the lower window, also called the Fields window, all the ETD-MS fields, such as Identifier, Title, etc., are available. The user can select pieces of data of interest from the source window and “paste” them on the respective cells of the fields window. After giving an example attribute, the user can select the “Test Attribute” button, to verify if DEByE is able to collect the selected attributes from the sample pages, and finally, after specifying all the example objects, the user can click on the “Generate Wrapper” button to generate the corresponding OEP, which encompasses structural and textual information on the objects present in the sample pages. Once generated, this OEP is used by an Extractor module that, when receiving a page similar to the sample page, will perform the actual data extraction of new objects and then will output them using an XML-based representation. Since we are using ETD-MS, all the extracted objects
Figure 3. Snapshot of an example specification session with the DEByE/Web-DL interface. are plain, i.e., they do not have a hierarchical or nested structure. In practice, the ETD-MS field thesis.degree contains four nested fields: name, level, discipline, and grantor. However, to simplify the interface, we chose to represent them as independent fields. It is interesting to note that DEByE also is capable of dealing with more complex objects, by using a so-called bottom-up assembly strategy, explained in [14].
6. Providing DL services: the MARIAN system MARIAN is a digital library system designed and built to store, search over, retrieve, and browse large numbers of diverse objects in a network of relationships [12] (See also about Java MARIAN at http://www.dlib.vt.edu/ projects/MarianJava/index.html). MARIAN is built upon four basic principles: unified representation based on semantic networks, weighting schemes, a class system and class managers, and extensive use of lazy evaluation. In MARIAN, semantic networks, which are labeled directed graphs, are promoted to first-class objects and used to represent any kind of digital library structure including internal structures of digital objects and metadata and different types of relationships among objects and concepts (e.g., as in thesauri and classification hierarchies). In order to support information retrieval services, nodes and links in MARIAN’s semantic networks can be weighted. The fun-
damental concept is that of weighted object set: a set of objects whose relationship to some external proposition is encoded in their decreasing weight within the set. Nodes and links are further organized in hierarchies of object-oriented classes. Each class in a particular digital library collection is the responsibility of a class manager. Among their other functions, each MARIAN class manager implements one or more search methods. All MARIAN searchers are designed to operate “lazily”. During result presentation, only a small subset of results is presented until the user explicitly requests the remaining answers. The number of instances requested, and thus the transmission costs across the network, are severely limited relative to the size of the sets they manage. In the context of the Web-DL environment, MARIAN provides searching and browsing services for the DL built from the Web. Data from OAI providers and from non-OAIcompliant members coming from the Web-DL environment are integrated into a Union Catalog. MARIAN is equipped with OAI harvesters able to collect data periodically from the Union Catalog. MARIAN is completely reconfigurable for different DL collections; it uses digital library generators and a special DL declarative language called 5SL [11] for this purpose. Using these, specific loaders for different metadata formats (e.g., ETD-MS) can be generated. Once a new subcollection is harvested, the loading process is applied. For
every OAI record in the new sub-collection, a new part of the semantic network for the metadata record is created, representing its internal structure according to a metadata standard and the connections among text terms and text parts. The new part of the semantic network for the record is then integrated into the MARIAN knowledge base. At the end of the loading process weights for the resulting collection network are recomputed to consider global statistics. Structured searches are supported by processing classes, class managers, and specific user interfaces also created during the DL generation process. Results of structured queries are displayed as ranked lists for browsing with entries and links created by specific XSL stylesheets. Presentations of full documents, also generated with special stylesheets, contain links that allow navigation to the originally collected Web page.
7. Converting the extracted data For our particular problem, to store the data extracted by DEByE wrappers, we chose to use the ETD-MS format, to comply with the OAI-PMH. Web sites, however, are far from containing standardized data, and some normalizing operations need to be performed. Four main problems were found, when converting data to standard format: (1) mandatory data is not present in the page; (2) data is present, but only implicitly; (3) data is not in a required format; and (4) the extracted data is not in the appropriate encoding. Regarding the first problem, when data is not present in the page, some replacement must be found. The solution for most mandatory fields is to use a default value, like “none”. For other fields, like “identifier”, a unique value must be generated, for instance by using sequential values or timestamps. The second problem happens when some piece of information is known, but the data is not explicitly represented in the page. For instance, for the dc.publisher field, we may know we are collecting from the Virginia Tech site, but this information appears nowhere in the page. The third problem occurred mainly for the dc.date field. As required by the ETD-MS, the date should be in ISO 8601 format. Therefore, dates collected from the Web pages must be converted before being stored. Finally, in many ETD pages, many formatting HTML tags and HTML entities are found within the text fields extracted. Also, non-English sites use many different character encodings to represent foreign characters. Some cleaning routines are needed to eliminate spurious tags and to convert between character encoding systems. A general solution to this data cleaning and conversion problem is very hard to find. In Web-DL, we chose to use an intermediate solution between fully automating the process and manual user intervention. A set of predefined modules
for processing the data is available and the user can select which ones to apply to the data being extracted. This process is fully implemented in the DEByE/Web-DL interface, providing a seamless integration to the Web-DL environment. For instance, as shown in Figure 4, for the date field, the user can apply a filter that converts the collected date to ISO 8061 format. A filter to insert a default value also can be applied to all fields. Filters to convert the character encoding and to strip HTML tags can be selected using the checkboxes on the bottom of the window, since these will be applied to all objects collected, independently of their value or type. When extracting the data from a Web page, the DEByE generated parser applies the selected modules to the objects. As a result all data will be in the desired standard format and can be stored using ETD-MS. The data cleaning and conversion modules are simply string processing routines. They take a string as input, process it, and return the resulting string as output. This provides great flexibility for the construction of such modules. Thus, users can implement data cleaning modules according to their own specific needs, using any available programming language. More complex modules can be built making use of an API provided by DEByE, which allows, for instance, the passage of parameters other than the string to be processed. Of course, a set of predefined modules is already included in DEByE, to provide users with no programming experience with as much data cleaning functionality as possible. These are fully reusable and appropriate for any project. This approach solves the problems found on our preliminary experiments with Web-DL [4], while maintaining the modularity of the environment and minimizing user intervention in the process of building a digital library from the Web. Once all the normalizing problems are solved, data can be stored in a relational database, later to be rendered using ETD-MS. The database is then made accessible through an OAI server. Using the OAI-PMH, the data extracted from the Web can be shared with any DL acting as an OAI service provider. In our environment, the extracted data is harvested, and integrated with data harvested from other NDLTD members within MARIAN.
8. An example Web ETD digital library For this work, we collected pages containing ETDs from the sites of 21 different institutions selected from the list of NDLTD members, available at http://www.theses.org. These experiments were performed in the same context as reported in [4], but using the new integrated data cleaning and conversion modules. The ETD sites contained a total of 9595 ETDs. It was not possible to collect information from the sites of 7 institutions, since these were off-line or available only through a search interface.
Figure 4. Data cleaning and conversion in the DEByE/Web-DL interface. Of the 6 mandatory ETD-MS fields, an average of 29.5% were missing in the collected pages, and were therefore filled with a default value. This value was inserted by the DL builder through the “default value filter” of the DEByE/Web-DL interface, thus requiring only one simple operation per field. The default value filter also allowed for the creation of unique identifiers by appending a serial number to the dc.identifier field. This was one of the major problems found in our previous experiments [4], which had required the manual implementation of data insertion routines. Here, it was solved by simply selecting options from the user interface. Table 1 shows the number of ETDs in which mandatory fields were missing. Field name dc.title dc.creator dc.subject dc.date dc.type dc.identifier
ETDs missing 43 (0.4%) 23 (0.2%) 2349 (24%) 283 (3%) 703 (7%) 4800 (50%)
Table 1. Mandatory fields missing from the collected ETDs. Table 2 shows the numbers for each site collected. It can be seen that, although not all, most of the information was collected and extracted. It is interesting to note that fields like dc.publisher or dc.type, which are often implicit in the collected site entry pages, but not available as extractable examples, could be easily inserted as a default value for the whole site. This means that the user needed only to type one value for each site, whereas in our previous experiments each site required the implementation of a separate routine.
The work required to include a site in the digital library consisted of providing sets of examples to the ASByE and DEByE tools. For each collected site only one example was needed to create the crawling agents. To generate parsers for data extraction, an average of 2–3 examples per field were required. This represented an average of 9 minutes of work per site, by a specialized user, much less than previously reported in [4]. The reduction in time was greatly due to the new automated process of converting data to a standard format. An interesting example is that of the dc.date field, which previously required that the user extracted each part of the date (day, month, year) individually or implemented a conversion routine for the ISO 8061 format. For the 21 institutions in our example, the total effort of the user summed up to approximately 3 hours and 15 minutes. Notice that most of this is due to processing time, which can be improved by further optimizing the system code or using faster hardware. Since we do not expect Web sites to be massively submitted to the system, this is a reasonable human effort to collect the data of interest. In the future, we expect to further automate this process, to reduce the time required, as more sites are harvested. To illustrate, Figure 5 shows an ETD published by Uppsala University. Once collected and extracted, all the metadata is stored and made available by the MARIAN system. Figure 6 shows the results of a query over the ETDs collected from the Web, using the MARIAN system. By using Web-DL, not only searching, but any number of DL services, such as browsing and filtering, among others, can be performed over the data extracted from the Web.
ETD Site Adelaide U. Australia N.U. Concordia U. Curtin U.T. Griffith U. H-U. Berlin N.S.Y.U. Taiwan OhioLINK Queensland U.T. Rhodes U. U. Kentucky U. New South Wales U. Tennessee U. Virginia U. Waterloo U. Wollongong U.P. Valencia Uppsala U. Victoria U.T. Virginia Tech Worcester P.I.
Number of
Fields per
Mandatory
Optional
ETDs 19 39 3 57 40 439 1786 932 53 134 30 89 10 619 105 6 264 1567 3 3278 122
ETD 4 5 9 10 5 7 9 6 5 5 9 5 8 8 5 5 6 3 5 9 10
fields missing 3 3 0 0 3 1 1 2 3 3 1 3 1 0 3 3 1 3 3 0 0
fields inserted 5 4 2 2 4 2 3 4 4 5 2 4 3 2 5 4 3 5 4 2 2
Table 2. Statistics for the data collected from the ETD sites.
Figure 5. Metadata for an ETD, available at the Uppsala University Web site.
Figure 6. Search results for query “fusion medical images” over the ETDs collected from the Web.
9. Summary and conclusions We proposed the Web-DL environment for the construction of digital libraries from the Web. Our demonstration environment integrates standard protocols, data extraction, and digital library tools to build a digital library of electronic theses and dissertations. The proposed environment provides an important first step towards the rapid construction of large DLs from the Web, as well as a large-scale solution for interoperability between independent digital libraries. In this paper, Web-DL was applied to the Networked Digital Library of Theses and Dissertations, where we were able to collect data from more than 9000 electronic theses and dissertations. Due to the flexibility of the tools that compose Web-DL, we expect it to be easily applicable to any other domain, requiring, at most, changes in the user interface. Different interfaces are easily implementable for specific areas. Alternatively, a general interface like nested tables can be used for the majority of data available on the Web.
9.1. Lessons learned Moving from the Web to a digital library is not a trivial task. Besides page collecting, we are faced with the difficult problem of transforming semi-structured data into structured data. Since there may not be a general solution for this problem, it is important to summarize the problems found and solutions applied when building the digital library of ETDs from the Web. One of the main problems found was that some of the ETD sites to be collected provide access to their data only through search interfaces, resulting in the hidden web problem [13]. Although we did not approach this problem in our experiments, it can be partially solved by the use of the ASByE tool, which allows filling forms and submitting queries
to reach the hidden pages. Thus, although it is impossible to guarantee that all data will be collected, the Web-DL environment is able to minimize the hidden Web problem, allowing us to obtain information otherwise unavailable by common Web crawlers. Although there are many approaches for data extraction, as discussed in [15], cases will always be found where wrappers must be built manually. For instance, Web pages within a site can be very different from each other, making it very hard to build a generic wrapper for the whole site. In our experiments, the use of the DEByE tool avoided all such problems and all wrappers were built with minimum effort. This may be due to the fact that most ETD sites were quite regular, but other experimental results [14] have shown that our approach for Web data extraction might be equally effective in more general and complex environments. Finally, we face the problem of making the unstructured Web data fit a standard pattern. In Web-DL, we adopted a compromise solution, where a set of predefined data cleaning and conversion modules is available and can be selected by the user collecting data. To keep the solution as general as possible, we allow users to implement their own extra modules, according to their specific needs. This solution still requires some user intervention, but it is very general, and user effort is reduced to a minimum. In sum, each of the tasks for extracting information from the Web into a DL environment presents its own set of problems. A general solution for building digital libraries from the Web depends on general solutions for each of these tasks and on an efficient integration of such solutions. The Web-DL environment provides such an integration and, through experiments, has shown itself to be a fast and efficient DL colection building tool. Further, using Web-DL to achieve interoperability between independent digital libraries requires as little effort as a gathering solution but provides the quality of data and services usually obtained only by harvesting or federated solutions.
9.2. Future work
References
The MARIAN system allows for harvesting data from NDLTD member sites using a variety of standard protocols. Therefore, an immediate first step is to integrate the data extracted from the Web with data collected from other member sites. A need resulting from this integration is that of deduping: e.g., recognizing two instances of the same object, coming from different sources, or combining search results coming from internal repositories and external sources. Approaches to these problems are currently being studied and will be implemented in the future. MARIAN also allows for the use of probability estimates for the quality of the extracted data and their utilization in retrieval operations [12]. We are currently studying a coherent way of computing these probabilities directly from the DEByE tool. In the current stage of our work, the generation of wrappers for each Web source was accomplished by using the DEByE tool by selecting example objects (i.e., bibliography entries) from sample pages from each of the sources. As we expect the number of sources to increase rapidly, we intend to deploy the automatic example generation method described in [9]. Such a method allows using data available on a pre-existing repository (e.g., titles, author names, keywords, subject areas, etc.) to automatically identify similar data in sample pages of new sources and to assemble example objects. By using it, we expect to automate the generation of wrappers, at least for a considerable number of cases. We also will be extending the current Web-DL environment to consider classification of data extracted from the Web using a number of classification schemes, such as the ACM or the Library of Congress classification schemes and domain-specific ontologies. Finally, the current work on the Web-DL environment is largely concentrated on improving quality of data. In the near future we will extend and incorporate new kinds of networks (e.g., belief networks) into MARIAN to improve the quality of current and future DL services.
[1] A. Atkins, E. A. Fox, R. K. France, and H. Suleman. ETDMS: an interoperability metadata standard for electronic theses and dissertations. http://www.ndltd.org/standards/ metadata/, 2001. [2] D. Bergmark. Collection synthesis. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’02, pages 46–56, Portland, Oregon, USA, June 2002. [3] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. Computer Networks and ISDN Systems, 28(12):119–125, December 1995. [4] P. Calado, A. S. da Silva, B. A. Ribeiro-Neto, A. H. F. Laender, J. P. Lage, D. de Castro Reis, P. A. Roberto, M. V. Vieira, M. A. Gonc¸alves, and E. A. Fox. Web-DL: an experience in building digital libraries from the Web. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, pages 675–677, McLean, Virginia, USA, November 2002. Poster session. [5] A. S. da Silva, I. M. R. E. Filha, A. H. F. Laender, and D. W. Embley. Representing and querying semistructured web data using nested tables with structural variants. In Proceedings of the 21st International Conference on Conceptual Modeling ER 2002, pages 135–151, October 2002. [6] D. de Castro Reis, R. B. Ara´ujo, A. S. da Silva, and B. Ribeiro-Neto. A framework for generating attribute extractors for web data sources. In Proceedings of the 9th Symposium on String Processing and Information Retrieval (SPIRE’02), pages 210–226, Lisboa, Portugal, September 2002. [7] E. A. Fox, M. A. Gonc¸alves, G. McMillan, J. Eaton, A. Atkins, and N. Kipp. The Networked Digital Library of Theses and Dissertations: Changes in the university community. Journal of Computing in Higher Education, 13(2):102– 124, Spring 2002. [8] N. Fuhr. Networked information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 344, August 1996. [9] P. B. Golgher, A. S. da Silva, A. H. F. Laender, and B. A. Ribeiro-Neto. Bootstrapping for example-based data extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 371–378, Atlanta, Georgia, USA, November 2001. [10] P. B. Golgher, A. H. F. Laender, A. S. da Silva, and B. Ribeiro-Neto. An example-based environment for wrapper generation. In Proceedings of the 2nd International Workshop on The World Wide Web and Conceptual Modeling, pages 152–164, October 2000. [11] M. A. Gonc¸alves and E. A. Fox. 5SL: A language for declarative generation of digital libraries. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’02, pages 263–272, Portland, Oregon, USA, June 2002. [12] M. A. Gonc¸alves, P. Mather, J. Wang, Y. Zhou, M. Luo, R. Richardson, R. Shen, L. Xu, and E. A. Fox. Java MARIAN: From an OPAC to a modern digital library system.
10. Aknoweledgments Thanks are given for the support of NSF through its grants IIS-0086227 and DUE-0121679. The first author is supported by MCT/FCT scholarship SFRH/BD/4662/2001. The second author is supported by AOL and by CAPES, 1702-980. Work on MARIAN also has been supported by the National Library of Medicine. Work at UFMG has been supported by CNPq project I3DL, process 680154/01-9.
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] [21]
Lecture Notes in Computer Science, Springer, 2476:194– 209, September 2002. P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: categorizing hidden Web databases. SIGMOD Record, 30(2):67–78, June 2001. A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. DEByE – data extraction by example. Data and Knowledge Engineering, 40(2):121–154, February 2002. A. H. F. Laender, B. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of Web data extraction tools. SIGMOD Record, 2(31):84–93, June 2002. C. Lagoze and H. V. de Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. In Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’01, pages 54–62, June 2001. C. Lagoze, D. Fielding, and S. Payette. Making global digital libraries work: Collection services, connectivity regions, and collection views. In Proceedings of the 3rd ACM International Conference on Digital Libraries, DL’98, pages 134–143, Pittsburgh, Pennsylvania, USA, June 1998. S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and Autonomous Citation Indexing. IEEE Computer, 32(6):67–71, June 1999. K. Maly, M. Zubair, and X. Liu. Kepler - an OAI data/service provider for the individual. D-Lib Magazine, 7(4), April 2001. PhysNet. http://physnet.uni-oldenburg.de/PhysNet/, 2002. I. H. Witten, S. J. Boddie, D. Bainbridge, and R. J. McNab. Greenstone: A comprehensive open-source digital library software system. In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 113–121, San Antonio, Texas, USA, June 2000.