File System Support for Search 1 - CiteSeerX

10 downloads 153448 Views 166KB Size Report
synopsis. A synopsis maintains tag/value attributes that summarize the contents of the underlying le. .... HTML Document Map an anchor to the corre- sponding ...
File System Support for Search 1 Mic Bowman Mirjana Spasojevic Alfred Spector

Abstract This paper describes the project goals for Transarc's ARPA research. Our overriding goal is the development of a new le system interface that supports enhanced le location, organization, and manipulation. The interface must handle problems of scale in a wide-area environment.

The Gulf Tower 707 Grant Street Pittsburgh, PA 15219

This research was funded by the Advanced Research Project Agency, under contract number F1962893-C-1076. The views and conclusions expressed in this paper are those of the authors and do not represent the ocial position of ARPA or Transarc Corporation. 1

1 Overview Wide area information systems provide access to widely distributed information through many di erent application interfaces like hierarchical directories [14], hypertext [1], menus [11], and search [6]. Each interface has strengths and weaknesses. Circumstances and requirements determine the most e ective interface for a particular user. Improved network connectivity and user interfaces increase the demand for wide-area information sharing. Wide area information systems exhibit exponential growth patterns in information volume, number of users, and diversity of use [4]. More users publish and access publicly available information. Increased network bandwidth facilitates the sharing of many new, complex data types such as audio and video. Formerly, the user base consisted primarily of computer professionals. Currently, the user base is more diverse, including elementary school students, business professionals, and other non-technical users. Each user and group of users wants to use information in a unique way. Exponential growth causes problems of increasing scale that complicate organization and location of information, overload servers and networks, and diversify access privileges. The volume of information makes it dicult for users to nd interesting information even in search-based information systems. Existing systems lack a mechanism for distributing load so that popular servers quickly become performance bottlenecks. To accommodate diverse user requirements, it is necessary to classify rights and privileges. Existing security mechanisms cannot handle the number and diversity of users that access wide-area information. Several decades of distributed systems research already provide solutions to the problems of scale [13]. Mechanisms for caching, replication, location transparency and authentication facilitate improved performance, reliability and security. Past research and commercial use demonstrates that the distributed le system paradigm, in particular, is sustainable at a very large scale. Our previous study demonstrates that a wide-area distributed le system e ectively satis es the requirements of individual organizations and, at the same time, o ers a vehicle for wide-area information sharing and collaboration [14]. In contrast to popular \tool-based" approaches to information sharing, we propose a new le system interface to information, based on existing distributed systems research, that addresses issues of scale in wide-area information systems. We encapsulate a le within an object called a synopsis. A synopsis maintains tag/value attributes that summarize the contents of the underlying le. The interface to a synopsis de nes methods to update attributes, generate new attributes, and resolve references to related synopses. A special type of synopsis, a digest, exports methods to resolve queries on attributes of a collection of synopses. This representation supports a superset of the interfaces de ned by existing information systems. The interface is constructed on top of established platforms for wide-area le access and remote procedure call. We believe that the scalability of existing technology will enable our interface to scale several orders of magnitude beyond existing information systems.

1

2 Synopses A wide-area information system contains many kinds of information: 

documents and annotations



data les



images, sound, and video



indexes and directories



services

Heterogeneity makes it unreasonable to force a common format for all information. However, to manipulate e ectively the large volume of information in a wide-area information system, there must be a common access method. We propose a uniform, logical interface to information based on typed, structured synopses. As shown in Figure 1, a synopsis acts as a uniform interface to interesting properties of a le that is stored in a le system. A synopsis is a logical interface in the sense that it is not a le, but rather information about a le. A special type of synopsis, a digest, encapsulates a collection of synopses. Digests organize related information and provide ecient searching mechanisms. Logical, Uniform Interface to Synopses

Heterogenous File System Interface

Figure 1: A Logical Interface to Heterogeneous Information.

2.1 Attributes A synopsis consists of a set of attributes|an attribute is a tag/value pair like (type C-source)| that represents properties of the le, as shown in Figure 2. The set of attributes is dynamic, but 2

( (uid !ab09) (name hello.c) ( le-path HOME/source/C) (type C-source) (description The canonical rst C program that prints \hello world" on standard output.) (functions main) (includes stdio.h) (content afs://transarc.com/usr/mic/src/hello.c)) Figure 2: A Synopsis. generally consists of the the le type, a name, a list of aliases, and a unique identi er. Some synopses have type-speci c attributes; e.g. a le that contains C source code might have attributes for included les, functions and variable de nitions, development project, and modi cation history. The content attribute stores a pointer to the actual location of the le as a Uniform Resource Locator (URL) [2]. The type of a synopsis speci es the domain of acceptable attributes. In particular, it constrains the set of applicable attribute tags and enforces a structure on the corresponding attribute values. For example, les of type C-source have attributes that list the functions that are de ned and the header les that are included. It also constrains the content attribute to point to a C code fragment.

2.2 Methods A user can add, remove, or retrieve attribute values using the get and put methods. A synopsis de nes at least four additional methods: derive, display, invoke, and resolve. The derive method generates and maintains attributes in the synopsis. For example, the derive method for a synopsis that represents a mail message summarizes the contents of the message. It might collect from the message le attributes such as \From", \To", and \Subject". The derive method for a synopsis that corresponds to a mailbox summarizes the contents of the mailbox by indexing the synopses created for each message in the mailbox. When the contents of the mailbox change, the derive method recomputes the index. The display method formats information in the synopsis for presentation to a user. That is, the display method translates a synopsis into a document using a uniform markup language that all clients understand. HTML|the hypertext markup language|is the common markup language for documents in the World-Wide Web [1]. Gopher menus, FTP and le system directories, and 3

( (uid (type (from (to (date (subject (summary (content

!ab459) mail) [email protected]) transarc.announce) 7/2/94) Picnic) Invitation to the annual picnic) afs://transarc.com/bb/announce/kgcV6i7)) (a) a synopsis for a mail message

( (uid (type (name (description (members

!ab622) digest) Announce bboard) Digest of announcements) (!ab459, !3we4, !79erg))) (b) a digest for a bboard Figure 3: A Mail Message Synopsis and a Digest.

WAIS result sets are translated into HTML documents for display. A textual document display format is insucient for les that contain video or interactive media. However, the display method is intended to provide a uniform way of looking at a synopsis. Text serves as a least common denominator since all les have some textual representation. The invoke method creates a channel for interacting with a synopsis. The protocol for interacting with the channel is not de ned, however. A synopsis that represents a database creates a channel to an application that understands the format of the database. The protocol for manipulating the channel is de ned by the database application. A video stream creates a channel for displaying images. The protocol for interacting with the video channel includes operations to move forwards or backwards in the stream. The invoke method overcomes limitations in the display method to provide more exible interaction with synopses. However, it de nes a uniform channel creation interface, not a uniform operational interface. The resolve method translates a query into a synopsis. For example, a directory de nes a resolve method that translates le names into the corresponding les. A hypertext document maps an anchor|an anchor is a position in the document|to a target synopsis. The query format depends on the type of a synopsis. Figure 3 presents an example of a mail message synopsis and a digest for a bboard where the message is posted. 4

date = 7/2/94 & to = transarc.announce & summary : picnic (a) a query ( (uid (type (query (parent-digest (members

!ab784) digest) date = 7/2/94 & ... ) !ab622) (!ab459, !79erg))) (b) the resulting digest Figure 4: New Digest.

Applying a resolve method on this digest with a query given in Figure 4(a) results in a new digest that contains all synopses from the original digest that match the query (Figure 4(b)). Searching can be further re ned by applying resolve method on the new digest.

3 Functional Components We propose a functional architecture that consists of a publisher, a synopsis server, an agent, a cache, an interface, and a wide-area le system (Figure 5). The publisher maintains the relationship between les and the corresponding synopses. When a le is created the publisher creates a synopsis in a synopsis server and invokes the derive method to generate indexable attributes. The publisher selects an appropriate synopsis server based on properties of the le. For example, it may send a mail message to a synopsis server that stores other mail messages or it may send a document about Norway to a synopsis server that maintains travel information. When a le changes, the publisher re-invokes the corresponding derive method to refresh the attributes in the le object. The synopsis server collects and stores synopses. It indexes attributes of synopses and resolves user queries. When a user query is sent to the synopses server, it is applied to the index and a digest is created with the results. The client agent serves as a mediator between the client and a collection of synopsis servers. The client interface requests that an operation be performed on a synopsis. The agent retrieves the synopses from a synopses server, evaluates the appropriate method, and displays the results. To improve performance, the agent uses a local synopsis cache. The cache is essentially a small, local synopsis server. The client interface displays information in a synopsis and performs browse and search operations through the agent. The client interface assumes a uniform interface to synopses. It implements 5

Interface

Process Event

Evaluate Query

Agent

Synopsis Server

Hold Synopsis

Update Synopsis Get/Save Synopsis

Cache

Publisher

File System Figure 5: An architectural overview of the functional components. the common markup language shared by all display methods. The wide-area le system supports ecient, robust, authenticated le transfer and storage.

4 Examples This section describes some common synopses. A le is the simplest type of synopsis. It contains some information in an unknown format. An HTML document contains format information and hypertext links to other documents. A mail message has a structured header with pointers to referenced messages, sender, and receiver. The body of a message is plain text. A database is a collection of records in an application speci c format. The synopsis points to a program that interacts with the database. Similarly, a spreadsheet is a special database. A synopsis server collects and indexes synopses created by publishers. A digest is the collection of synopses that was returned by a query sent to a synopsis server. It de nes a virtual index where 6

Synopsis Type

Attributes

Database

Data-Model, Query-Language, Schema-Library, Description, Content Description, Title, Headings, Content

File Owner, Size, Last-Modi ed, Content HTML Document Title, Headings, Summary, Address, Content Mail Message From, To, Subject, Content.

Spreadsheet Synopsis Server Digest

Description, Location, Index, Members Description, Synopsis Server, Query, Members

Derive

Summarize le from directory entry. Summarize document from content and extract hypertext links. Summarize message from structured header and keywords in the body. Summarize information in database description. Summarize information from document description. Collect synopses from publishers, store and index them. Invoke resolve method of synopsis server with Query.

Figure 6: Deriving Attributes for Some Synopsis Types. the resolve method forwards a query to the synopsis server that maintains synopses in the digest.

4.1 Data Manipulation Figure 6 shows examples of several synopsis types. For simple types attributes summarize key information extracted from data objects. For more complex types (e.g. synopsis sever, digest) the attributes can point to index that can be used for query resolution or to the query which was used to form the synopsis. The steps involved in applying derive method that generates these attributes is outlined in the Derive column.

4.2 User Presentation Figure 7 describes the e ect of invocation of resolve, display and invoke methods on the synopsis types outlined in 6. For simple types the resolve method consists of choosing pointer to other objects or invoking application speci c query. For complex synopsis types the resolve method provides query resolution mechanism. The display method is usually a step between two resolution methods. The invoke method is applicable only for synopsis types that correspond to particular data applications.

5 Information Volume The proposed architecture o ers advanced le location tools to reduce complexity caused by exponential growth of information volume. 7

Synopsis Type Resolve

File None. HTML Document Map an anchor to the corresponding synopsis. Mail Message Resolve links to related message threads. Database Spreadsheet

Synopsis Server Digest

Display

Display contents. Retrieve the content, highlight all anchors. Convert message body to a hypertext document with links to related messages. Pass a query to the database Display the database application, create a synopsis description. with the results. Start the application, pass Display the spreadsheet the query string, create a description. synopsis with the displayed results. Resolve a query, create a di- Display the index gest with the results. description. Forward the query to the syn- Present content as a hyperopsis server and lter the re- text document with links to sults with the digest query. each synopsis in the digest.

Figure 7: Methods

8

Invoke None. None. None. Invoke the database application as a separate session. Invoke the spreadsheet application in a separate session. None. None.

5.1 Searching Search is the computer driven process of locating a desired piece of information. Search is most e ective when a user can formulate a precise query from attributes that are indexed for ecient lookup. To locate a synopsis within a synopsis server, a user invokes the resolve method of the server's digest to evaluate a query. The server digest contains all synopses that are indexed and stored by the server. The server searches its database and constructs a digest that contains the corresponding synopses. The user re nes the search by invoking the digest's resolve method. To locate a synopsis in a collection of synopsis servers, a user sends a query to an appropriate subset of the synopsis servers|i.e. those that are most likely to contain a response to the query. To locate synopsis servers, the synopsis for each is stored in a server registry. The server registry is just a synopsis server that contains synopses that correspond to other synopsis servers. Initially, the user sends a query to the server registry. The user selects from the resulting digest, the best candidates and sends the query to these servers. The server registry provides a server location facility. The resulting hierarchical index signi cantly improves the performance of wide-area le location. A digest implements a lightweight, virtual index that selectively forwards queries to several other synopsis servers. Digests can be used to construct topical indexes. For example, a Travel digest might be created by constructing a digest of synopses about Travel from other synopsis servers.

5.2 Browsing Browse is the human guided process of locating a particular piece of information. Browse is most e ective when a user cannot specify a precise query. E ective browsing requires prior organization of an information space. Instead of specifying a query, the user manipulates the organization of documents. Classi cation is a common technique for organizing information. A taxonomy or classi cation scheme is an orderly arrangement of terms or classes [12]. The application of such a scheme to a set of synopses results in the ordering or arranging of the synopses into groups or classes with similar properties. A digest de nes a class of synopses that match the query used to build the digest. The synopses in the digest share properties used to construct the query. A digest that selects synopses from another digest|recall that a digest implements a virtual index|de nes a more restrictive synopses class. The set of synopses in the new class share more properties than synopses in the old class. A classi cation scheme is constructed by organizing related digests.

6 Diversity of Use A wide-area information system encapsulates data from a pool of information providers implementing diverse security policies. A wide-area information system has to provide secure information 9

sharing with exible protection and billing policies. The key aspects of the secure information system are: 

Authentication|veri cation that the requester of information is the legitimate client.



Authorization|veri cation that the client is allowed to access a particular piece of information.



Accounting|veri cation that the right entity will be billed for a particular service.



Auditing|veri cation that a particular information access transaction was executed.

The security model for the wide-area information system is based on the following axiom:

Security Axiom: In a wide-area information system, requests for information are

satis ed only with respect to the portion of the system accessible by the requester. That is, no references to an object that satis es the query will be made, unless the object is accessible by the user through the underlying data access system ( le system, database, etc.).

The key di erence between a wide-area information system and a traditional le system is that the former may contain new pieces of information (pointers, extracted keywords, etc.) that were not originally part of the underlying data system. In our model the values of synopsis attributes represent the new data. Our model for access control inheritance translates the access control model of the underlying data systems into the information system. The values of synopsis attributes inherit the access control rights from the original les. The model takes into account potential inconsistencies between the moment a le is being updated and the moment when publisher propagates updates to the synopsis servers.

7 Number of Users The proposed architecture enables performance enhancements that accommodate continued growth in the number of users. The chief obstacles to high performance in a wide-area information system are server and network loads and network latency. Traditional techniques of caching and replication address performance issues associated with the exponential growth in users, servers, and information [10]. Traditionally, caching and replication are used to improve performance. Caching decreases latency, server load, and network load because the request to retrieve a document can be handled by the local client machine. Replication transparently distributes server and network load. In addition, both caching and replication improve robustness. Without caching and replication, the document server is a single point of failure for the entire system. 10

8 Related Work Wide-area le systems such as AFS [15], DFS [9], and NFS [7] serve as a foundation for this work. Existing wide-area le systems demonstrate the feasibility of le sharing in systems with several terabytes of information. Indexing le systems like the Semantic File System [8] and the Nebula File System [5] extend the traditional naming interface to support descriptive names. Several system index World-Wide Web information. Notable among these is the Harvest system that collects information from Web servers, summarizes it, and indexes it for fast access [3].

9 Conclusion The primary role of a wide-area information system is to facilitate browsing and searching of widely distributed information. The presented model for a wide-area information system addresses issues of complexity and scale. It proposes a logical interface to the le system and uniform data representation format which help organize and search heterogeneous data. The use of existing technology for wide-area le access and remote procedure calls will allow the system to scale well in a wide-area environment.

References [1] T. Berners-Lee, R. Cailliau, J-F. Gro , and B. Pollermann. World-Wide Web: The information universe. Electronic Networking: Research, Applications and Policy, 1(2):52{58, Spring 1992. [2] T. Berners-Lee, L. Masinter, and M. McCahill. Uniform resource locaters. Technical report, Internet Engineering Task Force (IETF Draft), August 1994. [3] C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, and Michael F. Schwartz. The harvest information discovery and access system. In Proceedings of the Second International World-Wide Web Conference, pages 763{771, Chicago, Illinois, October 1994. [4] C. Mic Bowman, Peter B. Danzig, Udi Manber, and Michael F. Schwartz. Scalable internet resource discovery: Research problems and approaches. Communications of the ACM, 37(8):98{107, August 1994. [5] Mic Bowman, Chanda Dharap, Mrinal Baruah, Bill Camargo, and Sunil Potti. A le system for information management. In Proceedings of the Conference on Intelligent Information Management Systems, Washington, DC, June 1994. [6] Alan Emtage and Peter Deutsch. archie-an electronic directory service for the internet. In Proceedings of the Winter 1992 Usenix Conference, pages 93{110, San Francisco, California, January 1992. Usenix Association. [7] R. Sandberg et. al. Design and implementation of the Sun Network File System. In Proceedings of the Summer 1985 Usenix Conference. Usenix Association, Summer 1985. 11

[8] David K. Gi ord, Pierre Jouvelot, Mark A. Sheldon, and James W. O'Toole Jr. Semantic le systems. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles, pages 16{25. ACM SIGOPS, October 1991. [9] M. Kazar, B. Leverett, O. Anderson, V. Apostolides, B. Bottos, S. Chutani, C. Everhart, W. Mason, S. Tu, and E. Zayas. DEcorum File System architectural overview. In Proceedings of the Summer 1990 USENIX Conference, Anaheim, California, June 1990. [10] M. L. Kazar. Synchronization and caching issues in the Andrew File System. In Proceedings of the Winter 1988 Usenix Conference. Usenix Association, January 1988. [11] M. McCahill. The internet gopher: A distributed server information system. ConneXions The Interoperability Report, 6(7):10{14, July 1992. [12] E. Jennifer Rowley. Organizing Knowledge. Gower Publishing Company, Hampshire, England, December 1987. [13] M. Satyanarayanan. Scalable, secure, and highly available distributed le access. IEEE Computer, 23(5), May 1990. [14] M. Spasojevic and M. Satyanarayanan. A usage pro le and evaluation of a wide-area distributed le system. In Proceedings of the Winter 1994 Usenix Conference. Usenix Association, January 1994. [15] A. Z. Spector and M. L. Kazar. Wide area le service and the AFS experimental system. Unix Review, 7(3), March 1989.

A Schedule 1. Demonstrate the usefulness of merging existing operating system research and wide-area information technology. (a) (b) (c) (d)

CODE: implement a general purpose URL translation facility for Mosaic. CODE: implement a proactive caching repository for an AFS-based wide area le system. CODE: implement support for DCE RPC-based dynamic document construction. CODE: implement support for local program execution and le translation facilities.

2. Characterize client access patterns for information systems. (a) CODE: instrument Mosaic. (b) CODE: trace collection and processing facility. (c) TASK: distribute instrumented Mosaic to community and collect 6 months worth of traces; repeat yearly. 12

(d) ANALYSIS: identify patterns to characterize, relate data to access patterns. (e) ANALYSIS: identify changes in access patterns to characterize growth of information access. 3. Design multi-level, multi-policy caching and replication scheme to decrease network and server load. (a) TASK: complete characterization of access patterns. (b) DESIGN: design a cache/replication scheme that takes advantage of known client access patterns. (c) ANALYSIS: simulate the performance of this scheme relative to traditional schemes. (d) CODE: implement prototype cache/replication scheme within a wide-area le system. 4. Design wide area security scheme to protect information published through public and private indexes. (a) ANALYSIS: identify the security requirements of information exported through an index. (b) DESIGN: design a facility that supports secure search of wide-area indexes. (c) CODE: implement a prototype of the secure search system. 5. Design an indexing le server that supports explicit publishing, object organization, and object classi cation. (a) DESIGN: design a le system API supports implicit and explicit le indexing, le classi cation, and relationships between les. (b) DESIGN: design a publishing tool that exports information to a wide-area le system through an indexing API. 6. Design an indexing le client that supports personal and group annotations, object and index organization, and user pro les. (a) DESIGN: design a client agent that supports le annotations, personal le classi cation and organization. (b) DESIGN: design a client pro le that actively collects and organizes information on behalf of a client. 7. Integrate client, server, security, and caching into a cohesive, integrated proposal for the construction of a wide area le system.

13

Suggest Documents