to their potential customers. Another is the ability of ... systems are expected to support a wide range of ..... unforgeable session ticket which contains the user.
The Design of Distributed Hypermedia Information Systems Ruben Gonzalez†, H.W.Peter Beadle†, Hugh Bradlow* and Rei Safavi-Naini† †
Telecommunication Software Research Centre, The Institute for Telecommunication Research, University of Wollongong, Australia *Telstra Research Laboratories Telstra Australia
ABSTRACT : Many issues central to the success of distributed hypermedia systems exist aside from the actual content provided. One is the ability of information providers to easily present their services to their potential customers. Another is the ability of consumers to efficiently locate desired information. The wide acceptance of hypermedia information systems by the users will ultimately depend on the effective resolution of these issues. As the size of distributed hypermedia systems continues to grow, the limitations of the underlying technologies are becoming increasingly apparent. This paper will explore the problems and possible methods to overcome current limitations in distributed hypermedia information systems. 1.
INTRODUCTION
Recently there has been a lot of public interest in the provision of networked hypermedia services to the home. These large scale, distributed hypermedia systems are expected to support a wide range of diverse transactions and services. A large amount of attention has been given to the role of broadband telecommunications in this context. Somewhat less attention has been paid to the information technology issues required to support the anticipated services. At a system level there is a need for an appropriate service architecture which can meet the requirements of such a distributed hypermedia system. In this paper we propose such an architecture and discuss several key issues in providing these services commercially. These issues include authentication, security, billing and directory services. Hypermedia information systems especially in relation to the Internet have been undergoing rapid expansion. This rapidity has meant that the evolution of the underlying technologies has been unable to keep pace, exposing various information technology related weaknesses. This is aside from the increased capacity requirements when compared to traditional telecommunication services.
The various information technology problems are not as immediately apparent. These problems have become manifest partly due to the basic network architecture which is a distributed structure of independent nodes. Also partly due to the vigorous growth and incessant permutation of the information on the system and the need to support new services requiring the collaboration of diverse technologies. 2.
EXISTING WORK
2.1.
Networked Hypermedia Systems
A networked hypermedia information system is the conglomeration of different technologies including telecommunications, multimedia and hypertext. Hypermedia systems which extend the concepts of hypertext to multimedia are composed of a link and node structure. The links define relationships between nodes which may contain references to many other nodes. A hypermedia information system usually consists of three identifiable levels according to the DEXTER model [1]. The lowest level is the node database which contains data nodes, next is the link repository layer containing the hyperlinks which define any relationships between the various nodes. The final component is the user interface or presentation layer. The most well known distributed hypermedia information system is the World Wide Web (WWW) on the Internet. Instead of the DEXTER model the WWW uses a simpler model which can only readily support enhanced hypertext services. This model merges both of the lower layers into one by embedding the link data into each node or document. While this affords a much simpler information system it does this at the expense of flexibility and manageability. Due to the lack of any link repository in the WWW model, link management becomes difficult. Little or
no attempt has been made towards ensuring link consistency. Dangling or unresolvable links are common and difficult to correct. In contrast to the WWW, the Hyper-G system [2] seeks to solve these problems by using the DEXTER model and automating the administration of the link database. In spite of the popularity of the Internet its basic narrow band architecture hinders it from becoming the basis for a public access commercial service. Internationally, various pilot studies have been performed to evaluate possible broadband service architectures [3]. Most of these are heavily biased towards video on demand services and interactive television. In focussing on the network architecture in this way they avoid many of the information technology issues. 2.2.
Information Technology
One of the largest information technology problems is that of resource discovery or locating information in a distributed environment composed of autonomous information servers. Since there is no centralised index or link repository in this model finding relevant documents is difficult.
search for. Ideally the scope of the search should be adjustable as a complexity control mechanism. Searching as a resource discovery technique is most efficient when a centralised index exists. Examples of existing Internet systems which locally maintain centralised, global search indexes include Archie, Veronica, WWWWorm [6] and Netfind. These systems are plagued with scalability problems. One problem is the rapid index growth rate which compounds existing consistency problems Another problem is the computational load of having thousands of users simultaneously querying the database on a single search engine. Some systems regularly receive over one million requests per day. As the size of the Internet has grown there has been a progression away from centralised, global search indexes. Without such an index performing exhaustive searching becomes difficult in a distributed environment. One problem is the possibly number of nodes which must be individually searched. Another is that the search engine must first know where to find the nodes which are to be searched [7]. 2.2.3.
2.2.1.
Nontextual Media
One current limitation in existing hypermedia systems is the inability to retrieve information in nontextual media based on its content. While the usage of the term hypermedia implies hypertext like information retrieval with all types of media this simply is not the case. Instead existing hypermedia systems are only multimedia enhanced hypertext. Research is attempting to rectify this problem [4, 5]. While the methods of placing and identifying links and nodes in hypertext are well established, this is not the case for other media. Little work has been performed on the creation of full hypermedia documents where non-textual data (specifically continuous time based data such as video or audio) can contain both nodes and anchors. For example, it should be possible for the author to identify once a particular object in a video (for example a person) and thereafter for that object to maintain its anchor as the video evolves with time. 2.2.2.
Global Indexes
Hypermedia supports two techniques for locating information, searching and browsing. As a tool for locating information, searching is useful when the user can clearly define the appropriate keywords to
Domain Based Structures
Domain based schemes such as WAIS, address this problem by using a collection of disjoint homogeneous directories, with a centralised directory-of-directories. This method simplifies the task of administration. While each of the distributed directories maintains a flat index, the centralised directory itself is hierarchical. Searching is then done by selecting a search domain from the centralised directory of directories. Netfind similarly provides a centralised index of domains. Searching is performed by distributing the query to servers in selected domains according to a heuristic. Netfind can therefore locate all of the domains but may fail to locate the required data since not all domains are searched. 2.2.4.
Hierarchical Structures
Alternative, the X.500 system [8] provides an inverse structure where the list of domains is distributed and each domain maintains a flat index of the services within itself. Since there is no global index in this structure, an X.500 system can not be searched exhaustively except within the local domain. Given the correct domain, an X.500 system will locate the required information but if the domain is unknown then location of the data becomes uncertain.
2.2.5.
Directory Service Maintenance
The earliest directory systems like Archie depend on manual maintenance. Information had to be collected by each site administrator and sent to the administrator of the Archie database who then manually updated the index. This system is inefficient and prone to human errors. A more recent method requires the directory service operator to scour the network extracting relevant information from the documents that it finds. Software agents such as “Robots”, “Spiders” or “Worms” can be used to automatically update their local databases with information obtained by recursively enumerating hyperlinks, searching for URLs, and anchors. This scheme requires minimal administration effort and is attractive because of its unconstrained flexibility. However they place heavy loads on information servers and network links, bringing the servers down and causing severe congestion. This is mainly due to the inefficient object retrieval protocols used in current hypermedia systems and the “rapid fire” characteristics of their operation [9]. An alternative approach, that eliminates these congestion and performance problems, is for each content provider to supply an index of its information to the directory service operator. For pragmatic reasons these indexes should be generated in conformance with certain guidelines. The WHOIS++ system which takes this approach is based on collecting manually edited templates containing specific information from the content providers [10]. To reduce this administrative burden the directory service provider may supply indexing tools for this task. In effect distributing the worm to create a “worm farm”. Accordingly the Harvest system [11] permits a “Gatherer” program to be run at the provider site collecting the information it wants. This performs automatic indexing for a range of different document types. This index information is then periodically sent to the directory service operator. 2.2.6.
Browsing for Information Discovery
Browsing depends on the organisation of the information in a system. Most systems which support browsing do so by permitting the users to manually organise the information according to their personal preferences. Examples of these systems are WWW and Prospero. By making each user responsible for setting up their own browsing structure, the extent of browsing is limited to the information which each
user is already aware of. This constraint stifles the main benefit of browsing as an information discovery tool, because the information to be browsed must be known a priori. Alternatively, the Internet Gopher organises the information into predefined categories which are presented to the user through a simple menu system. These categories are specific to each node. Permitting a centralised authority to organise the information for browsing ensures that all of the information is located in the browsing tree, even that which users may not be aware of. Users have no control over the organisation of the browsing tree which may not conform to their personal preferences. Index based services like WAIS are very easy to search but they are difficult to browse because the index does not contain any relationships between the entries. Other schemes which do not rely on indexes, such as X.500 which is hierarchical, are very easy to browse but difficult to search because the user must know the location in the tree where the desired information resides. 2.2.7.
Disorientation and Complexity Control
One common problem when browsing hypermedia systems is that of becoming disoriented or "lost in hyperspace". Disorientation occurs when users loose track of their relative position in the link structure. This problem becomes more acute as either the depth or lateral extent of the link structure increases. It is also further compounded when each node in a distributed network is autonomous, since this leads to deficient internodal linking. These deficiencies arise because the hyper-links tend to be largely confined to local information. For example, the user wishes to jump to a certain topic, but no anchor has been placed by the author at that keyword. The user is left with the problem of where and how to proceed with the navigation. 3. NETWORK SERVICE ARCHITECTURE A constraining factor in the design of a public access hypermedia information system is the established public telecommunications network architecture. The evolution from a network designed solely for basic telephony use to a hypermedia information network will probability be staged.
3.1.
hypermedia document and then pass it on intact or modified to either a CPE or another service provider.
Hypermedia Network Topology
A likely topology for a near term public hypermedia information service may consist of a broadband (ATM) core network in concert with cable television (CATV) sub-networks. These subnets consist of head ends interfacing to the ATM network on one side and with the customer premises equipment (CPE) on the other. The CPE is generally a set top box for the television providing some form of low bandwidth back channel. For example, @home in the U.S. have announced a modem providing a 100kb/s back channel and 100Mb/s ethernet down link. Figure 1. illustrates the basic topology we propose as the most likely long term evolution of current networks. We are implementing a lab prototype of the architecture as part of our research infrastructure.
Directory Service Provider
Content Provider
Authentication Service
Billing Service Provider
Users’ CPE
3.3.
Customer Premises Equipment
As a hypermedia terminal the CPE generates request and receives hypermedia documents in response to its requests. It also has capabilities for decoding and displaying multimedia data, and handling user input. Initially the CPE may be restricted to consisting of a television, a set top unit (STU) and some input device. The function of the STU is to interface between these elements decoding inbound information for television display and the encoding input data to send up the back channel.
Content Provider
Mobile Terminal
Head End
BroadBand Network
Head End LAN or POTS Users’ CPE
Head End
Users’ CPE
Users’ CPE
Mobile Terminal
GSM
CATV Network
Users’ CPE
Users’ CPE
Figure 1. Basic Network Topology
Aside from the content providers various special services may be required such as directory, billing and authentication services which will facilitate the interaction between users. It may be appropriate for these services to fall under the jurisdiction of the network operator.
The CPE is considered to have some (potentially limited) general data processing capacity which is necessary for authentication and for interacting with the hypermedia information. The CPE may be able to receive program fragments from the network and execute them to reconfigure the CPE itself or to interpret other data from the network.
3.2.
3.4.
Content Providers
Authentication service
Content providers have databases which are repositories of hypermedia documents. They may operate the database themselves on their own, or third party’s equipment, or they may choose to allow a third party to operate their database.
The authentication service provides a mechanism for a reliable operator to identify the user, the CPE and the service provider. A fundamental tenet is that all parties agree to believe the authentication service about the particular details of a user.
Content providers receive and generate requests for documents, generate hypermedia documents and receive hypermedia documents. Optionally they may also forward requests and hypermedia documents (i.e. receive a request and pass it on intact (or modified) to another service provider, or receive a
Both users and service providers need to authenticate themselves to participate in the system. Neither of these parties should be able to delegate their authentication to another.
It is likely that authenticating the user for each link access request would generate an unacceptable load on the authentication service and network (due to the nature of the hypermedia systems, many links need to be traversed to access any information). In order to reduce the authentication server cost and network traffic, the number of authentication requests should be kept to a minimum. Thus the authentication should be performed at the system level rather than the service level. Ideally users would only require authentication once per session. For secure authentication to take place each CPE should have a unique identification number which should be registered with each head end. Spoofing may be prevented by ensuring that the CPE ID and network address or termination ID match. Also for any particular session the user ID and the CPE ID will remain fixed and both can be encrypted into a single token. This will further prevent spoofing. The user authentication process beings with the authentication of the CPE to the head end to receive a CPE token. Next if a smart card is used it may be authenticated by the authentication server. Following this the user can be authenticated who receives an unforgeable session ticket which contains the user plus CPE identification data. 3.5.
Identification and Security.
The entire network environment is assumed to be hostile. Authentication is achieved using a private key which is assigned to users at subscription (it may be assigned via a smart card) when they register with an authentication server. The authentication server will also operate a directory system for users’ public keys, which other users could retrieve. Since this public key is provided by a certified trusted authority, its integrity is assumed. Identification is achieved by the authentication server assigning each user an unforgeable token or ticket upon authentication. The ticket has a limited life time and is not encrypted. The ticket is signed by the Authentication server with its private key in order to prevent it being forged. The ticket itself is composed of the users identification, network address, security clearance, the valid period for the ticket and some unique number. The process of obtaining access to services offered by a content providers is performed by issuing a request, to the content provider. Request are composed of the users’ session tickets together with the identifying
label of the requested service. The ticket is used when making requests as an identification token. Authorisation, which is the granting or rejecting of the access, is performed by the content providers. Content providers validate the users by verifying the session ticket with the authorisation servers public key. It then validates the source of the request by matching the destination CPE network address against the address the request originated from. In this way, the information supplied by any content provider must only be delivered to the network access and to user specified by the ticket. The use of this ticket leads to the creation of an audit trail and can be used as the basis for the billing service. Higher security can be obtained at the expense of greater complexity. In this scheme, to issue a request, users must obtain the public key of the content provider. This can be done by sending a request to public key server and receiving a certified copy of the content provider’s public key. Each request is then encrypted by the user using this key. In addition to the users identification ticket the information contained in the request also contains the document to be accessed, the public key of the content provider and the type of access desired. Both systems could operate simultaneously with users specifying the level of security desired. Thus we have a trade-off between security and efficiency. 3.6.
Billing
All parties in a billing event must be authenticated for the billing event to be valid (i.e. information user, the user’s network access, the user’s Smart Card number and the service provider). A user request to a service provider must thus identify the user, the network access and the desired document. Billing is performed by content providers forwarding logs of service requests together with the user tickets encrypted with the service providers public key periodically to the billing service provider which then generates and sends a single invoice to the user. The biller decrypts this message to authenticate the service provider and decrypts the user information to authenticate user. Bills are then sent to the customer. Similar techniques are currently used in GSM billing through service providers rather than directly by carriers.
3.7.
Access Control
Each service provider is an autonomous organisation responsible for administering their own access control. As the service providers either own the information, or act on behalf of the owners, they are responsible for implementing appropriate legal requirements for access. We believe it si the service providers who make the information public and hence must abide by the laws pertaining to current print, TV and audio publication. Access control is implemented by the service providers using the identification information about the user and security clearance supplied by the authentication service on the session ticket. A generic access control must be possible based on demographic details of the user stored by the authentication service (such as age etc.). A finer degree of access control should also be possible based on the users unique identification number and an access control list for the document (or group of documents). This will cater for situations such as closed user groups, where someone wishes to provide information to a specific group of users but not to others (for example, a supplier offering documentation to its customers but not to its competitors). Unix like access control should be possible with a group mechanism so documents can be owned by a group and then users can belong to the group and get access to a document. 3.8.
Hypermedia Directory Services
A possible solution to the resource discovery problem is through the use of directory services. Appropriately designed, such a service may be used for both exhaustive searching and global browsing as well as hypermedia link resolution. Various difficulties must be resolved in the current services before these goals can be achieved. These resource discovery problems can largely be overcome through the use of an appropriate hypermedia directory service. The hypermedia directory is in effect a link repository containing links to either hypermedia nodes or other link databases. To be effective however it must be scalable, permit global and exhaustive searching while simultaneously allowing the user to browse the global data space with very low latency. These performance requirements are important considering the magnitude of the problem. As an
example the WWWWorm server index contains around three million entries. An average of two million requests are made per month on the server. The directory service should be integrated seamlessly into the hypermedia service network so that passing from the domain of the directory service to the domain of the other content providers is transparent. The directory service can be thought of as being analogous to a shopping mall where, although each store has a different character, the overall mall ambience is pervasive and each store essentially feels like an extension of the mall itself rather than a separate entity. In such a mall, shoppers can either browse or they can go to the information desk to directly locate (search for) the desired shop. One significant contribution that an appropriate directory service can make is that of adding value to third party services by supplying secondary (automatic, search generated) links to other external sources. This overcomes the problem of wishing to pursue a topic suggested by a keyword which a hypertext document author has not linked to another document. An application could use the designated keyword as an attribute for a search of the directory service. The search can be suitably qualified by context based information provided by the viewing application (for example, users may designate that when they click on a word, they want a dictionary definition of the word). Unlike traditional directory services which to some extent can be considered to be relatively static, a hypermedia directory would be constantly in flux. Content providers would be constantly updating their information. It is likely that content providers who reference the information from outside sources will wish to protect themselves from changes in the target information sources. To minimise the impact on their systems, content providers may choose to indirectly reference the required information via the hypermedia directory service. In which case it should be appropriately maintained, possibly with contracts to define object persistence. The maintenance scheme should preferably be decentralised. 3.8.1.
Searching and Browsing
While searching is most efficient in flat indexes, browsing requires hierarchical relationships between the information entities to be defined. Ideally, browsing requires a global and hierarchical view of the complete information space. Therefore there must be some way of making the global data set available to the user. Another aspect of browsing is
that users may wish to browse according to different views of the information space. For example users may wish to browse according to geography or by topic. This requires multiple browsing structures.
Further work is required in the areas of the logical directory architecture and user interface issues in order to develop an appropriate directory architecture to achieve our specified objectives.
Given a suitable hypermedia directory architecture which supports both exhaustive searching and global browsing the problem of complexity control remains. As the number of services and amount of information continues to increase, accurate retrieval of information becomes increasingly difficult.
4.
With large information sources an accurate search can return huge quantities of information, making it difficult for the user to process the results. The recovered information thus requires some measure of its relevance to the user, so that it may be ranked in order of likely importance. Ranking the relevance of the information as determined from the search criteria may, however, not significantly improve the situation. It is thus desirable to use filtering mechanisms that will use the context of the search to further refine the search criteria. For example, the user profile may be used to significantly affect both the search results and the way in which the search results are returned since the type of information desired by preschool children is different from that required to advanced researchers. For example the presentation of search results to a child may be simpler and cartoon based while that for an adult may be of a more technical nature and more detailed. Complexity control is also a problem when browsing a hierarchy containing a large number of entries since this can produce disorientation through the sensation of "drowning" in a sea of information. This phenomenon arises with large volumes of data becomes daunting due to its monotonicity. Context based adaptation may also be applied to constrain the lateral view of the browsing structure. Context based adapting of the browsing or searching parameters in this manner can be part of an effective complexity control solution. However this does not address the disorientation experienced by users due to the tendency to loose track of their relative position in the link structure during browsing. In this case a possible solution is to use multiple navigation paradigms based on the user’s current depth in the browsing structure. For example the user interface may progress from 3D to 2D as the depth increases.
CONCLUSIONS AND FUTURE WORK
This paper has analysed the problems involved in the design of a distributed hypermedia information system and presented a proposed network service architecture. This architecture will serve as the basis of our continuing research which is focusing on the integration of connectionless Internet-like services into the emerging STU/Television-based delivery architecture. 5.
REFERENCES
[1] F.Halasz, M.Schwartz, “The DEXTER Hypertext Reference Model”, Communications of the ACM, pp30-39, February 1994, Vol.37, No.2. [2] K.Andrews, F.Kappe, H.Maurer, “Hyper-G and Harmony: Towards the Next Generation of Networked Information Technology,” Proc. CHI’95, Denver, May 1995. [3] The Economist (Australian March 7, 1995) [4] M.H.O’Docherty, C.N.Daskalakis, “Multimedia Information Systems - The Management and Semantic Retrieval of all Electronic Data Types.”, The Computer Journal, Vol.34, No.3, 1991, pp.225-238 [5] M.Flickner, H. Sawhney, et.al. “Query by Image and Video Content: The QBUC system.” IEEE Computer sepcial issue on content based picture retrieval, 1995. [6] O.A. McBryan, “GENVL and WWWW: Tools for Taming the Web”, Proceedings of the First International World Wide Web Conference, CERN Geneva, May 1994. [7] M.F.Schwartz, A.Emtage, B.Kahle, B.C.Neuman, “A Comparison of Internet Resource Discovery Approaches.”, Computing Systems, Vol.5(4):pp461-493, Fall 1992. [8] ISO/IEC 9594-1, “Information Technology - Open Systems Interconnection - The Directory : Overview of Concepts, Models, and Services”, Recommendation X.500 [9] Martijn Koster, ‘Robots in the Web: threat or treat?’ NEXOR, http://web.nexor.co.uk/mak/doc/ robots/threat-or-treat.html. [10] C.M.Bowman, P.B.Danzig, U.Manber, M.F.Schwartz, “Scalable Internet Resource Discovery: Research Problems and Approaches.”, Proceedings of the 2nd International WWW Conference, pp763-771, October 1994 [11] D.R.Hardy, M.F.Schwartz, “Customised Information Extraction as a Basis for Resource Extraction.” Technical Report CU-CS-707-94, March 1994. To Appear, ACM Transactions on Computer Systems.