Untangling the World-Wide Web - CiteSeerX

Untangling the World-Wide Web

Liam Relihan University of Limerick [email protected]

Tony Cahill University of Limerick [email protected]

While, for years, the Internet has been used to make information resources available, until relatively recently its users have been forced to interact with it through a set of difficult-to-use protocols such as FTP (File Transfer Protocol) [PR85]. Furthermore, extensive knowledge of obscure command-line interfaces, addressing schemes and file types was often required. However, recent years have seen an increase in sophisticated and user-friendly Internet information systems. These include the WAIS (Wide Area Information Service) document server protocol, the menubased Gopher protocol and the hypertext-based WorldWide Web [Ber92]. It was the intention of the designers of the World-Wide Web (or WWW or W3) to provide access to the information resources of the Internet through easy-to-use software which operated in a consistent manner [BlEtAl92]. As the basis for information retrieval, the designers settled on the hypertext paradigm - a paradigm which supported the use of simple “point-and-click” interfaces. W3 began as a set of simple protocols and formats. As time passed, W3 began to be used as a testbed for various sophisticated hypermedia and information retrieval concepts. Unfortunately many of these proofs of concept were quickly adopted by the general Web community. This means that experimental extensions of dubious use are now established parts of the Web. To make matters worse, many of those extensions were inadequately documented if at all.

Michael G. Hinchey University of Cambridge [email protected]

The Internet is a collection of small interconnected networks whose operation is considerably aided by the adherence of its users to an informal social code. However, in years to come, Internet services will be provided to an ever-increasing number of people. Unlike in the past, these new users will not have time to adapt and learn the social and other mores of the “net”. Therefore, mechanisms that provide information will need to be become more robust. In this paper we shall examine some of the problems facing the World-Wide Web and approaches that may be useful in solving them. In particular, we shall examine problems that relate to the distribution of the Web's information resources. Finally, we shall provide a short evaluation of the Web from the point of view of information providers.

What is the Web ? The World-Wide Web is an Internet-wide distributed hypertext system which operates on a client/server basis. In response to user actions, W3 clients (which exist for a range of platforms) request representations of document objects from W3 servers which each manage a set of objects. The W3 object space consist of nodes and parts of nodes called fragments linked together by hyperlinks. Nodes usually exist "as-is" at some location on the Internet. However, certain nodes called index nodes may be searched. The result of such a search is a special kind of node generated "on-the-fly" called a virtual node. Document objects are identified in W3 by the Uniform Resource Locator (URL) addressing scheme [Ber93]. URLs are usually used to specify the destinations of hyperlinks and typically consist of a set of parts, delimited by colon (:), slash (/) and hash (#) symbols as follows: http://www.ul.ie/docs/home.html#intro

In this example, http refers to the scheme of the URL. A scheme is essentially a client/server dialog format which indicates how a W3 client should "talk" with the server which manages the object. These formats can be complex and are not within the scope of this document. However, suffice to say that http is the most important as it is dedicated to transfer of W3 documents only. www.ul.ie refers to the location of the server which manages the object. This part is typically a standard Internet domain name. /docs/home.html specifies a node that resides within the domain of the server. How this part is interpreted, depends upon the scheme being used. intro specifies a fragment of the node. Commonly this optional part will refer to an anchor within a node which is the end-point of a hyperlink. There is also another optional part in URLs. This optional part specifies search terms with which an index node is searched. In the following URL, http://mgp.ie/db is the URL of an index node while john is a search term to be used in searching the node:

Mosaic can also provide an interface to FTP, Gopher, WAIS and Network NEWS.

The Origins of the Web The World-Wide Web was first conceived by Tim BernersLee in 1989 [BLC90]. Berners-Lee was then working at CERN when it was recognized that researchers in high energy physics (HEP) were having difficulties in sharing information due to a range of different network information retrieval protocols and a range of workstation types with widely varying display capabilities. The World-Wide Web was proposed as “a simple scheme to incorporate several different servers of machine-stored information already available at CERN”. This “scheme” was to use hypertext to provide “a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help”. The original objectives of the World-Wide Web project included: ·

http://mgp.ie/db?john · Most W3 documents are encoded according to an SGML (Standard Generalized Markup Language) format, HTML (HyperText Markup Language). HTML allows the identification of the logical components of a document such as headings, paragraphs, lists and emphasised text. HTML also allows the tagging of both source and destination hyperlink anchors (anchors are endpoints for hyperlinks). HTTP (HyperText Transfer Protocol) [Bl93a] is a simple information transfer protocol that operates according to a set of methods (or commands) and whose main purpose is to support the transfer of World-Wide Web hypermedia information between W3 clients and servers. The most commonly used method is GET. GET is a basic retrieval method whereby a representation (usually in HTML) of a node is transferred to a client. The more modern implementations of the GET method allow format negotiation in that browsers may specify the data formats they can handle when they make a request. Servers then have the option of adjusting the data format of the node representations they return, so as to facilitate the capabilities of the browser. While hyperlinks may link objects managed by different servers, HTTP does not support any communication between servers. As will be seen later, this has implications for the maintenance of data integrity. W3 clients can typically handle far more schemes and transfer protocols than HTTP. Clients such as NCSA's

·

·

.

·

· ·

the provision of a simple protocol for requesting human readable information stored in remote systems accessible using networks. to provide a protocol by which information could automatically be exchanged in a format common to the information supplier and the information consumer the provision of some method of reading text (and possibly graphics) using a large proportion of the display technology in use at CERN at that time the provision and maintenance of collections of documents, into which users could place documents of their own. to allow documents or collections of documents managed by individuals to be linked by hyperlinks to other documents or collections of documents. the provision of a search option, to allow information to be automatically searched for by keywords, in addition to being navigated to by the following of hyperlinks to use public domain software wherever possible and to interface to existing proprietary systems. to provide the necessary software free of charge.

Interestingly, it was not the aim of the project developers to use sophisticated network authorization systems - it was intended that data would be either readable by all parties, or would be readable only on one file system. All these aims pointed to a system for sharing information over networks that was: · · ·

unrestrictive on data formats unsophisticated on security issues simple

Since W3 document information was encoded in the structurally-oriented HTML markup language, it was possible to develop browsers for platforms with very diverse display capabilities. As browsers became available for the more popular platforms such as X, MS/Windows and the Apple Macintosh, the Web began to move from the High Energy Physics community into many new user communities.

Problems with the Web In this section, we investigate some of the problems that the World-Wide Web is currently facing and will likely face in the future. We tend to concentrate on problems that are due to the increasing and changing user base of the Web.

Users Numbers Since its beginning, the size of the Web in terms of users has vastly increased. According to figures provided by the MERIT Network Information Center [Mer94], at the time of writing (February-June 1994) the World-Wide Web was approximately doubling in usage every 3 months in terms of absolute quantities of bytes transferred. In parallel with the growth in traffic, the range of different communities that use the Web is also growing. There follows a list of some of the communities now using the Web. Each user community is accompanied by the URL of a typical Web resource:

·

few sophisticated HTML editors available, Web authors are obliged to deal with markup themselves. Web servers must be carefully configured and maintained if they are to be reliable and not present a security risk

However, if current trends are to continue, new Internet users will appear even more rapidly than before. Indeed the number of users connecting to the Internet is almost growing exponentially. Evidently, this is at least partly due to the recent publicity regarding the Internet in the popular media. In the experience of the authors, many of these new users are inexperienced when it comes to the use of multiuser systems. Unfortunately, many Internet systems (such as NEWS and W3) depend upon users to observe a fairly strict etiquette and to be fairly technically adept if they are to operate smoothly. Until recently, this has not been a serious problem. However in recent times new users have joined the Net and have begun to use the Internet systems without sufficient instruction in etiquette or in their operation. In the case of NEWS, some recent breaches of “net etiquette” have resulted in near-chaos in certain newsgroups.

· software engineers http://rbse.jsc.nasa.gov/asv4_default.html

Previously, the developers of Internet systems could rely upon the vast majority of users to be technically competent and “forgiving” when it came to the use of their systems. However, we believe that systems designed solely with such highly competent and caring users in mind will find it more difficult to cope with users whose experience is often limited to simple single-user computer systems.

· aeronautical engineers http://techreports.larc.nasa.gov/ltrs/ltrs.html

User Demands

· researchers in postmodern studies http://jefferson.village.virginia.edu/pmc/contents.all.html · science fiction fans http://itdsrv1.ul.ie/Entertainment/Prisoner/theprisoner.html · advertisers http://www.cs.cmu.edu:8001/Web/booksellers.html · civil services and government bodies http://sunny.stat-usa.gov/CommerceHomePage.html Not surprisingly, the computer literate communities that were responsible for the early development of the Web, are still at the forefront in its use and development. This is probably due to a number of reasons: ·

·

it still requires a certain amount of technical knowledge to setup and maintain a connection to the Internet the main W3 data format, HTML, is a reasonably complex SGML-based format. Since there are currently

The early W3 user community was probably undemanding by today’s standards. To a large extent it was content that it now possessed a unified interface to many of the old Internet services and gained a new service in W3’s HTTP. HTTP (in conjunction with HTML) permitted users to have nicely presented documents available to them at the click of a mouse button. Moreover, the hypertext paradigm meant that it was simple to navigate from document to document. However, in recent times there have been pressures to expand the Web in several directions [www91-94]: ·

·

Some users want more sophisticated data formats. They find the logical markup of HTML too restricting and want to have complete control of how their document appears on screen. This is often an issue with users wishing to advertise on the Web. There are pressures from the commercial communities to integrate extensions (such as encryption) that will

·

·

facilitate the selling of services and products through the Web. Some want support for media other than the standard text and bitmapped graphics. Even virtual reality extensions have been proposed [VRML94]. Many users ask for more sophisticated authoring tools. W3 specific authoring tools are non-existent on some platforms. In addition, those authoring tools that do exist are often not very sophisticated and largely ignore such issues as collaborative authoring.

Such pressures usually result from users not being satisfied with the support the Web provides for their particular user community. However, a closer examination reveals that the Web is deficient in certain more fundamental areas.

Authoring and Collaboration While the browsing of W3 documents is generally straightforward, the authoring of W3 documents is less so. Although some of the issues to do with collaborative authoring appear to have been considered in early memos and correspondence between the developers of W3, no sophisticated support for authoring has ever been part of the Web. If someone wishes to provide information on the Web, he/she encodes it himself/herself in HTML (usually using an ordinary text editor) and placed it in a position from where it is visible to the local W3 server and thus by the rest of the world. This kind of isolated document creation would probably be adequate on a single user system where documents do not reference or include other people’s documents or resources. However, in a system such as the World-Wide Web where thousands of readers read the widely distributed interlinked work of thousands of authors, such authoring facilities are completely insufficient. To a large extent, the issue of authoring is being avoided in the W3 development community. The development of systems that allow humans to collaborate on non-trivial projects is difficult and complex. What makes such systems complex is the need to maintain integrity within information that is fragmented and being used by many different people concurrently. The concurrent manipulation of information traditionally leads to a range of problems such as lost updates and false crossreferences. However, these problems are not new - for years the designers of database management systems (DBMSs) have been forced to deal with such problems. A typical DBMS environment provides concurrency control facilities which involve restricting concurrent access to shared data so that problems such as “lost updates” are avoided. In contrast, the World-Wide Web does not restrict access to shared resources in any respect. Currently this is justified because the HTTP protocol is largely limited to

providing support to users for the reading of Web resources and not their creation, deletion or modification.

Hyperlinking and Integrity To some extent, it might be fair to say that some authors treat the World-Wide Web like the older Internet information retrieval protocols such as anonymous FTP where there is little cross referencing between resources and thus relatively little need for careful placement, management and naming. For instance, authors are sometimes inclined to setup a Web server and make some documents available without giving much thought to such issues as the naming of servers or the naming of documents. Later, as the authors and Web server administrators begin to consider their work, they often find it necessary to change names and reorganize their resources. This would not be too serious if other authors did not set links to the documents. In practice, this means that users browsing Web resources are often faced with links to non-existent documents. Such problems are largely due to authors and administrators neglecting to take notice of the fact that they were now placing documents into what is effectively a massive file system. To those reared on single user, personal computer systems, it is quite acceptable to organize and manage one’s resources in virtually any way one wishes. This includes creating arbitrary objects in arbitrary locations with arbitrary names. In addition, on single-user systems there are no serious difficulties associated with deleting or modifying the objects under one’s control. However, with the advent of the world wide file system that is W3, authors and server administrators have found it necessary to consider how their new resources interact with thousands of other people’s resources. Among the restrictions placed upon Web authors are the following: ·

·

·

it is not acceptable to create objects without giving serious thought to naming and placement. This is so, because W3's URL addressing scheme is dependent upon physical rather than logical location naming. As has already been mentioned, a URL is a tuple which contains an actual Internet domain name. it is not acceptable to delete objects at will because other objects may have links to these objects. In this case, the main information transfer protocol, HTTP, is at fault. it is not acceptable to make arbitrary modifications to the internal content of nodes since it is possible to link to named fragments within a node using the URL addressing scheme.

Consider the following situation: A site has made available a set of large documents through the World-Wide Web. These documents have been linked to by many external documents. However, the site is desperately in need of storage space and must reclaim space quickly by deleting information not essential to the primary purpose of the machine. The system administrator decides that the W3 documents must be deleted. In this case there is currently no mechanism by which the documents may be transparently moved elsewhere (i.e without requiring the authors of the external pages to reset their hyperlinks to a new location). Moreover, due to the simplicity of the HTTP protocols, it is difficult (if not impossible) for Web administrators to find out who is linking to their documents and thus it is impossible to warn others of changes. Unfortunately all this means that if someone places a document on the Web, he/she must effectively make a commitment to maintain it , or have it maintained at its original location for life. Given that many HTTP servers are originally setup for semiexperimental reasons and eventually become fully-fledged servers of important documents (this is the case with the University of Limerick’s server itdsrv1.ul.ie), chaos can result when circumstances dictate that changes must take place. Since there are no mechanisms to automatically enforce restrictions, the Web is relying upon its users to voluntarily enforce them. A system that depends upon its users to maintain cross-referential integrity is not automatically doomed to failure, of course. Indeed there are many instances of protocols where users fairly successfully maintain integrity by following a set of guidelines. However, in such cases users need to possess enough technical skill to understand these guidelines and to be capable of implementing them. Although the World-Wide Web currently is fortunate enough to possess authors and administrators that are technically competent, we believe that it cannot rely upon such competence in the vast majority of new users. The Web relies upon more than technical competence for the maintenance of information integrity - it also relies upon goodwill. To a large extent, a World-Wide Web server administrator relies upon the users of his server to keep him informed of problems. This is nothing new or uncommon in the Internet world - it is normal practice for an Internet user to inform an administrator of problems encountered when using the administrators' systems. As user numbers increase, it is possible that the Internet community will become less mutually supportive, thus sidelining such practices. Once again, the W3 server administrator’s job will become more difficult, as he will be forced to increase manual checking if he wishes to maintain the integrity of his resources.

Solving the Problems - a Beginning The difficulty in controlling referential integrity in the World-Wide Web largely stems from the fact that the agents (i.e. HTTP servers and their human administrators) in the Web are not privy to the global state of the WorldWide Web system, or even the state of that part of the system that directly affects them. When a change is made to an object under the control of a particular server, the change is not immediately apparent to the servers in the rest of the Web - it is up to the administrators of other servers to advise themselves of the change by either regularly browsing their outgoing hyperlinks, reading about the change in some forum (e.g. the dedicated W3 newsgroups, comp.infosystems.www.*) or being advised about the change by some other party (e.g. someone who has detected a link to a deleted document). This is a serious problem since as resources grow older and fall out of the maintenance schedules of administrators, the integrity of large sections of the Web will begin to deteriorate. Thus we believe that the Achilles heel of the World-Wide Web is the almost complete reliance upon the dedication of server administrators and the goodwill of others for the control of resources and maintenance of referential integrity. As in operating systems, the distribution of resources does not necessarily imply distribution of control [Mil92]. To control a set of distributed resources there are two options; centralized control and distributed control.

Centralized Control of Resources With the centralized control of resources, an “arbiter” that manages a set of resources is nominated - in the case of the World-Wide Web, an arbiter would manage all or some of the W3 servers which would each manage their own resources. All requests for performing an operation on any member of that set are sent to the arbiter for authorization. The arbiter is able to make the necessary decisions because it maintains complete knowledge of the state of the resources under its control. Centralized control is a straightforward way to control distributed resources. However, in the case of W3 it presents the following problems: ·

·

·

the use of centralized control introduces a single-point of failure (although strategies for alleviating this problem exist). centralized control could mean network traffic congestion due to all requests for changes to the global state being sent to a single node. In addition, storage problems are likely if the complete system state is to be maintained at one location. traditionally, the Internet community has favored protocols that require little or no centralized control

(for instance, USENET [HA87] is completely decentralized). This is partly due to the reluctance of individuals to accept the responsibility and expense that comes with maintaining a resource upon which others depend completely. Moreover, many users/administrators are reluctant to relinquish any substantial control of their resources to others - often for security reasons.

Distributed Control of Resources: Distributed control can eliminate most of the problems associated with centralized control and may be based upon a range of distributed control algorithms. Milenkovic in [Mil92] defines distributed computer systems as collections of “autonomous computer systems capable of communication and co-operation via their hardware and software interconnections”. They are “characterized by their absence of shared memory, by unpredictable internode communications delays, and by practically no global system state observable by component machines”. To ensure referential integrity, a node in a distributed system would only need to maintain a view of that part of the system state which is directly relevant to its own operation. Distributed resource control has the following advantages: ·

·

·

there is no single point-of-failure, since if a server breaks down, the ability of the other servers to make decisions about the available resources is largely unaffected. there is less likelihood of traffic congestion since all control-related traffic is not congregating at a particular point. Moreover, it is not necessary for each node to maintain the complete system state. since the owners of resources are not required to rescind control to a central node, there is less likelihood of the system being rejected at certain security conscious sites.

Unfortunately , distributed resource control brings its own complex problems such as the timing and ordering of events and ensuring mutual exclusion when granting concurrent access to resources. There are many approaches to solving these problems but these are outside the scope of this document. However, suffice to say that it is probable that the developers of W3 would benefit from research carried out into various other kinds of distributed systems such distributed operating systems, distributed database systems and managed networks [Sta93].

A World-Wide Web Layered Architecture: The complexities of transparently developing a hypertext system upon a distributed resource control system suggest

that it might be a good idea if the World-Wide Web adopted some kind of a layered model. Layering is a powerful technique for dealing with complexity in systems. The purpose of each layer in a layered system is to offer certain services to the higher layers, thus shielding those layers from the details of how the offered services are actually implemented [Tan89]. Layering is a technique that has already been adopted in the field of communications networks. For example, the International Standards Organization (ISO) Open Systems Interconnection (OSI) standard defines a seven layer architecture which aims to make computer networks as far as possible universally compatible. Actual communication takes place at the lowest (physical) layer, with each higher layer providing increased functionality, such as retransmitting lost packets, discarding duplicate packets, decomposing and reforming messages at the transmitting and receiving sites (respectively) and providing a transparent service to the top (applications level) layer. A virtual communication is made between component sites at each level in the model, with complex protocols being defined for communication at each level. Layered models have also been developed for hypertext systems [CG89, LØS92]. Such models usually consist of three layers (or levels) which may be termed the userinterface layer, the data model layer and the storage layer. The user interface layer presents hypertext information to the user using the services provided by the data model layer. The data model layer may be viewed as a core hypertext system supporting the basic notions of hypertext such as nodes, links, etc. There have been various formal approaches to this layer [Lan90], some of which could prove interesting to W3 developers seeking a formal basis for their work. The storage layer tends to deal with traditional database issues that are independent of hypertext such as storage, remote access, retrieval, multiple access, etc. This layer maintains a view of the data it manages whereby objects such as nodes and links are just atomic units served to the data model layer when required. The identification of a storage layer in W3 is particularily important since it is this layer that should support the notions of information distribution, information sharing and information integrity. This layer should also deal with the efficient manipulation of hyge multimedia objects, querying and versioning [LØS92]. As it to be expected, hypertext implementations in the "real world" are not always carefully based upon such ideal layered architectures. For instance, the simplicity of W3 has been compromised to some extent by the networks and communications protocols upon which it has been built.

Although the core W3 protocol, HTTP, is arguably a cross between a communications protocol and an information retrieval protocol for hypertext, it is almost completely devoid of layering. However, the adoption of some kind of a layered model by the W3 developers would probably have the effects of facilitating interoperability with other hypertext/hypermedia systems and increasing extensibility in general. Moreover, the use of well-defined layers and interfaces between them, promote the use of existing highly developed technologies and practices. For instance, the technologies encompassed by the storage layer are essentially technologies that have been formalized and refined over many years by the database and distributed systems communities.

Formal Methods A network protocol refers to the rules or conventions through which two network sites may communicate with each other. Unfortunately, the geographical dispersion of sites and the need to communicate over heterogeneous networks can complicate matters considerably. The OSI reference model proposed a solution to this problem. Although it initially attracted much attention, and is still often considered to represent the state-of-the-art in data communications, the OSI model has fallen considerably out of favor. The strict hierarchy of the model has proven to be difficult to implement, and more importantly, it has been difficult to demonstrate the compliance of implementations to the model. The Internet system, over which the World-Wide Web operates, does not in fact follow the guidelines of the OSI model. Nevertheless, it successfully supports communication between clients that are widely dispersed geographically, in a reasonably efficient manner, across heterogeneous computer systems and networks. For example, electronic-mail sent from a site in London can quite reasonably be expected to be delivered to a system in New York in a matter of minutes, or even seconds. While the transmission is completely transparent to the commonor-garden user, in fact many different formatting changes are occurring, and many different protocols must be employed between various sites en-route. These protocols involve complex interactions between component sites, involving a negotiation, or handshake, the translation of packets from one format to another, and retransmission of packets that have been lost or corrupted, or which have arrived in the wrong order. Internet protocols are surprisingly reliable, considering their simplicity, and the large demands that are made of them. W3, however, builds on top of the Internet protocols, introducing its own extended protocols, its own structure, and an often ambiguous nomenclature.

However, all network protocols are complex reactive systems, involving interactions between a number of distributed components, and are often required to operate within strict timing constraints. With this in mind, a number of specification languages have been proposed in an attempt to overcome the complexity necessarily inherent in such protocols. The ITU (formerly CCITT) have introduced (O)SDL (ObjectOriented Specification and Design Language), a language based on extended finite-state machines, specifically to support protocol specification and design. Similarly, ISO’s Extended State Transition Language (ESTL or Estelle) is based on Pascal with extensions for the definition of finite-state machines, and is extensively used in protocol specification and verification, as is the simulation language LOTOS, the Language of Temporal Ordering Specification, which is based on the CCS process algebra and the model-based specification language ACT ONE. Just like any other reactive system, network protocols must cope with many actions happening simultaneously (possibly at different sites), many components interacting, and yet must perform certain actions within strict time limits. And, just as any other concurrent system, network protocols can benefit from a more formal approach. Formal methods have been proposed for use in the specification and verification of network protocols. Specifications using model-based specification languages such as Z [Dukeetal] have proven to be successful; the use of languages more suited to concurrent systems, such as CSP or CCS, has proven to be even more practical [HJ93]. In fact, a formal analysis of the shutdown protocol on the Sizewell-B Nuclear Power Plant using CCS, highlighted problems in the feedback loop [AB95]. Indeed, a more theoretical examination of Chang and Maxemchuk’s Reliable Broadcast Protocol, which operates on top of the Internet system, highlighted flaws in the reformation part of the protocol [Jar92, HJ] even though the protocol has operated successfully for many years. The latter has particular ramifications for users of W3, as it indicates that protocols built on top of the Internet cannot just be assumed to be correct. W3 has never been subjected to the rigors of formal development. This has had the advantage of fostering the growth of the Web itself, allowing new functionality to be added in a very unrestricted manner, so that by facilitating the needs of many user communities the Web has now reached huge proportions. Unfortunately, such rapid development has meant that in recent times certain deficiencies have been detected in protocols and corrected in later software releases in an adhoc manner. This has meant that incompatibilities have

sometimes been introduced between browsers and servers, and hence "instant" software upgrades have been forced upon unwary users. We believe that many of these deficiencies are at least partly due to ambiguities in the relationships between some of the various components of W3, and have gone undetected for so long due to the lack of formality in its development. This has not caused major problems thus far, but as the Web continues to grow, such ambiguities and incompatibilities are likely to prove to be problematic. Moreover, it has been observed by Maioli, Sola and Vitali in [MSV93] that a major source of complexity in distributed hypertext systems arises from the interaction between different hypertext databases. This is not yet a problem in W3 since databases almost invariably do not interact. However, with growing network traffic, backlogs and the provision of intensive multimedia information at certain sites requiring massive bandwidth, there is an anticipated need for load balancing; hence more interaction between data storage sites will be necessary. The protocols necessary for such interaction will almost certainly be complex and formal techniques will be essential if they are to be specified precisely. Formal Description Techniques (FDTs) have already proven promising in the specification of document structures [HC92,CHR93]. Consequently, we believe that it would be beneficial to go at least some way towards formally specifying some of the components of W3.

How useful is W3 to the Document Community ? The importance of W3 is undoubted. For the first time we can see how a holistic interface may be provided to much of the information on the Internet. The Web can currently provide the following advantages to the document community: ·

·

·

W3 integrates many Internet services. Where previously Internet users were required to be familiar with a range of protocols and interface software, now W3 browsers such as NCSA's Mosaic provide access to many Internet services through a single interface. W3 information is simple to browse. Since users are becoming increasingly familiar with the hypertext paradigm (due to its use in many online help systems), very little time is wasted in learning a new way of navigating through information. W3 provides a testbed for some interesting new document technologies and serves to provide us with a preview of what could be the information technology tools of the future.

·

Most of all, the Web can be a good source of information. Everything from genome databases to stock exchange prices are available on the Web.

The popularizing of the Internet follows a long period during which single-user personal computing reigned supreme. While the concept of “one person, one CPU” was commendable in some respects, it probably restrained the development of collaborative computing. However, as is readily noticeable in the steady stream of LAN-based workgroup systems now being released upon the market, the personal computer community is now attempting to rediscover collaborative computing. In parallel with the growth of LAN-based workgroup computing, there is growing interest in the use of the Internet as a means of facilitating WAN-based workgroup computing. Quite possibly, the Internet presents opportunities for humans to collaborate on a scale never or rarely seen before. The authors believe that the development of more ambitious World-Wide Web protocols can facilitate the collaborative efforts of authors to create a massive, stable, crossreferenced document space that is readily browsable by all users. However, if future World-Wide Web development simply consists of ill-considered “bolt-on” extensions to the existing architecture, then problems are likely to result. Already, many older documents in the Web are littered with bad links to non-existent or relocated documents. If there is no change, this situation is likely to worsen as the quantity of users, servers and documents increase. The authors of this paper do not discourage the authoring of documents for the Web; rather they recommend caution. Since the Web is one of the first really successful attempts at providing a document-oriented Internet information system, it cannot simply be ignored. Indeed, the Web does have a lot of good things for technical writers such as the SGML-based HTML format. Since HTML is a markup language mainly based upon such principles as platform independence, the preservation of important logical information and flexibility of final appearance, the encoding of large quantities of documents in HTML is relatively easy to justify. It is often claimed that commercial concerns require completely robust systems whose behavior must be predictable if not amazing. This could be an unjust and unhelpful assumption. While the World-Wide Web is relatively unstable, possesses many deficiencies, and sometimes suffers from a lack of vision, it can still provide many benefits to the corporate world and other concerns that need to provide information to the Internet community.

Conclusion The World-Wide Web is a success. It is an example of a system that had humble beginnings and has grown to huge proportions. It has fairly successfully survived many extensions. It is the user friendly face of the Internet for many, and has been at least partly responsible for the recent explosive growth in new Internet users. All this in a mere five years. However, the Web is facing some threats to its continued well-being, only a few of which have been dealt with in this paper. To counter these threats, the developers of the Web must consider being far more ambitious in the future development of the Web protocols. Moreover, they must refrain from making “quick and dirty” additions to the existing features simply to satisfy the immediate narrowly focused needs of particular user communities.

Bibliography [AB95] S. Anderson and G. Bruns, “Formalization and Analysis of a Communications Protocol”, In M.G. Hinchey and J.P. Bowen (editors) “Applications of Formal Methods”, Prentice Hall International Series in Computer Science, Hemel Hempstead and Englewood Cliffs, to appear 1995. [Ber92] Tim Berners-Lee & Robert Cailliau, “World-Wide Web”, Invited Paper: Computing in High Energy Physics, Annecy, France, 23-27 September, 1992 [BlEtAl92] Tim Berners-Lee, Robert Cailliau, J.F. Groff & B. Pollermann. “World-Wide Web: An Information Infrastructure for High-Energy Physics”, In Software Engineering, Artificial Intelligence and Expert Systems for High Energy and Nuclear Physics, January, 1992 [Ber93] Tim Berners-Lee.“Uniform Resource Locators”, IETF URL Working Group, 1993. [Bl93a] Tim Berners-Lee, “Hypertext Transfer Protocol”, Internet Draft (Work in Progress), CERN, 1993. [BLC90] T. Berners-Lee, R. Cailliau , “WorldWideWeb: Proposal for a HyperText Project”, CERN Memo, http://info.cern.ch/hypertext/WWW/Proposal.html>, November, 1990 [CHR93] T. Cahill, M.G. Hinchey & L. Relihan, “Documents are Programs”, In Proceedings ACM SIGDOC’93, Waterloo, Ontario, 5-8 October 1993, ACM Press

[HA87] M. Horton & R. Adams, “RFC 1036: Standard for Interchange of USENET Messages”, Network Working Group, December 1987. [HC92] M.G. Hinchey & T. Cahill, “Towards a canonical specification of document structures”, In Proceedings ACM SIGDOC’92, Ottawa, Ontario, 13-16 October 1992, ACM Press [HJ] M.G. Hinchey & S.A. Jarvis, “The CSP Reference Book”, McGraw-Hill International Series in Software Engineering, London and New York, In Press. [HJ93] M.G. Hinchey and S.A. Jarvis, “An Incremental Approach to Network Specification and Verification”, In Proceedings SVNC’93, 4th Silicon Valley Networking Conference, 12--14 April 1993, Santa Clara, CA, pp 123131, Maple Press. [Jar92] S.A. Jarvis, “Specification and Verification of a Reliable Network Protocol”, Oxford University Computing Laboratory, Programming Research Group, September 1992. [Lan90] Danny B. Lange. “A Formal Model of Hypertext” In Proceedings of the Hypertext Standardization Workshop (January 16-18, 1990), National Institute of Standards and Technology, 1990 [LØS92] Danny B. Lange, Kasper Østerbye & Helge Schütt. “Hypermedia Storage”, Technical Report (R-922009), June 1992 [Mer94] MERIT Network Information Center, “NSFNET Statistics”, , 1994 [Metcalfe] R.M. Metcalfe and D.R. Boggs “Ethernet: Distributed Packet Switching for Local Computer Networks”, Communications of the ACM, July 1976. [Mil92] Milan Milenkovic. “Operating Systems: Concepts and Design”, McGraw-Hill, 1992 [MSV93] Cesare Maioli, Stefano Sola & Fabio Vitali, “Wide-Area Distribution Issues in Hypertext Systems”, In Proceedings ACM SIGDOC’93, Waterloo, Ontario, 5-8 October 1993, ACM Press [PR85] J.B. Postel & J.K. Reynolds, “RFC 959: File Transfer Protocol”, Network Working Group, 1985 [WD93] Chris Weider & Peter Deutsch, “Uniform Resource Names”, Internet Draft (Work in Progress), ITEF URI Working Group, 1993. [SM94] K. Sollins & L. Mastiner “Requirements of Uniform Resource Names”, Internet Draft, expires September 26, 1994 [Sta93] William Stallings, “SNMP SNMPv2 and CMIP: the Practical Guide to Network Management Standards”,

Addison-Wesley Publishing Company, Inc., Reading, Massachusetts 01867, 1993. [Tan89] Andrew S. Tanenbaum, “Computer Networks”, Prentice-Hall International Editions, 1989. [VRML94] Mark Pesce & Brian Behlendorf , “Virtual Reality Markup Language”, , 1994. [www91-94] “WWW Talk Mail Archives” , 1991-1994