COHSE: Dynamic Linking of Web Resources - CiteSeerX

6 downloads 45659 Views 2MB Size Report
All technical reports are available online on our website, http://research.sun.com/techrep/. ..... XPointer maker runs after all the above components and creates pointers to ... the best targets for links—particularly when we are us- ing the concept ...
COHSE: Dynamic Linking of Web Resources

Yeliz Yesilada, Sean Bechhofer, and Bernard Horan

COHSE: Dynamic Linking of Web Resources

Yeliz Yesilada, Sean Bechhofer, and Bernard Horan SMLI TR-2007-167

August 2007

Abstract: This document presents a description of the COHSE collaborative research project between Sun Microsystems Laboratories and the School of Computer Science at the University of Manchester, UK. The purpose of this document is to summarise the project in terms of the work completed and the results achieved. The focus of the project was an application to enable the dynamic creation of hypertext links between documents on a Web, thus the intended audience for this document comprises those members of academic and industrial research groups whose focus includes the Web in general and the Semantic Web and Hypertext in particular.

Sun Labs 16 Network Circle Menlo Park, CA 94025

email addresses: [email protected] [email protected] [email protected]

© 2007 Sun Microsystems, Inc. All rights reserved. The SML Technical Report Series is published by Sun Microsystems Laboratories, of Sun Microsystems, Inc. Printed in U.S.A. Unlimited copying without fee is permitted provided that the copies are not made nor distributed for direct commercial advantage, and credit to the source is given. Otherwise, no part of this work covered by copyright hereon may be reproduced in any form or by any means graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an information retrieval system, without the prior written permission of the copyright owner. TRADEMARKS Sun, Sun Microsystems, the Sun logo, Sun Fire, Java, Java SE, JDK, Java Card, Java Community Process, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. UNIX is a registered trademark in the United States and other countries, exclusively licensed through X/Open Company, Ltd. For information regarding the SML Technical Report Series, contact Jeanie Treichel, Editor-in-Chief .All technical reports are available online on our website, http://research.sun.com/techrep/.

COHSE: Dynamic Linking of Web Resources Yeliz Yesilada, Sean Bechhofer School of Computer Science University of Manchester Manchester, UK [email protected]

Bernard Horan Sun Microsystems Laboratories Guillemont Park, Blackwater, Camberley, UK [email protected]

This document presents a description of the COHSE collaborative research project between Sun Microsystems Laboratories and the School of Computer Science at the University of Manchester, UK. The purpose of this document is to summarise the project in terms of the work completed and the results achieved. The focus of the project was an application to enable the dynamic creation of hypertext links between documents on a Web, thus the intended audience for this document comprises those members of academic and industrial research groups whose focus includes the Web in general and the Semantic Web and Hypertext in particular.

1

Introduction

tions such as COHSE. Section 10 details our work on formalising COHSE’s navigational model. Section 11 surveys a number of studies that aim to understand user’s tasks on the web and discusses how COHSE can support these tasks. Section 12 explains how the Knowledge Service is extended to serve knowledge resources with different formalisms. Section 13 investigates some related work. Finally, Section 14 presents major outcomes and provides some conclusions, and Section 15 presents future directions for COHSE.

Between 2003 and 2007 Sun Microsystems Laboratories engaged in three consecutive collaborative research projects with the Information Management Group (IMG) of the School of Computer Science at the University of Manchester, UK. Prior to 2003, in association with the University of Southampton, members of the IMG had developed the Conceptual Open Hypermedia Service (COHSE). COHSE provided a framework to integrate a knowledge service and an open hypermedia link service to form a conceptual hypermedia system enabling documents to be linked via knowledge resources (such as ontologies). The original papers describing this initial development also include an evaluation of its use and effectiveness, both of which were rated highly. The objective of the collaborative research projects between Sun Labs and the IMG was to undertake a series of mini-projects to explore, experiment and investigate ways in which the architecture, implementation and use of COHSE could be improved. This report presents a summary of those mini-projects, described in detail in several appendices. The report begins with a brief summary of the architecture and implementation of COHSE at the outset of the project. Subsequent sections present summaries of individual mini-projects. The rest of this report is organised as follows: Section 3 introduces the main motivating scenarios. Section 4 presents our experiments with automating the annotation process and using GATE for text processing. Section 5 presents our work on customisation and personalisation, and explains the overall architecture that is used to deploy COHSE as a portlet. Section 6 presents our experiments with using XSL and related standard languages to transcode documents. Section 7 explains how we have used AJAX to dynamically create link boxes. Section 8 details our visualisation experiments and Section 9 summarises our attempts to clarify what is meant by the term “The Hidden Web” and its impact on applica-

2

Background

The theoretical foundation of the World Wide Web (hereafter ‘Web’) is the concept of links between resources [65]. Typically, links on the Web are embedded within web documents. Although this approach is simple and scalable, it has several limitations, including: a link can only be unidirectional; ownership is required to place an anchor within a document, documents must be in a particular format to support the addition of links (such as HTML or PDF); there is a weight of maintenance in creating and updating links between web documents [49]. Open Hypermedia Systems seek to overcome some of these limitations [75]. Rather than embedding links in web documents, an Open Hypermedia System (OHS) considers links to be first class objects. They are managed independently and can thus be stored, transported, shared and searched separately from web documents. The Distributed Link Service (DLS), developed by the University of Southampton, is an example of an OHS, and provides dynamic linking between web documents [39]. COHSE extends the DLS with Semantic Web (SW) technologies [38]. COHSE provides different hypertext views of Web documents according to the choice of domain ontologies or vocabularies. For instance, Bechhofer et al [31] illustrate how COHSE is used in Bioinformatics to link biology web documents using the Gene Ontology [5] and Carr et al [40] present how Sun’s Java tutorial pages can be augmented using an 1

ontology that describes Java and object-oriented programming. The original architecture for COHSE is described elsewhere [40]—we provide a brief overview here. The architecture includes a COHSE DLS plus two supporting services: a Knowledge Service1 and a Resource Service2 (see Figure 1).

provides the Google Web Service API with a concept’s label and retrieves an ordered list of Web documents containing that label3 . The COHSE DLS is responsible for embellishing HTML documents with new links. It performs this function by calling on the functionality provided by the Resource and Knowledge Services. For each document it processes, it inserts link source anchors and link targets. The COHSE DLS works briefly as follows: 1. the DLS retrieves the requested document; 2. the DLS identifies link source anchors in the document; 3. the DLS queries the RS to determine appropriate targets for each source; 4. the DLS rewrites the document and emits it to the original client requester. The original COHSE architecture provided two mechanisms to identify source anchors (i.e., step 2 above). The first of these, known as annotation linking, used the functionality provided by the RS described above. The DLS sent a request to the RS containing the URL of the requested document, and the RS responded with a set of concepts identified in that document with an XPointer expression describing the label of each. The second mechanism, known as generic linking, used a regular expression matcher to identify labels in the requested document. The pattern for the matcher was created by requesting the KS for all the lexical labels in the ontology. We described above how the RS provides a set of potential link targets when given a concept. The DLS uses this service in step 3 above. It also uses the KS to determine broader and more specific concepts for the given concept, and uses the RS to provide link targets for those concepts also. The original COHSE implementation distinguished between link anchors and sources generated in advance or through some batch process, and those generated dynamically. This is summarised in Table 1.

Figure 1: COHSE Architecture.

The Knowledge Service (KS) provides a simple HTTP service interface to access knowledge resources such as OWL [22] ontologies. It provides services such as mapping between concepts and lexical labels, information about specialisation and generalisation of concepts and description of concepts. (See Section 12 for a comparison of the use of ontologies and taxonomies to support navigation within COHSE.) The Resource Service (RS) manages the relationships between Uniform Resource Identifiers (URIs) and concepts. It provides two fundamental services: 1. given a concept, it will provide a set of URIs that can be used as hypertext link targets for that concept; 2. given the URL for an HTML document, it will provide a set of concepts identified in that document (as well as XPointer expressions that describe the location of the label of each concept in the document). One of the ways in which the RS provides these services is by using a repository of annotations. An annotation represents a bi-directional mapping between a concept and a URI. The URI can take the form of a URL or that of a resource fragment such as an XPointer that describes a paragraph within a web document. The annotation may be produced by parties other than the document owner or users of COHSE [56]. The annotations are used by the RS as follows: 1. a concept-to-URI mapping provides a URI that is a candidate target for a hypertext link. 2. a URI-to-concept mapping provides a candidate source anchor for hypertext links. In the original COHSE project, annotations were generated manually using the COHSE annotation tool [40]. Alternatively, the RS is able to provide its services by calling on the functionality of other providers. For example, the RS is able to fulfil the first of its fundamental services by using Web Services from providers such as Google—the RS 1 Also 2 Also

Source anchor Target anchor

Batch manual/automated annotation (Annotation linking) manual/automated annotation

Dynamic Regular expression matching (Generic linking) Web Services (e.g., Google)

Table 1: Link Types

The key distinction is that in the batch approach, annotations are generated and stored in a repository. In the dynamic approach, COHSE processes the document on request. These two approaches are valid for discovering link sources and target anchors. The table shows the techniques that we have used for dynamic and batch processing. However, there is no reason why one cannot move these techniques from dynamic to batch or vice versa. The COHSE DLS has, in the past, been implemented using a number of different architectures: • Browser Plugin The Browser plugin (based on Mozilla) manifests itself as a sidebar. The COHSE DLS adds ex-

known as the Ontology Service (OS). known as the Annotation Service (AS) and Resource Manager

3 In December 2006, Google announced the end of life of their Web Service.

(RM).

2

tra links to the Document Object Model (DOM)4 [20] within the browser. • COHSE Proxy The COHSE proxy sits between a Web server and a client browser. As requests are made by the browser, they are intercepted by the proxy. The proxy forwards them to the destination Web server and waits for a response. The proxy then intercepts the document response from the Web server and adds extra links to the document where appropriate. The modified document is then returned to the client browser that originated the request. The COHSE Proxy was implemented using the Muffin [9] proxy classes. • Rewriting Proxy The rewriting proxy makes use of the COHSE proxy, but is deployed as a servlet. The rewriting proxy takes requests in the form http://proxy_ url/http://document_url where proxy_url is the Uniform Resource Locator (URL) of the rewriting proxy and document_url is the required document. This allows users to make use of the COHSE proxy without adjusting browser settings. For each of the above implementations, the interaction between the DLS and other components (i.e., the KS and RS) differs. For each implementation the KS and RS are deployed in different locations in the architecture of the Web—with their advantages and disadvantages (see Section 5). Furthermore, although all implementations embody a common functionality to transcode documents and add links, they all use ad-hoc, customised techniques to do that. In the sections that follow we describe some experiments to overcome the limitations of the original COHSE approaches.

3

As part of its swoRDFish programme [18], Sun has created a Unified Product Taxonomy (UPT). The UPT describes Sun’s components, systems, software and services, as well as the relationships between them. Additionally, the UPT provides lexical information such as the names (and synonyms) of the products, and can thus be viewed as a thesaurus. By using the UPT as a knowledge source, COHSE can provide links from pages describing products on the Sun Store site to FAQs, support information in the Sun Solve site or additional resources. Sealife The Sealife project [15] is a similar scenario to the one above, but in a different domain. Here, the user community comprises Life Scientists, browsing Web documents. For example the National Electronic Library of Infection (NELI) [10] is a digital library of materials related to infectious diseases that is composed of a collection of Web sites (such as Bugs and Drugs, Training in infection). Although these sites present information relevant to the community, there are very few existing cross links between them and to other relevant sites on the Web. The Sun scenario illustrates the problem of linking within the same organisation. In the Sealife scenario we see the problem of linking across organisations—how do we link sites that belong to different organisations? To address this problem, we propose to use the COHSE infrastructure and enable cross-site linking by employing existing vocabularies such as the Gene Ontology [5]. Additionally, different users require different behaviour. For example, a specialist biologist user may be interested in more detailed information than a casual browser. This requirement introduces the need for application-specific annotations to be associated with vocabularies. This could be achieved by using OWL annotations— OWL provides annotation properties which allow additional information to be associated with classes in ontologies that does not have an impact the definition or interpretation of the classes.

Concrete Uses of COHSE

Two scenarios under investigation are the linking of documentation created by Sun Microsystems, and the linking of resources for bioinformaticians in the Sealife project [15]. Sun Microsystems Sun Microsystems provides several Web sites describing its products and services. The sites are targeted at different user communities, including: • Sun Store (http://store.sun.com) describes products, aimed at potential customers; • Sun Solve (http://sunsolve.sun.com) hosts a knowledge base, patches, updates and support for Sun Products; • Sun Docs (http://docs.sun.com) provides detailed documentation about Sun products These sites share subject content, in that they all contain information relevant to the products and services that Sun manufactures and sells. However, there is no explicit crosslinking between these sites. For example, users may benefit if HTML pages describing Sun’s software products provided links to relevant documentation. Managing links between these sites poses a challenge, crossing Sun’s organisational boundaries.

4

Information Extraction and Automated Annotation

In Section 2 we described the two mechanisms to enable linking: “annotation linking” and “generic linking”. Users employed the COHSE Annotation Tool to create annotations that were then used by the annotation linking mechanism. Although this approach produced accurate annotations, it was time-consuming and is unlikely to scale when annotating a large corpus of documents. The original “generic linking” mechanism used by COHSE relies on regular expression matching to identify the occurrences of terms from a selected knowledge resource (such as an ontology). Although this is a simple approach, the regular expression matching is performed purely at character level—there is no attempt to make use of any structure that may exist in the text. The implementation of regular expression matching is also rather brittle—several workarounds were required to ensure that special characters appearing in concept terms (such as “[” or “]”) do not cause problems in the regular expression. There are also questions of

4 The Document Object Model (DOM) is a platform- and languageneutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of XML-based documents

3

performance—the Java implementation of the regular expression matcher requires increasing amounts of stack space and memory as the number of terms in knowledge resources increases. In order to address the limitations of the pattern matching approach, we conducted an experiment with GATE, an open source platform for text processing tasks [47]. In this experiment (described in Appendix A), we aimed to achieve better text processing with GATE than we could using the existing pattern matching approach. Furthermore, we wanted to explore the benefits of annotating a group of documents as one “batch”—we believed that this would offer us the opportunity to improve the resulting links between documents within that batch. GATE contains a complete set of information extraction components known as ANNIE (A Nearly New Information Extraction System) that are designed to produce annotations from a corpus of source text. When we conducted this experiment, ontology-based information extraction had not been incorporated into GATE. Therefore we extended the ANNIE pipeline to use OWL ontologies. Furthermore, when GATE processes a corpus it does not provide any support for storing annotations to be used by other applications. Therefore, we also extended GATE’s pipeline to store annotations externally, so that the existing COHSE DLS could use them. In summary, the following components were added to the GATE pipeline: • Ontology loader loads the ontology to be used by GATE; • Gazetteer includes both ANNIE and OWL gazetteers to provide access to a collection of known tokens (such as terms in an ontology) or commonly occurring tokens; • XPointer maker runs after all the above components and creates pointers to generated annotations in XPointer format. (XPointer is based on XPath but provides finer granularity that can be used to point to sections of an XML document such as paragraphs.); • Annotation service loader is used to store identified annotations persistently. Although GATE had the potential to address the limitations of our original regular expression matching mechanism, the process of generating annotations still required human intervention—it was necessary to load a corpus and configure GATE’s user interface manually. We took the experiment one stage further and automated this process by creating a wrapper called the “Extraction Service” (ES). The ES provides a Web spider along with a pre-configured version of a GATE pipeline. The ES crawls a site from a given URL and passes all traversed documents to the pipeline for indexing and annotation. Our experiment with GATE and ES raised several issues particularly regarding the “usefulness” of the annotations generated by GATE: HTML Parsing: The GATE pipeline parses HTML documents differently from the COHSE DLS implementations. Different DLS implementations use different HTML parsers, for example the Mozilla sidebar implementation uses Mozilla’s HTML parser whereas the proxy implementation uses Tidy5 [21]. However, all the

COHSE DLS implementations parse HTML documents into a DOM object and then use this object to resolve XPointer expressions. GATE, on the other hand, uses the HTML parser in the Java Development Kit and removes HTML tags to extract text-only content. Our experiments showed that differences in the way in which HTML is parsed result in different XPointer expressions and potentially misaligned annotations. Furthermore, Chen et al [41] show that only about 5% of Web pages are “valid” with respect to the HTML standard. Our experiments also confirm these findings and show that different parsers handle “not valid” pages differently. This means that not “valid” pages can result in inconsistent creation and resolution of XPointers. XPointers: When GATE parses a document, it attempts to pretty print the document by removing all the HTML tags and replacing them with white space. Although this seems like a minor detail, such extra white space cause problems in determining accurate XPointers, particularly in achieving the correct offsets for XPointers. Furthermore, XPointers are fragile; when the structure of a document is changed by an author this causes the generated XPointers to generally be invalid. HTML to XHTML Conversion: Although we use Tidy to ensure well-formedness and to convert HTML documents to XHTML, such conversion is not a trivial task. Tidy fixes major problems, but we believe it needs more work to be useful in general-purpose applications. Over-Linking: This pipeline approach gives us a collection of annotations which can be used as potential link targets as well as link sources. The ES provided a large number of COHSE annotations and thus a large number of potential targets. This may itself cause problems. For example, in the Sun product pages, many pages mention Software or Hardware, but these pages are not always the best targets for links—particularly when we are using the concept taxonomy to bring in broader resources. Identifying Useful Annotations: Similarly, index pages or pages that list Sun’s products may show up as potential targets for many concepts. These again are unlikely to be useful as link targets. But with a simple crawler algorithm such as the one we were using, it is not easy to identify “useful” resources. 4.1

Summary and Outcomes

The lessons learnt from this experiment can be summarised as follows (see Appendix A for a fuller description): • In order to make this automated annotation pipeline work successfully, we need to ensure that both annotation consumers and producers parse the documents in the same way; • In order to use XML technologies such as XPointer with pre-XML technologies then we need to have efficient tools to convert from HTML to XHTML; • Using XPointer is a challenging task, particularly with documents that are not well-formed. We need a standard pointing formalism that supports the right granularity with the right consistency; • If HTML documents are created dynamically and their

5 Tidy is an open source program and library for checking and generating clean XHTML/HTML and DOM of documents.

4

structure changes regularly, then using the current existing pointing formalisms such as XPointer one cannot succeed at this batch processing approach—when a document’s structure changes, annotations for that document via XPointers can easily become invalid. An alternative approach to address this problem would be to embed GATE into the COHSE DLS, and rely entirely on dynamic annotation, thus eliminating the need for external annotation storage. However, with this approach one needs to consider the amount of time taken by the DLS to process the text thus directly affecting the user’s experience; • Finally we also need to develop techniques to identify “useful annotations” when they are batch produced.

5

to modify them. Thus, this approach implies that there is a clear distinction between author and reader. One of the criteria for describing a hypermedia system as open is that it has no such artificial distinction between author and reader [65]. Although documents can be linked dynamically with such a server side deployment, unless the user plays a role in this, it will not be a true open system. Server side processing also means that COHSE links (i.e., multi-headed extra links) will only be available for that particular site. The user will not have a continuous browsing experience; when the user traverses to other sites COHSE links will no longer be available. (2) Intermediary: A proxy server acts as an intermediary between a client and server and adds links to documents. This approach potentially opens up the use of COHSE to alternative platforms such as WAP and mobile phones as appropriate transformations can be used for different clients. As with the server side deployment above, no end user installation of specialist browsing software is required. However, the use of a proxy may introduce processing delays and the need to re-write URLs to ensure that users continue to browse via the proxy. Moreover users may need to modify their browser’s configuration—this can be inconvenient and may be prohibited in some organisations. As we described in Section 2, the original COHSE implementation overcame this problem by employing a rewriting proxy. Although this solution avoids the problem of users being required to configure their browser, it becomes difficult for users to read URLs displayed in the browser’s status bar. When the cursor hovers over a link, the target URL displayed in the status bar will be long and potentially confusing. Furthermore, although customisation can be easier using a proxy, storing users’ preferences persistently can be troublesome. To address this problem, one of the original COHSE implementations supported customisation per-IP address. If more than one user accesses a proxy from the same IP address—which may be the case if a firewall is in use—one user’s configuration may be changed by another user. And if a user accesses this proxy from different IP addresses, then the user’s settings will no longer be available. (3) Client side: A plug-in or browser extension adds links to documents. This approach is straightforward to prototype because the underlying development framework of the browser (such as Mozilla) can be used. However, the implementation is then tightly coupled to a specific browser and most likely to a specific version of that browser. For example, although Firefox is a continuation of Mozilla, a Mozilla extension cannot be installed into Firefox, leading to problems with maintenance. Similarly, Mozilla extensions cannot be installed into Internet Explorer, requiring implementations for different browsers. A client side extension also requires local installation by the user which may be a problem if the user has insufficient permission to change the settings of the browser. However, the benefit of using a browser extension is that the browser’s internal engine can be used to parse HTML documents—helping to overcome issues with badly-formed documents. Additionally, it is easier to store user preferences persistently using a client side implementation, as the underlying preferences structure of the browser can be used. However, the maintenance and communication of these pref-

Personalising COHSE

One of the constraints described in Section 3 is that the scenarios’ Web sites provide content targeted at particular user groups. The original COHSE architecture paid little attention to the user’s role, yet we can see from these scenarios that if we are to repurpose the existing contents of these Web sites effectively it is important to provide links appropriate for a particular user context [82, 32]. For example, the NELI sites are used by experienced medical doctors as well as members of the general public. Similarly the Sun Web sites are used by different user groups, such as potential purchasers, system administrators and software developers. Knowing more about these user groups and providing customised and personalised links can be a great benefit to these users: we can repurpose the content based on the user’s context to provide personalised custom links. (Appendix B discusses issues of customisation and personalisation in depth.) Based on the client-server architecture of the Web [65], various Open Hypermedia Systems have employed different techniques to augment Web documents with extra links [35]. Extra links are either displayed alongside the document or inserted into it [27]. COHSE uses the latter approach as this is familiar to most users [27]. However, this approach requires documents to be modified or transcoded6 . Transcoding can be performed at three different deployment locations in the Web architecture. Each location has its advantages and disadvantages, particularly in relation to different levels of customisation support: (1) Server side: An application modifies a document by adding links to it. For example, one might batch process a corpus of documents to add extra link anchors. This could be achieved via an extension to a Content Management System (CMS) or as a standalone application. Because this approach does not require a client to modify documents dynamically, no end-user installation of specialist browsing software is required. Additionally, on the server side, more than one document can be processed (i.e., batch processing), whereas the client-side approach can only access one document at a time. Although dynamically-generated sites can be better handled by a server-side application, each site requires its processing machinery. Moreover, one needs ownership of the documents 6 Transcoding is the act of converting documents from one form to another.

5

Figure 2: The COHSE architecture—annotation producers.

• The rewriting proxy provides flexibility and ease of use, but does not support per-user customisation of settings. It is possible to adjust the settings of the KS and RS in all three approaches, but in general not on a per-user basis, but system-wide. Most importantly, all original COHSE DLS implementations were unable to persistently store user preferences. To overcome the constraints and limitations described above, we adopted a portal-based architecture. A portal provides a framework to aggregate content from different sources, and supports storage of user profiles, customisation and personalisation [52]. A portal is defined as “a Web application that commonly provides personalisation, single sign-on and content aggregation from different sources”[8]. A portal employs pluggable user interface components called portlets to provide a presentation layer and enables aggregation of content from different sources. JSR168 [8] standardises how portlets are to be developed by addressing preferences (storing customisation preferences persistently), user information, security, etc. It allows portlets adhering to JSR168 to be deployed to any portal container that conforms to the specification. We re-implemented the COHSE DLS as a portlet to exploit the advantages of portals and in the following section we summarise the underlying

erences to external components generally has to be performed in an ad-hoc manner. The use of a plug-in may also make it more difficult to share information between users, thus it becomes more difficult to provide personalisation techniques such as collaborative filtering. As explained in Section 2, COHSE has, in the past, been deployed as a proxy and as a suite of plug-ins to Mozilla [40]. However, these client-side and proxy implementations had limitations, in particular with respect to customisation and personalisation. Furthermore, for large organisations such as Sun these implementations proved to be inappropriate; an extension installation can be difficult and using a proxy is not easy due to the internal firewalls used. In summary, then, the constraints and limitations of the original COHSE approaches are as follows7 : • The plugin solution offers reasonable support for customisation, but requires specialist software and customised coding for particular platforms. • The COHSE proxy offers some customisation on a peruser basis, but requires changes to the user’s browser settings. 7 For

a fuller discussion see Appendix C.

6

Figure 3: The COHSE architecture—annotation consumers.

architecture. 5.1

source repository stores each mapping as a row in a database table. Given that a concept is taken from an ontology or taxonomy, we represent a concept as a URI (or URN). The resource is represented as a combination of URL (identifying a Web page) and two XPointer expressions that describe a span of text within that Web page. (4) Knowledge Service The Knowledge Service is used by both annotation techniques to identify the concept that is to be associated with a page or fragment of a page. Items 4–8 of Figure 3 illustrate the components that comprise the portlet. Again, this architecture bears some similarity to that described in Section 2. (8) Portlet From a user’s perspective, a portlet is a pane in a portal that provides a specific service (such as a calendar). From an application perspective, a portlet is a software component written in Java (similar to a servlet), managed by a portlet container, which handles user requests and generates dynamic content. A portlet often provides users with various configuration options, which it manages itself. Note that, just like the annotation architecture, the COHSE portlet uses Tidy to create a DOM representation for each requested input document. (6) DLS The portlet implementation of the DLS is similar to those described earlier: it performs generic and annotation linking. To achieve generic linking it relies on a reg-

Architecture

The COHSE portal architecture contains some components from the architecture described in Section 2. In the paragraphs below we describe the extra components and how they relate to the existing components. We have divided the description into two parts: firstly a description of the architecture for creating annotations, and secondly the architecture of the portal implementation. Items 1–4 of Figure 2 illustrate the components that provide the functionality to create annotations. It is very similar to the original architecture, with the following modifications. (1) Tidy Tidy creates a DOM representation for each requested input document. Tidy is necessary because few documents on the Web are well formed, or in XHTML. (See Section 4 for a fuller discussion of this issue.) (2) Annotation We have experimented with two annotation techniques: manual annotation [40] and automated annotation using GATE (as described in Section 4). Both techniques rely on the DOM representation resulting from Tidy and both techniques result in the addition of annotations to the resource repository. (3) Resource Repository As we described in Section 2, an annotation is a mapping from a concept to a resource. The re7

5.3

ular expression matcher compiled with terms from the KS. To achieve annotation linking it relies on the RS to provide it with link anchors and link sources for the requested document. We indicated in Section 2 that the RS may use annotations stored in the resource repository to fulfil requests. As we indicated in Section 4, it is essential that the XPointer expressions stored in the repository have been created from the same DOM representation as that used by the DLS. Unfortunately, this approach is only suitable for those pages that change infrequently. (6) URL Rewriting Proxy The functionality of this component is identical to the one described in Section 2. 5.2

Personalisation

Although we have not yet fully exploited the advantages of this portal-based approach, the portlet implementation described above provides the basis to support a more sophisticated use of user profiles and history. For instance, we could use the portlet to record user behaviour and use the data to enable personalisation whereby the portlet can make decisions about how to improve user interaction [52]. The goal of using this portlet-based approach is to enable us to experiment with various adaptation techniques. We believe that the portal will enable us to explore the benefits of the following five groups (see Appendices E and F for more details): (1) Content Our aim here is to improve the effectiveness of the content in a link box. A link box is composed of the following five components: (1) the label of the concept, (2) the description of the concept, (3) a list of resources (Web documents) related directly to the concept, (4) a list of broader resources for the concept and (5) a list of more specific resources for the concept. By default a user is presented with a link box containing all five components. However, by employing a model of the user, the portlet could provide different combinations of these components. For example, in the NELI scenario a medical doctor is presented with a link box that omits a description of the concept. (2) Navigation Brusilovsky et al [36] identify five types of adaptive navigation support as follows (1) direct guidance, (2) adaptive ordering, (3) hiding and (4) adaptive annotation. The portlet could provide different adaptation techniques for each of these types. Zellweger et al [84] indicate that the annotation of links (augmented with extra information) can reduce the number of visited nodes in a hypertext thus further reducing a user’s learning time. Hence, the portlet could use different techniques to provide extra information to the links in a link box (such as to provide summaries, to use different colours). For example, in the NELI scenario, members of the general public are presented with a link box in which a very short description of the concept is provided as a tool-tip. (3) Presentation Our aim here is to improve the presentation of links and their resulting link boxes. According to Dalal et al [48], an unstructured long list of links can cause cognitive overhead and disorientation to the user. The portlet can adapt the following presentation characteristics of the link box: (i) structure (such as list, tabular, menu); (ii) ordering of links in a link box (such as, historical, alphabetical); and (iii) interactivity (a variety of interaction techniques can be used). For example, if the user’s current task has been identified as “browsing”, then the traditional Web interaction provided by the links in a link box (i.e., the user clicks a link and the page at that link replaces the current page) can be efficient. However, if the user has a very different task, such as “searching”, then this interaction model may not be effective, in which case a link box may be presented as a tool-tip. (4) Browsing History The user’s browsing history can be recorded and used to adapt both presentation and navigation aspects. For example, a conventional browser renders a link that the user has previously visited in a different style to those that the user has not visited. The portlet could also keep the user’s history based on the concepts that the user has “vis-

Example

Figures 4 and 5 show the portlet in action: (a) shows the original page; (b) COHSE identifies and highlights concepts (such as “midrange servers”); (c) when the user clicks on a highlighted term UltraSPARC IV, COHSE creates a box that includes a set of links (such as links to the relevant documentation) along with a short description and (d) when the user selects a link from the link box, COHSE retrieves and decorates that page with extra links. The COHSE DLS portlet conforms to JSR168 so it can be easily deployed to any JSR168 compliant portal (such as the Sun Portal Server [17]). The portlet stores preferences persistently and allows users to use COHSE without the need for local installation. It supports three major groups of customisation options: (i) users can specify the location of KS and can choose which knowledge resource to use; (ii) users can specify the location of RS and can choose which services to use for link targets; (iii) users can customise the visualisation of COHSE links (for example, extra links can be added as an icon, or terms can be highlighted) and can customise the visualisation of the created link boxes. With emerging WS standards such as SOAP, more services are becoming available for use by the RS. In addition to Google, the RS can also call on Amazon (aws. amazon.com), the Sun search engine (search.sun. com), Yahoo (developer.yahoo.com), Flickr (www. flickr.com), BBC News (backstage.bbc.co.uk/), MeshPubMed (www.meshpubmed.org/) and GoPubMed (www.gopubmed.org/) to identify target resources. In contrast to manual annotation and GATE, resources identified by these external services are not stored in our central repository. Wrapper components for these services are invoked when extra COHSE link targets are requested from the RS. We have also created specialised wrappers, for example our NELI service uses Google to search the NELI Web site. New WSs can easily be included in our set of services (such as the Ebay WS). Such a flexible architecture allows easy extension, and ensures that our system meets one of the important criteria for being a true open system [65]. The combination of a variety of resource service components (i.e., WSs and advanced processing techniques) and knowledge resources provide us a rich set of domain-specific resources. Again, being able to use different knowledge resources contributes to an open system that does not impose a particular domain model. 8

• Finally, although the COHSE DLS is implemented as a portlet, it does not fit in the true spirit of a portal approach—instead of fulfilling the conventional function of a portlet by returning a fragment of a document for aggregation into the portal, the COHSE portlet returns a complete document. In our implementation we have used the portal infrastructure to provide single signon, persistent storage of user profiles, and authentication.

ited” and present them in a different style to those that the user had not yet visited. This approach may aid users’ orientation, enabling them to keep track of their own history. For example, in the NELI scenario if the user clicks on the COHSE link “Antimicrobial” a link box will be presented containing links relevant to the concept “Antibiotic” (due to the rdfs:SubClassOf relationship between the concepts within the ontology). If the user subsequently visits the “Antibiotic” concept, the resulting link box will also contain resources for the concept “Antimicrobial” but with rendering hints (such as the use of icons) indicating that the user has already seen the concept. (5) Collaborative Filtering is the idea of filtering information for a user based on a collection of user profiles (for example a group of which a user is a member) [80]. One common application of this approach is to make recommendations, an application popularised by Amazon such as “users who bought X also bought Y”. The portlet could also enable collaborative filtering by exploiting the “groups” feature provided by a portal container. For example, the portlet could provide collaborative filtering to help users decide which of the links in a link box are more relevant to them to read and follow. We believe that this would be particularly appropriate when there are a large number of concepts within the knowledge model. 5.4

6

Experiments with XSL

All the COHSE DLS implementations described thus far embody a common functionality to transcode documents and add extra COHSE links. However, they each use their own customised ad-hoc techniques to achieve this functionality. In this section we describe an experiment to reengineer the overall architecture to use standard technologies such as XHTML, DOM, XSL and XPath. The experiment was a response to the following requirements • to have a single implementation for the core functionality of the COHSE DLS; • to be able to reuse the same implementation in alternative developments of the DLS (see Section 5); • to be able to use off-the-shelf implementations regarding these standard languages; • finally, to have better interoperability support, for particularly interaction between COHSE DLS and other supporting services (i.e., KS and RS). In spite of these obvious advantages, our experiment failed—here we summarise the details of this experiment and reasons for the failure. For a more detailed description, see Appendices D and E. The experimental architecture included changes to the COHSE DLS and the addition of two components to the architecture. As in earlier architectures, the experimental COHSE DLS first uses Tidy to ensure that the incoming document is well-formed and to construct a DOM representation. It then requests an XSL stylesheet from a new component, named the XSLT Service (XS, see below). The DLS uses an XSL engine to apply this stylesheet to the DOM representation, resulting in a transcoded document containing extra COHSE links. The two additional components were: An XSLT Service (XS): The XS dynamically creates an XSLT stylesheet to adds links via a transformation of a document8 . The XS communicates with the RM and KS to find the appropriate annotations for a specified document and then returns an appropriate stylesheet. Because the RM stores annotations in XPointer format and XSLT processes XPath expressions, the XS translates XPointers to XPath and uses character counting in template actions to handle character ranges. The XS also produces and adds necessary Cascading Style Sheets (CSS [3]) declarations into the generated stylesheet. These declarations serve to embellish identified source anchors and to present the link targets. An XSLT Engine: Having the required stylesheet generated

Summary and Outcomes

We re-engineered the COHSE DLS as a portlet as a response to the requirements to provide personalisation features. As we described earlier, the existing implementations each suffered from various limitations. The portlet approach gained us the benefit of the proxy server deployment and avoided many of the disadvantages. For example, user preferences were stored by the portal, and users were able to use the portal from any client platform. The major outcomes of re-engineering the COHSE DLS as a portlet are summarised below: • The DLS portlet conforms to JSR168, and thus can be deployed to any JSR168 complaint portal (such as the Sun Portal Server [17] or the Apache reference implementation known as Pluto [12]); • In the portlet DLS, we focused on “dynamic linking” (see Table 1) because of the problems we experienced with XPointers (see Section 4); • We also re-engineered the RS (i) to make the integration of new components straightforward (a wrapper component can be easily created for a Web service) and (ii) to be able to specify the components in the service request. The latter change is important for customisation—different users can now make use of different components; • The Knowledge Service was also re-engineered to serve extra knowledge resources. It now supports Simple Knowledge Organisation Systems’ (SKOS) “concept schemes” as well as OWL ontologies; • The portlet currently only supports customisation but we have identified the following five areas where further experiments can be performed on personalisation: content, navigation, presentation, browsing history and collaborative filtering;

8 An XSLT translation is defined in terms of a template specifying a node pattern and some action.

9

7

by the XS gives us the flexibility to use any XSLT engine to apply the stylesheet by using the DOM representation of the document (such as Saxon [14]). This is one of the advantages of using standards: it promotes reusability. In addition, it opens up the architecture and increases flexibility—the generated stylesheets can be applied in several ways.

In the original COHSE architecture, the COHSE DLS identifies all link anchors and relevant link targets before rendering the document for the user. The resulting document thus contains all the link boxes containing all the link targets, hidden at the bottom of the document. This approach has many disadvantages: (i) it requires extra processing load and time when the user requests a document; (ii) the user may never visit some of the link boxes; (iii) although added link boxes are hidden in the document, the document content is visually modified; (iv) depending on the number of link boxes, the modified document may appear quite long and the bottom of the document is empty. In order to address these issues, we investigated techniques to create link boxes on-demand (i.e., to create a link box when the user requests to see the box by clicking on a COHSE icon) (see Appendix F). We chose to use AJAX—a name given to a combination of readily-available Web technologies [2]. Javascript can be used from within an HTML page to make a call asynchronously to a server and request an XML document. The Javascript can then use the resulting XML document to augment the DOM of the originating HTML document. The main component of AJAX is the XMLHttpRequest object of Javascript that is now supported by most mainstream browsers across all platforms. The main reasons we chose to use AJAX were: • AJAX is based on open standards and supported by many browsers and platforms, so AJAX provides a cross-browser and platform-neutral solution; • With popular sites using AJAX (such as Google and Amazon), users are becoming familiar with the AJAX interaction model; • Users are not required to install any software or plugins; • By using AJAX we can open up COHSE to other Web 2.0 applications9 . We created a service called the Link Service. When given a concept from an ontology or taxonomy the Link Service will respond with an XML document containing a list of links for that concept. When the COHSE DLS processes a Web document, the DLS augments that document with appropriate AJAX Javascript libraries. It also adds COHSE anchors that will make XMLHttpRequest calls to the Link Service. When the user clicks on a COHSE anchor, the Javascript code calls the Link Service and creates a link box from the resulting XML, which it then displays to the user. This approach thus makes a clear separation between link source identification and link target discovery (i.e., link box creation). The AJAX approach is not widely used to access thirdparty Web services because most browsers impose a security restriction on the use of XMLHttpRequest—an HTML document is not allowed to make an XMLHttpRequest to any server except for the server from where the document originated10 . For example, if the AJAX application is con-

Although the approach above appeared to present an ideal standards-based solution for COHSE, in practice we encountered several difficulties: • XML resources are required. XSLT [25] is the only standardised transformation language for Web. However, it can only be used with XML-based languages. XHTML was introduced by the W3C to bridge the gap between XML and HTML and can thus be used with XSLT, but unfortunately few documents on the Web are in XHTML format; • XPath has limitations. XSLT uses XPath to navigate through the structure of an XML document. Although XPath is widely adopted, it does not provide the fine granularity required in applications such as ours. For instance, one cannot refer to arbitrary portions of a document, regardless of element boundaries; • XPointer is not a standard. XPointer[24] provides solutions to some of the above limitations but it is not a standard. Furthermore, it cannot be used with XSLT to process documents and is not supported in a consistent way; • Different referencing techniques cause mismatch problems. For our application (and indeed for many other Semantic Web applications), a variety of techniques may be used to annotate documents. But different techniques typically mean different ways of referencing document fragments. In order to use generated annotations consistently, one needs to match fragment references—this can be an error-prone and effort-demanding process; • Resources may change. XPointers are not tolerant against modifications in a document. Even small modifications can easily invalidate an XPointer; • Using standards does not always mean easy integration. Although standards such as the DOM provide, in principle, a standard way of representing, accessing and manipulating a document, the reality is that different implementations use the DOM in different ways. Combining components based on DOM often means tackling nittygritty technical problems. • Is embedded markup harmful? With XSLT, XPath and XHTML, only certain kinds of annotations can be represented and applied to documents because of the hierarchical representation of data in XML documents [72]. 6.1

Experiments with AJAX

Summary and Outcomes

In summary, our experiments with this XSL-based architecture for the COHSE DLS show that because of the difficulties discussed above and the current status of the Web [41], this approach is impractical. In order to continue using XSL transformations on the Web, first one needs to address the issues discussed above (see Appendices C and D).

9A phrase coined by O’Reilly Media (http://www. oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/ what-is-web-20.html), referring to a supposed second-generation of Internet-based services. 10 For detailed information, see http://www.xml.com/lpt/a/

10

• Avoid distracting the user and reduce cognitive overhead and disorientation [44, 78, 60]. Conklin [44] indicates that cognitive overhead and disorientation are the two most important problems that users experience when trying to navigate within hypertext systems. Cognitive overhead is defined as “additional effort and concentration necessary to maintain several tasks or trails at one time” [44]. This occurs when a user is given many choices and has to decide which one to follow. Disorientation and/or cognitive overhead are associated with the number of links in a hypermedia system; as the number of links increases, reading and learning performance decreases [79]. To address these kinds of issues, different techniques need to be explored. • Avoid breaking the continuity of a document by modifying it as little as possible. • Use a style to achieve the best perceived affordance11 of clickability12 or navigability. • Avoid cluttering the page layout. • Represent concepts from different ontologies. In order to achieve these features we identified several techniques to present extra COHSE links and to support better interaction with the additional COHSE links (see Appendix G).

tained in the document http://www.example.com/ ajax.html then the XMLHttpRequests that originate from that document can only make calls to services available on http://www.example.com. The COHSE portlet implementation overcomes this restriction by employing the rewriting proxy. This means that all the URLs contained in a Web document reference the host of the portlet, and are thus not subject to the security restriction described above. Therefore, from the browser’s point of view, both document and the service are served from the same domain. 7.1

Summary and Outcomes

The major outcomes of this experiment and lessons learnt can be summarised as follows: • AJAX allows us to create link boxes on-demand; • This approach solves the problem of embedding all link boxes within a transcoded Web document; • With this approach, better and more information can be captured about the user’s interaction with COHSE links and link boxes—this is important for personalisation. For example, we can now capture information about the link boxes that the user has visited. • Browsers’ security restrictions are overcome by using a re-writing proxy; • Under certain circumstances there is a significant delay between the user clicking a COHSE icon and a link box being displayed. This problem can be partially resolved by the use of caching; however, caching does not solve the problem when the link box is first displayed. We are currently exploring techniques that will gradually reveal the contents of a link box to the user, thus reducing the perceived delay.

8

Link Box Visualisation According to Dalal et al [48], an unstructured long list of links can cause cognitive overhead and disorientation. This suggests that structuring and organising a link box in different ways can overcome these kinds of problems. Furthermore, structuring and organising link boxes can assist navigation [54], browsing [42] and comprehension [48]. A link box is composed of five components and inclusion of these components can be optional—one of the customisation options available to a user. The components serve different purposes, for example the concept description promotes reading whereas the list of links promotes scanning and navigation. Therefore we could either have an integrated representation of a link box in which we present all the information about an identified concept, or we can adopt an incremental approach where we present different degrees of information about the identified concept through different means, for example the description of the concept could be presented as a tool-tip and the list of links could be presented in a pop-up window. In summary, we have identified various techniques which are based on the following six aspects: 1. Structure: how should we structure a link box? Different techniques can be used to organise (group, categorise and cluster) the list of links in a link box to support better and faster comprehension of suggested links [48]. These techniques include lists, tables, and different menu styles (such as hierarchical, drop-down menu). 2. Order: how should we order links in a link box? Two different types of ordering can be supported: customised and personalised. With customised ordering, the user

The Visualisation Experiment

As different user studies demonstrate [34, 73], a single presentation technique cannot meet the requirements of all users. As part of our goal to improve the customisation features of COHSE we undertook an experiment to explore different techniques for presenting links to users. We adopted Shneiderman’s view in which “user understanding is central and user control is vital for users to be successful” [77]. However, we also wanted to identify the “most efficient, effective and satisfactory” [53] technique to visualise COHSE links and link boxes, so that this technique could be used as the default presentation style for all users (see Appendix G). The experiment focused on two topics: the visualisation of the extra links added to a Web document by COHSE (i.e., link embellishment); and the visualisation of the list of link targets suggested by COHSE (i.e., link box visualisation). Link Embellishment We aimed to achieve the following features: • Represent and differentiate COHSE links from hardcoded, embedded links, and differentiate COHSE links from the normal text in the Web page. Particularly, COHSE links should be sufficiently distinguishable from normal text to indicate that they provide navigation.

11 Perceived affordance, http://www.jnd.org/dn.mss/ affordances-and-design.html 12 Guidelines for visualising links, http://www.useit.com/ alertbox/20040510.html

1627.

11

selects how to order the list of links whereas with personalised ordering the system decides how to order the links. For personalised ordering, different factors can be used to determine the appropriate order, such as user background (profile), user search goal (set of keywords or ontology), current interest [36]. Some identified ordering techniques include alphabetical, temporal and hierarchical, in addition to the usage of document structure. 3. Preview and context: what kind of information should be included in a link label about the link’s target document to help users to decide to follow or not to follow that link? The aim here is to support the users’ decisionmaking process to follow a link. Generally, users are not able to view the document at the link target until they have activated the link—this can lead to problems such as (i) the decision about which links to follow and which to ignore introduce cognitive overload, (ii) it can be burdensome for users to follow a link just to discover that is irrelevant, (iii) contextual changes can also be a problem when users need to return to the page where they started [63]. To overcome these difficulties, we developed techniques including summaries and keywords to augment link labels. 4. Interactivity: how should we make link boxes interactive? Possible techniques include: providing expand and contract views, adding tool-tips and using scrollbars in link boxes. 5. Amount of information: how much information should be included in a link box? There are some techniques that can be used to reduce the size of link boxes, particularly when many links are suggested as relevant targets. These techniques include: using hierarchies, fisheye views, grouping, filtering links. 6. Location: where should we locate the link box? The link box could be displayed at different locations and in different forms in the browser. For example, a link box can be displayed as a pop-up box, as a fluid interline13 , in the margin, or as a footnote. Appendix G presents our visualisation experiments in detail. The techniques summarised above are explained with concrete examples.

9

of the content of a web page.) We observed that there were several reasons why a resource could be considered part of the hidden web: Access Denial Examples of this include: content behind an organisation’s firewall that is not visible to the World Wide Web; content that is deliberately obscured by a content provider by using ’robots.txt’ 14 . Difficulty in interpretation This arises when information is presented using unconventional protocols or formats, or when a deliberate attempt is made to obfuscate content. For example, some resources may be available on publicly-connected web sites but are not retrievable by conventional Web protocols. (Examples include CVS servers and content management systems.) Conversely, some resources that are retrievable using standard Web protocols may be in formats that are not amenable to searching or annotation. Examples of this include CAD drawings and old word-processing documents. Alternatively, some retrievable resources are copies of an original in a different format, where information content has been lost as a result of format translation. Finally, some content providers deliberately obfuscate some parts of resources, typically in order to prevent automated systems from gaining access to the content. Difficulty in discovery Many resources are retrievable by URL, but there may be no way of navigating to the URL from a fixed set of roots. Search-based sites, such as manufacturers’ support sites, may have no or limited support for browsing (and thus annotation), and may provide few hyperlinks to content. In additional, many resources retrievable by URL change over time (for example news sites). Some content cannot be represented solely by a URI; further context is needed to identify the content, such as the name or location of the user. Some resources are only available after submitting a form via an HTTP POST mechanism, such as a search form. The resource may have no URI of its own, and only be generated in response to the form submission. Many sites serve different content depending on the user agent. For example, Microsoft Internet Explorer and Netscape Navigator have at times had sufficiently different font models that text sized for one may display many times larger or smaller in the other. Some sites serve different content depending on the user agent’s accepted language.

The Hidden Web

At the outset of the collaborative project between Sun Labs and the University of Manchester we set out to clarify what is meant by the term “The Hidden Web”. We believed that this had particular significance to COHSE since the term encompasses those resources on the Web that are in some way potentially unusable to Web services such as annotation services, and also unreachable by search engines. Our initial thoughts on the resources described by the term included mainly those resources that are not static. For example, web pages that are created by servlets or Java Server Pages technology. Not only are these web pages not static, the URLs that identify them also tend to be neither unique nor static. (For example, many eCommerce web sites use frames to obscure the ”real” URL

A fuller description of the Web’s hiding places is given in Appendix H. The appendix also contains a set of recommended solutions that would enable Semantic Web applications such as COHSE to navigate resources in the hidden web.

10

Modelling COHSE with FOHM

In this section we summarise our investigation to formalise the COHSE navigation model using the Fundamental Open Hypermedia Model (FOHM). FOHM provides a systematic 14 The robots.txt protocol is a convention to prevent co-operating web spiders and other web robots from accessing all or part of a web site [26].

13 The fluid interline technique displays the gloss directly below the anchor.

12

further information about a concept. It therefore seems unlikely that existing linkbases would support COHSE’s navigational model. • We can use FOHM to represent the navigational support provided by COHSE. However, we still need to investigate other techniques to better represent the relationship between its navigational model and its knowledge model. • The FOHM formalisation is imprecise.

vocabulary for expressing structures and operations to represent different hypertext domains: the spatial domain, the navigational model and the taxonomic domain [69]. The goal of our investigation was to create a FOHM representation (including a vocabulary and its meaning) that described the features of COHSE which would thus enable interchange and interoperability between COHSE and other Open Hypermedia Systems. Our goal was also to produce a formalism that described the relationship between the knowledge model used in COHSE and its navigational model. FOHM provides a systematic vocabulary to define structures and operations that can be applied on hypertext domains with the idea that these structures will become first-class objects in an Open Hypermedia System. FOHM defines four basic structures: Data, Association, Reference Object and Binding. FOHM also defines two types of metadata that may be attached to any structure within a model at any point (for example, an Association). During our investigation we determined that the fundamental structures of FOHM were insufficient to represent the features of COHSE. Appendix I describes the additional structures required by COHSE in detail and proposes a way of extending FOHM to model COHSE. The main outcomes of this investigation are summarised below: • In general, FOHM is a graph model for contextual browsing in open hypermedia. The notion of context can be used to modify COHSE so that it acts as an adaptive system (see Section 5). The context object within FOHM can be assigned to any of the four basic structures; therefore, this would provide us with the flexibility to adapt the COHSE navigational model in different ways. Although FOHM does not provide an explicit specification of its two main modifier objects—context and behaviour—we believe that we can easily extend FOHM to provide a better and more explicit formalisation of context and behaviour. • As highlighted before, one of the major benefits of using FOHM is to support interoperability with other Open Hypermedia Systems. However there are few Open Hypermedia Systems available in FOHM. Therefore, it may be more useful to explore the formalisation of the navigational model of COHSE by using existing Semantic Web metadata languages. • Gibbins et al. [55] discuss the relationship between FOHM and RDF and in particular investigate whether FOHM structures can be represented in RDF. Although they conclude that it is possible to represent FOHM higher level relations using the RDF vocabulary, they also state that doing so loses the FOHM semantics. If the constructs needed to represent FOHM semantics are introduced, this then breaks the RDF model theory. Therefore, they propose that it is necessary to extend the RDF semantics with higher level FOHM semantics by defining an RDF schema (in RDFS or OWL). This schema could then be used to apply FOHM to COHSE. • Several linkbases have been created using FOHM. All the linkbases share a common feature: they are not able to provide “dynamic” links, as all their links are stored persistently [68]. In contrast, the COHSE DLS relies on links that are created dynamically when a user requests

11

User Tasks on the Web

We have investigated a number of studies that aim to understand what kind of tasks users perform on the Web and the behaviours or tactics used to achieve those tasks. With this investigation, our aim was to understand how COHSE could potentially support these tasks and additionally to investigate how task-based adaptation/personalisation could be supported by COHSE (see Section 5). In the literature, the two fundamental paradigms that are discussed for finding information on the Web are browsing and searching [60, 66, 74]. The basic distinction between these two is the object of search (the information need) and systematicity of tactics used. When searching the user is looking for something specific whereas when browsing the user does not have a well-defined task. Looking at these simple definitions, broadly we can say that COHSE supports browsing. Studies on the browsing paradigm take this view one step further and propose three different kinds of browsing—based on the user’s goal or specificity of the object searched; it can be either very well-defined, not very welldefined and not defined at all [28, 45, 59, 64, 66]. Based on these, we can say that COHSE supports the latter two. There are also relevant information retrieval paradigms to describe what kind of tasks COHSE supports. These paradigms include exploratory search [81], berrypicking [29] and information foraging theory [74]. These paradigms are related to each other in the sense that they all propose that the user’s queries continually shift. They propose that accessing and learning new information may yield new queries for the user. According to information foraging theory users typically forage for information on the Web by moving from document to document along hyperlinks. Therefore by adding extra hyperlinks, COHSE improves the user’s potential exploration of the information space. According to Marchionini [67], learning and investigation activities are parts of exploratory search. Looking at the tasks involved with these two activities, and the way in which COHSE adds extra links and link boxes, we can say that COHSE supports these tasks. Marchionini [67] highlights that exploratory search is more concerned with recall (maximising the number of possibly relevant objects that are retrieved) than precision (minimising the number of possibly irrelevant objects that are retrieved)— this provides us with a clear distinction between the functionality provided by COHSE and that provided by search engines. We can say that COHSE aims to improve recall and search engines aim to improve precision. When a Web page is accessed, COHSE aims to add and return as many relevant links as possible by using the knowledge encoded in a domain 13

Byrne et al, 99 [37] locate on page use information go to page

Morrison al, 01 [70] find collect monitor

provide information configure browser react to environment

explore

et

Choo et al, 00 [43] formal search informal search conditioned viewing undirected viewing

Sellen et al, 02 [76] finding information gathering browsing

Heath et al, 05 [58] locating exploring monitoring

Kellar et al, 06 [62] fact finding information gathering just browsing

transacting

grazing

transaction

communication

sharing

other

housekeeping

notifying asserting discussing evaluating arranging transacting

Table 2: User Tasks - highlighted tasks are potentially supported by COHSE.

ontology (such as using broader and narrower relationships, etc.). If the user’s task is very well defined (such as searching) then we can say that COHSE does not support such tasks very well. However, the COHSE approach could be used to better support searching tasks. For example, when users are looking for something specific they can easily use a search engine to search for a specific query. However, when users visit the pages returned by the search engine, they move to a new task context and loses their original search context. Google’s toolbar attempts to maintain the user’s context by highlighting query terms, but does not provide links to other pages relevant to the user’s search query. The COHSE approach could add this extra feature by providing links dynamically on the resulting page according to the original query terms, thus enabling users to maintain the context of their original search. There are also several user studies that seek to understand what tasks users perform on the Web. The results of these studies are summarised in Table 2 and the tasks that COHSE could support are also highlighted. If we look at the tasks summarised in Table 2 and the other information seeking paradigms, most of the tasks discussed are based on the specificity of the search goal. Appendix J presents details of of these studies and provides more details of the research we undertook.

hard to use. In domains such as medicine or bioinformatics, formal ontologies help to ensure interoperability between applications, and the use of logic-based classifiers also helps in building and maintaining consistent vocabularies. Ontologies have also proved of use in supporting accessibility, as shown in the DANTE [83] and SADIe [57] projects. Here, annotations provide extra information as to the role played by objects in a Web page. This information can then be used to transcode pages, producing versions that are more accessible to users employing assistive technologies.

Ontologies or Vocabularies for Supporting Navigation?

A key point here is that in the above applications, the knowledge models are being used by some process or program in order to perform a task for the user. Thus the content must be unambiguously and explicitly represented (hence the need for languages such as OWL). In some situations, this “high-pain, high-gain” approach of detailed and formal semantics may not be necessary. For example, supporting navigation (as in COHSE) does not necessarily require that the relationships that we use for navigation are strict super or subclass relationships. Our original premise for COHSE was that the structure in an ontology can support navigation through a collection of documents. Although this is true to a certain extent, our experiences suggest that formal ontologies (such as those represented in OWL) are not always best suited to this task. Strict sub/super class relationships are not necessarily appropriate for navigation—rather, the looser notions of broader/narrower as found in vocabularies or thesauri provide the user with more appropriate linking.

Languages like OWL [22] provide representations that support detailed, formalised conceptual models [33]. They come with a well-defined formal semantics that tell us precisely how composite concept descriptions should be interpreted. These formal semantics can help to drive the machineprocessing that will underpin the Semantic Web and are vital if we are to support machine-to-machine communication. However, such an approach comes with an associated cost. OWL ontologies can be hard to build, hard to maintain, and

COHSE being designed with an open architecture, the intention was that ontologies could be taken “off the shelf” and plugged into the services and used. Again, this works to a certain extent. The adoption of OWL as a W3C recommendation helps this process greatly, and resulted in a large number of ontologies being published in OWL. Unfortunately, however, many of these ontologies are unsuitable for use within COHSE as they contain very little lexical information such as human readable labels. Instead, concepts often have names

12

14

is given the same access as the author [50]. Open Hypermedia Systems are well researched by the hypermedia community where several systems have been developed including MicroCosm, Chimera and Devise Hypermedia [75]. Several systems have also been introduced to provide an open hypermedia system on the Web including DLS, DHM/WWW, Webvise and the Arakne Environment [35]. Using these systems, readers can create links and other hypermedia structures on top of arbitrary Web pages, and can share these links with others through the use of external linkbases. Conceptual Hypermedia Systems specify hypertext structure and behaviour in terms of a well-defined conceptual scheme. Since constructing hypertext links manually is inconsistent and error-prone, these systems seek to bring conceptual models to bear in order to support linking. Although conceptual hypermedia systems address the problem of hardcoding and hand-crafting links, systems such as MacWeb [71] and SHA [46] provided closed systems rather than supporting linking in the wide (and wilder) world of the Web. We can also find some examples of conceptual hypermedia systems on the Web, for example Semantic Wikis (such as the Platypus Wiki [11]). They introduce explicit models of knowledge in support of management of content. This kind of approach combines hypertext and metadata, but is again primarily concerned with organising and managing resources under a single point of control (albeit control by a collection of individuals). Compared to the systems above, we can say that COHSE marries open and conceptual hypermedia systems by using the existing Web infrastructure. That is to say COHSE aims to combine the advantages of both approaches by using the Web. In order to do that COHSE uses Semantic Web technologies, in particular semantic annotation and knowledge resources such as controlled vocabularies and ontologies. Semantic annotation is one of the core technologies for the Semantic Web. It is the process of semantically enriching Web pages by adding metadata so that they are not only understandable by humans but also by machines [56]. Manual annotation approaches, such as our COHSE annotator [30], allow users to define annotations by hand. Automatic annotation approaches focus on extracting annotations automatically by using different techniques such as machine learning or natural language processing techniques such as our GATE crawler15 . Some tools are also available to annotate pages when they are designed and created (for example, SMORE [16]) which is important for bringing in the annotation process to the early stage of Web page design. OntoMatAnnotizer is a tool that can be used with existing pages but adopts an approach where annotations are injected into the page. Another project is Annotea [1] a W3C project allowing annotation of Web pages using RDF (Resource Description Framework) [13]. The Annotea project uses the RDF framework to create statements about part of a page [61]. Although this project is important in showing how RDF can be used to store annotations, it does not use ontologies explicitly in its annotation process (crucial for the COHSE approach).

rendered using “camelback” notation, for example AlphaBetaProteinComplex. Although we can apply heuristics to such names to provide human readable labels for rendering and searching in text, this provides passable results at best. In addition, these models also often lack information about lexical variants or synonyms. Such information can be particularly useful when trying to determine the occurrence of potential link targets. At the other end of the problem, there exist a large number of vocabularies which are not intended to be formal ontologies, but rather are controlled vocabularies to be used for annotation, information retrieval or organisation of information resources. These sources often contain the kinds of information that we believe are useful when presenting information to human users (rather than machines)—for example, synonyms, lexical variants and “scope notes” or definitions providing valuable context for a human reader. These sources are, however, often not provided as OWL, and in order to use these vocabulary resources within COHSE, we originally had to provide OWL-based versions. This often required a “bending” of the semantics. This was particularly the case when trying to make use of Sun’s UPT within COHSE. The UPT contains a classification taxonomy which is not strictly a super/sub classification, but is rather a broader/narrower hierarchy in the thesaurus sense. An early solution involved translating this to an OWL representation. Although this produced a working system, it required a significant “massaging” of the vocabulary; it was clear that the super/sub class relations represented were not actually respecting the OWL semantics. Another possible solution would be to introduce new relationships into the ontology representing the broader/ narrower relationships, but this introduces a requirement that the application machinery be aware of these, resulting in a nongeneric solution. We thus came to the conclusion that basing COHSE’s knowledge sources on a thesaurus model is likely to provide a better “fit” with the navigation models that we wish to support. SKOS provides standardised relationships (broader/narrower/related) which can be used by our application. Although these relationships may not have the precise semantics that come with OWL’s relationships, in this context, the looser interpretation is more appropriate to the task in hand. This then fitted well with the UPT. This is not to say that we have rejected the use of formal ontologies or OWL within our system—far from it. Rather, we have introduced an additional level of abstraction into the knowledge model that allows us to implement the underlying knowledge structure using whichever formalism we require (OWL, SKOS, etc.) while presenting a unified interface to the COHSE DLS. An OWL super/subclass relationship can still be considered as a broader/ narrower relationship, with the additional abstraction layer performing the mediation between the representations. The vocabulary/ontology still provides a shared model supporting the navigation.

13

Related Work

The Web can be considered as a closed hypermedia system since the links are embedded into the pages [65], whereas an open system in hypermedia simply is one in which the reader

15 A list of annotation tools, http://annotation.semanticweb. org/tools

15

for getting a variety of link targets and creating different hypertext views over Web pages. However, as seen by the withdrawal of Google’s Web Service, it is unwise to rely on them completely. Portlet Implementation We have implemented the COHSE DLS as a portlet conforming to JSR168. Although COHSE DLS portlet does not fit in with the true spirit of a portal approach—rather than returning a fragment of a document, COHSE actually returns a complete document, our portal approach has provided the required infrastructure of single sign-on, persistent storage of user profiles, and authentication to create customised dynamic links on Web pages. Personalisation The DLS portlet currently only supports customisation but we have identified techniques in the following five areas where the portlet can be modified to act as a personalised system: content, navigation, presentation, browsing history and collaborative filtering. XSL for Dynamic Linking Our experiments with reengineering COHSE’s architecture to use standard technologies for dynamic linking have shown that with the current status of the Web, it is not possible to use standard technologies such as XSL to dynamically link Web pages. AJAX Our experiments have demonstrated that AJAX can be used to create on-demand dynamic link boxes. This approach has several advantages: it solves the problem of hard-coding link boxes; it gives the flexibility of designing and visualising link boxes differently; and it is a good approach for capturing information about the user’s interaction with the COHSE links and link boxes. With our experiments we have also shown that a rewriting proxy can be used to address the problem of security restrictions imposed by most browsers. FOHM and COHSE Although we have extended FOHM to formalise COHSE’s navigational model, the FOHM formalisation itself is very loose. Furthermore, as few open hypermedia link bases are available in the FOHM format, there is not much benefit in providing COHSE link bases in FOHM. Users task on the Web According to the literature survey on users’ tasks on the Web, COHSE supports browsing where the specificity of the object searched is not well-defined or not defined at all. However, COHSE is not very good at supporting tasks where the object searched is very well-defined (i.e., searching). This survey has helped us understand COHSE’s strengths and weaknesses. Knowledge resources In order to use knowledge resources in different formalisms (for example, OWL or SKOS), we have introduced an additional level of abstraction into the Knowledge Service. This abstraction allows us to easily use knowledge resources in different formalisms.

Semantic annotations also play an important role in bioinformatics research. For example, Entrez [4] provides access to the PubMed literature where articles are marked up with terms from a controlled medical vocabulary called MeSH. The Gene Ontology has been used in an extension to PubMed called GOPubMed [6] to provide searches based on Gene Ontology terms. Whilst in GOPubMed an ontology is used to organise documents, there is currently no linking of documents performed. Another Semantic Web application similar to COHSE is MAGPIE [51]. It is implemented as a client-side plug-in to a browser, so as a system it has the limitations that we have discussed in Section 5, for example it requires a local installation. Systems like Google’s Autolink also provide dynamic linking functionality, but tied to a limited number of types of data. Vibrant Media’s IntelliTXT [7] gives a mechanism where advertising messages are linked to keywords, but this is a server-side process, whereas our infrastructure can enhance third-party resources. There are also some emerging technologies that are introduced to overcome the limitations of the embedded links on the Web such as XLink [23]. Although XLink is a powerful language for representing links that offers a lot of improvements, it is not widely used or supported by existing browsers.

14

Conclusions

This report has presented a series of experiments performed to enhance and overcome a number of limitations in the original COHSE implementations and architectures. The major outcomes of each experiment are as follows: Automated Annotation We have presented our Extraction Service that crawls the site from a given URL, annotates pages with the given ontology by using GATE and stores annotations externally in a repository. There are a number of advantages of such batch processing approach, for example saving time and effort. However, our experiments demonstrate that storing annotations externally and using them in a different application is not possible with the current status of the Web and its existing referencing mechanisms (such as XPointer). Deployment The COHSE DLS can be deployed at different locations in the architecture of the Web—as a client, server or proxy. Each deployment location has its advantages and disadvantages. Although we have experimented with the different deployment options, we cannot conclude that one approach is better than the other. However, we can conclude that one needs to consider these advantages and disadvantages, to make the best decision for a specific application and requirements. Open Architecture Designing a loosely coupled “open” architecture has enabled us to easily integrate heterogeneous third-party knowledge resources with third-party services to dynamically link third-party documents. Web Services We have experimented with a number of Web Services to retrieve link targets dynamically. These Web Services have proved to be easy to use and an efficient way of obtaining “free” link targets. They are also good

15

Future Directions

We have experimented with GATE and regular expression matching for identifying source anchors in a page (see Ta16

ble 1). However, we have not compared these two techniques for the quality and the quantity of terms identified. Regular expression matching provides basic, simple text processing but suffers from limitations. For example, Sun Fire V480 server is a term from the UPT but if the page includes slight variations of that term such as the plural Sun Fire V480 servers, standard character matching will not identify those variations. Compared to regular expression matching, GATE provides more sophisticated Natural Language Processing (NLP). But without a formal technical evaluation, we are not able to say which one is better. There are also other NLP tools that can be incorporated into the COHSE DLS, for instance TerMine [19] can be used instead of GATE or regular expression matching. As we have already highlighted in Section 2 (see Table 1), new techniques can easily be introduced to be either used for batch processing or dynamic processing. However, regarding the difficulties of having a robust pointing mechanism, the latter appears to provide a better approach. We have also experimented with a number of Web services to get link targets. With our component-based RS, a wrapper component can easily be created for a Web service. We have only experimented with calling one service at a time, directly. But one can explore using combination or “mashups”16 of these services to create better unique link targets. From the user’s perspective, COHSE enhances Web pages with more navigation opportunities in the information space. However, we only have anecdotal evidence that COHSE enhances the user’s overall experience. Further investigations need to be conducted to compare the browsing and navigation support of the original page against the page augmented by COHSE. We have also proposed different link embellishment and visualisation techniques but we do not know which ones are useful. A user evaluation or survey needs to be performed to show the efficiency and effectiveness of these techniques. COHSE’s adoption within the Sealife project [15] offers an opportunity to observer user behaviour and conduct evaluations. While there is no question about the increased efficiency of the automated annotation approach, there are still issues about the “usefulness” of the generated annotations. Further research needs to be conducted to develop techniques to monitor the quality and the quantity of the generated annotations. This can possibly address the problem of “over-linking” and “under-linking” of resources. If the user’s task is very well defined (i.e., searching) then we can say that COHSE does not support such tasks very well. However, the COHSE approach could be used to better support searching tasks. When users are looking for something specific they can easily use search engines to search for their specific query. However, when they visit the pages that the search engines return, they move to a new context and loose their search context. Although Google’s toolbar allows users to highlight searched terms in the visited page after the search, they do not get links to other pages relevant to their search query from that visited page. By using COHSE’s approach links could be dynamically added to that visited page

based on the keywords searched to allow users to continue to browse within the context of their original search. Although COHSE supports serendipitous browsing, it still imposes a particular view on the document which comes from the knowledge resource used. But it can be implemented or provided as a back-end “hidden” service—COHSE will not highlight terms in the page but every term in the page will be turned into a “link-able” object: a user can choose a term and request links regarding that term. Such a scenario obviously raises questions such as “how do we resolve a selected term?”; “which knowledge resource do we choose?”; “do we provide suggestions to the user to resolve the ambiguity of the term?”. Therefore, further research can be conducted based on this scenario and questions. We have extended COHSE’s KS to use different kinds of knowledge resources with different levels of formalism. Although we have anecdotal evidence that basing COHSE’s knowledge resources on a thesaurus model is likely to provide better “fit” with the navigation models we wish to support, one needs to perform both technical and user evaluations to compare different types of knowledge resources. Again, Sealife may provide a scenario within which to explore these issues. Finally, we have identified a number of techniques such as collaborative filtering to personalise COHSE. However in order to implement these a number of questions need to be answered. For example, “how much information do we need to store?”; “how do we store it?”; “where do we store it?”; “how do we reason over the stored information?”.

16

Acknowledgements

We acknowledge Robert Stevens, Simon Jupp and the Sealife team for providing the NELI scenario and the related materials. We also acknowledge Prof. Carole Goble’s valuable support and contributions. We thank Peter Crowther for his work on automated annotation and Hidden Web. We also thank Carlos Aguirre for his assistance with the Sun UPT. Finally, we thank Stephen Green and Paul Lamere for their constructive reviews of this document.

17

Trademarks

Sun, Sun Fire, Sun Microsystems, Java, Java SE, JDK, Java Card, and Java Community Process are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Glossary Distributed Link Service (DLS) The DLS adds links and to documents as they are delivered via a proxy from the originating Web server to the destination client browser. AJAX (Asynchronous Javascript And XML) AJAX is a family of technologies based on Javascript and XML that enable a Web page to retrieve small amounts of data from a server without reloading the entire page. It allows Web pages to be more interactive and behave like local applications, which are also known as “rich client” applications.

16 A mashup combines content or services from more than one source into an integrated experience, for examples see http://www. programmableweb.com/

17

Annotation An annotation represents a mapping between a concept and a resource fragment (such as a paragraph) that may be produced by parties other than the document owners or users of COHSE.

OWL Annotations OWL provides Annotation Properties— these allow additional information to be associated with classes in ontologies that does not impact of the definition or interpretation of the classes.

COHSE Annotation A COHSE annotation associates a URI with a particular document fragment. Document fragments are referred to via XPointer expressions.

Portal A portal provides a framework to aggregate content from different sources, and supports storage of user profiles, customisation and personalisation.

COHSE DLS The COHSE DLS is responsible for embellishing HTML documents with new links. It performs this function by calling on services provided by the Resource and Knowledge Services.

Resource Service (RS) A Resource Service manages the relationships between URIs and concepts. Semantic Annotation Semantic annotation is one of the core technologies for the Semantic Web. It is the process of semantically enriching Web pages by adding metadata so that they are not only understandable by humans but also by machines.

Document Object Model (DOM) DOM is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of XML-based documents.

Simple Knowledge Organisation Systems (SKOS) SKOS is an area of work developing specifications and standards to support the use of knowledge organisation systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.

Extraction Service The Extraction Service Crawls Web sites and uses GATE to extract and store annotations to a given database. Fundamental Open Hypermedia Model (FOHM) FOHM provides a systematic vocabulary for expressing structures and operations that can be used to represent different hypertext domains: the spatial domain, the navigational model and the taxonomic domain.

Tidy Tidy is an open source program and library for checking and generating clean XHTML/HTML and DOM of documents. Uniform Resource Identifier (URI) A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.

GATE (General Architecture for Text Engineering) GATE is a development environment that includes a set of tools for performing Natural Language Processing (NLP), such as Information Extraction.

Uniform Resource Locator (URL) The term “Uniform Resource Locator” refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”).

JSR 168 JSR 168 is a Java Community Process (JCP) Java Specification Request (JSR). This specification defines how portlets are created for any Java Web Portal.

UPT (Unified Product Taxonomy) The UPT describes Sun’s components, systems, software and services, as well as the relationships between them.

Knowledge Service A Knowledge Service supports interaction with knowledge resources such as OWL ontologies and SKOS vocabularies.

Web 2.0 A phrase coined by O’Reilly Media (http:// www.oreillynet.com/pub/a/oreilly/tim/ news/2005/09/30/what-is-web-20.html), referring to a supposed second-generation of Internetbased services.

Mashup A mashup combines content or services from more than one source into an integrated experience, for examples see http://www.programmableweb.com/.

Web Ontology Language (OWL) OWL is a W3C language for publishing and sharing data using ontologies on the Web.

NELI (National Electronic Library of Infection) NELI is a digital library of materials related to infectious diseases, and consists of a collection of sites (such as Bugs and Drugs, Training in infection).

Web Service A Web service is a software system designed to support interoperable machine-to-machine interaction over a network.

Open Hypermedia System (OHS) In an Open Hypermedia System (OHS) links are first class objects, stored and managed separately from multimedia data; like documents they can be stored, transported, cached and searched, and their use can be instrumented.

XPointer XPointer is a language for locating data within an Extensible Markup Language (XML) document based on properties such as location within the document, character content, and attribute values. 18

XSL (Extensible Stylesheet Language) XSL defines the presentation of an XML document and used to translate XML documents into other formats (like XHTML).

[24] XPointer. www.w3.org/TR/xptr-xpointer/.

References

[26] robots.txt. en.wikipedia.org/wiki/Robots. txt.

[1] Annotea. Annotea/.

[25] XSLT. www.w3.org/TR/xslt.

http://www.w3.org/2001/

[27] K. Anderson. Integrating open hypermedia systems with the world wide web. In Proceedings of the Eighth ACM Conference on Hypertext, pages 157–166. ACM, 1997.

[2] Asynchronous JavaScript and XML (AJAX). http: //en.wikipedia.org/wiki/AJAX. [3] Cascading Style Sheets (CSS). Style/CSS/. [4] Entrez. entrez/.

www.w3.org/

[28] S. Apted. General purposive browsing. Library Association Record, 73(12):228–230, 1971.

http://www.ncbi.nlm.nih.gov/

[5] Gene ontology (GO). geneontology.org/.

[29] M. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407–424, 1989.

http://www.

[30] S. Bechhofer and C. Goble. Towards annotation using DAML+OIL. In Proceedings K-CAP Workshop on Knowledge Markup and Semantic Annotation, 2001.

[6] GOPubMed. http://www.gopubmed.org/. [7] Intellitxt. http://www.intellitxt.com/.

[31] S. Bechhofer, R. Stevens, and P. Lord. Ontology driven dynamic linking of biology resources. Journal of Web Semantics (JWS). Accepted for Publication.

[8] JSR 168: Portlet Specification. http://www.jcp. org/en/jsr/detail?id=168. [9] Muffin. http://muffin.doit.org.

[32] S. Bechhofer, Y. Yesilada, and B. Horan. COHSE: Knowledge-driven hyperlinks. In Semantic Web Challenge at ISWC2006, 2006.

[10] National Electronic Library of Infection (NELI). http://www.neli.org.uk/. [11] Platypus wiki. http://platypuswiki. sourceforge.net/.

[33] S. Bechhofer, Y. Yesilada, B. Horan, and C. Goble. Knowledge-driven hyperlinks: Linking in the wild. In Adaptive Hypermedia and Adaptive Web-Based Systems: 4th International Conference, AH 2006, Dublin, Ireland, June 21-23, 2006, volume 4018 of Lecture Notes in Computer Science, pages 1–10, 2006.

[12] Pluto. http://portals.apache.org/pluto/. [13] Resource Description Framework (RDF). www.w3. org/RDF/.

[34] B. Bederson. Fisheye menus. In Proceedings of the Thirteenth annual ACM symposium on User interface software and technology, pages 217–225. ACM Press, 2000.

[14] Saxon. http://www.saxonica.com/. [15] Sealife Project. http://www.biotec. tu-dresden.de/sealife/. [16] SMORE. http://owl.mindswap.org. [17] Sun Portal Server. software/.

[35] N. O. Bouvin. Unifying strategies for web augmentation. In Proceedings of the 10th International Conference on Hypertext, pages 91–100, Germany, 1999. ACM.

http://www.sun.com/

[18] Sun Microsystems’ swoRDFish Programme. http://www.w3.org/2001/sw/sweo/ public/UseCases/Sun/.

[36] P. Brusilovsky. Methods and techniques of adaptive hypermedia. User Modeling and User-Adapted Interaction, 6(2-3):87–129, 1996.

[19] Termine. http://www.nactem.ac.uk/ software/termine/.

[37] M. D. Byrne, B. E. John, N. S. Wehrle, and D. C. Crow. The tangled web we wove: a taskonomy of www use. In CHI ’99: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 544–551, 1999.

[20] The Document Object Model (DOM). http://www. w3.org/DOM/. [21] Tidy. http://tidy.sourceforge.net/.

[38] L. Carr, S. Bechhofer, C. Goble, and W. Hall. Conceptual linking: Ontology-based open hypermedia. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001.

[22] Web Ontology Language (OWL). http://www.w3. org/2004/OWL/. [23] XLink. http://www.w3.org/TR/xlink/. 19

[39] L. Carr, D. DeRoure, W. Hall, and G. Hill. The distributed link service: A tool for publishers, authors and readers. World Wide Web Journal, 1(1):647–656, 1995.

[52] N. Ferguson, S. Schmoller, and N. Smith. Personalisation in presentation services. Technical report, The Joint Information Systems Committee, 2004.

[40] L. Carr, S. Kampa, W. Hall, S. Bechhofer, and C. Goble. Handbook on Ontologies, chapter COHSE: Conceptual Open Hypermedia Service, pages 193–209. Springer, 2004.

[53] E. Frokjaer, M. Herzum, and K. Hornbaek. Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? In CHI ’00: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 345–352. ACM Press, 2000.

[41] S. Chen, D. Hong, and V. Y. Shen. An experimental study on validation problems with existing html webpages. In Proceedings of International Conference on Internet Computing (ICOMP’05), pages 373–379, 2005.

[54] G. Furnas. Effective view navigation. In CHI ’97: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 367–374. ACM Press, 1997. [55] N. Gibbins, S. Harris, D. Michaelides, D. Miller, and M. Veal. Exploring the relationship between FOHM and RDF. In Proceedings of 1st International Workshop on Hypermedia and the Semantic Web, Nottingham, UK, 2003.

[42] R. Chimera and B. Shneiderman. An exploratory evaluation of three interfaces for browsing large hierarchical tables of contents. ACM Transactions on Information Systems, 12(4):383–406, 1994.

[56] S. Handschuh and S. Staab. Annotation for the Semantic Web, volume 96 of Frontiers in Artificial Intelligence and Applications. IOS Press, 2003.

[43] C. Choo, B. Detlor, and D. Turnbull. Information seeking on the web: An integrated model of browsing and searching. First Monday (online), 5(2), 2000.

[57] S. Harper and S. Bechhofer. Semantic triage for increased accessibility. IBM Systems Journal, 44(3), 2005.

[44] J. Conklin. Hypertext: an introduction and survey. Computer, 20(9):17–41, 1987. [45] J. Cove and B. Walsh. Online text retrieval via browsing. Information Processing & Management, 24(1):31– 37, 1998.

[58] T. Heath, M. Dzbor, and E. Motta. Supporting user tasks and context: Challenges for semantic web research. In ESWC2005 Workshop on End-user Aspects of the Semantic Web (UserSWeb), Heraklion, Crete, 2005.

[46] D. Cunliffe, C. Taylor, and D. Tudhope. Query-based navigation in semantically indexed hypermedia. In Proceedings of the Hypertext’97, pages 87–95, 1997.

[59] S. Herner. Browsing. Encyclopedia of Library and Information Science, 3:408–415, 1970. [60] S. Jul and G. Furnas. Navigation in electronic worlds: a CHI 97 workshop. ACM SIGCHI Bulletin, 29(4):44–49, 1997.

[47] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002.

[61] J. Kahan and M. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In Proceedings of the 10th International World Wide Web Conference, pages 623–632, 2001.

[48] N. Dalal, Z. Quible, and K. Wyatt. Cognitive design of home pages: an experimental study of comprehension on the world wide web. Information Processing and Management: an International Journal, 36(4):607–621, 2000.

[62] M. Kellar, C. Watters, and M. Shepherd. The impact of task on the usage of web browser navigation mechanisms. In Proceedings of the 2006 conference on Graphics interface, pages 235–242, Toronto, Ont., Canada, Canada, 2006. Canadian Information Processing Society.

[49] H. Davis. To embed or not to embed. Communications of the ACM, 38(8):108–109, 1995. [50] H. C. Davis, W. Hall, I. Heath, G. J. Hill, and R. J. Wilkins. Towards an integrated information environment with open hypermedia systems. In Proceedings of the Fourth ACM Conference on Hypertext, pages 181– 190. ACM, 1992.

[63] T. Kopetzky and M. Muhlhauser. Visual preview for link traversal on the world wide web. In Proceeding of the eighth international conference on World Wide Web, pages 1525–1532, New York, NY, USA, 1999. Elsevier North-Holland, Inc.

[51] M. Dzbor, E. Motta, and J. Domingue. Opening up magpie via semantic web services. In Proceedings of the Third International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science. Springer, 2004.

[64] M. Levine. An essay on browsing. Reference Quarterly, 93:35–36, 1969. [65] D. Lowe and W. Hall. Hypermedia and the Web: An Engineering Approach. John Wiley and Sons Ltd, 1998. 20

[76] A. J. Sellen, R. Murphy, and K. L. Shaw. How knowledge workers use the web. In CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 227–234, 2002.

[66] G. Marchionini. Information Seeking in Electronic Environments. Cambridge Series on Human-Computer Interaction. Cambridge University Press, 1995. [67] G. Marchionini. Exploratory search: from finding to understanding. Communications of the ACM, 49(4):41– 46, 2006.

[77] B. Shneiderman and P. Maes. Direct manipulation vs. interface agents. interactions, 4(6):42–61, 1997. [78] M. Thuering, J. Hannemann, and J. Haake. Hypermedia and cognition: designing for comprehension. Communications of the ACM, 38(8):57–66, 1995.

[68] D. Michaelides, D. Millard, M. Weal, and D. D. Roure. Auld leaky: A contextual open hypermedia link server. In Revised Papers from the International Workshops OHS-7, SC-3, and AH-3 on Hypermedia: Openness, Structural Awareness, and Adaptivity, pages 59–70, London, UK, 2002. Springer-Verlag.

[79] T. Tsandilas and m. c. schraefel. User-controlled link adaptation. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, pages 152–160. ACM Press, 2003.

[69] D. Millard, L. Moreau, H. Davis, and S. Reich. FOHM: A fundamental open hypertext model for investigating interoperability between hypertext domains. In Proceedings of the eleventh International ACM Conference on Hypertext and hypermedia, USA, 2000.

[80] J. Wang, A. de Vries, and M. Reinders. Unifying userbased and item-based collaborative filtering approaches by similarity fusion. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR06), 2006.

[70] J. Morrison, P. Pirolli, and S. Card. A taxonomic analysis of what world wide web activities significantly impact people’s decisions and actions. In CHI ’01 extended abstracts on Human factors in computing systems, pages 163–164, 2001.

[81] R. White, B.Kules, S. Drucker, and m.c. schraefel. Introduction - exploratory search. Communications of the ACM, 49(4):36–39, 2006.

[71] J. Nanard and M. Nanard. Using structured types to incorporate knowledge into hypertext. In Proceedings of Hypertext’91, 1991.

[82] Y. Yesilada, S. Bechhofer, and B. Horan. Personalised dynamic links on the web. In First International Workshop on Semantic Media Adaptation and Personalization (SMAP’06), pages 7–12, 2006.

[72] T. Nelson. Embedded markup considered harmful. World Wide Web J., 2(4):129–134, 1997. http:// www.xml.com/pub/a/w3j/s3.nelson.html.

[83] Y. Yesilada, S. Harper, C. Goble, and R. Stevens. Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers). In Proceedings of the International Conference on Web Engineering (ICWE), pages 445–458. Springer, 2004.

[73] H. Obendorf and H. Weinreich. Comparing link marker visualization techniques: changes in reading behavior. In Proceedings of the 12th international conference on World Wide Web, pages 736–745. ACM Press, 2003.

[84] P. Zellweger, S. Regli, J. Mackinlay, and B. Chang. The impact of fluid documents on reading and browsing: an observational study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 249–256. ACM, 2000.

[74] C. Olston and E. Chi. ScentTrails: Integrating browsing and searching on the web. ACM Transactions Computer-Human Interaction, 10(3):177–197, 2003. [75] K. Osterbye and U. K. Wiil. The flag taxonomy of open hypermedia systems. In Proceeding of the Seventh ACM Conference on Hypertext, pages 129–139, Washington, 1996.

21

(a)

(b) Figure 4: COHSE in action: (a) a screenshot of the original page, (b) COHSE identifies and highlights terms with a small icon.

22

(c)

(d) Figure 5: COHSE in action:(c) when the user clicks on an highlighted term, a link box is displayed; (d) when the user selects a link from the link box, target page will also be retrieved decorated with COHSE links.

23

About the Authors Yeliz Yesilada is a Postdoctoral Research Associate in the School of Computer Science at the University of Manchester. She has been involved in research projects looking into the accessibility and usability of the user interfaces and hypermedia, particularly the Web. She has worked on the COHSE project funded by Sun Microsystems and is currently working on the EPSRC-funded project “Reciprocal Interoperability between the Accessible and Mobile Webs”. Sean Bechhofer was the researcher primarily responsible for the initial development of COHSE. He has been involved in a number of research projects developing tools and infrastructure to support Semantic Web applications and was a participant in W3C’s WebOnt Working Group, standardising OWL. He is now a lecturer in the Information Management Group in the School of Computer Science, University of Manchester. Bernard Horan is a senior staff engineer at Sun Microsystems Laboratories, where he is currently a member of the Advanced Search Technologies team. Bernard was a participant in the W3C’s WebOnt Working Group and now represents Sun in the W3C Semantic Web Deployment Group.

24

Appendix A

Annotation using GATE Short technical report1 , March 2004 This appendix contains a discussion of the use of GATE to produce annotations.

A.1

Introduction

COHSE uses two mechanisms to determine link sources in documents: • Generic Linking, where string matching is used to find the occurrence of terms or phrases in documents. The terms or phrases we look for are those that are associated with concepts in an ontology. • Annotation Linking, where explicit annotations on a document provide the locations in the document where concepts are declared to occur. The location identified by the annotation provides the source anchor for the link. Generic Linking can effectively be performed at “runtime” by the DLS2 searching for the terms or phrases. This is the current solution employed in the COHSE proxy implementation–the proxy uses a regular expression mechanism to match terms in documents it parses. Annotation Linking requires us to have a source of annotations. In our earlier exeperiments with the Sun Java Tutorial, annotations were provided by hand. Although this approach supplies relevant, high quality annotations, such an approach is unlikely to scale. It is clear that automated (or probably more likely semi-automated) approaches to annotation will be required. Here we describe an experiment to provide automated annotation using GATE. A.1.1 Terminology This document makes reference to the notion of annotation in a number of different contexts. In order to reduce confusion, we set out here the different meanings: GATE annotation GATE associates ”features” with tokens. A GATE association could, for example, label a particular token or set of tokens as a noun phrase. COHSE Annotation A COHSE annotation associates an OWL class with a particular document fragment. OWL Annotation OWL provides Annotation Properties– an Annotation Property allows additional information to be associated with a class in an ontology in such a way that does not have an impact on the definition or interpretation of the class.

A.2

GATE

GATE (General Architecture for Text Engineering) has been developed at the University of Sheffield as an open source platform for text and document processing tasks. In particular, it contains a comparatively complete set of information extraction components known as ANNIE (A Nearly New Information Extraction system). These components (called Processing Resources or PRs in GATE) are designed to produce GATE annotations from a corpus of source text. 1 Peter

Crowther was one of the authors of this technical report. Distributed Link Service or DLS is the component in the COHSE system that provides the core COHSE functionality in terms of adding links. 2 The

An annotation in GATE is a set of features associated with one or more spans within the source text. Processing Resources are combined into pipelines, with each PR able to read the document and make use of the GATE annotations produced by previous PRs in the pipeline. GATE reads source documents into a standard internal form, possibly processing them as it does (for example, HTML tags are typically stripped and replaced with corresponding GATE annotations). This allows the PRs to work with the widest possible range of source material. GATE’s open nature means that third parties can provide additional resources, including PRs that can then be appended to a pipeline.

A.3

The Annotation Pipeline

The COHSE Annotation Pipeline builds on the basic ANNIE pipeline supplied by GATE, but with some additional components. Changes have also been made to some of the existing GATE components. The current prototype uses GATE version 2.1. A.3.1 Changes to GATE The following changes were made to the basic GATE infrastructure. Note that these changes have not, as yet, been contributed back into the GATE community. • A refactoring of GATE’s ontology interfaces. These were originally designed by Ontotext for RDF and were in some cases unsuitable for richer OWL ontologies, for example where concept names aren’t simply extracted from the fragment ID. GATE’s ontology interfaces have changed in 3.0. A port of the existing implementation to GATE 3.0 would thus required some reworking, but other improvements may make it worthwhile. • Amendments to the HTML document reader to add XPointerFragment features to the GATE annotations as it generates them. In addition, some changes were needed to the build process in order to excise javax.servlet.* from gate.jar before dropping it into a webapp. It’s unclear exactly why these classes are there at all. GATE 3.x takes the more conventional (and much saner) approach of delivering gate.jar with just the gate components and assuming the script running it can put together a classpath. A.3.2 Pipeline The pipeline contains the following components: Load Load the ontology to be used. The ontology is preparsed and converted into a GATE ontology. In theory, any ontology could be used; GATE already provides components that wrap an RDF ontology, and OWL and COHSE ontology wrappers have been added. The COHSE wrapper takes the URL to an Ontology Service that’s been suitably modified to allow batch requests for names. This can then supply information from the COHSE Ontology Service in a format expected by GATE. Similarly, the OWL wrapper wraps

A-1

an OWL ontology. Tokenize The first pipeline component is a tokeniser. This splits the characters from the input document(s) into words. The reason for this component is that different languages have different conventions for word breaking. At present, the English tokeniser from GATE’s normal ANNIE pipeline is used. Gazetteer The next components are the ANNIE and OWL gazetteers. A gazetter provides acess to a collection of known or commonly occurring tokens. When used in the pipeline, the gazetteer is checked to see whether tokens occurring in the document are known to the gazetteer. If so, appropriate GATE annotations are added. The ANNIE gazetteer is part of the original GATE pipeline, and is included because it knows about organizations, honorifics (Mr, Mrs) and similar, and these can help the sentence splitter and part-of-speech tagger. The OWL gazetteer provides a wrapper around an OWL ontology, and looks up tokens against concept names in the loaded ontology. The tokens are provided through OWL annotations on the concepts in the ontology, using OWL annotation properties such as rdf:label and a cohse:alternative property. The OWL gazetteer uses the WonderWeb OWLAPI3 to parse and represent the ontologies. There’s no reason why there couldn’t be multiple ontologies loaded, with one OWL gazetteer for each ontology at this step; the current code only loads one. Sentence split The sentence splitter tries to find breaks that correspond to sentences, which is essential for the part-of-speech tagger. Again, the ANNIE sentence splitter is used. This is needed because English uses the same character to denote an abbreviation (.) as it does to denote the end of a sentence. POS tagging The part-of-speech tagger is, again, the standard ANNIE tagger. It’s been trained on a general American document corpus known as the Brown corpus, so is a reasonable general-purpose tagger but misses some cases. Its job is to tag each token with a part of speech. At the end of this phase, all tokens are tagged with the most likely candidate part of speech. NE recognition The Named Entity transducer (another ANNIE component) tries to identify multi-token sequences that, together, are likely to form proper names. For example, if it sees ”Mr. Brown”, an honorific followed by another word that could be a name, it tags the whole phrase as an entity. Ortho Matcher The Ortho matcher, also from Annie, tries a few extra tricks to combine tokens in a broadly similar way, such as looking for sequences of capitalised words where one of them has a known type and extending the type across the other words. For example, if it sees Some Capitalised Words Limited and the gazetteer types Limited as being an Organization, then the orthographic matching will extend the type to all of the tokens in Some Capitalised Words Limited. XPointer The XPointer Maker, a new component supplied as part of COHSE, runs after all of the preceding components so that it has the best opportunity of identifying merged GATE annotations. It examines starts and ends of GATE annotations, looking for the XPointer fragment features that were added to the HTML tags by the HTML document reader as the document was 3 OWL-API,

http://owl.man.ac.uk/api.shtml

parsed. The positions of these are used, along with the positions of the starts and ends of GATE token annotations, to create a full XPointer feature for each token that has been identified from the ontology (and thus is a candidate for becoming an annotation). The positions of tokens are provided as offsets. These are character counts that provide the position of the token in the original source document. Unfortunately, the offsets may be out from the correct ones, because of issues with the HTML document reader collapsing spaces (see Section A.5.2). AS loader The Annotation Service Loader handles the production of COHSE annotations. The Annotation Service loader is given the URL of a COHSE Annotation Service–a service that stores COHSE Annotations. The AS loader then looks for tokens that have associated Ontology, OWL Class and XPointer features. For each such token, a COHSE Annotation creation request is passed to the Annotation Service, with the URL of the current document, the XPointer indicating the position in the document, the Ontology and the Class. The COHSE annotations are then available to a COHSE DLS. A.3.3

Creating and Running the Pipeline

The pipeline can be run either via the GATE UI, or as a standalone service provided through a servlet–the Extraction Service. More details of the Extraction Service are given below in Section A.4, but the following description of pipeline components applies in both cases. Using the pipeline via the GATE UI requires the user to set up and initialise a number of the components and then “feed” documents into the pipeline. The Extraction Service uses a web spider to gather documents for COHSE annotation. Figure A.1 shows the basic pipeline. The cloud in the diagram is intended to represent resources on the web (which may come from third party sources).

A.4

Extraction Service

The Extraction Service (ES) provides a web spider along with a “canned” version of the pipeline. Rather than requiring the user to construct the pipeline and feed it documents, the spider will crawl a site from a given URL and pass any documents found to the pipeline for indexing/annotation. The Extraction Service is deployed as a servlet hosted within a servlet engine (e.g. Tomcat). The ES is given the following arguments at startup (via a web form): • The URL of an OWL Ontology. The ontology that will provide the terms to be used in the COHSE annotations. • The URL of a running COHSE Annotation Service. The service will be used to store any COHSE annotations created (as described in Section A.3.2). • A URL from which to start the spider. Once started, the ES will crawl from the starting URL, adding COHSE annotations to the AS as appropriate. The basic architecture is shown in Figure A.2. The cloud again shows resources on the web, with the shaded pages contained in the dotted box representing pages on a particular site which is being crawled. The result of running the Extraction Service over a site will be a populated Annotation Service. This can then be used with the COHSE DLS to provide both link sources and targets.

A-2

Figure A.1: Pipeline Architecture

A.5

Figure A.2: Pipeline Architecture

Issues

The current configuration and installation process is nontrivial. For example, we have experienced issues regarding the use of XML parsers with GATE, and servlet engines such as Tomcat need to be carefully configured to work with the Extraction Service servlet. It appears that version 3.0 of GATE is better in this respect and can run in servlets alongside other code. It is possible that a port to GATE 3.0 would be more productive than trying to fix the GATE 2.1 installation. A.5.1 HTML Parsing HTML parsing has been causing problems with the ES. The parsing of the HTML leads to the construction of a Document Object which is then used in the production of the XPointer expressions. Any differences in the way in which different clients parse HTML may result in different XPointer expressions, and potentially misaligned COHSE annotations. The COHSE proxy uses Tidy in order to ensure that HTML pages are well formed before processing. However, GATE uses the JDK’s HTML parser. The two parsers produce different structures in the Document Object when end tags are omitted or tables are malformed. This then results in inconsistent XPointers. A possible solution would be to replace GATE’s HTML parser. An alternative is to ensure that pages are passed through Tidy before being given to the Extraction Service or pipeline, ensuring that the Extraction Service is “seeing” exactly what the proxy “sees”. This also provides a potential solution to collapsing white space–the tidying process could also remove unnecessary white space (although care would need to be taken with PRE tags as the white space and formatting is deemed to be important). A.5.2 XPointers We have experienced problems with the XPointer genertion due to the fact that the document reader collapses multiple white space characters. This results in the offsets produced in the pipeline being an inaccurate mechanism for referring to the original source document. For example, say the

original document contains the following, where we show a space as “ ”: The_ACME_XYZ_is_a___file_server_available ˆ ˆ 1234567890123456789012345678901234567 If file server is a term in our ontology, then the document offsets produced are with respect to the following string: The_ACME_XYZ_is_a_file_server_available_ ˆ ˆ 123456789012345678901234567890123456 where the multiple spaces are collapsed. We can see that the offsets produced are 19–29, where the position in the original document is actually given by 21–31. The resulting COHSE annotations are then rendered inaccurately by the DLS. It’s likely that, if necessary, this problem can be dealt with via suitable modifications to the reader.

A.6

Discussion

The Annotation pipeline (and associated Extraction Service) provides us with a mechanism for building up a collection of COHSE annotations on documents. A.6.1 Link Sources As it currently stands, the COHSE Pipeline identifies the occurrence of words and terms that correspond to concepts in an ontology. Using the GATE machinery provides us with a more flexible and potentially extensible solution than the regular expression matching employed in the original implementation. The regular expression matching employed in the current proxy is done purely at the level of characters–there’s no attempt to make use of any structure that may exist in the text. The implementation of the regular expression mechanism is also rather brittle–a number of hacks, fixes and changes have had to be made to ensure that special characters appearing in the concept terms (such as “[” or “]”) do not cause problems in the regular expression. There are also questions of performance–the regular expression mechanism requires increasing amounts of stack space and

A-3

memory as the number of terms in an ontology becomes large. The use of the GATE infrastructure means that we can delegate much of the basic text processing functions to GATE. Reasonables A possibility that we have begun to explore is to use the cooccurrence of concepts in sentences to try and produce composite concept descriptions. This draws on original work in the GRAIL language4 known as sanctioning. A sanction was an assertion in the knowledge model that it was “sensible” to try and form a composite of two concepts. For example, it is sensible to talk about people driving cars. This is not to say that everyone has to drive a car, or that all cars must have drivers, simply that in general it is a sensible notion. GRAIL then disallowed the user from forming composites unless they were sanctioned5 . Sanctions do not form part of the definition of a concept, but instead tell us how the concept might be used. GRAIL was a rather idiosyncratic language, and has now been superceded by languages based on well-defined Description Logics (such as OWL). A proposal for a sanctioning-like mechanism layered on top of a Description Logic, known as reasonables. The original intention for reasonables was to guide the construction of user interfaces–however we can see reasonables as a general mechanism for constraining concept construction. Our assumption is that taxonomies will become ontologies and ontologies will become increasingly granular, with more and more compound terms such as 300W PSU being represented as PSU and hasPowerOutput 300W. This gives advantages (it’s possible to recognise smaller terms independently and combine them, giving a wider range of input texts that can be recognised as mapping to the concept) and disadvantages (the recognition is hard). If we also assume that user queries, whether explicit via a search engine or implicit via a link enrichment system such as COHSE, should give high precision (low false positive document matches) and high recall (low false negative matches), then the terms in any user query should also be mapped to the ontology where possible to produce similar composite concepts, which can then be placed within the ontology. The search can then retrieve more general or more specific terms as required–even where the text of the terms does not match the input term. Given these assumptions, it’s necessary to try to extract composite concepts from free text–both query text and document text. The approach taken here is to use existing components for linguistic analysis (the ANNIE pipeline), and to combine these with algorithms for constructing ’reasonable’ composite concepts from the fragments identified by the linguistic analysis. The intention is that if the text mentions . . . an overheating PSU in the SunBlade . . . then ideally we would create (at least a subset of) the following COHSE annotations: • SunBlade • PSU • PSU and (exists hasFault Overheating) • PSU and (exists hasFault Overheating) and (exists hasLocation SunBlade) GATE components already exist that are able to tag each word in a document with a (likely) part of speech (POS), and to ’chunk’ those POSs into noun phrases. These noun phrases are the most interesting parts of the document for

COHSE, as they are most likely to contain composite concepts. Combined with the OWL gazetteers, we can identify areas of the document (noun phrases) each containing one or more identified concepts. The reasonable concept combiner is then used to produce potential combinations. Each generated concept is then checked against the ontology using an OWL reasoner to verify that it is consistent with the ontology; inconsistent concepts are discarded. Note that there is little to prevent this system from forming ’nonsense’ composite concepts, such as a PSU with an input power of 300W. However, a few features combine to ease this problem: • All reasonable concepts are formed and stored–so if the system constructs ’nonsense’ concepts, at least it constructs the sensible concepts at the same time. • Any restrictions in the ontology have to be obeyed. As ontologies are developed further for other purposes, their restrictions will tend to become more clearly defined, leading to fewer ’nonsense’ concepts. So the system should be able to take advantage of other developments in the ontologies. • The ’reasonableness’ entries can be tailored for the content where necessary; under some circumstances, it may be necessary to tighten up some entries where a particularly troublesome piece of nonsense is generated. A prototype implementation adds two resources to the pipeline (this is still very much a prototype and has not been integrated into the Extraction Service). The resources are placed after the ortho matcher and before the XPointer maker, to try to produce composite concepts from the available information. Noun Phrase Chunker Mark Greenwood’s noun phrase chunker was produced by a student at Sheffield and is maintained independently of GATE; it has been ported to GATE 3.0. It uses the part-of-speech information on the tokens to try to identify noun phrases, such as ’the noisy fan’, and produces additional GATE annotations giving the extent of these noun phrases. This can be customised via a configuration file, so new combinations of parts of speech can be added if necessary. Reasonables The reasonable concept combiner, uses the token and noun phrase GATE annotations to find sets of tokens that are part of the same noun phrase, all of which are tagged with an ontology and a class. The classes of all of the tokens in the noun phrase are placed in a set and passed to the Reasonable layer, which tries to synthesise a composite class from the inputs by repeatedly looking for configured rules that combine classes in the input set. Rules are of the form (X AND Y) for classes X and Y or (X AND EXISTS R Y), where X and Y are classes and R is a relation6 . Rules match when subclasses of X and Y appear in the input set. The resulting composite description is added to the input set, and the process repeated until all pairs have been examined and no new classes have been synthesised since the last addition. At that point, the set is traversed one last time and classes are removed if a subclass is present in the set. This leaves the set of most specific classes; these are attached as GATE annotations to the noun phrase as a whole, which is then available to be added to the AS by the Annotation Ser-

4 GRAIL,

http://www.opengalen.org/ details are slightly more complicated than this, but this is a suitable generalisation. 5 The

6 The expression (X AND EXISTS R Y) describes all those instances of X that are related to some instance of Y via the relation R.

A-4

vice loader. The reasonable mechanism has been bench-tested and is capable of producing composite concepts. We have, so far not thoroughly investigated the possibilities of this approach. A key issue here is how to store and transmit these composite concepts to the Annotation Service. The current Annotation Service is only capable of storing COHSE annotations using primitive concepts, and provides only simple queries. Extensions to the interface are needed to support arbitrary expressions. It’s likely that an approach making use of the Instance Store7 would be appropriate here, as well as a refactoring of the Annotation Store interface. As another example, an extension to the ontology and gazetteer would pre-tag words with a set of possible parts of speech; this would require a little extra information, either in the ontology or in a separate file, that specified that subclasses of class X were likely to be part-of-speech Y. We could then use this to improve our identification of ontology terms. Similarly, knowing that a number of concepts are referred to in a particular sentence may provide us with options for pruning of the COHSE annotations–one problem that occurs with the use of the UPT ontology is the occurrence of many links concerning terms like Software or Hardware. Being able to cull these in a principled way during the annotation step would help prevent overlinking. A.6.2 Link Targets The Annotation Pipeline approach gives us a collection of COHSE annotations which can then be used as potential link targets as well as source anchors. Again, though, this is functionality that could be provided via other means–for example the use of Google, constrained to a particular site, or a site specific search engine. The Extraction Service provides a large number of COHSE annotations and thus a large number of potential targets. This may itself cause problems. For example, in the Sun product pages, many pages mention Software or Hardware, but these pages do not always provide sensible targets for links–particularly when we are using the concept taxonomy to bring in broader resources. Similarly, index pages or pages that list products may show up as potential targets for many concepts. These again are unlikely to be useful as link targets. Improvements to the Annotation Service in order to avoid overlinking (as discussed above) should bring an additional benefit in reducing the number of “less interesting” link targets returned when using the Annotation Service for link targets. The structure in the ontology should help us here. For example, OWL annotations placed on ontology concepts can help identify those concepts that are high-level or abstract (e.g. Hardware), and which are unlikely to provide much information as targets when used directly in COHSE annotations.

7 http://instancestore.man.ac.uk/

A-5

Appendix B

Adaptation and Customisation Technical report, May 2006

B.1

Introduction

In the existing COHSE approach, the same page content and the same set of links with the same style are presented to all users. Although we provide some customisation options, we do not take into account the user’s goal or task at hand, the user’s background or interests, etc. But users come from different backgrounds, have different knowledge, different interests and this “one fits for all” approach might not be appropriate [19]. In this first part of the COHSE-IV project, we investigate how the system can be adapted to better match a user model. The purpose of such tailoring is (theoretically) to better meet the needs of the user; provide information and functionality that better suit a specific type of user. It can be considered as a way of assisting Web users in their information-finding and navigation activities. [22] indicates that the development of an adaptation application involves several sub-components, which are necessary for supporting the complex representation and inference underlying adaptive behaviour. They suggest these sub-components include the following two main distinct high-level processes: User modelling the main goal is to reach high level conclusions about the users that are considered significant for a particular application (adaptive hypermedia/hypertext applications usually take into account the user’s goal, knowledge, background, experience, and preferences [19]); Adaptation decision making specific adaptation techniques are selected based on the results of the user modelling phase, to improve selected aspects of interaction. Figure B.1 illustrates the adaptation process suggested by [22] which is modified for a possible adaptation process in COHSE. The four main components of this process can be explained as follows: 1. User modelling can be created in two ways: implicitly or explicitly1 [11]. This first component on Figure B.1 shows the explicit approach where the user tells the system directly about his/her preferences; 2. The second component shows the implicit approach where the system performs some low-level monitoring of user behaviour and then infers about the user from her/his interaction; 3. Some conclusions are drawn from the created user model; 4. In this last phase, some adaptation techniques are chosen, aiming to “improve” the selected aspects of the model. Based on these components, in the following section we first discuss how to create a user model which aims to answer the question “adapt to what?”. Then we discuss the adaptation decision-making phase which aims to answer the question “what can be adapted”. 1 A system is adaptable if the user model is created explicitly and a system is adaptive if the model is created implicitly [15].

Figure B.1: Adaptation Components

B.2

Adapt to what?

[20] suggests that the following three kinds of information are widely used to adapt systems accordingly: 1. user model: various characteristics of the users (e.g., goal/task, knowledge, background, experience, preferences, interests, etc.); 2. usage data: data about user interaction that cannot be resolved to user characteristics; 3. environment data: all aspects of the user environment that are not related to the users (e.g., platform (hardware, software, network bandwidth), user location, direction). For example, a user may interact differently with a shopping site than an education site. Among these, user modelling is the most widely used one. An explicit user model that represents user knowledge, goals, interests, and other features or “aspects” that enable the system to distinguish between users is a distinctive feature of an adaptive system [23, 21]. Depending on the system, different granularity can be used but it can broadly include the following information: goals or tasks this aims to answer the question “why is the user using the system and what does the user actually want to achieve?”. The main problem with capturing such information is that it changes quite often; it changes from session to session and often can change several times within a single work session. Rather than system trying to infer the user’s task, it can be made controllable; the user is asked to make a choice between a number of tasks. For example, [61] presents an adaptive system where users are allowed to choose between a small set of different explanation or task stereotypes such as “project planning”, “reverse engineering”, etc. However, they also state some drawbacks of this approach (i) the task stereotypes must be rather small in order for user to select among them and learn to understand their effects, and a small set might mean that they might not represent the user’s task at hand; (ii) users very often move between tasks in a

B-1

session, for example they might try to look for something but then they come across a term and then start to browse, but having these preferences mean that the user can easily switch between these task stereotypes. knowledge user’s knowledge of the subject represented in the area. In some systems, stereotype users are used which can be something like ‘novice, beginner, intermediate and expert’. Although this is a simple approach, for some adaptation techniques, a more finegrained model can be required. experience how familiar is the user with the structure of the system and how easy can the user navigate in it. For instance, will an experienced COHSE user behave differently than a novice user? background user’s background might effect how they browse (e.g., biologist, computer scientist); preferences the user has to inform the system for his choices. The existing portal approach already supports this, for example the user can customise the Ontology service and the Resource Manager. interests user’s interests which can be obtained either implicitly or explicitly. [13] presents a system that tries to guess the user’s interests implicitly. For example, their system suggests a list of car adverts to mobile web users and when the user selects an advert from the list then they consider this as a moderate indication that user is interested in that item, if the user bookmarks or stores the link then this is considered as a strong indication that the user is interested in that item. [36] demonstrates that time spent on a page, the amount of scrolling on a page and the combination of time and scrolling had a strong correlation with explicit interest.

B.3

What can be adapted?

[23] indicates that there are three different kinds of adaptation techniques: 1. adaptive content selection (giving priorities of some content over others); 2. adaptive navigation support (manipulate links, hide, sort, annotate); 3. adaptive presentation (present the content in different ways); In the following sections, we will discuss some adaptation techniques that can be achieved in COHSE with respect to the user model discussed in the previous section. B.3.1

Content adaptation

• task - If the user is learning then the description of the term can be important. This requires knowledge about the current task of the user. It can be challenging to automatically detect the user’s task. However, again as a customisation option, we can provide a number of stereotype tasks, for instance, learning, surfing, searching, etc. Then depending on the selected stereotype task, we can provide different content in a linkbox. For example, if the task is “learning”, we could include the description of the term but if the task is “searching” then we could just provide an ”I am feeling lucky” kind of link; • If we know who creates a particular annotation and its creator’s role, then if another user with a similar role requests that annotation we can assign higher priority to that link. For this we need to modify our resource manager to store more information about the creator of annotations (for example, their role, interest, etc.). However, the main problem with this is that if annotations are created automatically then we might not be able to capture such information; • When a list of links is suggested for a resource, we can order these links based on the date that the relevant annotations are created; • If multiple occurrences of a concept are identified in a page (e.g, for example with the UPT ontology “System” is one of the common terms identified on Sun Microsystems’ pages), we can use different heuristics to better present these links. For example, we can show only the first one, etc; • We can use the structure of the document, for example concepts will be added once per HTML para or div element or in a section, etc. B.3.2 Navigation [21] suggests four different kinds of navigation adaptation (mainly links): • direct guidance - suggests the “next best” node for the user to visit according to the user model (interest, preferences, goals, knowledge, etc). The main problem with this technique is that if it is solely used then there is no support for the users who do not wish to follow the system’s suggestion. In educational hypermedia, this proved to be useful for guiding students, as they usually have problems making their own choice and rely heavily on the suggestion of the system [21]. The main question that we would need to ask here is “how do we decide the next best node?”. Different strategies can be employed, for example collaborative filtering techniques can be used “user A in group B for concept C has visited link D, then for another user in the same group we can suggest link D as the next best link”. This requires storing information about user’s group and browsing history (which pages and concepts are visited? which links are followed on a link box? etc.); • adaptive sorting - ordering of the links can be changed again depending on the user model. The main problem with this is that the order of links can become nonstable; • hiding or removing links - if the links are not relevant2 to user’s current goal (task or interest) then do not display them. This protects users from the com-

The following list summarises a number of possible content adaptation techniques: • knowledge - If we introduce different levels of knowledge as a customisation option (say beginner, intermediate and expert), then we can provide different levels of term definitions. For example, if the user is a beginner then we can provide a very short description of the term, if the user is an expert then we either do not provide a description or we provide more knowledge about that term from the ontology (e.g., asserted conditions, disjoint, same-as) (experienced users can benefit from additional details); • knowledge - If the user is a beginner, then perhaps we do not display the hierarchy of the resources (broader, specific) but instead display show suggested resources as a simple list. As with the previous technique, we 2 Relevancy can be judged against the attributes that will be introduced can introduce another customisation option for domain knowledge (e.g., beginner, intermediate, expert, etc.); in the user model. B-2

needs to compare ontologies with different attributes plexity of the system but users may become unhappy of the created user model; when previously-available links become invisible or • If the user is an administrator, then allow the user to disabled [20]. This requires more information about add or remove ontologies to/from the ontology service annotations; we need to check if these annotations are and also add or remove components to/from the Rerelevant to user’s task or interest (or user’s model in source Manager. For this we can group users into two: general); administrator and general end-user, and administrators • annotation - adding extra information to the link lawill be able to customise these services; bels, for example different colours can be used: green • Present specific Resource Manager components to - known information, red - ready to learn, light blue specific user groups. Depending on the role, allow the not ready (in an education system). And some more navigation adaptations: user to choose from different Resource Manager com• task-based - If the user is searching or has a specific or ponents. For example, if the user has been identified well-defined goal, then we can provide a kind of “I am as a biologist, then we could restrict the ontologies that feeling lucky” link; can be presented to the user, such as the Gene Ontol• experience - If the user is an experienced COHSE user, ogy Database; then we can provide longer lists of resources, but if the • Suggest ontologies based on the page content - [46] user is new to the system, then we can have a threshuses TF-IDF4 (Tfidf is a statistical technique used to old for the number of resources (links) that will be evaluate how important a word is to a document.) in suggested. This requires information about the user’s conjunction with the cosine similarity to create groups knowledge about the COHSE system (could be doof documents that are in the same context. Can we use main as well). We can have stereotype user experiTF-IDF and cosine similarity to automatically choose ences and provide as a customisation option; an ontology that can be used to browse that document? • interest or goal - can we filter resources and present Vector representations can be calculated for a docuthe ones that are more relevant to the user’s interests ment and then cosine similarity can be used between or goals? For this probably we need to improve the the created vector and the ontology. And when we information stored about annotations (see discussion compare between a number of ontologies and the docabove); ument, we can suggest the one that has the highest similarity. B.3.3 Presentation The presentation of COHSE links and linkboxes could also B.3.5 History be adapted based on the following characteristics/features: Users’ browsing history can be used in the following way • task-based - If the user is searching or has a specific to support adaptation: or well-defined goal, then we can present the COHSE • Visited/unvisited links - Conventional browsers make a link boxes as tooltips. Here we can again use stereodistinction between visited and unvisited links. By ustype tasks and provide as a customisation option; ing the user’s history we can keep track of the concepts • knowledge-base - Depending on the user’s knowledge, that have been visited and highlight them differently. we can organise the links in a linkbox in different This can also be useful for orientation as users will ways. For experienced users, we use hierarchy in the be able to find out where they have been. For examontology and group resources by terms, and for beginple, if the user clicks on the COHSE link “membrane” ners we organise lists by target page titles; she will get the list of resources suggested for the con• Links can be visually annotated based on different cricept “innerMembrane” (because of subs relationship) teria to make a distinction between links and allow user and when she visits the “innerMembrane” concept, she to easily choose relevant links (e.g., different icons, will get the resources for the concept “membrane” but colours, font sizes, etc.). For example, if we emthis time we can for example put a small icon next to bellish COHSE links with a particular colour, when the “Membrane” concept to indicate that she has althe user clicks on a particular COHSE link, we can ready seen this. This requires information about the change the colour of that concept; with this technique user’s browsing history of concepts; we will make a distinction between visited and unvis• Tooltips vs Link boxes - If the user accesses a page with ited COHSE links (this will require information about COHSE links and does not click on COHSE links, the user’s browsing history of COHSE links). Simthen perhaps next time we can present COHSE links ilarly we can add a small icon next to the COHSE as tooltips. If we store information about the visited link to indicate that only the description of that parpages and concepts then we can infer such informaticular term is available in the link box (for example, tion; Wikipedia3 uses a small icon to differentiate links that • Focus is on description - If the user visits pages with point to articles outside Wikipedia); COHSE links and clicks on the suggested link boxes • experience - Depending on the user’s experience, we but does not follow any of the suggested links, can can provide different levels of interactivity in link we infer that the user is actually interested in the deboxes, for example we can have plain lists or add “exscription of the concept more than the suggested links? pand/contract” buttons. For this we need to store information about the conB.3.4 Adaptation of services - Resource Manager and cepts that the user clicks on and links that she follows Ontology Service from the relevant linkbox, and we also need information about the sequences of these events (or possibly • Depending on the user model, can we suggest ontolotime-stamps); gies that can be used for a particular session? One 3 Wikipedia,

4 TF-IDF,

http://en.wikipedia.org.

B-3

http://en.wikipedia.org/wiki/Tfidf

• Importance of order - If the user always selects the first many criteria then the system can get complicated and link from the list of resources, does that mean the user very easily inconsistent); is not interested in having long list of resources and • Feasibility problem - can the system actually incorpolikes to see only few links in a link box? For this we rate enough suitable knowledge about an individual need to have information about the order of the links user in order to make adaptive responses to that perand links that are followed from a linkbox; son; • Siblings - When the user visits a concept, can we • Customisation - having a lot of preferences can also assume that the user might also be interested in the be a problem, the user has to make lots of choices and siblings of that concept? For example, if the user selections before actually starting to use the system. is using the diving ontology and visited the concept This can be addressed by having good sets of defaults; “MixedGases”, can we assume that she might also be • Automatic model creation - (i) automatic model creinterested in the “InertGases” which is a sibling? ation is not reliable, (ii) some users cannot be de• Domain - If the user has been traversing pages from a tected automatically such as user preferences and particular site (or collection of sites), can we suggest background, (iii) insufficient information about the only links for that domain or give priorities to links user by watching the user’s behaviour. from that domain? (E.g., if the user has been browsing B.5 Requirements and Architecture pages from docs.sun.com, does this indicate that the user is looking for a piece of documentation). This re- In order to support adaptation techniques discussed in Secquires information about the ontologies used and also tion B.3, in summary we need to capture the following information: pages traversed; • Visited concepts - We need to store information about • Breadcrumb - We can also create history breadcrumb the concepts that the user visits (clicks on to get the trails by using the concepts that have already been vislink box). With our current architecture we can not acited; cess such information however we have modified our • Suggesting ontologies Depending on the user’s browsarchitecture to support this. We now use AJAX teching history, we can also suggest different ontologies nology and create link boxes dynamically; when the (e.g., ”you have been browsing COHSE with ontology user clicks on a link then the appropriate link box is X but you can also find ontology Y interesting”). This created (this is actually a call to the portlet). This enrequires similarity/relevancy analysis of ontologies. ables us to store information about the concepts that • Hot-list - If we observe the following pattern: the user visits; group-A group-A group-A • Domain knowledge - Some techniques depend on the user-X user-Y user-Z level of the domain knowledge of the user. By observpage-B page-B page-B ing the user’s browsing behaviour it can be extremely concept-C concept-C concept-C difficult to obtain such information. But we can introduce a customisation option and simply have the folThen we can locate concept-C in our hot-list for that lowing options: beginner, intermediate, and expert. page and ontology and when other users from the same • Task knowledge - Some techniques require the level of group visits that particular page, we can put special knowledge about the current task of the user (for examemphasis to concepts from that “hot-list”. ple, searching, browsing, surfing, learning, etc.). This B.3.6 Collaborative filtering can be again very difficult to obtain automatically, but If a user in a group exhibits certainbrowsing behaviour then we can simply introduce a new customisation option can we deduce that the other users in the same group might for this purpose (a number of stereotype tasks). Howwant to do the same thing?5 . ever this might not be very easy to customise; the user • user groups - We can use different heuristics to idencan easily move from one task to another and when he tify popular links among a user group by performing starts using the portlet, he does not necessarily need to collaborative filtering (see the last item in the previous know his task explicitly. section); • Groups and interests - Some techniques also require • In link boxes, we can add a section to include resources information about the user groups and interests. When that other users from the same group traversed for that the user gets an account from the portal, different user particular term; groups and interests can be introduced so that the user • We can highlight popular links among a particular selects a particular group and specify his interest at the group, or we can give priorities to popular links, or we registration stage. can use such information for the ordering of the terms • More information about annotations - To support some in the list; of the adaptation techniques discussed, we also need to • Identify the concepts that are mostly visited and then have more information about the annotations, for exassign higher priority to these concepts compared to ample the creator of the annotations, role and interests, the others. etc. • Experience of using COHSE - This can be a customisaB.4 Limitations and Difficulties tion option; we can again introduce different levels of Some issues are listed below that needs to be addressed in experience (stereotypes) and then the user will choose an adaptive system: from these. • Consistency - adaptive systems usually appear to go • Access rights - We can also introduce different user against this maxim (if we decide to adapt based on groups (e.g., administrator, end-user) in order to con5 Collaborative filtering, http://www.the4cs.com/ corin/ trol different levels of access to COHSE services. ˜ research/aws/. • Page structure - Now with the improved COHSE sysB-4

tem (based on AJAX) we can store the concepts that results? the user chooses to get link boxes, however when the • What would be the implications of using user profiles user clicks on a COHSE link we cannot access the poand histories on browsing and navigation support prositioning of that concept in the page. When the user vided by extra COHSE links? clicks on a link, only the id of that link box is submit• How can we make suggested COHSE links more effited to the portlet. However, we can modify this so that cient and effective? Will adapting COHSE links based some extra information is also submitted along with on a user model make them more efficient? the id of that link box. • Although adding links increases navigational choices • Pages/Sites Visited - In order to identify the sites that for users, it can also increase the cognitive effort rethe user visits, we need to store information about the quired to decide where to go next. So will adapting pages that the user traverses by using COHSE. these suggested COHSE links help us to address the • Ontologies - For some adaptation techniques, we also cognitive effort problem? need to store information about the ontologies that the • Added COHSE links might not always match with user uses to traverse pages with COHSE. With our exwhat user is trying to do so can we use the user model isting architecture, information about ontologies can to adapt links to better match with the user’s goal? be easily stored as we provide selection of ontologies • How do we guide users through the list of proposed as a customisation option. list of links? How do we help users to find their way • Links followed in a link box - In our current archiaround the links that we are suggesting? tecture (even with AJAX additions), we cannot access We definitely need to have a good understanding of what such information. The proxy does not make any dis- is being evaluated. • One of the questions that need to be asked is: “does tinction between the URLs that are requested from adaptivity in COHSE mean effectiveness and effithe page itself (from a tags) or from the link box. ciency?”; We might need to modify links that are added to the • [22] proposes a two-level evaluation approach: linkbox to include extra parameters (e.g., to differenti1. Evaluation of user modelling which is the evaluaate it from other links in the page, positioning of this tion of the validity of the information maintained link and some contextual information such as the conin the user model. cept that this link box is created for). 2. Evaluation of adaptation decision making which B.5.1 For how long do we store the user’s browsing hisis the evaluation of the success of the adaptation tory? decisions. Some of the requirements specified above are based on the • [22] discusses different ways of evaluating adaptive data stored about the user’s browsing history. Regarding systems which may include qualitative and quantitathis we also need to consider the duration of storage of such tive measures such as: task completion time, numdata; for how long do we store information about the user’s ber of nodes visited, accuracy of tasks, how well browsing history? We could possibly have the following the user remembers the structure of the information approaches: space, user’s indication of utility, ease of use, natu1. We can store and analyse data per session but then we ralness, number of navigation steps, number of repetimight not have enough information to draw some contions of previously studied concepts, number of transiclusions about the user’s behaviour; tions from one concept to another, or from one index 2. We can store data for a specific period of time but then to a concept; the duration of that period becomes very important. • A comparative evaluation can be conducted by comIf it is short then that period might not represent the paring original version, a version with COHSE links user’s browsing behaviour; and a version with adaptive COHSE links. We can 3. Finally, we can store data and never delete. But then compare them against certain metrics which can be it’s highly likely that we will have a huge amount of number of navigation steps, the number of tasks redata that it is very difficult to analyse and draw some peated, the number of unforced repetitions of previconclusions. ously studied concepts, etc. [21]; • performance by using different metrics by constraining B.5.2 Adaptation Level the usage of our system for a number of tasks; Some of the techniques presented in Section B.3 are driven • [13] presents an evaluation where some users got origby the user and some of them are driven by the system (see inal documents and some users got the personalised also Section B.1, adaptable and adaptive or customisation versions and their study shows that the users who got vs. personalisation). In Section B.4 we have also discussed the personalised pages turned to the site more often. a number of limitations and difficulties with adaptive systems or with personalisation. In order to overcome these limitations, we can provide a customisation option for the level of personalisation. Then the user will decide at which level the system is to make any judgements or decisions.

B.6

Aims and Objectives and Evaluation

We possibly need to ask the following questions regarding this adaptivity work: • [23] presents that adaptive navigation increases the speed of navigation and learning, and adaptive presentation improves the content understanding. So if we use similar adaptation techniques, will we get similar B-5

Appendix C

COHSE Architecture Technical report, January 2004 • The DLS also requests the Resource Manager to provide explicit annotations on the given document (if any exist). The concepts associated with the annotations also form potential link sources, and are then constructed as above.

This document contains a discussion and analysis of the COHSE architecture. It describes the existing architectures, requirements for customization, and identifies issues and shortcomings. An alternative approach is then discussed.

C.1

Introduction

COHSE provides dynamic linking of documents based on terms and concepts found in an ontology. The system has been described elsewhere–see [53, 30, 54] for details. The system makes use of three components with the responsibilities outlined below: • An Ontology Service provides information relating to the concepts and their relationships in (one or more) ontologies. • A Resource Manager provides functionality that maps concepts to resources (URLs)1 . • The DLS is responsible for adding links to documents dependent on information provided by the Ontology Service and the Resource Manager. The basic architecture of the system is shown in Figure 1.

Figure C.1: COHSE Architecture COHSE works roughly as follows: • The DLS requests the Ontology Service to provide the terms from a specified ontology. • The DLS requests a document. • The DLS looks for occurrences of terms in the document. Each term found acts as a potential link source. Each link source is associated with a concept corrsponding to the term that identifies the link source. • The DLS requests targets from the Resource Manager for each link source. • If insufficient targets are found, the DLS will use the taxonomy provided by the Ontology Service to broaden the search, again querying the Resource Manager with, for example, more specific terms. • Once the right number of targets have been identified, links are added to the source document. 1 In recent documentation, we have used the term Annotation Service. This is slightly confusing, however, as the service is not simply concerned with annotation, but is rather a general service concerned with mapping between resources and ontology terms. For the remainder of this document we will refer to it as the Resource Manager.

C.2

Customization and Personalization

The main purpose of revisiting the architecture is to support customization. By customization we mean allowing the user greater control over the way in which the system functions, for example allowing better control of the ontologies used or the sources used to determine link targets. We are not directly interested in supporting what might be called personalization, where the system makes decisions based on some observation of user behaviour (for example, selecting ontologies automatically based on frequently used keywords). Any architectural choices, however, should not preclude us from being able to investigate or support personalization. In the current COHSE implementation there are a number of preferences or options that can be selected in order to influence the behaviour of the system. These include: Ontology Service Selecting the Ontology Service that will deliver the ontological information to be used. Ontology Selecting which of the available ontologies should be used. Ontologies can also be loaded or unloaded to/from the service. Resource Manager Selecting the Resource Manager that will be used to supply link targets. Resource Manager Component Selecting the Components of the Resource Manager that will be used to supply link targets. For example, the current implementation includes components that use a COHSE annotation database or searches to external resources such as Google or Amazon DLS Agent Behaviour There are a number of options that fall under this general heading, They include whether to use generic or annotation linking, the use of link suppression or promotion and options such as whether to link to the same page, same domain etc. Clearly these choices are not independent. For example, the choice of Ontology Service will have an impact on the choice of ontologies available. There are also a number of potential extensions to the behaviour of the system that we can envisage. These would require additional options to be selectable. • The way in which the ontology is used could be configurable. For example, the user may wish to identify particular annotation properties that they would like to contribute to the operation of the system. As an example of this, consider the Sun UPT taxonomy used in an earler demo. This contains properties such as trademarkName. A particular user might want the lexical values associated with trademarkName to be used when supplying lexical renderings. This is currently handled through the use of rdf:label to denote the de-

C-1

fault name of a concept, and a special annotation property cohse:annotation which is used to provide alternative labels. The use of this special property requires both change to the source ontology and to the URI of the property to be hardwired into the Ontology Service implementation. • The current Ontology Service simply makes use of taxonomic relationships (super/sub) in order to broaden link target retrieval. We may wish to allow the user to specify alternative relationships that the DLS can use to navigate through the ontology. Again, this is likely to be on a per-ontology basis, so needs flexibility. • We plan to investigate visualisation and navigation of the ontology in the later stages of the project. This is likely to bring additional options. The question of access control is also relevant to a number of the issues above. For example, we may not wish to allow all users to add load or unload ontologies. Access to basic management functionalities of the services (starting/stopping etc) is also something which requires consideration.

C.3

Problems

The problems with the existing architecture and implementation can be split broadly into two, although it is clear that the two are related: • Interface granularity • Missing components For example, the current interface on the Resource Manager only permits the selection of Resource Management Components at a global level. This is not ideal as changes made by one user may effect the results of queries from other users. An example of a missing component is the lack of a component with responsibility for recording or storing user preferences–currently user preferences are stored in a number of different places (see discussions below for details).

C.4

Current Support for Customization

There are currently three approaches to COHSE implementation (although two are closely related). Browser Plugin The Browser plugin implementation (based on Mozilla) provides an additional browser sidebar to control user preferences. The DLS manipulates the DOM object within the browser, adding links to the document. COHSE Proxy The COHSE proxy sits between a web server and client browser. When an HTTP request is made from the browser, the proxy intercepts the response and rewrites the document (adding links to the document where appropriate) before returning it to the browser. The COHSE Proxy is implemented using the Muffin2 proxy classes. Rewriting Proxy The rewriting proxy makes use of the COHSE proxy, but is deployed as a servlet. This allows clients to make use of the COHSE proxy without adjusting browser settings. For each of the implementation architectures, the interaction between components is slightly different. We discuss below how the various customization options are currently supported (if at all), and highlight potential problems and issues. 2 http://muffin.doit.org

C.4.1 General Ontology Service Issues The Ontology Service acts as a repository for many ontologies. The ontologies are all exposed to any users of the Service. Ontologies can be loaded or unloaded, and there is no restriction on who can load or unload ontologies (as far as the basic Ontology Service is concerned). Calls to the Ontology Service (for example a request for the default rendering of a concept in an ontology) in general contain an argument specifying the ontology which is the context of the call. C.4.2 General Resource Manager Issues The Resource Manager has a number of components that support resource retrieval. These can be selected/deselected when starting the service (see below for further discussion of this). Note though that the components must be specified at Service startup–there is no facility for dynamically adding components to the Service. If component initialisation fails at start time (e.g. the database the component uses is not available), the component may not be subsequently available for use. The component selection is global. All requests to the Resource Manager will be resolved using the current settings. There is no ability to specify the components that should be used by a particular user or when a call is made. C.4.3 Browser Plugin The Browser Plugin is provided as a plugin for Mozilla, and offers customization through two mechanisms, a general Preferences pane (accessible via Mozillas preferences) and through controls on the sidebar itself. The plugin is implemented using Javascript. Ontology Service The user provides the URL of the ontology service via the Preferences pane. This can be changed at any time–if the DLS sidebar is then selected, the new Ontology Service will be used. Ontology The user selects ontologies via the sidebar. On initialisation the sidebar queries the Ontology Service for available ontologies. These are then presented to the user. Note that deselecting then reselecting the DLS sidebar results in the list of available ontologies being re-requested. If the collection of ontologies held by the Ontology Service has changed, this will be reflected in the list. Ontology load/unload Set via the Ontology Service front panel. Note that changes made to the Ontology Service will effect all users using that Ontology Service. As the DLS plugin simply accesses the Ontology Service via an HTTP connection, we cannot easily restrict access to the Ontology Service to control access to unload/load functionality without compromising basic access to the Ontology Service by the browser plugin. This may cause potential problems if third-parties change the settings of the Ontology Service. Resource Manager The user provides the location os the Resource Manager service via the Preferences pane. This can be changed at any time–if the DLS sidebar is then selected, the new Resource Manager will be used. Resource Manager Components The user selects the components via the front panel of the Resource Manager. Again, changes to the Resource Manager will effect all users using that Service, and as DLS

C-2

access is via vanilla HTTP, we cannot easily restrict access Resource Manager to the Resource Manager to control access to component The URL of the Resource Manager is specified in the conselection. figuration file. This is read at startup time, and cannot subsequently be changed while the proxy is running. General Behaviours The user specifies general behaviour using a combination Resource Manager Component of the sidebar and the general browser Preferences. For ex- Components are selected via the front panel of the Resource ample, use of generic or annotation linking is set via menu Manager. Again, changes to the Resource Manager will efcheckboxes. Link presentation choices and link quantities fect all users using that Service. As with the Ontology Serare also set via the sidebar, while options such as the use of vice (seeC.4.4), however, access to the Resource Manager annotations as resources is set via the Preferences. There could be restricted, alleviating problems of third parties alis an issue with persistence of the user settings with the tering settings. Mozilla plugin. Those preferences set via the sidebar are General Behaviours not kept–whenever the user activates the sidebar, s/he has to The user may select general bahaviours using the HTML specify the settings again. Preferences set using the Mozilla Settings form. Individual user settings are not held in any preferences mechanism are kept across sessions. persistent way. If the proxy is shut down and restarted, setC.4.4 COHSE Proxy tings for a particular user (e.g. IP address) will return to the The vanilla COHSE Proxy provides a proxy through which defaults. requests are made. The proxy allows some customization C.4.5 Rewriting COHSE Proxy of options through settings. Settings are associated with a Using the vanilla COHSE proxy requires the user to adparticular IP address. When the proxy is in use, the user is just the browser’s proxy settings. An alternative is to use offered a link at the bottom of each page that brings up an a rewriting proxy. The rewriting proxy takes requests of HTML form allowing adjustment of the settings. the form http://proxy/http:// whatever where http://proxy is A Proxy configuration file (in XML) specifies a number of the URL of the rewriting proxy, and http://whatever is the startup options. intended document. The rewriting proxy retrieves the docThe Proxy is implemented in Java, and runs as a standalone ument and processes it in some way–in this case by makapplication, providing a proxy service at a particular port. ing the request via the COHSE proxy. Any URLs found in The user uses the proxy by adjusting browser’s proxy set- the resulting document are rewritten by the rewriting proxy tings. to ensure that all subsequent requests are retrieved via the Access Control rewriting proxy. The location of the COHSE proxy is given to the rewritProviding a proxy has associated risks. For example, runing proxy at startup, and cannot be changed whilst they are ning a proxy on a machine in the cs.man.ac.uk domain may running. potentially allow external users using that proxy with access Adjustment of the COHSE proxy settings can be achieved to resources that are intended for local use only, or paid ser- as described above. However, when the rewriting proxy is vices such as Springers online proceedings. Access control in use, all requests to the COHSE proxy appear to come may therefore be advisable. For the basic COHSE proxy from the same IP address (the IP of the machine hosting the implementation, this can be achieved via restriction to par- rewriting proxy). This means that any change in settings ticular IP addresses (specified in the proxys configuration will be shared by all users of the rewriting proxy. file). This is probably too fine-grained, however, as it means The rewriting proxy is implemented as a servlet. that all IP addresses that will request access must be known Access Control at startup time. Again, access control may be advisable. In the case of the Ontology Service rewriting proxy, we can use HTTP authentication (via the The URL of the ontology service is specified in the config- hosts servlet container). This provides an easier mechauration file. This is read at startup time, and cannot subse- nism to support access to COHSE from third parties. As the requests to the COHSE proxy will now come from a quently be changed while the proxy is running. known host (that which hosts the servlet container), we can Ontology restrict access to the COHSE proxy using explicit IP adOntologies are selected via the settings form. When ini- dresses, providing a more secure situation than having the tialised, the proxy queries the Ontology Service for avail- COHSE proxy open to the world. able ontologies. These are then presented in the settings. C.4.6 Summary The proxy queries for ontologies once at startup–if ontologies are loaded or unloaded the changes will not be seen by A rough summary of the above is as follows. The Plugin solution offers reasonable support for customization, the Proxy. but requires client installation, specialist software and beOntology load/unload spoke coding for particular platforms. The COHSE proxy Set via the Ontology Service front panel. Note that changes offers some customization on a per-user basis, but requires made to the Ontology Service will effect all users using that changes to the users browser settings. The rewriting proxy Ontology Service. If the proxy and Ontology Service are provides flexibility and ease of use, but does not support under central control, we can restrict access to the Ontology per-user customization of settings. Adjustment of OntolService (for example through setting IP access control in the ogy Service and Resource Manager settings are possible in Ontology Services servlet container and allowing only ac- all scenarios, but are in general not on a per-user basis, but cess from the proxy host and trusted parties). This would global. Access to the Services can be restricted to prevent limit potential problems with third parties changing Ontol- problems with third parties altering settings, but this is not a ogy Service settings. solution to the situation where users may wish to introduce C-3

new components (such as resource retrieval components) or Resource Manager: The current Resource Manager is inflexible in its use of individual retrieval compoontologies into the system. All solutions suffer from a lack of persistence in the storing nents. Currently available retrieval components inof user preferences. clude Google, Amazon, COHSEDB and GoAnnotations. A user cannot choose any of these components C.5 COHSE Portal during his/her current browsing sessions, instead s/he As discussed above, COHSE currently offers three slightly is requried to modify the configuration of the resource different architectures with corresponding limitations conmanager in order to use the preferred components. So cerning personalisation. In order to overcome these difficulthe question is how can we allow users to choose these ties, we are proposing a move to a portal based architecture components in their browsing sessions? or how can with the COHSE functionality being delivered as a portlet. we use these components without actually modifying Portals support notions of users, and provide mechanisms the resource manager itself? Will users be allowed to for storing user preferences. use more than one component or will they be able to use only one of them? A move to a more Web Service C.5.1 Customisation Support of COHSE Portal like interface would alleviate some of these problems, User groups could be used to support different roles when with the context of the call (e.g. which components accessing the system and setting options. For example, we to use) being passed in as arguments. We have to also can have an administrator group where members will be alconsider how to implement it so that a new component lowed to customise the ontology service (add or remove oncan easily be added to these existing components. tologies). For a possible end user group, we can provide the Ontology Service: If we decide to support alternative navfollowing preferences: igation to the existing is-a relationship, then the ontolResource Manager Allow the user to choose from differogy service needs to be extended. For example, the onent retrieval components: Amazon, Google, Annotatology service needs to be able to return all data proptionDB, GODB, GATE, etc. erties, all object properties and all properties defined Ontology Service Allow the user to choose an ontology for a particular concept. from the list of ontologies provided by the ontology service. We can allow users to choose multiple ontolo- C.5.3 Portal Technologies gies (then we will need to consider how to differentiate Some of the available portal technologies that we can consider for our implementation are as follows: these ontologies in the presentation, etc.). Link Presentation Options might include: separate win- Portal Frameworks They mainly support aggregation (combine a number of portlets), common context, persondows, inline, pop-up menu. Annotation Linking Options might include: add all anno- alisation, Authentication & authorisation (single sign-on), tations, apply annotations from active ontology, use page-management, implementation (partially): • Apache Jetspeed, Version1 is not JSR168 compatible annotations as resources, show annotations about an but version 2 is. entire page. • Pluto (from Apache), JSR168 compatible, it is just a Link Visualisation In the current implementation, small l portlet container and does not provide much (reference and a icons are used to indicate the added links. Difimplementation). In fact, Jetspeed 2 is based on this ferent visualisation techniques could be offered. (http://portals.apache.org/jetspeed-1/). Resources Only propose resources in same domain as cur• uPortal, JSR168 compatible (http://www.uportal.org/). rent, ignore resources that point to the current page, • GridSphere, JSR168 compatible. prevent duplicate resource URLs. • Portal Toolkits specialised to application areas and Link Quantities Maximum number of generic links to disprovides a number of customisable portlets: play, maximum number of resource links to display for • CHEF, not JSR168 compatible. a concept. • SAKAI (uPortal + CHEF), JSR168 compatible Behaviours Link suppression, link promotion. (http://collab.sakaiproject.org/portal). Alternative navigation In the current COHSE ap• Sun Java System Portal Server. proaches, only the is-a relationship is used. We can experiment with the properties provided in the ontology. C.5.2 Existing Architecture We considered re-implementing the existing COHSE proxy or using some other proxy implementation instead of Muffin, such as IBM WBI3 , RabbIT4 . As described in section C.3, problems occur at different levels of granularity. The move to a portlet-based system solves the issue of where to represent and store customization information, and to a certain extent solves some of the issues of functionality/access. However, in order to support the customisation options mentioned in the previous section, we also need to consider some issues with the implementation of the existing ontology service and resource manager: 3 http://www.almaden.ibm.com/cs/wbi/ 4 http://www.khelekore.org/rabbit/

C-4

Appendix D

Dynamic Linking through XSL Technical report, May 2005 A discussion of the use of XSL for dynamic linking in COHSE

D.1

Introduction

COHSE enhances documents through the addition of links. These extra links may be based on the occurrence of terms or phrases in the documents (generic linking) or on explicit annotations on the documents (annotation linking). The current implementations rely on the explicit manipulation of the documents. D.1.1 Mozilla Client Javascript code is used to manipulate the DOM object in the browser. When a linking opportunity is identified by the plugin, a new node is added to the DOM object. Linking opportunities arise through: • Identification of text matching a term from the ontology; or • An explicit annotation. In the latter case, the Javascript code needs to identify the point in the document to augment with an annotation. This is currently achieved via xpointer expressions that indicate the beginning and end of the particular document section being annotated. This mechanism is known to be somewhat brittle as different parser implementations deal with badly formed HTML in different ways. A potential solution to this is to ensure that all documents are passed through HTMLTidy before annotating or linking–this will normalise the document structure. However, this approach requires us to handle and process the locations within the document. D.1.2 Proxy The proxy is in a similar situation, although the document manipulation is performed via Java code manipulating a DOM object instead of Javascript. Again, the code for the DOM manipulation is largely bespoke.

D.2

XSLT

D.2.2

Manipulation in Proxy

In this scenario, we provide a proxy1 that intercepts HTTP/S requests for documents then manipulates the response, rewriting the document and adding links. The addition of links and the rewriting of the document is performed via an XSLT engine within the proxy, however. Figure 1 shows the basic flow in this approach. It can be summarised as follows: 1. Browser makes an HTTP/HTTPS request for document docURL. The proxy passes the request to a web server and retrieves an HTML document (if it doesnt get an HTML document then it simply passes the document on to the browser). 2. Document is passed through Tidy to produce an XHTML document with sensible tags/structure. 3. Document is passed to XSL engine for transformation within the proxy. 4. Request made to DLS for an appropriate stylesheet to transform the document. This could be directly to the DLS, or via some servlet interface. 5. DLS calculates stylesheet. 6. Stylesheet returned to XSL Engine. Document transformed. 7. The amended document is then passed to the browser. 8. Browser renders document for the user. D.2.3 Manipulation in Client In this approach, the document passes through the proxy largely unchanged except for the addition of a processing instruction containing a URL referring to an XSL stylesheet, containing the templates rules for link addition. The contents of this URL could be dynamically generated, based on the URL being browsed. One issue here is that in order to apply the XSL translation, the browser needs to know that the document content type is appropriate. For example, if Firefox is given an XHTML document with a stylesheet processing instruction, but is not told that the document type is text/xml, it will not apply the stylesheet. Figure 2 shows the basic flow in this scenario. It can be summarised as follows: 1. Browser makes an HTTP/S request for document docURL. The proxy passes the request to a web server and retrieves an HTML document (if it doesnt get an HTML document then it simply passes the document on to the browser). 2. Document is passed through Tidy to produce an XHTML document with sensible tags/structure. 3. Proxy adds a single processing instruction to the XHTML document that points to an XSL stylesheet given by a URL something like: http://dlsserver?url=docURL, where http://dls- server is the location of a servlet wrapping the DLS. 4. The amended document is then passed to the browser.

XSL Transformations (XSLT) provide a mechanism for specifying manipulations and translations on XML documents. XSLT translations are defined in terms of templates which specify a node pattern and some action. For each node in the document that matches the pattern, the action is performed. XSLT can be used to produce, for example, HTML renderings of XML documents. Standalone XSL processors such as Xalan or Saxon take an XML document, an XSL stylesheet and apply the transformation, producing some results. Alternatively, stylesheet processing instructions can be added to XML documents, in which case XML/XSL capable browsers then apply the given stylesheet to produce a DOM object which is then used for rendering. D.2.1 Possible Architectures There are at least two possible architectures we could employ here. Both assume that we have some mechanism that, given a URL, will produce an XSLT stylesheet that contains 1 Possibly accessed via a portal–see other appendices for discussion of templates describing how to add links to a document. The this. question is then where the transformation occurs. D-1

Figure D.1: Proxy Manipulation

Figure D.2: Client Side Manipulation

The browser will see the processing instruction, indicating that it has to retrieve a stylesheet. 5. The browser calls http://dls-server?url=docURL. 6. The servlet wrapper passes docURL to the DLS. The DLS then produces an XSL stylesheet that encapsulates the annotation links that need to be added to the document. 7. Stylesheet returned to the Browser. 8. Browser applies the stylesheet to the document, rendering it for the user. The use of XSL gives us a relatively flexible solution and a clean story–linking can be applied at the client or proxy (or possibly server side). The infrastructure components providing the information specifying the links (the XSL template rules) is the same in all situations however.

tion, even if we could describe the text that we wish to replace, it is not clear how a new node would be inserted into the document. 3.

The quick brown fox jumps over the lazy dog.

Again here there is no easy way to represent this document fragment using XPath expressions, as it includes some text and a node. It would be possible in this case to insert a new node into the document without disrupting the structure, though. 4.

The quick brown fox jumps over the lazy dog.

This can be represented using offsets within the text node, 5.

The quick brown fox jumps over

D.3

Limitations

There are a number of limitations on the kinds of annotation that we can represent using this approach. This is primarily due to the fact that the structure of the DOM object is dictated by the tags appearing in the XML. See [80] for a discussion of some background and related issues concerning XML markup. The format of the XSL template rules means that our annotations can only refer to: • Complete nodes in the XHTML document; • Fragments of text within a single text node in the document. The following examples show spans that can and cant be simply represented using XSL templates and rules. 1.

The quick brown fox jumps over the lazy dog.

This can be easily represented as the em node. 2.

The quick brown fox jumps over the lazy dog.

This span cannot be captured, as it straddles the boundary between nodes. In this situa-

the lazy dog.

Here we have the entire p node.

D.4

Issues

Our two types of linking include annotation links, where a specific document fragment is identified as the source of the link. As discussed above, there are potential restrictions on the kinds of document fragment that we can specify using this mechanism. However, the template rules can be expressed purely in terms of XSLT. For example, Figure 3 shows a stylesheet with two templates. The first simply copies any node. This provides us with a null or identity transformation. The second rule identifies a particular node in the document (the first p of the third div of the documents body). We can use the substring operator to pull out and manipulate sections of the document. This example will take the text between characters 5 to 14 and replace it with a link (anchored on the same text). XSL uses a number of rules to determine which template should be applied when multiple templates match. In this case, the second is the more specific, so will be applied to the node when it is encountered.

D-2

D.4.1 Multiple Template matching One potential issue here is that XSL will apply a single template to a node–thus if we want to add multiple links to a particular text node, we will need to be careful about the template production. Several annotation links may need to be represented using a single template.

tags will appear by default in no namespace, meaning that a stylesheet using explicit namespaces to refer to HTML tags will not result in matches. A way of handling this is to insert an explicit namespace declaration into the document.

Figure D.3: Stylesheet Example D.4.2 Extensions For generic linking, it is less clear how much can be done within the XSLT templates. Generic linking basically relies on regular expression matching. When we identify the occurrence of a term mapping to a concept, then a new node is inserted into the document. Support for text manipulation in XSLT is minimal. There are built in functions such as match() that allow us to match regular expressions, but there is no general mechanism that allows the replacement of text matching a pattern with a node (which is what we need/want to do with generic links). An alternative here is to use extensions. XSLT allows us to call out to external code, for example Java classes. We can then supply functions that, given a node containing text, will manipulate that node in the ways we require. An issue here is that such XSLT manipulations cannot easily be performed on the browser side (as discussed above), as it would require us to pass the Java code to the browser. Thus this approach to generic linking seems suited only to proxy or server side implementations. D.4.3 Namespaces Due to the way that the browsers and XSL processors handle transformations, we need to be careful and explicit about namespaces in our documents. If an XHTML document contains an external DTD, and the document is processed by, for example, Saxon, the external DTD will be read and the HTML tags will appear in the given HTML namespace. This means that our stylesheets bust make sure that the template rules refer to tags in the appropriate namespace. However, Mozilla and Firefox ignore external DTDs. Thus in an HTML document processed by Firefox, the HTML D-3

Appendix E

Experiences of Dynamic Linking via XSL A slightly modified version of this appendix was submitted as a paper to WWW2006, January 2006 This paper describes an experiment to reengineer of the architecture of COHSE–a system for dynamically adding hypertext links to Web resources. COHSE uses domain ontologies (along with annotations using those ontologies) to identify link sources and suggest possible link targets. In the existing system architecture, links are added to the document using ad hoc, bespoke techniques. Our current goal is to enhance the system to support customisation through a portal. In order to support this, the architecture was reengineered to use standard technologies such as XHTML, DOM, XSL, XPath and the JSR168. However, there are a number of difficulties in achieving our goal with these standards. In the paper, we first introduce our new architecture and the portlet based on these standards. We discuss lessons learnt and experiences gained–these are challenges that need to be addressed in order to scale semantic annotation techniques to real world applications.

E.1

Introduction

In this paper, we present a reengineered architecture of the Conceptual Open Hypermedia System (COHSE) which shows how Web standards such as XHTML, XSL, XPath, DOM and CSS can be used to link Web pages dynamically. We then explain how this architecture is deployed in a portal to support customisation and discuss some experiences and insights of using these standards. The theoretical foundation of the World Wide Web (Web) is the concept of linking resources [12]. Typically, links on the Web are embedded within documents. Although embedding links is a simple and scalable approach, there are a number of limitations: links can only be uni-directional, ownership is required to place an anchor in a document, etc. [41, 42]. With the advances in the Semantic Web1 , the embedded link structure of the Web can be improved and extended; Additional Semantic Web components can be used to add links dynamically between resources by using the existing infrastructure of the Web. A key driver for Semantic Web advances is to improve machine processing on the Web, however including semantics for machineprocessing can also be used to improve the linking and navigation of Web pages intended for human end-users. COHSE provides a framework that integrates an ontology service and an open hypermedia link service to form a conceptual hypermedia system enabling documents to be linked via ontologies [30]. Through the use of different domain ontologies, COHSE can provide different hypertext views of Web resources. For instance, [9] presents how the COHSE system is used in Bioinformatics to link biology resources by using the Gene Ontology2 and [29] presents how Sun’s Java tutorial3 pages are augmented using a Java and programming ontology. The COHSE architecture is composed of a COHSE Distributed Links Service (DLS) and two supporting services:

an Ontology Service (OS) and a Resource Manager (RM). The Ontology Service supplies operations relating to content of a conceptual model [7]. The Resource Manager associates concepts with resources that are identified with different techniques (e.g., lexical matching, using GATE4 , etc.) and provides mechanisms for querying those associations. The COHSE DLS uses these services to decorate pages with extra links. The system has, in the past, been deployed as a browser extension (based on Mozilla5 ) or as a proxy through which HTTP requests are routed. Although these systems provide customisation, the options are limited. Supporting customisation has a number of requirements including identifying users, being able to store preferences persistently, etc. To meet these requirements, we have moved to a portal-based implementation. A portal provides a framework to aggregate content from different sources, supporting storage of user profiles, customisation and personalisation [47]. Although the supporting services (OS and RM) provide a loosely coupled architecture, both browser and proxy implementations have their own bespoke, ad-hoc techniques to modify (i.e., transform underlying structure of) documents and encode links. To overcome such limitations and support a light-weight DLS agent, the architecture has been reengineered to use Web standards such as XSL, XPath6 , DOM7 , etc. Although using standards has obvious advantages such as interoperability and being able to use off-the-shelf implementations, we experienced a number of difficulties. We highlight some of these below, and provide further discussion later in the paper: • XML resources are required. XSLT8 is the only standardised transformation language for Web languages. It can, however, only be used with XML-based languages. XHTML has been introduced to bridge the gap between XML and HTML and can be used with XSLT, but unfortunately few pages on the Web are in XHTML format. • XPath has limitations. XSLT uses XPath to navigate through the structure of XML-based documents. Although XPath is widely adopted, it does not provide the fine granularity required in applications such as ours. For example, one cannot refer to arbitrary portions of the document, regardless of element boundaries. • XPointer is not a standard. XPointer9 provides solutions to some of the above limitations but it is not a standard, cannot be used with XSLT to process documents and is not supported in a consistent way. • Different referencing techniques cause mismatch prob-

1 Semantic

Web, http://www.w3.org/2001/sw/. Ontology (GO), http://www.geneontology.org/. 3 The Java Tutorial, http://java.sun.com/.

2 Gene

E-1

4 GATE,

http://gate.ac.uk/. http://www.mozilla.org/. 6 XPath, www.w3.org/TR/xpath. 7 DOM, www.w3.org/DOM/. 8 XSLT, www.w3.org/TR/xslt. 9 XPointer, www.w3.org/TR/xptr-xpointer/. 5 Mozilla,

lem. For our application (and indeed for many other Semantic Web applications), a variety of techniques may be used to annotate documents. But different techniques typically mean different ways of referencing document fragments. In order to use generated annotations consistently, one needs to match these references–an error-prone and effort-demanding process. • Resources may change. An XPointer is not tolerant against modifications in a document. Even small modifications can easily invalidate an XPointer. • Using standards does not always mean easy integraFigure E.1: The COHSE architecture. tion. Although standards such as the DOM provide, 3. A DLS: responsible for modifying pages and adding in principle, a standard way of representing, accessing links depending on the information provided by OS and manipulating documents, the reality is that differand RM. This agent mainly processes documents and ent implementations use the DOM in different ways. applies identified link sources into the document along Combining components based on DOM often means with the possible link targets. Currently, the DLS tackling nitty-gritty technical problems. is implemented as a browser extension and a proxy • Is embedded markup harmful? With XSLT, XPath and where both have their own bespoke techniques to proXHTML, only certain kinds of annotations can be repcess and augment documents. resented and applied to documents because of the hierarchical representation of data in XML-based docu- By using these components COHSE provides two different link types based on the way the source anchors are discovments [80]. The rest of this paper is organised as follows: Section E.2 ered. In both, destination anchors are identified by an RM. 1. Generic A set of conditions are specified that must be presents the COHSE architecture in detail. Section E.3 exmet for a resource to be included [101]. For example, plains how Web standards are used in the new architecture source anchors are identified by analysing a document and how COHSE is deployed in a portal. Section E.4 diswith a lexical matching algorithm by using the terms cusses some lessons learnt and experiences gained. Secfrom an ontology which are provided by an OS. tion E.5 presents some related work and Section E.6 sug2. Specific Resources themselves are stored via a pointgests some future steps to be taken. ing mechanism [71]. Source anchors are identified by E.2 The COHSE Architecture using an RM that maintains external annotations popFigure E.1 illustrates this architecture which is based on the ulated by different techniques. Based on the client-server architecture of the Web [71], diffollowing three components: 1. An Ontology Service (OS): supports interaction with ferent Open Hypermedia Systems (OHS) utilised different ontologies by providing services such as mapping be- techniques to augment Web pages with extra links [14]. tween concepts and lexical labels (synonyms), provid- There are two basic approaches: extra links are displayed ing information about specialisation (sub-classes) and alongside the page or are inserted into the page [2]. COHSE generalisation (super-classes) of concepts, description uses the latter approach as this is the closest approach of a term, etc. [7]. This service has a simple HTTP to how users interact with embedded links on the Web. interface which can host third party OWL10 ontolo- However, this means documents need to be modified or gies. Reasoning about ontologies (e.g., the construc- transcoded14 . This transcoding can occur at one of three tion of classification hierarchies or the classification different places (see Figure E.2); these are discussed below of arbitrary conceptual descriptions) can be performed with their pros and cons: 1. Server side deployment: Links can be added by a spethrough the use of an OWL Reasoner11 . cialised server before delivery. The end user need 2. A Resource Manager (RM): stores and maintains exnot perform any installation, however only one Web ternal annotations (i.e., semantic annotation [54]) and site is processed at one time. Moreover, one possimaintains mappings between resources which are used bly needs to have rights to modify the documents or in two ways: the concept-to-resource mapping proknow how the documents are generated to properly vides candidate targets for links and the resource-toprocess them. This could, however, solve the problem concept mapping provides candidate source anchors. of dynamically-generated pages where storing annotaTherefore, RM is used for both source anchor identifitions referencing to these documents is very difficult. cation and discovery of possible target anchors. DifferLimited support can be provided for customisation as ent techniques are used to enrich our resource reposthe server needs to keep track of users and store their itory; currently, we use lexical matching algorithms, preferences persistently. Moreover, this can be a probmanual annotation with our annotation tool [8], exterlem because of privacy issues. nal third party services (such as Google Web Services 2. Intermediary deployment: Links can be added by a (WS)12 or Amazon WS13 ) and the GATE (A General proxy server that acts as an intermediary between a Architecture for Text Engineering) framework with exclient and server as discussed in [5]. A proxy can tended ontological services [39]. These give us flexibe used to rewrite documents and add links where apbility to discover different sets of resources. propriate. Although a proxy implementation does not 10 OWL, http://www.w3.org/TR/owl-ref/. need specialised browsers, this can cause processing 11 For example, FaCT, RACER or Pellet Citation? Web APIs, http://www.google.com/apis/. 13 Amazon Web Services, http://www.amazon.com. 12 Google

14 Transcoding is the act of converting documents from one form to another.

E-2

E.3

Figure E.2: COHSE DLS Agent in use. delays and a need to possibly re-write URLs to ensure that users continue to browse through the proxy. Although this opens up the use of COHSE to alternative platforms such as WAP and mobile phones, users may need to modify the browser’s configuration which can be unwieldy if the user does not wish to continually use the tool relying on the proxy. To overcome this limitation, COHSE employed a rewriting proxy which allows clients to make use of the proxy without adjusting their browser settings. The rewriting proxy takes requests in the form http://proxy_url/http: //page_url where http://proxy_url is the URL of the rewriting proxy and http://page_url is the required page. Although customisation can be easier with proxy, storing users’ preferences persistently is troublesome. 3. Client side deployment: Links can be added by a plugin or an extension to the browser that manipulates the DOM object within the browser. This is easy to prototype since the underlying development framework of the browsers (e.g., Mozilla15 , IE 16 ) can be used. However, this has a disadvantage of being tightly coupled to a specific browser and requiring local installation which sometimes can be difficult within intranets. Although storing user preferences persistently is easier with client side implementations compared to others, as the underlying preferences structure of the browser can be used, maintenance and communication of these preferences with external components have to be performed in an ad-hoc manner. The use of a plug-in may also make it more difficult to share information between users. This may be a problem if we wish to apply some kind of collaborative filtering techniques in order to customise the behaviour of the DLS. COHSE has, in the past, been deployed as a proxy built on top of Muffin17 and as a suite of plug-ins to Mozilla (including an annotator that can be used to create new metadata [8]). Even though the supporting services (OS and RM) provide a loosely coupled flexible architecture for deployment of COHSE, each DLS implementation has its own way of processing documents, and identifying and adding source anchors. For example, the Mozilla plugin is implemented using Javascript, while the proxy is in Java. They both use a similar approach in general, but there is little sharing of the application codebase. This is a potential problem when we wish to deploy the COHSE DLS in another context (e.g., in a portal) as this may mean another bespoke, ad-hoc technique to process documents. In the following section we discuss how a move to XSL and other related standardised technologies can help us to overcome this problem.

Dynamic Linking via XSL

Figure E.3 and Figure E.4 illustrate the new COHSE architecture which is based on XHTML, DOM, XSL, XPath (XPointer) and CSS. This new architecture distinguishes between annotation producers and consumers. Figure E.3 highlights the components that contribute to the identification of resources (i.e., annotation producers) and Figure E.4 highlights the components that use the resources identified by the producers (i.e., annotation consumers). In order to enable the communication between these two parties, this architecture includes a number of mediating components including OS, RM and a new service called XSLT service. We explain the different components and discuss how producers and consumers are orchestrated to dynamically link Web pages (see Figure E.3 for 1–3 and Figure E.4 for 4–7 ): 1. HTML vs. XHTML. XSL Transformations (XSLT) provide a mechanism for specifying manipulations and translations on XML documents. Typically, XSLT is used to produce HTML renderings of XML documents. XSLT is the only Web standard that can be used to transcode Web pages–however it can only be used if those documents are XML based. Although XHTML can also be used with XSLT, few pages on the Web are in XHTML format [101]. To address this problem, we use Tidy18 and translate pages from HTML to XHTML format. Tidy is used in both ends of our architecture; at the beginning of the annotation process and at the beginning of the transcoding process (encoded in the DLS). Tidy also helps us to fix malformed HTML pages because it is crucial to have wellformed pages with proper syntax for using XPath and XPointer19 . 2. Different Annotation techniques. XHTML pages are used by the manual annotation tool and GATE component. GATE’s ANNIE20 pipeline (an Information Extraction component) is extended with additional components to process documents using OWL ontologies. A crawler is used along with the extended GATE components to annotate web sites. With a central repository, introducing a new technique for annotation is easy as long as the DLS and this technique agree on a referencing mechanism. 3. XPointer. In our central repository, identified resources are referenced and stored using XPointers. XPointer is based on XPath but provides fine granularity that can be used to point to sections of an XML document such as paragraph fragments. Using Tidy to transform HTML pages into XHTML allows us to use XPointer as a referencing language with HTML pages. Using Xpointer gives us a common representation for storing annotations and in principle should allow the use of multiple annotation techniques and tools. 4. Web Services (WS). With emerging WS standards such as SOAP21 , more and more services are becoming available. In COHSE, we are currently using Google and Amazon WSs to identify target resources. Instead of manual annotation and GATE, with these WSs identified resources are not stored in our central repository. 18 Tidy,

http://tidy.sourceforge.net/ for a well-formed XML document, http://www.w3.org/TR/REC-xml/. 20 ANNIE, http://gate.ac.uk/ie/annie.html. 21 SOAP, http://www.w3.org/TR/soap/. 19 Requirements

15 Mozilla,

http://books.mozdev.org/ Explorer, http://www.microsoft.com/. 17 Muffin, http://muffin.doit.org/. 16 Internet

E-3

Figure E.3: The new COHSE architecture – annotation producers. The services are invoked when extra COHSE links are requested from the RM. New WSs can easily be included to our collection of services (e.g., Ebay WS22 ). The combination of a variety of WSs and ontologies provide us a rich set of domain specific resources. 5. XSLT Service (XS). XSLT translations are defined in terms of templates which specify a node pattern and some action. The XS is an additional supporting service that has been introduced into the architecture. It aims to dynamically create an XSLT stylesheet that adds links via a transformation. The service works as follows: when links are requested for a page, the XS is invoked by the DLS. The XS communicates with the RM and OS to find the appropriate resources for that page and then returns an appropriate stylesheet. Since the RM stores resources in XPointer format and XSLT processes XPath expressions, the XS translates XPointers to XPath and uses character counting in template actions to handle character ranges. The XS also produces and adds necessary CSS (Cascading stylesheets23 ) declarations into the generated stylesheet to embellish identified source anchors and to render the set of target anchors. 6. XSLT Engine. Having the required stylesheet generated by the XS gives us the flexibility to use any XSLT engine to apply the stylesheet by using the DOM representation of the page (e.g, Saxon24 ). This is one of the advantages of using standards; it promotes reusability. In addition, it opens up the architecture and increases flexibility–the generated stylesheets can be applied in a number of different ways. The last component (marked as no.7) in Figure E.4 is discussed below. This shows how a DLS is deployed in a portal to support customisation. 22 Ebay,

http://developer.ebay.com/. www.w3.org/Style/CSS/. 24 Saxon, http://www.saxonica.com/. 23 CSS,

E.3.1

Portal - Customisation

A portal is defined as “a Web application that commonly provides personalisation, single sign-on, content aggregation from different sources and hosts presentation layer of information systems” [1]. A portal uses a number of pluggable user interface components called portlets that provide a presentation layer and enables aggregation of content from different sources. From a user’s perspective, a portlet is a window in a portal that provides a specific service (e.g., a calendar). From an application development perspective, a portlet is a software component written in Java, managed by a portlet container, which handles user requests and generates dynamic content. JSR168 [1] standardises how portlets are to be developed by addressing preferences (storing customisation preferences persistently), user information, security, etc. Most important of all, it allows portlets adhering to JSR168 to be deployed to any portal that conforms to the specifications by enabling interoperability between portals and portlets. With the current COHSE DLS implementations (i.e., Mozilla browser extension and proxy), customisation has not been undertaken in a well-defined framework. For example, although some preferences are provided, they do not use a consistent mechanism for user profiling. Therefore, we have now implemented the COHSE DLS as a portlet to exploit the advantages of portals as discussed below. • Customisation With a portlet, different options can be declared and stored persistently that allow customisation. JSR168 specifies a method for handling persistent preferences and most portals that conform to JSR168 provide an implementation of that. Therefore, using an existing portal provides us with the advantage of using an existing infrastructure to support customisation and user profiling. • Single sign-on This feature enables single sign on across a range of applications. A user is required to enter their details (i.e, user name, password) once and that information will be used to authenticate the user

E-4

Figure E.4: The new COHSE architecture – annotation consumers.

Figure E.5: The COHSE DLS Portlet deployed in the Lif- Figure E.6: The COHSE DLS Portlet in action (15.10.2005) (The COHSE links are represented by small icons and when users eray portal along with other portlets (15.10.2005). click on these icons, a link box is displayed).

to use other resources that will be accessed through the portal. Since the COHSE DLS relies on external services including the RM, OS and XS, this can be an advantage in passing user authentication information to these services. • Aggregation This is the process of integrating content from different sources within a portal. A portal COHSE DLS can be used along with diverse types of information from a variety of sources (e.g., search engines). For example, with Grid community portals25 which provide a means to access different grid services, the DLS can be provided as another service. • Authentication Portals provide a mechanism to deal with authentication issues which means an existing development can be reused without implementing a new one. • User Profiles / Authorisation Most portals accommodate different roles and user groups. This feature gives us flexibility to support customisation in 25 For

different ways. For example, different user groups can have different customisation options depending on their role, for instance ‘administrators’ would be able to customise supporting services (e.g., OS) but not ‘end-users’. Moreover having user groups allows us to provide personalised links [97]; certain types of links would be visible to a specific group of users (rather than to every user) or different user groups would get different type of information about extra COHSE links. For example, novice users in a particular domain would get definition of the identified concepts but not experienced users. Figure E.4 illustrates the architecture of the DLS portlet (see part labelled 7). This portlet first uses Tidy to convert the page from HTML to XHTML, then it contacts the XS to get the XSLT stylesheet and uses an XSL engine to apply this stylesheet to the XHTML page. Finally, links are rewritten to ensure that users will continue to browse through the portlet. The DLS portlet conforms to JSR168 so it can be easily

some examples, see http://www.grids.ac.uk/.

E-5

deployed to any JSR168 compliant portal (e.g., Liferay26 or the Sun Portal Server). Figure E.5 shows our portlet deployed in the Liferay portal. Currently, this portlet provides options to specify locations of external services and some visualisation options. It accepts the URL of a page and then returns the transcoded page in a new window. For example, Figure E.6 shows how a page is augmented with this portlet; links added by the DLS are indicated by small icons and when a user clicks on them, a link box is displayed with a list of possible target anchors. Although with our current implementation we have not fully exploited the advantages of portals as discussed above, this portlet provides the basis to be extended to support more sophisticated usage of user profiles and history. For instance, personalisaton can be supported where the system makes decisions based on some observation of user behaviour [47], for example selecting ontologies automatically based on frequently used keywords.

E.4

Experiences - Discussion

The DLS portlet demonstrates how the reengineered architecture (see Figure E.4) can be used to deploy the COHSE DLS in a different context. Although we have successfully managed to do this by using Web standards, it has not been a straightforward process. We have tackled a number of issues that we believe provide insights for using these standards. Particularly, these issues are important for scalability of annotation processing in Semantic Web applications. Here we revisit the issues that are already enumerated in the introduction. XML resources are required With COHSE, adding or overlaying links to a page means modifying the underlying structure of the document before it is delivered to a user. When we look at the existing Web standards, the only standard that can be used to do this is XSLT. However, XSLT can only be used with documents that belong to the XML family (which also includes XHTML). The vast majority of the pages on the Web are still in HTML format [101]. To address this problem, we have used Tidy to transform documents from HTML to XHTML. We could have introduced a restriction in our system to only accept XML based documents as input. However, this would not be a feasible solution as COHSE is an open hypermedia system that promotes an open system [71]. Using Tidy provides a solution to this problem, but means that one always needs to process the document before actually doing any processing. This introduces extra overhead into the process. On the other hand, using Tidy enables us to make sure that only wellformed documents are annotated and processed by our system which is an important issue for referencing documents. We believe that the best solution to this problem would be to promote widespread usage of XML family-based languages through better browser support for these languages (as stated in [101]). XPath has limitations Although XPath is the the dominant pattern language proposed to be used with XSLT, XPath does not provide fine granularity for referencing arbitrary fragments in a Web page. Broadly speaking, with XPath, patterns can be created that point to a particular node or a set of nodes but not to a string which is part of a node. As an example, the fol26 Liferay,

lowing sentence from Figure E.6 can be used to illustrate this:
  • an outer membrane that encloses the entire structure


  • An XPath expression can be created that points to “outer membrane” as it means pointing to the element , but not for the text “membrane” which spans part of the node. Clearly if we process documents dynamically (also known as generic linking [42]), we do not need to use any referencing mechanism. However, processing documents in advance has a number of advantages. For example, batch processing can be performed which we have already discussed with our GATE crawler; web sites are crawled and annotated which eliminates the problem of processing one document at a time. Moreover, annotations which are stored persistently (rather than being temporary) become “first-class” citizens and promote sharing of these annotations which is one of the main goals of Open Hypermedia Systems [86]. Therefore it is important to have a consistent, robust mechanism that allows fine-grained annotation of fragments of documents. XPointer is not a standard The XPointer xpointer() schema extends XPath by adding the ability to address strings, points, and ranges. With these extra functionalities, XPointer provides enough granularity to be used as a reference language in our application (and in other similar Semantic Web applications). Even though there are some approaches to using XPointer with XSLT, XPointer is lagging behind in the standardisation process, and these approaches are not widely adopted [98]. In order to overcome the limitations of XPath, in our application we use XPointer to store annotations. The XS then analyses these XPointers and breaks them into XPath patterns and character offsets. When the XS creates a template, it uses an XPath expression as the node pattern and the character offset to identify the appropriate text in the template action. Alternative techniques to XPointer would be character counting as used in GATE [39], fixed byte offsets as used in Microcosm [43], geometric positions as used in Adobe’s PDF27 [87], etc. But these approaches are mainly used for documents that have little or no inherent structure [16]. Moreover, as Davis stresses [42], although these are generic methods as they can be used with any document type, they are most likely to break if the document is edited without informing the external references to the document. Furthermore, XML-based documents have a well-formed hierarchical structure that we believe is an advantage over other document types and important to be used in any referencing technique. Referencing mismatch problem As we have already discussed, using different techniques to identify resources and create annotations in COHSE provides flexibility to identify resources in different ways. Unfortunately, different techniques can mean different ways of representing and processing documents, and referencing identified resources. For example, when GATE ANNIE processes an (X)HTML document, it removes all the embedded markup from the document and then stores identified resources by counting characters from the top of the document. For many applications using GATE it may not be that important to point back to the processed document. In 27 Adobe

    http://www.liferay.com/.

    E-6

    PDF, http://www.adobe.com/.

    our application, however, it is crucial. In this case, we had to implement an algorithm to map the identified resources in GATE to XPointer expressions. Although we have successfully managed to do this, it meant an effort to tackle lowlevel technical problems such as handling multiple white space in the document. We believe this problem is chiefly because of the lack of standards providing fine granularity to be used as a referencing mechanism to XML-based documents. This is an important issue for applications such as COHSE that aim to exploit third-party annotation techniques. Or indeed any application that intends to use annotation of Web documents. Resources may change The problem of keeping references pointing to the right positions when the document in process is modified [97] has already been widely discussed in the literature. This is not only a challenge for the Web or the Semantic Web, but it is also a challenge for other applications (e.g., Microsoft Word28 [18]). In the literature, this problem has been referred to as “orphaning of annotations” [18], “content reference problem” [42], “editing problem”‘ [43], “link dangling problem” [97] or “robust intra-document locations” [87]. In order to overcome this limitation, a number of solutions are proposed: 1. Don’t bother [42]; 2. Have an off-line batch link verifier [42]; 3. Have the ownership of the document to have control over changes [42]; 4. Use heuristics to resolve these points [42, 18]; 5. Use versions of the documents annotated [42, 18, 81]; 6. Store surrounding context as well [18]; 7. Store multiple ways of references and use algorithms to solve this problem [18]; 8. Use best practice to generate robust XPointers [101]. Some of these solutions are not valid in our context, as we would like to annotate third party pages. For example, we cannot do versioning, nor it is feasible to annotate only documents that we own. When our proxy and DLS cannot resolve annotations, they place them at the bottom of the page. In our portlet DLS, we choose the first solution above and do not do anything, we just ignore these annotations. This may result in a loss of annotations. There are guidelines to generating robust XPointers, but these usually require a lot of knowledge about the underlying structure which we usually do not have. For instance, using IDs in XPointers is a solution, but still there is no guarantee that the ID will always be the same and that it will always identify the element that we want to reference. The strengths of XPointer can also be its weaknesses, for example its precision and sophistication makes it extremely fragile [97]. In order to address this, we plan to exploit some of the techniques suggested by [18] where annotations are stored in multiple ways – in particular using contextual information. Using standards does not always mean easy integration Using standards is important for scalability, reusability and interoperability. For example, since we use XSLT in our new architecture, we can make use of existing XSLT engines to process documents using the DOM and the generated stylesheet. Similarly, using JSR168 for developing our portlet enabled us to deploy this portlet to any JSR168 compliant portal. However, from an application perspective, us28 Microsoft

    Word, http://office.microsoft.com.

    ing standards does not always mean easy integration. Even though we use the DOM in different stages of our system, different bodies interpret this standard in different ways and provide slightly different extensions. For example, different browsers have their own way of representing the DOM. Therefore, manual annotations performed within one browser based on the DOM does not guarantee that these annotations will be resolved properly in other browsers or DOM processors. For example, if a page has a table node as follows:


    Then Mozilla adds an extra ”tbody” element so the node above is represented as follows:


    Even though this seems like a minor detail, based on the example above different pointers will be generated for “tr” and “td” tags regarding the hierarchy from the root of the document. Therefore, there is no guarantee that a pointer generated by using Mozilla DOM tree can be properly resolved by another DOM processor. [90] provides detailed discussion about these kinds of issues. In this case, using Tidy provided a solution to this problem as well. Having Tidy at the beginning of both annotation and transcoding processes ensures that they both process the same DOM. Is embedded markup harmful? COHSE overlays annotations to documents at runtime by using an XSLT engine and the stylesheet generated by XS. However because of the hierarchical markup structure of the document, certain types of annotations cannot be applied to the document because of the possibility of breaking the well-formedness of the document. To illustrate this we use another example sentence from Figure E.6: ...of integral membrane proteins that form channels...

    Although generating an XPointer for the highlighted text above is not a problem, it is not clear how extra markup or node can be inserted to explicitly specify such kind of annotations that straddles the boundary between the nodes. Even though technically it is possible to add markup to surround the highlighted part “membrane proteins”, the modified document will not be well-formed. This is not only a problem of COHSE because it tries to enhance links in a Web page by adding extra markup, but is more a generic problem with hierarchical, nested representation of any data which is strongly criticised by Nelson [80]. The issue here is that representations such as XPath are concerned with making reference to document fragments with respect to the document structure. In some cases, however, we wish to make annotations that are with respect to the document content–and the structure and content do not always coincide or overlay nicely. The XS currently eliminates annotations that might cause this problem. This is not really a solution, however, as it is effectively ignoring the problem. Therefore, with our work here we are not proposing a solution, but we rather would like to re-pose the question “is embedded markup harmful?” for further discussion. We feel this is a particularly interesting question in the context of semantic annotation.

    E-7

    E.5

    Related Work

    E.6

    The Web can be considered as a closed hypermedia system since the links are embedded into the pages [71]. Whereas an open system in hypermedia simply is the one where the reader is given the same access as the author [43]. Open Hypermedia Systems are well researched by the hypermedia community where a number of systems have been developed including MicroCosm, Chimera and Devise Hypermedia [86]. Several systems have also been introduced to provide an open hypermedia system on the Web including DLS, DHM/WWW, Webvise and the Arakne Environment [14]. Using these systems, readers can create links and other hypermedia structures on top of arbitrary Web pages, can share these links with others through the use of external linkbases. Compared to these traditional OHSs on the Web, COHSE introduces a combination of open hypermedia system with ontological knowledge, which provides a framework for ontological linking. In fact, COHSE brings together the work on OHS and Semantic Web by using the existing structure of the Web. Semantic annotation is one of the core technologies for the Semantic Web which is the process of semantically enriching Web pages by adding metadata so that they are not only understandable by humans but also by machines [54]. Manual annotation approaches allow users to define annotations by hand such as our COHSE annotator [8]. Automatic annotation approaches focus on extracting annotations automatically by using different techniques such as machine learning or natural language processing techniques such as our GATE crawler29 . Some tools are also available to annotate pages when they are designed and created (e.g., SMORE 30 ) which is important for bringing in the annotation process to the early stage of design. The OntoMatAnnotizer is a tool that can be used with existing pages but adopts an approach were annotations are injected into the page. Another project is Annotea31 , a W3C project allowing annotation of Web pages using RDF (Resource Description Framework)32 . The Annotea project uses the RDF framework to create statements about part of a page [64]. Although this project is important in showing how RDF can be used to store annotations, it does not use ontologies explicitly in their annotation process which is crucial for the COHSE approach. Another Semantic Web application similar to COHSE is MAGPIE. [45] states that MAGPIE is inspired by COHSE. MAGPIE is implemented as an extension to a browser therefore, as a system it has the limitations that we have discussed for client side deployment in Section E.2, for example it requires a local installation. There are also some emerging technologies that are introduced to overcome the limitations of the embedded links on the Web. These include XLink33 which is proposed as the linking language for XML. Although XLink is a powerful language for representing links which offers a lot of improvements over simple links on the Web, it is not widely supported by existing browsers. Even though, [35] shows that XLink is a viable and flexible language, it is just used as an intermediary format to transfer links represented in a particular open hypermedia format to HTML.

    Discussion and Future Work

    In this paper, we have presented a reengineered architecture of COHSE, based on Web standards including XSLT, XPath, DOM. We have also explained how a new COHSE DLS implementation is developed and deployed in a portal using this architecture. Using standards brings advantages such as reusability and interoperability, however we have faced a number of difficulties which have also been discussed. Although with the COHSE DLS portlet, a number of customisation options are provided and pages are decorated with extra links accordingly, this portlet and its portal deployment can still be extended and improved by better employing user groups, roles and personalisation features of portals. This could also include extending and modifiying our supporting services (OS, RM and XS) to support better communication and information exchange with portals. In all our COHSE DLS implementations, when a link source anchor is discovered in a page, a link box is added that includes: the term title, a description of the term and a list of related resources. Currently, we are working on using CSS to introduce different techniques to visualise these link boxes and provide these techniques as customisation options. We are also envisioning to conduct a user evaluation to demonstrate that different customisation options and visualisation techniques enhance browsing experiences of users on the Web. In this paper, we have also presented different techniques that COHSE uses to annotate documents including GATE, manual annotation and lexical matching algorithms. Based on this reengineered architecture we are planning to experiment with new techniques by particularly further investigating the usage of Natural Language Processing techniques. As a conclusion, the reengineered architecture of COHSE presented here provides a loosely coupled, light weight and exemplar architecture based on Web standards that enables dynamic linking of Web pages. Although using standards enables easy development and deployment of this new architecture in a different context (i.e., in a portal), a number of difficulties have to be solved. This paper have discussed our solutions, but what COHSE and indeed many other Semantic Web applications need is better long-term solutions. However, this paper aims to increase the awareness of these issues which are grouped into four below along with suggestions for long-term solutions: • XML resources are required We believe that the best solution to this problem would be to promote widespread usage of XML family-based languages through better browser support for these languages or to encourage designers and authoring tools create XHTML documents rather than HTML. • A robust, fine-grained referencing technique A consistent, robust mechanism needs to be standardised that allows fine-grained annotation of fragments of documents. • Using standards does not always mean easy integration Standards need to be introduced timely before different bodies start to introduce their bespoke add-hoc techniques to address a particular need [90]. • Is embedded markup harmful? “Nothing is perfect”.

    29 A list of annotation tools, http://annotation.semanticweb.org/tools. 30 SMORE, http://owl.mindswap.org. 31 Annotea, http://www.w3.org/2001/Annotea/. 32 RDF, www.w3.org/RDF/. 33 XLink, www.w3.org/TR/xlink/.

    E-8

    Appendix F

    AJAX Experimentation in COHSE Short technical report, November 2005 Some issues are discussed with respect to AJAX technology and how it can or cannot be used for dynamically generating COHSE link boxes.

    F.1

    What is the problem?

    In our current architecture, the list of links suggested by COHSE are created when the page is processed to identify the COHSE link anchors. However, this means that all of the contents of the link boxes is inserted into each processed page. This approach has many disadvantages, for example, 1. this requires extra processing load and time when the page is processed by a COHSE agent and there is a possibility that some of these link boxes might never be visited; 2. although added link boxes are hidden in the page, the page content is visually modified; the modified page visually looks quite long even though the bottom of the page looks empty. Therefore in order to address these issues, we are investigating different techniques to create link boxes on-demand Figure F.1: A possible interaction diagram for COHSE with AJAX. (create link boxes when the user requests to see the box). an XML document. By processing this XML document, a F.2 What is AJAX? link box will be created. Figure F.1 illustrates this new inAJAX stands for Asynchronous JavaScript Technology and teraction model which is as follows: XML (AJAX) which is actually a name given to a set of 1. When a page is loaded, the XSLT service will be inalready available Web technologies1 . By using Javascript voked to get the necessary stylesheet to identify the technology, an HTML page can asynchronously make calls COHSE link anchors; this stylesheet will also have to a server and fetch XML documents. These XML docthe necessary templates to add the required AJAX uments can then be used to augment the Document Object javascript code to the page and templates to add Model (DOM) of the HTML page. The main component of COHSE source anchors with necessary XMLHttpReAJAX is the XMLHttpRequest object of Javascript that is quest calls (this new interaction model introduces a now supported by the most mainstream browsers across all new service that serves list of links as an XML docplatforms. On the server side, the request arrives just like ument which makes a clear separation between idenany other HttpServletRequest. tification of link source anchors and creation of link Some advantages of AJAX technology can be summarised boxes); as follows: 2. COHSE links will be added by applying the returned 1. An HTML page can pull in small pieces of data as stylesheet with an XSLT engine; needed rather than reloading the whole page every 3. When a user clicks on a new COHSE link, the external time change needs to be displayed; link service will be invoked with the required parame2. It can reduce processing demand on the server side; ters to create a link box; 3. Well-designed code that is truly asynchronous also 4. When the link service returns the XML document that gives the server more time to respond to queries. includes list of links, the javascript code (that was F.3 How AJAX can be used for dynamic linkadded by using the stylesheet–returned by the XSLT service) will be used to process the returned XML docing in COHSE? ument and create a link box dynamically. A service could be created to serve only the list of links that will be suggested by the COHSE DLS. This will make a clear separation between source anchor identification and link box creation (or target anchor identification). Then pages will be augmented with appropriate AJAX Javascript code and COHSE anchors will be modified to make an XMLHttpRequest to this service to get the list of links as 1 For more information, see: http://java.sun.com/developer/technicalArticles/ J2EE/AJAX/ http://www.ajaxmatters.com/ http://en.wikipedia.org/wiki/AJAX.

    F.4

    Why we cannot use AJAX in COHSE?

    XMLHttpRequest is not widely used to access third-party Web services–this is mainly because all the newest browsers impose a security restriction on the use of XMLHttpRequest. The restriction is a client is not allowed to make an XMLHttpRequest to any server except to the server from which the page containing the XMLHttpRequest was originally retrieved2 . For example, if the AJAX application is 2 For

    ,

    F-1

    detailed information see http://www.xml.com/pub/a/2005/11/09/

    in the page http://www.xxx.com/ajax.html then the XMLHttpRequests that come from that page can only make requests to the services running in the same domain. Although in COHSE we will have control over the new link service, we will process third-party pages which means that the service and pages will be on different domains. And this will be a problem because of the security restriction imposed by the many browsers.

    fixing-ajax-xmlhttprequest-considered-harmful. html.

    F-2

    Appendix G

    The Visualisation Experiment in COHSE Technical report, November 2005 The visualisation experiment in COHSE focuses on two topics: the visualisation of the extra links added to Web pages by the COHSE system (i.e., link embellishment) and the visualisation of the set of link targets suggested by the COHSE system (called the “link box”) (i.e, link box visualisation). This document discusses the main goals of this experiment, introduces different techniques that can be used and discusses their pros and cons. The identification of different techniques and issues are important to support better customisation. As different user studies demonstrate [10, 84], a single standard technique cannot meet the requirements of all users. Therefore, one way to overcome this problem is to identify a good set of techniques and provide them as customisation options for the user. Moreover, as Shneiderman stresses “user understanding is central and user control is vital for users to be successful” [92]. But we can still benefit from addressing the question “what is the most efficient, effective and satisfactory [48] way to visualise COHSE links and link boxes?” where the answer can be used as the default presentation style. The rest of this document is organised into two sections: the first discusses link embellishment techniques in detail and the second discusses how a link box can be visualised based on different aspects (e.g., structure, order, etc.).

    G.1

    Link Embellishment

    clickability3 or navigability. • Avoid cluttering the page layout. • Being able to represent concepts identified from different ontologies. The following sections first present different link visualisation styles and then discusses different ways to enrich interactivity. G.1.1 Visualisation styles [84] indicates that small variances in appearance of link markers can have important effects on readability of the text and on usability. Therefore it is important how COHSE links are visualised as they can change the environment with which users interact. [84] also highlights that “a single standard link marker cannot meet the requirements of all users” so identification of different techniques is important to enable users to customise the visualisation of COHSE links. The following list discusses some styles that have been identified so far: Icon An icon can be associated with a link anchor by placing the icon next to the anchor or above the anchor. The latter approach can require a large amount of space between each line and can thus distract the users4 . Using icons can be quite efficient where the density of text is low [82]. Icons have also been successfully used in other Adaptive Hypermedia (AH)5 systems, such as in WebWatcher [62] where icons are inserted to highlight relevant links to the users’ task. Colour Changing the colour of the text or the background of the text (called overlay [84]). [84] indicates that the overlay technique is better than traditional underlined links; overlay links provide better readability of Web pages. The simplest form of this approach is supported by most Web browsers which can distinguish between two states of a link (visited/not visited) and appears to be quite useful. This approach has also been successfully used in different systems for conveying different kinds of information [95], for example [96] differentiates relevant and irrelevant links by using colour (or uses different colour for links that point to non-html pages). But a number of issues need to be considered when using colour [82]: • Colour blindness Some people are not able to recognise specific colours, so a potential solution would be to allow users to customise the colours used to render links. • Combination It is necessary to know the background colour of a page in order to select a con-

    The following list summarises what we want to achieve from visualisation of extra COHSE links: • Represent and differentiate COHSE links from hardcoded, embedded links, and differentiate COHSE links from the normal text in the page. Particularly, COHSE links need to be sufficiently distinguishable from the normal text to indicate that they provide navigation. • Avoid distracting the user and reduce their cognitive overhead and disorientation [37, 95] 1 . Conklin [37] indicates that cognitive overhead and disorientation are the two main problems that users experience when trying to navigate within hypertext systems. Cognitive overhead is defined as “additional effort and concentration necessary to maintain several tasks or trails at one time” [37]. This can occur when users are given a large number of choices and required to make decisions about which one to select. Disorientation and/or cognitive overhead are associated with the number of links in a hypermedia system; as the number of links increases, reading and learning performance decreases [96]. To address these kinds of issues, different techniques need to be explored. • Avoid breaking the continuity of a document by modaffordances.html. ifying it as little as possible. 3 Guidelines for visualising links, http://www.useit.com/ 2 • A style to achieve the best perceived affordance of alertbox/20040510.html 1 Navigation

    in Electronic Worlds: A CHI 97 Workshop, http:// www.acm.org/sigchi/bulletin/1997.4/jul.html. 2 Perceived affordance, http://www.jnd.org/dn.mss/ affordances-and-design.html http://www.interaction-design.org/encyclopedia/

    4 Intermedia hypertext system, http://www.eastgate.com/ HypertextNow/archives/ShowingLinks.html. 5 Adaptive Hypermedia (AH) systems are defined as “all hypertext and hypermedia systems which reflect some features of the user in the user model and apply this model to various visible aspects of the system to the user” [19].

    G-1

    trasting text colour. • Consistency Traditional browsers render links as blue and underlined, so adding extra COHSE links using different colour might be a problem6 . • Limitation Statistically, the user is unable to remember the associated information for more than three to seven colours [82]. Typography Different typographic styles can be used: Font family One needs to be careful since not all fonts are available on every platform. Font size [96] uses different font sizes to differentiate relevant and irrelevant links. However, one problem is that the size of the characters in the possible fonts can vary considerably, resulting in changes in interline spacing which can be annoying for the user. Font style Different kinds of underlining could be used (e.g., simple, double, dotted) [82, 102]. Source anchor text could also be rendered in bold and italic but these styles might give unnecessary emphasis to COHSE links and distract the user (e.g., Guide hypertext system7 ). Box This style has been used in some hypertext systems such as Hypergate7 in which boxes enclosed the link text. This approapch obviously avoids the problem of font emphasis but the use of boxes can be so distracting and uncommong that users do not understand how to interact with them7 . Button Source anchor text could be presented as a button. Although the use of buttons would be a good way to present the perceived affordance of clickability, the presence of many buttons in the text may be harmful for its continuity and thus for reading. ALternatively buttons can be added to a browser toolbar (similar to what MAGPIE [44]). Cursor Changing the appearance of mouse cursor to differentiate the COHSE links [82]. Sound This could be quite useful for visually impaired users8 . Status bar When cursor moves over a COHSE link, a description could be displayed in the status bar. No Change Web pages could be modified to only present COHSE link targets on demand. For example, when a user selects text, we could present extra options on the right-click menu to display the link box. Alternatively, the presentation style of the source anchor text could be modified when cursor hovers over the text. Blue-underlined In this approach COHSE links will not be visually different from traditional blue, underlined links. However, the links will be differentiated by other techniques, for example the cursor shape may be different or a tooltip could be used to indicate that it is a COHSE link. Furthermore, COHSE links could provide different behaviour to non-COHSE links: for example the links’ on-click behaviour will display the link box rather than moving to a new page. However, users may find this approach frustrating as all links no longer behave consistently. 6 Top

    ten mistakes in Web design, http://www.useit.com/ alertbox/9605.html. 7 http://www.eastgate.com/HypertextNow/archives/ ShowingLinks.html. 8 Earcons, http://www.dcs.gla.ac.uk/˜stephen/ generalearcons/generalearcons1.shtml.

    G.1.2

    Techniques to support interaction with COHSE links Techniques that can be provided to support better interaction with the additional COHSE links include the following: • Show All or links-on-demand[84] A shortcut key could be used to show all COHSE links. For example, HyperCard and Storyspace rendered links as boxes when the reader pressed a key, making links evident on request but keeping the text pristine the rest of the time9 . [84] concludes that links-on-demand is not a paradigm that users are familiar with on the Web, however their evaluation results show that links inhibit reading the text on the Web page. Therefore, this is a good technique to promote online reading as well as scanning (since links themselves support scanning10 ). Another possible problem with this technique is that it requires extra “effort” from the user to press a key– this may be solved by displaying all COHSE links by default and then providing the user with an option to toggle them. • Highlight Links on mouse over This gives control to users when to highlight links and does not clutter the page layout, for example the Symbolics Document Examiner rendered link boxes when the cursor passed over the link11 . • Highlight Related When a concept is highlighted in a page, all other suggested links for the same concept could be highlighted. • Addition of COHSE links for a specific concept Rather than adding COHSE links for all concepts in the ontology, the user could be offered the option to choose a set of concepts from which links could be derived. • Move between concepts A shortcut key could be provided that allows users to move from one identified concept to the next. • Filtering Some studies suggest that the number of links [96] or links themselves [84] can reduce the readability of a document, despite the observation that they also promote scanning. Therefore, different filtering techniques may be a good solution to reduce number of links that are added to a document. For example, a threshold value can be to specify the number of links to be added to a document for a certain concept; or links that are identified from the upper-level of the selected ontology will not be added; or the document structure can be used to filter links (e.g., the BBC news12 sites do not have links within the stories but instead provide links in the sidebars). Filters can also address the cognitive overload problem which is mainly due to the limited short term memory and limited capacity of human information processing[40].

    G.2

    Link box Visualisation

    According to [40], an unstructured long list of links can cause problems of cognitive overhead and disorientation. This observation suggests that structuring and organising a link box in different ways can overcome these kinds of 9 http://www.eastgate.com/HypertextNow/archives/ ShowingLinks.html. 10 How Users Read on the Web, http://www.useit.com/ alertbox/9710a.html. 11 http://www.eastgate.com/HypertextNow/archives/ ShowingLinks.html. 12 BBC news, http://news.bbc.co.uk/.

    G-2

    problems. Furthermore, structuring and organising link boxes can assist navigation [49], browsing [33] and comprehension [40]. A link box is composed of five components: 1. concept title, 2. concept description, 3. a list of related resources, 4. a list of broader resources and 5. a list of specific resources. The choice of components contained in a link box could be optional and provided as part of the customisation options offered to the user. Note that these components serve difFigure G.3: A hierarchical menu. ferent purposes, for example the concept description promotes reading whereas the list of links promotes scanning dicate the relationship between pages). Lists can be and navigation. Therefore we could either have an intevisualised with or without bullets, etc. grated representation of a link box in which we present all Tabular Multi-columned or multi-rowed tables can be the information about an identified concept, or we can adopt used in different ways. For example (1) we could have an incremental approach where we present different degrees a three-columned table in which each list of resources, of information about the identified concept through differbroader resources and specific resources will be loent means, for example the description of the concept could cated in a column; or (2) resources can be presented in be presented as a tooltip and the list of links could be prea list and a two-columned table can be used for broader sented in a pop-up window. and specific resources; or (3) two different tables can Although our focus is on customisation (i.e., providing a be created for broader and specific resources where the degree of freedom to the user) of link boxes, [19] identifies suggested list of links for a concept is presented in a five main types of adaptive navigation support that we can specific column (see Figure G.2). still explore. These are: Menu Different menu styles can be used. A list of links 1. direct guidance (indicate the best node for the user to may be converted into several menus. The following visit next or highlight most relevant link to follow) techniques can be used: (e.g., WebWatcher [62]) Hierarchical menu This menu organises the menu 2. adaptive ordering (links are ordered automatically to items into groups with each group represented support users’ tasks) by a menu item. When the user selects a menu 3. hiding (restrict the navigation space by hiding links to item corresponding to a group, the members of “not relevant” pages) the group are displayed in a second menu to the 4. adaptive annotation (adding additional information side of the original menu (see Figure G.3). Alabout links automatically to support users’ task). though this kind of menu structure solves the However, the main focus in all of the approaches above is problem of displaying a long list of items, this to capture as much information as possible about the user requires the user to have a good understanding of and the current task (i.e., create a user model), and then the groups and thus the hierarchy in the ontology. automatically adapt the environment to support the model Furthermore, it is difficult for the user to gain an in hand. overview of all the suggested links. However, The following sections discuss different techniques to vithis approach can be quite efficient when the sualise COHSE link boxes based on the following aspects: groupings are stable as in most general-purpose 1. Structure–how can we structure a link box? applications [10]. 2. Order–how can we order links in a link box? Scrolling This kind of menu includes a small arrow 3. Preview and context–what kind of information can be at the top and at the bottom of the list of menu included into a link label about the target page to help items. When the user clicks on those arrows, the users to decide to follow that link? list is scrolled up and down [10]. This kind of 4. Interactivity–how can we make link boxes interactive? menu presents agood way to display a long list 5. Amount of information–how much information of links for specific or broader resources (see Figcan/should be included in a link box? ure G.4). 6. Location–where can/should we locate the link box? Drop-down menu This is similar to the hierarchical G.2.1 Structure menu style but does not have nested structure. The menu only displays a list of items (see FigHere different techniques are discussed to organise (group, ure G.5). categorise and cluster) the list of links in a link box Scrollbar A scrollbar is used that controls the portion to support better and faster comprehension of suggested of the menu that is visible (as commonly used links [40]. Figure G.1 compares a structured and nonin browsers). Although this offers a solution to structured list of links. the problem of presenting a long list of items, in List Separate lists can be used for broader and specific COHSE link boxes are usually quite small and (narrower) resources (see Figure G.1). Lists can also may not have enough space to accommodate the be organised horizontally (or an inline list, e.g., item1, list. (see Figure G.6). item2, item3, etc.) or vertically. Nested lists can be Fish-eye view In a fisheye menu all items are always used to present the hierarchy of the concepts (via indisplayed in a single window that is completely dentation to support equivalances between levels of hivisible, but items near the cursor are displayed at erarchies [40]) or the hierarchy of web sites (similar to full size, and items further away are displayed at the way in which Google indents search results to inG-3

    (a) Unstructured list of links.

    (b) Structured list of links.

    Figure G.1: List of links.

    (a) Multi-columned table.

    (b) Multiple tables.

    Figure G.2: Tabular structure.

    (a) Zoom in or out.

    (b) Hide certain items.

    Figure G.7: Fisheye menus.

    G-4

    Figure G.10: Alphabetical ordering of links. a smaller size (see Figure G.7) [10]. Grouping We generally have two sets of resources: broader and specific resources. Within these sets we can support different groupings; group by concept (see Figure G.1), group by page, group by annotation date, etc. Groups could be used as higher-order information units and may help users preserve the context of links [40]. Multi-pane A mutli-pane structure sually consists of a predetermined number of separate panes, iin which each pane contains the contents for a separate level of the hierarchy. [33] shows that using multi-pane is better than having a stable, fully-expanded table of contents. This appraoch is similar to the “column” view provided by the Finder Mac OS-X (see example illustrations in Figure G.8). Figure G.4: A menu with scrolling arrows at the top and at Small Web page We could also design link boxes as small the bottom of the list as used in Microsoft Word. Web pages. For example, Figure G.9 presents how a link box can be designed to have a title, a menu either at the top or on the side. Such design can help us to reduce the size of the link boxes. G.2.2 Order Two different types of ordering can be supported: customised and personalised. With customised ordering, the user selects how to order the list of links whereas with personalised ordering the system decides how to order links. For personalised ordering, different factors can be used, including: user background(profile), user search goal (set of keywords or ontology), current interest, etc. [19]. Here we outline the various options for ordering links: Alphabetical Either by concept title or by page title (see Figure G.5: A drop down menu style. Figure G.10). Temporal By the date that annotations are created or added to the Resource Manager (e.g., latest resources first or last). Hierarchical and site structure Several studies have shown that hierarchical representation is an effective approach to support browsing [33]. Link boxes can be presented hierarchically either by concept or by the structure of sites (home pages and for sub-pages–Google style). Importance A possible ordering would be according to the PageRank13 algorithm or we can determine the number of times a concept is identified in a page and use that number as a level of relevancy. For example, if pageX has 20 opccurrences of conceptY and pageZ has 10 occurrences of conceptY then we could determine that pageX is more “relevant” than pageY and thus provide a link to pageX at the top of the link box (the closer to the top, the more relevant the link is [19]). The main drawback with this technique is that it makes the order of links non-stable and [65] shows that stable order of Figure G.6: Scrollbar and combo box. 13 PageRank technology/.

    G-5

    algorithm,

    http://www.google.com/

    (a) Vertical pane.

    (b) Horizontal pane. Figure G.8: Multi-pane.

    (a) Menu on the side.

    (b) Menu at the top.

    Figure G.9: A mini Web page. options in menus is important for novices. Use page structure granularity By making a distinction between links pointing to a complete page or parts of a page. Provenance By using the available provenance data, different ordering schemes can be created. Unordered List Present links without conforming to any ordering scheme. G.2.3 Context and Preview–Summaries Providing information about links that can hint at the content of the destination has been proposed in hypertext [68, 49] and particularly for the Web [93, 67, 82, 102, 51, 56]. And recently Web browsers have begun to provide facilities such as displaying popup tooltips to show a link’s “title” attribute14 . Additional information can eliminate the need to follow a link and give users the opportunity to read additional information in its related context [102, 82]. Moreover, users may be able to get the required information within the context of the link which is important for comprehension [95, 40]. Finally, [24] indicates that the annotation of links (links augmented with extra information) can reduce the number of visited nodes thus further reduces the learning time. Similarly, [56] concludes that enhancing link descriptions with context and preview is both useful and important. The aim here is to support the decision-making process of the users in determining whether or not to follow a link. Usually the user does not get to see the link target until s/he has activated the link, which can lead to problems such as (i) the decision about which links to follow and which to ignore, cognitive overload, (ii) it can be burdensome to follow

    Figure G.11: List of resources augmented with Google summaries. a link just to find out that is relevant to the user, (iii) contextual changes can also be a problem when a user needs to turn back to the page where s/he has started [67]. The following list presents some techniques to augment link labels in a link box: Google summaries Figure G.11 illustrates how COHSE links could be augmented with summaries from Google. Keywords A number of keywords or concepts can be identified to describe a page when they are annotated. For example, [26] presents how different algorithms can be used to extract keywords from Web pages for later use as link summaries for small screen devices. Page Title Using the title of the target page. Provenance Information about who created an annotation, when, and in which conditions. If Web pages include Dublin Core metadata15 then that information can be stored for annotated Web pages. Heuristics Use a number of heuristics to summarise target

    14 Using link titles to help users predict where they are going, http: //www.useit.com/alertbox/980111.html.

    G-6

    15 Dublin

    Core, http://dublincore.org/.

    Figure G.14: A short summary displayed in a tooltip.

    Figure G.13: Power Browser’s style of displaying summaries displayed progressively in three stages. pages [57](e.g., most important sentence, first paragraph, list of links in that page, first heading, or list of headings). Structural information about that page Number of words, number of links, number of headings, section names, distribution of media types, size of the page, etc. [82, 57]. Granularity Specifies whether the destination is an entire document or a part of a document [82]. MIME Type Specifies the MIME type in which the destiFigure G.15: Expand/contract for broader and specific renation data is presented (e.g., text, sound, image) [82]. sources. Target Type Specifies the type of the target document boxes can be long and even if a scrollbar is added it (e.g., PDF, HTML) or can make distinction between can mean that the user is required to perform a considHTML and non-HTML documents [96]. erable amount of scrolling and may get lost in a large Location indicates whether the destination is in the same list thus requiring more reading to locate the items of page, same site or different site [82]. interest. The expand/contract approach would allow us Interaction with these summaries could be supported in to limit the number of possibilities in order to avoid the different ways: cognitive overhead for the user: if there are too many Show Always All summary information will always be links, the user may have difficulties deciding which visible. one to follow [82]. Moreover, if the list is long, users Turn on and off An icon can be used to indicate that there can quickly forget what part of the list is being viewed. is a summary available and when users click on it the This appraoch can be used to shorten the list and presummary will be displayed. The icon will then be serve its logical structure and context. However, this changed to indicate that summary can be contracted approach this might require the user to constantly in(see Figure G.12). teract with the link box to identify the information they Three levels Three levels of summaries can be revealed are interested in. gradually. For example in the first level a list of keyThe expand/contract technique can protect the user words could be displayed about the target page; in the from complex link boxes and thus reduce cognitive second level a sentence explaining the content of the load. For example, we could display the top five reltarget page could be displayed; and in the third level a evant links and hide the remainder (see Figure G.16). paragraph can be displayed about the target page (for The top five could be determined using different metmore information about how this technique is applied rics, for example, relevancy to the current concept. Alto summarise pages for small screen devices, see [26]) ternatively we could present links to pages that the (Figure G.13 shows how Power Browser displays sumresource manager knows most about. Some example maries progressively in three stages [25]). systems are summarised in [19] that use different hidTooltip Very short summaries can be presented as a tooltip ing techniques for link adaptation. to links as in conventional browsers (see Figure G.14). This feature can be used in different ways, some of them are described below: G.2.4 Interactivity • Minimize resources lists We can have a threshBy using different techniques, we can make link boxes more old value and if the number of resources that we have is more than that we will only display the interactive. items up to that threshold value and the rest will Expand/ Contract [33] demonstrates that expand/contract mechanisms reduce browsing times compared to a be displayed on demand. • Group by If we group resources by using the confully expanded, stable view for large hierarchies. cept, then we can contract the list for that particHigh-level information could be displayed contiguular concept (see Figure G.17). ously and users could be given the choice of viewing the specific item and sub-levels on demand. Link • Broader/ Specific We can present only a list of G-7

    (a) Summaries can be expanded.

    (b) Summaries can be contracted.

    Figure G.12: Google summaries that can be expanded and contracted.

    (a) Can be expanded.

    (b) Can be contracted.

    Figure G.17: A way of using expand and contract. resources and hide broader and specific resources and display on demand (see Figure G.15). • Summaries Summaries can be hidden and displayed on demand (see Figure G.12). Tooltip By using tooltips we can provide three different kinds of information: 1. The concept title and the ontology to which this concept belongs (this can be useful for using multiple ontologies), 2. concept description, and 3. a typical link box. (However, the interaction for this kind of tooltip may get quite complicated.) Depending on the users’ experience, providing description of the concept as a tooltip can be useful. For example, if the user is novice in a specific domain, then the description can be useful and help the user understand the field better. However, if the user is expert then the description can be an obstacle and irritating. Scrollbar We could organise link boxes as small Web pages and add scrollbars to support movement within these pages. G.2.5

    Amount of Information

    The following list summarises some techniques that can be used to change the amount of information in a link box: Hierarchy level In the current implementation, only resources for one level up or down form the identified concept are displayed, but with different visualisation Figure G.16: Only top five links are displayed and the rest and structuring techniques explained in the previous is hidden. sections we can provide resources for various levels from the hierarchy in the ontology (this can again be a customisation option). Fish-eye views With this technique some items can be hidden so more resources can be included into the list (see Figure G.7). Number of Resources per Group As discussed before, a threshold value can be used for the number of links G-8

    that will be suggested for a particular group, etc. Filtering Sometimes the links in a link box point to the same target page. We could introduce a filter that removes duplicate link targets. Other filtering techniques are discussed in Section G.1.2. G.2.6 Location [102] presents different techniques to add additional information (termed a gloss) at a link anchor and their user evaluation study indicates that subjects required less time to use nearby glosses. Therefore, this provides a hint that it would be better if COHSE link boxes are displayed close to the concept identified. However, the following list summarises possible places that a link box could be displayed: • Extra COHSE links can be displayed as a pop box next to the relevant concept or at the top of the page. • As in fluid interlines16 the link box will directly be displayed below the identified link anchor [102]. • A toolbar can be created to add extra links. • Display the link box in a footnote area at the bottom of the page. This might not be easy as the notion of footnote area is not common on the Web. Furthermore, [102] concludes that users tend to ignore footnotes or text popping up at the bottom of the screen. • Create a margin and add COHSE links into the margin (e.g., fluid margin [102]); • Overlay is similar to interline, except that the link box could be overlayed on top of the text by using different colour (e.g., fluid overlay [102]).

    16 The fluid interline technique displays the gloss directly below the anchor.

    G-9

    Appendix H

    Shedding Light on the Hidden Web Technical report1 , December 2003

    H.1

    Abstract

    The terms Hidden Web, Deep Web and Invisible Web describe those resources on the Web that are in some way unreachable by search engines, and are potentially unusable to other Web systems such as annotation services. These hidden resources make up a significant part of the current Web. We provide firm definitions of the ways in which information can be “hidden”, and discuss the challenges that face those working with annotation in the Hidden Web. We do not attempt to provide solutions for these challenges, but a clarification of the terms involved is certainly a step in the right direction.

    H.2

    Introduction

    The idea of the Hidden or Invisible Web–information resources which are in some way inaccessible–has been with us since the early days of the World Wide Web [50]. There are several definitions of these terms, but intuitively, the notion seems to encapsulate those resources on the web that cannot be indexed by search engines. In addition, many pages and resources to be found on today’s Web are not static HTML authored by humans, but are generated by content management systems, or scripts on top of databases. Such information content has been characterised as the Deep Web [55]. As ever, of course, where intuitive notions are in use, we have conflicting interpretations of what the definitions of those terms are–what is it that makes up the Hidden/Deep/Invisible Web? In this paper we identify and classify the properties of those resources that are in some way hidden, and provide a more precise characterisation of the terms introduced above. We also provide examples of the problems that hidden web resources bring to Semantic Web applications. For an application perspective, we concentrate on annotation systems. We do not intend to supply detailed solutions to the problems of searching or annotation in the context of the the Hidden Web, but instead provide a more precise definition of what it means to be hidden. This will then, hopefully, inform us in activities relating to the Hidden Web. We first deal with the Hidden Web, provide a survey of terms coined, and clarify our particular interpretation of these terms. We follow this with a detailed discussion of the ways in which resources can be hidden and discuss, where appropriate, mechanisms for “exposing” the hidden web. The latter part of the paper deals with using resources in the Hidden Web, in the particular context of annotation. After defining what it means to annotate a resource, we introduce the different actors in an annotation system and consider the impact of hidden Web resources on those actors.

    H.3

    Defining the Hidden Web

    Here we provide historical definitions of relevant terms along with our own, more precise, characterisations. 1 Peter

    Crowther was one of the authors of this technical report.

    H.3.1 Historical Definitions Three terms used in this context are almost synonymous: Invisible Web, Hidden Web and Deep Web. Invisible Web The Invisible Web consists of Information content that is invisible to conventional search engines. The term is credited to Dr. Jill Ellsworth and was coined in early 1994; it was used in an interview with the Canadian Computer Paper magazine in January 1996 [50]. Ellsworth’s definition includes content that happens not to be indexed by search engines at the time of searching because the resources can not be reached from the current set of roots, as well as content that cannot be indexed for some other reason. Deep Web A supposed synonym of ’invisible web’ coined by Brightplanet [74] to describe searchable database content on the Web. They “[...] avoided the term “invisible Web” because it is inaccurate. The only thing “invisible” about searchable databases is that they are not indexable nor able to be queried by conventional search engines.” (p.1) Hidden Web Synonym of ’invisible web’, and the term that now seems to be in most common use. H.3.2 Proposed Definitions The authors propose the following, more precise, definitions: Visible Web That content which is indexed by the union of the non-topic-specific public search engines. More precisely, information content that is at a particular time reached by the engines’ normal mechanisms from the set of starting URLs submitted to the non-topicspecific public search engines and indexed in a way that humans can search, This makes a number of things explicit: that we only consider public search engines rather than intranet engines; that the ability to interpret content is important (many sound files, for example, are reachable, but no engine currently indexes this content); that the Visible Web is ever-changing as new starting URLs are submitted and better content indexing becomes available. It also clarifies that a novel search engine that is non-topic-specific and publically available may have an impact on the Visible Web by, for example, using a new URL acquisition technique. This means that a definition of Search Engine becomes essential. Potentially Visible Web That content which could be indexed by the union of the non-topic-specific public search engines if everyone submitted the URLs of all their root pages to the engines. More precisely, information content that could at a particular time be reached by the engines’ normal mechanisms from a set of starting nodes submitted to the non-topic-specific public search engines and indexed in a way that humans can search, Hidden Web The complement of the Potentially Visible Web, i.e. information content that cannot at a particular time be reached by the engines’ normal mech-

    H-1

    anisms from a set of starting nodes submitted to the non-topic-specific public search engines and indexed in a way that humans can search. This split definition motivates the following discussion on “What is a Search Engine?”. H.3.3 What is a Search Engine? Search Engine: A program that searches documents for specified keywords and returns a list of the documents in which those keywords were found. Although a search engine is a general class of program, the term is often used to specifically describe systems like Alta Vista and Excite that enable users to search for documents on the World Wide Web and USENET newsgroups2 . When Ellsworth coined the term Invisible Web, it is likely that she was referring to conventional search engines. However, there are at least two meanings of search engine, depending on whether one is referring to the software or the site: • The collection of general-purpose search engine software systems, containing as members “Google’s software” and “AltaVista’s software”. By this definition, a search appliance that indexes a corporate intranet but is not publically available would count as a conventional search engine. For example “Sun’s search engine” would qualify here–it is a piece of software but not a web site. Software engineers often use this definition of search engine. • The collection of general-purpose search engine software that is installed on publically-accessible servers, along with their current state, containing as members “the collection of computers accessible via http: //www.google.com/” and “the collection of computers accessible via http://www.altavista. com/”. By this definition, a search appliance that indexes a corporate intranet but is not publically available would not count as a conventional search engine. Librarians, educators and other Internet search engine users often use this definition of search engine. H.3.4 Hiding Places in the Hidden Web We present a number of reasons why a resource could be considered part of the Hidden Web, as defined above. For each, we describe the reason and discuss the challenges it presents to the search community. The reasons can be grouped into access denial, difficulty in interpretation, difficulty in discovery, time dependence and lack of a simple identifier for the resource. Access denial In general, access denial results from a desire of the content provider to limit access to resources. Several mechanisms may be used to achieve this. Firewalled content A resource may not be retrievable by the user’s browser or by the search engine because it is behind a firewall. This prevents the search engine from indexing the resource. A typical example would be content on another company’s intranet. Clearly this is a state of affairs that must be allowed, and few researchers would consider this content part of the Hidden Web. Whether resources held behind a firewall in one’s own organisation should be considered part of the Hidden Web depends on the definition of a search engine, as described 2 http://www.intermark.org/im-glossary.html

    above, and whether that search engine sits inside or outside the firewall. For example a resource may be retrievable by the user’s browser but not by a “standard” search engine such as Google, because the content is behind a firewall. In this case, the user (and content) sit within the firewall, but the search engine sits outside. The challenge is for the content provider to provide an appropriate solution for searching internal content as well as external, ideally combined into a single search interface. Denied by robots.txt Some content providers deliberately prevent indexing of resources by search engines for a variety of reasons. Some are technical: preventing browsers from entering except through designated pages so that frame sets can be constructed appropriately; monitoring searches performed by their users; inability to handle the sudden load from a crawler visiting the site. Others are commercial: a desire to prevent copyright content from being stripped from the site by a crawler and published elsewhere may lead to a technical solution to a legal problem. For example, until late 2003, Hewlett Packard’s support site3 prevented crawler access by use of robots.txt. The challenges are to construct sites that are amenable to rapid direct access in the technical cases, and to find a solution acceptable to the content provider in the commercial cases. Some of the fixes to the technical cases may cause further problems. For example, some sites check to see whether a content page is being requested from within a frameset and serve the frameset if not; this behaviour can prevent a crawler from retrieving and hence indexing the content. Difficulty in interpretation Difficulty in interpretation arises when information is presented using unusual protocols and formats, or when a deliberate attempt is made to obfuscate content. Unusual retrieval protocol Some resources are available on publically-connected systems but are not retrievable by conventional Web protocols. Examples include CVS servers, content management systems and some proprietary document stores. This is a self-correcting problem as HTTP is the “gorilla”4 protocol and is taking over as the mechanism for making content available. Fewer and fewer systems are unable to serve content over HTTP. The (rather weak) challenge is to make the crawlers cope with as many protocols as possible while the transition is still taking place. Unusual resource format Many resources that are retrievable via known protocols cannot be indexed by the Surface Web search engines because the resource is stored in a format of which the search engine has no knowledge. Examples include CAD drawings and old word-processing documents. Again, this is a self-correcting problem; XML is the “gorilla” format. Already, increasing numbers of systems can export in XML. Over time, XML may well replace proprietary formats for many purposes, allowing simple text extraction. In the mean time, the challenge for the search engines is to create tools to extract the content from the obscure document formats. Low-fidelity content Some retrievable resources are copies of an original in a different format, where some in3 support.hp.com 4 The phrase ten thousand pound gorilla being used to describe a technology that dominates a marketplace.

    H-2

    formation content has been lost as a result of format translation. One example is a PostScript printout of a LATEX document made available to CiteSeer[69], where the text can be difficult to reconstruct from the page description language. Another example is when a web designer uses an image instead of text on a page to preserve formatting and other graphical niceties, without providing alternate text. The challenge for the community is to encourage publication of content in its original format, or to develop more sophisticated algorithms for interpreting these formats. There is a tension between encouraging publication in a lossless form and encouraging publication in a standard form as described above. Deliberate obfuscation Some content providers deliberately obfuscate some parts of resources, typically in order to prevent automated systems from gaining access to the content. Examples include a UK insurance company that serves quotes from its online quote system as images rather than text, presumably to prevent brokers from scraping the site for quotes rather than going through their broker systems. The challenge is to change the mind-set of the content provider, or to develop more sophisticated techniques (such as OCR of the image, in this case) to “peer through” the obfuscation. Difficulty in discovery Many resources are retrievable by URL, but there may be no way of navigating to the URL from a fixed set of roots. An example would be an online dictionary (http: //dictionary.reference.com/), which provides search facilities but no browsing. Search-based sites, such as manufacturers’ support sites, may have no or limited support for browsing, and may provide few hyperlinks to content. The content is often volatile, with parts of the site being re-organised regularly, so it is very difficult for a search engine to index the content even given a large subset of the documents. The problem can sometimes be worked around by submitting a directory URL as a root to a search engine; in some cases, the resources are stored on filestore and the sites allow directory access over HTTP. This can provide a route in for crawler-based search engines. However, as more sites move to databases or content management systems for content storage, the scope for directory access is likely to reduce. Time dependence Many resources retrievable by URL change over time. This may cause the search engines to be out of date or to return false positives. Examples of these sites include the Top 10 selling hit singles for this week, films at the local cinema, news tickers and blogs. At least in the last two cases, there may well be a time-invariant URL on the same site for the content. For example, the BBC news site5 provides a News front page that refers to the current top stories. This is constantly updated. Each individual story, however, provides a timeinvariant URL. Some search engines, notably Google, work around this problem by supplying a cached copy of the resource as it was when last indexed. The challenge for the search community is to find or mint a persistent URL for the time-varying data.

    Lack of simple identity Some content cannot be represented solely by a URI; further context is needed. These challenge not just the search engines, but also bookmarking systems and any other system that assumes that a URI is sufficient to identify content. Post-only forms Some resources are only available after submitting a form via a POST mechanism, such as a search form. The resource may have no URI of its own, and only be generated in response to the form submission. By contrast, form submissions via GET do provide a URI that is sufficient to identify the resource. Forms for advanced searches of databases are the most likely to fall into this category, simply because the maximum amount of form data to be encoded may exceed any reasonable (or supported) URI length. For example, Microsoft Sharepoint Portal Server 20016 has an advanced search page that posts back to the server; an HTTP GET operation is ignored. This may be because Microsoft Internet Explorer has a 2047-character limit on the length of a URL. Frequently, all that would be needed to make the results of these search pages indexable, and hence visible, is for the content provider to accept a GET operation with similar parameters. Typical queries to identify a resource tend to be short, and will easily fit within typically-implemented URL limits. Challenges are: • Finding a way to refer to the page at all; • Finding a way to make that operational to retrieve content on demand. User agent sniffing Many sites serve different content depending on the user agent. For example, Microsoft Internet Explorer and Netscape Navigator have at times had sufficiently different font models that text sized for one may display many times larger or smaller in the other; site authors may alter the content served in order to correct for this difference. This is sometimes considered appropriate; for example, a Web page that is delivered to a PDA or to a WAP phone may contain cut-down navigation elements or more limited content. However, the variation can lead to false positive search results, where a crawler was presented with content that was removed before presentation to a limited device such as a PDA. More insidiously, some sites choose to present one piece of content when viewed by a user agent regarded as ’human’, such as IE or Navigator, but different content designed to enhance the site’s ranking when the resource is retrieved by a suspected crawler. Search engines discourage this practice, as it reduces the accuracy of the index, but it is still prevalent. As it is effectively impossible, and arguably undesirable, to eliminate browser variation, the challenge is to limit the content providers’ wish to tailor the content served to the crawler. This can only be done socially, by applying appropriate penalties to sites that are caught, such as removal from the index. Unfortunately, this decreases the coverage of the Visible Web, so web users lose out either way. Authentication Some sites prevent any content from being viewed unless the user agent provides appropriate authentication; these sites are often extranet sites or extensions of intranets, and probably should not be considered part of the Visible Web. Other sites serve different content depending on authentication, which may be made by cookie, username/password login, certificates or other means. For ex-

    5 http://news.bbc.co.uk

    6 http://www.microsoft.com/sharepoint/

    H-3

    ample, a registered user accessing Experts Exchange7 may obtain the full text of any solution on the site, whereas a guest visiting the same URL will retrieve a page with a teaser and an invitation to register. This illustrates a dilemma for those content providers wishing to make money out of users subscribing to their content. Ideally, providers would want search engines to index the full article, giving the greatest possible chance of a user visiting the page. However, as soon as a crawler takes a copy of the content, it may be cached and exposed for all users to view free of charge, removing any revenue opportunity for the content provider. In the long term, the challenge is to provide a partnership between content provider and search engine, with the provider allowing the engine to crawl the complete content in return for the engine not serving cached copies. However, this requires search engines to be able to authenticate to the content provider’s servers. Language dependence Some sites serve different content depending on the user agent’s accepted language. An example is MSN support8 , where support articles cannot be retrieved unless there is a translation in an accepted language. As a result, a support technician whose browser is set to US English and a customer whose browser is set to UK English may be unable to agree on the content of a URL, as each sees a different version (and one may be unable to retrieve any resource at all). The challenge for the search engine community is to determine where different content may be served depending on the accepted language and to find a way to retrieve all possible versions of that content, preferably associated with the language.

    H.4

    Annotation and the Hidden Web

    H.4.1

    What is annotation?

    For the purposes of this discussion, we consider the following rather strict definition of annotation: Annotation is the process of a commentor nondestructively augmenting existing content with new content about the existing content. It is worth noting the following points: • If new content is not about the existing content, then we do not consider the operation to be annotation. For example consider adding the URL of a new Web site about annotation to a Wiki document that lists Web sites about annotation. This is not annotation of the Wiki document–although it could be considered annotation of the new Web site, depending on how the Wiki document was used. • Annotation must be non-destructive. If an annotation amends the original, this is considered to be an edit not an annotation. This definition excludes systems such as SHOE [59] and many RDF-based content augmentation systems that require the augmented content to be added to the original document, typically in the HTML header. • Annotations may be made by the author of particular content, but only if the original doesn’t change; otherwise (as above), this is an edit and a new version rather than annotation.

    • Metadata is not the same as annotation, and the presence of metadata does not imply annotation. For example, the byte count, owner and permissions on a file are metadata. Adding a textual summary to a copy of a document placed in a content management system may be considered as annotation or metadata dependant upon the circumstances. H.4.2 Actors involved in annotation There are four actors involved in annotation: annotation producers (annotators), annotation consumers (users), annotation service providers and content providers. While the first three of these may seem obvious, content providers are also involved in annotation, as they have considerable control over how their content is made available to the other actors. In addition, we identify a fifth actor: the content consumer. The same person or agent may act in several or indeed all of these roles. For example, the maintainers of Annotea9 run an annotation service, produce and consume annotations and maintain the Annotea project Web pages. Annotation producer An agent that creates one or more annotations and adds them to the databases of one or more annotation service providers. Annotation producers are almost always content consumers, as they have to be aware of the content that is being annotated. An example would be a person using Amaya10 or COHSE [54] to add an annotation about a Web page to an Annotea server. Annotation consumer An agent that retrieves annotations from one or more annotation service provider. Typically, an annotation consumer is also a content consumer, and combines annotation and content for presentation or further processing. An example would be a person using Amaya or COHSE to display an annotated Web page. Annotation server Something that keeps a set of annotations and provides them to one or more annotation consumers, for example the W3C Annotea server. The annotation server may allow external annotation producers to modify the set of annotations (as is the case with Annotea), or may provide a service that is read-only to external agents (for example, a dictionary service). Annotations are themselves content and all annotation servers are therefore also content providers. Content provider Something that keeps a set of content and serves it to one or more content consumers. An example would be a Web server. Content consumer An agent that retrieves content from one or more content providers. An example would be a Web browser. H.4.3 Annotating the Hidden Web Hidden resources affect the different actors involved in annotation in different ways. In this section, we illustrate some of those ways and consider technical and social factors for systems to behave in particular ways. The comments are grouped, as before, into access denial, difficulty in interpretation, difficulty in discovery, time dependence and lack of a simple identifier for the resource. Access denial Firewalled contents The biggest concerns here are for annotation consumers. When retrieving annotations from an

    7 http://www.experts-exchange.com

    9 http://www.w3.org/2001/Annotea/

    8 http://www.msn.co.uk/msnhelp/home/

    10 http://www.w3.org/Amaya/

    H-4

    annotation server, it may be unclear whether all parts of an annotation and its referenced content are available. This problem is particularly acute when offering potential annotation targets to a human user, as there may be an expectation that an offered target is available. Social behaviour within the Semantic Web may well reduce the impact of this problem, however. Annotations for private content are more likely to be sent to a private annotation server than to a publically-accessible server–for example, annotations for Sun’s corporate intranet are stored on Sun’s intranet annotation server. Denied by robots.txt Refusing entry to robots will limit the scope for automated annotation producers to enter the site and perform annotation. As with the current situation with search engines and keyword indexing, this leads to annotation consumers being given a blinkered view of the available content. Socially, it may be more acceptable for annotation producers to crawl content than for the general search engines to do so. Annotation agents will, in general, construct summaries and cross-links rather than copying the entire content; this distinction may be sufficient for the content providers to allow annotation producers onto a site that is denied to the search engines. Difficulty in interpretation Unusual retrieval protocol This is a problem for content consumers (recall that these typically include annotation producers). It is difficult for the content consumer to retrieve the resource in order to process or display it, as the consumer may not understand the appropriate protocol. As more content consumers build upon a few standard resource retrieval libraries and proxy servers, this problem will reduce as support only has to be added to those retrieval libraries and proxies. The annotation service provider may also have a problem, depending on the annotation protocol in use. Several use URIs to refer to the resource; if there is no URI-scheme defined for a particular protocol, this approach breaks down. A likely social solution is that a “slang” URI-scheme is defined for the protocol until a more formal scheme can be put in place. This may also be a problem for the content provider, if they wish content to be available; the typical solution to the problem appears to be to move content to be available via HTTP (for human-readable content) or via servlets or Web services (for machine-processable content). Unusual resource format This is a problem for all the actors. Annotation producers, consumers and service providers may have the difficulty of being able to identify the portion of the resource to which an annotation applies, especially if there is no standard scheme for identifying portions of the resource in question. Content consumers may have no facility for decoding the resource for processing or display, leaving content providers with the problem that only part of their intended audience can process the resource. Low-fidelity content and deliberate obfuscation This is typically a problem for content consumers that need to extract the maximum possible information from the content, such as annotation producers. The producer may be unable to extract the desired information from the content, reducing the quality of any annotation or even preventing it altogether. This has a knock-on effect on annotation consumers; as with the conventional search engines, they may get a limited picture of the available content.

    Where the low-fidelity content could be replaced by an alternative that is more amenable to processing, the content provider could consider the benefits of doing so. In cases where this is not acceptable, or where the content provider chooses to obfuscate content, the only approach for automated annotation producers is to improve any extraction systems to maximise the information that can be extracted from the available resource. Again, this task is becoming easier as more processing power becomes available and more sophisticated extraction techniques are devised. Difficulty in discovery The concern here is typically for annotation producers, and potentially content providers. As with the search engines, an automated annotation producer may be unable to locate a specific page in order to produce annotations. This can only be addressed by the content providers implementing appropriate navigation and resource location aids for agents wishing to process the available resources. Handschuh et al suggest a Web service-based facility to access a site’s database as an aid to this process for data-driven pages in [55]. Once the resource has been located and any annotations have been produced, these are presented as normal to annotation servers and are then available as normal to annotation consumers and hence content consumers. Time dependence This affects annotation consumers and service providers, but content providers and annotation producers would have to be involved in any potential solution. The essential problem for annotation is that there is no guarantee that any stored annotation is still relevant. The problem can range from the obvious–the page is no longer available–to the subtle–the annotation referring to the record at Number 1 in last week’s Top Ten doesn’t make sense for this week. The simplest solution is for the content provider to provide some form of versioning or time-stamping of content that is served. The annotation producer could then include the same timestamp in any annotation generated, which could be stored by the annotation server. An annotation consumer could then check the annotation’s version against the retrieved content and could discard or warn about any annotations that were out of date. More complex solutions would involve the content provider in minting persistent URIs for the time-varying resources. A concern here is how to distinguish an annotation producer’s wish to produce an annotation about the timevarying or the time-independent concept; for example “the number 1 position in the UK Top Ten” versus “the record that is this week’s Number 1 at the time the annotation is made”. Lack of simple identity As stated previously, these challenge not just the search engines and annotation systems, but also bookmarking systems and any other system that assumes that a URI is sufficient to identify content. In general, these affect annotation producers, consumers and service providers, and content consumers. Apart from the problem of retrieving resources that may require complex operations, which affects content consumers and hence annotation producers, there is the very real difficulty of describing the operations required to retrieve the resource. Assuming such a “retrieval description language” can be devised, there is then the task of constructing annotation servers and protocols that can communicate

    H-5

    fragments of such a language and resource retrieval libraries that can interpret it. The sections below describe some specific problems and steps that can be taken in the absence of this generalpurpose language. Post-only forms Without a retrieval language that describes that form elements can be submitted via a HTTP POST as well as via GET, little can be done. For this case, the “language” could be as simple as agreeing on “slang” URI-schemes of http-post: and https-post:. User agent sniffing When content providers serve different content depending on the user agent, annotation producers may produce annotations that are specific to a given user agent. For example, an XPointer11 that is valid for a Web page served to Amaya and annotated using Amaya and Annotea may be invalid–or, worse, point to different content on the page–when the page is viewed using Mozilla and COHSE. The ultimate solution for this is to eliminate the unfortunate practice of user agent sniffing. However, as this is very unlikely to happen in the near future, an alternative would be to keep a note of the user agent which made the annotation and to request the resource using the same user agent when processing the resource with its annotations. Of course, this gives the potential for a page that is poorlyformatted for the actual user agent to be served. Another approach would be simply to prune annotations produced under different user agent settings, but this would prune many annotations given the plethora of user agent versions. Authentication Similar to the comments about robots.txt above. It may be that well-known annotation producers can be given privileged access to content in order to produce limited annotations about otherwise restricted resources. However, annotation consumers that retrieve the unauthenticated version of a resource but use annotations produced from an authenticated version may be working with incorrect information. Language dependence The problems here are identical to those produced by user agent sniffing, above, and with a similar solution. As there is no prospect of eliminating differences in human language–despite the apparently inexorable march of US English across the globe–an alternative is to keep a note of the language settings under which the annotation was made and to request the resource using the same settings when processing the resource with its annotations. Of course, this gives the potential for a page that is in the wrong language for the user to be served. Another approach would be simply to prune annotations produced under different language settings.

    H.5

    Conclusions

    The Hidden Web presents challenges for those wishing to use it–not least the identification of what actually makes up the Hidden Web. In this paper we have put forward a number of definitions that characterise aspects of the Hidden Web and describe ways in which resources can be “hidden”. These have also been considered in the context of annotation. We do not claim to answer all (or indeed many) questions surrounding the Hidden Web, but a clarification of the terms involved is certainly a step in the right direction.

    11 XPointer,

    http://www.w3.org/TR/WD-xptr.

    H-6

    Appendix I

    The Fundamental Open Hypermedia Model (FOHM) and COHSE Technical report, July 2006

    I.1

    Introduction

    This report explains how the COHSE navigation model can be formalised by using the Fundamental Open Hypermedia Model (FOHM). FOHM provides a systematic vocabulary for expressing structures and operations that can be used to represent different hypertext domains: the spatial domain, the navigational model and the taxonomic domain [77]. The aim of the formal representation of the COHSE navigational model is to define a common vocabulary and its meaning in order to have a better understanding of what COHSE does and to enable interchange and interoperability between COHSE and the other Open Hypermedia Systems. The formalisation will also help us to understand how a knowledge model used in COHSE maps to its navigational model. The rest of this document first discusses the overall structure of the current link boxes and introduces FOHM, then explains how FOHM can be applied to COHSE.

    I.2

    Structure of COHSE links and link boxes

    domains (e.g., specifically spatial, navigational and taxonomic hypertext). Although there are similarities between these domains (e.g., they all have the notion of data and association), there are also features that are unique to each of these domains (e.g., circulatory is allowed in navigational hypertext but not in taxonomic or spatial). These differences obviously cause interoperability issues and hinder the possibility of provision of linking of different hypertext domains [77]. Therefore, the Open Hypermedia community tackled the issue of interoperability between different OH systems and introduced FOHM to address these kinds of issues. FOHM, then, provides a systematic vocabulary to define structures and operations that can be applied on these strcutures with the idea that these stuctures will become first-class objects in an OH system. The original FOHM model defines four basic constructs which are illustrated in Figure I.1: Data: This construct is a wrapper for resources that are external to the model. The construct normally represents a document but it may represent any file, stream or other item and could be either a URL or an actual content. There is no formal descripton of data in FOHM; Associations: An Association is a structure that represents a relationship between Data objects. Data objects are not directly placed in an Association. Instead Reference objects are used, a Reference Object either points at a Data object in its entirety or at part of a Data object, for example the second paragraph of a text document, or the second scene of a film. A Data Object is attached to the Association object via a Binding. Each Association also has a structure type and a feature space.

    Basically, COHSE extends existing Web pages by adding multi-headed links. It identifies a term and then adds a list of links (multi-targets) to resources that are identified by a variety of different techniques. These target links are encoded in a box that is composed of the following information: • Label: An informative label for the linkbox (cardinality: 1); • Description: A short description of the term (cardinality: *1 ); • Resources: A number of resources that are somehow associated to this concept (cardinality: *); • Broader Resources: A number of resources that are relevant to the broader concepts of the identified term (cardinality for broader concepts: * and cardinality for resources of a given broader concept: *); • Specific Resources: A number of resources that are relevant to the specific concepts of the identified term (cardinality for specific concepts: * and cardinality for resources of a given specific concept: *); Lowe and Hall [71] introduce three types of links based on the type of information relationship being represented: structural (used to organise the information space), associative (semantic relationship between information elements) and referential (a link between an item of information and an elaboration or explanation of that item). Looking at these Figure I.1: A FOHM structure. definitions, a link box provides two kinds of linking: assoReference objects: A Reference Object either points at a ciative and referential. Data object in its entirety or at a parts of that Data I.3 Fundamental Open Hypermedia object (e.g. the second paragraph of a text document, (FOHM) or the second scene of a film). FOHM aims to define a common data model and a set of Bindings: A Binding is used to attach a Reference Object to an Association object. Each Binding states its posirelated operations that are applicable to different hypertext tion in that feature space, stating how it is bound to the 1 In a DTD a cardinality of 0 or more is represented with “*”. Association’s structure. I-1

    FOHM also defines two types of metadata that may be attached to the model at any point (can be attached to any of the four basic constructs: association, data, reference or binding). The original FOHM model does not explicitly describe these constructs but these constructs are supposed to be implemented during the implementation stage. Behaviour: An example would be a “follow link” and when the user clicks on it, the application would decide to open a link in a new window; Context: Defines metadata that describes in which context this part of the model can be seen. Figure I.2 shows an example FOHM association: a navigational link with one source and two destinations (one to a piece of text and two to URLs). There are two context objects: one target describes the target concisely (not a technical document) and the second one with more technical detail. This diagram also shows that the Navigational Link association has single feature in its feature space which is direction and direction has a vector space which includes: source, destination and bi-directional. Figure I.3: Original FOHM data structure definition. • To adapt links/link boxes by using the context and behaviour constructs; • To be able to use other link bases that serve their links in FOHM structures. I.3.2

    Figure I.2: A simple FOHM navigational link. There are also link bases that are created to store and serve structures represented in FOHM. Auld Linky2 is a contextual lightweight link server which uses context to filter query results [75]. Auld Linky uses an XML Linkbase format to read/write its structures. Linky’s FOHM specification are represented using an XMLSchema or DTD3 . Auld Linky is also used in several applications, for example [78] explains how this linkbase is used across both physical and digital spaces and [4] explains how the context mechanism of Auld Linky has proved useful as a way of implementing Adaptive Hypermedia.

    Figure I.3 shows the fundamental data model of FOHM [77]. In order to apply FOHM to COHSE, we would need to extend the fundamental FOHM data model. A possible way of doing this would be as follows. Three kinds of FOHM associations can be introduced to represent COHSE navigational model which are detailed below: Composite Association A Composite Association Ac is introduced to express a link box. A link box creates a kind of association between a term, its definition and also to a set of broader and specific resources. Therefore in order to represent this, Ac is defined to have a three dimensional vector space for direction, description and component. Description will be used to create an association between the identified term and its definition. Similarly, Component will be used to create associations between the identified term and the set of resources grouped for broader and specific terms.

    I.3.1 Why use FOHM? Based on the description of FOHM above, applying FOHM to COHSE would allow us: • To provide a common vocabulary and express its meaning; • To have a better understanding of the COHSE navigational model which would help us to use other knowledge resources other than OWL ontologies (e.g., folksonomies, thesaurus, classification schemes, taxonomies, terminologies, etc.); • To support better presentation of links/link boxes; 2 Auld Linky is created as part of the Equator project, more information can be found at http://www.equator.ecs.soton.ac.uk/ technology/linky/index.shtml. 3 These specifications can be found at http://www.equator. ecs.soton.ac.uk/technology/linky/definitions/ linkyhistory.shtml

    How to apply FOHM to COHSE?

    Ac = Composite Association S = Set N = {direction, description, component} Direction = {source, destination, bidirectional} Description = {hasdefinition} Component = {has-part}

    Association type Structural type Features space Direction features Description features Component features

    Broader Resources Association This represents a set of links that are related to the identified term via the “broader concept” relationship.

    I-2

    Ab = Broader Resources Association S = Set N = {members} Members = {one, two, three, · · ·}

    Association type Structural type Features space Direction features

    Specific Resources Association This represents a set of links that are related to the identified concept via the “specific concept” relationship. Ab = Specific Resources Association S = Set N = {members} Members = {one, two, three, · · ·}

    Figure I.4: An example COHSE linkbox.

    Association type Structural type Features space Direction features

    Link Association This represents a navigational link. Ab = Link Association S = Link N = {direction} Direction = {source, destination, bidirectional}

    Association type Structural type Features space Direction features

    Figure I.5 shows how Figure I.4 can be represented with FOHM by using the associations described above. It shows that a composite association is created between the term “SUN Fire V480 Server” (direction:src), the definition of this term (description: has-definition), two resources/links (direction:dst) and a broader resources association. The Broader Resources association has two link members (member: one, two); one of these is a link between the term “System” (direction:src) and two resources/urls (direction:dst) and similarly the other one is a link between the term “Entry-level server” (direction:src) and two resources/urls (direction:dst). I.3.3 How COHSE navigational model maps to the underlying knowledge model? Although FOHM extensions allow us to represent COHSE in FOHM, the relationship between terms which is used to create the association in the first place becomes implicit. For example, in Figure I.4, there is an explicit relationship between the term “SUN Fire V480 Server” and “EntryLevel Server” based on the knowledge model used (see Figure I.6) but this is not explicitly represented in the proposed extension of FOHM (to represent the navigational model of COHSE). In fact, this example shows the importance of the relationship between the knowledge model and the navigational model. Currently, in COHSE the navigation model is driven by an OWL ontology, particularly based on the broader and narrower relationships. However, different knowledge resources can be used, for example, a thesauri, classification schemes, taxonomies, subject headings and other types of controlled vocabulary, etc.

    I.4

    Discussion and Conclusions

    This reports introduces FOHM and discusses how FOHM associations can be extended to express COHSE navigational model supported by COHSE. This report is concluded with the discussion of the number of issues below: • In general, FOHM is a graph model for contextual browsing in open hypermedia. The notion of context can be quite useful to turn COHSE into an adaptive I-3

    Figure I.5: Figure I.4 expressed in extended FOHM. system. The context object can be assigned to any of the four basic constructs therefore this would give us flexibility to adapt the COHSE navigational model in different ways. Although the specification of the context and behavior which are the two main modifier classes in FOHM are not explicit, we can easily extend FOHM to provide better and explicit formalisation of context and behaviour. For example, in Auld Linky [75], a context object is implemented as a keyvalue pair, for example a context key can be about “age” and a possible value can be “adult” or “child” and the contextual linking supported can constrain the kind of documents that a child can browse. • As highlighted in the introduction, one of the major benefits of using FOHM is to support interoperability with other open hypermedia systems. But unfortunately there are not many OH systems available in FOHM. Therefore, it could be better to explore the formalisation of the navigational model of COHSE by using existing Semantic Web metadata languages. • Gibbins et al. discuss the relationship between FOHM and RDF and particularly investigate whether FOHM structures can be represented in RDF in [52]. Although

    they conclude that it is possible to represent FOHM higher level relations in the RDF vocabulary, they also state that doing so looses the FOHM semantics. And if the constructs needed to represent FOHM semantics introduced, then this breaks the RDF model theory. Therefore, they propose that it is necessary to extend the RDF semantics with the higher level FOHM semantics by defining an RDF schema (in RDFS or OWL). This schema can possibly be used to apply FOHM to COHSE. • Although there are a number of linkbases created using FOHM, in these linkbases proposed associations are stored persistently. Querying is used as a mechanism to filter these associations, that is to say associations are not created on the fly [75]. On the other hand, in COHSE associations are created dynamically. These associations do not persistently exist; they are created when the users request further information about a term. This would then address the question, how would this affect the navigational model supported by COHSE? • As also discussed in the previous section, although by using FOHM, we could model the navigational support provided by COHSE, we still need to investigate other techniques to better represent the relationship between the navigational model and the underlying knowledge model. • Unfortunately, the FOHM formalisation is very loose. Depending on the application, this can be a problem. Furthermore, the representation can easily get more complex than necessary. Further discussion about such kind issues can be found in [103] which are identified when FOHM is used to formalise e-science provenance data.

    Figure I.6: Figure I.4 expressed in extended FOHM.

    I-4

    Appendix J

    User Tasks on the Web - Taskonomy Technical report, September 2006 This document presents a number of studies that aim to understand what kind of tasks users perform on the Web and the behaviours or tactics used to achieve those tasks, and discusses how these tasks relate to COHSE. In the literature, the two fundamental paradigms that are discussed for finding information on the Web are browsing and searching [63, 72, 85]. The basic distinction between these two is the object of search (the information need) and systematicity of tactics used. With searching the user is looking for something specific whereas with browsing the user does not have a well-defined task. Looking at these simple definitions, broadly we can say that COHSE supports browsing. Studies on the browsing paradigm take this view one step further and propose three different kinds of browsing–again based on the user’s goal or specificity of the object searched; it can be either very well-defined, not very well defined and not defined at all [3, 38, 60, 70, 72]. Based on these, we can say that COHSE supports the latter two kinds of specificity. There are also relevant information retrieval paradigms that can be used to describe what kind of tasks COHSE supports. These paradigms include exploratory search [100], berrypicking [6] and information forging theory [88]. These paradigms are relevant to each other in the sense that they all propose that users’ queries continually shift. They propose that accessing and learning new information may yield new queries. According to the information foraging theory users typically forage information on the Web by moving from page to page along hyperlinks. Therefore by adding extra hyperlinks, COHSE improves the potential exploration of the environment. It provides extra paths by using terms in an ontology that enable users to better explore their environment. According to Marchionini [73], learning and investigation activities are parts of exploratory search. Looking at the tasks involved with these two activities (see Table J.2), the way COHSE adds extra links and link boxes, we can say that COHSE supports these tasks. There are also a number of user studies that aim to understand what users do on the Web in general. Table J.3 summarises findings of these studies and highlights the tasks that COHSE could potentially support. The rest of this document is organised as follows: Section J.1 compares browsing and searching paradigms. Section J.2 presents a number of studies on browsing. Section J.3 introduces the notion of exploratory search and Section J.4 introduces the notion of berrypicking and information foraging theory. Section J.5 details the studies that seek to understand what users do on the Web. Finally, Section J.6 presents our conclusions.

    produces a ranked list of links to pages that match the query” [85] or “the task of looking for known target” or “carefully planned series of queries pose with precise syntax”. Marchionini [72] lists the characteristics as “planned, discrete, goal-driven, deterministic, formal”. Browsing [85] or search by navigation[63] or browsing search[72] can be described as “the process of viewing pages one at a time and navigating between them sequentially using hyperlinks” [85] or “the task of looking to see what is available in the world”. Marchionini [72] lists the characteristics as “opportunistic, datadriven, heuristic, informal, continues”.

    J.2

    Browsing

    The notion of browsing is well researched. Different disciplines look at it from different perspectives [32]. It is a very common exploration strategy that hypertext users commonly engage in [28]. Cove and Welsh describe it as a process of “picking out bits and pieces. . .selecting worthwhile information need or interest” [38]. Many studies have addressed different types of browsing. Chang and Rice cite different typologies from different investigators in their reviews of browsing [32]. As Marchionini states “although there are differences in how the types are named, there seems to be an agreement on three general types of browsing that are differentiated by the object of search (the information needed) and systematicity of tactics used” [72]. Table J.1 shows the summary of these three different types of browsing from different investigators and their characteristics [3, 38, 60, 70, 72]. In a single information-seeking episode, the user can move from one type of browsing to another or she can engage in multiple types of browsing at the same time. Although the browsing categories in Table J.1 are based on users’ goals, Chang and Rice [32] analyse browsing phenomenon based on the following dimensions: • Context: includes organisation of the data, interface, feedback and economic factors; • Behavioural: includes scanning and movement (continuous movement is important); • Motivation: includes purpose and goal; • Cognitive: includes planning (can be planned or unplanned) and knowledge/experience; • Resource: includes form (object or representation) and focus content or path.

    J.3

    Exploratory Search

    White et al [100] indicate that users generally combine querying and browsing strategies to foster learning and inJ.1 Browsing vs. Searching vestigation. Users usually engage in this kind of activity The two main paradigms for finding information on the Web when they want to find something from a domain where are browsing and searching. they have a general interest but not specific knowledge (i.e., Searching [85] or Search by Query[63] or analytical searches where the target may be undefined). The techsearch [72] can be described as “the process of nique used to support such queries is called “exploratory entering a search query into a search engine, which search”. White et al [100] highlight that “the exploratory J-1

    lookup fact retrieval known item search navigation transaction verification question answering

    learn knowledge acquisition comparison/ interpretation comprehension aggregation/ integration socialise

    typically forage for information on the Web by navigating from page to page along hyperlinks. The most important concept in the information foraging theory is “information scent”. Information scent is the imperfect, subjective perception of the value, cost, or access path of information sources from browsing cues [88].

    investigate analysis negation synthesis evaluation

    J.5

    discovery planning transformation

    Table J.2: Marchionini’s broad categories of search activity. search strategies are used continually to allow people to discover new associations and kinds of knowledge, they are often motivated by a complex information problem, an a poor understanding of terminology and information space structure”. Marchionini [73] indicates that exploratory search is more concerned with recall (maximising the number of possibly relevant objects that are retrieved) than precision (minimising the number of possibly irrelevant objects that are retrieved; search engines tend to worry about precision). Marchionini [73] also groups search activities into three broad categories: lookup, learn and investigate, and the latter two categories as “exploratory search”. In the “exploratory search interfaces” workshop report the authors indicate that “Exploratory searches are typified by uncertainty about the space being searched and the nature of the problem that motivates the search. People engage in exploratory searching both intentionally and incidentally to other activities. It includes situations where: (i) the target of the search is unknown or partially unknown, (ii) the search begins with some certainty about what is known but changes into one unknown and unfamiliar on exposure to new information and (iii) users recognise useful information objects by scanning through information resources, evaluating their usefulness and determining the content/structure of a set of information objects. Relevance feedback (RF), and in particular implicit relevance feedback, is a technique that complements users’ exploration of the search space and the reduction in uncertainty about the target of the search that can result from this activity” 1 .

    J.4

    Studies of User Tasks on the Web

    This section discusses a number of studies that seek to describe what users do on the Web (see Table J.3 and Section J.5.1- J.5.7). In summary, • These studies show that it is not enough to only study the user’s click-stream to draw conclusions about the user’s goals and motivations [91]. Most of the studies summarised in Table J.3 combine several techniques (e.g., interview, click-stream analysis, observations) to understand what users are trying to achieve and how. • Although some of these studies analyse tasks based on different variables (e.g., purpose, task, content or motivation and behaviour), the distinction and relationship(s) between these variables is not clear. It is not always easy to say that user behaviour is motivated by a specific task. • If we look at the tasks summarised in Table J.3 closely, most of these tasks are based on the specificity of the object search and bear similarity with the three basic browsing categories. • Some of these studies discuss composite activities. Users perform some activities that involve combination and coordination of multiple tasks. • Most of these studies also confirm that link-following is still the most common task on the Web. In 1995, Catledge and Pitkow [31] report that the most commonly used method of Web navigation is hyperlinks. In 1997, Tauscher and Greenberg [94] indicate that hyperlinks accounted for half of all navigation. Similarly, Milic Frayling et al [76] report that 43% of all Web navigation is a result of hyperlinks. More recent work in 2006 by Weinreich et al [99] confirm that hyperlink following is still the most common activity on the Web. Kellar et al [66] also highlight that following hyperlinks is also the most common mechanism used for different kinds of tasks (i.e., fact finding, information gathering, just browsing, transactions).

    Berrypicking and Information Foraging Theory J.5.1 The tangled Web we wove: a taskonomy of

    The notion of Berrypicking is introduced by Bates [6]. Standard information retrieval models assume that the user’s information need remains the same throughout the search session and the model’s goal is to produce a perfect set of relevant documents. On the other hand, with the berrypicking model, the query is continually shifting and users may move through a variety of sources. New information may yield new ideas and new directions, and the value of the search is on the bits and pieces picked up along the way. A relevant theory to berrypicking is the information foraging theory which shows how users interleave directed searching with opportunistic and unstructured behaviour. This theory aims to understand how users search for information. The fundamental idea of this theory is that users 1 Exploratory Search Interfaces: Categorisation, Clustering and Beyond, White, R., Kules, B., and Bederson, B. Report on the XSI 2005 Workshop at the Human-Computer Interaction Laboratory, University of Maryland, 2005. http://hcil.cs.umd.edu/trs/2006-18/ 2006-18.htm.

    WWW use [27] In this study, Byrne et al [27] observe how eight volunteers browse the Web for a day. The data is captured by videotaping users’ activities. The participants are asked to browse the Web as they normally do and their browsing behaviours are recorded. A taskonomy is created based on the observational data which include the following six broad categories: Use information: includes any activity in which the user attempts to use a piece of information (e.g., read/view/listen, save to disk (download), display for others, duplicate, print). Locate on page: includes activities such as keyword search and is related to activities to locate an item on a page (e..g, locate an image, related concept, tagged information, specific string or something “interesting”). Go to page: is the activity where the browser is used to locate a URL (e.g., type a URL, use back and forward

    J-2

    button, history list, bookmark)2 . Provide information: is the process where the user usually enters information rather than retrieving information (e.g., entering shipping address, survey response, entering search key). Configure browser: involves activities that the user engages to configure his/her browser (e.g., add bookmark, change cache settings, window size, etc.) React to Environment: involves the activities that the user engages that are initiated by the browser (e.g., respond to display change, respond to dialogue, reload). The authors also highlight that users usually try to achieve a higher level task and they actually perform a number of the tasks summarised above to complete that higher level task. Norman [83] calls these higher level task activities and highlights that activities are not the same as tasks; he indicates that at the highest levels are activities, which are comprised of tasks, which themselves are comprised of actions, and actions are made up of operations [83]. For example, in this study one of the participants tries to download a paper written by a colleague. In order to achieve this, the user performs several tasks: use info, goto a bookmark, locate, follow a link and locate. J.5.2 A taxonomic analysis of what world wide web activities significantly impact people’s decisions and actions [79] Based on a survey, Morrison et al [79] create taxonomic classifications based on three variables: Purpose (why users search the Web) These are grouped into three: find (use of the Web to download information, get a fact, get a document, find our about a product); compare/choose (evaluate products); and understand (includes locating facts or documents). Method (how users search the Web) 3 These are grouped into four: explore (general searching for information (without a particular goal)); monitor (repeated visits to specific sites); find (searching for a particular fact/document/piece of information); and collect (searching for multiple pieces of information). Content (what information users search for) The following classification of content is identified: business, education, finance, job search, medical, miscellaneous, news, people, product info and purchase (computer, vehicles, download) and travel. J.5.3 Information seeking on the Web: An integrated model of browsing and searching [34] In this study, 34 participants are used. Using interviews, questionnaires and data logging4 over a two week period, four modes of information seeking are identified: Undirected viewing: the user does not have any specific need in mind. The goal here is to use a large number of different sources and different types of sources [starting, chaining]. Conditioned viewing: the user explores selected topics or certain types of information [browsing, differentiating, monitoring]. 2 According to their statistics, link following is the most common way of traversing between pages [27] 3 Some of these might not be triggered because of a particular purpose, therefore they can be considered as specific tasks in their own rights. 4 A Web tracker is used to log the user’s browser interaction: browser menu selection, open URL or file, reload, back, forward, add to bookmarks, go to bookmark, print, stop. The data logs are pre-analysed to prepare for personal-interviews. Therefore the data logs are not solely used to identify the information seeking episodes.

    Informal search: the user aims to deepen the knowledge and understanding of a specific issue [differentiating, monitoring, extracting]. Formal search: the user looks for a specific information or types of information about a particular issue [monitoring, extracting]. In this study, the authors draw relationships between these motivations and the following moves or behaviours: Starting Identifying sources of interest that could serve as a starting point (e.g., jump sites that contained links of interest, go to portal sites or intranet entry pages of an organisation). Chaining Following up on the new leads from the starting point is the activity of chaining (e.g., following links from starting pages, using back and forward). Browsing Semi-directed searching in the areas of potential search (e.g., examining list of links or headings). Differentiating Filtering and selecting from among the sources scanned by noticing the differences between the nature and quality of information offered (e.g., bookmark a page, print). Monitoring Keeping abreast of developments in an area by regularly following particular sources (e.g., revisiting favourite sites). Extracting Systematically working through a particular source or sources in order to identify material of interest (e.g., systematically working through a site). J.5.4

    How knowledge workers use the web [91]

    The aim of this study is to understand what different kinds of activities people use the Web for and their characteristics. In this study 24 participants are interviewed at the end of the day in the course of two consecutive days. This study concludes that people perform common tasks which are grouped into six: Finding Using the Web to find something specific (goaloriented, very well defined, e.g., what is the reference for this publication?). Information gathering Less specific than finding, but using the Web to search a specific topic for various reasons (e.g., looking for a job, what products do they offer? what job opportunities are out there?). Information gathering activities could be time consuming. These activities involve comparing and contrasting information across sites or organisations. Browsing Visiting sites with no specific goal in mind (e.g., entertainment, reading newspapers, etc.). The browsing activity usually involves scanning a pageful of text and graphics. Transacting Using the Web to execute a transaction (e.g., bank transfer, etc.). Communicating Using the Web in order to participate in chatrooms, etc. Housekeeping Maintaining Web resources (e.g., checking if the links in the maintained site are working properly) In this study, it is also observed that users perform activities that consisted of elements of two categories (e.g., searching for an item (finding) and then buying it (transacting)). This is also closely related to what Norman [83] has proposed about the hierarchy of activities. According to their survey, information gathering activity is the most frequent and also of second longest duration. Browsing and Finding are frequent activities but tended to be of a much shorter average.

    J-3

    J.5.5 Taxonomy of Web Search [17] Broder [17] proposes that there are three different kinds of searching: informational (acquire information present on a site), navigational (give me the URL of the site that I want to search–my intent is to search a particular site), and transactional (show me sites that I can perform a certain transaction). In this study two methods are used to classify queries: a user survey and manual classification of log entries. A similar study is performed by Rose et al [89]. The main goal of this study is to understand the space of user goals. The authors analyse the queries entered by users and create a classification of Web search goals. These are grouped into three broad categories: Navigational The main goal is to reach a specific Web site. Informational The goal is to learn something by reading or viewing Web pages. This is further classified into five: directed (closed/open) (want to learn something in particular), undirected (want to learn anything/everything), advice (want to get advice, suggestions), locate, and list (list of suggested Web sites). Resource The main goal is to obtain a resource (other than information). This is further classified into download, entertainment, interact, and obtain. J.5.6 Supporting User Tasks and Context: Challenges for Semantic Web Research [58] Although previous studies propose task classifications by studying users’ browsing history or based on interview results, this paper presents a classification of tasks based on the previous studies and the authors’ experiences. The authors propose the following list of tasks: Locating Looking for an object or chunk of information. Exploring Gathering information about a specific topic. Monitoring Checking known sources that are expected to change. Grazing Moving speculatively between sources with no specific goal in mind. Sharing Making an object or a chunk of information available to others. Notifying Informing others of an event. Asserting Making a statement or fact or opinion. Discussing Exchanging knowledge. Evaluating Assessing a particular piece of information. Arranging Coordinating with third parties to ensure that something will take place. Transacting Transferring money or credit, etc. It is not clear how searching, browsing, directed or undirected search relates to these tasks. Furthermore, some of these tasks do not have any evidence that they are common tasks among the users. It is not clear if the authors came up with these tasks just to explain the scenario they have introduced in the article. J.5.7 The impact of task on the usage of web browser navigation mechanisms [66] The results of this study includes information about the participants’ Web usage, task information and questionnaire data. Participants are asked to classify their tasks based on the following activities: Fact finding Looking for specific facts, files or pieces of information. Information gathering Collecting information, often from various sources, in order to make a decision. Just browsing Viewing Web pages, with no specific goal in mind, often just for entertainment. Transactions An online action, such as email or banking.

    Other Tasks that do not fit into any of the above categories. Fact finding and information gathering are primarily search-based tasks while browsing and transaction tasks often consisted of repeated visits to Websites. This study confirms that the most common methods of navigation are hyperlinks. The other mechanisms that are used are bookmarks, opening new windows and typed-in URLs was relatively minimal compared to following hyperlinks method.

    J.6

    Conclusions

    This document presents a survey of the existing work that aims to understand what users do on the Web and what kind tactics they use. The main conclusions regarding COHSE can be summarised as follows: • The two fundamental paradigms for information seeking on the Web are searching and browsing. Regarding these two, we can say that COHSE supports browsing (see Section J.1). • In the literature a number of investigations on browsing paradigm agree on three different types of browsing. Regarding these browsing types, we can say that COHSE supports the ones where the goal is not well defined and not defined at all (see Section J.2). • As can be seen in Section J.3 and J.4, Berrypicking, Information foraging theory and Exploratory search are related paradigms that can also be used to explain what kind of tasks COHSE supports. Bates [6] indicates that the user’s query is continually shifting while he/she moves through variety of sources and the value of search is in the bits and pieces picked up along the way. COHSE adds links that do not exist in the original document, it thus creates paths in the information space that allows users to better explore the space and pick bits and pieces along the way. COHSE also highlights terms in a page by using an ontology, so these highlighted terms could also help user to better formulate their query or formulate a new query within the context provided by the ontology. • As Marchionini [73] highlights that exploratory search is more concerned with recall (maximising the number of possibly relevant objects that are retrieved) than precision (minimising the number of possibly irrelevant objects that are retrieved). This is a good distinction between what COHSE does and what search engines do. We can say that COHSE aims to improve the recall and search engines aim to improve the precision. When a Web page is accessed, COHSE aims to add and return as many relevant links as possible by using the knowledge encoded in a domain ontology (e.g., using broader and narrower relationships, etc.). • If the user’s task is very well defined (i.e., searching) then we can say that COHSE does not support such tasks very well. However, the COHSE approach could be used to better support searching tasks. When users are looking for something specific they can easily use search engines to search for their specific query. However, when they visit the pages that the search engines return, they move to a new context and loose their search context. Although Google’s toolbar allows users to highlight searched terms in the visited page after the search, they do not get links to other pages relevant to their search query from that visited page. But by using COHSE approach links could be dynamically added to that visited page based on the keywords searched to allow users to continue to browse within

    J-4

    Figure J.1: User tasks on the Web and COHSE. the context of their original search. • In Section J.5, we discuss a number of user studies that seek to understand what users do on the Web. The results of these studies are summarised in Table J.3 and the tasks that COHSE could support are also highlighted. • If we look at the tasks summarised in Table J.3 and other information-seeking paradigms, most of the tasks discussed are based on the specificity of the search goal. Figure J.6 illustrates the relationships between these tasks based on the specificity of the search goal. We can say that the COHSE mainly supports the middle part of this diagram, gradually supporting less towards its left. This is because COHSE uses an ontology to add links thus the domain knowledge in that ontology provides a broad context for search and browsing. If the user does not have any idea about his/her tasks then the extra links added based on an ontology might not be that useful. However, as discussed above, the idea of dynamic linking by using ontologies could be used in different ways to support other tasks. • Finally, this literature survey also confirms that it is not enough to only study the user’s click-stream to draw conclusions about the user’s goals and motivation. Therefore, we can say that the fully automatic task based personalisation will be almost impossible without users’ intervention.

    J-5

    Levine, M [70]

    Herner, S [60]

    Apted, S. [3] specific

    Cove and Welsh [38] search

    Marchionini, G [72] systematic

    semideterministic

    directed

    quasi-random

    semi-directed

    generally purposeful

    general pose

    opportunistic

    random

    undirected

    general

    serendipity

    pur-

    casual

    Characteristics specific goal or target, focused, structured and systematic predictive, less definite target and less systematic purely random, unstructured, very little focus, no real goal, not systematic

    Table J.1: Browsing types

    Byrne et al, 99 [27] locate on page use information go to page

    Morrison et al, 01 [79] find

    provide information configure browser react to environment

    explore

    collect monitor

    Choo et al, 00 [34] formal search informal search conditioned viewing undirected viewing

    Sellen et al, 02 [91] finding

    Heath et al, 05 [58] locating

    Kellar et al, 06 [66] fact finding

    information gathering browsing

    exploring monitoring

    information gathering just browsing

    transacting

    grazing

    transaction

    communication

    sharing

    other

    housekeeping

    notifying asserting discussing evaluating arranging transacting

    Section J.5.1

    Section J.5.2

    Section J.5.3

    Section J.5.4

    Section J.5.6

    Table J.3: User Tasks - highlighted tasks are potentially supported by COHSE.

    J-6

    Section J.5.7

    Bibliography [1] Java portlet specification (JSR168), version 1.0. Java Community Process, 2003. http://www.jcp. org/en/jsr/detail?id=168.

    [16] D.F. Brailsford. Separable hyperstructure and delayed link binding. ACM Computing Surveys, 31(4es):30, 1999.

    [2] K.M. Anderson. Integrating open hypermedia systems with the world wide web. In Proceedings of the Eighth ACM Conference on Hypertext, pages 157– 166. ACM, 1997.

    [17] Andrei Broder. A taxonomy of web search. SIGIR Forum, 36(2):3–10, 2002.

    [3] S.M. Apted. General purposive browsing. Library Association Record, 73(12):228–230, 1971. [4] Christopher Bailey, Wendy Hall, David Millard, and Mark Weal. Towards open adaptive hypermedia. In Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, pages 36–46, London, UK, 2002. SpringerVerlag. [5] R. Barrett and P.P. Maglio. Intermediaries: An approach to manipulating information streams. IBM Systems Journal, 38:629–641, 1999. [6] M. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407–424, 1989.

    [18] A. J. B. Brush, D. Bargeron, A. Gupta, and J. J. Cadiz. Robust annotation positioning in digital documents. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 285– 292. ACM, 2001. [19] P. Brusilovsky. Methods and techniques of adaptive hypermedia. User Modeling and User-Adapted Interaction, 6(2-3):87–129, 1996. [20] P. Brusilovsky. Adaptive hypermedia. User Modeling and User-Adapated Interaction, 11:87–110, 2001. [21] P. Brusilovsky. Adaptive navigation support: From adaptive hypermedia to the adaptive web and beyond. Psychnology, 2(1), 2004.

    [7] S. Bechhofer and C. Goble. Delivering Terminological Services. AI*IA Notizie, Periodico dell’Associazione Italiana per l’intelligenza Artificiale., 12(1), 1999.

    [22] P. Brusilovsky, C. Karagiannidis, and D. Sampson. Layered evaluation of adaptive learning systems. Int. J. cont. Engineering Education and Lifelong learning, 14(4/5), 2004.

    [8] S. Bechhofer and C. Goble. Towards annotation using DAML+OIL. In Proceedings K-CAP Workshop on Knowledge Markup and Semantic Annotation, 2001.

    [23] P. Brusilovsky and M.T. Maybury. From adaptive hypermedia to the adaptive web. Communications of the ACM, 45(5):30–33, 2002.

    [9] S. Bechhofer, R. Stevens, and P. Lord. Ontology driven dynamic linking of biology resources. Journal of Web Semantics (JWS). Accepted for Publication. [10] B.B. Bederson. Fisheye menus. In Proceedings of the Thirteenth annual ACM symposium on User interface software and technology, pages 217–225. ACM Press, 2000. [11] D. Benyon and D. Murray. Artificial intelligence review. Applying user modelling to human-computer interaction design, 7:199–225, 1993. [12] Tim Berners-Lee. Weaving the Web. Texere, 2000. [13] D. Billsus, C. A. Brunk, C. Evans, B. Gladish, and M. Pazzani. Adaptive interfaces for ubiquitous web access. Communications of the ACM, 45(5):34–38, 2002. [14] N. O. Bouvin. Unifying strategies for web augmentation. In Proceedings of the 10th International Conference on Hypertext, pages 91–100, Germany, 1999. ACM. [15] P. De Bra, L. Aroyo, and V. Chepegin. The next big thing: Adaptive web-based systems. Journal of Digital Information, 5, 2004. K-1

    [24] P. Brusilovsky and L. Pesin. Visual annotation of links in adaptive hypermedia. In CHI ’95: Conference companion on Human factors in computing systems, pages 222–223. ACM Press, 1995. [25] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Proceedings of the tenth international world-wide web conference. In Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices, 2000. [26] O. Buyukkokten, O. Kaljuvee, H.G. Molina, A. Paepcke, and T. Winograd. Efficient web browsing on handheld devices using page and form summarization. ACM Transactions on Information Systems, 20(1):82–115, 2002. [27] M. D. Byrne, B. E. John, N. S. Wehrle, and D. C. Crow. The tangled web we wove: a taskonomy of www use. In CHI ’99: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 544–551, 1999. [28] E. Carmel, S. Crawford, and H. Chen. Browsing in hypertext: a cognitive study. IEEE Transactions on Systems, Man, and Cybernetics, 22(5):865–883, 1992.

    [29] L. Carr, S. Kampa, W. Hall, S. Bechhofer, and C. Goble. Handbook on Ontologies, chapter COHSE: Conceptual Open Hypermedia Service, pages 193–209. Springer, 2004. [30] Leslie Carr, Sean Bechhofer, Carole Goble, and Wendy Hall. Conceptual linking: Ontology-based open hypermedia. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001. [31] Lara D. Catledge and James E. Pitkow. Characterizing browsing strategies in the world-wide web. In Proceedings of the Third International World-Wide Web conference on Technology, tools and applications, pages 1065–1073, 1995. [32] S. Chang and R.E. Rice. Browsing: a multidimensional framework. Annual Review of Information Science and Technology, 28:231–276, 1993. [33] R. Chimera and B. Shneiderman. An exploratory evaluation of three interfaces for browsing large hierarchical tables of contents. ACM Transactions on Information Systems, 12(4):383–406, 1994. [34] C.W. Choo, B. Detlor, and D. Turnbull. Information seeking on the web: An integrated model of browsing and searching. First Monday (online), 5(2), 2000. [35] B.G. Christensen, F.A. Hansen, and N. O. Bouvin. Xspect: bridging open hypermedia and XLink. In Proceedings of the 12th International World Wide Web Conference, pages 490–499, Budapest, Hungary, 2003. [36] M. Claypool, P. Le, M. Wased, and D. Brown. Implicit interest indicators. In Intelligent User Interfaces, pages 33–40, 2001. [37] J. Conklin. Hypertext: an introduction and survey. Computer, 20(9):17–41, 1987. [38] J.F. Cove and B.C. Walsh. Online text retrieval via browsing. Information Processing & Management, 24(1):31–37, 1998. [39] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. [40] N.P. Dalal, Z. Quible, and K. Wyatt. Cognitive design of home pages: an experimental study of comprehension on the world wide web. Information Processing and Management: an International Journal, 36(4):607–621, 2000. [41] H. Davis. To embed or not to embed. Communications of the ACM, 38(8):108–109, 1995. [42] H. Davis. Referential integrity of links in open hypermedia systems. In Proceedings of the Ninth ACM Conference on Hypertext, pages 207–216. ACM, 1998. K-2

    [43] H. C. Davis, W. Hall, I. Heath, G. J. Hill, and R. J. Wilkins. Towards an integrated information environment with open hypermedia systems. In Proceedings of the Fourth ACM Conference on Hypertext, pages 181–190. ACM, 1992. [44] J. Domingue and M. Dzbor. Magpie: supporting browsing and navigation on the semantic web. In Proceedings of the Ninth international conference on Intelligent user interface, pages 191–197. ACM, 2004. [45] M. Dzbor, E. Motta, and J. Domingue. Opening up magpie via semantic web services. In Proceedings of the Third International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science. Springer, 2004. [46] S.R. El-Beltagy, W. Hall, D.D. Roure, and L. Carr. Linking in context. In HYPERTEXT ’01: Proceedings of the twelfth ACM conference on Hypertext and Hypermedia, pages 151–160, New York, NY, USA, 2001. ACM Press. [47] N. Ferguson, S. Schmoller, and N. Smith. Personalisation in presentation services. Technical report, The Joint Information Systems Committee, 2004. [48] E. Frokjaer, M. Herzum, and K. Hornbaek. Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? In CHI ’00: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 345–352. ACM Press, 2000. [49] G.W. Furnas. Effective view navigation. In CHI ’97: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 367–374. ACM Press, 1997. [50] Frank Garcia. Business and marketing on the internet. Computer Paper Magazine, 1996. The original article is no longer online, but an archived copy is available at http://web.archive. org/web/19980113075450/http: //www.tcp.ca/Jan96/BusandMark.html. [51] G. Geisler. Enriched links: A framework for improving web navigation using pop-up views. Technical Report INLS TR-2000-02, University of North Carolina at Chapel Hill, 2000. [52] N. Gibbins, S. Harris, D. Michaelides, D. Miller, and M. Veal. Exploring the relationship between FOHM and RDF. In Proceedings of 1st International Workshop on Hypermedia and the Semantic Web, Nottingham, UK, 2003. [53] C. Goble, S. Bechhofer, L. Carr, D. Roure, and W. Hall. Conceptual open hypermedia = the semantic web? In The Second International Workshop on the Semantic Web, Hong Kong, May 2001. [54] S. Handschuh and S. Staab. Annotation for the Semantic Web, volume 96 of Frontiers in Artificial Intelligence and Applications. IOS Press, 2003.

    [55] Siegfried Handschuh, Steffen Staab, and Raphael Volz. On deep annotation. In Proceedings of WWW2003, Twelth International World Wide Web Conference, 2003.

    [68] G.P. Landow. Relationally encoded links and the rhetoric of hypertext. In HYPERTEXT ’87: Proceeding of the ACM conference on Hypertext, pages 331– 343. ACM Press, 1987.

    [56] S. Harper, Y. Yesilada, C. Goble, and R. Stevens. How much is too much in a hypertext link?: Investigating context and preview – a formative evaluation. In Proceedings of the fifteenth ACM conference on Hypertext & hypermedia, pages 116–125, 2004.

    [69] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67–71, 1999.

    [57] Simon Harper and Neha Patel. Gist summaries for visually impaired surfers. In Assets ’05: Proceedings of the 7th international ACM SIGACCESS conference on Computers and accessibility, pages 90– 97. ACM Press, 2005. [58] Tom Heath, Martin Dzbor, and Enrico Motta. Supporting user tasks and context: Challenges for semantic web research. In ESWC2005 Workshop on End-user Aspects of the Semantic Web (UserSWeb), Heraklion, Crete, 2005. [59] Jeff Heflin, James Hendler, and Sean Luke. SHOE: A knowledge representation language for internet applications. Technical Report CS-TR-4078 (UMIACS TR-99-71), Department of Computer Science, University of Maryland, 1999. [60] S. Herner. Browsing. Encyclopedia of Library and Information Science, 3:408–415, 1970. [61] Kristina H¨oo¨ k, Jussi Karlgren, Annika Wærn, Nils Dahlb¨ack, Carl Gustaf Jansson, Klas Karlgren, and Benoˆıt Lemaire. A glass box approach to adaptive hypermedia. User Modeling and User-Adapted Interaction, 6(2-3):157–184, 1996. [62] T. Joachims, D. Freitag, and T. M. Mitchell. WebWatcher: A tour guide for the world wide web. In IJCAI (1), pages 770–777, 1997. [63] S. Jul and G.W. Furnas. Navigation in electronic worlds: a CHI 97 workshop. ACM SIGCHI Bulletin, 29(4):44–49, 1997. [64] J. Kahan and M. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In Proceedings of the 10th International World Wide Web Conference, pages 623–632, 2001. [65] V. Kaptelinin. Item recognition in menu selection: the effect of practice. In INTERCHI ’93 conference companion on Human factors in computing systems, pages 183–184. ACM Press, 1993. [66] M. Kellar, C. Watters, and M. Shepherd. The impact of task on the usage of web browser navigation mechanisms. In Proceedings of the 2006 conference on Graphics interface, pages 235–242, Toronto, Ont., Canada, Canada, 2006. Canadian Information Processing Society. [67] T. Kopetzky and M. Muhlhauser. Visual preview for link traversal on the world wide web. In Proceeding of the eighth international conference on World Wide Web, pages 1525–1532, New York, NY, USA, 1999. Elsevier North-Holland, Inc. K-3

    [70] M.M. Levine. An essay on browsing. Reference Quarterly, 93:35–36, 1969. [71] D. Lowe and W. Hall. Hypermedia and the Web: An Engineering Approach. John Wiley and Sons Ltd, 1998. [72] G. Marchionini. Information Seeking in Electronic Environments. Cambridge Series on HumanComputer Interaction. Cambridge University Press, 1995. [73] G. Marchionini. Exploratory search: from finding to understanding. Communications of the ACM, 49(4):41–46, 2006. [74] Michael K. Bergman. The deep web: Surfacing hidden value. http://www.brightplanet. com/deepcontent/tutorials/DeepWeb/ deepwebwhitepaper.pdf, September 2001. [75] D. Michaelides, D. Millard, M. Weal, and D. De Roure. Auld leaky: A contextual open hypermedia link server. In Revised Papers from the International Workshops OHS-7, SC-3, and AH-3 on Hypermedia: Openness, Structural Awareness, and Adaptivity, pages 59–70, London, UK, 2002. SpringerVerlag. [76] Natasa Milic-Frayling, Rachel Jones, Kerry Rodden, Gavin Smyth, Alan Blackwell, and Ralph Sommerer. Smartback: supporting users in back navigation. In Proceedings of the 13th international conference on World Wide Web, pages 63–71, 2004. [77] D. Millard, L. Moreau, H. Davis, and S. Reich. FOHM: A fundamental open hypertext model for investigating interoperability between hypertext domains. In Proceedings of the eleventh International ACM Conference on Hypertext and hypermedia, USA, 2000. [78] David Millard, David De Roure, Danius Michaelides, Mark Thompson, and Mark Weal. Navigational hypertext models for physical hypermedia environments. In Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, pages 110–111. ACM Press, 2004. [79] J.B. Morrison, P. Pirolli, and S.K. Card. A taxonomic analysis of what world wide web activities significantly impact people’s decisions and actions. In CHI ’01 extended abstracts on Human factors in computing systems, pages 163–164, 2001. [80] T.H. Nelson. Embedded markup considered harmful. World Wide Web J., 2(4):129–134, 1997. http://www.xml.com/pub/a/w3j/ s3.nelson.html.

    [81] Theodor Holm Nelson. Xanalogical structure, needed now more than ever: parallel documents, deep links to content, deep versioning, and deep reuse. ACM Computing Surveys, 31(4es):33, 1999.

    [93] D. Stanyer and R. Procter. Improving web usability with the link lens. In Proceedings of the Eighth International World Wide Web Conference, pages 1533– 1544, Toronto, Canada, 1999.

    [82] M. Noirhomme-Fraiture and V. Serpe. Visual representation of hypermedia links according to their types. In Proceedings of the Working Conference on Advanced Visual Interfaces, pages 146–155, 1998.

    [94] Linda Tauscher and Saul Greenberg. Revisitation patterns in world wide web navigation. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 399–406, New York, NY, USA, 1997. ACM Press.

    [83] Donald A. Norman. Human-centered design considered harmful. interactions, 12(4):14–19, 2005. [84] H. Obendorf and H. Weinreich. marker visualization techniques: ing behavior. In Proceedings of tional conference on World Wide 745. ACM Press, 2003.

    Comparing link changes in readthe 12th internaWeb, pages 736–

    [95] M. Thuering, J. Hannemann, and J. Haake. Hypermedia and cognition: designing for comprehension. Communications of the ACM, 38(8):57–66, 1995. [96] T. Tsandilas and m. c. schraefel. User-controlled link adaptation. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, pages 152–160. ACM Press, 2003.

    [85] C. Olston and E.H. Chi. ScentTrails: Integrating browsing and searching on the web. ACM Transactions Computer-Human Interaction, 10(3):177–197, 2003.

    [97] F. Vitali and M. Bieber. Hypermedia on the web: what will it take? ACM Computing Surveys, 31(4es):31, 1999.

    [86] K. Osterbye and U. K. Wiil. The flag taxonomy of open hypermedia systems. In Proceeding of the Seventh ACM Conference on Hypertext, pages 129–139, Washington, 1996.

    [98] F. Vitali, F. Folli, and C. Tasso. Two implementations of xpointer. In Proceedings of the Thirteenth International Conference on Hypertext, pages 145–146. ACM, 2002.

    [87] T.A. Phelps and R. Wilensky. Robust intra-document locations. In Proceedings of the Ninth international World Wide Web conference, pages 105–118, 2000.

    [99] Harald Weinreich, Hartmut Obendorf, Eelco Herder, and Matthias Mayer. Off the beaten tracks: exploring three aspects of web navigation. In Proceedings of the 15th international conference on World Wide Web, pages 133–142, 2006.

    [88] P. Pirolli and S.K. Cart. Information foraging. Psychological Review, 106(4), 1999.

    [100] [89] Daniel E. Rose and Danny Levinson. Understanding user goals in web search. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 13–19, 2004. [101]

    R.W. White, B.Kules, S.M. Drucker, and m.c. schraefel. Introduction - exploratory search. Communications of the ACM, 49(4):36–39, 2006. E. Wilde and D. Lowe. XPath, XLink, XPointer, and XML: A Practical Guide to Web Hyperlinking and Transclusion. Addison Wesley, 2002.

    [90] Doron Rosenberg. Migrate apps from Internet Explorer to Mozilla (How to make Internet Explorerspecific Web applications work in Mozilla-based [102] P.T. Zellweger, S.H. Regli, J.D. Mackinlay, and browsers). IBM developerWorks, 2005. http: B. Chang. The impact of fluid documents on reading //www-128.ibm.com/developerworks/ and browsing: an observational study. In Proceedweb/library/wa-ie2mozgd/. ings of the SIGCHI Conference on Human Factors in Computing Systems, pages 249–256. ACM, 2000. [91] Abigail J. Sellen, Rachel Murphy, and Kate L. Shaw. How knowledge workers use the web. In CHI ’02: [103] Jun Zhao. A conceptual framework for e-science Proceedings of the SIGCHI conference on Human provenance. Technical report, Department of factors in computing systems, pages 227–234, 2002. Computer Science, The University of Manchester, Manchester, UK, 2004. [92] B. Shneiderman and P. Maes. Direct manipulation vs. interface agents. interactions, 4(6):42–61, 1997.

    K-4

    Suggest Documents