Web-based Information Access - Semantic Scholar

2 downloads 0 Views 104KB Size Report
The web site of Jakob Nielsen. http://www.useit.com, 1998. [47] J. Nielsen and D. Sano. SunWeb: User interface design for sun microsystem's internal web.
Web-based Information Access Tiziana Catarci Dipartimento di Informatica e Sistemistica Universit`a di Roma “La Sapienza” Via Salaria 113, 00198 Roma, Italy E-mail: [email protected] Abstract The need of friendly environments for effective information access is further enforced by the growth of the global Internet, which is causing a dramatic change in both the kind of people who access the information and the types of information itself (ranging from unstructured multimedia data to traditional record-oriented data). To cope with these new demands, the interaction techniques traditionally offered to the users have to evolve and eventually integrate in a powerful interface to the global information infrastructure. The new interaction mechanisms must be especially friendly and easy-to-use, since, given the enormous quantity of information sources available on the Internet, most of the users remain “permanent novices” with respect to each one of the sources they have access to. This tutorial offers a survey of the main approaches adopted for letting the users effectively interact with the Web. Thus, it covers topics related with both extracting the information of interest spread over existing Web sites and building new, more usable, sites. Being mainly “usercentered”, the tutorial will analyze proposals coming from different areas, namely DB, AI, and HCI, which share the final goal of making the Web a huge, easy-to-access, information repository.

1. Introduction The World Wibe Web continues to attract the attention of many different research communities (e.g., database, artificial intelligence, software engineering, information retrieval, human-computer interaction) unlike any new technology in recent memory. Looking at this growing interest, some questions may come to one’s mind, like: “What is so unique about this new medium?” “Are researchers inventing new techniques for investigating it?” “Do the diverse research communities cooperate in achieving a global goal?”

The first question is really the most difficult to be answered, and people are debating around it [53]. From a technical point of view, the Web is not so different from a huge collection of interlinked documents. It is mainly a matter of scalability, so information retrieval people could tell us that they already solved many of the existing problems (even if the scalability problem cannot be, for several reasons, underestimated). What really makes the difference is the social impact the Web is having on the computer usage: computer professionals could in principle disappear. Everybody can input data on the Web and make them becoming part of a whole, where the user experience is dominated by a feeling of “being in the Web”, so individual page characteristics are partially lost. Second, even professional designers cannot control the user interface. The user controls the navigation and can move in ways that were not intended in the design. Third, the development cycle has been accelerated, it is less structured and almost out of control. Finally, the user community is higly diverse with different expertise, levels of motivation, interests, goals, tasks, and expectations. In summary, the Web is something slipped out the hands of researchers and computer professionals. Both these categories are now trying to catch it, either applying their previous know-how or building (partially) new theories, on one side, and skills, on the other side. It is worth noting that, while the Web could be a wonderful interdisciplinary test bed, very few exchanges and common projects exist among the various research communities. This tutorial tries to summarize and compare some interesting approaches of the database (DB), artificial intelligence (AI) and human-computer interaction (HCI) communities, which, with different shades, are oriented towards the same goal, i.e., making the Web an easy-to-use, consistent, safe, integrated information repository, where the user does not get lost in the information jungle. In particular, the DB community mainly concentrates on the extraction activity (only a few proposals deal with site construction). It basically amounts to putting a structure over the Web, and/or designing languages and tools to extract the information of in-

terest and manage them in some comfortable and hopefully efficient way. On the AI side there is a lot of research concentrating on information access, mainly exploiting knowledge representation and user modeling techniques (which are also used for building “adaptable” sites). HCI community efforts are mainly oriented towards the construction activity, with the final goal of designing usable sites, while visualization, clustering and other techniques are adopted for helping the user in finding the information of interest. The tutorial is divided in four main sections. The first part will describe and classify the various types of approaches to systems for information access. In particular, proposals coming from both the DB and AI fields will be revised, and an integrated approach (merging techniques from both areas) will be briefly presented. In the second part, the emphasis will be on the kind of data such systems support, and on the models for such data which have been recently proposed, namely semi-structured data models. In the third part, proposals for retrieving information in the Web, i.e., Web query languages and visualization/browsing systems, will be discussed. Finally, the fourth part will present recent work on how to build more effective and usable sites, emphasizing the different contributions coming from HCI (e.g., Web site usability assessment and user-centered design), DB (e.g., the S TRUDEL approach), and AI (e.g., adaptive web site design).

2. Global Information Management Systems In order to help the user in retrieving information scattered everywhere in the Web, systems have been proposed, which we call Global Information Management Systems (GIMSs) [38, 20], whose main goal is to provide a framework to integrate different and heterogeneous information sources into a common domain model. The user interacts with the GIMS as a whole information system, so that s/he can ignore the data schema used in the sources and access information using a query-answering mechanism. Popular keyword-based search engines can be considered as first-generation GIMSs, where documents are characterized using feature-based representations. A vector whose features are specific words describing document content is associated to each document. A particular case is that in which documents are represented by a list of keywords. Such representations make it easy to automatically classify documents, but offer very limited capabilities for retrieving the information of interest. In this class we include the well known keyword-based search engines, such as Altavista, Lycos, Yahoo!, and many others. They are equipped with soft-bots that explore the entire Web, reading documents and indexing them according to some keys, and permit a keyword-based search of previously analyzed documents.

More advanced GIMSs typically use sophisticated methods for representing information sources. Such methods can be roughly classified as being based on database or knowledge representation techniques (e.g. [24, 6] and [3, 40, 29, 34] respectively). In the DB perspective the Web is regarded as a federation of databases, and query answering is based on the availability of ad-hoc wrappers and mediators for each specific information source. In the AI perspective the Web is essentially a semantic network and the ability of GIMSs to answer queries relies on methods for dynamically accessing the information sources.

2.1. Database Approaches The Web is regarded as a federation of databases, with the proviso that database federations typically rely on the presence of a schema describing the sources and on highly structured data, while Web documents are usually unstructured or semi-structured. One example is Tsimmis [24], which describes the common schema with the Object Exchange Model (OEM, see next section). Tsimmis makes use of translators to translate both data objects into a common information model and queries into requests for an information source, while mediators embedd the knowledge needed for processing a specific type of information, once the content of information sources is known. Each mediator needs to know which sources it will use to retrieve information. Therefore, a model of information sources has to be explicitely specified, but it is possible to work without a global database schema. Classifiers and extractors can be used to extract information from unstructured documents (e.g. plain text files, mail messages, etc.) and classify them in terms of the domain model. The classifier/extractor components of Tsimmis is based on the Rufus system [54]. Rufus uses an object oriented database to store descriptive information about user’s data, and a full text retrieval database to provide access to the textual content of data. Another proposal along these lines is constituted by the ARANEUS Project [7], whose aim is to make explicit the schema according to which the data are organized in socalled structured servers, and then use this schema to pose queries in a high level language instead of browsing the data. Even though the ability to construct structured description of the information in the Web enables the system to answer effectively user’s queries, the approach has the following drawbacks that are typical of a Database perspective: 1) Araneus works only on a particular kind of Web sites and pages, which have a clearly specified structure, not on generic ones; 2) the user has to completely specify the relational schema corresponding to the site data; there is no automatic translation from the site to the database; 3) there is no hint for automatic search and conceptualization of WWW sites similar to prototypical ones indicated by the

user.

2.2. Knowledge-based Approaches Knowledge-based GIMSs are systems using a Knowledge Representation approach for information sources representation, data acquisition and query processing. Many logical frameworks are used to represent information and many KR systems are used to reason about them. The main design element for these systems is the knowledge representation language. Also relevant are automatic data acquisition techniques, that are useful to build and update knowledge bases, as well as query-planning techniques, adopted to answer user’s queries. As for the knowledgerepresentation language and data acquisition aspects, let us remark that a GIMS needs to represent both the application domain and the content of the information sources. Usually a single KR language is adopted, such as Description Logic, which is suited to represent taxonomic knowledge [13]. In addition, a basic feature for a GIMS is the possibility of identifing interesting information sources unknown to the user and to automatically gather from them relevant information units. In other words, tools to scale up with the growth of the information space are needed. The discovery of new information sources, the extraction of information units within them and the interpretation of data coming from these sources are all problems related to information acquisition. This issue is rarely addressed in most systems, as they force the user to hand-code information source models. The main exceptions are ShopBot and ILA [50]. ShopBot addresses the extraction problem learning how to access an on-line catalog (via an HTML form) and how to extract information about products. It uses an unsupervised learning algorithm with a small training set. Whereas ILA (Internet Learning Agent) is focused on the interpretation problem. It learns how to translate information source output into the domain model, using a set of descriptions of objects in the world. It is worth noting that, especially when dealing with the automatic discovery and integration of information sources, the vocabulary problem is one of the most critical ones. The presence of possibly different terms representing the same concept in the description of a source or an information unit is a significant example. At least three possibilities have been explored to face this problem: (i) unique vocabulary, that is forcing the description of information sources and domain model to share the same vocabulary; (ii) a manual mapping, that is relationships between similar concepts are hand-coded; (iii) automatic (or semi-automatic) mapping, in which the system takes advantage of existing ontologies that provide synonym, hypernym and hyponym relationships between terms. It is worth noting that using hypernym and hyponym relationships to solve ontological questions involves infor-

mation loss when generalization of terms are used. On the other hand, it is a very powerful tool for retrieving information. As for query answering, a significant body of work on agents able to reason and make plans has been developed. In this case, the representation of the information sources is known to the system. The use of planning techniques to retrieve information requested by a user’s query has been very common in this context and is in general aimed at introducing a certain degree of flexibility in exploring the information sources and extracting information from them. For instance, in Information Manifold [43] the content of information sources is described by query expressions that are used to determine precisely which sources are needed to answer the query. The planning algorithm first computes information sources relevant to each subgoal, next conjunctive plans are constructed so that the soundness and completeness of information retrieval and the minimization of the number of information sources to be accessed are guaranteed. In this system, interleaving planning and execution is a useful way to obtain information for reducing the cost of the query during plan execution. SIMS [3] defines operators for query reformulation and uses them to select relevant sources and to integrate available information to satisfy the query. Since source selection is integrated into the planning system, SIMS can use information about resource availability and access costs to minimize the overall cost of a query. A final note is on the closed world assumption adopted by all above systems. That is, they assume that the domain model contains all information needed and that all unavailable information does not exist. On the contrary Internet Softbot [29] provides a framework to reason with incomplete information, executing sensing actions to provide forms of local closure, i.e., to verify the actual presence of information in the source during plan execution.

2.3. An Integrated Approach: WAG In the WAG (Web-At-a-Glance) project [16, 17], a database conceptual model (namely the Graph Model [21, 22]) and its environment to interact with the user [18] are coupled with the C LASSIC knowledge representation system [9]. This gives rise to a system aiming at semiautomatically building conceptual views over information extracted from various Web sites and allowing the user to query such views. WAG differs from the approaches described above in two major aspects. First of all, WAG exploits the advantages of both knowledge representation and database techniques, since it follows a database approach at the external (useroriented) level, and a knowledge representation approach at the internal level. Indeed, WAG allows the user to access

the Web data by issuing a visual query on the conceptual schema of a database (thus, s/he never encounters the wellknown problems of disorientation and cognitive overhead in finding the data of interest on the Web), but, in order to build such a database, the system relies on sophisticated knowledge representation techniques. Second, WAG, instead of requiring an explicit description of the sources, attempts to semi-automatically classify the information gathered from various sites based on the conceptual model of the domain of interest.

HTML Browser User Interface

THE WEB !

WAG Querier

WAG Engine

HTML HOT SITE

WEB DataBase

Figure 1. The WAG Architecture

Fig.1 shows a high-level view of the WAG architecture. The user interacts with the user interface that allows for switching among a conventional HTML browser, a WAG querier, and the WAG engine. Each time the user meets a site containing pieces of information about a relevant matter s/he can activate the WAG engine in order to analyze it. The WAG engine reaches the site pointed out by the user and collects the HTML pages belonging to it. Once the site pages are locally available, the WAG Engine starts its analysis. In doing that, some additional information on the domain of interest is needed; it is provided either by the system knowledge base or by the user, through an interactive session. In the latter case, the pieces of information gathered by the user are added to the knowledge base for further reuse. The main objective of the analysis process is to associate with the site under observation a conceptual data schema and to populate it. The results of such a process, that may again involve the user in an interactive fashion, are stored in the WEB DataBase. More precisely, the WEB DataBase contains both the data and the locations in which such data are available (e.g., the page URL, the page paragraph, etc.). Once the site has been fully analyzed, the user can query the WEB DataBase through the WAG Querier, according to several query modalities provided by the system. The

WAG Querier handles all phases needed to answer a user’s query. In particular, it provides the user with a multiparadigmatic visual environment (see [18]), equipped with different visualizations and interaction mechanisms, including a synchronized browser on the database schema and instances and several ad-hoc features for interacting with multimedia data.

3. Semistructured Data Models Data that could be find on the Web are very disomogeneous and diverse, and they tipically do not seem to follow a precise structure 1 . On the opposite, traditional databases are very regular and structured and this introduces several benefits, e.g., to maintain integrity; to query based on structure; to optimize query execution. Such benefits are only partially counterbalanced by the lack of flexibility in asking queries and the need to know in advance the database schema. Considering the above and other issues, database people coined the term semistructured data and, starting from object-oriented and graph models, defined the so-called semistructured data models [11, 1]. Buneman [11] interesting definition of semistructured data model is “A syntax for data with no separate syntax for types”. The basic idea of such models is to couple the existence of some sort of schema or structure (which typically resembles a graph, even if this is not necessary, see, e.g., [37], which is based on F-logic), with a high degree of freedom on the adherence of the data to such a structure, i.e. the schema (also called data guide in semistructured models [2]) provides indicative information on the current type of the data, but violations are permitted, in terms of both completeness and typing [12]. Also, the schema may be given implicitly for part of the data and may need to be extracted aposteriori (especially for data integration purposes). Finally, as a consequence of the need to accomodate data heterogeinity, schema can be very large, compared with traditional database schemata, and varies very frequently. Semistructured data models are extensively used in systems integrating heterogeneous sources and allowing the user to query them (see also GIMSs described in Section 2 above), and as such they need to be sufficiently simple and yet powerful and flexible enough to describe semistructured (and structured) data sources found on the Net. For instance, OEM is the semistructured data model used in the Lore project [2]. OEM uses a labeled graph as basic modeling structure, where vertices represent both atomic and complex objects. Each object has a unique object identifier. Atomic objects contain a value from one of the basic atomic types, e.g. integer, strings, html, gif, etc. Complex objects 1 Actually, once (and if) XML will become a standard this will not be true, but the existing Web stuff is still higly chaotic, and irregular.

take their values from a set of object references, i.e. (label, oid) pairs, see Fig. 2.

&12 restaurant

restaurant

restaurant

&35

&77

&19 category

&17 gourmet

address

category

&13 Chef Xu

zipcode

category

name

name

&14 city

&66

&23

vietnamese Saigon zipcode

&44

&16

Palo Alto

90876

&29

&55

98765

cheap

name

&80 ‘McDonald’

Figure 2. An OEM graph

As one may see from Fig. 2, data structure is not rigid; some attributes may exist or not, and their structure may be different. Similarly, queries do not need to be strongly typed. For instance, we may ask for the zipcode of certain restaurants without knowing whether the zipcode occurs as part of the address or as subobject of restaurant.

4. Retrieving Information on the Web Searching on the Web, different kinds of consumers can locate several types of information, ranging from unstructured documents and pictures to structured databaseoriented information [15]. The interaction mechanisms must be friendly and easy-to-use, since, given the enormous quantity of information sources available on the Internet, most of the users remain permanent novices with respect to each one of the sources they have access to.2 From the hci perspective, the idea is to provide the user with, mainly visual, tools helping her/him in browsing the Web and organizing the information of interest. The final user is the active subject and the Web information is unchanged. From the DB perspective, the idea is to allow the user to ask queries over multiple sources and get an integrated result. The system is the active subject, and the information is manipulated, chopped and integrated.

4.1. Web Query Languages An approach providing the user with mechanisms for querying (instead than browsing) the Web involves the development of Web query languages, somehow similar to database query languages (see [1, 33, 44]). Note that, at 2 The concept of permanent novice has been introduced by Borgman in the context of online catalogues [10].

least in the first generation of such languages, the approach is weakly related with the idea of modeling the information stored in the Web sites. The main idea is to model the Web document network topology (in particular, in [44] a “virtual graph” is used to represent the hypertextual documents in the Web), and to provide the user with a query language enriched with primitives for specifying query conditions on both the structure of single documents and their locality on the network. However, the user has no chance to query the Web information content. Relevant examples in literature are W3QL [41], WebLog [42], WebSQL [44]. In particular, WebSQL proposes to model the Web as a relational database composed of two virtual relations, namely Document and Anchor. The first one represents documents in the Web (one tuple per document), while the second one has one tuple for each anchor in each document in the Web. Based on this relational model of the Web, the user may ask queries in an SQL-like language. However, since the two relations are completely virtual, the basic way to materialize them is by navigating from a known URL to reach other documents. Path regular expressions, specified in the FROM clause of the SQL query, are used to describe this navigation. The above first-generation of Web languages still present many of the problems one encounters using indexes, such as information changes or lack of representation of document structures. However, the possibility of capturing the structure of a hypermedia network, explicitely describing links between documents, and the introduction of the “query locality” concept to measure the cost to answer a query, are important elements, that need to be taken into account in the development of effective and efficient systems. The second generation of Web query languages exhibits a powerful feature with respect to the above ones, namely they provide access to the structure of the Web objects that they manipulate, modeling the internal as well as the external links of the documents and supporting some semistructured data modeling feature. Also, they have the ability to define new complex structures as query result. Examples of this category of languages are WebOQL [5] and StruQL [31]. Lorel [2] is a language which may also be included in this category, even if it was not developed specifically for the Web. Indeed, it was a language for generic semistructured data (namely the language for the OEM data model mentioned in the previous section), but it was then proved to work quite well for HTML and XML documents on the Web. It follows an object approach, extensively uses coercion and is equipped with ad-hoc constructs to manage complex path expressions. WebOQL uses as main data structure the hypertree, i.e. an arc-labeled tree with two types of arcs, internal and external. Internal arcs are used to represent structured objects and external arcs represent references between objects. A

web in WebOQL is a set of related hypertrees. The queries follow the select-from-where structure of SQL, and use navigation patterns (i.e., regular expressions over an alphabet of record predicates) to specify the structure of the paths that must be followed in order to compute the query result, which is either a new hypertree or a new web. The recent growth of XML has caused the development of another class of languages (for instance, XML-QL [28]) explicitly thought for XML documents. XML allows one to define documents with complex structure, introducing arbitrary tags and attributes, and checking their conformity with respect to an optional document type descriptor (DTD), which represents a sort of schema for the document. XMLQL is a query language for semistructured data, based on a simple data model, which expresses information which is essential to the data representation (for example, it does not represent the document layout). The basic modeling structure is a graph, in which each vertex is represented by an object identifier, edges are labelled with element tag identifiers, nodes are labbeled with sets of attribute-value pairs and leaves with values. XML-QL syntax resembles again SQL syntax. It uses element patterns to match data in XML documents, composed in regular expressions, allows grouping, nesting, construction of complex query result, and integration of data coming from multiple sources. Being XML more structured than HTML, its query languages may exploit such structure allowing one to express complex queries. Unfortunately, many documents on the Web are still HTML, so it is still too early to measure the adequacy of XML query languages. Finally, from the user’s point of view, all these languages are too difficult to be used. SQL itself is too difficult to be used by a casual user ([19]), so very recent proposals aim at putting visual query interfaces on top of Web query languages, e.g. XML-GL [23], using graphs as visual representation, and EquiX [26], based on forms. Such proposals claim to offer easy-to-use alternatives to textual Web query languages. Unfortunately, no usability experiment results are presented, thus there is no evidence supporting such a claim. Also, one of the usability problems that users could have with such languages is the lack of a “user-oriented” interaction startegy. Indeed, even if the languages employ visual elements, the proposed interaction styles seem to be one-to-one with the traditional (textual) query formulation.

community also shares the same goal of helping the user, but the means for obtaining it are quite different. Here one of the basic ideas is not extracting information from data by transforming, structuring, merging them, instead the keypoint is just representing the data in a form that matches the user’s perceptual capabilities, so that s/he may easily grasp the information of interest. One of the fundamental mechanism to achieve this is the correct usage of powerful visualizations (see [35] for a survey). These visualizations range from sophisticated techniques to visualize large networks in a screen shot [35], to animated spaces where related information may be organized, analyzed and linked by means of different visual mechanisms [27], to sensemaking tools, which help users understanding information by associating and combining it. Such tools re-represent retrieved information to make patterns visible, or they allow the construction of new information patterns from old ones by exploiting the power of visual attributes of the representation, which may be quickly detected by the eye [25]. Visualization mechanisms may help even when dealing with single documents. For instance, the WebBook [14] let users group pages of document(s) into a simulated physical book, and fan them for rapid scanning. CyberMap [36] automatically generates overview maps for textual documents. CyberMap creates a graph of a collection of nodes by clustering related documents by content into nodes and automatically generating links between semantically related nodes. The resulting graph can be viewed in multiple representations, providing for quick access to information and data filtering in the Web. In addition to several visual techniques, CyberMap builds a personal user profile and tracks the user interaction history to offer personalized views of the various documents.

5. Web Site Construction The problem of semi-automatically building Web sites which are easy to navigate and provide the user with the information s/he is searching for is still open and is common to the DB, AI, and HCI communities. However, each of them has proposed quite different solutions, which are briefly surveyed in the remaining of this section.

5.1. Web Site Usability 4.2. Browsing and Visualizing the Web As we discussed in the previous sections, the idea of the database and knowledge representation communities is that of helping the user to find the information of interest by providing her/him with different means to analyze, structure, and chop the sources of such information (i.e., Web sites containing collections of, possibly related, pages). HCI

Nowadays, while designing interactive systems, the focus of attention has shifted towards the user (the term usercentered has been coined to denote this fact [48]). In order to be user-centered, the design of any interactive system should be carried out by emphasizing the following three points: 1. the identification of the users and their needs; 2. the usage of this information to develop a system which,

through a suitable interface, meets the users’ needs; 3. the usability evaluation and validation tests of the system. Several different definitions of usability exist in the literature. In particular, in [8] a very comprehensive definition of usability is given as “the extent to which a product can be used with efficiency, effectiveness and satisfaction by specific users to achieve specific goals in specific environments”. More precisely, the effectiveness refers to the extent to which the intended goals of the system can be achieved; the efficiency is the time, the money, and the mental effort spent to achieve these goals; the satisfaction depends on how comfortable the users feel using the system. From this point of view, the usability is a major component of the quality of interaction between the user and the overall system. In the Web context usability mainly refers to how easy it is to find, understand and use the information displayed on a Web site [39]. Unfortunately, usability was not initially taken into account in developing Web sites and only recently methodologies for designing usable web sites have been proposed (see, e.g., [55, 47, 45, 56]). According to [46] and [55], Web usability problems fall into two categories:

 Site-level usability: information architecture, navigation, and search; linking startegy; designs for internal and external users; overall writing style; page templates; layout, and site-wide design standards; graphical language and commonly used icons.  Page level usability: understability of headlines, links, and explanations; intuitiveness of forms and error messages; inclusion or exclusion of specific information; individual graphics and icons. The above problems emerged from very extensive usability studies, carried out testing hundreds of users interacting with various Web sites (including commercial sites of big companies). It is worth noting that in the experiments the users were not supposed to surf the Web, but they were searching for specific information. Such studies showed quite uniform behaviour and reactions of the different users. They basically have very little patience for poorly designed sites and tend to quickly abandon sites where the navigation is poorly organized. Also, users do not want to scroll, do not want to read much text, do not want too crowded pages, and do not like animated graphics. Summarizing, the most important factors in Web site usability are the organization of both the site content and the navigation. All other aspects, such as the use of a color instead of another, are secondary with respect to the primary goal of avoiding the user to get completely stuck (and this is typically caused by a wrong navigation design). Experiments showed that the average users do not have the domain knowledge they would need to navigate the site and that,

surprisingly enough, they do not build a mental model of the site. Users apparently do not think about the site structure at all. Instead, they continue on an exploratory path through the site until they find what they are looking for or become so frustrated that they give up. A good design methodology should concentrate on (in order of importance): 1. design the information structure and navigation paths; 2. design effective links, i.e. links from which the user can easily predict where the link will lead and that can be differentiated one from another; 3. design self-explaining within-site searching mechanisms; 4. design readible pages with a good layout (note that the layout of a Web page has characteristics which are very different from those of a book page); 5. use graphic features to improve the overall design (without overloading it).

5.2. Adaptive Web Sites It is now well known that complex interactive systems, which are intended to be used by a wide spectrum of users, having different goals and skills, cannot be monolithical packages. they have instead to evolve and adapt to the different user needs and preferences [18]. This is obviously true also for Web sites, which have different kinds of visitors, and each visitor may have different needs at different times. The AI community is addressing this problem by introducing adaptive Web sites, i.e. Web sites that automatically improve their organization and presentation by learning from user access paths [52]. Two different types of adaptivity may be conceived. On one side, sites may focus on customization, which means altering the pages or the navigation paths in real time to suit the user’s needs. On the other, the site may focus on optimization by self-modifying to make the navigation easier for all [52]. An example of the first kind of adaptivity is WebWatcher [4], a system which predicts which links the user will follow on a particular page as a function of her/his interests. The system learns to predict links by example. Examples are formed by asking a sample of users what are they searching for and, at the end of their interaction, whether they found what they wanted. WebWatcher then uses the path of satisfied visitors as examples of successful navigation and highlights the links composing those paths to other visitors with the same goal. AVANTI [32] uses a similar approach, and, based on what it knows about the user, tries to predict her/his ultimate goal and customizes the site to better fit such a goal. In the optimization aimed approach the system tries to globally improve the site. Instead of making changes for each user, the site learns from all users to improve its usability. This is a much more ambitious goal, since the improvement of the site should be done based on some formal

metrics for measuring usability, but, since not all usability components may be precisely measured, to define a metrics which take into account all of them (including, for instance, user satisfaction) is still an open problem. In [51] a system is proposed which aims at improving a site organization applying appropriate transformations, which include rearranging and highlighting links as well as synthesizing new pages. The system learns from common patterns in the user access logs and decides how to transform the site to make it easier to navigate.

5.3. DB Techniques for Building Web Sites The DB community initially concentrated on the problem of accessing existing Web sites and extracting information of interest. However, since Web sites are mainly repository of information, recently it appeared quite natural to apply techniques used for database construction and maintenance to them. This is actually the problem discussed in the above subsections, and the tasks to be accomplished are also the same: 1. choosing the data that will be displayed at the site; 2. designing the site and navigation structure; 3. designing the page layout. However, the approach undertaken to tackle the problem is quite different from those discussed above and it is far from being really user-centered. Indeed, most of the proposed systems, see, e.g. [30, 5, 6, 49], concentrate on providing an explicit declarative representation of the structure of a Web site, which is basically defined as a view over existing data, and represented as a graph of pages connected by hypertext links. For instance, S TRUDEL [30] works by separating the management of the site data, the structure and the presentation. First the site builder defines the data that will be available at the site, which may come from the internal S TRUDEL repository or reside in external sources. Then, using the S TRUDEL mediator and a set of source-specific wrappers, an integrated view of the data is created. At this point, the site builder declaratively specifies the Web site structure by issuing a so-called site-definition query in S TRU QL, the S TRUDEL query language. The evaluation of such a query results into a site graph, which models both the site content and structure. Finally, the site builder specifies the graphical presentation of pages in S TRUDEL HTML template language and the HTML generator produces HTML text for every node in the site graph from a corresponding HTML template. This kind of declarative approaches provide several benefits, including the possibility of easily creating multiple versions of the same site, possibly tailored to different users (so realizing a form of adaptivity, as described in the previous subsection), the support to the site evolution, the possibility to express and enforce integrity constraints on the site and to incrementally update a site. However, in order

to guarantee the effectiveness and usability of the produced sites, it is necessary to complement this approach with usercentered design methodologies and usability tests coming from the HCI field.

6. Conclusions What makes the Web so unique is mainly the revolution it caused in the traditional roles of computer professional and generic user. In principle, it could be seen as the revenge of millions of obscure computer users. Quite recently computer devotees (both pratictioners and researchers) realized this and decided to catch again their disappearing power first of all demonstrating that the Web is underused (and often badly used). Then, defining theories, models, languages, systems to greatly improve such usage. However, they (at least the researchers) are still far from succeed, not only because the problem is very difficult to tackle, but also because the most involved research communities seem to scarcely communicate with each other, and this is a pity since the Web is for sure the largest interdisciplinary case of study we never had at disposal before.

References [1] S. Abiteboul. Querying semi-structured data. In Proceedings of the International Conference on Database Theory, 1997. [2] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68–88, 1997. [3] Y. Arens, C. A. Knoblock, and W. Shen. Query reformulation for dynamic information integration. Journal of Intelligent Information Systems, 1996. [4] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of AAAI Spring Symposium on Information Gathering form Heterogeneous Distributed Environments, 1995. [5] G. Arocena and A. Mendelzon. Weboql: Restructuring documents, databases, and web. In Proceedings of the International Conference on Data Engineering, 1998. [6] P. Atzeni, G. Mecca, and P. Merialdo. To weave the Web. In Proceedings of the Int. Conf. on Very Large Databases, 1997. [7] P. Atzeni, G. Mecca, and P. Merialdo. Design and maintenance of data intensive Web sites. In Proceedings of the Conf. on Extending Database Technology (EDBT), 1998. [8] N. Bevan and M. Macleod. Usability assessment and measurement. In The Management of Software Quality (M. Kelly, ed.). Ashgate Technical/Gower Press, 1993. [9] A. Borgida, R. J. Brachman, D. L. McGuinness, and L. Alperin Resnick. CLASSIC: A structural data model for objects. In Proceedings of the ACM SIGMOD Conf. on Management of Data, pages 59–67, 1989.

[26] [10] C. Borgman. Why are online catalogs hard to use? lessons learned from information-retrieval studies. Journal of the American Society for Information Science, 37(6), 1986. [11] P. Buneman. Semistructured data. In Proceedings of the [27] ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), 1997. [12] D. Calvanese, G. De Giacomo, and M. Lenzerini. What can knowledge representation do for semi-structured data? In Proc. of the Fifteenth National Conference on Artificial In[28] telligence and Tenth Innovative Applications of Artificial Intelligence Conference (AAAI 98, IAAI 98), 1998. [13] D. Calvanese, M. Lenzerini, and D. Nardi. Description log[29] ics for conceptual data modeling. In Logics for Databases and Information Systems. Kluwer Academic Publishers, [30] 1998. [14] S. Card, G. Robertson, and W. York. The WebBook and the Web Forager: An information workspace for the World Wide Web. In Proceedings of ACM Conference on Human [31] Factors in Computing Systems (CHI), 1996. [15] T. Catarci. Interacting with databases in the global information infrastructure. IEEE Communications Magazine, 35(5), [32] 1997. [16] T. Catarci, S. Chang, W. Liu, and G. Santucci. A lightweight Web-At-a-Glance system for intelligent information [33] retrieval. Knowledge-Based Systems, 11(2):115–124, 1997. [17] T. Catarci, S. Chang, D. Nardi, and G. Santucci. Wag: Webat-a-glance. Int. Jour. of Cooperative Information Systems [34] (IJCIS), 7(2):187–214, 1998. [18] T. Catarci, S. K. Chang, M. F. Costabile, S. Levialdi, and [35] G. Santucci. A graph-based framework for multiparadigmatic visual access to databases. IEEE Transactions on Software Engineering, 8(3):455–475, 1996. [36] [19] T. Catarci, M. Costabile, S. Levialdi, and C. Batini. Visual query systems for databases: A survey. IEEE Transactions [37] on Knowledge and Data Engineering, 8(3), 1996. [20] T. Catarci, L. Iocchi, D. Nardi, and G. Santucci. Conceptual views over the web. In Proc. of the VLDB Workshop [38] Knowledge representation meets databases, August 1997. [21] T. Catarci, G. Santucci, and M. Angelaccio. Fundamental [39] graphical primitives for visual query languages. Information Systems, 18(2):75–98, 1993. [22] T. Catarci, G. Santucci, and J. Cardiff. Graphical interaction [40] with heterogeneous databases. VLDB Journal, 6(2):97–120, 1997. [23] S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. Xml-gl: a graphical query [41] language for querying and restructuring xml documents. http://www.w3.org/TandS/QL/QL98/pp/xml-gl.html, 1998. [24] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, [42] Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In Proc. of IPSJ Conference, pages 7–18, 1994. [43] [25] E. Chi, J. Mackinlay, P. Pirolli, R. Gossweiler, and S. Card. Visualizing the evolution of web ecologies. In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI), 1998.

S. Cohen, Y. Kanza, Y. Kogan, W. Nutt, Y. Sagiv, and A. Serebrenik. Equix - easy querying in xml databases. In Proceedings of the ACM SIGMOD Workshop on The Web and Databases (WebDB99), 1999. M. Czerwinski, S. Dumais, G. Robertson, S. Dziadosz, S. Tiernan, and M. van Dantzich. Visualizing implicit queries for information management and retrieval. In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI), 1999. A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. Xml-ql: A query language for xml. http://www.w3.org/TR/NOTE-xml-ql, 1998. O. Etzioni and D. Weld. A Softbot-Based Interface to the Internet. CACM, 37(7), 1994. M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. Catching the boat with strudel: Experiences with a web-site management system. In Proceedings of the ACM SIGMOD Conf. on Management of Data, 1998. M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for a web-site management system. Sigmod record, 26(3):4–11, 1997. J. Fink, A. Kobsa, and A. Nill. User-oriented adaptivity and adaptability in the avanti project. In Designing for the Web: Empirical Studies. Microsoft Usability Group, 1996. D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the world-wide-web: A survey. Sigmod record, 27(3):59–74, 1998. P. Francis, T. Kambayashi, S. Sato, and S. Shimizu. Ingrid: A self-configuring information navigation structure. In Proceedings of 4th WWW Conference, 1995. N. Gershon and J. e. Brown. Special report on computer graphics and visualization in the global information infrastructure. IEEE Computer Graphics, 16(2), 1996. P. Gloor and S. Dynes. Cybermap: Visually navigating the web. Journal of Visual Languages and Computing, 9, 1998. R. Himmeroder, G. Lausen, B. Ludascher, and C. Schlepphorst. On a declarative semantics for web queries. In Proceedings of DOOD’97, 1997. L. Iocchi and D. Nardi. Information access in the web. In Proc. of WEBNET-97. AACE, 1997. B. Keevil. Measuring the usability index of your web site. In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI), 1998. T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava. The Information Manifold. In Proc. of the AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments, 1995. D. Konopnicki and O. Shmueli. W3QS: A query system for the World Wide Web. In Proceedings of the Twentyfirst International Conference on Very Large Data Bases (VLDB95), pages 54–65, 1995. L. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and resctructuring the Web. In Proceedings of 6th International Workshop on Research Issue in Data Engineering (RIDE-96), 1996. A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In Proceedings of 22nd International Conference on Very Large Databases (VLDB-96), 1996.

[44] A. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide Web. In Proc. of PDIS’96, 1996. [45] J. Nielsen. A home-page overhaul using other web sites. IEEE Software, 12(3), 1995. [46] J. Nielsen. The web site of Jakob Nielsen. http://www.useit.com, 1998. [47] J. Nielsen and D. Sano. SunWeb: User interface design for sun microsystem’s internal web. In Proc. of the Second World Wide Web Conference, 1994. [48] D. Norman and S. Draper. User Centered System Design. LEA, Hillsdale, N.J., 1986. [49] P. Paolini and P. Fraternali. A conceptual model and a tool environment for developing more scalable, dynamic, customizable web applications. In Proceedings of the Conf. on Extending Database Technology (EDBT), 1998. [50] M. Perkowitz, R. B. Doorebons, O. Etzioni, and D. S. Weld. Learning to understand information on the Internet: an example-based approach. Journal of Intelligent Information Systems, 1996. [51] M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. Technical Report UW-CSE-97-03-01, University of Washington, Department of Computer Science and Engineering, 1997. [52] M. Perkowitz and O. Etzioni. Adaptive web sites: An ai challenge. In Proceedings of the Int. Joint Conf. on Artificial Intelligence, 1997. [53] B. Shneiderman, J. Nielsen, S. Butler, M. Levi, and F. Conrad. Is the web really different from everything else? In Proc. of CHI98, 1998. [54] K. Shoens, A. Luniewski, P. Schwarz, J. Stamos, and J. Thomas. The Rufus system: Information organization for semi-structured data. In Proceedings of the Nineteenth International Conference on Very Large Data Bases (VLDB-93), 1993. [55] J. Spool, T. Scanlon, W. Schroeder, C. Snyder, and T. DeAngelo. Web Site Usability: A Designer’s Guide. Morgan Kaufmann, 1999. [56] VV.AA. World wide web usability, special issue. International Journal of Human-Computer Studies, 47(1):1–222, 1997.