A Methodology for the Enhancement of a Hypertext ... - Semantic Scholar

1 downloads 0 Views 257KB Size Report
Textbook by the Automatic Insertion of Links in the Subject Index. F. Crestani .... Subject Index Term Nodes have been modelled as term nodes storing theirĀ ...
A Methodology for the Enhancement of a Hypertext Version of a Textbook by the Automatic Insertion of Links in the Subject Index F. Crestani

M. Melucci

Department of Computing Science University of Glasgow Glasgow G12 8QQ Scotland [email protected]

Dipartimento di Elettronica e Informatica Universita di Padova Via Gradenigo 6/A I-35131 Padova Italy [email protected]

Abstract

This paper presents a methodology for the enhancement of a hypertext version of a textbook. The enhancement over the textual version of the textbook is achieved by automatically inserting links between text excerpt of the textbook and item in the subject index produced by the author of the textbook. These links enable accessing parts of the textbook that have not been speci cally indexed by the author, but that are semantically related to items in the subject index. Such links are meant to improve the e ectiveness of the use of the book in search oriented tasks.

1 Introduction

Automatic authoring has been addressed by researchers since the early days of hypertexts. The increasing availability of online textual document collections, whose size is too large to enable a manual authoring, is the main reason for which fully automatic, or partially automatic authoring techniques are currently being studied and implemented. Much research have addressed the problem of producing hyperbooks either manually (see for example [12]) or semi-automatically, converting existing textual books into hyperbooks (e.g. [7, 14, 13]). An important issue in this area is the production of hypertexts for learning, teaching, training, or selfreferencing. Learning through the Internet will become increasingly popular in the future, with the widespread availability of networked computers in schools and universities. Distance learning, in particular, will bene t considerably from the online availability of books and textbooks used in courses since, by accessing digital libraries of course textbooks, students will have online access to most of the course material. A few researchers have already addressed the problem

of producing hypertexts to be used for distance learning and referencing [9, 14, 20, 6] and have suggested a series of rules to be followed in order to produce \good" hypertext documents [12]. An important issue related to providing a hypertext version of a textbook is related to the availability of e ective methodologies for its automatic construction from a textual version. Transforming a textbook into a hyper-textbook is a process that is di erent from transforming a book into a hyperbook. The speci c characteristics of textbooks and their target users and usage are very di erent from those of books. Without entering into the details of these di erences we will only point out at the very important role that the subject index plays in textbook compared to its role in books. The subject index of a textbook is the most important tool for accessing part of the textbook that are relevant to a user information need. The table of contents is another important tool. However, given that textbooks are often used for nding speci c answer to speci c question, an e ective tool for accessing small relevant excerpt of the large amount of text usually composing a textbook is certainly more important than a tool, such the table of contents that enables accessing large portions of text. Therefore, in textbooks the subject index plays a much more important role in search oriented tasks than the table of contents. In [8] we presented, by means of a case study, a new methodology for the automatic construction of a hyper-textbook from a textbook. In this paper we will present in more detail the part of the methodology that deals with the enhancement of a hypertext version of a textbook by automatically inserting links between text excerpt of the textbook and item in the subject index produced by the author of the textbook. These links enable accessing parts of the textbook that

have not been speci cally indexed by the author, but that are semantically related to items in the subject index. The work presented here is part of a larger project devoted to the design and implementation of a tool for the fully automatic construction of a hypertextbook from a paper textbook. This will enable an organisation, like a university department or a company, to make available a large amount of technical information to users or employees by the fast creation of a online library of hyper-textbooks from their textual versions.

2 Automatic Authoring of HyperTextbooks

The process of constructing a hypertext is often called authoring . A hypertext can be authored by its author or groups of authors from scratch. However, nowadays the most common situation is the construction of a hypertext starting from a collection of textual, linear documents available in a machine readable form. Notice that the term \document" refers here to any piece of text that can be considered as a single unit because of its semantic content. A document can be for example a book, a chapter, a section, a subsection, or an appendix of a book. Building the hypertext requires three main steps: 1. Design: identi cation and design of the target hypertext application. 2. Authoring and Construction: transformation of the initial single big document or collection of documents into a hypertext. 3. Publishing: making the hypertext available to the potential user community. The second step is the one we are most interested in, in particular in relation to the authoring of hypertextbooks. For hyper-textbooks, the complexity of automatic text-to-hypertext conversion is mainly due to the size and the structure of the textbook, to the complex nature of the relationships existing between the di erent parts of the textbook, and particularly between these parts and the subject index. We already addressed some of these problems and proposed some suitable solutions in [8]. In the following sections we will describe these solutions concentrating particularly on the enhancement of the subject index by nding semantic relationships between subject index items and textbook excerpts.

3 The Typical Structure of a Textbook

We consider a generic textbook consisting of a preface, a table of contents, some chapters, a bibliography, and a subject index. An machine readable version of the textbook is supposed to be available. The textbook can follow a classical hierarchical structure consisting of chapters and sections. Sectionparagraph-sentence hierarchy partially corresponds to topic-subtopic hierarchy, but we do not use this feature as we propose a methodology that is general enough for any textbook. The most recently published textbooks are very rich in terms of links between the di erent parts. However, we ignore any hand-made links since the methodology has to work with also the earliest textbooks. The generality of the methodology is an important issue due to the presence in a digital library of digitalised textbooks that are quite di erent one from another. Bibliographic citations are often one-way links, namely there are no links starting from bibliographic references to citing pages. Pages containing bibliography are quite di erent from normal text pages because the bibliographic reference lists are more heterogeneous in content due to the number of di erent titles cited in the same chapter. The bibliography cannot be always e ectively used to enhance hypertext functionalities because often citations are \outgoing" links, i.e. they do refer to articles or books that are external to the textbook. Sometimes, if links from bibliography to citing pages were available and one employed textbook citations to indirectly link di erent pages, one would encounter a low number of referred pages by one bibliographic entry since bibliographic references are usually cited by one or two pages only. Accordingly to the need of taking into account also the earliest textbooks, we consider the simple case of a subject index consisting of an alphabetical list of terms. More complex cases can be considered where subject index items are more similar to thesaurus concepts, i.e. terms and di erent types of relationships to other terms than the simple considered case. A term in the subject index can be a keyword (a single word) or a phrase. It may be related to: 1. a list of page numbers the author judged to be relevant to the term, 2. a synonymous term through a \see" link, 3. a list of page numbers and a semantically associated term through a \see also" link; a list of page numbers consists of an individual page number, or of a range of contiguous page numbers.

A part from the \see" and \see also" links, often representing synonymy and \related" thesaurus relationships, in the subject index there are no other semantic relationships between terms, such for example specialisation/generalisation or aggregation often used in thesauri. Therefore, the subject index is rather poor to be an e ective browsing tool.

4 A Methodology for Converting a Textbook into a Hyper-Textbook

The objective of this work is to provide a methodology to enhance the paper-based version of the textbook by the automatic insertion of new semantic links. Our work is based on the preservation of the features of the paper version. The already available textbased features are to be enhanced, not removed by the implementation of hypertext capabilities since the readers of the textbook are familiar with the physical page-based organisation [10]. In particular: 1. The book pages are the unit of reference for our hyper-textbook. 2. The relevance relationships from subject index terms to pages have been preserved because the subject index is the set of terms the author used to index the textbook, and therefore it is the most important structure for browsing. To enhance the textbook we developed a methodology to (1) author the pages through the insertion of traversal semantic links to other pages or relevant data, and (2) to add new semantic links from the subject index to the pages and viceversa, and between the subject index terms. The identi cation of an alternative textbook structure is out of the scope of this work. Text structure identi cation still has many open problems, such as passage retrieval, theme extraction, and text structuring. To date some local solutions have been already proposed [16, 5, 11, 17], but no general techniques are available.

4.1 A Conceptual Model for the HyperTextbook

The need for a textbook conceptual modelling tool becomes essential for the e ectiveness of hypertextbooks because of the importance of fully understanding the semantics of pages and subject index terms. Di erent hypertext models have been proposed during last decade, and the earliest ones date back to the late eighties [9, 21]. Most of these models employ di erent types of data structure at the higher abstraction levels to describe the informative content

T TT

subject index term node

TP

P

page node

PT

PP

list node

Figure 1: Nodes and links of the hyper-textbook architecture of the document collection placed on the lowest level of abstraction. It is well-accepted that one of the most e ective hypertext model for IR is one based on a twolevel architecture [2, 4]. The two-level EXPLICIT model [1, 2] is our reference model for the design of the conceptual structure of our hyper-textbook. The main elements of the model are nodes and links. From a conceptual point of view, nodes are organised in two types: the page level (P ) corresponding to the textbook pages, and the term level (T ) corresponding to the subject index. Links are implemented as list nodes inter-connecting and intra-connecting the levels. Figure 4.1 depicts the relation between nodes and links, on the left, and data nodes and list nodes, on the right. Page Nodes store complete pages of the book. As a user of the textbook accesses a page, so the hypertextbook user accesses a node. The user can access related pages, or subject index terms starting from any page by means of automatically constructed links. The page containing the table of contents has been used as the entry point to the rst page of a chapter according to the suggestions reported in [12]. Such a modi ed table of contents can be of use for the user whenever he gets lost during hypertext navigation. Subject Index Term Nodes have been modelled as term nodes storing their textual description and their relationships with other terms, or pages. The importance of this type of nodes is due to the associated list nodes linking the semantically related terms and the relevant page nodes. List Nodes are employed to implement associative links between page nodes and term nodes. Associative links between page and term nodes express semantic relationships. List nodes are built to disclose the semantics existing between pages and terms. A list node stores the anchors to the destination nodes which are semantically related to the origin node. The detection

of associative links is based on an automatic indexing process. PT and TP list nodes implement links between pages and terms. Pages are linked to terms describing the page content. Before the hyper-textbook construction, a sub-set of TP links, i.e. the links identi ed by the author in the subject index, are already available. Statistical techniques are used to determine new PT and additional TP list nodes in combination with the available TP links (see Section 4.2). The resulting list nodes are ranked according to a statistical values measuring the representation power of each link. The higher the value, the more the two nodes are estimated to be semantically close. PP list nodes relate a page to other pages. Structural PP links, such as \next", \previous" or \go to the table of contents" links are easily detectable, while statistical techniques help exploit the \semantic similarity" between pages estimated to be semantically close, namely pages addressing the same topics. TT list nodes relate a term to other terms. Links between terms are meant to provide the user with the \semantic association" functionality of the EXPLICIT model, aiming at making clear and explicit the meaning of a term [3]. Statistically determined links relate semantically similar terms, that is terms estimated to address the same topics. These links are constructed using a similarity measure computed on the basis of the distribution of terms within pages (term-term occurrence, or term co-occurrence). The higher the similarity measure, the more the two terms are assumed to be semantically close.

4.2 From a Textbook to a HyperTextbook The steps of the automatic conversion process are:

1. automatic indexing; 2. automatic subject index expansion; 3. automatic hypertext construction. In the following we will explain all these steps, concentrating more on steps 2 and 3 that are those involving the subject index.

Automatic Indexing

Automatic indexing is necessary to build associative links between pages and terms. Indexing takes place on the full-text of pages (including the titles of chapters and sections), and on the titles and authors of bibliographic reference. After indexing, pages are assigned a set of con ated and weighted terms, to do so,

we used classical IR indexing technique used for long time by many researchers in the eld (see for example [19, 22]). In particular we employed the vector-space model as the framework for the indexing process [19]. Pages are assigned keywords as speci ed by a n  k occurrence matrix C, where n and k respectively are the number of pages and of keywords. The generic element cij is equal to 1 if i-th page is assigned the j -th keyword, 0 otherwise. Keywords are usually weighted to represent their importance within pages and within the whole collection. We can then build the n  k matrix W where each element wij is computed using the classical tf  idf weighting scheme [15]: tfij log nNj wij = Pk N j=1 tfij log nj

where wij is the weight of the j -th keyword within the i-th page, tfij is the keyword frequency within ith page, and log nNj is the inverse document frequency (idf ) of the j -th keyword, given that nj is the number of documents to which the j -th keyword has been assigned to. The tf  idf weights have been normalised according to page length, since pages are not of the same size. In fact, there are pages in which gures or tables take half a page, or pages with the rst or last part of the chapter bibliography. Weights are employed to rank list of nodes. Starting from W and its transpose WT it is possible to build the k  k matrix X = WT  W specifying co-occurrence weights between pairs of keywords, and the n  n matrix Y = W  WT specifying the similarities between pairs of pages [19]. The subject index has been indexed too. Every index term is assigned a list of weighted stems extracted from each word composing the term itself, these are called subject index term keyword, or simply term keyword. For example, the subject index term \document clustering" is indexed by the pair of keywords (\document",\cluster"). According to the vector-space model, each subject index term t has been assigned a binary vector (t1 ; :::tk ) such that tj = 1 if the j -th keyword occurs within the term. The result of the indexing of the subject index is a m  k matrix T, where m is the number of subject index terms and k is the number of possible terms, i.e. the number of binary vectors describing the subject index terms.

Automatic Subject Index Expansion

The detection of new semantic links is based on the statistical distribution of index terms within pages and

subject index terms. The function representing this linking process can be expressed as Oxq where x is either a page or a subject index term, and q is a subject index term. Oxq is a value representing the weight to be given to the link between x and q. However, it is important to note that the number of page keywords is much higher than the number of term keywords. This means that if we only considered term keywords in the Oxq computation, the link setting would depend on a small number of weights, namely the weights of the keywords included in the intersection between the set of page keyword and the set of term keyword. The weight of such a link is given P by the sum kj=1 wpj tqj , where wpj is the weight of the j -th keyword within the p-th page and tqj = 1 if the j -th keyword indexes a given subject index term q. If the statistical evidence were based on one or two term keywords only, the decision of setting the link to a page would depend only on those keywords, thus running the risk either of missing an important link or of setting an useless or wrong link. The idea is to build a new keyword-based term description that is larger than the description given by the subject index. The additional keywords are the ones that signi cantly co-occur within the pages to be considered semantically close to each other. We then expand the set of keywords describing a subject index term to enlarge the intersection set of keywords between the page and the term. To do that we consider the keywords that are similar to those cooccurring within the page and the term keyword sets. The computation of Oxq might then take into account the matrix X specifying the similarities between keywords, as proposed in [18] to expand a query by adding similar index terms to those used by the user to express the query itself. Di erently from what done in [18], in this work we employ the term similarity matrix X to build persistent links between the hypertextbook nodes that can be e ectively and directly navigated by the user. Let the T = T  X be the m  k matrix representing the expanded subject index terms, i.e. the terms expanded by the keywords that describe the subject index terms and that are similar to the original ones. The i-th row vector of T is ti = ti  X of which the j -th generic element is computed as the scalar product between the i-th row vector P of T and the j -th column vector of X, i.e. tij = kh=1 tih xhj . Thus we have that the generic element Opq of O corresponding to the page p and to the subject index term q is:

Opq =

k X j=1

wpj tqj

Through the expansion of T into T , keywords not occurring within the subject index term description still participate in the link weighting. The weight of such keywords is then an average over all the other keyword weights using similarities as coecients. Let consider the following example, with n = 5, k = 3, and 2 3 1.0 0.3 0.0 X = 4 0.3 1.0 0.2 5 0.0 0.2 1.0 Given the j -th subject index term tj = (0; 1; 0), we get tj = (0:3; 1; 0:2). One can observe that keywords k1 and k3 describe tj now, but with a lower strength than k2 which already describes tj before expansion into tj .

Hypertext Construction

The third step of the textbook-to-hyper-textbook transformation is concerned with setting up links between nodes (see on the left of Figure 4.1). The construction of associative links between pages and terms is the core of our proposal since this is what makes the hyper-textbook a tool for browsing. The process of determining associations between pages and subject index terms is related to the indexing and subject index expansion processes since keyword distribution provide a mean to detect statistical relations between pages and terms. In this way it is possible to determine all the links between nodes of the conceptual model (see Figure 2).

Construction of Links between Pages and Subject Index

Some TP links do already explicitly exist because they were been inserted when the subject index was been manually constructed, but only a small number of pages are on average assigned a subject index term. Moreover, PT links are not available unless the user scans the subject index to search for the terms that refers to the current page. Since a textbook is usually used as a reference tool, the user would like to access more pages that those pointed out by the textbook author in the subject index, if more pages relevant to a subject exist. These additional pages could be available only if other links were set up between the subject

weighted average between two components: T

T

Lpq (O; S ) = Spq + (1 , )Opq

P

P

T

T

P

P

Figure 2: The process of the hyper-textbook construction index and the book pages. Moreover, it is likely that the user would browse from the current page to the corresponding subject index terms describing the content of the page itself and linking to other pages. We both add other pages to the list of relevant pages associated to each index term, and give a weight to these new TP links. We also build and weigh new automatic PT links to allow browsing from pages to the subject index. If the link weight is over a stated threshold, then the link is set up, otherwise it is not inserted. This means that additional TP and PT links other that those provided by subject index are created. List nodes storing the PT and TP links are then ranked according to the link weight. The decision function we employed to set a link between a term and a page takes into consideration two sources of evidence:  the relevance judgement the author gave about the page with respect to the subject index term; this information is already available in the subject index of the textbook;  the page and the subject index term descriptions given as lists of weighted keywords after the automatic indexing process; this source of evidence is enhanced through statistical data describing the similarities between keywords. Let Lpq be the decision function regarding setting up a link between page p and term q, de ned as

provided 0   1. The function O represents what we called the \objective" source of evidence provided by the automatic indexing process, i.e. the statistical data describing he similarities between keywords, whereas the function S is the \subjective" source of evidence available with the manual indexing implemented by the subject index, i.e. the relevance judgements the author gave about page p with respect to the subject index term q. Spq is de ned as follows: Spq = 1 if a relevance judgement about q exist w.r.t. S , 0 otherwise. The function Lpq is a combination of two types of linking component: the manual and the automatic one. The use of allows of balancing the importance to be given to each component. The closer the to 0, the more the automatic computed weights determine the link setting. This is depicted on the right of Figure 3 where the combination of the two components is averaged by (alpha in Figure 3). t

t

0

alphaSpt

1

1 (1-alpha)O pt

0

Lpt = (1-alpha)O pt 0 0 Lpt = (1-alpha)O pt + alphaSpt 1 1 1

p

Figure 3: The combination of \subjective" and \objective" sources of evidence to build a link Let O be the n  m matrix specifying the weights to be given to the links between the m subject index terms and the n pages. The n  m matrix O is computed as a linear transformation of the page keyword weight n  k matrix W, the term similarity k  k matrix X, and the transpose of the m  k matrix T representing the subject index term:

O = W  X  TT nm nk kk km that can be rewritten as O = W  T . Let consider the previous example, with W a 5  3 matrix representing the weights of keywords within pages:

2 3 0 1.0 1.5 66 2.5 1.5 0 77 W = 66 1.0 0 3.1 77 4 0 0 1.8 5 2.2 0 1.8 By computing the 5 scalar product between each row of W and the vector tj , we get a 5-elements vector describing the weights to be given to the link between each page and the j -th subject index term originally described by tj , and then expanded to tj . We should have then that the j -th column vector of O is oj = (1:3; 2:25; 0:92; 0:36; 1:02)0. If one used t instead of tj , one would have oj = (1; 1:5; 0; 0; 0)0, and pages 1, 2 and 3 would not be linked to the term. The process of expansion is depicted in Figure 4 as well. T

T

K

P

K

P

Figure 4: The process of expansion: from the original subject index term description to the expanded one through keyword similarity. The level K includes the automatically extracted keywords.

Construction of Links between Terms

To make the description of his own information needs more precise, the user might use, if accessible, di erent terms than those initially employed to express his own information needs. Therefore, links between terms are needed. The \see" or \see also" references are the sole links inter-connecting the terms of the subject index. However, these links are too few to provide a complete network that allows to reach every term from any other term. We therefore de ne a set of techniques for automatically building and weighting additional links between terms to make the subject index a more complete network to be used for browsing. These techniques are based on the concept of similarity between terms, i.e. an additional link between two terms is inserted if the similarity between the two terms is higher than a given

threshold. The similarity is estimated on the basis the keywords that co-occur within the terms, and is computed as the scalar product of the vectors describing the terms. However, the similarity is inaccurately estimated for many pairs of terms since the majority of the terms are described through a low number of keywords. Therefore, the similarities between subject index terms are computed on the basis of the expanded subject index as previously described. Given the i-th similar subject index term, the similar ones are selected through the matrix k  k X, i.e. by computing the scalar product between the row ti and each other row of T .

Construction of Links between Pages

In textbooks, it is likely that the user searches for pages addressing a speci c topic that are not physically, but semantically close to each other, to enlarge the amount of material retrieved that is relevant to the addressed topic. The structural links between a page and the next one, or between the table of contents and the rst page of each chapter, are the sole links available in a textbook. They thus provide with a poor network of links between pages since the user can only sequentially browse the textbook, or return to the table the contents if he gets lost or needs to understand which are the main topics addressed by the textbook. We therefore de ne a method to automatically build and weigh additional links between pages to make the textbook a more complete network to be used for browsing. These methods are based on the concept of similarity between pages, i.e. an additional link between two pages is inserted if the similarity between the two pages is higher than a given threshold. The similarity is estimated on the basis the keywords that co-occur within the pages, and is computed as the scalar product of the vectors describing the pages. The similar pages are selected through the n  n page similarity matrix Y.

5 Browsing and Searching the HyperTextbook

For reasons of space we cannot report here any example of how a user could employ the hyper-textbook to nd information relevant to some task. Figure 5 and 6 depict four example screen-dumps of the hypertextbook. We suggest the interested reader to try the result of the automatic authoring process presented here at the following URL address: http://www.dei.unipd.it/~melo/hyper-textbook

Figure 5: Home page of the hyper-textbook (left), and page of a term of the subject index (right). At that address he will nd a working example of a textbook, Van Rijsbergen's textbook on Information Retrieval [22]. The textbook has been automatically transformed into a hyper-textbook using the methodology presented here. A detailed account of the process of automatic construction of that hyper-textbook is reported in [8]

6 Evaluation

We are currently evaluating the e ectiveness of the hyper-textbook. The evaluation will take two di erent points of views: the point of view of who uses the book for teaching and the point of view of who uses the book for learning. Regarding the evaluation of the use of the hypertextbook as a teaching tool, we are still designing a suitable evaluation methodology, since the literature on this topics is quite limited. Regarding instead the evaluation of the use of the hyper-textbook as a learning tool, we can make use of the considerable experience achieved by other researchers. We gained a deeper insight into the problem by our participation to the Esprit \Mira" Working Group. Mira is a three years project that brings together the experience of researchers from 13 di erent universities and research institutions in Europe with the purpose of designing and testing an evaluation framework for interactive multimedia IR applications. Our evaluation methodology has been designed in the

context of this framework. Without going into the details of the evaluation methodology, for reasons of space, we will only say that we are submitting to a large population of students a number of task to be solved using the textbook. Most of these tasks involve searching portions of the textbook dealing with a speci c problem. Roughly half of the our test population will be asked to tackle the tasks using a text (paper-based) version of the textbook, while the other half will be asked to tackle them using the hypertext version. We will cross check the results to avoid external factors biasing them and we hope we will be able to show the advantages of using an extended subject index in search oriented tasks. We will present in the results in the most appropriate venue as soon as possible.

7 Conclusions

In this paper we presented a methodology for the automatic construction of an hyper-textbook from a textbook. In particular we concentrated our description of the methodology on the enhancement of a hypertext version of the textbook by the automatic insertion of links in the original subject index. A case study reporting the full details of the methodology can be found in [8]. In this paper we did not address the complex issues related to the copyright of online hyper-textbooks. We leave these issues to the experts in this eld.

Figure 6: List of textbook pages relevant to the term \cluster pro le" (left), and page of the textbook relevant to the term \cluster pro le" (right).

Acknowledgements

We would like to thank Prof. Van Rijsbergen for letting us use the machine readable version of his textbook.

References

[1] M. Agosti, R. Colotti, and G. Gradenigo. A twolevel hypertext retrieval model for legal data. In A. Bookstein, Y. Chiaramella, G. Salton, and V.V. Raghavan, editors, Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 316{325, Chicago, October 1991. [2] M. Agosti, G. Gradenigo, and P.G. Marchetti. A hypertext environment for interacting with large textual databases. Information Processing & Management, 28(3):371{387, 1992. [3] M. Agosti and P.G. Marchetti. User navigation in the IRS conceptual structure through a semantic association function. The Computer Journal, 35(3):194{199, 1992. [4] P.D. Bruza and T.P. van der Weide. Strati ed hypermedia structures for information disclosure. The Computer Journal, 35(3):208{220, 1992.

[5] J. Callan. Passage-level evidence in document retrieval. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 302{310, Dublin, Ireland, July 1994. [6] F. Casati and B. Pernici. A methodology for the design of WWW sites and its application to distance education. In Atti del Convegno Sistemi Evoluti di Basi di Dati, pages 253{272, San Miniato, Pisa, Italy, July 1996. [7] N. Catenazzi and F. Gibb. The publishing process: the hyper-book approach. Journal of Information Science, 21(3):161{172, 1995. [8] F. Crestani and M. Melucci. A case study of automatic authoring: from a textbook to a hypertextbook. Data and Knowledge Engineering, 1998. In press. [9] M.E. Frisse. Searching for information in a medical handbook. Communications of the ACM, 31(7):880{886, 1988. [10] M. Landoni. The visual book system: a study of the use of the visual rethoric in the design of electronic books. PhD thesis, Department of Information Science, University of Strathclyde, Glasgow, Scotland, UK, May 1997.

[11] E. Mittendorf and P. Schauble. Document and passage retrieval based on Hidden Markov model. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 318{327, Dublin, Ireland, July 1994. [12] P. Muller. Writing hypertext books, January 1995. http://www.inf.fu-berlin.de/ tec/ Mosaic/ HTB. [13] R. Rada. Converting a textbook to hypertext. ACM Transactions on Information Systems, 10(3):294{315, 1992. [14] D.R. Raymond and F.W. Tomps. Hypertext and the Oxford English dictionary. Communication of the ACM, 31(7):871{879, 1988. [15] S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129{146, May 1976. [16] G. Salton, J. Allan, and C. Buckley. Approaches to passage retrieval in full text information systems. In R. Korfhage, E. Rasmussen, and P. Willett, editors, Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 49{58, Pittsburgh, PA, June 1993. [17] G. Salton, J. Allan, and A. Singhal. Automatic text decomposition and structuring. Information Processing & Management, 32(2):127{139, 1996. [18] G. Salton and C. Buckley. On the use of spreading activation methods in automatic Information Retrieval. In Yves Chiaramella, editor, Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), Grenoble, France, June 1988. [19] G. Salton and M.J. McGill. Introduction to modern Information Retrieval. McGraw-Hill, New York, 1983. [20] A.F. Smeaton and P.J. Morrissey. Experiments on the automatic construction of hypertext from text. Technical report, Dublin City University, School of Computer Applications, Ireland, 1995. Working Paper: CA-0295. [21] R.H. Thompson. The design and implementation of an intelligent interface for Information Retrieval. Technical report, Computer and Infor-

mation Science Department, University of Massachusetts, 1989. [22] C.J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.