An Infrastructure for Open Latent Semantic Linking - CiteSeerX

0 downloads 0 Views 451KB Size Report
Open Hypermedia, Web, Information Retrieval, Automatic ... The general approach to create links automatically over web .... matrix with non-zero entries (called singular values) only ..... [7] L. Carr, D. C. DeRoure, H. C. Davies, and W. Hall.
An Infrastructure for Open Latent Semantic Linking Alessandra Alaniz Macedo Instituto de Ciencias ˆ Matematicas e de ´ Computac¸ao ˜ Universidade de Sao ˜ Paulo Sao ˜ Carlos/SP - Brazil

[email protected]

Maria da Grac¸a Campos Pimentel Instituto de Ciencias ˆ Matematicas e de ´ Computac¸ao ˜ Universidade de Sao ˜ Paulo Sao ˜ Carlos/SP - Brazil

Instituto de Ciencias ˆ Matematicas e de ´ Computac¸ao ˜ Universidade de Sao ˜ Paulo Sao ˜ Carlos/SP - Brazil

[email protected]

[email protected]

ABSTRACT The more the web grows, the harder it is for users to find the information they need. As a result, it is even more difficult to identify when documents are related. To find out that two or more documents are in fact related, users have to navigate by the documents and carry out an analysis about their content. This paper presents an infrastructure allowing the use of latent semantic analysis and open hypermedia concepts in the automatic identification of relationships among web pages. Latent Semantic Analysis has been proposed by the information retrieval community as an attempt to organize automatically text objects into a semantic structure appropriate for matching. In open hypermedia systems, links are managed and stored in a special database, a linkbase, which allows the addition of hypermedia functionality to a document without changing the original structure and format of the document. We first present two complementary link-related efforts: an extensible latent semantic indexing service and an open linkbase service. Leveraging off those efforts, we present an infrastructure that identifies latent semantic links within web repositories and makes them available in an open linkbase. To demonstrate by example the utility of our open infrastructure, we built an application presenting a directory of semantic links extracted from web sites.

Keywords Open Hypermedia, Web, Information Retrieval, Automatic Linking, Information Integration, Semantic Structures.

1.

Jose´ Antonio Camacho-Guerrero

INTRODUCTION

Users are assisted toward finding the desired information in the web by interfaces made available by a wide range of search engines; by formulating the appropriate queries, users

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’02, June 11-15, 2002, College Park, Maryland, USA. Copyright 2002 ACM 1-58113-477-0/02/0006 ...$5.00.

are to be pointed to relevant documents. Current search engines have leveraged of traditional information retrieval approaches, sometimes exploiting the underlying linking structure of the documents. In the 80s, Salton was a pioneer in investigating text retrieval approaches over collections of natural-language documents. He proposed a basic process to create automatic indexing, which includes techniques such as term truncation, removal of stopwords and term weighting [31] [32]. The result of this process, which we refer to lexical-based indexing, is a set of terms associated with the documents they represent. Many approaches to retrieving textual materials depend on a lexical match between words in queries and those assigned to database objects. In most languages, the diversity of words allows, on the one hand, the use of different words in order to describe the same object and, on the other hand, the same word to be used to describe objects that truly distinct. As a result, lexical-based matching methods are at the same time incomplete and imprecise. Latent Semantic Analysis [14] tries to overcome the problems common to the lexical approach by automatically organizing text objects into a semantic structure built to be more appropriate for matching. Once a user has located a relevant document, traditional hypertext navigation usually takes place by the user following the links crafted the author of the document. In fact, the benefits underlying this hypertext-based navigation are that (a) content creators can provide carefully-defined relationships and (b) users have a context in which to understand the information [38]. However, the very process of authoring those links turns out to be more difficult when the information space is too large. As a result, it is very useful to consider using automatic techniques towards determining relationships between pieces of information. The general approach to create links automatically over web repositories demands the capability of editing the documents in those repositories so as to embed the specifications of the links. Such a writing permission is an definitive obstacle when the task demands the automatic creation of links within any repositories. One attractive way of supporting hypertext links without changing the original document is to use open hypermedia concepts. In open hypermedia systems, links are managed and stored in special databases called linkbases. The idea is to maintain links in a linkbase instead of wiring them within the

document contents. This approach incorporates flexibility to the documents model since it allows the addition of hypermedia functionality to any document without changing the original format of the document or embedding mark-up information within it. Open hypermedia also incorporates some other concepts such as link maintenance and reuse, as demonstrated by the research in the field [7] [13] [17] [http://www.cs.aue.auc.dk/ohswg/]. As far as interoperability is concerned, a protocol [9] has been proposed towards allowing the interchange of information among applications. The overall approach is supported on the Dexter Reference Model [22]. This discussion demonstrates how the areas of information retrieval and hypertext are tightly related in a complementary way, presenting good solutions to finding information on the web. In order to exploit the contribution of these areas over web-based repositories, specially the approaches of Latent Semantic Analysis and Open Hypermedia, we propose a flexible and extensible open infrastructure allowing the use of the latent semantic approach in the automatic generation of hypertext links among information contained in web repositories. To demonstrate the utility of our infrastructure, we built an application presenting a directory of semantic links extracted from a web site. By using our service, users are able to relationships among information stored in a web space without having to perform appropriate queries nor navigate through the whole information space. The remaining of this paper is organized as follows. Section 2 presents related work both in terms of information retrieval and open hypermedia systems. Section 3 describes concepts of Latent Semantic Analysis (LSA), which is used to extract salient semantic structures between documents from web repositories; a classic example of the use of LSA in information retrieval is also presented. In Section 4, we present the Web Linkbase Service which incorporates a linkbase where the salient semantic relationships extracted from web repositories are stored. In Section 5, we present an infrastructure allowing the use of latent semantic and open hypermedia approaches in the automatic generation of links. In Section 6 we present our final remarks.

2.

FROM INFORMATION RETRIEVAL TO LINK AUTHORING

In 1993, Salton and Allan presented a study which used global text comparison methods to identify similarities between text elements, followed by local context-checking operations that resolve ambiguities and distinguish superficially similar texts from texts that actually cover identical topics [33]. This work was followed by Allans results that presented methods for automatically linking related documents and described a process for automatically assigning types to document relationships [3]. The automatic generation of links was also supported in the VOIR system: the user specifies a topic and gets back a collection of articles that lexically match the query, additionally, some of the query terms are added as anchors in the articles returned [15]. In a complementary effort, Price et al. proposed a new interface for reader-directed link construction aiming at bridging both reading and browsing activities [30]. Greens work on the automatic generation of hypertext is based on lexical chaining which is a method for discover-

ing sequences of semantically related words in a text [16]. Tudhope and Cunliffe presented a study which discusses, using lexical chaining associated with semantic approach, the possible relations between words [37]. Samhaa et all defined a semi-automatic method for generating context links by adapting to the information needs of users and the context of the query defined [12]. The implementation of Samhaa et alli’s work is supported by a multi-agent framework with open hypermedia concepts and explores a kind of semantic similarity based on vector space model over the web pages’ keywords visited by users [12]. The Samhaa et alli’s work is based on vector space theory to model information and in the user’s intentionality to define links. Besides lexical and semantic approach to generate links, some researchers have investigated the utilization of information derived from cross-references to define links. Silva et alli’s work combines traditional information retrieval techniques based on document content with knowledge about structural link to improve the results in Web-based services [34]. Lennart uses link structures and small-world phenomena on the web with possible implications for knowledge discovery or mining over the web [4]. Once relationships have been identified, links should be created to allow hypertext-based navigation. One attractive way of supporting hypertext links is exploring open hypermedia concepts to provide hypermedia functionalities to applications. Systems such as Distributed Link Service (DLS) [7] and Webvise [17] were designed to bring the open hypermedia philosophy to the web. Webvise is an open hypermedia service which, integrated with Microsoft products, supports the creation and the maintenance of different types of open hypermedia structures such as contexts, links, annotations, collections and guided tours. This system includes the ability for users to create manually links from parts of web pages. The DLS open hypermedia system is based on the OHS Microcosm [13] which supports local, generic and specific unidirectional link storage in linkbases in different servers. DLS uses URL addresses to identify web documents and the protocol HTTP to support the communication between applications and its linkbase. The COHSE project combines the DLS architecture with a conceptual model to provide a Conceptual Open Hypermedia Service [8]. By using predefined ontology based on a thesaurus, resource information composed by web pages representing concepts and metadata storage in a linkbase, concepts referred to in web resources can be identified and matched against potential “link destination” for navigational purposes [8]. The COHSE project is based on semi-automatic hypertext construction since users interact with a search service to get back documents to be navigated by related concepts transformed by links and it uses semantic similarities based on thesaurus. The infrastructure we present in this paper exploits LSA towards extracting salient semantic relationships from web repositories and makes those relationships available to any application by storing them in an open hypertext linkbase.

3.

LATENT SEMANTIC ANALYSIS AS INFORMATION RETRIEVAL TECHNIQUE

Latent Semantic Analysis (LSA) is a method to orga-

nize textual information into semantic structures which can be used to retrieve information and browsing [14]. This method has demonstrated improved performance over the traditional vector space technique; it extends the vector space model by modelling term-document relationships using a reduced approximation of the column and row space computed by singular value decomposition of the term by document matrix [11]. The method is designed to overcome two problems faced by lexical matching indexing — synonymy (the variability in word choice of human) and polysemy (the same word has often different meanings) — by automatically organizing documents into a semantic structure more appropriate for information retrieval. Many researchers have applied, extended and evaluated LSA. One example is the use of LSA to organize retrieval results semantically [5] [39]. Soto used LSA to compute semantic similarity between task descriptions and menu labels in applications based in menu [35]. Papadimitriu et alli’s results indicate that, under certain conditions, LSA does succeed in capturing the underlying of a corpus and achieves improved retrieval performance [26].

3.1

Singular Value Decomposition

Singular Value Decomposition (SVD) is a form of factor analysis; in fact, it is the mathematical generalization of which factor analysis is a special case [14]. It constructs an n dimensional abstract semantic space in which each original term and each original document (and any new) are presented respectively as rows and columns of a rectangular matrix (matrix X). In SVD, the matrix X must be decom0 posed into the product of three other matrices T , S and D , as illustrated in Figure 1. Matrix T is an orthogonal matrix and its rows correspond to the rows of the original matrix X, but it has m new columns with variables specially derived so that there is no correlation between any two columns. In other words, each column is linearly independent of the others. Matrix D is an orthogonal matrix and has columns corresponding to the original columns of matrix X but m 0 rows composed of derived singular vectors. D is a transpose matrix of D. The third matrix, S, is an m by m diagonal matrix with non-zero entries (called singular values) only along one central diagonal. A large singular value indicates a large effect of this dimension on the sum-squared error of the approximation. The role of these singular values is to relate the scale of the factors in the other two matrices to each other so that when the three components are matrix multiplied, the original matrix is reconstructed.

Ideally, k should be large enough to fit the real structure in the data, but small enough such that noise, sampling errors and unimportant details are not modelled. The reduced dimensionality solution generates a vector of k real values to represent each document. The reduced matrix ideally represents the important and reliable patterns (latent semantic structures) underlying the data in X. It corresponds to a least-squares best approximation to the original matrix X. Because this minimization requires the simultaneous accommodation of all data, it constitutes a form of induction. SVD provides reduced rank-k approximation for the column and row space of a term by document matrix X for any value of ˆ k and this reduced matrix is represented by X. It is important to observe that, as far as using LSA to organize textual information from terms contained within documents, the literature suggests the use the k = 2, corresponding to the two dimensions term and documents [14].

3.2

Use of SVD in Information Retrieval

Similarity between documents and terms can be computed on the basis of the matrices X, T , S and D already ˆ Tk , Sk and Dk . Calculating similarities reduced, i.e., X, for all pairs of documents of reduced matrix is equivalent to ˆ (X ˆ 0 ) by X. ˆ According to multiply the transpose matrix X the SVD decomposition, it is algebrically equivalent to: ˆ = (Dk Sk )(Dk Sk )0 Xˆ 0 X

The comparison of term i and term j may be made by the inner product of rows i and j of the reduced matrix T S. So, it is equivalent to: ˆ Xˆ 0 = (Tk Sk )(Tk Sk )0 X

Following the decomposition by SVD, the k most important dimensions (those with the highest values in S) are selected. All other factors are omitted. The reduction in the indexing space implies in less use of memory and computation. The amount of dimensionality reduction, i.e., the choice of k, is critical and is an open issue in the literature.

(2)

Finally, the comparison of term i and document j may be made by the inner product of rows i of the reduced matrix T S 1/2 and the row j of the reduced matrix DS 1/2 : ˆ = (T S 1/2 )(DS 1/2 ) X

(3)

Query. Indexes defined by information retrieval models can be explored by queries engines to find out documents based on their similarities. In a Latent Semantic Analysis (LSA) approach such as standard vector of term matching, the similarity of two documents is obtained by comparing (with inner-product or cosine) the corresponding two column vectors of matrix X. So a vector query (Vq ), represented as a pseudo-document (a column vector of term frequencies), can be compared against all columns of matrix X so that the best matches can be found [14]. The column vector query is calculated with values of reduced component matrix Tk , so it is also considered reduced Vˆq . Vˆq is: Vˆq = Tk Tk0 Vq

Figure 1: Component Matrices of SVD

(1)

(4)

SVD Model. According to Furnas et al., the calculations of similarities between documents, terms and queries may be given a geometric interpretation [14]. If the axes of the spaces are re-scaled by the associated diagonal values of S, the inner product between term points or document points can be used to make the algebraic comparisons of interest. To have a latent semantic structure view of an information retrieval system, the query must be given a representation

within the SVD model. Its representation must yield results consistent with the procedure used in the term matching conceptualization. The query must be a pseudo-document with similarity calculated by its inner product (or cosine measure) to other document points.

3.3

Example

The following classic example, extracted from [14], illustrates the SVD theory and LSA approach. The example is about a set of nine titles of selected technical memoranda as shown in Figure 2(top). Figure 3: Reduced Matrix of the Technical Memo Example [14].

Figure 2: Technical Memo Example. (top) Titles of Some Technical Memos. (bottom) The entries in the term by document matrix [14]. In this example, words occurring in more than one title were selected for indexing and these words are italicized. There are two classes of titles: five about human-computer interaction (c1 – c5) and four about graph theory (m1 – m4). The entries in the term by document matrix, shown in Figure 2(bottom), are the frequencies of the occurrences of each term in each document. Techniques based on matching terms would return only documents which share the exact terms of the query. For example, a query “human computer interaction” would return only documents c1, c2 and c4. However, two other documents which are also relevant — c3 and c5 — are missed by this method since they do not have terms in common with the query. To apply SVD to matrix X, “k”=2 should be used to reduce the dimensions of all matrix involved; this is because values other than S1,1 and S2,2 are always close to zero. The matrix resulting from ˆ presented in Figure 3. matrix X is matrix X Considering the two shaded cells for “survey” and “trees” in column m4 in Figure 2(bottom), the word “trees” did not appear in this graph theory title since m4 did contain “graph” and “minors”. In Figure 3, the zero entry for “trees” has been replaced with 0.66. It is an estimate of how many times it would occur in each of an infinite sample of titles containing “graph” and “minors” and it can be used

to retrieve the document m4 with queries composed by the word “trees”. Queries made over the initial matrix in Figure 2(bottom) would not retrieve this document. The value 1.00 for “survey” in Figure 2(bottom), which appeared once in m4, has been replaced by 0.42 in 3. This can reflect the fact that it is unexpected in this context and should be counted as unimportant. Considering the matrix in Figure 3, the query “human computer interaction” would return documents c1, c2, c4 as well as documents c3 and c5 because the two parts of terms “human” and “user” and the term “computer” and “system” can be semantically related. This observation illustrates the ability of the method in capturing implicit synonymy. SVD Model. The calculations of similarities between documents, terms and queries have a geometric interpretation. If the axes of the spaces are re-scaled by the associated diagonal values of matrix S component of matrix X, the inner product between term points or document points can be used to make the algebraic comparisons of interest. In Figure 4(top) for example, the similarities for all pairs of documents in the previous example set could be defined by (Dk Sk )(Dk Sk )0 . The similarities for all pairs of terms are equivalents to the (Tk Sk )(Tk Sk )0 and it could also be plotted in Figure 4(top). The query “human computer interaction” has been plotted in the point (0.14,-0.03) in Figure 4(bottom), where the geometric interpretation of the two-factor solution shows clearly that all the human-computer papers have been nicely separated from all the math papers. The query has been treated as a pseudo-document and placed at the weighted vector sum of its component terms. The angle of its vector with that of all relevant documents, whether they share terms with it or not, is less than with any of the math paper. In the work presented in this paper, the latent sematic relationships identified by our service is based in the LSA approach.

4.

THE WLS LINKBASE SERVICE

The Web Linkbase Service (WLS) is an XML-based open linkbase service for the web that aims both at (a) providing hypermedia functionalities to a set of non-hypermedia XML applications and (b) allowing the integration of these applications [6]. WLS has been developed as an API (Application Programming Interface), so that application developers can reuse and combine the available operations with their own building blocks. By means of an API, WLS can be reused in different contexts, reducing the authoring effort of

Figure 5: WLS Conceptual Model

Figure 4: (top) A 2-dimensional plot of 9 documents from the example set. (bottom) A 2-dimensional plot of 9 documents from the example set and the query “human computer interaction”.

hypermedia-enabled applications. In the work reported in this paper, WLS is explored to store salient semantic relationships extracted from web repositories by means of LSA.

4.1

The WLS Requirements

The requirements presented in [25] have guided the development of WLS: (a) Explicit separation between contents and structure of information [9] [19]; (b) Linking to information with the same meaning should be supported independently of the documents containing the information [36]; (c) Links can refer to documents in different formats [36]; (d) Support to external linkbases (public and private) [9] [19]; (e) Links “from” and “to” read-only information [36]; (f) A number of different relationships between documents [9] [19] such as: specific links, bi-directional links, local links and global links.

4.2

The WLS Model

Based on both the OHP linking sub-protocol messages and the navigational model presented in [18], the WLS conceptual model has been defined as a set of classes and relationships in UML (Unified Modelling Language) notation, as depicted in Figure 5. The class Context represents a collection of nodes, links and nested contexts. This class allows different sets of links over a same set of information and the reuse of previously created contexts. The OHS Webvise supports contexts as first-class hypermedia structures [17]. WLSs core functionality is the relationship among the classes Anchor, EndPoint and Link. The class Link identifies associations between endpoints. The class EndPoint corresponds to each extremity of a link with its respective

direction (source, destination or bi-directional). The class Anchor defines an internal location into the contents of a document. The XPointer specification [10] was defined as anchoring model by the applications integrated to WLS; the anchor’s value is stored into the expression property. According to the relationships among the Anchor, EndPoint and Link classes, the WLS conceptual model provides support to multidirectional links and to the sharing of an anchor between several links endpoints. Moreover, WLS supports dangling links. The class Link includes properties that incorporate semantics, such as: name, a descriptive title for the link; and keywords, auxiliary words that express the semantic role between linked documents. Other interesting feature in the WLS data model is the class Semantics: it is intended to provide explicitly the semantic relationships between the endpoints of a link. WLS provides some pre-defined link semantic types including comment, explanation, example, advice and seealso; some of them as influence of the Annotea’s annotation types [23].

5.

AN INFRASTRUCTURE FOR OPEN LATENT SEMANTIC LINKING

We first present our previous work, corresponding to the implementation of an infrastructure that allows the identification of latent semantic relationships between two web based repositories; in this case, the title of the pages in one repository is used to query the other repository [24]. We then discuss how we were able to exploit our original infrastructure to make the processing of the repositories more flexible; we also present the mapping that allowed the storage of the relationships to be done in an open hypermedia linkbase.

5.1

Query-based Computation of Relationships with LSA

Figure 6 illustrates the use of our Latent Semantic Linking Infrastructure for automatic generation of links between two web-based repositories, A and B. The underlying processing is as follows: 1. Initially Repositories A and B are indexed (1(a) and 1(b)). The indexing mechanisms should take into account the size of the repositories.

2. The index resulting from 1(a) is usually associated with the repository having the highest number of words in the index. It is also used to generate a term by document matrix X. 3. The term by document matrix X produced in 2 is decomposed into components T , S and D using Single Value Decomposition (SVD), where T is the matrix of terms, S is the matrix of singular values and D is the matrix of documents. 4. The index resulting from 1(b) is processed to be turned into query column matrices. 5. The semantic matrix is finally generated by the combination of matrix X reduced and the query column matrices (produced in the step 6) by the computation of the cosine among those matrices. 6. Given the semantic matrix generated in step 5, relationships between the repositories are identified by considering the cells that have the higher values of similarity, while relationships within the repositories are obtained by the cosine of its columns (Repository A) and the cosine of its lines (Repository B). 7. Potential links are identified by considering those relationships with the highest degrees of relevance generated in the step 6.

Figure 6: Latent Semantic Linking Infrastructure for the automatic generation of links between repositories based on LSA. To investigate the utility of the service, we run an experiment to link information between two repositories storing information with respect to traditional university courses. • The eClass is an instrumented environment — with electronic whiteboards, large projected displays and streaming digital audio/video — that automatically captures much of the detail of a lecture experience. As a result, multimedia-enhanced web-based documents are stored in the eClass repository for users review the lecture [1] [27] [28].

• The CoWeb is an informal and unstructured web service for collaborative authoring of web-based material [21]. The LSA approach was exploited in an experiment with a graduate seminar course taught by two instructors for 31 students. The lectures used the eClass infrastructure twice a week, totaling 18 lectures that generated 318 slides over a 10week period. The use of the CoWeb was required — students and instructors created essay pages to hold summaries of readings, and identification pages for themselves, resulting in 303 pages created [2]. We choose the eClass repository to generate a term by document matrix X. We used the mnoGoSearch [20] general public license search engine to index the repository. We first tuned the indexing process by customizing the dictionary of stopwords to take account the vocabulary used in the repositories. It was also necessary to modify the engine itself so as to compute over links activated by JavaScript. This process indexed 1904 terms and 616 documents corresponding to the 18 lectures containing those terms. From the index produced for the eClass repository, a term by document matrix X and then the SVD matrices T , S and D were generated. We used the value 2 as the reduced rank-k approximation for the column and row space of matrix X. The indexing of the CoWeb repository took into account the titles of the web pages only, since such information is meant to be relevant to the whole contents of the page. This implies that, although the CoWeb repository is much bigger than the eClass repository in terms of number of words, the use of the title only provides a natural filtering as far as the use of relevant information is concerned. This approach had already been successfully adopted in a lexical automatic linking service implemented earlier [29]. The index produced for the CoWeb was based on 305 query column matrices. Finally, a procedure combined matrices from eClass and query matrices from CoWeb to generate a semantic matrix of 305 lines by 616 columns (grouped by 18 lectures), by calculating the cosine between the reduced matrix X and the query column matrices. We used a 99% level of similarity filtering to generate a 99% relevance semantic matrix which was used to create hypertext links between the repositories. The overall process, run with 99% degree of similarity and considering k = 2, identified 789 links between eClass and CoWeb repositories for the course considered – an average of 43.8 links per lecture [24]. Such results were very positive and stimulated us to work towards generalizing the approach.

5.2

Generation and Storage of Links

The novelty of the work reported in this paper corresponds to a new infrastructure that: • Identifies latent semantic relationships between web repositories without the need of keywords or query phrases: the whole contents of the given repositories is considered; • Stores the relationships identified in the WLS linkbase that can be used by third-party applications — hypermedia or not. The first feature demanded that each web repository be indexed separately and integrated with LSA: terms and URLs from each web repository are placed in the same database

tables and SVD matrices. Our current interface supports up to five web repositories to be computed simultaneously. The latter feature required the definition of a mapping between the classes of the Web Linking Service (WLS) and the Latent Semantic Linking infrastructure, as summarized in Table 1. Table 1: Mapping of Web Linking Service (WLS) and the Latent Semantic Linking Infrastructure (LSLI). WlS LSLI Anchor Terms in the documents (URLs) of X matrix Node References to the documents (URLs) of X matrix EndPoint Term computed as an extremity of a latent semantic link Link Two EndPoints defining a latent semantic relationship Context Collection of links generated automatically Semantics Pairs of similar terms The indexing process starts from a main URL given to the system. WLS Anchors and Nodes correspond, respectively, to the terms of documents (URLS) and to references to documents (URLs) that have been found to be similar to another document with a similarity value superior to 99%. Terms are taken from matrix T which was obtained after applying SVD to matrix X. URLs are extracted from matrix D which was obtained after applying SVD to matrix X. EndPoints are equivalent to terms (mapped to Anchors). WLS Link is a set of two EndPoints identified to a latent semantic relationships defined with our LSL service. The semantic relationships between the repositories are identified by considering the cells that have the highest values of similarity, while relationships within the repositories are obtained by the cosine of its own columns and the cosine of its own lines. WLS Contexts correspond to the whole collection of links generated automatically by the service at one time. Semantics correspond to pairs of similarity terms. These pairs of terms can be automatically defined by LSL service and stored in the WLS linkbase, independently of the documents identify holding the information. Because of the relationships among the Anchor, EndPoint and Link classes, WLS provides support to n-ary multidirectional links and to the sharing of an anchor between several links endpoints to our infrastructure. The external linkbase defined by WLS supported an effective creation of links; these functionalities have been explored in the Open Latent Semantic Linking Infrastructure presented next.

5.3

Open Latent Semantic Linking Infrastructure

Figure 7 presents our proposed infrastructure for open latent semantic linking over web-based repositories. The structural level of the Open Latent Semantic Linking Infrastructure defines semantic relationships between web repositories without analyzing a query, so this level is composed by a part of the Latent Semantic Linking Infrastructure presented in the Figure 6. At the beginning of the sequence of procedures presented in the structural level, the

web repositories are indexed to generate a term by document matrix X. The module “Compute Similarity” creates the matrix of semantic relationships. In the next section the processing modules are presented in details. In the structural level, the infrastructure manipulates contents and structural components such as nodes and latent semantic links. The storage level is composed by the WLS linkbase created to store the links identified by our approach. The communication between the structural level and the storage level is supported by procedures of the anchoring level. The three levels — the structural level, the storage level and the anchoring level — correspond to homonyms levels of the Dexter Reference Model [22]. The presentation level (request interface) supports interaction functionalities so that users can (a) make requests to the Latent Semantic Linking Service and (b) visualize the results stored in the linkbase in the storage level. This level is composed by interfaces such as the interfaces presented in the Figure 8.

Figure 7: Infrastructure for Open Latent Semantic

5.4

Implementation

We present an algorithm of one possible implementation of the Open Latent Semantic Infrastructure based on modules of processing from the structural level, the anchoring level and the storage level. The presentation level is not presented because its implementation depends on the kind of service using the infrastructure proposed. Example of interfaces to presentation level are presented in Section 5.5. Processing modules, shown in Figure 7, are implemented based on the algorithm presented next. In the whole implementation process, we have used a matrix-oriented programming language developed at Oxford University, OX1 augmented with C++ code. Procedure OpenLatentSemanticLinking Begin varchar(50) URLs[]; URLs[]:=URLs /*URLs to be indexed*/ For (i:=1 to length(URLs);i++) do //Indexing method calls the indexing engine*/ tbTerms, tbURLs:=indexing(URLs[i]); matrix_X[N][Z]:= 1

www.timberlake.co.uk

generateTermsByDocuments(tbTerms, tbURLs); matrix_T[N][K],matrix_S[K][K],matrix_D[K][Z]:= ComputeSVD(matrix_X,K); matrixReduced_X[N][Z]:= matrix_T[N][K]*matrix_S[K][K]* transpose(matrix_D[K][Z]); //computeSimilarity() is equivalent to For (i:=1 to Z;i++) do matrixSimilarity[i]:= coseno(matrixSimilarity[1..N][i], matrixSimilarity[1..N][i+1]); if (matrizSimilarity[i] >= Similarity_Threshold) then WLS_NewLink(matrixSimilarity[1..N][i], matrixSimilarity[1..N][i+1]); End.

The procedure OpenLatentSemanticLinking service executes as follows: • Module “Indexing” indexes each web repository. We use the mnoGoSearch [20] general public license search engine to index repositories. It has been necessary to modify the engine itself so that it computes over links activated by JavaScript. Web repositories are indexed separately but stored in the same database tables (Terms Table and URLS Table) in the engine database. • Module “Generate Terms by Documents Matrix” generates, from the tables and indexes produced to the repositories, a term by document matrix X. • Module “ComputeSVD”, given the term by document matrix X produced and the k-value to reduction, executes the Single Value Decomposition analysis generating the component matrices T , S and D and the reduction dimension of these matrices according to k. • Module “Compute Similarity” uses a threshold level of similarity filtering to generate a relevance semantic matrix which is used to identify relationships between the repositories. • Module “WLS New Linking” request WLS services to create a link and storage it in a linkbase.

5.5

Proof of concept

According to [3], evaluation of automatically-generated hypertext links in general settings is problematic because although bad hypertext links are fairly easy for a human to recognize, there are varying ideas of what a “good” link is. Perhaps partly for that reason, there is no standard test collection against which to evaluate automatically generated links. The evaluation of hypertext links can be even more difficult when these links are latent semantic links, since relationships depend on the context where they are used. To demonstrate the utility of our infrastructure we built the LinkDigger Service2 which (a) allows up to five webbased repositories to be indexed and related semantically one each other (b) gives access to results in the form of a directory referring to documents in the given web sites. The interfaces have been built with PHP and JavaScript. The interfaces for executing the service and accessing the results are shown, respectively, on the left-hand and the right-hand sides of Figure 8. The interface for feeding URLs is based on forms with text boxes where users can specify the URLS to be integrated, and the interface for accessing results presents the links created like a set of directories hierarchically organized. 2

http://mexcal.intermidia.icmc.sc.usp.br/LSL/

Figure 8: Feeding URLS to the LinkDigger Service (left) and accessing the results (right).

To run the service, a user (a) selects the number of sites to be linked, (b) fills in the URLs specifications and an email address and (c) starts the execution of the linking service. In order to support large sites to be indexed, the service runs as background process: when the execution concludes an email is sent to the given address informing the URL where results can be seen. The results of several preliminary experiments with LinkDigger are positive. As an example, we have run the service in order to relate pages from the New York Times (NYT) and New York Post (NYP) on-line edition of January 13th 2002. The experiment considered 25 pages referring to international news (16 from NYT and 9 from NYP). The service identified a total of 174 latent semantic links. An average of 7 links were created to each page. The page with the higher number of links (14) is the home page of the NYT, which is an expected result because this page is composed by headlines. Moreover, of those 14 links 6 are to pages of the NYP. 10 pages from the NYT have been related only to pages of the NYT itself, while 6 pages of the NYP related to pages of the NYP only. The main page of the NYT linked to both newspapers while the main page of the NYP linked to the NYP only. Many of these facts can be consequences of bigger quantity and variety of news of the NYT.

6.

CONCLUSION

We presented a flexible and extensible open infrastructure allowing the use of the latent semantic approach in the automatic generation of hypertext links among information contained in web repositories. Our infrastructure allows the extraction of salient semantic relationships from web repositories and store those relationships in an open hypertext linkbase. As proof of concept of the utility of our infrastructure, we presented the LinkDigger Service. The results of preliminary experiments with LinkDigger are very positive. We are currently setting up experiments to evaluate qualitative and quantitatively the LinkDigger. Although the computational effort involved in the approach can limit its use to well modelled cases, the service can be exploited efficiently in many situations where links cannot be generated by traditional lexical matching approaches. Acknowledgments Alessandra Macedo is a PhD candidate and Jose Anto-

nio Camacho is a MSc candidate and both are supported by FAPESP (99/115270 – 00/141036). Maria Pimentel currently holds individual research grants from CNPq and FAPESP. Maria Pimentel has international support to the InCA-SERVE Project by CNPq in Brazil jointly with Gregory Abowd who is supported by NSF in the U.S. We thank Gregory Abowd, Mark Guzdial and Renato Bulc˜ ao who lead, respectively, the eClass, the Coweb and WLS projects.

7.

REFERENCES

[1] G. Abowd. Classroom 2000: an experience with the instrumentation of a living educational environment. IBM Systems Journal, 38:508–530, 1999. [2] G. D. Abowd, M. G. C. Pimentel, B. Kerimbaev, Y. Ishiguro, and M. Guzdial. Anchoring discussion in lecture: an approach to collaboratively extending classroom digital media. In Proceedings of the Computer Support for Collaborative Learning (CSCL) Conference, pages 11–19, Stanford University, 1999. [3] J. Allan. Automatic hypertext link typing. In Proceedings of the Seventh ACM Conference on Hypertext, pages 42–52, 1996. [4] L. Bjørneborn. Small-world linkage and co-linkage. In Proceedings of the Hypertext2001, August 2001. [5] K. Borner. Extracting and visualizing semantic structures in retrieval results for browsing. In Proceedings of the Fifth ACM Conference on ACM 2000 Digital Libraries, pages 234–235, 2000. [6] R. F. Bulc˜ ao Neto. WLS: An xml-based open hypermedia service for the web. Msc thesis, Instituto de Ciˆencias Matem´ aticas e de Computa¸c˜ ao da USP, S˜ ao Carlos, S˜ ao Paulo, 2001. In Portuguese. [7] L. Carr, D. C. DeRoure, H. C. Davies, and W. Hall. The distribuited link service: A tool for publishers, authors and readers. In Proceedings of the fourth International World Wide Web, pages 647–656. ACM Press, 1995. [8] L. Carr, W. Hall, S. Bechhofer, and C. Goble. Conceptual linking: Ontology-based open hypermedia. In Proceedings of the 10th International World Wide Web, pages 334–342. ACM Press, May 2001. [9] H. Davis, A. Lewis, and A. Rizk. OHP: a draft proposal for a standard Open Hypermedia Protocol. In Proceedings of the 2nd Workshop on Open Hypermedia Systems (Hypertext’96), pages 27–53. ACM Press, 1996. [10] S. DeRose, E. Maler, and R. Daniel. XML Pointer Language (XPointer), Last Call Working Draft. on-line in World Wide Web, 2001. URL: http://www.w3.org/TR/xptr. [11] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using latent semantic analysis to improve access to textual information. In Conference Proceedings on Human Factors in Computing Systems, pages 281–285, 1998. [12] S. R. El-Beltagy, W. Hall, D. DeRoure, and L. Carr. Linking in context. In Proceedings of the Hypertext 2001, pages 151–160, August 2001. [13] M. A. Fountain, W. Hall, I. Heath, and H. C. Davis. Microcosm: An open model for hypermedia with dyanmic linking. In Proceedings of ECHT’90, pages

298–311, 1990. [14] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the Eleventh International Conference on Research & Development in Information Retrieval, pages 465–480, 1988. [15] G. Golovchinsky. What the query told the link: The integrations of hypertext and information retrieval. In Proceedings of the ACM Conference on Hypertext’97, pages 30–39, 1997. [16] S. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5):713–730, Semptember/October 1999. [17] K. Grønbæk, L. Sloth, and P. Orbæk. Webvise: browser and proxy support for open hypermedia structuring mechanisms on the WWW. In Proceedings of the Eighth International World Wide Web Conference, pages 253–267, Toronto, Canada, May 1999. [18] K. Grønbæk and R. Trigg. Toward a Dexter-based model for open hypermedia: unifying embedded references and link objects. In Hypertext’96 Seventh ACM Conference on Hypertext, pages 149–160, Washington DC, March 1996. ACM Press. [19] K. Grønbæk and R. Trigg. From Web to Workplace: Designing Open Hypermedia Systems (Digital Communication), volume 1. MIT Press, Boston, Ma, July 1999. 386 p. [20] M. Group. Mnogosearchtm web search engine software. Internet, 2001. URL: http://www.mnogosearch.ru. [21] M. Guzdial. Supporting learners as users. The Journal of Computer Documentation, 23(2):3–13, 1999. [22] F. Halasz and M. Schwartz. The Dexter hypertext reference model. Communications of the ACM, 37(2):30–39, 1994. [23] J. Kahan, M. Koivunen, E. Prud’Hommeaux, and R. R. Swick. Annotea: an open RDF infrastructure for shared web annotations. In Proceedings of the WWW10 International Conference, Hong Kong, May 2001. [24] A. A. Macedo, M. G. C. Pimentel, and J. A. C. Guerrero. Latent semantic linking over homogeneous repositories. In Proceedings of the ACM Symposium on Document Engineering, pages 144–151. ACM Press, November 2001. [25] A. M. M. Miotto and R. P. M. Fortes. Uma vis˜ ao geral das caracter´ısticas de sistemas hiperm´ıdia abertos. Technical Report 124, Instituto de Ciˆencias Matem´ aticas e de Computa¸c˜ ao (ICMC-USP), Novembro 2000. 22 p. In Portuguese. [26] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic indexing: a probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 159–168, 1998. [27] M. G. C. Pimentel, G. D. Abowd, and Y. Ishiguro. Linking by interacting: a paradigm for authoring hypertext. In Proceedings of the eleventh ACM on

Hypertext and Hypermedia, pages 39–48, 2000. [28] M. G. C. Pimentel, Y. I. B. Kerimbaev, G. D. Abowd, and M. Guzdial. Supporting long-term educational activities through dynamic web interfaces. Interacting With Computers Journal, 13:353–374, 2001. [29] M. G. C. Pimentel, A. A. Macedo, and G. D. Abowd. Linking homogeneous web-based repositories. In Proceedings of International Workshop on Information Integration on the Web, pages 35–42, Rio de Janeiro-Brazil, 2001. URL: http://www.cos.ufrj.br/wiiw/schedule.html. [30] M. N. Price, G. Golovchinsky, and B. N. Schilit. Linking by inking: trailblazing in a paper-like hypertext. In Proceedings of ACM Conference on Hypertext’98, pages 30–39, 1998. [31] G. Salton. A blueprint for automatic indexing. ACM SIGIR Forum, 16(2):22–38, 1981. [32] G. Salton. Another look at automatic text-retrieval systems. Commun. ACM 29, 7:648–656, 1986. [33] G. Salton and J. Allan. Selective text utilization and text transversal. In Proceedings of the ACM Conference on Hypertext’93, pages 131–144, 1993. [34] I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, and N. Ziviani. Link-based and content-based evidential information in a belief network model. In Proceedings of ACM SIGIR’00, pages 96–103, 2000. [35] R. Soto. Learning and performing by exploration: label quality measured by latent semantic analysis. In Proceedings of the Conference on Human Factors in Computing Systems, pages 418–425, 1999. [36] L. C. Tai. Architecture support for content-based hypermedia. In Proceedings of the 2nd Workshop on Open Hypermedia Systems (Hypertext’96), pages 1–5, Wahsington D.C., USA, Maio 1996. ACM Press. [37] D. Tudhope and D. Cunliffe. Semantically indexed hypermedia: Linking information disciplines. ACM Computing Surveys, 31(4):6p, December 1999. [38] R. Wilkinson and A. F. Smeaton. Automatic link generation. ACM Computing Surveys, 31(4):4p, December 1999. [39] S. Zelikovitz and H. Hirsh. Using lsi for text classification in the presence of background text. In Proceedings of 10th International Conference on Information and Knowledge Management, pages 113–118, 2001.

Suggest Documents