user and a given set of concepts with different names. It contains an ..... the section of the site regarding meat dishes, as the chosen wine is red, even if the site is ...
A User Behavior-Based Agent for Improving Web Usage Francesco Buccafurri , Gianluca Lax, Domenico Rosaci, and Domenico Ursino DIMET – Universit` a “Mediterranea” di Reggio Calabria Via Graziella, Localit` a Feo di Vito, 89060 Reggio Calabria, Italy {bucca,lax,rosaci,ursino}@ing.unirc.it
Abstract. Designing applications for supporting the user activity on accessing Web information sources is one of the most appealing challenges for researchers in the area of Artificial Intelligence. The agent paradigm represents a natural way of facing this problem, mainly because of the autonomy and reactivity properties which these applications should have. In this paper we design an agent capable of both creating and managing a personal ontology as well as of exploiting it for discovering navigation paths potentially interesting for the user. The agent, during the navigation of a site, provides the user with a set of recommendations and, at the same time, learns her/his preferences by updating the ontology. The power of the approach is based on the capability of the ontology model (called concept-graph) of representing user-behavior dependent relationships among concepts and, importantly, dealing with structural and semantic heterogeneity of Web sources.
1
Introduction
The exponential growth of accessible information sources, mainly due to the very fast expansion of the Web and its permeating information system architectures, has determined the necessity of adequately supporting the user activity for driving her/his actions in a so large and composite universe. E-commerce is certainly one of the most evident case in which such a need strongly arises, since a frequent reason of low profit is the confusion of customers in front of a large offer landscape. In this case, automatic techniques, based on the construction of personal ontologies, may allow the reduction of the search space, by identifying the portion of the offers of interest for the customer. Similar considerations can be done in many other Web application contexts. To face this problem, one immediately thinks to adopt agent-based solutions, since the above requirements lead to software systems that autonomously react to user actions and continuously learn, from her/his behavior, new knowledge which is then exploited for supporting her/his activity. But, one of the main difficulty we have to overcome concerns with the inherently heterogeneous nature of the Web. Heterogeneity in the sense of multiplicity of data formats (HTML, XML, text, images, etc.), but
Corresponding author
R. Meersman, Z. Tari (Eds.): CoopIS/DOA/ODBASE 2002, LNCS 2519, pp. 1168–1185, 2002. c Springer-Verlag Berlin Heidelberg 2002
A User Behavior-Based Agent for Improving Web Usage
1169
also semantic, as, for instance, that originated by the presence of terminological, structural and semantic differences on terms [2]. Consider, for an instance, an agent which should support the navigation, through an e-shop site, of the user John Smith, who is a sommelier. Even if the site contains a section named Ferrari, selling gadgets of the famous Italian car, it should be undesirable that the agent drives the user into this section, only because the word Ferrari appears in the profile of Smith. In fact, Ferrari is also the name of a famous Italian “spumante” wine. There are many recent proposals in the context of user modeling [18,7,23,20, 17,13,1,19,15]. However, conceptual models used in these cases take into account only lexical and syntactic (i.e., structural) information and, thus, they are not capable of capturing the semantics underlying each concept as well as semantic relationships among concepts. The problem of dealing with heterogeneity, even semantic, has been deeply investigated in the field of Cooperative Information Systems (CIS, for shortly) [3,9,10,12,8,21,14,11,22]. As a consequence, an interesting research direction is studying how the integration of CIS with agent-based solutions for user modeling can be profitably exploited for designing models and applications for supporting Web usage. The purpose of this approach is that of obtaining the capability of representing user profiles describing her/his preferences at conceptual level (that is, interest of the user for a given concept, relationships felt by the user among concepts, and so on) and, thus, necessarily, the capability of dealing with all the above mentioned forms of heterogeneity. In this paper we give a contribution in such a context. Indeed, we propose a framework for supporting Web usage, which exploits recent results in the field of Cooperative Information Systems [21]. In particular, we design an agent capable of creating and managing the user ontology and exploiting it for supporting the user activity. Construction and updating of the user ontology is done by taking into account information about the structure of the visited information sources. The agent, during the navigation of a site, carries out, at the same time, two activities: the first is providing the user with a set of recommendations for supporting her/his navigation, and the second is monitoring the user behavior for learning her/his preferences and encoding such a new knowledge into the ontology. A key issue which has to be faced concerns with the model used for representing the user ontology, that should be capable of embedding both concepts of interest for her/him and the correlations she/he perceives. For this purpose, we define a conceptual model, which we call concept-graph (c-graph, for short). Since ontologies are built by the agent on the basis of user Web navigation, we exploits results of [21] for providing the model with the capability of representing structural properties of information sources with different formats (i.e., HTML and XML documents, OEM graphs, E/R schemes) and dealing with inter-source heterogeneity. Moreover, c-graphs allow us also to give a temporary representation of the site navigation, embedding both the structure of the site and the dynamics of the user behavior expressing her/his preferences. Such a
1170
F. Buccafurri et al.
temporary representation is then used, at the end of the visit, for updating the user ontology. The paper is organized as follows. Related work are discussed in Section 2. Section 3 describes the concept-graph model, and, together with Section 4, illustrating how inter-source heterogeneity is solved, gives the basis for constructing the agent model. This is presented in Section 5, where the two tasks performed by the agent are described. For updating ontology (that is the second agent task) an integration process between two c-graph is needed: Section 6 illustrates technical details of c-graph integration. A brief section of conclusion and future work closes the paper.
2
Related Work
In the context of models dealing with heterogeneity we cite [12] describing the conceptual model associated with Lore which appears well suited for handling OEM graphs and XML documents. In [5] a graph-based conceptual model managing a series of different formats is also given. The Description Logic based model DLR, allowing a designer to represent and reason about the content of information sources, is described in [8]. A conceptual model capable of uniformly handling both relational and E/R and object oriented information sources is presented in [16] and is adopted also in [9]; it can be seen as a relational model augmented with object oriented features. In [3] a model, called ODLI 3 , is proposed for representing and handling relational and object oriented databases, as well as XML documents. [21] present a graph-based conceptual model, called SDR network; the authors provide rules for translating OEM graphs, XML documents and E/R schemes into this model. Recently, integration and cooperation tasks have been supported by some intelligent agent systems, for making them semiautomatic; as an example, in InfoSleuth [14] a network of cooperating agents is exploited for supporting information retrieval and processing in ever-changing networks of sources. Analogously, in [11], an agent-based integration approach, particularly suited for operating in an e-commerce environment, is described; in the same paper the authors provide a formal semantics for information integration able of coping with distributed, autonomous, partial and redundant information. [4] proposes an approach based on mobile agents in the context of an infrastructure for semi-automatic information integration of heterogeneous information sources. In [22], a structure of multiple shared agent ontologies for integrating heterogeneous information sources is presented. From the side of user profiles, a large number of proposals has been recently presented. In this context, the existing techniques are based on the computation of simple statistic indicators; conceptual models used in these cases are purely structural and, thus, not suited for dealing with semantic heterogeneity. Some example of these approaches are cookies [18] and Web logs [7,23]. Recently, in the agent community, a great attention has been posed on approaches for constructing the profiles of users on the basis of their past behavior on accessing various sites [20, 17,13,1]. Basically, these approaches work as follows: a user, possibly supported
A User Behavior-Based Agent for Improving Web Usage
1171
by an agent, chooses a site to visit. On the basis of the visited sites, another agent tries to guess the topics of her/his interest; this task can be carried out by exploiting various techniques such as text filtering, keyword searching and so on. Particularly interesting is the approach described in [19], where a way of modeling user interests is presented and the user model is exploited for effective information retrieval and filtering. A system based on this model “watches over the shoulders” of a user while she/he is surfing on the Web and the visited pages are characterized by some behavioral parameters such as the length of the pages and the time spent thereon. Although such parameters are very useful for representing the perception of the user w.r.t. the pages she/he has visited, they do not characterize the perception of the various concepts, i.e., it is impossible to derive the user behavior at a conceptual level. In [15] an approach that makes use of the embedded structural information of the documents frequently accessed by the user for deriving a personalized concept categorization (i.e., an ontology) and for identifying user preferences concerning document accesses, is presented. This approach allows to group documents having similar structure, and the behavior of the user in accessing documents is exploited as a feed-back for suitably ranking and sorting the various categories.
3
The Framework for Representing User Ontologies
In this section we describe the framework used for representing user ontologies. It is based on a graph of concepts with labeled arcs. Labels encode knowledge about both structure and semantics of visited sites and past user behavior. A synthetic function elaborating all components (both structural and behavioral) provides a user-behavior dependent metrics for quantifying closeness of concepts. 3.1
The Concept-Graph
In this section we present the concept-graph model utilized in the following for representing the user ontology. The atomic elements of our model are instances belonging to a given universe U. Instances are used for representing different kinds of objects, depending on the data format we refer (that is, they can be XML instances, or relational tuples, etc.). Since our model is Web oriented, we assume that instances are accessible by the user by means of Web documents. A concept is a pair c, name(c) where c is a set of instances and name(c) is a string (with prefixed maximum length). For example, an element of the DTD of a XML site represents a concept whose name is the name of the element and instances are the instances of this element. We denote by C = 2U ×Σ ∗ the set of all possible concepts named with strings in the alphabet Σ. Often, with a little abuse of notation, we denote a concept just as a set of instances, not by a pair (according to definition above). So, we could say that, a given set of instances s is a concept with name name(s).
1172
F. Buccafurri et al.
The concept-graph, which we next formally introduce, is defined for a given user and a given set of concepts with different names. It contains an explicit representation of membership of instances to such concepts, as well as the semantic relationships among them. Moreover, information about accesses of the user is included. From a physical point of view, user accesses occur only on instances since concepts do not correspond to physical objects. But we say that a user accesses a concept every time she/he accesses one of its instances. In order to discover useful information about the relationships among concepts, as well as the user preference for an instance w.r.t. the other ones of the same concept, we need to distinguish internal accesses from external ones. An access to an instance t of a concept s is internal if the user comes from another instance of the same concept s. A user may access an instance t of a concept s by an external access if she/he comes from an instance of a concept s distinct from s. We define now the concept-graph (c-graph, for short). Given a subset of concepts N ⊆ C such that no two concepts with same name occur in it and an instance may belong only to one concept., and a user u, a c-graph (for u on N ) is a rooted labeled direct graph C Graph(N, u) = N, A, where N is the set of nodes and A ⊆ N × N is the set of arcs. Informally, N represents the set of concepts of interest for the user u. Arcs encode semantic relationships among concepts. Their labels define a number of properties associated to relationships of C Graph(N, u) containing also the dependency of the model on the user u. More precisely, an arc (s, t) is provided with a label label(s, t) = dst , rst , hst , τst , where both dst and rst are real numbers ranging from 0 to 1, hst is a non negative integer, and τst is a non negative real number. The four label coefficients above encode different properties, and their definition, which we next provide, clarifies why our graph is directed. In particular: – dst is the (semantic) independence coefficient. It is inversely related to the contribution given by the concept t in characterizing the concept s. As an example, in an E/R scheme, for an attribute t of an entity s, dst will be smaller than ds t , where s is another entity related to s by a relationship. Analogously, in an XML document, a pair element, sub-element (s, t) will have an independence coefficient dst smaller than ds t , where s is another element which s refers to through an IDREF attribute. – rst is the (semantic) relevance coefficient, indicating the fraction of instances of the concept s whose complete definition requires at least one instance of the concept t. The relevance plays a role similar to the support defined for data mining association rules: it gives a measure (in terms of instances) of how much the concept t characterizes the concept s. Such a coefficient is necessary since c-graph is used for representing also semi-structured information. – hst is the hit coefficient, counting the number of hits which u carries out on t (i.e., on some instance of t) coming from s (i.e., coming from some instance of s). hst ti – τst is (no-idle) time coefficient, defined as i=1 qi , where ti the effective total time spent by u at the i-th hit for consulting the concept t (i.e., instances of
A User Behavior-Based Agent for Improving Web Usage
1173
t) coming from s (i.e., coming from some instance of s) and qi is the size of the relative accessed page.1 Similarly to the relationship between concepts, also the membership of instances to concepts is weighed by labeling. In particular, for each pair (s, t) such that s is a node of the graph and t is an instance belonging to s, we define label(s, t) = hst , τst , where hst , called hit coefficient, and τst , called (no-idle) time coefficient, are defined in obvious way, coherently with the corresponding coefficients presented above. The only difference here is that the instance which the user comes from is either another instance of the concept s or an instance of another concept. In other words, while hit and time coefficients for pairs of concepts regard only external accesses, the corresponding coefficients in case of pairs concept-instance sum both internal and external accesses to instances. For taking into account the locality principle, we assume that, periodically, hst and τst are suitably automatically reduced (if greater than 0) in order to gives less importance to old accesses. The c-graph is used for representing the user ontology. The construction and management of the user ontology is a task of our agent and will be explained in detail in Section 5. As we shall see, the agent first extracts from visited sources concepts and structural information (kept in independence and relevance coefficients). The user behavior is then monitored in order to capture user preferences: hit and time coefficients allow to keep memory of user actions. For extracting structural information, it is necessary to define how, given an information source, a static c-graph (that is, a c-graph with hit and time coefficients set to 0), can be automatically built. This is the first step which the agent performs at the beginning of each new visit in order to have a temporary c-graph on which the user behavior is marked up. At the end of the visit, such a temporary c-graph will be exploited for updating the user ontology. For facing the non-trivial problem of constructing the static c-graph of an information source (which may be represented by means of different data formats), we exploit results obtained in the field of Cooperative Information System, and, in particular, the techniques defined in [21]. In these papers, a conceptual model called SDR network is proposed which basically coincides with the notion of c-graph deprived of the instances associated to each node and the hit and time coefficients. Thus, for the construction of a static c-graph we follow an approach similar to that used for SDR networks [21]. The algorithm is rather technical and elaborated, therefore we do not report it in detail. We just note that such a computation depends on the format of the involved information sources. The reader may found in [6] the case of XML format as well as the translation rules from other kinds of information sources. We conclude this section by showing an example of a simple user ontology. Example 1. In Figure 1 a c-graph representing the ontology of a user, named John Smith, is reported. Labels of arcs are 4-tuples listing, in order, indepen1
Throughout the paper, for simplicity, we assume that all pages have size 1.
1174
F. Buccafurri et al.
Fig. 1. The ontology of John Smith
dence, relevance, hit and time coefficients. For simplicity, we do not represent instances. John Smith is a sommelier, very interested in wine as well as in optimal combination between food and wine. For simplicity, we assume that the navigation history of John Smith consists of just a number of visits to two XML sites, one regards wine, the other gastronomy. The DTD of the first site contains the elements oenology, wine and vines. The element oenology has a sub-element news, and such a sub-element has one IDREFS attribute. An instance of oenology collects general news about wine; such news may contain links to particular wines or to information about vines. An instance of wine is a particular wine, and it may contain links to external instances belonging to the second site concerning recipes – the structure of the second site will be explained in the following. The second site regards recipes, and its DTD contains the element recipes, which has the attributes ingredients, type, nationality and (preparation) time. The topology of the c-graph representing the Smith’s ontology (see Figure 1), reflects the structure of the visited documents. This is true also for structural coefficients, that is, independence and relevance coefficients. For instance, consider the arc (wine, recipes): the independence coefficient is 1 (that is, the maximum value). This is rather intuitive, since an IDREF attribute just represents a link between the two concepts but not the composition of one concept in terms of the other one, as, for instance it occurs for XML sub-elements. Conversely, for the arc (recipes, ingredients) the independence coefficient is 0, since ingredients is an attribute of recipes, and it gives an high contribution in characterizing the concept recipes. The relevance coefficient measures how much instances support relationships. For instance, for the arc (recipes, time), the relevance coefficient is 0.6, meaning that the 60% of recipes visited by Smith reports the preparation time. Behavioral coefficients (that is, hit and time coefficients), keep memory of actions performed by Smith. For instance, from the label of the arc (wine, recipes) it results that Smith has moved from an instance of wine to an instance of recipes for 100 many times, summarizing 800 seconds for consulting such instances of recipes. ✷
A User Behavior-Based Agent for Improving Web Usage
3.2
1175
Measuring Concept Closeness
The four coefficients composing labels of a c-graph need to be elaborated in order to become really interpretable. First examine independence and relevance coefficients. Consider given a pair of concepts s and t. As remarked above, they express structural semantic closeness. In particular, having (1 − dst ) and rst close to 1 means that t strongly characterizes s. rst has the role of support for such a relationship. Indeed, dealing with semi-structured information, it may happen that only a fraction of instances witnesses such a relationship. A reasonable way for merging the information arising from the two coefficients is using the relevance coefficient as a reducing factor of (1 − dst ). Thus, we define the function ψ(s, t), called s(tructural)-closeness as: ψ(s, t) = (1 − dst ) · rst for giving a synthetic measure of the structural dependence of the concept s from t. Also hit and time coefficients have to be elaborated in order to express, in a synthetic way, usable information about the user behavior. Consider given a pair (s, t) where s is a concept and t is either a concept or an instance of s. First we define the function θ, giving a measure of the “interest” the user u has for t when she/he accesses t through s. θ is defined as: θ(s, t) = hst + τqst where q is a suitably set parameter allowing us to modulate the importance of the effective access time w.r.t. the number of hits. Indeed, for high values of q the interest measures just how many times the user contacted t. On the contrary, a small q cancels the effect of contact actions. The value q could be set taking into account several variables, like the connection bit rate, the expertise of the user and so on. For instance, in a search strategy tree, an inexpert user typically stands in a fail node for a lot of time before realizing she/he is not interested in it. Thus, in this case, the effective time spent in each contact cannot give us information about the real interest of the user (i.e., q should be set to an high value). The contrary happens in the case of an expert user. Then, we define the function ρ, representing the preference the user u gives to t w.r.t. all the other concepts (in case t is a concept) or instances (in case t is an instance) reachable in just one step by s. The preference is computed as a fraction of the overall interest. More precisely, ρ is: ρ(s, t) =
θ(s,t) θ(s,t )
t ∈A(s)
where A(s) is either the set of nodes adjacent to s, in case t is a concept, or s itself, in case t is an instance. At this point we have two synthetic functions, that are ψ and ρ, the former encoding the structural closeness, the latter the user preference. In order to have a unique synthetic information summarizing both structural and behavioral components we define the function γ, measuring the “subjective” semantic closeness
1176
F. Buccafurri et al.
of two concepts, where by subjective we mean that the user preference function ρ is used for modulating structural semantic closeness. γ, called u-closeness, is defined as γ(s, t) = k · ψ(s, t) + (1 − k) · ρ(s, t) where 0 ≤ k ≤ 1 is a parameter modulating the importance of behavioral information versus the structural one (a small k reduces the importance of the structural component). Example 2. Consider the ontology of John Smith of Example 1. Again, consider the arc (wine, recipes). The independence coefficient of the arc is 1. As observed in Example 1, this correctly reflects the structure of the sites visited by Smith. As a consequence, the function ψ(wine, recipes), measuring how much the concept recipes structurally characterizes the concept wine, takes the value 0. However, under the perspective of Smith, who is interested in food and wine combination, the two concepts are strongly related each other and this is proved by the high values of behavioral coefficients of the arcs (wine, recipes). The function γ encodes this knowledge. Assuming the parameter k is set to 0.25 (thus, in a case in which the behavioral component is considered more important than the structural one), we obtain that γ(wine, recipes) = 0.75, denoting a high degree of semantic closeness. ✷ The function γ allows us to define the notion of neighborhood of a given concept s. Informally, it consists of the set of concepts that are sufficiently (i.e., up a suitable threshold) “close” (according to the function γ) to the concept s. This is done by extending the function γ to the paths of the c-graph. More formally, let s and t be a pair of nodes and π be a path in a c-graph C Graph(N, u) from s to t. The p-closeness of t w.r.t. s by π, denoted by γpath (s, t, π), is the sum of the γ values associated to the arcs of π. Let s and t be two nodes such that there exists a path in C Graph(N, u) from s to t. The closeness of t from s, denoted by Γ (s, t), is defined as min{γpath (s, t, π) | π is a path in C Graph(N, u) from s to t}. Given a c-graph C Graph(N, u), a positive integer number k and a concept t ∈ N , the k-neighborhood of t in C Graph(N, u) is the set of nodes {s ∈ N | Γ (s, t) ≤ k}. Example 3. Consider the c-graph of Example 1 and its concept wine. The 1neighborhood of wine is the set of concepts {recipes, time} since Γ (wine, recipes) = 0.5, Γ (wine, time) = 0.9. Moreover, the 2-neighborhood of wine is {recipes, time, ingredients, nationality, type} since Γ (wine, nationality) = 1.01, Γ (wine, type) = Γ (wine, ingredients) = 1.14. ✷
4
Solving Semantic Heterogeneity
In order to use the c-graph for representing the user ontology, and design an agent capable of exploiting such an ontology for supporting the user, we have to be able to deal with structural and semantic heterogeneity. Indeed, for instance,
A User Behavior-Based Agent for Improving Web Usage
1177
the agent must be able to discover that a given concept of the user ontology coincides with a concept of a visited site, even if they have a different name. In some cases, different names corresponding to two coinciding concepts are simply lexical synonymes and available dictionaries on the Web may be used for detecting them. But sometimes, synonymies can be detected only by means of techniques based on semantic approaches. Another possible case is that two concepts with the same name have actually a different meaning, that is, they are homonym. We have to discover also such cases, since, clearly, they may cause erroneous agent actions.
Fig. 2. The e-shop site
Example 4. Consider the ontology of John Smith introduced in Example 1. Suppose John Smith visits a XML e-shop site whose associated c-graph (built according to translation rules rules) is depicted in Figure 2. The site, among others, contains a category called food. On the basis of the interests of Smith, it would be desirable that the agent is able to conduct the user into the section of the site corresponding to food. But, in the Smith’s ontology, it appears neither a concept named as food nor some lexical synonym. Indeed, it appears only the concept recipes. We can argue that such a synonymy cannot be discovered only by means of dictionaries. This is a case of a semantic synonymy. ✷ The issue of detecting semantic heterogeneity has been deeply investigated in the field of Cooperative Information Systems [3,12,8]. Most of the approaches for solving synonymies and homonymies proceed by analyzing the context of concepts. Two concepts are detected as synonyms, if their contexts are sufficiently similar, while two concepts are homonymous if they have the same name but their contexts are sufficiently different. In our case, the semantics of a given concept c can be captured by analyzing concepts which are sufficiently close to c. Closeness, in this case, is only structural, and thus computed on the basis of independence and relevance coefficients (that is, assuming that c-graphs are static). For such a reason, the approach we follow for detecting synonyms and homonyms is similar to that proposed in [21]. The context of a concept is captured by adapting the notion of neighborhood defined in Section 3 to the static
1178
F. Buccafurri et al.
case: this is obtained simply by giving to the function ψ the role played by the function γ. We call this new notion structural neighborhood. Further, given two c-graphs B1 and B2 , for any pair of concepts s and t belonging to B1 and B2 , respectively, we determine a real coefficient, called similarity coefficient, ranging from 0 to 1, expressing how much concepts s and t are similar. Once two suitable thresholds for the similarity are dynamically computed , say th1 and th2 , two concepts s and t are synonymous if both (1) they have different name and (2) their similarity coefficient is greater than th1 . Conversely, s and t are homonyms if both (1) they have the same name and (2) their similar coefficient is smaller than th2 . The complete procedure is rather complex and, for the sake of presentation, we do not include it in the paper. The reader may refer to [6] for further details. Example 5. Consider the c-graph of Example 1 and the c-graph of Figure 2. Let denote by B1 be the first c-graph and by B2 the second one. Consider the concept recipes of B1 and the concept food of B2 . By applying translation rules, it can be verified that the similarity coefficient of the pair (food,recipes) is 0.74 showing that such a concepts are (semantic) synonymous (for any reasonable fixed threshold). ✷
5
The Agent Model
User activity is thought as an iterative process consisting of accesses to different sites. We design an agent for supporting such an activity based on handling a user ontology. The ontology represents the profile of the user in terms of preferences and semantic relationships learnt by monitoring the history of her/his activity. To recent history is given a more important role than the past (thanks to the automatic reduction of behavioral coefficients), in order to take into account the locality of user interests. The agent carries out two tasks: the first one is supporting the user in the navigation of each site, by providing her/him with a set of recommendations; the second one concerns with the ontology management (that is, its construction and updating). Ontology update concerns with new user actions: knowledge coming from new accesses to (possibly already exploited) information sources has to be incorporated into the ontology. This is also periodically pruned, in order to eliminate what can be considered just as noise. Pruning can be done on the basis of the function γ, by using a suitable thresholding. For simplicity, we have not described in this paper the pruning process (the reader may refer to [6] for details). The user ontology, denoted by O, is a c-graph, initially empty and updated after each new visit. Suppose now the user visits a new site, say IS. First of all, the c-graph S(IS), representing the structure of the site, is automatically built by the agent. Observe that this is a “static” c-graph in the sense that behavioral coefficients (that is, hit and time coefficients) are set to 0. From a practical point of view, the feasibility of such an initial step depends on the presence of
A User Behavior-Based Agent for Improving Web Usage
1179
some kind of cooperation between client and server sides. Cooperation could be implemented by using mobile agents, for instance, or agents working on both sides (client and server) and exchanging data. Of course, such design decisions could be adopted only after that a precise application setting (like, for instance, e-commerce) is chosen. However, we have intentionally presented the model in the most general context, with the purpose of concentrating on the description of its features. Thus, a deep analysis of the above considerations, as well as of other implementation issues is outside of the scope of this paper. Now we describe the two agent activities, that are supporting the user navigation and ontology management. 5.1
Supporting User Navigation
By exploiting S(IS) and the user ontology O, the user is supported during the visit by providing her/him with a set of recommended links to concepts (corresponding to collections of URLs) of the site. For each visited page, the user receives new recommendations. Such recommendations are obtained by identifying in S(IS) those concepts which belong also to the user ontology. Suggested links are obtained by considering relationships among such concepts occurring in the user ontology. Thus, recommendations take into account user preferences. As noted in the previous section, in order to discover concepts of S(IS) which belong also to O, semantic heterogeneity has to be solved. Indeed, a certain concept of S(IS) can appear in O as (either lexical or semantic) synonym (even not trivially lexical), and this can be detected by using the already described technique. Moreover, a concept of S(IS) can be an homonym of a concept of O, and thus could be erroneous considered belonging to O. Once synonymies and homonymies have been detected and homonymous nodes in S(IS) are renamed, the agent is able to select the concepts of the site (i.e., occurring in S(IS)) of interest for the user, by finding in S(IS) the concepts such that there exists at least a synonymous concept appearing in O (hereafter in this section, we say that two concepts are synonymous also if they have the same name). Let CI be such a set of concepts. For each concept c of CI the agent builds the collection of the URLs of all the pages containing instances of c in the site IS. At this point, the agent builds a graph, say R(IS, O), with set CI of nodes. For each pair s and t of concepts in CI an arc (s, t) occurs in R(IS, O) if there exists in O an arc (s , t ) such that s and s are synonymous and t and t are synonymous. This graph, called the recommendation graph, is exploited for supporting the user visit in the following way: Initially the agent suggests to the user the page collections associated to CI , as starting points of the visit. Then, when the user visits a page corresponding to an instance of some concept of CI , say c, the agent extracts from R(IS, O) all adjacent concepts, and, for each of them, a link to the corresponding collection is suggested to the user. Observe that, in a suggested URL collection associated to a concept c of O, the agent may identify the URLs leading to instances detected as sufficiently similar (by text matching techniques) with the k−first instances of c ordered by decreasing user preference, determined by means of the function ρ (see Section 3), where k
1180
F. Buccafurri et al.
is a given fixed parameter. Note that suggested links might not be actual links of the site, but they could only derive from the past behavior of the user, encoded into her/his ontology, and representing the semantic closeness among concepts she/he perceives. Thus, the user can be driven by the agent through paths not directly provided by the site but (probably) appearing as “natural” for the user. Example 6. Consider the user John Smith of Example 1 having the ontology O reported in Figure 1. Suppose the user visits the e-shop site, called here IS, of Example 2. Again, assume that Smith prefers red wines and, thus, he recently and many times accessed instances of wine regarding red wines. Consequently, Smith frequently consulted meat based recipes. Thus, it results that the 10-first instances of wine ordered by preference contain the word “red” and the 10-first instances of recipes ordered by preference contain the word “meat”. Denote by S(IS) the c-graph, reported in Figure 2 of the site. In this example we show how the agent supports the navigation of Smith in the site IS. First, semantic heterogeneity is solved: The agent, by computing similarity coefficients, discovers that, besides lexical synonyms, concepts recipes of O and food of S(IS) can be considered as synonyms (as shown in Example 5). Thus, the set CI of concepts of S(IS) appearing also in O is {wine, f ood}. Therefore, for the concept wine, the agent builds a collection of URLs of pages containing the instances of this concept in IS, and carries out the same task for the concept food. At this point the agent builds the recommendation graph R(O, IS), which has as nodes the concepts of CI (to which the corresponding URL collections are associated). Arcs are copied from O, by considering corresponding nodes. The graph R(O, IS) so obtained is composed of just two nodes, namely wine and food and there is only one arc directed from wine to food. As a first suggestion the agent submits to the user the list of URL collections associated to concepts wine and food. Due to the preferences of Smith, among all the URL associated to wine, the agent selects red wines as favourite (because of the occurrence of the word “red” in the favourite instance of Smith). Suppose now Smith chooses to visit the page Brunello di Montalcino (that is a famous Italian red wine) following one of the links appearing in the URL collection associated to the concept wine, suggested by the agent. On this page, the agent provides the user with a set of suggested links extracted from the recommendation graph R(O, IS) by considering all concepts adjacent to wine. In this case, there is only one concept adjacent to wine, that is food. Therefore, the agent shows to the user a link to the URL collection associated to the concept food. Moreover, exploiting the most user favourite instances of recipes (that is, the concept corresponding to food in the user ontology), the agent selects, among the above URLs, those corresponding to the instances containing the word “meat”. This way, the user, who is expert in combination between wine and food, is correctly driven to the section of the site regarding meat dishes, as the chosen wine is red, even if the site is not provided with such a direct link (see Figure 2) (actually the two sections of the site are quite far each other). We remark that such a useful recommendation can be generated only by exploiting the history of the user behavior (showing that the user sees a strong relation between concepts wine
A User Behavior-Based Agent for Improving Web Usage
1181
and recipes), embedded in his ontology, and, importantly, the capability of the model of dealing with semantic heterogeneity, allowing the agent to discover the semantic correspondence between the concept recipes of the user ontology and the concept food of the site. Thus, this example shows how the (apparently too complex) features of the model, are actually needed for effectively dealing with practical situations. ✷ 5.2
Ontology Management
The second task of the agent, transparent for the user, concerns with the ontology management. Each new visit, updates the user ontology, in such a way that the behavior history of the user is recorded in it. Clearly, for privacy reasons, the user is allowed to disable the visit monitoring. For updating the user ontology, the user dynamically builds a c-graph B(IS) during the visit and, at the end, incorporates the knowledge encoded in B(IS) into the user ontology O by integrating the two c-graphs. At the beginning of the visit, B(IS) is empty. During the visit B(IS) changes, as all concepts accessed by u, as well as their neighborhoods in S(IS), are recorded in B(IS) by inserting new nodes and new arcs. Moreover, hit and time coefficients are recomputed at each step according to their definition. Independence and relevance coefficients are taken from corresponding arcs in S(IS). At the end of the visit, B(IS) is a representation of the portion of IS, visited by u in this session, containing also information about the user behavior, thanks to hit and time coefficients. By considering also neighborhoods (in S(IS)) of visited concepts, the agent autonomously discovers potentially interesting concepts for the user and includes them in B(IS). More formally, given a concept s, we denote by nbh(s) its kneighborhood. In obvious way, we define as arcs(nbh(s)) the set of arcs induced by the k-neighborhood of a concept s. Thus, for each access a to an instance t of a concept s: – if a is the first access of the visit, then the node s is inserted into B(IS) with only the instance t belonging to it. hst and τst are updated (in this case hst is set to 1). – if s is accessed for the first time, a is not the first access, and, thus, the user comes from an instance of another concept, say s , then the node s is inserted into B(IS) with only the instance t belonging to it, and an arc (s , s) is also added. hst and hs s are set to 1, τst and τs s are set to the same measured value. Independence and relevance coefficients of the arc (s , s) are set to the corresponding values occurring in S(IS). – if a is an external access coming from a concept s but s was already accessed, then the arc (s , s) is added in B(IS) if not already present. hst and hs s are increased of 1; τst and τs s are equally updated. – if a is an internal access and it is not the first one, then hst is increased and τst is updated. – for every kind of access a, nodes of nbh(s) and arcs of arcs(nbh(s)) are inserted into B(IS), if not already occurring in it. Independence and relevance
1182
F. Buccafurri et al.
coefficients of inserted arcs are taken from the corresponding arcs of S(IS). Hit and time coefficients are set to 0. At this point, before handling the choice of a new information source, the knowledge encoded in B(IS) has to be incorporated into the ontology O, updating it. This is done by integrating the two c-graphs. Integration is described in Section 6. After the update of O, B(IS) is not useful anymore, since the memory of such a visit of IS is kept into the ontology. Now, O must be pruned, in order to eliminate all concepts and instances with low interest for the user u. O now can be exploited for supporting next user visits.
6
Integration of Two C-Graphs
In this section we describe how two c-graphs are merged into a global c-graph. Informally, this merge consists of a “union” of the two c-graphs executed after that synonymies and homonymies are eliminated: by computing the similarity coefficients between all possible pairs of nodes (a node belonging to the first cgraph, the other belonging to the second c-graph), synonyms and homonyms are first detected, and then, synomnym nodes are renamed giving the same name and homonym nodes are renamed in such a way that they assume distinct names. The union of the two “normalized” c-graphs is done by suitably averaging labels of arcs. Let B1 = N1 , A1 and B2 = N2 , A2 be two c-graphs. The union of B1 and B2 , denoted by U (B1 , B2 ), is a directed labeled graph with set of nodes: N = {s ∈ N1 | ∃t ∈ N2 s.t. name(s) = name(t)}∪ {s ∈ N2 | ∃t ∈ N1 s.t. name(s) = name(t)}∪ {x ∈ C | x = s ∪ t, s ∈ N1 ∧ t ∈ N2 ∧ name(s) = name(t) = name(x)} and set of arcs A = {(s, t) | ∃(s1 , t1 ) ∈ A1 s.t. name(s) = name(s1 ) ∧ name(t) = name(t1 )}∪ {(s, t) | ∃(s1 , t1 ) ∈ A2 s.t. name(s) = name(s1 ) ∧ name(t) = name(t1 )} Nodes are obtained by copying nodes of each c-graph with name not appearing in the other c-graph and by merging nodes with common name into a node with equal name including all the instances of the original nodes. Arcs are obtained in obvious way. Now we define how labels are determined. Let (s, t) be an arc belonging to A. Its label is the 4-tuple dst , rst , hst , τst defined as follows: (a) dst = ds1 t1 , rst = rs1 t1 , hst = hs1 t1 , τst = τs1 t1 if ∃(s1 , t1 ) ∈ Ai s.t. name(s) = name(s1 ) ∧ name(t) = name(t1 )∧ ∃s2 ∈ Nj s.t. name(s) = name(s2 ). (b) dst = ds1 t1 , rst = f (rs1 t1 , |s2 |), hst = hs1 t1 , τst = τs1 t1 if ∃(s1 , t1 ) ∈ Ai s.t. name(s) = name(s1 ) ∧ name(t) = name(t1 ) ∧ ∃s2 ∈ Nj s.t. name(s) = name(s2 )∧ ∃(s2 , t2 ) ∈ Aj s.t. name(t1 ) = name(t2 ).
A User Behavior-Based Agent for Improving Web Usage |s |·d
1183
+|s |·d
(c) dst = 1 s|s1 t11|+|s22 | s2 t2 , rst = f (rs1 t1 , |s2 |), hst = hs1 t1 + hs2 t2 , τst = τs1 t1 + τs2 t2 if ∃(s1 , t1 ) ∈ Ai s.t. name(s) = name(s1 ) ∧ name(t) = name(t1 ) ∧ ∃s2 ∈ Nj s.t. name(s) = name(s2 ) ∧ ∃(s2 , t2 ) ∈ Aj s.t. name(t1 ) = name(t2 ). where i = 1, 2, j = 1, 2, i = j, and by f (rs1 t1 , |s2 |) we denote the function for re-computing the relevance coefficient of the arc (s1 , t1 ) when the cardinality of the node s1 is increased by |s2 | (recall that the relevance coefficient depends on the number of instances of the source node). With a little abuse, we assume that U (B1 , B2 ) is a c-graph. Note that this is not necessarily true since U (B1 , B2 ) might not to be rooted. However, in this case, a dummy root can be added to make U (B1 , B2 ) a c-graph. In absence of lexical and semantic heterogeneity, the integration of two cgraphs would correspond to the union defined above. Unfortunately, the correspondence between names of concepts on which the union is based does not take into account the possibility of occurrence of synonymies and homonymies. Thus, before applying the union operation, we have to solve such a heterogeneity as shown in Section 4. Let T = {(s, t) |s ∈ N1 ∧ t ∈ N2 ∧ name(s) = name(t) ∧ sim(s, t) ≥ th1 }, where, sim(s, t) is the similarity coefficient of the pair (s, t) and th1 is the threshold computed for detecting synonymies (see Section 4). T represents the set of all pairs of candidate synonyms. Any subset T¯ of T such that both (1) (s, t) ∈ T¯ implies that there is no (s, t ) ∈ T¯ and t = t and (2) (s, t) ∈ T¯ implies that there is no (s , t) ∈ T¯ and s = s , is an admissible synonym set, that is a set of pairs of candidate synonyms such that a concept of N1 (N2 , resp.) is involved at most once. We require this last condition since we intend to solve synonyms by renaming candidates pairs of concepts assigning the same name. Thus, the above condition guarantees that, after this operation, there are no two nodes in B1 (in B2 , resp.) with the same name (according to definition of c-graph). Among all admissible synonym sets, we choose a set with maximum value of the global similarity, where, for global similarity of an admissible set T¯ we mean the sum of all the similarity coefficients of the pairs belonging to T¯. Such a set, called S, represents the set of the detected synonyms. Now, the set of homonyms is determined. It results O = {(s, t) | s ∈ N1 ∧ t ∈ N2 ∧ name(s) = name(t) ∧ sim(s, t) ≤ th2 }. Note that, a concept s of N1 (N2 , resp.) can appear in O once at most. At this point S and O can be used for solving synonymies and homonymies in B1 and B2 . In particular, for each pair (s, t) ∈ S, the concept t is renamed in such a way that name(t) = name(s), and, for each pair (s, t) ∈ O, the concept t is ¯1 and B ¯2 the renamed in such a way that name(t) = name(s). Let denote by B two c-graphs so obtained. The integration of B1 and B2 is then obtained by ¯2 ). ¯1 , B computing the c-graph U (B
7
Conclusion and Future Work
This paper deals with the problem of modeling Web users by means of personal ontologies. Our approach is based on a semantic representation of the user activ-
1184
F. Buccafurri et al.
ity, which takes into account both the structure of visited sites and the way the user navigates them. The model appears promising, mainly because of its capability of dealing with semantic inconsistency occurring in the Web. This is shown by examples in the paper, by simulating the execution of an agent-based application exploiting the user ontology for supporting the user navigation. Thus, the paper gives the formal framework as well as the design guidelines for building an agent system and providing experiments as further validation of the model. This is matter of current and future work.
References 1. M. Balabanovic, Y. Shoham. Fab: Content-based, collaborative recommendation. Communications of the ACM, 40(3), 4(3), 1997. 2. C. Batini, M. Lenzerini. A methodology for data schema integration in the entity relationship model. IEEE Trans. on Software Eng., 10(6):650–664, 1984. 3. S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano. Semantic integration and query of heterogeneous information sources. Data & Knowledge Eng., 36(3):215– 249, 2001. 4. S. Bergamaschi, G.Cabri, F. Guerra, L. Leonardi, M. Vincini, and F. Zambonelli. Supporting information integration with autonomous agents. In International Workshop on Cooperative Information Agents (CIA’01), pages 88–99, Modena, Italy, 2001. Lecture Notes in Artificial Intelligence. 5. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proc. of International Conference on Database Theory (ICDT’97), pages 336–350, Delphi, Greece, 1997. Lecture Notes in Computer Science, Springer-Verlag. 6. F. Buccafurri, G. Lax, D. Rosaci, D. Ursino. The c-graph model. DIMET Techn. Report 2002, available from the authors. 7. A.G. Buchner and M.D. Mulvenna. Discovering internet marketing intelligence through online analytical web usage mining. SIGMOD Record, 27(4):54–61, 1998. 8. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati. Description logic framework for information integration. In Proc. of International Conference on Principles of Knowledge Representation and Reasoning (KR’98), pages 2–13, Trento, Italy, 1998. Morgan Kaufman. 9. S. Castano, V. De Antonellis, and S. De Capitani di Vimercati. Global viewing of heterogeneous data sources. Transactions on Data and Knowledge Engineering, 13(2), 2001. 10. A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proc. of International Conference on Management of Data (SIGMOD 2001), Santa Barbara, California, USA, 2001. ACM Press. 11. C. Ghidini, L. Serafini. Information integration for electronic commerce. In Proc. of International Workshop on Agent Mediated Electronic Trading 1998 (AMET’98), pages 189–206, Minneapolis, USA, 1998. Lecture Notes in Artificial Intelligence, Springer Verlag. 12. R. Goldman, J. McHugh, J. Widom. From semistructured data to XML: Migrating the lore data model and query languages. In Proc. of International Workshop on the Web and Databases (WebDB’99), pages 25–30, Philadelphia, USA, 1999.
A User Behavior-Based Agent for Improving Web Usage
1185
13. T. Joachims, D. Freitag, T.M. Mitchell. Web watcher: A tour guide for the world wide web. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI 97), pages Vol. 1, 770–777, Montreal, Canada, 1997. Morgan Kaufmann. 14. R. J. Bayardo Jr., B. Bohrer, R. S. Brice, A. Cichocki, J. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. H. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishnan, A. Unruh, D. Woelk. Infosleuth: Semantic integration of information in open and dynamic environments. In Proc. of International Conference on Management of Data (SIGMOD’97), 195–206, Tucson, USA, 1997. ACM Press. 15. S. Kim, W. Hall, A. Keane. Using document structures for personal ontologies and user modeling. In User Modeling 2001: 8th International Conference LNCS 2109, pages 240–242, Sonthofen, Germany, 2001. Springer-Verlag. 16. A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of International Conference on Very Large Data Bases (VLDB’96), pages 251–262, Bombay, India, 1996. Morgan Kaufmann. 17. H. Lieberman. Letizia: An agent that assists web browsing. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI 95), pages Vol. 1, 924–929, Montreal, Quebec, Canada, 1995. Morgan Kaufmann. 18. J.S. Park and R.S. Sandhu. Secure cookies on the web. IEEE Internet Computing, 4(4):36–44, 2000. 19. A. Pretschner, S. Gauch. Ontology based personalized search. In IEEE International Conference on Tools with Artificial Intelligence, pages 391–398, Chicago, 1999. 20. G. Somlo, A. E. Howe. Incremental clustering for profile maintenance in information gathering web agents. In Proc. of International Conference on Autonomous Agents 2001, pages 262–269, Montreal, Canada, 2001. ACM Press. 21. G. Terracina, D. Ursino. Deriving synonymies and homonymies of object classes in semi-structured information sources. In Proc. of International Conference on Management of Data (COMAD 2000), pages 21–32, Pune, India, 2000. McGraw Hill. 22. P.R.S. Visser, V.A.M. Tamma. An experience with ontology-based agent clustering. In IJCAI-99 Workshop on Ontologies and Problem-Solving Methods (KRR5), pages 12–1–12–13, Stockolm, Sweden, 1999. 23. O. Zaiane, M. Xin, and J. Han. Discovering web access patterns and trends by applying olap and data mining technology on web logs. In IEEE Advances in Digital Libreries Conference (ADL’98), pages 19–29, Santa Barbara, California, USA, 1998. IEEE Computer Society Press.