Contextualization and Personalization of Queries to Knowledge Bases Using Spreading Activation Ana B. Pelegrina1 , Maria J. Martin-Bautista2, and Pamela Faber1 1
2
Department of Translation and Interpreting University of Granada {abpelegrina,pfaber}@ugr.es Department of Computer Science and Artificial Intelligence University of Granada
[email protected]
Abstract. Most taxonomies and thesauri offer their users a huge amount of structured data. However, this volume of data is often excessive, and, thus does not fulfill the needs of the users, who are trying to find specific information related to a certain concept. While there are techniques that may partially alleviate this problem (e.g. visual representation of the data), some of the effects of the information overload persist. This paper proposes a four-step mechanism for personalization and knowledge extraction, derived from the information about users’ activities stored in their profiles. More precisely, the system extracts contextualization from the users’ profiles by using a spreading activation algorithm. The preliminary results of this approach are presented in this paper.
1
Introduction
In the world today, storage, processing and networking technologies have advanced significantly. Users thus have at their disposal huge amounts of structured data regarding almost any topic or field of knowledge. Such information repositories are known as Knowledge Bases (henceforth KBs) and often take the form of thesauri, taxonomies or ontologies. Well known examples of KBs include BabelNet[1], DBPedia [2], WordNet [3,4], and EcoLexicon [5]. However, one of the constraining factors when working with certain repositories is that users may find it difficult to retrieve the information that they are seeking from the vast amount of data available. Therefore, it is necessary to provide them with effective tools to browse and query the data. Software personalization is a commonly used mechanism to adapt content to user needs and goals. The automatized personalization of software systems is not a trivial task and requires the use of one or more techniques to capture interests of potential users and to translate these preferences into the software. This can be achieved with the use of different techniques such as clustering [6,7,8], fuzzy logic [9,10], data mining [11,10,12,13], and spreading activation [14,15]. Spreading activation is used as a search method for associative networks. Although its roots lie in psychology, semantics processing and linguistics [16,17], H.L. Larsen et al. (Eds.): FQAS 2013, LNAI 8132, pp. 671–682, 2013. c Springer-Verlag Berlin Heidelberg 2013
672
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
it has been rapidly adopted in the computer sciences field as a way to simulate a ‘human-like’ search for information or documents (see Sect. 2 and Sect. 3 for a detailed description of spreading activation and its applications). A simple way to offer users a clear understanding of the data and its structure is to provide them with a visual representation of its content. Nevertheless, given the size of such databases, this visualization often includes too much information to be processed at one glance. This could confuse the user and hinder his/her search efforts. A possible solution for this problem is to personalize the visualization, and show only the entities that are relevant to the user’s preferences. Additionally, a user who is accessing and querying the data may possess a certain degree of specialized knowledge. This can be reflected in the user’s interactions with the system (e.g. a geologist will query the system with related terms within the domain of expertise, thus reflecting a knowledge of geology). Generally speaking, the explicit elicitation of this kind of knowledge requires the effort and time of experts, which may be costly. However, alternatively, this knowledge could be implicitly extracted and added to the knowledge base without interfering with the user’s activities. This would have the advantage of bypassing expert consultation and avoiding explicit knowledge elicitation. In this paper, we propose the use of spreading activation as a means to customize the search tools and visualization of knowledge bases. Furthermore, our objective is to combine the information provided by the users through their interactions with the system and the results given by the spreading activation algorithm to extract implicit expert knowledge from the users. The result is a simple, non-intrusive, and implicit mechanism to extract contextualization from users and include it in the user profiles. This contextualization may be later used as a way to provide users with a recommender system [18] to guide nonexperts who access the system. This concept is applied to and exemplified in the EcoLexicon knowledge base. This paper is organized as follows: In Sect. 2, we provide a background on spreading activation. Section 3 discusses the related works and an overview of our proposal. The proposed approach and its implementation are described in Sect. 4. The preliminary results of the application of the technique over a subset of the knowledge base are shown in Sect. 5. Finally, conclusions and future work are presented in Sect. 6.
2
Spreading Activation
The spreading activation theory attempts to model how humans represent and retrieve their knowledge. This theory claims that knowledge in the human mind is represented as nodes and links between nodes in the manner of a semantic network of concepts. It also proposes that the human recall process starts with the activation of a set of concepts (or nodes), which spreads to the neighboring nodes in a series of pulses in a decreasing gradient [17]. For spreading activation, each node in the network has an activation value and each edge has a weight. These values can be initialized to certain values that
Contextualization and Personalization of Queries to Knowledge Bases
673
represent certain properties of the information in the network. For instance, the initial weight of the edges can be set to the strength of the relationship between the connected nodes. The search process begins with a set of source nodes (e.g. the search terms), which are updated to a new activation value. This activation spreads through the nodes that are directly connected to the initial nodes in a series of pulses (or iterations). For each activated node, the activation value is calculated by using (1) and (2) [19]. More specifically, the initial input value Ij for the node j is computed by adding the output value (Oi ) of the nodes directly connected to it, which are weighted by the weight wi,j of the links between those nodes. The output, or activation, value Oj is calculated by applying the activation function, f , to the input value Ij . The process finishes once one of the stop conditions (e.g., number of iterations) has been reached. Oi wi,j (1) Ij = i
Oj = f (Ij )
(2)
This technique can be constrained by a set of restrictions (briefly outlined below) that limit the spreading activation process in order to obtain results of relevance. These restrictions thus compensate for some of the disadvantages of pure spreading activation (e.g. saturation of the network produced by the activation of all the nodes in the network) [19]. – Distance: the spreading process should stop when it reaches a node whose distance from the initial nodes is greater than a predeterminate value. – Path: activation should spread following preferred paths. This can be achieved adding weight to the relationships between nodes. – Activation: a threshold function can be used to restrict the spreading of the activation of the nodes. – Fanout:the spreading process should stop when it reaches a node that is connected to a large number of nodes. Furthermore, more restrictions can be added to the models such as the number of activated nodes, number of pulses or the execution time [20], in order to adapt spreading activation to some particular factors that might arose when applying this technique to some specific problems.
3
Related Work
Spreading activation is widely applied in a broad variety of fields in Computer Science, such as information and document retrieval [19,21], computing metrics for information retrieval [22], Web search [23,14,24], collaborative recommendation systems [25,26], data analysis [27], and data visualization [28,29]. This widespread application can be explained by the following: (i) the simplicity and customizability of the algorithm; (ii) the widespread use of the data
674
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
structure required by the technique, which makes it easy to apply spreading activation to existing network-based knowledge bases; (iii) the ‘human-like’ results of this technique, though not optimal, may be viewed by users as the most relevant. Spreading activation has been proposed as a way to extract user preferences or interests in a number of research studies, such as [30,15,14,24,25]. However, none of them is focused on gaining an understanding of the knowledge itself. The goal of this work is not only to personalize the search results but to also provide a way to extract the domain knowledge of the users and apply it in order to attain a deeper understanding of the underlying knowledge structure (e.g., the domains in which can be divided). The application of spreading activation to data visualizations has been proposed in [28] and [29]. VisLink [28] is a method to interactively explore visualizations and the relationships between them. This method includes spreading activation as means to add analytical power to VisLink. Kuß et al. [29] propose a method based on spreading activation to generate visualizations for a neuroanatomical atlas. Unlike our proposal, neither of these research pieces take advantage of the results of past spreading activation executions in order to improve the personalization of the systems (i.e., they only use spreading activation as a search algorithm with no memory of past searches).
4
Proposal
This research has two objectives. The first is to use the spreading activation technique to personalize search, browsing, and visualization tools for large amounts os structored data. The second is to combine the information provided by the users through their interactions with the system and the results provided by the spreading activation algorithm, and employ them to extract the implicit expert knowledge of the users. For example, we can use the elements searched by the user, the domains that the user is skilled in (e.g., geology) and the search result to automatically classify the concepts included in the knowledge base in different domains. This provides an implicit mechanism to extract contextualization from users. The process is composed of four steps, as shown in Fig. 1: 1. User profiles generation. Based on each user’s search queries and data browsing, his/her user profile is constructed (see Sect. 4.1). 2. Contextual information extraction. Based on the user profile information, a spreading activation algorithm is applied to the data in order to extract the contextualizations. This paper is focused on this step. 3. Integration of the contextualization in the knowledge base. Finally, the information regarding the contextualization is incorporated into the knowledge base. 4. Application of the contextualization. When searching or visualizing the data, the contextualization obtained and processed in steps 2 and 3 will be employed to customize the searching, browsing and visualization. The new user activities generate new information, and then return to step one.
Contextualization and Personalization of Queries to Knowledge Bases Step 1. User profile generation user profile
675
Step 2. Contextualization extraction user profile
contexts
SA
Step 4. Contextualization application
Step 3. Integration
contexts Knowledge base
Fig. 1. The steps involved in the proposed mechanism
4.1
User Profiles
The user profiles store user preferences and provide the basis for the personalization of the search and for browsing the data and the recommender system. A variation of the simple profile presented in [31] will be used: “A group of concepts extracted from the user’s activity and deemed interesting by him/her”. These concepts are extracted from the user’s interaction with the system (e.g., searching and browsing activities) and stored in the user profiles as a list of items included in the system (e.g. [water, concrete, cement]). When a user first starts accessing the system, the contextualization will only be provided by the constrained spreading activation algorithm and the system will not supply the new user with any customization. Once the initial information about the user has been collected, the system will provide him/her with personalization as well as contextualization. 4.2
Contextual Information Extraction
In order to extract the contextual information implicitly stored in the user profiles, an spreading activation algorithm is applied. This algorithm will activate the concepts that are strongly related to the concepts included in the user profile, defining a context based on the specialized knowledge of the user. In the future, the collected information may be put to use in a recommender system. This system will serve as a guide to new users of the system, and will suggest terms, relations and searches based on their preferences, their interaction with the system, and the extracted experience of past users. Spreading Activation Implementation. The spreading activation technique has been implemented by using Algorithm 1. The inputs of the algorithm are
676
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
the entities included in the user profile and the output is a vector of activation values, one for each node in the network. The parameters in the algorithm and their values are given below:
Algorithm 1. Spreading Activation Algorithm procedure SpreadingActivation(N ODES, V ERT EX, ORIGIN , F , D) toF ire ← ORIGIN while toF ire = ∅ do for all i ∈ toF ire do toFire.remove(i) for all j ∈ adjacents(i) do Aj , Oj ← Aj + DOi wi,j if Oj > F then toF ire.add(j) else Oj ← 0 end if end for end for end while end procedure
– Activation threshold (F ): 0.05 and decay factor (D): 0.75. Both parameters have been adjusted to the conditions of the experimental setup presented in Sect 5. These conditions are a few starting nodes and a small graph. – Activation function (see (3) and (4)): First, the activation value (Aj ) for the node j is calculated using its previous activation value (Aj−1 ), the output value of node i (Oi ), the weight of the edge between nodes i and j (wi,j ), and the decay factor (D). Subsequently, the output value (Oj ) to be used in the next iteration, is computed by using the threshold constraint (F ). – Edge weight: in order to reduce de influence of highly connected nodes (see Sect. 2), the weight of a node is calculate following (5): where wi,j is the weight of the edge; degree(i) is the number of edges that leave the node i; and α is a constant that limits the fan-out effect. Aj = Aj−1 + Oi Dwi,j Oj = wi,j =
0 Aj
1 1 α degree(i)
if Aj < F otherwise if degree(i) = 1 if degree(i) > 1
(3)
(4)
(5)
Contextualization and Personalization of Queries to Knowledge Bases
5
677
Experimental Example
This proposal was evaluated with the EcoLexicon knowledge base [5], a thesaurus for the specialized domain of the Environment. It is the result of hundreds of hours of work of experts in the areas of translation, terminology, and environmental sciences. This knowledge base includes more than 4,000 concepts, 10,000 terms in six languages, as well as the annotated relations between the aforementioned concepts and terms. EcoLexicon also provides a visual tool (available at: http://ecolexicon.ugr.es) that enables the browsing and searching of its content.
Fig. 2. Database subset (for readability, only some labels are displayed)
In order to address the issue of information overload in EcoLexicon, a contextualization based on the classification of the instances of relations in different domains is presented in [32]. However, these contexts are rigidly defined and require the explicit effort of various experts to define and implement them. The aim of this section is to present an alternative to these contextual domains by means of user-generated contexts using spreading activation. Based on user interactions with EcoLexicon (i.e., search queries), the system builds user profiles that are used as a starting point for the spreading activation process. The initial results of this approach are presented in this paper. First, we extract a subset of the data in the knowledge base related to the concept ‘WATER’, which generates a set of concepts and relationships between
678
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
them. Afterwards, the Networkx library [33] is used to build a graph-based representation of the subset. The resulting network contains 201 nodes (concepts) and 220 links between nodes (relations between two concepts) and as shown in Fig. 2. After building the network, four user profiles were selected. Each profile includes a set of search terms (though the user profiles can be constructed by also using other user interactions with the system). These profiles are the starting nodes for the spreading activation process. As shown below, each profile corresponds to an area of expertise: – Profile 1: water, concrete, cement. This profile corresponds to the coastal engineering domain. – Profile 2: water, weathering, mineral. This profile corresponds to the geology domain. – Profile 3: water, seawater, fresh water. This profile corresponds to the hydrology domain. – Profile 4: water, matter, cloud. This profile This profile corresponds to the meteorology domain. Below, we present the results obtained in the first experiments. Tables 1, 2, 3 and 4 show the activation values for the activated nodes after the execution of spreading activation for every profile. The state of the network after the spreading activation process is presented in Fig. 5 for Profile 1 (only the nodes with an activation value bigger than zero are represented). Table 1. Activated nodes for Profile 1
Table 2. Activated nodes for Profile 2
Node Activation sand 0.90 karstic region 0.26 rock 0.06 water 1.00 binder 0.45 conglomerate 0.36 region 0.23 soil component 0.26 dissolved oxygen 0.26 weathering 0.26 matter 0.51 placer mine 0.06 water action 0.26 erosion 0.26 cement 1.00 aggregate 0.81 mineral 0.09 intrusion 0.26 concrete 1.00
Node Activation sand 0.49 karstic region 0.26 rock 0.60 water 1.00 conglomerate 0.13 region 0.23 soil component 0.26 dissolved oxygen 0.26 weathering 1.00 matter 1.00 placer mine 0.60 water action 0.26 erosion 0.26 aggregate 0.13 mineral 1.00 intrusion 0.26 concrete 0.36
Contextualization and Personalization of Queries to Knowledge Bases
679
Table 3. Activated nodes for Profile 3
Table 4. Activated nodes for Profile 4
Node Activation sea water 1.00 salt marsh 0.54 sand 0.13 karstic region 0.26 rock 0.06 water 1.00 conglomerate 0.03 region 0.23 soil component 0.26 dissolved oxygen 0.26 fresh water 1.00 weathering 0.26 matter 0.18 placer mine 0.06 saltwater 0.90 water action 0.26 erosion 0.26 aggregate 0.03 sandripple 0.90 mineral 0.09 intrusion 0.26 salt pan 0.54 concrete 0.09
Node Activation atmosphere 0.36 low atmospheric pressure 0.36 sand 0.13 karstic region 0.26 hydrometeor 0.36 cloud 1.00 rock 0.06 water 1.00 conglomerate 0.03 region 0.23 soil component 0.26 dissolved oxygen 0.26 weathering 0.26 glory 0.36 matter 1.00 placer mine 0.06 water action 0.26 erosion 0.26 aggregate 0.03 mineral 0.09 intrusion 0.26 concrete 0.09
Fig. 3. Activated nodes after the spreading activation process for Profile 1. The size of each node is proportional to its activation value.
680
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
We performed a comparison between the generated contextualizations and the domains proposed by the experts in [32]. The results for profile 1 are presented in Table 5. As can be seen, the contextualization coincides with the experts domain (coastal engineering) in the 84% of the concepts, using only a small user profile of three terms. Table 5. Comparison between the generated contextualization and the domains proposed by the experts for Profile 1 Node Present in domain sand Yes karstic region No rock Yes water Yes binder Yes conglomerate Yes region Yes soil component Yes dissolved oxygen Yes
6
Node Present in domain weathering Yes matter Yes placer mine Yes water action No erosion Yes aggregate Yes cement Yes mineral Yes intrusion No
Conclusions and Future Work
As the quality and quantity of large data sets steadily increase, there is a growing need for more efficient means of adapting the browsing and searching of these data sets to user preferences. In response to this issue, we propose the use of spreading activation in combination with user profiles as a way of adapting and customizing the search, query, and visualization capabilities of KBs. Moreover, the proposed mechanism is capable of eliciting the meta-knowledge of the users to extract the knowledge contextualization. A first experimental example was carried out, which showed how contextual domains could be extracted from the user profiles. Regarding future work, in the short term, we plan to improve the spreading activation process by including fuzzy logic in the data set and adding new parameters to the algorithm. In addition, regarding the recommender system, we will implement and evaluate a preliminary application, using the information extracted from the user profiles and the spreading activation processes. Finally, an evaluation of the EcoLexicon extension will be performed by experts in the environmental domain Acknowledgements. This research has been carried out within the framework of the project FFI2011-22397, funded by the Spanish Ministry for Science and Innovation; the project P11-TIC7460, funded by Junta de Andalucía; and the project FP7-SEC-2012-312651, funded from the European Union in the Seventh Framework Programme [FP7/2007-2013] under grant agreement No 312651.
Contextualization and Personalization of Queries to Knowledge Bases
681
References 1. Navigli, R., Ponzetto, S.P.: Babelnet: Building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 216–225. Association for Computational Linguistics (2010) 2. Auer, S., et al.: Dbpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007) 3. Fellbaum, C.: Wordnet. In: Theory and Applications of Ontology: Computer Applications, pp. 231–243 (2010) 4. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: An on-line lexical database*. International Journal of Lexicography 3(4), 235–244 (1990) 5. Reimerink, A., Faber, P.: Ecolexicon: A frame-based knowledge base for the environment. In: Proceedings of the International Conference Towards eEnvironment, pp. 25–27 (2009) 6. Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. Software: Practice and Experience 38(2), 189–225 (2008) 7. Han, L., Chen, G.: A fuzzy clustering method of construction of ontology-based user profiles. Advances in Engineering Software 40(7), 535–540 (2009) 8. Leung, K.T., Ng, W., Lee, D.L.: Personalized concept-based clustering of search engine queries. IEEE Transactions on Knowledge and Data Engineering 20(11), 1505–1518 (2008) 9. Widyantoro, D., Yen, J.: Using fuzzy ontology for query refinement in a personalized abstract search engine. In: Joint 9th IFSA World Congress and 20th NAFIPS International Conference, vol. 1, pp. 610–615 (July 2001) 10. Kim, K.J., Cho, S.B.: Personalized mining of web documents using link structures and fuzzy concept networks. Applied Soft Computing 7(1), 398–410 (2007) 11. Duong, T.H., Uddin, M.N., Li, D., Jo, G.S.: A Collaborative Ontology-Based User Profiles System. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 540–552. Springer, Heidelberg (2009) 12. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on web usage mining. Communications of the ACM 43(8), 142–151 (2000) 13. Mulvenna, M.D., Anand, S.S., Büchner, A.G.: Personalization on the net using web mining: introduction. Communications of the ACM 43(8), 122–125 (2000) 14. Jiang, X., Tan, A.H.: Learning and inferencing in user ontology for personalized Semantic Web search. Information Sciences 179(16), 2794–2808 (2009) 15. Katifori, A., Vassilakis, C., Dix, A.: Ontologies and the brain: Using spreading activation through ontologies to support personal interaction. Cognitive Systems Research 11(1), 25–41 (2010) 16. Anderson, J.R.: A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior 22(3), 261–295 (1983) 17. Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychological Review 82(6), 407 (1975) 18. Resnick, P., Varian, H.R.: Recommender systems. Communications of the ACM 40(3), 56–58 (1997) 19. Crestani, F.: Application of spreading activation techniques in information retrieval. Artificial Intelligence Review (1997)
682
A.B. Pelegrina, M.J. Martin-Bautista, and P. Faber
20. Alvarez, J.M., Polo, L., Jimenez, W., Abella, P., Labra, J.E.: Application of the spreading activation technique for recommending concepts of well-known ontologies in medical systems. In: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2011, p. 626 (2011) 21. Preece, S.E.: Spreading activation network model for information retrieval. Dissertation Abstracts International Part B: Science and Engineering (Diss. Abst. Int. Pt. B- Sci. & Eng.) 42(9) (1982) 22. Gouws, S., Rooyen, G.J.V., Engelbrecht, H.A.: Measuring conceptual similarity by spreading activation over wikipedia’s hyperlink structure, 46–54 (August 2010) 23. Crestani, F., Lee, P.L.: Searching the web by constrained spreading activation. Information Processing & Management 36(4), 585–605 (2000) 24. Sieg, A., Mobasher, B., Burke, R.: Ontological User Profiles for Representing Context in Web Search. In: 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, pp. 91–94. IEEE (November 2007) 25. Sieg, A., Mobasher, B., Burke, R.: Improving the effectiveness of collaborative recommendation with ontology-based user profiles. In: Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec 2010, pp. 39–46. ACM Press, New York (2010) 26. Sieg, A., Mobasher, B., Burke, R.: Ontology-Based Collaborative Recommendation. In: ITWP (2010) 27. Teufl, P., Payer, U., Parycek, P.: Automated analysis of e-participation data by utilizing associative networks, spreading activation and unsupervised learning. In: Macintosh, A., Tambouris, E. (eds.) ePart 2009. LNCS, vol. 5694, pp. 139–150. Springer, Heidelberg (2009) 28. Collins, C., Carpendale, S.: Vislink: Revealing relationships amongst visualizations. IEEE Transactions on Visualization and Computer Graphics 13(6), 1192–1199 (2007) 29. Kuß, A., Prohaska, S., Meyer, B., Rybak, J., Hege, H.C.: Ontology-based visualization of hierarchical neuroanatomical structures. In: Proc. Vis. Comp. Biomed., pp. 177–184 (2008) 30. Eyharabide, V., Amandi, A.: Ontology-based user profile learning. Applied Intelligence 36(4), 857–869 (2011) 31. Martín-Bautista, M.J., Kraft, D.H., Vila, M., Chen, J., Cruz, J.: User profiles and fuzzy logic for web retrieval issues. Soft Computing-A Fusion of Foundations, Methodologies and Applications 6(5), 365–372 (2002) 32. León Araúz, P., Magaña Redondo, P.: Ecolexicon: contextualizing an environmental ontology. In: Proceedings of the Terminology and Knowledge Engineering (TKE) Conference 2010, pp. 341–355 (2010) 33. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and function using networkx. In: Varoquaux, G., Vaught, T., Millman, J. (eds.) Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA, pp. 11–15 (2008)