Improving Recall and Precision of a Personalized Semantic Search ...

5 downloads 2523 Views 361KB Size Report
Semantic Search Engine for E-learning ... semantic E-learning domain using the known college and .... -Improving Search Recall: Search engines retrieve.
2010 Fourth International Conference on Digital Society

Improving Recall and Precision of a Personalized Semantic Search Engine for E-learning Olfa Nasraoui and Leyla Zhuhadar Knowledge Discovery and Web Mining Lab Department of Computer Engineering and Computer Science University of Louisville, KY 40292, USA [email protected] [email protected]

The main objective of this paper is to propose and evaluate an architecture that provides, manages, and collects data that permit high levels of adaptability and relevance to the user profiles. In addition, we implement this architecture on a platform called HyperManyMedia. To achieve this objective, an approach for personalized search is implemented that takes advantage of the semantic Web standards (RDF and OWL) to represent the content and the user profiles. The framework consists of the following phases: (1) building the semantic E-learning domain using the known college and course information as concept and sub-concept, (2) generating the semantic user profiles as ontologies, (3) clustering the documents to discover more refined sub-concepts, (4) reranking the user’s search results based on his/her profile, and (5) providing the user with semantic recommendations. The implementation of the ontologies models is separate from the design and implementation of the information retrieval system, thus providing a modular framework that is easy to adapt and port to other platforms. Finally, the experimental results show that the user context can be effectively used for improving the precision and recall in E-learning search, particularly by re-ranking the search results based on the user profiles. Index Terms—personalization; semantic web; evaluation; search engine; E-learning

I. I NTRODUCTION This paper describes the most recent developments in HyperManyMedia1 , an information retrieval system that utilizes ontologies as models to provide semantic information. This approach uses two different types of ontology, a global ontology model that represents the whole E-learning domain (contentbased ontology), and a learner-based ontology that represents the learner’s profile. The implementation of the ontology models is separate from the design and implementation of the information retrieval system, thus providing a modular 1 HyperManyMedia: We defined this term to refer to any educational material on the web (hyper) in a format that could be a multimedia format (image, audio, video, podcast, vodcast) or a text format (webpage, powerpoint).

978-0-7695-3953-9/10 $26.00 © 2010 IEEE DOI 10.1109/ICDS.2010.63

framework that is easy to adapt and port to other platforms. The main objective of this paper is to propose and evaluate an architecture that can provide, manage, and collect data that permit high levels of adaptability and relevance to the learner’s profile, in addition, this system uses clustering techniques to divide the documents into an optimal categorization that is not influenced by the hand-made taxonomy of the colleges and course titles. To achieve this objective, an approach for personalized search is implemented that took advantage of the semantic Web standards (RDF and OWL) to represent the content and the user profiles. The framework consists of the following phases: (1) building the semantic E-learning domain using the known college and course information as concepts and sub-concepts, (2) generating the semantic user profiles as an ontology, (3) clustering the documents to discover more refined sub-concepts (descriptive terms from each cluster) than provided by the available cluster and course taxonomy, (4) reranking the user’s search results by matching the concepts to the user profiles, and (5) providing the user with dynamic semantic recommendations during the search process. The most important advantage of clustering from the personalization perspective is that the clusters are later used as automatically constructed labels for each user profile. Hence, depending on the document collection and its evolution, both the user profiles and their underlying ontology labels are allowed to change or evolve accordingly. The experimental results show that user context can be effectively used for improving the recall and precision in E-learning search, particularly by reranking the search results based on user profiles. The rest of this paper is divided into the following sections: Section 2 (Background and Related Work): In this section, we give an overview of ontology (knowledge representation), information retrieval, clustering documents (text analysis), in addition to the semantic web for E-learning. Section 3 (Methodology): In this section, we present the core contribution of this paper starting with the proposed architecture to build the semantic domain, then the methodology used to augment the platform with external resources. It ends with the clustering process of documents. Section 4 (Experimental Analysis): In this section, we de-

222 216

scribe our evaluation methods and results. Section 5 (Discussion and Conclusion): In this section, we present the novelty of our research and future work. II. BACKGROUND AND R ELATED W ORK A. Ontology (Knowledge Representation) Neches et al., define an ontology as “the basic terms and relations comprising the vocabulary of a topic area as well as the rules for combining terms and relations to define extensions to the vocabulary [10].” Gruber defines an ontology as “an explicit specification of a conceptualization [4].” Swartout et al., describe the ontology as “a hierarchically structured set of terms for describing a domain that can be used as a skeletal foundation for a knowledge base [17].” Studer et al., explain an ontology as “A formal, explicit specification of a shared conceptualization [16].” We can decomposed all these definitions into four interrelated areas of research: (a) formalization: machine readable representation; (b) specification: concepts, functions, properties, relations, constraints, axioms, etc., are explicitly defined; (c) conception: the act of conceiving knowledge; (d) conceptualization: creating i.e., an abstract from a complex model. B. Information Retrieval (IR) McGill and Salton define Information Retrieval (IR) as “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information [8].” BaezaYates and Ribeiro-Neto linked Information Retrieval to the user information needs which can be expressed as a query submitted to a search engine [1]. To accommodate this needs, information has to be first analyzed and structured, then stored and organized in order to be retrieved. Korfhage [7] looked at an information retrieval system from three perspectives: design, evaluation, and usage. Several evaluation methods have been introduced in the literature, such as Recall, Precision, F-measure, Harmonic Mean, E Measure, User-Oriented Measure (coverage, novelty), expected search length, satisfaction, frustration, etc. but, the most widely used are Top-n-Recall and Top-n-Precision. These measures are defined as follows, - Top-n Recall: is the number of relevant retrieved documents among the top n retrieved documents divided by the total number of relevant documents.

C. Clustering Documents/Text Analysis Cluster analysis is a process of grouping objects in groups where the similarity between the objects within the same group is greater than the similarity with the other groups [11]. In particular, the following improvements in an Information Retrieval system can be expected using clustering: -Improving Search Recall: Search engines retrieve documents related to a specific query term. Generally, the same concepts can be expressed using different terms, thus searching for one of these terms will not retrieve the others. Clustering, which is based on overall similarity between the documents, can improve the recall since the search query will match an entire cluster instead of only one or more terms. -Improving Search Precision: Assessing the relevance of documents to a query in a big collection of documents is considered a difficult task. Clustering those documents into smaller collections, ordering them by relevance, and returning only the most relevant group of documents, may help in finding a user’s specific interest. II.C.1 Clustering Documents: Applying clustering on a set of documents involves the following processes: (1) Data Representation: Vector Space Model (VSM), Metric Space Model (MSM), and Graph Model (GM); (2) Similarity Measures: Inter-Object Similarity, and Inter-Cluster Similarity; (3) Clustering Algorithms: The most common families being Agglomerative and Partitional; (4) Evaluation and Validation. II.C.2 Data Representation: In order to cluster documents, they need first to be represented using a model. While a number of modeling representations are discussed in the literature, the most common ones are: the Vector Space Model (VSM), the Metric Space Model (MSM) and the Graph Model (GM). Among the three models, the Vector Space Model is the most ubiquitous. Our focus in this paper is on VSM. II.C.3 Similarity Measures: Two scopes of measuring similarity can be considered: (i) Inter-Object Similarity and (ii) Inter-Cluster Similarity. The former deals with the similarity between two individual objects, while the latter deals with the similarity between two entire groups of objects. Other approaches of measuring similarities exist. For a detailed review, see [12], [6], [5].

number o f relevant documents retrieved within top n results total number o f relevant documents (1)

II.C.4 Clustering Algorithms: Without loss of generality, Clustering Algorithms can be divided into two broad categories: agglomerative approaches [15], [21], [18], [2] and - Top-n Precision: is the number of relevant documents partitional approaches [22], [15], [21]. Each criterion function uses a different methodology to produce the optimal clustering retrieved within the top n results divided by n. solution. In the case of internal criteria, they search for the number o f relevant retrieved documents within top n results best solution based only on the documents inside each cluster. Top−nPrecision = n (2) In the case of external criteria, the focus is on finding the optimal solution in which the clusters are very different from Top − n Recall =

217 223

each other. Graph models represent the documents as a graph and then finds the optimal solution. Finally, hybrid models use a mix of criterion functions [21]. Each of these seven criterion functions can use a choice of different similarity measures, as shown in Table I. 1- Hierarchical Agglomerative Algorithms: Agglomerative algorithms start by assigning each document to its own cluster; the goal is to find the pairs of clusters to be merged at the next step, and this can be done using classical approaches, such as single-link, weighted single-link, complete-link, weighted complete link, UPGMA, or using different criterion functions [19]: I1, I2, E1, G1, G1*, H1, H2, with each criterion measuring different aspects of intra-cluster similarity and inter-cluster dissimilarity. 2- Partitional Algorithms: The goal is to find the clusters by partitioning the set of documents into a predetermined number of disjoint sets, each related to one specific cluster by optimizing various criterion functions [21], [19], [20]. Two methods of partitioning are very popular: (i) direct K-way clustering (similar to K-means), and (ii) repeated bisection or Bisecting K-Means (makes a sequence of bisections to find the best solution). II.C.5 Evaluation and Validation: In the special case where external class labels are available for the input data, the quality of a clustering solution can be measured using the Entropy [21]. Entropy is calculated as the weighted average of the entropies of the k individual clusters, each weighted in proportion to its cluster size: k

Entropy =



r=1

nr E(Sr ) n

(3)

For a specific cluster S, of size n, the entropy of this cluster is defined [21] in Equation (4), where q = number of classes in the dataset, and nir = number of documents of the ith class that were assigned to the rth cluster. E(Sr ) =

1 q nir ni log r ∑ log q i=1 nr nr

Also, all the above efforts addressed the evolution of user interests, however they did not implement their methods within a working information retrieval system. III. M ETHODOLOGY A. Semantic Domain Structure Let R represent the root of the domain which is represented as a tree, and Ci represent a concept under R. In this case R = ∪ni=1Ci , where n= Number of concepts in the domain. Each concept Ci consists either of sub-concepts SC ji which can be children of Ci = ∪mj=1 SC ji if Ci has subconcepts or leaves which are the actual lecture documents (∪lk=1 dki ). B. Augmenting the HyperManyMedia Repository with External Resources Previously, the HyperManyMedia dataset consisted of 28 courses, 2,812 learning objects (lecture), with each learning object represented in 7 different formats (text, powerpoint, streamed audio, streamed video, podcast, vodcast, RSS), thus a total of ~20,000 individual learning objects. Those materials were created by Western Kentucky University and located on the HyperManyMedia E-learning repository. We augmented the HyperManyMedia repository with external open source resources, from MIT OpenCourseWare2 . Currently, we have 64 courses, 27 are courses provided by faculty from Western Kentucky University and 37 courses are embedded in the platform from MIT OpenCourseWare. Nineteen courses are provided in Spanish and English and the rest only in English, 7,424 learning objects (lectures), with each learning object represented in 7 different formats. This amounts to a total of ~51,968 individual learning objects. Augmenting HyperManyMedia with MIT courses influenced three parts of the architecture design of the platform (Generic search, Metadata search, Semantic search). Table II presents a summary of the total resources that the current HyperManyMedia platform contains.

(4) Table II S UMMARY OF HyperManyMediaR ESOURCES

D. Semantic Web for E-learning Several new efforts in the area of semantic information retrieval have been well presented in Sheth et al., [13] and in Gauch [3]. More recently, Nasraoui et al., [9] presented a Semantic Web usage mining methodology for mining evolving user profiles on dynamic Websites by clustering the user sessions in each period and relating the user profiles of one period with those discovered in previous periods to detect profile evolution, and also to understand what type of profile evolutions have occurred. Of all related work, Sieg et al., [14] seems to be the closest to ours. However, there are major differences between our approach and theirs. Mainly, our search engine provides re-ranking based not only on the user’s profile, but also on cluster-based similarity metrics that capture the distribution of the documents in the domain.

Total #of colleges= 11

Total # of courses= 64

Total #of WKU courses= 27

Total # of MIT courses= 37

Total #of English courses= 45

Total # of Spanish courses= 19 Total #of Lectures=7,424

C. Augmented Semantic Search Engine We used Protégé3 , an open source ontology editor and knowledge-based framework that supports two ways of modeling ontologies - (1) Protégé-Frames and (2) Protégé-OWL editors - to design and build the structure of HyperManyMedia’s

218 224

2 http://ocw.mit.edu/OcwWeb/web/home/home/index.htm 3 http://protege.stanford.edu

Table I S UMMARY OF VARIOUS C LUSTERING C RITERION F UNCTIONS

CF

Category

Optimization Function  k

I1

 maximize ∑ nr  n12

Internal

r

r=1 k

I2

k

 k



d j ,d j ⊂Sr

k  cos(di , d j ) = ∑ kDr k2 r=1

k

k

maximize ∑ ∑ cos(di ,Cr ) = ∑ ∑

Internal

E1

External

G1

Graph-Based, Hybrid

G1’

Graph-Based, Hybrid

H1 H2

Hybrid Hybrid Figure 1.

r=1di ⊂Sr k Dt D minimize ∑ nr kDr k r r=1 k Dt D minimize ∑ r 2 r=1 kDr k k Dt D minimize ∑ n2r r 2 r=1 kDr k maximize EI11 maximize EI21

r=1di ⊂Sr

dit Cr kCr k

k

= ∑ kDr k r=1

HyperManyMedia User Interface (UI)

ontology. Our current ontology consisting of ~40,000 lines of code4 . D. Clustering/Text Analysis for the Semantic Search Clustering the augmented documents affects two parts of the HyperManyMedia Platform: (1) Semantic Search, and (2) Visual Search. These interfaces are driven by the ontology and the outcome of the Clustering Analysis is added to the domain ontology as additional leaves under the “SubSubSubconcept = lecture”. The approach that we followed is similar to our approach in [23]. The importance of our approach is the 4 http://acadmedia.wku.edu/Zhuhadar/Evaluation%20Results/semanticnew.owl

combination of an authoritatively supplied taxonomy by the colleges with the data driven extraction (via clustering) of a taxonomy from the documents themselves; thus, making it easier to adapt to different learning platforms, and making it easier to evolve with the document/lecture collection. Therefore we clustered the documents into meaningful groups that form a finer granularity compared to the broader college and course categories provided by the available E-learning taxonomy. From each cluster, we extracted the descriptive terms, then we modified the ontology accordingly by adding the cluster’s terms as semantic terms under the “SubSubSubconcept = lecture” to which these documents

219 225

belong. The total corpus consisting of approximately 7,424 documents (lectures) divided into 4,888 English documents (lectures) and 2,536 Spanish documents (lectures). We experimented with partitional algorithms, direct K-way clustering (similar to K-means), and repeated bisection or Bisecting K-Means with all criterion functions. Additionally, we experimented with graph-partitioning-based clustering algorithms. We compared different hierarchical clustering algorithms for the English corpus which consisted of4,888 documents. We repeated each clustering algorithm with all possible combinations of clustering criterion functions for different numbers of clusters, within the range of [20,...,50]. By considering each college as one broad class (thus 11 categories), we tried to ensure that the clusters are as pure as possible, i.e. that each cluster contains documents mainly from the same category. However, since a class may be partitioned into several clusters (as was the case here) the clusters are more refined versions of the college categories, which was our goal. We experimented with the three different clustering algorithms mentioned above. In the next section, we present our experiment analysis.



Tables (C1, C2, C3,...,C50) in URL5 present all the clustering experiments that we ran on the English corpus. Table C49 shows the best clustering solution for the English corpus, and summarizes the results of the best clustering solution. In addition, we generated a Confusion Matrix, as shown in Table C50, showing only 200 misclassified documents out of 7,424 (~4%). Note that we re-labeled each cluster based on the majority of assigned documents in each college. Top-n-Recall Results (Semantic Search vs. Personalized Cluster-based Semantic Search) In this stage, we used a dataset of 1,825 user profiles to evaluate the Top-n-Recall in the augmented system.

Figure 2. Top-n-Recall (Semantic vs. Personalized Semantic vs. Personalized Cluser-based Semantic Search)

IV. E XPERIMENTAL A NALYSIS A. Evaluation Methodology and Results 1- Evaluation Methodology: •



Research Questions - Will there be an improvement in Top-n-Recall when using the Personalized Cluster-based Semantic search engine compared to the Semantic search engine? - Will there be an improvement in Top-n-Precision when using the Personalized Cluster-based Semantic search engine compared to the Semantic search engine? Evaluation Measures a) Cluster Evaluation We used the entropy measure, refer to Equation (4), to evaluate the quality of each clustering solution. This measure evaluates the overall quality of a cluster partition based on the distribution of the documents in the clusters. Then the entropy of the entire partition, consisting of p clusters is computed as shown in Equation (3). •

b) IR Evaluation We used Top-n-Recall, refer to Equation (1) and Top-nPrecision, refer to Equation (2). 2- Evaluation Results : •

Clustering Results The best clustering method for the English corpus, which produced the highest Purity = 0.959 and with the lowest Entropy = 0.05, was the Agglomerative Method,with Number o f Clusters = 38, using Clustering Criterion Function = I1, as shown in Table II-C, refer to Section II-C, for more information about the Cluster Analysis.

Figure 2 shows the results of Top-n-Recall for two types of searching mechanisms: (a) Non-Personalized Semantic search, and (b) Personalized Cluser-based Semantic search. We found that the Recall results of the Personalized Cluser-based Semantic search outperformed the Semantic search in each interval (Top-10, Top-20, Top30,..., Top-100). Top-n-Precision Results (Semantic Search vs. Personalized Cluster-based Semantic Search) In this stage, we also used the same dataset of 1,825 user profiles to evaluate the Top-n-Precision in the augmented system.

5 using the algorithms in the clustering package Cluto: http://acadmedia.wku.edu/Zhuhadar/Evaluation%20Results/Appendix%20C%20Clustering%20English%20Documents.pdf

220 226

Figure 3. Top-n-Precision (Semantic vs. Personalized Semantic vs. Personalized Semantic with Relevance Feedback)

Figure 3 shows the results of Top-n-Precision for two types of searching mechanisms (a) Non-Personalized Semantic search, and (b) Personalized Cluser-based Semantic search. We also found that the Precision results of the Personalized Cluserbased Semantic search outperformed the Semantic search in each interval (Top-10, Top-20, Top-30,..., Top-100). V. D ISCUSSION AND C ONCLUSION Our approach is significantly different from the one used in the development of current information retrieval systems. First, it utilizes ontologies as models to provide semantic information. Second, our approach uses two different types of ontologies, a global ontology model that represents the whole E-learning domain, and a learner model that represents the learner profile. Moreover, the implementation of the ontology models is separate from the design and implementation of the information retrieval system, making the approach modular and portable. Finally, this platform is an open source repository that enables an online learners’ community to use and share resources. Also, we note that the separation of the ontology models from the design and implementation of an Adaptive Hypermedia system benefits the State of the Art by providing an architecture that enables a generic approach that is application independent and reusable, and is therefore less costly to develop. The experimental results showed that the learner’s context can be effectively used for improving the precision and recall in E-learning search, particularly by re-ranking the search results based on the learner’s past activities. Our current work addresses the problem of adapting to evolving users and an evolving domain, more details can be found in [24]. VI. ACKNOWLEDGMENTS

[3] S. Gauch. Ontology-based personalized search and browsing. Web Intelligence and Agent Systems, 1(3):219–234, 2003. [4] T.R. Gruber. A translation approach to portable ontology specifications. KNOWLEDGE ACQUISITION, 5:199–199, 1993. [5] AK JAIN, MN MURTY, and PJ FLYNN. Data Clustering: A Review. ACM Computing Surveys, 31(3), 1999. [6] Kaufman and Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990. [7] R.R. Korfhage. Information storage and retrieval/Robert R. Korfhage. Wiley, New York, 1997. [8] M.J. McGill and G. Salton. Introduction to modern information retrieval. McGraw-Hill, 1983. [9] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain. A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, pages 202–215, 2008. [10] R. Neches, R.E. Fikes, T. Finin, T. Gruber, R. Patil, et al. Enabling technology for knowledge sharing. AI magazine, 12(3):36, 1991. [11] T. Pang-Ning, M. Steinbach, and V. Kumar. Introduction to data mining. Boston: Person Addison Wesley EducatioPress, 2005. [12] C. Romesburg. Cluster Analysis for Researchers. Lulu. Com, 2004. [13] A. Sheth, C. Bertram, D. Avant, B. Hammond, K. Kochut, and Y. Warke. Managing semantic content for the Web. Internet Computing, IEEE, 6(4):80–87, 2002. [14] A. Sieg, B. Mobasher, and R. Burke. Ontological user profiles for representing context in web search. Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on, pages 91–94, Nov. 2007. [15] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques, 2000. [16] R. Studer, V.R. Benjamins, and D. Fensel. Knowledge engineering: principles and methods. Data & Knowledge Engineering, 25(1-2):161– 197, 1998. [17] B. Swartout, R. Patil, K. Knight, and T. Russ. Toward distributed use of large-scale ontologies. In Proc. of the Tenth Workshop on Knowledge Acquisition for Knowledge-Based Systems, 1996. [18] O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 46–54. ACM Press New York, NY, USA, 1998. [19] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524, 2002. [20] Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168, 2005. [21] Y. Zhao, G. Karypis, and MINNESOTA UNIV MINNEAPOLIS DEPT OF COMPUTER SCIENCE. Comparison of Agglomerative and Partitional Document Clustering Algorithms. Defense Technical Information Center, 2002. [22] Ying Zhao and George Karypis. Soft clustering criterion functions for partitional document clustering: a summary of results. In CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 246–247, New York, NY, USA, 2004. ACM. [23] Leyla Zhuhadar and Olfa Nasraoui. Semantic information retrieval for personalized e-learning. Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE International Conference on, 1:364–368, Nov. 2008. [24] Leyla Zhuhadar, Olfa Nasraoui, and Robert Wyatt. Dual representation of the semantic user profile for personalized web search in an evolving domain. In Proceedings of the AAAI 2009 Spring Symposium on Social Semantic Web, Where Web 2.0 meets Web 3.0, pages 84–89, 2009.

This work is partially supported by the National Science Foundation CAREER Award IIS-0133948 to Olfa Nasraoui. R EFERENCES [1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999. [2] A. El-Hamdouchi and P. Willett. Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval. The Computer Journal, 32(3):220–227, 1989.

221 227

Suggest Documents