Performance and Scalability Testing Methodology for ...

6 downloads 5279 Views 1MB Size Report
HyperManyMedia search engine by evaluating the crawling speed, more ... Domain”. Recently, the Semantic Web has evolved as a key technology, bringing new insights and solutions to well .... (WKU/MIT), (b) Course name, (c) Lecture name,.
Performance and Scalability Testing Methodology for a Semantic Personalized Search Engine Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Department of Computer Engineering and Computer Science University of Louisville, Louisville, KY 40292, USA. [email protected], [email protected]

Abstract—The Internet Systems Consortium1 (ISC) shows, in its latest Internet global statistics, that the number of hosts advertised in the DNS is equal to 732,740,444 (January, 2010); whereas, the number of Internet Users is around 1,802 Million and there are, approximately 4.2 Billion Mobile Users and 1.3 Billion PC Users2 (December, 2009). This proliferation of data, users, and mobile devices pushed the Information Retrieval (IR) community harder to find solutions to relate, organize, and control this information in a more efficient way. The work presented in this paper describes an evaluation of the performance of a hybrid recommender retrieval model that can filter information based on a user’s needs. We evaluate the scalability of the HyperManyMedia search engine by evaluating the crawling speed, more specifically, the time needed for the crawler to crawl the server.

I. I NTRODUCTION The work presented in this paper describes an evaluation of the performance of a hybrid recommender retrieval model that can filter information based on a user’s needs. We believe that the methodology for designing an efficient recommender system, regardless of the approach used, i.e., content-based, collaborative, or hybrid, is to incorporate the following essential elements: contextual information, user interaction with the system, flexibility of receiving recommendations in a less intrusive manner, detecting the user’s change of interest and responding accordingly, supporting user feedback, the simplicity of the user interface, and finally the scalability of the recommender engine. To measure the scalability of a search engine, we need to evaluate the crawling speed. More specifically, the time needed for the crawler to crawl the local server. Our recommender search engine is semantically enriched; 1 ftp://ftp.isc.org/www/survey/reports/current/index.html 2 InternetWorldStats.com

(Dec 31, 2009)

therefore, measuring scalability is based on two aspects: (1) the Web scale of data, and (2) the human scale to deal with this data. Polleres and Huynh discussed this issue, saying “As the core of the Semantic Web matures, we see parallel trends such as microformats, RDFa, and Linked Data evolve with it, all of which complementing each other in what we may well call a machinereadable Web. Yet, scalable techniques for dealing with this Web of Data in its entirety, i.e. using the Web as a database, as the Semantic Web has been envisioned, still misses some important pieces of the puzzle. Scalability here does not mean solely the ability to handle amounts of data at Web scale in terms of actual data processing, but it also refers to the human scale in the form of user-friendly tools that open the Web of Data to the current Web user [1].” II. P REVIOUS W ORK In this section we provide an overview of the research field of “Ontologies in the E-learning Domain”. Recently, the Semantic Web has evolved as a key technology, bringing new insights and solutions to well documented problems in Elearning. The most challenging process in Elearning has been related to the annotation and packaging of the learning content. The current approaches, mostly based on reusable learning objects and learning designs methods, suffer from the poor and underutilized semi-automated and manual techniques for entering the required formal or informal semantics and metadata. Another challenge in E-learning has been the very narrow conceptualizations and algorithms used for modeling of the learners’ profiles. The quest for highly effective and personalized E-learning systems must be based on a detailed specification of behavioral, learning, and systemic parameters that

jointly formulate the learners’ model. Personalization must be built on top of well defined learning contexts and then realized within an Information Retrieval Framework. The research field of Semantic E-learning covers a wide range of research problems. Below, we cover some of the Semantic Web efforts in the context of E-learning. Some work focused on the Educational Semantic Web, i.e., how authoring tools can be designed to provide educational content that is shareable, reusable and interoperable [2]. This research proposed an ontology-driven authoring tools framework that the educational platform can benefit from. Others argued that an ontology-supported learning process enhances the activities between faculty and students in Web-based learning environments and surveyed the relationship between Artificial Intelligence in Education (AIED) and Web Intelligence (WI). This demonstrated strong evidence that the AIED field can benefit from the core components of WI techniques, such as ontologies, adaptivity, personalization, and software agents [3]. One important aspect of this benefit is that it can enhance the learner’s comfort, especially from a personalization point of view, where WI techniques can be used to create a learner model that can free the learner from the disturbance of manually discovering relevant materials in an ontology-supported learning environment. Others have demonstrated a framework for ontology-enabled annotation and knowledge management in collaborative learning environments [4]. The proposed framework provided personalization and semantic content retrieval with the personalization aspect expressed using ontologies that provided the relationship between learners and annotated content. These ontologies have later been used to discover similar content and collaborators, thus realizing a queryby-annotation retrieval system. Another E-learning web service architecture presented in [5] can provide students with the following: registration, authentication, tutoring, question-answer query services, and an annotated feedback. The feasibility of this architecture has been tested through different scenarios in the E-learning domain, with the most interesting one being the capability of the system to provide a semantic annotation of argumentation in student essays. Three different ontologies governed the system: learning materials ontology, annotation schema ontology, and services ontology. A framework for personalized Elearning in the Semantic Web where the hypertext structures were automatically composed using the Semantic Web services was proposed in [6]. Another approach was presented in [7], where a semiautomatic ontology extraction is used to create

draft Topic Maps. It uses Topic Maps to encode knowledge through the design and implementation of a plug-in in TM4L editor and This research exploits the concept of subject identity in learning content authoring, It uses Wikipedia articles as a source for (1) consensual naming, (2) and subject identifiers. This research is implemented in the Topic Maps for E-Learning tool [8]. III. I MPLEMENTATION A. HyperManyMedia Repository with External Resources The HyperManyMedia repository consists of 28 courses, 2,812 learning objects (lecture), with each learning object represented in 7 different formats (text, powerpoint, streamed audio, streamed video, podcast, vodcast, RSS), thus a total of ~20,000 individual learning objects. Those materials were created by Western Kentucky University and located on the HyperManyMedia Elearning repository. Recently, we augmented the HyperManyMedia repository with external open source resources, from MIT OpenCourseWare 3 . Currently, we have 64 courses, 7,424 learning objects (lectures), each learning object represented in 7 different formats. A total of ~51,968 individual learning objects. Augmenting HyperManyMedia with MIT courses influenced three parts of the architecture design of the platform: Generic search, Metadata search [9], and Semantic search [10], [11]. In addition, the search engine was enhanced to provide the following capabilities: 1) Semantic Search Engine Recommendations: User Relevance Feedback and Collaborative Filtering, refer to our previous work [12], for more details. 2) Multi-Language Ontology-based Search Engine, refer to our previous work [13], for more details. 3) Visual Ontology-based Search Engine, refer to ur previous work [14], for more details. Each augmented college was divided into the following categories: (a) College resource (WKU/MIT), (b) Course name, (c) Lecture name, (d) Language type (English/Spanish). B. HyperManyMedia User Interface (UI) Figure 1 illustrates the user interface of the HyperManyMedia platform. The interface distinguishes three types of users (Quest, WKU Student, Open-Source User). 1) Quest Login: In this case, the user has full access to all the resources. The only difference between this type of user and the other 3 http://ocw.mit.edu/OcwWeb/web/home/home/index.htm

Figure 1.

Figure 2.

HyperManyMedia User Interface (UI)

HyperManyMedia Registration Interface

two types is that the system does not register the user’s activities/patterns/behaviors while using the system. Every time the user logs into the system he/she is considered as a new user. No users models are created for these types of users. The semantic domain will be presented as a complete domain (ontology). Also, the user will not be provided with privileges such as updating their user profile based on their history of navigation, nor recommendations based on collaborative filtering, etc. 2) WKU Student: In this case, the user uses

his/her university’s authentication access. Each student is recorded in the database system. After the user logs in, he/she creates a user’s profile. The user chooses the domain of interests, i.e. College and Course(s). The system creates a user’s profile (user model). This profile can be updated based on the user’s interaction with the system, either by user’s relevance feedback or based on recommendations from similar users based on collaborative filtering. 3) Open-Source User: In this case, the user will login with his first/last name. He/she

Figure 3.

HyperManyMedia Search Engine Architecture

the next section, we provide a brief overview of the architecture of the HyperManyMedia crawler, and follow with a section about the HyperManyMedia performance results.

A. Crawler Architecture The HyperManyMedia search engine architecture is shown in Figure 3. HyperManyMedia search engine’s crawler algorithm is provided in Algorithm 1.

will be provided with the same privileges as a WKU Student (in the time being) for the evaluation purpose. Figure 2 illustrates how the user creates his/her own user profile by choosing the college domain and the courses. In this stage, the user profile is created and added to the database system. Table I presents a summary of the resources that the current HyperManyMedia platform contains. In its current status (since it is a dynamic platform), we have approximately 7,424 lectures distributed in 11 colleges, for a total of 64 courses; 27 are courses provided by faculty from Western Kentucky University and 37 courses are embedded in the platform from MIT OpenCourseWare4 . Nineteen courses are provided in Spanish and English, and the rest only in English.

Table I S UMMARY OF HyperManyMedia R ESOURCES Total #of colleges= 11

Total # of courses= 64

Total #of WKU courses= 27

Total # of MIT courses= 37

Total #of English courses= 45

Total # of Spanish courses= 19

Algorithm 1 HyperManyMedia Crawling Algorithm depth =N; create a new WebDB; inject http://hypermanymedia.edu into the WebDB (inject); Repeat Every Week Generate a fetchlist from the WebDB in a new segment(generate); Fetch content from URLs in the fetchlist (fetch); Update the WebDB with links from fetched pages (updatedb); Until depth=N

The HyperManyMedia’s crawler depends on four stages: 1) injecting http:// hypermanymedua.wku.edu into the WebDB; 2) generating the fetchlist form the Web Database; 3) fetching the content from the URLs in the domain of http:// hypermanymedua.wku.edu; and 4) updating the Web Database on a weekly basis. The following section presents the performance evaluation of those stages.

Total # of Lectures=7,424

IV. P ERFORMANCE E VALUATION We evaluated the scalability of the HyperManyMedia search engine’s crawling. This evaluation is measured based on the performance of the search engine’s (re-)crawling. The crawling could be against the entire Web or against a local http server. The HyperManyMedia search engine falls into the latter category, we crawl a local server http://hypermanymedia.wku.edu. This performance evaluates the speed of crawling and it depends on the structure of the search engine. In 4 http://ocw.mit.edu/OcwWeb/web/home/home/index.htm

B. HyperManyMedia Performance To measure the performance of a search engine, we need to evaluate the crawling speed. More specifically, the time needed for the crawler to crawl the local server. This is evaluated as relative performance since it relies on calculating the time for: injecting http:// hypermanymedia.wku.edu into the WebDB (2) generating a fetchlist from the WebDB in a new segment, (3) fetching the content from http:// hypermanymedua.wku.edu into the fetchlist, and (4) updating the database in the WebDB with all the links in the fetched pages.

Figure 4.

HyperManyMedia Search Engine Performance

Algorithm 2 Measuring the Performance of the HyperManyMedia Search Engine foreach weekend do re-crawl (http://hyperm anymedia.edu); inject = time (injecting http://hypermanymedia.wku.edu into the WebDB); generate = time (generating a fetchlist from the WebDB a new segment); fetch = time (fetching the content from the http://hypermanymedia.wku.edu into the fetchlist); updateDB= time (fetching the content from the http://hypermanymedia.wku.edu into the fetchlist); endfor; end

Algorithm 2 represents the methodology that we followed to test the crawling performance on HyperManyMedia. Figure 4 shows the relative performance results for (generate, inject, fetch, updateDB) over 100 iterations of re-crawling. We conclude that: 1) Generating the fetchlist from the Web Database is stable in every iteration; 2) The performance of generating (Generate in Figure 4) the fetchlist is good, since it only takes ~24 seconds to generate the fetchlist from the Web Database; 3) Injecting http:// hypermanymedua.wku.edu is also stable in every iteration; 4) The performance of injecting (Inject in Figure 4) http:// hypermanymedua.wku.edu is good, it only takes = ~27 seconds to inject; 5) Fetching the content from the URLs underneath http:// hypermanymedua.wku.edu is

not stable; 6) The performance of fetching (Fetch in Figure 4) reaches a maximum of ~33.46 minutes and a minimum of ~14.13 minutes which is considered reasonably fast; 7) The performance of updating the Web database (UpdateDB in Figure 4) reaches a maximum of ~11.58 minutes and a minimum of ~ 3.65 minutes which is considered good. 8) In Figure 4, we notice that the performance of the HyperManyMedia crawler improved tremendously with respect to two operational stages: (1) fetching the content from the URLs underneath http:// hypermanymedua.wku.edu, and (2) updating the Web Database. This improvement started from iteration #70. Looking back at the crawler logfiles, we link this improvement to the changes that we made to the architectural design of HyperManyMedia. These changes are related to the method of implementing Collaborative Filtering. We noticed that using the K-Nearest Neighbor classifier had a negative impact manifested in the problem of “OutOfMemoryError: Java heap space” in the Tomcat server. This problem is more evident when we have more than 500 users logged into the system at the same time. Therefore, we enhanced our system by changing the Collaborative Filtering approach to rely on the User-to-User Fast

XOR bit Operation Method. The impact of this method was evident in two aspects: 1) increasing the speed of providing recommendations to the user, and 2) increasing the speed of the overall crawling performance. V. C ONCLUSION In this paper, we described the evaluation of the scalability of a hybrid personalized retrieval model. This evaluation is measured based on the performance of the search engine’s crawling. To measure the performance of a search engine, we evaluated the crawling speed. We found that the performance of generating the fetchlist is ~24 seconds from the Web Database; the performance of injecting http:// hypermanymedua.wku. edu is good, taking only ~27 seconds to inject fetching the content from the URLs under http: // hypermanymedua.wku.edu, the performance of fetching reaches a maximum of ~33.46 minutes and a minimum of ~14.13 minutes which is considered reasonably fast; whereas, the performance of updating the Web database reaches a maximum of ~11.58 minutes and a minimum of ~ 3.65 minutes which is considered good. Also we noticed that using the K-Nearest Neighbor classifier had a negative impact manifested in the problem of “OutOfMemoryError: Java heap space” in the Tomcat server. This problem is more evident when we have more than 500 users logged into the system at the same time. Therefore, we enhanced our system by changing the Collaborative Filtering approach to User-to-User Fast XOR bit Operation Method. The impact of this method was evident in two crucial aspects, increasing the speed of providing recommendations to the user, and increasing the speed of the overall crawling performance. Overall, the crawling performance of HyperManyMedia is considered good. R EFERENCES [1] A. Polleres and D. Huynh, “Editorial: Special issue: The Web of Data,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, p. 135, 2009. [2] L. Aroyo and D. Dicheva, “The new challenges for elearning: The educational semantic web,” Educational Technology & Society, vol. 7, no. 4, pp. 59–69, 2004. [3] V. Devedži´c, “Web Intelligence and Artificial Intelligence in Education,” Educational Technology & Society, vol. 7, no. 4, pp. 29–39, 2004. [4] S. Yang, I. Chen, and N. Shao, “Ontology Enabled Annotation and Knowledge Management for Collaborative Learning in Virtual Learning Community,” Educational Technology & Society, vol. 7, no. 4, pp. 70–81, 2004. [5] E. Moreale and M. Vargas-Vera, “Semantic Services in e-Learning: an Argumentation Case Study,” Educational Technology & Society, vol. 7, no. 4, pp. 112–128, 2004.

[6] N. Henze, P. Dolog, and W. Nejdl, “Reasoning and Ontologies for Personalized E-Learning in the Semantic Web,” Educational Technology & Society, vol. 7, no. 4, pp. 82–97, 2004. [7] S. Roberson and D. Dicheva, “Semi-automatic ontology extraction to create draft topic maps,” in Proceedings of the 45th annual southeast regional conference, pp. 100– 105, ACM New York, NY, USA, 2007. [8] C. Dichev, D. Dicheva, and J. Fischer, “Identity: How To Name It, How To Find It,” in Workshop on I3: Identity, Identifiers, Identification, 16th Intl Conference on World Wide Web, WWW 2007, May 8-12, 2007, 2007. [9] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Metadata as seeds for building an ontology driven information retrieval system,” Int. J. Hybrid Intell. Syst., vol. 6, no. 3, pp. 169–186, 2009. [10] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Dual representation of the semantic user profile for personalized web search in an evolving domain,” in Proceedings of the AAAI 2009 Spring Symposium on Social Semantic Web, Where Web 2.0 meets Web 3.0, pp. 84–89, 2009. [11] L. Zhuhadar and O. Nasraoui, “Semantic information retrieval for personalized e-learning,” Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE International Conference on, vol. 1, pp. 364–368, Nov. 2008. [12] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Multi-model ontology-based hybrid recommender system in e-learning domain,” Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, vol. 3, pp. 91–95, 2009. [13] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Multi-language ontology-based search engine,” Advances in ComputerHuman Interactions, 2010. ACHI ’10. Third International Conferences on, 2010. [14] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Visual Ontology-Based Information Retrieval System,” in Proceedings of the 2009 13th International Conference Information Visualisation, pp. 419–426, IEEE Computer Society, 2009.

Leyla Zhuhadar received the Ph.D. degree in Computer Engineering and Computer Science from the University of Louisville, Louisville, in 2009. Currently, she is a Research Scientist at Western Kentucky University and an Adjunct Assistant Professor at the Department of Computer Engineering and Computer Science (CECS), at the University of Louisville. Her research interests are in Knowledge Acquisition from the Web, Information Retrieval, Ontology Engineering, Semantic Web, Metadata for Accessible Learning Objects. She designed and implemented two working research platforms in the e-learning domain, HyperManyMedia and the Semantic Repository. She is a member of IEEE, IEEE Women in Engineering, IEEE Computer Society, IEEE Education Society, the ACM, SIGKDD, SIGACCESS, SIGIR, Web Intelligence Consortium (WIC), and the AIED.

Olfa Nasraoui is the founding Director of the Knowledge Discovery and Web Mining Laboratory, at the University of Louisville, where she is also an Associate Professor of Computer Engineering and Computer Science and the Endowed Chair of e-Commerce. She received the Ph.D. degree in Computer Engineering and Computer Science from the University of Missouri, Columbia, in 1999. From 2000 to 2004, she was an Assistant Professor at the University of Memphis. Her research interests include data mining, Web mining, stream data mining, and computational intelligence. She is a member of IEEE, IEEE Women in Engineering, and in the last 10 years, has been active in the SIGKDD community, notably by organizing the WebKDD workshop on Web Mining and by serving as Vice-Chair on Data Mining conferences, including KDD 2009, ICDM 2009, and WI 2009. She is a recipient of a US National Science Foundation Faculty Early Career Development (CAREER) Award, and a Best Paper Award in the Artificial Neural Networks in Engineering Conference. She has published more than 100 publications, and acquired close to $2M in funding for research from NSF, NASA and other agencies.

Suggest Documents