Metadata Domain-knowledge Driven Search Engine in "HyperManyMedia" E-learning Resources Leyla Zhuhadar
Olfa Nasraoui
Robert Wyatt
University of Louisville Dept of Computer Engineering and Computer Science University of Louisville, Louisville, KY 40292, USA
University of Louisville Dept of Computer Engineering and Computer Science University of Louisville, Louisville, KY 40292, USA
Western Kentucky University Dept of Biology Western Kentucky University, Bowling green, KY 42101, USA
[email protected] [email protected] [email protected] ABSTRACT In this paper, we exploit the synergies between Information Retrieval and E-learning by describing the design of a system that uses “Information Retrieval” in the context of the Web and “E-learning”. With the exponential growth of the web, we noticed that the “general-purpose” of web applications started to diminish and more domain-specific or personal aspects started to rise, e.g., the trend of personalized web pages, a user’s history of browsing and purchasing, and topical/focused search engines. The huge explosion of the amount of information on the web makes it difficult for online students to find specific information with a specific media format unless a prior analysis has been made. In this paper, we present a metadata domain-driven search engine that indexes text, powerpoint, audio, video, podcast, and vodcast lectures. These lectures are stored in a prototype “HyperManyMedia” E-learning web-based platform. Each lecture in this platform has been tagged with metadata using the domain-knowledge of these resources.
Figure 1: “Colliding Web Sciences” Diagram
Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval-Inforamtion Search and Retrieval
General Terms Design, Algorithms, Experimentation
Keywords Information Retreival, Metadata, Search Engine
1.
INTRODUCTION
In the past 30 years, Information Retrieval (IR) was considered as a field of interest mainly to librarians and information experts. With the invention of the World Wide
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSTST 2008 October 27-31, 2008, Cergy-Pontoise, France Copyright 2008 ACM ...$5.00.
Web (WWW), IR evolved and new techniques started to develop, such as tools and applications for retrieving data in the context of the web (in this paper, we concentrate on the development of search engines in IR). Nowadays, the Web, which can be considered as the aggregate of huge hyperlinked digital libraries, contains billions of dynamic resources. Searching for information, which could be a document, an image, an audio, or a video file, a podcast, a vodcast, a blog, etc. from these libraries can be a difficult task in IR. Therefore, new techniques in IR emerged to increase the quality of searching. These techniques focused on three areas of interest: the capability of handling a new indexing format (which allows the indexing of images, audio, video, RSS, etc.), enhancing the indexing to have a faster response time for queries, and the quality of the information retrieval systems. The work presented here falls within a new sector in the “Colliding Web Sciences” Figure 1. This sector merges part of Information Retrieval (in the context of the web) with part of Education (E-learning). It encapsulates E-learning with the following: indexing and searching, query languages, text and multimedia languages, metadata, modeling, user interface and visualization. This sector (IR for E-learning) would not have evolved without the invention of the WWW. Only minimal work has been done on proactively merg-
ing Information Retrieval (IR) with Education, such as [1, 11] which is not directly related to the E-learning domain research. Why is there a need for a specific thrust, such as “IR for E-learning”? The answer is simple: In order to respond accurately to an online student is query for a specific resource (a lecture) from a huge repository, to have a better understanding of online students’ behaviors using the IR system, such as search engine, and to analyze the effect of the design and development of a new search engine on online students, we should tap on the knowledge in educational pedagogy and how students, in general, and online students, in particular, navigate a search engine. This knowledge will help to have a meaningful understanding of their searching patterns. The need for this synergy arises from the fact that the number of online students has been growing significantly; nearly twenty percent of all U.S. higher education students were taking at least one online course in the fall of 2006, and almost 3.5 million students were taking at least one online course during the fall 2006 term; a nearly 10 percent increase over the number reported the previous year, and an increase of 9.7 percent growth rate for online enrollments far exceeds the 1.5 percent growth of the overall higher education student population. These facts are based on a survey that represents the fifth annual report on the state of online learning in U.S. higher education1 .
2.
PREVIOUS WORK
Web search engines have their origins in Information Retrieval Systems (IR), starting from a keyword index from a predefined corpus and ending with responses to queries from ranked documents. The traditional relevance-ranking technology in Information Retrieval Systems (IR) which succeeded in retrieving information inside categorized libraries (using long term queries) failed in web search engines (using short term queries) because of the structure of the web graph and the profusion, redundancy, misrepresentation, and nonauthoritative nature of the web content [7]. This realization has motivated two inventions: The “Page Rank” algorithm [5], which is a scoring algorithm implemented by Google2 in 1997, and the “HITS” (hyperlink induced topic search) algorithm [18]. The basic tasks that need to be performed by a search engine are based on two phases: crawling the web and indexing the fetched pages. The crawler’s job ends with creating a repository of web content. This repository could be used in different ways, for example, in building a keyword index, classifying documents into a topic directory like Yahoo!3 , or social network analysis [8]. Search engines with their promising ranking algorithms still cannot excel in retrieving the most relevant webpages, due in part to the fact that the search terms used in queries are very short compared to the size of the web, which is relatively immense. Moreover, conventional search engines have little support for adapting to the background of individual web surfers. Recently, two notions of search engines started to gain popularity: the metadata search engine and the focused/topical search engine. The metadata search engine is based on a metadata structure, or “machine understandable informa1
http://www.sloan-c.org/publications/survey/ survey07.asp 2 http://www.google.com 3 http://www.yahoo.com
tion about web resources”4 . The metadata search engines have been studied intensively in [19, 28, 10, 12, 17, 20]. On the other hand, focused/topical search engines were introduced for the first time in [9] and a great deal of research related to it has been presented in [10, 13, 23, 25, 26, 30, 32]. Our approach differs from the previous ones in as much as it uses hybrid “metadata” and “domain knowledge” mechanisms. Additionally, our search engine is capable of indexing and fetching many different media format resources from text, powerpoint, audio, video, podcast, and vodcast. Since our search engine is built on top of the domain-knowledge (of E-learning), this knowledge representation was extracted from the subject area, which combines topics and media formats about online resources. Finally, we mapped this knowledge into “metadata” to enhance the Educational (Elearning) search mechanism. We have organized the paper as follows. Section 3 introduces our motivations and contributions. Sections 4 presents our methodology and implementation of a metadata search engine. Section 5 presents our results and evaluation measures. Section 6 summarizes our conclusions and future work. Section 7 acknowledgments. Finally, section 8 references.
3.
MOTIVATIONS AND CONTRIBUTIONS
Western Kentucky University5 hosts a “HyperManyMedia”6 open-source repository of lectures7 . Hundreds of online lectures are available in different formats: text, power-point, audio, video, podcast, vodcast, and RSS. This web-based platform is a main medium of communication between WKU online faculty and online students. For the last two years, the number of lectures added to this platform has grown significantly, and the usage of its resources by online students increased considerably. This led to several problems: 1. Searching for a specific college, course name, topic, media format is time consuming, and the results are not always accurate. 2. Searching for combinations of results is impossible (e.g., finding all video lectures in the business college related to accounting). Our contribution in this paper was to design and implement a search platform to overcome these limitations. The search engine Nutch 8 [6] was embedded in the “HyperManyMedia” platform so that online students could search for specific formats of lectures more easily and more efficiently. We implemented this search engine in two phases (first as “generic”, and later as “metadata”).
4.
METHODOLOGY AND IMPLEMENTATION
4.1 System Architecture 4
http://www.w3.org/DesignIssues/Metadata http://www.wku.edu 6 HyperManyMedia: refers to any educational material on the web (hyper) in a format that could be a multimedia format (image, audio, video, podcast, vodcast) or a text format (webpage, powerpoint). 7 http://blog.wku.edu/podcasts 8 http://lucene.apache.org/nutch/ 5
2. Parsing Learning Objects (Lectures) and Adding Metadata All webpages (lectures) located on the server were parsed using a java program which parses a webpage to find the specific location for the metadata (between title and the beginning of the webpage). The metadata information describes the following criteria: college name, course name, professor name, lecture name, media format type. 3. Re-configuring Nutch - A Brief Overview of Nutch
Figure 2: Overall System Architecture Figure 2 illustrates our system architecture which can be divided as follows:
Nutch9 is an open-source search engine based on Apache lucene,10 which is a scalable Information Retrieval (IR) library that allows indexing and searching capabilities. Nutch searches and indexes components with a powerful fetcher (crawler robot), which is designed to handle crawling, indexing, and searching of several billion frequently updated web pages. It has a modular architecture, which permits developers to design and embed plugins for media parsing, data retrieval, querying, and clustering. Nutch has different types of fields, such as keywords, text, metadata, etc. We used the last feature to add our own domain-specific fields, such as, college names, lecture names, and media types. Nutch can parse many different file format such as html, php, doc, pdf, rtf, etc. We also wrote our own plugins that add metadata as additional fields. Nutch has been used in many research applications as a retrieval system in digital libraries [27, 29, 19] and as a web search engine [24, 3]. - Nutch Implementation
1. Domain-knowledge Extraction 2. Parsing Learning Objects (lectures) and Adding Metadata 3. Re-configuring a Search engine for Multimedia Indexing and Querying 4. Encapsulating the Metadata Search Engine within the Platform
The Nutch scoring algorithm is inherited from apache lucene, which is based on a combination of the Vector Space Model (VSM) and the Boolean Model. It applies the Boolean Model first to select the most relevant documents for the query; then, it uses the Vector Space Model as a contentbased ranking algorithm. The score of query q for document d is related to the cosine-distance similarity (1) between the document and query vectors in a Vector Space Model (VSM).
4.2 System Implementation 0
cos(x.x ) =
1. Domain-knowledge Extraction As of November 2007, the “HyperManyMedia” Server contained more than 400 lectures from 11 different colleges: “English”, “Social Work”, “History”, Chemistry”, “Accounting”, “Math”, “Management”, “Consumer and Family Sciences”, “Architect and Manufacturing Sciences”, “Engineering” and “Communication Disorders”. Each lecture is delivered in six different media formats: text, powerpoint, audio, video, podcast, and vodcast. Each lecture is a learning object used in online courses and taught by different professors. Information about each lecture was extracted and saved. This phase was done manually, since the knowledge of each resource in the platform can only be known by a web designer, web developer, or multimedia editors. Tagging this information inside the learning objects (lectures) is described in the next section.
xT .x0 xT · x0 = √ √ T | x | · | x0 | x x · x0T x0
(1)
where x ∈ R|V| x and x0 are vector-space representations of two documents, T denotes the ’transpose’ operator and xT ·x0 indicates the dot product between two vectors. Nutch uses several refinements on VSM by extending the Boolean vector model and adding weights associated with terms and fields as shown in (2). This represents Nutch’s scoring, which is influenced by the sum of the score for each term of a query. For each field, the score is the product of the following factors: Its "tf", "idf", and index-time boosting. The score is computed as following: score(q, d)
=
coord(q, d) × queryN orm(q) × 2
X
(tf (tind)
×idf (t) × t.getBoost() × norm(t, d)) 9
(2)
http://www.nutch.org Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. (http://lucene.apache.org/java/docs/).
10
Figure 4: Generic Search engine
Figure 3: Modified Nutch Architecture
- Modified Nutch Search Engine’s Boosting Mechanism
We changed Nutch’s boosting algorithm to accommodate metadata, knowing that Nutch uses (3).
Boost = α(url) + β (anchor) + γ (content) + δ (title)
(3)
The original boosting weights are shown in Table 1.
Field Boost
Table 1: Boosting fields URL Anchor Content 4.0 2.0 1.0
Figure 5: Metadata Search engine
Title 1.0
4. Encapsulating the Metadata Search engine within the “HyperManyMedia” Platform
We modified Nutch’s boosting score as shown in (4). M odif iedBoost +
= δ (title)
α (url) + β (anchor) + γ (content) + (metadata)
(4)
Table 2 shows the modified boosting weights after adding the metadata. The metadata boosting weight was therefore
Field Boost
Table 2: Modified boosting fields URL Anchor Content Title Metadata 4.0 2.0 1.0 1.0 5.0
The two interfaces of the “Generic”, and “Metadata” search engines are shown in Figure 4 and Figure 5, respectively. For example, the “Generic” search engine responded to a query “English”, by retrieving relevant pages based on the boosting in (4). Figure 6 indicates the scoring boost for a query=“English” which is equal to 1.0. In this case, the “Generic” search engine only used the “content” boosting weight γ (content) = γ (English) = 1.0.
= 5 in (4). The methodology used to add this boosting weight is described in the next section. - Designing and Embedding the Parser, Indexer, and QueryFilter Plugins
Figure 3 illustrates the changes that we made to Nutch’s architecture. Plugins were embedded inside Nutch to automatically add additional fields (metadata fields). This enhances Nutch’s capability of indexing and searching for metadata-based retrieval. Four plugins were added (colleges, courses, professors, and media-format). For each plugin, three programs were embedded to boost the parser, indexer, and query filter in order (e.g., CollegeParser, CollegeIndexer, and CollegeQueryFilter).
Figure 6: ”Generic” Search engine Score for the Query=”English”
We selected specific queries from the query logs containing queries submitted to our “HyperManyMedia” search engine during two semesters (fall & winter terms in 2007-2008). These queries represent usage of the search engine by online students. First, the query log file was cleaned from irrelevant data and from misspelled terms, since we did not want these queries to skew our results. Second, only the “Top_list” of most frequent queries was extracted. Finally, these terms were ordered in descended order of frequency. Subjectivity is one of the elements that should be considered when we evaluate a query. A web resource may be related not only to a query, but also to a user’s nation, age, gender, career, culture, hobby, etc. [31] A study of satisfaction factor of users could be considered to support the evaluation of the retrieved results. In our case, the search engine is used primarily by online students who have knowledge of what kind of materials they are looking for (this knowledge was provided by their online faculty), for example, college name, course name, faculty name, lecture title, etc. Therefore, our evaluation methodology relies on the online students’ domain knowledge of the resources that they are looking for. Consequently, the subjective factors have a limited influence in our method.
5.2 Figure 7: ”Metadata” Search engine Score for the Query=”English” For the metadata field “college=English”, the boosting factor equals (meta − data) = (”English”) = 5.0, as shown in Figure 7. Since all lectures in this open-source platform were parsed and metadata were embedded, any search for a metadata will bring the related topic. This part is more related to the knowledge of E-learning domain and more specifically to the knowledge of our online resources (domain-specific).
5.
RESULTS AND EVALUATION MEASURES
5.1
Evaluation Methodology
5.1.1
Research Questions:
1. Will there be an increase in precision when using the metadata search engine compared to the generic search engine?
2. Will relevant documents be ranked higher when using the metadata search engine?
5.1.2
Selection of Queries:
The first step was selecting queries. A great deal of research on search engine queries has found that searchers rarely use Boolean operators [4]; typically, this usage is around 10% [15]. Another study [16] observed that the highest distribution of the number of terms in queries range between 1 and 3, and these are primarily noun phrases. Accordingly, we ran our comparison between the two search engines (generic) and (metadata) based on “single-term”, “two-terms”, and “three-term” queries without Boolean operators.
Precision:
Precision is the ratio of the number of relevant documents to all retrieved documents as shown in (5) precision =
number of relevant documents × 100 number of retrieved documents
(5)
Where number of retrieved documents is equal to the sum of relevant and irrelevant retrieved documents.
5.3
Selection of Ranking Algorithm:
This step is related to our evaluation algorithm. Studies show that approximately 80% of web searchers never click on more than ten results [14], which indicates the importance of search engines that can retrieve the most relevant pages at the top ten. Most of the ranking algorithms evaluate the ranking quality based on precision and recall; a detailed review of ranking algorithms can be found in [31]. One of the limitations of the recall measure is the difficulty of counting the number of relevant documents in the corpus. We used a new algorithm SEREET 11 for ranking efficiency, which was recently proposed in [2]. This algorithm evaluates the performance of search engines based on a comparison between the order of relevant documents and retrieved documents. This algorithm starts at 100 points in the top of the rank and deducts points each time that a relevant document is not found.
P SEREET = where:(
100 100
Wi = p.p. − 11
#misses rank length
Wi × 100% n
(6)
if top of the rank if no miss is f ound × 100% ortherwise
SEREET: Search Engine Ranking Efficiency Evaluation Tool
We used Precision and SEREET ranking measures to compare the two search engines. We submitted each of the three datasets with the 100 “top_list” queries to both search engines, after removing operators and misspelled terms. Then, we recorded the number of retrieved pages and the number of relevant pages. We repeated the process for all the queries, and ordered the results based on the most used queries in all categories, “single-term”, “two-term”, and “three-term” queries. The reviewers were students from different colleges. The assigned queries to these reviewers were related to the students’ domain knowledge of the resources. These reviewers were students who worked in our “Distance-learning” division and their job was to create these resources and to add them to the platform. The reviewers recorded the order of the retrieved documents and the order of the relevant documents. After ordering the retrieved documents and the relevant documents, SEREET was implemented for each query for both search engines.
5.4
Results:
Table 3: Overall Singleterm Generic 0.619 Metadata 0.81 Query
Precision Results
Twoterm 0.717 0.856
Threeterm 0.851 0.925
Figure 8: Overall precision
We present the results in terms of precision and ranking first, with the conclusion following in the next section.
5.4.1
Precision Results:
Of the 100 top_list queries submitted in both search engines (“Generic” and “Metadata), we recorded the retrieved and the relevant documents, in addition to the order of their ranking. We repeated the experiment three times based on “single-term”, “two-term”, and “three-term” queries. We considered combined words like “communication disorders” as a single-term since these two words are related. We conducted a comparison of the two search engines “Generic” and “Metadata” to determine if there were any significant differences in terms of precision. In the first experiment, we submitted “single-term” queries. There was a significant difference between the “Generic” and “Metadata” search engines in term of precision. The “Metadata” search engine outperform the “Generic” one. In only 5 cases out of 100 did the “Generic” search engine perform slightly better. In the second experiment, we submitted “two-terms” queries, there was also a significant difference between the “Generic” and “Metadata” search engines in terms of precision. The “Metadata” search engine outperformed the “Generic” one even more than in the “single-term” case. We had only 3 cases out of 100 where the “Generic” outperformed the “Metadata” search engine. In the third experiment, we submitted “three-term” queries. There was also a significant difference between the “Generic” and “Metadata” search engines in terms of precision. The “Metadata” search engine outperformed the “Generic” one. In only 4 cases out of 100 did the “Generic” search engine perform slightly better. Table 3 presents the overall precision results for the three experiments. The average precision for each experiment indicates that the “Metadata” search engine outperformed the “Generic” search engine. Given that the reviewers were
aware of the “Metadata” mechanism in the search engine, for retrieving the lectures and how the metadata was designed, there is a possibility of bias in this analysis. However, the students who use this platform are also aware of this mechanism. Our goal was to compare and contrast both search engine performances using the most frequently used queries by our online students. We conclude that the metadata search engine increases the precision across all numbers of terms used. This in turn answers our first research question: Will there be an increase in precision when using the metadata search engine? We found that the metadata-driven search engine has a significant impact on the precision with overall precision values equal to 0.810 (for single-term queries), 0.856 (for twoterm queries), and 0.925 (for three-term queries), compared to 0.619 (for single-term queries), 0.717 (for twoterm queries), and 0.851 (for three-term queries) for the generic search engine, as shown in Figure 8. The Generic search engine was able to generate a better precision by a very small percentage ' 3 − 6% in some queries where we do not have metadata. In these cases, the reviewers selected the metadata which is the closest to the query terms. As a consequence, it skewed the results. However, this encouraged us in two directions; first, to think about adding some new metadata to our platform; and second, to support our idea of the need of creating E-learning ontologies. E.g., the problem of synonymous words could be reduced if we had a semantic search engine. This will be considered in our future work. Investigating user queries could also be considered as additional research that our study did not explore. However, the results gave us some ideas about how to increase the performance of our metadata search engine based on the query patterns used by our online students.
5.4.2
SEREET ranking results:
Our second search question concerned the effect of metadata on the ranking of relevant documents. Table 4 and Figure 9 show a significant improvement in SEREET values using the metadata search engine. The results were for “single-term” queries and based on the comparison in ranking order for both retrieved documents and the relevant documents, as we explained in section: “Selection of ranking algorithm”. Most of the SEREET values were equal to 1.0, which proved the significance of using metadata in retrieving the relevant documents in general. We had only 6 cases out of 100 where the generic search engine outperformed the metadata one. In this case, for “two-term” queries. Once again, we noticed a significant improvement in SEREET values using the metadata search engine compared to the generic search engine. We had only 3 cases out of 100 where the generic search engine outperformed the metadata. Finally, for “three-term”queries, we noticed a significant improvement in SEREET values using the metadata search engine compared to the generic search engine. We had only 4 cases out of 100 where the generic search engine outperformed the metadata one. We conclude that the metadata search engine increases the ranking performance across all the numbers of terms that we used in queries. This in turn answers our second research question: Will relevant documents be ranked higher when using a metadata search engine? We found that the metadata-driven search engine has a significant impact on the ranking performance with overall values of SEREET equals to 0.803 (for single-term queries), 0.846 (for two-term queries), and 0.914 (for three-term queries), compared to 0.597 (for single-term queries), 0.684 (for two-term queries), and 0.834 (for three-term queries)) for the generic search engine, as shown in Fig.8. However, the Generic Search engine was able to generate better SEREET values by a very small percentage ' 3 − 5% in some queries. Table 4 and Fig.9 present the overall results of SEREET values for “single-term queries”, “two-term queries”, and “three-term queries”. The average results of all of them show a significant performance of the metadata search engine. Table 4: Overall SEREET Results SingleTwoThreeQuery term terms terms Generic 0.597 0.684 0.834 Metadata 0.803 0.846 0.914
6.
CONCLUSION AND FUTURE WORK
In this work, we presented a metadata domain-knowledge driven search engine in “HyperManyMedia” E-learning resources. Our results of precision and SEREET ranking showed a significant improvement in retrieving relevant resources to the submitted queries when we used the metadata search engine. The evaluation of these results was based on our domain knowledge of these resources. Our current research focuses on a hybrid metadata and a semantically enriched search engine which is built on top
Figure 9: Overall SEREET of the domain-knowledge (of E-learning), where the knowledge representation is extracted from the subject area. Our work divided into five phases: (1) building the semantic elearning domain using the known college and course information as concepts and sub-concepts in a lecture ontology, (2) generating the semantic learner’s profile as an ontology from navigation logs that record which lectures have been accessed, (3) clustering the documents to discover more refined sub-concepts (top terms in each cluster) than provided by the available college and course taxonomy, (4) re-ranking the learner’s search results based on the matching concepts in the learning content and the user profile, and (5) providing the learner with semantic recommendations during the search process, in the form of terms from the closest matching clusters of their profile, more detailed description of this work is presented in [21, 22]. Our recent experimental results show that the learner’s context can be effectively used for improving the precision and recall in e-learning search, particularly by re-ranking the search results based on the learner’s past activities. Our future work will focus on visualizing online students communities with their associated learning objects and their relationships.
7.
ACKNOWLEDGMENTS
This work is partially supported by National Science Foundation CAREER Award IIS-0133948 to Olfa Nasraoui.
8.
REFERENCES
[1] Emme ’07: Proceedings of the international workshop on Educational multimedia and multimedia education, New York, NY, USA, 2007. ACM. [2] Wadee S. Alhalabi, Miroslav Kubat, and Moiez Tapia. Search engine ranking efficiency evaluation tool. SIGCSE Bull., 39(2):97–101, 2007. [3] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526, New York, NY, USA, 2007. ACM. [4] Christine L. Borgman. Social aspects of digital libraries (working session). In DL ’96: Proceedings of the first ACM international conference on Digital libraries, page 170, New York, NY, USA, 1996. ACM.
[5] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. volume 30, 1998. [6] Mike Cafarella and Doug Cutting. Building nutch: Open source search. Queue, 2(2):54–61, 2004. [7] Soumen Chakrabarti. Data mining for hypertext: a tutorial survey. SIGKDD Explor. Newsl., 1(2):1–11, 2000. [8] Soumen Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, August 2002. [9] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11-16):1623–1640, 1999. [10] Yuxin Chen. A novel hybrid focused crawling algorithm to build domain-specific collections. PhD thesis, United States – Virginia, 2007. [11] Edward Clarkson and James D. Foley. Browsing affordance designs for the human-centered computing education digital library. In JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 361–361, New York, NY, USA, 2006. ACM. [12] David J. Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data Mining (Adaptive Computation and Machine Learning). The MIT Press, August 2001. [13] Bin He. A holistic paradigm for large scale schema matching. PhD thesis, United States – Illinois, 2006. [14] Christoph Höscher and Gerhard Strube. Web search behavior of internet experts and newbies. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, pages 337–346, Amsterdam, The Netherlands, The Netherlands, 2000. North-Holland Publishing Co. [15] Bernard J. Jansen and Amanda Spink. An analysis of web searching by european alltheweb.com users. Inf. Process. Manage., 41(2):361–381, 2005. [16] Steve Jones, Sally Jo Cunningham, and Rodger McNab. Usage analysis of a digital library. In DL ’98: Proceedings of the third ACM conference on Digital libraries, pages 293–294, New York, NY, USA, 1998. ACM. [17] Seikyung Jung. Designing and understanding information retrieval systems using collaborative filtering in an academic library environment. PhD thesis, United States – Oregon, 2007. [18] Jon M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv., page 5. [19] Carl Lagoze, Dean Krafft, Tim Cornwell, Naomi Dushay, Dean Eckstrom, and John Saylor. Metadata aggregation and "automated digital libraries": a retrospective on the nsdl experience. In JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 230–239, New York, NY, USA, 2006. ACM. [20] Hyun Chul Lee. Extending link analysis for novel Web mining applications. PhD thesis, Canada, 2007. [21] Zhuhadar Leyla and Nasraoui Olfa. Personalized
[22]
[23] [24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
cluster-based semantically enriched web search for e-learning. In CIKM 08: ACM 17th Conference on Information and Knowledge Management , ONISW The 2nd International workshop on Ontologies and Information Systems for the Semantic Web., Napa Valley, California October 26-30, 2008. Zhuhadar Leyla and Nasraoui Olfa. Semantic information retrieval for personalized e-learning. In ICTAI 2008, 20th IEEE International Conference on Tools with Artificial Intelligence, November 3-5, 2008, Dayton, Ohio, USA. Jinghu Liu. Resource-bounded online search for dense neighbourhood on the Web. PhD thesis, Canada, 2002. José E. Moreira, Maged M. Michael, Dilma Da Silva, Doron Shiloach, Parijat Dube, and Li Zhang. Scalability of the nutch search engine. In ICS ’07: Proceedings of the 21st annual international conference on Supercomputing, pages 3–12, New York, NY, USA, 2007. ACM. Adam Stuart Nickerson. Connecting link structure and content on the Web for effective focused crawling. PhD thesis, Canada, 2003. Gautam Pant. Learning to crawl: Classifier-guided topical crawlers. PhD thesis, United States – Iowa, 2004. Gautam Pant, Kostas Tsioutsiouliklis, Judy Johnson, and C. Lee Giles. Panorama: extending digital libraries with topical crawlers. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 142–150, New York, NY, USA, 2004. ACM. Jialun Qin, Yilu Zhou, and Michael Chau. Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 135–141, New York, NY, USA, 2004. ACM. Herbert Schorr and Salvatore J. Stolfo. Towards the digital government of the 21st century. In dg.o ’02: Proceedings of the 2002 annual national conference on Digital government research, pages 1–40. Digital Government Research Center, 2002. Yilei Shao. Exploring social networks in computer systems. PhD thesis, United States – New Jersey, 2007. Dell Zhang and Yisheng Dong. An efficient algorithm to rank web resources. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, pages 449–455, Amsterdam, The Netherlands, The Netherlands, 2000. North-Holland Publishing Co. Cong Zhou. CNDROBOT: A robot for the CINDI digital library system. PhD thesis, Canada, 2006.