Incorporating Quality Metrics in Centralized/Distributed ... - CiteSeerX

4 downloads 1025 Views 107KB Size Report
Web search engines have their roots in information retrieval systems that were ... stays at each site and uses this information to increase the popularity metric for the ... In a distributed environment, the rankings assigned to documents from one ...
Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web 1

Xiaolan Zhu2 Susan Gauch Department of Electrical Engineering and Computer Science University of Kansas Contact Information Dr. Susan Gauch 415 Snow Hall Dept. of EECS University of Kansas Lawrence, KS 66049 785-864-7755 (phone) 785-864-7789 (fax) [email protected]

Abstract Most information retrieval systems on the Internet rely primarily on similarity ranking algorithms based solely on term frequency statistics. Information quality is usually ignored. This leads to the problem that documents are retrieved without regard to their quality. We present an approach that combines similarity-based similarity ranking with quality ranking in centralized and distributed search environments. Six quality metrics, including the currency, availability, information-to-noise ratio, authority, popularity, and cohesiveness, were investigated. Search effectiveness was significantly improved when the currency, availability, information-to-noise ratio and page cohesiveness metrics were incorporated in centralized search. The improvement seen when the availability, information-to-noise ratio, popularity, and cohesiveness metrics were incorporated in site selection was also significant. Finally, incorporating the popularity metric in information fusion resulted in a significant improvement. In summary, the results show that incorporating quality metrics can generally improve search effectiveness in both centralized and distributed search environments. Keywords: information brokers, distributed collections, merging search results/information synthesis.

1

This research was partially supported by NSF CAREER Award 97-03307.

2

Note: The primary author is a full time student and will present the paper if it is accepted. Thus, this paper is eligible for the Best Student Paper award.

1

1 Introduction Web search engines have their roots in information retrieval systems that were originally developed to manage information in libraries. Since the information sources for these systems were carefully selected, they tended to be of comparable quality. Thus, the retrieval algorithms were developed without regard to the quality of the information source. In contrast, since there is no quality control on how Web pages get created and maintained, information quality varies widely on the Internet. Thus, when information retrieval systems are applied in Web search engines, they should take into account the quality of a Web page, not just its contents. This paper explores a variety of automatically calculated quality metrics and evaluates their ability to improve search results.

1.1 Related Work Information retrieval systems can be classified into two broad categories based on their architectures: 1) centralized; and 2) distributed. Centralized information retrieval systems typically require that all the documents reside locally at a single site and all queries are also handled by that site. In contrast, distributed information retrieval systems allow users to simultaneously access document collections distributed across multiple remote sites. In a centralized collection, the quality of individual pages must be compared and used in the retrieval process. For distributed systems, quality metrics must also be calculated for the collection as a whole and used to augment query brokering and results fusion.

1.1.1 Centralized Search A typical centralized search system maintains an index, searches and fetches documents from a single site. Document rankings are traditionally determined by term distribution statistics in the collection, with no regard for document quality. An typical formula is tfij * idfi used to calculate the weight of a query term in a particular document, where tfij is the frequency of term i in document j, and idfi is calculated as log2(N/ni), where N is the number of documents in the collection, and ni is the total number of documents that contain term i in the collection. Some centralized search systems do take information quality into account, usually in the form of document popularity. One approach to calculating a popularity metric is to count the number of links pointing to the page (Li & Rafsky, 1997). A similar approach is used by Google (Google, 1999) to calculate the PageRank for each Web page. A Web page gets a high PageRank when a other pages with high PageRanks link to it (Brin & Page, 1999). Another approach is to calculate popularity by observing user behavior. The Web search engine Direct Hit (Direct Hit, 2000a; 2000b) tracks the result links clicked by the user and how long the user stays at each site and uses this information to increase the popularity metric for the examined pages.

1.1.2 Distributed Search Distributed information retrieval systems need to perform three tasks for a query: 1) choose the sites that contain promising information: 2) query the selected sites; and 3) merge the search results from those sites. Thus, site meta-information creation, site selection, and information fusion are three important issues for distributed search environments.

2

The goal of site selection is to direct queries to the sites that potentially have the answers. This process is usually guided by meta-information that summarizes the contents of a site so that the sites with the best overlap with the query can be identified. Meta-information may contain information about ranking algorithm, tokenizer, stop words, and content summary of a site (Gravano et al., 1997). Meta-information may be represented as an n-gram centroid, which is the mean of the documents’ n-gram profiles on a site (Crowder & Nicholas, 1996a; 1996b; Pearce & Nicholas, 1996), a list of the tokens representing the contents of a site (Kretser et al. 1998) or a list of term-frequency pairs of a site (Selberg, 1996) or similarity matrices produced by corpus linguistics analysis (Gauch et al., 1998). In a distributed environment, the rankings assigned to documents from one collection are usually not comparable with rankings from another collection due to the sizes of the collections and different ranking algorithms employed. Overall retrieval effectiveness can be severely degraded if the responses from different collections are simply ordered by the rankings (Voorhees, 1997). Therefore, appropriate techniques to merge the responses, i.e., information fusion, must be developed when constructing distributed information retrieval systems. One approach to information fusion is to find the appropriate number of documents that should be selected from each collection to maximize the total number of relevant documents (Towell et al., 1995; Voorhees, 1997). Another approach to information fusion is to use weighted scores based on collection ranking, in which each collection is assigned a score based on their quality, The final score for a document is determined by the score returned from the collection and a weight of the collection (Callan et al., 1995; Fan & Gauch, 1999).

1.1.3 Quality Metrics The critical issue in evaluating quality of a Web page is to select the quality criteria. The criteria investigated in this research were selected by based on the criteria used by Web sites that provide rating services, e.g., Internet Scout (Scout, 1999a; 1999b), Lycos Top 5% (Lycos, 1999b; 1999c), Argus Clearinghouse (Clearinghouse, 1999), WWW Virtual Library (Ciolek, 1998), McKinley Group's Magellan site (Magellan, 1999; 1998), and the Internet Public Library (IPL) (IPL, 1999). We reviewed their rating systems and identified the following 16 metrics used by their human: subject, breadth, depth, cohesiveness, accuracy, timeliness, source, maintenance, currency, availability, authority, presentation, availability, information-to-noise ratio, quality of writing, and popularity. From this list, we selected the following 6 as being widely used and amenable to automatic analysis: Currency: how recently a Web page has been updated. Availability: the number of broken links contained by the Web page. Information-to-Noise Ratio: the proportion of useful information contained in a Web page of a given size. Authority: the reputation of the organization that produced the Web page. Popularity: how many other Web pages have cited this particular Web page. Cohesiveness: the degree to which the content of the page is focused on one topic.

3

2

Research Approach

This section decides the objective ways in which we obtained the quality metrics described in Section 1.1.3. First, we describe how metrics for a individual page are calculated, then how those metrics are accumulated to summarize the quality of an entire Web site. The quality of a Web site is calculated on a per-topic basis, and this information is later used as metainformation for topic-based query routing and information fusion.

2.1 Measuring the Quality of a Web Page The quality metrics we chose are measured as follows: Currency is measured as the time stamp of the last modification of the document. Availability is calculated as the number of broken links on a page divided by the total numbers of links it contains. Information-to-Noise Ratio is computed as the total length of the tokens after preprocessing divided by the size of the document: Authority. The authority of a Web page was based on the Yahoo Internet Life (YIL) reviews (ZDNet, 1999), which assigns a score ranging from 2 to 4 to a reviewed site. If a Web site was from a site that was not even reviewed by YIL, its authority was assumed to be 0. Popularity. The number of links pointing to a Web page was used to measure the popularity of the Web page. The information about how many links that point to a particular Web page was obtained from the AltaVista (1999) site. Cohesiveness is determined by how closely related the major topics in the Web page are (see Section 2.1.1). 2.1.1 Measuring Cohesiveness Our cohesiveness metric is based on categorizing each Web page into the topics in a reference ontology using a vector space classifier (Zhu et al., 1999). For the reference ontology, we use an eleven-level ontology of 4,385 nodes downloaded from the Magellan site (Gauch, 1997b). Each topic in the ontology was associated with up to 20 Web pages attached to the topic which were used as training documents for the concepts. The training documents associated with a given concept are concatenated, and a vector space indexer is used to create a vector for each concept in the ontology. We calculate the cosine similarity measure between the vector representing the Web page to be classified and the vectors representing the reference ontology concepts, returning the top 10 matching topics and the associated weight of each match. The cohesiveness of the Web page was then determined by the distribution of the top 20 matching topics in the reference ontology. The more closely related the top 20 topics, the closer the topics in the reference ontology and the higher the cohesiveness. The following formula was used to calculate the cohesiveness of a Web page in our first set of experiments:

4

N * (N - 1) M * (M - 1) ————— - ————— + ξξ Pij 2 2 ——————————————— (i, j = 0, N - 1; i < j) N * (N - 1) ————— 2 where N is the maximum number of top matching topics requested (i.e., 20 in this case) M is the minimum of N and the number of matching topics returned, and Pij is the length of the shared path between topic i and j divided by the height of the ontology. The above formula relies on the topological distribution of the matching topics only, ignoring the weight of matches. We reran the experiments in Section 3.3.2 using the following revised formula that includes the match value for better results: N * (N - 1) M * (M - 1) ————— - ————— + ξξ Wij * Pij 2 2 ——————————————— (i, j = 0, N - 1; i < j) N * (N - 1) ————— 2 where Wij is the sum of the weights of topic i and j.

2.2 Constructing the Meta-Information of a Site The metrics described in the previous section are relevant to an individual Web page. To evaluate the effectiveness of the quality metrics in distributed search, this information needs to be summarized in meta-information for the site. The meta-information was represented as an augmented 4,385 concept reference ontology, where each concept has an associated quantity value and 5 associated quality values. In addition, we calculated a cohesiveness metric for the site overall. To calculate the quantity metric for each concept, each Web page on the site to be summarized was classified into the top 10 relevant topics in the ontology using the approach (Zhu et al., 1999) described in the previous section. The match weights of all documents classified into a particular concept were then accumulated to determine the quantity of information related to that concept in the site being studied. For each concept, the information quantity measure is augmented with five information quality metrics (all except cohesiveness). As shown in Table 1, the meta-information for site i includes the information quantity related to each topic j (Wij), currency (Tij), availability (Aij), information-to-noise ratio (Iij), authority (Rij), and popularity (Pij).

5

All quality metrics, except cohesiveness, associated with a topic were produced by summing the quality measures of Web pages that were categorized into that topic. Cohesiveness (Ci for site i) was not associated with individual topics but rather the distribution of the quantity weights throughout the ontology. The cohesiveness of the site was determined by how tightly distributed the top 100 matching topicswere, calculated using the first cohesiveness formula given in Section 2.1.1.

Topic ID

Quantity

Currency

0 1 2 …

Wi0 Wi1 Wi2

Ti0 Ti1 Ti2 …

Availabilit y Ai0 Ai1 Ai2 …

Quality Info/Noise

Authority

Popularity

Ii0 Ii1 Ii2 …

Ri0 Ri1 Ri2 …

Pi0 Pi1 Pi2 …

Table 1. Meta-Information for Site i (Site Cohesiveness = Ci).

2.3 Centralized Information Search Our first goal was to develop a search algorithm that incorporated both quantity and quality metrics in the document/query match process. Initially, the query matching process returned similarity rankings using tf * idf as described in Section 1.1.1. Each quality metric was evaluated separately, then the metrics were combined to get the final score of each Web page: sd = sr * (ap *td + bp*ad + cp*id + dp*rd + ep*pd+ fp*cd)

[1]

where sd is the final score of document d, sr is the normalized similarity-based score returned by the vector-based search algorithm, td, ad, id, rd, pd, and cd are the normalized quality metrics for the document, including currency, availability, information-to-noise ratio, authority, popularity, and cohesiveness respectively, and ap, bp, cp, dp, ep, and fp are the weights representing the importance of the corresponding quality metrics in centralized search. The weight of a quality metric was as the improvement made by incorporating the quality metric alone computed (see Section 3.3.1) divided by the sum of the improvements made by incorporating each of the quality metrics alone in centralized search.

2.4 Site Selection Another goal of this research was to use our quantity and quality based meta-information to route queries to the appropriate sites for processing. Information quality was incorporated in the following equation to calculate the score for each site, and then queries were routed to the top scoring sites:

6

S i = W i * (a s' * T i + bs' * Ai + c s' * I i + d s' * R i + es' * P i + f s' * C i )

[2]

where Wi, Ti, Ai, Ii, Ri, and Pi are the means of information quantity, currency, availability, information-to-noise ratio, authority, and popularity of site i across the top 10 topics relevant to the query, Ci is the cohesiveness of site i, and a's, b's, c's, d's, e's, and f's are the weights representing the importance of each quality metric for site selection. The weight of a quality metric was computed (see Section 3.3.2) as the improvement made by incorporating the quality metric alone divided by the sum of the improvements by incorporating each of the quality metrics alone in site selection.

2.5 Information Fusion Since simply merging the results based on the scores returned from each search site could severely degrade the search effectiveness (Voorhees, 1997), the score of a Web page from a local site was adjusted by the goodness measure of the site before presented to the user. The goodness of a site was determined by its information quantity as well as information quality using the following equation: Gi = W i * (a s" * T i + bs" * Ai + c "s * I i + d s" * Ri + es" * P i + f s" * C i )

[3]

where Wi, Ti, Ai, Ii, Ri, and Pi are the means of information quantity, currency, availability, information-to-noise ratio, authority, and popularity of site i across topics relevant to the query, Ci is the cohesiveness of site i, a"s, b"s, c"s, d"s, e"s, and f"s are the weights representing the importance of each quality metric for information fusion. The weight of a quality metric was computed (see Section 3.3.3) by the improvement made by incorporating the quality metric alone divided by the sum of the improvements by incorporating each of the quality metrics alone in information fusion. The score sd of document d returned from site i was adjusted by incorporating the quality metrics of the site with the following equation: s'd = sd * Gi

[4]

7

3

Evaluation

We evaluated the ability of the quality metrics to improve search effectiveness for three tasks: 1) query document matching; 2) query routing; and 3) information fusion. For each task, we established a baseline by performing the task with a traditional vector-space system, then evaluated the performance with each quality metric in turn. Finally, the quality metrics were combined (all quality metrics and just those found to significantly improve search effectivenss).

3.1 Experimental Resources Twenty target sites on five different topics (art, computing, fitness, music, and general issues) were selected based on the reviews of YIL (ZDNet, 1999). For each topic, we downloaded 91 – 300 pages from four sites that varied in quality, based on the YIL reviews. Forty queries were selected from the queries recorded in the ProFusion log files (January - April, 1999). Similar to the site resources, the queries were selected to be relevant to each of the five topics. The query length varied from 1 to 5 terms in length, with an average of 2.8 terms per query. For each query, the results retrieved by any algorithm were pooled and presented to human subjects in random order to create a testbed of queries and results and relevance judgments. Four subjects participated the experiments, each judging between 331 and 1213 Web pages and providing a yes or no judgment depending on whether or not the document was relevant to the query.

3.2 Experiments In this section, we introduce three sets of experiments and the results. These were designed to test the effects of each quality metric and their combinations in centralized search, site selection, and information fusion, respectively. 3.2.1 Quality Metrics in Centralized Search All documents downloaded from the 20 Web sites were combined into a single collection. The baseline is the search effectiveness obtained when none of the quality metrics was considered. The search effectiveness was measured as the precision of the top 10 search results. The paired-samples T-test was used to test the difference between each pair of mean precisions. The alpha value was set at .05 in all statistic analyses. Thus, the probability of erroneously declaring the mean precisions in a pair to be different is less than 5%. Experiment 1 Hypothesis The search effectiveness of centralized search can be improved when retrieval is enhanced with each quality metric in turn and more so when they are combined. When there was only one quality metric incorporated, the weight of the quality metric was set to 1 and the weights of other quality metrics were set to zero. When more than one quality metric was incorporated, the weight of an incorporated quality metric was calculated as the improvement made by incorporating the quality metric alone divided by the sum of the

8

improvements made by incorporating all the quality metrics individually. The weights of the quality metrics that were not incorporated were set to zero. Since the mean variance of the weight of each Web page across top 10 topics is small (0.17), the first cohesiveness formula which ignores match weights was used. Results Table 2 summarizes the results of the quality metric experiments for centralized search. The results in the table is sorted by the magnitude of the improvement. The Significant column indicates whether or not the improvement is statistically significant. '+' indicates that the centralized search was guided by information quantity as well as the quality metric(s). For example, the +currency row lists the result obtained when the information quantity and currency metric were incorporated in centralized search. The mean precisions obtained by incorporating the currency, availability, information-to-noise ratio and cohesiveness metrics in similarity ranking are significantly higher than the baseline. The search effectiveness produced by incorporating the authority and popularity metrics seem subjectively higher than the baseline, but the differences are not statistically significant. Condition Quantity (baseline) +Popularity +Authority +Currency +Availability +Cohesiveness +Info/Noise +All Metrics +Currency+Availability+Info/Noise+Cohesiveness

Precision 0.443 0.465 0.468 0.483 0.485 0.490 0.508 0.533 0.553

Improvement 5.0% 5.6% 9.0% 9.5% 10.6% 14.7% 20.3% 24.8%

Significant No No Yes Yes Yes Yes Yes Yes

Table 2. The effect of information quality metrics in centralized search. The mean precision obtained by incorporating the four significant quality metrics, currency, availability, information-to-noise ratio, and cohesiveness, is significantly higher than that obtained by incorporating all the quality metrics. This suggests that incorporating the quality metrics that do not have significantly improve the system effectiveness produces noise. It is worth noting that incorporating the four significant quality metrics and incorporating all the quality metrics both significantly improved search effectiveness compared with using a single quality metrics, with one exception. The improvement made by incorporating the informationto-noise ratio alone is not significantly different from that made by incorporating the four best quality metrics, indicating that the information-to-noise ratio plays an important role in improving the effectiveness of centralized search. 3.2.2 Quality Metrics in Site Selection Each query was classified into concepts in the reference ontology using same classifier used for Web pages (Zhu et al., 1999). Each site was then assigned a score determined by its information quantity as well as information quality based on the 10 topics. The top K (e.g., K = 3) sites were selected for each query. The search effectiveness obtained when site selection was guided by the information quantity alone was used as a baseline. Based on the results of experiment 1, the document rankings in centralized search were determined by information

9

quantity and the four significant quality metrics. The final result list was produced by simply sorting the scores returned from the multiple sites. Experiment 2 Hypothesis The search effectiveness from a collection of sites can be improved when the site selection is enhanced with each quality metric in turn and more so when they are combined. Method Equation 2 was used to calculate the score of a site. When there was only one quality metric incorporated, the weight of the quality metric was set to 1 and the weights of other quality metrics were set to zero. When more than one quality metric was incorporated, the weight of an incorporated quality metric was calculated as the improvement made by incorporating the quality metric alone divided by the sum of the improvements made by incorporating each of the quality metrics alone. The weights of the quality metrics that were not incorporated were set to zero. Results The results are summarized in Table 3. The mean precisions obtained by incorporating the availability, information-to-noise ratio, popularity are significantly higher than the baseline. Incorporating currency and authority produced precisions higher than the baseline, but the differences are not significant.. The precision produced by incorporating cohesiveness without considering weight distribution of matching topics is not different from the baseline, but when the revised cohesiveness metric that includes the weight distribution is used, the precision improvement is significant. This suggests that weight distribution of topics plays an important role in the cohesiveness calculation when the mean variance of the weight of each Web is large (1.07 x 106). Condition Quantity (baseline) +Cohesiveness (not consider weight distribution) +Authority +Currency +Info/Noise +Availability +All Metrics +Availability+Info/Noise+Popularity +Popularity +Cohesiveness(consider weight distribution)

Precision 0.355 0.355 0.376 0.405 0.413 0.415 0.415 0.415 0.418 0.445

Improvement 0.0% 5.6% 14.1% 16.3% 16.9% 16.9% 16.9% 17.7% 25.4%

Significant No No No Yes Yes Yes Yes Yes Yes

Table 3. The effect of information quality metrics in site selection. The mean precision obtained by incorporating the three significant quality metrics, including availability, information-to-noise ratio, and popularity metrics, in site selection is not different from that obtained by incorporating all the quality metrics. This suggests that leaving out the non-significant metrics in site selection does not hurt search effectiveness. Note that these combinations of quality metrics both significantly improve search effectiveness. The mean precision obtained by incorporating the three significant quality metrics in site selection is not different from those obtained by incorporating them individually. This suggests that incorporating any of the significant quality metrics produces results as good as that produced by incorporating the combination of them.

10

3.2.3 Quality Metrics in Information Fusion Based on the results of the first two experiments, in Experiment 3, centralized search was guided by information quantity and the combination of the four significant quality metrics of centralized search, and site selection was guided by information quantity and the availability, information-to-noise ratio, popularity metrics for site selection. The baseline is the precision obtained when information fusion was guided by information quantity alone. Experiment 3 Hypothesis The search effectiveness of information fusion can be improved when retrieval is enhanced with each quality metric in turn and more so when they are used in combination. Method Equation 3 was used to calculate the goodness of a site. When there was only one quality metric incorporated, the weight of the quality metric was set to 1 and the weights of other quality metrics were set to zero. When more than one quality metric was incorporated, the weight of an incorporated quality metric was calculated as the improvement made by incorporating the quality metric alone divided by the sum of the improvements made by incorporating each of the quality metrics alone. The weights of the quality metrics that were not incorporated were set to zero. The original cohesiveness equation was used.. Results The results are summarized in Table 4. Popularity was the only quality metric that significantly improved search effectiveness in information fusion, although the informationto-noise ratio improvement was marginally significant (p = .06). The mean precision obtained by incorporating the popularity and information-to-noise ratio metrics together was not significantly different than that obtained by incorporating all the quality metrics. This suggests that leaving out the non-significant metrics in information fusion does not hurt search effectiveness. Condition Quantity (baseline) +Cohesiveness +Currency +Availability +Availability+Info/Noise+Popularity (Their weights were determined by site selection) +Authority +Info/Noise+Popularity +Info/Noise +All Metrics +Popularity

Precision 0.545 0.490 0.553 0.558 0.560

Improvement (%) -10.1% 1.5% 2.4% 2.8%

0.563 0.565 0.568 0.570 0.573

3.3% 3.7% 4.2% 4.6% 5.1%

Significance No No No No No No No* No Yes

Table 4. The effects of information quality metrics in information fusion. The mean precision obtained by incorporating the popularity and information-to-noise ratio quality metrics in information fusion is not significantly different from those obtained by incorporating the information-to-noise ratio metric or popularity individually. This suggests that

11

incorporating any of the significant quality metrics produces results as good as those produced by incorporating the combination of them.

4 Discussion and Conclusions Each of the quality metrics was able to improve search effectiveness when incorporated in a centralized search engine, although the effects produced by the authority and popularity metrics are not significant. The best single metric was information-to-noise ratio which improved the results 14.7%. When the significant quality metrics are combined (currency, availability, information-to-noise ratio, cohesiveness), an improvement of 24.8% over the baseline was achieved. This suggests that quality information is extremely important to centralized search. When incorporating the quality metrics in site selection, the currency, availability, information-to-noise ratio, authority, popularity, and cohesiveness all made an improvement on search effectiveness although the effects produced by the currency and authority metrics are not significant. The best single metric was the revised cohesivenes metric which improved precision 25.4%. Combining multiple quality metrics did not appear to improve the results beyond that achieved by the individual metrics. The data collected from the information fusion experiments shows that only the improvement made by the popularity metric is significant and that improvement is fairly minor (5.1%). This may indicate that the site selection process worked and there was little quality variation among the selected sites, thus incorporating quality metrics further during information fusion may not be necessary. The authority metric did not have a significant impact on either centralized search, site selection, or information fusion. It may be that the authority metric is not related to search effectiveness at all, or the authority ratings we used are not accurate enough. The informationto-noise ratio is the only quality metric that improves search effectiveness when incorporated in centralized search, site selection and information fusion. Therefore, it can be considered as the most useful quality metric in both centralized and distributed search.

5 Future Work The information quality metrics tested in the experiments may be only a subset of criteria that are important to information quality. Spelling, grammar and HTML tagging errors may also give clues as to the quality of a Web page. Log files kept on servers also provide a rich resource of information that could be exploited. For example, the number of times a page is served could be used to estimate the popularity of a site rather than, or in addition to, the link counts.

6 References AltaVista. 1999. http://www.altavista.com Brin, S., & Page, L. 1999. The Anatomy of a Large-Scal Hypertextual Web Search

12

Engine. http://google.stanford.edu/long321.htm Callan, J. P., Croft, W. B., & Harding, S. M. 1995. “The INQUERY retrieval system.” In Proceedings of the 3rd International Conference on Database and Expert System Applications, Valencia, Spain, September. Ciolek.1998. http://www.ciolek.com/WWWVLPages/QltyPages/QltyDefinitions.html Clearinghouse. 1999. Argus Clearinghouse Ratings System. http://www.clearinghouse.net/ratings.html Crowder, G. & Nicholas, C. 1996a. “Using Statistical Properties of Text to Create Metadata.” In First IEEE Metadata Conference. April 1996. Crowder, G. & Nicholas, C. 1996b. “Resource Selection in CAFÉ: an Architecture for Network Information Retrieval.” In ACM-SIGIR96 Workshop on Networked Information Retrieval. 22 August, 1996. Direct Hit, 2000a. http://www.directhit.com Direct Hit, 2000b. http://www.directhit.com/about/press/articles/cnet_shootout.html Fan, Y. & Gauch, S. 1999. “Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources.” In 1999 AAAI Symposium on Intelligent Agents in Cyberspace. Stanford University, March, 1999. Gauch, S. 1997b. “Cooperative Agents for Conceptual Search and Browsing of World Wide Web Resources.” In Technique Proposal Funded by the National Science Foundation, CAREER/EPSCoR Award number 97-03307 Gauch, S., Wang, J., & Rachakonda, S. M. 1998.“A Corpus Analysis Approach for Automatic Query Expansion and its Extension to Multiple Databases.” In ACM Transactions on Information Systems (to appear). Gravano, L., Change, K., Garcia-Molina, H., Lagoze, C., Paepcke, A. 1997. Stanford Protocal Proposal for Internet Retrieval and Search. http://www-db.stanford.edu/~gravano/standards IPL. 1999. http://www.ipl.org Kretser, O., Moffat, A., Shimmin T., & Zobel J. 1998. “Methodologies for Distributed Information Retrieval.” In 18th International Conference on

13

Distributed Computing Systems, Amsterdam, May, to appear. Li, Y., & Rafsky, L. 1997. “Beyond relevance Ranking: Hyperlink Vector Voting.” In ACM-SIGIR97 Workshop on Networked Information Retrieval. Philadelphia, USA, 31 July 1997. Lycos. 1999b. http://point.lycos.com/categories/index.html Lycos. 1999c. http://www.lycos.com/help/top5-help2.html Magellan. 1999. http://magellan.mckinley.com. Magellan. 1998. http://www.lib.ua.edu/maghelp.htm Pearce, C. & Nicholas, C. 1996. “TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data.” In Journal of the American Society for Information Science (JASIS), April 1996. Scout. 1999a. Internet Scout Project. http://scout.cs.wisc.edu/scout/index.html Scout. 1999b. Scout Report Selection Criteria. http://scout.cs.wisc.edu/scout/report/criteria.html Selberg, E. 1996. “DISW '96 Query routing and Searching Breakout.” In Report of the Distributed Indexing/Searching Workshop, Cambridge, Massachusetts, May 1996. http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/S6Group1.html

Towell, G., Voorhees, E. M., Gupta, N. K., & Johnson-Laird B. 1995. “Learning Collection Fusion Strategies for Information Retrieval.” In Proceedings of the Twelth Annual Machine Learning Conference, Lake Tahoe, July 1995. Voorhees, E. M. 1997. “Database Merging Strategies for Searching Public and Privated Collections.” In ACM-SIGIR97 Workshop on Networked Information Retrieval, Philadelphia, USA, 32, July 1997. ZDNet. 1999. http://www.zdnet.com/yil Zhu, X., Gauch, S., Gerhard, L., Kral, N., Pretschner, A. 1999. “Ontology-Based Web Site Mapping for Information Exploration.” In 8th ACM Conference on Information and Knowledge Management.

14

Suggest Documents