Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
A General Method for Statistical Performance Evaluation ∗ Longzhuang Li Dept. of Computing and Mathematical Sci. Texas A&M Uni., Corpus Christi, TX 78412
[email protected] Wei Zhang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 65211
[email protected]
Abstract In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects an appropriate statistical metric according to the problem parameters. Empirically, We compare the performance of five representative statistical metrics under different conditions through simulation. They are expected loss, Friedman statistic, interval-based selection, probability of win, and probably approximately correct. In the experiments, expected loss is the best for small means, like 1 or 2, and probably approximately correct is the best for all the other cases. Also, we apply the general method to compare the performance of HITS-based algorithms that combine four relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set of broad topic queries. Among the four relevance scoring methods, CDR is the best statistically when it is combined with a HITS-based algorithm.
1. Introduction Performance evaluation has many real world applications. For example, when a customer wants to buy a computer, he needs to compare prices, CPU speed, memory, and pre-installed softwares, etc., among multiple choices before he finally decides which one to buy. In information retrieval on the Web, we may wonder which search engine will return the most relevant information for the given queries [13]. In performance evaluation, hypotheses are selected or ranked based on performance comparison of hypotheses on sample data. ∗ Research supported in part by the National Science Foundation under grant DUE-9980375 and EIA 0086230.
Yi Shang Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 65211
[email protected] Hongchi Shi Dept. of Computer Engr. & Computer Sci. Uni. of Missouri, Columbia, MO 65211
[email protected]
Among the real world applications of statistical performance evaluation, many solutions or hypotheses exist and the ones performing the best in terms of predetermined measurements are sought. For example, In image compression, it is critical to design and choose the best filter banks for the quality of the reconstructed images [20]. In evolutionary algorithms, the individuals to be propagated to future generations are often selected with likelihood that is proportionate to their rank in the current generation [7]. The performance measurements of hypotheses are numerical numbers and have to be obtained based on sample data and may contain noise. In addition, due to the time and resource constraints in real applications, it is often impractical or even impossible to evaluate all hypotheses. Thus, statistical metrics are used to evaluate the performance of hypotheses efficiently using a limited number of sample data. There are many statistical metrics available and their results depends on many factors, such as the size of sample data, and the distribution of performance measurements of the hypotheses. Selecting the most appropriate statistical metrics is a challenging task. In the paper, a general, effective method to evaluate hypotheses’ performance is developed. The method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of applications. We have considered the following important parameters: the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. Also, the general method is applied to evaluate the performance of the combination of HITS-based algorithms [12, 2] with one of the four relevance scoring methods: cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], vector space model (VSM) [16], and threelevel scoring method (TLS) [14], using a set of broad topic
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
1
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
queries. In the experiments, we study the performance of five representative statistical metrics using sample data with four different types of distributions. The five statistical metrics are expected loss (EL) [4], Friedman statistic (Frie) [17], interval-based selection (Int) [4], probability of win (Pwin) [11], and probably approximately correct (PAC) [19]. The four distributions of sample data are chisquare distribution, exponential distribution, normal distribution, and Poisson distribution. This paper is organized as follows. In Section 2, we briefly review the statistical metrics for performance evaluation. In Section 3, we propose a general method for statistical performance evaluation, and apply the general method to evaluate the performance of HITS-based algorithms. In Section 4, we describe the criteria to compare the performance of different statistical metrics. In Section 5, we show our experimental results. And in Section 6, we summarize the paper.
2. Statistical Metrics for Performance Evaluation In this section, we briefly review statistical metrics to evaluate the performance of different hypotheses. Performance evaluation consists of two kinds of problems: hypothesis selection problems and hypothesis ranking problems [1, 4, 5]. A hypothesis selection problem arises when we select the best one from a set of hypotheses, given their performance over some sample data. In hypothesis ranking problems, a set of hypotheses are ranked by their expected performance. Hypothesis ranking problems are extension of hypothesis selection problems [5]. Generally, statistical metrics for hypothesis selection problems can be applied to hypothesis ranking problems. The distinction between hypothesis ranking and hypothesis selection is that the result is a single best hypothesis in hypothesis selection, whereas the order of all the hypotheses is returned in hypothesis ranking. Many metrics have been developed to solve the hypothesis selection problems (see figure 1). They can be classified into two groups: one for problems with a small number of hypotheses, and the other for problems with a large number of hypotheses. The statistical metrics for a small number of hypotheses include the interval-based selection [4], the COMPOSER system [8], the Turnbull and Weiss algorithm [18], the probably approximately correct (PAC) model [19], the expected loss (EL) approach [4], the Friedman statistic [17], and the probability of win [11], etc. In the scenario of the statistical selection metrics for a large or infinitive number of hypotheses, we usually adopt generate-and-test search strategies to find the best hypothesis. As defined by Mitchell [15], the strategies can broadly
Small Number of Hypotheses Statistical Selection Metrics Large or Infinitive Number of Hypotheses
Interval-Based COMPOSER PAC EL T. & W. Algorithm Friedman Statistic Probability of Win
DataDriven
KnowledgeDriven
Depth-First Breadth-First Version-Space Decision-Theoretic
Explanation-Based
Figure 1. Statistical hypothesis selection metrics.
be classified as data driven and knowledge-driven. The difference lies in the amount of tests performed: data-driven metrics do not rely on domain knowledge and often require extensive tests on the hypotheses under consideration, whereas knowledge-driven metrics depend on domain knowledge and one or a few tests to deduce new hypotheses. In the paper, we focus on the statistical metrics for a small number of hypotheses. To find the best one among a small set of hypotheses, we compare the hypotheses pairwisely.
3. A General Method for Statistical Performance Evaluation Because a statistical metric may only be suitable for certain situation for different applications, in this section, we first propose a general method for statistical performance evaluation, then apply the general method to a real world application.
3.1. A General Method The general method consists of following major steps (see figure 2): 1. Select a set of sample data. At this step, we must be careful in choosing sample data when the hypotheses are too expensive to be tested extensively and we have a large number, and possible infinite data. On the other hand, when the size of sample data is limited and the cost of information is high, it is very important to minimize the cost of acquiring additional samples while achieving the desired evaluation quality. 2. Test the performance of each hypothesis on sample data. Sometimes, the performance of hypotheses de-
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
2
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
pend on the measures and algorithms we use to test sample data. 3. Select an appropriate statistical metric according to the problem parameters. For example, those parameters may include the number of hypotheses, the size of sample data for each hypothesis, the distribution of performance measurements, and the distribution of noise in performance measurements. 4. Select or rank the hypotheses based on the performance measurements using the chosen statistical metric. The chosen hypothesis is the one with the best statistical value. Also, we expect the hypothesis selected to be generalizable; that is, it must perform well not only on sample data but also on data not seen in evaluation. The general method incorporates various statistical metrics and automatically selects an appropriate one based on the parameters of the applications, and can be adapted to different applications under time and resource constraints.
3.2. An Application: Performance Evaluation of HITS-based Algorithms Kleinberg’s hypertext-induced topic selection (HITS) algorithm [12] is a very popular and effective algorithm to rank documents based on the link information among a set of documents. The algorithm presumes that a good hub is a document that points to many others, and a good authority is a document that many documents point to. Hubs and authorities exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed by many good hubs. In the context of Web search, a HITS-based algorithm first collects a base document set for each query. Then it recursively calculates the hub and authority values for each document. To gather the base document set I, first, a root set R that matching the query is fetched from a search engine; then, for each document r ∈ R, a set of documents L that point to r and another set of documents L that are pointed to by r are added to the set I as R’s neighborhood. For a document i ∈ I, let ai and hi be the authority and hub values respectively. To begin the algorithm, ai and hi are initialized to 1. While the values have not converged, the algorithm iteratively proceeds as follows: 1. For all i ∈ I which point to i, ai =
i
hi
2. For all i ∈ I which is pointed to by i, hi = i ai 3. Normalize ai and hi values so that i ai = i hi = 1.
Kleinberg showed that the algorithm will eventually converge, but the bound on the number of iterations is unknown. In practice, the algorithm converges quickly. Because the HITS algorithm ranks documents only depending on the in-degree and out-degree of links, it will cause problems in some cases. For example, Bharat [2] identified two problems: mutually reinforcing relationships between hosts and topic drift. Both problems can be solved or alleviated by adding weights to documents. In Bharat’s improved HITS algorithm (BHITS), to solve the first problem, a document is given a authority weight of 1/k if the document is in a group of k documents on a first host which link to a single document on a second host, and a hub weight of 1/l if there are l links from the document on a first host to a set of documents on a second host [2]. And the second problem can be alleviated by adding weights to edges based on text in the documents or their anchors [2, 3]. Bharat’s improved HITS algorithm (BHITS) achieved a remarkable better results by a simple modification of Kleinberg’s HITS algorithm for the first problem, while further precision was obtained by adding content analysis for the second problem. Disregarding the time it may take, combining connectivity and content analysis has been proved to be useful in improving precision. But the similarity measure currently used is vector space model [2] or just a simple occurrence frequency of the query terms in the text around the anchors [3], which may not be the best method to evaluate the relevance of Web documents because most queries submitted to search engines are short, consisting of three terms or less [6]. Although we can expand the short queries by adding more related words, expanding itself can cause topic drift. In this paper, we statistically compare the performance of four relevance scoring methods when they are combined with Bharat’s improved HITS algorithm. Three of them are variations of methods widely used in the traditional information retrieval field. They are cover density ranking (CDR) [6], Okapi similarity measurement (Okapi) [9], and vector space model (VSM) [16]. In addition, the fourth one is called the three-level scoring method (TLS) [14], which mimics commonly used manual similarity measuring approaches.
4. Performance Comparison of Statistical Metrics In the paper, we compare the performance of five representative statistical metrics under different conditions through simulation. The five statistical metrics are expected loss (EL), Friedman statistic (Frie), interval-based selection (Int), probability of win (Pwin), and probably approximately correct (PAC). The four distributions used in our
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
3
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Sample Data
Select an appropriate Measure Measurement statistical comparison Performance Results metric
A Statistical Statistical Final Metric Comparison Results
Figure 2. A general method for statistical performance evaluation. experiments are chi-square distribution, exponential distribution, normal distribution, and Poisson distribution. When we compare the performance of different statistical metrics, we need the criteria to justify which statistical metrics are better in identifying one distribution X is better (or worse) than another distribution Y. There are two ways to do it. One is to fix the size of sample data from X and Y, and find the correct ratios of the statistical metrics. This can be done through simulation. We randomly generate a fixed number of samples, such as 20, from each distribution, apply each statistical metric to the sample data, and see whether the result is correct or not. This test can be repeated many times, such as 100 times, to get an average correctness, which is more accurate than just one run. Another way is to find the smallest size of sample data that each statistical metric needs in order to identify the correct result, i.e. X is better (or worse) than Y, with a high confidence, such as 99% confidence. This can also be done through experiments. In our experiments, we take the first approach. In our controlled experiments, we generate the sample data from a known distribution, such as chi-square distribution, and we know ahead of time which data set is better. Then we can check how the statistical metrics work by comparing their answers to the right ones. In our experiments, we are taking a simple approach by assuming that the one with larger mean is better, although it may be subjective for some real situations. Although most of statistical metrics we test assume that the sample data are of normal distribution, we do not need to change the formula when we test the metrics on sample data with other distributions. In another words, we are testing the robustness of these metrics under the situations that the sample data are not normally distributed. Of the five statistical metrics, the Friedman statistic has two control parameters: a sample size n and a significance level α. The interval-based techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ ∗ and an indifference setting . The EL techniques have two control parameters: a sample size n and a loss threshold l∗ . The PAC techniques have three control parameters: a sample size n, a desired confidence of correct ranking γ ∗ and an indifference setting . The probability of win has two control parameters: a sample size n and a de-
sired confidence of correct ranking p∗ . In our experiments, we set γ ∗ and p∗ as 0.95, α and l∗ as 0.05, as 10% of the standard deviation of sample data.
5. Experiments In the experiments, we first analyze the correct ratios of five statistical metrics through the simulation based on data without or with noise, and draw some rules on how to select the statistical metrics under different conditions. Then we apply the general method to compare the performance of HITS-based algorithms according to the chosen statistical metrics.
5.1. Simulation on Data Without Noise In this section, without considering noise, we do two groups of experiments on data with different distributions and different sample numbers. In the first group of experiments, for each run, we generate two sample sets using Matlab and use all the data in both sample sets to test the different statistical metrics. In the second group of experiments, we first generate two data sets, whose size is 1000; then randomly pick a subsets from each data set for each run. In both groups of experiments, the two compared data sets have means 1 and 2, 1 and 5, 4 and 5, and 11 and 12, respectively. If the difference between two means is less or equal to 1, we consider the difference is small; otherwise, if the difference of two means is large or equal to 4, the difference is big. We set the variance of each data set equal to its mean for normal distribution. In the first group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i=1 to 100, (the number of random runs) { 1. randomly generate a sample set of size k (e.g., 30) for X using Matlab and a sample set of size k for Y. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of each statistical metric for the sample size k based on 100 random runs.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
4
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
In the second group of experiments, to compare two random variables, X and Y, of the same distribution (e.g. chisquare), with different means and variances: for i = 1 to 20, (the number of random runs for a sample set of size k, e.g., 30) { generate a data set D1 for X and D2 for Y respectively, the size of both D1 and D2 is 1000. for j = 1 to 100, (the number of test runs for correct ratio) { 1. randomly pick a sample set of size k for X from D1 and a sample set of size k for Y from D2. 2. compare these two sample sets using each statistical metric. 3. check the results with the correct answer. } compute the correct ratio of 100 test runs for the sample size k. } calculate the average correct ratio of 20 random runs for the sample size k. The results of the first group of experiments are shown in figure 3 and 4. The results of the second group of experiments are not shown here because the results are similar to those obtained in the first group of experiments. Figure 3 and 4 show that as the sizes of sample data increase, the correct ratios of all five statistical metrics also increase. But we can find some exceptions, the correct ratios become smaller when the size of data set is enlarged. For example, in the top-left plot labeled chisquare 1 2 in figure 3, the performance of PAC drops from 0.7 to 0.62 as the size of sample set increases from 15 to 20. The reason may be that the testing sample sets do not represent the underline distribution enough. Comparing all the plots, we can easily see that EL and PAC are the two best metrics under all kinds of conditions. Especially, EL is the best for small means, like 1 or 2, and PAC is the best for all the other cases. If the difference m of two sample means µx and µy is big, e.g. m = 4, and µx = 1 and µy = 5 (see bottom half of figure 3), a sample set of size 10 can almost guarantee above 90% correct ratios for Int, EL, and PAC; a sample set of size 50 is good enough for all the five metrics to get correct ratios of almost 100%. If m is small, e.g. m = 1, and µx = 1 and µy = 2 (see top half of figure 3), a sample set of size 30 is large enough to assure correct ratios of above 80% for EL and PAC; and with a sample size of 40, above 90% of correct ratios can be achieved by EL and PAC; and a sample set of size 200 can have 100% correct ratios for all the five metrics. As µx and
µy become larger, more samples are needed to achieve high correct ratios (see figure 4). Both groups of experiments tell us that the size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that the data size of 1000 is good enough. Next, we study how the performance of the statistical metrics degrades as the problem becomes harder. According to our understanding, there are two ways to make the problem harder, the first one is to make the difference of means of two compared data sets become smaller and smaller; the second one is to fix the difference of two means, and make the means of both data sets larger. As we know, for distributions like chi-square, exponential, and Poisson, the variance turns bigger when the mean is bigger. In our experiments, we choose the second way to make the problem more difficult. We have done experiments when the distributions of data sets are chi-square, exponential, Poisson, and normal, respectively. The experimental results can be found in figure 5, which show the correct ratios of five statistical metrics for different mean pairs when the data set is of chi-square, exponential, normal, and Poisson distribution, respectively. In figure 5, each data point of a curve is the average performance of 12 sub-sample data sets from a data set of size 1000 generated by Matlab, the size of 12 sub-sample data sets are 5, 10, 15, 20, 30, 40, 50, 100, 200, 500, 800, and 1000, respectively. In figure 5, x = 1 represents “1-2”, which denotes that the two compared data sets have means 1 and 2. x = 2 represents “4-5”, which denotes that the two compared data sets have means 4 and 5. Similarly, we can explain x = 3 for “11-12”, x = 4 for “21-22”, x = 5 for “31-32”, x = 6 for “41-42”, x = 7 for “51-52”, ..., x = 20 for ”181-182”, the increase of means is 10. In figure 5, for each distribution, the variance of each data set is set to equal to its mean. Figure 5 shows that the correct ratios of five statistical metrics drop quickly when the means are increased from 1 (x=1) to 12 (x=3). But after that, the correct ratios change in a certain small range. When the mean is small, e.g., x=1, EL is the best but PAC is the best when the mean is larger than 11 (x=3). Figure 5 also shows that the correct ratios of each statistical metric degrade almost at the similar rate when the means are increased from 1 to 12. But after that, the degradation rates vary little.
5.2. Simulation on Noisy Data We use white noise in our experiments. White noise is randomly (uniformly) distributed in a certain range. We determine the noise range as a percentage of the standard deviation of the base distribution, such as 1% is small and 10% is large. In our experiments, we test the performance of five sta-
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
5
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
chisquare_1_2 1 0.9 0.8
0.8
0.7
0.7
0.6 0.5
0.4 0.3
0.2
0.2 0.1 0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples normal_1_2
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
15
20
30
40 50 100 200 500 800 1000 Number of Samples poisson_1_2 Pwin Frie Int EL PAC
1
0.8
0.6
10
0.9
Correct Ratio
Correct Ratio
0.5
0.3
0
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples chisquare_1_5
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
15
20
30
40 50 100 200 500 800 1000 Number of Samples exponential_1_5 Pwin Frie Int EL PAC
0.9
0.8
0.6
10
1
Correct Ratio
Correct Ratio
0.6
0.4
0.1
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples normal_1_5
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
20
30
40 50 100 200 500 800 1000 Number of Samples poisson_1_5 Pwin Frie Int EL PAC
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
15
0.9
0.8
0.6
10
1
Correct Ratio
Correct Ratio
Pwin Frie Int EL PAC
1 0.9
Correct Ratio
Correct Ratio
exponential_1_2 Pwin Frie Int EL PAC
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples
5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples
Figure 3. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 1 2 denotes that the data set is of normal distribution, and the two compared data sets have means 1 and 2, respectively. The others are analogously defined.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
6
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
chisquare_4_5 1 0.9 0.8
0.8
0.7
0.7
0.6 0.5
0.4 0.3
0.2
0.2 0.1 0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples normal_4_5
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
15
20
30
40 50 100 200 500 800 1000 Number of Samples poisson_4_5 Pwin Frie Int EL PAC
1
0.8
0.6
10
0.9
Correct Ratio
Correct Ratio
0.5
0.3
0
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples chisquare_11_12
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
15
20
30
40 50 100 200 500 800 1000 Number of Samples exponential_11_12 Pwin Frie Int EL PAC
0.9
0.8
0.6
10
1
Correct Ratio
Correct Ratio
0.6
0.4
0.1
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples normal_11_12
5
Pwin Frie Int EL PAC
1 0.9
0.8
0.7
0.7
0.5
20
30
40 50 100 200 500 800 1000 Number of Samples poisson_11_12 Pwin Frie Int EL PAC
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
15
0.9
0.8
0.6
10
1
Correct Ratio
Correct Ratio
Pwin Frie Int EL PAC
1 0.9
Correct Ratio
Correct Ratio
exponential_4_5 Pwin Frie Int EL PAC
0 5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples
5
10
15
20
30
40 50 100 200 500 800 1000 Number of Samples
Figure 4. The correct ratios of five statistical metrics for different numbers of sample data when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal 4 5 denotes that the data set is of normal distribution, and the two compared data sets have means 4 and 5, respectively. The others are analogously defined.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
7
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
chisquare 1
0.9
0.8
0.8
0.7
0.7
0.6 0.5
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Compared Two Sample Means normal
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Compared Two Sample Means poisson
Pwin Frie Int EL PAC
0.9
0.8
0.8
0.7
0.7
0.6 0.5
0.6 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
Pwin Frie Int EL PAC
1
Correct Ratio
Correct Ratio
0.9
Pwin Frie Int EL PAC
1
Correct Ratio
Correct Ratio
0.9
exponential Pwin Frie Int EL PAC
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Compared Two Sample Means
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Compared Two Sample Means
Figure 5. The correct ratios of five statistical metrics for different mean pairs when the distribution of data set is chi-square, exponential, normal, and Poisson, respectively. The title normal denotes the data set is of normal distribution. The others are analogously defined.
tistical metrics based on data sets contaminated by small or large white noise. We compare four pairs of data sets for each distribution, these data sets have means 1 and 2, 4 and 5, 11 and 12, and 1 and 5 for chi-square, exponential, and Poisson distribution; and means and variances (0,1) and (1,1), (4,1) and (5,1), (0,3) and (1,3), and (4,3) and (5,3) for normal distribution. Also, the data sets have different sizes, they are 5, 10, 15, 20, 30, 40, 50, and 500, respectively. From the experimental results, we find that the small or large noise has little or no effect on the correct ratios of statistical metrics. Also from our experiments, we find that for normal distribution, if the data sets have same means but with larger variances, the correct ratios are decreased. For example, the correct ratios for sample data pairs with means and variances (4,3) and (5,3) are less than the correct ratios for sample data pairs with means and variances (4,1) and (5,1).
5.3. Guidelines for Choosing Statistical Metrics In this section, we summarize the results from all the previous experiments as follows,
• The size of data set does not matter much as long as it is large enough to represent the underline distribution sufficiently. The experiments show that 1000 points (data sets with size 1000) is good enough. • EL is the best for small means, like 1 or 2, and PAC is the best for all the other cases (see figure 5). • (a) If the difference m of two means µx and µy is big, e.g. m = 4, and µx = 1 and µy = 5, a sample set of size 10 can almost guarantee above 90% correct ratios for Int, EL, and PAC; a sample set of 50 is good enough for all the five metrics to get correct ratios of almost 100% (see bottom half of figure 3). As µx and µy become larger, more samples are needed to achieve high correct ratios. (b)If m is small, e.g. m = 1, and µx = 1 and µy = 2, a sample set of size 40 is large enough to assure correct ratios of above 90% for EL and PAC; and a sample set of size 200 can have 100% correct ratios for all the five metrics (see top half of figure 3). As µx and µy become larger, more samples are needed to achieve high correct ratios (see figure 4). • For normal distribution, when variance is large, more
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
8
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Table 1. The four HITS-based algorithms Algorithm CB OB TB VB
Description Combination of BHITS and CDR Combination of BHITS and Okapi Combination of BHITS and TLS Combination of BHITS and VSM
samples data are needed to get high correct ratio. • White noise, whose range is about 10% of the standard deviation, has little effect on the correct ratios.
5.4. Experiments on the Performance of HITSbased Algorithms In our experiments, 28 broad topic queries and five search engines are used. The queries are exactly the same ones as those used in [2, 3], one example is vintage car. For each query, to build a base set, we start five threads simultaneously to collect top 20 hits and their neighborhood from five search engines, AltaVista, Fast, Google, HotBot, and NorthernLight, respectively. The combination of these top hits and their neighborhood forms the base set. For a document in the root set, we limit it to at most 50 in-links and collect all its out-links. The default search mode of the five search engines and lower-case queries were used. The way we construct the base set is different from the previous works [12, 2], which usually build the base set from only one search engine, e.g., AltaVista. Combining top 20 hits and their neighborhood from five search engines gives us a more relevant base set. Running on a Sun Ultra-10 workstation with 300 MHz UltraSPARC-IIi processor connected to the Internet by the 100 Mbps fast Ethernet, the Java program took about three to five minutes to gather the base set for each query. After we remove duplicate links, intra-domain links, broken links, and irrelevant links, the numbers of distinct links in these 28 base sets range from 615 to 2768. Table 1 lists the four HITS-based algorithms used in the experiments. The four algorithms are BHITS and its combination with one of the four relevance scoring methods. To compare the performance of the above mentioned four algorithms, we first use the pooling method [10] to build a query pool formed by the top 10 authority links and the top 10 hub links generated by each of the four algorithms, then recruit three graduate students to personally visit all documents in each query pool, and manually score them in a scale between 0 and 10, with 0 representing the most irrelevant and 10 most relevant. A Web page receives a high score if it contains both useful and comprehensive information about the query. Also, a page may be given a high score it has many links which lead to relevant message
Table 2. Average improvement(%) of relevance scores between two algorithms. Each number in the table is the improvement of the method in the first column over the method in the first row. CB OB TB
CB -
OB 0.7 -
TB 0.7 0.0 -
VB 2.9 2.2 2.2
because we encouraged three evaluators to follow outgoing link and browse a page’s neighborhood. We did not score the pages that are written in language we do not understand, and did not tell the evaluators the algorithm from which a set of links were derived. We take the average score of three evaluators for a link, and the average score of top 20 links, top 10 authority links and top 10 hub links, as the final score for a query. Table 2 presents the average improvement(%) of relevance scores between two algorithms. It shows that the combination of BHITS algorithm with any of the four scoring methods has comparable performance, with CDR the best and VSM the worst. The best algorithm CB improves OB, T B, and V B by only 0.7%, 0.7%, and 2.9%, respectively. According the guidelines to select statistical metrics, EL and PAC are the best two metrics under various testing conditions, so we choose both of them to compare the performance of the above four algorithms. We set the confident level 0.95 for PAC and the loss threshold 0.05 for EL. The results are shown in Table 3. Both metrics give consistent results: CB is the best and V B is the worst among the four algorithms combined with relevance scoring methods; and OB better than T B, although two PAC values are a little below the confident level.
6. Conclusion In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects the appropriate one based on the parameters of a specific application. By combining relevance scoring methods with a HITSbased algorithm, we statistically compare the performance of four relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set of broad topic queries, with CDR the best and VSM the worst, although the performance differences among the four relevance scoring methods are not significant.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
9
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Table 3. Statistical performance comparison of different algorithms. CBvOB means the statistical comparison of CB over OB, and the others are similarly defined. Statistical Method EL PAC
Statistical Comparison of a Pair of HITS-Based Algorithms CBvOB CBvTB CBvVB OBvTB OBvVB TBvVB 0.007 0.005 0.001 0.017 0.001 0.001 0.941 0.964 0.995 0.878 0.993 0.989
References [1] R. E. Bechhofer. A simple-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25(1):16–39, 1954. [2] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104–111, 1998. [3] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Network and ISDN Systems, 30:65–74, 1998. [4] S. Chien, J. Gratch, and M. Burl. On the efficient allocation of resouces for hypothesis evaluation: A statistical approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):652–665, July 1995. [5] S. Chien, A. Stechert, and D. Mutz. Efficient heuristic hypothesis ranking. Journal of Artificial Intelligence Research, pages 375–397, 10 (1999). [6] C. L. A. Clarke, G. V. Cormack, and E. A. Tudhope. Relevance ranking for one to three term queries. Information Processing & Management, 36:291–311, 2000.
[11] A. Ieumwananonthachai and B. W. Wah. Statistical generalization of performance-related heuristics for knowledge-lean applications. Int’l J. of Artificial Intelligence Tools, 5(1 2):61–79, June 1996. [12] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668–677, ACM Press, New York, 1998. [13] H. V. Leighton and J. Srivastava. First 20 precision among world wide web search services (search engines). J. of the American Society for Information Science, 50(10):870–881, 1999. [14] L. Li and Y. Shang. A new statistical method for evaluating search engines. In Proc. IEEE 12th Int’l Conf. on Tools with Artificial Intelligence, 2000. [15] T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203–226, 1982. [16] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., 1989. [17] S. Siegel and Jr. N. J. Castellan. Nonparametric statistics for the behavioral sciences. McGraw-Hill, 1988.
[7] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Pub. Co., 1989.
[18] B. W. Turnbull and L. I. Weiss. A class of sequential procedures for k-sample problems concerning normal means with unknown equal variances. In T. J. Santner and A. C. Tamhane, editors, Design of Experiments: Ranking and Selection, pages 225–240. Marcel Dekker, 1984.
[8] J. Gratch and G. DeJong. Composer: A probabilistic solution to the utility problem in speedup learning. In Proceedings of the 10th National Conference on Artificial Intelligence, pages 235–240, 1992.
[19] D. Vanderbilt and S. G. Louie. A Monte Carlo simulated annealing approach to optimization over continuous variables. Journal of Computational Physics, 56:259–271, 1984.
[9] D. Hawking, P. Bailey, and N. Craswell. Acsys trec-8 experiments. In Proceedings of the TREC-8, 1999. [10] D. Hawking, N. Craswell, and P. Thistlewaste. Overview of the trec-7 very large collection track. In Proceedings of the TREC-7, 1998.
[20] J. D. Villasenor, B. Belzer, and J. Liao. Wavelet filter evaluation for image compression. IEEE Trans. on Image Processing, 2:1053–1060, 1995.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
10