Technical Report:
No. 9800020
Using Con dence Interval to Summarize the Evaluating Results of DSM Systems
Weisong Shi Weiwu Hu Zhimin Tang
Center of High Performance Computing Institute of Computing Technology Chinese Academy of Sciences September 16th, 1998
Using Con dence Interval to Summarize the Evaluating Results of DSM Systems Weisong Shi and Zhimin Tang Institute of Computing Technology Chinese Academy of Sciences Beijing 100080, P.R.China E-mail: fwsshi,
[email protected] Abstract
Distributed Shared Memory(DSM) systems have gained popular acceptance by combining the scalability and low cost of distributed system with the ease of use of single address space. Many new hardware DSM and software DSM systems are proposed in recent years. In general, benchmarking is widely used to demonstrate the performance advantages of new systems. However, the common method used to summarize the measured results is the arithmetic mean of ratios, which is incorrect in some cases. Furthermore, many published papers list a lot of data only, and do not summarize them eectively, which confuse users greatly. In fact, many users want to get a single number as conclusion, which does not provide in old summarizing techniques. Therefore, a new data summarizing technique based on con dence interval is proposed in this paper. The new technique includes two data summarizing methods: (1) paired con dence interval method; (2) unpaired con dence interval method. With this new technique, we conclude that at some con dence that one system is better than others. Four examples are showed to demonstrate the advantages of our new technique. On the other hand, with the help of con dence level, we propose to standardize the benchmarks used for evaluating DSM systems so that we can get a convincible result. Furthermore, our new summarizing technique ts not only for evaluating DSM systems, but also for evaluating other systems, such as memory system and communication systems. Keywords: Data Summarizing Techniques, Performance Evaluation, DSM Systems, Con dence Intervals, Benchmarking.
1 Introduction Distributed Shared Memory(DSM) systems have gained popular acceptance by combining the scalability and low cost of distributed system with the ease of use of single address space. GenThe work of this paper is supported by the CLIMBING Program of China and National Science Foundation of China.
1
erally, there are two methods to implement them: hardware and software. The corresponding systems are called hardware DSM and software DSM systems respectively. To date, many commercial and research DSM systems are implemented by corporations and research institutions. Stanford DASH [1], Stanford FLASH [2], MIT Alewife [3],UIUC I-ACOMA [4], and SGI Origin series are representative hardware DSM systems. While Rice Munin [5], Rice TreadMarks [6], Yale IVY [7], CMU Midwary [8], Utah Quarks [9], Marland CVM [10], DIKU CarlOS [11] and ICT-CAS JIAJIA[12] are representative software DSM systems. Naturally, when a new DSM system is proposed, it is compared with other systems to show its advantages. Although analytical modeling, simulation and measurement are three general performance evaluation methods adopted by researchers [13], the later two are more commonly used for evaluating DSM systems. No matter simulation or measurement are used, however, researchers will choose several benchmarks which is appropriate for their system , and show the ratio between new system and other systems [10][14]. There are many ratio games during the data summarizing phase which will result in contrary conclusions for same data. In fact, owing to the variability of applications, theoretically, there is no single computer system can do better than any other systems for all applications. Furthermore, since there is no normal data summarizing method can be used to draw conclusions, some researchers list those results which in favor of their conclusions only, and other frank authors listed all good and bad results altogether, and only tell us in which percentage their system is better than other alternatives. The main contributions of this paper are: (1) correcting the common mistakes used for evaluating DSM systems; (2) making the conclusion more intuitive than before. The remainder of this paper is organized as follows. Section 2 gives the background of performance evaluation and summarizing techniques, and lists the general mistakes made by some researchers. The concept of con dence interval is introduced in Section 3. Using con dence interval, two new data summarizing techniques are shown in Section 4. Three examples which are excerpted from evaluating real systems are described in Section 5. Finally, Concluding remarks are presented in Section 6.
2 Background of Performance Evaluation and Summarizing Techniques 2.1 Classical Methods for Performance Evaluation
Analytical modeling, simulation and measurement are widely used three performance evaluation techniques. Jain gives a detailed description about considerations that help choose appropriate performance evaluation techniques in his famous book [13]. Analytical modeling often fails to capture the complexity of the interaction among complex system components of parallel computer architectures. Therefore, simulation and measurement are used more frequently by computer architects. Among the variety of simulations described in the literature, those that would be of interest to computer architects are Monte Carlo Simulation, Trace-Driven Simulation, and ExecutionDriven Simulation. Since Monte Carlo simulation is used to model probabilistic phenomenon
2
that do not change characteristics with time, it doesn't t for evaluating DSM systems. Although trace-driven simulation is widely used in the past, it has several disadvantages [15] which result in the prevalence of execution-driven simulation, such as MINT for Mips processor [15], Augmint for Intel x86 processors [16], PAint for HP PA-RISC processor [17], etc. Execution-driven simulation is widely used for evaluating hardware DSM systems [15, 2, 18]. The input of execution-driven simulation is benchmark (in source code or binary code). Measurement means measure the real performance on real system, which includes two types: hardware prototyping and software prototyping. For hardware DSM systems, the hardware prototyping is in exible and expensive [18], therefore the general method for evaluating them is simulation. While for software DSM systems, the software prototyping can be implemented easily. such as TreadMarks, Munin, Midway, etc. The general method used for evaluating this kind of system is measuring the execution results of some input benchmarks.
2.2 Summarizing Techniques
In both simulation and measurement cases, after collecting the results of dierent benchmarks, how to summarize them is an art. Given the same measurement results, two analysts will give two dierent conclusions. For example, we compare the response time of two systems A and B for two workloads. The measurement results are shown in Figure 1(a). System
W1
W2
System
W1
W2 Average
A
20
10
A
20
10
15
B
10
20
B
10
20
15
(a)
(b)
System
W1
A
2
0.5
W2 Average
B
1
1
1.25 1
(c)
Figure 1: An Example of Dierent Summarizing Results. There are three ways to compare the performance of the two systems. The rst way is to take the average of the performance on the two workloads. This leads to the result shown in Figure 1(b). The conclusion in this case is that the two systems are equally good. The second way is to consider the ratio of the performance with system B as the case shown in Figure 1(c). The conclusion in this case is that system A is better than B. The third way is to consider the performance ratio with system A as the base, which is similar to Figure 1(c), only A and B are exchanged. The conclusion in this case is that system B is better than A. This example tells us the ratio game which is often used by some computer architects is incorrect. Therefore, in the next section, we will describe a new method to summarize the results. In general, there are three data summarizing techniques: Arithmetic Mean, Harmonic Mean, and Geometric Mean [19]. Assume xi represents the i-th result of evaluation , the de nition of these three means can be given as follows. 1
1x
i
can be computation time, speedup, or other performance metrics.
3
Arithmetic Mean:
n X 1 x = n xi; i=1
Geometric Mean:
v u n uY x_ = tn ( xi); i=1
Harmonic Mean:
x = 1=x + 1=x n+ + 1=x : n 1
2
Furthermore, there are improved version based on these basic de nitions. Such as weighted arithmetic mean, and weighted harmonic mean [19]. The corresponding de nitions about weighted mean are: Arithmetic Mean: n X x= 1 wx;
ni
=1
Harmonic Mean:
i i
x = w =x + w =x n+ + w =x ; n n 1
1
2
2
we w + w + + wn =1, and wi is the frequency of the i-th benchmark in the applications. The nal objective of performance evaluation is choosing the best alternatives. Generally, we use the arithmetic mean of the dierences among these systems. However, to evaluate DSM systems, arithmetic mean and harmonic mean of these dierences have no physical meaning. On the other hand, many researchers like to normalize the performance metric to a reference system and take the average of the normalized metric. In this case, the arithmetic mean should not be used as average normalized metric [19], or we will obtain the wrong conclusion, which is demonstrated in the above example. Instead, geometric mean of normalized results should be used. The objective of our comparison is to estimate a ratio c, which indicates the advantages of the new design over original design. If we nd that all the ratios are approximately a constant, then we assume the sizes before and after the optimization are expected to follow the following multiplicative model [13]: ai = cbi (1) The best estimate of the eect c in this case is obtained by taking a logarithm of the model: 1
2
log ai = log c + log bi =) log c = log bi ? log ai
(2)
and estimating log c as the aritmetic mean of log bi ? logai. This is equivalent to estimating c as the geometric mean of bi=ai. Therefore, it can be estimated by the geometric mean of bi=ai. The geometric mean of c can be used if and only if the assumption of the data following the multiplicatible model is correct. If the variance of ri is very large, the absolute deviation and con dence interval must be used to summarize the results. 4
Furthermore, for DSM systems, whet it is implemented by hardware or software, the performance evaluation method used by researchers is simpli ed by measuring two systems on just 5 or 10 workloads and then declare one system de nitely better than others. From the viewpoint of methodology, however, this method is incorrect, in spite of it is widely used by many researchers. When one workload is run on a system, it's just a sample, a sample is only an example. One example, is often not enough to prove a conclusion. This problem is solved in the following sections.
3 Concept of Con dence Interval The basic idea behind con dence interval is that a de nite statement cannot be made about the characteristics of all systems, but a probabilistic statement about the range in which the characteristics of most systems would drop can be made. This idea is very useful when we compare dierent DSM systems. The variety of applications determines that it is not possible for one system to be better than others in all cases [20]. In order to get a clear understanding about con dence interval, we must have a background about sample mean and population mean. Suppose there are several billion random numbers with a given property, for instance, population mean and standard deviation . We now put these numbers in an urn and draw a sample of n observations. Suppose the sample x ; x ; :::; xn has a sample mean x, which is likely to be dierent from population mean . Each sample mean is an estimate of the population mean. Given k samples, we have k dierent estimates. The general problem is to get a single estimate of the population mean from these k estimates. That is similar to conclude which alternative is the best by comparing some benchmarks. In fact, it is impossible to get a perfect estimate of the population mean from any nite number of nite size samples. The best we can do is to get probabilistic bounds. Thus, we may be able to get two bounds, for instance, c , c , such that there is a high probability, 1?, that the population mean is in the interval(c ; c ): Probabilityfc c g =1?. The interval (c ; c ) is called the con dence interval for the population mean, is called the signi cance level, and 100(1?) is called the con dence level. There are two ways to determine the given con dence interval: two-sided or single-sided. From above de nition, in order to estimate the population mean, we need to gather k samples, which will entail large eorts. Fortunately, it is not necessary to gat too many samples because of the central limit theorem, which allows us to determine the distribution of the sample mean. This theorem tells us the sum of a large number of independent observations from any distribution tends to have a normal distribution. With the central limit theorem, a two-sided 100(1-)% con dence interval for the population mean is given by p p (x ? z ?= s n; x + z ?= s n); we x is the sample mean, s is the sample standard deviation, n is the sample size, and z ?= is the (1?=2)-quantile of a unit normal variate. This formula applies only for the cases we the number 1
1
1
2
2
1
1
2
2
2
1
2
1
2
1
5
2
of samples is larger than 30. When the number of samples is less than 30, con dence intervals can be constructed only if the observations come from a normally distributed population. For such samples, the 100(1? )% con dence interval is given by : (x ? t ?= n? s=pn; x + t ?= n? s=pn); we t ?= n? is the (1?=2)-quantile of a t-variate with n ? 1 degrees of freedom. The p interval is based on the fact that for samples from a normal population N (; ), (x ? /(= n) has a N (0; 1) distribution andq(n ? 1)s = has a chi-square distribution with n ? 1 degrees of freedom, and therefore, (x ? )= s =n has a t distribution with n ? 1 degrees of freedom. Figure 3 shows a sample t density function: the probability of the random variable being less than ?t ?= n? is =2. Similarly, the probability of the random variable being more than t ?= n? is =2. The probability that the variable lies between ?t ?= n? and t ?= n? is 1?. [1
[1
2;
2;
1]
[1
2;
1]
1]
2
2
2
2
[1
[1
[1
f(x)
2;
1]
[1
2;
2;
2;
1]
1]
1]
t(n-1) density function
area=1-a
Area=a/2
Area=a/2
-t [1-a/2;n-1]
+t [1-a/2;n-1]
0 (x-u)/s
q
Figure 2: (x ? )= (s =n) follows a t(n-1) distribution. 2
4 New Data Summerizing Technique As discussed in Section 2, when we evaluate a new idea in hardware DSM system, we will design a simulator for that system and choose several benchmarks to evaluate it. While for software DSM systems, the general method for evaluation is benchmarking too. To date, SPLASH [21] and SPLASH-2 [22] are the most widely used benchmarks for DSM or shared memory systems. There are 7 applications in SPLASH and 12 applications in SPLASH-2. SPLASH-2 is an expanded and modi ed version of SPLASH suite, it includes 4 kernels and 8 applications. The resulting SPLASH-2 suite contains programs that (1) represent a wide range of computations in scienti c, engineering and graphics domains; (2) use better algorithms and implementations; and (3) are more architecture aware. We assume that n benchmarks are used for performance evaluation. The performance metric is the computation time . The results of system A are represented by a ; ; an, the results of system B are represented by b ; ; bn, and the ratio between these two systems are represented 2
1
1
2 In some other cases, for example, when memory hierarchy performance is evaluated, cache hit ratio, and other performance metrics will be used.
6
by r ; ; rn . Now we assume these results have already been measured . The next problem is how to summarize these results? We propose two methods to summarize these results. 1
3
4.1 Paired Con dence Interval Method
In above discussion, if all n experiments were conducted on two systems such that there is a one-to-one correspondence between the i-th test on system A and the i-th test on system B, thus the observations are called paired. If there is no correspondence between two samples, the observations are called unpaired. The later case will be considered in the next subsection. This subsection consider paired case only. When some absolute deviations between two systems are greater than zero, and others are less than zero, it is dicult to summarize them by ratio, which is similar to the case when the ratios between two systems are uctuate around one. In this case, we may draw the conclusion that one system is better than other system with 100(1-)% con dence level. For example, six similar workloads were used on two systems. The observations are f(5.4, 19.1), (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (1.4, 2.5), (0.6, 3.6), (7.3, 1.7)g. From these data, we get the absolute performance dierences constitute a sample of six observations, f-13.7, 13.1, -2.8, -1.1, -3.0, 5.6g. For this sample, sample mean=?0:32, some researchers will conclude that system B is better than system A. However, in 90% con dence interval, there is no dierence between these two systems. The deducing procedure are as follows. Sample mean=?0.32 Sample variance =81.62 Sample standard deviation =9.03 q Con dence interval for mean=?0:32 t (81:62=6) = ?0:32 t3:69
The 0.95-quantile of a t-variate with ve degrees of freedom is 2.015: 90% con dence interval =(?7.75, 7,11). Since the con dence interval includes zero, these two systems are not dierent signi cantly. If we apply con dence interval method on ratios and nd that one drops corresponding con dence interval, we will conclude that there is no dierence between these two systems too, which will be shown in the example three of next section. In fact, if the con dence interval locates below zero, we say the former system is better than the later, if the con dence interval locates above zero, we say the later system is better than the former, when the zero is included in the con dence interval, we say these two systems have no dierence at this con dence level.
4.2 Unpaired Con dence Interval Method
Sometimes, we need to compare two systems using unpaired benchmarks. For example, when we want to compare the dierence between PVM and TreadMarks, the applications used as In the following discussion, we will compare two systems only, however, those methods can be extended to for more systems easily. 3
7
benchmarks for these two systems may be dierent [23], , which will result in unpaired observations. How to make comparison in this case is a bit more complicated than that of the paired observations. Suppose we have two samples of size na and nb for alternatives A and B, respectively, The observations are unpaired in the sense that there is no one-to-one correspondence between the observations in these two systems. Then the steps to determine the con dence interval for the dierence in mean performance requires making an estimate of the variance and eective number of degrees of freedom. The procedure is shown as follows: 1. Compute the sample means: nb na X X 1 1 xa = n xia; xb = n xib; a i=1
b i=1
we xia and xib are the i-th observations in system A and B respectively. 2: Compute the sample standard deviations: s P na (
2
2
s P nb (
2
2
i?1 xia )?na xa na ?1
sa = sb =
i?1 xib )?nb xb nb ?1
;
:
3: Compute the mean dierence: xa ? xb. 4: Compute the standard deviation of the mean dierence: qs
+ snbb2 . 5: Compute the eective number of degrees of freedom:
s=
=
a2 na
2 sa2 + sb2 ) na na 1 sa 2 2 1 sb2 2 na +1 ( na ) + nb +1 ( nb ) (
? 2.
6: Compute the con dence interval for the mean dierence: (xa ? xb) t ?= s, we, t ?= is the (1 ? =2)-quantile of a t-variate with degrees of freedom. 7. If the con dence interval includes zero, the dierence is not signi cant at 100(1?)% con dence level. If the con dence interval does not include zero, then the sign of the mean dierence indicates which system is better. This procedure is known as a t test. In the next section, we will present four examples excerpted from real system evaluations described in [23], [14], [24] and [12]respectively. [1
[1
2; ]
8
2; ]
5 Examples Analysis 5.1 Example One
In [23], Lu et.al. use 8 benchmarks to quantify the performance dierence between PVM and TreadMarks. They porte 8 parallel programs to both systems: Water and Barnes-Hut from SPLASH benchmark suite; 3-D FFT, IS, and EP from the NAS benchmarks; ILINK, SOR and TSP. However, PVM and TreadMarks use dierent programming model, where PVM uses message passing programming model and TreadMarks supplies shared memory programming interface. Therefore, the benchmarks used for evaluating is not the same. In order to compare the performance of speedup of these two systems, they use sequential execution time as base time, and measure the speedup of these two systems as follows. For 8 benchmarks on PVM, the speedups are f8.00, 5.71, 7.56, 8.07, 6.88, 6.19, 5.20, 4.58g, while the speedups of TreadMarks are f8.00, 5.23, 6.24, 7.65, 7.27, 5.87, 3.73, 2.42g. This case belongs to unpaired performance evaluation. In order to draw the conclusion that which system has better speedup, our new summarizing technique is useful. The summarizing procedure is as follows. we use xa and xb to represent PVM and TreadMarks system respectively. For PVM: Mean xa = 6:524 Variance sa =1.725 na=7 For TreadMarks: Mean xb = 5:801 Variance sb =3.332 na=7 Then: Mean dierence xa ? xb=0.7225 Standard deviation of mean dierence =0.795 Eective number of degrees of freedom f = 23:821 0.95-quantile of t-variate with 23 degrees of freedom=1.714 90% con dence interval for dierence =(?0.537,2.082) 2
2
Since the con dence interval includes zero, so we conclude that from viewpoint of speedup, there is no signi cant dierence between PVM and TreadMarks at 90% con dence level. However, in [23] the authors conclude that PVM is better than TreadMarks since the speedup of PVM are larger than TreadMarks in 75% benchmarks. In fact, at 60% con dence interval, the con dence interval is (0.040, 1.405). Therefore, their conclusion has 60% con dence level only.
9
5.2 Example Two
L.Yang et.al. propose a new processor architecture for hardware DSM systems in [24], which exploit the advantages of intergration of processor and memory eectively. In [24], they compare their new architecture named AGG with traditional COMA and NUMA architectures. They use 4 benchmarks to evaluating them, we FFT and Radix come from SPLASH-2 suite, 102.swin and 101.tomcatv come from SPECfp95 suite. They use simulation to evaluate their new architecture. Although they compare AGG, NUMA and COMA in both 15% and 75% memory pressure, we will analyze partial results only. Here we take the measured execution time of 4 applications on COMA15 and D AGG15 from [24] only. We use Xa to represent the AGG system with directmapped P-nodes memories(i.e., D AGG15), and Xb to represent the COMA. Then Xa =f108, 54, 19, 51g, Xb =f70, 53, 37, 61g. This is a paired comparison, so we use paired con dence interval method proposed in Section 4.1. The dierence between D AGG15 and COMA15 is f38, 1, ?18, ?10g. Sample size=n=4 Mean=11/4=2.75 P Sample variance= n? ni (xpi ? x) = 612:92 Sample standard deviation= 612:92p= 24:76 Con dence interval=2.75 t24:76= 4 = 2:75 12:38t 100(1?)=90, =0.1, 1?=2=0.95. 0.95-quantile of t-variate with three degrees of freedom=2.353 90% con dence interval for dierence=(?28.57, 31.88 ). 1
1
=1
2
Since the con dence interval includes zero, we conclude that at 90% con dence interval there is no signi cant dierence between D AGG and COMA with 15% memory pressure. With 75% memory pressure, we can draw the similar conclusion so that the computing procedure is not included in this paper. This conclusion shows that if the new AGG architecture use directmapped memory organization, it's performance advantages cann't exploit at all. However, from the direct measure results we can not draw this useful conclusion so easily since half of four benchmarks favor one alternative and vice versa.
5.3 Example Three
In [14] Thirikamol and Keleher evaluate the using of multithreading to hide remote memory and synchronization latencies in a software DSM system CVM. Benchmarking technique is used for evaluation. Their application suite consists of seven applications: Barnes, FFT, Ocean, Watersp, and Water-Nsq from the SPLASH-2 suite, SOR, and SWM750 from the SPEC92 benchmark suite. They evaluate these applications for 4 and 8 processors, and from one to four threads, normalized to single thread execution times. In [14], the authors enumerate evaluating results only, for example, the execution time with three threads are f0.95, 1.12, 0.8, 1.2, 0.95, 0.95, 0.8g 10
respectively. They do not give conclusion about whether multithreading is better than single thread or not. With our new data summarizing method, we draw the following conclusion. First, since the ratios are given by original paper and the ratio approximate a constant, so geometric mean is adopted for these ratios. The average ratio is 0.967. Since 0.967 is less than 1, therefore, we conclude that when three threads are executed per-node, the whole performance increases by a little bit. However, using only 7 benchmarks we can not draw a de nitely conclusion about whether multithreading is valuable or not. Therefore, readers want to know the probability bounds about comparison. We use second method introduced in this paper. we assume 90% con dence in following computing procedure. Sample size =n=7 Mean=6.67/7=0.967 P Sample variance= n? ni (xpi ? x) = 0:022 Sample standard deviation= 0:022 =p0:148 Con dence interval=0.967t 0:148= 7 = 0:967 0:056t 100(1?)=90, =0.1, 1?=2=0.95. 90% con dence interval for dierence=(0.859, 1.075). 1
1
=1
2
The con dence interval includes one, Therefore, we cannot say with 90 % con dence that multithreading is better than single thread. However, with 70% con dence, the con dence interval is (0.936, 0.998). One does not lie in this con dence interval, so we can say with only 70% con dence the multithreading is better than single thread. This conclusion shows that the three thread per node is not so convincible that can be adopted by software DSM systems, in spite of that the geometric mean shows three thread is better than single thread. From example 3, we nd that FFT, Water-sp are not suitable for multithreading programming model. However, they represent two important application characteristics. In order to demonstrate the robust of the new system, all benchmarks must be used. In the past, without good data summarizing techniques, some researchers do not use these applications which are not in favor of their conclusions. However, with the con dence interval, all these benchmarks, include favourable and unfavourable, can be used so that the results will be more convincible than before. Therefore, we suggest to use SPLASH-2 as the standard benchmarks for shared memory systems (includes DSM systems).
5.4 Example Four
In [12], W. Hu et.al. use 7 benchmarks to quantify the performance dierence between JIAJIA and CVM. They port 7 parallel programs to both systems: Water and Barnes-Hut from SPLASH benchmark suite, LU from SPLASH2 benchmark, IS, and EP from the NAS benchmarks, SOR and TSP form TreadMarks distribution. The speedup of these two systems as follows. For JIAJIA, the speedups are f6.72, 6.38, 3.39, 7.89, 6.22, 5.98, 5.27g, while the speedups of CVM are f4.47, 6.26, 2.41, 7.89, 6.56, 4.49, 2.30g. The dierence between them is f2.25, 0.12, 0.98, 11
0.9. -0.34, 1.49, 2.97g. For all applications but one, IS, JIAJIA is prior to CVM. It seems that JIAJIA is better than CVM. However, using our new summerizing technique, we nd that the conclusion that JIAJIA is better than CVM has only 80% con dence level. The summarizing procedure is as follows. The dierence is represented by xi. Sample size =n=7 Mean=1.07 Sample variance= n? Pni (xpi ? x) = 2:34 Sample standard deviation= 2:34 =p1:53 Con dence interval=1.07t 1:53= 7 = 1:07 0:58t 100(1?)=90, =0.1, 1?=2=0.95. 90% con dence interval for dierence=(-0.04, 2.20). 1
1
=1
2
Since the con dence interval includes zero, so we conclude that from viewpoint of speedup, there is no signi cant dierence between CVM and JIAJIA at 90% con dence level. However, when the is set to 80%, the con dence interval becomes (0.24, 1.9). In other words, the conclusion that JIAJIA is better than CVM has 80% con dence level.
6 Conclusions In this paper, we review the methods used for performance evaluation and summarizing techniques rst, the common mistakes made by many researchers are listed. We point out that using arithmetic mean for normalized execution time is incorrect sometimes. When the ratio is approximate a constant, the geometric mean is recommended to be used. Furthermore, in many cases, users want to get a probabilistic statement about the range in which the characteristics of most systems would drop. Therefore, the concept of con dence interval is proposed in this paper to compare systems in paired or unpaired modes. The main contributions of this paper are: 1. Correcting the common mistakes used for performance evaluation of DSM systems. 2. Making the conclusion more intuitive than before. 3. Making it possible to use more representative benchmarks than before. In the past, in order to demonstrate their new system or new idea, many authors choose the benchmarks in favor of their systems only, which resulted in the fact that only 3|5 benchmarks are used for evaluation. Those benchmarks which do not in favor of their new system, and represent important application characteristics, are omitted. Therefore, the persuasion of their performance evaluation is not strong enough. With con dence interval, all those benchmarks can be used even if some results are not preferable. 4. Providing a new data summarizing techniques for evaluating other systems, such as memory system, communication system, etc. 12
With con dence interval, we propose three suggestions for future performance evaluation: (1) For any new shared memory system at least 10 benchmarks from SPLASH2 must be used for performance evaluation, (2) All the conclusions are convincible if and only if the con dence level equal or larger than 90%, and (3) More representative benchmarks should be included into SPLASH-2 suite for future evaluation, such as OLTP, debit-credit applications, etc.
References
[1] D. E. Lenoski, J. Ludon, K. Gharachorloo W-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam. The stanford dash multiprocessor. IEEE Computer, 25(3):63{ 79, March 1992. [2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. The stanford ash multiprocessor. In Proc. of the 21th Annual Int'l Symp. on Computer Architecture (ISCA'94), pages 302{313, April 1994. [3] A. Agarwal, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, K. Kurihara, B-H. Lim, G. Maa, and D. Nussbaum. The mit alewife machine: A large-scale distributed-memory multiprocessor. In M. Dubois and S. S. Thakkar, editors, Scalable Shared Memory Multiprocessors, pages 239{261. Kluwer Academic Publishers, 1992. [4] J. Torrellas and D. Padua. The illinois aggressive coma multiprocessor project (i-acoma). In Proc. of the 6th Symp. on the Frontiers of Massively Parallel Computing (Frontiers'96), October 1996. [5] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proc. of the 13th ACM Symp. on Operating Systems Principles (SOSP-13), pages 152{164, October 1991. [6] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proc. of the Winter 1994 USENIX Conference, pages 115{131, January 1994. [7] K. Li. Ivy: A shared virtual memory system for parallel computing. In Proc. of the 1988 Int'l Conf. on Parallel Processing (ICPP'88), volume II, pages 94{101, August 1988. [8] B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The Midway distributed shared memory system. In Proc. of the 38th IEEE Int'l Computer Conf. (COMPCON Spring'93), pages 528{537, February 1993. [9] D. R. Khandekar. Quarks: Distributed shared memory as a building block for complex parallel and distributed systems. Master's thesis, Department of Computer Science, The University of Utah, March 1996. [10] P. Keleher. The relative importance of concurrent writers and weak consistency models. In Proc. of the 16th Int'l Conf. on Distributed Computing Systems (ICDCS-16), pages 91{98, May 1996. [11] P. T. Koch, R. J. Fowler, and E. B. Jul. Message-driven relaxed consistency in a software distributed shared memory. In Proc. of the 1st Symp. on Operating Systems Design and Implementation (OSDI'94), pages 75{85, November 1994. 13
[12] W. Hu, W. Shi, and Z. Tang. Jiajia: An svm system based on a new cache coherence protocol. Technical Report TR980001, Institute of Computing Technology, Chinese Academy of Sciences, January 1998. [13] R. Jain. The Art of Computer Systems Performance Analysis : Techniques for Experimental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, Inc., 1991. [14] K. Thitikamol and P. Keleher. Multi-threading and remote latency in software dsms. In Proc. of the 17th Int'l Conf. on Distributed Computing Systems (ICDCS-17), May 1997. [15] J. Veenstra and R. Fowler. Mint:a front end for ecient simulation of shared memory multiprocessors. In Proc. of the 2nd Int'l Workshop on Modeling, Analysis, and Simulation of Computers and Telecommunication Systems (MASCOTS'94), pages 201{207, February 1994. [16] A. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit for intle x86 architectures. In Proceedings of 1996 International Conference on Computer Design, October 1996. [17] L. Stoller, R. Kuramkote, and M. Swanson. Paint-pa instruction set interpreter. Technical report, University of Utah, Computer Scienece Department, March 1996. [18] S. Herrod. Tangolite: A multiprocessor simulation environment. Technical report, Stanford University, November 1993. [19] D. Patterson and J. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Inc., 1996. [20] S. V. Adve, A. L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel. A comparison of entry consistency and lazy release consistency implementations. In Proc. of the 2nd IEEE Symp. on High-Performance Computer Architecture (HPCA-2), pages 26{37, February 1996. [21] J. P. Singh, Wolf-Doetrich Weber, and Anoop Gupta. Splash: Stanford parallel applications for shared memory. ACM Computer Architecture News, 20(1):5{44, March 1992. [22] S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Charachterization and methodological considerations. In Proc. of the 22nd Annual Int'l Symp. on Computer Architecture (ISCA'95), pages 24{36, June 1995. [23] H. Lu, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Quantifying the performance differences between pvm and treadmarks. Journal of Parallel and Distributed Computing, 43(2):65{78, June 1997. [24] L. Yang, A. Nguyen, and J. Torrellas. How processor-memory intergration aects the design of dsms. In Workshop on Mixing Logic and DRAM:Chips that Compute and Remember, ISCA'97, June 1997.
14