EuroBen Benchmark results for the Hitachi S3800

EuroBen Benchmark results for the Hitachi S3800 Aad J. van der Steen Academic Computing Centre Utrecht The Netherlands [email protected]

September 22, 1993 Abstract

In this contribution we report benchmark results obtained with the EuroBen Benchmark version 3.0. Results for one CPU are presented. To add some perspective, we occasionally compare some of the results with those obtained on Cray systems (Y-MP and C90), NEC systems (SX-3/12 and /14), and a Siemens-Nixdorf S200.

1 Introduction The EuroBen Benchmark is directed at the performance evaluation of high-performance computers as used in technical and scienti c areas. The EuroBen Benchmark limits itself to these areas and attempts to provide a benchmark that addresses the performance of high-performance systems on numerically-oriented programs. Its main charateristics are that it is modular and hierarchical, i.e., the results obtained from module 1 help to explain the results from module 2 and these modules both assist in understanding the results found from module 3. The structure is as follows: { Module 1 contains programs to measure basic properties of processor-memory systems, like speeds of basic operations, memory bank con icts, speed and accuracy of intrinsic functions, etc. These measurements are done in ve programs. { Module 2 contains basic algorithms: linear algebra algorithms, like the solution of dense and banded linear systems, dense and sparse eigenvalue problems, FFT algorithms, and a machine independent random number generator (sequential and parallel variant). Module 2 contains eight programs. { Module 3 contains application kernels: PDE-solvers (Multigrid method, Fast Poisson method, and Block Relaxation method), ODE-solver, Linear- and Non-linear Least Squares programs, and two programs to test I/O: a very large out-of-core matrix multiplication and a very large out-of-core 2-D FFT program. Module 3 also contains eight programs. A full description of the EuroBen Benchmark together with the rationale for its contents and structure can for instance be found in [8]. In the following sections we give an overview of the architecture of the Hitachi S3800, followed by a summary of the benchmark results for modules 1, 2, and 3. Finally, we will make some additional remarks regarding the results and the use of the system.

2 Architecture of the S3800 The Hitachi S3800 is a system of the multiple headed vectorprocessor type comparable to the Cray Y-MP (C90) and th NEC SX-3 series. In table 1 an overview of available models with their characteristics is given. The S3800 is the current top-end system of Hitachi's S3000 series. Five dierent models are oered: The 160 and the 260 in which the 260 is simply the 2-CPU version of the the 160. 1

Table 1: The S3800/x60, S3800/y8z ; x = 1; 2 y = 1; 2; 4 z = 0; 2).

Model

Clock cycle VPU Clock cycle scal. proc.

S3800/x60 2 ns 6 ns

Theor. peak performance

4{8 G op/s 8{32 G op/s

No. of processors Scalar Vector Main memory Extended memory

1{2 1{2

S3800/y8z 2 ns 6 ns

1{4 1{4

256{1024MB 512{2048MB 16GB 32GB

Furthermore, there is a sub-series 180, 280, and 480, of which the 280 and 480 are again 2-CPU and 4-CPU versions of the 180. However, in addition, there is a model 182 with 2 scalar processors and 1 vector processor as is oered in the Fujitsu VP2000 series and for the same reason: context switching delay between jobs should be reduced by this scheme. The smallest model, the S3800/160 has 4 multi-functional multiply/add pipes which may deliver up to 8 results per clock cycle. This is equivalent to 4 G op/s. In the /180 the number of pipes is doubled to 8 with a corresponding peak performance of 8 G op/s. All models feature one or more separate divide pipes. As the multiheaded systems can work in parallel, the top model, the S-3800/480, may theoretically attain a speed of 32 G op/s. Unlike Cray and NEC, the S3800 does not run native Unix. The OSF/1 system is run under the MVS like VOS3/HAP/ES. Hitachi now delivers an auto-parallelising compiler, which can be complemented by parallelising compiler directives similar to those of Cray and NEC, however, at this moment parallelisation is not yet oered under OSF/1. It is expected to be available by the end of 1993. As can be seen from table 1, the clock cycle of the scalar processors is three times longer than that of the vector processors. Super cially, this looks like a set back as compared to its predecessor, the S820, where the scalar processors were only twice as slow as the vector processors (8 vs. 4 ns clock cycle). However, in these systems only every two clock cycles a scalar instruction could be issued, while the S3800 can issue a scalar instruction every clock cycle. Furthermore, the S3800 can complete its scalar instructions almost always in one cycle, where this was 2{3 cycles for the S820. In practice, the scalar speed turns out to be about 3{4 times higher than that of the S820.

3 Benchmark results Because of the size of the benchmark (and of the resulting output) we are only able to present here a summary of the results. In the following we will show results from each of the modules 1, 2, and 3 of which we think that they are the most informative. To add perspective we also add some results from similar systems like machines from Cray and NEC. One has to keep in mind, however, that the results on the other machine are older (0.5{1 year) and that the circumstances under which these results were obtained not always were optimal. Therefore, one should draw no absolute conclusions from the gures presented here. As in many of the machines of this type, on the S3800 there is an enormous amount of compiler options that have an in uence on the performance some of which are virtually mandatory for good performance, others may be bene cial, while some may prove counterproductive. Furthermore, compiler options may in uence each other. This makes the search for optimal code generation something like an art. The options used for the tests that are discussed here are: HAP(MODEL(180)), SOPT, XFUNC(XFR,DXR)), S THe rst option ensures that all functional unit 2

Table 2: Speeds for some kernel operations from program mod1ac. Headings: C90 = Cray Y-MP C90, S3800 = Hitachi S3800, SX-3/14 = NEC SX-3/14, and S200 = Siemens-Nixdorf S200. The maximum observed performances are shown.

Operation

Broadcast Copy Multiply Divide Dot product AXPY Plain rotation 2 dierence 9 degree poly. 1 order recur. 2 order recur. Scatter Gather Multiply (indirect addr.) Divide (indirect addr.) Dot product (indirect) AXPY (indirect addr.) nd th st

nd

C90 S3800 SX-3/14 S200 M op/s M op/s M op/s M op/s 430 430 405 123 620 590 607 666 853 35 36 160 56 111 88 190 321

1704 1029 906 370 2089 1800 1912 2332 5687 48 36 478 683 662 358 1351 1310

1378 1042 858 147 2235 1748 1879 1992 4801 40 19 151 159 99 60 170 304

229 210 145 58 350 291 278 345 487 27 33 62 68 44 32 94 116

pipe sets are employed, the second option turns on scalar optimisation, and the third assigns an extended set of scalar registers to the code to be executed. The function of the last option is, as yet, unclear.

3.1 Results of module 1

The program mod1ac of module 1 measures 31 kernel operations. Of these, a selected set is shown in table 2. Other kernels of which no results are displayed check the sensitivity for memory bank con icts and gathering with regular stride by using strides of 3 and 4 for most of the operations shown in table 2. The evaluation of a 9 degree polynomial is included because of its very favourable ratio of computational to load/store operations. Generally, this kernel shows more or less the maximum achievable performance for systems like the S3800. Note that for the operations with direct addressing the performances of the S3800 and the SX-3/14 are generally quite close to each other except for the division and the evaluation of the 9 degree polynomial. This is remarkable, as the clock speed of the SX-3 is almost 50% lower (2.9 vs. 2.0 ns). The division is clearly better on the S3800. However, for the 9 degree polynomial the SX-3 attains 87% of its peak speed, while the S3800 reaches 71%. For direct addressing operations the S3800 seems to have some room for improvement. In contrast, the indirect addressing variants of the former operations as well as the gather and scatter operations perform relatively very well on S3800 and less on the SX-3. This impression is con rmed by the work on the performance of load/store operations by Uehara and Tsuda [10]. Although both the Cray Y-MP C90 and the Siemens S200 have a peak performance of 1 G op/s and roughly the same clock speed (4.2 and 4.0 ns, respectively), the latter system is not able to use both of its arithmetic pipe sets on simple loops as occur in program mod1ac. This means that eectively for this kind of code the peak speed of the S200 is 500 M op/s. The eect shows quite clearly in the results in table 2. For operations where the ratio of computations and memory references is unfavourable, as in single dyadic operations, the S200 is also hampered by memory bandwidth limitations. A result that as yet not could be explained is the much lower performance of the gather th

th

th

3

Table 3: Eect of memory bank read and write con ict

Stride 1 2 4 8 16 32 64

Read con ict Write con ict Mop/s Mop/s 852 826 845 589 369 309 205

832 829 824 573 358 206 111

Table 4: Speeds for some intrinsic functions from program mod1f. Headings: C90 = Cray Y-MP C90, S3800 = Hitachi S3800, SX-3/12 = NEC SX-3/12, and S200 = Siemens-Nixdorf S200. The

r1 values for the argument ranges with maximum observed performances are shown.

C90 S3800 SX-3/12 S200 Function MCall/s MCall/s MCall/s MCall/s (x2)1 5 x sin x cos px x e log x log10 x tan x cot x arcsin x arccos x arctan x sinh x cosh x tanh x :

y

x

5.1 5.4 9.2 9.2 33.8 22.0 13.7 11.4 12.3 12.0 9.8 9.7 12.8 14.1 14.1 15.8

36.9 37.3 122.9 116.0 84.0 112.7 71.3 72.7 87.4 1.0 48.4 47.1 55.5 70.7 95.8 70.8

8.7 7.1 27.1 27.4 24.2 22.0 19.2 19.1 13.2 0.4 10.3 10.2 17.2 18.1 11.5 9.6

5.2 4.9 12.7 13.2 13.2 10.5 8.3 8.5 10.2 0.5 5.0 5.0 11.3 6.9 9.0 7.7

operation with respect to the scatter operation on the C90. On the other systems the speed of the gather operation is at least as high as that of the scatter operation. The Hitachi S3800 behaves dierently with regard to memory read and memory write con icts. This is shown in table 3. This behaviour seems peculiar to the S3800 and it is most clearly observable at higher strides; it is not observed in Cray machines, the Siemens, and the NEC systems. The available gures have to be analysed more extensively before an explanation can be given. Read con icts are forced by assignments of the form a(i) = b(n*i) where n = 1; 2; 4; : ::. Conversely, write con icts are caused by expressions of the form a(n*i) = b(i). Programs mod1e and mod1f measure the speed of intrisic functions. Below we list the r1 (see [3]) values of some of these functions in table 4. As dierent algorithms are used for dierent argument ranges, in fact several speeds are observed for each function depending on the argument range. The results shown here are those that are the highest for each function. On all machines presented here the argument ranges where the highest speeds occurred were identical. However, this may not be the case for other types of machines where caches, and in particular dierent cache sizes, can play a r^ole. It is clear that the perfomance of all intrinsic functions is signi cantly faster on the Hitachi 4

Table 5: Speeds for matrix-vector multiplication mod2a. Headings: Y-MP = Cray Y-MP , S3800 = Hitachi S3800,and SX-3/12 = NEC SX-3/12. The results shown are for the largest problem size (500500. Order n = 500

Implementation

(1) y = Ax + y, row-wise (2) y = A x + y, row-wise (3): (1) 4unrolled (4): (2) 4unrolled (5) y = Ax + y, col. wise (6) y = A x + y, col. wise (7): (5) 4unrolled (8): (6) 4unrolled

Y-MP S3800 SX-3/12 M op/s M op/s M op/s 155 171 165 248 201 182 272 162

T

T

2293 6446 771 5908 6583 2303 5905 763

542 677 320 1548 728 552 1578 328

S3800 than on any o the other machines with exception of the cotangent function. This may however be due to the poor implementation: if no cotangent function is present from the vendor (this function does not belong in the standard Fortran 77 repertoire) it is implemented as 1= tan x. In fact, this is the actual situation for the code of mod1f as it is distributed. It may be replaced by the vendor cotangent function. This has not been done except for the C90. Much better results were obtained for the Hitachi S3800 and as well for the SX-3/14R and the Fujitsu VP2600 (equivalent to the Siemens-Nixdorf S600) when the vendor calls were employed. In this case speeds of 51.9, 111.3, and 36.2 Mcall/s were found for the Fujitsu VP2600, the Hitachi S3800, and the NEC SX-3/14R respectively [7]. With regard to the accuracy of the functions as checked in program mod2e, the C90 is clearly better than the other machines. For no function an accuracy loss of more than 0.71 decimal digits is observed notwithstanding the fact that the oating-point number system is the most shaky of all those considered here. In contrast, the most functions incur an accuracy loss of 0.7{1.4 decimal digits. The most dicult functions seem to be the sine and cosine function that show a rootmean-square accuracy loss of 3.95 and 4.53 decimal digits for the sine and cosine on the S3800, respectively and of 3.33 and 3.89 on the S200. These losses are 0.06 and 0.05 for the C90 and 0.95 and 0.94 for the NEC SX3-12 respectively.

3.2 Results of module 2

In module 2 primarily linear algebra algorithms and FFTs are considered. In program mod2a various implementations of a vector-matrix product are tested: row- and column-wise accumulation for Ax and A x with and without loop unrolling. In table 5 results are shown for the Hitachi S3800 together with those for the Cray Y-MP and the NEC SX-3/12; all results are obtained on one processor of the respective machines. As can be seen from table 5 there are vast dierences is performance between the various implementations. This is especially apperent for the S3800 and the SX-3. Although there are speed dierences for the Y-MP, these do not amount to more than roughly a factor 1.5. The symmetry in performances is more or less what could be expected (implementation (1) in table 5 is in principle identical to implementation (6), implementation (2) corresponds to implementation (5), implementation (3) corresponds to implementation (8), and implementation (4) corresponds to implementation (7)). Furthermore, the S3800 does not bene t from unrolling, while it gives a slight advantage on the Y-MP. On the SX-3 column-wise unrolling results in a performance increase, while a row-wise unrolling gives a performance decrease as compared to the plain implementations. The relative constancy of the Cray results are a consequence of its CPU-memory bandwidth. Furthermore, the memory structure and the amount of memory banks play a r^ole. This program shows that to attain the highest possible speed in a code using matrix-vector multplications, the way of implementing this algorithm is not indierent and one should check for the most favourable T

5

Table 6: Speeds for the solution of a full linear system mod2b. Headings: Y-MP = Cray Y-MP, S3800 = Hitachi S3800,and SX-3/12 = NEC SX-3/12. The results shown are for the largest problem size (500500).

Order n = 1000

Implementation

(1) LINPACK (2) LAPACK, BLAS 2 (3) LAPACK, BLAS 3 (4) Fortran, BLAS 2 (5) Fortran, BLAS 3

Y-MP S3800 SX-3/12 M op/s M op/s M op/s 157 296 300 273 296

5520 570 540 465 594

212 598 473 445 451

Table 7: Speeds for the solution of a sparse linear system mod2c. Heading:S3800 = Hitachi S3800. The results shown are for the largest problem size (100100100). Order n = 100

Implementation

(1) ICCG (2) ICCG, diag. ordered preconditioning (3) CG, diagonal scaling

S3800 M op/s 144 917 2497

one. Program mod2b solves a full linear system for sizes of n = 50; : : :; 1000. In table 6 we show the speeds that were obtained for the largest problem size n = 1000. Five implementations are considered in mod2b: a LINPACK implementation, a LAPACK implementation both with BLAS 2 and BLAS 3 calls, and an own Fortran implementation, again using BLAS 2 and BLAS 3 calls respectively. The S3800 recognises the LINPACK calls in the code of program mod2b and substitutes library routines, resulting in a performance at a speed of 69% of the peak performance. However, this substitution is not done for implementations (2){(5) in table 6. In this case is in the order of 5{ 7% which is rather disappointing. For the Y-MP the LAPACK and the Fortran implementations using BLAS 2 and BLAS 3 are clearly superior to the LINPACK implementation as could be expected. They attain speeds that reach 80{90% of the peak speed. The SX-3 takes a position in between: The LINPACK implementation performs worse than the other ones. Implementation (2){(5) perform in the 15{20% range of the theoretical peak performance. Program mod2c measures the speed of solving a sparse symmetric hepta-diagonal system (stemming from a 3-D Laplace PDE problem). Three implementations are used: the standard ICCG method, ICCG with diagonally-ordered preconditioning and the CG method with diadonal scaling [1]. The results for the Hitachi S3800 are given in table 7. We only give the result for the largest problem size of a 1003 grid. The general pattern of performances for the three implementations agree well with those for other machines. For instance, speeds of 60, 607, and 1124 M op/s are given for the SX-3/12 in [1] for implementations (1){(3), respectively. For the Cray Y-MP C90 speeds of 56, 444, and 737 M op/s were found. In terms of fractions of the peak performance the performance level of the Hitachi S3800 is somewhat lower than that of the other machines: 2, 11 and 31%, respectively for the three implementations while these percentages are 2, 22, and 41 for the NEC SX-3 and 6, 47, and 77 for the C90. The latter machine bene ts here from its larger CPU-memory bandwidth. Program mod2e measures the performance of a sparse symmetric positive de nite eigenvalue problem. Problem with sizes ranging from n = 100; : : :; 10000 are considered. The smallest 5 eigen values for all problems are computed using a Lanczos algorithm. Table 8 shows the results as for for the Cray Y-MP, the Hitachi S3800, and the NEC SX-3/12. The results are given in terms of iterations/s. The program also reports the time in seconds and the time per iteration per element 6

Table 8: Speeds for the computation of the ve smallest eigenvalues in sparse symmetric positive eigenvalue problems (program mod2e). Headings: Y-MP = Cray Y-MP, S3800 = Hitachi S3800,and SX-3/12 = NEC SX-3/12. The results are shown interms of the number of iterations/s.

Y-MP S3800 SX-3/12 Problem size Iter/s Iter/s Iter/s 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

319 269 290 292 256 339 314 262 294 304 280 271 211 191 208 190 166 193 175

675 555 584 592 590 686 645 539 624 711 600 543 459 417 467 451 375 457 423

399 339 353 358 351 410 377 337 372 415 356 323 274 250 278 268 222 271 250

which we do not show here. The performance in terms of M op/s is often not satisfactory and a measure of the number of problems solved per unit of time is more appropriate in sunch cases. As the problems in mod2e are iterative in nature, the number of iterations/s is a sensible measure (in combination with the total time per problem), see [4]. The algorithm is only partially vectorisable/parallelisable. The search algorithm proper is purely scalar and cannot be parallelised or vectorised. However, all intermediate calculations are very well t for parallelisation or vectorisation and their in uence grows with growing problem size. The vector startup time is smaller on the Cray Y-MP than on the other machines. This explains that the number of iterations/s decreases faster with growing problem size for the Y-MP than for the other two machines. In these systems the larger amount of computation is more compensated by a higher speed in the vector operations when the in uence of the vector startup becomes smaller. The performances are not in proportion to the peak performances of the respective machines: while these peak performances are 333, 8000, and 2750 M op/s, respectively, the observed speed ratios range from 2.0{2.4 for the Y-MP vs. the S3800 and from 1.2{1.4 for the Y-MP vs. the SX-3. These ratios more or less agree with the clock cycle ratios (when weighted for the proportion of scalar and vector work). Only for much larger problems the speeds of the S3800 and the SX-3 would show a larger advantage over the Y-MP. Programs mod2f and mod2g are concerned with the accuracy and the speed of Fast Fourier transforms, respectively. In mod2f the accuracy with respect to complex, real, and Hermitian transforms were tested and found satisfactory. In table 9 speeds for complex 2-D and 3-D FFTs are shown for the Cray Y-MP, the Hitachi S3800, and the NEC SX-3/12. In addition to computing the actual performances as shown in table 9, the program also estimates the speed for the most optimal implementation (the program is potentially capable of choosing the optimal one from the possible permutations). The estimates show that the 2736125 size problem can be implemented better (though dierently) on all three systems. The speeds for this problem would increase to 189, 7783, and 1452 M op/s, respectively. Although the estimated speed for the Hitachi S3800 is highly unlikely (it would attain 86% of its theoretical peak perfor7

Table 9: Speeds for 2-D and 3-D FFTs (program mod2g1). Headings: Y-MP = Cray Y-MP, S3800 = Hitachi S3800,and SX-3/12 = NEC SX-3/12.

Y-MP S3800 SX-3/12 Problem size M op/s M op/s M op/s 512512 171 2666 852 2736125 182 2265 873 646464 161 1494 698 Table 10: Uniformly distributed numbers generated per second in program mod2h.The amount with and without initialisation is shown as obtained when generating 1,000,000 random numbers. Headings: Y-MP = Cray Y-MP, S3800 = Hitachi S3800,and SX-3/12 = NEC SX-3/12.

Y-MP RN/s

S3800 RN/s

SX-3/12 RN/s

Not initialised 32,357,596 26,062,704 61,907,281 Initialised 73,663,229 400,477,696 230,710,980 mance in this case), it would certainly show a higher speed. We did not have an opportunity to actually measure it. With regard to the theoretical peak performances of the machines, fractions of 54, 33, 32% are attained for the highest speeds observed. The program mod2h computes the number of uniformly distributed random numbers that can be generated per second. The algorithm used is a Taushworthe-like generator based on bit manipulation (exclusive or operation), see [6]. Table 10 displays results for the Cray Y-MP, the Hitachi S3800, and the NEX SX-3/12. The initialisation is done by a linear congruential method which is implemented in a scalar way (it could be vectorised/parallelised, however). It will be clear that large simulations should be done with this random number generator to overcome the large initialisation costs. With initialisation the program is almost completely dominated by the scalar performance which is compatible with table 10: the Y-MP and the S3800 are roughly equal in speed while the SX-3 is about two times faster (the clock cycles are 6, 6, and 2.9 ns, respectively). When initialised, the speed should be proportional to the vector processor clock cycle and the number of logical pipes that are available in the systems. However, integer-to-logical conversions and vice versa also have to be done and it is not clear how much vector resources are available for this and whether such procedures are silimar on the systems considered. The speed ratio for the Y-MP vs. S3800 is 5.43 and this ratio is 3.13 for the Y-MP vs. the SX-3.

3.3 Results for module 3

In module 3 we have collected programs that are in some (limited) sense exemplary for important applications or implementations of such applications. The rst two programs mod3a and mod3b measure the performances of I/O dominated programs, viz., the wallclock time found for a 9,0009,000 out-of-core matrix multiplication and of an out-of-core 8,1928,192 complex FFT, respectively. Table 11 shows the wallclock times as mesured on the Hitachi S3800 and the SX-3/14. From the results of program mod3a one would infer that the I/O speed is more or less the same for both systems. However, in mod3b the speed of the SX-3 is more than an order of magnitude higher. It is not clear where this dierence comes from. Other programs suggest that the computational speed of the S3800 is at least not lower than that of the SX-3, while the I/O speeds can be comparable as shown in mod3a. A more detailed study of the I/O con gurations might help to explain this large dierence. Table 12 shows the results of programs mod3c{mod3g. Programs mod3c and mod3g both solve 8

Table 11: Wallclock times measured in seconds for program mod3a (9,0009,000 out-of-core matrix

multiplication) and for program mod3b (8,1928,192 complex FFT). The I/O in mod3a is done sequential, unformatted and as direct access I/O. In mod3b only direct access I/O is used. Headings: S3800 = Hitachi S3800 and SX-3/14 = NEC SX-3/14.

Program

S3800 SX-3/14 sec sec

, seq. 352.86 , d.a. 355.87

mod3a mod3a

,

mod3b

352.86 347.84

487.00

17.36

Table 12: Execution times for some programs of module 3. In mod3d, a linear least squares problem is solved via QR and SVD decompositions. Note that the timing of mod3d and mod3e is in ms instead of seconds. Headings: C90 = Cray Y-MP C90, S3800 = Hitachi S3800, SX-3/14 = NEC SX-3/14.

Algorithm mod3c

(sec)

mod3d

(msec)

mod3e

(msec)

mod3f

(sec)

mod3g

(sec)

QR alg. SVD alg.

C90

S3800 SX-3/14

0.16991 0.15887

0.11621

54.414 289.91

9.588 46.313

1.161 29.031

4.3973

10.375

3.3741

5.6133

8.8048

6.3988

0.3560 0.06008

0.05187

a discretised Laplace equation but with dierent methods: the rst program uses a Multigrid algorithm while the second employs a Fast Elliptic Solver. Programs mod3d and mod3e look at the performance of linear and non-linear Least Squares problems, respectively. Program mod3f solves a diusion equation by computing a time sequence of sti ODEs. Note that the times in table 12 are not all in the same units: the times for mod3d and mod3e are reported in ms while the other execution times are in seconds. In program mod3c the execution times of the machines are not very dierent. On this (mostly) well vectorisable code one would expect both S3800 and the SX-3/14 to be signi cantly faster than the C90. Although the latter can bene t from its larger CPU-memory bandwidth and a shorter vector startup time in many parts of the code, the larger amount of functional units and the lower clock speed should give the S3800 and the SX-3 some advantage over C90. In program mod3d the speed ratio as would be expected for the C90 and the S3800 are indeed observed: the S3800 is roughly 5.5{6.0 times faster than the C90. Also the dierence in computation time for the QR algorithm and the SVD algorithm is as can be expected for vector processors: the SVD algorithm takes about 5 times longer. In this respect, the SX-3/14 behaves completely atypical. The time for the QR algorithm is extremely short and the QR algorithm is more than 25 times faster than the SVD algorithm. We still have no explanation for this phenomenon. Another implementation, not based on the LLSQ package from Lawson and Hanson, could be of help here. The non-linear Least Squares problem is rather small in terms of computational load for the machines considered here. However, the results are suciently clear to see that the C90 does relatively well while the S3800 behaves rather disappointing. The computation is almost completely dominated by calculating l2 norms with a length of n = 1000. One would expect that this would 9

Table 13: M op-rates measured in program mod3h in which a Laplace equation is solved by a generalised red-black relaxation scheme. Headings: S3800 = Hitachi S3800 and SX-3/14 = NEC SX-3/14.

S3800 SX-3/14 Problem size M op/s M op/s 1717 76 | 3333 180 231 4141 231 | 6565 393 488 129129 794 820 257257 1496 1289 513513 2495 1803 10251025 3310 | 20492049 3820 | give a better result for the S3800 when one compares this with the results from program mod1ac. The results of program mod3f show that evidently no substitutions of LINPACK calls have occurred in the execution on the S3800. Program mod3f spends about 80% of its time in LINPACK routines and, as we have seen in program mod2b this leads to very high speeds when substitution takes place. As table 12 shows, the speed of the S3800 is a little lower than that of the C90 and the SX-3/14, using the supplied Fortran code. The time for program mod3g on the C90 is 6{7 times longer than those on the S3800 and the SX-3/14. This is roughly in line with the peak speeds of the machines considered, although the SX-3 behaves somewhat better than could be expected and the S3800 somewhat worse. Table 13 shows the results of another implementation of solving the Laplace equation. A generalised red-black relxation scheme is used here and the problem sizes ranging from 1717 to 20492049 are calculated. Table 13 shows that at small problem sizes the SX-3/14 outperforms the S3800. At larger vectorlengths the S3800 is faster than the SX-3. This con rms the impression that the vector startup times for the S3800 are longer than those for the SX-3. Therefore, the problem sizes should be large to fully bene t from the speed of the processor on the S3800.

4 Concluding remarks The rst test with the EuroBen Benchmark on the Hitachi S3800 were done by the author innthe second half of July 1993 at the Computing Centre of the University of Tokyo. For these preliminary rests the guest operating system OSF/1 was used that is available since April 1993. During the tests several problems arose: the rst one concerned the timing. Although the resolution of the available clock routine was clearly sucient (s resolution), the overhead for attracting the attention of the host operating system was so high (and variable) that the timings were highly unreliable. Another problem was the speci cation and availability of compiler options: it was not always clear which options were to be preferred and why, nor were all options that were available under the VOS3/AS operating system available under OSF/1. Finally, some compiling and execution errors occurred that could not be resolved inthe short (4-day) term that was available for the benchmarking. It was decided that the programs would be turned over to the Hitachi Software Development Center and run again under the VOS3/AS operating system. The results presented here stem from the eorts of the Software Development Center at Hitachi Ltd. A problem not yet fully mastered is the timing. Because of this, we judged the (very important) n1 2 values obtained too unreliable to include them in this paper. Also, we are very cautious in our conclusions regarding the vector startup time. Very roughly speaking, the n1 2 values seem to be in the range of 400{800 for the S3800. For other systems considered in this paper these values seem to be in the range of 80-120 for the Cray Y-MP C90 and about the same or little higher for =

=

10

the NEC SX-3/14. This means that the vectorlength should be higher for the S3800 to have full pro t of its (indeed very large) processing power. In some occasions the potentially very high speed of the S3800 is showing. This is especially true for some linear algebra codes and for FFTs (if properly implemented). On the other hand, in some occasions, especially where there is a signi cant portion of scalar code, the S3800 does not show its high potential. Where undoubtly without too much eort a high gain in performance can be expected in the near future is program mod2b. The LAPACK and BLAS 2 and 3 implementations provided there should be very well suited to the architecture of the S3800 and should therefore perform quite well after proper tuning.

5 Acknowledgements We would like to express our gratitude to Prof. Yasumasa Kanada of the Computing Centre of the University of Tokyo who was so kind to the S3800/480 at his institute available for our rst tests and for his intermediary services to transfer the codes to Hitachi Ltd. Furthermore, we thank the sta of the Hitachi Software Development Center for there eorts port the codes to the VOS3/AS operating system and to look into the problem of running then optimally in such a short time frame. Finally, we are grateful to the Dutch National Computer Facility Foundation that funded our stay at Tokyo and initiated the benchmark tests and the results presented here.

References [1] J.J. Dongarra, H.A. van der Vorst, Performance of various computers using standard techniques for solving sparse linear equations, Supercomputer IX, No. 5, 1992, 17{30. [2] O. Haan, W. Walde, Performance of Fast Fourier Transforms on vector computers, Supercomputer VII, No. 6, 1992, 42{49. [3] R.W. Hockney, C.R. Jesshope Parallel Computers 2, Adam Hilger AIP, Bristol and New York, 1988. [4] R.W. Hockney, A framework for Benchmark Analysis, Supercomputer IX, No. 2, 1992, 9{22. [5] R.W. Hockney, Results from the GENESIS distributed memory benchmark, Proc. of the 3rd EuroBen Workshop, 1992, 73{83. [6] N. Ito, Y. Kanada, Monte Carlo simulation of the Ising model and random number generation on the vector processor, Proc. of Supercomputing '90, IEEE Computer Society, 1990, 753{763. [7] T. Nagai, Benchmarking Fortran intrinsic functions, Proc. of the workshop on benchmarking and performance evaluation in high-performance computing, Ed. M. Shimasaki, Tokyo, 1993, 26{32. [8] A.J. van der Steen, The Benchmark of the EuroBen Group, Parallel Computing 17 (1991) 1211{1221. [9] A.J. van der Steen, Benchmarking and relating application performance to machine architecture, Proc. of the workshop on benchmarking and performance evaluation in high-performance computing, Ed. M. Shimasaki, Tokyo, 1993, 7{15. [10] T. Uehara, T. Tsuda, Benchmarking vector indirect load/store instructions, Proc. of the workshop on benchmarking and performance evaluation in high-performance computing, Ed. M. Shimasaki, Tokyo, 1993, 16{25.

11