High Performance Computers — A status report - CiteSeerX

2 downloads 0 Views 289KB Size Report
Nov 3, 2004 - The reason is that in the POWER4 two processor cores are put on one ... one chip: since the introduction of the the IBM POWER4, also HP, and.
High Performance Computers — A status report Aad J. van der Steen High Performance Computing Group Utrecht University P.O. Box 80195, 3508 TD Utrecht The Netherlands [email protected] November 3, 2004 Abstract Some years ago the supercomputer landscape changed dramatically: the supremacy of vector processorbased systems had to give way to machines built from Commodity Of-The-Shelf (COTS) products that promised to deliver a better price-performance ratio. In addition cheap Beowulf clusters emerged and are massively employed in laboratories all over the world. We discuss some reasons for this change and also address questions like: Is there still room for computer architecture research? Is there still value in large systems as built in the ASCI initiative in the USA as opposed to clusters? We will review the present status of the supercomputer field and comment on the (rather limited) architectural range that is offered today. Furthermore, we will discuss some of the latest developments in the field and how they may impact on the machines to come.

1

Introduction

High Performance Computing (HPC) is by definition dependent on the computers one is working with. When overlooking the field for a number of years it is evident how volatile this particular field is: Barely 10 years ago supercomputers were almost equivalent to vector processor machines with a small corner of the HPC world occupied by Distributed-Memory MIMD systems which was shared with some processor-array machines like the the Cambridge Parallel DAP Gamma and the MasPar MP-2. This outlook has completely changed. Only two vendors of vector processor systems are still active and processor-array machines are non-existent nowadays. Instead RISC-based ccNUMA or SMP systems have taken over the field, not to mention the multitude of Beowulf clusters, homegrown or ready-bought. In this paper we want to trace the reasons for these changes and where the still ongoing developments are taking us. Will it still be viable to do expensive research to build ever larger “integrated” parallel systems or do they have to make way for cheaper cluster solutions everywhere? What is the speed horizon for the coming generations of HPC computers and can we keep up Moore’s Law in the years to come? To answer these questions we need to look at the current status of the HPC computer field in order to understand the driving forces and the hindrances that we are confronted with. We will first describe what is done presently to achieve high speed in computer systems in section 2. Then we will go on to the problems that come up in the practical use of HPC systems in section 3 and present some of the solutions to performance problems that (has) shape(d) our present-day systems in section 4. We focus shortly on the possibilities of clusters and on their development in section 5 and we finish with some thoughts about near-future developments, both on the component level and on the architectural level, that will result in the machines of tomorrow in section 6.

2

What we do for speed

The appetite for computer power is insatiable. This is to say, not only the sheer amount of resources, also the performance per se, as 2,000 systems with a peak performance of 5 Gflop/s cannot simply be used instead 1

of one 10 Tflop/s machine even when it was only was for sake of logistic reasons. So, many approaches are being followed at the same time to increase the speed of large-scale HPC systems. We list some of the obvious ones: 1. 2. 3. 4.

Speed up processors. Use more processors per task. Increase the communication speed between processors. Look again at algorithms.

We will discuss each of them in turn.

2.1

Speeding up processors

The speeding up of processors is going on for many years on such a regular basis that one is inclined to almost regard the doubling of the speed every 18 months, known as Moore’s Law [8] for processors, as a Law of Nature. Of course we all know that this is not true. In fact, it becomes increasingly difficult to satisfy this “Law”. For the moment a feature size of 90 nm is becoming customary which means that for instance an Itanium 2 processor can house 410 billion devices on its chip and the next step to shrink the feature size to 65 nm is about to be taken. This presents big challenges with regard to lithography: deep ultraviolet wavelengths have to be used in combination with phase shifting to attain the necessary feature resolution. A consequence is that only a few global players are still able to keep up with these demands because of the huge investments needed. This is especially so because we expect to see Moore’s Law followed at about the same price level. Without this requirement other, non-standard features could be built in. In fact, vector processors are a premier example of such non-commodity processors with the appropriate reflection in the price tag. Yet, for high-level HPC requirements one may choose for the highest performance per se instead of going for the best price/performance. It depends on the added features of the non-commodity processor whether a system based on it can win out in comparison with a system built out of commodity components. A higher density of devices and smaller path length on the chip with a corresponding higher clock frequency does not automatically mean that the speed of the processor rises proportionally. A good example of this is the Intel Pentium III versus its P4 successor. The Pentium III housed 24 million transistors. The Pentium 4 one year later (2000) contained 42 million transistors and the clock frequency went up from 1.1 GHz for the latest Pentium III to 1.5 GHz for the first P4s. One would therefore expect at least a 40% increase in performance. The actual speedup turned out to be only about 15%. In addition, the instruction pipeline of the processors which was already 20 stages in the Pentium 3 was increased to no less than 31 stages in the P4, making a break in the pipeline very costly. It is just an example to illustrate that the increase in complexity comes with a price. Another manifestation of this increased complexity is that it becomes also harder to maintain a central clock for the entire chip. Measured in clock ticks the devices on the chip are getting ever further away from each other, not the least because the number of latches to synchronise the devices grows superlinearly with the number of devices and therefore introduce additional delays. The physical density feature limits for the now ubiquitous silicon-based CMOS technology lies at ≈7 nm. Below this size tunneling effects would render it useless. So, even when technology hurdles can be overcome we will meet the end of the road (strictly following Moore’s Law) at about 2014 at a feature size of about 8 nm. This is not to say that new materials and techniques could not have taken over by then. We will discuss some new developments later. As is well known to any HPC system user, the processor is not the only factor and nowadays not even the decisive factor in determining the practical performance. We will come back to that in section 4 to discuss what can be done (and partly is done) to make it easier for the processing functional units to attain a larger part of their potential.

2.2

Use more processors per task

The solution to use more processors per task to decrease the time-to-completion for compute-intensive problems has already a long history (relative to the total computing history). An excellent account of this history

2

is given in [7] and serious attempts to build a practical parallel system, started with the ILLIAC IV machine in the period 1967–1972 [14]. The ILLIAC IV machine addressed a shared memory with 64 processors configured in a 2-D grid and needed its own specialized programming language to express the parallelism in the task at hand. The processor grid was controlled by a control processor that issued the instructions to be executed. Although the machine was not much of a success at the time it has been an important starting point for building parallel systems drawing on a shared memory. In fact Hitachi’s SR8000, marketed from 1998–2003 retained similar ideas with a control processor to overlook 8 instruction processors in a processing node (see [16]). Nowadays shared-memory parallelism has the advantage of being relatively simple to use without extensive restructuring of existing codes. The introduction of the de facto standard OpenMP in 1997 [12] has greatly contributed to the adoption of this type of parallelism. There are, however, drawbacks to shared-memory parallelism, like memory bandwidth problems and an increasing time for the synchronisation of processes when the number of processors grows. Therefore, again already fairly early ideas for distributed-memory parallelism were laid out that at least would evade the bandwidth competition problem riddling shared-memory parallel systems. The first practical distributed-memory machine was the Cm∗ built at Carnegie-Mellon University [18] in 1977. Many other systems followed, including the Cosmic Cube in 1984 at CalTech. The processors of this machine were arranged in a 6-dimensional hypercube configuration with the important property that the largest distance between processor only grew with the logarithm of the number of processors. Textbooks like “Solving Problems on Concurrent Computers”, [6] increased the popularity of this type of parallelism notwithstanding the much more complicated way that programs had to be built. In fact, also with the present day commonly accepted standard MPI [10, 11] developing distributed-memory parallel programs can be a challenge. In many application areas good efficiencies can be obtained using parallel techniques and one may expect that these efficiencies may still increase and may become applicable in wider areas with the evolution of parallel tools: MPI 2 supporting dynamic process creation, parallel I/O, and one-sided communication is becoming increasingly available. In the near future Co-Array Fortran [4], which has proved to be very effective on Cray T3E and SGI Origin2000 systems will incorporated in the new Fortran standard. For the moment the majority of parallel applications is routinely run on DM-MIMD machines and clusters using MPI, OpenMP, or a combination of both whenever the architecture of the machine allows it and is sufficiently efficient.

2.3

Increase speed of communication between processors

When Distributed-Memory systems were first developed the communication between the processors was the main problem: the access of operands within a processor was faster by several orders of magnitude than the fetching of these operands from a neighbouring processor. For instance, in the 1990–1992 Intel iPSC860, fetching a local 8-byte operand took 0.25 µs while getting such an operand from another processor took at least 2280 µs. The bandwidth for this systems’ network was 2 MB/s (quite advanced at the time) but the latency was killing: 2260 µs. This meant that the communication/computation ratio had to be very low. The Intel i860 processor could churn out about 180,000 floating-point operations in the time of one non-local memory access, forcing application programmers to think very hard about how to organise the communication. Even on the iPSC680 system no heed was taken anymore of the actual topology of the network connecting the processors. A study in 1988 [15] showed already for its predecessor, the iPSC machine, that explicitly taking the topology of the system into account hardly payed of and often even could be counterproductive. This is a reason why in the message-passing libraries today hardly any means are given for mapping the communication structure of applications onto a particular topology. This does not mean that the topology is not important. Much effort has spent in designing topologies that are both cost-effective and fast. The ideal solution lies in the direct connection of every processor to all others. Such a one-stage crossbar, however, becomes prohibitively costly as the number of switches grows with the square of the number of processors. Topologies in which the number of switches grow O(log p) for p processors are a viable and often employed alternative. Representatives are the already mentioned hypercube and the so-called fat tree, a tree structure in which the bandwidth toward the root of the tree is higher than near the leaves as depicted in Figure 1. The intrinsic speed of todays’ networks have improved very much. So much so, that the bandwidth provided by the network is in the same range as that of the Front Side Bus (FSB) that connects a processor to its local memory: the (FSB) bandwidth in a dual-processor Intel Nocona3

d=1 Ω=1

d=2 Ω=2

d=3 Ω=3

d=4 Ω=4

(a) Hypercubes, dimension 1−4.

(b) A 128−way fat tree.

Figure 1: Some logarithmic network topologies. based computation node is 3.2 GB/s where a Quadrics QsNetII network has a bandwidth of somewhat over 1.3 GB/s. Also network latencies have decreased by orders of magnitude to the level of tens of ns for hardware latencies and a few µs for MPI latencies. This has made the applicability of message passing-based programming much larger and has enabled the fruitful parallelisation of algorithm classes that were formerly not well possible, e.g., irregular search algorithms and transpositions of matrices as occur in large parallel FFTs. The majority of integrated parallel systems have a heterogeneous network because they have a structure in which several processors are combined in a Symmetric Multi-Processor (SMP) node in which the processors all draw on the common local node memory via a network that is factor of 2–4 times faster than that between the SMP nodes. Futhermore, all memory within such a system may be shared logically, although it is physically distributed. Because of the physical limitations and the parameters of the network the access of data throughout the system cannot be uniform and one speaks therefore of NUMA systems, where NUMA stands for Non-Uniform Memory Access. As all presently existing systems of this type also keep the content of caches of the processors consistent, the term cache coherent NUMA or ccNUMA system is commonly used. The speed of the inter-node network has obviously an impact on the non-uniformity of the data access. It is expressed in the NUMA factor which states the delay factor between local and non-local memory access. The NUMA factor is, depending on the size of the system modest but non-negligible: about 2 for small systems to about 6 for large systems. In designing algorithms for such systems one therefore often tries to localise the memory access by mapping the data in an appropriate manner. The speed of the internode networks for ccNUMA-like systems or those that simply connect SMP-nodes is higher than that of the networks used in clusters but not by large factors anymore: the bandwidth is about 2–4 times higher and the latency is lower by a factor 2–3.

2.4

Alternative algorithms

The advent of HPC systems has had a large impact on the use and design of algorithms. For vector processors this lead to consideration of algorithms that might not be the fastest on scalar processors but could take advantage of the vector capabilities of such a machine. An example is the construction of preconditioner 4

approximations for the solution of large sparse linear systems. Many types of preconditioners are not very fit for vectorisation as they involve fairly random memory access and floating-point divisions. A few terms of a von Neumann series to approximate an ICCG preconditioner is perfectly vectorisable and only slightly inferior to the exact precondition matrix [20] with respect to the convergence behaviour. For Distributed Memory systems, similar examples are well known: for instance the formulation of a large 1-D FFT as a multi-dimensional one [1]. This reformulation evades frequent fine-grained long-range communication at the cost of a matrix transposition for every dimension and an extra element-wise multiplication of the matrix. Although extra floating-point operations must be done, it is by far more efficient than the original one based on the Cooley-Tukey algorithm as described in [6]. Genetic algorithms are also excellent candidates for reconsidering for parallel computers. On sequential machines the computation of large populations on genes can often be prohibitive, especially when the generepresentation itself also is compute-intensive. Parallel systems are ideally suited to cope with these algorithms because the calculations of the genes it totally independent. Only cross-over operations need a modest amount of communication as does the communication of the parameters for the genes in a next generation. In short, reconsidering algorithms can be highly rewarding and may lead to large speedups. The largest such speedup known to the author is a factor of 1 million for a DNA-sequence comparison algorithm as described in [13]. Although not all algorithms are amenable to such huge speedups, reconsidering algorithms can be very much worth exploring and is too often neglected.

2.5

The ideal machine

So, considering the various ways of speeding up computations. What would be the ideal system to achieve high speeds? One very important property would be a “flat” memory system. We saw already in section 2.1 that processor speeds not scale with the clock frequency. This is for the largest part attributable to the disparate growth in speed of the processors proper and of the memories that must feed them. Where the speed of processors grows by almost 60% per year, the increase in speed of the memory is only a 6.5% per year. To mitigate the adverse effects of the growing speed gap between the processor and memory a very complex memory hierarchy has been built nowadays comprising L1, L2, L3 and TLB caches in order to feed the functional units of the processor at least a large part of the time with sufficient operands. When operands in the registers are not re-used even the bandwidth from the nearest L1 cache is not sufficient and the functional units may still have to wait at least another cycle. Performance optimisation has mostly to do with restructuring algorithms in such a way that required data is in a nearby cache most of the time. This is often quite cumbersome and often not even possible, for instance in the case of sparse linear algebra algorithms. So, therefore an ideal machine should have a flat, non-hierarchical memory and sufficient bandwidth to the processor as not to starve the functional units for data. The earlier Cray systems like the Cray Y-MP and C90 approached this ideal, be it only for vector data. Presently no system is able to satisfy this requirement, not even at extreme costs. Although we would like it to be otherwise, the ideal machine cannot be a single-processor system with a virtually infinite speed. So, our ideal machine will have more than one processor, possibly many processors. What we do not want is to be aware of these processors in the sense that we have to bother about the amount of them or have to do a major reconstruction of our programs. Although we have made some progress to this ideal property, we are not yet there (and may never be). On vector processors it was (and is) wise to take the length of a vector register into account: reloading a vector register just for a few elements can be costly. Also the dimensioning of multi-dimensional arrays should be considered lest memory bank conflicts, another manifestation of the slowness of memory, might decrease the speed of operand access considerably. In the early days of distributed memory parallelism one was bound to let the system know the amount of processors one wanted to use and possibly modify the program for another amount of processors because the communication should be organised differently for the adjusted number of processors. As both in OpenMP and MPI the processors have been virtualised to processes, this painful procedures are no longer needed but it is clear that restructuring of programs is still needed and often to a large extent. The semantics of programming languages, except for the most trivial cases, do not allow automatic extraction of parallelism unless one also reorganises the processor architecture to allow the processing of massive amounts of processing threads. Some progress is made in this field which we discuss in section 4, but the ideal situation is still not near. 5

Efficiency of dotproduct 0.8 Intel Xeon, 2.6 GHz Cray C90, 250 MHz IBM POWER3, 375 MHz IBM POWER4, 1.3 GHz

Fraction of peak speed

0.6

0.4

0.2

0.0

10

100

1000

10000

Loop length

Figure 2: Efficiency of the dotproduct operation for 4 processors. When using parallel systems naturally we expect to see the benefits of using multiple processors. In fact, for an ideal machine we expect to see a linear scaling behaviour for our programs. This presupposes a very fast communication network practically without latency and, equally important, an I/O speed that scales with the number of processors. Of course, perfect scaling will only happen when the latency and bandwidth for local memory references and for non-local ones are about the same. Where this property is slowly coming in sight at least with respect to bandwidth (recall the network bandwidth mentioned in section 2.3) a similar development for I/O is lagging behind. MPI-IO as defined in MPI-2 should improve this situation but as yet the implementations that are available make clear that much leaves to be done in that respect. Still, for large-scale simulations and especially in complicated computational frameworks, massive I/O may be the only way to communicate the result to another framework component or to the user. The architecture of I/O hardware is therefore also one of the areas where still much work has to be done to arrive at the ideal machine.

3

Problems in attaining high speed, . . .

As remarked in the former section the ideal machine is not (yet) there and we discuss some of the problems with which we are confronted because of the less than ideal situation we have to live with at the moment.

3.1

The processor — memory speed gap

The lagging behind of the memory speed relative to the processor speed is battled by putting cache memory in between the large, slow main memory and the processors. Yet, also these caches cannot provide all the bandwidth that is needed. This is well illustrated by considering the efficiency of a dotproduct operation d = (x, y) where the vectors x an y have a length-range 10–10,000. We shown these efficiencies for various processors in Figure 2. The dotproduct is a very regular operation that has a high spatial locality of reference (meaning that when a data item is addressed also its direct neighbours are likely to be addressed). This means that for the cache-oriented systems, like the Intel Xeon and the POWER3 and POWER4 in Figure 2 the data will be mostly found in the cache. On the other hand there is no re-use of operands in this operation which could mitigate the bandwidth need somewhat. The late Cray C90 vector processor (1994–1996) was designed to have sufficient bandwidth from memory, however, there was a considerable latency in loading the vector into the registers which is reflected in reaching of higher efficiencies only for fairly large vector lengths. For the 6

Dense matrix−vector multiply 12000.0 autoparallel OpenMP MPI

10000.0

Mflop/s

8000.0

6000.0

4000.0

2000.0

0.0 0.0

500.0

1000.0 Order N

1500.0

2000.0

Figure 3: Performance of 3 parallelisation implementations of a dense matrix-vector multiplication on 6 processors of an SGI Altix system. cache-based processors it is evident that the bandwidth is not large enough to sustain the operation at peak speed. It is interesting to observe the lower efficiency of the POWER4 processor relative to its predecessor, the POWER3. The reason is that in the POWER4 two processor cores are put on one chip that have to share the available bandwidth to the on-chip L2 cache. There is an unfortunate trend in the processor industry of putting more processor cores on one chip: since the introduction of the the IBM POWER4, also HP, and Sun now have dual core chips, and AMD and Intel (with the Itanium Montecito) are soon to follow. In all cases bandwidth to/from the processors suffers with a corresponding decrease in the processor’s efficiency.

3.2

Parallelisation models

The nearest thing to the transparent parallelism of out ideal machine is the autoparallelisation option that is provided on several systems. In practice it is very close to a limited set of facilities as present in OpenMP. So, only (ccNUMA) shared-memory machines have this possibility. Distributed-memory systems have at present no equivalent. Having such autoparallelisation possibilities does not automatically mean that the benefits are large. Consider for instance the performance of a dense matrix-vector multiplication algorithm on an SGI Altix shows in Figure 3. Because of the very regular structure of the algorithm (essentially a sequence of vector update operations) the autoparallelisation version and the OpenMP version should show similar performance. This is not the case: the OpenMP version is 4–9 times slower than the autoparallel version. This is partly due to the extremely high synchronisation overhead in most OpenMP implementations (not only SGI’s!). But also the autoparallelised version does not perform well. The sequential version running on one processors is in fact about 10% faster than the parallel version on 6 processors. Also here the synchronisation overhead does not compensate for the rather small computational load in each of the parallel threads that are executed. By contrast, when a distributed parallelisation model is used with MPI the performance is significantly better. For small matrix sizes that can be kept in the cache the speed can be 5–10 times higher than in the autoparallelised version and two times faster when main memory is needed. This style of parallelisation is reasonably efficient but requires restructuring of the program and therefore it is a long way from the transparent parallelism a layman user would like to see.

3.3

Scaling behaviour

Although the bandwidth and latency of communication networks have improved spectacularly over time this does not mean that one should not give due thought to the way the communication is organised.

7

Distributed Dotproduct (Cluster with Infiniband network)

10000.0

8000.0

Simple Tree Bcast/Reduce

Mflop/s

6000.0

4000.0

2000.0

0.0 0.0

20.0

40.0 No. of Processors

60.0

Figure 4: Performance of 3 implementations of a distributed dotproduct on a cluster with an Infiniband network. This is especially true for distributed-memory systems where one has some control over the communication primitives. A very simple but important example is represented in the distributed dotproduct. In Figure 4 we show three different implementations of this operation. It is clear from the figure that there are large differences in performance. These differences are only due to the communication structure. In the so-called simple implementation every processor computes its partial sum and sends it to the root processor by an MPI_Send/MPI_Recv. The root processor computes the global sum and sends it to all other processors by an MPI_Send/MPI_Recv. This works out rather miserably, as could be expected: not only the Send calls are sent out sequentially, they all try to reach the root processor at virtually the same time causing a hotspot in the network and overloading the root processor. In the so-called tree method each processor determines its position in a binary tree using its processor id; it also determines the left and right branch nodes beneath itself and the “up” node to which it is connected. The processors compute their partial sums and send it up the tree. Each processor receives the partial result from its left and right branch below, adds them to its own partial result and sends it up to its up node. At the end of traversing the tree, the root processor contains the global sum. This is send down again through the tree to all processors. All communication in this method is done by MPI_Send/Recv pairs. The communication load is now distributed evenly over the network, resulting in a much faster communication which in turn shows in the performance of the operation. In the third method one takes advantage of the available MPI primitives. The processors compute their partial sums and by an MPI_Reduce/Bcast pair of calls the global sum is made known to all processors. In fact, the MPI_Bcast and MPI_Reduce routines are mostly also organised via a binary tree communication and should therefore show a performance that is similar to that of the tree implementation. As is evident from Figure 4 this is indeed the case, although there is a difference of about 10% for 64 processors. Note that this scaling behaviour has hardly anything to do with the actual topology of the network. The binary tree constructed in method 2 is made in software without knowledge of relative position or actual connection of the processors with each other. There is an interesting exception, however, that we will discuss in the next section. This example makes clear that for real-world machines at this moment infinite scalability is not yet there and one has to be still careful how to organise the communication if one do not want to be confronted with poor scalability of parallel applications. This is especially true for I/O-bound problems. Presently one often prefers to communicate input and output to a master processor which does all I/O sequentially rather than to rely on an operating system that somehow has to keep the I/O requests of the parallel processes in order. Especially for large amounts of processes this is not very reliable and certainly not fast. Well implemented 8

6.4 GB/s Itan 2

Intel QBB

SHUB

Itan 2

6.4 GB/s

Memory

Itan 2

10.2 GB/s

Itan 2

6.4 GB/s SNC

XIO

Memory

Itan 2

3.2 GB/s 3.2 GB/s

Itan 2

SHUB

Memory

Itan 2

Itan 2

6.4 GB/s

XIO

6.4 GB/s 2.4 GB/s

Figure 5: Building blocks of the Bull NovaScale and the SGI Altix. MPI-IO may improve this situation very much in the near future but this way of implementing I/O is far from transparent to the user.

4

. . . and how is dealt with them

In the preceding section we have sketched some of the problems one encounters when dealing with presentday HPC systems. What is done to deal with these problems? In Table 1 some of the solutions that are employed are listed together with the effects they tend to have on the systems in which they are applied. Table 1: Approaches taken to improve HPC system performance. 1 2 3 4 5

4.1

Solution: Increase Increase clock frequency Cache size No. of process threads No. of processor cores Communication on chip

Effect Widens CPU-memory gap Tends to be slower Better latency hiding Competition for bandwidth Topology becomes important

Impact + +/-

Node and processor integration

It is important to keep in mind that cost and marketability are the main driving forces also for HPC systems. In the ASCI program [3] very strong emphasis was put on the use of commodity of-the-shelf (COTS) components. This would make HPC systems more affordable and it would benefit both the sales and the research of these components. Unfortunately, living by the ASCI guidelines has hurt much of the research both with respect to components and machine architecture that were not directed to COTS-based machines. This explains much of the rather single-minded approaches that are taken to boost component, and therefore HPC system performance. An interesting example is the recent advent of Intel Itanium 2based HPC servers: no less than 5 vendors are marketing such systems: the Bull NovaScale, the HP Integrity Superdome, the NEC TX-7, the SGI Altix, and the Unysis Orion 440. All of these systems use building blocks with 4 processors but there is some difference in the way these building blocks are connected. Figure 5 shows two such nodes. In both cases nodes contain 4 processors but there are some differences with respect to the connecting network within the node and to other nodes. Bull uses the standard Quad Building Blocks (QBBs) as provided by Intel, also NEC and Unisys use the same QBBs which makes the difference between the systems rather limited and also quite dependent on what Intel will deliver at the time determined by Intel. HP and SGI are slightly independent in this respect because they merely use the Itanium 2 processor and not the node architecture from Intel. Still, it is clear that for such systems mass market forces for a large part determine the development rather than high efficiency of the systems offered: in an Intel QBB the bus from the node controller has a bandwidth of 6.4 GB/s while the bus speed of each individual Itanium 2 processor is also 6.4 GB/s. The chance for starving the processor is therefore quite high. In an SGI node the situation is slightly better because only 2 processors have to share a 6.4 GB/s bus. The upshot is that putting more

9

processors in one SMP node is more cost effective from the vendors point-of-view but generally lowers the efficiency of the processors in a node. As mentioned earlier, there is also clear trend to integrate more processor cores on one chip. In itself this does not mean that the bandwidth to the cores should suffer as can be seen from the example of the IBM POWER3 and POWER4 processors in Table 2. Table 2: Efficiency in operations/cycle for the IBM POWER3 and POWER4 processors for the axpy operation. Length 1000 2000 4000 6000 8000 10000

POWER3 375 MHz Ops/cycle 0.2438 0.2398 0.1524 0.1563 0.1543 0.1542

POWER4 1 GHz Ops/cycle 0.3187 0.3072 0.2569 0.2519 0.2510 0.2527

Because the operand vectors are rather small the on chip L2 cache clearly helps to maintain a higher efficiency in the POWER4. On the other hand, when the operands have to be shipped in from memory, the relative memory bandwidth kicks in and the efficiency of the POWER4 is lower than that from the POWER3 as was illustrated in Figure 2. There is another reason for putting more processors cores on one chip. When looking at the differences between the IBM POWER3 and POWER4 chip one notices that the number of floatingpoint units did not change. In both cases there are 4 FPUs, however, in the POWER3 they were combined in one processor where they are divided over two processor cores in the POWER4. An important reason for the decision to split the number of functional units over more processors cores is that the scheduling of instructions over more functional units becomes extremely complicated. Simulations have shown that dynamic scheduling for more than six functional units requires an analysis window that is longer than the average program unit and control devices like branch units to handle this information. This obviously does not make sense and this leads to moving to simpler processor cores for which optimisation analysis is still possible.

4.2

How to deal with code complexity

There are various ways that one can deal with the complexities to schedule instructions in such a way that optimal use is made of all fucntional units in a processor. One possibility is simplifying the instruction stream. This is actually done in vector processors. The vectorisable part of a code is executed using vector instructions that apply to large sequences of operands instead of to single operands which greatly simplifies the scheduling of the instructions. Figure 6 shows a block diagram of a NEC SX-6 vector system. It reflects the clear separation of the vector instructions over the vector units and also over the scalar CPU for the non-vectorisable code. A second way out of the complexity problem is to move the burden of optimally scheduling the instructions to the compiler instead of to control hardware in the CPU. In that case it may even be advantageous to have more functional units to one’s disposal because this could add flexibility to the instruction schedule. In this case we use static instruction scheduling: once the schedule has been determined by the compiler it is fixed and will be executed as such. Systems that employed this technique have briefly been popular in the early nineties. They were known as Very Large Instruction Word (VLIW) machines of which the Multiflow Trace 300 with an instruction word length of 1024 bits was the most extreme example. A diagram of this machine is given in Figure 7. It contains no less than 28 functional units (in four blocks of 7). It could deliver 8 floating-point results per cycle. With its 120 MHz clock it was able to match a regular Intel processor of ten years later with a clock of over 1 GHz. Of course this sets very high demands on the quality of the compiler and an equally clever instruction set architecture. Although VLIW architecture as such have disappeared much of the VLIW ideas are found again in the Intel/HP Itanium processors. The Itanium processor also schedules its instructions statically on a large number of functional units. Intel calls its realisation of the VLIW philosophy EPIC (for Explicit 10

8x Mask Register

Vector Unit

Mask ops. Logical ops.

Vector registers

Multiplcation

Load/Store pipe

Add/Shift Division

Scalar Unit Cache

Scalar CPU

Scalar registers Vector data

Control/Scalar data

Figure 6: Block diagram of the NEC SX-6 showing the separate vector units and the scalar CPU each dedicated to their specific instruction streams.

1024-bit VLIW Instruction Stream

B I I L L F F B I I L L F F B I I L L F F B I I L L F F

To PC

I Reg.

F Reg.

To PC

I Reg.

F Reg.

To PC

I Reg.

F Reg.

To PC

I Reg.

Memory B: Branch Unit I: Integer ALU L: Load/Store Unit F: Floating-point Unit

I Reg.: Integer Register F Reg.: Floating-point register PC: Program Counter

Figure 7: Block diagram of Multiflow Trace 300 VLIW system.

11

F Reg.

Store units (2) Load units (2)

Data cache (16 KB)

Integer

To system L3 cache

L2 cache

(6 MB)

(256 KB)

bus

Floating Point Registers

Registers

Advanced load table

(128)

(128)

IA−32 decode & control Fetch/prefetch engine & TLB Register stack

Instruction cache (16 KB)

Branch units (3) FMAC, /,

Branch registers (8) Integer units (6)

FMAC, /,

MMX units (4)

Figure 8: Block diagram of the Itanium 2 EPIC architecture. Time Active thread

Inactive thread

Stall: wait for data or resource.

Switch time

Switch time

Stall: wait for data or resource.

Figure 9: Time diagram of two process threads in a multi-threaded architecture. Parallel Instruction Computer). Figure 8 depicts a block diagram of an Itanium 2 processor. The instruction length is less extreme as in the Multiflow machine but the length of an instruction is still 128 bits which contains 3 “bundles” of 41-bit instructions and 5 so-called predication bits that can modify the course of action taken for a given set of instructions. In this sense the scheduling of instructions is therefore not fully static. The figure shows the functional units that can be addressed through the instruction bundles. Each bundle addresses a combination of 3 functional units, e.g., a Memory unit, a Floating-point unit, and a Branch unit in a so-called MFB template. This template is matched with the units that are available at that cycle. Ordering of the templates should result in a high usage rate of the functional units. The detailed structure of the Itanium 2 and the scheduling of its instructions can be found in [9]. Rather than just wait for operands to arrive and letting the functional units to be inactive during that time one could try to use them for other purposes in the mean time thus increasing the throughput of the CPU as a whole. In this way one could tolerate high memory latencies because other useful work is done. This presupposes that more than one process at the same time may be available to be offered to a CPU. Fortunately, the average program can be decomposed in many independent process threads. So, when the hardware supports the processing of more than one thread the average number of instructions per cycle could be improved significantly. This technique, called multi-threading, was first conceived in the early 1980’s in the Denelcor HEP system. It can be depicted as in Figure 9. A present-day representative of a multi-threaded architecture that is expressly built for this purpose is is the Cray MTA-2. The MTA-2 is a NUMA system but not a ccNUMA system because it does not have caches. In this respect it approaches the ideal machine discussed in section 2.5. Memory latency is hidden by

12

Cray MTA−2 network

CPU

CPU

CPU

CPU

Mem.

Mem.

Mem.

Mem.

Figure 10: Block diagram of the MTA-2 network. just switching to another thread. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams per processor and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. The connection network connects a 3-D cube of p processors with sides of p1/3 of which alternately the x- or y axes are connected. Therefore, all nodes connect to four out of six neighbours. In a p processor system the worst case latency is 4.5p1/3 cycles; the average latency is 2.25p1/3 cycles. A diagram of the MTA-2 is given in Figure 10. There is another aspect that is close to that of the ideal machine: parallelism is nearly transparent. One can help the compiler by giving directives that may help to extract more parallelism, much like vectorisation and OpenMP directives. To take full advantage, however, of the capabilities of the machine one can use the so-called full/empty bits that are associated with each variable to obtain very fast process synchronisation. In this case much more restructuring of program is required, though. Performance results and comparisons for this interesting machine can be found in [2]. The multi-threading idea has been adopted by other manufacturers too, be it not on such a grand scale: the latest Intel IA-32 processors, the P4 and Xeon employ it under the name of Hyperthreading with the ability to support one additional thread. The switch time between process threads is, however, significantly more than the one cycle required in the MTA-2. In the recent IBM POWER5 also multi-threading is employed in a variant that is called Simultaneous Multi-Threading (SMT). In SMT two threads are continuously active and schedule instructions whenever a functional unit is available. Because the instructions and operands are labelled every result ends up at the proper place and no thread switching time is incurred. The only downside is that both threads are competing for space in the caches. So, it is advised not to use it for highly regular computations where waiting for operands is less of a problem anyway.

5

Clusters

Clusters are nowadays a primary resource for HPC cycles. Started as a more or less funny initiative at CalTech where they were called Beowulf clusters [17], now literally hundreds of vendors market preassembled, rack-mounted clusters. The cluster idea has come a long way in democratising HPC: the costs are reasonably proportional to the number of processors, which means that small clusters are quite affordable to small research groups at a price of $ 4,000–5,000/dual node, including power, cooling, and a fast communication network. As it happens that the user group does not need such a fast network because the communication/computation ratio is low, the price per node even drops to ≈$ 2,500. In the early days of clusters the network speed (usually 10 Mb/s Ethernet) limited the spectrum of applications that could be run and, of course, only the distributed-memory parallel model could be applied. With

13

the advent of fast communication networks the application field has widened, and with very large clusters like the ASCI Red and the near future Red Storm systems at Sandia National Labs, USA, clusters have made an impact also in the high end of HPC. This is aptly demonstrated by the presence of 5 out of 10 cluster systems in the TOP500 list of the world’s most powerful HPC systems [19]. The speed of networks have increased to such a level that they outpace the speed of some proprietary networks in integrated parallel systems of just one generation before or even use such a network1 . Table 3 shows the most important network available for clusters today. Table 3: Characterics of cluster networks available today.



Bandwidth Latency Network MB/s µs Gbit Ethernet 120 ≥60 Infiniband 850 7 Myrinet II 250 10 QsNet 400 4 QsNetII 980† 2 SCI 500 2 Constrained by PCI-X bus, > 1200 MB/s

So, with these networks and COTS processors that can be replaced relatively easy, is there still a place for integrated parallel systems? We believe there is for several reasons: The present situation is that there is a renewed interest in alternative architectures because the efficiency of COTS processors is steadily decreasing (see section 4). This has lead to dissatisfaction of many users of large HPC facilities. These facilities have high peak performances but low to very low sustained performance in many practical situations. The supremacy of the Earth Simulator for a couple of years in this respect has made that painfully clear. So, there are now many initiatives to improve processor efficiency that of which the results will turn up in very high-end systems within a few years. Processors and networks in these systems will be far from commodity and should deliver more in terms of application performance than their cluster counterparts. At least for large national research institutions these new machines will be attractive candidates for their HPC infrastructure and (large) clusters will be there more like auxiliary parts of that infrastructure. An additional reason is that up till now massive I/O requirements can be fulfilled better by integrated HPC systems than by clusters. Note, that we in fact speak here actual about high performance per se, not a price-performance ratio. In this league clusters obviously win and that is also the reason that they are here to stay and even will continue to be a growing part of the HPC landscape for years to come.

6

Thoughts about future machines

Compute power has never been enough and that will continue to be so. As alluded to already, there is a growing concern about the decreasing efficiency of processors when the standard policies for increasing (clock cycle, caches) their speed are the only means to raise the speed. On can try to improve the situation in more than one way: on the device level and on the archtectural level and preferably with a combination of these approaches.

6.1

What is done on the device level

So, various vendors are exploring alternative ways to improve both speed efficiency. We list a few that will have an impact in the very near future: 1. Processing in memory (PIM). 2. Specialised processors (FPGAs). 3. Faster memory types. 1 This

is the case with the HP AlphaServer SC that employs Quadrics’ QsNet that is also used in clusters

14

Figure 11: Genetic marker matching: comparing C. Burnetti and Synechocyctis PCC6830. Processing in memory, also called Computation in RAM (C-RAM) starts from the idea that shipping operands to/from the memory the processor and back is an enormous and growing waste of time. So, why do not the processing in the memory itself where possible? This could be done by enhancing memory cells slightly with bit-wise processors to do massive SIMD-type computing. The idea is as often, not new: in the 1980s machines like the ICL DAP and the Goodyear MPP actually implemented the idea but it has resurfaced recently and it was part of the USA’s NSF-based HTMT project that was abandoned a few years ago but of which key parts are still serious candidates to turn up in near-future machines. Table 4 gives an impression of what potentially can be achieved with PIM [5]: Table 4: Algorithm speedup achieved with PIM. Algorithm 16M 3×3 convolution Vector quantisation Data mining

Factor 6404 1312 2724

The huge speedup factors shown here reflect the fact that within memory we deal with latencies that are in the order of individual gate delays, i.e., picoseconds instead of the nanosecond time frames we see in transporting data to the processor. Not all processing is amenable to this type of processing though, but a significant part of scientific codes could benefit enormously from it. A second development is the upcoming rˆ ole of Field Programmable Gate Arrays (FPGAs). FPGAs are large arrays of logic cells that can be configured as memory cells, I/O channels, or devices that perform basic compute operations, like a bit-wise add. Traditionally, the device density of FPGAs has been about 10 times higher than what can be achieved in a general purpose processor. They have become increasingly interesting for more general computing task since the number of logic cells has become so high that all kinds of fundamental algorithms can be implemented on them. In addition, the configuration of the logic cells on an FPGA is not static as in the proprietary ASICs (Application Specific Integrated Circuits) used by vendors in their systems: They can be reconfigured in a few milliseconds to perform another task. Over the last 2 decades FPGAs have proved to be ≈ 10–30 times faster than general purpose CPUs. They start now to turn up as add-ons to “normal” systems. In Figure 11 we show the visualised outcome of a genetic marker matching job that with help of the FPGA plug-in card into a fast (2.6 GHz Intel Xeon) standard PC took 10 minutes instead of 14 hours. Cray is offering its XD1 system now with the possibility to add an FPGA on each processor board. Programming FPGAs is still far from easy: it is still mostly done with hardware description languages like VHDL or Verilog. However, there are activities both from FPGA vendors as from 15

Components attached to network GRU: Vector Intens. Scatter/Gather Address Tranl.

Latest Scalar processor

Apll. Spec. Processor (FPGA)

AMU: Memory Intensive proc. (PIM)

Crossbar switch

To other nodes

Memory Control (Inclusive Directory Memory)

DRAM Modules

Figure 12: Possible node of a network-oriented computer. vendors that sell systems using them to provide more user-friendly tools and/or routine libraries that can be called from Fortran or C programs. One can therefore expect that FPGAs will turn up in many near future systems, at least as an optional enhancement of standard configurations. We already mentioned several times the main reason for the decreasing efficiency of new generations of processors: the growing gap in speed between processors and memory. The most direct way to reverse this trend would be to speed up the memory. A possibility to do just that can be found in Magnetic RAM (MRAM). The working of MRAM is based on the difference in resistivity of aligned or non-aligned spins of free electrons in a magnetic medium. Presently, FreeScale, a Motorola daughter company is already producing MRAM at a limited scale for evaluation. Already the present characteristics are very attractive and are expected to improve siginificantly overtime: the memory status is permanent, so no current is required to maintain information. This means that the power requirements are much lower than for the standard DRAM that has to be rewritten continuously. Also, the information in MRAM is immediately available which cuts waiting times in powering up of systems drastically. The speed of MRAM is currently about as fast as that of SRAM — a few nanoseconds with quite some room for improvement in the near future. Two issues need attention to make MRAM a universal replacement for DRAM is the density and the production costs. The density of MRAM is still significantly lower than that of DRAM (currently 4 Mbit chips are produced) and the production is done at the moment by an expensive sputtering process to obtain material layers with the desired properties. It is expected that both issues can be resolved in the next few years. Together with improving switching times this could curb the trend of the growing memory gap for the first time in many years.

6.2

What is done on the architectural level

Both Cray (Rainier project) and SGI (Ultra Violet project) are exploring the possibilities to reverse the rˆ ole of the components in their machines in a sense. Consider Figure 12: roughly speaking, in this view the communication network is the computer. Such a network-oriented computer contains nodes that may vary in composition and possible components may be vector processors, scalar processors, FPGAs, and PIM modules. I/O processors can be attached as well (not shown in the Figure). The present status quo is such that there are no fundamental problems with respect to the hardware to realise such an architecture, although it still may take some time to have PIM modules available. It is rather the software side that will need large efforts to unify the systems. A way to do that is to make every device individually addressable for every other device. The extension of such a concept to devices outside the system at hand is than a logical step and one could begin to think about computational grids of devices rather than systems — a highly

16

Figure 13: A full IBM BlueGene/L configuration. attractive idea for a number of application areas. A more conventional approach, except for the scale is to be seen in the BlueGene product line of IBM. The first generation is the BlueGene/L. It is made for very massively parallel computing. The individual speed of the processor has therefore been traded in favour of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz. Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache and of data cache on board. In favourable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast. The packaging in the system is as follows: two chips fit on a compute card with 512 MB of memory. Sixteen of these compute cards are placed on a node board of which in turn 32 go into one cabinet. So, one cabinet contains 1024 chips, i.e., 2048 CPUs. For a maximal configuration 64 cabinets are coupled to form one system with 65,356 chips/130,712 CPUs. In normal operation mode one of the CPUs on a chip is used for computation while the other takes care of communication tasks. In this mode the theoretical peak performance of the system is 183.5 Tflop/s. It is however possible when the communication requirements are very low to use both CPUs for computation, doubling the peak speed; hence the double entries in the System Parameters table above. The number of 360 Tflop/s is also the speed that IBM is using in its marketing material. A rendering of a complete configuration is given in Figure 13. The BlueGene/L possesses no less than 5 networks, two of which are of interest for inter-processor communication: a 3-D torus network and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc. The hardware bandwidth of the tree network is twice that of the torus: 350 MB/s against 175 MB/s per link. Early next year the first systems should be installed: a full configuration at Lawrence Livermore National Lab in the USA and a 6144-processor, 34 Tflop/s system to the Dutch ASTRON organisation that will use it for analysis and synthesis of radio astronomy data. At the moment both development hardware and software are in full swing. The benefits of the separate tree network for reduce and gather-like operations in MPI are evident as can be seen from the results of a distributed dotproduct on a 512-processor development model of the BlueGene/L shown in Figure 14. As can be seen, system is able to recognise the MPI_Reduce and MPI_Bcast calls and to put them out on the faster tree network, resulting in a higher overall speed of the operation. This is in contrast to a system where all communication patterns use the same network (compare with Figure 4). The dips in the speed at 16, 32,. . . , 384 processors shows that still some work has to be done for this experimental system. It also shows that, with the proper support, network topology has a significant impact on the performance.

17

Distributed Dotproduct, IBM BlueGene/L, 700 MHz

24000.0

20000.0

Send/Recv Binary Tree Bcast/Reduce

Mflop/s

16000.0

12000.0

8000.0

4000.0

0.0 0.0

100.0

200.0 No. of processors

300.0

400.0

Figure 14: Three distributed dotproduct implementations on the IBM BlueGene/L.

7

Conclusion

In the preceding sections we have tried to give an overview of the issues with which systems designers (and users) are confronted and of the roads taken to able to guarantee somehow a growth in speed for our HPC systems. It has become clear that some of these roads are coming to an end, as is the case of shrinking the feature size on chips, or that some roads are leading in undesirable directions, like those that lead to ever larger processors-memory speed gaps. We have also seen that there is a renewed interest in a more fundamental approach to architectural research that may lead to systems that are unconventional from our current point-of-view. It at least shows that there are still large resources of creativity to bring the HPC systems area forward and we may benefit from this in the years to come.

References [1] R.C. Agarwal, F.G. Gustavson M. Zubair, A High Performance Parallel Algorithm for 1-D FFT, Proc. IEEE, 82 (9), Sept 1994, 34–40. [2] W. Anderson, P. Briggs, C.S. Hellberg, D.W. Hess, A. Khoklov, M. Lanzagorta, R. Rosenberg, Early Experience with Scientific Programs on the Cray MTA-2, Proc. SC2003, Phoenix, AZ, USA, 2003. [3] The ASCI program: http://www.llnl.gov/asci/. [4] Co-Array Fortran Homepage: www.co-array.org/. [5] D.G. Elliott, W.M. Snelgrove, M. Stumm, A PetaOp/s is currently feasible by computing in RAM, PetaFlops Frontier Workshop, Washington, 1995. [6] G. Fox, M. Johnson, G, Lyzenga, S. Otto, J. Salmon, D. Walker, Solving Problems on Concurrent Computers, Prentice-Hall, Englewood Cliffs, USA, 1988. [7] R. W. Hockney, C. R. Jesshope, Parallel Computers II, Bristol: Adam Hilger, 1987. [8] G.E. Moore, Cramming more components onto integrated circuits, Electronics, 38 (8), April 19, 1965. [9] C. McNairy, D. Soltis, Itanium 2 Processor Micro Architecture, IEEE Micro 23 (2), 2003, 44–55.

18

[10] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference Vol. 1, The MPI Core, MIT Press, Boston, 1998. [11] W. Gropp, S. Huss-Ledermann, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, M. Snir MPI: The Complete Reference, Vol. 2, The MPI Extensions, MIT Press, Boston, 1998. [12] OpenMP Forum, Fortran Language Specification, version 1.0, Web page: www.openmp.org/, October 1997. [13] J.W. Romein, J. Heringa, H.E. Bal, A Million–Fold Speed Improvement in Genomic Repeats Detection, Proc. SC2003, Phoenix, AZ, USA, 2003. [14] D.L. Slotnick, Unconventional Systems, Proc. AFIPS Conf. 30, 1967, 477–481. [15] K. Schwan. Win Bo, Topologies — computational messaging for multicomputers, Proc. 3rd Conf. on Hypercubes and Appl., ACM Press, 1988. [16] A.J. van der Steen, Overview of recent supercomputers, version 2003, Report NCF, Oct. 2003, or http:/www.euroben.nl/reports/web03. [17] T.L. Sterling, J. Salmon, D.J. Becker, D.F. Savaresse, How to Build a Beowulf, The MIT Press, Boston, 1999. [18] R.J. Swan, S.H.Fuller, D.P. Siewiorek, Cm∗ : a modular multi-microprocessor, Proc. National Computer Conf., 1977, 637–644. [19] H.W. Meuer, E. Strohmaier, J.J. Dongarra, H.D. Simon, Top500 Supercomputer Sites, 23rd Edition, June 2004, The report can be downloaded from: www.top500.org/. [20] H.A. van der Vorst, A vectorizable variant of some ICCG methods, SIAM J. Sci. Stat. Comput. 3, 1982, 350–356.

19