Accessing Data on SGI Altix: An Experience with ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
per time unit) and latency (how long the CPU at least has to wait for the first data element). While the caches of a scalar processor have a low latency, high ...
Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. M¨uller, Wolfgang E. Nagel, Stefan Pfl¨uger Technische Universit¨at Dresden Center for Information Services and High Performance Computing 01062 Dresden, Germany Email: {guido.juckeland | matthias.mueller | wolfgang.nagel | stefan.pflueger}@tu-dresden.de

Abstract— The SGI Altix system architecture allows to support very large ccNUMA shared memory systems. Nevertheless, the system layout sets boundaries to the sustained memory performance which can only be avoided by selecting the “right” data access strategies. The paper presents the results of cache and memory performance studies on SGI Altix 350. It demonstrates limitations and benefits of the system and the Intel Itanium 2 processor underneath.

II. T HE SGI A LTIX 350 The Intel Itanium 2 processor with its good floating point performance and the large amount of addressable shared memory in the SGI Altix series combine for a unique HPC system. The characteristics of these two components with regard to memory performance are the subject of this section. A. The Intel Itanium 2 Processor

I. I NTRODUCTION The increasing performance gap between processor and memory speed in computer systems strongly requires sophisticated data management strategies. While vector systems utilize proprietary memory modules with many memory banks to allow a large number of parallel memory accesses, superscalar processors introduced caches to buffer small pieces of frequently used data. A lot of research has gone into optimizing the microarchitecture of the processors as well as applications and compilers for both techniques. The memory system of a computer has two key characteristics: Bandwidth (how much data the system can transfer per time unit) and latency (how long the CPU at least has to wait for the first data element). While the caches of a scalar processor have a low latency, high bandwidth connection with the processing units, they are very expensive and can only hold a limited amount of data. Modern microarchitectures of scalar processors, therefore, contain a multi-staged cache-hierarchy to balance the trade-off between size and speed. Main memory access being the bottleneck in every computer system based on a scalar processor is without question. The knowledge of the exact latencies and bandwidths for all cache levels and the various memory levels in a distributed shared memory environment, however, is a key for understanding and improving application performance on a specific computer system. This paper presents the research on SGI Altix 350 to establish and verify those metrics. It will introduce the Altix system architecture and the Itanium 2 processor architecture with focus on the memory subsystem in the first section. Afterwards, the measurement results for single and parallel access patterns will be discussed. The presented results will enable programmers to avoid major pitfalls when working on SGI Altix systems.

The Itanium 2 processor distinguishes itself from other superscalar processors, since it implements the IA-64 architecture and, therefore, an improved VLIW ISA (called EPIC). Since not every combination of operations within an instruction word (bundle) is permitted, this also has an impact on the number of memory access operations the can be issued at once. The Itanium instruction set allows for a maximum of two such operations per bundle. The microarchitecture can issue two bundles per clock which results in at most four memory transactions issued per clock cycle. It will be shown that this corresponds with the maximum transfer capabilities between the caches and the computation units (see [1]). The IA-64 ISA is a so called load-store-architecture – compute operations can only be issued on register contents and the data has to be moved into/out of the registers by separate load/store instructions. 1) Cache Hierarchy: The three staged cache hierarchy of the Itanium 2 processor is shown in figure 1. It can be seen that data, before reaching the computation units, has to “flow” through the three cache levels. There is, however, a unique exception to traditional access schemes, as can be seen in figure 2: The floating point unit (FPU) is not connected to the first level data cache but to the level two cache. This allows pointers to data structures to remain in the L1D cache as the floating point data is processed in the L2 cache. The transfer rates from the caches into the computation units are shown in table I. It can be seen, as mentioned earlier, that the microarchitecture only allows at most four load-storeinstructions per clock cycle. 2) Translation Lookaside Buffers (TLBs): As will be shown later on, the TLB size is of significant importance to the cache and memory access latencies. Since cache and memory are accessed on Itanium 2 with their physical addresses, the time for the address translation tt adds to the physical access time

Itanium 2 Processor Memory 32 Byte/Clockl

6,4 GB/s

Northbridge

System Bus

System Bus ControlLogik

L3, up to 9MB, 128 Byte, 14+ Clocks

32 Byte/Clock

L2, 256 KB, 128 Byte, 5+ Clocks

L1I, 16 KB, 64 Byte, 1 Clock

32 Byte/Clock

16 Byte/Clock

L1D, 16 KB, 64 Byte, 1 Clock

I/O

Caches Description: Name, SIze, Cache Line Size, Access Latency

Fig. 1.

Cache hierarchy of the Itanium 2 [1]

L1 Instruction-Cache and Fetch/Prefetch-Unit

ISB

InstructionQueue

ITLB IA-32 Decoding and Control

8 Bundles

Branch Prediction B

B

B

M M M M

I

I

F

F

ECC

BranchVerzweige Verzweige Unit einheit einheit

128 GPRs

IntegerVerzweige and Verzweige einheit MM-Unit Verzweige Verzweige einheit Verzweige einheit einheit einheit

ECC

Bus Controller

Fig. 2.

Itanium 2 block diagram [1]

128 FPRs

ALAT

Scoreboard, Predicates NaTs, Exceptions

Branch- and Predicate-Registers

Dual-Port L1 Data-Cache and DTLB

L2 Cache - Quad Port

ECC

L3 Cache

Register Stack Engine / Re-Mapping

FPVerzweige Unit einheit

TABLE I T RANSFER RATES BETWEEN COMPUTE UNITS AND THE ATTACHED CACHES ( IN

The best case penalties for this page walking are shown in table III.

B YTE PER CLOCK CYCLE ) [1]

TABLE III Unit

Connected to

Read rate

Write Rate

ALU

L1D

16 Byte

16 Byte

FPU

L2

16 Byte 32 Byte

16 Byte

or

B EST CASE HPW PENALTIES [1] Event

Penalty in clock cycles

Hit in L2 VHPT

25

Miss in L2, hit in L3 Miss in L2 and L3

31 20 + Main memory latency

tp and results in the total access time ta . ta = tt + tp

(1)

While the physical access time tp is fixed (depending on the cache or memory level), the total access time ta is depending heavily on the translation time tt . If the address translation has to be done by the operating system (OS), which is ultimately responsible for page allocation and translation, a penalty larger than the physical access time tp will occur. The Itanium 2 features a two-levelled TLB structure which buffers a number of most recently used translations to reduce the translation time tt . Its characteristics are summarized in table II. TABLE II

B. The SGI Altix 350 System Architecture The SGI Altix series contains the small to medium range servers of the Altix 350 series and the large supercomputers of the Altix 3000 series. The system studied in this paper is an Altix 350 with four Intel Itanium 2 (Madison) CPUs with 3 MBytes of L3 Cache each. The system is set up of so called modules of different types: Base, extension and router modules. The base module contains the CPUs as well as the main memory. Its layout is shown in figure 3. The Altix 350 at hand is running Red Hat Enterprise Linux 3 with ProPack 3 (SP4) and the Intel compilers version 9.0.24 were used for all measurements.

I TANIUM 2 TLB CHARACTERISTICS [1] SGI Altix 350 Base Module

Characteristic Number of entries Associativity Penalty for L1 miss

Instruction TLB Level 1 Level2 32

128

Data TLB Level 1 Level2 32

128

Full

Full

2 clock cycles

4 clock cycles

The TLBs have to deal with the flexibility of the Itanium 2 microarchitecture regarding the supported memory page sizes. They can range from 4 KByte up to 4 GByte. The first level TLBs can only handle 4 KByte pages, but can work with larger pages by using segments of that page. The L1 TLBs can, therefore, contain translations for 32 x 4 KByte = 128 KByte of memory space each. The L1 TLBs and the L1 caches are tightly coupled – an L1 cache line will be invalidated, if a corresponding L1 TLB entry is evicted. The second level TLBs support all page sizes but suffer from OS limitations. SGI is running Linux on the Altix systems which uses a fixed page size of 16 KByte for all processes. Hence, the address space, for which translations can be placed into the L2 TLBs, is limited to 128 x 16 KByte = 2 MByte. Beyond that point another Itanium 2 specific mechanism tries to avoid invoking the operating system: The hardware page walker (HPW). 3) Hardware Page Walking: When no address translation can be located in the TLBs the hardware page walker will access the virtual hash page tables (VHPTs) – a special data structure kept additionally to the L2 and L3 cache and in a portion of main memory. These tables contain the address translation for their corresponding caches. The translation for a memory access that for example misses the TLBs but its cache line being held in the L2 Cache will use the L2 VHPT.

Main Memory Intel Itanium 2 CPU

10.2 GB/s

6.4 GB/s

Front Side Bus

6.4 GB/s

NUMAlink

6.4 GB/s

NUMAlink

SHUB

2.4 GB/s

Intel Itanium 2 CPU

I/O System

Fig. 3.

PCI/PCI-X PCI/PCI-X PCI/PCI-X PCI/PCI-X

SGI Altix 350 Base Module [2]

1) System Layout: The modules are combined to form larger systems using the two NUMAlink 4 connections available per base module. The connected modules form a ccNUMA shared memory environment so that one running process can allocate all of the system’s main memory. The system layout for the SGI Altix 350 at hand is shown in figure 4. The system contains 8 GByte of main memory. 2) Cache Coherency: Ensuring cache coherency in such large scale shared memory systems poses quite a challenge. SGI uses the SHUBs and NUMAlinks to communicate the coherency information. When the system is bootet, the SHUB reserves a portion of its local memory to contain the addresses of all cache lines and a bit mask containing one bit for every CPU in the system. This bits are used to mark the processors sharing the cache line. The SHUB also listens to the cache snoop information transmitted by the Itanium 2 CPUs over the

SGI Altix 350 Base Module

SGI Altix 350 Base Module

Main Memory Intel Itanium 2 CPU

Main Memory

10.2 GB/s

Intel Itanium 2 CPU

10.2 GB/s 6.4 GB/s

Front Side Bus

6.4 GB/s

NUMA-Link

SHUB

SHUB

6.4 GB/s

Front Side Bus

pick a new random element. Have the last element point to this new element and so on. Finally, have the last selected element in the vector point to the first element. One receives a kind of interwoven ring of pointers that allows random jumps through the allocated memory area.

6.4 GB/s Intel Itanium 2 CPU

2.4 GB/s

2.4 GB/s

I/O System

I/O System

Fig. 4.

Intel Itanium 2 CPU

4 CPU SGI Altix 350 system layout

front side bus and manages the directory accordingly. When a CPU writes to a shared cache line, the SHUB will transmit that cache line to every CPU holding a copy [3]. III. L ATENCY The time between the issue of a load instruction and the arrival of the requested data from one of the caches or the main memory is known as access latency. It is in case of the Altix system (as shown earlier) the sum of the physical latency and the address translation time. Intel specifies in [1] the Itanium 2 latencies as shown in table IV.

B. Access on Separate Memory Segments A parallel version of the pointer chasing algorithm was furthermore created using the MPI standard. The involved processes will then all work in their own piece of main memory and one can study their influence on each other regarding the access latency. They do not exchange any data using MPI messages, but do, however, use barriers to synchronize each other. 1) Cache Latencies: When restricting the allocated memory area to fit within the processor cache, the cache latencies can be determined. Since all CPUs are then working out of their caches they do not influence each other. Therefore, only the measurement results for one active process are shown in figure 5. They demonstrate the exceptionally low cache access times of the Itanium 2 processor. Itanium 2 cache-latencies 60 55

TABLE IV

50

I TANIUM 2 ACCESS LATENCIES Physical latency

Level 1 cache Level 2 cache Level 3 cache

1 clock cycle ≥ 5 clock cycles ≥ 12 clock cycles

Main memory

System dependent (≥ 100 ns)

40 Clock cycles

Hierarchy level

45

35 30 25 20 15

This section will establish the total access times for the three cache levels as well as the main memory latency with one and more than one process accessing the memory. A. Measuring Algorithm The exact measurement of cache and memory access times poses quite some problems for the performance analyst. One access is too fast to use a timer and for multiple access one has to circumvent hard- and software optimizations that try to hide the latency. In this paper a pointer chasing algorithm was used to acquire the measurement data. Pointer chasing uses the data from one access to determine the address of the next access. In case of a random dispersal of the addresses within the allocated memory area neither hardware nor software prefetching can then determine the access pattern and every access will encounter the full latency. By varying the size of the memory area used one can determine all cache latencies as well as the main memory access time. The used memory area is initialized using the following algorithm: Treat the area as a vector of pointers. Select a random position and have the first element point to that position. Remove the element from the list of available elements and

10 5 0 1

4

16

64

256

1024

4096

Amount of memory used (in KByte)

Fig. 5.

Cache access time for an Itanium 2 Madison 3M with 1.4 GHz

The effects of the address translation can be clearly seen in the picture. While the L1 cache latency is as expected at 2 clock cycles for a pointer chasing, the L2 cache latency increases from 5 to about 9 clock cycles as the L1 TLB cannot hold the translations for all data and the 4 clock cycle penalty for accessing the L2 TLB is added to the access time. The L3 cache latency is somewhat larger than the value provided by Intel due to the L1 TLB misses. As the used memory area hits the capacity of the L2 TLB at 2 MBytes the access time rapidly increases as the hardware page walking adds to the address translation time and the benefit of the Itanium 2’s fast level three cache is gone. This is of specific importance since this result also applies for all Itanium 2 CPUs: After leaving the TLBs the caches lose a lot of their performance. 2) Main Memory Latencies: Using the pointer chasing algorithm to determine the main memory latency results in somewhat contradicting results. The process scheduling also

TABLE VI

influences process and data placement and has to be taken into account. Therefore, the processes were pinned to the CPUs by using the dplace-tool from SGI. The results gained with and without that tool are gathered in table V1 .

M AXIMAL BANDWIDTH WITHIN SGI A LTIX 350 Bandwidth Intel Itanium 2 (1.4 GHz) L2 Cache ↔ CPU L3 ↔ L2

TABLE V M EASURED MEMORY LATENCIES WITH 10 MB YTE OF USED MEMORY PER

Within one Module

PROCESS

Total # of processes

22.4 GB/s read + 22.4 GB/s read or write 44.8 GB/s read or write

Front side bus SHUB ↔ Memory NUMAlink 4

Latency for n active processes 1 2 3 4

6.4 GB/s read or write 10.2 GB/s read or write 3.2 GB/s read + 3.2 GB/s write

Without dplace 1 2 3 4

110 110 110 144

1 2 3 4

110 145 145 144

ns ns ns ns

110 ns 144 ns 144 ns

126 ns 144 ns

126 ns

126 ns 145 ns

126 ns

With dplace ns ns ns ns

126 ns 126 ns 126 ns

The results raise a number of questions: Why is there decrease for 4 processes when all of them are active compared to keeping a number of them idle? And, why does the usage of dplace increase some access times while it decreases others? Or, why does one only see three different latency value? One possible explanation might be hidden in the system architecture. The values are all about 15 ns apart which corresponds pretty much with the time for one NUMAlink hop. It seems that in case of the 126 ns at least one process is working with memory that is located in the other module (remote memory). In case of the 144 ns it could be that at least one process is in fact working with its local memory (locate it its own module) but the data is somehow sent to the other module first, before coming back to the original module. This would result in two NUMAlink hops and would add about 30 ns to the initial latency. IV. BANDWIDTH While access latencies can be hidden by a number of hardand software mechanisms, the available bandwidth cannot be increased by such techniques. Therefore, bandwidth is the limiting factor for most applications, especially in data intensive computing. The maximum data transmission rates for the SGI Altix are shown in table VI. This section will compare these values with the obtained measurement results and will discuss the scalability with more than one active process. A. Measuring Algorithm The STREAM benchmark has established itself as a standard for the measurement of memory transfer rates. It uses a set of different vector operations (e. g. a vector triad) on double precision floating point numbers where no piece of 1 The access time increases to about 150 ns for 1 GByte of used memory space. Since there are the same effects visible, the memory space was reduced to 10 MByte to reduce the measurement runtime.

data is reused. This worst-case scenario for scalar processors practically circumvents the caches as it does not include any kind of data locality. The available bandwidth BW can be determined from the number of vectors used nvector , the length of the vectors lvector , the size of one element in the vector selement , and the run time trun as follows: BW =

nvector · lvector · selement trun

(2)

Originally, the STREAM benchmark only measures the bandwidth for one fixed memory size. This was changed in a way that the bandwidth can be determined for a range of memory sizes. Furthermore, two self-selected vector operations T emp := T emp + A + B · C

(3)

A := A + B · C

(4)

and have been selected to determine the available bandwidths for pure read and combined read/write operations. The measurement routines have also been implemented to benchmark the accumulated bandwidth with more than one process accessing separate and shared memory segments. B. Memory Bandwidth with Different Strides A first benchmark, which is focussed on cache bandwidth, measures the available bandwidth when accessing data with different strides. The memory range was selected from 32 KByte up to 10 MByte and the stride sizes were varied from every element (8 Byte) to every 128th element (1KByte). The results of the benchmark are displayed in figures 6 and 7. The first figure underlines an important characteristic of the second level cache. The sudden drop in performance between strides of 128 Bytes and 256 Bytes points to bank conflicts within the L2 cache. As the bank width of that cache is 256 Bytes, all requests that are a multiple of that access the same cache bank and are, therefore, serialized. This cuts the available bandwidth in half. The drop in performance as the capacity of the L2 TLB is reached at 2 MByte can be observed as well. The results for the combined read/write access point to a weakness of the compilers. The obtainable cache bandwidth for accessing every vector element stays below the bandwidth for accessing every second and forth element. This is due to

Memory bandwidth for read-only access with different strides sizes

Accumulated memory bandwith with read-only access on separate memory segments

15 Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 1024 Bytes

14 13 12 11

4 active, 0 inactive processes 3 active, 1 inactive process 2 active, 2 inactive processes 1 active, 3 inactive processes

120

100

10 80 GByte/s

GByte/s

9 8 7

60

6 5

40

4 3 20

2 1 0

0 32

64

Fig. 6.

128

256 512 1024 Anount of memory used (in KByte)

2048

4096

8192

32

Read-only bandwidth with different strides

Fig. 8.

Read/write bandwidth for different stride sizes Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 1024 Bytes

13 12 11

128

256 512 1024 2048 Amount of memory used per process (in KByte)

4096

8192

Read-only bandwidth with different # of processes

Accumulated memory bandwidth for combined read/write access on separate memory segments

15 14

64

4 active, 0 inactive processes 3 active, 1 inactiver process 2 active, 2 inactive processes 1 active, 3 inactive processes

120

100

10 80

8

GByte/s

GByte/s

9

7

60

6 5 40

4 3 2

20

1 0 32

64

128

256

512

1024

2048

4096

8192

Amount of memory used (in KByte)

Fig. 7.

Read/write bandwidth with different strides

bad read/write interleaving and resulting bank conflicts as the compiler assumes an access stride greater than one. C. Access on Separate Memory Segments For multi-processor shared memory systems it is always of interest to what degree the running processes influence each other as they are accessing the main memory. To study that behavior, the presented algorithm was adapted as a parallel MPI program. All running processes were instructed to use their own memory segments and no communication took place to exchange vector data. The processes were, however, synchronized with barriers, so that they were all working on the same vector sizes. Since one active process is enough to fill the front side (or system) bus of a base module, this experiment will show how well the Altix system handles memory access overload situations. The results of the benchmark are shown in figures 8 and 9. At first, one can extract from the plots that the obtainable cache bandwidth is independent from the number of processes running on the system. Secondly, the cache bandwidth for read-only accesses is slightly below that for combined

0 32

64

128

256

512

1024

2048

4096

8192

Amount of memory used per process (in KByte)

Fig. 9.

Read/write bandwidth with different # of processes

read/write accesses which suggests that the read-only algorithm is not using the full L3 cache transfer capability. As introduced earlier, the L3 cache is capable of transmitting four double precision floating point numbers per clock cycle to the L2 cache. Those capabilities are not fully used by both versions. Another interesting discovery within the results demonstrates that the Itanium 2 CPU is not capable of using the full 6.4 GByte/s of front side bus bandwidth. The reason for such behavior can be too small memory request buffers within the CPU – the system bus cannot be saturated with memory requests by one CPU and the performance is limited to some degree by the main memory latency as well. The read-only access reaches a transfer rate of 5.5 GByte/s for one active CPU, the combined read/write access 4.7 GByte/s. The lower performance for the combined accesses is caused by bus turnaround cycles on the system bus, since it is an unidirectional bus which requires a few clock cycles to change the transmission direction. Finally, the measurement data displays a somewhat sur-

Accumulated memory bandwidth for read-only access on shared memory segments 100 4 threads 3 threads 2 threads 1 thread 80

60 GByte/s

prising result: The accumulated bandwidth for three active processes drops below that of two active processes. This ist due to load-imbalances within the modules. While two processes can be placed by the scheduler so that each module contains one process (which then has the system bus all to itself), three processes require one module to use both CPUs. At that point those two CPUs have to share the system bus, thus limiting the available bandwidth per process to 3.2 GByte/s. Since the runtime is taken when all processes are finished with their work, the third process having the other module to itself is held back by the other two. For three processes one receives then a theoretical bandwidth of 3 x 3.2 GByte/s = 9.6 GByte/s versus 2 x 6.4 GByte/s = 12.8 GByte/s for two active processes.

40

20

0 64

D. Access on Shared Memory Segments Within shared memory environments one can use multiple threads to work on data within the same address space. In this case, bandwidth is even more of importance, as the threads running on the different processors might access the same physical piece of main memory. When using OpenMP to share the work – in this case the vectors to compute – each thread will work on a part of the overall data. The number of threads used in this measurement was varied from one to four; the results are plotted in figures 10 and 11. Accumulated memory bandwidth for combined read/write access on shared memory segments 100 4 threads 3 threads 2 threads 1 thread 80

GByte/s

60

40

20

128

Fig. 11.

256

512 1024 2048 4096 8192 Total amount of memory used (in KByte)

16384

32768

Read/write bandwidth with different # of threads

feed data to the local processors and the remote processor by saturating the connection the memory. Therefore, adding a fourth thread does not result in any further performance improvements. E. Access with Varying Degree of Randomness A final experiment was used to determine how the bandwidth develops as randomness within the memory accesses increases. For that purpose a so called gather-code was produced. It access data indirectly over an index vector. The values of A, which is accessed by using the index vector J, are summed up to generate a read-only access pattern. By arranging the elements in J sequentially and then interchanging a varying number of elements one can generate the different degrees of randomness when accessing A. The experiment was run in parallel, where each process was working on its own memory segment; the results for 50 MByte of used memory space are displayed in figure 12. Accumulated memory bandwidth for accessing separare memory segments with varying randomness 10

0 64

128

Fig. 10.

256

512 1024 2048 4096 8192 Total amount of memory used (in KByte)

16384

4 processes 3 processes 2 processes 1 process

32768 9 8

Read-only bandwidth with different # of threads

7 6 GByte/s

The “cache expansion” effect of shared memory computing can be observed well in both figures. While the performance drops for one active thread as it leaves the cache, two threads can actually prolong that drop as they divide the work between each other. Hence, they virtually double the available cache size. This effect can be observed for more threads as well. Additionally the cache bandwidth also increases constantly when adding more threads. Interestingly, the accumulated memory bandwidth exceeds 6.4 GByte/s when switching from two to three threads. This confirmes the system design which allows a transfer rate of 10.2 GByte/s between the memory and the SHUB. The data for this experiment is located in only one module. The other module has to remotely access that data. The SHUB can then

5 4 3 2 1 0 0

10

20

Fig. 12.

30

40 50 60 Degree of randomness (in %)

70

80

90

100

Bandwidth with different # of processes

The results are comparable to the ones received when accessing the memory with different strides: The higher the

randomness, which corresponds to large distances between two accesses, the lower the obtained bandwidth. This behavior is not surprising since the cache lines loaded from main memory are then only accessed once or twice before being evicted again and one eighth or one forth of the main memory bandwidth is lost. The usage of two process doubles the received bandwidth. Adding more processes only leads to a slight increase in performance as the front side busses of the two module are already almost saturated. V. C ONCLUSION The research presented in this paper has shown that in some case the obtainable results for applications correspond with the best cases offered by the hard- and software. In most cases, however, the measured performance stays behind the processor’s capabilities. This was to be expected and is in no way a surprising result, since it is the case for most modern superscalar microprocessors. It could, furthermore, be shown that the SGI Altix 350 usually can live up to its capabilities when following a few basic principles: • Use the system evenly; try to avoid load imbalances which occur when allocating an odd number of threads or processes. • Bind processes to their CPUs using dplace. This avoids them being moved away from their data. • Initialize memory pages by the process which will be working on the data. This will place the memory page (if possible) onto the module the process is running on and will, therefore, minimize remote memory accesses and network traffic. The characteristics and limitations of a superscalar processor in general and the Intel Itanium 2 specifically also suggest keeping the following in mind while using the processor: • Avoid access patterns that spill and reload cache lines. • Avoid access patterns with a multiple of 256 Bytes between two accesses. • Try not to use more then 2 MBytes of cache as latency will rapidly increase beyond that point while bandwidth drops. Overall, the scalability and sustainability of the obtained results encourages the usage of Altix. The systems offer a unique amount of shared memory space and offer very good floating point performance at the same time. The main memory bandwidth is, however, the boundary that cannot be crossed by data intensive applications. The front side bus then becomes the bottleneck to feed data to the two processors. Hence, more than two CPUs per module would show no performance gain and SGI is well advised to reduce the number of Itanium 2 Montecito (dual core) CPUs per module in the next generation of Altix systems to one to be capable of supplying them with data. Further research from the side of the authors will evaluate whole applications with respect to their memory performance on Altix. Additionally, code optimization on those applications are planned for the large Altix system soon to be installed in Dresden.

ACKNOWLEDGMENT All measurements presented in this paper have been done using BenchIT (www.benchit.org), a performance analysis environment developed at the Center for Information Services and High Performance Computing, Dresden. R EFERENCES [1] Intel, Intel Itanium 2 Processor Reference Manual – For Software Development and Optimization, May 2004, Document-No.: 251110-003. [2] SGI, The SGI Altix 350 Server, January 2005, Document-No.: J14796. [3] D. Lenoski et al., “The Stanford Dash Multiprocessor,” IEEE Computer, vol. 25, no. 3, pp. 63–79, March 1992.

Suggest Documents