Dynamic Characteristics of Multithreaded Execution in the EM-X

0 downloads 0 Views 152KB Size Report
... EM-X. A synthetic mi- cro benchmark is developed to accurately characterize the ... element (PE) called EMC-Y. Each PE is connected via a circular omega ...
In Proceedings of the International Workshop on Computer Performance Measurement and Analysis (PERMEAN95), pp.14-22, August 1995. Dynamic Characteristics of Multithreaded Execution in the EM-X Multiprocessor

Hirofumi Sakane, Mitsuhisa Sato, Yuetsu Kodama, Hayato Yamana, Shuichi Sakaiy and Yoshinori Yamaguchi Electrotechnical Laboratory, y Real World Computing Partnership, 1-1-4, Umezono, 1-6-1, Takezono, Tsukuba, Ibaraki 305 Japan Tsukuba, Ibaraki 305 Japan Email: [email protected] Tel: +81-298-58-5931 Fax: +81-298-58-5882 Abstract Multithreading is known be e ective for tolerating communication latency in distributed-memory multiprocessors. Two types of support for multithreading have been used to date including software and hardware. This paper presents the impact of multithreading on performance through empirical studies. In particular, we explicate the performance di erence between software support and hardware support for the 80-processor EM-X distributed-memory multiprocessor which we have designed and implemented. The EMX provides three types of hardware supports for ne-grain multithreading including direct remote memory access, fast thread invocation, and dedicated instructions for generating xed-sized communication packets. To demonstrate the e ect of multithreading, we have performed various experiments using micro benchmark programs and MP3D, one of the SPLASH benchmarks. Three types of performance parameters have been measured including processor eciency, remote memory latency, and network load. Experimental results indicate that the EM-X architecture is highly e ective for supporting the multithreading principles of execution through dedicated hardware and software.

keywords

Multithreading, latency hiding, ne grain communication, direct remote memory access, shared memory benchmark, synthetic workload.

1 Introduction Parallel computing is becoming increasingly important to meet the computational requirements for large-scale real world applications. Problems such as computational uid dynamics, computational chemistry and computational biology require a tremendous amount of computations if they are to be applicable to real world applications. Such computational demand can only be met by parallel computing on large-scale parallel machines. Among large-scale parallel machines is distributed-memory architectures which are known to be scalable toward building massively parallel machines. Recent introduction of IBM SP-2 and Cray T3D clearly indicates that massively parallel computing is a necessity for solving realistically-sized problems in tolerable time. Numerous applications have been successfully solved

on massively parallel distributed-memory machines. The main factor which degrades the performance of massively distributed-memory machines is communication latency. Distributed-memory machines assume data to be distributed mutually exclusive to all processors. When the distribution of data does not match the distribution of workload, it is necessary to initiate communication. When a function is called on a processor where the corresponding data is not present, a mismatch is said to occur, which in turn requires communication to access data allocated to another processor. This remote memory latency is often regarded as the main bottleneck towards high performance. Various techniques have been developed to reduce/tolerate/hide the latency including data partitioning, runtime load balancing, multithreading, coherent cache, etc. [1, 2, 3, 4, 6, 7]. Of these techniques multithreading is known to be a latency tolerance approach. The idea behind multithreading is to overlap computation and communication such that the e ect of communication is minimal, if not negligible. Studies have indicated that multithreading is e ective for a data ow-based architecture suitable for ne-grain parallel computations [4, 7, 8, 11]. An analytical study on multithreading indicated that increasing the number of threads exposed a saturated region in which little improvement is expected [1]. However, this result is based trace-level simulation, ignoring physical conditions such as actual network con guration. It is not clear whether the results can be applied to real machines. Evaluating parallel computation on an existing computer under real implement conditions, it is often found that eciency actually decreases instead of saturating due to network contention [8]. It is precisely the purpose of this report to further investigate the e ects of multithreading on performance. We have designed and implemented the 80-processor EM-X distributed memory multiprocessor [9, 10]. The main objective of designing/building the machine is to provide hardware support for multithreading toward generalpurpose parallel computation. The hardware data ow mechanism embedded in the EM-X provides fast synchronization and communication, which in turn directly support the multithreading principles of execution for ubiquitous parallel computing. This paper attempts to explicate the behavior of multithreading using EM-X. A synthetic micro benchmark is developed to accurately characterize the

1 word / clock 2 word / clock

EMC−Y

a

a b

SU

MCU

MM

[1, 0]

[2, 0]

[0, 1]

[1, 1]

[2, 1]

[0, 2]

[1, 2]

[2, 2]

[0, 3]

[1, 3]

[2, 3]

b

c

c d

IBU

[0, 0]

d

OBU e

e

MU

f

EXU

g h

SU : Switching Unit IBU : Input Buffer Unit MU : Matching Unit EXU : EXecution Unit OBU : Output Buffer Unit MCU : Memory Control Unit MM : Memory Module

Figure 1: Block Diagram of EMC-Y performance and behavior of multithreading. Various performance parameters are de ned to identify the relationship between them. The key parameters we consider in this study are the number of processors, the number of threads per processor, the thread granularity, and the amount of communication. Through experimental studies, we attempt to provide an insight into the role of each parameter embedded in multithreading. We believe that identifying the roles and impact of parameters on performance can help guide the future of designing massively parallel distributed-memory machines. The paper is organized as follows: In section 2, we present the EM-X architecture. Section 3 de nes a simple synthetic workload which we used throughout this paper to examine the e ect of multithreading. Section 4 lists experimental results based on two machine models and synthetic workload. Discussions on the role of various parameters are presented in the section as well. Section 5 gives experimental results based on one of the SPLASH benchmark suite. The performance of EM-X on the benchmark program is explained along the multithreading principle. The last section concludes our study.

2 The EM-X Architecture The EM-X is a distributed memory multiprocessor which has a multithreaded execution mechanism using data ow tokens. Figure 1 shows an overview of EM-X processing element (PE) called EMC-Y. Each PE is connected via a circular omega network. Figure 2 shows a con guration of the circular omega network of 12 PEs. The EM-X prototype consists of 80 PEs, which has its own local memory. The EMC-Y is a RISC-style processor suitable for negrain parallel processing. The EMC-Y pipeline is designed to fuse register-based RISC execution with packet-based data ow execution for synchronization and message handling support. The EMC-Y processor consists of Switching Unit (SU), Input Bu er Unit(IBU), Matching Unit

f

g

(ca, ga)

ca : column address

h

ga : group address

Figure 2: Circular Omega Network (12 PEs) (MU), Execution Unit (EXU), Output Bu er Unit(OBU) and Memory Control Unit(MCU). The EXU includes a packet generation mechanism and a RISC pipeline which executes a sequential thread. The MU has mechanisms for synchronization and thread invocation. Packets are sent out through the OBU which separates the EXU from the network. The MCU controls the access to the local memory o the EMC-Y chip. Communication in the EM-X is done with 2-word xedsized packets. The EMC-Y can generate and dispatch packets directly to and from the network and provides hardware support to handle queuing and scheduling of packets. Packets coming in from other processors from the network are bu ered in the packet queue. As a packet is read from the packet queue, a thread of computation speci ed by the address portion of the packet is invoked with one-word data. The thread runs to completion unless it encounters remote memory reads or other thread invocation. In the event that the current thread is to be suspended, any live state in the thread may be saved in an activation frame associated with the thread. The completion of a thread causes the next packet to be automatically dequeued from the packet queue according to the order in which they were received. Thread scheduling is, therefore, explicit First-In-First-Out (FIFO). Network packets can be interpreted as data ow tokens. Data ow tokens have an option of matching with another token upon arrival. Unless both tokens have arrived, the token will store itself in memory, which in turn cause the next packet to be processed, i.e., thread switch. The EM-X circular omega network is designed to provide low latency and high throughput. The SU has three types of components: two input ports, two output ports and a threeby-three cross-bar switch. Each port can transfer a packet, which consists of a word of address part and a word of data part, at every two cycle. A packet can be transferred in n +1 cycles to the PE n hops beyond by a virtual-cut-through routing. The detail of the implementation was discussed in [5]. The IBU has two levels of priority packet bu ers for exible thread scheduling. Each bu er is an on-chip FIFO, which can hold up to 8 packets. The IBU operates independent of EXU and MU. Packets coming in from network can be im-

mediately processed without incurring any overhead even if EXU or MU are busy. The e ect of the priority bu ers is described in [10]. In the EMC-Y processor, any communication does not interrupt the execution. For remote memory access, we use the packet address as a global address rather than specifying a thread. A global address consists of the processor number and the local address in the processor. When a thread references a remote memory location, it generates a remote memory read packet and terminates itself. The remote read packet contains a continuation to return the value, as well as a global address. The packet is handled on the destination PE and a result packet is sent containing the value of the global memory. The result packet resumes the caller thread. The remote memory write generates a remote memory write packet, which contains a global address and the value to be written. The thread does not terminate when the remote write message is sent. The EM-X provides two types of hardware support for remote memory access [9, 10]. One is SYSRD, which is handled in parallel with instruction execution by IBU hardware. The other is USRRD, which is executed in the EXU in the same way as other packets. This access does not need a special hardware. In the USRRD memory access, the remote memory access packet invokes a ne-grain thread which fetches the memory word and sends it to the reader's thread. In this case, its timing is in uenced by other activities in the destination PE.

thread 1

thread m

thread M

(suspended)

(active)

(suspended)

task process

LEN (clock) send

Read Packet send

resume

PE 1

send

suspend here

Result Packet

PE n

Result Packet

Read Packet

PE N

Figure 3: Synthetic Workload for Multithreaded Execution

3 Synthetic Workload for Multithreaded Execution 3.1 Parallel Workload A program in a multithreaded execution model is a collection of partially-ordered threads. A thread is a function call which consists of primitive instructions. Instructions may be executed either in sequence or in parallel depending on the function behavior. An instruction may issue remote memory reads/writes or invoke other threads. The response to these accesses triggers the dependent threads. The EM-X programming environment, called EM-C, supports remote memory access operations both in software and hardware, as explained in the previous section. Software support for remote memory operations refers to an invocation of a negrain thread consisting of a memory load instruction and packet generation instruction. User-initiated remote read is, therefore, performed through execution of a ne-grain thread, which simply fetches the content of memory location speci ed in the address portion of the read packet. The EM-X programming environment assumes Single Program Multiple Data (SPMD) parallel programming paradigm. All the threads are copied to all processors. We assume in the following discussion that data references by each thread are evenly distributed among processors (in real programs, the number of remote reads/writes by a thread varies among threads and is often unknown for irregular programs). Each processor may have one to many active threads at runtime, determined again only by the program characteristics. A remote memory read in conventional single-threaded execution model typically results in the processor which issued the remote read spinning, i.e., the processor goes into a waiting state for the remote read to return data. A remote memory read in multithreaded execution model, on the other hand, leads to immediate suspension of the current thread, and in turn to thread switch if there is any. The newly initiated thread proceeds to execution. This overlapping of remote read and thread computation

is the key to tolerating the latency caused by the remote memory read operations. The fundamental assumption of multithreading is that more than one thread is ready to switch.

3.2 Workload parameters Workload for multithreaded execution is characterized typically by the following parameters: 1.

the number of threads (multithreading level):

the number of threads in processor is the most important parameter to identify the e ect of multithreaded execution.

2.

the run-length between context switches:

3.

the number of remote reads for each thread:

the number of clocks needed to perform thread switching and thread resumption due to remote references. the number of remote reads a ects the amount of communication.

4. Data reference locality 5. Synchronization The synthetic benchmark program we developed contains a xed number of threads for each processor. Each thread issues random remote references with xed runlength. Figure 3 depicts the behavior of the benchmark program. The multithreading parameters listed above are parameterized such that they can be easily adjusted to re ect di erent runtime behavior. The benchmark program is designed in a way that no synchronization between threads is necessary as the main purpose of this investigation is to explicate the runtime behavior of multithreading. Barrier

PE0

PE1

PE2

PE 0

PE N PE 11

PE 1

PE 10

IBU

PE 9

PE 3

MCU

MM

PE 2

MU PE0 EXU

PE 8

PE 4

PE1 PE2

PE 7

PE 5 PE 6

Figure 5: Completely Connected Network (12 PEs) PE N

Figure 4: Processor for Complete Connection synchronization is performed only once at the end of the computation. The computation model assumes a \switch always" processor, where context switch occurs at every remote reference. Some multithreaded execution models assume a \switch on load" processor where context switch occurs when a cache miss occurs. The latter model, which used coherent cache, exhibits that the run-length varies at runtime according to the locality of data reference in the workload. This runtime characteristic is dicult to model. We therefore adopted random remote memory reference pattern and xed run-length interval. We further assumed the number of remote reads per thread to be one to two.

3.3 Performance Measure To measure the performance of multithreading, we have identi ed several key statistics, including processor eciency, remote memory latency, and network contention. Processor eciency is de ned as the number of clocks spent performing useful work over the total number of clocks. Recall that a remote read operation is two-way trac: sending out request by initiating processor and receiving the data returned from the destination processor. Latency for remote memory operations is averaged over the two latencies.

4 Experiments using Synthetic Workload 4.1 Environment We developed an EM-X register-transfer-level (RTL) simulator to design and test our architecture. The EM-X RTL simulator is written in C language. It simulates all architectural level state transitions in EMC-Y processors so that it provides the exactly same behavior of the system as real

hardware at every clock cycle. In addition, the simulator can be easily modi ed to determine design parameters such as the network bu er size. In experiments described below, the EM-X simulator is used to design machine models. All the PEs and interconnection network are synchronized with single clock on the machine models by the simulator.

4.2 E ect of network con icts Network architectures are often central to the performance of distributed-memory machines. This study attempts to understand the impact of network architectures on multithreaded machines. To determine the impact of network architectures on performance, we have identi ed two types of network architectures as shown below:  Circular Omega network and xed-size input bu er

(EM-X)

 Completely connected network and in nite-size input

bu er (Figure 4, 5)

The rst con guration is essentially the same as the one used in the EM-X multiprocessor and requires no further explanation (see also Figure 1, 2). The second con guration represents an ideal network architecture, where processors communicate directly with any other processor with no network delay. Since all the processors are fully connected, packets sent out by a processor will be immediately inserted to the network queue of the receiving processor. Those packets inserted in the queues will be picked up by the receiving processor and in turn invoke corresponding threads in the order in which they were queued. Each processor assumes N queues attached to Input Bu er Unit, where N is the total number of processors. Each queue is directly connected to other processor as shown in Figure 4. These N queues allow N packets to be received

100

Efficiency (%)

80

60 GrainSize(clocks) coarse(234) medium(38) fine(17)

40

20

0 1 2

4

8 Threads/PE

16

Figure 6: Eciency (Complete Connection, SYSRD, 1 read, 64 PE) 100

Efficiency (%)

80

60 GrainSize(clocks) coarse(234) medium(38) fine(17)

40

20

0 1 2

4

8 Threads/PE

16

Figure 7: Eciency (EM-X, SYSRD, 1 read, 64 PE) 80 70 60 Latency (clocks)

simultaneously. The selection of a queue for packet processing is assumed to be Least Recently Used (LRU) when there is more than one queue which contains packets. This ideal network architecture further assumes the size of the network queues and IBU in nite. The rst model, EM-X, has various physical limitations while the second model, ideal net architecture, has no physical limitation. The reason behind the two types of network architectures is to precisely identify the e ects of multithreading. By separating the impact of multithreading from the impact of network contention, we will be able to precisely identify the factors which contribute to the performance. To verify the performance of the two models, we have implemented and executed the synthetic benchmark program on the two models. Figures 6 and 7 plot some of the experimental results on 64 processors out of the 80 processor system for one remote memory read per thread. Three types of threads have been de ned in terms of thread granularity: ne-grain, medium-grain, and coarse-grain threads. A ne-grain thread takes approximately 17 clocks whereas a coarse-grain thread takes 234 clocks to complete. The xaxis indicates the number of threads per processor while the y-axis indicates the processor utilization. The gures demonstrate that the performance is drastically improved by increasing the number of threads from one to two, regardless of the network architecture. This performance improvement clearly indicates the advantage of using multiple threads which interleave to mask o the latency. Increasing the number of threads to four still improves the performance although it is limited to the EM-X architecture. The experimental results showed that approximately four threads/processor are sucient to achieve the best possible performance for the given benchmark problem. Figure 6 demonstrates that the completely connected network model quickly reaches to 100% utilization as there is no network contention. To be more precise, consider a negrain thread which takes approximately 17 clocks to complete. Recall that the overhead for thread switching is one clock cycle. A ne-grain thread with one remote memory read will give processor utilization of 17=(17 + 1) = 94:4%. This high utilization is again the best possible performance that multithreading can achieve in an ideal environment. Figure 7 on the other hand shows performance much different from Figure 6 due to network contention. As the number of threads continuously increases to four and eight, the performance decreases instead of saturating. The impact of network contention on performance is especially severe for ne-grain threading. To further understand the behavior of this adverse e ect, we have plotted the latency of remote read for the EM-X. Figure 8 shows the relationship between remote read latency and the number of threads. Recall that a remote read consists of a split-phase round-trip: request and receive. The latency is averaged over the round trip. It is clear from the plots that the latency for ne-grain threading substantially increases the latency as the number of threads increases. This large increase in latency is the main cause for the performance degradation shown in Figure 7. The latency can be viewed from two di erent perspectives: network con icts and packet generation rate. Figure 9 plots the number of con icts in network for each packet. The gure re ects essentially the same behavior shown in the latency plot. Figure 10 shows the rate of packet generation for each processor. The plot essentially reveals the dynamic network throughput. The network possesses a certain critical point which identi es communication capacity.

50 40 GrainSize(clocks) coarse(234) medium(38) fine(17)

30 20 10 0 1 2

4

8 Threads/PE

16

Figure 8: Runtime Round Trip Latency (EM-X, SYSRD, 1 read, 64 PE)

EM-X supports two types of remote memory read mechanism, called USRRD and SYSRD. The main di erence between the two types is how remote read operations are serviced at the target processor. The user de ned remote read, USRRD, is designed literally for user initiated remote read operations. When the USRRD packet arrives at the destination processor, it will be inserted into a queue waiting for its turn to be picked up by the EXU. On its turn, the packet will invoke a simple thread, which consists of a memory read instruction and a packet operation. The thread invoked by the USRRD packet reads a memory location pointed to by the address portion of the packet. Since this read is processed in software as any other thread-invocation packets, its latency is governed by other parameters such as number of packets in the queue, thread scheduling, number of active threads, the run-length, etc. The system de ned remote read, SYSRD, on the other hand does not go through the steps a USRRD packet does. The hardware support allows the SYSRD packet to bypass all the above steps without thread invocation. When the SYSRD packet arrives at the destination processor, it is handled directly by IBU. Memory read is processed instantly and the requested data is immediately sent back to the originating processor. No queuing is necessary for the packet, hence no thread scheduling. While the USRRD packets are serviced through software thread invocation procedure, the SYSRD packet completely bypasses all the units such as MU and EXU which take part in the normal course of thread invocation. A latency for a SYSRD packet is typically constant. The only factor which may a ect the latency is that the SYSRD packets waiting to be serviced are processed one by one at the destination processor. To identify the performance di erence between the two types of packets, we have performed experiments using USRRD. Figure 13 shows the performance using USRRD. Recall Figure 7 where SYSRD is used. Comparing Figure 13

8

Conflicts/packet

7 6 5 GrainSize(clocks) coarse(234) medium(38) fine(17)

4 3 2 1 0 1 2

4

8 Threads/PE

16

Figure 9: Network Con ict (EM-X, SYSRD, 1 read, 64 PE) 0.11 Packet Rate (packets/clock/PE)

4.3 SYSRD vs. USRRD

9

0.1 0.09 0.08 0.07 0.06 0.05 0.04

GrainSize(clocks) coarse(234) medium(38) fine(17)

0.03 0.02 0.01 0 1 2

4

8 Threads/PE

16

Figure 10: Packet Rate per PE (EM-X, SYSRD, 1 read, 64 PE) 100

80 Efficiency (%)

Fine-grain threading indicates that the packet generation rate has reached the critical point as the number of threads became four. The generation rate then gradually decreases as the number of threads increases. This critical point explains the dramatic increase of the network con icts. The network con ict is closely related to the packet generation rate. Results on larger grain size supports this observation where the network gives less in uence on performance due to less communication. The discussion given above is based on results obtained using 64 processors. To con rm our ndings above, we have performed more experiments with varying number of processors and number of remote reads. Figure 11 plots processor eciency for 32 processors while the number of remote reads/thread is xed to one. As we have expected, the amount of communication becomes smaller, which in turn resulted in less performance degradation. Figure 12 plots experimental results for a thread with two remote reads while the number of processors is xed to 64. It is not surprising that the performance for ne-grain threading is even worse than what is shown in Figure 7. Again, the main cause for this degraded performance is the increase in communication. We have seen above how the performance varies depending on the number of threads with varying number of remote reads/thread. In what follows, we shall identify the impact of the two types of remote reads on performance.

60 GrainSize(clocks) coarse(234) medium(38) fine(17)

40

20

0 1 2

4

8 Threads/PE

16

Figure 11: Eciency (EM-X, SYSRD, 1 read, 32 PE)

with Figure 7, we nd that there is substantial di erence in performance. All the three types of threads perform better when SYSRD is used. This clearly indicates that the remote read service mechanism with bypassing is e ective. We further compare the two servicing mechanisms. Figure 14 shows the average round-trip latency when USRRD is used. Again, it is not surprising to nd that the latency with USRRD is over 10 times higher than the one with SYSRD. Figure 15 further compares the performance di erence in terms of network contention. We again nd that the contention with USRRD shown in Figure 15 is approximately 1.5 times higher than the one with SYSRD of Figure 9. Two factors typically a ect the round-trip latency: the network con ict observed in the case of SYSRD, and the e ect of thread invocation condition at the destination processor.

100

Efficiency (%)

80

4

8 Threads/PE

16

Figure 12: Eciency (EM-X, SYSRD, 2 reads, 64 PE) 100

Efficiency (%)

80

60 GrainSize(clocks) coarse(234) medium(38) fine(17)

40

20

0 1 2

4

8 Threads/PE

16

Figure 13: Eciency (EM-X, USRRD, 1 read, 64 PE) 1800 1600 1400 Latency (clocks)

Figure 16 shows the experimental results of the MP3D. For a small number of processors, the eciency saturates at four threads. As the number of processors increases to 64, the eciency also increases to 73% at four threads and then levels o after the peak. This behavior is similar to the one observed based on the synthetic workload. Experimental results clearly indicate that four threads are sucient to to improve the performance of MP3D. Recall that the synthetic program assumed threads of the same run-length. The MP3D implementation uses threads

20

1 2

5.1 MP3D on EM-X

5.2 Results

GrainSize(clocks) coarse(239) medium(43) fine(22)

40

0

5 Real Applications and Performance Predication The previous sections examined the impact of various parameters on performance using the synthetic micro benchmark. To verify the experimental results we have obtained from the previous section, we present in this section experimental results based on one of the most widely used benchmark programs. In particular, we have taken a shared memory program MP3D from the SPLASH benchmark suite and executed on the EM-X multiprocessor. MP3D is a 3dimensional particle simulator. The overall computation of MP3D consists of evaluating the positions and velocities of particles over a sequence of time steps: the particles are picked up and moved according to their velocity vectors. If two particles in the same cell come close to each other, they may collide under a probabilistic model. The primary data objects used in MP3D are the particles and the space cells. An EM-X implementation assumes particles divided equal to processors. Space cell arrays are cyclically allocated to processors. Particles in the main loop are referenced locally within a processor. No remote memory access, therefore, takes place in the main loop. When particles collide, the position of the colliding particle is to be referenced, resulting in remote memory access. Workload distribution for MP3D is centered around the main loop, where each time step is distributed to all processors. Barrier synchronization is performed at the end of each time step. MP3D is known to spend most of the execution time in particle move phase. Parallelizing this particle move step, each processor can e ectively tolerate and reduce remote memory latency. All of the remote read operations are implemented using the SYSRD packet to identify the maximum possible performance. Our implementation of MP3D is completely parameterized in a way that the number of threads and the number of processors can be easily adjusted for various experiments.

60

GrainSize(clocks) coarse(234) medium(38) fine(17)

1200 1000 800 600 400 200 0 1 2

4

8 Threads/PE

16

Figure 14: Runtime Round Trip Latency (EM-X, USRRD, 1 read, 64 PE)

14 12

Conflicts/packet

10

GrainSize(clocks) coarse(234) medium(38) fine(17)

8 6 4 2 0 1 2

4

8 Threads/PE

16

Figure 15: Network Con ict (EM-X, USRRD, 1 read, 64 PE) 100

Efficiency (%)

80

60 Processors 2 4 8 16 32 64

40

20

0 1 2

4

8 Number of threads

16

Figure 16: Eciency (EM-X, MP3D) of di erent run-length. Actual measurements indicate that ne-grain threads with the run-length of less than 20 clocks occupied 37% of the total execution time while mediumgrain threads with the run-length of 70 clocks to 80 clocks did approximately 20 % of the total execution time. The average of run-length is about 27 cycles. The behavior of multithreading based on MP3D is very similar to or essentially the same as that based on synthetic micro benchmark.

6 Conclusion Tolerating remote memory latency is the key to wide-spread use of parallel computing on distributed-memory multiprocessors. This paper has explicated the impact of multithreading on the performance of distributed-memory multiprocessors. We have attempted to identify the behavior of key parameters and hardware/software support which govern the performance of multithreading. In particular, three parameters have been examined, including remote memory

latency, network contention, and processor eciency. The performance of each parameter has been compared using two types of machine models: a real machine and an ideal machine. The EM-X multiprocessor has been used to represent a real machine. An ideal machine assumed a fully connected network with in nite network bu er. We have used two types of programs to represent workload. The rst workload type is a synthetic micro kernel benchmark program which we developed while the second one is MP3D, one of the SPLASH benchmark programs. These programs have been implemented and executed on the two machine models. Experimental results have demonstrated that multithreading with hardware support is highly ecient for tolerating remote memory latency and at the same can substantially increase the performance of distributed-memory multiprocessors. Speci cally, we have found that hardware support for remote memory operations is highly e ective in reducing remote memory latency. The latency with hardware support is found to be an order of magnitude smaller than the one with software support only. This di erence strongly has suggested that a small amount of hardware support for multithreading can lead to high performance. We have also identi ed that two to four threads per processor are sucient to tolerate latency caused by the given benchmark programs. A large number of threads did not perform better than as it has been perceived. In fact, increasing the number of threads to over four adversely affected the performance, resulting in worse performance than no threading. The main reason for this negative impact on performance is network contention. More threads simply increased the network trac by generating more packets for remote memory operations. Execution results on MP3D also supported our premise that multithreading with hardware support will be e ective for latency tolerance. We have found that the eciency of multithreaded execution reached 73% on 64 processors of the EM-X. This performance is considerably high for executing multithreaded shared memory applications. Results also con rmed that the performance obtained using our micro benchmark is proportional to that obtained using MP3D. We believe that our synthetic workload model can be used to accurately predict the performance of multithreading. We are currently working on di erent types of problems to further identify the impact of multithreading on the performance of distributed-memory multiprocessors.

Acknowledgment We wish to thank Dr.Kimihiro Ohta, Director of the Computer Science Division and the sta of the Computer Architecture Section for fruitful discussions. We would also like to thank Dr.Andrew Sohn for improving the early version of this paper.

References [1] Saavedra-Barrera,R.H., Culler,D.E., T. von Eicken. Analysys of Multithreaded Architectures for Parallel Computing, Proc. 2nd Annual ACM Symp. on Parallel Algorithms and Architectures, (1990), pp.169-178. [2] Boothe,B., Ranade,A. Improved Multithreading Techniques for Hiding Communication Latency in Multiprocessors, Proc. 19th Annual Int. Symp. on Computer Architecture, (1992), pp.214-223.

[3] Weber,W., Gupta,A. Exploring the Bene ts of Multiple Hardware Contexts in a Multiprocessor Architecture: Preliminary Results, Proc. 16th Annual Int. Symp. on Computer Architecture, (1989), pp.273-280. [4] Nikhil,R.S., Papadopoulos,G.M., Arvind. *T: A Multithreaded Massively Parallel Architecture, Proc. 19 th Annual Int. Symp. on Computer Architecture, (1992), pp.156-167. [5] Sakai,S., Kodama,Y., Yamaguchi,Y. Design and implementation of a circular omega network in the EM-4, Parallel Computing, Vol.19, No.2, (1993), pp.125-142. [6] Tullsen,D.M., Eggers,Susan.J. Limitations of Cache Prefetching on a Bus-Based Multiprocessor, Proc. 20th Annual Int. Symp. on Computer Architecture, (1993), pp.278-288. [7] Sato,M., Kodama,Y., Sakai,S., Yamaguchi,Y., Koumura,Y. Thread-based Programming for the EM-4 Hybrid Data ow Machine, Proc. 19th Annual Int. Symp. on Computer Architecture, (1992), pp.146-155. [8] Sato,M., Kodama,Y., Sakai,S., Yamaguchi. Experience with Executing Shared Memory Programs using FineGrain Communication and Multithreading in EM-4, 8th Int. Parallel Processing Symp, (1994), pp.630-636. [9] Kodama,Y., Sakane,H., Sato,M., Sakai,S., and Yamaguchi,Y. Message-based Ecient Remote Memory Access on a Highly Parallel Computer EM-X, Int. Symp. on Parallel Architectures, Algorithms and Networks 1994, (1994), pp.135-142. [10] Kodama,Y., Sakane,H., Sato,M., Yamana,H., Sakai,S., and Yamaguchi,Y. The EM-X Parallel Computer: Architecture and Basic Performance , Proc. 22nd Annual Int. Symp. on Computer Architecture, (1995), pp.14-23. [11] Sohn,A., Kim,C., and Sato,M. Multithreading with the EM-4 Distributed-Memory Multiprocessor, Parallel Architectures and Compilation Techniques 1995, (1995), pp.27-36.

Suggest Documents