In the Proc. of Parallel Architectures and Languages Europe, Athens, Greece, July, 1994. Pages 823-826
Performance of Interconnection Network in Multithreaded Architectures? S.S. Nemawarkary, R. Govindarajany, G.R. Gaoz and V.K. Agarwaly y Dept. of Electrical Engineering, z School of Computer Science McGill University, Montreal, H3A 2A7, CANADA fshashank,govindr,gao,
[email protected]
Abstract. In this paper, we analyze the performance of interconnection networks in a multithreaded multiprocessor using a closed queuing network model. Proposed integrated model of the multiprocessor system captures the interaction among subsystems faithfully. Our study reveals a strong relationship of workload parameters to the network performance and brings out a feedback eect of network response on message rate to the network.
1 Introduction Interconnection networks (IN) play a vital role in the performance of any multiprocessor system. Average time taken by a message to traverse the network, called the network latency, can greatly aect the performance. Previous studies on IN consider it as an open system. The message arrival rate is an input to the open system model, and the system is studied for each message arrival rate. Since the overall message rate to IN is not known a priori, these models cannot relate the workload characteristics to IN performance. Also near saturation region, network latency is highly sensitive to the message rate. In this paper, we advocate the modeling of a multiprocessor system as a closed system. This model captures the interaction among the subsystems faithfully under various workloads. In particular, both the message rate as well as the observed latency are treated as output variables of the integrated system. This integrated model is robust over the range of input workload parameters. Evaluating the performance of an interconnection network in a multi-threaded multiprocessor system has attracted attention of several researchers [1, 2, 4]. We focus this study on a shared-memory multi-threaded multiprocessor system similar to Alewife [3]. The architecture consists of physically distributed memory across multiple processing elements connected through a 2-dimensional torus network. Performance of a multi-threaded architecture depends on a number of parameters related to the architecture and the workload. Next section presents an overview of the multiprocessor system under study. In Section 3, we present the integrated analytical model for this system, using a closed queuing network. For detailed descriptions, please refer to [5]. Section 4 presents the results derived from this analytical model. In Section 5, we derive conclusions from this study. ? This work was supported by MICRONET | Network Centres of Excellence, Canada.
2 A Multi-threaded Architecture We consider a multi-threaded architecture with processing elements (PEs) connected across a k-ary 2-cube network. Each PE has a part of shared memory. A multi-threaded processor masks the long latency of an operation by suspending the execution on current thread which issued the long latency operation, and switching to another thread. A context switch requires C time units to save the state of a recently suspended thread and load that of a newly scheduled thread. Each processor executes on a set of nt parallel threads. Threads interact only through memory locations. These threads can not be scheduled for execution on any other processor. A thread can be in either of three states : suspended, ready and executing. An executing thread enters suspended state after the processor issues one long-latency operation for it. When its long latency operation is satis ed, the thread becomes ready. A ready thread scheduled for execution on processor, reaches executing state. A thread is executed for a duration called runlength R, before a memory access is issued. A fraction premote of these accesses is directed to a remote memory module, while remaining fraction is serviced locally. The remote memory access pattern across memory modules has a geometric distribution.
3 Analytical Model For a multi-threaded architecture described in Section 2, we propose an analytical model using a closed queuing network as shown in Figure 1. The queuing network model is composed of PEs with three types of nodes, namely : processor, memory and switch. Processor node : A processor node has a single server. Threads are executed according to an FCFS service discipline, with exponential service time having a mean R. Memory Module : A memory module has a single server, with exponentially distributed access time and a mean L time units. Switch Node : We consider a detailed switch node with two parts{ inbound switch and outbound switch. A PE sends messages to its outbound switch and receives messages from inbound switch. All messages on the network are forwarded through inbound switch of nearest PE towards destination. Each part of a switch node has a single server with an exponential service time with mean value S . Thus a message suers a delay of S time units at each switch on its path. Observed network latency for a one-way message is denoted by Sobs . The rate at which remote accesses are sent by a processor to the network is net. Solving above queuing network accurately is computationally intensive. So we use Approximate Mean Value Analysis (AMVA) [6] to investigate the performance measures (Sobs and net) in terms of architectural and workload parameters.Details of validation with Petri Net simulations have been presented in [5].
4 Results We present results for a 4 4 torus network by focusing on workload parameters (like nt , premote ). Other parameters are set to default values (like R = 10, L = 10 and S = 10). Figure 2 shows the eect of nt and premote on net and Sobs . Linear increase in net occurs till premote reaches 0.4, beyond which net saturates (at 0.0288). Increasing nt after saturation of net, leads to increased waiting time for the messages. The network response remains constant at rate net, since all switches are busy. This feedback leads to a steady state with net as the message rate to and from the network. For saturated net, Sobs increases linearly with nt due to increased waiting time for messages. Sobs sharply rises with premote only till net is unsaturated. Even in saturated region, Sobs is not overly sensitive to workload parameters, hence its value can be determined with high accuracy. Eect of R and L on network performance is shown in Figure 3. Performance trends due to processor and memory subsystems are similar. With decreasing R (and L), net increases. A larger number of messages contending on the network increases Sobs , hence Sobs surface follows net surface. Below R = 20 (and L = 20), net is close to its maximum value, and Sobs is nearly 5 times its unloaded value.
5 Conclusion In this paper, we have investigated the performance of interconnection networks in multithreaded multiprocessors as a closed system. We have related the network latency to workload parameters. This model brings out the feedback eect of network response on the message rate to network, stabilizing the message rate to a steady state value. Hence the model is very robust. Unlike in open system models, network latency can be obtained near saturation, with high accuracy.
References 1. V.S. Adve and M.K. Vernon. Performance analysis of mesh interconnection networks with deterministic routing. Tech Rep 1001b, Computer Sciences Dept, University of Wisconsin, Madison, July 1993. 2. A. Agarwal. Performace tradeos in multithreaded processors. IEEE Trans. on Parallel and Distributed Systems, 2(4), Sept. 1992. 3. A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatowicz. April: A processor architecture for multiprocessing. In Proc. of the 17th ISCA, pages 104{114, 1990. 4. K. Johnson. The impact of communication locality on large-scale multiprocessor performance. In Proc. of the 19th ISCA, pages 392{402. ACM, May 1992. 5. S.S. Nemawarkar, R. Govindrajan, G.R. Gao, and V.K. Agarwal. Interconnection performance for multithreaded architectures : A study based on a closed system model. ACAPS Technical Memo 84, SOCS, McGill University, Montreal, Nov. 1993. 6. M. Reiser and S. Lavenberg. Mean value analysis of closed multichain queueing networks. Journal of ACM, 27(2):313{322, April 1980.
P
1 ? premote
1
R
premote
Inbound
M
Sw
Sw
Sw
Sw
1
L Outbound 1
S
0.03
350
0.025
300 Network Latency
Message Rate
Fig. 1. Queuing Network Model
0.02 0.015 0.01 0.005
250 200 150 100 50
0 20
0 20
1
1
0.8
15
0.8
15
0.6
10
10
0.4 5
0.6 0.4 5
0.2 0 0
Number of Threads
Probability of Remote Access
0.2 0 0
Number of Threads
Probability of Remote Access
Fig. 2. Eect of workload parameters nt and premote on (a) net and (b) Sobs
0.03
160 140 Network Latency
Message Rate
0.025 0.02 0.015 0.01
120 100 80 60 40
0.005 0
0 20
20 40
40 60
60 80
0 20
20 40
40 60
80 100
Thread Runlength
20 0
100
60 80
80 100
Memory Latency
100
Thread Runlength
Fig.3. Eect of R and L on (a) net and (b) Sobs
Memory Latency