O Performance of XY Routing in 2-D Meshes

I/O Performance of X-Y Routing in 2-D Meshes under various Disk Load Balancing Schemes S.R. Subramanya Rahul Simha Bhagirath Narahari Dept. of Computer Science Dept. of Computer Science Dept. of EE and CS University of Missouri{Rolla College of William and Mary George Washington University Rolla, MO 65409 Williamsburg, VA 23185 Washington, DC 20052 [email protected]

[email protected]

Abstract

2-D mesh has proved to be one of the eective and ef cient topologies for high-performance massively parallel computing (MPC) systems, and wormhole routing is being used as the switching scheme in most of the MPCs. For I/O bound applications on such systems, minimizing the I/O transfer completion time facilitates higher performance. An earlier paper [8] studied a particular I/O model and showed by simulation that balancing the load on the disks resulted in the reduction of I/O completion time. This paper studies the I/O performance in a 2-D mesh connected system using the same I/O model, under various schemes for balancing the load on the disks. The results of simulation are presented.

1 Introduction Massively parallel computers (MPCs) with 2-D mesh topologies (exempli ed by Intel Touchstone Delta, Paragon, and a few others) and wormhole routing schemes([2, 7] for message switching are used in applications demanding performance, reliability, and availability, such as in large-scale scienti c computations, large databases, and media-on-demand servers. During I/O, large amounts of data are transfered compared to processor-to-processor communication. As a speci c example, consider the use of an MPC as a Media-on-demand (MOD) server which has repositories of multimedia data, such as video and audio stored on disks. Requests for data arrive from clients to the server, which should satisfy the requests of as many clients as possible in a timely fashion. The nodes of the MPC are responsible for obtaining the required data from the disks and transferring them to the clients. MOD servers need to sustain a maximum number of streams, deliver minimum response time, meet quality-of-service requirements, provide reliability and availability, and several other requirements.

[email protected]

Such applications are I/O bound and the I/O performance is crucial. The trac patterns and routing in the context of disk I/O in MPCs has not received the same level of attention as the interprocessor communication. Also, in the literature addressing disk I/O, most of them assume uniform (or sometimes unit) trac from nodes to disks. Non-uniform disk I/O in meshes have not been extensively studied. Across applications and also in a particular application, the I/O demands placed on the disks change over time. It is bene cial to have a uniform load the disks to minimize the overall I/O transfer times. In this paper, we study the performance of I/O in a 2-D mesh connected system using wormhole routing, under various schemes for balancing disk loads. We assume that data is suitably replicated on the disks so that a node can get its requests from more than one disk, if necessary. In this paper, we consider a 2-D mesh system, (exempli ed by Intel Touchstone Delta, Paragon, and a few others) as the server. We assume disks connected to one of the sides of the mesh, although the heuristic can easily be extended to the case with disks at two opposite sides. The data transfers are assumed to be done using wormhole routing. The next section provides the background and motivation for the proposed study. Section 3 presents the disk load balancing schemes. Simulation results are given in 4, followed by conclusions.

2 Background In wormhole routing [7], a packet is divided into a number of its for transmission. The header it governs the route of the packet and the other its follow in a pipelined fashion. At each time-step, every header tries to get to the next node in its route. If more than one header wants to use a channel in the same direction at the same time, only one of them is

to disk. Three dierent kinds of distributions of the number of assigned nodes to disks have been used. Scheme 1 uses a Zipf distribution, scheme 2 uses a mid-weighted distribution, and scheme 3 uses a reverse mid-weighted distribution. The distributions in the three non-uniform assignments and in the scheme with equal number of nodes per disk are shown in Figure 1. Non-Uniform distribution (Scheme 1) No. of nodes

No. of nodes

Equal number of nodes per disk

Disks

Disks

Non-Uniform distribution (Scheme 3) No. of nodes

Non-Uniform distribution (Scheme 2) No. of nodes

allowed to move and the rest are blocked. The decision about which one is allowed to use the channel could be arbitrary or based on some policy. When a header is blocked, all the its of that packet remain in the it buers of the nodes on the route. A virtual channel is a logical link between two nodes. Several virtual channels could share a physical channel (link) through multiplexing. Typically requests-for-data resident on disks, originate at the nodes of the mesh. The requests for data are usually a few bytes, while the actual data transfers from the disks to the nodes are massive. For each request, the particular disk where the required data is resident is located and a suitable route is then determined from the disk to the node. The data from the disks to the nodes need to be transferred adhering to certain deadlines which are required to provide a given quality of service (QoS). One of the prime objectives of the routing scheme is to minimize the con icts and the overall I/O transfer time. In X{Y routing scheme (also called row{column routing), the packets traverse in the row (column) of the source up to the column (row) of the destination, and then go straight through to the destination. X{Y routing is very simple with reasonable performance. In a study of routing and scheduling of I/O trac in a 2-D mesh it was shown by simulation in [6] that X{Y routing performed quite well among the various blocking schemes which were considered therein. A study described in [8] evaluated the I/O performance of X{Y routing in 2-D meshes under various patterns of disk-to-node assignments re ecting various applications scenarios. These were: Equal number of nodes per disk In this scheme, an equal number of nodes are are assigned to each disk. In a mesh with n n nodes and m disks, each disk will have dn2 =me nodes assigned to it. The particular nodes assigned to each disk could either be random or selected in some order. Balanced load on disks The load balancing heuristics try to balance the load on all disks given the I/O trac from the nodes. Random assignment Random numbers are uniformly generated in the interval [0; m ? 1], where m is the number of disks. Successive numbers are assigned to nodes of the mesh (either randomly or in some order). A number k assigned to a node implies that disk k is the assigned disk with which the node does I/O. Non-uniform assignment In this scheme, the number of nodes assigned to a disk varies from disk

Disks

Disks

Figure 1: Various node-to-disk assignments. It was found that under X{Y routing, the scheme with balanced load on the disks resulted in lower I/O completion times. This is the motivation for studying dierent disk load balancing schemes and their eect on I/O transfer times, in this paper.

3 Load Balancing Heuristics The load balancing heuristics try to balance the load on all disks given the I/O trac from the nodes. We determine the cost of allocation in each of the cases, wherein n is the size of one of the dimensions of the mesh. In our discussion below, we just refer to assignment of `rows' to nodes as equivalent to assignment of the disks in those rows to the nodes. Also, the weight of a node is equal to the amount of disk I/O originating at the node. The weight of a row is the sum of the weights of all the nodes in that row. We use the terms `weight' and `load' synonymously. The basic scheme in the disk load balancing heuristic is to transfer excess weights from those rows whose weights are more than the average load per disk, to those rows whose weights are less than the average. The disk-load balancing is a two-step process: (1) in the row with excess weight, selecting the node(s)

whose weight(s) is (are) to be shifted; (2) Selecting the row where the above chosen weight is to be transferred. We describe four schemes. In all the schemes, we nd the total load in each row and the average load to be handled by each disk. The totalPload in row i, denoted by W is given by, W = =1 w . The average load handled by each disk is denoted by W and is given by W = W =m, where W is the total weight sizes) of all nodes, given by W = P =1 P(packet w , and m is the number of disks. =1 We then assign each disk to a set of nodes such that the sum of the weights of the nodes in any set is as nearly equal to the average W as possible. In the following discussion the assignment of rows is synonymous with the assignment of disks in those rows. Rows with more (lesser) load than the àverage' transfer (take) the èxcess' load to (from) other rows. R i

D av

n

n

i

j

n

R i

j

ij

D av

tot

tot

tot

ij;k

D av

3.1 Choice of row with de cit weight After choosing the node whose weight is to be transfered to another row, the question is: `which row to transfer to?'. The row should have row-weight less than the average (i.e. it can take additional weight) and it should have a disk attached to it. Here again, we have some heuristics to choose the row. (1) Row choice 1: the rst available row (starting from row 0). (2) Row choice 2: the available row which is closest to the current row. (3) Row choice 3: the available row which has the least weight. (4) Row choice 4: the available row whose weight dierential is closest to the weight to be transfered. Note that we have assumed the mesh size to be n n; there are n rows with n nodes in each row. It is easily seen that the time complexity of the row choosing operation if O(n), where n is the number of rows. We de ne the imbalance for a row to be the deviation of row weight from the average after weights from rows with excess load have been transfered and rows with de cit loads have been given additional weights). The overall imbalance is de ned to be the sum of the imbalances of all the rows. The imbalance would generally be dierent for dierent schemes.

3.2 Choice of node in the row with de cit weight First Fit In this scheme, we just pick the nodes from

left to right (or right to left) in order and transfer their loads to those rows which can take additional loads until the load in the current row

equals or falls below the average. For each weight

w transfered from row i, we subtract it from the row weight W . The choice of the destination ij

R i

row could be any one of those listed in 3.1. The complexity of First- t choice of the node is O(n).

Best Fit In this scheme, the weights in the row are

sorted in non-decreasing order. Then the weight which is closest to the excess weight is determined using binary search. This is then transfered to another row. After transfer, the èxcess weight' is suitably decremented. This process is continued until the excess weight become equal to (or less than) zero. Although this requires more work than the rst t, gives better results. The choice of the destination row could be any one of those listed in 3.1. The time complexity of Best- t is O(n log n).

Largest First In this scheme, we sort the weights in

the row with excess weight and start transferring in the decreasing order of node weights (Largest rst) and continue until the row weight becomes equal to or less than the average. The time complexity would be O(n log n).

3.3 Zigzag scan scheme Starting at a corner of the mesh, a zigzag like scan is made through the entire mesh. During the scan, successive nodes are assigned to a disk until the sum of the packet sizes of the nodes exceeds the average. When this happens, assignment proceeds with the next disk. The complexity of this scheme is O(n). In all the above schemes, we transfered weights from a row until the total weight in the row became equal to or less than the average. This may not always result in a good balance. For example, assume that the average is 100. If a row has excess weight, say 102 and the smallest weight that can be transfered is 12. Further, assume that the row to which the excess weight would be transfered has a weight of 96. Then, by transferring that weight, we increase the weight in the other row to 96+12=108, 8 more than the average. On the other hand, by retaining it in the original row, it would be in excess of the average by only 2. Also, in all the above schemes, the weights are transfered in their entirety; there are no fractional transfer of weights. We could have schemes where weights could be split and only a portion is transfered. This could result in better balancing of weights (loads) among the disks. In fact it can be proved that if fractional weights could be transfered, the balancing could

4

4

32 X 32 MESH; X−Y ROUTING

x 10

3.5

3

I/O COMPLETION TIME

I/O Completion times in a 32 32 mesh Disk load balancing Number of disks schemes 2 4 8 16 Row choice 1 26068 13663 7597 4165 (First t) Row choice 1 26053 13660 7606 4187 (Best t) Row choice 2 27644 14714 8475 6102 Row choice 3 26132 13788 7634 4180 Row choice 4 26092 13697 7516 4227 Vertical zigzag 26115 13757 7506 4233

Eq.Num. Uniform Load Non−Uni. (Sch.1) Non−Uni. (Sch.2) Non−Uni. (Sch.2) Random

2.5

2

1.5

1

2661 4240 2680 2688 2715

Table 1: I/O Performance under various disk load balancing schemes

0.5

0

32 2679

2

4

8 NUMBER OF DISKS

16

32

Figure 2: I/O Performance under various node-to-disk assignments. be exact, i.e. the weights on the rows after balancing is equal to the average.

4 Simulation Results The proposed heuristic is implemented in C and simulated on a SparcStation, using various mesh sizes and number of disks. The I/O completion times for X{Y routing in a 32 32 mesh under various nodeto-disk assignment schemes is shown in Fig. 2. It is easily seen that the scheme which balances load on the disks results in the best I/O performance in all cases. The results of I/O performance under various disk load balance schemes for various numbers of disks are tabulated in Table 1. It is easily seen that the best t scheme does consistently better in almost all cases (number of disks), and the `row choice 2' scheme has the worst performance among the schemes considered. The rst t scheme has performance close to that of the best t scheme.

5 Conclusions High-performance massively parallel computing (MPC) systems with 2-D mesh topology and wormhole routing for message switching are being used in several applications demanding high I/O performance. For I/O bound applications on such systems, balancing the load on the disks facilitates reduction of I/O

completion time. In this paper, we studied the problem of data transfer from disks to the nodes of a 2-D mesh, under various schemes for balancing the load on the disks. It is observed that the I/O performance of even a simple scheme to balance disk loads would be close to that of a more sophisticated scheme.

References [1] A. Choudhary, et.al. `PASSION: Parallel and Scalable Software for Input-Output', NPAC Technical Report SCCS636, September 1994, Syracuse University. [2] W.J.Dally and C.L.Seitz, `The Torus Routing Chip', Journal of Distributed Computing, vol.1, no.3, pp187{196, 1986. [3] D.Jadav and A.Choudhary, `Designing and Implementing High{Performance Media-on-Demand Servers', IEEE Parallel and Distributed Technology, Summer 1995. [4] H.Jiang et. al. Ècient Algorithms for Non-blocking Wormhole routing and circuit switching on Linear Array Multiprocessors', Proc. of the ISCA Conference on Parallel and Dist. Computing Systems, Nevada, October 1994, pp 614{619. [5] W.Mao and R.Simha, `Routing and Scheduling File Transfers in Packet-Switched Networks', Journal of Computing and Information, Vol. 1, No. 1, Special Issue: Proceedings of the 6th International Conference on Computing and Information, 1994, pp 559{574. [6] B.Narahari et. al. `Routing and Scheduling I/O Transfers on Wormhole-Routed Mesh Networks', Journal of Parallel and Distributed Computing, Vol.57, No.1, April 1999, pp1{ 13. [7] L.M.Ni and P.K.McKinley, À Survey of Wormhole Routing Techniques in Direct Networks', IEEE Computer, Vol.26, no.2, February 1993, pp 62{76. [8] S.R.Subramanya, R.Simha, and B.Narahari Ì/O Performance of X{Y Routing in 2-D Meshes under various Node-to-Disk Assignments', Int'l. Conf. on Computers and Their Applications, Cancun, Mexico, April 1999, pp302{ 304.

O Performance of XY Routing in 2-D Meshes

O Performance of XY Routing in 2-D Meshes

Suggest Documents

I/O Performance of XY Routing in 2-D Meshes under ... - CiteSeerX

The 2D J1â J2 XY and XY-Ising Models

2D Parametrization of 3D Meshes

Long Wavelength Anomalous Diffusion Mode in the 2D XY Dipole

Explaining Routing Performance in Disruption

Topological Lattice Actions for the 2d XY Model

3D Spin Glass and 2D Ferromagnetic XY Model: a Comparison

XY

O Performance in Parallel

Performance Enhancement of routing Protocols in ...

PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN ...

Performance Enhancement of Routing Protocol in MANET

Arbitrary order 2D virtual elements for polygonal meshes: Part II ...

PERFORMANCE COMPARISONS OF ROUTING PROTOCOLS IN ...

Performance Comparison of Routing Protocols in ...

A method for computing curved 2D and 3D meshes ...

Performance Analysis of Hierarchical Routing Protocols in

A Neural Network Algorithm to simplify 2D meshes

2D & 3D Voronoi Meshes Generation with ShaPo

XY follicle cells in the ovaries of XO/XY and XO/XY ... - Semantic Scholar

Stochastic, Spatial Routing for Hypergraphs, Trees, and Meshes

Hierarchical Routing Architectures in Clustered 2D-Mesh ... - CiteSeerX

Routing in NoC on Diametrical 2D Mesh Architecture

Adaptive Multicast Wormhole Routing in 2D Mesh ... - Semantic Scholar

O Performance of XY Routing in 2-D Meshes