Massivelly parallel machine based on T9000 and

Laboratoire de l’Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n°1398

Massivelly parallel machine based on T9000 and C104 Eric Fleury Marc Picquendar

April 1994

Research Report No 94-14

Ecole Normale Supérieure de Lyon 46 Allée d’Italie, 69364 Lyon Cedex 07, France Téléphone : (+33) 72.72.80.00 Télécopieur : (+33) 72.72.80.80 Adresse électronique : [email protected]−lyon.fr

Massivelly parallel machine based on T9000 and C104 Eric Fleury Marc Picquendar April 1994

Abstract The T9000 transputer and its companion routing chip, the C104, allow the construction of very large networks (several thousand processors). The designer of such a network must select a topology taking into account both performance (i.e. small communication delays) and engineering considerations (cost and wireability). This paper presents preliminary studies of various candidate topologies for large networks of transputers. These topologies are all based on small building modules which are easy to package; various ways of interconnecting these modules are studied. For each candidate topology, we describe the connection patterns and routing functions; we also discuss the relative merits of the various topologies both from a performance viewpoint (through simulations) and an engineering viewpoint (cost and wireability).

Keywords: T9000, C104, multiprocessor, parallel architectures, interconnection networks Resume On peut gr^ace au transputer T9000 et son routeur associe le C104 construire des machines massivement paralleles interconnectant plus d'un millier de processeurs. Le choix d'une topologie doit tenir compte des performances du reseau mais aussi de sa faisabilite. Ce rapport presente une premiere etude de dierentes topologies construites a partir du transputer T9000 et du router C104. Toutes ces topologies sont construites a partir de modules de bases ; plusieurs facon d'interconnecter ces modules sont decrites. Pour chaque topologie nous donnons ses avantages et/ou ses inconvenients du point de vue des performances et de la faisabilite.

Mots-cles: T9000, C104, machines paralleles, architectures paralleles, reseaux d'interconnexion

Massively parallel machine based on T9000 and C104 Eric Fleury

Marc Picquendar

LIP, CNRS URA 1398 Ecole Normale Superieure de Lyon 69364 Lyon Cedex 07 France

Computer & Information Science University of Massachusetts Amherst, MA 01003 USA

Tel: (33) 72 72 82 31

Tel: (1) 549 8483

[email protected]

[email protected]

Abstract

The T9000 transputer and its companion routing chip, the C104, allow the construction of very large networks (several thousand processors). The designer of such a network must select a topology taking into account both performance (i.e. small communication delays) and engineering considerations (cost and wireability). This paper presents preliminary studies of various candidate topologies for large networks of transputers. These topologies are all based on small building modules which are easy to package; various ways of interconnecting these modules are studied. For each candidate topology, we describe the connection patterns and routing functions; we also discuss the relative merits of the various topologies both from a performance viewpoint (through simulations) and an engineering viewpoint (cost and wireability).

Key words: T9000, C104, multiprocessor, parallel architecture, interconnection networks.

1 Introduction The T9000 transputer and its companion routing chip, the C104, allow the construction of very large networks (several thousand processors). The designer of such a network must select a topology taking into account both performance (i.e. small communication delays) and engineering considerations (cost and wireability). This paper presents preliminary studies of various candidate topologies for large transputer networks. These topologies are all based on small building modules which are easy to package; various ways of interconnecting these modules are studied (to implement various topologies). For each candidate topology, we describe the connection patterns and routing functions; we also discuss the relative merits of the various topologies both from a performance viewpoint (through simulations) and an engineering viewpoint (cost and wireability). This paper is organized as follows: in section 2 and 3, we describe the T9000 and C104 characteristics and the organization of the basic modules. In section 4, we describe the proposed topologies. Section 5 is concerned with the analysis of the simulations. Finally, we discuss some engineering related issues (section 6.1) and propose further simulations (section 6.2).

Both authors are supported by the research programs ANM and C3.

1

2 T9000 and C104 The purpose of this section is not to describe here all the characteristics of the T9000 and the C104 [3, 8] but to focus on the communication capabilities implemented in the T9000 and supported by the C104 or any other kind of chips that oer equivalent properties.

2.1 The T9000 transputer

Figure 1 shows the dierent modules of the architecture of the T9000. It has a superscalarprocessor, a hardware scheduler, 16K byte of on-chip cache memory and an autonomous communication processor. The T9000's processor and scheduler implement communication between processes running on the same processor. The communication system allows processes running either on directly connected transputers or transputers connected by a network build from C104 to communicate. Workspace Cache

@ Generator 1

FPU

@ Generator 2

ALU

VCP

System Services

Virtual Channel Processor Link 0

CROSSBAR

Timers

16 Kbyte Instruction and Data Cache

Link 1 Link 2 Link 3 Event 0−3 CLink 0

PMI

Programmable Memory Interface

CLink 1

Figure 1: Dierent blocks of the T9000 architecture The communication system has four full-duplex serial communications links. Each one owns a pair of a direct memory access (DMA) channels. Messages are passed over these links by the the virtual channel processor (VCP). The VCP allows any number of virtual links to be established over a single hardware links by multiplexing messages. When a process needs to send a message, this message is sent to the VCP which transmits it as a sequence of 32-bytes packets. Each packet starts with a header, which is used to route it through the network and identify the virtual link used by the remote process. Before sending the next packet of a same message, the VCP waits for an acknowledgment (ACK). Of course, messages and acknowledgments of other virtual links can be sent while waiting for an ACK of a speci ed virtual link.

2.2 The C104

The C104 is a packet routing chip. It connects 32 serial communication links to each other via a 32 32 non-blocking crossbar switch, enabling message to be routed from any of its links to any other link. It allows communication between T9000 that are not directly connected. The C104 can also be connected to other C104 to build larger networks.

2

2.2.1 The routing technique of the C104

The C104 uses wormhole routing [1, 2, 12]. The routing decision is taken as soon as the header of the packet has reached the switch (Figure 2). If the output link is free, the header is directly sent to this link and the remaining of the packet follows without being stored. If the output link is busy, the packet is buered. This routing method can be viewed as a form of dynamic circuit switching, where the header of the packet passing through a sequence of nodes creates a temporary circuit. As the tail of the packet is pulled out, the link is released and the circuit vanishes. T9000 ou C104 T9000 ou C104 T9000 ou C104

C104

T9000 ou C104

C104

T9000 ou C104

C104

T9000 ou C104

Figure 2: Wormhole routing A packet can be passed through several C104 at the same time and the header can be received by the destination before the whole packet has been transmitted by the source.

2.2.2 Routing algorithm of the C104

The routing mode shown above needs a routing strategy to compute which output links has to be selected. This algorithm has to be sure (any sent packet has to arrive), complete (it has to work for any networks), deadlock-free and simple enough to be implemented on a chip and to minimize the latency time. The C104 uses an interval routing algorithm [6, 9, 13], a popular way of building compact routing tables. The header of an arriving packet is compared with a set of range of values (intervals), one for each output of the switch. The packet is routed to the output in whose interval the header falls (Figure 3).

T90000

T90001 3 [1; 2) [3; 6) [0; 1) C104 [2; 3)

T90003 [3; 4) C104 [5; 6) [0; 3) [4; 5)

T90005

T90004 T90002 Let suppose that the T9000 number 1 wants to send a message to the T9000 number 3. The header of the message is indeed 3. When the message arrives in the rst C104, this header is compared with all the intervals, and so \falls" in the interval [3; 6). When the header arrives in the second C104, it falls in the interval [3; 4) and nally reaches the transputer number 3.

Figure 3: Interval routing

2.2.3 Grouped-adaptive and universal routing Grouped-adaptive routing. The C104 supports locally adaptive routing by allowing the programmer to group several consecutive output links [4, 5, 7, 11]. When grouped, each link is

3

equivalent, thus the problem: which link as to be used is reduced to \which link of the group is available rst". This allows to optimize performances by ensuring that no packet is waiting for a link though an equivalent link is free. This grouped-adaptive routing also takes advantage of all the bandwidth.

Universal routing. The interval routing algorithm provides deadlock-free communication. Transmission speed of the packets depends on the number of collisions encountered on the route of the message. Unfortunately, it may happen that some communication patterns generate collisions. A hot-spot is created when to many packets want to use the same link. To eliminate such an hot-spot, the C104 provides a two-phases algorithm referred to as universal routing [14]. First, route the packet to an intermediate randomly chosen destination. Then route the packet to its nal destination. The rst phase is implemented in the C104 by adding a random header to the packet. When the packet reach the intermediate destination its random header is deleted and that is exposed (the standard header) is used to route the packet to its nal destination.

3 Building Blocks We are interested in evaluating various topologies of transputers interconnections using C104 routing chips. For the rest of the paper, we suppose that the network is composed of transputers, labeled from 0 to N?1 . For each topology proposed, we will describe: connections: the way the network is \build" by interconnecting the dierent routers and processors; routing: since the C104 router uses an interval routing scheme, we have to describe: rst how to label the nodes from 0 to N?1 ; and then for each router in the network, how to assign an interval for each of its output link. The topologies studied in this paper are all organized around basic building modules composed of T9000 and C104. The organization in small modules provides modularity and scalability. Two kinds of modules are used: Brick composed of eight transputers and two switches. Block composed of four bricks (32 T9000) or eight bricks (64 T9000). N

T

T

T

T

3.1 Brick

The basic brick used in the simulations is composed of eight T9000 connected to two C104. The lower links of the C104 (0 to 15) are connected to the links of the transputers. The 4 higher links of the C104 (28 to 31) are reserved for local IO, which leaves 24 links (2 12) connecting the basic brick to the rest of the network. Since there are eight transputers per brick, there are N8 bricks in the network. For the rest N . There are two switches per of the paper, we suppose that bricks are labeled k with 0 8 N. N brick, and thus 4 brick switches in the network. We label each brick switch by k , 0 4 Thus, the brick is composed of height transputers labeled from 8k to 8(k+1)?1 and of two brick switches: 2k and 2k+1. The intervals associated with the links of transputer k are of equal size (within 1), and span the rest of the network. Thus the interval: b

k