POINT TO POINT COMMUNICATION USING VIRTUAL ... - CiteSeerX

1 downloads 0 Views 309KB Size Report
May 1, 1993 - In a sense, point-to-point communication is more general than ..... receive queue (recv q): This queue is also organized as a linked list of packets. ..... 11] G.Golub and Charles F. Van Loan, Matrix Computations, The John ...
POINT TO POINT COMMUNICATION USING VIRTUAL CHANNEL FLOW CONTROL AND WORMHOLE ROUTING A dissertation submitted in partial ful lment of the requirements for the degree of Bachelor of Technology in Computer Science & Engineering

V. N. Padmanabhan Vishal Gaur Project Supervisors:

Prof. B. B. Madan & Dr. Gautam Shro B.Tech Major Project Report Department of Computer Science and Engineering Indian Institute of Technology, Delhi. May,1993.

CERTIFICATE This is to certify that this project entitled "A wormhole router for the MEIKO T800 transputer system" has been completed under our guidance by V. N. Padmanabhan and Vishal Gaur, in partial ful llment of the degree of Bachelor of Technology in the Department of Computer Science and Engineering, IIT Delhi. This work has not been submitted elsewhere for award of any other degree/diploma.

Prof. B. B. Madan, Professor

Dr. Gautam Shro , Assistant Professor

Department of Computer Science and Engineering, Indian Institute of Technology - Delhi, New Delhi - 110016, INDIA.

i

ACKNOWLEDGEMENTS We wish to thank our guides, Prof. B. B. Madan and Dr. Gautam Shro , for their guidance, encouragement, and the many hours they devoted with us to the project. We would also like to thank Dr. Sandeep Sen for explaining the ideas of random routing and shearsort to us. We are also grateful to Mr. Atul Varsheney, and Mr. K. Vijay Kumar for helping us in using the transputer system.

V. N. Padmanabhan

Vishal Gaur

ii

Contents 1 Introduction

1

1.1 The MEIKO transputer system : : : : : : : : : : : : : : : : : : : : : : : : :

1

1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2

1.3 Aim of the project : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3

1.4 Work achieved : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3

1.5 Organization of this report : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4

2 Key concepts and review of previous work

6

2.1 Two models of communication : : : : : : : : : : : : : : : : : : : : : : : : : :

6

2.2 Wormhole routing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

7

2.3 Virtual channels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

8

2.4 Reactive scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2.5 Review of literature : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2.6 The CSN router : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 2.7 The router developed by Barua & Goyal : : : : : : : : : : : : : : : : : : : : 11 iii

3 Findings and their implications for the design

12

3.1 The T800 Transputer : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 3.2 Software layers on the transputers : : : : : : : : : : : : : : : : : : : : : : : : 12 3.3 Blocking Communication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3.4 Reverse messages : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3.5 Shared memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3.6 Semaphores : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 3.7 Reactive scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15

4 Design of the wormhole router using virtual channels

17

4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 4.2 Input bu ering vs Output bu ering : : : : : : : : : : : : : : : : : : : : : : : 17 4.3 Division of work between sender and receiver : : : : : : : : : : : : : : : : : : 18 4.4 Path determination : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 4.5 Preservation of order of messages : : : : : : : : : : : : : : : : : : : : : : : : 20 4.6 Routing protocol : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 4.7 Minimizing bu er copying : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.8 Protection against malicious user : : : : : : : : : : : : : : : : : : : : : : : : 22 4.9 Master-Slave Arrangement : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23 4.10 Process Ids : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23 4.11 Termination : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23

iv

5 Implementation details

25

5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25 5.2 Data structures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25 5.3 Flit structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29 5.4 The Receiver process : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29 5.5 The Sender process : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30

6 Experimental results

36

6.1 E ect of virtual channels on throughput : : : : : : : : : : : : : : : : : : : : 37 6.2 Latency comparison with CSN : : : : : : : : : : : : : : : : : : : : : : : : : : 39 6.3 E ect of Virtual channels on network latency : : : : : : : : : : : : : : : : : : 40 6.4 E ect of wormhole routing : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40 6.5 Application programs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42 6.6 Reactive scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42

7 Conclusions and directions for future work

44

A The POINTCOM communications library

48

A.1 Organization of the library : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48 A.2 Installing the library : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49 A.3 Using the POINTCOM library : : : : : : : : : : : : : : : : : : : : : : : : : : 50 A.3.1 Writing a user application : : : : : : : : : : : : : : : : : : : : : : : : 50 A.3.2 Initialization and other miscellaneous appendages : : : : : : : : : : : 51 v

A.3.3 Compiling user programs : : : : : : : : : : : : : : : : : : : : : : : : : 51 A.3.4 Running user programs : : : : : : : : : : : : : : : : : : : : : : : : : : 52 A.3.5 Con guring the library : : : : : : : : : : : : : : : : : : : : : : : : : : 53

B Manual pages

54

C Example programs

70

vi

Chapter 1 Introduction 1.1 The MEIKO transputer system The MEIKO T800 transputer system at IIT Delhi is a 32-node message passing parallel machine. It is an MIMD (multiple-instruction multiple-data) type of machine. This means that the transputers work asynchronously, so that at any point of time, di erent transputers could be executing di erent instructions. Also, processes on each transputer have access only to the local memory of that transputer. There is no shared memory between di erent nodes of the transputer system. To communicate with each other, processes on di erent transputers use the physical channels that connect them. In the MEIKO system, each transputer has 4 recon gurable links attached to it. It can use these links to connect to any of the other transputers. Thus, with this system, we can set up topologies such as hypercubes of upto 3 dimensions (a 4D hypercube cannot be set up because one link of one of the nodes in the topology is needed to connect to the host), meshes with upto 32-nodes (allowing one-way wrap around), etc.

1

1.2 Motivation Routing and ow control algorithms are often the critical components of a message-passing concurrent computer because performance is very sensitive to network latency, throughput and the presence of deadlocks in the network. Network latency is the time taken for a message to travel from the source to the destination node and throughput is a measure of the utilization of the network's capacity. Deadlock in an Interconnection Network (ICN) is the situation wherein no message can advance to its destination because the queues of the message system are full. It is therefore imperative that the algorithms for routing and ow control are designed and implemented in a careful fashion.

A major reason for poor throughput in an ICN is coupled resource allocation , which essentially means that each physical channel c has a xed bu er b associated with it. So, if b is full with a blocked message, then the channel c will also idle, even though there may be another message that could have used it. Network latency is large if the whole message packet is bu ered at every intermediate node before being forwarded to the next one, as in store-and-forward routing. The currently available point-to-point inter-process communication primitives (viz. the Computing Surface Network|CSN) on the T800 MEIKO transputer system use store-and-forward message routing. This implies that bu er space requirement for storing a complete packet at each node is large, and so also is the network latency. It is also plagued with the problems of coupled resource allocation. This motivates our project which is to develop a new set of fast communication routines that avoid the problems inherent in CSN, and provide an interface similar to Cosmic Environment (described below), to the application programmer. The Cosmic Environment is a run-time environment that allows message-passing parallel programs to be run on a network of UNIX hosts. It presents interface in which programs are written as one master (host program) and several slaves (node programs). The master reads in the input data and distributes it to the slaves. After the slave processes have nishes, the master collects the processed data, and outputs it. A rich collection of routines for matrix computation is available on the Cosmic Environment. It would be nice if this library could be ported to the MEIKO transputer system with 2

little or no change. By providing an interface very similar to that of Cosmic Environment, we hope to achieve this.

1.3 Aim of the project 1. The design and implementation of a point-to-point router for the T800 transputer system using virtual channels for ow control and wormhole routing. 2. Implementation of deadlock-free routing functions for some standard ICN topologies (hypercube, mesh etc.), and a study of their performance. 3. Allowing several user processes to share a node of the transputer and its links. These processes should be scheduled in a reactive fashion. 4. Studying the e ect of wormhole routing and virtual channel ow control on parameters such as network latency and throughput. 5. The interface provided to the application programmer should be the same as that of Cosmic Environment. It should be possible to port applications written on Cosmic Environment to our system, with minimal e ort. The master-slave architecture of Cosmic Environment should be retained.

1.4 Work achieved We have been able to achieve nearly all the goals we set for ourselves. Only a brief overview is provided here; a detailed discussion is presented elsewhere in this report (Chapters 6 and 7). We have designed and implemented a point-to-point router using the CS Tools channel communication routines. This router has been thoroughly tested and is entirely functional. Routing functions for the hypercube (upto 3 dimensions) and mesh (upto 32 nodes, with one-way wrap-around) topologies have been implemented. However, other topologies can be incorporated in a straightforward manner. 3

We have also conducted experimental studies to evaluate the performance of our router. Network latency and throughput have been measured under varying load conditions and the results obtained agree well with simulation studies. Finally, our point-to-point communications library has an interface almost identical to Cosmic Environment. This allows programs written for the Cosmic Environment to run using our library, with hardly any change. Cosmic Environment, and thus our point-topoint router, provide better functionality than CSN and are easier to use. While we have been able to run several user processes on each node of the transputer system, we have been unable to schedule them in a reactive fashion. This is essentially due to the lack of appropriate tools on the MEIKO system (refer section 6.6).

1.5 Organization of this report This introductory chapter is followed by a description of the necessary theoretical concepts that are the basis of our work (Chapter 2). That chapter also includes a review of the previous work done in this direction, which motivates our project. In Chapter 3, we discuss the evolution of our design, keeping in mind the speci c features of the hardware and software of the MEIKO system. This discussion leads us to the nal design of our router, which we present in Chapter 4. This is followed by a chapter on implementation details (Chapter 5), which includes a description of the data structures and algorithms used. This chapter should be very useful to a person who wishes make modi cations to our router. Chapter 6 is devoted to a discussion of the performance of our router measured through experiments. It also compares our results with theoretical predictions. This chapter concludes with a description of an application we have developed on our router, which exploits the parallelism provided by non-blocking send function xsend(). Chapter 7 sums up the report with conclusions and directions for future work. A detailed description of how to install and use our system is provided in Appendix A. The manual pages for the POINTCOM library form Appendix B. Appendix C shows how 4

to write an application program, which illustrates the use of our library. These appendices also comprise a separate USER'S MANUAL that we have provided.

5

Chapter 2 Key concepts and review of previous work 2.1 Two models of communication In a message-passing parallel computer (like the MEIKO T800 system), di erent processors communicate by exchanging messages with each other. Such communication can be classi ed into two broad categories|Crystalline communication and point-to-point communication. Crystalline communication, which is the more restrictive of the two, refers to patterns of communication in which all nodes of the interconnection network (ICN) participate explicitly. For instance, if we do a `shift right' operation on a mesh, then each user process sends its local copy of the element to be shifted to the processor to its right, and then receives the new element from the processor to its left. This type of communication is also called expected communication, since the user process on each node is aware of the communication taking place. Other examples of such communication include exchange, combine etc.

In contrast, point-to-point communication involves just two nodes|the sender and the destination. The sender sends the packet, specifying the destination. The packet is routed to the destination node via intermediate nodes. The user processes running on these intermediate nodes are not aware of the communication taking place. The router processes that 6

run on each node handle the communication. In a sense, point-to-point communication is more general than crystalline communication since the latter can be built on top of the former. However, due to the fact that routing needs to be done and also that communication is not expected, point-to-point communication tends to be slower than crystalline communication.

2.2 Wormhole routing In traditional networks, the kind of routing that is done is packet or store-and-forward routing . The essential idea here is that the packet sent by the sender hops from one node to the next, until it reaches its destination. The whole packet is bu ered at each intermediate node. This has two implications: 



The network latency is high. This is because only after the whole packet is received at a node is it forwarded to the next node. Thus the latency is O(nl) where n is the number of hops and l is the latency per hop. The bu er requirement at each node is large because the whole packet has to be bu ered at the node.

The idea of wormhole routing avoids the drawbacks of store-and-forward routing by using the concept of a ow control digit ( it). Each packet is broken up into a number of its, each of which is only a few bytes long. The rst it is the header it|the only one that contains routing information such as the source and destination of the packet. This is followed by the data its that only contain a number identifying the packet to which they belong, apart from data. The end of the packet is marked by the tail it. The packet moves from the source to destination in the following manner. The header

it rst moves to the rst intermediate node. Then, while it moves on to the next node, the rst data it reaches the rst node. Then, the header advances to the third node, the 7

rst data it to the second and the second data it to the rst node. In this fashion, the movement of di erent its of the same packet is overlapped. This movement is akin to that of a worm|the head moves rst and then drags the the body along. Hence the name wormhole routing. The advantages of this scheme over the previous one are: 



The network latency is greatly reduced due to the overlap between the movement of di erent its of the same packet. In terms of the quantities n and l de ned earlier, the latency is O(n + l). The bu er requirement at each node is just that for a single it, a great reduction from that in the previous scheme. However, for better performance, each node should bu er a few its rather than a single one because di erent nodes donot work synchronously.

2.3 Virtual channels A conventional network organizes the it bu ers associated with each physical channel into a single rst-in- rst-out (FIFO) queue. As a result, each it bu er can contain only its from a single packet. If this packet gets blocked, the the physical channel is idled because no other packet is able to acquire the bu er resources needed to access the channel. The single FIFO queue arrangement is also plagued by the problem of deadlock. One possible scenario is depicted in g. 1. Here, the bu ers of each node are full of its destined for the diagonally opposite node. This results in a deadlock, as no it is able to advance towards its destination.

8

The solution is to organize the it bu ers associated with each channel into several lanes as shown in g. 2. The bu ers in each lane can be allocated independently of those in any other lane. This added exibility increases channel utilization and thus throughput. This is analogous to the increase in throughput of a road when it is divided into parallel lanes.

A packet is broken up into a header it, several data its, and a tail it. The header it sets up a virtual circuit from the source node to the destination node. The data its move along this virtual circuit and the tail it closes down this virtual circuit. With the appropriate use of virtual channels, deadlocks can be avoided. The basic idea is to break cycles in the channel dependency graph. The nodes of this graph correspond to the channels in the ICN. There is an edge from node c1 to c2 if channel c1 is followed by c2 along some route in the original network. The channel dependency graph for 4-node ring network (without virtual channels) of g. 3(a) is shown in g. 3(b). The cycle in the channel dependency graph can be broken by splitting each physical channel into two virtual channels as shown in g. 3(c). The two sets are the high virtual channels (c10,: : : ,c13) and the low virtual channels (c00,: : : ,c03). Messages at a node numbered less than their destination node are routed on the high channels, and those at a node numbered greater than their destination node are routed on the low channels. Thus we get a channel dependency graph that does not have a cycle ( g. 3(d)), and consequently the routing algorithm is deadlock-free.

9

2.4 Reactive scheduling When there is more than one user process running on a node, a scheme for scheduling the processes is required. The simplest scheme is round-robin scheduling|each process is allocated a xed-length time slot by rotation. But this simple scheme su ers from the drawback that a blocked process (such as a process that has executed a blocking receive) may also get scheduled, and just waste CPU time. In reactive scheduling, a process is scheduled only so long as it is making progress. It is de-scheduled as soon as it executes a blocking receive. In this fashion, wastage of CPU time is avoided as a blocked process are not scheduled unless a there is a new message for it. If several processes on one node are doing communication intensive processing, the network latency can get masked by the use of reactive scheduling.

2.5 Review of literature The idea of store-and-forward routing is quite old and is very similar to packet routing in computer networks [9]. Several deadlock-free routing algorithms are known for networks employing this kind of a routing algorithm. The concept of wormhole routing originated with the Wormhole chip project at Caltech in 1985 headed by C.L.Seitz [7]. In 1987, Dally and Seitz introduced the idea of virtual channels to develop deadlock-free routing algorithms for interconnection networks (ICN) that use wormhole routing [2]. They also designed the Torus Routing Chip which implemented this alogorithm for a particular network topology. Dally extended the idea of virtual channels for use in ow control in an ICN. In his 1992 paper titled `Virtual-channel ow control' [1], he has described a theoretical model for networks using virtual channel ow control and has presented the results of his simulation studies. These results indicate a dramatic increase in network throughput due to the use of virtual channels. The Cosmic Environment was developed by Seitz and his students at Caltech [4] in 1988. 10

The basic idea was to emulate the Cosmic cube machine on a network of UNIX hosts. One of the members of this group, Jakov Seizovic, developed the Reactive Kernel [6] for the Cosmic cube. This kernel was essentailly a new node operating system that incorporated the idea of reactive scheduling of node processes.

2.6 The CSN router The MEIKO T800 transputer system has a point-to-point router called CSN (Computing Surface Network), which has been provided by MEIKO. This uses store-and-forward routing. Consequently, the bu er requirement at each node is large. Also, the basic communication functions provided are blocking send and receive. Blocking send is a major drawback in that it renders it impossible to write certain kinds of parallel programs (for instance, parallel triangular system solving). Another point is that CSN may use extra links (i. e. , over and above the links present in a particular topology) to optimize certain paths. This leads to erratic performance. When we conducted test on a 3-D hypercube, it turned out that at times it took less time for a message to be sent between diagonally opposite nodes (a distance of 3 hops) than between neighbouring nodes (a distance of 1 hop).

2.7 The router developed by Barua & Goyal As a part of their B.Tech. project, Rajeev Barua and Anshuman Goyal [5] developed a crystalline router for the T800 transputer system. This router gave excellent performance. They also attempted to build a point-to-point wormhole router. However, due to shortcomings in their design, the performance of this router was very poor. For instance, it took a minimum of 30000 ticks for a message to be exchanged between neighbouring nodes|about 100 times slower than the CSN router. After the upgradation for the MEIKO software in Jan. '93, the crystalline router developed some problems. However, we have recti ed it and the system is fully functional. 11

Chapter 3 Findings and their implications for the design 3.1 The T800 Transputer The T800 MEIKO transputer system is a 32-node machine in which each node has 4 recon gurable, bidirectional links available. This allows us to con gure topologies such as hypercubes of upto 3 dimensions (4-D ones are not possible because atleast one link of one node must be free to communicate with the SUN host), meshes upto 32 nodes, CCC upto 3 dimensions etc.

3.2 Software layers on the transputers The T800 MEIKO transputer system provides software and communication tools at three levels, as shown in g. 4.

12

CSN is the existing software for point-to-point communication on the transputer system which is what we plan to replace. OCCAM and C library functions provide low level communication which we expect to be very fast and so, we had initially planned to use these functions to implement our router. In this design, the router could have been written in C and it had to be called from an OCCAM harness program (which con gures the ICN and loads the processes on di erent nodes). But we have encountered certain problems with the new version of the software on the transputer system: 1. The electronic wiring tool provided with the OCCAM programming system (OPS) does not work. So it is not possible to run any OCCAM program at present. 2. The C compiler (tcc ) required to produce an OCCAM library le (with sux .li8 ) from a C program, so that it can be called from an OCCAM harness, is not available. However, we have found a way to produce a .li8 library le using the mcc compiler. This is not documented in the manuals provided. Due to these problems, we decided to write our router using CS Tools. The basic channel communication routines of CS Tools (cread and cwrite ) are reasonably fast. For instance, sending 400 bytes between adjacent nodes takes about 500 seconds. Moreover the availability of shared memory between processes on a node in CS Tools (which is not allowed in OCCAM) is a great help as explained later.

13

3.3 Blocking Communication The basic channel read and write functions (cread and cwrite ) are blocking. This implies that having a single receiver-cum-sender router process at each node may lead to a deadlock. For instance, if two neighbouring nodes decide to send to each other at the same time, they will get blocked. So our router has separate receiver and sender processes at each node. Process switching on the transputers is done in hardware and thus, is very fast. Consequently, using two processes instead of one for the router will not slow down the system very much. Though receive is blocking, it is possible to check for incoming packets on several channels at the same time (receiveany). There is a function alti that returns the index of the channel on which a packet is received rst. This allows our router to have just one receiver process for all the input channels.

3.4 Reverse messages The fact that the cwrite function is blocking implies that it would not be wise to let a node to send its to its neighbour unless it is sure that the latter has bu er space free to receive the its. This calls for reverse messages to be sent by each node to its neighbours informing them of the availability of bu er space. Such control messages in our router are very short (2 bytes) so the bandwidth consumed by them is very small.

3.5 Shared memory For the router to be fast, the amount of processing it does within each node is to be minimized and distributed between the processes (sender, receiver and the user processes) in a balanced fashion. So, shared memory rather than soft channels (channels for communication between processes on the same node, which do not need hardware links), has been used to communicate between the various processes, such as receiver, sender and user processes, on 14

a node. This has a three-fold advantage. Firstly, possible wastage of time due to blocking sends and receives on soft channels is avoided. Secondly, by keeping all bu ers shared, hardly any copying is required, which saves time. Lastly, this permits sharing of state information between receiver, sender and the user processes, which helps in balancing the processing load, as is explained later in the design. However, when we measured memory access times for shared memory and local memory, we found that shared memory access is approximately twice as slow as local memory access.

3.6 Semaphores CS Tools supports shared memory but does not provide any semaphores. We are therefore using Dekker's algorithm [8] to provide mutually exclusive access to shared structures. Most of the structures are shared only between two processes and so the usual form of Dekker's algorithm is sucient. For the few that are shared by more than two processes, we have used an extension of the above algorithm to more than one processes.

3.7 Reactive scheduling For implementing reactive scheduling, we had planned to use the following strategy: Before doing a receive, a user process would indicate to the router of its intention of doing so. Then it would pause (i. e. go to sleep) until the router received the complete packet desired by it, and signalled it to wake up. However, the problem is that transputer processes do not seem to have process ids like normal Unix processes; so signalling them is not possible. It is, however, possible for a node process to change its priority (using the functions setpri and ldpri [3]). There are 2 levels of priority, low and high. So, instead of pausing as described above, a process may lower its priority while it is waiting for a message. The priority is raised once the message is received. Through preliminary testing, we found that a high priority process is scheduled for about 4 times as long as a low priority process. So 15

we expected this scheme to help reduce the wastage of CPU time. However, the results have not been very encouraging (refer Section 6.6).

16

Chapter 4 Design of the wormhole router using virtual channels 4.1 Introduction In this chapter, we describe the major design issues of the point-to-point router. The actual implementation of the design, the data structures and the working of the two router processes (called sender and receiver) are discussed in the next chapter.

4.2 Input bu ering vs Output bu ering Bu ers are provided at the input of the virtual channels ( g. 5) i. e. , the bu ers store messages just received along the channel rather than messages waiting to be sent out. As described in the previous chapter, adjacent nodes in the ICN exchange information about the state of their bu ers to prevent blocking and bu er over ows. If bu ering were provided at the output of each virtual channel, then the state information of all bu ers would have to be sent to all adjoining nodes. By providing bu ers at the input of the virtual channels, we have reduced the number of control messages required because in this scheme, state information about a particular bu er will have to be sent only to the node along the physical channel 17

with which this virtual channel is associated.

4.3 Division of work between sender and receiver The natural division of work between the sender and the receiver processes of the router suggested by the ndings and the decision mentioned in section 3.3 is as follows:

Receiver process: This process receives its along the physical channels and places them

at the end of the respective virtual channel queues. To receive, it uses the function alti mentioned in the previous chapter. The virtual channel queues are shared between the sender and receiver processes on a node; while the receiver keeps adding its to the end of the queues, the sender keeps picking up its from the head of the queues and forwarding them towards their respective destinations. Receiver process does not have to worry about bu er over ows because a sender process sends a it to its adjacent node only when there is free space in the input bu ers of its adjacent node. It does not need to do any allocation of virtual channels to packets either; this is done by sender process.

Sender process: This process does routing and allocation of virtual channels. It contin-

uously keeps scanning all input bu ers (those associated with the incoming virtual channels as well as those associated with send queues of local user processes). When it gets a it from one of the bu ers, it checks if it is a header it. For a header it, it makes a switching table entry using the routing function for the speci c topology. For a data it, it consults the switching table to nd where the it has to be sent and then sends it along the appropriate link or puts it in the receive queue of the appropriate local user process, as the case maybe. 18

The sender process also allocates virtual channels to packets. When it gets the header

it of a packet, it rst determines the outgoing physical channel along which it has to be sent. Then it uses the state information about the adjoining node along that physical channel to allocate a free virtual channel to this packet. Subsequently, all its of this packet are sent with this virtual channel number so that the receiver on the adjoining node knows where to keep them. To avoid bu er over ows on adjoining nodes, information about the number of free it bu ers on all adjoining nodes is also maintained. Flits of a packet are sent forward only when there is free space available in the input bu er of the virtual channel allocated to it. The data structures containing state information about adjoining nodes are continuously updated by exchange of control messages between them. Because all input channels of a node are connected to the receiver process and the outgoing channels to the sender process, these data structures are shared between the receiver and sender processes on a node.

User library functions: These functions access the send and receive queues of the user processes. Function xsend puts packets to be sent in the send queue of the user process, and function xrecvb picks up packets from the receive queue and gives them to the respective user process. These queues are shared between the user and sender processes on a node. There is a separate send queue and a receive queue for each user process.

4.4 Path determination Path determination algorithms are of two types|oblivious algorithms, in which the route between any two nodes is xed and independent of trac in the ICN, and adaptive algorithms which monitor the congestion in the ICN and change the paths between di erent pairs of nodes accordingly. We have used oblivious algorithms for path determination. This has the advantages that it is simple to implement, it does not require control messages to monitor the trac in the ICN, and we could avoid all potential deadlocks by deciding a policy of restrictive allocation of virtual channels (to break cycles in the channel dependency graph). 19

The routing functions used are described below. In a hypercube, the nodes are numbered using a Gray code. This means that the binary strings representing the node ids of neighbouring nodes di er in exactly one bit. Going across an edge (channel) in a particular dimension ips the corresponding bit in the node id. Since our channels are bidirectional, the simple e-cube routing algorithm ensures deadlock-free routing. First we nd the bit positions in which the node ids of the source and destination nodes di er. Then we route along channels corresponding to the bits that need to be ipped, starting from the MSB and going down to the LSB. For instance, to route a packet from node 5 (binary 101) to node 2 (binary 010) in a 3-D hypercube, we choose the path 5

!

1

!

3

!

2

In the case of a mesh, the routing function is somewhat more involved. This is essentially because of the cycles created by the one way wrap-around, and the consequent need of virtual channels for deadlock avoidance. To get from node (i; j ) ((row #, col #)) to node (m; n), we rst move along the column, in the appropriate direction, till we reach the node (m; j ). Once in the correct row, virtual channels are used to avoid deadlocks. This is done by splitting each physical channel along rows into 2 sets of virtual channels|high and low. Routing is restricted so that when going from a node with smaller node id to larger, a it uses one set of virtual channels and when going from a node with larger node id to smaller, it uses the other set of virtual channels. In this routing function, we have made an improvement over the function described by Dally and Seitz [2]. In their routing function, all its in a row of the mesh must move in the same direction. But in our design, its are bu ered only at the inputs of virtual channels. The set of bu ers required to move in opposite directions are disjoint. So using our router, virtual channels can be used in both directions. This helps reduce the network diameter (refer section 2.3).

4.5 Preservation of order of messages Use of oblivious algorithm for path determination makes it easy to preserve the order of messages between any pair of user processes without using sequence numbers. The following 20

points are to be noted in this regard: 1. A receiver process puts all its at the ends of respective bu er queues, and a sender process picks up its from the head of the bu er queues. All its of one packet go to the same virtual channel between two adjoining nodes, and its of di erent packets are not interleaved in a bu er queue. 2. Associated with each physical channel is a queue of packets waiting to be allocated virtual channels along it. Thus virtual channels are allocated in the rst-come- rstserved (FCFS) manner. This not only helps in maintaining the order of packets, but also prevents starvation.

4.6 Routing protocol In order to maintain state information about adjoining nodes, the router processes exchange control messages with each other. A simple protocol (which we have used) for exchange of control messages is described below. All sender processes know the size of virtual channel bu ers on each node (this is a prede ned constant). When sender process on node A sends a it to the receiver process on node B, it decrements the count of free it bu ers on the virtual channel used. The receiver on node B puts this it in the appropriate virtual channel bu er. After forwarding this it to its destination, the sender process on node B sends a control it to the receiver process on node A saying that one more it bu er is free. The receiver process on node A, on receiving this control it, increments the count of free it bu ers on the corresponding virtual channel. In our design, a receiver process is always ready to receive. Therefore, inspite of channel communication being blocking, we expected the communication overheads of control messages to be negligible. Through measurements of time involved in actual send operations (cwrite function call), we have found that this is indeed so. Thus, we didnot need to cut down the number of control messages to a smaller value.

21

4.7 Minimizing bu er copying In our design, all bu er queues, instead of containing data, contain pointers into a large memory segment which is shared among all processes on a node. This has the advantage of minimizing copying of data on the nodes. A receiver process receives a it in a it bu er obtained from the shared heap, determines which virtual channel it belongs to, and then instead of copying it to the bu er associated with that virtual channel, it simply puts a pointer to the it bu er in the bu er queue. Explicit copying takes place only thrice in the lifetime of a packet: once at the source node (copying the packet from the user bu er into the shared area), and twice at the destination node (copying the packet from the virtual channel bu er to the shared packet heap, see section 5.2, and then into the user space). To simplify memory management, the shared heap is divided into two parts: a packet heap (a large array each element of which is a packet of some large size), and a it heap (a large array, each element of which is a it). These heaps are organized as linked lists of free packets (or its). As there is a large amount of memory available on each transputer (four megabytes), we donot expect the heap size of the router (200 Kbytes) to cause a memory shortage for a simple application.

4.8 Protection against malicious user We could not provide extensive protection to our data structures because we are not aware of anything similar to the user mode and kernel mode of UNIX on the transputers. The steps we have taken in this regard are: 1. We do not provide the user access to any of the router's data structures by declaring them static (to limit their scope to the user library le). The router's data structures are, thus hidden from the user processes. 2. We do not provide the user any pointer to the shared heap. The function xsend copies the user bu er to the shared data area before sending, and the function xrecvb copies the packet in the shared data area into user data area before giving it to the user. 22

4.9 Master-Slave Arrangement We have found that if a process running on a transputer attempts input/output (other than by using the function debugf , refer [3]), CSN automatically tries to set up a link to that node. Our router uses up the hardware links to create a topology in such a way that CSN connections to all transputers used by the topology cannot be made. Therefore, our router supports applications written in master-slave manner (as in Cosmic Environment), in which there is a single master process (running on one of the transputer nodes) which communicates with the host machine and does all I/O using the normal C functions, and several slave processes which do the computations. Typically, the master process initially distributes data among the slave processes and in the end, collects results from them. In our design, the master process sits on node 0 of the topology, as a CSN connection to node 0 is always made.

4.10 Process Ids Every application process running on the router has a unique ID, which is represented as a pair (node, pid). The numbering of the nodes for the mesh and hypercube topologies, which we have provided, is shown in g. 6. The pid of a user process is unique within a node and is an integer in the range 0

Suggest Documents