Departamento de Computación Facultad de Ciencias Exactas y Naturales
Universidad de Buenos Aires INFORME TÉCNICO
Issues about routing in real-time massively parallel computers Gabriel Wainer, Gustavo Pifarré
Report n.: 98-008
Pabellón 1 - Planta Baja - Ciudad Universitaria (1428) Buenos Aires Argentina http://www.dc.uba.ar
Title:
Issues about routing in real-time massively parallel computers
Authors:
Gabriel Wainer, Gustavo Pifarré
E-mail:
[email protected],
[email protected]
Report n.:
98-008
Key-words: Massively parallel computers, routing, real-time systems Abstract:
At present the number of real-time applications is growing steadily. Recently, massively parallel computers have begun to be used to develop high-performance real-time systems. Their multiplicity of processors and internode routes gives them the potential for high performance and reliability. However, the problems existing in the communications subsystem are augmented by the fact that interprocessor communication must consider timing requirements. In this work we present a survey, allowing to study the main problems and solutions existing in this field. We present several strategies to provide end-to-end guarantees and other techniques to allow runtime scheduling. The integration of both kinds of strategies can permit the transmission of messages in a timely fashion.
To obtain a copy of this report please fill in your name and address and return this page to:
Infoteca Departamento de Computación - FCEN Pabellón 1 - Planta Baja - Ciudad Universitaria (1428) Buenos Aires - Argentina TEL/FAX: (54)(1)783-0729 e-mail:
[email protected]
You can also get a copy by anonymous ftp to: zorzal.dc.uba.ar/pub/tr or visiting our web: http://www.dc.uba.ar/people/proyinv/tr.html Name:.................................................................................................................................................. Address:............................................................................................................................................... ............................................................................................................................................................
Issues about routing in real-time massively parallel computers Gabriel A. Wainer
[email protected]
Gustavo D. Pifarré
[email protected]
Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación
Abstract At present the number of real-time applications is growing steadily. Recently, massively parallel computers have begun to be used to develop high-performance real-time systems. Their multiplicity of processors and internode routes gives them the potential for high performance and reliability. However, the problems existing in the communications subsystem are augmented by the fact that interprocessor communication must consider timing requirements. In this work we present a survey, allowing to study the main problems and solutions existing in this field. We present several strategies to provide end-to-end guarantees and other techniques to allow runtime scheduling. The integration of both kinds of strategies can permit the transmission of messages in a timely fashion.
1. Introduction At present, the number of real-time systems is growing steadily. In these systems, the logical correctness of the computations is as important as the moment when the results are obtained. The growth can be seen in numerous applications, such as aerospace and defense systems, industrial automation and control, etc. Real-time systems pose problems for computer architectures, operating systems, fault-tolerance and evaluation tools, as system tasks must be completed before certain deadlines. Reliability is also a crucial issue, because a failure can cause economic disasters or loss of human lives. Recently, massively parallel computers have begun to be used in high-performance real-time systems. Their multiplicity of processors and internode routes gives them the potential for high performance and reliability. However, the problems existing in the communications subsystem are augmented by the fact that interprocessor communication must consider timing requirements. In this article we present several problems existing in real-time multiprocessors, focusing in real-time routing problems. The work is organized in 10 sections. Section 2 is devoted to introduce some basic terminology to be used later. Section 3 contains a description of the most widely used topologies. Section 4 is devoted to explain some basic features to be considered about routing algorithms. Section 5 describes several ideas about real-time theory. Section 6 is devoted to issues to be considered in real-time routing. Section 7 describes a widely used model for communication traffic. Section 8 explains issues about link assignment and flow control in real-time routers. Finally, section 9 is devoted to some end-to-end routing strategies.
2. Basic terminology In this section we include some basic terminology that will be used throughout the paper. A parallel computer is a network of processors, coprocessors, memory banks, switches and links [Fel91]. The network can be thought as a set of nodes connected by point-to-point dual simple links. A node in the system can have several incoming and outgoing links connected to it, and these links can operate in parallel. Each node contains one or more switches connecting incoming links to outgoing links. Internode communication is accomplished sending messages routed between the source and the destination nodes across links in the network, often passing through multiple switches on the way. This transmission is accomplished by a routing algorithm. To do so, switches contain routing logic (or simply, a router), that is usually integrated to the communications network.
Figure 1 - A multihop network. Each processor can fetch its own instructions from a local memory (a MIMD machine) or execute the same instruction with other processors, broadcasted from a central control processor (a SIMD machine). Parallel computers can also support a shared memory or a message passing model for processor communication. In the first case, processors communicate reading and writing to a global memory. Instead, message passing computers use processors to communicate by passing messages that can be long and invoke complex actions. The injection model is said static when processors inject the messages in the network at the same moment. Otherwise, the model is dynamic (i.e., each processor injects messages at arbitrary times). There are many techniques to connect nodes in a massively parallel computer. They can be classified in busbased computers, switch-based interchange networks, and interprocessor direct links. The buses can be simple or multiple. In multiple buses, each one can be connected to the same or different node subsets. In this work, we will focus on switch and processor-based architectures. A routing algorithm is said minimal if each hop brings the message closer to its destination. The technique is called adaptive if there are different ways to route messages between nodes (the decision is made considering faulty links, congestion, etc.). When a routing algorithm takes a route depending only on the source and destination nodes, it is said to be oblivious. Communication between user-level entities in this system can be either connection-oriented or connectionless. In the connection-oriented case, it can be either message stream or byte stream oriented. In packet-switching routing, messages are divided in constant size units (called packets). Packets move forward in every hop of the path, and are stored in the nodes or the links. If the messages are of variable size, wormhole routing can be used. Here, messages are divided into a sequence of constant-size flits (flow-control units). The first flit holds the destination address, and it is used to determine the message path. Once a link is occupied by the head, it cannot be used for other messages until the last flit has left it. In virtual cut-through routing [Ker79], packets arriving at an intermediate node are forwarded to the next node in the route without buffering if a circuit can be established to the next node. Virtual cut-through can reduce the communication delay as messages are delivered without being buffered (unless the links and/or nodes in their routes are blocked or faulty). When virtual channels are used, logical channels are implemented using individual flit buffers for each logical channel at each node. A logical channel will have the buffers from source to destination allocated to flits of the
message in a FIFO manner. A message can be broken into packets bounded to a maximum size, and the packets are assigned to a single virtual channel. However, as flits in different logical channels can be sent in any order, an arbiter is needed. The response time for a channel on a link is the sum of the waiting time, transmission time, and propagation time.
3. Topologies In the last years, several network topologies have been successfully used for switched and processor-based massively parallel computers. In this section we will briefly remind the main aspects of the most common topologies. For a detailed discussion, see [Cyp93]. I. Switch-based topologies . Omega networks: this is a multi-stage network connecting N inputs to N outputs through log N switches stages (from now on, we will call N to the number of nodes in the network). Each stage has N/2 switches, each one with two inputs and outputs. . Benes networks: this is also a multi-stage network with different sizes for the switches. If the switch has two inputs and two outputs, it has 2(log N)-1 stages, each one with N/2 switches. . Butterflies: this topology has an interconnection pattern of degree 4. There are N=k.2k processors, where k=2j. Each node is a pair (C,R), and nodes conform an array of 2k by k. Each node is connected with those in (C+1 mod k, R), (C+1 mod k, R(c)), (C-1 mod k, R), and (C-1 mod k, R(c-1mod k)). . Multibutterflies: here, a set of butterflies is joined and permuted. The routing can be done in O(log N), and it is fault-tolerant. II. Processor-based topologies . Tree: in these networks the nodes are organized in a hierarchical way. Each node (excepting the root) has a unique predecessor and several children. The most common variant is the complete binary tree topology. . Cycles: in this topology each node is connected with only two neighbors, in a ring fashion. These topologies are adequate for a small number of processors. A common variation is the chordal ring, where each node is also connected to two other nodes, improving fault tolerance. . Mesh: each node is in a d-dimensions cube. Each node is connected to the two neighbors in each of the d dimensions. The most common has two or three dimensions. The nodes in the borders can be connected to fewer nodes, or can be wrapped around, forming a torus. . Pyramids: it consists of (1/2) log N + 1 levels, where the i-eth level, i ∈ [0,(log N)/2] is a mesh with N/2i nodes. Each level has four connections to lower levels, four to close neighbors in the same level, and one to the upper level. . Hypercubes: here, N is a power of 2. Every node in the hypercube has n=log N links. Each node in the cube is connected with n neighbors, where the addresses of these nodes differ in only one bit of the node to which it is connected. . Shuffle-exchange: here, N is a power of two. Let p = log N. Each node is connected to the nodes X(0) (exchange connection), (Xp-2, Xp-3,...,X0, Xp-1) (shuffle connection), and (X0, Xp-1, ..., X1) (unshuffle connection). . de Bruijn Computers: here, N is a power of two. A node is connected to the nodes (0, Xp-1,...,X1), and (1,Xp1,...,X1) nodes (forward connections), and also to the (Xp-2,...,X0,0), and (Xp-2,Xp-3,...,X0,1) nodes (backward connections).
. Cube-Connected Cycles (CCC): here, we have N=k.2k nodes, where k=2j. Each node is a pair (C,R), and nodes conform an array of 2k by k. Each node is connected with its left and right neighbors, and in each column i, we connect the processor pairs in rows that differs in the i-eth bit; i.e., (C,R) is connected with (C+1 mod k, R), (C-1 mod k, R) and (C, R(c)). . Hexagonal mesh: it is composed of N = 3n(n-1)+1 nodes, such that each node has six neighbors. The dimension n is the number of nodes on its peripheral edge. The mesh can be viewed as composed of 2n-1 rows (d0 direction), 2e-1 left-up rows (d1 direction), and 2e-1 left-down rows (d2 direction). In each direction, we label from the top the rows R0 to R2e-2. The net can be wrapped using a C-wrapping connection in which the last node in Ri is connected to the first node in R[i+n+1]2n-1 for each i in each of the three partitions [Che90].
4. Basic features of routing algorithms When studying routing algorithms, some important features should be considered: . Network latency: a routing algorithm should have low latency to deliver messages to their destinations; . Deadlock freedom: the algorithm should provide some mechanism to avoid or recover from deadlocks; . Livelock freedom: a message should be avoid to be delayed forever. It should arrive to its destination eventually; . Starvation freedom: a node can starve if it has a packet to inject and it is never allowed to do so. This should be avoided; . Fault Tolerance: the routing scheme should tolerate faults in processors and switches; . Synchronization overhead: it should be kept to the minimum; . Easy implementation: routers should be implemented with simple combinational logic. The various routing and switching schemes differ in terms of delivery latency, bandwidth utilization, and predictability. The hardware should be flexible, enabling the system to tailor communication policies to requirements of the application and network conditions. Most static routing algorithms generate minimal routes. Instead, adaptive schemes consider non-minimal routes to avoid network congestion. Traditional packet switching requires to store an arriving packet before transmission to the subsequent node. In contrast, cut-through switching schemes allow an incoming packet to begin transmission before complete reception. If the packet finds a busy outgoing link, virtual cut-through switching buffers the blocked packet at the node, whereas wormhole switching stalls the packet in the network until the link becomes available. In cut-through switching, messages compete for virtual channels and bandwidth. Hence, a virtual channel assignment strategy decides which arrival flits can use the virtual channels; and once a message is assigned to a virtual channel, it occupies the virtual channel until the last flit of the message leaves the node. The arbitration function of a router determines how to share the bandwidth. The arbiter selects the next flits to be sent on the physical channels. There are several well-known routing algorithms for massively parallel computers. We can include: Random Routing [Val81], Fluent Machine [Ran85], Adaptive Minimal routing [Kon90], Fully Adaptive Routing [Dal86], Chaos [Kon90b], Exchange Model [Nga89], Routing by Sorting [Nas81] and others. Further information about these and other approaches can be found in [Fel91] and [Ni93].
(a)
(b)
(c) Figure 2. Basic structure of routers (a) oblivious 4x4, (b) mesh (c) cut-through and packet-switching [Li94a]
5. Real-Time systems concepts As previously stated, Real-Time systems' timing correctness is as important as logical correctness. This kind of controlled systems, must meet high standards of reliability, availability and safety, as the cost of a failure can exceed the initial investment in the computer and the controlled object. To prevent such failures, system design must guarantee performance [Kop89]. One way to meet systems' timing constraints is to rely on a real-time scheduler. The scheduler should study system predictability, but the diversity of restrictions in these systems makes it an NP-hard problem. There are different ways used to lower the complexity of the guarantee tests. Several scheduling algorithms rely on a well-known task model, usually called the periodic task's model. It considers the existence of two different kinds of tasks to schedule: the periodic and sporadic (aperiodic) ones. Periodic tasks have a continuous series of regular invocations (whose time is called the period), and a worst case execution time (WCET). The task deadline is at the beginning of the next period. Aperiodic tasks have arbitrary arrival time and deadline, and a known WCET. When they are invoked, they execute only one instance. At present there is a diversity of local scheduling solutions (see, for instance, [Che88], [Sta94], [Wai95]). Some of the most widely used approaches include: •
Rate Monotonic (RM) [Liu73, Leh89]: it is one of the most traditional schemes. It is used to schedule periodic independent tasks, which are selected accordingly with their activation rate (preemptive fixed priorities). Liu and Layland [Liu73] determined that this is an optimal algorithm for periodic independent tasks. The basic algorithm assumes that the deadline of a task is equal to its period. An important result is
• • • • •
that the worst-case utilization bound for this algorithm is ln 2 for a high number of tasks. In [Leh89] a new test was provided, showing that the algorithm can often successfully schedule tasks with higher rate utilization. Earliest Deadline First (EDF) [Liu73]: it selects the system tasks using variable preemptive priorities, giving higher priority to tasks with the earliest deadlines. Earliest Deadline (ED) [Der74]: the highest priority is assigned to the task with the earliest deadline (fixed priorities without preemption). Higher priority is given to tasks that have closer deadlines. Least Laxity First (LLF) [Mok78]: it is an Earliest Deadline algorithm, but considering the task laxity, that is, the remaining execution time up to the deadline. Tasks with smaller laxity are chosen to execute first. Least Slack First (LSF) [Mok78]: the highest priority is assigned to the task with the least laxity, determined statically at task arrival and not adjusted afterwards. Deadline Monotonic (DM) [Leu82]: it requires static priorities in ascending order according with the periods, but does not require deadlines equal to the period. Their deadline can be less than the period.
In all these cases, schedulability tests are provided, allowing to determine if the task set is schedulable, that is, if system tasks can meet their deadlines in a deterministic fashion.
6. Real-Time multiprocessors When considering the features of real-time systems, we can see that one the main problems is the provision of predictable responses. This is also valid for interprocessor communication. Unpredictable delays in message delivery can affect the completion time of tasks and their deadlines can be lost. At present, some work has been done in real-time communications, but mainly focusing in LANs, distributed systems and WANs. In the first case, there are several techniques for medium access control protocols in LANS. There are techniques for CSMA/CD networks, such as the virtual-time method [Zha87] and windowed protocols [Dol89]. There are also some protocols based in token ring, such as the ones described in [Ram90], [Mal94] or [Str89]. In the case of routing for Wide Area Networks, we can mention the DASH [And88] and Tenet [Fer89] projects. Several other projects have addressed the use of multiprocessors for real-time systems, but they do not address the problems existing in the communications networks. For instance, MARS (Maintainable Real-Time System) is a fault-tolerant multiprocessor for process control. It is a distributed memory system, designed as a network of clusters consisting of computing components. Each component consists of processors, memories and input/output channels, connected through a bus architecture [Kop89]. The bus is accessed through the TDMA (Time-Division, Multiple-Access) strategy, providing a deterministic, load-independent and collision-free media access. The bus is assigned to components of the cluster, one slot per component. Predictability is provided, as the schedules for tasks and messages are computed by an off-line scheduler. The system provides fault-tolerance, because when a component fails it can be replaced by another cluster. The main problem of this architecture is that predictability is insured only inside one component; no net schedulability is provided. Another early approach was the SIFT project [Wen78]. Its structure consists of a pool of processors that store their results in a central memory. A processor and its memory are connected by a conventional high bandwidth connection. The I/O processors connect to input and output units of the system. The interconnection topology is a set of buses, each one with one bus controller. A more advanced solution was presented in the HARTS project [Shi91]. HARTS uses a hexagonal mesh of clusters, each one consisting of application processors, a system controller, a shared memory, an Ethernet processor, and a Network Processor (designed as interface to the interconnection network). A hexagonal mesh was chosen as it meets the requirements of fixed connectivity and planar architecture (for easy VLSI and communication implementation, scalability, fault-tolerance and ease of connection). To send a message, the source computes the shortest paths to the destination. The algorithm is non oblivious and several routes can be chosen. The topology allows efficient broadcast of a message, sending it ring by ring toward the periphery using a two-phase algorithm [Che90].
A fault-tolerant algorithm is also provided. If a message reaches a node where all the following links are faulty, a bit in the header is set, the fault-tolerant algorithm is activated, and it remains in control until the message has passed the faults. When the optimal links are faulty, the processor looks for non faulty links, starting counterclockwise. If the distance to the destination is less than the existing distance when the message started fault-tolerant routing, the message leaves fault-tolerant mode. I/O devices are clustered, and a controller manages access to the devices of each cluster. Each I/O controller is placed as the center of one triangle derived from the hexagonal mesh, giving three possible ways of access to each controller.
Figure 3. Placement of I/O controllers in HARTS [Shi91].
Simple store-and-forward switching schemes are not suitable for real-time routing because real-time applications normally require short response times. Hence, HARTS supports fast switching methods, as virtual cut-through and wormhole routing. The algorithms presented in [Shi91] does not attack the problems existing when routing messages with timing constraints. To avoid these problems, the project was lately integrated with new routing algorithms providing guarantees, that will be presented later in this work. As we can see, a variety of architectures can be employed, depending on the application performance requirements. Although bus and ring networks are useful for small-scale systems, complex systems can obtain better results using the higher bandwidth available in processor or switch-based topologies. These networks also offer several routes, improving system fault tolerance. However, it is complicated to guarantee end-to-end performance. Recent advances in VLSI technology allowed to develop new architectures suitable for real-time applications. In these cases, some important basic issues must be considered [Sta88]: . Interconnection topology for processors and I/O: in real-time systems, extensive I/O and high-speed data processing are needed to allow the interaction between the computer and the controlled system. Then, the topology should integrate processors and I/O. The topology should also have homogeneity (to allocate tasks to any node based solely on deadlines and availability of resources), scalability (to change computational power without redesigning any of the nodes), survivability (to have several minimal paths increasing the survivability of the network in case of node/link failures), and experimental flexibility (to emulate various architectures by disabling some of the links in the chosen topology). [Sta88b]. . Fast, reliable and time-constrained communications: most of the existing communications models may provide fast communications, but they do not know how long it will take to deliver a particular message. Hence, it is difficult to guarantee all real-time messages.
. Architectural support for error handling: the hardware should support for speedy error detection, reconfiguration and recovery, including self-checking circuitry, maintenance processors, system monitors, voters, etc. . Architectural support for scheduling algorithms: architectures may need to support fast preemptability, sufficient priority resolution, efficient support for data-structures like priority queuing, and sophisticated I/O and communication scheduling. . Architectural support for real-time operating systems. . Architectural support for real-time languages. At present, the most crucial problems in real-time networks are related on how to provide guarantees in end-toend routing. To allow so, it is also necessary to provide runtime support in the network. Therefore the router should provide help to meet the goal of timely message transmission. Some of the problems include, for instance, the limited buffering of messages in the network, or the FIFO characteristics of the channels. The current techniques used for massively parallel computers does not support real-time communications, and it is difficult to support real-time applications. Other features, such as adaptive routing, also improve average latency but complicate the effort to provide predictable communication. To be successful, the real-time communications subsystem must be able to predictably satisfy individual message-level timing requirements. Hence, the concept of time should be introduced. This includes ensuring the schedulability of synchronous and sporadic messages, as well as ensuring that the response time requirements of asynchronous messages are met. For these reasons, in real-time networks, the connection-oriented service is considered more suitable than bytestream models. Cut-through routers can reduce the latency avoiding packet delay at intermediate nodes (an incoming packet goes directly to the next node in its route if an outgoing link is available). To guarantee transmission times, the system requires information about other sources than the one originating the message. One attractive feature of multihop networks is their ability to provide fault tolerance. This is important in realtime systems, which are expected to operate for long periods without maintenance. To bound worst-case latency, links and buffers must be reserved a priori based on the application's anticipated traffic load. In this way, the network can provide end-to-end performance guarantees through link scheduling and buffer allocation policies. Mean-value analysis is inadequate for real-time applications as worst-case communication delays often play an important role. Another important factor is the dependence on the interconnection topology. As virtual cut-through and packet switching use physical queues in each node, priority based scheduling can be done among the packets. In contrast, in wormhole routing, the logical queues include multiple nodes. Priority assignment of virtual channels to incoming packets improves predictability, because the arbitration policies can allocate bandwidth depending on packet deadlines or priority. The priority resolution of the priority values will depend on the number of virtual channels. If packets at different priority levels share virtual channels, special care must be taken about the blocking time when a lower priority packet holds resources needed by higher priority traffic. Separate buffers for each priority level can be provided at a significant cost for fine-grain resolution. Instead, fine-grain priorities (such as deadlines) can be used for packet queues. Other possibility is include a single priority queue for each output link. Packet switching enables the router to schedule traffic to provide guarantees. For example, suppose a guaranteed packet enters a node before its deadline. The scheduler can stop the packet to avoid overloading the subsequent nodes. In cut-through networks, a virtual channel assignment strategy decides which flits can use the virtual channels. The most conventional flow control scheme is the FCFS strategy: a virtual channel is assigned to the message
with the earliest arrival time in the waiting queue. The arbiter polls the virtual channels in a round-robin fashion. Nevertheless, the timing constraints of a message are not considered and it cannot guarantee predictable routing. Besides, it is difficult to carry the timing property of a message in the header and make flow control decisions based on it, as the size of the node buffers and the cost of the router increases. A set of priority bits may be associated with each virtual channel, and the arbiter could use the priority to decide which channel should transmit its flits next. The router alone cannot satisfy the performance guarantee requirements. To do so it is necessary to control the network access times and bandwidth allocation. The use of deadline based scheduling can offer bounds on worst-case latency for time-constrained traffic. Even though, real-time applications also include certain applications that do not have stringent performance requirements. For example, good average delay may suffice to transmit certain status and monitoring information. Therefore, real-time traffic is typically categorized in two basic classes of traffic: guaranteed messages, which require delivery before their deadlines, and best-effort messages that do not need delivery-deadline guarantees. Most existing systems are best effort schemes, because the system tries to ensure that most messages meet their deadlines, but does not give guarantee about delivery time. On the other hand, if the system has some information about arrival pattern of messages, it can guarantee the delivery time. Traditional approaches use packet switching and static routing for guaranteed messages. As it was previously stated, it is difficult to control packet delivery time under adaptive routing and cut-through switching. However, best-effort packets could potentially improving average latency. Best-effort traffic should be able to use low latency communications techniques without interfering the performance guarantees of time-constrained packets
7. A communication traffic model In this section we will analyze the definition of a model for communication traffic, which is specially important for real-time systems. In most routing techniques it is assumed that a significant amount of the traffic will be periodic. This occurs as most tasks are periodic, hence most traffic requires deadlines to be met at regular periods. However, some traffic may occur in bursts, which means that it may arrive at a time sooner than the beginning of its period. One common abstraction used for guaranteed time-constrained communication are the real-time channels [Cru87]. These are connection oriented messages restricted to unidirectional communication (bidirectional connection can be composed using a pair of channels). The message generation process is specified with a linear bounded arrival model. Here, the arrival process of source has three worst case parameters. These are S, the message size (defined in bytes), the message rate R=1/I, which is measured in number of messages transmitted per second (excluding bursts), and the burst size B. The three parameters are maximum values. R defines a bound on the message generation rate (excluding bursts), while I is the minimum period between messages. In any interval of length t, the number of generated messages may not exceed B + t.R, and the length of each message may not exceed S. B puts a bound on the short-term variation in the message generation rate, and partially determines the buffer space requirement. This parameter is significant for networks that buffers messages, since it specifies the number of buffers needed at intermediate nodes along a message path. Instead, it is not significant for wormhole networks since they do not buffer entire messages, but rather a small fraction. Non periodic message generation can be represented using an estimate of the worst-case interarrival time and the average generation rate.
A message mi is physically generated at a time denoted as ti, but has associated a logical generation time l(mi), defined as: l(m0) = t0 l(mi) = max{l(mi-1) + I, ti} If d is the end-to-end delay for the channel, the system must guarantee that any message mi will be delivered to the destination by time l(mi)+d. In other words, when the interarrival time between messages is at least I, the system guarantees that each message in the channel incurs a delay of at most d seconds. However, messages that arrive in a burst (where the interarrival time is less than I), it may suffer a larger delay, because the guarantee is given regarding the logical arrival time. Hence, arrivals at each node have to be regulated to prevent burst arrivals on one channel from affecting the guaranteed messages of other channels. Therefore, the network should not admit a new connection unless it can reserve sufficient buffer and bandwidth resources without violating the requirements of existing connections. A connection establishment procedure must decompose the end-to-end delay d into local delays dj for each hop such that dj ≤ I and
n
∑
dj ≤ d.
j =0
Respecting the local delay bounds, a message mi has a logical arrival time lj(mi) = lj-1(mi) + dj-1 for j>0 at node j in its route, where j=0 corresponds to the source node. Link scheduling ensures that message mi arrives at node j by time lj-1(mi) + dj-1, the local deadline at node j-1. However, message mi could reach node j earlier due to the delays of previous hops.
Figure 4 - Parameters of the traffic communication model [Kan94].
8. Link assignment and flow control In this section we will examine the most common strategies used to control flow of traffic in real-time massively parallel routers, analyzing the main advantages and disadvantages of them. The focus is put in cutthrough schemes, and virtual channel assignment. Most of the presented strategies can also be used for packetswitched networks.
8.1. First Come First Served (FCFS) This scheme assigns a virtual channel to the first message arrived to the input queue. The arbiter polls the virtual channels in a round-robin fashion, allocating the link if there is a flit in the input buffer and free buffers in the corresponding output link. As previously stated, this strategy cannot guarantee timely delivery of messages in the network. An oldest-packet-first arbitration further improves the performance in comparison to the round-robin scheme [Dal92]. If the generation time of each message is used for flow control, the headers should expend extra overhead. The buffer size of virtual channels must also be increased, and timing hardware complicates the router design.
8.2. Tightest Deadline First (TDF) The TDF [Li94a] algorithm assigns priorities to messages according to deadline tightness factors. The tightness of a message is computed as Ti = Deadlinei/Latencyi. Let us call N=2n to the number of virtual channels per link. Ti is mapped to a priority p ∈ [1, 2n] as follows:
p=
N+1 N + 1 - Ti, 1
if Ti < 1 if Ti ∈ [1,N+1] if Ti > N+1
A message can request to be allocated any virtual channel with lower number than its priority (then, a high priority message can request a larger number of virtual channels), increasing the probability to meet deadlines of messages with tight deadlines.
8.3. Least-laxity first (LLF) In [Li94b], the laxity up to the message deadline is used as a priority. The laxity is the remaining time up to message delivery, and it can be estimated as Lxi = di-latencyi. Let us call N to the number of virtual channels, W to the estimated time of the message blocking before the priority is incremented by one, and lxi= Lxi/W. Th priority mapping function is:
p=
N+1
if lxi< 1
N + 1 - lxi, 1
if lxi ∈ [1,N+1] if lxi > N+1
The number of priority levels is the same of the number of virtual channels, and a message can request any virtual channel with priority less or equal than that of the message. This approach was analyzed for large-scale cut-through networks. The results showed that LLF reduces the deadline missed ratio from the conventional FCFS scheme. With this approach, increasing the number of virtual channels increases the network throughput, but does not improve the guarantee ratio. By limiting the number of virtual channels a message can request, LLF increase the chance that a message with smaller laxity can meet its deadline. However, the messages with low priorities can be blocked by higher priority messages for an indefinite period.
8.4. Priority Climbing (PC) In this scheme [Li94b], the priority of a message is adjusted according to its current state, increasing the priority of blocked messages. Hence, high priority messages cannot block low priority ones for an indefinite period. If each physical link has 2n virtual channels, n bits are used to represent all the priority levels. Each message carries a one byte priority value, where the n most significant bits represent the priority, and the others are used to count blocking time. The arbiter polls the virtual channels, and if all the virtual channels are polled once, it increments by one the count of the blocked messages. If the value of the count exceeds the maximum, the priority is increased by one. Hence, the priority increases automatically as the laxity of the message is reduced. The initial priority in the TDF and LLF schemes reflects the time available up to the message deadline. As a message is transmitted in the network, its timing property is adjusted by counting the number of times the message is blocked in the network. Simulation results showed that this scheme improves the system performance over the LLF scheme. It prevents starvation of low priority messages, and increases the probability that a message may meet its deadline.
8.5. Enhanced Priority Climbing (EPC) With this scheme [Li94b], a channel is assigned to the oldest message in the local queue. Each virtual channel has an associated priority register (see figure 5). When the header of a message leaves the virtual channel, the register remains unchanged and the body flits of the message inherit that priority. If there is a flit blocked in the virtual channel, the priority adjustment scheme modifies the value of the priority register.
Figure 5 - Enhanced priority climbing router [Li94b].
This scheme reduces the chance of being blocked by high priority messages. Compared with the PC scheme it also increases the chance that messages with small laxities can be successfully transmitted. Finally, the additional use of the priority of a message for virtual channel assignment and bandwidth allocation can reduce the deadline missed ratio.
8.6. Rate Monotonic (RM) Here, the priority assigned to a channel is related to the frequency of messages on that channel [Mut94]. The arbiter allocates the bandwidth to the virtual channel with the highest priority (according to the period). Each virtual channel needs a register to store the priority, and the arbiter must support the priority scheme. Also, each node needs a static table to save the deadlines of all the message sources routed through the network. A message inherits the priority of the message source that generates the message. The response time for a message varies with the arrival time of other messages at a node. Even if messages are generated with a fixed interarrival time, the timing for the message at a node cannot be constant. Also, early messages can cause low priority messages to miss their deadlines.
Figure 6 - Early arrivals in a RM router [Kan94].
To avoid these problems, if a message arrives earlier than its period the priority is adjusted and lowered to prevent the unnecessary blocking of normal low priority traffic.
8.7. Earliest Due Date (EDD) This strategy is a priority scheme, and it is similar to RM. Here, the criterion to schedule messages is the ascending order of their deadlines. As the priority of a task depends on the relative arrival order of the tasks, guaranteed delivery is difficult.
8.7. The One Bit approach To support the RM algorithm, several priority bits must be added in the header, and a table should be included at each message source (to determine the priority adjustment of early messages). In [Li94a] a simplified version of the scheme is presented. Instead of carrying a multiple-bit priority, only one bit is used to indicate early arrivals. The message priority is the virtual channel number to which the message is assigned. The arbiter allocates the bandwidth to the highest numbered virtual channel with a flit waiting in the buffer. Early messages adjust their priority locally at each node: when an early message requests a virtual channel, the message is assigned to a channel with a lower number that the one of the source. As a result, this scheme does not need a static table in each source to store the deadlines of all the message sources routed through the node.
8.8. Message dropping methods These methods can reduce the contention caused by late messages, improving system performance. In the case of cut-through routing, dropping a flit is not straightforward, as it can generate additional traffic. When a flit is dropped, the virtual channel is released, but a portion of the message has been delivered. Hence some virtual channels reserved on the message path will not be released. Furthermore, the receiver of a partially dropped message cannot determine which flit is the last one. Therefore, the tail flit should not be dropped (it is responsible for releasing all the virtual channels). Hence, a reserved virtual channel cannot be used by other messages while waiting for the tail flit. A correct message dropping scheme must guarantee that the receiver of a dropped message receives only one tail flit, all the virtual channels reserved for a dropped message are released, and the subsequent flits of the first late flit are dropped. A correct scheme has been proposed in [Li94a]. When a flit is late, a drop-flag (DF) bit is set. The late flit continues its transmission and is dropped at the next node with a DF set: when entering a node, the DF is set, the flit is dropped and the virtual channel released. If the DF is not set, the late flit is transmitted to the next node. When the flit leaves the node, the virtual channel is released. The on-time flits are dropped at a node where the DF is set by a previous late flit, and the virtual channel is not released in this case. When the receiver receives a late flit, it is treated as the tail flit of the message. This scheme provides an efficient mean for reducing the network congestion produced by late messages. Simulation results showed significant reduction of the deadline missed ratio from that of the EPC scheme, specially when the networks increase in size. In addition, by dropping tardy messages, the system can provide an acceptable performance at a higher load than the EPC without message dropping.
8.9. Multiple traffic strategy Best-effort and guaranteed traffic have conflicting performance goals that complicate interconnection network design, specially when both are mixed. Kandlur and Shin [Kan94] propose to partition both kinds of packets onto separate virtual networks. Some virtual channels carry best effort packets, and the others transport guaranteed packets. In this way best effort packets can use adaptive routing and cut-through switching without endangering guaranteed packets.
Figure 7 - Structure of the router for multiple traffic strategy [Rex96].
As the separate virtual networks compete to access the link, fair demand-slotted arbitration should be used, the guaranteed traffic should have the highest priority, and best effort packets should use the remaining bandwidth. The best-effort virtual network can use adaptive routing, allowing to circumvent a heavy load of guaranteed packets. Since wormhole switching does not consume buffers, wormhole routing can be used for best effort, and packet switching for guaranteed packets. Blocked best effort packets temporarily stall in their own virtual network, and not consume resources at intermediate nodes. The logical arrival time of a message, l(mi), is used by the scheduler. It keeps three queues for each outgoing link. The first one contains normal real-time packets. Here, a multiclass variation of the EDD algorithm is used, giving higher priority to real-time messages that have reached their logical arrival time, and transmitting the message with the smallest deadline. The second queue contains best effort packets, and it is scheduled using FCFS. The third queue contains early time-constrained packets, and it is ordered by logical arrival time. When a real-time packet arrives, its logical arrival time is computed. This time is determined based on the channel to which the packet belongs and the sequence number of the message. After, the deadline for this packet is set to l(mi)+dc (the worst-case delay guaranteed for the channel). If the packet is early, it is inserted into queue 3; otherwise it is inserted into queue 1. After that, a dispatcher examines the queue 3 and transfers the packets that reached its logical time to the queue 1. Then, it searches the three queues and if they are not empty, the transmission starts. The basic idea of the strategy is the following: if queue 1 is empty, best-effort traffic is routed before the early time-constrained messages. In this way, best effort traffic performance is improved without violating the delay requirements of time-constrained communication. The queue 3 absorbs variations in delay at the previous node. As the transmission of early time-constrained traffic is postponed, the link scheduler avoids overloading the buffer space. The link can transmit early real-time traffic only if the messages are within a horizon distance of their logical arrival time. This parameter improves average latency and bandwidth utilization, at the expense of increased buffer requirements. A connection's local delay bound, coupled with the incoming link's horizon parameter, limits the required buffer space at the next node in the route.
8.10. Modified multiple traffic strategy. To ensure predictable consumption of link and buffers, time-constrained traffic employs store-and-forward packet switching. By buffering packets at each node, packet switching allows each router to schedule packet transmission to satisfy delay requirements.
In the multiple traffic strategy each outgoing port uses separate priority queues for early and on-time packets. However, implementing two queues for each link would incur significant hardware cost and would require logic to transfer packets from the early queue to the on-time one. If multiple packets reach their logical arrival times simultaneously, movement between the two queues is more complicated. Hence, in [Rex96], the authors employ a new strategy where the router does not attempt to store timeconstrained packets in sorted order. Instead, it employs a tree of comparators to select the packet with the smallest key. For on-time traffic the lower bits of the key represent packet laxity, whereas the key for early traffic represents the time left before reaching the packet's logical arrival time. To avoid replicating the scheduling logic, the links can share a unique single comparator (pipelined to improve throughput). As a priority based scheme is used, the performance of best effort packets can be degraded because the blocking of wormhole flits increases contention in the best effort virtual network. To avoid these problems, the arbiter varies the service rate for best effort packets depending on the load of guaranteed traffic. To reduce contention, best effort network employs adaptive routing, circumventing heavy loads of guaranteed packets. Also, the router can ensure predictable access to the physical link even in the presence of guaranteed packets. The router can allow up to x best effort flits to accompany the transmission of a guaranteed packet. Since the guaranteed traffic employs packet switching, a guaranteed packet holds the physical link for a bounded time proportional to its packet length l. This dilates each guaranteed packet to a service time of at most l+x cycles, while dissipating contention in the best effort virtual network. When no guaranteed packets await service, the pending best effort flits have free access to the outgoing link. This permits forward progress for best effort packets while still enforcing a tight bound on the intrusion on guaranteed traffic, without restricting packet size, and preserving delay for guaranteed traffic.
9. Routing strategies The use of a flow control scheme such as those presented in the previous section can only provide best effort routing. To provide end-to-end guarantees, special strategies should be provided. In this section we will explain some of the existing techniques with this purpose.
9.1. Real-time channels In [Kan94], the authors propose to use real-time channels to route real-time messages. As the dynamic routing is very difficult to provide guarantees for message delivery, static real-time channels use a static strategy. First, a source-destination route is selected, and the worst-case delay for a message is computed, trying to ensure that the new channel does not affect the guaranteed delivery times of existing channels (in this stage, buffer requirements are also computed). The sum of the worst-case delays is computed, and if it is less than the user specified delay, the channel can be established. In this case, the delay is divided among the links on the route based on their worst-case delay for the message, and the buffer requirements are adjusted (the permissible message delay is distributed proportionally among the different links). The worst-case delay for a message depends upon the scheduling algorithm used and the other channels that use the link. The authors propose a combination of deadline and rate-monotonic scheduling. The channel establishment scheme uses rate-monotonic, but they use a form of EDD for message scheduling. Let us call Ci to the maximum service time for messages of a virtual channel Mi (essentially the transmission time), which is proportional to the maximum message size (S.R). Let us call pi=I to the minimum message interarrival time; and di to the delay assigned to the channel. Let us call S = {Mi = (Ci,pi,di) / i ∈ [1,k]} to the set of existing virtual channels. To establish a new channel Mk+1 = (Ck+1, pk+1, dk+1), we must find a priority assignment such that the response time ri' satisfies that ri' ≤ di, i ∈[1,k].
To meet these requirements, the channels are sorted in ascending order of delay di, and the highest priority is assigned to the channel Mk+1. The other channels have priorities based on this order (higher priority assigned to channels with small delays). The new (worst-case) response times ri' are computed for the existing channels, assigning priorities for the channels that use the link. In the priority order, the smallest position q such that ri'≤di is found for all channels with position greater than q (i.e., with priority; lower than q). Then, the priority q is assigned to the new channel and the response time rk+1' is computed. It was proven that if di≤pi for each i (that is, the worst-case delay at each link for any channel does not exceed its interarrival time), the procedure is optimal (the computed response time rk+1' is the minimum possible for any feasible priority assignment.
9.2. Rate-Monotonic routing In [Mut94] a routing strategy based on RM is presented. The algorithm is proposed for wormhole networks, and uses the RM scheme to schedule periodic messages with early arrival. First, the message sources (periodic tasks) are allocated to the processors in the system. Then, global priorities are assigned to message sources according to their periods. The virtual channels on the path of a message source are assigned the static priority of the message source. The algorithm analyzes the feasibility of the message source allocations, and assigns higher numbered virtual channels to higher priority sources. The arbiter of the physical link allocates the bandwidth to the virtual channel that has a flit waiting in the buffer with the highest priority, as previously explained. The messages generated in a source inherit its priority. If a message arrives before its period, its deadline is computed as the normal period. The message priority is changed to that of the source sharing the path with the message and has the shortest deadline. The global priority adjustment scheme ensures that an early message will not block messages generated by lower priority sources. However, if the network is underutilized, the message is delayed unnecessarily. This approach requires timer management that degrades the performance. Messages are routed using XY strategy [Dal86] and virtual channels. Street-sign routing is used to keep the amount of header information small. This increases router complexity and the overhead for channel establishment, but it is deterministic, and provides flexibility. Let us suppose there are 2n logical channels, and in each node there is a register for each channel with p≥n priority bits. Each source of real-time traffic must specify the parameters S, R, B and d, and the source priority will be assigned giving the highest priority to the virtual channel with the shortest period. Let us suppose the number of sources using a link is less or equal to 2n (number of logical channels), and there are enough priority bits to give each message a distinct priority. Then, any source will share a portion of its path with at most 2p-1 other sources. The outgoing channel at the source is assigned according to the priorities of the traffic sharing the path: the highest channel number is associated with the highest priority source on the link. The main disadvantage of this approach is that the assignment of the virtual channels is software dependent, and it must be done off-line. The priorities also waste overhead in packet headers. Finally, the router must be built using complex and expensive combinational logic (including timer management).
9.3. Modified RM In [Mut94], a modification of RM strategy is proposed. The sender knows the deadlines of the message sources sharing a link with the sender. The source node has a table to match a priority with the corresponding deadline. For an early arrival, the sender determines the priority p with the smallest deadline larger than d+l(m)-t. The sender of the early arrival uses the logical channel to which it has been assigned as if the arrival was not a burst, loading its priority register with p.
A newer message from source k cannot pass an older message from the source (which is possible if messages use different logical channel assignment). When the arbiter sees two ready channels and each channel has the same value in their priority registers, the arbiter gives higher priority to the largest logical channel number. Therefore, the priority generally used by the arbiter is resolved by the value on the priority register appended with the logical channel number. This approach allows the arbiter to remain simple, and guarantee that an early message of a higher logical channel number will not preempt messages using lower channel numbers greater than p. The messages using these channels are guaranteed to meet their deadlines. The main problem is how to guarantee the new deadline of a message from the high priority channel when we substitute the priority of its message. Here, if an early arrival occurs, its deadline and the end of its period are extended. The priority change of one message does not require any additional capacity to serve nor additional blocking time due to priority inversion, but the system has merely one task with a longer period and deadline. The new set of messages will also be schedulable using the Deadline Monotonic algorithm that allows to provide predictability.
9.4. Multiple traffic strategy As previously explained, in [Kan94] a new strategy was proposed based on the use of separate virtual networks. The effective mixing of time-constrained and best effort traffic is crucial to provide predictability. This strategy allows to predictably combine both kinds of traffic. To establish a real-time channel, the network reserves link bandwidth and buffer space along a fixed path between the source and destination. The route depends on the resources available. The real-time router implements dimension-ordered routing, using a shortest path scheme (XY routing), avoiding deadlocks and facilitating efficient implementation in a mesh. In contrast, best-effort traffic does not require resource reservation along packet routes. Its performance can improve using adaptive wormhole routing with additional virtual channels to avoid deadlock. To reduce hardware complexity, the non real-time tasks are performed by protocol software, permitting greater flexibility in route selection and buffer allocation policies. It was proved that if the guarantees are conformed, and the flow control strategy presented in section 8.9 is used, then all messages arriving on real-time channels meet their delay requirements on the link.
10. Conclusion In this work we have presented some results on real-time routing in massively parallel computers. We have identified the main problems existing in this area, and some existing solutions. The main problems are related with the provision of end-to-end guarantees and runtime scheduling in a router. At present, several projects have proposed mechanisms to allow predictability in wormhole networks. One simple approach that does not need hardware support, is to use the application or the operating system to control end-to-end performance by regulating the rate of packet injection at each source. However, this approach limit utilization of the network to avoid contention between packets (particularly in wormhole networks, since a stalled packet may indirectly block the advancement of other traffic that does not use the same links). The router architecture can improve predictability by favoring older packets when assigning virtual channels or arbitrating between channels on the same physical link. Although these mechanisms provide guarantees in end-to-end latency, they can fail to guarantee performance under high network loads. To avoid these problems, a router can support multiple classes of traffic, by partitioning traffic onto different virtual channels, with priority-based arbitration. Flit-level preemption of lowpriority virtual channels can significantly reduce intrusion on the high priority packets. Still, if priorities are coarse-grained, they cannot differentiate between packets with different latency. With additional virtual channels, the network has greater flexibility in assigning packet priority, perhaps based on the end-to-end delay requirement, and restricting access to virtual channels reserved for higher priority traffic. This fact increases
the cost of implementation: extra flit buffers, larger virtual channel identifiers, and more complex switching and arbitration logic. Instead, a router can increase priority resolution by adopting a packet-switched design. Some studies comparing the run-time scheduling algorithms allowed to see that increasing the number of virtual channels can reduce the delayed ratio, assigning priorities to messages according to their timing properties can further improve the performance. The RM scheme has the best performance and the FCFS the worst. The RM scheme guarantees all the periodic messages will meet their deadlines when there are no burst arrivals. The one-bit scheme has similar performance to the RM scheme when the burst rate is small. As burst rate increases, the performance of the scheme declines. the large amount of burst traffic saturates the lower numbered virtual channels, and cause low priority and burst messages to miss their deadlines. Both the TDF and LLF schemes improve the performance of FCFS. The LLF outperforms the TDF by a small margin. If the number of virtual channels is limited, both schemes reduce the delayed ratio without the substantial hardware cost of the RM scheme. The global priority adjustment of RM provides stable performance until the system load is close to the saturation point. the PC scheme increases the probability of meeting deadlines for low priority messages and reduces the delayed ratio for TDF and LLF. Although the RM performs well, the global priority adjustment needs a static deadline table in each node for storing the deadlines of all the message sources routed through the node. since buffers are the most expensive resources, a system designer should consider the hardware cost to support the RM scheme. n contrast, the PC scheme only requires a block count in each message header and an adder in each arbiter to increment the block counts. Since the TDF and LLF schemes require that a message header carries a priority value, we can integrate the priority and clock count into on byte. Also, and additional adder in each arbiter will not complicate the hardware design. As we can see, the results proposed at present are few, and additional research is needed to support real-time communication. Some open research fields include the dynamic routing solutions with guaranteed timing correctness, network buffer management that support scheduling solutions, fault-tolerant real-time communications and network scheduling that can be combined with processor scheduling to provide systemlevel scheduling solutions.
11. References [And88] ANDERSON, D.; FERRARI, D. "The DASH project: an overview". Tech. Rep. 84/405. Computer Sciences Division, Dept. of Electrical Engineering and Computer Sciences. University of California, Berkeley. CA, USA, February 1988. [Che88] CHENG, S; STANKOVIC, J.; RAMAMRITHAM, K. "Scheduling Algorithms for Real-Time systems: a brief survey". En "Real-Time Systems", IEEE Press, 1993. pp. 150-173. [Che90] CHEN, M.; SHIN, K.; KANDLUR, P. "Addressing, routing and broadcasting in hexagonal mesh multiprocessors". IEEE Transactions on Computers, Vol. 39, No. 1. January 1990. pp. 10-18. [Cru87] CRUZ, R. "A calculus for network delay and a note on topologies of interconnection networks". Technical Report UILU-ENG-87-2246. University of Illinois at Urbana-Champaign. 1987. [Cyp93] CYPHER, R.; SANZ, J. "The SIMD model of parallel computation". Springer-Verlag. New York. 1993. [Der74] DERTOUZOS, M. "Control robotics: the procedural control of physical process". In Proceedings of the IFIP Congress, 1974. [Dal92] DALLY, W. "Virtual channel flow control". IEEE Transactions on Parallel and Distributed Systems, 3(2): 194-205. March 1992. [Dal86] DALLY, W.; SEITZ, C. "Deadlock-free routing in multiprocessor interconnection networks". IEEE Transactions on Computers, Vol. C-36, No. 5, May 1986.
[Dol89] DOLTER, J.; RAMANATHAN, P.; SHIN, L. "A microprogrammable VLSI routing controller for HARTS". Proceedings of the IEEE International Conference on Computer Design: VLSI in computers, 1989. pp. 160-163. [Dol91] DOLTER, J.; RAMANATHAN, P.; SHIN, L. "Performance analysis of virtual cut-through switching in HARTS: a hexagonal mesh multicomputer". IEEE Transactions on Computers. June 1991. [Fel91] FELPERIN, S.; GRAVANO, L.; PIFARRE, G.; SANZ, J. "Routing Techniques for Massively Parallel Communications". Proceedings of the IEEE, Vol. 79, No. 4. April 1991. [Fer90] FERRARI, D.;VERMA, D. "A scheme for real-time channel establishment in wide-area networks". IEEE Journal on Selected Areas Communications. Vol. 8, pp. 368-379. April 1990. [Kan94] KANDLUR, D.; SHIN, K.; FERRARI, D. "Real-Time Communication in Multihop Networks". Transactions on Parallel and Distributed Systems, Vol. 5, No. 10. pp. 1044-1055/ October 1994. [Ker79] KERMANI, P.; KLEINROCK, L. "Virtual cut-through: a new computer communication switching technique". Computer networks, vol. 3, No. 4, pp. 267-286, September 1979. [Kon90] KONSTANTINIDOU, S. "Adaptive, minimal routing in hypercube". 6th. MIT Conference of Advanced Research in VLSI. pp. 139-153, 1990. [Kon90b] KONSTANTINIDOU, S.; SNYDER, L "The chaos router: a practical application of randomization in network routing". 2nd. Annual ACM SAAP, pp. 21-30. 1990. [Kop89] KOPETZ, H, et al. "Distributed Fault Tolerant Real-Time systems: the MARS approach". IEEE Computer. February, 1989. pp. 25-40. [Leh89] LEHOCZKY, J.P.; SHA, L.; DING, Y. "The Rate Monotonic Scheduling algorithm - exact characterization and average case behavior". Proceedings IEEE Real-Time Systems Symposium. CS Press, Los Alamitos, Calif. pp. 166-171. 1989. [Leu82] LEUNG, J.;WHITEHEAD, J. "On the complexity of fixed-priority scheduling of periodic real-time tasks". Performance evaluation 4(2): 237-250. 1982. [Li94a] LI, P.; MUTKA, M. "Real-Time virtual channel flow control". Proceedings of the 13th. IEEE Intl. Phoenix conference on computers and communication, 1994. [Li94b] LI, P.; MUTKA, M. "Priority based real-time communication for large scale wormhole networks". Proceedings. of the IEEE International Parallel Symposium. 1994. [Liu73] LIU, C.; LAYLAND, J. "Scheduling algorithms for multiprogramming in a Hard Real Time System Environment". Journal of the ACM, Vol. 20, No. 1, pp. 46-61. 1973. [Mal94] MALCOLM, N.; ZHAO, W. "The timed-token protocol for real-time communications". IEEE Computer. pp. 35-41. January 1994. [Mok78] MOK, A.; DERTOUZOS, M. "Multiprocessor scheduling in a hard real-time environment". Proceedings of the Seventh Texas Conference on Computing System. 1978. [Mut94] MUTKA, M. "Using Rate Monotonic Scheduling Technology for real-time communications in a wormhole network". Proceedings of the Second Workshop on Distributed and Parallel Real-Time Systems, pp. 194-199, April, 1994. [Nas81] NASSIMI, D.; SHANI, S. "Parallel algorithms to set up the Benes Permutation Network". IEEE Transactions on Computers, vol. C-31, pp. 148-154. 1982.
[Nga89] NGAI, J.; SEITZ, C. "Adaptive routing in multicomputers", in Opportunities and constraints of parallel computing. J.. Sanz, Ed. New York, Springer Verlag. 1989. [Ni93] NI, L.; MC. KINLEY, P. "A survey of wormhole routing techniques in direct networks. IEEE Computer, pp. 62-76. February 1993. [Ram90] RAMANATHAN, P.; KANDLUR, D.; SHIN, K. "Hardware assisted software clock synchronization for homogenous distributed systems". IEEE Transactions on Computers, vol. 39, pp. 514-524. April 1990. [Ran85] RANADE, A. "How to emulate shared memory". In Foundations of Computer Science. pp. 184-185, 1985. [Rex94] REXFORD, J.; DOLTER, J.; SHIN, K. "Hardware support for controlled interaction of guaranteed and best-effort communication". Workshop on Parallel and Distributed Real-Time Systems, April 1994, pp. 188193. [Rex94b] REXFORD, J.; SHIN, K. "Support for multiple classes of traffic in multicomputers routers". Proceedings of the Parallel Computer Routing and Communication Workshop, May 1994, pp. 116-130. [Rex96] REXFORD, J.; SHIN, K. "A router architecture for real-time point-to-point networks," Proceedings of the International Symposium on Computer Architecture, pp. 237-246. May 1996. [Shi91] SHIN, K. "HARTS: a Distributed Real-Time Architecture". IEEE Computer, May 1991. pp. 25-35. [Sta88] STANKOVIC, J. "Misconceptions about Real-Time computing". IEEE Computer, pp. 10-19. October 1988. [Sta88b] STANKOVIC, J. "Real-Time computing systems: the next generation". En "Hard Real-Time Systems". Tech. Report TR-88-06, COINS Dept. University of Massachusetts. January 1988. [Sta94] STANKOVIC, J. et al. "Implications of classical scheduling results for Real-Time systems". CMPSCI Technical Report 93-23. University of Massachusetts at Amherst. January 1994. [Str89] STROSNIDER, J.; MARCHOK, T. "Responsive, deterministic IEEE 802.5 token ring scheduling". Journal of Real-Time systems, vol. 1. pp. 133-158. September 1989. [Val81] VALIANT, L.; BREBNER, G. "Universal Schemes for Parallel Communication". ACM STOC, pp. 263-277. 1981. [Wai95d] WAINER, G. "Some results on local real-time scheduling". (in Spanish). Proceedings of the 24 Jornadas Argentinas de Informática e Investigación Operativa, JAIIO. August 1995. [Wen78] WENSLEY, J. et al. "SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control". Proceedings of the IEEE. October 1978. pp. 1240-1255. [Zha87] ZHAO, W.; RAMAMRITHAM, K. "Virtual time CSMA protocols for hard real-time communications". IEEE Transactions on Software Engineering. Vol. 13, pp. 938-952. August 1987.