RESYM. a high performance, low power multi-microprocessor bus J.D.

RESYM.

a high performance, multi-microprocessor

low power bus

J.D. Nicoud and K. Skala Swiss Federal Institute of Technology Av, de Cour 37, CH-1007 Lausanne

In order to build lower cost multimicroprocessor systems, a narrow synchronous bus (15 active lines) is proposed. It multiplexes address and data on 8 bits, and arbitrates in two pipe-lined cycles on four lines. Due to the 20 to 40MHz bus clock, and the pipe/ined control logic, the performances are equivalent to Mu/tibus-2, IEEE-P896 and similar 32-bit buses,

2. h4ultiprocessor buses

For the implementation, cards are disposed radially around a special connector. The very short connections allows for the usage of fast HC-MOS drivers with only a light adaptation.

Multiprocessors systems can be built at various levels: 1) At the chip level, a high degree of parallelism can be achieved, as long as the interconnections can be routed. This limits the application to systolic arrays, but future designs may see a pool of processors, more or less dedicated, implemented on the same chip or on a wafer scale set of chips or future 3D solutions. The low equivalent speed of the electrical signals on the surface of the IC (1% of the speed of light under certain conditions) limits, however, the performance one can hope for highly integrated solutions. 2) At the board level, multiprocessor interconnections are facilitated by multilayer printed circuit boards. However, manufacturing and packaging limits the complexity of a system one can build on a single board; many drivers and "glue" chips are required when interconnecting processor and memory devices, which have been designed for a general purpose application and not optimised for a given multiprocessor architecture. 3) At the crate level, boards corresponding to a computer, a global memory, or an I/0 channel can communicate through one or several backplane buses [CON83]. More random backplane connections improve the performance but limit the flexibility in configuring the system. 4) At the company level, all the centralized and distributed computers available are connected through local networks and telephone networks. This increases the accessibility of data, but is not used to improve the processing power, except for several experimental applications.

1. Introduction Multiprocessors have been studied for many years, but the commercial multiple processor architectures of the future are not yet clear [RAT85] [LIN85]. The traditional architectures of several processors linked by a fast bus and sharing data in the most efficient possible way will however survive, and will be introduced in personal workstations as soon the price goes low. Many multiprocessor buses have been defined over the years but few are being standardized [KIRR85] [DELC86]. The recently proposed 32-bit buses such as FutureBus (IEEE-P896), [P89683], NuBus, [NUB83], VME (IEEE-P1014) [VME85], M3 [CON83], MULTIBUS II [MUL285] and BI [BI86] seems to be the only commercial choices. The board size for all these buses is about 300x250mm, providing 5 to 10din 2 of area. The backplane has a typical length of 41 cm with 21 slots. The current drawn is 5 to 10A per board (5V), more than 1A being required for the drivers only. A less expensive solution, using fewer bus lines and better adapted to surface mount technology, is. required in the future; it is the purpose of this paper to indicate a possible new direction. Size reduction is essential both for improved performance (on a loaded backplane, propagation time is 1 ns per 3cm) and power saving (at 20MHz, lines longer than 10cm must be adapted). The solution proposed here uses only 15 active lines and is implemented at best in a MICRAI (Micro CRAte Interconnect) box.

The purpose of this paper is to propose a bus solution between 2) and 3) and some original manufacturing facilities for 3).

169

0884-7495/86/0000/0169S01.00 © 1986 IEEE

3.

Optimum

bus

size

4.

A great deal of experience has been accumulated by microcomputer manufacturers with buses about 40cm long driven by 40 to 100mA drivers. Fastbus [FAST82] and the ELXSI Gigabus [OLS83] with their ECL technology provide the best possible performance.

RESYM is designed to provide a low cost and high performance bus adequate for a compact technology. Its principal features are a small number of lines and a very low power consumption. Synchronous transfers, optimised protocols and short bus lengths allow high transfer rates on this bus.

Microcomputer manufacturers have encountered a lot of problems trying to use 24mA and 40mA TTL drivers with buses that are too long and too heavily loaded. The PDP-11 Unibus uses 70mA drivers and 100mA is preferable with buses longer than 40cm. This dissipates a large amount of power with the 2-3V voltage swing of TTL signals. The P896 solution is to use special 100mA drivers with 1V voltage swings and a trapezoidal waveshape [BALA83]. This provides ECL advantages with TTL compatibility. But power consumption stays high and propagation delays are in the 15ns range.

4.1

System clock

The clock is centralized on the first board or on a separate small module. A single system clock in a synchronous system always provides a high risk of total system failure. Self-synchronized or distributed daisy-chained clocks are not used due to the small increase in theoretical reliability and the difficulties of implementation of multiple high speed clocks. An high speed clock is required otherwise for the derivation of all clocks of the system from the bus clock; this avoids metastable problems and double synchronization logic.

HC-MOS drivers have been used at 10MHz with success on short buses (10cm) with a light termination. The improved speed and driving capability obtained with recent HC-MOS circuits makes their use .possible with bus termination, but the high voltage swing i means increased dissipation on adapting resistors. Not adapting the bus is only possible if the lines are short (10cm) The bus width adequately supports compatible with low two are used), and using the 50 to 80

RESYh'I s p e c i f i c a t i o n s

4.2

Arbitration

Distributed arbitration mechanisms using only bused lines are used in all recent buses, but they are slow due to the settle time required for collision and priority logic (300ns on FutureBus). The arbitration of RESYM is also decentralized and uses a reduced number of only four arbitration lines; two active clock cycles are required to arbitrate between 16 masters. This means 100ns at 20MHz, and occurs in parallel to the current data transfer.

is 32 bits for most recent buses. This the most recent microprocessors, is cost 64 or 96-pin connectors (one or generates an acceptable amount of heat necessary bus drivers.

Asynchronous buses like FutureBus (IEEE P896) and VME (IEEE P1014) have an address/data bus cycle not better than 200ns, and an arbitration time of the same order; this adequately matches current microprocessors. Synchronous buses may be faster; the Gigabus [OLS83] reaches 40MHz since it is well synchronized with the special ECL processor running at the same speed. But universal synchronous buses like Nubus and Multibusll use a 10MHz clock only; in addition, resynchronization cycles due to metastable states are frequently required and slow down the transfers.

The arbitration protocol is a time multiplexed extension of the solution used in PIBUS [ROET80]. The 4 arbitration lines (ARB0...ARB3) have increasing priority. Only one line is activated at a time; since two successive cycles are used, arbitration between 4x4=16 devices results in a simple and fast logic. All boards taking part in the arbitration examine the arbitration lines at the end of each arbitration cycle. A winner of an arbitration cycle is a board having activated the highest priority line. Only the winners of the first arbitration cycle may continue the arbitration in the second cycle. The single winner of the second arbitration cycle becomes the current bus master.

Fast narrow (8 bits) synchronous buses have apparently never been proposed. As shown later, they provide (for shorter bus lengths supporting 20 to 40MHz) about the same speed as 32-bit asynchronous buses with a much lower dissipation; the solution proposed in this paper, called RESYM for REduced SYnchronous Multiprocessor bus, is such a narrow, low power, fast and short bus. Pure serial transmission may be a next step; speeds higher than 100MHz are, however, difficult to consider and arbitration between simultaneous requests not efficient. Hence serial transfers are restricted to a secondary channel on multiprocessor buses, as it is the case for FutureBus, VME, M3 and Multibus II.

In order to synchronize the beginning of arbitration, all boards waiting for the bus may begin the arbitration only after a cycle of silence on the arbitration bus. The current master keeps the arbitration lines active until 3 cycles before the end of its bus activity; hence, the bus is released at the end of the arbitration, allowing the new master to immediately begin its bus transfers. A nice feature of this arbitration protocol, is its extension possibilities. The maximum number of modules may be increased by the use of additional arbitration lines or arbitration cycles; one can hence trade speed and complexity, and the use of a variable number of cycles provides faster access for boards which use frequently the bus.

1'70.

4.3

3) the low address byte is transferred;" the slave acknowledges its physical presence (signal PHYS) and its capability to accept data (READY); 4) the first data byte is transferred; the slave may signal an address space violation (ADDRESS); 5) 6) 7) the next data bytes are transferred.

Fairness

Several applications require that no module monopolizes the bus and that each device gets a fair access. This feature is implemented by RESYM in the same way as the synchronization of the arbitration. If fairness is required, each board which wants to start a bus request has to be sure that the arbitration bus has been idle for at least 2 bus cycles since its last request, instead of one normally. Hence all boards waiting for the bus, which is occupied by higher priority boards, are sure to be able to use it even when higher priority boards need it again. These boards have to wait for two idle bus cycles, which will occur only when all waiting boards have made their transfers. 4.4

Bus cycles definition

In order to minimize the number of lines, most of the bus lines are time-multiplexed. There are 4 groups of lines: - system clock (1) arbitration bus (4) data tranfer bus (8) control lines (2) power and ground (5 or more)

Three cycles before the end of the tranfer, the master deactivates the arbitration bus, thus enabling a new arbitration. Read cycles are semi-synchronous: the master must wait for a data valid signal on the STAT1 line (figure 2). Simulation has shown that split cycles are not worth their complexity; present memories are fast and read cycles through the backplane bus are mostly used for fetching small items of data immediately available. The READ-MODIFY-WRITE cycle has the same characteristics as the read cycle. The master keeps the bus until it has received and retransmitted the data to the slave; this guaranties the multiprocessor capabilities. The BLOCK-MODE is a normal write cycle were the data are transferred in blocks of 64 up to 160 bits. The number of transferred quadlets is indicated during the second transfer cycle on the status lines.

A write-cycle requires 7 clock cycles (figure 1): 1) the logical address of the addressed board is put on the arbitration bus (for broadcast, the logical address (0) is decoded by all boards); the first byte of the 24-bit physical address is placed on the data transfer bus; the transfer mode is placed on two control lines (read, write, read-modify-write or block-mode); 2) the second address byte is transferred; the length of the data transfer is placed on the control lines (8, 16, 24 or 32-bits);

The BROADCAST cycle is more complex: the master waits for two acknowledge vectors on the data bus (each card activates one data line during one of these cycles). The data will be valid only if all cards have given their acknowledgement. This guarantees that all cards have a valid copy of the data at the same time. One could be tempted to use a 16 bit data bus in order to improve the speed. But due to the need to wait for one synchronization cycle after each operation, the speed improvement is not significant.

CLOCK [11

ARe [41.

AD/DAT [8]

Fig. 1 WRITE-Cycle STATI [ I }

/,

STATO [11.

X '

X

\

/

lick, full ~ . drm

lINll

CLOCK [ I ]

ARe [41

(h,,3 X',,,' X"'l

ADIDAT [81

i

Fig. 2 READ--Cycle ST,4TI [11

\

STATO [ I ]

I bd

171

32lib

Ick.~It Ick,li'm

X-

5. InterfaFe, ,,des,ign

Computing power is the first parameter considered in evaluating the system performance. A second important parameter concerns the ratio of computing power and system price [MAR82], which indicates the economic aspects of our solution.

Due to its small number of lines and the limited power dissipated in the bus drivers, RESYM is highly suitable for a VLSI implementation. A universal processor/memory bus (32-bit multiplexed system) can be designed in a 68-pin package (figure 3).

'.

Processor

.~

I

Address bus (3i -

latch

[( bldlrecttonnel)~

,

.......

I

The computing power is a function of the number of processors in the system (x-axis) and the traffic created by the processors (common bus access) which is indicated by the percentage of time devoted to instruction execution. The computing power (y-axis) is expressed as a multiple of the computing power of a single processor system. It displays the limits of a single bus architecture. These systems achieve high computation power only when the traffic created by the processors remains small. A comparison shows that the degradation of the system performance is more important for the RESYM than for the FutureBus owing to the smaller throughput of RESYM. But we find an important difference only in the case where another system architecture should be considered anyway.

)

Hemory

I/..,j

I-,,1-. --

R 'sYM

controller[

14

~/~ RESYM bus (15 active lines)

U-F-

. ~J~.(

15

12

--'t

to |

Fig. 3 Typical bus interface

¢

4 Z

6. Performanc e Comparison

RESYM !

4

•

•

to

lit

14

1•

Z

4

•

•

It

II~

14

II

All performance estimations are based on a linear limited Markov model [MAD84] [DRA67]. This model supposes a system with a single common bus and a constant bus occupation time for each transfer. These hypotheses are realistic for a certain number of buses; they provide a good performance approximation with a really simple model. The bus may be described by N+I states which represent the length of the queue formed by the processors waiting for the bus. This model permits one to calculate the probability of waiting and the average waiting times. "From these values we find, by adding the arbitration time, the average access time which is a direct measure of system performance.

Fig. 4 Performance comparizon

For the economic evaluation we used the ratio computing-power/system-price expressed in percent of the value for a single processor system (Processor board costs = 1 unit, Computation power = 1). We used the hypothesis of a linear cost function C=I+B.p [MAR82], where I is the initial system price, B the price of each processor board and p the number of processor boards in the system. The followed values provide the basis for the computations:

Let us compare the perfomance of the proposed bus (RESYM) and the FutureBus (P896).

Bus speed Write Cycle (32 bit) Read Cycle (32 bit) Arbitration speed Processor instruction time Memory response time

RESYM

P896

20 MHz 350 ns2 500ns 2 100 ns2

Asynchronous 220 ns1 340ns 1 200 ns

100 ns

1 /~s 100ns

initial cost Processor board cost

Note1 The cycle times are calculated as follow: Bus Protocol (160ns) + Logic (60ns) + Memory Access Time (Read Cycle only) ( 120 ns). Performance calculation is based on an average of 4 read cycles for 1 write cycle. Note 2 Thes figures are divided by bNo for a 40MHz clock.

RESYM

P896

2 units 1.2 unit

5 units 1 unit

RESYM boards are considered more expensive due to the new surface mounted technology they use and the special interface circuits they require; these costs are only partly compensated by the smaller number of drivers and lines. On the other hand the mechanical and electrical

172

implementation dramatically reduces the initial costs. The diagrams below show that RESYM systems clearly have a higher performance/price ratio over a major part of the application domain. ~'~ o.lo

f

RESYM

J

Power supply

/

o./g

i F

I

Z

e.=e

o.le

4

I

6

I

I

O

10

J

12

I

14

J

|$

P896 Fig. 6

O.flO

z

4

s

o

to

tz

14

MICRA/ box

Conducting semi-circular rings with teeth are used as edges connector; they are simply inserted on a grooved ceramic cylinder. With a 20ram diameter cylinder, the spacing between slots is 3mm, which is enough to put the direct bonded drivers close to the connector on both sides of the board (figure 7). Standard components and modules of greater height can be placed at some distance from the connector. At a distance of 50mm, a socketed component or an hybrid module of 13mm height can be inserted. Memory 3-D modules are inserted a little further and this respects the normal system layout. 3-D modules are efficient to save line length and increase the density. The wide spacing of the boards obtained at the periphery improves the cooling.

~

e.IO

Fig. 5

./ / /

16

Price~performance ratio

In the domain of global bus applications, RESYM presents a good compromise between system performance and cost. Technological advances in VLSI make possible the construction of complex bus interfaces, which allow optimised data transfer protocols resulting in a reduced number of lines (RESYM has 20 lines) with performances

comparable to those of FutureBus (P896 96 lines).

7.

~

Implementation

The use of RESYM makes sense only if the bus length is very short and if an adequate implementation takes advantage of the new surface mount technology (SMD packages and direct bonding) and uses mostly HC-MOS circuits in order to allow a compact design with an acceptable power dissipation. Compared to present implementations, an important cost saving is realized by the backplane and the smaller power supplies. The only difficulty at present is to get the components: the HC-MOS family is not yet very complete, and SMD packages or chips for direct bonding are not easy to get. The packaging principle, named MICRAI for Micro CRAte Interconnect, is shown in figure 6.

~, ~

•1) '

Fig. 7

,~..

~" -.

"lY

HetsI ring ¢omtector

Coramlc

Component placement on the boards

The reliability of the connections can be improved by soldering them to the boards: the central cylinder can easily be heated to melt the solder deposited on the connector, and will allow for the insertion or removal of the boards. This cylinder can be cooled to allow efficient heat dissipation for the bus drivers. A sheet of metal can be sandwiched in the printed circuit next to the connector to improve the heat distribution. The bus interface and driver will likely be implemented on a piggy back module soldered to the computer, memory and I/0 boards. Two or more boards, each corresponding to a full subsystem, can share the same bus drivers.

Up to 13 boards can be connected on radially, with 15' angle. The two boards at 45 °, relatively to the symmetry axis, can be 30% longer. The location for the fan and power supply allows a simple and compact package. The number of contacts is limited. With an height of 80mm (lOOcm 2 area), about 30 active lines plus the necessary power and ground lines can be easily implemented. This is more than required for RESYM, and can be increased in a more complex system using larger boards.

173

Several of the 2dm 3 crates including 5V, IOA power supplies and fans can be connected together using bus windows or gateways. In a workstation, a graphic multiprocessor and the main multiprocessor could typically use two crates and fit easily into an IBM-PC box, with the same amount of power dissipation but one hundred times the processing power. With a good standardization of modules and automatic manufacturing, the price could be highly competitive.

[KIRR85] [LIN85] [MAD84] [MAR82]

8. C o n c l u s i o n s Next generation workstations which offer extended possibilities of processing power will need new solutions. RESYM and CBOX look attractive and the first tests are promising.

[MUL285]

We would like to thank Roger Hersh and Philippe Schweitzer, who have contributed to the initial ideas, and the "Fonds National Suisse" (grant 2.722) and "VALTRONIC SA" for their financial support.

[0LS83]

[BALA83]

[BI86] [CON83]

[DELC86] [DRA67]

[FAST82]

[NUB83]

[P89685]

H. Kirrmann, MicroStandards, IEEE Micro, August 1985, pp 82-89. R. Lineback, Parallel processing: why a shakeout nears, Electronics, Oct. 28, 1985. Madnick S.E., Donovan J.J., Operating systems, McGraw Hill, 1984. M. A. Marsan, G. Balbo, G. Conte, Comparative Performance Analysis of Single Bus Multiprocessor Architectures, IEEE Transaction on Computers, Vol C-31. No 12, December 1982. Multibus II Bus Architecture Specification Handbook, Intel Corp, Santa Clara, 1983. NuBus Specifications, T e x a s Instruments, Irvine, 1983. R.A. Olson, et el., Messages and Multiprocessing in the ELXSI System 6400, Compcon83, San Francisco, 1983.

Specifications for Advanced Microcomputer Backplane Buses, IEEE-P896 Working Group, Draft 7.1, November 1985. [RAT85] J. Rattner, Concurrent processing; A new direction in scientific computing, National Computer Conference 1985, pp 157-166. [ROET80] H. R~thlisberger, F./.R.S.T., A high resolution raster scan display system for interactive layout of text and figures, Eurographics 80, C.E. Vandoni (ed), 291-302. [THUR79] K. Thurber, J. Masson, Distributed Processor Communication Architecture, 1979, Lexington Books, Lexington [VME85] VME Specification Manual, rev C, VITA Group, 1985.

References R.V. Balakrishnan, A Solution to the Bus Driving Problem, IEEE Micro, August 1984, pp. 23-27. Backplane interconnect, Digital Equipment (product to be likely announced in 1986) G. Conte, et al., Multiprocessors: M3 Bus System and TOMP Architectures, 199 p., Ufficio MUMICRO, Bologna, 1983. D. del Corso, et el., Microcomputer Buses and Links, Academic Press, London, 1986 Drake, A.W., Fundamentals of Applied Probability Theory, McGraw-Hill Book Co., New York, 1967. Fastbus Tentative Specifications, US NIM Committee, August 1982.

174

RESYM. a high performance, low power multi-microprocessor bus J.D.

RESYM. a high performance, low power multi-microprocessor bus J.D.

Suggest Documents

Design of a Low Power, High Performance

A LOW POWER AND HIGH PERFORMANCE EBCOT ...

Low-Power Wireless Bus

High Performance Low-Power Signed Multiplier - CiteSeerX

High Performance and Low power Monolithic Three

Reza Asadpour - Low-Power High-Performance Nanosystems ...

low voltage, low power, high performance current ... - Semantic Scholar

A High-Performance/Low-Power On-chip Memory ...

A High Performance And Low Power Hardware ... - CiteSeerX

Towards a High-Performance, Low-Power Linear Algebra Processor

A High-Performance Low-Power Nanophotonic On ... - Semantic Scholar

a new low power high performance flip-flop - CiteSeerX

A Novel Ultra Low Power High Performance Atto ... - Semantic Scholar

A low-power high-performance accelerometer ASIC for ... - IEEE Xplore

A High-Performance, Low-Power Linear Algebra Core - UT Computer ...

LPRAM: A Novel Low-Power High-Performance RAM Design With

A Low-Power High-Performance Asynchronous Logic QDI Cell Template

A High-Performance, Low-Power Linear Algebra Core

a high performance and low power hardware ... - Semantic Scholar

A Low-power and High-performance Radix-4

Narrow Bus Encoding for Low Power Systems

Bus Bar Design for High-Power Inverters

Coding a Terminated Bus for Low Power - Semantic Scholar

VDS - JD Power