C. Gordon Bell*:, Robert C. Clien, Samuel H. Fuller,. John Grason, Satish Rege ..... 746-757, August. 1968. 4. Bergland, G P., "Fast Fourier Transform Hard.uare.
THE ARCHITECTURE AND APPLICATIONS OF COMPUTER MODULES: A SET OF COMPONENTS FOR DIGITAL SYSTEMS DESIGN* C. Gordon Bell*:, Robert C. Clien, Samuel H. Fuller, John Grason, Satish Rege and Daniel P. Siewiorek Departments o f Computer Science and Electrical Engineering Carnegie-Mellon University
ABSTRACT
This p a p e r discusses the design and use o f system-building modules of about minicomputer complexity. These modules (CMs), are intended t o f a c i l i t a t e t h e design o f t h e f u l l range o f digital s y s t e m s needed t o c a r r y o u t current and f u t u r e c o m p u t a t i o n a l tasks.
N a t i o n a l Semiconductor and AM1 [10j a.e prec;.s:-s c' t h e s e modules. This paper presents :-e r a t ~ c - a9 f c t h e t r e n d t o w a r d s more cotnplex rrca, es and f c - CVS in particular. The range o f a p p l i c a t , ~ ?of CVs a-: :-c implications, i.e., e f f i c i e n t cornrnunicattc.rs tnterfaces. i - e discussed. Finally, w e describe sorre a:~;~cat.c-s, a-:: discuss t h e communications needs ar:! csst/perfc-.-a-:e t r a d e o f f s r e l a t e d t o these applicatio-s.
ARCHITECTURE Module Complexity
(
Module sets f o r computer system design are b e c o m i n g increasingly complex, driven b y decreasing c o s t a n d size o f hardware and increasing computer s y s t e m performance requirements. Standardized module s e t s h a v e e v o l v e d f r o m circuit elements t o gates and f l i p - f l o p s t o 1C chips' t o register transfer level module In this paper w e introduce a new, more sets. complex, more flexible set, called Computer Modules (CMs).
A C M consists o f a processor (PC) and memory (Mp) o f a b o u t minicomputer complexity, together w i t h The s e v e r a l c a r e f u l l y designed ports: see Figure 1. I/O a n d i n t e r r u p t structures o f conventional computers make it d i f f i c u l t t o use them t o construct closely c o u p l e d n e t w o r k s . Addressing this problem, each p o r t o f a CM i s designed t o handle operations such as handshaking and buffering, executing concurrently w i t h These p o r t s allow us t o t h e processor o f t h e CM. c o n s t r u c t C M sys'tems covering a wide range o f cost a n d performance. CMs w i l l come i n t o physical existence within t h e n e x t f e w years: t h e current microprocessors o f Intel,
--------_-_-1
I I 1
I 1 Mp-
1
PC
-S
and o t h e r
I
Computer Modules
I I
CMs are another s t e p i n a csntinuing ! r ~ - s t o w a r d m o r e complex modules. Ore reason f c r :-s t r e n d i s t h a t defining more comprex modules '-2simpler modules yields distinct a:.;-tapes. C-e advantage i s t h a t t h e complex modu:es c t n be ~ 5 t : d i f f e r e n t applications without redes g l , and k2*6'0 s y s t e m s o f these modules can be constructec c v m o d i f i e d f a s t e r and more easily. A se:ond advc-:;I. arises f r o m economics o f scale: b y standardlzirg :-e m o r e complex modules, t h e y can b e mass pro:-ce~ r a t h e r t h a n made o n a custom basis.
-
T h e t r e n d t o w a r d more complex modules i s a.so d u e t o decreasing size and cost o f hardware a-c I re increasing pressures f o r complex systems. p r e s s u r e s f o r complex systems CO-P from s9.e-a sources. One i s t h e need f o r k -,n p e r k - a - : e I?: s y s t e m s such as t h e I B M 360191 (21, C X ST:' a n d ILLIAC I V [3) Another is increas r s aware-ess t h e advantages o f decentralized sjstems; t t s 1s evidenced by the appearance c! p o w e r f ~ l i/'3 processors, multiprocessor systems sucn as C.w-3 [;:. c o m p u t e r n e t w o r k s such as the ;??A pet, a x increasingly,'intell~~entterminals.
-
=+
?is lncreasine module compler~:y also disadvantages. These include (a) ir.flewlS~:.:j C ' s t r u c t u r e b e l o w t h e module level resuiting $ 3 ct:suboptimal use . o f resources t7d su=c=: - a i performance, and (b) inflexibility o f module !L-::C~ r e s u l t i n g i n suboptimal systems d e s t j r ~ In crcer t:,
*The r e s e a r c h in this paper was supported b y K a t m a l Science Foundation Grant GJ 32758X. **On leave a t Digital Equipment Corporation, hcafnarc, Mass. F i g u r e 1.
P41S Diagram o f a (31
1. Conrmun~calionlinks: sinele words and data strings ( l o w speed).
f
/
-
2. W o r d - b y - w o r d inter-computer buffers: single words, program-controlled. 3. Block t r a n s f e r among computers: data strings. Sharing memory among multiple processors: random access l o global variables. Sharing peripheral devices: block transmission (e.g. disk). 6. Broadcast of data/control to multiple, independent units. Table 1. Common Communicalion Mechanisms
r e d u c e t h e s e disadvantages, any module set consists of a r a n g e o f module types. A t some p o b t , however, t h e number o f module types begins t o n u l l i f y the advantages t h a t can b e gained b y standardizing the modules. Register transfer level modules are already b u m p i n g against this ceiling. Any set o f more complex modules w o u l d either take too many module types o r become t o o inflexible i n function or form, i f it were n o t f o r t h a t classic idea: stored programs. Since CMs a r e programmable, any one module t y p e i n the CM m o d u l e s e t can p e r f o r m any function, and it is n o t n e c e s s a r y t o have many d i f f e r e n t CM types. A t this p o i n t w e envision a CM set where d i f f e r e n t types o f C M d i f f e r o n l y i n size o f primary memory (Mp) and d e s i g n o f I / O p o r t s (e.g. p o r t s for communication b e t w e e n local o r remote CMs and p o r t s f o r interfacing w i t h conventional device controllers). Any one CM, ,-,however, can occupy only one point in the cost/perforrnance space. To obtain a variety o f levels / o f c o s t and performance, CMs are interconnected i n t o s y s t e m s o f v a r y i n g size and complexity.
I
%
T h e CM can also b e viewed as p a r t o f t h e e v o l u t i o n o f c u r r e n t centralized computer structures i n t o h i g h l y distributed, intelligent networks. This e v o l u t i o n a r y sequence has proceeded from (a) single p r o c e s s o r w i t h centralized control o f 1/0, through (b) t h e a d d i t i o n o f i n t e r r u p t s and local control o f 110, (c) 110 processors, and id) multiple central and 110 processors, finally t o (e) multiple, interconnected computers. Almos-t e v e r y current I/O device (e.g. t y p e w r i t e r , card-reader) and secondary storage device (e.g. magnetic tape, disks) could be controlled b y CMs i n f u t u r e systems. This not only permits more c o n c u r r e n c y as well as autonomy w i t h respect t o a c e n t r a l computing site, b u t also permits b e t t e r local control for higher reliability and b e t t e r failure diagnosis. Communications The wide range o f performance f o r CM systems, above, w i l l be possible only i f CMs can c o o p e r a t e e f f i c i e n t l y i n parallel systems. This requires that the modules b e able t o synchronize and communicate e f f i c i e n t l y . I n fact, this is the major CMs from current factor that differentiates \minicomputers.
b u i l t t o s a t i s f y t l i c needs o f various parallel systems. One instance is t h e highly multiplexed switch o f the Ariot her is the asynchronous, .extendable C.mmp [14]. Still r i n g o f t h e D ~ s t r i b u t e d Comput~ng System [8]. o t h e r s are t h e geographically dispersed ALGHA [I]and ARPA n e t w o r k [12] systenis, the bus systems such as t h e PDP-I I Unibus [.GI, and the synchronized, highly s t r u c t u r e d ILLIAC I V system. The more common communication mechanisms used in such systems are given I n Table 1. These range f r o m t h e conventional comrnunications link f o r pair dialogues t o t h e shared p r i r ~ i a r y memory which permits a n y processor t o access any global variable. The v a r i o u s dimensions o f t h e interconnection problem are (a) logical switching s t r u c t u r e (none: single transmittersingle receiver; broadcast: sinele transmitter- multiple receiver; Unibus-type: single pair d~aloo,ues broadcasted), (b) physical switching structure (links + c e n t r a l switch; bus; loop), (c) node separation (local; d i s t r i b u t e d ) , (d) message t y p e (1-bit events; data word; v a r i a b l e name + data word; data blocks), and (e) node addresses (none; single address; subset o f nodes). This wide v a r i e t y o f systems is representative o f the f l e x i b i l i t y required o f CM interfaces. This indicates t h a t CMs must have several communications p o r t s , each w i t h facilities f o r handshaking and o t h e r f o r m s o f synchronization. E f f i c i e n c y o f t e n requires t h a t module interfaces h a v e independent processin& power. For instancq i n many cases independently controlled b u f f e r s should b e p r o v i d e d t h a t do n o t liave t o be directly managed b y t h e c e n t r a l processor. I n t e r r u p t queueing and some simple i n t e r r u p t processing might also be performed i n d e p e n d e n t l y o f t h e central processor. CM p o r t s m u s t t h e r e f o r e have sufficiant power and flexibility t o constr'uct e f f i c i e n t parallel systems.
APPLICATIONS
In o r d e r
t o learn more about how CMs should b e designed, and also t o illustrate how they might b e u s e d t o design systems, we have investigated a set o f applications, including array processing (Fast Fourier T r a n s f o r m processing, generalized array processin-,, and r a d a r signal processing), sorting, lanetlaze processing (compilation y r d machine language interpretation), ancl p r o c e s s control. I n each case, we liave tried t o brinz o u t t h e communications reqbirements and the rarlze o f p e r f o r m a n c e t h a t can he achieved b y varyin2 the CM s y s t e m structure. I n this paper we describe i n detail only the sorting and Fast Fourier Transform applications. The other applications are discussed briefly.
discussed
\
.
well
Various physical communications structures as as s o f t w a r e protocols have been proposed or
Fast Fourier Transform The Fast Fourier Transform (FFT) lends itself t o p a r a l l e l ancl pipelined processin2 [4,5]. For an n-point t r a n s f o r m , t h e algorithm can be represented b y an a r r a y o f nodcs w i t h n r o w s and log, n colum~is. A t e a c h node, t w o values c a l c u l ~ t e db y the nodes i n the p r e v i o u s column are processetl, produc~n; a v ~ l u c 10
.
b c u s e d 111 t h c next column. The calculations of each column nr11r.t IIC c ~ m p l c l c d b c f o r c the calculations o f the next r o l u r ~ ~ n can procccd. W ~ t h i n any column, h o w e v e r , t h e calculations at each node can proceed i n d e p e n d e n t l y o f the calculations at the other nodes. F o r a n n-point transform, therefore, four simple C M implcmcntations are: (a) a f u l l parallellpipeline s t r u c t u r e using n log n CMs, one f o r each o f the nodes, ( b ) a parallei structure using n CMs, one f o r e a c h of t h e nodes i n a column, i n e f f e c t folding t h e columns back o n themselves, using an appropriate communications net, (c) a pipeline structure using 1og.n CMs, o n e f o r each column, and (d) a completely serial i m p l e m e n t a t i o n using just one CM. The speeds i n each case a r e approximately proportional t o t h e number o f CMs. Furthermore, any node can be implemented using I n this way, d i f f e r e n t s e v e r a l CMs f o r higher speed. performance requirenients can be satisfied by c o n s r u c t i n g d i f f e r e n t l y structured CM systems. The communications requirements are d i f f e r e n t for each o f t h e f o u r implementations discussed. P r i v a t e o r shared (bus or loop) connections, size o f buffers required, handshaking protocols and interconnection patterns are some of the considerations. sorting Various sorting algorithms can be implemented b y systems. Some bucket sorts are chosen f o r illustration.
CM
Suppose t h e file t o b e sorted is divided among a l l connected via a ring-type communication net. Each o f t h e n CMs are pre-assigned a range o f values. The C M examines i t s portion o f t h e f i l e and a n y r e c o r d which does n o t fall within i t s assigned r a n g e i s placed o n t h e ring. Records o n the r i n g c i r c u l a t e u n t i l picked u p b y the CM i n whose range t h e r e c o r d k e y falls. A f t e r all the records have been p i c k e d up, each CM w i l l s o r t the records it collected ( t h i s c a n b e done either internally t o t h e CM or b y reassigning ranges t o t h e CMs and repeating t h e d i s t r i b u t i o n process).
n CMs,
A n o t h e r possibility is t o use a t r e e s t r u c t u r e Here, each record arriving at a node is (Figure 2). s e n t t o t h e upper or lower successor node as a p p r o p r i a t e , u n t i l it arrives at a leaf node. Each leaf n o d e t h e n s o r t s the records i t collected. This t r e e i n d i c a t e s a paralleljpipeline structure which can b e
I
Figure 2.
Sorting Tree
J
m a p p e d i n various ways o n t o CM s . r r ? - 5 : .e cc- : u s e o n e C M f o r each node i n tke : - E . $ . c * c - e ! : r e a c h Ievel, f o r instance. One might -;: :e :-a: i s a s p e e d mismatch i n the ratio 2:1 :s:-?Ja-, :I: successive levels o f the tree; t o cc-.s:: :-.s. ;c c a use a n a r r a y s t r u c t u r e similar t o t - 2 : i;* t-• Fas: F o u r i e r Transform. It is important :t -::e t-a: w - : . e t.hcse s t r u c t u r e s are similar, the t - r ; ; : cz::ePn , S d i f f e r e n t and consequently differe-: c:-r,c.ca:.zrs s t r u c t u r e s may b e indicated.
Other
Applications
In generalized array process - 5 (s-:? as p e r f o r m e d b y t h e ILLIAC 1V [3]) aca s:z:.a':?= avra7. processing (such as radar signal c-::ess -5;. tre o u t s t a n d i n g f e a t u r e i n many cases is c :so c:-2 - a cc the component CMs. Comm~nicr::- 22-:. 9 : ~ r e q u i r e m e n t s are likely t o be high, so 1-2 CL.% - s f tt c o n f i g u r e d as an array, a vector, c - ..-;, t - sc-.e o t h e r interconnection s t r u c t u r e may t e - C D J I o . ~1-0 , p e r f e c t . s h u f f l e [13]). Each CM can r:cs:ez, sss', o n e o f a series o f computation ste;s. 2 z e , r s ;C : 2(as in t h e COC Star), or each CM m,. z s z : , 2 . : the s t e p s t o one particular data item e; a 1.-e ( a s IS t y p i c a l f o r t h e ILLIAC IV).
-
In compilation, partitioning cc-: 2: op r o c e d u r e i n t o phases allows p i p e h -; c f c:-= .cs programs, w h i l e techniques similar r: !-:-e-e-:a: compiling [7] can b e applied t o obt; z ~ - a . . es-: a: t h e subprogram level. The outstar.dir; 'aa:~-e the communications s t r u c t u r e is the COX-:^ c a y czse (symbol table, etc.) accessed b y most c: ;re 3's a t e a c h s t a g e of t h e compilation. This c c t a s:rl::de m a y b e s t o r e d i n one memory (wk :.rr5: il~:C r e q u e s t s f r o m any o f t h e CMs) or rnr: c e b - s catec a t e a c h CM. These d i f f e r e n t schemes I-aiy c.ifeeent communications structures.
-
.,
Jn machine languase interpretatic-, rr-a::e: s- is o b t a i n e d b y pipelining instruction fe!c-, ce::~? a'c! execution, and b y multiplying the n u ~ t i -c i el.:-! $7 units. The denlands rnade o n the ::--,r :o: :-5 system involves broadcastin2 of :=-:~-E-?IL!P~ instructions, b u f f e r s a t the execution :s, e.: A f i n e example o f a high-performance st--::,re as 1-e 3 6 0 1 9 1 121. B e t t e r system utilizaticr! it-: t-ere;:*e b e t t e r performance) can b e achieved ~f :-s CL.( s , s : c ~ i s u s e d t o execute simultaneously r:-P : - 2 7 :-e program, o r i n t e r p r e t parallel machine ' 2 - j - 2 j e s ;i !)
--
\
In t h e case o f process contrc! c:-:-: -5. c o m p u t a t i o n can usually be done i n p.-a 5 8 7 :ce-s A s,s:?2' o f m u l t i p l e independent control tasks. t h i s t y p e i s t y p i c a l o f (a) telephcre I - :c"; reg. c o n t r o l o f switching paths i n terms C : r c 5:: * f - : e requests), (b) discrete process control i s ;. a t.i-s'~~ machine r e q u i r i n g simultaneous s o l u t ~ o nc 53; 2 : : e q u a t i o n s o n a time-sampled bas~s), aqd ':, c SEE:- >:: t i m e - s a m p l e d c o n t r o l (e.3. simultaqe$-s 5 2 , : c' ; c v \a-,:-s m u l t i p l e independent c o n t r o l equa!iocs loops). I n each case, w h ~ l emany opera: :-s a-e c:-e in parallel, each o p e r a t ~ o nis small, and . - z r z w c e - : o f o t h c r operations. Cornmunicat~ons arc-; rce v l * 2-5 c o n t r o l tasks is p r o v ~ d e d by m u l t ~ p l e I r..s :a t.::ess i n p u t s and o u t p u t s and t o other compu:e-s (5ee F;-'e
l i n k s f rm s i n g l e variable t o a t least
permits a s i n g l e o u t p u t t o be
a g c n ~ r a l structi~re su'tabk for rr.ost thcrc applications? (Somc work has hccn done i n t h ~ s How can wc d c v l n f o r a ei,/en d i r e c t i o n . [ I I].) r e l i a b i l i t y requirement? HOW do w e specify t b e communication requirements for a e i v c ~applicat~oc,'and h o w d o w e d c s i ~ n physical cornrwr.:cat~on structures a n d i n t e r - m o d u l e protocols t o f i t tk,ese? These 2nd They will be o t h e r questions must be answered. s u b j e c t s o f f u r t h e r research. REFERENCES
-
c o n t r o l and cccxunicat i o n
F i g u r e 3.
global s t a t e and v a r i a b l e transmission Process c o n t r o l f o r a s e t of computers p e r m i t t i n g redundant c o n t r o l paths.
3). Each C M also requires communications t o all o t h e r
c
CMs i n o r d e r t o access common input variables, c o m m u n x a t e global s t a t e variables, and t o r e p o r t status. The communicatior, requirements are similar t o t h o s e encountered. when using multiple CMs f o r c o m p i l a t ~ o no r interpretation. CONCLUSION We h a v e discussed the architecture o f CMs and the o o s s i b i l i t y o f constructing computing systems w i t h them. A l t h o u g h several questions remain t o b e answered about the CM's architecture, several g r e l i m i n a r y conclusions emerged from this study: (a) a rn:croprccessor i s included within each CM (b) t h e s t r u c t u r e o f t h e I/O p o r t s is crucial, and (c) more than o n e 110 p o r t i s needed per CM. Table 2 gives some o f t h e characteristics we expect CM systems t o have, b a s e d o n t h e anplications we investigated. M a j o r questions remain. Can, CM structures s u c c e s s f u i i y compete w i t h conventional computers? (We 63 n o t rocossarily expect them t o execute machine language f a s t e r than an emulated machine, but, given a s e t o f a ~ o l ! c a t i o n s , we wonder how well a CM system w o u l d d o compared t o a conventional computer.) How w e l l w i l l a C M s t r u c t u r e f i t a set o f applications? I s 4ttribute
Values
I;o. of processors \'errors size VJord size :.o. of p a r t s .a.o f CL! t y p e s t;o. o f CVs In a system
1 1K words and over 8 t o 16, bits 2 to5 1 to? A few to several thousand
T ? ? ' r 2. Properties of CM5 and CM Svstems
1. Abramson, N., "The ALOHA System Ano:her Alternative for Computer CornrrunicationsIm FLCC Proceedings 1970, Vol. 37, pp. 221-2C5. 2. Anderson, D. W., Sparacio, F. J., Tomasulo, R.U.., "The I B M System/ 3 6 0 Model 91: Vachine Pni;osophy a n d Instruction-Handling," IBM Journ;.i o f Research and Development, Vol. 11, pp. 8-24, Jawary, 1967. 3, Barnes e t al., "The llliac I V Corr,;~ter," IEEE Trans. Computers, Vol. C-17, No. 8, pp. 1 746-757, August 1968. 4. Bergland, G P., "Fast Fourier Transform Hard.uare Implementations--An Overview", JEEE Trans. on Audio a n d Electroacoustics, Vol. AU-17, pp. 104-119, June 2, 1969. 5, Brigham, E. 0. and Morrow, 2. E., "The Fast F o u r i e r Transform," IEEE Spectrum, Vol. 4, pp. 63-70, December 1967. 6. Digital Equipment Corporation, T C P - 1 1 Interface Manual," DEC-11-HIAB-D, Digital Equipment Cor oration, Maynard, Mass., 1971. Z. Earley, J. and Caizerpues, P. 'A Methcd for I n c r e m e n t a l l y Compilinz L a n p e s w i t h h s t e d Statement Structure", Comm. o f the ACM, Vol. 15, pp. 1040-1044, December 1972. 8. Farber, 0. J. and Larson, K. C., "The System A r c h i t e c t u r e o f t h e Distributed Computer System The Communications System," Symposium o n Computer N e t w o r k s , The Polytechnic I n s t i t u t e o f Brooklyn, April, 1972. G. and Tate, 0. P., "Control Data 9. Hintz, R. STAR-100 Processor Design," IEEE C c x p u t e r Society I n t e r n a t i o n a l Conference, September 1972. 10. Lapidus, G., "N3S/LSI Launches the Low-Ccst 33-40, Processor," IEEE Spectrum, Vol. 9, pp. November 1972. R., "Dynamic Control Structures and 11. Lesser, V. t h e i r use .in Emulation", Ph.0. Thesis, Ccrnputer ScLence Dept., ~ t a k f o r dUniversity, Aueust 1972. G., Wessler, B. D., "Computer 12. Roberts, L. N e t w o r k Development t o Achieve Resource Sharing,' SJCC Proceedings 1970, Vol. 36, pp. 543-549. S., "Parallel Processing w i t h the 13. Stone, H. P e r f e c t Shuffle," IEEE Trans. Computers, Vol. C-20, pp. 153-160, February 1971. G., "C.mrnp A 14. Wulf, W. A., Bell, C. Multi-Mini-Process~r," FJCC P r o c e e d i n ~ s 1972, Vol. 41, P a r t 11, pp. 765-777.
-
-
e
-
-
-
-
-
-
-
.
:
1