The Design of an Asynchronous Memory Management Unit 1

0 downloads 0 Views 193KB Size Report
the segmentation read (sr) register are placed on the real address (ra) bus; and, similarly, the contents of ... block diagram for the MMU is shown in Figure 1. MDl. MDs. MSl. MSs ... When a memory request is made, a comparison is done to determine if a ..... Part of the nal circuit implementation is shown in Figure 4. The.
The Design of an Asynchronous Memory Management Unit  Chris J. Myers y Computer Systems Laboratory Stanford University Stanford, CA 94305

Alain J. Martin Computer Science Department California Institute of Technology Pasadena, CA 91125

Abstract

This paper demonstrates the ease of transformation from a general description to a high-level CSP speci cation and through systematic transformations and optimizations to a circuit implementation. We present a detailed description of the design of a fully asynchronous memory management unit (MMU) to be used in conjunction with an asynchronous microprocessor. It was designed using Martin's synthesis method and CAD tools developed at Caltech. In addition to illustrating the design of the controller, the formal derivation of the asynchronous datapath is also described.

1 Introduction

Recently, there has been a resurgance in interest in the design of asynchronous circuits [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. There is still, however, a wide held belief that asynchronous design is dicult compared to synchronous design. The purpose of this paper is to describe through an example of a simple memory management unit (MMU) a formal systematic procedure to facilitate speci cation and synthesis of asynchronous circuits. Martin's synthesis method [2] speci es asynchronous circuits using a language based on Hoare's CSP [12] and Dijkstra's guarded commands [13]. A CSP speci cation can be naturally derived from a general word description. Once speci ed formally, optimizations can be made at a high-level. A systematic procedure is then used to transform this high-level speci cation to a circuit implementation. The resulting circuit satis es the quasi-delay insensitive delay model in which functional correctness is independent of gate and wire delays except in the case of isochronic wire forks which are assumed to be negligible. While asynchronous circuits are often used for controller circuits, many believe that they are not well suited for microprocessors and other datapath dependent designs. Recently, Martin and his students have successfully designed many fully asynchronous chips with complex datapaths. Most notably, they obtained very promising results in the synthesis of an asynchronous microprocessor [14] with a load/store architecture. While the datapath for this design is not very complicated (a comparator and two registers), it is sucient to introduce two new design techniques to reduce asynchronous datapath complexity. First, we use dynamic asynchronous logic to implement a 15-bit comparator. This allows the complete comparison to be done in a single gate. Second, we use tri-state pads for completion detection when sending signals o chip [15]. This allows us to avoid the need for dual rail, and thus twice as many pads, at the output. This paper is divided into ve sections. Section 2 discusses the derivation of the high-level CSP speci cation of the MMU. Section 3 describes the synthesis of the control circuitry. Section 4 describes the synthesis of the datapath circuitry. Section 5 presents our results and conclusions.  The research described in this paper was sponsored by the Defense Advanced Research Projects Agency, DARPA Order number 6202, and monitored by the Oce of Naval Research under contract number N00014-87-K-0745. y Supported by an NSF Fellowship.

1

2 Speci cation The MMU converts a 16-bit memory address to a 24-bit real address by concatenating 8-bits from one of two segmentation registers with the memory address. When memory reads are requested, the contents of the segmentation read (sr) register are placed on the real address (ra) bus; and, similarly, the contents of the segmentation write (sw) register are used for memory writes. These segmentation registers are accessed through reads and writes to memory addresses FFFF and FFFE, and their contents are placed on or sampled from the lower 8-bits of the data bus. The output of this circuit is connected to a memory interface. The block diagram for the MMU is shown in Figure 1. MICROPROCESSOR

MDs

MSl

MMU

MSs

ma 16 data 8

sr register sw register

ra 8

24

data

MEMORY INTERFACE

MDl

16

Figure 1: Block diagram for the MMU. Thin arrows represent communication channels, and thick arrows represent buses. The dashed arrow indicates that the memory address is not modi ed by the MMU.

2.1 From General Description to CSP

There are six possible cycles that this circuit can enter depending on the whims of the environment: read from or write to the segmentation read register, read from or write to the segmentation write register, and read from or write to memory. The MMU waits in an idle state until a load or store is requested by the microprocessor (i.e., until a communication is detected at the memory data load (MDl) port or the memory data store (MDs) port). When a memory request is made, a comparison is done to determine if a segmentation register or main memory is being addressed. Dependent on whether a load or store is requested and the result of this comparison, one of six courses of action is taken. For example, for a memory load, there is the following set of operations: 1. Wait for communication on the memory data load port (i.e., MDl which indicates a probe of the port MDl). 2. Do a memory address comparison to determine that it is not equal to either FFFE or FFFF (i.e., b3 is true where b3 := :(ma = FFFE) ^ :(ma = FFFF)). 3. Put the segmentation read address on the real address bus (i.e., ra := sr). 4. Request a memory load from the memory interface and wait for it to complete (i.e., initiate a communication on the MSl port). 5. Complete the communication on the memory data load port (i.e., MDl). A similar set of operations is performed for a store to memory. If, however, the segmentation read register is to be written to, there is the following set of operations: 2

1. Wait for a communication on the memory data store port (i.e., MDs which indicates a probe of the port MDs). 2. Do a memory address comparison to determine that it is equal to FFFF (i.e., b1 is true where b1 := (ma = FFFF)). 3. Put the value from the data bus into the segmentation read register (i.e., sr := data). 4. Complete the communication on the memory data store port (i.e., MDs). There are similar sets of operations for the other types of accesses to the segmentation registers. From this description, we derive our original CSP speci cation: [[MDl ! (b1; b2; b3 := (ma = FFFF); (ma = FFFE); :(ma = FFFE) ^ :(ma = FFFF)); [b1 ! data := sr; MDl []b2 ! data := sw; MDl []b3 ! ra := sr; MSl; MDl ] []MDs ! (b1 ; b2; b3 := (ma = FFFF); (ma = FFFE); :(ma = FFFE) ^ :(ma = FFFF)); [b1 ! sr := data; MDs []b2 ! sw := data; MDs []b3 ! ra := sw; MSs; MDs ]]]

2.2 Optimizing the CSP Speci cation

The above CSP speci cation could be implemented directly, but by making a couple of small optimizations to it, the area and performance of the implementation can be improved. First, one rather than three comparators can be used to generate b1, b2, and b3. The signal b3 can be determined by comparing only the most signi cant 15-bits of the memory address with all bits high. If all bits are high, then b3 is false, otherwise, b3 is true. Using b3 and the value of the least signi cant bit of the memory address (ma0 ), b1 and b2 can be determined (i.e., b1 = :b3 ^ ma0 and b2 = :b3 ^ :ma0 ). Now, we must wait for b3 to be computed before b1 and b2 can be computed. The logic, however, for generating b1 and b2 from b3 is very simple. Therefore, there is not a signi cant performance penalty in doing this, and a signi cant reduction in the circuit area is achieved. The second optimization is to perform the assignment of the real address (ra) earlier. Which segmentation register that is to be written to the real address bus is known as soon as the load or store request is detected. Therefore, the value of the segmentation register can be put immediately onto the real address bus. Putting a value on the real address bus when the segmentation registers are being addressed does not cause a problem, since the memory interface will not get a memory request (i.e., a communication on the MSl or MSs port). Using this optimization, an improvement in performance for the most common cases, memory loads and stores, is achieved. The delay of a memory load or store has been reduced from that given in equation 1 to that given in equation 2 where ma is the delay of the memory address comparison and ra is the delay of the real address assignment. mem = ma + ra (1) mem = max(ma ; ra) (2) The delay of loads and stores of the segmentation registers, however, has potentially increased from that of equation 3 to that of equation 4. Depending on the actual values of these delays, it may result in an increased delay for loads and stores of the segmentation registers, but this is acceptable, since memory accesses are much more likely than segmentation register accesses. reg = ma (3) reg = max(ma ; ra ) (4) Our nal CSP speci cation is given below: [[MDl ! ((ra := sr) k (b3 := :(ma = FFFE) ^ :(ma = FFFF); b1; b2 := :b3 ^ ma0 ; :b3 ^ :ma0 )); 3

[b1 ! data := sr; MDl []b2 ! data := sw; MDl []b3 ! MSl; MDl ] []MDs ! ((ra := sw) k (b3 := :(ma = FFFE) ^ :(ma = FFFF); b1; b2 := :b3 ^ ma0 ; :b3 ^ :ma0 )); [b1 ! sr := data; MDs []b2 ! sw := data; MDs []b3 ! MSs; MDs ]]]

2.3 Process Decomposition

The next step of the design process is to separate the control from the datapath using a technique called

process decomposition. In this technique, each operation in the CSP program is replaced with a communi-

cation on a port. As an optimization, the three memory address compare operations are combined into one communication on the port B. This communication returns one of three booleans variables (b1 ; b2; b3) true indicating the result of the comparison. The resulting CSP program for the control part is given below: [[MDl ! (RA k B?(b1 ; b2; b3)); [b1 ! LSR; MDl []b2 ! LSW; MDl []b3 ! MSl; MDl ] []MDs ! (WA k B?(b1 ; b2; b3)); [b1 ! SSR; MDs []b2 ! SSW; MDs []b3 ! MSs; MDs ]]] Seven new ports which initiate seven datapath processes have been created. The resulting block diagram with these new ports is given in Figure 2. The port B is used to begin a memory address comparison and get the results. The V CSP programWfor this process is as follows: [[B ! T; F := ( k:115 : mak ); ( k:115 :mak ); [T ^ ma0 ! B!b1 []T ^ :ma0 ! B!b2 []F ! B!b3 ]]] The ports RA, LSR, and SSR are used to move or change the value of the segmentation read register. Similarly, the segmentation write register is a ected by communications on the ports WA, LSW, and SSW. Each of these processes have the same form. For example, for the process initiated by communications on the port SSR, there is the following CSP program: [[SSR ! sr := data; SSR]]

3 Control The design of the controller is done in several stages. First, we derive a handshaking expansion from the CSP speci cation described above. This expansion is reshued to optimize for concurrency while minimizing circuit area of the implementation. Finally, the implementation is derived from this reshued handshaking expansion.

3.1 From CSP Speci cation to Handshaking Expansion

Before the handshaking expansion is generated, the protocol must be assigned to the ports. We decided to use four-phase handshaking with either a passive or active protocol. For example, MDl is probed, so we must make it passive. Therefore, we expand the communication on the MDl port as follows: 4

COMPARATOR

ma 16

B MDl

MSl

MDs

RA data

MSs

CONTROL

SSW LSW WA

LSR SSR

sr register

sw register

ra 8

8

Figure 2: Block diagram for the MMU with control and datapath separated. [mdli]; mdlo "; [:mdli]; mdlo # Similarly, MDs uses a passive protocol. All the other ports use an active protocol. For example, RA is expanded as shown below: rao "; [rai]; rao #; [:rai] It is often possible to replace an active protocol with a lazy-active protocol as given below: [:rai]; rao "; [rai]; rao # With this protocol, the system is not forced to wait on the completion of the four-phase handshake of a process until it uses that process again. This technique increases throughput when one type of operation is followed by a di erent type. For all active ports, we used a lazy-active protocol. We have chosen to implement the port B using four signals because data (the result of the comparison) needs to be transferred on this port. The three values are returned using a one-hot encoding as shown in the following expansion: [:b1i ^ :b2i ^ :b3i]; bo "; [b1i _ b2i _ b3i]; bo # With the communications expanded, they can be substituted into the CSP speci cation to produce a handshaking expansion. First, we replace each probe on a passive port x with the wait [xi], and each completion of a communication on a passive port x with xo "; [:xi]. Next, we replace each communication on a lazy active port x with [:xi]; xo "; [xi]. The reset of each communication, xo #, is initially placed at the end of each sequence. The result is the following: [[mdli ! (([:rai]; rao ") k ([:b1i ^ :b2i ^ :b3i]; bo ")); [rai ^ b1i ^ :lsri ! lsro "; [lsri]; mdlo "; [:mdli]; rao #; bo #; lsro #; mdlo # []rai ^ b2i ^ :lswi ! lswo "; [lswi]; mdlo "; [:mdli]; rao #; bo #; lswo #; mdlo # []rai ^ b3i ^ :msli ! mslo "; [msli]; mdlo "; [:mdli]; rao #; bo #; mslo #; mdlo # ] []mdsi ! (([:wai]; wao ") k ([:b1i ^ :b2i ^ :b3i]; bo ")) [wai ^ b1i ^ :ssri ! ssro "; [ssri]; mdso "; [:mdsi]; wao #; bo #; ssro #; mdso # []wai ^ b2i ^ :sswi ! sswo "; [sswi]; mdso "; [:mdsi]; wao #; bo #; sswo #; mdso # 5

[]wai ^ b3i ^ :mssi ! msso "; [mssi]; mdso "; [:mdsi]; wao #; bo #; msso #; mdso # ]]]

3.2 Reshuing the Handshaking Expansion

The transformation from CSP to a handshaking expansion is not unique. Other expansions can be obtained by reshuing the original handshaking expansion. In particular, we examined various placements of the reset of the four-phase handshake (i.e., xo #). The various implementations were checked for the need for state variables as well as the number of literals needed for the implementation using the tool PRGEN [16]. The handshaking expansion which we selected is shown below: [[mdli ^ :rai ! rao "; [:b1i ^ :b2i ^ :b3i]; bo "; [rai ^ b1i ^ :lsri ! lsro "; [lsri]; mdlo "; rao #; bo #; [:mdli]; lsro #; mdlo # []rai ^ b2i ^ :lswi ! lswo "; [lswi]; mdlo "; rao #; bo #; [:mdli]; lswo #; mdlo # []rai ^ b3i ^ :msli ! mslo "; [msli]; mdlo "; rao #; bo #; [:mdli]; mslo #; mdlo # ] []mdsi ^ :wai ! wao "; [:b1i ^ :b2i ^ :b3i]; bo "; [wai ^ b1i ^ :ssri ! ssro "; [ssri]; mdso "; wao #; bo #; [:mdsi]; ssro #; mdso # []wai ^ b2i ^ :sswi ! sswo "; [sswi]; mdso "; wao #; bo #; [:mdsi]; sswo #; mdso # []wai ^ b3i ^ :mssi ! msso "; [mssi]; mdso "; wao #; bo #; [:mdsi]; msso #; mdso # ]]]

3.3 The Production Rules and Circuit Implementation

The handshaking expansion for one of the six possible cycles, loading the segmentation read register, is given below where :bi  :b1i ^ :b2i ^ :b3i: [[mdli ^:rai]; rao "; [:bi]; bo "; [rai ^ b1i ^:lsri]; lsro "; [lsri]; mdlo "; rao #; bo #; [:mdli]; lsro #; mdlo #] For each transition on an output signal, a weak guard is created using the preceding transition in the handshaking expansion resulting in the unstrengthened production rules that follow: mdli ^ :rai :bi rai ^ b1i ^ :lsri lsri mdlo :rao :mdli :lsro

7! 7! 7 ! 7! 7! 7! 7 ! 7!

rao " bo " lsro " mdlo " rao # bo # lsro # mdlo #

Unfortunately, this set of guards does not guarantee the order of transitions speci ed in the handshaking expansion. Therefore, a technique called guard strengthening [17] is applied to obtain a strengthened production rule set. The basic idea is to determine the set of states in the handshaking expansion where a guard for a transition evaluates to true and compare it to the set of states where the signal should have the opposite value. The intersection of these two sets is the con ict window. To close this window, any signal which is stable when the transition being considered is to be enabled can be added to the guard. Strengthened production rules which now implement the handshaking expansion are given below and are depicted as a complex gate implementation in Figure 3. :mdlo ^ mdli ^ :rai ! 7 rao ^ :bi ! 7 mdli ^ bo ^ rai ^ b1i ^ :lsri 7!

lsro ^ lsri mdlo 6

7! 7!

rao " bo " lsro " mdlo " rao #

:rao 7! :bo ^ :mdli 7! :lsro 7!

bo # lsro # mdlo #

mdli lsri

rai bi

+

gC

+ + +

gC

+

gC

mdlo

lsro

rao

+

+

gC

bo

Figure 3: Part of the MMU implementation: loading the segmentation read register. There is a similar set of production rules for the other ve cycles. In the next step, the transistors in the controller were sized [18] and the inverter bubbles were shued using BUBBLE with arbitrary datapath delays. All the paths were then combined into one le which was input into CELLGEN to generate a GLADYS le. From this, layout was created with GLADYS (for a description of the programs BUBBLE, CELLGEN, and GLADYS, see [16]). Part of the nal circuit implementation is shown in Figure 4. The circuit implementation for mdso? and wao is the same as that for mdlo? and rao, respectively, with the input and output signal names changed appropriately (note that x? indicates that x is an active low signal). The circuits for lswo, mslo, ssro, sswo, and msso are similar to that of lsro. The actual gates generating each signal are implemented in a single complex gate as shown in Figure 5 for the signal mdlo? .

4 Datapath In this section, we describe our implementation of the datapath. It consists of two parts: a memory address comparator and the segmentation registers.

4.1 Memory Address Comparator

4.1.1 From CSP to a Handshaking Expansion

Earlier, we presented the CSP for the memory address comparator. First, the assignment of T and F and the decision of which boolean to transmit to the controller is separated into two processes. We also expand the communications into actual signal transitions. The result is given below: V k [[bo ^ ( Wk:115 : mak ) ! T "; [:bo]; T # []bo ^ ( k:115 : :mak ) ! F "; [:bo]; F #]] k [[T ^ ma0 ! b1i "; [:T]; b1i # 7

lsro lswo mslo lsri lsro

C

mdlo_

C

bo_

C

rao_

C

lsro

lswi lswo msli mslo rao wao rao b1i_ b2i_ b3i_ wao b1i_ b2i_ b3i_ mdlo_

reset_ mdli rai_ mdlo_ mdli_ rai_ b1i_ bo_ lsri

mdli_ bo_

Figure 4: Part of the MMU implementation.

8

rao

lsro lswo mslo

weak mdlo_

lsri

lswi

msli

lsro

lswo

mslo

Figure 5: Circuit for mdlo? . []T ^ :ma0 ! b2i "; [:T]; b2i # []F ! b3i "; [:F]; b3i #]] Unfortunately, it is not possible to design a fteen bit comparator in a single gate using static CMOS technology. Therefore, we decided to implement it using dynamic logic. When the memory address is not valid, the signals T and F are precharged low. To determine when the memory address is not valid, we need a third state which is obtained by adding an additional process for each bit of the memory address to convert it to a dual-rail signal. Another process is added to collect the acknowledgements from all of these and produce a completion signal when all bits have been converted. The process which assigns a value to T and F is also modi ed such that the comparison only uses an OR or NOR gate. Finally, the completion of the dual-rail encoding, comp, is added to the guard for the decision of which boolean to transmit to the controller. The handshaking expansion for this is shown below: kk:015: [[bo ^ mak ! maTk "; maT?k #; ackk "; [:bo]; maTk #; maT?k "; ackk # [:bo]; maFk #; maF?k "; ackk #]] V []bo ^ :mak ! maFk "V; maF?k #; ackk];"comp #] k [[ k:015W: ackk ]; comp "; [ k:015 : :ackk W k [[bo ^ :W ( k:115 : maT?k ) ! T "; [:bo ^ ( k:115 : maT?k )]; T # []bo ^ ( k:115 : maFk ) ! F "; [:bo]; F #]] k [[comp ^ T ^ maT0 ! b1i "; [:comp ^ :T ^ :maT0 ]; b1i # []comp ^ T ^ maF0 ! b2i "; [:comp ^ :T ^ :maF0 ]; b2i # []comp ^ F ! b3i "; [:comp ^ :F]; b3i #]]

4.1.2 From Handshaking Expansion to Implementation

The comparator was layed out using MAGIC. It was composed of the circuits shown in Figure 6. We need sixteen instances of the rst process described above, one for each bit of the memory address. Each instance is implemented with a PADIN circuit shown in Figure 6a. This circuit takes the signal bo and a bit of the memory address, mak , from the pads as input, converts the signal to a dual rail encoding, and produces an acknowledgement, ackk . The next process combines the acknowledgements from the sixteen PADIN circuits, and it produces a completion signal, comp, when all bits of the memory address have been converted to dual rail. This is implemented with the C-element shown in Figure 6b. This C-element is actually implemented with a tree of four-input C-elements. In the third process, fteen bits of the memory address (ma15 to ma1 ) are compared with all bits high 9

using the dynamic logic circuit shown in Figure 6c. Initially, bo is low, and thus, the bits maT?15 through maT?1 are all initially high. Therefore, the signals T and F are both precharged low. After bo is asserted high, the F signal could rise if any bit of the memory address is low (i.e., for any k, maFk is high) which would indicate that the comparison was false. Also, the T signal could rise if all bits of the memory address are high (i.e., for all k, maT?k is low) since the transistors are ratioed such that any of the n-channel transistors could overpower the p-channel. The result from the dynamic comparator, the completion signal from the C-tree, and bit 0 of the memory address are used by the last process to determine which bit to send back to the controller (b1i, b2i, or b3i). The C-elements used to determine this result are shown in Figure 6d. Once all the bits of the memory address are valid (i.e., comp is high) then one of the bits b1i? , b2i? , or b3i? will be asserted low. If all the memory address bits are high (i.e., T and maT0 are high) then b1i? is asserted low. If all the memory address bits are high except bit 0 (i.e., T and maF0 are high) then b2i? is asserted low. If, however, some memory address bit is low other than bit 0 (i.e., F is high) then b3i? is asserted low.

4.2 The Registers and Their Interface to the Environment

4.2.1 From CSP to a Handshaking Expansion

As mentioned before, the six other datapath processes are all of similar form. For example, storing to the segmentation read register, SSR, is expanded as follows: kk:07: [[ssro ^ datak ! srk "; ssrik "; [:ssro]; ssrik # V []ssro ^ :datak ! srVk #; ssrik "; [:ssro]; ssrik #]] k [[ k:07 : ssrik ]; ssri "; [ k:07 : :ssrik ]; ssri #] There is a similar expansion for storing to the segmentation write register, SSW. The expansion is only slightly di erent for the others where the pad is used as the state-holding element. For example, loading the segmentation register, LSR, is expanded as follows: kk:07: [[lsro ^ srk ! set?k #; [datak]; lsrik "; [:lsro]; set?k "; lsrik # []lsro ^ :srk ! resetk "; [:datak]; lsrik "; [:lsro]; resetk #; lsrik #]] kk:07: [[set?k ! datak " V []reset ! datak #]] V k [[ k:07 : lsrik ]; lsri "; [ k:07 : :lsrik ]; lsri #] There are similar expansions for LSW, RA, and WA.

4.2.2 From Handshaking Expansion to Implementation

The circuits for the segmentation registers were also layed out using MAGIC. The segmentation registers consist of three types of cells: a register cell (REG), a cell to interface to the pads (PADOUT), and a multiplexer (MUX) to allow either of the two registers to write to a single pad. These cells, as well as, a model for the tristate pad are shown in Figure 7. The process described above for storing to the segmentation registers (i.e., SSR and SSW) is implemented with the register cell shown in Figure 7a. When this circuit receives the GO signal (i.e., ssro or sswo), it writes the value of the IN signal (i.e., datak ) into the ip- op (FF). Once this value shows up on the OUT wire (i.e., srk or swk ), an acknowledgement, ACK, (i.e., ssrik or sswik ) is returned. These acknowledgements can be grouped together by a tree of C-elements to produce the acknowledgement that all bits have been stored (i.e., ssri or sswi). The next processes to be implemented are those which put the output of the segmentation registers on the data or real address pads (i.e., LSR, LSW, RA, and WA). These are implemented with the cell PADOUT which is shown in Figure 7b. It is similar to the REG cell except that it uses the value on the PAD as the storage device rather than a ip- op. The advantage of this technique is that only a single phase of the signal is needed at the pads to generate a completion signal. Thus, only half as many pads are needed than if dual-rail signals had to be taken o -chip [15]. The last cell MUX which is shown in Figure 7c is used to enable either the segmentation read register or segmentation write register to write to the same pads. It will assert the set? signal low or the reset signal high if either of the segmentation registers is trying to change the value on the pad. Mutual exclusion 10

bo maTk maT_

ack 15

k

ack k maF_

ack 0

k

. . .

C

comp

maFk ma k (a)

(b)

T bo

comp F

...

maF15

C

b1i_

C

b2i_

C

b3i_

maT0 maF1 T comp

bo_

maF0

weak T

maT_15

...

F maT_1 comp

(c)

(d)

Figure 6: Memory address comparator: (a) PADIN converts the memory address to a dual-rail signal; (b) generates completion signal when all memory address bits have arrived; (c) determines if the bits ma15 to ma1 are all high; and (d) uses results of (a) to (c) to determine result of memory address comparison.

11

between the set1? and set2? is guaranteed by the controller (i.e., these signals are never both low at the same time). Similarly, the controller assures mutual exclusion between reset1 and reset2. While the tristate pad is much more complicated, it can be modeled by the transistors shown in Figure 7d. Essentially, if set? is low, the PAD is pulled high; if reset is high, the PAD is pulled low; otherwise, its state remains unchanged and can be driven from o -chip. It is important to note that it may be necessary to add some sort of staticizer to the pad o -chip to guarantee that the wire holds the value. The cells are combined to form the segmentation registers as shown in Figure 8. In Figure 8a, the block diagram for the segmentation read register is shown. This register is composed of one REG cell which is loaded from the data bus, and two PADOUT cells. The rst PADOUT cell is used to write to the data bus, and the second is used to write to the real address bus. There are six signals between these cells and the controller (ssro, ssri, lsro, lsri, rao, and rai). The cell for the segmentation write register can be constructed in a similar manner with the only di erence being the control signals (sswo, sswi, lswo, lswi, wao, and wai). The two segmentation registers and their interface to the pads are constructed as shown in Figure 8b.

5 Results and Conclusions A formal systematic procedure for the synthesis of asynchronous circuits has been illustrated using the example of a simple memory management unit. This procedure begins with an informal general description from which a formal CSP speci cation is derived. The CSP speci cation is then decomposed into control and datapath processes which are synthesized independently. In both cases, the abstract process communications are expanded into signal transitions. This transformation to a handshaking expansion is not unique so various reshuings are examined. From this, a strengthened production rule set is determined which fully characterizes the complex-gate implementation of the controller. For the datapath, we presented a comparator which uses a dynamic logic implementation and registers which use tri-state pads for completion detection when sending signals o -chip. At each step of the procedure, optimizations have been presented to improve area and performance of the nal circuit. In [19], it is shown that further optimizations can be applied at the end of this procedure to utilize timing constraints to trade robustness to variations in delay for reduced area and increased performance. The core of the MMU occupies approximately 1800x18002 (where  equals half the feature size). SPICE simulation results for a 0:8m CMOS process indicate that for nominal conditions there is a delay of 5 ns, from the time a memory request is made until it is processed and passed onto the memory interface. It takes about 8 ns to read from or write to the segmentation registers.

Acknowledgments We wish to thank Jose Tierno, Tony Lee, Drazen Borkovic, Dr. Steve Burns, and Dr. Pieter Hazewindus who assisted with various parts of the design, as well as, providing us with many useful comments on this manuscript. We would also like to thank James Miller and Cecilia Yu who were responsible for the designs of the IMMU and SDC.

References

[1] Charles L. Seitz. \System Timing ". In C. A. Mead and L. A. Conway, editors, Introduction to VLSI Systems. Addison-Wesley, 1980. [2] Alain J. Martin. \Programming in VLSI: From Communicating Processes to Delay-Insensitive VLSI Circuits". In C.A.R. Hoare, editor, UT Year of Programming Institute on Concurrent Programming. Addison-Wesley, 1990. [3] J. C. Ebergen. \Translating Programs into Delay-Insensitve Circuits". PhD thesis, Eindhoven University of Technology, 1987. [4] F. U. Rosenberger, C. E. Molnar, T. J. Chaney, and T. P. Fang. \Q-Modules: Internally Clocked Delay-Insensitive Modules". IEEE Transactions on Computers, 37:1005{1018, 1988. 12

GO GO

ACK

FF

IN

IN

ACK

REG

OUT

OUT (a)

GO set_

GO

IN ACK PAD

ACK

PAD OUT

set_ reset

reset IN

PAD (b) set1_

set_

set1_

set2_

set2_ reset1 reset2

reset

reset1 reset2

set_

MUX

reset

(c) set_ PAD

set_

reset reset

PAD

PAD

(d)

Figure 7: Segmentation Register: (a) REG is the basic register cell; (b) PADOUT is used to write to tristate pads; (c) MUX is used to allow two sources to write to the same pad; and (d) PAD is a model of the function of a tristate pad. 13

ssro ssri

data

REG

lsro

lsri

PAD OUT

data−set_ data−reset data−set_ data

rao

data

rai ra

PAD OUT

SR REG

data−reset ra−set_ ra−reset

ra−set_ ra−reset ra (a)

SR REG

M U X

P A D

SW REG

M U X

P A D

data

ra

(b)

Figure 8: Block Diagram of the Segmentation Registers: (a) block diagram for the segmentation read register; and, (b) block diagram for both segmentation registers and their interface with the pads.

14

[5] C.H. van Berkel and R. Saeijs. \Compilation of Communicating Processes into Delay-Insensitive Circuits". In International Conference on Computer Design, ICCD-1988. IEEE Computer Society Press, 1988. [6] Erik Brunvand and Robert F. Sproull. \Translating Concurrent Programs into Delay-Insensitive Circuits ". In International Conference on Computer-Aided Design, ICCAD-1989. IEEE Computer Society Press, 1989. [7] Ivan E. Sutherland. "Micropipelines". Communications of the ACMS, 32(6):720{738, 1989. [8] Tam-Anh Chu. Synthesis of Self-Timed VLSI Circuits from Graph-theoretic Speci cations. PhD thesis, Massachusetts Institute of Technology, 1987. [9] Teresa H.-Y. Meng, Robert W. Brodersen, and David G. Messershmitt. \Automatic Synthesis of Asynchronous Circuits from High-Level Speci cations". IEEE Transactions on Computer-Aided Design, 8(11):1185{1205, November 1989. [10] L. Lavagno, K. Keutzer, and A. Sangiovanni-Vincentelli. \Algorithms for Synthesis of Hazard-Free Asynchronous Circuits". In Proceedings of the 28th ACM/IEEE Design Automation Conference, 1991. [11] Peter A. Beerel and Teresa H. Y. Meng. \Automatic Gate-Level Sythesis of Speed-Independent Circuits". In Proceedings IEEE 1992 ICCAD Digest of Papers, pages 581{586, 1992. [12] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall International,UK.LTD., Englewood Cli s, New Jersey, 1985. [13] E. Dijkstra. A Disciple of Programming. Prentice-Hall, Englewood Cli s, New Jersey, 1976. [14] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J. Hazewindus. \The Design of an Asynchronous Microprocessor". In Decennial Caltech Conference on VLSI, pages 226{234, 1989. [15] Jose Tierno. Private Communication, 1991. Jose Tierno, a graduate student at Caltech, rst proposed the idea of using the pads as a state-holding element. [16] Alain J. Martin. \CAD Tools for the Design of Asynchronous Circuits". Technical Report CS-TR-93XX, California Institute of Technology, 1993. [17] Alain J. Martin. \Formal Program Transformations for VLSI Circuit Synthesis". In E.W. Dijkstra, editor, UT Year of Programming Institute on Formal Developments of Programs and Proofs. AddisonWesley, 1989. [18] Steve Burns. Performance Analysis and Optimization of Asynchronous Circuits. PhD thesis, California Institute of Technology, 1991. [19] Chris J. Myers and Teresa H.-Y. Meng. \Synthesis of Timed Asynchronous Circuits". In International Conference on Computer Design, ICCD-1992. IEEE Computer Society Press, 1992.

15

Suggest Documents