Asynchronous Data Communication with Low ... - Semantic Scholar

2 downloads 0 Views 73KB Size Report
nous circuits without using clocks to avoid such problems, demonstrate the ... buses, or even to stop the power supply for the synchro- nous part without affect ...
Asynchronous Data Communication with Low Power for GLAS Systems Shengxian Zhuang, Weidong Li, Jonas Carlsson, Kent Palmkvist, and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden E-mail: {zhuangsx, weidongl, jonasc, kentp, larsw}@isy.liu.se Tel.: +46 13 284059, Fax.: +46 13 139282

Abstract: In this paper, we propose an asynchronous wrapper with new handshake circuits for the data communication in GALS systems. The handshake circuits include two data-ports and a local clock controller. we present two approaches for the implementation of dataports; one with pure standard cells and the other with Muller-C elements.The detailed design methodology is given and the circuits are validated with VHDL and circuits simulation in standard CMOS technology.

1. INTRODUCTION Recent researches show that the state-of-art globally synchronous designs are hardly applied to the giga-hertz frequency ULSI circuits which demand both low power and high performance. The problems such as clock skew, EMI and power consumption caused by global clock distribution would become more prominent in SOC. Asynchronous circuits without using clocks to avoid such problems, demonstrate the advantages on many aspects over the globally synchronous designs. But many drawbacks exist, such as widespread control overhead, complex of design and lack of synthesis CAD tools familiar to industry, which make it still be unacceptable way to substitute the current VLSI design methods[4]. The GALS (Globally-Asynchronous Locally-Synchronous) [1]approach for building large deep submicron system on chips has been recently viewed as a promising method to handle the problems with global clock. By adding the asynchronous interfaces to locally synchronous (LS) modules, the interiors of a specific module are isolated from the interfaces and it is possible for each module to use its own clock, power supply voltage, and low swing buses, or even to stop the power supply for the synchron o u s p a r t w i t h o u t a ff e c t a ny o t h e r p a r t s o f t h e chip[5][7][8]. This make GALS systems possessing several advantages, i.e. not only can it mitigate the clock distribution problems due to large chips, power consumption in the clock distribution and the problem of clock skew, it also simplifies the reuse modules as the modules does not need to use the same clock signal, and the interface. However, compared to the totally asynchronous design, less research interests focus on GALS systems,

The main challenges in the design of GALS circuits are: the interface circuits must be easy to glue with various synchronous paradigms and generate a reliable stretchable clock frequency, at the same time, it should take chip area and delay of time as less as possible. In this paper, We aimed to design a low power asynchronous wrapper with new handshake circuits and stretchable clock controller. Section 2 introduces the basic concept of handshake circuits and proposes the data-ports both with standard cells and Muller-C elements, and stretchable local clock generator. Section 3 presents the construction methods of GALS systems. The application of low swing bus driver is introduced in section 4. Finally, the simulation is done and performance is verified in section 5.

2. THE HANDSHAKE CIRCUITS 2.1. Data-ports using standard cells To simplify the asynchronous wrapper design, but without loss of generalities, we assume the handshake circuits in GALS work in the following mode: a) The request of data communication is always activated by a data outport , named W-port, which is equipped to a master LS module, The data in-port, named R-port, which is equipped to a slave LS module, is always passive for accepting the data. b) When W-port activates a data communication (output a data), it might stop the internal clock and must wait for getting the acknowledge from the corresponding R-port. Likewise, when R-port initializes reading a data, it must maintain the state until the W-port send a request to it.This means every activation of each port completes an effective data transmission. c) Both W-port and R-port can be independently enabled by the internal signals WR or RD from their own LS modules. Such considerations should be reasonable for many situations of the data communications in GALS systems. Based on the above assumptions and constraints, the regular signal transitions on the W-port and R-port, which interface with each other and to the internal LS modules, could be represented with Fig. 1, where STRETCH1 and STRETCH 2 is the demands of stretching of the local clocks respectively. It is safe for asynchronous communi-

cations in GALS systems, but not of time efficiency due to no signal transitions concurrency. WR+

RD+

STRETCH1+

STRETCH2+ ACK+

R-port

W-port

REQ+ REQ-

ACKSTRETCH1-

STRETCH2-

WR-

RD

-

Figure 1. Signal transitions on W-port and R-port.

Q

D WR

REQ

LD CLR ACK

VDD D

Q

CLK CLR STRETCH1

Figure 2. W-port circuits with standard cells. For W-port, STRETCH1+ and REQ+is successively generated immediately after the writing demand WR+ is produced. If the stretch of clock is required for a data communication, STRETCH1+ should be valid before the next rising edge of the local clock for a reliable stretch of low phase of the clock. REQ + is sent out to R-port on which it is combined with the reading demand RD+ to generate the ACK+, which is connected to W-port. The ACK+ on R-port will lead to REQ- which in turn resets the ACK+ to ACK-. The data is latched after both REQ and ACK return their initial states. The stretch of the clock is canceled with STRETCH1- which makes WR+ go to WR-. For the regular STG, all the signal transitions could be easily generated with standard cells such as D-FFs and latches with set and reset functions, except the STRETCH 1 - due to its previous signals REQ and ACK both experiencing two states (high and low) which can not be directly employed to generate STRETCH1-. In order to make the synthesized circuits as simple as possible, we let the STRETCH1- is generated soon after REQ-. Because ACK - is immediately generated by REQ - , the correct handshakings is still guaranteed with such small modifications of the regular STG. Thus ACK+ and REQ- can be used to produce STRETCH1-. The synthesized circuits for W-port is show in Fig. 2. In the same manner, we can synthesize the handshake circuits for R-port.

The interface circuits are easily matched with LS modules and robust to their environments. The average data communication speed can be achieved due to signal transitions obtained directly by the output or the reset of D-FF and latch.In addition, the handshake circuits be easily captured in high level hardware languages and synthesizable to the current CAD tools (special models are still required for the layout of circuits to guarantee the timing delay), except the Muller-C element in the local clock controller which is discussed in 2.3. 2.2. .Data-ports using C-elements The data communication ports above, although is highly reliable and robust for 4-phase bundled asynchronous data communications, it is greatly speed limited because the standard cells such as D-FF and latch have much delay from input to output. The proposed interface circuit of the W-port is shown in Fig. 3, which has a similar structure as the paradigm of handshake circuits in [2]. The signals alternate according to the follow sequences as the LS module send out a data by firing WR + : WR + STRETCH 1 + -REQ + -(RD + )ACK + -REQ - -STRETCH 1 - . There is a slight difference for the requirement of WR . Because both of their changing from low to high generate pluses on the output of AND gates, there is no need for them to be hold at high before the STRETCH 1 going down. Thus it might further facilitates the interface with LS modules. With the R-port, if the transition time from REQ- to ACK- is need to be as short as possible, it could be done by adding a reset to the Muller-C element that outputs ACK and connecting it to REQ. WR Delay

C

REQ

C Reset

STRETCH1

ACK

Figure 3. W-port using Muller-C elements. 2.3. Stretchable clock controller A stretchable clock controller is a key component to GALS systems. With stretching the low phase of clock and adjusting the number of inverters which consist of a ring oscillator, we can provide the LS module a highly flexible clock to avoid synchronization failure. The pausible clock controller (PCC) is the first scheme applied to GALS systems[3]. The main drawback is that a metastability could happen when the request and the rising edges of the clock arrive simultaneously, because it takes a strategy to “toss a coin” to determine which will pass the ME circuit. Another stretchable clock[6] has a simple structure using only two basic gates. But it could be unreliable because the output can not be hold for some states of inputs. Our stretchable clock controller shown in Fig. 4 has a similar architecture to [6].The generated local clock signal is feed back to the both input terminals of Muller C

element but with inversion and different delay time. According to the properties of C element, If STRETCH is not asserted (Low), The output and inputs of C-element will following the signal transitions in Fig. 5. If STRETCH is asserted high, the input Xa is set to low, the output of Celement could be either at low or high. However, the output will eventually be maintained at low level. Thus the next rising edge is postponed by the STRETCH+.. For the multiple requests of stretching local clock, it can be obtained by connecting all the STRETCH1-i to a OR gate. Whenever a request of stretch is valid, there is STRETCH output to the clock controller.

Xa

STRETCH

Xout (lclk)

C

Xb

STRETCH1 STRETCHi

Figure 4. Stretchable clock generation. Xout+

Xa-

Xb-

Xout-

Xa-

Xb-

Xa+

Xb+

Xout+ STRETCH+

Xout-

Xb+

Figure 5. STG of Stretch clock controller.

3. Configuration of GALS systems with the asynchronous wrapper 3.1. Point-to-point communication The modern DSP systems integrated on a chip may be very complicated. We could simply categorize the LS modules into three classes in terms of data-ports: source LS modules with only data out-port, sink LS modules with unique data in-port and intermediate modules that own both data in-port and out-port. With the LS module encapsulated in an asynchronous wrapper we presented. A typical GALS systems with point-to-point data communication is configured in Fig. 6. If the source LS wants to

between the intermediate LS modules and sink LS modules follow the same way. The handshake circuits in W-port and R-port can be directly interfaced with asynchronous FIFO. If each LS module has a different clock speed, a FIFO can be added between W-port and R-port to improve the speed of data transimission. 3.2. Multiple data-ports communication In terms of interfacing methods that most frequently used in GALS systems, there possibly are two forms of multiports communication. Either is a data out-port driving multiple data in-ports, or multiple data out-ports driving an data in-port. In other words, we can call them either a multi-output LS module (M-LS) interfaces with several single-input blocks (S-LS) or several single-output LS modules (S-LS) interface with a multiple-input LS module (M-LS). Fig. 7 shows the basic structure of such multiport data communications. Such configuration of the M-LS is completely compatible with the interface circuits for W-port and R-port. Only an AND gate is required to make the data communications synchronized with all the sending LS modules or the receiving LS modules. The main problem is it has no flexibility to adapt the independent data communications between the M-LS module and a S-LS module. An arbiter is needed to let only one request to go to the receiving LS module if two data communications is necessarily processed independently. However, the competence can not be resolved if two requests go into the arbiter at same time. Additionally, the receiving LS module can not sense the request coming from which sending LS module if the occurrence order of the requests is not arranged. It is very difficult for the M-LS module to have a independently acknowledgement to each of sending S-LS module.

S-LS

Ri1

Ri

Ack1

S-LS

Ri2 Ack2

Acki

M-LS Ro

Ri3 Ack3

Acko

S-LS

Ri4

S-LS Ack4

Figure 7. Multi-port data communication. LS

lclk

R-port W-port Rd Req Wr Ack Stretch Stretch1

W-port Wr Req LS

lclk

R-port Rd LS

Ack

Stretch2

Stretch

lclk

Figure 6. Basic structure of point-to-point GALS system. send a data to the intermediate LS, it puts the data on data bus and starts out request with WR+. The intermediate LS can receive the data by giving a acknowledged with RD+. Both WR+and RD+ will trigger STRETCH+ causing the internal clocks stretched if the data communication is not done before the next lclk + . The data communications

4. Low swing bus drivers In GALS systems, several buses exist both in data path and control path. They consume a significant part of power due to the high switching activities and large loads. The increasing level of system integration on chip requires more buses. The higher system performance demands also higher speed requirement for the buses, which further increases the power consumption of buses. Low power buses are therefore important for the systems. One way to reduce the power consumption is to use low swing bus. We propose a single-rail low swing bus. The bus driver has boost-trapped PMOS and NMOS transistors shown in Fig. 8, which reduces the swing of bus.

On the receiver side, a Schmitter is used. This low swing bus operates with one supply voltage and can be modified easily to different supply voltages. The transistor size of

In

Out

In

Out

(a) Bus receiver. (a) Bus driver. Figure 8. Low swing bus driver and receiver. Schmitter can be justified to reduce the duty-distortion, which is common for single-rail low swing buses.

tions take only the high level time of clock. If every computation takes multiple clocks, the efficiency can be improved. With the simple module, We did both the high level behavioral VHDL and circuits simulation in standard 0.35µ CMOS technology with Spectre in Cadence. Fig. 9 shows the clock signals produced by the local ring oscillator. A 100 MHz clock frequency is generated with inverter chains in the ring oscillator. In our target application, the computation time is less than 4ns, the time of clock+is long enough to complete a computation. Fig. 10 shows the control signals completing a data output and input, in which we can see the correct operation is guaranteed by the postpone of data reading acknowledge signal. In comparison, the interface circuits with Muller-C elements can work at higher frequency than those with standard cells.

6. CONCLUSION

Figure 9. Stretch control and clock signals.

In this paper, we have presented an asynchronous wrapper with novel handshake circuits including two dataports and a stretchable clock controller. With a basic GALS module, its performances are simulated and verified. We hope it can be applied to the low power communication DSP systems on chip.

References

Figure 10. Simulation results of the handshake circuits.

5. Simulation and evaluation A GALS LS module is configured by connecting the handshake circuits and stretchable clock controller to a simple synchronous computation block which could be used as a typical element in DSP and digital filters. For the aim only to simulate the performance of our asynchronous wrapper, a 4-bit adder is used to compute the input and delayed output. Assuming it launches a computation and data exchange per clock, due to the unknown time when the output data is accepted and input data is ready, in order to avoid the output data overlay, we put a constraint on the wrapper,i.e.the R-port mustn’t send out the acknowledge to latch the input data before the output data was accepted by the next module.We assign the tasks to different clock phase in sequence, letting the computation done during the period of clock+ and data communication done during the clock-. It is not efficient for the system as computa-

[1] D.M.Chapiro, Globally-Asynchronous Locally-synchronous Circuits,Ph.D dissertation, Stanford University, U.S.A., Oct. 1984. [2] T.H.-Y.Meng, R.W.Brodersen and D.G.Messerschmitt, “Automatic Synthesis of Asynchronous Circuits from High-Level Specifications,” IEEE Tans. ComputerAided Design, pp. 1185-1205,Vol.8, No.11,Nov. 1989. [3] K.Y.Yun and R.P.Donohue, “Pausible clocking: a first step toward heterogeneous systems,” In Proc. of Int. Conf. Computer Design (ICCD), pp. 118-123,1996. [4] A.M.G.Peeters, Single-Rail Handshake Circuits, Ph.D Dissertation, Eindhoven Univ. of Technology, Eindhoven, The Netherlands, Jun.1996. [5] J.Muttersbach, T.Villiger and W. Fichtner, Practical “Design of Globally-Asynchronous Locally-synchronous Systems,” In Proc. of Int. Symp. on Advance Research in Asynchronous Circuits and Systems (ASYNC), pp.52-59, Mar. 2000. [6] D.S.Bormann and P.Y.K.Cheung, “Asynchronous Wrapper for Heterogeneous Systems,” In Proc. of IEEE Inter. Conf. on Computer Design (ICCD), pp. 307314,1997. [7] H. Zhang and J. Rabaey, “Low-Swing Interconnect Interface Circuits,” In Proc. of Int. Symp. on Low Power Electronics and Design, pp. 161-166, Aug. 1998. [8] T. Njølstad, O. Tjore, K. Svarstad, et al., “Towards a Universal Socket Interface for Globally-Asynchronous Locally-synchronous Systems using Multiple Supply voltages for Rate-Adaptive Energy Saving,” In Proc. of 14th IEEE Inter. ASIC/SOC Conf., pp. 110-116, 2001.