FPGA based High Performance Asynchronous ALU based on Modified 4 Phase Handshaking Protocol with Tapered Buffers Nikhil Bhandari, Shubhajit Roy Chowdhury Centre for VLSI and Embedded Systems Technology International Institute of Information Technology-Hyderabad Hyderabad-500032, India
[email protected],
[email protected]
Abstract—The paper presents a high performance implementation of asynchronous ALU on FPGA by minimizing the amount of time taken for the execution of instructions. The amount of time taken for the execution have been minimized using a modified version of 4 phase handshaking protocol. The asynchronous design methodology helps to achieve higher flexibility and performance of the arithmetic logic unit. The ALU has been implemented on Xilinx ISE simulator and Spartan 3E family FPGA board. The speed of the computation has been increased to 1.56 times the speed with the previous available asynchronous versions with the help of modified 4-phase handshaking protocol. Index Terms—Reconfigureable ALU, dual rail protocol, 4phase protocol, handshaking, muller c element, Power dissipation, Performance
synchronous) to eliminate the global clock from the system by making smaller sub-blocks using synchronous design and whole system is connected asynchronously with these subblocks [3]. The GALS architecuture got rid of global system clock, clock skew and enhanced the speed of the design but it also increased the total chip area and power consumption of circuit that are major drawbacks of using this approach [4].
I. I NTRODUCTION The simultaneous minimization of delay and power consumption is becoming one of the major challenges in designing the high performance processors. The demand of faster and complex computations in minimum amount of time is becoming quite challenging task for the designers working in various industries. The processor is now at the heart of each and every electronic device from big systems to small timer clocks. The major block of the processor is the Arithmetic Logic Unit (ALU). So for processor to function and work at higher speed ALU must perform computations at high speed to match the current technology trend of huge computations. The importance of high performance ALU is required to provide the faster speed and better accuracy for various forthcoming devices. The researchers have tried to increase the speed and accuracy of the ALU computation by using various techniques in the field of synchrononus, asynchronous and Globally asynchronous and locally synchronous (GALS) architectures. The notion of gated clocks to reduce the switching activities in case of synchronous design has been proposed by Benini and De Micheli in [1] and Pandey and Pattanaik in [2] but it did not provide the efficient working at higher order of frequencies. Hemani etal used the buffer and device sizing for low power functionality in [3] but could not reach to the desired computations at that time. The researchers started focussing on GALS architecture (Globally asynchronous locally
Fig. 1.
Conventional GALS Architecture
An alternative idea was to resort to complete asynchronous implementation of ALU using asynchronous pipeline architecture [5]. In this implementation, only simple handshaking protocols and Muller pipeline was used for the implementation of the completely asynchronous ALU but large amount of power is consumed in this approach due to the settling and waiting time for the next instruction to start the execution which decreases the overall efficiency of this implementation. A team of researchers present the delay insensitive GALS encoding scheme which provides each gate to be locally synchronized in the complete architecture which ease the process of data forwarding by providing higher speed as compared to its previous counterpart [6]. The current works majorly focuses on the design of a complete asynchronous implementation of 16 bit ALU with
the help of modified 4-phase handshaking protocol which helps in attaining higher speed as compared to the earlier design [8]. The usual 4 phase handshaking protocol takes two round trip times to send the req and ack for particular instruction but the same work can be done in one trip or round cycle by dividing the time cycle into two parts. This particular approach will provide better speed as compared to the standard modified four phase handshaking protocol The remainder of the paper is organized as follows. Section II presents the background of asynchronous architecture with discussion on key elements. Section III presents the modified 4-phase protocol for asynchronous ALU design to increase the performance of previous implemented architectures. Section IV presents the design procedures for circuit implementation using the modified approach. Section V presents the results using the proposed approach and comparisons with the previous existing approaches. II. BACKGROUND The asynchronous architecture comprises of mainly bundled data rail protocols and handshaking protocols between various components to achieve synchronization and neccessary communication between the various blocks. The best implementation possible is muller implementation [7] using bundled data protocol and 2-rail protocol to synchronize the delay between the latches and registers to provide the correct synchronization between the channel inputs and output. The 4-phase handshaking protocol is preferred over 2-rail protocol because of higher efficiency of synchronization and less power consumption.
the path backwards to pull the ack (acknowledgement) and req (request) signals. The 4 phase dual rail protocol is used in the implementation of the circuit for better reliability and the performance of the circuit. B. Muller C-element- The Muller C-element[7] is one of the basic building blocks used in the asynchronous design. The muller c-element provides logic high (logic 1) when all the driving inputs are high and similarly provides a logic low (logic 0) if all the drivings inputs are low. The extensions of muller c-element is created by adding the transistor in series or parallel to gain the above characteristics. There is a limit of maximum number of transistors needed to be cascaded due to body effect of transistors which may result in the interpretation of a logic high as logic low in the circuit. C. Muller Pipeline- The muller pipeline [8] implementation is used in asynchronous design to speed up the timing constraints on the ALU design. Generally, the pipelining technique used is 1-bit wide with 3 bit depth with one ack signal per stage for the entire pipelining process. All the ack (acknowledgement) signals are initialised to zero at the beginning of the pipelining stage. The value of C element k is updated if the previous value i.e. C[k-1] = 1 and the successor value i.e. C[k+1] = 0.
Fig. 3.
Fig. 2.
Standard 4-phase protocol
Bundled Data Protocol
A. 4 phase dual rail protocol- The 4-phase bundled protocol [6] helps in reliable transfer of information from the source to destination regardless of delay introduced between the combination blocks present from the source end to the destination end. The 4-phase data protocol requires two wires i.e. one for logic high (logic 1) and one for logic low (logic 0). The 4-phase handshaking protocol uses a level triggered approach and it takes more time because it needs to retrace
Fig. 4.
2-input Muller Element
Fig. 7.
Fig. 5.
3-stage Muller Pipeline
III. M ODIFIED 4 PHASE PROTOCOL FOR ASYNCHRONOUS ALU DESIGN The current work suitably modifies the standard 4 phase handshaking protocol to speed up the computations by decreasing the time taken for executing the instructions. The ack (acknowledgement) and req (request) signals are executed in one round as compared to two trips taken in the earlier 4-phase handshaking protocols. The waiting window time between the next instruction execution has been shrinked or reduced as to increase the speed of the overall alu block. The technique has been implemented by dividing the routing time into two equal halves but it provides the major constraint in terms of flow of instructions and a very narrow window time. The given approach provides better speed in terms of computation as compared to the previous standard 4-phase handshaking protocol.
Fig. 6.
Implementation Diagram
The above error has been eradicated from the implementation by using an additional set of buffers after the request and acknowledgement signal generation (tapered buffers). This technique provides zero error control but it causes the delay in the speed at the time of computation. The previously used standard 4-phase protocol and synchronous counterpart simulations. The speed for actual hardware implementation has been drop down to 1.56 times due to the addition of tapered buffers in the actual hardware simulation. The reduction in speed was caused because of the delay introduced by the tapered buffers before the execution of the instruction. This approach helps us to get better speed in terms of computation as compared to the previously available asynchronous counter part i.e. standard 4-phase handshaking protocol. In the tapered buffer in modified 4-phase handshaking protocol, tapered buffers are added before the req and after the acknowledgement to store the updated values in the register before the actual start of the computation. The buffers provides an additional delay after the request and acknowledge signal but the updated value is used in executing the instruction which reduces the error occurence which has been caused due to narrow window time in the without tapered buffer approach.
Modified 4-phase protocol
Due to very narrow window time between the request and acknowledgement signal, there were erroneous data generated in some cases because of the values not getting updated in the register before the execution of the instruction. The speed was 1.8 [10] times more than the earlier standard 4phase handshaking protocol but the outputs generated were not always correct.
Fig. 8.
Block Diagram of Tapered Buffer
IV. C IRCUIT I MPLEMENTATION Using the principles discussed above the given ALU circuit is designed using 4-phase modified handshaking protocol and muller pipeline implementation. The Fig. 8. represents the complete block diagram with asynchronous architecture with the final ALU output feedback to the register B for further computations.
optimization in terms of speed and power dissipation as compared to the previously used standard 4-phase protocol and synchronous counterpart simulations. The performance achieved with the simulation tool is more than 1.8 [10] times as compared to the previously done for the asynchronous counterpart (standard 4-phase handshaking protocol). The speed for actual hardware implementation has been drop down to 1.56 times the previous speed because of the delay time introduced by the external environment factors. The simulated output for 4 operations i.e. Add, Left shift, AND and OR is shown below The performance is checked by giving randomly 100 set of instructions (i.e. add, left shift, right shift by providing some random set of values) to the three versions i.e. synchronous,standard 4-phase and modified 4-phase version.
Fig. 9.
Block Diagram
The simulation results provide that 1592 operations can be executed in the time as required by the clocked ALU for performing 700 operations. The given data is simulated by giving random set of operations i.e. ADD, LOAD, STR, SUB, XOR, OR etc. The given results predicts that the given approach is more than 2.17 times faster than its synchronous counter part. The major reason is zero delay for the assignment of the next instruction. A greater speed can be achieved with the help of modifying the instruction set architecture.The previously available implementation provides the speed of 1.56 times as compared to their asynchronous counter-part. This design has increased the efficiency by more than 15 percent from the previous known implementations. The basis of implementation of ALU was based on DIMS adder [10] which was implemented with the help of muller Celement and basic muller pipelining. In the case of a single bit adder, there can be 8 possible set of outputs which is further instantiated for 16 bit module. The 10 outputs have been used including a gen and kill operation when both the inputs are aH and aL, bH and bL, where every bit is represented by two states. V. R ESULTS AND D ISCUSSION The proposed ALU was implemented on Xilinx Spartan FPGA 3E family.The results obtained with the modified 4phase protocol in asynchronous design provides a better
Fig. 10.
Simulation Waveform
The performance comparison between the synchronous and asynchronous design is also verified on the Altera Cyclone II FPGA based ALU. The implementation on FPGA has been done by taking the amount of time in running the fixed set of instructions. The final output image is shown for the random testcases which is used for checking the correctness of the design. The technology schematic is made using the latest technology and the elements used in making the arithmetic logic unit comprises of muller element, adder, shifter, mux, inverters and other basic blocks. The final technology schematic helps in the designing at the time of fabrication of chip by getting the exact placing and routing of various components.
required to compensate the delay and timing in the case of synchronous design. The modified 4-phase protocol with tapered buffer approach provides additional buffers at the acknowledgement and request cycle to provide the error free outputs by providing time for receiving the proper acknowledgement. The time provided includes the signalling time for asynchronous protocols and delay incorporated in the case of synchronous protocol. The results for the modified 4-phase protocol with tapered buffer approach, standard 4-phase protocol and synchronous counter part is given in the below table TABLE I C OMPARISON OF D IFFERENT I MPLEMENTATION
Fig. 11.
Hardware Implementation Waveform
Design
Time taken(ms)
Area(mmXmm)
Power(mW)
Synchronous
49
31.84
41.73
Standard 4-phase protocol
42
42.84
19.62
Modified 4-phase protocol
30
41.48
19.84
The power requirement is more in the synchronous version than modified 4-phase handshaking protocol because of the usage of too many buses for VDD and Clk which needs to be distributed in each and every part of the chip area. The power estimated in the asynchronous version comes out to be near 16mW. The asynchronous version comes with the drawback of higher chip area as compared to its synchronous counter part because of usage of too many buffer elements to compensate for the delay in each part of the circuit. VI. C ONCLUSION The current work proposes an asynchronous implementation of the ALU, that attempts to minimize the basic issues with the synchronous implementation of ALU, viz. clock skew, power consumption and delay. Using the asynchronous ALU, a computation speed of 1.5 times the previous available standard 4-phase protocol has been achieved. The asynchronous version is providing efficient working at higher frequencies but the major limitations is the higher amount of chip area used in the implementation of the architecture. The researchers are trying to minimize the usage of chip area which in turn increase the performance of the asynchronous ALU design over the synchronous designs. The less power dissipation is also a plus point in the implementation of asynchronous design. Further works are going on. R EFERENCES
Fig. 12.
Technology Schematic
The time taken represents the amount of time taken by successfully running the set amount of instructions in each of the three possible implementations. The power dissipated in the last column is given in mW. The chip area of the ALU is calculated using Xilinx ISE tool and results shows that the area used in the asynchronous design is more as compared to the synchronous design because of additional buffer elements
[1] Luca Benini and Giovanni De Micheli. “ Automatic Synthesis of LowPower Gated-Clock Finite-State Machines.” IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 15, NO. 6, JUNE 1996. [2] A. Hemani, T. Meincke , S. Kumar, A. Postula, P. Nilsson, J. Oberg , P. Ellervee, and D. Lundqvist, Lowering power consumption in clock by using globally asynchronous locally [3] Hala A. Farouk, Mahmoud T. El-Hadidi. “Implementing Globally Asynchronous Locally Synchronous Processor Pipeline on Commercial Synchronous FPGAs .” 2010 17th International Conference on Telecommunications
[4] T. Y. Tang, C. S. Choy, J. Butas, and C. F. Chan, An ALU design using a novel asynchronous pipeline architecture, in Proc. IEEE ISCAS, May 2000, pp. vol. 5, 361-364. [5] P. Amrutha, G. Hanumantha Reddy. “Implementation of ALU Using Asynchronous Design“, IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) [6] Jens Sparso, Steve Furber,”PRINCIPLES OFASYNCHRONOUS CIRCUIT DESIGN”, (Text book on asynchronous design). [7] Paul Metzgen . “A High Performance 32-bit ALU for Programmable Logic”. FPGA ’04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays [8] S. Hauck. Asynchronous Design Methodologies: An Overview. Proc. Of the IEEE,83(1):69-93, January 1995. [9] Matheus T. Moreira, Carlos H. M. Oliveira, Ricardo C. Porto, Ney L.V. Calazans. “Design of NCL Gates with the ASCEnD Flow: A Standard Cell Library for Semi-Custom Asynchronous Design” 2012 13th International Symposium on Quality Electronic Design (Isqed) Pages: 84-90 [10] John Teifel,Rajit Manoharˆa”An Asynchronous Dataflow FPGA Architecture” 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004