A Timing-Driven Approach to Synthesize Fast ... - Semantic Scholar

1 downloads 9147 Views 222KB Size Report
of multiple stages for merging, and ii) the design of the merged stage. .... The functionality of each bit-slice of our dual merged stage for a left shifter is as follows:.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 1, JANUARY 2008

31

A Timing-Driven Approach to Synthesize Fast Barrel Shifters Sabyasachi Das and Sunil P. Khatri

Abstract—In modern digital signal processing and graphics applications, the shifter is an important module, consuming a significant amount of delay. This brief presents an architectural optimization approach to synthesize a faster barrel shifter block, which can be useful to reduce the delay of the design without significantly increasing the area. We have divided the problem of generating the shifter into two steps: i) timing-driven selection of multiple stages for merging, and ii) the design of the merged stage. In our proposed method, we define the notion of dual merged stage, where two stages are merged and the triple merged stage, where three stages are merged into a single composite stage. These merged stages are identified by using a timing-driven algorithm and are used in conjunction with some single stages of the traditional barrel shifter. The use of these merged stages helps reduce the depth of the proposed barrel shifter architecture, thereby improving the delay. The timing-driven nature of our algorithm helps produce a faster implementation for the overall shifter block. We have evaluated the performance of our design by using a number of technology libraries, timing constraints and shifter bit-widths. Our experimental data shows that the shifter block generated by our algorithm is significantly faster (10.19% on average) than the shifter block generated by a commercially available datapath synthesis tool. These improvements were verified on placed-and-routed designs as well.

I. INTRODUCTION

A

S WE MIGRATE toward ultra deep sub-micron feature sizes, digital designs are becoming increasingly complex, with very aggressive performance goals. Arithmetic components are typically highly computation-intensive, and are widely used in modern integrated circuits (ICs). The shifter is an integral part of many digital designs. A barrel shifter is a combinational logic block that can shift a data by any given number of bits, in a single operation. There are many applications that require shift operations, including CPUs, floating point operations (like normalization), variable length coding, word packing/unpacking, bit indexing, address generation, field extraction etc. Shifters are essential in the digital signal processing field. The barrel shifter is a commonly used shifter architecture. One of the important reasons behind the widespread usage of this architecture is the fact that it can perform multi-bit shifts in a single operation (within one clock cycle). In addition, the area of the barrel shifter is also reasonably small, which helps keep the area of the design under control.

Manuscript received May 22, 2007, revised July 11, 2007. This paper was recommended by Associate Editor L. Lavagno. S. Das is with Synplicity Inc, Sunnyvale, CA 94087 USA (e-mail: [email protected]). S. P. Khatri is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 30332 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSII.2007.908951

Several techniques have been proposed to design efficient barrel shifters in different contexts. Basic architecture of a barrel shifter was introduced in [1]. High-speed pipelined architectures using TSPC were discussed in [2] and [3]. A high performance and area efficient CMOS 32-bit barrel switch and its physical design were presented in [4]. In [5], number of stages in a shifter were reduced, resulting in significantly faster speed. A multilevel barrel shifter structure in the context of the CORDIC design was introduced in [6]. In [7], different design tradeoffs in the context of barrel shifter were analyzed. Timing-driven layout techniques of cyclic shifters were proposed in [8] and [9]. In [10], data-driven dynamic logic is used to generate a faster and more power-efficient barrel shifter than domino-logic based design. A 4-bit barrel shifter in the QCA computing paradigm was introduced in [11]. A mixed signal 32-bit rotator/shifter circuit design with short latency was discussed in [12]. Several low-power architectures for barrel shifters have been presented in [13]–[15] and [16]. Energy delay evaluation of a Low Power Barrel Switch is discussed in [17]. In this brief, we propose a timing-driven technique to synthesize a faster barrel shifter block. In our approach, we merge two (or three) stages of the shift operation into a single stage, leading to a reduction in the total number of stages. These stages are referred to as dual merged and triple merged stages. The decision to merge stages is made in a timing-driven fashion, so that the overall delay of the shifter is minimized. The optimizations involved in our approach are orthogonal to the ideas previously presented in this section. We have organized the rest of the brief as follows: In Section II, we present some background information about the barrel shifter architecture. In Section III, we discuss our proposed approach in detail. Section IV presents the experimental results. Conclusions are drawn in Section V. II. PRELIMINARIES In this section, we briefly explain the concept of a barrel shifter and discuss how it is typically synthesized [18]. In a barrel shifter, if the data input signal is -bit wide, then the shift bit wide. The width of the output signal is typically of the shifter is typically same as the input width ( ). The shifter stages, where each stage ( ) performs a is divided into single shift of 0 or bits, depending on the value of the th bit of the shift signal. Each bit of the shift signal controls exactly one barrel shifter stage. The input data is shifted (or not shifted) by each of the stages in sequence. To implement this, multiplexers (or an equivalent logic circuit constituted using technology library cells) are used in each stage. Fig. 1 shows the block-level diagram of a 3-stage barrel shifter (left shifter), where each row represents a stage. In this figure, the logic-0 input signal is de(Verilog notation). In this diagram, the data input noted by signal ( ) is 8-bit wide and the output signal ( ) is also 8-bit

1549-7747/$25.00 © 2007 IEEE Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply.

32

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 1, JANUARY 2008

Fig. 1. Traditional left barrel shifter with 3-stages.

wide. The shift signal has 3 bits ( ) and hence the shifter consists of 3 stages. Similarly, to implement a shifter with a 128-bit input data signal, the shifter would require 7 stages in the shifter architecture. III. OUR APPROACH Throughout the rest of the brief, we will assume the data input to the barrel shifter is bits wide and the shift input signal is bits wide. The output (signal ) is also bits wide. We by . Let be the arrival time of the th bit denote of the shift signal . In the traditional barrel shifter architecture, this block has stages and each stage consists of 2-to-1 Multiplexers (MUXes). The timing-critical path of the shifter traverses through 2-to-1 MUXes. To estimate the delay of a traditional barrel shifter stage, we identify the fastest 2-to-1 MUX cell from the provided technology library. The functionality of a 2-to-1 MUX cell can equivalently be implemented by one of the following two logic expressions: (1) (2) In some technology libraries, the built-in 2-to-1 MUX cell delay is larger than the MUX cells generated from the basic gates by using the functionality presented in (1) (AND-OR operation) or (2) (NAND-NAND operation). We consider the smallest of these three delays as the delay of a single stage of the traditional barrel . shifter. We denote this delay as In this brief, we introduce a technique to implement a faster barrel shifter. The key idea is to merge multiple (two or three) stages of the barrel shifter into one stage. We define mergeable stages as those which can be merged to create a hybrid stage (leading to faster performance of the shifter block). To identify mergeable stages, we design a timing-driven algorithm, so that the overall delay of the shifter block is minimized. In the following two subsections, we discuss each of the two steps (the design and identification of merged stages) in detail. A. Design of the Merged Stages To facilitate the explanation, we will discuss about the leftshifter only. Note that the similar concept applies to right-shifter as well. In our approach, we attempt to merge two or three stages of the shifter into one single stage. If two stages are merged, we call the newly created stage a dual merged stage. On the other hand, if three stages are merged, then we call the new stage a

triple merged stage. Note that the stages to be merged are not necessarily consecutive. In the case of dual merged stages, let us assume that we merge the stages corresponding to the th bit and the th bit of the shift , and . Note signal , where that and do not require to be two consecutive bits of the shift signal. Our newly created dual merged stage will perform one of the following four operations: and ); 1) no shifting operation (if and ); 2) shift by bits (if and ); 3) shift by bits (if ) bits (if and ). 4) shift by ( The functionality of each bit-slice of our dual merged stage for a left shifter is as follows: for where = , = , = , . and = Even if no merging is performed, for the left-shifter, the functionality of a few bitslices near the least significant bits (LSB) of the shifter gets simplified, because some of the values (in ). For example, in the above expression) become logic-0 ( ) of Fig. 1, two bitslices near the LSB the middle stage ( have simplified functionality. In case merging is performed, this simplification can be exploited more aggressively. The above expressions indicate that the timing-critical path of each of our dual merged stage consists of a single inverter, a single 3-input NAND gate and a single 4-input NAND gate. We . The functionality of the dual merged denote this delay as stage can also be implemented by two individual stages of the barrel shifter placed one after the other. In all the technology libraries that we have explored, the delay of the dual merged ) is less than the delay of two cascaded stages of the stage ( ). traditional barrel shifter ( In a similar manner, we can formulate the output equations of each bitslice of a triple merged stage. Let us assume that we merge the stages corresponding to the th bit, th bit and the th bit of the shift signal , where , , , , and . Note that , and do not require to be three consecutive bits of the shift signal. Our newly created triple merged stage will perform one of the following eight operations: , and ); 1) no shifting operation (if , and ); 2) shift by bits (if , and ); 3) shift by bits (if bits (if , and ); 4) shift by ) bits (if , and ); 5) shift by ( ) bits (if , and ); 6) shift by ( ) bits (if , and ); 7) shift by ( ) bits (if , and ). 8) shift by ( The functionality of each bit-slice of our triple merged stage for a left shifter is as follows:

for where

. =

,

, = , = , = , = , = . Similar to the dual-merged stages, for the triple merged stage, the functionality of few bitslices near the LSB (for a left-shifter) ,

=

Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply.

=

DAS AND KHATRI: TIMING-DRIVEN APPROACH TO SYNTHESIZE FAST BARREL SHIFTERS

or MSB (for a right-shifter) gets simplified, because some of the values in above expressions become logic-0. This fact is aggressively exploited while merging stages. By decomposing the functionality of each bitslice, we find that the timing critical path of each triple merged stages consists of a single inverter, a single 4-input NAND gate, a single 3-input NAND gate and a single 3-input OR gate. Based on the available cells in a technology library, there may be other more efficient ways of implementing the functionality of each bitslice as well. A general-purpose technology mapper is able to identify the most efficient implementation of the triple merged stage of a shifter. We denote the best possible delay of the triple merged . stage as

else

B. Identification of the Mergeable Stages

end while

In addition to the design of the merged stages, the technique to identify the mergeable stages plays a key role in determining the performance of our proposed shifter architecture. Algorithm 1 : Identification of the Mergeable Stages

MergeableStageList = NULL SelPriorityQueue = Store s0 ; s1 ; . . . ; sm01 in ascending order of arrival time while SelPriorityQueue is not empty do

(i; j; k) = Select rst (earliest-arriving) three elements of the shift signal from SelP riorityQueue:

33

Create a new node (singlestage) with only one element i // Not suitable for any merging

singlestage:element0 =

i

01 singlestage element2 = 01 singlestage:element1 = :

Add singlestage into MergeableStageList Remove (Deque) i from SelPriorityQueue end if end if return MergeableStageList // The list of all stages

The algorithm to identify the mergeable stages is presented in Algorithm 1. A detailed explanation is provided below. Our algorithm uses the following timing-driven analysis to find two or three stages for merging: we store all the bits of the shift signal in the ascending order of the arrival time. To perform this operation in an efficient way, we use a priority queue data structure. Let us assume that the six earliest arriving signals are , , , , and . For the signals and , if we construct a dual merged stage, then the output of the dual merged stage will be available at time

On the other hand, if we construct two individual stages, then the output of the second stage will be available at time

// If 3 stages are not remaining, then the algorithm takes a simpler route Tsingle1 Tdual

=

Tsingle2 Ttriple

=

tsj

+ Del1

+ Del2

= Max((tsi + Del1 ); tsj ) + Del1

=

Tsingle3

tsi

tsk

+ Del3

= Max(Tsingle2 ; tsk ) + Del1

if (Ttriple

Similarly, for the signals , and , if we construct a triple merged stage, then the output of the triple merged stage will be available at time

< Tsingle3

) and (Ttriple

Suggest Documents