reducing the latency of floating-point arithmetic

REDUCING THE LATENCY OF FLOATING-POINT ARITHMETIC OPERATIONS

a dissertation submitted to the department of electrical engineering and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy

By Nhon T. Quach December 1993

c Copyright 1993 by Nhon T. Quach All Rights Reserved

ii

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Michael J. Flynn (Principal Advisor)

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Mark A. Horowitz

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Robert Dutton

Approved for the University Committee on Graduate Studies:

iii

Abstract Floating-point (FP) numbers are used in general-purpose scienti c computation and increasingly in digital signal processing and graphics as well. The manipulation of these numbers, however, is more complex and has much higher latencies than their integer counterparts. This research attempts to reduce the latencies of the commonly used FP operations: add, multiply, divide, and square root. The latency of an FADD can be improved by an estimated 20% at no extra hardware expense. By a detailed analysis of the existing two-path implementation, this work shows that the rounding step in both paths can be combined with the mantissa addition step, saving both hardware and time. The algorithm has been demonstrated through extensive computer simulation and silicon implementation. The test chip implemented in a standard 1um CMOS technology has a simulated nominal delay of 17ns. Like existing adders, the implemented adder uses a leading one prediction (LOP) circuit. This work examines LOP in a general framework and applies it to sticky bit computation in high-speed multiplers and dividers and in condition code prediction in high-speed processors. SRT is a popular iterative division scheme used in many high-end processors today because of its low hardware cost and simplicity. This research shows that for 30% more hardware, the latency of a simple radix-4 divider can be reduced roughly by half and staging three such simple dividers (or recycling it with a 3X clock) allows a radix-64 divider to be built. The same method can be applied to square root operation. FP numbering system does not have a closure property and rounding is needed to iv

ensure representability and reproducibility of a result. The rounding method proposed in this dissertation reduces the number of adders required for proper rounding in an FPU. Using this method, this research shows that, for multiplication and addition and for the IEEE standard, the round to in nity modes are harder to implement than the round to nearest mode, with the round to zero mode being the easiest. The complexity in implementing a rounding mode is de ned here as the minimum number of results a functional unit has to pre-compute in anticipation of later rounding events.

v

Acknowledgements First and foremost, I like to thank my advisor Michael Flynn for his support and guidance throughout my years at Stanford, and for believing in me more than I dare to believe in myself. Second, I like to thank my associate advisor Mark Horowitz for his advice; some of which I ignored, only to nd that he has been right all along. Prof. Dutton served on my reading committee. Back in the years when I was struggling with device physics (and I still am), he has been a source of inspiration and it's a real honor having him as the third reader. Prof. Dill was kind to sit on my oral committee. Special thanks goes to Martin Morf for his help, guidance, and the many fruitful discussions on generalization. Dennis Brzesinski, Le Quach, and Tin-Fook Ngai have also helped in the shaping of this dissertation. I like to thank Dr. Nishi Yosio, Ruby Lee, Dennis Brzezinski, John Kelly, David Liu, and Ed Lin of Hewlett-Packard for the fabrication of the chip. Ruby Lee, Dennis Brzezinski, and Erdem Hokenek deserve a special thanks for their support and enthusiasm throughout this work. Brian Flachs has been great in helping me through more computer problems and crashes than I care to remember. This work has been funded by the National Science Foundation and in part by an IBM fellowship with Erdem Hokenek serving as a mentor. My parents Hue Lam and Tu Quach, my in-laws Anh Quach and Anh Hua, my sisters and brothers, and my wife Lilian have supported me in one way or anther. My parents and sisters worked hard through many economic and political hard times to ensure that I receive a \proper" education. My in-laws are always ready to extend a helping hand. Lilian has spoiled me enough to attempt this pursuit. She has shared my joy and frustration, shouldered the nancial burden, and even brought our rst child vi

Alex. They are all wonderful and I can never thank them enough.

vii

Contents Abstract

iv

Acknowledgements

vi

1 Introduction

1

1.1 Motivations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1

1.2 Research Focus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.3 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 3

1.4 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4

2 High-Speed Integer Addition 2.1 2.2 2.3 2.4 2.5 2.6

6

Introduction : : : : : : : : : : : Background : : : : : : : : : : : A 32-Bit Modi ed Ling Adder : Comparison with Other Adders Implementation : : : : : : : : : Summary : : : : : : : : : : : :

: : : : : :

3 Leading One Prediction

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: 6 : 6 : 8 : 14 : 17 : 17

18 viii

3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

18

3.2 Theory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.3 Implementations : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

20 22

3.4 Generalization and Applications : : : : : : : : : : : : : : : : : : : : :

30

3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

32

4 High-Speed FP Addition

33

4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

33

4.2 A Review of FP Addition Algorithm : : : : : : : : : : : : : : : : : :

34

4.3 The Proposed Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : 4.4 Implementation Issues : : : : : : : : : : : : : : : : : : : : : : : : : : 4.5 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

36 40 45

4.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

47

5 A Radix-64 FP Divider

50

5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.2 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.3 The Proposed Implementation : : : : : : : : : : : : : : : : : : : : : :

50 51 52

5.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

58

6 Fast IEEE Rounding

62

6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.2 IEEE Rounding Modes : : : : : : : : : : : : : : : : : : : : : : : : : : 6.3 Rounding for Binary Parallel Multipliers : : : : : : : : : : : : : : : :

62 63 66

6.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

78

ix

7 Conclusion and Future Work

79

7.1 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.2 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Bibliography

79 80

83

x

List of Tables 2.1 Comparison of Gate Delay and Serial Transistors in CMOS Adders :

15

3.1 Three-Bit Patterns That Need to Be Detected : : : : : : : : : : : : :

24

3.2 Logic Equation for cfinei at the ith Bit Position : : : : : : : : : : : : 3.3 Two-Bit Patterns That Have to Be Examined : : : : : : : : : : : : :

26 31

4.1 Steps in Conventional Algorithms. : : : : : : : : : : : : : : : : : : : :

35

4.2 Steps in the Proposed Algorithm. : : : : : : : : : : : : : : : : : : : : 4.3 Possible Adjustments for the Exponent : : : : : : : : : : : : : : : : : 4.4 Breakdown of Area Consumption : : : : : : : : : : : : : : : : : : : :

39 40 44

4.5 Simulated Delay at 5 Volts and Room Temperature : : : : : : : : : :

47

5.1 New Quotient Digits Encoding Scheme : : : : : : : : : : : : : : : : : 5.2 Comparison of Quotient Digits Equations : : : : : : : : : : : : : : : : 5.3 Next Quotient Logic : : : : : : : : : : : : : : : : : : : : : : : : : : :

54 56 58

6.1 Implementation of IEEE Rounding Modes : : : : : : : : : : : : : : : 6.2 Comparison of Round to Nearest and Round Up : : : : : : : : : : : : 6.3 Rounding Table for Round Up in Binary Multipliers : : : : : : : : : :

64 65 67

6.4 Predicting the Regions with the msb's of Cl and Sl : : : : : : : : : :

70

xi

6.5 Prediction Table for Simple Model : : : : : : : : : : : : : : : : : : :

72

6.6 Possible Digit Selections for Round Up in The Simple Model : : : : : 6.7 Possible Prediction Schemes for Round Up for The Improved Model :

73 75

6.8 Rounding Table for Round to In nity in Binary Multipliers : : : : : :

76

6.9 Prediction Table for the Round to In nity Mode : : : : : : : : : : : :

77

6.10 Rounding Table for Round to Zero in Binary Multipliers : : : : : : :

78

xii

List of Figures 2.1 Structure and Number Convention of the 32-Bit Modi ed Ling Adder

9

2.2 Implementation of the Group Generate : : : : : : : : : : : : : : : : : 2.3 Implementation of Group 2 : : : : : : : : : : : : : : : : : : : : : : :

11 14

3.1 (a) Leading One Detection and (b) Leading Prediction. : : : : : : : : 3.2 A Possible Precharged Implementation of Leading One Prediction : :

19 28

4.1 Rounding Scheme : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Logic Diagram of the Adder. : : : : : : : : : : : : : : : : : : : : : : : 4.3 A Plot of the Floating-Point Adder. : : : : : : : : : : : : : : : : : : :

38 43 49

5.1 PD-Plot : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

53

5.2 Quotient Digits Allowable in the SRT Regions : : : : : : : : : : : : : 5.3 Comparison of the Proposed and the Simple SRT Division Scheme : : 5.4 Radix-64 Divider : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

55 60 61

6.1 Explanation of Notation. K represents S, C, or R. : : : : : : : : : : : 6.2 A Hardware Model for IEEE Rounding : : : : : : : : : : : : : : : : : 6.3 An Improved Hardware Model for IEEE Rounding : : : : : : : : : : :

64 71 74

xiii

Chapter 1 Introduction 1.1 Motivations Floating-point (FP) numbers are used in general-purpose computation and increasingly in the areas of digital signal processing and graphics as well [1, 2]. The use of FP numbers, however, has not been without a price. FP operations are more complex and incur much higher latencies than their integer counterparts. The complexity arises partly from the way FP numbers are represented and partly from the fact that the result of an FP operation requires rounding. Two other independent developments in the 80s have also provided impetus to this work: the increasing silicon budget and the introduction of an IEEE standard for FP binary arithmetic [3]. Reduction in the minimum de nable feature sizes, coupled with the steadily increasing die sizes and number of interconnects, has permitted complex digital systems to be integrated on a single chip. Systems implementors are now willing to invest more silicon area for higher performance. As a result, an FP algorithm selection based primarily on area needs to be re-examined. An FP number consists of a sign, a biased exponent, and a mantissa. Normally, the exponent determines its range and the mantissa its precision. In the past, systems implementors freely allocate widths for the exponent and the mantissa. Consequently, FP numbers in dierent systems often dier, causing a loss of precision or range, or both, and a need of format conversion during data transfer. Rounding and arithmetic 1

CHAPTER 1. INTRODUCTION

2

exception handling are other areas of little uniformity among computer systems. The IEEE standard is an attempt to remedy this situation; it dictates not only the formats but also the rounding modes and their eects. Conformance to this standard generally resulted in a higher hardware expenditure or latency, or both. Under these premises, this research attempts to develop algorithms for high-speed implementation of the commonly used FP operations and to explore methods to perform IEEE rounding eciently.

1.2 Research Focus The FP performance of a computer system is determined by its execution rate and the rate at which data can be fed into the unit. In the past, the latter had been the FP performance bottleneck. But features found in recent processors, such as large cache sizes, advanced caching operations, wider cache-CPU interfaces, supports of load and store double, multiple instruction issues, and aggressive clock rates, have helped greatly in this respect, with an accompanying dramatic increase in FP performance. The increase in execution rate over the past years, on the other hand, had been far less impressive. Consequently, it is expected that further increase in FP performance requires a more balanced attention in both areas [4]. The execution rate of an FPU is determined by its degree of pipeline and the latency of its longest pipeline stage1. Pipelining involves placing synchronizing elements, or latches, at regular intervals and is a relatively well understood task in the design of an FPU. The problem of reducing the latency, in principle, can be addressed at the technology, circuit, and algorithm levels. This research focuses on the algorithm aspect of the problem with emphasis placed on its suitability for high-speed implementation in a CMOS VLSI medium. The ndings, however, can be extended to other technologies as well. 1

Assumed synchronous designs.


3

1.3 Contributions This research demonstrates that current algorithms for the commonly used FP operations can be sped up considerably at a small to a reasonable hardware cost. The contribution encompasses the areas of integer addition, leading one prediction, FP addition, FP division, FP square root, and IEEE rounding. Integer Addition This research demonstrates that the logic of the carry tree in the commonly used carry lookahead adder can be simpli ed at no extra hardware cost. The simpli cation leads to a slightly faster adder. Leading One Prediction When subtracting two FP numbers that are close in value, cancellation may produce a transient un-normalized number requiring a massive normalization shift, which is generally a slow operation. Leading one prediction is a technique for predicting the number of places a result has to be left shifted directly from the input operands. Hence, subtraction and part of the normalization operation can proceed in parallel to reduce latency. This work examines the technique of leading one prediction based on a general framework and applies it to the areas of sticky bit computation in high-speed multipliers and condition code prediction in high-end processors. FP Addition Existing high-speed FP adders use a two-path implementation. A normal path handles the case when the operands dier greatly and a cancellation path when the operands are close in value. Operationally, the normal path involves an exponent subtract, an alignment shift, a mantissa add or subtract, and a rounding step. The cancellation path involves a mantissa subtract, a normalization shift, and a rounding step. The algorithm developed in this thesis combines rounding with the mantissa addition step in both paths and uses leading one prediction to speed up normalization in the cancellation path. For a two-stage pipelined implementation, the algorithm has one fewer rounding operation in the critical path than the existing algorithms. This speeds up the algorithm by an estimated 20% at no additional hardware cost.


4

The algorithm can also be incorporated into a three-stage pipelined adder. In this arrangement, the adder is shared between both paths so that only one mantissa adder is needed, allowing a reduction in hardware. FP Division and Square Root Traditionally, FP division has mainly been regarded as a subject of theoretical interests because of its low performance impact and most FPUs have opted for a low cost implementation. The increasing silicon area budget and the growing interests in graphics applications, however, have revived interests in FP division lately. SRT is a popular division technique used in many current processors. This research shows that with a roughly 30% increase in hardware, the latency of a simple radix-4 SRT divider can be reduced by half. Staging multiple such dividers (or recycling it with a 3X clock) allows higher-radix ones (e.g., radix64) to be built. The same method can be applied to the square root operation as well. IEEE Rounding FP operations do not have a closure property and rounding ensures representability and reproducibility of the result. This research develops a systematic rounding method that minimizes the number of mantissa adders required in an FP operation. The method is general and can be used for most FP operations as well as for FPUs that use other number representation (such as signed-digit). Using this rounding method, this research shows that the round to nearest mode in the IEEE standard is easier to implement than the round to in nity modes while round to zero is the easiest. The complexity in implementing a rounding mode is de ned here as the minimum number of results an adder has to pre-compute.

1.4 Organization The remainder of this dissertation is organized into 7 chapters. Chapter 2 describes the integer addition algorithm. Chapter 3 examines the theory of leading one prediction (LOP), contrasts two implementations, and extends it to condition code prediction and sticky bit computation. Chapter 4 describes a fast FP addition algorithm


5

that is based on the improved integer adder and LOP presented earlier. It also describes a test chip used to verify the algorithm. Chapter 5 details the modi cation needed to improve the latency of a simple divider for staging. Chapter 6 presents the rounding method using multiplication as an example. Finally, Chapter 7 summarizes and proposes future work.

Chapter 2 High-Speed Integer Addition 2.1 Introduction Integer adders are an important basic element in an FPU. FP operations typically require many subcomputations which use an adder to perform their functions. In rounding, for example, an adder increments and in format conversion, an adder subtracts to convert a negative result into a sign-magnitude form. In the exponent side, an integer adder computes 2 adjacent results to obtain the absolute dierence to drive the alignment shifter. The exponent adjuster, yet another use of an adder, computes multiple exponent values to account for later rounding and normalization. For these reasons, a revisit of integer addition algorithm is very much in order.

2.2 Background A ripple adder is ecient, but has a latency that grows at an order of O(n) because of the need to ripple the carry across its size n. A carry-skip adder groups its operands to break this ripple chain. Within each group, addition is performed in the same manner as a ripple adder, but the condition to skip a carry, from a lower order to a higher order group, or to generate one is detected in a parallel fashion. In this way, only the carry chain needs be the critical path and, if enough hardware is devoted to it, carry-skip adders can be made fast. The adder group sizes clearly play a major role in determining its performance. Majerski [5] studied such a problem and suggested a 6

CHAPTER 2. HIGH-SPEED INTEGER ADDITION

7

multi-level carry skip. Oklobdzija [6] recast the problem as a \tile shading" method. Guyot [7], Turrini [8], and Chen and Schlag [9] considered the case where the ratio of a carry skip to a carry propagate is non-integer. For short addition such as an exponent add, carry-skip adders can be quite ecient. But for long addition such as a mantissa add, they quickly lose appeal because of their O(pn) order of growth1. Furthermore, carry skip adders require a separate carry chain for each result to be computed (e.g., when computing sum and sum+1). CLAs have an O(log(n)) order of growth and are therefore attractive candidates to improve upon. Mead and Conway [10] showed that for VLSI, a Manchester carrychain adder can be both fast and ecient. Brent and Kung [11] repopularized a variant of the CLA that is based on a gate level implementation and a binary carry tree. Another variant is the conditional sum adder. In such an adder, results in each group are pre-computed assuming both values of the carry in and the true sum is later selected based on its actual value. MacSorley [12] reviewed the eld of addition and concluded that the conditional sum adder is the fastest adder. All CLAs mentioned above use more or less the same carry-lookahead equation. Ling [13, 14] showed that for certain logic families such as emitter coupled logic (ECL), the carry-lookahead equation can be reformulated to take advantages of its dotting capability, thus simplifying the carry-lookahead chain and improving its speed. Bewick et al. [15] recently improved on the regularity of this Ling type adder. Based on symmetry, Doran [16] showed that there are 32 valid ways to propagate a carry, of which only four permitted Ling reformulation. Ling's approach can result in a drastic load reduction in the input stage circuitry, thereby allowing direct generation of the group generate from the input operands. Because this approach requires a dotting The O(pn) order of growth applies to single-level carry skip adders. For m-level skip adders, the order of growth is O( mpn). For single-level carry skip adders, the order of growth can be reasoned as follows. The group size increases and then decreases in a carry skip adder. This can be mentally pictured as a triangle, whose width, height, and area correspond to the number of group, the maximum group size, andp the number of bits in the adder. The height, determining the critical path, grows at a rate of O( n) with the area because the shape of the triangle is roughly preserved. Applying the same argument repeatedly to each level of skip, one obtains the O( mpn) order of growth for m-level carry skip adders. 1


8

capability, it is not as suitable for CMOS adders. A straightforward application of Ling's scheme to CMOS adders increases hardware or delay time, or both. This chapter presents an implementation of a fully static CMOS adder using a modi ed Ling scheme. The implementation saves up to 1 gate delay and always reduces the number of serial transistors in the critical path over the conventional CLAs. Furthermore, this is accomplished at a negligible hardware cost. In CMOS, because the number of serial transistors from the output to the power or the ground node is a major speed limiting factor, reducing it is therefore of interest. Section 3 shows how Ling's approach can be modi ed for a CMOS technology by way of a 32-bit adder design. The approach to be presented, however, is independent of the number of bits in the adder. Section 4 compares our adder with others reported in the literature. Section 5 discusses the implementation and Section 6 contains a summary. For two operands A = a0a1 . . . an,1 and B = b0b1 . . . bn,1, a bit transmit is de ned as ti = ai bi, a bit propagate as pi = ai _ bi, and a bit generate as gi = aibi. Unless otherwise stated, the most signi cant bit is numbered bit 0. The bit sum is de ned as si = ai bi and the nal bit sum as Si = si ci,1. The p1,n term denotes Qni=1 pi. The index i 2 (0; 32) is used for bits in the operands, the index j 2 (0; 10) for groups of bits, and the index k 2 (0; 2) for blocks of groups, exclusively and respectively.

2.3 A 32-Bit Modi ed Ling Adder The adder is divided into 4 blocks; the highest order block has 6 bits with the rest of the blocks having 9 bits. The carry in is treated as g32, so Block 3 has only 8 bits. Each block is further divided into 3-bit groups. Within each group and block, the local sum logic uses the conditional-sum algorithm [17]. Figure 2.1 shows the structure of the adder. This partition is motivated by an attempt to better balance the loading on the carry tree. The ECL adders reported in [14] and [15] have a 4bit group; owing to the limited fan-in capability of fully static CMOS circuits, 3-bit


p6 012

345

G0 G1

p13 111 234

111 567

G4

G5 G6 G7

Block1 C0

p24

11 9 01

G2 G3

678

Block 0 Gb0 Pb0 Cout

Gb1

Pb1

112 890

222 123

Block 2 C1

9

Gb2 Pb 2

222 456

222 789

333 012

G8 G9 G10 Block 3

C2

Gb3

Pb3

Global Carry Look-ahead

Figure 2.1: Structure and Number Convention of the 32-Bit Modi ed Ling Adder groups are used in our adder2.

2.3.1 Global-Carry For ease of discussion, we illustrate our approach on Group 2 in Block 1 of the adder (see Figure 2.1). The conventional group-generate equation for the group [17] is

G2 = g6 _ g7p6 _ g8p7p6

(2:1)

Using the identity gi = pigi and extracting p6, we rewrite Eqn (2.1) as G2 = p6G2 where G2 = g6 _ g7 _ g8p7 . The essence of Ling's approach is to propagate the G2 term only, because G2 can be expanded as

G2 = a6b6 _ a7b7 _ a8b8a7 _ a8b8b7

(2:2)

Eqn (2.2) contains 4 terms and a total of 10 literals, with the largest term having 3 literals. This can be implemented in CMOS in 1 complex gate [18]. Eqn (2.1) on the other hand, when expanded, contains 7 terms and a total of 24 literals, with the largest term containing 4 literals. In CMOS, the number of literals in a minterm corresponds Though a fan-in of 4 is often used in fully static CMOS, we have chosen to use a fan-in of 3 to avoid having 4 P-channel devices in series. The approach is independent of the group size, however. 2


10

to the number of N-channel serial transistors; Eqn (2.2) is therefore preferable to Eqn (2.1) for direct generation of the group generate from the input operands. This allows saving of 1 gate delay because the gi and pi terms are not implemented. A straightforward implementation of Eqn (2.2) in fully static CMOS has 4 P-channel transistors in series, severely limiting its usefulness. But it can be simpli ed as the relationship gi pi = pi can again be put to use. Figure 2.2 shows an implementation of Eqn (2.2)3 only 3 P-channel transistors are in series. The equation for the group propagate in a conventional CLA is P2 = p6,8. Ling uses a modi ed group propagate, P2 = p7,9. Gj and Pj are, respectively, the reduced and left-shifted versions of Gj and Pj . To see why the group propagate is de ned this way, consider the block-generate equation for Block 1

Gb1 = G2 _ G3P2 _ G4P3P2

(2.3)

Using the de nition of Pj and Gj , we can rewrite Eqn (2.3) as

Gb1 = p6 (G2 _ G3P2 _ G4 P3P2) = p6Gb1

(2:4)

Hence, the use of Pj aords a more ecient implementation of the block generate. Eqn (2.4) is easier to implement than Eqn (2.3) because Gj are available, but Gj are not. Also, the p6 term in Eqn (2.4) propagates further down into the nal-carry equation:

C0 = Gb1 _ Gb2Pb1 _ Gb3Pb2Pb1 which can be rewritten as where

C0 = p6C0

(2:5)

C0 = Gb1 _ Gb2Pb1 _ Gb3Pb2Pb1

(2:6)

The implementation is obtained by complementing Eqn (2.2) and applying the identity gi pi = pi. Complementation is needed because the P-channel network evaluates the complement of Eqn (2.2). 3


11

Vdd

G8 a8 a6 a7 b8

b6 b7

Vss

Figure 2.2: Implementation of the Group Generate The nal-carry equations C2, C1, and Cout can be derived similarly as

where and

C2 = p24Gb3

(2:7)

C1 = p13C1

(2:8)

C1 = Gb2 _ Gb3Pb2

(2:9)

Cout = p0Cout

(2:10)

where

= Gb _ Gb Pb _ Gb PbPb _ Gb Pb Pb Pb Cout (2:11) 0 1 0 2 1 0 3 2 1 0 The de nitions of Gbk and Pbk in Eqns (2.6), (2.7), (2.9), and (2.11) follow those of the conventional Gbk and Pbk with a simple modi cation: Pj and Gj are replaced by Pj and Gj , respectively. For example,

Gb2 = G5 _ G6 P5 _ G7P6P5 Gb2 = G5 _ G6P5 _ G7P6P5

(2:12)


12

and

Pb2 = P5,7 Pb2 = P5,7 Ling's scheme calls for either the implementation of Ck from Pbk , Gbk , and Ck (Eqns (2.7)-(2.11)) or the modi cation of the local sum logic to account for the fact that Cj are propagated [14]. Neither of these options is attractive. The former fails to reduce the number of serial transistors in the critical path and the latter complicates the local sum logic, increasing hardware and delay time. In our implementation, only the Ck, Pbk , Gbk , Pj, and Gj terms are implemented in the carry look-ahead circuitry without modifying the local sum logic. The leading p terms in Eqns (2.5), (2.7), (2.8), and (2.10) are implemented in the local group-carry equations, both of which are non-critical paths. The following section shows that this is indeed possible and in fact desirable because of reuse of the Pj and Gj terms.

2.3.2 Local Group-Carry We can prove for the general case that the leading p terms can be implemented in the local group-carry equations for all blocks in the adder. For ease of discussion, however, we show that this is possible for Block 1. That is, we show that only C1 needs to be propagated to Block 1 and p13 to Group 4. Note that together, these signals play the role of C1, the conventional carry. The nal-sum equation for bit 8 in Group 2 (see Figures 2.1 and 2.3) is

S8 = s8 (G3 _ P3 G4 _ P3P4C1) = s8 fp9[G3 _ P3(G4 _ P4C1)]g De ning and

(2:13)

gb3 p9(G3 _ P3G4)

(2:14)

pb3 p9[G3 _ P3(G4 _ P4)]

(2:15)


13

and expanding Eqn (2.13) in terms of C1 using Shannon's theorem [19], we get

S8 = C 1(s8 gb3) _ C1(s8 pb3) The equations for S7 and S6 can be derived similarly. The S6 one is given below:

S6 = C 1fgb3[s6 (g7 _ p7g8)] _ gb3[s6 (g7 _ p7p8)]g _ C1fpb3[s6 (g7 _ p7g8)] _ pb3[s6 (g7 _ p7p8)]g

(2.16)

Hence, only C1 in Eqn (2.8) needs to be propagated globally to Block 1. The p13 term can be accounted for locally in Group 4 in G4 and P4 in Eqns (2.14) and (2.15). Both equations can be implemented in 2 complex gate delays since Pj and Gj require only 1 gate delay. In our implementation, all complex gates have at most 3 serial transistors and have roughly the same complexity as that shown in Figure 2.2. One complex gate delay is equal to two 2-input NAND gate delays at a load of 0.5pF from SPICE simulation [20]. The G4 _ P4 term in Eqn (2.15) is actually available from Group 4 within the same block and can be reused at a cost of 1 complex gate delay, increasing the number of gate delays of pb3 from 2 to 3. Figure 2.3 shows an implementation of Group 2. In Figure 2.3, we have used s7 as the local propagate (globally, we used pi ), allowing a more ecient implementation of (g7 _ p7g8) and (g7 _ p7p8 ) in Eqn (2.16). Because our approach does not modify the local sum logic, there is no increase in hardware in Figure 2.3 when compared with an implementation that uses the conventional conditional-sum algorithm4. In terms of number of gate delays, S0 is no worse than S6. The equation for S0 is similar to Eqn (2.16):

S0 = C 0fgb1[g2(s0 g1) _ g2(s0 p1 )] _ gb1[g2(s0 g1) _ g2(s0 p1)]g _ C0fpb1[p1(s0 g1) _ p2 (s0 p1)] _ pb1[p2(s0 g1 ) _ p2(s0 p1)]g (2.17) where gb1 = p2 G1 and pb1 = p2(G1 _ P1). From the previous discussion, Pj and Gj are available in 1 complex gate delay, Pbk and Gbk in 2, and Ck in 3. The nal Since we have used conditional sum in the local sum logic, it is fair to compare our adder with a true conditional sum adder, knowing that a conditional sum adder consumes more hardware than a CLA does [17]. 4


a8 b8

a7 b7

14

a6 b6

Vdd

Vdd

gb3 pb3 2-1 mux

C1

S8_b

S7_b

S6_b

Figure 2.3: Implementation of Group 2 sum selection multiplexor counts as 1 gate delay. Hence, our adder has a total of 4 complex gate delays.

2.4 Comparison with Other Adders Table 2.1 compares our adder with a CLA, a conditional-sum, a carry-select, and a Multiple-Output Domino Logic (MODL) adders [21] in terms of complex gate delays and number of serial transistors in the critical path. Our adder has fewer complex gate delays than other adders. These adders were chosen because their delay times have the same order of growth, O(log(n)). In the comparison, we have assumed that


15

Table 2.1: Comparison of Gate Delay and Serial Transistors in CMOS Adders Adders

Gate Delay Number of Serial Transistors from cin to S0 from cin to S0 CLA 6 21 Condition-Sum 5 16 Carry-Select 5 18 MODL 5 18 Proposed 4 14 the CLA adder is implemented in a complex gate oriented media (i.e., MOS LSI or VLSI) and the carry-select adder uses a 4-bit group and conventional carry lookahead to propagate the global carry. To be fair, we have further assumed that the conditional-sum adder has a similar organization as our adder but without using the modi ed Ling scheme and that the MODL adder uses conditional-sum logic locally. Because CLA, conditional-sum, and carry-select adders do not generate the group generate directly, they have one more gate delay than our adder. CLA requires another gate delay to generate the local sum, increasing its gate delay from 5 to 6. The MODL adder though generates the group generate directly, it does so by using a small 2-bit group [21], therefore requiring more levels in the global carry generation process than ours. Comparison of complex gate delays in CMOS adders can be misleading because they depend on both fan-in and fan-out. A better measure is the number of serial transistors which a signal must traverse in the critical path. This means that for fully static CMOS circuits, we evaluate both P-channel and N-channel transistors for critical paths. For dynamic CMOS circuits [18], which include domino circuits, we only evaluate the N-channel transistors. Hence, this comparison scheme is slightly biased against fully static CMOS circuits. In counting the number of transistors in series, the discharge N-channel transistors in domino logic are not included. Inverters are counted as 1 transistor and XOR gates


16

as 25. The number of serial transistor count for NAND, NOR, and complex gates is the number of transistors in the longest N-channel or P-channel chain for static and dynamic CMOS circuits. Pass gates [18] complicates the situation slightly. The input-output (source-drain) path of a pass gate is counted as 0.5 transistors6 and the control-output (gate-drain) path as 1 transistor. By the same token, 2-1 multiplexors are counted as 2 transistors from the select-output path, but as 0.5 from the inputoutput path. A similar comparison scheme has been suggested by Oklobdzija and Barnes [22], but the accounting details were not given. From Table 2.1, our adder has a fewer number of serial transistors in the critical path than others. To count the number of serial transistors in the critical path, Eqns (2.6), (2.12), and (2.17) can be used. The input signal must traverse Gj in 4 transistors (Figure 2.2), Gbk in another 4 transistors (Eqn (2.12) plus an inverting buer), and C0 in yet another 4 transistors (Eqn (2.6) plus an inverting buer), giving a total of 12 transistors. All other terms in Eqn (2.17) arrive sooner than C0. The nal sum selection multiplexors contribute 2 more transistors. Hence, the critical path from cin to S0 in the adder has 14 transistors, as indicated in Table 2.1. As with other adders, the inverting output buers in Figure 2.3 are not counted because they are only included for driving considerations. The path from cin to Cout has the same number of serial transistors as the path from cin to S0. This can be seen by rewriting Eqn (2.11) as = p [Gb _ Pb(Gb _ GbPb _ Gb Pb Pb )] Cout 0 0 0 1 2 1 3 2 1

This cin to Cout path, however, is not the critical path because it has a much smaller capacitive loading than that from cin to S0. If both the inputs and their complements are available, the XOR gate delay is 1. When there are several pass transistors in series, one has to take capacitive loading into account as suggested in [22]. This situation never arose in our study. 5

6


17

2.5 Implementation A 53-bit adder using the above algorithm has been implemented in silicon using the HP CMOS26 1um process with three layer metal. The third layer of metal is used exclusively for power and ground in our implementation. The adder has an area of 2650x400m2 and a simulated typical delay of 5.5ns (at 5 volts and room temperature).

2.6 Summary In summary, we have presented a fully static CMOS implementation of a Ling type adder. The implementation has fewer complex gate delays and serial transistors in the critical path than other conventional parallel adders. Compared with a conventional conditional-sum adder, the hardware increase in our implementation is negligible. The key idea that allows Ling's scheme to be used for the proposed CMOS adder is that the leading p term in Ling's equation can be propagated locally to reduce the number of serial transistors in the critical global carry propagation path.

Chapter 3 Leading One Prediction 3.1 Introduction In FP addition, the result of a subtraction may require a massive normalization shift [23]. This can happen when the two operands are close in value, causing the higher order bits to cancel. To normalize, the simple way computes and then detects the location of the leading one1 in the result. Leading one detection (LOD) is slow because the result has to be rst computed. Leading one/zero prediction (LOP) avoids this problem by predicting its location directly from the input operands. The prediction is not perfect because the carry into each bit position is ignored. And when o, the result requires a further 1-bit ne adjustment shift as the carry into the bit position becomes available. Figure 3.1 illustrates the dierence between LOD and LOP. LOP works on both positive and negative results; for negative results, LOP predicts leading-zero. Early uses of LOP and its variants can be traced back to the work of Ware et al. [24] and Hays et al. [25]. More recent uses include the DEC 50MHz CMOS 64-bit FPU [26], the Intel 860 processor [27], the Cydrom-5 processor [28]2, the IBM RS6000 [29], the Weitek WTL3170/3171 Sparc FP coprocessors [30], and the SuperSparc FPU A leading one refers to the one following a beginning string of 0's in a positive result. This is to be contrasted with a string of preceding 1's preceding the leading zero in a negative result. 2 The Cydrom-5 and the Intel 860 FP adder performed LOP before (as opposed to during) the mantissa addition step. The underlying principle remains the same. 1

18

CHAPTER 3. LEADING ONE PREDICTION

19

Adder LOP

Adder

LOD

Left Shifter

(a)

Left Shifter

(b)

Figure 3.1: (a) Leading One Detection and (b) Leading Prediction. [31, 32]. While past work abounds, the description given in these papers with the exception of Hokenek and Montoye [29] is at best sketchy. This chapter explores LOP more fully, explaining it in Section 2 based on the framework of bit pattern detection. Section 3 contrasts two implementations and discusses implementation issues. Section 4 generalizes LOP and applies it to sticky bit computation and condition code prediction. Section 5 contains a summary. In this chapter, assuming A = a0a1 . . . an,1 and B = b0b1 . . . bn,1, a transmit literal is de ned as ti = ai bi, a generate literal as gi = aibi, a zero literal as zi = aibi, and a propagate literal as pi = ai _ bi. When the context is clear, the subscript i is dropped to simplify notation. gm denotes a string of m g literals. g denotes zero or more g literals. A tgz string, for example, begins with zero or more t literals, followed by a g literal, and ended with zero or more z literals. Unless otherwise stated, a string always starts from the most signi cant bit (msb) of the operands A and B . More examples follow. Assuming A = 11000 and B = 10001, for example, a string is formed by examining the bits in the operands. In this case, the rst literal in the string is g0 since both a0 and b0 are true. Similarly, the second literal is t1 since a1


20

is true but b1 is false. Completing the evaluation, we obtain a string of g0t1z2z3t4, or simply gtzzt, formed by A and B . For A = 1111110000 and B = 0000010000, the string is tttttgzzzz, or t5gz4.

3.2 Theory Given the operands A and B in a 2's complement form, how do we predict the location of the leading one or zero in the result? The key to this problem is to realize that the position is independent of whether we are computing A + B or B + A. At each bit position, only the symmetric literals, z, t, and g, need to be considered; other symmetric literals can be expressed in terms of these literals (e.g., pi = ti _ gi). Exhausting all possible permutations on a small bit string shows that only a small number of patterns needs to be examined for leading one or zero prediction. Speci cally, for a positive result, only the following two patterns produce preceding 0's: z and tgz Similarly for a negative result, only the following two patterns need to be detected for preceding 1's g and tzg An example for the rst pattern is A = 00001100 and B = 00010100, which produce a string of zzzttgzz. The result A + B = 00100003 has two preceding 0's. An example for the second pattern is A = 11111001 and B = 00001001, giving a string of ttttgzzg. And the result A + B = 00000010 has 6 preceding 0's. In practice, the operand B needs to be rst inverted before feeding into the adder in an subtraction operation. In this case, the string is formed after the bit inversion operation. 3


21

Some FPUs [27] use a magnitude comparator to ensure a positive result. For these units, only the positive patterns, z and tgz, need to be detected with an accompanying reduction in hardware. The following discussion assumes the detection of all four patterns mentioned above. At each bit position i, LOP detects and generates a shift signal Qi = 1 if any of the patterns above are found. The logic equation for Qi is therefore:

Qi = zi _ tj gzk _ gi _ tj zgk where j and k are integers such that j + k = i , 1. The equation is obtained by reasoning as follows. Qi is true if one (and only one because the patterns are mutually exclusive) of the patterns in the equation is found. For example, let's assume that we have two operands that produce a string of zzzzgzt and we are at position 3. The rst pattern produces a one because it is a z string (a zzzz string so far). The rest of the patterns in the equation produces zero because the rst literal in the string is a z. So, Q3 = 1. At bit position 4, all patterns produce zero because none of them matches the string zzzzg. Hence, Q4 = 0, indicating the position of the leading one. The subsequent Qi's are all equal to zero for the same reason. In general, for n-bit operands A and B , the LOP produces an n-bit vector Q containing a possible 1 string followed by a 0 string4. The transition indicates the location of the leading one or zero. Alternatively, Q may be represented as a 1-of-n code, with the single one or zero indicating the location of the leading one or zero. These two representations are interchangeable and it is therefore a minor issue. In both cases, Q may be o by one bit because of the possible carry in. In the context of FP adder, depending on the implementation of the exponent logic, the total shift amount may be represented as Q + cfine or as Q , cfine, where cfine equals one if a ne adjustment is needed. Note that in the rst case, we always under-predict and in the second case, always over-predict. This is achieved by appropriately wiring the input operands. Developing an equation for cfine involves a case by case analysis of the patterns. We 4

The leading one could be at bit position 0.


22

rst describe an implementation.

3.3 Implementations As shown in Fig. 3.1, the LOP latency is preferably equal to or less than the adder's. Obviously, if the adder uses a CLA scheme, LOP must employ a similar structure. In this section, we describe two parallel LOP implementations. In the rst implementation, each pattern is detected separately; the outputs are then OR'ed together. In the second implementation, all bit patterns are detected at the same time. Both schemes have an O(log(n)) computation time. In a parallel implementation, the literals in a string are rst grouped at the input stage; these groups are in turn blocked at the second stage, and so on. Within each group or block, information is processed independently. In general, the size of the group (and therefore block) is implementation dependent. Fan-in of a logic family or a technology though often dictates this group size, other factors may come into play depending on the speci cs of an implementation. Assuming a group and block size of four5, the LOP logic each bit position in the group needs to compute the z, t, and g signals, which may be shared with the adder. In this method, detecting the pattern z requires the logical AND of all zi's. Detecting the pattern tgz requires keeping account of three states: the N (not found) state, indicating that the g literal has not yet been found in all the groups (or blocks) examined so far; the J (just found) state, indicating that the g literal has just been found in the group (or block) being examined; and the F (found) state, indicating that the g literal has already been found. The N and F states correspond to the logical AND of all the ti and zi literals, respectively, in a group (or block). The J state corresponding to the following condition:

J = gzzz _ tgzz _ ttgz _ tttg

(3:1)

A block size of four is chosen for this particular presentation. LOP can be implemented in any block or group sizes. 5


23

For the subsequent stages, two N states, NN, produces a (bigger) N state; an N and a J state, NJ, a J state; and two F states, FF, an F state. Any other combination causes Qi to be false (= 0). The logic equation at the block level then is

Jblock = JFFF _ NJFF _ NNJF _ NNNJ

(3:2)

Eqn (3.2) is similar in form to Eqn (3.1). Its implementation is much like the CLA tree used in parallel adders, but is more hardware intensive because multiple carry trees are needed. The detection of the negative patterns, g and tzg, can be performed in a similar manner. Having determined the rough location of Qi, we now turn our attention to cfine, the 1 bit ne adjustment. To determine cfine , it is important to realize that at the bit position where Qi = 0, we only need to know whether we are processing a z string or a g string. A t string will eventually turn into a z or a g string, depending on the ending literal. Hence, only a global variable needs to be maintained to dierentiate these cases. Speci cally, cfinei = ci,1 for a z string and cfinei = ci,1 for a g string where ci,1 is the carry into the (i , 1)th bit position. Thus the problem of determining cfine reduces to that of determining whether we have a z or a g string. At any bit position with Qi = 0, how far back in a string do we have to go to know that we are processing a z or a g string? In general, the answer depends on the pattern to be detected and can be found through trial and error. In this case, the answer is two literals. Fewer literals won't do because in a t0t1t2z3z4g5g6 pattern and we are at bit position 4, the z3 literal cannot tell us that we are actually processing a g string because a tttz string is considered a g string. So, when Qi = 0, we only need to examine the (i , 1)th and the (i , 2)th bits. Further, we need not examine any strings with a ti,1 literal because they never cause Qi to be zero. Combining the above ndings, we have the following equation:

cfinei = (gi,2gi,1 _ ti,2zi,1 _ zi,2gi,1 )ci,1 _ (zi,2zi,1 _ ti,2gi,1 _ gi,2 zi,1)ci,1 (3:3) Grouping, simplifying, and replacing zi,1 by gi,1 because patterns with ti,1 never


24

Table 3.1: Three-Bit Patterns That Need to Be Detected This Pattern Produces These 3-Literal Patterns z zzz t gz ttt, ttg, tgz, gzz and zzz g ggg t zg ttt, ttz, tzg, zgg and ggg occur, we obtain

cfinei = ci,1 (ti,2 gi,1)

(3:4)

Notice that in Eqn (3.3), the terms in the ci,1 parenthesis all produce a sum of 10 and therefore when ci,1 = 0, an ne adjustment has to be made because Qi = 0 and the leading one should have been at the (i , 1)th bit position. Similar remarks hold for the terms in the ci,1 parenthesis. In a sense, the above method requires the maintenance of a global state. For Q, we need to know which states, N, J, or F, we are in and for cfinei , whether we are processing a z or a g string. But there is a more distributed way. Given that we have to detect the above patterns, how many literals in a window do we have to examine before we can declare any of the patterns found? For the z pattern, for example, we have to examine a window of 3 literals. The reason is as follows. Assuming that we have a window of zi,1zi, we won't know whether it is a ti,2zi,1zi or a zi,2zi,1zi pattern. Similarly, for the tgz pattern, the smallest window is 3 literals. In general, this window size is a function of the patterns to be detected and is determined on a case by case basis. The 3-literal patterns, with indices omitted, are listed in Table 3.1 for the four patterns above. From Table 3.1, the equation for Qi can be written as,

Qi = zi,2zi,1zi _ ti,2ti,1ti _ ti,2ti,1gi _ ti,2gi,1 zi _ gi,2zi,1zi _ gi,2gi,1 gi _ ti,2ti,1zi _ ti,2zi,1gi _ zi,2gi,1 gi

(3.5)


25

Grouping and simplifying, we have

Qi = ti,2ti,1 _ (ti,2zi,1 _ ti,2zi,1)zi _ (ti,2gi,1 _ ti,2gi,1)gi therefore

Qi = ti,2ti,1 _ (ti,2 zi,1)zi _ (ti,2 gi,1 )gi

(3:6)

Further, because a gi,2 ti,1 or a zi,2ti,1 pattern would have caused Qi,1 to be zero, ti,2ti,1 in Eqn (3.6) can be replaced by ti,1, obtaining nally,

Qi = ti,1 _ (ti,2 zi,1)zi _ (ti,2 gi,1)gi

(3:7)

Note that in contrast to the rst scheme, where a Qi = 0 indicates a de nite location of the leading one6, a Qi = 0 now indicates a possible location of the leading one or zero. An additional means must be provided to detect the rst such event. Note also that as more and more patterns need to be detected, this distributed scheme becomes more and more attractive compared with the rst one because of the optimization opportunities. If desired, Eqn (3.7) can be rewritten by again replacing zi,1 by gi,1 as

Qi = ti,1 _ (ti,2 gi,1)zi _ (ti,2 gi,1 )gi and then by further replacing gi,1 by ai,1 as

Qi = ti,1 _ (ti,2 ai,1)zi _ (ti,2 ai,1)gi The replacements are possible because of the ti,1 term in the equations. In this distributed scheme, developing the equation for cfinei takes a bit more work. How do we know when to adjust? This question can be answered by examining all possible 3literal patterns shown in Table 3.2. Again, patterns with a ti,1 need not be examined because they always produce an Qi = 1 as shown in Eqn (3.7). In the table, the rst column contains the bit patterns of the string at the (i , 2)th to the ith bit positions. 6

Subject, of course, to a 1-bit ne adjustment.


26

Table 3.2: Logic Equation for cfinei at the ith Bit Position

qi,2qi,1qi Sum (ci = 1) Sum (ci = 0) Adjustment zzz 001 000 0 zzt 010 001 ci zzg 011 010 0 zgz 101 100 0 zgt 110 101 ci zgg 111 110 0 tzz 101 100 0 tzg 111 110 0 tzt 110 101 ci tgz 001 000 0 tgt 010 001 ci tgg 101 100 0 gzz 001 000 0 gzt 010 001 ci gzg 011 010 0 ggz 101 100 0 ggt 110 101 ci ggg 111 110 0 The location of the leading one is (arbitrarily) assumed to be at the (i , 1)th bit position (i.e., Qi,1 = 0)7. The second and third columns are the sums of the bit pattern with the carry into the ith bit equal to one and zero, respectively. The fourth column indicates the condition of ci under which an adjustment is needed. In Row 1, we know we never need to adjust because zi,2zi,1zi can not produce an Qi,1 = 0. In the second row, when ci = 0, the most signi cant non-sign bit is actually at the ith bit position, requiring therefore a ne adjustment left shift of 1 bit. In the zgt row, a ne adjustment is needed when ci = 1 because the non-sign bit is at the ith bit position. Entries in the other rows can be interpreted similarly. The equation 7

This is possible by properly wiring the ti , zi , and gi terms.


27

for cfinei can be written from the table as

cfinei = (zi,2gi,1 _ gi,2 gi,1 _ ti,2zi,1)tici _ (zi,2zi,1 _ gi,2zi,1 _ ti,2gi,1 )tici which can be rewritten as

cfinei = [(zi,2 _ gi,2 )gi,1 _ ti,2zi,1] tici _ [(zi,2 _ gi,2)zi,1 _ ti,2gi,1 ] tici and as

cfinei = (ti,2 gi,1 _ ti,2zi,1)tici _ (ti,2zi,1 _ ti,2gi,1)tici Further, because patterns with ti,1 never occur, we can substitute gi,1 for zi,1 so that cfinei = (ti,2 gi,1)tici _ (ti,2 gi,1)tici (3:8) Hence, (3:9) cfinei = ti [ci (ti,2 gi,1)]

A possible precharged implementation of Eqns (3.7) and (3.9) is given in Fig. 3.2. This circuit is obtained from Kershaw et al. [33]. In this particular implementation, the replacement of gi,1 by ai,1 has been made and the output of LOP is represented in a 1-of-n code stored in the Qi array. The nal cfine signal, obtained by a logical NOR of all cfinei 's, indicates whether a 1-bit ne adjustment is needed. The top portion of the circuit is the Manchester carry chain for the adder and is not considered part of the circuit. Initially, F0 is grounded, discharging Fi depending on the intermediate Ui 's. Note that in Eqn (3.8), tici and tici can be replaced by ci,1 and ci,1, respectively, saving hardware and obtaining a similar equation as Eqn (3.4):

cfinei = ci,1 ti,2 gi,1

(3:10)

The dierence between Eqn (3.10) and Eqn (3.4) is important to note. The latter assumes that the total shift amount is Q , cfine and the former Q + cfine. The cfine's should therefore dier.


28

ai bi

Vdd gi

Vdd

ki

ti Precharge clock ci

Vss ti-2 ai-1

ai

ti-1 Vdd

Ui Precharge clock

ti Vdd

Fi

Fi+1

cfine

Qi

Si

Figure 3.2: A Possible Precharged Implementation of Leading One Prediction

3.3.1 Setting the Boundary Conditions The LOP logic at the msb and the lsb requires padding of the input operands. This is trivial for sign magnitude numbers since zeros can be added to the A and B operands and then invert the operand appropriately. For two's complement numbers, the operands must be sign-extended. Another method is to simply look for patterns that will produce a 01xx..x result because it will de nitely produce a leading 1 or 0 depending on the carry in. These two patterns are

Q0 = z0t1 _ g0t1 which can be inverted and simpli ed to Q0 = t0 _ t1.


29

3.3.2 Modi cation for Sign-Magnitude Numbers The above presentation of LOP assumes numbers in a 2's complement form. The scheme has diculties with negative results because the IEEE standard dictates a sign-magnitude mantissa and the conversion requires an addition, possibly changing the position of the leading one or zero. A simple solution is not to ne adjust at all. After the coarse adjustment and conversion, the msb of the result indicates the need of a ne adjustment. This method, however, has a lower performance because the result has to be shifted before the msb can be examined. The following discussion assumes that a ne adjustment is to be performed. There are (at least) two ways to perform LOP with sign-magnitude mantissas. The rst method is to use a 1's complement adder. When the result is negative, it is simply bit-inverted. LOP in this case is straightforward because bit inversion does not change the position of the leading one or zero. But the LOP unit must be able to detect both leading zero and one. The second method uses a magnitude comparator and swapping to guarantee a positive result. In this case, LOP is simpler because only the pattern tgz will produce a string of preceding 0's8. For FP adders that compute A + B + (0; 1)9 to avoid the time penalty of the extra rounding step as in the algorithm to be presented in Chapter 4, the LOP unit needs to know the correct ci to examine for ne adjustment. Since LOP is only needed when the operation is a subtraction and when the exponents of the operands dier by at most one, one can observe the lsb of the shifted mantissa to determine the correct ci to observe. 8 9

When denormalized numbers are allowed in the input, the pattern z also needs to be detected. A + B + (0; 1) means computing the sum A + B and A + B + 1 in parallel.


30

3.4 Generalization and Applications 3.4.1 Generalization

Though normally not thought of as such, parallel addition is also a bit pattern detection problem. In parallel addition, to compute the nal sum at a bit position, we need to know whether the lower-order bits will produce a carry in. In other words, we need to detect the following pattern:

tg or more precisely, the pattern

tgx where the x string is a don't care string. The dierence between CLA and LOP is interesting to note. In CLA, the problem is done once a g is found after a t string and in LOP, we still have to make sure that it is followed by a z string.

3.4.2 Applications Sticky Bit Computation In high-speed multipliers, the partial products are generated in parallel and then reduced by some sort of a Wallace tree to two terms, S and C . The lower-order n , 1 bits of S and C have to be summed and then examined to determine the sticky bit. The same situation occurs in high-speed dividers, in which the partial remainder is represented in a sum and carry form. Before rounding, they must be summed to determine the sticky bit. By recognizing that this is a bit pattern detection problem, we can detect the trailing 0's from S and C directly. Speci cally for multipliers, we need to detect the following patterns starting from the (n + 2)th10 bit position of the S and C terms:

zn,1 10

Starting from the (n + 2)th bit position because of the over ow and the guard bits.


31

Table 3.3: Two-Bit Patterns That Have to Be Examined This Pattern Produces These 2-bit Patterns z zz t gz tt, tg, gz, and zz and

tj gzk where j 1 and k are integers such that j + k = n , 2. In principle, the same distributed scheme described above can be used for this purpose. But there is an optimization. As explained above, a 3-literal window size must be used in LOP because there is no way of dierentiating the patterns ti,2zi,1zi and zi,2zi,1zi due to the valid pattern ti,2zi,1gi. But in this particular application, this pattern is not allowed; hence, we only need to examine the patterns listed in Table 3.3. From the table, the logic equation can be written as

Qi = zi,1zi _ ti,1ti _ ti,1gi _ gi,1 zi which simpli es to

Qi = ti,1 zi

(3:11)

Notice that each bit position only requires 3 simple gates to detect the patterns. Additionally, a giant AND gate for all Qi is needed. To see Eqn (3.11) is correct, we take its inversion and obtain Qi = ti,1 zi. Expanding, we get four patterns ti,1zi, zi,1ti, gi,1 ti, and gi,1gi . Ignoring the carry out, these patterns have a sum of either 10 or 01 depending on the carry into the ith bit position. Both yield a result that is non-zero and therefore a sticky bit of one. An advantage in using signed-digit binary multipliers over the conventional binary multipliers lies in its reduction of the rounding hardware [34, 35, 36]. This application of LOP to sticky bit computation will reduce this advantage.


32

Condition Code Prediction In a processor, condition codes are status ags indicating certain conditions or the quality of a result. One of these ags indicates a zero result. In the past, zero detection is done after the computation and often is a critical path in a pipelined machine. In recent processors which place a heavy emphasis on the clock rate, this critical path problem is often avoided by using a compare and branch instruction. For processors that do not have this instruction, LOP can be used to remove this critical path.

3.5 Summary In this chapter, we have examined leading one prediction (LOP) in the framework of bit pattern detection and described two possible implementations. The rst implementation detects each pattern separately. The second detects all patterns together and consumes less hardware when multiple patterns are to be detected. We have shown how to adopt LOP for sign-magnitude numbers and for FP adders which compute multiple results for rounding and normalization. We then showed that parallel carry lookahead is also a bit pattern detection problem, allowing meaningful comparison of the complexity of LOP and CLA. Other more practical uses of LOP include sticky bit computation, needed for IEEE rounding in high-speed multipliers and dividers, and condition code prediction in high-end processors. The key ideas of LOP are (1) the carry in an addition can only aect the position of the leading one by at most one bit and (2) bit patterns can be generated using simple bit-wise operations, which contain enough information to predict the location of the leading one or zero.

Chapter 4 High-Speed FP Addition 4.1 Introduction FP addition (FADD) is a frequent operation in many scienti c applications. Yet despite its apparent simplicity, FP addition and multiplication have roughly the same latencies in most high-speed FPUs today. Traditionally, this is because FADD has a low degree of parallelism; many of its subcomputations cannot be executed in parallel. Existing algorithms, for example, require two to three serial mantissa addition steps. This chapter describes an improved FADD algorithm that minimizes these addition steps, thus oering a considerable speed advantage over the earlier ones. Section 2 reviews the existing algorithms and traces their evolution. Section 3 presents the main ideas behind the proposed algorithm. The reader is referred to Quach and Flynn [37, 38] for a detailed derivation of the equations. Section 4 describes the implementation of a test chip. Concluding remarks are given in Section 5. In this chapter, A = a0a1 . . . an,1 denotes an n-bit vector or an n-bit operand. The A term denotes the bit inversion of A. The Q:y term denotes the y bit of a computation or a quantity Q. For example, the (A+B ):co term denotes the carry out of A+B . The A[i...j] term denotes the ith to the j th bits of A. Ai denotes the ith bit of A. Finally, A + B + (0; 1) denotes the simultaneous computation of A + B and A + B + 1ulp, where ulp means unit in the last place. For A as de ned above, 1 ulp means 2,n assuming a binary point in front of a0. 33

CHAPTER 4. HIGH-SPEED FP ADDITION

34

4.2 A Review of FP Addition Algorithm A simple algorithm for FADD consists of the following steps [17]: 1. Exponent subtraction (ES): Subtract the exponents and denote jEa , Ebj = d. 2. Alignment (Align): Right shift the mantissa of the smaller operand by d bits. Denote the larger exponent Ef . 3. Mantissa addition (SA): add or subtract the mantissa according to the eective operation, which is the arithmetic operation actually carried out by the FP adder. 4. Conversion (Conv): Convert the result to sign-magnitude representation. The conversion requires an addition step. Denote the result Sf . 5. Leading one detection (LOD): Compute the amount of shifts needed and denote it Elod. For eective addition, a 1-bit right shift may be needed and for eective subtraction a full-width1 left shift may be needed. 6. Normalization (Norm): Normalize the mantissa by Elod bits and subtract Elod from Ef . 7. Rounding (Round): Perform IEEE rounding [3] by incrementing Sf by 1 ulp as needed. This step may cause an over ow, forcing another normalization right shift. Ef in this case has to be incremented by 1. The simple algorithm is slow because its composing steps are performed serially. Existing adders improve the algorithm in the following ways: Swapping The Conv step is only needed for a negative result, which can be avoided by swapping the mantissas according to the result of the ES step. This arrangement causes the smaller mantissa to be subtracted from the larger one. In the case of equal exponents, the result may still be negative and requires a conversion. But no rounding is needed in this case. Hence, rounding and conversion are made mutually exclusive by the swapping step, allowing them to be combined. As a 1

Full-width refers to the width of the mantissa, which is much larger than the exponent width.


35

bonus, swapping also reduces the number of alignment shifters to one. Mutual Exclusion The simple algorithm can be further optimized by recognizing that the Align and the Norm steps are mutually exclusive. Normalization requiring a massive left shift is needed only when d 1 (the cancellation path). Conversely, alignment requiring a massive right shifts is needed only when d > 1 (the normal path). Consequently, only one full-width shift, either the alignment or the normalization one, is in the critical path [39]. Leading One Prediction LOD in the cancellation path can be performed in parallel with the SA step, removing it from the critical path. LOD now becomes LOP . The steps in the simple and the existing algorithms are summarized in Table 4.1. In the existing algorithm, the Pred step in the cancellation path predicts the need of a 1-bit right shift to align the mantissas. The prediction is based on the least signi cant bit (lsb) and the next to least signi cant bit (nsb) of the exponents. Note that the existing algorithm executes more steps in parallel, requiring therefore more hardware. Table 4.1: Steps in Conventional Algorithms. Simple Existing Cancellation Path Normal Path ES Pred + Swap ES + Swap Align Align SA SA; LOP SA Conv Conv; Round Round LOD Norm Norm Round select select FADD has long been a subject of active research. Sweeney [40] analyzed the statistical distribution of the operands in an FADD. Waser and Flynn [17] described a variant of the simple algorithm, which has its origin in the IBM 360/91 FPU [41]. Farmwald [39] was apparently the rst to suggest the two path implementation in the existing algorithm. Gosling [42] reviewed tricks commonly used in the area of FP. Most of these tricks has now become standard practices seen in many FPUs today. Sit et al.


36

[27], Benschneider et al. [26], Birman et al. [30], and Bural et al. [32] adopted the two path implementation with conformance to the IEEE standard. Vassiliadis et al. [43] reported an FP adder for the S/370, which uses a hexadecimal base.

4.3 The Proposed Algorithm From the above discussion, we see that the existing algorithm still require two fullwidth addition steps in both the normal and the cancellation paths. In the former path, the two addition steps are the mantissa and the rounding adds. In the latter path, the two addition steps are the mantissa subtract and the format conversion or the rounding add. The key ideas behind the proposed algorithm, which requires only one full-width addition step, can be summarized as follows.

In the normal path, the IEEE round to nearest (RN) mode requires only the

computation of A + B + (0; 1) to account for all the normalization and rounding possibilities2. The SA and the Round steps can therefore be combined to save one addition step. The above observation clearly holds when the result of the SA step needs a left shift or no shift. The only problem is when the result needs a 1-bit right shift because rounding in this case requires adding 2 ulps, not 1 ulp, to A + B . The solution lies in the de nition of the RN mode3. In the case of a 1-bit right shift, it is only necessary to add 2 ulps to A + B when the lsb of A + B is a logical 1 because after the right shift, the lsb becomes the guard bit (gb). Hence, adding 1 ulp to A + B causes the carry into the nsb to be true, adding in eect 2 ulps to A + B.

In principle, the above technique can also be applied to the SA step in the cancellation path. But there are three problems. First, subtraction requires a bit inversion

Implementation of the directed rounding modes is explained later. The RN mode rounds up a number in all cases except a tie, whence it rounds up when the lsb is odd and truncates when even. 2 3


37

of the operand to be subtracted and the addition of 1 ulp. This complementation 1 ulp con icts with the rounding 1 ulp. Second, cancellation may produce a result with preceding 0's or 1's and LOP can only predict them to within a bit. Consequently, the position of the rounding bits are not known until the end of the alignment step, rendering the above scheme inoperative. Finally, subtraction may produce a negative result, forcing, for format conversion, precisely the same full-width addition that we are trying to avoid. These problems must be solved to speed up the cancellation path. Existing algorithms use an extra step performing both rounding and format conversion. Having presented the possible optimizations and problems in both paths, we now show how this is done. From the above discussion, it is clear that the challenge is in deriving an equation for Cin selecting among the pre-computed results for both paths. Since in an FP operation, normalization may require a 1-bit right shift, no shift, or a full-width left shift, Cin needs to account for all these normalization possibilities such that the nal selected result appears to be normalized and rounded properly. In general, Cin is a function of the rounding mode, the rounding bits, the operands, and the eective operation. The FP adder examines the rounding bits for proper rounding. They include the least signi cant bit (lsb), the guard bit (gb), the round bit (rb), and the least signi cant sum bit (sb) of the result. Because the mantissa has 53 bits and because a right shift of up to 52 bits may be needed in the normal path during alignment, a 105 bit adder is potentially needed. Since we are only concerned with the higher order 53-bit, we use a 53-bit adder in the interest of hardware eciency. In the case of complementation, a complementation 1 ulp needs to be added at the 105th bit position. How far left into the higher order bit this complementing 1 ulp, Cc, propagates and whether it reaches the adder, clearly depends on the lower order bits of the shifted mantissa. When Cc does reach the adder, it is added to the result at the lsb position. The rounding 1 ulp, Cr , on the other hand, is always added at the rb position for this particular implementation (Fig. 4.1).


38

sb1 a

....

0 d

b0

a n-2 a n-1 ....

b i-1

b i bi+1 bi+2 C c Cr

....

b n-2 bn-1

Cin n-bit compound adder

MUX

Figure 4.1: Rounding Scheme When the lower order bits of the shifted mantissa does allow Cc to reach the real adder, gb, rb, and sb must all be zero; therefore, no rounding is required. Hence, complementation and rounding, as far as the adder is concerned, are mutually exclusive events and can be combined, solving the rst problem. The LOP problem can be avoided by realizing that the LOP result is only needed when a massive left shift is required to normalize the result. However, after such an normalization, all the rounding bits must be zeroes, again requiring no rounding4. Finally, format conversion of a negative result can be performed by computing A+ B + (0; 1). When the result is positive, A+ B + 1 is selected and when the result is negative, A+ B is selected, followed by a bit inversion. This conversion scheme works well with the above rounding scheme. Table 4.2 lists the steps in the proposed algorithm. The number of mantissa addition step in both paths has been reduced to one. The lsb of the shifted mantissa needs to be handled properly by taking into account the eect of complementation before shifting it back into the result. 4


39

Table 4.2: Steps in the Proposed Algorithm. Cancellation Path Normal Path Pred + Swap ES + Swap SA; Conv; Round; LOP Align Norm SA; Round select select

4.3.1 Optimization in the Exponent Path In the above discussion, we showed how the number of full-width add was reduced from 2 to 1 in the mantissa path. The exponent path can also be optimized as follows. Exponent Subtraction In the normal path, the smaller mantissa has to be aligned properly. The exponent adder computes the amount of right shift. When this subtraction step produces a negative result, it has to be converted before driving the alignment shifter. This conversion step is undesirable because of the addition step required. A simple way uses two exponent adders [44], computing both Ea ,Eb and Eb , Ea. This simple method requires two adders and is not very hardware ecient. A better way is to compute A+ B + (0; 1), followed by a bit inversion. Exponent Adjustment The exponent value needs to be properly adjusted after rounding. In an eective addition, the carry out of the mantissa addition step is not known until the end of the operation. This carry out signal determines if a 1-bit right shift is needed during normalization. When it is a 1, then the right shift is de nitely needed. But when it is zero, the right shift is only needed when rounding causes the result to over ow. The exponent, in either case, needs to be incremented by 1. In the cancellation path, the LOP determines the adjustment amount that needs to be subtracted from the exponent. But because the LOP can only predict this amount to within 1 bit, a ne adjustment is needed as the true adjustment becomes known, either by examining the msb or by predicting such need directly from the input operands as described in the earlier chapter.


40

Table 4.3: Possible Adjustments for the Exponent Case Ef Eadj Cin EA Left Shift w Fine Adjust Ef , Elop Elop 1 Left Shift w/o Fine Adjust Ef , Elop , 1 Elop 0 1-bit Right Shift Ef , 1 All 1's 0 No Shift Ef All 1's 1 1-bit Left Shift Ef + 1 All 0's 1 The exponent adjust path is normally not critical because multiple results can be pre-computed. In the proposed algorithm, the timing on this path is more stringent, requiring optimization. Table 4.3 explains how this is done. The second column lists the nal exponent value and the third and fourth columns indicate the values of Eadj and Cin EA to obtain that value. For the case of many left shift without a ne adjustment, for example, the nal exponent is Ef , Elop. This is obtained by selecting Ef and Elop as the inputs into the exponent adjust adder with a carry in Cin EA of 1. Normally, the cancellation path is not a problem because Elop is known early, computing both exponent values Ef +Elop +(0; 1) is sucient to account for all cases arising from the ne adjustment. The normal path presents a problem because we need to compute Ef + (,1; 0; +1). This can be accomplished by computing either Ef + (,1; 0) or Ef + (0; 1). The former is needed for subtraction and the latter for addition. Hence, as shown in the table, we rst subtract 1 and then add 1 for the case of no shift.

4.4 Implementation Issues 4.4.1 Hardware Requirement

The algorithm can be implemented either as a two-stage pipeline for performance or as a three-stage pipeline for a lower performance with a reduced hardware cost. In a two-stage implementation, the normal path delay consists of an exponent add, an


41

alignment shift, a mantissa add, plus a few gates for rounding. The cancellation path delay consists of a few gates to predict if a mantissa swap is needed, a mantissa swap, a mantissa subtract, a normalization shift using the LOP result, and a mux selecting between the results from the two paths. Notice that in both paths, the rounding step in the existing algorithm has been reduced to just a few rounding gate delays. In terms of hardware, this two-stage implementation requires the same number of compound adders as in the existing algorithm. The proposed algorithm, however, does not require an incrementor to perform the rounding operation. An additional hardware saving is possible in a three-stage implementation, in which only one compound adder is needed. The normal path is executed as an exponent subtract plus an alignment shift in the rst cycle. The second cycle adds up the mantissas and round. The result is available in the third cycle. In the cancellation path, the rst cycle subtracts the exponents and swap the mantissas, the second cycle subtracts the mantissas plus a round, the third cycle normalizes the result. The cancellation path cannot subtract the mantissas until the second cycle because of contention in the mantissa adder with the normal path. To implement the directed rounding modes, RM and RP, requires a row of half adders to compute A + B + (0; 1; 2) and RZ requires no additional hardware. Support of single precision operations can be implemented in a straightforward manner as shown in Quach and Flynn [37]. Handling denormalized numbers requires two minor modi cations and none aects the critical path. First, the exponent subtracter has to dierentiate two cases. When both operands are normals or denormals, the exponent dierence would be exact and can be used directly to drive the alignment shifter. But when either of the operands is a denormal, the subtracter has to account for the fact that there is a gap in the value of the exponent going from normal to denormal. The subtracter has to therefore add one or subtract one appropriately to compute the correct shifting distance. The second modi cation is in the alignment logic, which has to make sure that the alignment will not under ow the minimum exponent value, emin. Both modi cations cost little in hardware.


42

Special cases, those involving Not-a-Number (NaN) and in nities, are detected at the input stage and the proper result is selected at the output stage. The ve exceptions are handled in a similar manner. For addition, divide-by-zero never occurs. The invalid operation exception occurs when subtracting in nities or when the operand(s) is NaN(s). The under ow exception occurs when the result exponent drops below emin. This exception only happens in an eective subtraction. The over ow exception occurs when the result exponent exceeds emax. This exception only happens in an eective addition. Finally, the inexact exception occurs when there is a loss of accuracy or a masked over ow. The inexact ag can be set when rounding is required. Generally, the invalid exception can be detected early. The under ow, over ow, and the inexact ags can only be detected after the result mantissa has been computed.

4.4.2 The Test Chip To verify the algorithm in silicon, we have implemented an adder that has a data ow similar to the three-stage design described above. We chose the three-stage design because it requires additional optimization over the two-stage implementation. Hence, when such an algorithm works, one can deduce that the same algorithm will also work for a two-stage implementation. To simplify testing, we have adopted a straightforward ow-through design5. A block diagram of the adder is shown in Figure 4.2. The test chip only supports addition and subtraction of double precision IEEE operands with the round to nearest mode. In the gure, the Sign Logic determines the sign of the result. MuxExp selects the larger of the exponents, which is later adjusted as needed. ExpAdd computes the absolute dierence of the exponent, whose carry out controls the swapper and whose sum drives the alignment right shifter. The shifter also computes a sticky bit for the Round Logic. The larger mantissa from the swapper serves as an input to the mantissa adder, Compound Adder. The other input is selected based on the dierence in the exponents. When the exponents dier by more than 1, the other output from Though the design has no latches, the data ow of the adder exactly matches a design that has latches (i.e., a pipelined design). 5


43

the swapper is selected and when the exponents dier by less than or equal to 1, the output from the right shifter is selected, thereby accounting for all the cases. MuxRnd, controlled by the Round Logic, performs rounding by selecting the proper result. A negative result is converted into a sign-magnitude form by the 1's complementor. MuxRnd does not consume extra hardware because it is part of the Compound Adder. Note that MuxRnd takes the place of the incrementor performing rounding in the existing algorithms.

Sa Sb

Ea

MuxExp

Eb

Fa

ExpAdd

Fb

Swapper

Sticky Bit Logic Right shifter

Control

MuxF

Round logic

Compound Adder LOP

Sign Logic

MuxRnd 1’s complement

Exp. Adj. Logic

Left Shifter

Exp. Adjust MuxFinal Sf

Ef

Ff

Figure 4.2: Logic Diagram of the Adder.


44

Table 4.4: Breakdown of Area Consumption Component X Y m2 Km2 Percent Exponent Adder 592x459 272 2 Swapper 2150x142 921 6.6 Alignment Shifter 2952x282 833 6 Normalization Shifter 2952x282 833 6 Shift Decoders 423x170 72 0.5 Sticky Bit Logic 2172x424 921 6.6 Mantissa adder 3089x450 1390 10 LOP 3000x430 1290 10 One's Complementor 3000x71 213 1.5 Round Logic 212x208 44 0.3 Sign Logic 92x120 11 0.01 Exponent Adjust Logic 592x58 34 0.2 Exponent Adjust Adder 592x459 272 2 Muxes 225 1.6 Buers 327.6 2.3 Total 7700 55 The LOP works in parallel with the mantissa Compound Adder, predicting the left shift distance needed and driving the shifter to normalize the result. The need of a ne adjustment shift is not predicted in this particular implementation. Instead, it is determined after the normalization shift by the msb. When msb = 0, a ne adjustment is needed. MuxFinal selects among the results computed. The Exponent Adjust Logic controls the input into the Exponent Adjust unit for nal exponent adjustment. All control decisions are made by the Control unit. The FP adder uses the modi ed Ling adder described in Chapter 2 as the mantissa, the exponent, and the exponent adjust adders. The exponent adjust adder is not in the critical path due to the way the logic is developed as discussed above; it could be replaced by a slower adder to save hardware if desired. Table 4.4 lists the area required to implement each component. It is interesting to note that the rounding logic only takes up 0.3% of the FP adder. The percentage is computed based on the total area of the FP adder excluding pads. The components only occupy 55% of the total area with the rest of the area consumed by wires and dead spaces.


45

4.4.3 Methodology Before the implementation, all the logic equations are derived and simulated on a simulator written in the C language. The reader is referred to Quach and Flynn [37, 38] for a discussion on the derivation of the logic equations. The simulator simulates both the algorithm and the logic optimization. Both single precision and double precision operations are simulated. Half a billion randomly generated vectors were tested. In parallel, the basic building blocks were laid out and design rule checked. The rst iteration requires very little SPICE simulation. When the whole unit is put together, it is tested at the switch level to ensure functionality; about 1 million randomly generated test vectors were passed at this level. SPICE is then used to check and tune the latency. The switch level simulator can also be run in an anolog mode for 40,000 test vectors to ensure that no other critical paths exist. During implementation, optimizations that improve the performance are often found. This requires modifying the simulator and rerunning the test vectors.

4.4.4 Technology The adder is implemented using the HP CMOS26 triple-layer metal process. In this process, both N and P-channel transistors have a 1m drawn length, with an oxide thickness of 150 A. All pitches are 2m for uncontacted metals and 2.6m for contacted ones. In our implementation, the third layer is used exclusively for power and ground. The whole FP adder is laid out using a standard cell approach with automatic routing; no compaction was performed.

4.5 Results Table 4.5 lists the simulated delay in the critical paths. In the normal path, ExpAdd computes the absolute magnitude of the exponent dierence, which is then decoded (Dec) to drive the alignment shifter. The ExpAdd delay includes selecting the proper


46

result to obtain absolute value. The output then goes through a 3-1 mux, which also performs inversion if needed. The delay for computing the multiple results in the signi cand adder is 3.7ns. This delay does not include the inversion of the input operand and the nal result selection as in the case of an integer add; the delay is therefore smaller. The round logic takes 3.6ns, which includes delay for signals traveling from the msb to the round logic and delay for driving the nal selection muxes. Finally, the result is buered to drive the nal selection mux. One's complementation is not needed in this path. In the current implementation, the round logic is placed on the lsb side of the signi cand adder. By placing the round logic on the msb side, the signal traveling delay, about 1ns, can be saved. The nal result selection muxes should not have been counted as part of the round delay because this delay is needed anyway to normalize the result before rounding had a more conventional rounding algorithm been used. In the cancellation path, we need to compute the sign of the result of the exponent subtraction and swap the signi cands if needed. The result may be negative when the exponents are equal. The output of the swapper then goes to a 3-1 mux. At this point, we need to examine two paths to determine the critical path. The rst path is LOP plus normalization shift and the second is signi cand add, round, and one's complement. As shown in the table, the latter turns out to be worse, becoming the critical path of the whole adder. After one's complement, the result is selected. Without LOP, the second cancellation Path is worse because two more operations are needed for normalization. The rst operation is leading one detection6 on the result from the one's complementor. The second is encoding, whose result is then used to drive the normalization shifter7. On the positive side, however, we no longer need a ne adjustment step for a 1.4ns saving. The critical path would then have been 22.3ns8 ; hence, LOP is important for high-speed FP adders. The worst case delay shown in Table 4.5 does not include other rounding modes. Leading-one detection requires a priority encoder. A possible optimization is to merge the encoder logic and the decoder in the shifter. 8 Delay = TPath 2 + TLOD + Tencoding + Tnormalize(control) , Tnormalize(data) , TFine adjust = 17:2 + 2:8 + 2:4 + 2:8 , 1:5 , 1:4 = 22:3ns 6 7


47

Table 4.5: Simulated Delay at 5 Volts and Room Temperature SA Path

AS Path Path ns Path 1 ns Path 2 ExpAdd 3.6 ExpAdd.Cexp 2.6 Same Shifter Decocde 1.2 Swapper + buer 2.0 Same Alignment 1.6 3-1 mux 0.6 Same 3-1 mux + buer 0.8 LOP 6.5 ComAdd + Round Signi cand add 3.7 Shifter Decode 1.2 One's Com Round logic 3.6 Normalize (Control) 1.6 Normalize (Data) One's Comp + buer 0.8 Fine Adjust 1.4 Same 4-1 mux + buer 1.0 4-1 mux + buer 1.0 Same Total 16.3 16.9

ns

2.6 2.0 0.6 7.3 0.8 1.5 1.4 1.0 17.2

Round to in nity requires computing up to 3 outcomes in parallel in the signi cand adder, requiring a row of half adders as shown in Quach and Flynn [37]. Assuming a 0.8ns delay for a half adder, the worst case delay for an adder that performs all rounding modes is roughly 18ns. Fig. 4.3 is a plot of the test chip. Actual testing and evaluation of the chip is still on-going.

4.6 Summary In this chapter, we have presented an improved algorithm for high-speed FP addition. The proposed algorithm has only one mantissa addition step in the critical path, therefore oering considerable advantages over the existing ones. The algorithm can be incorporated either in a two-stage implementation for minimum latency or in a three-stage implementation for reasonable latency and minimum hardware cost. A test chip, implemented in a 1um CMOS technology, has been built to verify the algorithm. The adder has a simulated nominal ow-through latency of 17ns and has a data ow similar to the one in a 3-stage implementation.


48

The key idea behind the proposed algorithm is that the rounding and format conversion steps in the existing algorithm can be combined with the mantissa addition step.


Figure 4.3: A Plot of the Floating-Point Adder.

49

Chapter 5 A Radix-64 FP Divider 5.1 Introduction Next to addition and multiplication, FP division is the (distant) third most frequently used instruction in many scienti c applications. Traditionally, because of its low use frequency, FP division has mainly been treated as a subject of theoretical interests. In FPUs that do support division, it is often performed iteratively on a multiplier or on an autonomous unit allotted a minimum amount of hardware. But this is changing because of the increasing area budget and the growing interests in graphics applications. Unlike FP add and multiply, FP division has a much longer latency because of its low parallelism. Many modern machines use an iterative division algorithm known as SRT [45, 46]. In the algorithm, the number of bits obtained per iteration can be roughly considered as the radix of the algorithm. Current SRT algorithms use small radices (4 to 16) because of the cycle time and possibly perceived hardware limitation. This chapter presents a method to reduce approximately by half the latency of a simple radix-4 SRT divider. Because of this reduced latency, three such simple dividers may be staged, or recycled with a 3X clock, to obtain a radix-64 divider. In this divider, an IEEE double precision operation can be computed in 11 cycles at a competitive clock rate (e.g., a 54-bit ALU addition time).

50

CHAPTER 5. A RADIX-64 FP DIVIDER

51

5.2 Background Sweeney, Robertsons [47], and Tochner [48] independently developed SRT. Atkins [45, 49] extended it to higher radices. But the extension requires a large lookup table. More recently, Taylor [50] overlapped 2 simple radix-4 dividers to obtain a radix-16 divider. To reduce latency, Taylor's method requires expensive hardware duplication so that the second stage can operate in parallel with the rst. Fandrianto [51] \folded" the SRT P-D table to obtain a radix-8 divider with little hardware expenditure. This folding scheme, however, is limited to low radices (< radix-16). Carter and Robertson [52] reported a radix-16 divider using a signed-digit representation and a simple overlap method. Ercegovac et al. [53] scaled the dividend and the divisor to reduce the complexity of the lookup table, but unfortunately the need of scaling partially osets its bene ts. Though not explicitly stated, their method also allows staging.

Other high-speed division schemes exist. The Newton Raphson algorithm [54, 17], which is essentially an approximation based on power series expansion, features reuse of the multiplier and a quadratic convergence rate. To speed up the initial iterations, a small lookup table is often used. The looked-up seed is then iterated to the desired precision for rounding. A popular variant of the Newton Raphson is the Goldschmitd algorithm, which allows a somewhat higher degree of overlapping. The numbers of iteration required in both algorithms depend on the sizes of the initial lookup tables, which are currently small, thus limiting their performance. Though larger table sizes can be used, it is not considered eective. For proper IEEE rounding, these multiplicative algorithms have the additional problem that the result has to be computed to a precision twice that of its operands [55]. The method reported in this chapter uses 30% more hardware than a simple SRT divider but with half the latency. In hardware, our approach uses more than Ercegovac's but less than Taylor's; in latency, ours rivals the latter and uses no scaling as the former does. Section 3 brie y describes the general principle of SRT and Section 4 presents our method. Implementation details, hardware cost issues, and extension


52

to square root operation is also addressed. Section 5 contains a summary. Because the implementation of the exponent path is straightforward, the discussion in this chapter focuses on the mantissa path.

5.3 The Proposed Implementation

5.3.1 SRT - A Brief Review

Division is basically a trial and error process. To divide x by y we estimate the rst quotient digit and then subtract the product of this guessed quotient and the divisor from the dividend to obtain a partial remainder. If the magnitude of the partial remainder is larger than the divisor, we try a larger digit until the magnitude of the partial remainder becomes smaller than that of the divisor. The partial remainder is then multiplied by the radix to form the next remainder and the process repeats until the quotient reaches the desired precision. This pencil-and-paper division method can be described mathematically as:

Pj+1 = rPj , qj+1D with

jPj+1 j < D

where r is the radix and the qj+1 digit is selected subject to the constraint of the latter equation. From the equations, we see that each selection of qj+1 requires a full-width subtraction of rPj and qj+1D. Division is consequently slow. With SRT, the idea is to introduce a coecient k such that

Pj+1 = rPj , qj+1D but with

jPj+1j < kD

(5:1)

where k = n=(r ,1) and n is the number of positive allowable quotient digits excluding zero. k is sometimes called the coecient of redundancy. For radix-4 and qj+1 = f,2; ,1; 0; 1; 2g, n = 2 and k = 2=3. In general k lies in the range of [1; 1=2] and is a


53

8/3D 5

Partial Remainder

4

qj=2 5/3D

3 qj={1,2}

2

4/3D

qj=1 2/3D

1

qj={0,1} 1/3D qj=0 5

10

15

Divisor

Figure 5.1: PD-Plot design parameter that determines the complexity of the lookup table for the quotient digit during iteration; the larger k the smaller the lookup table. Diculty in forming the divisor multiples keeps k at a reasonable size. To nd the boundary de ning each redundant region, the minimum and maximum of qj+1's can be substituted into Eq. (5.1) as shown in Figure 5.1. In the gure, only the positive half of the gure is shown; the negative half is a mirror image. The redundancy regions, labeled qj = f0; 1g and qj = f1; 2g in Figure 5.1, allow a tolerance in the selection of qj+1; thus, only small numbers of bits in rPj and D need to be examined. In general, these numbers depend on both k and r. Tan [46] has developed an analytical equation for determining them for the proper selection of qj+1. In practice, these numbers are often obtained by computer programs performing an exhaustive search. For radix-4, we need to examine the top 5 bits in rPj (denoted rP^j ) and 4 bits in D (denoted D^ ). In other words, we are using the truncated versions of rPj and D. Assuming P and D have been properly latched at the input, each iteration of the


54

SRT division operation consists of the following three steps: (1) rP^ and D^ are used to consult a lookup table, which implements the P-D plot; (2) The selected quotient digit is used to form the divisor multiple, qj+1D. A full-width adder is used to subtract rPj and qj+1D to form Pj+1; and (3) The rP^j term is used to look up the next quotient digit. The rP^j term is formed by shifting Pj by the radix r and no multiplication is needed. The quotient digit, typically stored in a redundant format, is collected along the way and is normally not in critical path; the latency of the algorithm is limited by the full-width adder computing the next remainder. Current SRT dividers solve this problem by storing the partial remainder in a sum-carry or a signed-digit format. To reduce the size of the lookup table, the bits accessing the table are converted into a non-redundant (i.e., 2's complement) form using a small width adder. In a sense, this small ALU \models" the division process in the full-width adder. The critical path of this simple divider consists roughly of an 8-bit CLA, a 26-term PLA [50], and a CSA. Table 5.1: New Quotient Digits Encoding Scheme

qi+1 sign q1 q2 -2 1 x 1 -1 1 1 0 0 0 0 0 1 0 1 0 2 0 x 1

5.3.2 Improving the Simple Divider The simple divider can be further improved in two ways: (1) use of a dierent encoding scheme for the quotient digits and (2) duplication of logic element to increase parallelism. Dierent Quotient Digit Encoding Scheme One can fold the SRT table so that only the positive half of the table need to be implemented. This reduces the number


X __ 116

Divisor $xxx.xxy

0 00.00 00.01

Partial Remainder

00.10 00.11 01.00 01.01 01.10 01.11 10.00 10.01

0 0 1 1 1 1 2 2 2 2

1 0 0 1 1 1 1 A 2 2 2

2 0 0 1 1 1 1 B 2 2 2

3 0

4 0

0

0

1

1

1 1 1 1 2 2 2

1 1 1 1 A 2 2

5 0 0 0 1 1 1 1 1 2 2

6 0 0 0 1 1 1 1 1 2 2

7 0 0 0 1 1 1 1 1 A 2

8 0 0 0 1 1 1 1 1 1 2

55

9 0 0 0 1 1 1 1 1 1

10 11 12 13 14 15 0 0 0 1 1 1 1 1 1

0 0 0 1 1 1 1 1 1

0 0 0 1 1 1 1 1 1

0 0 0 1 1 1 1 1 1 1

0 0 0 1 1 1 1 1 1 1

0 0 0 1 1 1 1 1 1

2

1

1

1

10.10

2

2

2

2

2

2

2

2

2

2

2

2

1

1

1

1

10.11

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

>11.00

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

1

A = 1 if negative and y bit = 1; A = 2 otherwise; B = 2 if negative and y bit = 1; B = 1 otherwise.

Figure 5.2: Quotient Digits Allowable in the SRT Regions of terms in the PLA from 26 to 19 as reported by Fandrianto [56]. A further reduction in the number of terms can be achieved by using a slightly dierent encoding scheme for the quotient digit as shown in Table 5.1 where the symbol x represents a don't care in the new encoding scheme. Table 5.2 compares the complexity of the logic equations obtained using the conventional scheme and the encoding scheme proposed here. The logic equations are obtained using the CAD program espresso. In the table, the equations in the conventional row use the standard encoding scheme. The equations using the new encoding schemes are listed below. The bits in the truncated divisor and the partial product have been renamed as abcd and stuvwxy, respectively. Both sets of equations are obtained using a slightly dierent procedure than those reported by Fandrianto [56]. For a negative partial remainder, Fandrianto rst converts it into


Scheme

Table 5.2: Comparison of Quotient Digits Equations Equations q1

=

cdt0x0 yv 0 + bst0 x0yw + cst0 x0 yv 0 + dst0 x0 yv 0w + ct0 x0yv 0 w0 + a0 c0d0t0 x0 y 0u + bct0x0 uv + t0 x0 y 0uv + bdt0x0uv + bt0x0 yv 0 + a0 b0t0 x0y 0 u + at0 xy 0u0 v 0 + t0 x0yu0 + bcdst0xy 0u0 v 0w + abt0xy 0 u0 + act0 xy 0u0 + abt0 xy 0 v 0 + at0 x0y

q2

=

a0b0d0 s0 yuw + a0 b0c0yuw0 + a0 c0d0yuv + a0 b0c0 s0yu + a0 b0c0d0yu + a0b0yuv + b0c0xv + a0 xw0 + xuv + a0 s0 x + a0 d0x + a0xv + b0xu + a0c0 x + a0 b0x + a0 xu + xy + t

Conventional

New Encoding

56

q 1 = a0 c0d0u + a0 b0u + uv + y + x

q2

=

a0b0d0 s0 yuw + a0 b0c0yuw0 + a0 c0d0yuv + a0 b0c0 s0yu + a0 b0c0d0yu + a0b0yuv + b0c0xv + a0 xw0 + xuv + a0 s0 x + a0 d0x + a0xv + b0xu + a0c0 x + a0 b0x + a0 xu + xy + t

positive by using one's complementation. The t bit is used to OR with the lower bits, ensuring a partial remainder that is always less than 11.00 (see Table 5.2). In our procedure, conversion is done in a similar manner, but the t bit is not used to clamp the value of the partial remainder. Even so, it is interesting to note that our equations for q1 and q2 have only 18 terms, as opposed to 19 as reported by Fandrianto [56]. From Table 5.2, one can see that while the logic equation for q2 remains roughly the same, the complexity of q1 has been reduced considerably. Duplication of Logic Element In this optimization, one can assume the value of qj+1 and proceed to compute the next partial remainders in parallel with the quotient digit lookup step. When the true quotient digit is available, the correct remainder will be chosen for next iteration. At rst glance, one seems to need a separate copy of next partial remainder hardware for each value of qj+1 in the allowable digit set, therefore requiring potentially ve time as much hardware as the simple SRT scheme. But a critical observation is that the sign of the next quotient digit


57

arrives early, allowing us to reduce the number of the quotient digits to 3: 0, 1, and 2. Furthermore, from Table 5.2, we know whether qj+1 equal to 0 or 1 early (after a few gate delays) using the quotient digit encoding above. This allows a further reduction in the next quotient hardware. The end result is that we only need one more row of CSAs and a guess CLA. The critical path of this improved divider is now the larger of a table lookup delay or a CSA plus an 8-bit delay. That is, we have reduced the latency of the simple SRT divider by half. Figure 5.3 compares the simple and the improved SRT schemes.

5.3.3 Overlap and Rounding Overlap of the improved dividers is straightforward as shown in Figure 5.4 below. In the gure, the divider obtains 6 quotient bits per iteration. For IEEE double precision, the divider has to compute up to 56 bits for proper rounding, requiring 10 cycles. An additional cycle is needed for rounding. During iteration, the quotient digit is kept in three registers: QM , Q and QP . The Q register contains the value of the quotient and the QM and QP registers contain the value of the quotient minus and plus one, respectively. The conversion of the quotient from the PLA lookup table format into this format is quite straightforward as shown in Table 5.3 below. In the table, QMj denotes the value of the register in the jth iteration and QMj+1 its value in the (j + 1)th iteration. The values of the quotient registers in the (j + 1)th iteration are determined by qj+1. If it is negative, we subtract Qj by 1 and set the next two quotient bits1 accordingly. Decrementing Qj is equivalent to selecting QMj as the next quotient register. A similar argument applies for the other registers. In the table, QP is only used at the end of the iteration for rounding. A similar technique has been reported by Fandrianto [56] (to speed up square root operation) and Ercegovac and Lang [57] (to speed up rounding). At the last iteration, the sign of the true remainder determines which register pairs are selected for rounding. A positive remainder selects the Q and QP pair and the Remember that we are staging three radix 4 dividers; therefore, we only obtain 2 quotient bits at a time but three times per iteration. 1


58

Table 5.3: Next Quotient Logic

qj+1 QMj+1 Qj+1 QPj+1 -2 QMj QMj QMj -1 QMj QMj Qj 0 QMj Qj Qj 1 Qj Qj Qj 2 Qj Qj Qj QM and Q pair otherwise. Rounding for the IEEE standard can be performed in a straightforward manner and will not be discussed further here.

5.3.4 Extension to Square Root Operation There are two ways to extend a divider to square root operation. The rst method uses a lookup table for the initial root and subsequent iterations share the division table as proposed by Fandrianto [56]. The second method uses a separate table altogether as proposed by Ercegovac and Lang [58]. Normally, the second method is considered superior because of its smaller table size. For our method to be eective, though, the rst scheme is actually better because q1 can be determined more quickly than q2 in this scheme.

5.4 Summary We have presented a simple method to reduce the latency of a simple SRT radix-4 divider by approximately half. Staging three such improved dividers or recycling it with a 3X clock produces a radix-64 one. Such a divider can deliver the quotient of 2 IEEE double precision numbers in 11 cycles. At lower clock rates, even higher radices (e.g., radix-256) can be achieved. The key ideas presented in this chapter are (1) During each iteration of the SRT process, all possible next partial remainders can be pre-computed. The true quotient


59

digit, known after the table lookup, can then be used to select among these precomputed next remainders. In this way, the next remainder computation and the quotient digit lookup can be performed in parallel. (2) The amount of duplicate logic may seem prohibitive because the quotient digit set consists of f,2; ,1; 0; 1; 2g and each digit requires a set of next remainder hardware. However, the sign of the next result quotient digit is known early. This allows "folding" of the negative digit set into the positive half. Further, by using a slightly dierent quotient digit encoding scheme, a 0 or a 1 quotient digit can be determined early. Together, these observations reduce the hardware duplicate to an extra row of CSA and a guess CLA.


Divisor

Dividend Mux

D0 -2D -D

Rem P D 2D

Mux

CSA QS Logic Quotient Collection Logic (a) Simple SRT division

Divisor

Dividend Mux

^ Pj+1

QS Logic

D

Rem P

CSA

CSA

Mux

Mux

Mux

Quotient Collection Logic

(b) Proposed Scheme

Figure 5.3: Comparison of the Proposed and the Simple SRT Division Scheme

60


Divisor (D)

Dividend (P)

Mux rPjm1_Sign

Clk

qj

Prediction PLA

qj1

Prediction PLA

qj2

Prediction PLA

D

rP_est

CSA Mux gCLA

Mux_sign

rP_est

CSA Mux gCLA

Mux_sign

rP_est

CSA Mux gCLA

Mux_sign

rP_est

rPj_S , rPj_C

CSA gCLA

Mux_rPj

CSA gCLA

Mux_rPj

CSA gCLA

Mux_rPj

On-the-Fly Quotient Conversion Clk

rPj_S, rPj_C

QM, Q, QP

Round Logic

Compute Sign and sticky

Quotient

Figure 5.4: Radix-64 Divider

61

Chapter 6 Fast IEEE Rounding 6.1 Introduction Not all real numbers are representable in a computer because of the limited hardware precision. Rounding is a many-to-one mapping that maps an unrepresentable number into a representable one. This mapping process can happen during I/O [59, 60] and computation. Here, we deal with the latter and with the mapping speci ed by the IEEE standard [3]. This chapter reports our ndings on multiplication. The same approach, however, can be applied to rounding in the other operations as well [34]. In the IEEE FP format, a number is represented as (,1)s 1:m2e where s is the sign, m the mantissa, and e the biased exponent. For a normalized number, m 2 [0; 1). The term 1:m is often called the signi cand. The number of bits n in the signi cand depends on its precision; for double-precision n = 53 and for single-precision n = 24. In high-speed multipliers, multiplication is carried out by rst generating many partial products in parallel, followed by a reduction step reducing these partial products to two terms, sum (S ) and carry (C ) [61, 62]. The nal summation is then carried out by a carry propagate adder (CPA). The product has 2n bits, potentially requiring a right shift for normalization and a rounding operation to map it into an n-bit number. For denormalized operands, the product also has to be normalized before rounding1. In modern processors, the integer unit often stalls for a cycle for the FPU to normalize a denormalized result signi cand. 1

62

CHAPTER 6. FAST IEEE ROUNDING

63

The discussion in this chapter assumes normalized operands. For performance reasons rounding is typically done in hardware. Past rounding works, on multiplication by Santoro et al. [63] and on addition by Quach and Flynn [23], are rather ad hoc and therefore hard to verify, let alone generalize. This chapter presents a systematic rounding procedure with optimization guidelines to minimize the latency and hardware. This procedure-based rounding approach has the additional advantage that veri cation and generalization to other types of multipliers are straightforward. Though applicable to other forms of rounding, we restrict our treatment here to those dictated by the IEEE standard. Section 2 reviews and discusses the issues surrounding the IEEE rounding modes. Section 3 demonstrates the proposed rounding procedure for all the IEEE rounding modes and Section 4 is a summary. For notational simplicity, we de ne our binary point to be at the nth bit of S and C . The integer n + 1 bits are denoted Sh and Ch and the fractional n , 1 bits Sl and Cl (Figure 6.1). Also, we de ne R=S+C Rh = Sh + Ch and Rl = Sl + Cl where Sh; Ch 2 [0; 2n+1 ); Rh 2 [2n,1; 2n+1 )2; Sl and Cl 2 [0; 1); and Rl 2 [0; 2). sb1 denotes the logical OR of all bits in Rl except Rl:msb (in other words, the sticky bit before normalization).

6.2 IEEE Rounding Modes Again, the four rounding modes dictated by the IEEE standard [3] are round to nearest (RN), round to positive in nity (RP), round to minus in nity (RM), and round to zero (RZ). From an implementation point of view, these 4 rounding modes 2

Assume normalized input operands.


64

Rounding Point Kh : n+1 bits

K l : n-1 bits

K msb

lsb

Figure 6.1: Explanation of Notation. K represents S, C, or R. Table 6.1: Implementation of IEEE Rounding Modes IEEE Positive Result Negative Result Rounding Modes Treated As RN RU with x-up RP RI RZ RM RZ RI RZ RZ can be reduced to 3: round up (RU) with a x up, round to in nity (RI), and RZ, as shown in Table 6.1. Mathematically, for RU,

8< dxe if x , bxc 0:5 x=: bxc otherwise, x = dxe for RI, and x = bxc for RZ, where dxe and bxc are the ceiling and oor

functions [64], respectively. Table 6.2 shows how RN can be implemented as RU with a x up. In the table, the least signi cant bit is denoted as lsb, the guard bit as gb, and the sticky bit as sb. The entries in the table are the values of the rounding \1" to be added to the result at the lsb position. From the table, we see that RN and RU dier in the only one case when the lsb is zero. So RN and RU produce results which dier in the lsb bit | the RN result has a zero and the RU result has a one, which can be xed up by simply setting the lsb to zero without generating a carry propagation. For IEEE rounding, a simple method includes the following steps:


65

Table 6.2: Comparison of Round to Nearest and Round Up lsb gb sb RN RU x 0 0 0 0 x 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1. Compute R (and therefore C and S ) to a precision of 2n bits. 2. Normalize R if necessary and decrement the result exponent. 3. Compute lsb, gb, and sb. lsb corresponds to the nth bit of R, and gb to the (n,1)th bit of R. sb requires a logical OR of the n , 2 lower-order bits of R. 4. Based on the rounding mode, lsb, gb, and sb, add 1 ulp to Rh when necessary. 5. If Rh:co = 1 as a result of rounding, right shift Rh by one bit, and adjust the exponent accordingly. Several implementational diculties associated with the IEEE rounding modes can now be identi ed. First, S , C , and R need to be computed to a precision of 2n bits, requiring more hardware. Second, correct rounding in certain modes depends on sb, requiring more time because R must be computed and then rounded using an extra addition step. Rounding was not an issue prior to the IEEE standard; few processors supported RP and RM, and RN was nonexistent. Rather, most processors only approximated RU. In some high-speed multipliers, for example, the product was only computed to a precision of n plus a couple extra bits. RU was then implemented by adding a rounding \1" at the gb position during the CPA step. Rounding in this case requires no extra addition step. A natural question is whether we can do away with the addition step (i.e., step 4) in the rounding procedure for the IEEE standard. In the following section, we show


66

that such a rounding scheme is indeed possible and shall establish a procedure for doing so. But for the moment, let's examine the above rounding procedure again and consider the implications. To eliminate the rounding addition step in step 4, it must be combined with step 1, where R is computed. Because the rounding \1" (r(1)) is added to R at the nth bit position, we must therefore compute Rh and Rl separately, breaking R into two parts. Further, the computation of Rl may propagate an over ow \1" (o(1)) into Rh . This means that we have to compute Rh , Rh + r(1) or Rh + o(1), and Rh + r(1) + o(1) in parallel and select the correct one. Finally, because rounding is now performed before normalization, an r(1) becomes an r(2) when a right shift is needed during normalization; i.e., we now have to compute Rh + r(2), instead of Rh + r(1), for rounding. The observation here is that the possible outcomes are few, making possible the rounding scheme to be proposed below.

6.3 Rounding for Binary Parallel Multipliers The rounding procedure has 2 steps: constructing a rounding table and selecting a valid prediction scheme. Using RU as an example, we rst explain how to construct this rounding table, describe a simple rounding hardware model, and then outline methods to select a valid prediction scheme. Finally, we present an improved model, which has the advantage that it provides a solution for all the IEEE rounding modes.

6.3.1 Round Up 6.3.1.1 Constructing a Rounding Table In the rounding table, the number of columns corresponds to the shifting possibilities in the normalization process. For multiplication the result may require a right shift to normalize, giving 2 columns: no shift (NS) and right shift (RS). The number of rows in the table, on the other hand, depends on both the rounding modes and the magnitude of Rl. For RU, Rl 2 [0; 2) and four rows, called regions, are required. Each entry in the table corresponds to the correction that must be performed to Rh to obtain the nal result, rounded properly. A rounding table for the RU mode is


67

given in Table 6.3. Table 6.3: Rounding Table for Round Up in Binary Multipliers Region Range of Rl Normalization No Shift Right Shift 1 0 Rl < 0:5 Rh Rh :lsb = 0: Rh + f0; 1g Rh :lsb = 1: Rh + f1; 2g 2 0:5 Rl < 1 Rh + 1 Rh :lsb = 0: Rh + f0; 1g Rh :lsb = 1: Rh + f1; 2g 3 1 Rl < 1:5 Rh + 1 Rh :lsb = 0: Rh + f2; 3g Rh :lsb = 1: Rh + f1; 2g 4 1:5 Rl < 2 Rh + 2 Rh :lsb = 0: Rh + f2; 3g Rh :lsb = 1: Rh + f1; 2g We rst consider the entries in Region 1. When Rh is normalized, no adjustment of the result is needed because Rl < 0:5. When Rh needs an RS to be normalized, we consider two cases: 1. When Rh :lsb = 0: after shifting, Rl 2 [0; 0:25), and the result needs no correction. However, there is an optimization here. Because Rh :lsb = 0 and it is to be discarded after the shift, adding an equivalent \1" (e(1)) to Rh would not change the result. Thus, the possible correction digits are f0,1g, producing the Rh + f0; 1g entry in the table. 2. When Rh :lsb = 1: after right shifting, the eective Rl 2 [0:5; 0:75). Either an r(2) or an e(1) can be added to Rh for correction; therefore, we have the Rh + f1; 2g entry. To obtain Region 2, because Rl 2 [0:5; 1), an r(1) needs to be added to Rh when the result does not need an RS. The RS entries are obtained using a similar argument as that for Region 1. To obtain Region 3, consider the NS case. Because Rl 2 [1; 1:5), an o(1) needs to be added to Rh , obtaining the entry Rh + 1. In the case of Rh needing an RS, we again consider two cases. When Rh:lsb = 0, the eective Rl 2 [0:5; 0:75); hence, we can add either an r(2) or an e(3) to Rh . When Rh:lsb = 1, we can add


68

either an o(1) or an e(2) as indicated in the table since Rl 2 [1; 1:25). To obtain Region 4, we need to add both an r(1) and an o(1) to Rh in the case of NS. The RS entries are obtained as follows. After the RS, Rl 2 [0:75; 1). When Rh :lsb = 0, either an r(2) or an e(3) can be added to Rh, obtaining the Rh + f2; 3g entry. When Rh :lsb = 1, Rl eective 2 [1:25; 1:5); hence, either an o(1) or an e(2) can be added, hence yielding the Rh + f1; 2g entry in the table. An intuitive argument might have yielded an Rh + f3; 4g entry because both an r(2) and an o(1) need to be added to Rh . But such an argument is incorrect. Table 6.3 indicates that at least three outcomes, Rh , Rh + 1, and Rh + 2, must be computed in parallel to minimize latency as pointed out earlier. Because we are computing multiple outcomes in parallel, which msb to observe for the need of a right shift is an interesting question. Preserving event sequentiality is the issue. The over ow \1" is added to Rh before RS, and rounding comes later; hence, Rl:co determines the correct msb. In Regions 1 and 2, for example, Rl:co = 0 because Rl < 1; Rh:msb should be examined. In Regions 3 and 4, because Rl:co = 1, (Rh + 1):msb should be examined. Methods to determine the minimum number of regions are also of interest because it aects the number of bits that we have to examine for prediction (as explained below). A minimum number of regions is assured by starting o with as many regions as needed and then coalescing them whenever possible. The necessity of a region is determined by events that can happen in the range of Rl and may aect Rh . In this case, rounding, over ow, and RS are the events of signi cance. Over ow here means that the value of Rl exceeds 1. Informally speaking, the goal of rounding is to implement the rounding table with as little hardware as possible. Optimization is performed through judicious selection of the correction digits to reduce the number of adders and the complexity of the selection logic selecting the nal result. Reducing the former saves hardware and reducing the latter improves speed. Reducing the former may have adverse eects on the latter; however, such an event is unlikely for current mainstream CMOS and ECL technologies. High-speed parallel adders are area-intensive, and low-speed adders are


69

time-consuming. Besides this area and/or time savings, a smaller number of adders also means less power consumption in ECL and less capacitive loading in CMOS. Both have a positive eect on speed. Our rst goal is therefore to reduce the number of adders. To reduce the number of adders, an obvious way is to reduce the number of possible outcomes in a rounding table. But this is not possible because they depends mainly on the rounding mode. However, high-speed adders generally employ some form of carrylookahead network for carry propagation. Prime examples of this type of adders are carry-select [65, 66], carry-lookahead, conditional sum [17], and the Ling type adder described in Chapter 2. For these adders, only the carry-lookahead network needs to be duplicated for computing Rh + (0; 1)3. Hence, adders computing adjacent results can be combined to produce a so-called compound adder, reducing the number of adders needed to implement a rounding table. This method is especially eective when coupled with the following observations on Table 6.3. 1. To know exactly the RS or the NS columns Rh is in requires a full 2n-bit addition because we have to compute R. 2. To know exactly which rows Rl is in requires an (n , 1)-bit addition because we have to compute Rl. However, 3. To predict approximately which rows Rl is in requires only an examination of Sl:msb and Cl:msb. This is illustrated in Table 6.4. When both msb's are 0, we know that Rl must be less than 1, corresponding to Regions 1 and 2 (Group 1). Similarly when one of the msb's is 1, Rl must be in the range of [0:5; 1:5), corresponding to Regions 2 or 3 (Group 2). Finally, when both msb's are 1, Rl must be in the range of [1; 2), or in Regions 3 or 4 (Group 3). Hence, rather than having to implement the whole rounding table, prediction allows us to only implement a group. Within a group, a compound adder computes the 3

In fact, simultaneous computation of Rh + (0; 1; 2) is possible as we shall see shortly.


70

Table 6.4: Predicting the Regions with the msb's of Cl and Sl Sl:msb Cl:msb Regions Group 00 1 or 2 1 01 2 or 3 2 10 2 or 3 2 11 3 or 4 3 outcomes and we select the correction digits such that all outcomes in the group are computable by this adder. Dierent groups may compute outcomes that are numerically dierent. This is resolved by a predictor, which predictively adds a xed constant (usually the dierence) to Sh and Ch before sending them to the compound adder. A hardware model for rounding emerges naturally at this point. We digress to present it because selection of a prediction scheme depends on its detail. A Simple Hardware Model Figure 6.2 shows a simple hardware model for rounding. In this model, we have an (n +1)-bit compound adder computing Ch + Sh + p +(0; 1) where p 2 f0; 1g and is determined by the predictor based on a certain prediction scheme (to be established later), a row of half adder that produces an empty slot for later insertion of p at the compound adder, a selector selecting among the possible outcomes, an Rl-adder that computes Rl, and some logic for determining lsb, gb, and sb. The selector also has to perform the implicit task of shifting the selected result as needed. The nal selection multiplexors may get unwieldy if the number of results is large, necessitating a two-step multiplexing scheme at this time. The following discussion assumes a single-stage selection scheme.

In general, the selector needs to examine the appropriate Rh:msb, Rh :lsb, Rl:co, Rl:msb, and sb1 to select the correct result. To know exactly the region Rh is in, we need Rl:co and Rl:msb; to know exactly the column Rh is in, we need Rl:co and Rh:msb. This step is not optimizable because of our inability to predict the future. The Rh:lsb allows selection of the correct result in the RS column. This step is optimizable in that it may be simpli ed, or eliminated, through judicious selection of a prediction scheme and the correction digits. The need to examine


Rh

71

Rl

S C

n+1-bit half adders

Predictor

n+1-bit compound adder

lsb, gb, sb

Final selection mux

Selector

Rl _Adder

Figure 6.2: A Hardware Model for IEEE Rounding

sb1 depends on the sizes of the regions in the rounding table and is not always needed. But when it is needed, it is not optimizable. The size of the region in Rounding Table 6.3, for example, is 0.5. The smaller the region, the more bits potentially one has to examine. A single-point region requires an examination of all bits. For RU, only the former 4 bits, Rh:msb, Rh:lsb, Rl:co and Rl:msb, need to be examined. Normally, these bits are known at approximately the same time with the exception of Rh:lsb, which is known much sooner. A circuit-level optimization is possible here. The paths computing Rh:msb in the compound adder and Rl:co and Rl:msb in the Rl-adder can be selectively sped up for early determination of the column and region. After the correct result is selected, lsb, gb, and sb can then be determined. These bits are needed for the x-up step in RN.


72

6.3.1.2 Selecting a Valid Prediction Scheme The number of prediction schemes available to the predictor is an exponential function of the number of regions in a rounding table. Not all of these prediction schemes are valid, however, because outcomes in all groups must be computable by the compound adder. Investigating possible prediction schemes is important because they aect the complexity of the selector logic. The next goal of the design should be to reduce the complexity of this selector logic, the general guideline being to avoid having to examine Rh :lsb. The complexity of the predictor logic, on the other hand, depends on the prediction scheme and is relatively insigni cant because of the small number of literals involved. Using Table 6.3 as an example, because there are 4 regions, we have a total of 16 possible prediction schemes, but only one is valid using the simple hardware model, as shown in the prediction table4 (Table 6.5). In this particular case, the prediction scheme, which corresponds to a simple logical OR of Cl:msb and Sl:msb, can be easily guessed from Table 6.4 because a prediction \1" needs to be added whenever we know that we may be in Region 3. In general, an exhaustive search may be needed before a suitable predictation scheme can be found. It is interesting to note that this prediction scheme is unique for the simple model and corresponds to Algorithm 3 reported in Santoro et al. [63]. Table 6.5: Prediction Table for Simple Model Sl:msb Cl:msb Regions Predictor Action 00 1 or 2 Rh 01 2 or 3 Rh + 1 10 2 or 3 Rh + 1 11 3 or 4 Rh + 1

The prediction table lists the action of the predictor, not the the compound result that has to pre-compute all results. 4


73

After a prediction scheme is chosen, we can now select the correction digits for Table 6.3. Based on the prediction table, the allowable outcomes in Group 1 is Rh +(0; 1); hence, the digit \2" in Regions 1 and 2 must be discarded. Similarly, the digit \0" in Region 2 and the digit \3" in Region 3 must be discarded because the allowed outcomes in this group are now Rh + (1; 2) as a result of prediction. The digit \3" in Region 4 also needs to be discarded for the same reason, leaving only 3 pairs of selectable correction digits in the table. Table 6.6 lists all 8 possible digit selections for the simple model. In the table, only the RS columns are shown; all NS columns are the same as those in Table 6.3. To simplify notation, the actions corresponding to the cases of Rh :lsb = 0 and Rh:lsb = 1 are denoted as c1=c2 where c1 and c2 are the valid correction digits to be added to Rh . Digit selection scheme 8 does not require examination of Rh :lsb. Apparently, there is an advantage in using this scheme over the others. Table 6.6: Possible Digit Selections for Round Up in The Simple Model Range of Rl 1 2 3 4 5 6 7 8 0 Rl < 0:5 0/1 0/1 0/1 0/1 1 1 1 1 0:5 Rl < 1 1 1 1 1 1 1 1 1 1 Rl < 1:5 1/2 1/2 2 2 1/2 1/2 2 2 1:5 Rl < 2 1/2 1 1/2 1 1/2 1 1/2 2

Improving the Simple Model Other solutions are possible if we relax the constraint of Rh + (0; 1) in the compound adder as in the simple model. An improved rounding hardware model is given in Figure 6.3. This model diers from the simple one in two ways:

1. The compound adder is reduced by 1 bit. The inputs to the adder come from the most signi cant n bits of Sh and Ch, enabling it to compute Rh + (0; 1; 2). The outcome Rh + 1 is accounted for as follows. When Rh :lsb = 0, it is set to 1 and Rh is selected. When Rh:lsb = 1, it is set to 0 and Rh + 2 is selected. In either case no carry propagation is generated.


Rh

74

Rl

C S

n-bit half adders

FA

Predictor Rl_ Adder lsb, gb, sb

n-bit compound adder

Rh .lsb

Final selection muxes

Selector

Figure 6.3: An Improved Hardware Model for IEEE Rounding Note that if one views the last element in the half adder array in the simple model as an adder of length j (j = 1 in this case) and de ne k as the length of the compound adder, then we can compute Rh +(0; ; 2j ) where j + k = n +1. When viewed in this light, another optimization reveals itself. A prediction \1" controlled by a predictor can be added to the j -bit adder (j -adder). This option may seem redundant for some rounding modes like RU. For others (e.g., RI), it is absolutely necessary as we shall see. 2. Additional logic must be used to generate Rh :lsb. j is a major factor in determining the complexity of this Rh :lsb logic. For this reason, a small j is to be preferred. If desired, the x-up logic for RN mentioned in Section 6.2 can be incorporated into this Rh:lsb logic. The improved model allows more possible outcomes in a group, resulting in a greater degree of freedom in selecting a prediction scheme and the correction digits. More important, it often provides solutions to a rounding table where the simple model fails.


75

With this improved model, 8 prediction schemes are possible for Table 6.3, as listed in Table 6.7. Some of these prediction schemes are more interesting than others. Schemes 3 and 5, for example, use Cl:msb and Sl:msb as the predictor while scheme 8 uses no prediction at all. Other prediction schemes, however, may yield a simpler selector logic. Based on this prediction table, it can be easily shown that there are a total of 576 ways to select the correction digits as opposed to 8 for the simple model. Table 6.7: Possible Prediction Schemes for Round Up for The Improved Model Sl:msb Cl:msb 1 2 3 4 5 6 7 8 00 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 10 1 1 0 0 1 1 0 0 11 1 0 1 0 1 0 1 0

6.3.1.3 Summary of the Rounding Procedure We summarize our procedure as follows. 1. For a particular rounding mode, identify the shifting possibilities in the normalization step. Determine the ranges of Rh and Rl. This step may require a careful analysis of the operation (i.e., addition, multiplication, etc.). 2. Determine the maximum number of regions needed by examining events that can happen in the range of Rl and can aect Rh . Examples of such events are shifts during the normalization step and rounding. Construct the rounding table, coalescing the regions whenever possible. 3. Determine the number of bits in Rl that need to be examined and construct a prediction table. 4. Based on the rounding table, develop a rounding hardware model.


76

5. Using this model and the rounding and prediction tables, nd all optimal solutions. An optimal solution has the minimum number of adders and the simplest selector logic. In general, more than one optimal solutions are possible. Lesser constraints, such as the complexity of the predictor logic, can be used to further screen these solutions.

6.3.2 Round to In nity To derive a rounding scheme for RI, we use the procedure developed earlier in Section 6.3.1.3. In step 1 the range of Rh is the same as that for RU, requiring 2 columns: NS and RS. In step 2 the range of Rl is again the same as that for RU, but the rounding threshold is now dierent. For RU the rounding threshold is 0.5 and for RI it is 0. The range of Rl must be divided into three: Rl = 0, Rl 2 (0; 1], and Rl 2 (1; 2). A rounding table for RI is given in Table 6.8. The considerations for obtaining the entries in the table are similar to those for RU. Table 6.9 is the prediction table for this rounding mode. Table 6.8: Rounding Table for Round to In nity in Binary Multipliers Region Range of Rl Normalization No Shift Right Shift 1 Rl = 0 Rh Rh :lsb = 0: Rh + f0; 1g Rh :lsb = 1: Rh + f1; 2g 2 0 < Rl 1 Rh + 1 Rh :lsb = 0: Rh + f2; 3g Rh :lsb = 1: Rh + f1; 2g 3 1 < Rl < 2 Rh + 2 Rh :lsb = 0: Rh + f2; 3g Rh :lsb = 1: Rh + f3; 4g Several observations can be made in relation to Table 6.8. First, Rl = 0 has it own region. This has a speed implication because the selector logic now has to await full computation of Rl and determination of gb and sb before being able to select the nal result. Second, in Regions 2 and 3 corresponding to Group 2 (Table 6.9), there are a total of three possible outcomes: Rh + 1, Rh + 2, and Rh + 3. Thus, there is no


77

solution using the simple model. Third, there are a total of 4 possible outcomes in the table; therefore, even with the improved model, prediction is needed. Since we can only compute up to Rh + 2 in the compound adder, we need to add a prediction \1" to the j -adder as soon as we know we may be in Region 3. This corresponds to the condition of at least one of the msb's of Sl or Cl being true (Table 6.9). This prediction scheme is unique and corresponds to a simple logical OR operation. We have now completed steps 3 and 4 of our procedure. For step 5, there are a total of 16 dierent solutions because of the freedom in selecting the correction digits. Table 6.9: Prediction Table for the Round to In nity Mode Sl:msb Cl:msb Regions Group 00 1 or 2 1 01 2 or 3 2 10 2 or 3 2 11 3 3 The number of possible outcomes, their dierences, and the sizes of the regions in a rounding table are often an indicator of the degree of diculty in implementing a certain rounding mode. In this sense, RI is harder to implement than RN.

6.3.3 Round to Zero A rounding table for RZ is shown in Table 6.10. Being a simple truncation, its table is the simplest among the rounding modes. Only two regions are required. Both hardware models have solutions, but the improved one has more because of the larger degree of freedom in selecting the correction digits. Not much more can be said about this rounding table other than the fact that if we select the digit \0" in Region 1 and the digit \1" in Region 1, then the selector does not have to examine Rh :msb. It only needs to right shift the result when Rh:msb = 1.


78

Table 6.10: Rounding Table for Round to Zero in Binary Multipliers Region Range of Rl Normalization No Shift Right Shift 1 0 Rl < 1 Rh Rh + f0; 1g 2 1 Rl < 2 Rh + 1 Rh :lsb = 0: Rh + f0; 1g Rh :lsb = 1: Rh + f1; 2g

6.4 Summary In FP multiplication, the right shift in the normalization step gives rise to many possible rounding schemes. This chapter presented a systematic procedure for selecting among these schemes. The procedure starts with the construction of a rounding table and then uses prediction to reduce the number of possible outcomes in the rounding table. Constructing a rounding table involves examining the range of the result and the shifting possibilities in the normalization step while selecting a prediction scheme depends on the details of a hardware model. Two rounding hardware models are presented. The rst has been shown to be identical to that reported by Santoro et al. [63]. The second is more powerful, providing solutions where the rst fails. Both models pre-compute all outcomes needed for a rounding table in parallel and select the correct one. Optimization guidelines have been outlined in each step to reduce the number of adders and the complexity of the selector logic. Applying this approach to the four IEEE rounding modes for conventional binary multipliers reveals that the RM and RP modes are more dicult to implement than the RN mode in terms of the number of possible outcomes and their maximum differences. RZ requires the least amount of hardware. Though not discussed in this dissertation, the procedure can also be applied to multipliers using a signed-digit representation as shown by Quach et al.[34]. The procedure can also be applied to other operations such as addition and division.

Chapter 7 Conclusion and Future Work 7.1 Conclusion Algorithms for the most frequently used oating-point (FP) operations | add, multiply, divide, and square root | in existing FPUs can be sped up considerably at a small extra hardware cost. The latency of an FP add can be improved by about 20% by combining rounding with the mantissa addition step and by the use of leading-one prediction. The algorithm has been veri ed through extensive computer simulation and silicon implementation, which has a simulated nominal delay of 17ns. The latency of a simple radix-4 divider can be halved by using a slightly dierent quotient digits encoding scheme and by duplicating the next remainder logic. This allows quotient digit table lookup and computation of the next partial remainder to proceed in parallel. Because of this reduced latency, three such simple dividers can be staged (or by recycling it with a 3X clock) to obtain a radix-64 one. The same idea can also be applied to other SRT based division schemes and to square root operations. Floating-point operations require rounding to ensure representability and predictability of the result. Rounding is an expensive operation in terms of hardware and latency. A popular fast rounding method is for the mantissa adder to pre-compute multiple results to simplify the subsequent rounding operation. In general, the more results the mantissa adder has to pre-compute, the more hardware it potentially requires. 79

CHAPTER 7. CONCLUSION AND FUTURE WORK

80

The rounding method proposed in this dissertation minimizes of the number of these results. Using this method, this research showed that, for the IEEE standard, the round to in nity modes require more results to be computed than the round to nearest mode does, with the round to zero mode requiring the least number of results.

7.2 Future Work This work can be extended in several directions. Formal Approach Current FPU design involves more art than science. Existing designs rely heavily on simulation to ensure functional correctness. This veri cation strategy not only is time consuming, but also fails to guarantee a bug-free design. While formal approaches exist, they happen on a very localized scale. The rounding work reported earlier in this dissertation is such an example. FPU Algorithm and Data ow Compiler FP algorithms have matured enough to be incorporated into CAD tools. Current FPU designs are based on cells, attempting to reuse as many basic blocks as possible. An expert-system type of approach should be possible and useful. Given a set of constraints, in terms of performance and area, the system gures out the optimal solution, perhaps complete with all the rounding modes, etc. This tool would require an accurate area model for making decision. There have been recent developments in modeling the area of memory structures as reported by Mulder et al. [67]. Even a less aggressive attempt can be quite bene cial in this area. A data ow compiler is a step down from an algorithm compiler but is more general. Current FPU design begins with a high level hardware description, followed by massive simulation to ensure timing and functionality. In essence, the design is performed at the RTL level, not at the data ow level. With a data ow compiler, a designer would de ne a data ow for each instruction to be included in the design and specify the manner in which exceptions are to be handled. It is then up to the compiler to perform hardware minimization and logic optimization. Many of the optimization techniques commonly used in a language compiler, such as common subexpression


81

elimination and data ow analysis to detect certain data dependence, are directly applicable in this area. VLSI Regularity and Irregularity Regularity is an important issue in VLSI and arithmetic units are no exception. Intuitively, the notion of regularity is easy to grasp, but a formal de nition has proven dicult. In general, there are two types of regularity in VLSI: repetitive and recursive. An example of repetitively regular structure is a ripple carry adder and in fact, any structures that can be partitioned in a bit slice fashion can be considered to be repetitively regular. A rare example of recursively regular structure can be found in the multiplier reported by Luk and Vuillemin [68]. In VLSI, structures of repetitive regularity are more useful because they reduce design, layout, and veri cation time. But from a synthesis tool point of view, both types of regularity are equally desirable. Informally, one can make the argument that the fewer elements a logic unit use, the more regular it is. In fact, this argument has often been made in the literature. But such a de nition would have diculty determining the size of the elements. If a design uses a carry-ripple adder and anther uses a CLA, are both designs equally regular because they use the same number of elements? Regularity of a logical element can also be de ned in terms of its description length on a Turing machine. But such a de nition has proven less useful because irregular patterns can often be described concisely. Using this de nition, for example, a Wallace tree is as regular as a multiplier tree built with 4-2 adders (or 5-3 counters), though in reality the latter is much easier to layout. The opposite side of regularity is irregularity. The motivation for intentional irregularity is speed. In an FPU, abnormal cases such as Not a Number are almost always treated by a dierent path. This is an example of irregularity because information ows irregularly. At a lower level, irregularity often arises from the need to speed up a critical path, such as reducing its loading or sizing up its devices. It is useful to also have formal de nitions of irregularities, intentional and un-intentional.


82

Partial Evaluation Partial evaluation refers to a technique in which a computation is carried out only partially. In this way, only the most frequently used operations need to be supported in hardware and can be made fast. Less frequently used operations are supported in software. This technique has many applications. In an FALU, for example, it is possible to reduce the alignment and the normalization shifts to a simple muxing operation because large distance alignment or normalization is rare. Another example is current FPUs store their results in a register le in the same format as the operands. It is possible to store them in a redundant form to speed up later operation. Using this method, for example, an FALU can get away with the nal carry propagate addition step. Multiplication in this format does represent a problem, but not an unsolvable one. A study on the distribution of the input operands are needed for this method to be eective. Redundant Number Representation Currently, all integers use a 2's complement representation and FPU almost always use IEEE FP number representation. Though other numbers have been proposed in the literature such as residue number system. Their use has been limited to certain DSP applications. In a computer, an FPU only has to deal with the outside world when it is storing or loading data. If the computation is long enough, it is possible to represent a number in a signed-digit format. In this way, the intermediate computation can be sped up. Also, it is possible to select a signed-digit encoding scheme such that it is a superset of the conventional numbers. This means that loading data into the FPU requires no conversion at all and only storing data requires conversion, which is a less frequent event.

Bibliography [1] Y. Shimazu, T. Kengaku, T. Fujiyama, E. Teraoka, T. Ohno, T. Tokuda, O. Tomisawa, and S. Tsujimichi, \A 50MHz 24b Floating-Point DSP," in In Proc. of the IEEE International Solid-State Circuit Conference, pp. 44{45, Feb. 1989. [2] J. Foley, A. v. Dam, S. K. Feiner, and J. F. Hughes, Computer Graphics: Principles and Practice. Menlo Park, California: Addison Wesley, 2nd ed., 1987. [3] The Institute of Electrical and Electronics Engineers, Inc., 345 East 47th Street, New York, NY 10017, USA, ANSI/IEEE Std 754-1985: IEEE Standard for Binary Floating-Point Arithmetic, 1985. [4] A. Inoue and K. Takeda, \Performance Evaluation for Various Con gurations of Superscalar Processors," Computer Architecture News, vol. 3, no. 1, pp. 4{14, Mar. 1993. [5] S. Majerski, \On Determination of Optimal Distribution of Carry Skips in Adders," IEEE Transactions on Computers, vol. EC-16, no. 1, pp. 45{58, Feb. 1967. [6] E. R. Barns and V. G. Oklobdzija, \New Multilevel Scheme for Fast Carry-Skip Addition," IBM Technical Disclosure Bulletin, vol. 27, no. 11, pp. 6785{6787, Apr. 1985. [7] A. Guyot, B. Hochet, and J. M. Muller, \A Way to Build Ecient Carry-Skip Adders," IEEE Transactions on Computers, vol. C-36, no. 4, pp. 1144{1151, Oct. 1987. 83

BIBLIOGRAPHY

84

[8] S. Turrini, \Optimal Group Distribution in Carry-skip Adders," Proc. of the 9th Symposium on Computer Arithmetic, pp. 96{103, 1989. [9] P. K. Chen and M. D. F. Schlag, \Analysis and Design of CMOS Manchester Adders with Variable Carry Skip," Proc. of the 9th Symposium on Computer Arithmetic, pp. 86{95, 1989. [10] C. Mead and L. Conway, Introduction to VLSI Systems. Readings, Massachusetts: Addison-Wesley, 1980. [11] R. P. Brent and H. T. Kung, \A Regular Layout for Parallel Adders," IEEE Transactions on Computers, vol. C-31, no. 3, pp. 260{264, Mar. 1982. [12] O. L. MacSorley, \High-Speed Arithmetic in Binary Computers," IRE Transactions on Computers, vol. 49, pp. 67{91, January 1961. [13] H. Ling, \High Speed Binary Parallel Adder," IEEE Transactions on Computers, vol. EC-15, no. 5, pp. 799{802, Oct. 1966. [14] H. Ling, \High Speed Binary Adder," IBM Journal of Res. and Dev., vol. 25, no. 3, pp. 156{166, May 1981. [15] G. Bewick, P. Song, G. DeMicheli, and M. J. Flynn, \Approaching a Nanosecond: A 32-Bit Adder," in Proc. of International Conference on Computer Design, pp. 221{224, 1988. [16] R. W. Doran, \Variants of an Improved Carry Lookahead Adder," IEEE Transactions on Computers, vol. C-37, no. 9, pp. 1110{1113, Sep. 1988. [17] S. Waser and M. J. Flynn, Introduction to Arithmetic for Digital Systems Designers. New-York: Holts, Rinehart and Winston, 1982. [18] J. P. Uyemura, Fundamentals of MOS Digital Integrated Circuits. Readings, Massachusetts: Addison-Wesley, 1988. [19] E. J. McCluskey, Logic Design Principles with Emphasis on Testable Semicustom Circuits. Englewood Cli, New Jersey: Prentice-Hall, 1986.

BIBLIOGRAPHY

85

[20] L. W. Nagel, \SPICE2: A Computer Program to Simulate Semiconductor Circuits," Tech. Rep. ERL Memo, ERL-M520, University of California at Berkeley, May 1975. [21] I. S. Hwang and A. L. Fisher, \A 3.1ns 32b CMOS Adder in Multiple Output DOMINO Logic," in In Proc. of the IEEE International Solid-State Circuit Conference, pp. 140{141, 1988. [22] V. G. Oklobdzija and E. R. Barnes, \Some Optimal Schemes for ALU Implementation in VLSI Technology," in Proc. of the 7th Symposium on Computer Arithmetic, pp. 2{8, Jun. 1985. [23] N. T. Quach and M. J. Flynn, \High-Speed Addition in CMOS," Tech. Rep. CSL-TR-90-415, Stanford University, Fab. 1990. [24] F. A. Ware, W. H. McAllister, J. R. Carlson, D. K. Sun, and R. J. Vlach, \64 Bit Monolithic Floating Point Processors," IEEE Journal of Solid-State Circuit, vol. SC-17, no. 5, pp. 898{907, Oct. 1982. [25] W. P. Hays, R. N. Kershaw, L. E. Bays, J. R. Boddie, E. F. Fields, R. L. Freyman, C. J. Garen, J. Hartung, J. J. Klinikowski, C. R. Miller, K. Mondal, H. S. Moscovits, Y. Rotblum, W. A. Stocker, J. Tow, and L. V. Tran, \A 32-bit VLSI Digital Signal Processor," IEEE Journal of Solid-State Circuit, vol. SC-20, no. 5, pp. 998{1004, Oct. 1985. [26] B. J. Benschneider, W. J. Bowhill, E. M. Cooper, M. N. Gavrielov, P. E. Gronowski, V. K. Maheshwari, V. Peng, J. D. Pickholtz, and S. Samudrala, \A Pipelined 50-Mhz CMOS 64-bit Floating-Point Arithmetic Processor," IEEE Transactions on Computers, vol. 24, no. 5, pp. 1317{1323, Oct. 1989. [27] H. P. Sit, M. R. Nofai, and S. Kim, \An 80MFLOPS Floating-Point Engine in the i860 Processor," in Proc. of International Conference on Computer Design, pp. 374{379, 1989. [28] C. Nelson, \Cydrom-5 FPU." Private Communication, 1986.

BIBLIOGRAPHY

86

[29] E. Hokenek and R. K. Montoye, \Leading-Zero Anticipator (LZA) in the IBM RISC System/6000 Floating-Point Execution Unit," IBM Journal of Res. and Dev., vol. 34, no. 1, pp. 71{77, Jan. 1990. [30] M. Birman, A. Samuels, G. Chu, T. Chuk, L. Hu, J. McLeod, and J. Barns, \Developing the WTL3170/3171 Sparc Floating-Point Coprocessor," IEEE Micro, pp. 55{64, Feb. 1990. [31] G. Blanck and S. Krueger, \SuperSparc: A Fully Integrated Supersalar Processor," in Proceedings of the Hot Chip Symposium III, 1991. [32] D. Bural, M. Darley, M. Gill, P. Groves, D. Steiss, and T. Wolf, \The Megacell Dierentiated Floating-Point Product Family," in Proceedings of the Hot Chip Symposium III, 1991. [33] R. N. Kershaw, L. E. Bays, R. L. Freyman, J. Klinikowski, C. R. Miller, K. Mondal, H. S. Moscovits, W. A. Stocker, and L. V. Tran, \A Programmable Digital Signal Processor with 32b Floating-Point Arithmetic," in In Proc. of the IEEE International Solid-State Circuit Conference, pp. 92{93, 1985. [34] N. T. Quach, N. Takagi, and M. J. Flynn, \On Fast IEEE Rounding," Tech. Rep. CSL-TR-91-459, Stanford University, March 1991. [35] N. Takagi, H. Yasuura, and S. Yajima, \High-Speed VLSI Multiplication Algorithm with a Redundant Binary Addition Tree," IEEE Transactions on Computers, vol. C-34, no. 9, pp. 789{796, Sep. 1985. [36] S. Kuninobu, T. Nishiyama, H. Edamatsu, T. Taniguchi, and N. Takagi, \Design of High Speed MOS Multiplier Using Redundant Binary Representation," in Proc. of the 8th Symposium on Computer Arithmetic, pp. 80{86, 1987. [37] N. T. Quach and M. J. Flynn, \An Improved Algorithm for High-Speed FloatingPoint Addition," Tech. Rep. CSL-TR-90-442, Stanford University, Aug. 1990. [38] N. T. Quach and M. J. Flynn, \Design and Implementation of the SNAP Floating-Point Adder," Tech. Rep. CSL-TR-91-501, Stanford University, Dec 1991.

BIBLIOGRAPHY

87

[39] M. P. Farmwald, On the Design of High Perfromance Digital Arithmetic Units. PhD thesis, Stanford University, Aug. 1981. [40] D. W. Sweeney, \An Analysis of Floating-Point Addition," IBM Syst., vol. 4, pp. 31{42, 1965. [41] S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers, \The IBM System/360 Model 91: Floating-Point Execution Unit," IBM Journal of Res. and Dev., vol. 11, pp. 34{53, January Jan. 1967. [42] J. B. Gosling, \Some Tricks of the (Floating-Point) Trade," in Proc. of the 6th Symposium on Computer Arithmetic, pp. 218{220, 1983. [43] S. Vassiliadis, D. S. Lemon, and M. Putrino, \S/370 Sign-Magnitude FloatingPoint Adder," IEEE Journal of Solid-State Circuit, vol. 24, no. 4, pp. 1062{1070, Aug. 1989. [44] N. P. Jouppi, \MultiTitan Floating Point Unit," Tech. Rep. 87/8, WRL Research Report, Apr. 1988. [45] D. E. Atkins, \Higher-radix Division Using Estimates of the Divisor and Partial Remainders," IEEE Transactions on Computers, vol. C-17, pp. 925{934, October 1968. [46] K. G. Tan, \The Theory and Implementation of High-Radix Division," Proc. of the 4th Symposium on Computer Arithmetic, pp. 154{163, October 1978. [47] J. E. Robertson, \A New Class of Digital Division Methods," IRE Transactions on Computers, vol. EC-5, pp. 65{73, June 1956. [48] K. D. Tochner, \Techniques of Multiplication and Division for Automatic Binary Computers," Quaterly Journal of Mech. Appl. Math, vol. 11, pp. 364{384, 3 1958. [49] D. E. Atkins, \Design of the Arithmetic Units of Illiac III: Use of Redundancy and Higher Radix Methods," IEEE Transactions on Computers, vol. C-19, no. 8, pp. 720{723, Aug. 1970.

BIBLIOGRAPHY

88

[50] G. S. Taylar, \Radix 16 SRT Dividers with Overlapped Quotient Selection Stages," in Proc. of the 7th Symposium on Computer Arithmetic, pp. 64{71, 1985. [51] J. Fandrianto, \Algorithm for High Speed Shared Radix 8 Division and SquareRoot," in Proc. of the 10th Symposium on Computer Arithmetic, pp. 68{75, 1990. [52] T. M. Carter and E. R. Robertson, \Radix-16 Signed-Digit Division," IEEE Transactions on Computers, vol. C-39, no. 12, pp. 1424{1433, Dec. 1990. [53] M. D. Ercegovac, T. Lang, and R. Modiri, \Implementation of Fast Radix-4 Division with Operands Scaling," in Proc. of International Conference on Computer Design, pp. 486{489, 1988. [54] M. J. Flynn, \On Division by Functional Iteration," IEEE Transactions on Computers, vol. C-19, no. 8, pp. 702{706, Aug. 1970. [55] W. Kahan, \Checking Whether Floating-Point Division Is Correctly Rounded." University of Berkerly, Unpublished Class Notes, April 1987. [56] J. Fandrianto, \Algorithm for High Speed Shared Radix 4 Division and Radix 4 Square-Root," in Proc. of the 8th Symposium on Computer Arithmetic, pp. 73{ 79, 1987. [57] M. D. Ercegovac and T. Lang, \On-the- y Conversion from Redundant into Conventional Representation," IEEE Transactions on Computers, vol. C-36, pp. 895{ 897, July 1987. [58] M. D. Ercegovac and T. Lang, \Radix-4 Square Root without Initial PLA," Proc. of the 9th Symposium on Computer Arithmetic, pp. 161{168, 1989. [59] W. D. Clinger, \How to Read Floating-Point Number Accurately," in ACM SIGPLAN'90 Conference on Programming Language Design and Implementation, pp. 92{101, 1990. [60] G. L. Steele, Jr. and J. L. White, \How to Print Floating-Point Number Accurately," in ACM SIGPLAN'90 Conference on Programming Language Design and Implementation, pp. 112{126, 1990.

BIBLIOGRAPHY

89

[61] C. S. Wallace, \A Suggestion for Fast Multipliers," IEEE Transactions on Electronic Computers, no. EC-13, pp. 14{17, Feb. 1964. [62] D. D. Gajski, \Parallel Compressors," IEEE Transactions on Computers, vol. C29, pp. 393{398, May 1980. [63] M. R. Santoro, G. Bewick, and M. A. Horowitz, \Rounding Algorithms for IEEE Multipliers," Proc. of the 9th Symposium on Computer Arithmetic, pp. 176{183, 1989. [64] R. Graham, D. E. Knuth, and O. Patashnik, Concrete Mathametics { A Foundation for Computer Science. Menlo Park, California: Addison-Wesley, 1989. [65] M. Uya, K. Kaneko, and J. Yasui, \A CMOS Floating Point Multiplier," IEEE Journal of Solid-State Circuit, vol. SC-19, no. 8, pp. 697{702, Oct. 1984. [66] P. Y. Lu, A. Jain, J. Kung, and P. H. Ang, \A 32-MFLOP 32b CMOS FloatingPoint Processor," in In Proc. of the IEEE International Solid-State Circuit Conference, pp. 28{29, 1988. [67] J. M. Mulder, N. Quach, and M. Flynn, \An Area-utility Model For On-Chip Memories and Its Application," Tech. Rep. CSL-TR-90-345, Stanford University, Feb. 1990. [68] W. K. Luk and J. Vuillemin, \Recursive Implementation of Optimal Time VLSI Integer Multipliers," in Proc. IFIP, pp. 155{168, 1983.

reducing the latency of floating-point arithmetic

reducing the latency of floating-point arithmetic

Suggest Documents

Comparative Evaluation of Latency Reducing and ... - CiteSeerX

Reducing DRAM Latency by Exploiting Design-Induced Latency ...

Reducing Latency via Redundant Requests: Exact Analysis

Reducing Latency by Providing Location Based ...

Reducing Processing Latency in Network Packet

Reducing Damage Assessment Latency in Survivable Databases

Reducing the Latency of Division Operations with Partial Caching

Optimal Reissue Policies for Reducing Tail Latency

Reducing WWW Latency and Bandwidth ... - Semantic Scholar

Reducing Damage Assessment Latency in ... - Semantic Scholar

Reducing the Latency and Area Cost of Core Swapping through

Swift: Reducing the Effects of Latency in Online Video

Reducing Latency and Overhead of Route Repair with ... - CiteSeerX

Reducing Internet Latency: A Survey of Techniques ... - WordPress.com

Reducing Access Latency of MLC PCMs through Line ... - UCSD CSE

Reducing Memory Access Latency with Asymmetric DRAM Bank ...

Reducing Host Load, Network Load, and Latency in a ... - CiteSeerX

iAdvize: Reducing Latency And Increasing Global Presence - Dyn

Reducing Host Load, Network Load, and Latency in a ... - CiteSeerX

Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency

Reducing Host Load, Network Load and Latency ... - ScholarlyCommons

Reducing LDPC Soft Sensing Latency by Lightweight ...

iAdvize: Reducing Latency And Increasing Global Presence - Dyn

Reducing LDPC Soft Sensing Latency by Lightweight Data Refresh for ...