We present a fast 64b adder based on Output. Prediction Logic (OPL) that has a measured worst- case delay of 409ps, equivalent to 4.7 FO4 inverter delays for ...
409ps 4.7 FO4 64b Adder Based on Output Prediction Logic in 0.18um CMOS Sheng Sun, Yi Han, Xinyu Guo, Kian Haur Chong, Larry McMurchie, and Carl Sechen University of Washington, Seattle, WA 98195, USA {shengsun, hanyi77, xyguo, khchong, larry, sechen}@ee.washington.edu
Abstract We present a fast 64b adder based on Output Prediction Logic (OPL) that has a measured worstcase delay of 409ps, equivalent to 4.7 FO4 inverter delays for the TSMC 0.18um process that was used for fabrication. This normalized delay is 1.45X faster than the fastest previously reported 64b adder. The adder uses a modified radix-3 Kogge-Stone architecture and has 5 logic levels.
1. Introduction An adder, particularly a 64b adder, is one of the basic components determining microprocessor performance. Output Prediction Logic (OPL) is a recently developed circuit technique that is twice as fast as domino logic, and several times faster than static CMOS logic [1]. This paper describes the architecture and design of a 64b adder based on OPL that has the lowest normalized delay ever reported. OPL relies on the alternating nature of logical output values for inverting gates on a critical path. In static CMOS logic, every gate is inverting. Thus, every output on a critical path must fully transition from 0 to 1, or from 1 to 0 in the worst case, as shown in Fig. 1(a). OPL significantly reduces the worst-case behavior of a critical path by predicting that every inverting gate output in a critical path will be a logic one after the transitions are completed. Since all gates are inverting, as in the case for static CMOS, the OPL predictions will be correct exactly one-half of the time. This means that only every other gate needs to make a logic level transition, as shown in Fig. 1(b). Therefore, OPL obtains significant speedups (at least 2X) over the underlying logic families (e.g. static CMOS or dynamic logic). Of course, a one at every output is not a stable state for an inverting gate. The one will erode (possibly going to zero) in the latter gates of a circuit path. To prevent this erosion, each gate is disabled with a clocked footer device. The gates remain disabled until their inputs are ready for evaluation. In
this manner, predicted output values are maintained until new input values dictate otherwise. Successive clocks are delayed by a clock separation as shown in Fig. 2. Only one (low-skew, i.e., fast) pull-down event has to occur for every two consecutive clock separations, because at most one of two adjacent gates in a critical path has to transition. Note that if two consecutive gates in a path both transition, then the falling of the first gate does not trigger the falling of the second gate, and therefore this path is non-critical and the clock separation is immaterial. We therefore see that the needed clock separations are quite small, about ½ of a low-skew pull-down event. A chain of various OPL dynamic gates is shown in Fig. 3. When the clock is low, the gate output is precharged to a logic one. When the clock goes high, the gate is enabled for evaluation. Note that an OPLdynamic gate looks exactly like a domino gate, but without the output inverter. OPL circuits achieve a 2X speedup compared to their domino counterparts because only every other gate on a critical path needs to be discharged. In contrast, the worst case for a chain of domino gates occurs when every gate pulls up. The same is true when OPL-differential logic (equivalently, OPL applied to differential cascode voltage switch logic) is compared to differential domino, as demonstrated in [14].
Fig. 1. (a) Static CMOS worst-case behavior; (b) OPL[1].
Fig. 2. Clock scheme for OPL.
Fig. 3. A chain of dynamic OPL gates.
2. OPL design techniques
more robust design. Fig. 4 shows Monte Carlo simulations on an example circuit, which has an optimal worst-case delay of 210ps for uniform separations of 55ps. The delay distribution due to clock skew at the separation 55ps is scattered. For a larger separation of 65ps that gives margin for skew tolerance, the worst-case delay increased by 23% to 260ps, but all the delays are in a narrow range and this implies good robustness. The Monte Carlo simulations also show that for a certain skew (+/- 15ps in the example), the nominal separation does not need to be increased by an amount equal to the skew tolerance. The simulations are based on a Gaussian distribution of clock skew with ±15ps at ±3σ.
Because of clock skew, larger separations than what would otherwise be optimal are necessary for
Fig. 4. Monte Carlo simulations (100 points) of an OPL circuit’ delay with clock skew Therefore, OPL favors designs with fewer logic levels (and clock levels) and larger separations, which requires a smaller timing budget for skew tolerance. There is also the practical challenge of generating very small clock separations. A realistic clock separation time is at least 50ps for the 0.18um TSMC process. If a considerably slower gate follows a fast gate, then the optimal clock separation between the fast gate and slow gate is zero. For example, in Fig. 5, the two cases (output low, and output high, respectively) are shown for a chain of inverter/AOI222 pairs. A nonzero sep1 is needed since the AOI222 gate pulls down slowly and the inverter must correctly “latch” a zero (upper chain in Fig. 5). However, sep2 is optimally zero since the AOI222 gate will still correctly “latch” a zero (lower chain in Fig. 5) even if the inverter is clocked at the same time as the AOI222 gate.
In the top chain in Fig. 5, a sep2 of zero allows the AOI222 gate to get a head start in evaluating (i.e., at the same time as the preceding inverter) to yield improved speed. Also, the delay for the opposite case (lower case in Fig. 4) is not degraded significantly even though the AOI222 gate evaluates early; since the inverter is so fast, the AOI222 gate is nonetheless able to correctly “latch” the falling inverter output. Some degradation in delay for the lower case (high output) can easily be tolerated since the delay for the upper case is considerably more (limited by pulldowns of complex AOI222 gates, rather than pulldowns of fast inverters). Clock grouping, where fast gates are assigned the same clock as for the following slower gates, minimizes the number of clock phases needed. This technique is especially useful when inverters are required. An example is the case of carries for an adder’s final-sum XOR gates, where both polarities of the carry signals are needed, which would otherwise require an additional logic level (clock phase).
3. 64b OPL adder implementation Fig. 5. Clock grouping for the case when slow gates follow fast gates.
3.1. 64b adder architecture We selected the Kogge-Stone architecture since it has minimum logic depth and presents a very regular
structure [2]. It generates carries for each bit in parallel and then computes every final sum bit directly from a pre-computed partial sum. With the original binary (radix-2) Kogge-Stone scheme, the depth is 1+log264+1 = 8 levels. The radix-3 and radix-4 versions have depth 1+log364+1 = 6 levels and 1+log464+1 = 5 levels, respectively. The extra two logic levels (i.e., the first and last levels) are for preliminary generate/propagate and final sum gates. We chose the radix-3 Kogge-Stone architecture and still were able to achieve 5 logic levels (Fig. 6). We were able to eliminate one level by merging the preliminary generate/propagate gates with the first level of the carry-merge tree (Fig. 7 (a)). This merging of the preliminary generate/propagate gates is not possible with the radix-4 architecture, since the carrymerge cells already have stacks of four nMOS devices. Thus, the radix-4 approach still needs five levels. Ling’s approach can be used to reduce the first level at the cost of a more complex sum equation [3]. But more problematic with the radix-4 architecture is its larger fanout and wiring flux; wiring already presents a challenge for radix-3.
In the carry-merge tree, fanouts (FOs) for generate and propagate are 3 and 5, respectively. The longest wires occur at level 3 for the 10 least significant bits (LSBs), which have a FO of 3 along with a span of 54 bits. Fortunately, the gates for the 10 LSBs are relatively simple, either an inverter or an AOI12, having small fan-ins and small size. Thus we duplicate these gates (equivalently, size up by 2X) to provide the necessary drive for the longer wires, without increasing the loading excessively to the level 2 gates. Propagates in level 2 also have large fanouts and are duplicated. The largest FO in our design is 11 nominally sized nMOS devices, which is comparable to a CMOS fanout-of-four. The worst-case delay for the 64b adder occurs when level 3 gates (which have long wires) need to pull down. We therefore use a larger clock separation between levels 3 and 4 to improve the critical path delay.
Fig. 6. Modified radix-3 Kogge-Stone adder structure. Equations (1-4) show the carry-merge cells for the nodes requiring the full 3b merge. Fig. 7 shows the majority of the logic gates used in the 64b adder (with only the pull-down networks shown). The XOR gate is used for partial sums at level 1 as well as for final sums at level 5.
G = g 2 + p 2 g1 + p 2 p1 g 0 = g 2 + p 2 ( g1 + p1 g 0 ) (1)
P = p2 p1 p0
(2)
G = g 2 + p2 ( g1 + p1 g 0 ) = g 2 ( p2 + g1 ( p1 + g 0 )
(3)
P = p2 p1 p0 = p2 + p1 + p0
(4)
The nominal size for the transistors in the dynamic pull-down networks was 2um. Transistors were sized up by a factor of N when they were in a stack of N transistors.
Fig. 7. Carry-merge cells: (a) Level 1, (b) Level 2 and level 4 (no P for level 4), (c) Level 3; and (d) XOR gate for sums. The nominal transistor size (x) is 2um for the 0.18um TSMC process.
3.2. Clocking
3.3. Layout
We use Reduced Swing Buffers (RSBs), shown in Fig. 8, to realize clock separations that are considerably smaller than a static buffer delay [4]. RSBs are cascaded in a chain to provide multi-phase clocking. clk_rsb is applied to the next level’s RSB clk_in. Two extra static buffers (buf1 and buf2) in the RSB provide the driving capability for the 200um of poly gate loading at each level. Many nominally sized RSBs are connected in parallel to achieve the desired drive strength. Also, clk_rsb, clk_buf and clk of different RSBs for the same logic level are connected together to reduce skew. Global VCN and VCP control signals are used instead of individual controls for each clock level, to reduce the number of chip pins required, though this sacrifices level-by-level control. VCN is the main control as it determines the rising edge delay.
All of the logic cells (dynamic OPL) are placed in a contiguous set of rows, in a logic level by logic level fashion, as illustrated in Fig. 10. The RSBs of one clock level are placed in a single row, in the middle of the set of rows for a level of logic cells. (The RSBs are spread uniformly throughout the RSB row for a given level.) This approach effectively reduced N by one half for the example modeled in Fig. 9, and yielded more than a 2X improvement in clock skew compared to placing the RSB row entirely before or after the rows of logic cells for a level.
Fig. 8. RSB (Reduced Swing Buffer)
Fig. 10. Adder floorplan.
We use a simple RC line model to analyze and minimize clock skew, where the wire resistance r and gate capacitance loading c are the dominating skew contributors (Fig. 9). Shown is what represents the longest clock net that runs over 8 rows (N=8) in the (initial) layout. The r of 3.1Ω is the wire resistance corresponding to the 16um of center-to-center row separation. The c of 60fF approximates 2 gate loads. The RC calculation and spice simulation results are very close. In general, skew is proportional to 1+2+3…+(N-1) = (N-1)N/2. Therefore, minimizing N is an efficient way to reduce skew. The modeling gives a worst-case skew of 5.2ps from n1 to n8. From simulations of full 3D extracted parasitics, the worstcase skew was approximately 8ps.
Each clock net is organized as a mesh to minimize skew, with vertical wiring entirely on M4 and horizontal wiring entirely on M3. Using 3D postlayout parasitic extraction, the simulated worst-case clock skew within any level in the final layout was less than 4ps. Of concern in OPL-dynamic circuits is crosscoupling noise to gate outputs that are to remain high. Since many signals will experience high-to-low transitions (but no low-to-high transitions), crosscoupling noise to these gate outputs (that are to remain high) may lower these outputs too much. One approach that can be used to appreciably lower the cross-coupling noise is full shielding ((a)). Full shielding using alternating VDD/GND stripes significantly reduces the coupling noise, but at the cost of appreciably increased parasitic capacitance. This is because the cross-capacitance to minimum-spaced neighboring wires is about 7X larger than the capacitance to ground for a wire for the 0.18um TSMC process ((b)), and full-shielding guarantees close neighbors on both sides of every wire. While full shielding mitigates cross-coupling noise, it clearly results in larger capacitive loads and therefore larger delays.
Fig. 9. Clock skew modeling (r = 3.1Ω, c = 60fF, N= 8). Note that some loads (represented by c) are closer to the clock driver than others, giving rise to the modeled skew.
(a)
the board. A chip-on-board was designed and fabricated, which reduced L for each power pad to about 1nH, which is several times smaller than that for a standard PGA package. The use of multiple power pads also reduces the effective inductance proportionally. Decoupling capacitors were used at the board level and chip level to act as auxiliary power supplies.
4.2. Testing
(b) Fig. 11. (a) Pseudo-shielding with wide spacing vs. full shielding; (b) Parasitic wire capacitance and wire space (wire width=0.4um and length=1000um). We found that a superior approach is to increase the wire pitch to the same amount that would be used in the full-shielding approach, but without inserting the VDD/GND shields. This pseudo-shielding approach uses a 1.6um pitch with 1.2um space, and reduces the coupling capacitance by 2.6X and the total capacitance by 1.6X, compared to a non-shielded 0.8um pitch with 0.4um space ((b)). The improvement is even more when compared to the minimum spacing allowed by the design rules. Note that the intrinsic capacitance to ground increases a little as wire space goes up, because more electric field lines are directed to ground when coupling decreases.
Our testing approach is illustrated in Fig. 12. First, test vectors are serially shifted in. Then the addition launches with the rise of clk1. An additional clock phase is used to trigger a capturing dynamic D-flipflop (DDFF). The DDFF captured values are then passed to a conventional shift register, and are then serially shifted out. The DDFF, shown in Fig. 12(b), is a revised version of the semi-dynamic DFF developed by Sun Microsystems [5]. It is a rising-edge-triggered DFF with a nominal setup time of zero. Hence the delay of the 64b adder is determined by the delay difference between this additional clock phase (clk6) and the first clock phase (clk1). The sixth clock phase that drives the DDFFs is generated just as for the other clocks, i.e., delayed from clk5 using an RSB.
4. Practical issues and testing 4.1. Power supply voltage drop Voltage drop strongly impacts circuit performance. There are two factors that cause power supply drop. The first one is IR drop (VDD) or bounce (GND). A regular power/ground network was implemented to reduce the IR drop/bounce. M5 and M6 constituted the two power planes, with a 16um pitch. On average, each cell is connected once to each of VDD/GND through stacked vias to the top power planes. Simulated IR drop and bounce in the adder are 0.04V and 0.08V, respectively, for 3 pairs of power pads. Another factor that causes voltage drop is L·(di/dt). The inductance L mainly comes from the package and
Fig. 12. (a) Testing structure (b) DDFF Since the delay of the 64b adder is only a few hundred picoseconds, careful steps were taken to ensure the accuracy of final measured results. To factor out off-chip delay mismatches between clk1 and clk6, two multiplexers direct clk1 and clk6 to two output pads interchangeably, depending on control signal ctrl_clk. By switching ctrl_clk, two sets of delay measurements are obtained and averaged, yielding the true difference between clk1 and clk6.
5. Results and comparison The 64b OPL adder has a worst-case delay of 411ps in simulations, with full RC extraction. Measurements on fabricated parts show a worst-case delay of 409ps (Fig. 13), obtained by testing using worst-case vectors as well as thousands of random vectors (for verification). When VCN is swept from low to high, the clock separations reduce. Initially, the adder produces correct results, but at some point erroneous results may be observed. The last successful case indicates the maximum speed of the adder. This is illustrated in Fig. 14. The best speed is obtained for VCN=1.045V. With VCN=1.045V, the clock separations are approximately 75, 70, 80, 70ps, respectively. With VCN greater than 1.05V, the adder starts to fail for some vectors.
faster than the fastest previously reported 64b adder that had a delay of 6.8 FO4. Table 1. Comparison of fast 64b adders
*The FO4 for the Intel 0.18um process is estimated to be 36 ps for typical environmental conditions [13].
Fig. 15 shows the chip-on-board and the on-die probe setup. Our adder has 6754 transistors for the logic gates, and 1384 for the RSBs. The adder area is 507um by 483um for the core, and 550um by 660um with test circuits. The RSBs occupy 6 rows out of 29 rows, or about 20% of the core area. Fig. 13. Adder worst-case delay is 409ps, the average of 386 and 432. Falling edges are measured because an odd number of inverters were used as buffers to lead clk1 and clk6 to output pads.
Fig. 15. (a) chip-on-board, (b) on-die probes.
6. Conclusion
Fig. 14. Adder delay vs. VCN. The FO4 inverter delay (or simply FO4 delay) is commonly used to normalize the speed of designs for different process technologies. The measured FO4 delay for the TSMC 0.18um 1P6M process run was 87ps, and therefore the adder has a worst-case delay of 4.7 FO4 delays. As shown in Table 1, this is 1.45X
We presented a 64b adder based on Output Prediction Logic (OPL) that had a measured worstcase delay of 409ps, equivalent to 4.7 FO4 inverter delays for the TSMC 0.18um process. This is 1.45X faster than the fastest previously reported 64b adder. The adder used a modified radix-3 Kogge-Stone architecture and had 5 logic levels.
Acknowledgments We are grateful for the contributions of Samuel Kio, and for the financial support provided by MARCO/C2S2.
References [1] Larry McMurchie, S. Kio. G. Yee, T. Thorp, C. Sechen., “Output Prediction Logic: a High Performance CMOS Design Technique,” Proc. Int. Conf. On Comp. Design, Sept. 2000. [2] S. Knowles, “A Family of Adders,” Proc. 14th IEEE Symp. Computer Arithmetic, pp30-34, April 1999. [3] H. Ling, “High-Speed Binary Adder,” IBM J. Research. Develop. Vol. 25, No. 3, May 1981. [4] S. Kio, K.H. Chong, C. Sechen, “A low power delayed-clocks generation and distribution system,” ISCAS, vol.5, pp. 445-448, May 2003. [5] F. Klass, et al, “A new family of semidynamic and dynamic flip-flops with embedded logic for highperformance processors,” J. Solid-State Circuits, May 1999, pp. 712-716. [6] S. Mathew, et al, “Sub-500-ps 64-b ALUs in 0.18-µm SOI/bulk CMOS: design and scaling trend,” J. Solid State Circuits, Nov. 2001, pp. 1636-1646, also ISSCC 2001. [7] R. Zlatanovici, B. Nikolic, “Power-performance optimal 64-bit carry-lookahead adders,” ESSCIRC 2003, pp.321-324. [8] A. Neve, H. Schettler, T. Ludwig, D. Flandre, “Powerdelay product minimization in high-performance 64-bit carry-select adders,” Tran. VLSI , March 2004, pp. 235-244. [9] J. Kim, R. Joshi, C. Chuang, K. Roy, “SOI-optimized 64-bit high-speed CMOS adder design,” Dig. Symp. VLSI Circuits, pp. 122-125, June 2002. [10] D. Stasiak, F. Mounes-Toussi, S.N. Storino, “A 440-ps 64-bit adder in 1.5-V 0.18 um partially depleted SOI technology,” JSSC, Oct. 2001, pp. 1546-1552; Also in ISSCC 2000. [11] J. Park, H. Ngo, J. Silberman, S. Dhong, “470-ps 64-bit parallel binary adder,” Dig. Symp. VLSI Cir., pp. 192193, June 2000. [12] S. Lee, R. Woo, H. Yoo, “480 ps 64-bit race logic adder,” Dig. Symp. VLSI Circuits, pp. 27-28, June 2001
[13] R. Ho, K. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, pp. 490-504, Apr. 2001. [14] Kio Su, et al., “Application of Output Prediction Logic to Differential CMOS”, Proc. IEEE Workshop on VLSI, pp. 57-65, April 2001.