Parallel Phase Accumulator Architecture for DDFS. Irwin Horowitz and George S. La Rue. School of Electrical Engineering and Computer Science. Washington ...
Parallel Phase Accumulator Architecture for DDFS Irwin Horowitz and George S. La Rue School of Electrical Engineering and Computer Science Washington State University Pullman, Washington 99164-2752
Abstract—A novel parallel architecture is described for a phase accumulator (PA) in a direct digital frequency synthesizer (DDFS) intended for space-based applications. A comparison is made between the parallel and pipelined PA architectures in a 0.18 µm CMOS technology. The parallel architecture dissipates about 1/3 less power while achieving performance at least as high as the pipelined architecture. The accumulator designs will be hardened against latch-up, total dose effects and single-event upsets through the use of guard rings, FET gate geometry and triple-mode redundancy (TMR) hardware. Keywords- DDFS; Phase Accumulator; Radiation Hardness
I.
INTRODUCTION
Direct digital frequency synthesizers (DDFSs) produce sinusoidal waveforms over a wide range of frequencies (up to the Nyquist rate). Fast frequency hopping, high spectral purity and low power dissipation are the desired performance measures that are required by many modern wireless communication systems. In addition, DDFSs can have higher frequency resolution and lower phase noise than phase-lock loop based frequency synthesizers [1]. The standard implementation of DDFSs consists of a phase accumulator (PA), a ROM lookup table that converts phase to digital amplitude and a digital-to-analog converter (DAC) to generate an analog output signal. A low-pass filter with a cutoff frequency near Nyquist may be added to the DAC output to remove high-frequency harmonics from the signal. The PA stores the current value of the phase and increments it every clock cycle with a user-specified N-bit input phase word. By varying this phase word (PW), the output frequency is modified according to (1). As shown, the output frequency is
f DDFS =
PW f clk 2N
(1)
related to the clock frequency. Designers typically utilize pipelining in order to achieve higher clock rates. This paper presents an alternative approach using multiple accumulators working in parallel in order to achieve a higher effective clock rate, while reducing overall power dissipation. II.
PIPELINED PHASE ACCUMULATOR
This DDFS is intended for space-based applications, which require both low power as well as radiation-tolerant circuitry. The design specifications are listed in Table 1. An initial 12-bit DDFS has been fabricated using Honeywell’s 0.35 µm MOI5 This material is based on research sponsored by the Air Force Research Laboratory under agreement number F29601-02-2-0299 in conjunction with the NSF Center for the Design of Analog-Digital Integrated Circuits (CDADIC).
TABLE I.
DESIGN SPECIFICATIONS FOR PHASE ACCUMULATOR
Number of bits Data rate Power dissipation Total ionizing dose tolerance
32 2 GHz < 150 mW > 200 krad Si
silicon-on-insulator (SOI) process. This process was chosen for its inherent radiation-tolerance. The PA for this design is limited to 24 input bits because of area considerations on the die, and is operational up to a maximum simulated data rate of 1.6 GHz. It outputs the upper 12 bits to the ROM lookup table. Radiation testing showed that the PA continues to operate properly after total ionizing dose (TID) of 300 krad Si. The 24-bit PA consists of 6 pipeline stages, with 4 bits per stage. The pipelining requires a substantial number of synchronizing registers, each of which is clocked at the maximum rate. The device sizes in these registers must also be large enough to drive the circuit at this maximum rate, resulting in substantial capacitance that must be charged and discharged each clock cycle. This leads to power dissipation rates in excess of 1 W, primarily for the clock drivers. To address the power issue, a decision was made to change from the MOI5 process to the 0.18 µm TSMC CMOS process, which has a lower delay-power product. In order to meet the radiation requirements, radiation-hard-by-design (RHBD) techniques are required. Guard rings and specially designed FET geometries are included in the design [2]. Triple-mode redundancy (TMR) with majority-voting circuitry was added to mitigate the effects of single-event upsets (SEUs), which occur more frequently in bulk processes than SOI processes. Fig. 1 shows the 32-bit pipelined PA that has been simulated using 8 pipeline stages with 4 bits per stage. The upper 14 bits are output to the ROM lookup table. In order to achieve RHBD using triple redundancy, it is necessary to triplicate the entire PA circuit, and then vote the output bits from each. Prior to the first input register in each pipeline, there are buffers (not shown) that invert the input phase word, permitting the use of fully differential logic. The hardware present in the complete circuit consists of 582 fully differential 2 GHz flip-flops, 96 input buffers, 24 4-bit carry-bypass adders and two 14-bit majority-voting blocks (for both the signal as well as its complement). The resulting power dissipation in the 0.18 µm technology was still higher than desired and an alternative architecture was investigated.
IN[31:28]
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
A Cout Cin S
4-bit Reg
OUT[13:10]
B
IN[27:24]
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
A Cout Cin S
4-bit Reg
4-bit Reg
OUT[9:6]
4-bit Reg
4-bit Reg
4-bit Reg
OUT[5:2]
FF FF
FF FF
FF FF
B
IN[23:20]
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
A Cout Cin S B
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
A Cout Cin S
IN[19:16]
4-bit Reg
OUT[1] OUT[0]
B 4-bit Reg
4-bit Reg
4-bit Reg
4-bit Reg
A
A Cin
4-bit Reg
4-bit Reg
S
S
4-bit Reg
4-bit Reg
A Cout Cin
4-bit Reg
4-bit Reg
S
4-bit Reg
Cout Cin S B
B
Figure 1. 32-bit 8-stage Pipeline Phase Accumulator
B CLK1
32-bit Reg
32-bit latch
32-bit Reg
32-bit latch
32-bit Reg
A Out 32-bit Voter B
A1
A
CLK B A2
32-bit latch
32-bit Reg
32-bit latch
32-bit Reg
32-bit latch
32-bit Reg
3:2 C comp S
A3
C
32-bit Reg
A Out 32-bit Voter B
A
C A 4:2 C comp D S E B
32-bit Reg 32-bit Reg
Figure 2.
A Out 32-bit Voter B
14-bit Reg
A4
PA3[13:0]
B
A
32-bit Reg
32-bit Reg
S
B A
4:2 C comp D S E
CLK3
32-bit latch
32-bit Reg
A 4:2 C comp D S E B
B
32-bit Reg
S
PA2[13:0]
B
A
32-bit latch
14-bit Reg
A
32-bit Reg 32-bit Reg
S
D
A Out 32-bit Voter B
IN[31:0]
32-bit latch
PA1[13:0]
PA4[31:0]
C
CLK2
14-bit Reg
A
A 32-bit Reg
CLK4 S
CLK4
32-bit latch
IN[7:4]
Cin
A
IN[11:8]
4-bit Reg
B
B 4-bit Reg
4-bit Reg
Cout
Cout
IN[15:12]
3:2 C B comp S D
S
32-bit Reg
B
C
A A 3:2 C B comp S D
A Out 32-bit Voter B
S
32-bit Reg
B
C
32-bit M=4 Triply Redundant Parallel Phase Accumulator
PA4[13:0]
IN[3:0]
III. PARALLEL PHASE ACCUMULATOR Using M parallel accumulators in the PA allows each to operate at 1/M times the clock frequency, reducing the overall power consumption. Fig. 2 shows a block diagram of the parallel PA, including the TMR and voting blocks necessary for dealing with SEUs. M was chosen to be four in this design. The input phase word is first synchronized with three identical 32-bit single-ended registers for TMR reasons. They are then demultiplexed with a four-phase clock and 32-bit latches prior to being resynchronized with registers, voted to eliminate SEUs and input to the parallel PAs. Each parallel PA operates at 500 MHz, or ¼ of the 2 GHz design specification. The fourphase clock was designed with 25% duty cycle in each phase using a ring counter.
G1 Cin
1
G0
0
Cout0
G0
Cout1 G2
P3 T 1 T 3
G3
Cout3 1 in1 2
0
1
Figure 3.
Cout3
3
in2
1 G1
3
Cout2
1
in3
0
4-bit Conflict-Free Carry-Bypass Circuit
250 Pipelined (w/o TMR)
(2)
A1 + A 2 + A 3
Pipelined (w/TMR)
200
A1 + A 2 + A 3 + A 4
During the nth cycle, each of the accumulators add together the phase stored in the fourth accumulator from the previous cycle along with from one to four 32-bit input phase words (A1-A4). The adders must therefore be able to handle between two and five 32-bit input operands. This is accomplished using a series of 3:2 and 4:2 compressor stages to reduce the total number of operands in each accumulator to two: a carry bit C and a sum bit S [3]. These bits are then input into a hybrid 32bit carry-bypass/carry-select adder. Each adder first evaluates the propagate Pi and generate Gi bits. Then the lower 16 bits are combined using 4x4-bit blocks in a carry-bypass design, while the upper 16 bits use 4x4-bit blocks in a carry-select design. The bypass adders utilize a conflict-free bypass circuit (see Fig. 3) for the carry-in [4]. When all four propagate bits are set high, T3 = P0 P1P2 P3 , otherwise T1 = P0 P1P2 P3 or P3 are active. The carry-select adders evaluate Cout assuming that Cin is either zero or one. When Cin settles to its proper value, the correct value of Cout for each bit is generated using a 2:1 multiplexer. The sum for the ith bit is then generated from (3). This adder was chosen to improve the overall speed of the design in order to meet the specifications given in Table 1.
Sum (i ) = C out (i − 1) ⊕ P (i )
P2
0
in1 2 1
Cout2
P1 Cout0
A1 A1 + A 2
0
Cout1
P0
(3 )
Since the inputs to each accumulator depend only on PA4, this accumulator and the logic leading to it needs to be triply redundant and voted to ensure proper elimination of any SEUs. PA1, PA2 and PA3 may have upsets but these only cause onetime glitches in the output, while an error in PA4 can cause a long term phase error leading to very many errors. This design takes advantage of the partial redundancies found in the second and third accumulators to minimize the overall hardware required to achieve the TMR. The total hardware present in the parallel accumulator with triple redundancy consists of 100
Power (mW)
(n − 1 ) + PA 2 (n ) = PA 4 (n − 1 ) + PA 3 (n ) = PA 4 (n − 1 ) + PA 4 (n ) = PA 4 (n − 1 ) +
in2 G2
G3
Cin
4
in3 1
1
Equation (2) provides the arithmetic operations performed by the circuit, where PAi is the output of the ith parallel PA and Aj are the 4 phase words input during the previous 4 clock cycles.
PA 1 (n ) = PA
0
Parallel (w/o TMR) Parallel (w/TMR)
150 100 50 0 0
500
1000
1500
2000
Data Rate (Msps)
Figure 4. Comparison of Power Dissipation in Pipelined and Parallel Phase Accumulators with and without TMR
single-ended 2 GHz flip-flops, 666 single-ended 500 MHz flipflops, 5 32-bit majority voters, 3 32-bit 3:2 compressors, 3 32bit 4:2 compressors and 6 32-bit hybrid adders. IV.
COMPARISON OF PIPELINED AND PARALLEL DESIGNS
A 4-stage pipelined accumulator with 8 bits per stage was unable to achieve the desired 2 GHz performance. The number of stages was increased to eight with 4 bits/stage to meet this criterion. Fewer pipeline stages result in less power dissipation with the tradeoff of a lower maximum speed. Simulations of the 8-stage 32-bit pipelined phase accumulator shown in Fig. 1, in 0.18 µm CMOS with standard FET geometries were compared to the results for the parallel phase accumulator in Fig. 3. Power dissipation measurements using designs with and without TMR are provided in Fig. 4. The parallel accumulator has about 1/3 less power dissipation at 2 GHz than the pipelined design. In addition, TMR results in a tripling of the power dissipation for the pipelined accumulator since it requires three identical versions of the circuit. The increase should be less than triple for the parallel PA approach, since this architecture takes advantage of redundant pathways. The
CLK
total simulated power dissipation for the 32-bit parallel phase accumulator with TMR at 2 GHz is 153 mW. V.
D
Wp=2.2µ
nQ
Wn=2.3µ
CLK
RADIATION HARDENING
Wp=Wmin
Triple-redundancy in the PA is necessary to protect against SEUs, which impact bulk CMOS processes more than SOI processes. In addition to SEUs, the circuit must be hardened against total dose effects as well. A widely used radiation-hardby-design (RHBD) layout technique is to use annular n-channel FETs, shown in Fig. 5. This enclosed layout has no channel over field oxide between source and drain which results in a significant improvement in the radiation tolerance of the circuit. SPICE models were generated for 0.18 µm annular FETs using the method in [5].
CLK
RST
Wn=2.3µ
Figure 6. Single-ended Inverting Latch w/Reset
REFERENCES [1]
G
[2]
S or D
[3] D or S [4] Figure 5.
Layout Geometry for Annular FET
For this process, the minimum effective device width of the annular FETs is 2.3µm. This results in substantially more gate capacitance than for a minimal width device. Since the minimum width of standard p-channel FETs is much less than this, and since the primary source of power dissipation comes from the clock drivers to the synchronizing latches, the latch design is critical to overall power consumption and speed of the circuit. For the single-ended latches used in the parallel PA, the latch input uses a CMOS transmission gate sized according to the annular n-channel FET, but has only a single minimum sized p-channel FET in the feedback path. The latch is also designed as an inverting latch, with only one inverter in the feedforward path to improve the overall speed. The second inverter is placed in the feedback path, and does not unduly influence the speed of the latch, as shown in Fig. 6. The fullydifferential latches used in the pipelined PA follow the same basic scheme, although with their feedback paths cross-coupled to avoid the need for a second inverter. VI.
CONCLUSION
A 24-bit pipelined accumulator has been fabricated as part of a DDFS using Honeywell’s 0.35 µm MOI5 SOI process, and tested for radiation tolerance and power dissipation. The excessive power requirements of this circuit make it a poor candidate for low-power applications, such as satellite communications. Both a 32-bit pipelined phase accumulator and a 32-bit parallel phase accumulator have been designed using TSMC’s 0.18 µm CMOS process. Simulations show that the parallel architecture requires about 1/3 less power dissipation than a pipeline design at the same data rate and with the same process. The overall power dissipation is significantly less than the first DDFS design.
[5]
J. Tierney, C.M. Rader, and B. Gold, “A digital frequency synthesizer,” IEEE Trans. on Audio and Electroacoustics, vol. 19, no. 1, pp. 48-57, January 1971. M. N. Martin and K. Strohbehn, “Analog rad-hard by design issues,” NASA Symposium on VLSI Design, May 2003. R. Zimmerman, “Binary Adder Architectures for Cell-Based VLSI and their Synthesis,” Ph.D. Thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998, pp. 36-40. T. Sato, M. Sakate, H. Okada, T. Sukemara, and G. Goto, “An 8.5ns 112-b Transmission Gate Adder with a Conflict-Free Bypass Circuit,” IEEE JSSC, vol. 27, no. 4, Apr. 1992, pp. 657659. C. Champion, “Modeling of FETs with arbitrary gate geometries for radiation hardening”, Thesis, Washington State University, Aug. 2000.