A REGULAR MODULAR ARCHITECTURE FOR PIPELINED BINARY TREE MULTIPLIERS BASED ON A SOG STRUCTURE Nicola Testoni, Marco Cisterni, Eleonora Franchi Advanced Research Center on Electronic Systems (ARCES), University of Bologna v.le Risorgimento 2, 40136 Bologna, Italy e-mail:
[email protected]
ABSTRACT This paper presents an highly regular modular architecture for the design of pipelined binary tree multipliers which is suited to be mapped onto a gate array device. The “tree of Wallace trees” architecture is employed in order to achieve a regular layout, while symmetrical sea-of-gate (SOG) enables for a reduced design time and an high predictability of line delay. Post-layout simulations for a 16×16bit pipelined multiplier designed using a 5 metal layers, 0.35μm CMOS technology show operative frequencies up to 1GHz.
1. Introduction A great number of today's mid-volume integrated circuit production runs are not well-served by the two leading technologies in this field: FPGAs and ASICs. Author in [1] suggests that a new class of devices, namely structured ASICs, could fill the gap between these technologies, since they are both closer to FPGAs in terms of design costs and turnaround time and to ASICs in terms of gate capacity, performances and power consumptions. Modern high-end gate array devices can be viewed as extremely fine-grain structured ASICs, being both highly regular and fast to design: by this standpoint we describe a modular architecture for pipelined binary tree multipliers which can be easily used to quickly design arbitrary large multipliers with good throughput performances exploiting the high line delay predictability due to the use of a symmetrical gate array support. Taking steps from the Wallace Tree (WT) [2], the tree of Wallace Tree (TWT) [3] architecture is discussed in Section 2. In Section 3 we highlight the advantages of a symmetrical sea of gates (SOG) structure for the design stage of a TWT pipelined multiplier and we describe implementation choices at circuit and layout level. Section 4 shows the simulation results for a prototypical implementation on a 5 metal layers, 0.35μm CMOS technology. Section 5 concludes this paper.
Fig.1: TWT layout for a 16×16bit multiplier: the longest row is enclosed by a dashed line while arrows indicate the adding direction; partial product generators are shaded.
Fig.2: block diagram of the pipeline: critical path is drawn in black while gray block represents other TWT leaves. Pipeline buffers are enclosed in the latch stages. The layout of the TWT can be conducted in stages, from leaf nodes towards the root, in the form of an L-tree [4] as shown in Fig.1, so that any node is placed between its two children. Although the root always ends up in the middle of the layout, the resulting routing density is the smallest achievable for a linear placement structure, improving the pipeline performances. Finally, due to the symmetry of the layout at each leaf, the design of multipliers with an higher number of bits comes at almost no cost, while maintaining all the benefits of constant and predictable line delay.
2. TWT architecture
3. SOG implementation of TWT
The function of a TWT multiplier unit is divided into three stages: the first stage generates the partial products. The second stage sums up the partial products adding them in parallel using multiple full adders (FA): the TWT architecture uses a pair of FA at each node in order to generate two outputs from four inputs (CSA 4:2), which results in a balanced binary tree with an high degree of regularity and a simplified wiring structure. The third stage generates the final multiplication result: in this stage, any kind of fast adder architecture can be used as a carry propagate adder (CPA).
Since each CSA stage of the TWT multiplier is identical to its children, an optimized implementation for the CSA should be adopted. In this paper we exploit the performance that can be achieved by using the high regular structure of a SOG: even if this choice does not aim to single transistor optimization, it grants a certain degree of optimization on the whole layout. The CSA stages are arranged in order to have partial products of the same weight on the same row. Only one row of maximum length exists (Fig.1) and it can be shown that this row coincides with the critical path (Fig.2). This
same row is replicated to implement the whole multiplier layout by properly fixing some of its inputs, following a predictable order: this improves the global predictability of the line delay since each interconnection is realized using the same structure. Moreover, since the TWT is a balanced binary tree and the SOG used is itself symmetrical, each path has the same performances of the critical one: this means that the length of each line is quantized and the optimization effort can be concentrated on the basic CSA and the pipeline buffers. Finally, thanks to the symmetry of the SOG, each cell in the TWT is designed only once and then mirrored left-right in order to realize the L-Tree structure. This gives way to interconnections that have the same length and capacitance not only between each row but even between each leaf of the L-Tree. In order to improve pipeline throughput, dynamic latches are used in place of static registers: the two clock phases are generated within each latch from a single phase clock. A pass transistor solution is chosen both for the full adders and the latches: doing this, pipeline buffers always enclose at most 3 transmission gates. As for a 16×16bit multiplier, the difference in interconnections' length between CSA stages results in a negligible difference of capacitance loads so that equal sized pipeline buffers are used. Input signal buffers are designed in order to drive the lines of maximum length for each row: in fact multiplicand's and multiplier's bits does not follow the same path within the unit, but their difference in capacitance is less than 10% of the optimal line driver load.
Fig.3: Layout of the 16×16bit TWT multiplier on a 0.35μm CMOS technology symmetrical sea of gate.
4. Simulation results Simulations of a 16×16-bit unsigned multiplier have been done using a 5 metal layers, 0.35μm CMOS technology on Spectre™. Each cell of the SOG measures 1.5μm×17.4μm and contains one NMOS transistor with WN=3.4μm and one PMOS with WP=4.7μm: these form factors had been chosen in order to maximize the routability of the first metal layer and to minimize the propagation delay; each cell is symmetrical. The final dimension of the TWT structure is 636μm×606μm, with an high degree of regularity, as shown in Fig.3. The simulation involved only the critical path: worst commutation of each FA pair has been considered in order to evaluate the performance of this architecture. Typical corner post-layout simulation at 27°C and 3.3V power supply shown in Fig.4 demonstrates that operating frequencies up to 1GHz are achievable on this support.
5. Conclusions An high-regularity modular architecture for the design of pipelined binary tree multipliers has been presented. The regularity of the architecture combined with the structure of the SOG allows for an high degree of predictability of line delay and for a great balancing of line capacitance load. System-wise optimization of pipeline buffers led to high operating frequencies even if SOG does not allow for single transistor optimization. The proposed architecture seems an ideal candidate to the design of a multipliers' compiler providing both short design time and predictable performance.
Fig.4: Simulation results for the critical path of a 16×16bit TWT multiplier in 0.35μm CMOS technology: all output signals' transients end before clock raises.
6. References [1] B.Zahiri, “Structured ASICs: Opportunities and Challenges”, 21st IEEE International Conference on Computer Design, proceedings, pp.404-409, 2003. [2] C.S.Wallace, “A suggestion for fast multiplier”, IEEE Transactions on Electronic Computers, vol.13, pp.14-17, Feb. 1964. [3] K.F.Pang, “Architecture for pipelined Wallace tree multiplier-accumulators”, IEEE International Conference on Computer Design., proceedings, pp.247250, Sep. 1990. [4] Y.Harata, et al., “A high speed multiplier using redundant binary adder tree”, IEEE J. Solid St. Circ., vol.22, pp.28-34, Feb. 1987.