Generation and validation of multioperand carry save adders from the web Minas Dasygenis
Department of Informatics and Telecommunications Engineering University of Western Macedonia, Kozani, 50100, Greece,
[email protected]
Abstract Many arithmetic circuits utilize multioperand addition, usually using carry-save adders (CSA) trees. Automatic generation of custom VHDL models for these CSA trees, allows the designer to perform a time efficient design space exploration. Although, the CSA trees are heavily utilized in modern digital circuits, there is no tool, accessible from the web, to generate the HDL description of such multioperand designs. To the best of our knowledge, our novel tool is the first one to automate the design of optimized CSA trees and simultaneously provide custom testbenches to verify their correctness. Our synthesized circuits on Xilinx Virtex 6 FPGA, operate up to 724 Mhz.
I.
Introduction
The design automation and test processes (DAT) play a crucial role in contemporary multi-billion transistor era. One of the aspects of DAT is the fast parametrized generation of bit accurate models and their testvectors, in a hardware description language. This enables the designers to perform a rapid design space exploration and select the best custom implementation. Especially, the HDL generators for constructing circuits that are required in almost every digital system have an increased importance. Addition is by far the most important elementary operation within digital systems [14]. One very effective technique to perform addition with multiple vectors, is the carry save adder tree (CSA). Even though CSA trees were introduced some decades ago [4], and they are still very popular as parts of other circuits (multipliers [10], FIR [7], cryptography [16], DSP and other), tools that support automatic HDL generation of CSA multioperand vectors of arbitrary bit patterns are absent from the EDA landscape, due to the complexity of their design. Designing CSA trees is a very tedious process, especially when the designer requires the addition of multioperand vectors of non-standard patterns (such as vectors carrying non continuous input bit patterns).
All the commercial IP block generators, can create HDL descriptions of adders using CSA, but they only support specific bitwidths (8, 16, 32, 64, etc.), with every vector of the module to have the same size, and every vector to carry only continuous bit patterns. This means, that the derived circuit will be unoptimized, if the vectors are sparsed and utilize only some bit positions. The more the derivation from the standard design parameters that accept all the commercial tools, the more the unoptimization will happen. Designing performance and energy consumption optimized circuits, demands an IP block generator with greater flexibility. Finally, EDA tools that are public accessible from the web are very scarce [12]. Carry save adders are used intensively in designing computer arithmetic modules. One of the area that makes heavy use of them is Residue Number System [15], a non conventional weightless arithmetic system. Due to the nature of this arithmetic system, it requires addition of various bit patterns. Many authors have presented circuits during the last fifty years that utilize carry save adders as core components of it, like multioperand modular adders [8], multipliers [11], RNS to binary (forward and reverse) converters [13] and a multitude of other modules. All these circuits consist of carry save adders, that process non standard sparse input vectors, and cannot be designed automatically using EDA IP Block generators. In fact, the complete absence of a flexible tool from the EDA landscape to produce syntactically correct and synthesizable HDL code for adding custom input vectors, was our primary motivation for our work. Even though CSA design is a heavily used arithmetic module, and thus making it an important module in circuit designs, nobody until now has presented a tool to automate the design and test of parametrized CSA trees. The derived CSA trees are optimized according to the heuristic of using the least number of component, and thoroughly tested using random input vectors, supplied to the end user. Therefore, the novelty of this paper lies in the fact that we present the first web based tool to automate the optimized CSA design process, and increase the productivity of every engineer that requires a custom CSA module. Such custom CSA modules cannot be designed with IP-blocks generators from major EDA Companies (like Altera or
Xilinx). Our tool is accessible using any standard web browser, even from a mobile device, and user friendly. To the best of our knowledge, no CSA tool exists up to this date, that accepts an arbitrary number of input vectors, each one consisting of several continuous or sparse bit patterns, with various bit lengths, accessible to everybody via a typical browser, and generate a syntactically correct register transfer level description in HDL, accompanied with a random generated testbench to validate the correctness. We were motivated by this and we worked on developing such a tool. Here, we present the outcome of our work: a web based tool to assist in the rapid design space exploration of CSA tree generation. Our tool accepts as inputs the number of vectors, their bit patterns and weights, and generates a synthesizable RTL description in the VHDL language. The CSA generation is optimized in term of using the least number of components, because it can decide whether to use a full adder, a half adder or to schedule the placement of an adder to a future design iteration for a more efficient component utilization. Furthermore, it generates a custom testbench file to verify the circuit’s correctness and a schematic. Finally, it reports metrics of performance (delay) and hardware (number of components and transistors). The rest of the paper is structured as follows: The next section (Section II) discusses related work in the field of automating HDL generation. In Section III we present our web based tool and describe in details its modules, while in Section IV we provide test cases with performance and hardware metrics. Finally, in Section V we give the concluding remarks.
II.
Related Work
Reconfigurable computing systems, System-on-Chip, embedded systems and other complex systems, usually have many design requirements, sometimes conflicting, that demand a thorough design space exploration. In a recent research work [18] the authors pinpointed that even though during the last 15 years multiple research projects have worked on automating the HDL generation, the outcomes are not encouraging. Furthermore, they conclude that ”despite all these efforts the automated hardware generation is yet to become a widely adopted industrial practice”, due to (a) lack of automation and (b) lack of optimization. Due to our agreement with these authors, we constructed our tool to aid toward the EDA automation process. All major EDA companies (Altera, Xilinx, Mathworks, etc.) support customized IP blocks for various functions. Xilinx provides the “Core Generator”, while Altera provides the “MegaWizard Generator”. These generators can be used to create HDL descriptions of typical adders for a specific bitwidth of two vectors, but not for fully parametrized carry save adders.
Automating VHDL generation for various design modules has been a topic under research for more than 15 years. Daveau et al [3] presented an approach to allow the generation of VHDL from SDL models. The work was limited only on the specification and no EDA tool was ever created. Daitx et al [2] presented a tool to create VHDL description of FIR filters according to a coefficient specification file. Their tool is not available online, in contrast with our tool. On the other hand, automating High level HDL generation is the focus of multiple research projects. All these generators accept a high level description of the code, usually in C, and create an HDL description of it. The major drawback of high level synthesis, is that input code should adhere to a lot of requirements, like regular memory access, perfectly nested loops, in some cases absence of pointers and so on, while all the tools have a limited application domain. The ROCCC project [1] supports mainly streaming applications. The Streams-C project [6] and DWARV [17] also support a very limited application domain. Also, the majority of EDA companies on the field of IC design provide High level synthesis solutions (Vivado, Synphony, Altium, C-to-Silicon, Triton, etc.) but all of them support typical design requirements and cannot provide the granularity required to design optimized CSA trees. Mathwords also, provides the ”HDL Coder”, which is a high level synthesizer for Matlab functions, models and charts. This can not be used to create parametrized CSA modules, because CSA modules demand bit level arithmetic, and not high level design. As it can be observed, our novelty lays in the automated design of the optimized HDL of the custom CSA tree, from the web. Our tool can produce a syntactically correct code for multiple input vectors, either sparse or continuous, of various input lengths, something that is not supported by other EDA tools.
III.
The web CSA hardware compiler
Our tool, which has been installed on a public web server1 , utilizes a number of technologies (PHP, Python, JSON) in order to deliver a syntactically correct and synthesizable VHDL description. Our tool is partitioned in two different departments, according to their function: the front-end and the back-end. These modules exchange information using the javascript object notation (JSON) format [5]. The tool’s back-end provides the core functions, and for this reason we will focus only on it. This back-end consists of three modules: (i) the CSA design module, which analyzes the user inputs and creates the specific design description in a special netlist format, (ii) the HDL Generator module, which takes as input this netlist format 1 The
tool is available at http://arch.icte.uowm.gr/hdl/csa.php
and creates signals, networks, assignments, and connections, resulting in the output description in VHDL, and (iii) the VHDL Testbench creator, which takes as input the constructed data structures of the previous module, and generates a full VHDL testbench, with handles for automatic design validation.
A. CSA design module The first module is the CSA design module, which creates a netlist in an internal format developed at our laboratory, which we call it α-HDL format, and operates in three stages. At the first stage, it accepts as input an arbitrary array of vectors, supplied by the user. Each vector carries a series of ‘1’ and ‘0’ in an weighted list, starting from the least significant bit on the left, up to the most significant bit on the right. If a location carries ‘1’, then this location provides a bit that has to be taken into consideration, otherwise this location is not used. Using this notation, the user can supply sparse vectors, that is vectors that carry bit in some positions, and are not continuous. For example, a sparse vector that carries only information in the 0th bit and the 2nd bit position, can be noted (the left bit is the LSB bit) as X − X−, with X being the position that should be taken into consideration for the computation, and − is an empty position. This example denotes that the possible values of this vector are 1 − 1 − (5), 1 − 0 − (1), 0 − 1 − (4) and 0 − 0 − (0), with the corresponding decimal number placed in parenthesis. Even though it may seem strange to use sparse vectors, circuits in special arithmetic circuits, like residue number system make heavy use for them. The input vectors accepted at the beginning of the CSA design module define the operands to the CSA tree, with an ‘1’ in a given vector to define a bit that has to used for computation and 0 to define an empty position. For example, the vector v1 = 00a2 a3 a4 a5 00a8 , or − − XXXX − −X is described with 001111001 or in python syntax [0, 0, 1, 1, 1, 1, 0, 0, 1]. Similarly, the vector v2 = b0 b1 b2 b3 0 is described with 11110 or in python syntax [1, 1, 1, 1, 0]. When a vector has ‘1’ in a position, this vector contributes one bit to this column. For the example with these inputs, the two dimensional array of vectors will be: [[0, 0, 1, 1, 1, 1, 0, 0, 1], [1, 1, 1, 1, 0]]. The second stage of the CSA module is the reduction, which consists of many iterations. In every iteration i the reduction stage scans all columns j starting from the least significant column, locates the columns that have more than 1 bit and places full adders (FA) or half adders (HA). The placement of adders is done in the best efficient way, in order to minimize the total number of FAs or HAs, as follows: Until the total number of bits to be added are over 2, full adders are placed in the netlist, with their output carry registered for future processing at the next iteration (i + 1), at the next column (j + 1), and their output sum registered for future processing, at the next iteration (i+1),
same column (j). If the number of bits to be added is 2, then the tool examines whether to add a HA, or to delay the insertion of the HA in favor of a better utilization in a future iteration. The tool will not add an HA when a carry has been registered at the next iteration (i + 1), for this column (i), because a FA can be used in the next iteration to add all three bits. Also, the tool will not add an HA when two bits have been registered at the next iteration (i + 1), of the previous column (j − 1), because in the iteration (i+1) a carry bit will be created and be registered at iteration (i + 2), column (j). Thus, in iteration (i + 2), a FA can be used to add all three bits. When the total number of bits in a column is less than 2 the reduction stage completes. The third stage of the CSA module is the final addition using a ripple carry adder. This stage, which is also optimized, places the best number and types of adders. This is done by checking the total number of bits to be added in every column (0 or 1 or 2), and then decides whether to direct connect the column to the output (when the bits are 0 or 1 and no carry has been generated in the previous column), to place a HA (when this column carries two bits and no carry bit has been generated in the previous column, or this column carries one bit and a carry bit has been generated in the previous column), or to place a FA (when 3 bits have to be added). The CSA design module also accepts as input the option to pipeline the design or not. The pipelined design uses D flip flops (DFF) to delay input and output bit columns, and increases the throughput of the design, with the cost of increased hardware (Section IV). The tool carefully adds delay units both to inputs and output columns, for a uniform delay to every bit. Currently, there is no pipeline optimization or parametrization: a DFF is utilized after every adder.
B. HDL Generator Module This is a general purpose VHDL generator module that can be easily connected to many different generators. This module accepts as input a special and compact netlist format, which we name it abstracted HDL α-HDL. This is a special type of netlist, that can be visualized as a hypercube (Fig. 1), in which every node corresponds to a component, and carries the vectors that define the input connections. This means that they convey only the information as to the component that generated the signal, the output port of that component, as well as the bitwidth and signal type. By defining only the input connection vectors, a very compact and easy to generate netlist format is created. This netlist format is encompassed in a single javascript object notation format (json). This netlist format does not belong to the scope of this paper, and thus we will not describe it further. The HDL Generator module consists of five stages:
that define the input and output ports.
Fig. 1. The visualization of the α−HDL netlist format forms a hypercube
(i) validation, (ii) top level input output, (iii) port mapping, (iv) HDL generator, and (v) schematic generator (Figure 2).
Fig. 2. The HDL Generator accepts our compact netlist and outputs the VHDL files and the schematic At the validation stage, the α-HDL format is checked for inconsistencies, like the existence of vectors that carry invalid coordinates. Also, every component used in the netlist is checked with the HDL library of the tool to verify its support. The HDL generator maintains a library for many primitive components (FA, HA, logic gates and more). Every component provides the architectural description, the component description, the entity description and the number of in, out, inout and buffer ports. The number and type of ports is important at the portmapping stage. At the second stage, the toplevel input and output analysis is performed. The tool examines every vector, and registers the input vectors that point as origin an input port of the design. It should be noted that the netlist carries no explicit input and output port information. The input and output ports are discovered from the vectors. The vectors are analyzed and the top level input ports are defined in terms of port number, port bitwidth, signal type, and port type. In a similar way, this stage defines the output ports of the design. The results of this stage are two structures
At the third stage, the portmapping operation occurs. In this operation signals are created and connected to specific port numbers and port types. For every component, the input vectors are examined in order to form the connection pair between the originating component and the destination component. In case that a signal name already exists with the same attributes (originating component, originating port, type and bitwidth), this signal is registered to the portmap structure, otherwise a new signal name is created. The portmap structure for every component, carries four sections. The ‘in’, ‘out’, ‘inout’ and ‘buffer’ sections, similar with the VHDL port types. Every port section is an ordered collection of signals, which define the port number of this port type (for example a signal name located at position ‘2’ of out port section, signifies that this signal will be connected to out2 during the VHDL creation). Index ports that have no connection are marked ‘open’ or connected to a ‘ground’ signal. The outcome of this stage is the derivation of the portmapping data structure and the signals data structure. At the fourth stage, the HDL generation occurs. Currently, only VHDL can be generated but this stage can be extended to cover Verilog HDL generation. This stage outputs two VHDL files: (1) a VHDL file that carries the entities and the architectural descriptions of all primitive components used in the design, and (2) a VHDL file that describes the design derived from the netlist. Here, the tool places the appropriate library declarations, which depend on the type of signals used. Then, it instantiates every primitive component used. Using the input and output ports data structure, it creates the entity definition of the design. The next task is to define signals, using the signal data structure, created before. After this, portmapping declarations strictly adhering to the VHDL syntax are performed, using the appropriate data structure. Following this step, the signal assignment phase occurs, in which signals are connected to constant values, or signals are suitably connected to other signals of different bitwidth. At the fifth stage, the block schematic of the design is created, for a visual representation. Using the DOT visualization language, a graph is constructed and rendered as PNG picture. This stage make use of the portmapping, input and output port data structures in order to complete all connections. Furthermore, it annotates the connection with the signal names, input and output port of the connection as well as bitwidths. Finally, it uses colors to indicate the input and output paths. With the creation of the above files some metrics are also reported to the designer, which are the quantities of every component used and the delay, measured in terms of stages. It has to be noted that the tool does not perform any synthesis at all. The metrics of transistor count are computed using standard calculations, for example an one bit Full Adder requires 28 transistors, a 2 input OR gate requires 6 transistors, and so on [9]. Concerning the delay,
it corresponds to the worst case path from an input vector to the computation of the output vector, which is measured by counting the D flip flops that are used as pipeline latches. Finally, the transistor count is reported both for every component, and in total, in case the user would like to implement this design in CMOS [9]. Our tool does not synthesize the circuit, because its purpose is different. It is used to generate optimized VHDL codes, which the engineer will use in his own synthesis tool.
C. HDL Testbench Generator
This module, is of utmost importance, because it creates multiple vectors of testbenches, which can be used to test the correctness of the design in an HDL simulator. As it is evident, the multioperand CSA design is a very complicated process and should be tested thoroughly. Our tool accepts as input the number of input cases to create, and generates the testbench VHDL file. To do this, first it creates an empty entity declaration, then it instantiates the top level component and creates signals for every input and output port. Furthermore, it creates a clock process and a function that is used to convert bits to integer. The next step is to create the requested number of input test cases. For the number of input test cases, the following loop is performed: for every operand a random number is created and boolean ANDed with the operand bitmask, as was given in the beginning. For example if the user had supplied for the first input vector the mask XX − − or 1100, then every random generated value, will be Boolean ANDed with 1100 resulting in one of the following values (lsb is the left bit): 1100 (3), 1000 (1), 0100 (2), 0000 (0). This number is converted to binary, is extended to the full bitwidth of the operand, and is assigned to the signal that is associated with this input. The tool sums all the generated operands and precomputes the final result. After the value assignments, it insert a wait clause for the delay, which was computed in the previous stage, and then constructs an ’assert’ statement to check the output. For example, if one vector has the random value (lsb is the right bit) 100000b (32), and another has the random value 100011b (35), then the following lines will be added to the generated VHDL file: -- input vector: 32 signal0