Using Digital Signal Processor Singularities to ...

Using Digital Signal Processor Singularities to Minimize ROM Based Controller Area Bertrand LE GAL, Lilian BOSSUET and Dominique DALLET

IMS Laboratory - UMR CNRS 5218 University of Bordeaux - Talence, FRANCE Abstract—The Interest in synthesis of custom Digital Signal Processors (DSP) using automated design flow like High Level Synthesis greatly increased in the last years. This phenomenon is due to the growing processing complexity and the time to market constraint. Dedicated processor component design is a complex process, for which tools must optimize the datapath and its controller. In this paper, we propose a controller design flow based on mapping Finite-State Machines into Memory Blocks in order to limit the critical path delay in the controller. Our design flow approach takes into account DSP circuit singularities providing efficient area saving compared to other approaches (more than 18% up to 42% on real applications).

I. INTRODUCTION Custom circuits generated using High-Level Synthesis design flows are based on generic architectures. These custom circuits, usually dedicated to Digital Signal Processing applications, are composed by two main parts: a datapath to perform the computations and a controller to command the hardware resources. Complexity of generated circuits increases with application functionalities and performance constraints. Design complexity heavily affects the controller part of the circuit. Therefore, the controller part limits the circuit maximum frequency characteristic. Many controllers related issues have been addressed in control intensive researches. It has been demonstrated that implementing a controller using a ROM based design provides interesting characteristics [1, 2]. However, existing techniques were developed for control-intensive applications and must be adapted to efficiently manage computation intensive circuit specificities.

In this paper, we present a new design flow for ROM based controllers dedicated to custom digital signal processors to save memory area. The problem formulation and its resolution are different from literature approaches. Indeed, we do not consider that the next state computation part of the controller is the most complex one. In the case of DSP circuits, the controller complexity is located in the output decoder part of the design. The main issue is to factorize efficiently the output command signals to save ROM area. Article is organized as follows. Section 2 presents the literature approaches optimizing ROM based implementations of controllers and explains the motivation behind studying this kind of solution. In Section 3, we detail the area optimization algorithm used and extend literature approaches to handle DSP singularities. Experimental results which validate our strategy to reduce the controller ROM area are reported in Section 4. Finally, Section 5 concludes this paper. II. RELATED WORKS The FSM can be described by a 5-tuple (X, Y, S, f, g) where X, Y and S are finite collections of the input, output and state variables; The f function defined as f: X!S " S is the transition function which provides the next state sj for the given inputs and the current state si. The function named g defined as g: X!S " Y is the output function which computes the machine output for the given inputs and current state. DSP controller optimization techniques have been developed considering logic based controller implementation (Figure 1a). Logic based controller design has been proved as inefficient for controllers with large number of states [3]. One

Figure 1. Various FSM controller implementations (a) logic based design (b) ROM based design (address and output) (c) ROM based design using an address counter (states).

Figure 2. Circuit composed of a datapath and its controller. way to cope the relation between the critical path and the number of FSM states is to implement the design controller in a ROM based design (Figure 1b). Using such controller architecture, the output values and the transition conditions are pre-computed (before logical synthesis and stored in ROM. In these controller architectures, the critical path is constant whatever the number of states and the number of resources (depending only on ROM characteristics). General methods for ROM based controller synthesis targeting implementation of sequential circuits using embedded memory blocks have been proposed in [4, 5]. These methods dedicated to control intensive applications, save ROM area by decomposing the memory block (corresponding to the controller) into two blocks: a semi-combinational address modifier and a smaller memory block to store the output values. An appropriately chosen decomposition strategy may reduce the required memory size at the cost of additional logic cells. These optimizations focus only the next address computation part of the controller implementation. A similar approach was proposed in [5] which considers the controller power consumption problem. Finally, in [6], the author uses don't care value to simplify state transition equations. This simplification reduces the memory size as well as multiplexer complexity of the address modifier part only of the controller. Proposed approach Literature approaches focus on general FSM models with uncorrelated output signals. They consider that next state computation part of the controller is more complex than the output decoder one. In dedicated DSP circuits, controllers do not have such characteristics: (1) the FSM models are linear (the next state decoding equations and conditions are simples) (2) the output signal set is complex (huge numbers of states and output signals).

Figure 3. Command signal table for ASIP circuit (signals control datapath resources)

In custom DSP circuits, the controller output signals are used to control the datapath resources like shown in Figure 2. In the same time, the next state computation part of controller is simpler than general FSM models because their execution path is linear. This singularity is due to digital signal application properties which are computation intensive applications. Controller architecture presented in Figure 1c is efficient for such kind of circuits. In this design, the next state computation function f is implemented using an adder, a register and a multiplexer resource. The output decoding function g is implemented using a ROM memory. Each word of the ROM store the output commands associated to state S. In this paper, we propose a cluster-based methodology that reduces efficiently ROM area associated to output signal generation. III. AREA SAVING TECHNIQUES ROM area increases with the number of resources and the number of execution states. Depending on the circuit complexity, these requirements can become huge. Two techniques have been proposed, removing spatial and temporal redundancy. These techniques use the fact that command signals (controller outputs) are not required for each hardware resources at each clock cycle. Undefined command values named don't care values are represented using X in example the truth table (Figure 3). Don't care values help the compaction process in reducing the ROM area as they can be modified without design functionality impact. Spatial redundancy - The first approach to save ROM area is to realize column compaction [4]. This step aims at removing output signals (columns) which are logically equivalent, or can be made equivalent through assignment of don't cares. Given a set of output columns, the problem of finding the smallest column set to drive the overall datapath resources can be obtained by compacting the given set. This problem is related to the maximum clique-partitioning problem, which is NP-complete. Removing temporal redundancy – The second approach is used to remove the inter-instruction redundancy (reducing the ROM height). Removing the temporal redundancies modifies output computation function g:X!S"Y updating it by g’:T(X!S)"Y where T:(X!S)"I is the function which associate an instruction for each controller state. This indexed relation between state and the output signals required an architectural modification: a new ROM memory is added to the controller design to implement the T relation. Modified controller architecture is presented in Figure 4. However this appraoch may leads to ROM size increasing in some circumstancies i.e. when their exist a low instruction redundancy in the controller (indexing ROM may be more

Figure 4. Architecture for instruction compacted controllers

Figure 5. Controller approaches in DSP circuit (a) single controller (b) clustered controllers (c) mixed approach expensive that memory saved compacting instructions).. A. Proposed Cluster-Based approach Using a single controller approach to control the overall datapath (Figure 5a) is a bottleneck during the optimization process, i.e. merging rows or columns can be forbidden by one bit value over hundreds. To solve this optimization issue, the dedicated processor circuit can be divided into independent synchronous clusters. Clusters are atomic elements composed of: a computation resource, its associated storage elements and required steering logic resources. Each cluster has its own characteristics (i.e. computation starting and ending states depending on the resource usages). Using such circuit decomposition (Figure 5b), each cluster controller can be optimized without considering others. This approach is efficient for instruction compaction. Unfortunately, the drawback of duplicating controllers using an island-styled approach (Figure 5b) is design area increase: it reduces the column compaction opportunities and it requires indexing ROMs in each cluster. Duplicated indexing ROMs and state counters which can be partially or fully redundant with others increases circuit area. Efficient datapath controller design is located between the clustered approach and single ROM one (Figure 5c). The clustermerging problem is an optimization problem where the objective function can be described as follow: N

min :

" Area(c ) i

i=1

with N the number of controllers; Area(ci) the controller memory size of the ith controller.

!

To find an efficient controller solution, a weighted graph B=(C, E) is built. Each vertex cl # C represents a cluster controller. Nodes ci are weighted with vi which represent the minimum memory cost of the ith controller. E # (C ! C) is the set of weighted edges el,m between cl and cm. Edges represent

the merging possibility between the linked clusters. Weight wl,m associated with edge el,m corresponds to the area saving (or lost) obtained while merging controllers c1 with c2. 1) First Step: Creating the weighted graph For each cluster ci with i # [1, N] in the architecture we create a node ci in B. For each node ci we compute the associated controller minimum memory cost vi. The optimal vi weight is obtained after applying the overall optimization schemes. There exist three distinct possible ways to obtain the best ROM design: (1) using column compaction only (2) using column and instruction packing (3) the same optimizations as 2 but in the opposite order. The minimum cost value is selected as node weight (vi). Once all nodes is created, we can create edges. Each graph node is linked to all others. Each edge ei,j models the area saving obtained while merging controller ci with cj. For each node couple (ci, cj) we create three distinct edges, weighted using the merged controller cost obtained using the three optimization processes (similar as nodes weight). An example of such graph is presented in Figure 6a. 2) Second Step: Removing inefficient opportunities Once the overall, weighted graph B has been constructed, we first eliminate the redundant edges linking node couples: for each couple (ci, cj) we only conserve one edge ei,j corresponding to the better area reduction. In case of area saving equivalence, column compacted only solution is preferred to other ones for critical path reason. This step result is presented in Figure 6b. Finally, the inefficient merging possibilities are removed (merging opportunities which increase the design area). Each edge is evaluated and ones with wi,j < 0 are removed. This model transformation is illustrated in Figure 6c. 3) Third Step: Incremental graph compaction Graph compaction problem is solved using greedy approach to limit the algorithmic complexity. The B graph is analyzed to find the maximum weight wi,j value. This weight

(a) Initial graph model (b) Redundant edge removing (c) Inefficient edge removing (d) ROM factorization result Figure 6. Bi-partite graph models obtained during the proposed optimization process.

Application

Technique

64 taps FFT

Without optimisation Column compaction Column and Instruction compaction Fully clustered approach Proposed approach Without optimisation Column compaction Column and Instruction compaction Fully clustered approach Proposed approach

inverse JPEG

Without optimisation Column compaction Column and Instruction compaction Fully clustered approach Proposed approach Without optimisation Column compaction Column and Instruction Fully clustered approach Proposed approach

# of # of states resources

91

1247

274

1069

80

2805

140

2472

# of clusters

Instruction # of ROM width bits

ROM size # of (kByte) instructions

Saving vs proposed

1 1 1 61 13 1 1 1 54 10

1247 692 692 1043 852 1069 551 551 873 677

114724 63664 48392 45057 37509 293975 151525 88707 74305 57095

14,0 7,8 5,9 5,5 4,6 35,9 18,5 10,8 9,1 7,0 0,0

91 91 69 [6, 35] [29, 44] 274 274 157 [3, 65] [53, 89]

67,3!% 41,1!% 22,5!% 16,8!% -----80,6!% 62,3!% 35,6!% 23,2!% ------

1 1 1 256 37 1 1 1 119 4

2805 1221 1221 2295 1674 2472 1110 1110 1963 1136

227205 98901 88479 78307 50749 348552 156510 139737 148909 106635

27,7 12,1 10,8 9,6 6,2 42,5 19,1 17,1 18,2 13,0

80 80 72 [4, 30] [13, 33] 140 140 125 [2, 76] [91, 94]

77,7!% 48,7!% 42,6!% 35,2!% -----69,4!% 31,9!% 23,7!% 28,4!% ------

Table 1. ROM area saving for two applications synthesized under different timing (latency) constraints corresponds to the better controller merging opportunity (area saving). We merge the two controllers associated to ei,j, removing nodes ci and cj from B. A new node ck is inserted in B. Edges linking ck to cm with cm # B/{ck} are created and weighted as described in algorithm Step 1. Newly created edges are optimized and the procedure performed again until there is no more possible merging available (Figure 6d). Saving results may be improved considering smaller clusters at the optimization start i.e. one controller for each register and multiplexer element). Unfortunately, such approach would increase drastically the optimization runtime. IV. EXPERIMENTAL RESULTS In this section, we present experimental results. The ROM based optimization techniques have been integrated in the VHDL backend of the GraphLab HLS tool [8]. Experiments are based on commonly used digital signal processing applications. The optimization process results have been obtained for different synthesis constraints applied to the applications producing different controllers architectures (different number of resources to control and different number of states). This procedure helps us to provide a fairly evaluation of the saving obtained. Five methodologies have been compared: (1) ROM optimized using the column compaction technique only (2) optimized using column compaction technique followed by instruction compaction (3) same as 2 but in the opposite order (4) full clustered approach, with for each controller column and instruction compaction (5) proposed cluster based approach. Results presented in Table 1 present ROM area saving obtained for benchmark applications depending on the controller number of states (and the number of resources). Results highlight that the proposed technique always provide better area saving (more than 17% saving) compared to other methodologies. However, memory saving obtained depends on the FSM characteristics. However, the saving depends on the number of resources, controller states, and command signal values.

V. CONCLUSION In this paper, we have presented a new ROM area saving methodology whose objective is to use the digital signal processor circuit singularities to extend literature approaches. The controller architecture is generated using is an efficient trade-off between a single controller and a full clustered approach. Proposed methodology, efficient for complex circuit, can be integrated to high-level synthesis in CAD tools or used independently on hand made circuits. As the experimental results show, the controllers generated using our design flow have significantly less area: 22% up to 42% compared to a single based controller design and 17% up to 36% compared to a full clustered approach. VI. [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

REFERENCES

W. Shiue, “Power/area/delay aware fsm synthesis and optimization,” Microelectronics Journal, vol. 36, no. 2, pp. 147–162, February 2005. K. Kuusilinna, V. Lahtinen, T. Hamalainen, and J. Saarinen, “Finite state machine encoding for vhdl synthesis,” in Computers and Digital Techniques, IEEE Proceedings -, vol. 148, no. 1, pp. 23–30, 2001. P. Bomel, E. Martin, and E. Boutillon, “Synchronization processor synthesis for latency insensitive systems,” in Proceedings of the DATE conference, pp. 896–897, 2005. S. Mitra, L. Avra, and E. Mc Cluskey, “An output encoding problem and a solution technique,” in Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design, pp. 304–307, 1997. H. Selvaraj et al., “Fsm implementation in embedded memory blocks of programmable logic devices using functional decomposition,” in Proceedings of the ITCC’02 Conference, page 355, 2002. M. Rawski, H. Selvaraj, and T. Luba, “An application of functional decomposition in ROM-based FSM implementation in fpga devices,” Journal of Systems Architecture, vol. 51, no. 6-7, pp. 424–434, 2005. I. Garcia-Varga at al, “ROM-based finite state machine implementation in low cost fpgas,” in IEEE International Symposium on Industrial Electronics (ISIE), pp. 2342–2347, 2007. E. Casseau and B. Le Gal, “ High-Level Synthesis for the Design of FPGA-based Signal Processing Systems”, In the Proceedings of SAMOS’09, Samos, Greece, July 20-23, 2009

Using Digital Signal Processor Singularities to ...

Using Digital Signal Processor Singularities to ...

Suggest Documents

Digital Signal Processor (DSP) Architecture

Control Using Digital Signal Processor Chips - Dipartimento di ...

xdspcore: a compiler-basedconfigurable digital signal processor

Digital signal processor fundamentals and system design

digital signal processor architectures and programming

Digital Signal Processor Controlled PWM Phase ...

Signal Generation Using TMS320C6713 Processor - IJCSET

Digital signal processor control of scanned probe microscopes - Core

Digital Signal Processor With Efficient RGB Interpolation And ... - kaist

Design and verification for dual issue digital signal processor

A Scalable SIMD Digital Signal Processor for High

Low-Power Digital Signal Processor Design for a ...

Digital Signal Processor (DSP) based 1/f noise generator

RNS-enabled digital signal processor design - Electronics Letters

Digital Signal Processor System for AC Power Drivers

Development of Digital Signal Processor Controlled ... - IEEE Xplore

Digital signal processor-based real-time optical Doppler tomography ...

Digital signal processor control of scanned probe microscopes - Core

Digital signal processor against field programmable gate array ...

A robust digital signal processor: determining the true input rate

SMJ320C80 Digital Signal Processor (Rev. B) - Texas Instruments

A Fast DCT Algorithm for Watermarking in Digital Signal Processor

Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal ...

Digital-signal-processor-based dynamic imaging ... - Semantic Scholar