A High-Performance Data-Path for Synthesizing DSP ... - CiteSeerX

22 downloads 0 Views 363KB Size Report
A high-performance data-path to implement DSP kernels is introduced in this paper. The data-path is realized by a ... of financial support. The work of Michalis D. Galanis was supported by the Alexander S. Onassis Public Benefit. Foundation.
A High-Performance Data-Path for Synthesizing DSP Kernels

Michalis D. Galanis, George Theodoridis, Spyros Tragoudas, and Costas E. Goutis

Abstract A high-performance data-path to implement DSP kernels is introduced in this paper. The data-path is realized by a Flexible Computational Component (FCC), which is a pure combinational circuit and it can implement any 2x2 template (cluster) of primitive resources. Thus, the data-path’s performance benefits from the intra-component chaining of operations. Due to the flexible structure of the FCC, the data-path is implemented by a small number of such components. This allows for direct connections among FCCs and for exploiting inter-component chaining, which further improves performance. Due to the universality and flexibility of the FCC, simple and efficient algorithms perform scheduling and binding of the Data Flow Graph. DSP benchmarks synthesized with the FCC data-path method show significant performance improvements when compared with template-based data-path designs. Detailed results on execution time, FCC utilization, and area are presented.

Index terms: Template units, high-performance data-path, chaining, scheduling, binding.

Manuscript received April 7, 2004; revised November 8, 2004; revised May 23, 2005.

1

Affiliations Michalis D. Galanis and Costas E. Goutis are with the VLSI Design Laboratory, Electrical & Computer Engineering Department, University of Patras, Rio Campus, 26500, Greece (e-mail: [email protected]; [email protected]). George Theodoridis is with the Physics Department, Aristotle University, Thessalonica, 54124, Greece (e-mail: [email protected]). Spyros Tragoudas is with the Electrical and Computer Engineering Department, Southern Illinois University, Carbondale, IL 62901, USA (e-mail:[email protected]).

Acknowledgements of financial support The work of Michalis D. Galanis was supported by the Alexander S. Onassis Public Benefit Foundation.

2

1. Introduction Digital Signal Processing (DSP) and multimedia applications usually spend most of their time executing a small number of code segments with well-defined characteristics, called kernels. To accelerate the execution of such kernels, various high-performance data-paths have been proposed [1]-[11]. Research activities in High-Level Synthesis (HLS) [2]-[4] and Application Specific Instruction Processors (ASIPs) [5]-[11] have proven that the use of complex resources instead of primitive ones (like a single ALU) improves the performance of the data-path. In these works complex operations (instructions in the ASIP domain) are used at the behavioral level instead of groups of primitive recourses. At the architectural level, the complex operations are implemented by optimal custom-designed hardware units, called templates or clusters. A template may be a specialized hardware unit or a group of chained units. Special hardware is usually used for common-appeared operations (e.g. multiply-add). Chaining is the removal of the intermediate registers between the primitive units improving the total delay of the combined units. The templates can be either obtained by an existing library [2]-[5] or can be extracted by the kernel’s Data Flow Graph (DFG) [6]-[11]. Although performance is improved with the usage of templates, the area of produced data-path is larger than the area of the corresponding data-path implemented by primitive resources [2]-[11]. However, the area-time product, which indicates the efficiency of the synthesized data-path [2], is decreased over a data-path implemented by primitive resources. Due to the lack of flexibility of the templates in the existing template-based approaches, a large number of different templates is required to partially cover the DFG. The remaining portion of the DFG is covered by primitive resources implemented in ASIC [2], [7], [8], [9] or FPGA technology [6]. However, the large number of resources in these data-paths prevents the existence of direct connections among the used hardware resources. With such connections, inter-component chaining of operations can be exploited, instead of utilizing only the intra-template chaining [2]-

3

[11]. Section 2.1 shows that inter-component chaining reduces the system’s latency. Also, the use of a large number of different templates further complicates synthesis. To cover a part of the DFG by a given set of templates, the subset of templates that match this part must be identified (template matching) and then the most efficient template should be selected (template selection). Both template matching and selection are intractable. Thus, complex graph algorithms and heuristics are used for their solution which may result in non-optimal solutions. Template selection has significant impact in the performance and many optimization techniques have been proposed to address this problem [2]. When templates are generated according to the input’s program behavior, the number of complex instructions (templates) candidates grows exponentially with program size [6]-[11]. Subsequently, heuristics are used to prune this large search space. The performance of such datapaths varies among different applications. In particular, single output templates are considered [7], [9], the frequency of appearance of a template type (instead of frequency of execution) is the criterion for automatic generation of templates [6], whereas in [8] a branch and bound method (that its complexity grows very fast when the number of templates becomes large) is used to identify a single cut in a basic block with maximum speedup. In the latter case, the objective to maximize the sum of speedup of each individual cut may not result in the minimum execution time, and may not always achieve the highest possible performance. In this paper, we introduce a data-path for implementing with high-performance DSP kernels. The computational resource of the data-path is a uniform and Flexible Computational Component (FCC). The FCC is a combinational circuit consisting of a 2x2 array of nodes. Each FCC node contains one ALU and one multiplication unit, while one of them is activated at each control step. Due to the steering logic inside the FCC, any two-level template can be easily derived. This flexibility allows for covering the DFG with significantly smaller number of FCCs when

4

compared with existing template-based methods. Thus, an inter-FCC interconnection can be utilized allowing direct communication among FCCs, inter-FCC chaining is exploited, and the performance is improved. Experimental comparisons with a template-based method on representative DSP benchmarks show an average improvement in the performance by approximately 17% with an average area increase of approximately 38%. The area-time product is less in at least half of the benchmarks. The rest of the paper is organized as follows. The motivation for developing the FCC data-path and its architecture is presented in Section 2. Section 3 describes the data-path and control-unit considerations, while in Section 4 the proposed synthesis methodology is presented. The experimental results are reported in Section 5. Finally, Section 6 concludes this paper.

2. The proposed data-path method 2.1. Preliminaries Consider the example in Fig. 1. Let as assume that both the templates (Fig. 1a) and the DFG nodes (Fig. 1b) perform two-operand computations. We observe that there is a similarity regarding the computational structures in three control steps (c-steps). In particular, the sub-graphs at the first two c-steps have the same structure and differ to the performed operations. In the third c-step, four pairs of operations (sub-graphs) exist that differ in the operations performed. We assume that all the operations in the DFG have different operands. To achieve minimum latency, the DFG is implemented with existing methods using 8 template instances (Fig. 1b).

5

1 *

*

*

*

*

*

*

*

*

*

*

*

+

+

+

+

-

+

+

Suggest Documents