A Multi-Core Signal Processing Platform with ...

2 downloads 0 Views 1MB Size Report
[31] Y. Janin, V. Bertin, H. Chauvet, T. Deruyter, C. Eichwald, O. A. A.. Giraud, V. Lorquet, T. Thery, ... S. Whitty, R. Ernst, W. Putzke-Roming, R. Guerrieri, "Application ... film processing reconfigurable platform”, EURASIP J. Embed. Syst., vol.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Other approaches that leverage on the trade-off between cost and performance rely to the utilization of regularity. In the past, in the field of multi-processor systems, regularity has commonly been used for the implementation of processor arrays [17][18] allowing designers to focus on a small portion of silicon and replicating the tiles many times over the die surface. Another class of devices that deals with regularity to reduce the mask costs is that of structured ASICs, also known as gate arrays. Gate arrays are integrated circuits that are partially fabricated using a set of generic masks that can be further programmed with a set of custom mask. They are mainly divided into via-programmable (VPGA) [19] and metal-programmable (MPGA) [20][21] gate arrays. From this perspective, Field Programmable Gate Arrays (FPGAs) [22][23] can be considered as structured ASICs with maximum flexibility. B. Overview With respect to most of commercially and research ASSPs, which tend to increase the flexibility of their accelerators in order to extend their application portfolio, the platform proposed in this paper reduces the design time by implementing custom hardware accelerators with a design flow that starts from a high-level description language, and leverages structured ASIC technologies to achieve the required flexibility. Considering the design methodology, a main objective of the platform is to fill the performance gap between the fully automatic customization (utilizing HLS-ELS techniques) and manual customization (utilizing standard ASIC design methodologies) [25] by exploiting a design flow that allows to design and implement algorithmic-specific, coarse-grain, pipelined hardware accelerators, and a design platform that allows to calibrate the architectural parameters over the specific features of the implemented accelerators. From an implementation standpoint, the proposed platform targets the reduction of design and manufacturing costs joining the regularity at architectural level and physical level, adopting a tile-based implementation methodology and structured ASIC technologies as silicon platform for the hardware accelerators. With respect to most of reconfigurable and configurable technologies that target the general-purpose implementation of digital circuits, the ones proposed in this work are specifically designed and optimized for signal processing, as they target the direct mapping of data flow graphs into configurable pipelined datapaths. Considering specific platforms for embedded signal processing based on multi-processors augmented with hardware accelerators, configurable either at run-time or design-time, few of those targets the specific research domain proposed in this work. In a previous work [24], the authors present Morpheus: a multi-core digital signal processor with three reconfigurable computing units featuring different granularities. With respect to Morpheus the proposed platform overcomes its intrinsic heterogeneity with the advantage of providing benefits in terms of application mapping ease through the exploitation of high level programming models.

2

Moreover, the proposed platform provides a higher computational density as at any time it can guarantee that all the computational units can run concurrently by exploiting either homogeneous or heterogeneous computational models. Finally, this work provides different configurable technologies for the implementation of accelerators targeting the different trade-offs between NRE costs, TTM and computational efficiency. In [25] an exploration of the trade-offs in the customization through the automated instruction set extension of a multi-processor system on chip is described. The exploration was performed utilizing the Tensilica’s Xtensa Modeling Platform as its base. Although the automated flow for realizing tightly coupled accelerators provided good results, the next step consisting of the implementation of coarse-grained extensions attached to the memory subsystems had to be done implementing custom blocks. A key objective of the proposed work is to overcome these kinds of limitations by providing the accelerators with a flexible but efficient memory access subsystem. In [26] a multi-core cluster with shared hardware accelerators is presented. The accelerators have been designed with Mentor Catapult C [13]. Although the accelerators have dedicated ports connected to the shared memory subsystem, the address patterns are hardwired within the custom blocks, thus limiting their flexibility. On the contrary, in the proposed work only the inner loop of kernels are hardwired within custom blocks, while the addressing and memory accesses are performed through programmable blocks on a separate interface. Summarizing, the main contributions of this work are: a) A design-time configurable computational platform aimed at the implementation of multiprocessor systems-on-chips augmented with (re-) configurable hardware accelerators. b) A design framework that assist the user in the exploration of the architectural parameters of the platform and in the design of application-specific hardware accelerators utilizing high-level programming models. c) The evaluation of the adopted programming, architectural and technological techniques through the exploration of their impact on flexibility, performance, energy efficiency and manufacturing costs. d) The establishment of the proposed platform novelty through a comparison with respect to the state-ofthe-art devices that define its design space boundaries. The paper is structured as follows. Section II gives an overview of the platform. Section III provides a description and an evaluation the computational paradigm of the platform. Section IV provides a quantitative analysis of the platform. Section V gives the comparison of the proposed platform with established state-of-the-art solutions. Section VI provides some final considerations.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < respect to a multi-processor system on chip with no specialization, the platform have been proven able to provide up to 3 orders of magnitude of improvement in performance and energy efficiency in several signal processing applications delivering an average of 90 GOPS and 130 GOPS/W. When compared with the devices that define its design space boundaries, the proposed platform has been proven competitive in both area and energy efficiency. With respect to these devices, a key advantage of the proposed platform is to target the applications performance and energy efficiency through the hardware/software migration of kernels, and acutely reducing NRE costs through exploitation of configurable technologies for the implementation of application-specific hardware accelerators. REFERENCES [1]

[2]

[3]

[4]

[5]

[6] [7] [8]

[9]

[10]

[11] [12]

[13] [14]

[15]

[16]

[17]

G. Hughes, L. C. Litt, A. Wüest, S. Palaiyanur, “Mask and wafer cost of ownership (COO) from 65 to 22 nm half-pitch nodes”, Proceedings of the SPIE, vol. 7028, pp. 70281P-70281, 2008. P.-Y. Chen, T.-Y. Wang, C.-S. Chen, “A Low-Cost VLSI Architecture for Robust Distributed Estimation in Wireless Sensor Networks”, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 58 , issue 6, Jun. 2011, pp. 1277 - 1286. J. A. Rodríguez, O. D. Lifschitz, V. M. Jiménez-Fernández, P. Julián, O. E. Agamennoni, “Application-Specific Processor for Piecewise Linear Functions Computation”, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 58, no. 5, May 2011, pp. 971 - 981. W. Wolf, A. Jerraya, “Multiprocessor System-on-Chip Technology”, IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 10, pp. 1701-1713, Oct. 2008. L. Dagum, R. Menon, “OpenMP: an industry standard API for sharedmemory programming”, IEEE Computational Science & Engineering, vol. 5, issue: 1, pp. 46-55, Jan-Mar 1998. P. Pacheco, “Parallel Programming with MPI”, 1997. NVIDIA Corporation, “NVIDIA CUDA C programming guide, v4.0”, http://www.nvidia.com/, Feb 2011. The Khronos OpenCL Working Group, “OpenCL - The open standard for parallel programming of heterogeneous systems”, http://www.khronos.org/opencl/, Feb. 2011. A. Heinecke, M. Klemm, “From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture”, Computing in Science & Engineering, vol. 14, issue 2, Mar. 2012, pp. 78-83. E. A. Lee, A. L. Sangiovanni-Vincentelli, “Component-Based Design for the Future”, Design, Automation & Test in Europe Conference & Exhibition (DATE), 14-18 March 2011. V. Berman, “Standards: The P1685 IP-XACT IP Metadata Standard”, IEEE Design & Test, vol. 23, issue 4, Jul. 2006, pp. 316-317. F. Sun, S. Ravi, A. Raghunathan, N. K. Jha, “A Scalable Synthesis Methodology for Application-Specific Processors”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 11, Nov. 2006, pp. 1175 - 1188. http://www.mentor.com/esl/catapult/overview/ K. Kculzer, S. Malik, R. Newton, J. Rabaey, A. L. SangiovanniVincentelli, “System-level design: Orthogonalization of concerns and platform-based design”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 19, issue 12, pp. 15231543, Dec. 2000. A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, J. Teich, “Electronic System-Level Synthesis Methodologies”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 10, Oct. 2009, pp. 1517 - 1530. F. Sun, S. Ravi, A. Raghunathan, N. K. Jha, “Application-Specific Heterogeneous Multiprocessor Synthesis Using Extensible Processors”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 9, Sep. 2006, pp. 1589. S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J. Zook, “TILE64 processor: A 64-

[18] [19]

[20]

[21] [22] [23] [24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

[32] [33]

[34] [35] [36]

[37]

[38]

[39]

[40]

13

core SoC with mesh interconnect”, IEEE International Solid-State Circuits Conference (ISSCC’08), Feb. 2008, pp. 88-89. A. Duller, G. Panesar, D. Towner, “Parallel Processing - the picoChip way!”, Communicating Processing Architectures, 2003, pp. 125-138. H.-H. Tung, R.-B. Lin, M.-C. Li, T.-H. Heish, “Standard Cell Like Via-Configurable Logic Blocks for Structured ASIC in an Industrial Design Flow”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 12, Dec 2012, pp. 2184-2197. U. Ahmed, G. G. F. Lemieux, S. J. E. Wilton, “Performance and Cost Tradeoffs in Metal-Programmable Structured ASICs (MPSAs)”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 12, Dec. 2011, pp 2195-2208. www.st.com/spear www.altera.com www.xilinx.com D. Rossi, F. Campi, S. Spolzino, S. Pucillo, R. Guerrieri, “A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing”, IEEE Journal of Solid-State Circuits, vol. 45, no. 8, Aug. 2010, pp. 1615-1626. R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, M. Horowitz, “Understanding sources of inefficiency in general-purpose chips”, Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10), vol. 38 issue 3, June 2010, pp. 37-47. M. Dehyadegari, A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi, N. Yazdani, “A Tightly-Coupled Multi-Core Cluster with Shared-Memory HW Accelerators”, International Conference on Embedded Computer Systems, Jul. 2012, pp. 96-103. M. Coppola, “Spidergron: a Novel on-chip Communication Network”, 2004 International Symposium on System On Chip,Finland, Oct. 2004. A. T. Tran, “A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms”, Jerraya, G. Martin, Multiprocessor System-on-Chip (MPSoC) Technology, IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, no. 6, Jun. 2010, pp. 897 - 910. A. Lodi, C. Mucci, M. Bocchi, A. Cappelli, M. De Dominicis, L. Ciccarelli, "A Multi-Context Pipelined Array for Embedded Systems", IEEE International Conference on Field Programmable Logic and Applications (FPL 2006), Aug. 2006, pp. 1-8. L. Benini, E. Flamand, D. Fuin, D. Melpignano, "P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator," Design, Automation & Test in Europe Conference & Exhibition, vol., no., pp.983,987, 12-16 March 2012. Y. Janin, V. Bertin, H. Chauvet, T. Deruyter, C. Eichwald, O. A. A. Giraud, V. Lorquet, T. Thery, “Designing Tightly-Coupled Eextension Units for the STxP70 processor”, IEEE Int. Conf. Design Automation and Test in Europe, Mar. 2013, pp. 1-6. http://www.synopsys.com/IP/ProcessorIP/ARCProcessors C. Mucci, C. Chiesa, A. Lodi, M. Toma, F. Campi, "A C-based Algorithm Development Flow for a Reconfigurable Processor Architecture", Proceedings on the IEEE Symposium on System on Chip (SoC2003), Tampere, Nov. 2003. V. Allan, R. Jones, R. Lee, S. Allan, “Software Pipelining”, ACM Computing Surveys, Vol. 27, No. 3 September 1995. http://www.arm.com/products/processors/classic/arm9/arm926.php D. Rossi, C. Mucci, F. Campi, S. Spolzino, L. Vanzolini, H. Sahlbach, S. Whitty, R. Ernst, W. Putzke-Roming, R. Guerrieri, "Application Space Exploration of a Heterogeneous Run-Time Configurable Digital Signal Processor", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, , vol.21, no.2, pp.193,205, Feb. 2013. M. Chiesi, L. Vanzolini, E. Franchi Scarselli, R. Guerrieri, “Energy efficient scheduling of Parallel Workloads on GPU”, to appear on IEEE Transactions on Parallel and Distributed Systems (TPDS). D. Pramanik, H. H. Kamberian, C. J. Progler, M. Sanie, D. Pinto, “Cost effective strategies for ASIC masks”, Proceedings SPIE 5043, 2003, pp. 142-152. K. Jeong, A. B. Kahng C. J. Progler, “Cost-driven mask strategies considering parametric yield, defectivity, and production volume”, Journal of Micro/Nanolithography MEMS MOEMS , vol. 10, issue 3, 033021 Jul.-Sep. 2011. R. S. Mackay, H. Kamberian, Y. Zhang, “Methods to Reduce Lithography Cost by Reticle Engineering”, Microelectronic Engineering, vol. 83, issue4-9, Mar. 2006, pp. 914-918.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [41] S.-J. Huang, S.-G. Chen, “A High-Throughput Radix-16 FFT Processor With Parallel and Normal Input/Output Ordering for IEEE 802.15.3c Systems”, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 59, no. 8, Aug. 2012, pp. 1752 – 1765. [42] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, L.-G. Chen, "Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder", IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 6, Jun. 2006, pp. 673-688. [43] S. Sukhsawas, K. Benkrid, “A High-level Implementation of a High Performance Pipeline FFT on Virtex-E FPGAs”, IEEE Computer society Annual Symposium on VLSI, Feb. 2004, pp. 229-232. [44] I. Yasri, N. H. Hamid, V. V. Yap, “Performance analysis of FPGA based Sobel edge detection operator”, in Proc. Int. Conf. Electron. Design, pp. 1-4, 2008. [45] S. Heithecker, A. Lucas, and R. Ernst, “A high-end real-time digital film processing reconfigurable platform”, EURASIP J. Embed. Syst., vol. 2007, no. 1, pp. 1-15, Jan. 2007. [46] http://www.xilinx.com/products/silicon-devices/soc/zynq-7000 [47] http://www.amd.com/us/products/embedded/graphics-processors

Davide Rossi received the electronics engineering degree from University of Bologna, Bologna, Italy and M.Sc. from Tampere University of Technology, Tampere, Finland in 2007. In 2012 he received the Ph.D. degree with the Advanced Research Centre on Electronic Systems (ARCES) from the University of Bologna, Bologna, Italy. From 2008 to 2012 he has also been a consultant for STMicroelectronics in the field of reconfigurable computing. Dr. Rossi is currently a senior member of the ERC MULTITHERMAN laboratory, Department of Electrical, Electronic and Information Engineering (DEI), University of Bologna, Bologna, Italy. His main research interests include VLSI system on chip design, configurable systems based on run-time and metal programmable devices, multi-processor systems, and ultra-low-power platforms. He is co-author of about 15 publications on international conferences and journals in the same fields.

Claudio Mucci received the Electronics Engineering degree from the University of Bologna, Italy, in Feb. 2003 and the PhD degree at the same university in 2007. Since 2003, he has been with the Advanced Research Center on Electronic Systems “E. De Castro” (ARCES), Bologna, and during that period he has been an STMicroelectronics consultant in the field of reconfigurable computing. Since 2009, he joined STMicroelectronics Technology R&D, Agrate Brianza, Italy. His main research interests include configurable platforms based on run-time and metal-programmable technologies, digital signal processing, application development and related methodologies applied to embedded programmable system. He is co-author of about 30 publications on international conferences and journals in the same field.

Matteo Pizzotti received the electronics engineering degree from the Polytechnic University of Milan, Italy in 2005. Since 2005 he is an ST Microelectronics consultant in several fields of digital electronics including high level modeling, reconfigurable embedded systems and related design flows, low power design and microcontroller programming. He is currently working on low power systems for energy harvesting sensor networks.

14

Luca Perugini received the electronics engineering degree from the University of Bologna, Italy in 2006. Since 2006 he is an ST Microelectronics consultant in the field of digital reconfigurable systems and 3D chip-to-chip wireless interconnects. He is currently working on low power systems for energy harvesting sensor networks.

Roberto Canegallo received the degree in electrical engineering from University of Pavia, and PhD from University of Bologna, Italy. He worked for five years in Central R&D on multilevel Flash memories. Since 1998, Dr. Canegallo has been working at the joint ST research Lab with University of Bologna. His work has focused on integrated, innovative designs of system on chip and he is author of many scientific papers and holds European and US patents. Dr. Canegallo is currently a senior member of Technical Staff in the Smart Power Technology R&D in STMicroelectronics and the project coordinator at STMicroelectronics lab which is jointly managed with ARCES the advanced research center on electronic systems of the University of Bologna. His main activities are in the field of advanced hardware design and three-dimensional silicon integration.

Roberto Guerrieri received the M.S. and Ph.D. degrees from the University of Bologna, Bologna, Italy. Since 1986, he has been visiting the Department of Electrical Engineering and Computer Science, University of California, Berkeley, for four years and the Department of Electrical Engineering, Massachusetts Institute of Technology, Boston. During his scientific activity, he has published more than 90 papers in various fields, including numerical simulation of semiconductor devices, numerical solution of Maxwell’s equations, parallel computation on massively parallel machines, and reconfigurable architectures. In 1998, he became the Director of the Laboratory for Electronic Systems, a joint venture of the university of Bologna and STMicroelectronics for the development of innovative designs of systems-on-chip. He is currently a Full Professor of electrical engineering with the University of Bologna. Dr. Guerrieri was the recipient of the Best Paper Award in the IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING in 1992, for his work in the area of process modeling, and the “ISSCC Best Paper Award” in 2004, for his work on sensor system designs.

Suggest Documents