Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal ...

Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal Processing ¨ der Der Technischen Fakultat ¨ Erlangen-Nurnberg Universitat ¨ zur Erlangung des Grades DOKTOR-INGENIEUR vorgelegt von Dmitrij Kissler

Erlangen 2011

Als Dissertation genehmigt von der Technischen ¨ der Universitat ¨ Erlangen-Nurnberg Fakultat ¨ Tag der Einreichung: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. Juli 2011 Tag der Promotion: . . . . . . . . . . . . . . . . . . . . . . . 15. Dezember 2011 Dekan: . . . . . . . . . . . . . . . . . . . Prof. Dr.-Ing. habil. Marion Merklein Berichterstatter: . . . . . . . . . . . . . . . . . . . . Prof. Dr.-Ing. Jurgen Teich ¨ . . . . . . . . . . . . . . . . . Prof. Dr.-Ing. Wolfgang Nebel

Acknowledgments I would like to thank my advisor, Professor Jürgen Teich, for his support and encouragement throughout this work. He has provided assistance in numerous ways, including valuable scientific, technical, and editorial advices. My special thanks are to Professor Wolfgang Nebel for the fruitful discussions and for agreeing to be the co-examiner of this thesis. I also wish to thank the members of the Ph.D. committee, Professor Robert Weigel, and Professor Klaus Meyer-Wegener. I would like to thank Dr. Patrick Haspel and Cadence Academic Network who assisted me greatly during my work. Furthermore, I wish to thank all my colleagues I collaborated with during the years at the department. I express my sincere gratitude to my mother and my family for their great support and encouragement.

Dmitrij Kissler Erlangen, July 2011

iii

iv

Contents 1 Overview 1.1 Goals and Contributions of the Thesis . . . . . . . . . . . . . . . . 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Introduction 2.1 Massively Parallel Architectures: Paradigms and Trends . . . . . 2.1.1 Characteristic Properties of Mobile Systems . . . . . . . . 2.1.2 Digital Signal Processing . . . . . . . . . . . . . . . . . . 2.1.3 Wireless Communication . . . . . . . . . . . . . . . . . . 2.1.4 Performance Requirements . . . . . . . . . . . . . . . . . 2.1.5 Fine-Grained Reconfigurable Architectures . . . . . . . . 2.1.6 Coarse-Grained Reconfigurable Architectures . . . . . . . 2.1.7 Many-Core Architectures . . . . . . . . . . . . . . . . . . 2.2 Advanced VLSI Technology Challenges . . . . . . . . . . . . . . 2.2.1 Design and Production Costs . . . . . . . . . . . . . . . . 2.2.2 Variability Issues . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Reliability Issues . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Nanometer Transistor Properties . . . . . . . . . . . . . . 2.2.5 Power and Energy Efficiency . . . . . . . . . . . . . . . . 2.2.6 Power Consumption in Digital CMOS Circuits . . . . . . 2.3 Low-Power Design Techniques and Tool Flows . . . . . . . . . . 2.3.1 Low-Power Design Techniques . . . . . . . . . . . . . . 2.3.2 Low-Power/Power-Aware Test and Verification . . . . . . 2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Power Estimation and Modeling Techniques . . . . . . . . . . . . 2.4.1 Different Types of Power Consumption and their Analysis 2.4.2 Hardware Abstraction Layers . . . . . . . . . . . . . . . 2.4.3 Analytical Power Models . . . . . . . . . . . . . . . . . . 2.4.4 Empirical Power Models . . . . . . . . . . . . . . . . . . 2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

1 2 5 7 7 9 10 10 11 13 14 16 19 19 20 21 22 26 27 29 30 44 45 46 46 50 53 53 55

3 A Parameterizable TCPA Architecture: Weakly Programmable Processor Arrays 57 3.1 Architecture of a Weakly Programmable Processing Element . . . . 57

v

Contents

3.2

3.3

3.1.1 Instruction Set Architecture . . . . . . . . . . . . . 3.1.2 Dynamically Reprogrammable Interconnect Modules Weakly Programmable Processor Arrays . . . . . . . . . . . 3.2.1 Adaptive Reprogramming . . . . . . . . . . . . . . 3.2.2 Multicast Reprogramming Scheme . . . . . . . . . . 3.2.3 Reprogramming Speed . . . . . . . . . . . . . . . . 3.2.4 Graphical Architecture Entry . . . . . . . . . . . . . 3.2.5 VHDL Template Code Complexity . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Power, Area Characterization and Modeling 4.1 Probabilistic Power Analysis . . . . . . . . . . . . . . 4.1.1 Spatial Signal Correlation . . . . . . . . . . . 4.1.2 Temporal Signal Correlation . . . . . . . . . . 4.1.3 Analysis Techniques . . . . . . . . . . . . . . 4.2 Power Profiling Results for WPPAs . . . . . . . . . . 4.2.1 Reconfiguration Phase . . . . . . . . . . . . . 4.2.2 Functional Processing Phase . . . . . . . . . . 4.3 Power Macro-Modeling . . . . . . . . . . . . . . . . . 4.3.1 Characterization Procedure . . . . . . . . . . . 4.3.2 Database Design . . . . . . . . . . . . . . . . 4.3.3 Estimation Procedure . . . . . . . . . . . . . . 4.3.4 Overall Power and Area Evaluation Framework 4.4 Efficiency and Accuracy Comparisons . . . . . . . . . 4.5 Analytical Hardware Area Estimation . . . . . . . . . 4.6 Area Macro-Modeling . . . . . . . . . . . . . . . . . 4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

58 59 60 61 62 63 63 64 64

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

67 68 72 72 73 74 85 90 91 95 96 97 98 100 105 105 107 109

5 Increasing the Energy Efficiency of Programmable Hardware Accelerators 111 5.1 Power Reduction in Coarse-Grained Reconfigurable Architectures . 111 5.1.1 Automatic Clock Gating . . . . . . . . . . . . . . . . . . . 112 5.1.2 Hybrid Hierarchical Clock Gating . . . . . . . . . . . . . . 113 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.1 Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.2 Results for a 2 × 2 WPPA Array . . . . . . . . . . . . . . . 117 5.2.3 Results for a 3 × 8 WPPA Array . . . . . . . . . . . . . . . 119 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.4 Standby and Idle Power Optimization . . . . . . . . . . . . . . . . 130 5.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4.2 Overview of the Common and Unified Power Formats . . . 136

vi

Contents

5.5

5.4.3 Reconfiguration-controlled Power Gating 5.4.4 Scalable Power Control Network . . . . . 5.4.5 Power-aware Design Flow . . . . . . . . 5.4.6 Experimental Results . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

6 Design Space Exploration of WPPA Architectures 6.1 Multi-Objective Evolutionary Algorithms . . . . . . 6.2 Architecture-Compiler Coupling . . . . . . . . . . . 6.3 Exploration Results . . . . . . . . . . . . . . . . . . 6.3.1 Single Algorithm Explorations . . . . . . . . 6.3.2 Multi-Algorithm Exploration . . . . . . . . . 6.4 Issue Width Impact . . . . . . . . . . . . . . . . . . 6.5 Estimation Methodology and Accuracy . . . . . . . 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . .

142 143 145 146 149

. . . . . . . . .

155 158 160 165 167 169 170 171 172 172

7 Concluding Remarks 183 7.1 Future Directions: Invasive Tightly-Coupled Processor Arrays . . . 185 8 German Part

187

A Appendix A.1 VHDL Code Listings . . . . . . . . . A.2 Generic CPF Description . . . . . . . A.3 WPPA Software Examples . . . . . . A.4 WPPA Template Code Complexity . . A.5 TCL Characterization Script Fragment

193 193 197 198 202 203

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Author’s Own Publications

213

Bibliography

217

Acronyms

249

List of Symbols

253

List of Figures

257

List of Tables

263

vii

1 Overview Digital circuit design has now officially entered the power-limited era. Power and energy are driving factors in virtually every design or design component. The exponentially growing power consumption also has set the final limit to the ”frequency scaling” approach which was used to achieve higher performance in mono-processor systems for more than two decades. At the same time, technological advances provide ever growing integration densities and increasing number of ”raw” transistor counts. However, currently these additional resources cannot be used as efficiently as they theoretically could be due to engineering productivity issues. Therefore, the contemporary micro-architectural trend shows a very clear tendency towards massively parallel, tiled architectures which is also referred to as ”geometric scaling”. These architectures generally allow to provide a higher computational performance within acceptable power consumption limits. Another tendency going jointly with the first one is towards heterogeneous embedded systems which are enhanced with rich analog, mixed-signal, and microelectrical mechanical devices and application-specific hardware accelerators for highend graphics and sophisticated digital signal processing. A high demand for such systems can be observed, for instance, on mobile and wireless markets, see, for instance, [297]. On the other hand, application areas such as wireless communication or multimedia in embedded devices traditionally require realtime or near-realtime processing speeds and very high power efficiencies until recently achieved solely by application-specific hardware. With the advent of the reconfigurable hardware paradigm, computation-intensive tasks can be executed on reconfigurable platforms with performance and energy efficiencies close to that of dedicated circuits. Modern embedded hardware architectures have to be flexible to a great extent since, for instance, different signal processing algorithms have to be implemented on the same hardware. A major example are the different communication protocols for mobile wireless systems. In order to allow for multiple applications to efficiently use a predesigned architecture, it has to be highly configurable. With the help of configurable parameters, architectures optimized for performance, size, and power in a particular application domain can be derived. Therefore, a new focus on ”parameterized system design”, different from the past focus of building an architecture optimized for one particular set of constraints, is required, see [116]. Coarse-grained reconfigurable architectures

1

1

Overview

can meet the growing demands on computational resources, fast reconfigurability, and flexibility, as well as high power efficiency. Due to the big widths of reconfigurable interconnect signals in coarse-grained architectures of up to several words much of the routing problems are alleviated, see [130]. As a consequence, the amount of memory for several reconfiguration data streams, as well as the reconfiguration time for coarse-grained architectures is reduced. Complex computations can be accomplished in several processing elements which can be very small or quite large, complete embedded processor cores with RISC architecture (Reduced Instruction Set Computer). Generally, the main requirements for the design of low-power, energy-efficient embedded systems can be summarized as follows, see [147]: • Higher circuit flexibility must be achieved, for instance, through programmability or reconfigurability. • Higher design abstraction levels of physical, functional, and electrical models are needed which provide only those levels of detail which are actually needed on the current design implementation stage. • Circuits should feature a high silicon implementation regularity and support massive hardware parallelism. • Massive EDA support for automated configuration and efficient design space exploration should be provided. Hence, to improve the productivity in hardware development, there is an urgent need for the highest possible abstraction levels in the description of massively parallel hardware systems and aggressive automatization of their development process.

1.1 Goals and Contributions of the Thesis Increased deployment of packet-switched networks and the firm trend towards the seamless user mobility combined with continued integration led to the steady growth of embedded digital systems market and especially its mobile part. These applicationspecific systems are designed to fulfill different performance and power requirements dependent on their usage scenario. Rapidly changing standards and a high market pressure demand for more flexible solutions and shorter development times. Legacy fixed architectures are not well suited to meet the requirements of complex, yet power sensitive media-oriented processing tasks, see [128]. Thus, generic highly parameterizable architecture templates in terms of IP-cores (intellectual property) have become more and more important when building such systems. At the same time, design techniques for power reduction became integral to every modern design process.

2

1.1 Goals and Contributions of the Thesis

&38

,2

,2

,2

,2

,2

,2

,2

&38

,2

&38

,2

,2

&DFKH &DFKH &DFKH

&38

&38

6WDQGDUG 0XOWLFRUH 3URFHVVRU

,2

&38

,2

&DFKH &DFKH &DFKH

:HDNO\ 3URJUDPPDEOH 3URFHVVRU $UUD\

$SSOLFDWLRQ6SHFLILF ,QWHJUDWHG &LUFXLW

Figure 1.1: Difference between the standard multi-core processors, tightly-coupled, weakly programmable processor arrays (WPPA), and applicationspecific circuit solutions. A WPPA realizes a trade-off between flexibility and efficiency. However, the power consumption itself does not set a technology scaling limit but rather dictates the maximum allowable current density and therefore the corresponding operating frequency and/or number of devices per chip which can be switched on concurrently, see [302]. In this thesis, we focus on highly area- and power-efficient, massively parallel, tightly-coupled embedded hardware architectures called Weakly Programmable Processor Arrays (WPPA) which are to be used as hardware accelerators in mobile embedded systems for sophisticated digital signal and image processing, see Figure 1.1 Our research fully proved the need and the benefits of deploying such efficient accelerators in modern high-performance embedded systems: The prototype implementations of a WPPA in 90 nm CMOS ASIC technology with 4 and 24 processing elements revealed power efficiency values ranging from 98 MOPS/mW to 124 MOPS/mW and 0.064 mW/MHz to 0.66 mW/MHz. The corresponding chip area lies between 0.2 mm2 (4 PEs) and 2.2 mm2 (24 PEs). Compared to the current general-purpose multicore architectures, which are manufactured in much smaller process geometries, WPPAs have thus a 100 times smaller area footprint and up to 1 000 times better power efficiency, see [247, 104]. The main contributions of this thesis lie in the following fields: Architectural research, power modeling, power optimization, as well as automatic design space exploration. Architectural Research A novel, highly parameterizable coarse-grained reconfigurable architecture called Weakly Programmable Processor Array was designed.

3

1

Overview

It consists of several weakly programmable processing elements with a VLIW (very long instruction word) architecture which are connected with the help of dynamically reconfigurable interconnect modules. One of the distinguishing properties of a WPPA is that it is an architectural template rather than a fixed design, like many other well-known coarse-grained reconfigurable architectures. The high degree of parameterization enables a flexible adaptation of hardware resources to the prospective set of applications as well as an automatic design space exploration. The main results with respect to the architecture were published in [4], [5], [3] and [12]. Furthermore, the WPPA architecture was already presented during the hardware and software demonstration at the DATE University Booth [1] and [20]. Power Modeling With the help of a table-based, probabilistic macro-modeling technique with non-uniform parameter sampling, implemented by means of a relational database we show that the achievable power estimation speeds for large WPPA arrays consisting of several hundreds of processing elements can be reduced to the minutes range within 10% estimation error compared to a state-of-the-art commercial gate-level post-layout power estimator. The main results were published in [7]. Power Optimization First, the important aspect of power-efficient dynamic reconfiguration control techniques in coarse-grained reconfigurable architectures was addressed: Proper clock domain partitioning with custom clock gating combined with automatic clock gating resulted in a 35% total power reduction. This is more than a threefold as compared to the single clock gating techniques applied separately. The corresponding case study application with 0.064 mW/MHz and 124 MOPS/mW power efficiency outperforms the major coarse-grained and general purpose embedded processor architectures by a factor of 1.7 to 28. The active and standby leakage power consumption could also be significantly reduced due to state-of-the-art, Common Power Format based design flow and a novel, highly scalable power control network for designs with hundreds of power domains. The main results were published in [9], [10] and [2]. Automatic Design Space Exploration An exploration framework for WPPA based on state-of-the-art multi-objective evolutionary algorithms was implemented which allows us to perform a highly accurate and expeditious automatic exploration and evaluation of any possible WPPA instance in terms of area, performance and power on a high level of abstraction. The presented framework constitutes the means to automatically determine the absolute upper and lower limits of the objectives for a given parameter range which would be impossible to achieve otherwise. Substantial acceleration of the automatic exploration procedure is achieved due to deployment of a novel, relational database-based macro-modeling methodology and modern multiobjective evolutionary algorithms. Finally, the automatic exploration of combined

4

1.2 Thesis Organization deployment of several different algorithms on a single WPPA instance programmed by means of run-time reconfiguration was investigated. In the following, the overall structure of the thesis is presented by a short summary of the single chapters.

1.2 Thesis Organization C HAPTER 2 gives a short introduction into the scientific fields relevant to our research. Section 2.1 describes the principal architectural paradigms of modern computer systems. Subsequently, in Section 2.2 the major challenges of advanced VLSI and semiconductor technology are presented. In Section 2.3, state-of-the-art lowpower design techniques and EDA tool flows are described. Finally, Section 2.4 gives a short overview over the main current power estimation and modeling techniques. C HAPTER 3 is devoted to the architecture of weakly programmable processor arrays. Different aspects of micro-architecture and hardware design are covered. C HAPTER 4 deals with the power and area characterization and modeling of WPPAs. Section 4.1 introduces the foundations of probabilistic power analysis. In Section 4.2, the power profiling results of a prototype WPPA implementation in 90 nm standard cell, semi-custom ASIC technology are discussed. Section 4.3 describes the proposed power and area macro-modeling methodology. Efficiency and accuracy comparisons are presented in Section 4.4. The results of analytical area estimation and area macromodeling are covered in Sections 4.5 and 4.6. Related work and a summary finally conclude this chapter. C HAPTER 5 covers the aspect of energy and power efficiency of CGRAs in general and WPPA in particular. Section 5.1 deals with the optimization of dynamic power consumption. The corresponding experimental results are presented in Section 5.2. Related work and a short summary follow. In Section 5.4, the standby and idle power consumption of WPPAs are optimized with the help of power gating and a Common Power Format based EDA tool flow. A summary concludes this chapter. C HAPTER 6 is devoted to the automatic design space exploration. The previously discussed results are used to explore the architectural design space of WPPA architectures including the chip area, average static and dynamic power dissipation, execution latency, issue width of processing elements, and the computational precision in an automated fashion. Section 6.1 presents the major principles of multi-objective, evolutionary algorithms. The coupling of architecture and compiler is discussed in Section 6.2. A related work overview and a short summary are also given.

5

1

Overview

C HAPTER 7 summarizes the main contributions of the thesis and presents some concluding remarks together with suggestions for future work. A PPENDIX A contains some VHDL and TCL code listings, a generic power intent description of the WPPA template, characteristic software listings in WPPA assembler code and the WPPA template code complexity statistics.

6

2 Introduction Massive parallelism now applies to a wide range of computing systems from generalpurpose ones to embedded computing, see [44, 157, 48, 268]. This is also referred to as geometric scaling [147], as opposed to the frequency scaling approach, now abandoned due to power consumption limits. It is used not only to satisfy performance requirements of up to the teraflop scale [298, 165], but also to meet severe power restrictions of the contemporary and near-future applications. As an example, the gross power breakdown of some typical industrial designs is given in Table 2.1.

2.1 Massively Parallel Architectures: Paradigms and Trends The number of high-performance processor cores integrated on a single chip is expected to grow as fast as 19% annually during the following three years, as projected by the International Technology Roadmap for Semiconductors (ITRS) [147]. Meanwhile, all major semiconductor vendors updated their product lines with different kinds of multi- and manycore architectures. Since different application domains for massively parallel architectures impose their specific requirements on throughput, area, and power consumption, the corresponding architectures can be broadly classified into the following four major groups: Server Domain Typical thermal design power: ≈ 150 W, typical clock frequencies: ≈ 2-3 GHz. The performance for control-dominated, sequential applications is paramount. Such architectures are based on several conventional, superscalar, outof-order uniprocessor cores, see [247, 104]. Mostly 4-8 homogeneous cores are integrated on one die. These architectures are also widely known as multicore processors or chip multiprocessors (CMP). Large on-chip hierarchical caches with cache coherence protocols implemented in hardware are provided. The major core interconnection schemes are crossbar and point-to-point. These systems are mostly used for non real-time, high-performance applications with large portions of control-oriented, sequential code. Graphics and Scientific Computing Domain Typical thermal design power: ≈ 250 W, typical clock frequencies: ≈ 1 GHz. The performance for parallel, data-

7

2

Introduction Application

Total Power (W)

Core (%)

I/O (%)

Memory (%)

Mobile Video Networking

0.2 – 1 2–6 30 – 60

20 – 30 30 – 40 40 – 50

10 – 20 25 – 30 20 – 25

50 – 60 40 – 50 40 – 50

Table 2.1: Power breakdown of typical industrial designs for different application domains. dominated floating-point applications is paramount, see [312]. Systems targeting this domain deploy a high number, typically 300-500 or more, of simple in-order processing elements. These elements are further grouped into clusters and can be used in SIMD-like fashion (single instruction multiple data). A non-coherent distributed memory system is typically deployed. Such architectures are also often denoted as many-core GPU systems, see [165]. DSP Domain Typical thermal design power: ≈ 5-50 W, typical clock frequencies: ≈ 1 GHz. Performance and power consumption are equally important. Target applications are data-oriented and mostly with SIMD data-level parallelism. Typical number of single DSP cores is 3-6, see [109]. Application-specific interconnect fabrics and networks-on-chip are used for the inter-core communication. Heterogeneous, distributed memory systems allow more power-efficient computations. These systems are mostly used in applications with real-time processing requirements. Mobile Domain Typical thermal design power: ≈ 1 W, typical clock frequencies ≈ 500 MHz. Both static and dynamic power consumption is paramount for long run and standby times, see [56, 24]. The highly integrated analog, digital, and microwave circuits enriched with micro-electrical mechanical systems (MEMS) like accelerometers, gyroscopes, and others are implemented as heterogeneous systems-on-a-chip. The mobile application domain mostly has real-time processing requirements, like for instance in wireless communication. A large number of specialized hardware accelerators is deployed to meet the strict performance and power constraints. The market for mobile devices with globally ≈ 300·10 6 unit shipments has almost reached the same size as in case of combined desktop personal computers and notebooks sales and is rapidly growing. An increasing number of former portable consumer stand-alone devices like personal navigators, video camcoders, digital still cameras, music and video players, and others have converged into one generic, multifunctional mobile device with broadband wireless connectivity, see, for instance, [217] and [254]. Since high performance and low power consumption are the key critical factors of any mobile system, we will describe the characteristic properties of mobile systems in a greater detail in the following section.

8

2.1 Massively Parallel Architectures: Paradigms and Trends

Baseband Modem

DSP

WCDMA, CDMA-2000, EDGE, HSDPA, LTE, TD-SCDMA

2D/3D Graphics

Multicore RISC CPU CPU . . . CPU

Interconnect Fabric Security AES, . . .

Memory Display Controller Controller PMU

Figure 2.1: Typical internal structure of a smartphone system-on-a-chip.

2.1.1 Characteristic Properties of Mobile Systems The form factor and the operating time are the two most critical factors of any mobile device. The form factor and weight limitations indirectly also set a limit on the available energy budget, that is the battery capacity. A typical Li-ion battery used in a smartphone has a capacity of 1.2-2 Ah and contributes to as much as 20-30% of the overall weight, see [231]. A typical smartphone is, in addition to its original telephonic usage, meanwhile a small high-performance device taking pictures, making and playing several high definition video formats, receiving TV programs, playing music, and serving for broadband mobile internet access and as a personal navigator. To achieve these levels of integration and convergence, a careful selection and design of hardware components is needed. A typical structure of the digital system-on-a-chip part in a smartphone can be seen in Figure 2.1. Beside the baseband modem subsystem implementing several mobile communication standards, it also features a high-performance DSP for audio and video processing, specialized security circuits, advanced 2D/3D graphics accelerators, a multicore RISC processor for running the mobile operating system, controlling the peripherals and the user interface, a high-speed interconnect fabric, display and memory controllers, as well as a power management unit. Generally, 3 W is the approximate limit of the overall power budget in a mobile phone, see [217]. The power budget for computations, that is baseband processing, audio, video playback, operating system lies at about 1 W, see [265, 297]. At the same time, the overall computation load currently amounts to 100 giga-operations per second (GOPS) and will grow up to 1 000 GOPS for the fourth-generation mobile communication standards (4G) like LTE advanced (longterm evolution) and WiMAX (worldwide interoperability for microwave access). The major part of these computations are due to sophisticated digital signal processing algorithms.

9

2

Introduction

2.1.2 Digital Signal Processing In digital signal processing (DSP) the analog physical signals first undergo an analogto-digital conversion. After that, they are manipulated in different ways to extract information or to synthesize new signals with desired properties, see [202]. A key property of DSP algorithms is that they work on relatively small bit widths. This is the result of the analog-to-digital converter precisions of typically 8 to 24 bits. As opposed to floating point arithmetic in scientific and high-performance computing domains, fixed-point arithmetics in DSP allow area and power efficient implementation in hardware. Digital processing displaced analog systems because of their inexpensiveness, the ease of reproduction of digital signals, the precision flexibility, and other important advantages. On the other side, digital signal processing poses very high computational requirements on the respective hardware. Since DSP algorithms naturally provide lot of inherent parallelism, massively parallel architectures are perfectly suited to implement them. The main DSP algorithm types include filtering and transforms. Filtering Like in case of analog signals, filtering is used to flexibly remove undesired signal components leaving the desired components untouched. Decimation, interpolation, and other filter types exist. Transforms Transforms are used to convert the signals from one representation domain to another to reduce computation complexity of signal processing. Transforms are extensively used in image and video processing as well as in wireless communication. Most prominent examples are the fast fourier transform (FFT), discrete fourier transform (DFT), discrete cosine transform (DCT), and the discrete wavelet transform (DWT).

2.1.3 Wireless Communication Wireless communication has shown breathtaking advances during the last 15 years. From a niche market of military and administrative applications it grew to one of the biggest accelerators of the whole semiconductor and high-tech industries. The peak data rates increased as much as three orders of magnitude between successive technology generations: kbit/s for 2G systems, Mbit/s for 3G systems, up to Gbit/s for 4G. Wireless communication devices are all multi-standard and multi-application devices. To reduce the complexity of systems with application-specific implementations of different standards and radio interfaces the concept of software-defined radio (SDR) was introduced, see [201]: A software radio is a radio that can have its functionality substantially modified by software or hardware reconfiguration.

10

2.1 Massively Parallel Architectures: Paradigms and Trends

Operations/bit

104 hsdpa

umts wcdma

103

lte

802.11n 802.11b

10

2

gsm

101 100 10−2

lte adv.

uwb

gprs 0.1 GOPS bluetooth 1.0 GOPS 10 GOPS 100 GOPS 1000 GOPS

10−1

100

dvd

bluray

101 102 Bit Rate [Mbps]

103

104

Figure 2.2: Decoding workloads of different wireless protocols and multimedia standards (original version in [297]). The reasons for this development are the extremely diversified market with different communication standards, fast standard evolution and short times-to-market, widely varying computational loads even within the same standard (16-150 GOPS in LTE), as well as the compactness and energy efficiency of hardware. Therefore, more and more programmable or reconfigurable accelerators are used in wireless baseband processing chips. However, the usage of such accelerators is not ”plug-and-play” since the performance requirements are exceptionally high.

2.1.4 Performance Requirements Figure 2.2 shows the decoding workloads of different wireless and multimedia standards in double-logarithmic scale. It can be clearly seen that the different decoding workloads vary as much as four orders of magnitude. Already for HSDPA with bit rate of 14.4 Mbps the computing power needed is about 50 GOPS. For LTE (100 Mbps) and LTE advanced (1 000 Mbps), these loads are already in the order of 1 000 GOPS. The energy efficiencies needed to implement these protocols in mobile user equipment are approximately 25 mW/GOPS (or 25 pJ/OP) for HSDPA, about 4 mW/GOPS (or 4 pJ/OP) for LTE, and must be close to 0.5-1 mW/GOPS (or 0.5-1 pJ/OP) for LTE advanced. Figure 2.3 gives an overview over the different hardware architectures used in mobile devices together with their respective application areas and energy efficiencies. Mobile microprocessors are typically used for control-intensive, lowload applications like peripheral controls and operating system. Mobile DSPs are deployed for higher workloads of graphics and modem processing. Configurable hardware and application-specific accelerators are used primarily for high workload,

11

Introduction

Computational Load [GOPS]

2

ASIC

102

radio:codec

101

radio:frontend

WPPA [24 cores] media:codec radio:modem graphics:pixel

WPPA [6 cores]

100

10

−1

graphics:vertex media:display

Configurable Hardware DSP μ-Processor

100

control protocols

application

media:audio

101 102 Energy [pJ/op]

103

Figure 2.3: Computational workloads and corresponding energy per operation (original version in [297]). energy-sensitive baseband signal processing, see [87]. Two example instances of parameterizable weakly programmable processor arrays (WPPA) covered in this thesis can also be seen (WPPA 24 cores, WPPA 6 cores). According to achieved power efficiencies of 10 and 15 pJ/OP, see Chapter 6, these arrays are suitable for media and modem processing in mobile architectures. Finally, in Table 2.2 different hardware architectures are compared regarding their achievable energy and power efficiencies. It clearly shows that currently only heterogeneous systems consisting of different specialized blocks are able to provide adequate energy efficiencies for workloads required by state-of-the-art wireless communication standards. Although Table 2.2 is based on older CMOS technology of 90 nm, the fundamental relations between the architectures do not change significantly with smaller process nodes since these relations physically reflect the key tradeoff between flexibility and performance of hardware implementations. Architecture

Efficiency (MOPS/mW) ↑

Efficiency (pJ/OP) ↓

Desktop PC Mobile RISC Mobile DSP Fine-grained Rcf. Coarse-grained Rcf. ASIC

0.05 0.5 10 25 80 > 200

20·103 2·103 100 40 12.5 VT H ), see [71]. Diffusion current depends exponentially on its control voltage. During the

22

2.2 Advanced VLSI Technology Challenges weak inversion phase the drain-source current IDS is an exponential function of the gate voltage VGS and can be modeled in first approximation as IDS = IS 10

VGS −VT H S

−nVDS S

1 − 10

, S=

nkT ln(10) ≈ 60 . . . 100 mV , q

≈1, VDS > 100 mV

W IS = 2 n μ COX L

kT q

(2.1)

2 (n ≥ 1, n = 1.4 . . . 1.5) .

The parameters IS and n are empirical (n = 1 for an ideal transistor). S is the slope factor also called the sub-threshold swing, T the absolute temperature in ◦ K, k the Boltzmann constant, q the electron charge, μ the electron mobility, COX the gate oxide capacitance, W the transistor width, and L the transistor length. The term kT/q is also referred to as the thermal voltage and equals approximately 25.8 mV at 27◦ C (300◦ K). For the magnitude of the slope factor we can state S ≥ 2.3 · kT /q which is approximately 60 mV for an ideal transistor and 300◦ K. The subthreshold leakage is most sensitive to the parameter variations, see [204]. Example 2.1 If we assume a slope factor of 80 mV and the threshold voltage set to 400 mV, then the leakage current drops by five orders of magnitude between VGS = VT H and VGS = 0. Consequently, for a low-threshold device with VT H = 100 mV the respective leakage current will be approximately 10 4 times higher, see [237]. In general, the sub-threshold leakage will increase 10× if the threshold voltage is reduced by the slope factor. Gate Oxide Leakage According to the constant-field technology scaling rules, the equivalent gate oxide thickness TOX should be scaled together with other physical transistor dimensions. For a 65 nm device this would result in a gate oxide thick˚ which is about the thickness of a few insulator molecule ness of only 1.2 nm or 12 A ˚ for comparison). Startlayers (take the radius of an oxygen molecule O2 of ≈ 1.21 A ing from the 180 nm technology node down to the 90 nm the gate leakage increased by more than 10 4 , see [200]. The high gate leakage currents substantially reduce the gate resistance of the transistor and threaten the basis principles of CMOS design. The main sources of the gate leakage are due to the physical phenomena of direct-oxide tunneling, which is the main component of the gate leakage for lower TOX , see [68], and the Fowler-Nordheim tunneling, see [240]. Direct-oxide tunneling currents expose an exponential dependence on the gate oxide thickness and the supply voltage VDD . The respective current density grows by several orders of magnitude ˚ to 6 A. ˚ Additionally, direct-oxide tunneling for oxide thicknesses going from 19 A

23

2

Introduction

has a larger impact on NMOS than on PMOS devices. IGD ∼ e−TOX · eVGD , IGS ∼ e−TOX · eVGS .

(2.2)

Contrary to the sub-threshold leakage currents which can be reduced by increasing the threshold voltages, gate leakage can only be reduced by reducing the supply voltage VDD . In 90 nm technology the magnitude of the gate leakage currents is about 10× lower than that of sub-threshold leakage currents, but this difference will reduce as the gate oxide thickness shrinks for smaller technology nodes. The final limit of oxide thickness is reached when the gate oxide leakage will be about the same magnitude as sub-threshold leakage. This happens at the physical oxide thickness of ≈ 1 nm, see [71]. It is desirable to keep the gate leakage current density at ≈ 100 A/cm2 (1 μA/μm2 ), see [112, 147]. To achieve this, the introduction of dielectric materials with high permittivity (high-κ) is required instead of SiO2 which has a comparatively low dielectric permittivity. High-κ Dielectric Materials The effective transistor current in a MOS transistor is proportional to the transconductance parameter k , defined as k = μ·Cg = μ ε/TOX .

(2.3)

Thereby, Cg denotes the gate capacitance, ε (or κ) is the relative permittivity of the gate dielectric, and TOX is the gate dielectric thickness. To increase the transconductance one has to find a way to either enhance the mobility of the charge carriers (for instance by straining the silicon [274]) or increase the gate capacitance like in three-dimensional FinFET transistors [259]. Since scaling the dielectric thickness leads to an exponential increase in gate oxide leakage, the only way to achieve this goal is to use a material different from SiO2 (κ = 3.9) which provides a substantially higher dielectric permittivity. Such materials are, for instance, hafniumdioxide HfO2 (κ = 25), zirconium dioxide Zr O2 (κ ≈ 22), and titanium dioxide TiO2 (κ ≈ 86 . . . 173). The usage of high-κ materials comes at the cost of introducing metal gate electrodes instead of polysilicon. This is due to the phonon scattering4 effects which limit the electron mobility and degrades the transistor performance. The introduction of high-κ and metal gates leads to a dramatic gate leakage reduction in 45 nm node, as compared to the 65 nm transistors: Gate leakage is reduced by > 25× for NMOS and by 1 000× for PMOS, see [200]. 4 In

physics, a phonon is a collective excitation in a periodic, elastic arrangement of atoms or molecules in condensed matter, such as solids and some liquids. Often referred to as a quasiparticle, it represents an excited state in the quantum mechanical quantization of the modes of vibrations of elastic structures of interacting particles, see [311].

24

2.2 Advanced VLSI Technology Challenges Temperature Sensitivity As already mentioned, the dominant drain current mechanism for sub-threshold leakage is the diffusion based current and the oxide leakage current is drift based. Therefore, the sub-threshold current exponentially depends on temperature and the gate leakage has only a weak dependency on temperature increase. A temperature increase leads to a reduced charge carrier mobility μ and consequently to a reduction in the ION current. On the other hand, it also leads to a reduced threshold voltage VT H and an subsequent increase of the sub-threshold leakage current. At higher temperatures, this interdependence leads to the so-called thermal runaway effect which eventually destroys the chip. As a consequence, an increase in operating temperature has an deteriorating effect especially on the ION /IOFF current ratio of the devices. Drain-induced Barrier Lowering (DIBL) The reduction of the channel length of the transistor leads to a reduced barrier for the majority charge carriers to enter the channel. This leads to a reduced threshold voltage of the transistor. In short channel transistors, the barrier height and the threshold voltage are a strong function of the drain voltage. If the depletion charge between the source-body and the drain-body terminals becomes a larger fraction of the channel length, the threshold voltage is reduced, see [71]. This phenomenon is called drain-induced barrier lowering (DIBL): VT H = VT H0 − λd VDS .

(2.4)

The coefficient λd is the respective proportionality factor. Therefore, the DIBL effect turns the threshold voltage into a signal-dependent variable. For instance, for a gate length of 100 nm and VDS = 1 V the DIBL effect is ≈ 120mV , see [106]. Example 2.2 Raising the VDS voltage of an NMOS transistor in the off state (65 nm, λd = 0.18, S = 100 mV, VGS = 0) from 0.1 V to 1.0 V increases the leakage current by a factor of 10, see [237]. Body Biasing Effect Additionally, the threshold voltage of a transistor can be influenced by applying biasing voltages to the bulk or well of the transistor. The linear relation of the body biasing voltage and the corresponding threshold voltage is described with VT H = VT H0 − γd VBS . (2.5) The coefficient γd is called the body-effect parameter. Considering these effects leads to a slightly modified expression for the sub-threshold leakage current: VGS −VT H0 +λd VDS +γd VBS −nVDS S IDS = IS 10 1 − 10 S . (2.6)

25

2

Introduction

According to Equation 2.6, the sub-threshold leakage current is an exponential function of drain and bulk voltages VDS and VBS . The properties of a transistor (gate delay t p , drain current IDS ) in the operational mode can be described in first approximation by the empirical alpha model by Sakurai and Newton [244, 146]: IDS =

W μ COX (VDD −VT H )α , α ≈ 1.2 . . . 1.3, 2L (2.7)

tp =

ACL VDD . (VDD −VT H )α

Here, A is an empirical constant, CL is the load capacitance, α models the velocity saturation of the technology node, see [146].

2.2.5 Power and Energy Efficiency There exist many different metrics to express the efficiency of a given design with respect to its power and/or energy consumption. The following metrics are used very often: • Thermal design power (W), • Average power (W), • Million operations per second – power ratio (MOPS/mW, GOPS/W), • Average power – frequency ratio (mW/MHz), • Energy (J), • Energy-delay product (pJs), • Energy-delay n (pJsn ) Depending on the application and the target performance, some metrics are more appropriate than others. In portable, battery-power systems the energy would be most appropriate in the majority of cases. In the area of high-performance computing the correct metrics would rather be the GOPS/W (MOPS/mW) or energy-delay product.

26

2.2 Advanced VLSI Technology Challenges

2.2.6 Power Consumption in Digital CMOS Circuits Regarding the modern CMOS transistor technology, two parts of power consumption can be distinguished: • Dynamic power consumption: PDyn • Static power consumption: PLeak Both dynamic and static power dissipation have to be minimized. While dynamic power consumption is normally due to useful computations, static or leakage power in standby mode does not contribute to any value and has to be reduced as far as possible. For digital semi-custom CMOS circuits we distinguish between the following types of power consumption: Definition 2.1 (Dynamic Power) Dynamic power (PDyn ): Mainly consists of intercell or switching power and intra-cell or internal power. Definition 2.2 (Switching Power) Switching or inter-cell power (PSw ): Power which is consumed by switching external capacitances in interconnect wires and cells in the fan-out. Definition 2.3 (Internal Power) Internal power or intra-cell power (PInt ): The power dissipated by input transitions, not producing output switching. This is power consumed by switching internal capacitances. Definition 2.4 (Leakage Power) Static or leakage power (PLeak ): This is the directcurrent mode power consumed by the design. This power is associated with maintaining the logic values of internal circuit nodes between the switching events (active leakage power) or during the standby periods (standby leakage power). It is formed by the parasitic leakage currents flowing from the power supply to the ground during the transistors are nonconducting. Although dynamic power used to be the main component in the overall power budget until ≈ 180 nm (20-25%), starting with the 90 nm node (35-50%) the static power consumption is increasing rapidly. This is not surprising, since the sub-threshold and gate leakage currents increase exponentially with supply and threshold voltage scaling as already mentioned in Section 2.2.4. For an in-depth overview of leakage power sources and their modeling, we refer to [133]. For each of the single power components, we can provide the following equations: PTotal = PDyn + PLeak PDyn = PSw + PSc + PInt .

(2.8)

27

2

Introduction

The corresponding energy consumption of a synchronous CMOS circuit sums up to ETotal (t) =

t 0

PSw +PInt

P

Sc ( CL ·VDD ·VSwing · f ·α + VDD ·QSC · f +VDD ·ILeak ) dt .

PDyn

(2.9)

PLeak

In Equation 2.9, CL denotes the total nodal capacitance which includes the driver, interconnect and fan-out capacitance, f denotes the clock frequency, VSwing denotes the signal voltage swing (for CMOS typically VSwing = VDD ), QSC is the charge due to the short-circuit momentary current, and α denotes the average number of transitions per clock cycle (switching activity factor), see [229]. In well-designed digital circuits (with equal input and output rise and fall times, see [301]), the short-circuit power dissipation (PSc ) is negligible. Therefore, we consider only the switching and internal parts of the dynamic power consumption (PSw , PInt ) additionally to the static power consumption (PLeak ). According to this, power optimization approaches can also be broadly categorized into techniques reducing the dynamic power and techniques for static power reduction. However, this categorization is not strict in the sense that advanced low-power design techniques usually address both components simultaneously. Additionally, one distinguishes between design time and run-time power optimization. Again, some techniques may fall into both categories. Reduction of the Dynamic Power Consumption To reduce the dynamic power consumption, all five variables of PDyn in Equation 2.9 can be tuned: Technology scaling and optimization, for instance, reduces the CL of the gates. By reducing the supply voltage and voltage swing, more power efficient computation can be achieved. Reduction of the clock frequency as well as the average signal transition number also greatly saves dynamic power. In this case, technology scaling and the reduction of the signal transition number would be categorized as design time and the remaining techniques as run-time optimization. Supply voltage and frequency reduction can be deployed both at design time and run time. Reduction of supply voltages 3 on the sub-threshold has a quadratic effect on the dynamic power but even an ∼ VDD 4 and ∼ VDD on the gate leakage currents, see [171]. Reduction of the Static Power Consumption Regarding the static power consumption, the major tuning knobs are the manufacturing technology parameters and materials as well as the supply and threshold voltages of the transistors. Specific to the static power consumption is the fact, that generally two major operation modes of the devices can be distinguished: 1. Normal or active operation mode and the 2. Standby or sleep mode.

28

2.3 Low-Power Design Techniques and Tool Flows Again, we separate design-time and run-time optimization techniques. For instance, devices with different threshold, supply voltages, and gate lengths can be used at design time for different hardware blocks. Design-time techniques reduce the leakage of both active and standby operating modes, but their capabilities are limited: ≈ 42% with dual-VDD devices [257], and ≈ 30% using dual-gate-length techniques [243]. Usage of complex gates reduces the leakage currents as well due to the transistor stacking effect which will be explained in detail later on. Run-time techniques only reduce the leakage in the standby mode but have a much higher reduction capabilities: 40× and beyond in case of power gating, see for example [243]. Generally, a trade-off exists between static and dynamic power consumption during the active mode of operation: Using low-VT H devices leads to an increase in leakage power but simultaneously allows lower supply voltages for the same delay target. The optimal balance between dynamic and static power consumption depends among other things on the switching activity of the circuit and lies approximately at 1/2, that is 50% dynamic power, 50% leakage power for circuits with a high switching activity, as it was exemplarily shown in [53] for an adder design. As of today, a much higher performance per joule of expended energy is required than currently available. According to [147], the computation performance in some suitable power-aware metric must be increased by one to two orders of magnitude by the year 2020. This can only be achieved by aggressively applying different lowpower design techniques which are briefly described in the following section.

2.3 Low-Power Design Techniques and Tool Flows The latest advances in semiconductor manufacturing with up to 22 nm process nodes definitely made low power one of the most important design goals for the majority of consumer and industry applications. Power and energy related aspects are meanwhile the key features of virtually every design or design component, see for example [239]. Following this development, power reduction techniques became an integral part of the overall design process embracing all abstraction layers from the system level down to the physical implementation, manufacturing, and test. This finally led to the insight that the power intent cannot be separated from the overall design flow and thus has to be seamlessly integrated into the existing design methodologies.

29

2

Introduction

Furthermore, the current advance of the massively parallel computation paradigm was mainly provoked by the ”power wall” [50], which also caused a subsequent priority shift from the frequency/area-constrained designs towards the power-limited era.

2.3.1 Low-Power Design Techniques From the design perspective, simple and advanced design techniques to reduce the overall power consumption of modern digital circuits can be distinguished. Simple techniques enclose for example (hierarchical) clock gating and multiple VT H optimization. These techniques are already incorporated in most modern EDA design tools and have only a little impact on the overall design methodology. Mostly these are design time optimization techniques. The advanced power reduction techniques like multi-supply voltage, power gating, or power shut-off (PSO), dynamic and adaptive voltage frequency scaling, and substrate biasing have a much bigger impact both on the methodology and the power saving capabilities. These techniques are supposed to mitigate the leakage current increase introduced by the aggressive technology, supply and threshold voltage scaling for the next few years. Applied each on its own or in combination they provide a better power and energy efficiency of the respective design. Clock Gating The main idea is to disconnect the clock from the flip-flops in the idle module, that is inserting an AND gate in front of the clock input of the flip-flops. This greatly reduces the dynamic power, since the clock distribution network has the highest signal switching activity together with a very high overall capacitive load. This technique was originally proposed by Téllez et al. in [277] as activity-driven clock design. Additionally, the authors gave a formal description of clock gating together with respective gate insertion and clock tree construction algorithms. The work by Wu et al. [315] investigates various issues in automatically deriving a gated clock from a master clock. In a recent publication on that topic [105] the concepts of observability don’t-care conditions (ODC) (combinational) as well as stability conditions (sequential) are used. Regarding the granularity, different choices are possible which can be combined to achieve an even higher power reduction coefficient: • Clock gating at the flip-flop/register level: RTL clock gating • Disabling parts of the global clock distribution network: Architectural clock gating • Turning off the phase-locked loop circuit and the quartz oscillator during the low duty-cycle operation modes

30

2.3 Low-Power Design Techniques and Tool Flows

Advantages:

Mature technique, applied automatically by modern CAD tools.

Limitations:

Gated clock signals suffer an additional gate delay and increase the overall clock skew. Potential high latencies for turning on quartz oscillator and phase-locked loop circuits. Has no effect on the leakage power consumption.

Operand Isolation Changes at the inputs of combinational blocks cause switching activity in the fan-out gates. Unintentional changes of inputs of large combinatorial circuits like for example multipliers can lead to a considerable dynamic power consumption. The operand isolation technique addresses this issue. An early description and definition is found in [88]: Sections of circuitry are isolated from ”seeing” changes on their inputs unless they are expected to respond to them. That is, if the enable signal of a data path module like adder or multiplier is inactive, the inputs of this module are frozen at a constant value. Similar to clock gating, this technique can be applied automatically with a low to moderate area and timing overhead, like it was exemplarily shown in [206]. For isolation AND/OR gates or latches can be used. Another work considering low-power behavioral synthesis was published in [85]. Advantages:

Mature technique, applied automatically by modern CAD tools.

Limitations:

Additional gate delay and area overhead due to isolation logic.

Multi-performance Devices Originally introduced in [89] for power-efficient memory circuits, this technique addresses the static power consumption and allows the optimization of both standby and operational leakage power without any impact on the design performance. Library cells with manufacturing process options for different device thresholds, see [183] for the analytical models, are used to optimize for power, timing, and area constraints. The main idea is to incrementally replace the fast and leaky transistors in non-critical paths with slower, but leakage-proved devices with a higher threshold voltage, see [215] and [309]. The highest benefits come from the utilization of cells with two different thresholds. Adding a third and further thresholds, however, has a diminishing return with respect to the power reduction, see [176]. For sub-100 nm technologies, most semi-custom cell library vendors provide libraries with multi-VT H cells. In modern CAD-based design flows, a synthesis tool for low-power applications is aware of the multi-threshold cells in the corresponding library and uses them to meet speed and area constraints simultaneously optimizing for power dissipation. Example 2.3 In a 65 nm technology with VSwing = 100, increasing the threshold voltage by 100 mV causes a 10× leakage reduction.

31

2

Introduction

Advantages:

Little impact on the design flow. No special layout strategies or voltage level conversion needed. Substantial leakage power reduction.

Limitations:

Added mask costs for the manufacturing process and additional source of parameter variability.

Gate Sizing/Tapering Is an effective circuit-level technique to reduce especially the internal power component of the dynamic power consumption. During the gate sizing the physical dimensions of devices on the non-critical timing paths are reduced. Provided appropriate standard cell libraries with discrete device sizes, this technique is automatically applied by modern synthesis CAD tools during the technology mapping step: Cell resizing. Advantages:

Does not impact the design performance, as long as non-critical paths are considered.

Limitations:

Narrows the timing path-delay distribution: Makes more paths time critical. Increases the impact of process variations, impacts the robustness. Continuous transistor sizing applicable for fullcustom circuit design only.

Transistor Stacking Effect Describes the general reduction of leakage current flowing through a stack of two and more transistors connected in series and turned off compared to only one single device. In complex CMOS gates like for example multi-input NAND/NOR gates three and more transistors are connected in series either in the pull-up or the pull-down network. Since the off-resistance of a nanometer size transistor is a function of the applied voltage, the reduction in leakage for stacked devices is much higher than linear. One can define the stack effect factor as the ratio of the leakage current in one device switched off to the leakage current in a stack of two devices switched off, see [216]:

λd VDD 1+λd IDevice s 1+2λd X= . (2.10) = 10 IStack Therefore, the stack effect primarily depends on the supply voltage VDD , the slope factor S and the DIBL-factor λd . For instance in the pull-down network of a complex NAND gate in the off-state (inputs 0 0) the voltage on the intermediate node between the NMOS transistors connected in series settles to an intermediate value. This leads to a negative gate-source voltage of the upper NMOS device and therefore an exponential reduction of the leakage currents. This technique can be also applied automatically by respective CAD design tools.

32

2.3 Low-Power Design Techniques and Tool Flows Inputs 0 0 1 1

0 1 0 1

Output

Leakage Current (nA)

1 0 0 0

23.06 51.42 47.15 82.94

Table 2.3: The leakage values of a two input NAND gate (180 nm technology, 0.2 V threshold, 1.5 V supply voltage), see [229]. The stack effect becomes stronger with technology scaling: A decrease in the transistor channel length and threshold voltage will increase λd in a given technology, see [183]. For proper modeling of the stack effect we refer to [75]. Compare also the Table 2.3 for 180 nm devices (factor ≈ 4) with the following Example 2.4 for 65 nm devices (factor ≈ 18). Example 2.4 For NMOS transistors (65 nm, λd = 0.18, S = 100 mV, VDD = 0.8 V), the stack effect factor is ≈ 18. See also [103] for an illustration. Advantages:

Super-linear leakage reduction effect. Can be used to control threshold variability to some extent. Negligible effect on performance, little deployment overhead. Fast transition times.

Limitations:

Dependence on the input pattern, see Table 2.3. Overall leakage reduction possibilities are limited.

Sub-threshold/Ultra Low Voltage Logic Scaling of the supply voltage to the sub-threshold regions gives the possibility to operate a circuit at the true minimum energy point, typically in the μW and nW range. In this regime, sub-threshold leakage currents charge and discharge load capacitances, limiting performance but giving significant energy savings over nominal VDD operation, see [304, 60]. The performance is exponentially lower than at nominal VDD due to the lower on-current. This technique is suitable for low (10-100 kHz) and medium-throughput applications (1-10 MHz). Advantages:

Allows operation at the minimum energy point. Suitable application domains: Implanted electronics, smart sensors, wearable electronics.

Limitations:

Not suitable for high-performance applications. Complex, fullcustom design. Highly sensitive to process and environmental variations, see [47].

33

2

Introduction

Body Biasing (Forward/Reverse) Originally introduced by Kobayashi and Sakurai in [168] as the self-adjusting threshold-voltage scheme (SATS, reverse biasing). Also known as variable threshold voltage CMOS (VTCMOS), see [146], and its modification dynamic threshold-voltage MOSFET (DTMOS, forward biasing), see [32]. Given that the threshold voltage of a transistor is a variable parameter rather than a constant one, this technique manipulates the threshold voltage of devices at run-time by applying a negative bias voltage to the substrate or p-well of the NMOS transistors and a positive bias, that is voltage higher than the supply voltage VDD to the n-well of PMOS transistors to heighten the respective threshold voltage: Reverse body biasing (RBB). The reverse technique to lower the threshold voltages (and increase the performance) is called forward body biasing (FBB). The terms body bias, substrate bias, well bias, and back-gate bias refer to the same concept of the threshold voltage modulation using the fourth terminal of the MOSFET, see [71]. In case of the adaptive body biasing (ABB), the control is accomplished in a closed-loop: The circuit parameters are monitored, compared to the desired values, and adjusted accordingly. Some physical restrictions however exist on the magnitude of the bias voltage: For FBB, the source-bulk diode must remain in reverse-bias state VSB > -0.6 V, for RBB this value is around 0.5 V but because of the junction leakage increase. Biasing is only possible in bulk and partially depleted silicon-on-insulator (SOI) technologies. For the fully depleted SOI, no biasing can be used since the body voltage of the transistor is floating. Furthermore, the biasing effect diminishes with technology scaling like it is shown in the following example: Example 2.5 For the 130 nm technology, an overall bias range VBB of 1 V (reverse + forward) changes the threshold voltage by ≈ 200 mV or approximately 3× the slope factor. The same bias value for 90 nm achieves a threshold voltage change of 95 mV and for the 65 nm of 55 mV only, or less then one slope factor, see [237]. Example 2.6 An NMOS transistor with channel length of 50 nm and gate oxide thickness 1.3 nm is given. Different VDD and VBB voltages are provided for different ION and IOFF targets in three operation modes: A high-speed mode, a nominal mode, and a power-save mode. By varying the supply voltage VDD and the body biasing voltage VBB , the transistor performance can be tuned for high performance and low power dissipation, see [282]: Operating Mode VDD (V ) VBB (V ) nominal high-speed power-save

34

0.9 1.2 0.6

-0.5 0 -2.0

ION (μA)

IOFF (μA)

650 1150 120

10 85 0.3


Advantages:

Allows for the post-manufacturing compensation of process variations or a dynamic trade-off between performance and power consumption. Used in burn-in circuit testing to reduce the leakage current. Forward body biasing improves the device short-channel effects, it reduces sensitivity of VT H to variation in gate length, oxide thickness, and channel doping, see [289].

Limitations:

Diminishing effect with technology scaling: Smaller body coefficient γd . Reverse body biasing potentially increases the threshold voltage variation [71]. Only relevant for bulk and partially depleted SOI technologies. Reduces only sub-threshold leakage currents and not the gate-oxide leakage. Triple well technology is required for fine-grained application: Increased mask and manufacturing costs. Depending on the design, considerable activation and especially deactivation time for large biasing swings, see [319]. Additional characterization at bias process corners required during the sign-off for run-time deployment.

Multiple Supply Voltages Using different supply voltages for different blocks of a big system-on-a-chip is a next natural choice in reducing the power dissipation. This technique is applied statically at design time. Originally, it was mainly used in memory circuits, see [188]. Lowering the supply voltage reduces the power consumption but possibly at the expense of a lower performance, that is operating frequency, unless threshold voltage is scaled as well. Regarding the application of different supply voltages in a design two granularities can be distinguished: • Fine-grained/custom voltage assignment and • Coarse-grained/block-level voltage assignment or voltage islands During the custom voltage assignment to gates inside a logic block, one searches for the non-critical timing paths for which the supply voltage can be reduced. Noncritical paths are gradually assigned a lower supply voltage until they become critical. This algorithm for the assignment of voltage levels is called clustered voltage scaling (CVS), see [294]. In CVS, the cells driven by each power supply are grouped (clustered) together and level conversion is needed only at sequential element outputs (referred to as synchronous level conversion), see [294]. Like in the case of devices with multiple threshold voltages the highest benefits come from the utilization of two different supply voltage levels, see [176]. The optimal ratio between these voltages lies by 0.7. Adding a third voltage level provides additional savings of ≈ 5-10%. An extension of the CVS scheme, extended clustered voltage scaling (ECVS) was proposed in [162, 273]. Here, the level conversion is allowed anywhere in the circuit

35

2

Introduction

and not only at the outputs of sequential elements: Asynchronous level conversion. An effective application of ECVS is possible for slow and mid-range operating frequencies which provide a sufficient timing slack for the conversion. This type of voltage conversion is generally more sensitive to coupling and supply power noise. On the block-level, different design modules operate at different supply voltages: Voltage islands, multiple supply voltages (MSV) originally proposed by Lackey et al. in [178]. In this scheme, the power grid for each block is designed and fixed statically and cannot be changed after the fabrication of the chip: Static voltage scaling (SVS). The appropriate supply voltage level is chosen based on performance requirements of the corresponding block: Core processor, memory block, peripheral device, and other. In the contemporary era of many-core processors with potentially thousands of cores, see [48], this technique was extended with dynamic voltage island creation and assignment schemes and regains a rapt attention, see [186]. Both fine-grained and coarse-grained techniques require voltage level conversion circuits at the corresponding voltage boundaries. These circuits generally reduce the timing slack and increase the chip area. Additionally, overheads for the generation and the distribution of different supply voltages have to be considered. Advantages:

Fixed-voltage circuits are more robust than circuits that operate on dynamically varying supply voltages, see for instance [26].

Limitations:

Overheads for generation, regulation, and distribution of different voltages. Potential power savings reduce with the scaling of the main supply voltage. Impacts the layout process. More complex verification. Limited application adaptation capabilities: Often, a change of the voltage-island layout depending on the application is desirable, see [186].

Voltage Level Conversion The main issue regarding the deployment of multiple voltages in a single design is the proper interfacing between the gates driven by a low supply voltage and those driven by a high supply voltage. If no care is taken, a DC current is flowing through the high-VDD gate when it is driven by the output of a low-VDD gate. This happens because the output signal of a low-VDD gate, for example, cannot turn the PMOS transistor of a high-VDD gate off completely. Therefore, a level converter restores the high-voltage swing from a low-voltage signal. The opposite direction, that is down-conversion, if deployed at all is, however, unproblematic. These special voltage level conversion circuits are also called level shifters. As already mentioned, we can distinguish between the synchronous and asynchronous level shifters. In case of synchronous voltage conversion, a flip-flop or latch is used,

36


9'' KLJK RXWSXW LQSXW

9'' ORZ

9*1'

Figure 2.7: A simple voltage level converter design, see [327]. see [149]. However, synchronous conversion is not always possible, especially in case of on-chip communication circuits, that is high-speed interconnect. In case of an asynchronous conversion, an additional conversion delay has to be taken into account. A low-to-high level shifter cell in semi-custom design is implemented with a single signal input and dual supply voltage sources, see Figure 2.7. In [327] and [173] different design alternatives together with the respective advantages and limitations are discussed. Dynamic and Adaptive Voltage and Frequency Scaling Since the operating frequency and the corresponding supply voltage are related by Equation 2.7, there exists the possibility of changing the supply voltage together with the operating frequency of a module at run-time which is referred to as dynamic voltage and frequency scaling (DVFS). This technique was originally proposed in a slightly different context by Macken et al. in [185] and in the modern form by Chandrakasan et al. in [70]. In [272], a subset of this technique, namely varying the supply voltage only, so-called variable supply-voltage scheme (VS scheme) is presented. Thereby, a digital circuit was used to control supply voltage and frequency on a chip. Regarding the power and energy savings in short-channel transistors, the supply voltage initially scales super-linearly with respect to the frequency but saturates again for very low frequency values, like it is shown in [272]. That means that DVFS is able to reduce the respective dynamic energy per operation in a super-linear fashion. 3 : Dynamic power consumed The dynamic power consumption is reduced even ∼ VDD 2 and the frequency can be scaled proportionally on capacitive loads decreases as VDD to VDD . Would the clock frequency alone be scaled, only power would be reduced. The energy per operation and therefore battery life would stay the same. This technique can be applied very effectively for matching of the computational power with the current application demand. However, ramping up or down of the supply rail with a high capacitive load additionally consumes time and energy. Therefore, accurate workload prediction schemes are a must. Since the DIBL effect becomes larger with technology scaling the potential savings achievable with DVFS are increasing. Another important aspect of DVFS is how to relate the supply voltage and clock fre-

37

2

Introduction

quency reliably and cost-effectively. In DVFS, an open-loop approach is used which is based on the selection of predefined operating points from a lookup table with discrete frequency-voltage entries. An extension of the dynamic voltage and frequency scaling is the adaptive voltage and frequency scaling (AVFS). In AVFS a closed-loop approach is implemented: The operating points are based only on the frequency. The operating system decides which performance is currently required and sets the target frequency. The supply voltage to support the desired frequency is then adjusted automatically. Regarding the voltage regulation, generally the following alternatives exist: Switched capacitor charge pump, linear regulator, and switching regulator. For advantages and limitations of each scheme we refer to [132]. Generally, linear regulators require less area but have a limited conversion efficiency (below 40%). Switching regulators provide high conversion efficiency (up to 80%) but have a large area footprint, see [231]. High switching delays between voltage levels can degrade the overall system performance. Example 2.7 The following table provides a quantitative description of achievable leakage current reduction with the help of AVFS and ABB for 90 nm and 65 nm (GP: general purpose, LP: low power) processes, see [308]: Technique

90 nm GP

90 nm LP

65 nm LP

AVFS ABB (VDD /2) AVFS & ABB

5.3× 4.1× 21.6×

3.3× 6.6× 21.5×

5.6× 4.5× 24.8×

Advantages:

Impressive savings in dynamic and leakage power consumption in case of aggressive voltage scaling. Can be smoothly combined with power gating.

Limitations:

Continuously variable voltage source required. Very high timing and also energy overheads to change voltage/frequency: Typical phase-locked loop latency ≈ 0.1 ms, voltage regulator latency ≈ 2-3 ms, software device driver latencies ≈ 20-50 ms. The design should provide a certain performance slack. Limited to voltage ranges where delay and voltage track monotonically. Only efficient together with accurate workload estimation algorithms. Fine-grained versions are difficult to implement cost-efficiently. Additionally, asynchronous interfaces and meta-stability issues may arise in this case. Extremely complex verification and test in case of continuous scaling.

38

2.3 Low-Power Design Techniques and Tool Flows Voltage Dithering/Multi-level Voltage Scaling Originally proposed in [125], this scheme addresses the issue that significant power savings with DVFS technique are possible only if the voltage can be changed on the same time scale as the variations in workload. Therefore, the goal is to minimize the time needed to switch to a new supply voltage and frequency level. The usage of DC-DC converters cannot provide fast switching times. Instead, a small set of predefined voltage supply rails is used which can be connected to via special power switches. The use of power switches allows to approximate a continuous variation of supply voltage by dithering with a set of discrete voltage levels. Sometimes this is also called multi-level voltage scaling (MVS), see [26]. Example 2.8 For instance, if only voltages of 0.5 V and 0.75 V are available, the required voltage of 0.6 V can be imitated by working 40% of time with 0.75 V and 60% with 0.5 V, see [125]. This gives better power saving possibilities because fast workload-specific changes are possible. If this technique is used on a block level (instead of the chip-level), a fine-grained DVFS deployment, so called local voltage dithering (LVD), or voltage rail hopping, is possible, see [69, 95]. Ultra-dynamic Voltage Scaling As already mentioned, the efficient application of DVFS is constrained to the voltage ranges where circuit delay and supply voltage track monotonically. Extension of the DVFS into the sub-threshold operation range combined with local voltage dithering yields the concept of ultra-dynamic voltage scaling (UDVS), see [69]. The minimum energy operating voltage of digital designs usually falls into the sub-threshold voltage region, see [110, 307]. Advantages:

Operation at the minimum energy point is possible.

Limitations:

High sensitivity to process variations and operating conditions: Additional control circuitry is required, see [308].

Power Gating/Power Shut-off Power gating (PG) or power shut-off (PSO) saves power by completely disconnecting blocks with a substantial amount of leakage current from the power supply rails. It was originally proposed by Shigematsu et al. as multi-threshold CMOS (MTCMOS) in [256]. Thereby, devices with high threshold voltage are used to connect and disconnect the power supply rails from the low-VT H devices, see Figure 2.8. An earlier version of this technique turned off the external power supply. Connecting and disconnecting circuit blocks causes large changes in currents flowing on a power supply grid and also introduces voltage droops along with high density currents on some power grid segments, which together affect the chip reliability.

39

IRRWHU FHOOV KHDGHU FHOOV

2

Introduction

3RZHU &RQWUROOHU

9'' 9''

6:

6ZLWFKDEOH 3RZHU 'RPDLQ

LVRODWLRQ FHOOV


9*1' 6:

9*1'

$OZD\V2Q 3RZHU 'RPDLQ

Figure 2.8: Power gating overview.

These and other peculiarities make the proper implementation of power gating very challenging. For a very good overview and a detailed discussion of special circuitlevel techniques we also refer to [258] and [134]. Additionally, power gating requires new design concepts and a few types of special cells which, for convenience, are briefly described below, see also [263]: Power Domain:

A power domain is a collection of design elements. Unless otherwise specified, elements of a power domain share a common primary supply set. The primary supply set is implicitly connected to all elements within the domain.

Power Mode:

A static state of a design in which each power domain operates on a specific nominal condition.

Always On Cell:

A special cell that has a secondary power or ground pin in addition to its primary power and ground pin. The cell is on as long as the supply to the secondary power or ground pin is on.

Isolation Cell:

Logic used to isolate signals between two power domains where one is switched on and one is switched off.

Level Shifter Cell:

Logic to pass data signals between power domains operating at different voltages. Sometimes also in combination with isolation function.

State Retention Cell:

Special flip-flop or latch used to retain the state of the cell when its main power supply is shut off.

40

2.3 Low-Power Design Techniques and Tool Flows Power Switch Cell:

Device with a high threshold used to connect and disconnect the power supply from the low-VT H gates in a power domain. Two kinds of power switch cells are feasible: Header for the supply net switching or footer for the ground net switching, or both.

While conceptually simple, the physical implementation of power gating on a large system-on-a-chip requires a systematic approach [160, 152]. Therefore, the power intent specification for semi-custom ASIC designs was standardized, see [263, 145]. Here, we will go through the single requirements. In Section 5.4.1 more information on power gating will be given. The newly developed power specification formats are described in Section 5.4.2. Power Switch Cell Generally, both types of power switch cells, that is header switches (PMOS) and footer switches (NMOS), are possible as well as the combination of both, see Figure 2.8. The main optimization goal is to reduce the supply voltage drop, that is the IR drop on the power switch devices during the active mode operation of the circuit. Typically, the maximum allowable supply voltage drop is set to be 1-5% of the supply voltage to prevent any significant delay increases. This also allows to use the normal CAD tools for logic synthesis and timing analysis, see [258]. The power switches increase the resistance of the power rail by their respective onresistance. Effectively, this reduces the supply voltage of the power gated cells and also has a back-biasing effect. Regarding the overheads in chip area and leakage power, however, differences exist. Due to the higher mobility of electrons in NMOS devices, footer switches require a smaller area overhead for the same active-mode voltage drop. At the same time, a header switch induces an order of magnitude smaller leakage current as compared to the footer style: The gate leakage through SiO2 for PMOS transistors is an order of magnitude lower than for NMOS. This is due to the different physical sources of the gate leakage for PMOS and NMOS. In NMOS, the electron tunneling from conduction band (ECB) forms the dominant part of the gate leakage current. In case of PMOS, this is the hole tunneling from valence band (HVB). In SiO2 , the barrier height for HVB of 4.5 eV is much higher than for ECB with 3.1 eV. This translates directly into the lower leakage of header switches. However, this property depends on the used gate dielectric. Would for instance silicon nitride Si3 N4 be used, the opposite would be true, see [323]. Since the already mentioned transistor stacking effect is present in power gated circuits, the choice between header and footer switches also influences the switching delays and therefore the design performance. Header switches increase the threshold voltage of the pull-up transistors and increase the low-to-high transition delay. The same applies to footer switches and the high-to-low transition delay. From this stand-

41

2

Introduction

point, concurrent usage of header and footer switches would be optimal (symmetric switches). In this case the stacking effect introduced by power gating would be independent of the input patterns. However, the area, leakage overhead, and active-mode IR drop of this scheme is prohibitive. Therefore, mostly an asymmetric switch insertion is used, see [117]. Using wider devices decreases the on-resistance but requires more area and energy for turning the device on and off since a high-VT H transistor has to be very large if a low resistance is required in linear operation region. This dynamic energy also sets a lower limit on the duration of the idle period below which no power savings due to power gating are possible. Wider switch transistors increase this minimal duration. An RTL-level modeling of sleep transistors for high-level synthesis was proposed in [242]. Generally, the design of the power switching cells includes the following optimization objectives: IR drop, ramp-up time, maximum rush current 5 , leakage power, and chip area. The rush current can be as high as 4-10× of the average current of the block in active mode, see [258, 94]. This is because all internal capacitances have to be charged in a relatively small amount of time. Furthermore, IR-drop analysis requires the determination of patterns that produce maximum instantaneous currents. The IR drop on the power switching cells introduces additional supply rail noise especially during power-off and power-on sequence and impacts the signal integrity. The supply rail noise can be impaired by inserting decoupling or bypass capacitors, see [232, 237]. One of the issues with decoupling capacitors is that they also have to be charged and discharged during power-up and power-down sequence and therefore contribute to the overheads of power gating. In full-custom ASICs, often a single (large) transistor is used as a power switch, see [126]. It is sized for example according to the average current method from [207]. In semi-custom ASICs, this single transistor is split into several smaller devices which are designed as a special standard cell (segmented switch insertion, see [117]) and are put together with other special cells like retention registers and always-on buffers into the so called coarse-grained standard cell library. To increase the efficiency of power gating, it can be combined with body biasing: Super cut-off CMOS (SCCMOS) [159], voltage scaling [25] and other voltage tuning techniques like, for example, boosted gate MOS (BGMOS), see [148]. Since high currents flow through the switch devices, the topics like electromigration [281] and NBTI [64] play an important role and should also be taken into account. In [64] the authors explore the leakage/performance/aging optimization scenario and switch transistor sizing for the improvement of aging characteristics. 5 Transient

42

current flowing during the wake-up phase.


7,GOH0LQ

3RZHU

3$FWLYH

3,GOH ( 6OHHS 76OHHS

36OHHS

( :XS 7:XS

7LPH

Figure 2.9: Transition to and from a stand-by mode, see [258]. Another aspect is the granularity of power gating deployment. In semi-custom ASICs, for example, it can be implemented on a cell level (fine-grained) or a block level (coarse-grained). In case of fine-grained power gating, each standard cell has switching transistors built in (switch-in-cell). This does not change the placement and routing but the area overhead is prohibitively large: almost 80 − 100%. For the coarse-grained scheme, different placement strategies are possible, see Chapter 5. Power Up/Down Sequencing To assure an error-free functionality, a certain protocol has to be implemented by the power management circuitry, see Figure 2.8. First of all, to proceed to stand-by, the clock of the power-gated module should be disabled. After that, the isolation has to be activated, optional state retention signal asserted, and, finally, power turned off. During the wake-up procedure, the order is reversed. Of course, the transition process itself consumes energy, see Figure 2.9. A break even point exists for power saving with the help of power gating. For a detailed modeling of power switch transistors under different temperatures and gating types, we refer to [242]. The total energy amount mainly depends on the size of the power switches and the transition time like the following example shows. Example 2.9 Given the 65 nm technology with 1.2 V supply voltage, a minimum idle-time TIdle Min can be computed, below which turning off the current switch will lead to an increase in power consumption: TIdle Min =

ESleep + EWup − PSleep (TSleep + TWup ) E Tr − PSleep ·TTr ≈ PIdle − PSleep PIdle

(2.11)

PIdle PSleep , E Tr = (ESleep + EWup ), TTr = (TSleep + TWup ) . This evaluates to idle times in the microsecond range, that is ≈ 100 clock cycles for 100 MHz clock frequency for E Tr ≈ 10 pJ, TTr ≈ 1 ns, see [258]. Since in [258] the

43

2

Introduction

test circuits were quite small, the minimum idle times for large designs are typically an order of magnitude higher. For the generation of sleep signals, several strategies were proposed. For instance, static and dynamic sleep signal generation can be distinguished, see [324]. The dynamic sleep signal generators achieve thereby accuracies of up to 90% compared to 40%-60% for the static scheme. Clock enable signal can be also used for power gating (active-mode power gating, AMPG), see [296]. Extensions have also been proposed in [251]. Like the most power saving techniques power gating can be applied in coarse-grained, fine-grained [295, 141], and also ultra fine-grained manner [196]. Also, so called multi-mode power gating was proposed in [25]. A detailed modeling approach for power gating can be found in [242] and in [126] an analysis of the overheads is presented. By and large, the following main advantages and limitations of power gating can be named:

Advantages:

Substantial active leakage reduction in big systems-on-a-chip. Reduction of standby leakage to virtually zero.

Limitations:

High impact on the CAD design methodology. Influences the active-mode performance. Complex electrical implementation. Highly complex verification and testing. Requires special circuits. Requires a standardized description. Reliability issues.

2.3.2 Low-Power/Power-Aware Test and Verification The application of advanced power management techniques has a profound impact on the test and verification. Additional circuit elements have to be tested which can be quite difficult like in case of (segmented) power switches, see [117]. The operational modes of the tested device, that is the different power modes, have to be taken into account. The low-power features can no longer be ignored during the test, since otherwise the power consumption of the chip with all power domains switched on could lead to an overheat and a permanent damage of the tested device. Additionally, adequate fault models must be deployed to render the automatic test patter generation (ATPG) usable to test power-related components as well. Conventional ATPG minimizes pattern count, not the switching activity. For an in-depth treatment of testrelated issues we refer to [114]. In this section, only a broad overview over this topic is given. Like in case of design signals crossing different power and or voltage domains the design-for-test (DFT) circuitry like scan chains must also be inserted accordingly to reduce the voltage conversion overhead in case of different voltage islands. The observability of power gating, retention, and isolation signals must be provided. The

44

2.3 Low-Power Design Techniques and Tool Flows power switching network together with the power management components must be tested thoroughly. The power consumption during the test must be controlled and monitored. Especially in case of burn-in testing this is extremely important since the chip is exposed to high supply voltage and operating temperature. The feasibility of a burn-in test can be restricted by the allowed power consumption. Normal test mode power consumption limit is already at 2× of its functional power. The peak power consumption during a scan test can be up to 10× of the functional mode, see [114]. The power consumption during the test has direct implications on the physical chip design, for instance regarding the power rail dimensioning. Due to time reduction pressure of testing, the frequency and/or number of switched-on modules is increased. This leads directly to higher switching activity and leakage power.

2.3.3 Discussion During further technology scaling, the effectiveness of power gating will rise as opposed to the body biasing and voltage scaling techniques, see, for instance, [287]. The effectiveness of pure body biasing techniques diminishes approximately 4× per technology generation. Additionally, body biasing cannot be applied to CMOS processes different than bulk and partially depleted SOI. On the other side, such new processes are required in sub-20 nm technology to reduce the problems of random dopant fluctuations and other sources of process variability. Voltage scaling techniques intrinsically assume an existing voltage reduction headroom as well as a positive performance slack. As the supply voltages continue to decrease, the effectiveness of voltage scaling reduces substantially. Furthermore, such techniques rely heavily on accurate and fast load prediction algorithms. Nevertheless, voltage biasing and scaling can be seamlessly integrated into the power gating schemes with multiple gating modes and module coarseness by means of ultradynamic voltage scaling with several distinct voltage levels. This allows to efficiently trade off the leakage reduction capabilities with wake-up latencies, state retention issues, and supply voltage integrity.

45

2

Introduction

2.4 Power Estimation and Modeling Techniques Power optimization and minimization techniques are manifold and can be combined to address the needs of a specific design. Circuit level parameters like supply voltage, threshold voltage, delay can be varied to obtain a higher energy and power efficiency. However, these techniques are tightly coupled with higher design abstractions like the architectural and algorithmic levels. Generally, the potential power savings grow with higher abstraction: 10-50% in case of circuit and physical techniques and up to several orders of magnitude for optimizations at the architectural and algorithmic levels. For example, parallelization and pipelining techniques may lead to a substantial reduction in operating frequency and supply voltage since they are able to create a positive performance slack. Furthermore, architecture-level power optimization also enables a broader choice and a higher efficiency of the low-level techniques. To tune a given design for a better power and energy efficiency, there should be a possibility to predict its power consumption at an appropriate level of abstraction. The goal is to enable power modeling during the very early design phases which can subsequently be used in an automated design space exploration. Over the years, many different power modeling techniques were developed. For example, without complete and accurate power and timing models of the standard cells the current semi-custom CAD design automation would not be possible. However, with higher abstraction different issues arise for building reliable and accurate models. There is an inherent trade-off between the estimation accuracy and the model complexity/feasibility. The most accurate, circuit-level power estimation can only be applied to designs where almost all of the possible design choices have already been made, that is the design was already implemented down to the transistor level, either based on experience or simply by chance. With higher design complexities such methodologies become, however, very risky and can lead to non-functional designs and consequently a project failure. Instead, iterative approaches adopted from the large-scale agile software development must be used. That is, a gradual refinement of estimation accuracy during the design and continuous monitoring of estimated power consumption. This method leads to several points of power analysis and reduction throughout the design flow. In other words, the granularity of power analysis should be on par with the impact of the design decisions that are being made, see [237].

2.4.1 Different Types of Power Consumption and their Analysis Regarding the terms power consumption and power dissipation a subtle difference exists, see [229]:

46

2.4 Power Estimation and Modeling Techniques Definition 2.5 (Power Consumption) Power consumption is defined as the amount of energy consumed or withdrawn from a power supply in unit time. Definition 2.6 (Power Dissipation) Power dissipation is defined as the amount of energy dissipated, that is converted into another form like heat, in unit time. For CMOS technology, these two values become the same only if a complementary pair of charging and discharging events is considered. Example 2.10 For a simple CMOS inverter with the switching of input from logical 1 to logical 0, the energy drawn from the supply rail will be C·VDD 2 : A half will be dissipated on the on-resistance of the PMOS transistor and another half saved in the load capacitance. During the opposite switching of the input, no additional power will be drawn from the supply. Instead, the energy saved in the load capacitance will be dissipated on the on-resistance of the NMOS transistor and turned to heat. Therefore, during the charging and discharging event, the amount of dissipated energy will be ( 12 ·C·VDD 2 ), respectively. We will use the term power in the sense of Definition 2.5. Several types of power consumption can further be distinguished, see for instance [229]: Instantaneous power :

Power/current measurement value for a given time step, for instance 1 ps (a small interval) or 1 clock cycle (a larger interval). Among others, it determines the power rail noise injection and other power rail properties.

Peak power :

Maximum instantaneous power for a given simulation/measurement time. Among other things, an important quantity for power rail dimensioning and chip testing, see Section 2.3.2.

Time-averaged power :

Defined as the time average of power consumed over a relatively long period, that is hundreds and more clock cycles. Typically in the microseconds/milliseconds range. It is primarily used to calculate battery life and for high-level power estimation.

47

2

Introduction

Root-mean-square power :

Reflects the long-term behavior simultaneously accounting for peak values. It determines the electromigration current limits which depend both on the average current and the peak value. Also applicable for bi-directional currents which would otherwise show a very small average.

The power consumption of a circuit can be analyzed in several ways. At lower abstraction levels like transistor, gate, and switch level, mostly simulation-based approaches are used. The ultimate goal is to verify the power consumption of the given design. Simulations at the low abstraction levels are able to provide the types of power estimates discussed above with a high accuracy. Regarding simulation, the following types can be distinguished: Power emulation :

Thereby, hardware prototyping platforms are used for power estimation. Special ”power estimation hardware” for collection of switching activity is added to the original design. Accelerations of several orders of magnitude as compared to eventbased simulations are possible, see [82, 83]

Event-based / Dynamic simulation :

Also called vector-driven analysis or patterndependent analysis. A simulation tool is used to obtain the information about the dynamic switching events throughout the circuit. This approach yields accurate power and timing data for the given input stimuli set.

Probabilistic / Static simulation :

Also called vectorless analysis or (input) pattern independent analysis. If the prospective input stimuli are not known, probabilistic properties of the signals can be used instead. Such a probabilistic quantity is, for example, the probability of a signal to have a logic value ”high” or the probability of a sequential circuit to be in a certain state. Provided such data, for each net the likelihood of switching is estimated and the corresponding contribution to the overall power consumption is derived.

48

2.4 Power Estimation and Modeling Techniques

Statistical simulation :

Using dynamic analysis, the circuit is simulated with a set of different randomly generated inputs and thereby the power consumption is monitored. Additionally, to reduce the number of required stimuli special statistical techniques, like sampling and others, are deployed. Depending on the statistical techniques used, parametric, like [57], and nonparametric, like [326] versions are further distinguished.

It should be mentioned that the instantaneous currents predicted by the circuit-level simulations generally cannot be observed and validated by in-field chip testing for several reasons. As described in [330], the parasitic capacitance and inductiveness of a ready to use packaged integrated circuit make such measurements impossible in principle. The inductiveness and capacitance of the wire bonds and the power rails form a kind of an oscillating loop which acts as a low-pass filter. Thus, instantaneous current behavior becomes unobservable from outside of the chip. Example 2.11 In a high-performance processor running at 1 GHz clock frequency, the corresponding parameters are as follows [234]: R = 500 μΩ , L = 0.005 nH,C = 500 nF. This yields a resonance frequency of 100 MHz. That means that dynamic current/power behavior for time intervals under 10 ns or 10 clock cycles duration cannot be measured from outside of the chip. During probabilistic or static power analysis, only one run through the circuit is made. It is comparable to the idea of the static timing analysis in the sense that no actual simulation has to be performed. Whereas in statistical power analysis, the number of simulations to perform equals the number of different stimuli generated. Depending on the abstraction level and the accuracy constraints, this can be very time-consuming and hence not feasible. Hence, this approach is strongly input pattern dependent. Additionally, the dynamic simulation cannot provide reliable worst-case power estimates unless an exhaustive simulation of all possible signal combinations is performed. Probabilistic power analysis generally cannot account for such circuit-level phenomena like the slew rates of signals and other analog circuit behavior. The probabilistic measures a priori reflect the properties of discrete signals averaged over time. The general probabilistic model assumes a continuous time (real-delay circuit model, inertial-delay circuit model) and the simplified versions assume discrete time steps measured in clock cycles (zero-delay circuit model). On the other side, probabilistic technique can achieve a very high circuit coverage. That means it can assure that a

49

2

Introduction

high number of input-to-output paths were really exercised. For event-based simulation approaches, this is normally not the case and can only be achieved with a very high number of simulations. This is done, for example, in statistical power analysis. However, without any acceleration techniques, statistical power analysis is only useful in case of very small circuits. To accelerate statistical power analysis, one can either shorten the input sequences appropriately, that means without large accuracy penalties and/or reduce the number of input patterns that have to be simulated. That is where different statistical techniques come into play: By statistical sampling, a small subset of the whole input stimuli space can be used to estimate the power consumption without a large accuracy loss. One prominent example of statistical power estimation is the Monte Carlo simulation, see [57]. Another example of statistical sampling is found in [81]. In Monte Carlo simulation, the initial assumption is that the sample power values are distributed normally, hence the classification as a parametric technique. Based on this, the probability that the estimated average power will not exceed predefined ranges can be computed. An input model based on a Markov process is used to generate the input stimuli. A new input sequence is generated iteratively and a simulation is performed. Subsequently, the mean and variance of the power consumption is calculated and the convergence criterion is checked. Furthermore, confidence intervals can be given which is not possible in case of nonparametric techniques. The main shortcoming of the Monte Carlo simulation is the initial normality assumption for the power sample distribution. If this assumption fails, for instance in case of bi-modal and multi-modal distributions or distributions with asymmetric tails, the estimation errors can be very high. Therefore, modifications were proposed in [96], including the nonparametric sampling in [326]. An alternative technique to accelerate statistical simulation is to shorten the input sequences without compromising the estimation accuracy. Therefore, different sequence compaction techniques were proposed, see [190, 191, 194]. Input compaction techniques are interesting for the probabilistic power analysis alike.

2.4.2 Hardware Abstraction Layers Depending on the current design stage, estimations of different power types become available: During the architectural and system-level design, at most time-averaged estimates exist. As the implementation advances, more detailed simulations and estimations become possible finally ending with the in-field current/power measurements after the chip fabrication. According to this, different power modeling and circuit simulation approaches exist which depend on the abstraction level.

50

2.4 Power Estimation and Modeling Techniques Transistor Level Power modeling at transistor level is accomplished with the help of differential equations from the solid-state physics. For this purpose, standardized Berkley short-channel IGFET (BSIM) transistor models were developed, see [255]. This kind of modeling is used, for instance, for standard cell library characterization and relies heavily on analog simulation. During the cell characterization, only time-continuous signals are used to exercise all input-to-output paths exhaustively. Thereby, current waveforms on the power and ground pins are monitored and subsequently stored in an appropriate format. The instantaneous power is obtained by multiplying the current with the supply voltage. From these data, average and peak power values are derived. The final gate-level power models contain the dynamic and steady-state behavior of respective standard cells. Although transistor-level modeling provides the highest possible level of accuracy, its biggest disadvantage is the missing scalability. The number of simulations grows exponentially with the number of inputs and outputs. Hence, the computational capacities are limited either to very small circuits or very small set of applied stimulus vectors, or both. Switch Level To alleviate the scalability problems of the circuit-level modeling, one approach is to simplify the analog behavior of a transistor to a simple switch, see [55]. A switch can be either open or closed. It further has corresponding capacitive and resistive characteristics. With this technique larger circuits can be simulated with more representative input signal sequences. Compared to the circuit-level modeling accuracy, switch-level modeling is accurate in estimating dynamic power but has limitations regarding the leakage power. The abstraction level still remains very low. Gate/Logic Level At the gate level, the properties of physical devices are further abstracted to logical gates or cells which are charging and discharging load capacitances under certain timing constraints. The corresponding acceleration compared to a transistor-level simulation is about three to four orders of magnitude. As representation and modeling standards for the semi-custom standard cell libraries at the gate level the composite current source (CCSM), see [270, 269], and effective current source (ECSM) models, see [61], can be named. The CCS model has two components: A driver and a receiver model. A CCS driver model determines the current flow through a fixed load capacitance at a fixed value of the input slew. A receiver model determines the dependence of the effective input capacitance from the input signal slew and the output capacitance.

51

2

Introduction

The main distinction of ECSM compared to CCSM is that in ECSM the values of the output voltage are stored. Whereas CCSM stores the values of output current. Theoretically, both models are equivalent, since the following relation between output voltage and output current holds: I OUT (t) = COUT ·

dVOUT (t) . dt

(2.12)

Additionally, the VOUT values are normalized, that is multiplied by the factor 1/VDD . Thus, the ECSM model can be used in multi-voltage designs without modifications. Generally, gate-level modeling is used during the design verification stage, and not the design stage itself. Therefore, more stress is put on the efficient power simulation/analysis than on power modeling. The same applies to circuit level, alike. The final goal is to verify the power consumption, see [181]. The scalability is still restricted to ≈ 1·10 6 cells, that is small to mid-range designs. Register-Transfer/Micro-Architectural Level The next abstraction step leads to the register-transfer or micro-architectural level where functional macro-blocks like adders, multipliers, register files, on-chip memories, and on-chip buses are considered. This is also the mostly used representation form for digital designs which is fully supported by all design automation tools. This is due to the industry-wide adoption of hardware description languages for both design and simulation of digital and mixed-signal circuits. From the design perspective at this step, it is important to obtain reliable estimates of the average and worst-case average power consumption to be able to optimize for power and energy efficiency. Contrary to gate-level and circuit-level design stages, the instantaneous and peak power values are only of minor interest here, since the long-term behavior of the circuit is important to predict for example the standby and operational time of a portable device. Furthermore, the average power dissipation directly influences the chip temperature which, in turn, affects the design reliability. Hence, at this abstraction level the focus is put more on efficient power modeling, as compared to efficient power simulation/analysis on the lower levels. Consequently, the relative accuracy of power models is more important than their absolute accuracy, since the goal is to explore different alternative implementations and quickly assess their feasibility obviating the need to implement every possible architectural choice down to the transistor level. For power estimation at the micro-architectural level, mostly empirical models or macro-models are used.

52

2.4 Power Estimation and Modeling Techniques Methods for Model Construction Regarding the construction procedure of a power model, there exist two major techniques: Top-down or analytical models and bottom-up or empirical models. Empirical models are also referred to as macromodels, see [181, 235]. Both will be discussed in detail in the next sections.

2.4.3 Analytical Power Models Analytical models are simulation-free. They can also be categorized as behaviorallevel models. They rely, for instance, on such information-theoretical measures like entropy, see for instance [182]. The hardware circuit is modeled as a set of Boolean functions. The switching activity can be predicted using either entropy or informational energy averages. Hence, entropy and hamming-distance based models take the switching activity of the circuit into account, like in [189]. Such models are classified as activity-based or switched capacitance models. However, first switched capacitance models did not account for leakage power consumption. An extensions for the modeling of leakage power was proposed in [103]. Other approaches use ordered binary decision diagrams (OBDD) to model the underlying hardware circuit, see [38]. Alternatively, the logical complexity [230] of the given implementation can be taken to estimate the power, see [205]. Theses models are therefore complexity-based. Advantages:

Small number of model parameters.

Limitations:

Limited accuracy for irregularly structured circuits. Models highly dependent on the type of the block modeled, that is adder, multiplier, bus, cache memory. In-depth structural information is required, for instance for analytical memory models, which cannot be extended to other architectures easily, see [248].

2.4.4 Empirical Power Models During empirical or macro-modeling, fully implemented designs are simulated on lower abstraction levels, that is, for instance, gate or circuit level, to obtain reliable estimations of area and power consumption under different input data activity and, possibly, process and environmental conditions. Macro-modeling was introduced in [235]. Principally, the accuracy level is only constrained by the modeling accuracy of the respective simulation tools. The beauty of the empirical models is that they are applicable to just any kind of hardware designs and to both combinational and sequential logic. Furthermore, the modeling procedure itself can be performed in an automated fashion. Certainly, different kinds of empirical power models exist. They can be classified based on different criteria [248] like for example the

53

2

Introduction 1. Representation strategy, the 2. learning or creation strategy, and the 3. application strategy/model evaluation

Model Representation Regarding the macro-model representation, the following variants can be named: • Equation-based or continuous function models. The model consists of one or several equations, derived with the help of – statistical regression techniques ∗ parametric [45, 154] and ∗ nonparametric [156, 177] or based on the principle of – power sensitivity [77] combined with statistical modeling [76, 142]. The main advantages are the small storage requirement, since only the regression coefficients have to be stored, as well as the independence of the modeling accuracy from the (storage) size of the model. • Look-up table based models, see [122] and its later extensions [124, 46]. Such models require some storage space, which mainly depends on the modeling accuracy. • Hybrid [318, 29, 123, 180] and multi-model approaches, like [167, 35, 233] try to combine the advantages of the first two techniques depending on the application scenario to increase the general modeling accuracy and evaluation speed. • Approaches based on neural networks, see [65, 236] The main advantage of neural networks is their ability to model the highly nonlinear statistical behavior better than the polynomial-based regression approaches, see [236]. Another advantage is their ability to estimate the worst-case power consumption as well. Model Creation For the learning/creation strategy the following variants can be distinguished: • Pattern-dependent as for instance [45, 138, 154] and • pattern-independent ( probabilistic) [227, 212, 213, 42, 220]

54

2.4 Power Estimation and Modeling Techniques Model Evaluation The application/evaluation strategy basically depends on the model representation. For equation-based models, this is the mathematical evaluation of the regression equation(s). In case of look-up tables a reading access to storage of the empirical data is performed. For hybrid models, an additional (high-level) simulation may be necessary for model evaluation. Furthermore, regarding the model evaluation, we distinguish between the • Cycle-accurate and • Cumulative (time-averaged) techniques. In case of cycle-accurate estimation, the power consumption of a design is obtained by a summation over a number of clock cycles j of the power consumption Pi j of the subcircuit i, that is cycle−wise PTotal = ∑ Pjcycle = ∑ ∑ Pi j . (2.13) ∀j

∀ j ∀i

The advantage of the cycle-accurate approach is the ability to resolve the power consumption over time. Such a detailed picture is needed especially during the design verification stage and the physical circuit design like reliability analysis, noise analysis, and power rail analysis. Naturally, for a cycle-accurate estimation, the macromodels have to be (re-)evaluated cycle-wise. That means that this procedure generally requires a higher evaluation overhead than the estimation of average power where the estimation is performed for one long time period at once: cumulative PTotal = ∑ Pi .

(2.14)

∀i

Of course, time-averaged estimates can also be obtained from the cycle-accurate values. The time-averaged techniques are described in Chapter 4 in more detail.

2.4.5 Discussion The most important criteria which should be considered during the choice of a macromodeling technique are the following: 1. Generality of the model 2. Flexibility (different accuracies and later extensibility) 3. Automatization of model construction. Analytical models are always module-specific. Each different data path unit like adder, shifter, multiplier, and others, for instance, requires its own analytical model. Furthermore, analytical models mostly imply functional meaning, that is they can hardly account for different circuit implementation alternatives like, for example,

55

2

Introduction

Property

LUT

BDD

Neural Networks

Stat. Regression

Compactness

Analytical Form

Accuracy

∼

∼

Evaluation Speed

∼

Generation Complexity

Automation

∼

∼

∼

Scalability

Extensibility

Robustness

∼

n.a.

n.a.

∼

Combined Area Modeling

Table 2.4: Comparison between different empirical power model representations (see also [248]). carry-save and booth multipliers. For irregularly structured circuits, the accuracy of analytical models is limited. The construction of an analytical model requires a massive human interaction and automation is very difficult. On the other side, empirical models provide the largest possible generality. They can be extended at an arbitrary point in time to an optional level of accuracy. However, the possibility of an automatic model construction is probably their biggest advantage. Regarding the representation of an empirical model, a summary is given in Table 2.4. The main choice falls between a statistical regression and a table-based (or LUT) approach. Binary decision diagrams can also be used to store the empirical data in a tree-like structure, as described in [199]. However, they have scalability and extensibility problems. Although models based on neural networks, see [65, 236], provide good accuracies and are quite robust regarding the out-of sample accuracy issues6 they are difficult to deploy and extend. The main advantages of the table-based model representation over the statistical regression is their extensibility and automatic generation. Furthermore, circuit area modeling can be included into the table-based power models ”for free”. This refers not only to the model construction but also to the model evaluation. The table-based approach will be described in Chapter 4 in more detail. 6 Out-of

sample accuracy means that the data used during the model evaluation exhibits different statistical properties as compared to the data used during the model creation/characterization phase.

56

3 A Parameterizable TCPA Architecture: Weakly Programmable Processor Arrays This chapter presents the architecture of Weakly Programmable Processor Arrays from the micro-architectural point of view. Contrary to research on standard multicore SoCs which exploit most often only task level parallelism, the applications relevant for our work are typically requiring tightly coupled inter-processor communication in order to exploit multiple levels of parallelism: Array level parallelism (loop level), instruction level parallelism (VLIW), sub-word parallelism (data-level), functional and software pipelining. Efficiency and flexibility of WPPA architectures as introduced here and belonging to the class of Tightly-Coupled Processor Arrays (TCPAs), is obtained through the concept of weakly programmability of processing elements (WPPEs) as well as their interconnect. Hence, WPPAs combine the flexibility of CGRAs with the programmability of many-cores and the efficiency of dedicated, algorithm-specific systolic arrays. Another distinguishing property of WPPAs developed here is that it is an architectural template rather than a fixed CGRA architecture, like many others mentioned in the introduction. This enables a flexible adaptation of hardware resources to the prospective set of applications as well as an automatic design space exploration. Our research results on WPPE, instruction set definition, and the modeling of versatile reprogrammable interconnect architectures will be summarized next.

3.1 Architecture of a Weakly Programmable Processing Element Our target architectures consist of arrays of processing elements which are weakly programmable (WPPEs). Each WPPE has a VLIW (very long instruction word) structure, see Figure 3.1. The term weakly programmable characterizes the property that only a limited amount of instruction memory in each processing element is avail-

57

A Parameterizable TCPA Architecture: Weakly Programmable Processor Arrays

I/O

PE

PE

Interconnect Wrapper

I/O

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

ip0 Input Registers

WPPE

PE

regI

i0

ip1

i1

ip2

i2

ip3 General Purpose Regs

i3

regGP rPorts

I/O

Instruction Decoder

I/O

3

Instruction Memory

FU f0

f1

PE

PE

PE

PE

PE

PE

PE

I/O

PE

I/O

PE

wPorts

regFlags

PE

pc BUnit

r0 r1 r2 r3 r4 r5 r6 r7 r8 r9

regO o0 o1 Output Registers op0 op1

I/O

I/O

Figure 3.1: Example of a weakly programmable processor array (WPPA). All parameters such as number and type of PEs as well as their interconnect structure can be defined at synthesis time according to domain-specific needs. able and the control overhead is kept as small as possible for the given application domain of algorithms. For example, there is typically no support for interrupts and exceptions. The instruction memory contains very long instruction word programs. Every single WPPE can be parameterized at synthesis time with respect to the number and types of functional units, like adders/subtractors, multipliers, shifters, and modules for logical operations. Special functional units such as multiply-accumulate (MAC), barrel shifters, etc. can be specified and added to the architecture template. A parameterizable register file for data is provided to store intermediate computational results. For control signals, a similar parameterizable control register file is used. Furthermore, special storage elements at the inputs of each WPPE have been proposed to store incoming data. These are FIFO buffers (first in first out). A microarchitectural overview of a WPPE / WPPA is given in Figure 3.1, see also [3].

3.1.1 Instruction Set Architecture The instruction set architecture of a single WPPE may be chosen according to domainspecific computational needs and thus minimized accordingly. Since our architecture primarily addresses embedded systems-on-a-chip for computation-intensive algorithms typically specified by nested-loop programs, the instruction set needs to be optimized individually for each application. For each WPPE, the instruction set is parameterizable at synthesis time. The widths of the different instruction fields, like operation code, register addresses, and the width of the instruction itself are parameterizable at synthesis time to suit different application needs. In general, five instruction types with corresponding semantics are distinguished [3]:

58

3.1 Architecture of a Weakly Programmable Processing Element

ADD ADDI MUL MULI ANDI SHLI CONST

Code dest op1 op2 dest op1 im dest op1 op2 dest op1 im dest op1 im dest op1 im dest const

dest dest dest dest dest dest dest

Operation := op1 + := op1 + := op1 ∗ := op1 ∗ := op1 ∧ := op1

:= const

op2 im op2 im im im

Table 3.1: An example subset of a WPPA instruction set.

Register instructions

Write the result of an operation with two operands from the operand registers to the given destination register Immediate instructions Similar semantics as register instructions except for the second operand, being coded in the instruction itself Constant instructions Used to load a specific constant into a register Move instructions For transfer of data between different storage elements Branch instructions For conditional and unconditional branching In Tables A.1 and A.2 of the Appendix, two example VLIW programs for two different processing elements are given for an image processing algorithm (edge detection). Table A.4 gives another software example for a digital signal processing algorithm (fast fourier transform). Finally, in Table A.3 a software program is shown which serves as a switching load generator without any computational meaning (synthetic load). It is used for power profiling purposes. These programs are to be executed on a 3 × 8 WPPA shown in Figure 3.5. The corresponding WPPA compiler works on the class of so-called dynamic piecewise linear algorithms (DPLAs), see [129]. This class of algorithms describes loop nests with a potential number of run-time conditions as a set of recurrence equations. Different transformations like hierarchical partitioning of the iteration space are applied to match the size of the application to the size of the physical array, or to optimize the local memory by data reuse within a partition. Chapter 6 contains additional examples of the target applications.

3.1.2 Dynamically Reprogrammable Interconnect Modules In multi-processor hardware architectures, inter-processor communication plays a very important role. Flexible interconnect structures have been investigated through definition of a novel concept of an interconnect wrapper module belonging to each

59

3


WPPE, see the example in Figure 3.5. It is also used to describe the capabilities of switching networks. Different topologies between the single WPPEs in a weakly programmable processor array like torus, 4D hypercube, and others can be implemented and optimized statically but also changed dynamically, see [4]. To allow to model a vast set of different interconnect topologies, an adjacency matrix has been proposed to specify each interconnect wrapper in the array at synthesis time. The structure of the adjacency matrix is exemplarily shown in Figure 3.3(a). All input and output signals in the four directions N (north), E (east), S (south), and W (west) of the interconnect wrapper, as well as input P in and output signals P out of the encapsulated WPPE are organized in a matrix form. The input signals of the interconnect wrapper and the output signals of the encapsulated WPPE build the rows of the corresponding adjacency matrix. The output signals of the interconnect wrapper and the input signals of the encapsulated WPPE build the columns of the adjacency matrix. In Figure 3.3(a), two input and output signals in each direction of the interconnect wrapper are assumed, and two input and output ports in the encapsulated WPPE for the corresponding example interconnect wrapper and the adjacency matrix. If an arbitrary input signal i and output signal j have to be connected, the variable ci j in Figure 3.3(a) is set to 1. Otherwise, it is set to 0. If multiple input signals are allowed to drive a single output signal, a multiplexer with an appropriate number of input signals is generated. The inputs of this multiplexer are connected to the corresponding input signals and the output to the corresponding output signal. The select signals for such generated multiplexers are stored in interconnect control registers and can therefore be changed dynamically: Either by a WPPE itself or by a global reconfiguration. By changing the values of the control registers in an interconnect wrapper component, different topologies can be implemented and even be changed at run-time. The number and width of the input and output signals of the interconnect wrapper component might be different for each WPPE. These parameters have to be configured at synthesis time. The WPPAs do not utilize a network-on-chip (NoC) approach for inter-processor communication and deploy instead a circuit-switched interconnection scheme. The rationale behind this is the following: In modern CMOS technologies the wires, that is bandwidth, is relatively cheap. In complex NoCs, the routers constitute a large area and power overhead. According to an industrial study by Intel, 18% of the power is consumed by the link itself and 82% is consumed by the router. Therefore, the circuit-switched approach is attractive especially for high-data-rate interconnects, see [237].

3.2 Weakly Programmable Processor Arrays The interconnect wrapper components of each weakly programmable processing element are connected in a regular two-dimensional grid topology to form a weakly

60

...

MUX

...

...

MUX

Branch Unit

... MUX

...

MUX

...

MUX

MUX

Configuration Loader

FU FU

MUX

MUX ...

MUX

...

North Output

Data

Control Register File WPPE Control

...

...

MUX

MUX

North Input

Configuration Bus

... ...

WPPE ... ...

West Input West Output

...

ICN Wrapper

East Input East Output

... ...

...

MUX

Register File

MUX ...

...

MUX

Configuration Controller

FU

VLIW Instruction Memory MUX

...

LUT

...

...

MUX

Configuration Configuration Memory Manager

MUX

MUX

...

...

WPPA

MUX

MUX

External Configuration Requests

3.2 Weakly Programmable Processor Arrays

South Input South Output

Figure 3.2: Reconfiguration in weakly programmable processor arrays. programmable processor array, see Figure 3.1. Two levels of connections can be specified: Regular static connections between the single interconnect wrapper components build the first, meshed level, the second dynamic level is the current interconnect topology which is defined by the values of the interconnect control registers in each interconnect wrapper. By means of the first static level of interconnect signals with regular layout, the problems of placement and routing of WPPEs in an array are considerably reduced, see [130]. With the help of the second level of interconnect, topologies different from a mesh can be specified and changed dynamically.

3.2.1 Adaptive Reprogramming To achieve dynamic reprogramming of VLIW programs and adapt the interconnect structure of each WPPE to the currently processed algorithm, each WPPE possesses a small component called configuration loader, see Figure 3.2. This loader is controlled by a finite state machine. Additionally, there is a global reconfiguration controller and global reconfiguration memory for storage of different programs and interconnect schemes on the array level. The global reconfiguration controller is connected to the reconfiguration manager which reacts on external reconfiguration requests, locates the corresponding program in the configuration memory with the help of a small look-up address table, and configures the reconfiguration controller. The external requests are supposed to come, for example, from an embedded processor core in an SoC design.

61

3


In a programming or reprogramming phase, the global reconfiguration controller reads the current data from the global configuration memory and puts it on a global configuration bus, see Figure 3.2. The width of the global configuration memory and the data bus are parameterizable at synthesis time, too. They may be different from the width of local VLIW instruction memories and the width of interconnect control registers in the single WPPEs and interconnect wrapper components. The local configuration loaders in WPPEs which have to be reprogrammed run synchronously with the global configuration controller and write the received data words to the local storage elements. To achieve scalability of weakly programmable processor arrays, also a multicast scheme has been provided. Since a global bus is used for (re)programming of WPPEs, it is important to be able also to program and reprogram parts of a large processor array, like for example an array of size 10 × 10 with 100 processing elements or more. With the help of multicast bits in special registers, see Figure 3.3(b), it is possible to address single processing elements as well as groups of processing elements in an array. This scheme is explained in the following in detail. Regarding the choice of PEs for reconfiguration, currently a static scheme is used. That means, that the programmer is responsible for the assignment of PEs to a given application. Thereafter, this information is statically captured in a configuration data stream. To reduce the idle power consumption of a single processor array instance used for different applications a special ”shut-off” configuration type can be used, to switch off an arbitrary group of PEs in case that the number or the spatial location of the PEs allocated to a subsequent application changes, see Section 5.4 for details.

3.2.2 Multicast Reprogramming Scheme A multicast scheme called RoMultiC (row multicast configuration) was chosen [292, 214, 293], and adapted to the case of WPPAs [3] for processor array (re)programming. There is a corresponding multicast bit for each row and each column in a two-dimensional processor array. Each interconnect wrapper in the processor array is connected to one horizontal and one vertical multicast bit line, see Figure 3.3(b). If a multicast bit is set, the corresponding row or column in the array is selected. Afterwards, only those WPPEs will be programmed which have both the vertical and horizontal multicast bit lines set. The single multicast bits are grouped into two multicast registers: A vertical mask register for the columns of the array and a horizontal mask register for the rows of the array. Both mask registers are located in the global configuration controller and are accessible from configuration manager, see Figure 3.2. For the tuple of two mask register values, we use the term multicast signature. A multicast signature uniquely identifies a group of weakly programmable processing elements. By the means of this multicast scheme, partial reprogramming of the array becomes

62

3.2 Weakly Programmable Processor Arrays

Column: 2 1 0

E out

i

N in

W out

·

·

·

Bit # 2

·

·

·

·

· ·

in

· ·

P out

·

cij ←

1, if ∃ a possible connection, 0, otherwise;

Horizontal Mask Register (H)

·

S in

Bit # 1

Bit # 0

Row: 0 1 2

· · cij ·

Vertical Mask Register (V) Bit # 2

P in

·

E in

W

S out ·

Bit # 1

N out

ICN Wrapper

ICN Wrapper

ICN Wrapper

ICN Wrapper

ICN Wrapper

ICN Wrapper

Multicast Signature: H: 1 1 0 V: 0 1 1

Bit # 0

j

(a) Adjacency matrix example.

ICN Wrapper

ICN Wrapper

ICN Wrapper

Multicast Signature: H: 0 0 1 V: 1 1 1

Configuration Bit

(b) Multicast reprogramming scheme.

Figure 3.3: Adjacency matrix example and multicast reprogramming scheme of the WPPEs and their interconnect wrapper modules. possible. To use the advantages of this reprogramming scheme, one could partition a given processor array, for instance, into two rectangular domains each running a different application. Then, by means of partial reprogramming, the first domain could be adapted to a new algorithm while an application is still running on the second.

3.2.3 Reprogramming Speed Realistic programs like different digital filter algorithms, like edge detection, have been implemented by tiny programs of three VLIW instructions each on a 3 × 2 WPPA. In a case study it was shown that 3 μs are needed in worst case to program (loading of the programs into each WPPE and configuration of the interconnect structure) a 4 × 4 WPPA for a filtering algorithm7 , see [4]. In case of fine-grained architectures such as FPGAs, this is approximately the time for configuring a single look-up table (Xilinx Virtex-II). This shows a WPPA can be dynamically reprogrammed at extremely high-speeds.

3.2.4 Graphical Architecture Entry For architecture and software development a graphical tool called WPPA Editor is used, see [175]. A screenshot is given in Figure 3.4. The WPPA architecture can either be edited graphically, or loaded from a MAML (machine markup language) 7 Configuration:

32 bit wide configuration bus running at a frequency of 100 MHz for loading of 4 VLIW instructions each 79 bit wide.

63

3


Figure 3.4: Different graphical framework views of WPPA Editor tool, see [175]. textual description, see [15]. Different software and hardware configuration panels as well as distinct interconnect views are provided. The same tool can also be used for cycle-accurate, compiled simulator generation and graphical debugging.

3.2.5 VHDL Template Code Complexity A template for a generically parameterizable WPPA was implemented with the help of the VHDL hardware description language. It consists of 43 VHDL entities and four additional libraries with auxiliary types and function definitions, see Figure A.2 and Listings A.3, A.4, A.5, and A.6. Since the hardware description is supposed to describe a very large set of possible user-specific WPPA instances, it makes extensive usage of both generic and generate concepts available in VHDL, see Listings A.1, A.2, and A.7 for example code snippets. The overall code complexity currently amounts to 16 328 lines of VHDL code. This is already the complexity of industry-level designs. For a comparison, we can look, for instance, at the freely available implementation of the Leon Sparc V8 processor from Gaisler Research: 15 000 lines of code, 37 VHDL entities, see [108].

3.3 Summary In this chapter a new class of massively parallel embedded processor architectures called weakly programmable processor arrays was introduced. The instruction set architecture and the structure of the single processing elements were described. A

64

3.3 Summary

Figure 3.5: Fourteen configuration domains of the 3 × 8 FFT case study application (each shown in a different color). A view from the WPPA Editor design and simulation environment. novel approach for dynamically reconfigurable interconnection schemes for coarsegrained reconfigurable architectures was also introduced by means of the interconnect wrapper concept. With the help of a special multicast scheme, dynamic and partial reconfiguration schemes can be applied to these processor arrays. The main results were published in [4], [3], [17], [19], [6], [11], [18], [14], [16], [8], and [21]. Furthermore, the WPPA architecture was also presented at the hardware and software demonstration at the DATE University and FPL Booth [1], [13], as well as [20]. The high amount of different parameters leads to a large number of possible array architectures with various hardware and software configurations. Therefore, the following interesting question arises immediately: How can an efficient WPPA architecture be automatically derived for a given set of prospective applications? The main objective functions are the execution latency of algorithms, the power and energy consumption as well as the chip area of hardware implementations. In order to answer this question, efficient high-level models for power and area consumption depending on the software and hardware parameters are needed to assess the quality of a given WPPA instance, as well as the means to automatically explore the large space of possible solutions. This are the main topics of Chapter 4 and Chapter 6.

65

3

66


4 Power, Area Characterization and Modeling New application areas for embedded systems like mobile multimedia processing, wireless communication, and medical image processing with computational complexity up to several hundred giga-operations per second pose a challenge for modern embedded hardware architectures. Especially for the area of mobile and ubiquitous computing implementations on standard general purpose processors or multicore architectures are not suitable due to low energy and power efficiency with tens of milliwatts per megahertz combined with a high prime cost. Residing architecturally between general purpose multicore systems on the one side and ASIC implementation on the other, coarse-grained reconfigurable architectures (CGRAs) tend to inherit the domain-specific flexibility from MPSoCs as well as the area and power efficiency from an ASIC implementation. Therefore, the completely flexible general purpose paradigm is abandoned in favor of application-domain specific, massively parallel reconfigurable solutions [237]. Nowadays, there exist many commercial and academic coarse-grained reconfigurable architectures, refer for instance to [161] for an overview. During their evolution, CGRAs made a transformation from one static, hard-coded architecture towards more flexible and customizable architecture templates. This is needed to achieve even better power and performance characteristics for different embedded application domains where computational power and energy efficiency are the key design metrics. Owing to their parameterization capabilities, such architecture templates allow the designer to tailor the available hardware resources to the specific needs of a set of algorithms, for example from the digital signal processing domain. Provided an automated design flow including a scheduler/compiler, hardware description and configuration stream generators, this dramatically saves the development costs of both hard and software for modern systems-on-a-chip, where CGRAs are now used extensively, see for example [195, 59]. However, besides its advantages the high parameterization also leads to an explosion of the corresponding design space with billions of design alternatives already for a dozen of processing elements and design parameters. Therefore, highly efficient architecture-level power, area, and performance estimation is needed to cope with such combinatorial design complexity.

67

4

Power, Area Characterization and Modeling

4.1 Probabilistic Power Analysis The probabilistic analysis of power consumption is a technique which allows to capture the main stochastic properties of the input/output and internal signals, that is their switching characteristics, in a more abstract way than using logic simulation. The corresponding theory fundamentals are presented in the following. See also the overview article by Najm [211]. The original work of Najm [209] proposed to use the transition density as a measure of activity, that is the average switching rate of a circuit node in addition to signal probability, see [227]. This measure is based on a stochastic model. Furthermore, it is possible to propagate the transition density through a circuit to obtain the corresponding switching rates on the internal nodes. The process of activity propagation can be compared with an abstract, single-pass stochastic simulation of the circuit. This technique is especially well-suited for very large system-on-a-chip designs and long input sequences where an RTL-level simulation would not be possible. Originally, it was aimed at the circuit reliability analysis, since the nodes with high switching rates are the first candidates for different kinds of hardware failures. The underlying stochastic model is based on the mean-ergodic, strict-sense stationary 0-1 stochastic processes. A process x(t), t ∈ (−∞, +∞) is strict-sense stationary (SSS) if its statistical properties do not depend on the observation start point. Furthermore, a constant mean, finite variance and a decaying auto-correlation of the stochastic process is assumed (mean-ergodicity). If the inputs of a circuit satisfy the upper mentioned properties, it can be shown that all internal and output nodes also satisfy them. One important example of such a process is, for instance, the two-state continuous-time Markov process. The probability P, that x(t) takes the value 1 at any given time t is equal to its mean value, that is E[x(t)] at that time and is a constant. This is the equilibrium probability of x(t). We can further state that in a time interval T the following equation holds, see [209]: 1 lim T →∞ T

+T /2 −T /2

x(t)dt = P(x) .

(4.1)

Let furthermore nx (T ) be the random number of transitions of x(t) in the time interval (−T /2, +T /2]. The E[nx (T )] depends on the length of the time interval T only. The start point of the observation is irrelevant since x(t) is a SSS stochastic process. Base on this, the following equation holds: E[nx (T1 + T2 )] = E[nx (T1 )] + E[nx (T2 )] = = k·T, k ≥ 0 .

(4.2)

Therefore, the ratio E[nx (T )]/T is the transition density D(x) or the expected number of transitions per unit time. If x(t) is a sample of x(t) it is also SSS and mean-ergodic

68

4.1 Probabilistic Power Analysis and the following properties apply for x(t) as well: 1 T →∞ T lim

+T /2 −T /2

x(t)dt = P(x), (4.3)

nx (T ) = D(x) . T →∞ T lim

Definition 4.1 (Equilibrium Signal Probability) The equilibrium signal probability P(x) is, asymptotically, the fraction of time that x(t) spends in the state 1. Definition 4.2 (Transition Density) The signal transition density D(x) is, asymptotically, the average number of transitions of x(t) per unit time. Another aspect of the probabilistic power models is the circuit delay model assumed which is directly dependent on the time interval used in the previous two definitions. If the smallest considered time interval is a single clock cycle, then the definitions of signal probability and transition density become independent of the physical circuit delays. This is comparable to the concept of a switching event in abstract logic simulation because only functional transitions are considered. Additionally, the transition density becomes transition probability. Definition 4.3 (Transition Probability) The transition probability Pt (x) at a circuit node x is the average fraction of clock cycles in which the steady state value of node x is different from its initial value [211]. Definition 4.4 (Signal Probability) The zero-delay signal probability Ps (x) at a circuit node x is the average fraction of clock cycles in which the steady state value of node x is a logic high [211]. If arbitrary time intervals T are possible, a real-delay model of the circuit is assumed. The relation between the transition probability and transition density is given by D(x) ≥

Pt (x) . Tclock

(4.4)

Both measures are only equal in case of a zero-delay model. Therefore, the transition probability Pt (x) gives a lower bound on the transition density D(x). Similarly, the Ps (x) and the equilibrium signal probability P(x) are equal only in case of a zerodelay model.

69

4


Total Power (mW)

1

Pt(x)

0.8 0.6 0.4 0.2 0 0

0.2 0.4 0.6 0.8 Ps(x)

1

10 5 0 0

0.2

0.8

1

0.6 0.4 0.4 0.6 0.2 0.8 1 0 Ps Pt

(a) Theoretical interdependence (zero-delay) (b) Total power of a 32-bit multiplier as a function of the average signal probability Ps and the average transition probability Pt of the first operand

Figure 4.1: Theoretical relationship between the transition probability and signal probability for zero-delay model (a) and corresponding power values for a 32-bit multiplier (b). Special Case of Transition and Signal Probability Interdependence The following relation between the transition probability and signal probability can be derived in case of the zero-delay model if the input signals toggle at most once in a cycle, that is are always driven by registers, see [124]: Pt (xi )/2 ≤ Ps (xi ) ≤ 1 − Pt (xi )/2 1 Pt (xi ) ≤ 1 − 2· Ps (xi ) − . 2

(4.5)

Graphically, this is the shaded area in Figure 4.1(a). Therefore, only one half of the theoretically possible combinations of (Ps , Pt ) has to be taken into account during the model creation phase. Example 4.1 Consider the two following binary sequences of length 10 with an equal number of zeroes and ones: a) 0 1 1 0 0 0 1 1 0 1 b) 1 1 1 0 0 0 0 0 1 1 . Although both sequences have the same signal probability Ps (a) = Ps (b) = 0.5 their transition probabilities differ a lot: Pt (a) = 0.5, Pt (b) = 0.2 .

70

4.1 Probabilistic Power Analysis Consider now an n bit wide binary input stream I of length m clock cycles: I = {(x11 , x12 , . . . , x1n ), . . . , (xm1 , xm2 , . . . , xmn )} .

(4.6)

The switching activity metrics for a single bit j, that is the signal probability Ps (x j ) and the signal transition probability Pt (x j ) are defined as follows, see [210]: xk, j ⊕ xk+1, j ∑m−1 ∑m k=1 xk, j , Pt (x j ) = k=1 , 1 ≤ j ≤ n. Ps (x j ) = m m−1

(4.7)

Spurious Activity Probabilistic power estimation by means of a zero-delay circuit model in principle ignores the spurious switching activity in combinational circuits generated by dynamic logic hazards, see [97]. The average dynamic power is computed in this case by a summation over the number i of circuit nodes according to: 1 PAverage = ·V 2 Ci Pt (xi ) . (4.8) 2Tclock DD ∑ i Ci stands for the total capacitance at node xi . In case of the real-delay model, an additional difficulty arises from the fact that the logic gates exhibit an inertial kind of delay. That means that not all theoretically possible spurious transitions are propagated towards the circuit output during circuit operation. Instead, spurious switching transitions whose duration is smaller than the respective internal gate delay are filtered, that is the so-called glitch filtering property or inertial-delay model. However, the power consumption due to spurious switching activity can be reduced considerably if care is taken that most of the delay paths of the circuit are balanced. All power-aware commercial CAD tools respect this additional constraint during the synthesis and place and route. The situation here is similar to the case of short-circuit power which can also be minimized by lowering the supply voltage and balancing the rise and fall times of the input and output signals, see [301]. Signal Correlation Different kinds of random signals would be an example of fully uncorrelated signals. The signals within a digital circuit as well as its input signals, however, are normally not random and depend on each other. They are said to be correlated. That means that the logic value of one signal logically influences the value of the others and vice versa. Concerning the signal correlations, spatial and temporal correlation types are distinguished. Compressed video and audio streams are the major examples of digital signals with strong correlations of both types. Furthermore, in both types a distinction can be made between structural and patterndependent correlations. To keep the models manageable, correlation coefficients are mostly averaged over the tuple width n and the stream length m.

71

4


4.1.1 Spatial Signal Correlation Two signals are said to be spatially correlated if the value of the first signal is logically dependent on the value of the second and therefore cannot be changed independently. In case of structural spatial correlation this behavior comes from the physical properties of the circuit implementation, namely the so-called reconvergent fan-out (RFO) phenomenon. Another form of spatial correlation is the pattern-dependent correlation. The reason for this phenomenon lies in the information-theoretical properties of the algorithms implemented in the given hardware (video encoding, audio encoding, telecommunication, . . . ) [290] and also the digital representation of the data, see [180]. Also, higher-order spatial correlations exist, as compared to the pair-wise signal correlations. In general, spatial correlations may exist among every n-tuple of input signals with n as small as two signals and as large as the input signal width. Example 4.2 Consider a simple AND gate with spatially independent inputs a and b. The zero-delay signal probability at the output c is given by Ps (c) = Ps (a) · Ps (b) . A similar formula can be derived for other Boolean expressions, see [227]. The zero-delay probabilities can be propagated through a BDD representation of a logic circuit in linear time, see [209]. For the correlated signals, however, conditional probabilities have to be used instead: Ps (c) = Ps (a|b) · Ps (b|a) ,

Ps (a|b) =

Ps (a ∩ b) , Ps (b) = 0 . Ps (b)

Spatial Correlation Coefficient According to [317], the average, pair-wise spatial correlation coefficient Sin over a binary stream I is defined as m (n−1) n 1 Sin = ∑ ∑ ∑ xl j ≡ xlk . m·n/2n/2 l=1 j=1 k=( j+1)

(4.9)

Also, modified definitions exist, see [124, 40]. This definition uses the equivalence operator (≡) and captures correlated streams comprising both matching ones and matching zeros, as opposed to the definition given in [124] which captures only matching ones by using the AND-operator.

4.1.2 Temporal Signal Correlation In case of temporal correlation, a given signal value depends on the values it had in previous time steps. In sequential networks this kind of structural signal correlation

72

4.1 Probabilistic Power Analysis is mainly caused by feedback loops. Furthermore, temporal correlations of orders higher than one, that is the previous time step, also exist, see [192]. Example 4.3 In case of temporally uncorrelated signals a simple relation between the transition probability Pt (x) and the signal probability Ps (x) can be given: Pt (x) = Ps (x) Ps (x) + Ps (x) Ps (x) = 2 Ps (x) Ps (x) = 2 Ps (x) (1 − Ps (x)) . Temporal Correlation Coefficient In analogy to the spatial correlation coefficient, the average temporal correlation coefficient Tin of order l is defined as 1 Tin = n·(m − (l − 1))·l

n m−(l−1) (i+l)

∑ ∑

j=1

i=1

∑

xi j ≡ xk j .

(4.10)

k=(i+1)

To capture the time, a window of length l is put over the stream I. The window length l is often chosen to be of size 10, according to [100] where this metric was originally introduced. The first-order temporal correlations, that is between consecutive clock edges only, are already captured by Pt and Ps metrics, like it is shown in [124].

4.1.3 Analysis Techniques Approximate Analysis Techniques To accelerate probabilistic power estimation, usually simplifying assumptions are made about the signal correlations and gate delays. To be able to propagate the stochastic activity measures through a circuit, spatial independence of signals is often assumed. Thereupon, the transition density can be propagated with the help of the Boolean difference function [210]. For the propagation of the equilibrium signal probability the BDD representation of the circuit is used. However, BDD-based techniques all suffer from the growing size of the BDD for larger circuits. Exact Analysis Techniques Several publications exist on the proper theoretic modeling of temporal and spatial signal correlations during a probabilistic power analysis. In case of combinational circuits, an exact modeling of both spatial and (high-order) temporal correlations is possible with the help of Bayesian networks [41, 42]. Additionally, exact techniques relying on the zero-delay model and OBDDs can be named, see [113, 192, 249]. Spatial pair-wise correlations were modeled in [102] and the respective propagation algorithm was presented as well. In [192], a lag-one Markov chain model is used for proper modeling of spatio-temporal correlations under the zero-delay assumption. The same lag-one Markov model is also used in [249]. In [193], which

73

4


is an enhancement of [192], the mathematical concepts of conditional independence and almost isotropic signals are brought into play. The authors formally prove that based on conditional independence and signal isotropy the statistics taken for pairwise correlated signals are sufficient to characterize larger sets of higher-order correlated signals. Extension to sequential circuits So far, the activity metrics were only used to characterize combinational circuits. However, the same techniques are also valid for sequential circuits, like shown in [123] and [46], if the goal is to estimate the average power consumption.

4.2 Power Profiling Results for WPPAs To get an idea, how the real signal statistics usually look like on WPPAs, we examined a gate-level simulation of two algorithms: An 8 points radix-2 fast fourier transform and a synthetic load algorithm, which generates a high overall signal activity and 100% resource utilization without any computational meaning. The corresponding hardware architecture consists of 24 PEs each containing two adders, two multipliers, one shifter and a data path unit with an overall instruction issue width of 6 slots. This yields an overall number of 48 adders, 48 multipliers, 24 shifters, and 24 data path units. However, since the data path units have a very low implementation complexity, a small overall power consumption, and are not used during the FFT algorithm, they are not considered here. The interconnect structure of the processor array is shown in Figure 3.5. For each type of the functional units and the multi-ported register file, the transition density and probability were captured during a simulation run of 20 μs of a back-annotated gate-level netlist after the place and route. The results are presented as three-dimensional plots with the functional unit number [1-24 or 1-48] or register file number [1-24] on the x-axis, the respective bits of the input [1-16] (first operand, second operand) or output signal [1-16] (result) in case of functional units and (input registers, write data) as well as (result, read data, output registers) in case of register file, on the y-axis. The z-axis gives the corresponding transition density (left figures) and signal probability (right figures). Adder/Subtractor According to results given in Figure 4.2, the common value for the transition density is notably under 0.1 for most adder inputs. It also shows that most of the switching activity is located in the second row of the WPPA array shown in Figure 3.5. The output transition densities are generally higher than those of the inputs. This is important especially for estimation of switching load on the register file ports (write data). The probability values show a very noisy behavior with greater variations both over the bit width and the adder number.

74

4.2 Power Profiling Results for WPPAs For the synthetic load algorithm, the transition density does not exceed 0.3 for most adder inputs, that is a single input bit switches approximately 1.5× per clock cycle, see Figure 4.3. The outputs signals (result) show again a higher transition density as compared to the inputs (first operand, second operand). Also to note is only a minor variation of the transition densities over the signal width of 16 bits. However, this behavior strongly depends on the application. The transition density approaches a uniform distribution, for example when the corresponding algorithm reduces the image correlation, like shown in [290]. Multiplier Whereas the transition density of the adder units for the FFT algorithm was quite low, the corresponding values for the multipliers are zero for the most units, see Figure 4.4. This is due to the fact that only a few out of 48 multipliers are utilized by the software. The low values of ≈ 0.05 again show a uniform behavior over the complete signal bit width. The corresponding probability distribution lies around 0.5 for the first operand and is constant in case of the second operand because a multiplication with a binary constant takes place. The multiplications are performed in the second and the third row of the array. However, Figure 4.5 shows that this behavior is highly application-specific. In case of synthetic load algorithm, the corresponding transition densities increase significantly to ≈ 0.3 and ≈ 0.35 for the first and the second operand and up to 0.6 for the multiplier output (result). The probability distributions are more smooth and lie around 0.2 and 0.3. Shifter Similar to the case of multipliers, the transition density distribution for the shifter modules is also very sparse. Shift operations occur mostly in the third row of the array which also becomes clear from Figure 4.6. Except for the shift outputs, both input operands show very low transition densities around 0.05. In case of the second operand, only the least significant four bits show values different from zero. This is due to the fact that for 16-bit wide operands shift operations cannot be performed for values greater than 16. The corresponding probability values are around 0.3 to 0.5. Again, this behavior dramatically changes for the synthetic load algorithm, like shown in Figure 4.7. Now, all 24 shifter modules of the array are engaged which leads to a transition density of 0.2 for the first operand and even 0.4 for the outputs. Probabilities also increase but are quite noisy both for different units and bits. The second operand still shows low transition densities but now for all units of the array. Register File The results for register files are also very indicative for the softwaredependency of transition densities and especially register port utilization. That is

75

4


confirmed in Figure 4.8. Since the register file has 12 read and 6 write ports, the corresponding width of the write data signal is 96 (6×16 bits for each functional unit). For the read operands 2 × 16 bits are used: 2×16×5 for 2× adder, 2× multiplier, and 1× shifter, as well as one read port for the data path unit. This amounts to 176-bit wide read data signal of each of the 24 register files. Furthermore, each processing element has four inputs and two outputs which corresponds to the input registers width of 4 × 16 and output register width of 2 × 16. If we take a closer look at the transition densities of the write data signals, we will see that in the first row of the processor array only adder units are used by software (bits 0 to 31). Two adder modules with 16-bit wide output operands8 . The transition densities of the write operands are significantly higher as for the read operands. This is in accordance with the previous observations for higher transitions densities on the outputs of the functional units. The switching activity on the processor inputs and outputs is very low and is diminishing from first array row to the last one. The switching activity for register file clearly shows that among all power-relevant parameters, the utilization rate of register file port will play the most important role for the proper power estimation. If all functional units are utilized in the software, like in case of synthetic load, high transition densities throughout all write and read ports result, see Figure 4.9. Instruction Memory The power consumption of instruction memories is dominated by power consumed by switching internal capacitances. Since only read operations are performed on instruction memory during the functional processing the switching activity on memory write ports is rather low. Furthermore, since the memories used in WPPA are macro blocks generated by the memory compiler CAD tool, these modules are only characterized by the memory size and aspect ratio in the ASIC library file and not the switching activity. Reconfiguration Logic Regarding the reconfiguration logic, we can state that due to architectural clock gating scheme described in the next chapter their dynamic power consumption during the functional processing is reduced to zero. Since the reconfiguration logic is active only a small fraction of the time, it was not included into the power macro-modeling procedure. The power profiling results for the reconfiguration logic are described in the next section in more detail.

8 The

overall connection scheme of the functional units to register file ports is as follows: Least significant bits are occupied by adder units, the next ports are connected to multipliers, and the most significant bits to the shifters and data path units.

76

4.2 Power Profiling Results for WPPAs

0.8 1 0.2 0.5

0.15

1 Probability

Transition Density

0.25

0.1

0 0 Bit

8 12

42 30 36 18 24 # .) 0 6 12 p (1. O Adder

0.6 0.5

0.5

0.4 0.3

0 0

5 · 10−2

4

0.7

0.2 4

Bit

0

8 12

42 30 36 18 24 12 6 # .) 0 (1. O p Adder

0.1 0

0.2 0.15

0 0 Bit

8 12

42 30 36 18 24 # .) 0 6 12 (2. O p Adder

0.6 0.5

0.5

0.4 0.3

0 0

0.1 4

Transition Density

Probability

0.25 0.5

0.7

1

0.3

1

0.2 4

5 · 10−2

Bit

0

8 12

36 42 24 30 18 12 .) # 0 6 (2. O p Adder

0.5

0.3 0.2

0 0

0.1

4

Bit

8 12

36 42 24 30 18 12 # 0 6 (Res.) Adder

0

0

0.7

1

0.4

0.1

0.8

0.5

1

Probability

Transition Density

0.8

0.6 0.5

0.5

0.4 0.3

0 0

0.2 4

Bit

8 12

36 42 24 30 18 12 # 0 6 (Res.) Adder

0.1 0

Figure 4.2: Adder profiling results FFT on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

77

4


0.5

0.2 0.15

0 0

Probability

0.25

4 8 12

42 30 36 18 24 # .) 0 6 12 (1. O p Adder

0.25 0.2

Bit

8 12

42 30 36 18 24 # .) 0 6 12 p (2. O Adder

8 12

0.35 0.3

0.5

0.25 0.2 0.15

0 0

0.1 4

5 · 10−2

Bit

8 12

36 42 24 30 18 12 .) # 0 6 (2. O p Adder

0.5

0.5

0.4 0.3

0 0

0.2 4

Bit

8 12

36 42 24 30 18 12 # 0 6 (Res.) Adder

5 · 10−2 0

0.6

1 Probability

Transition Density

0.6

0

0.4

0.7 1

0.1

36 42 24 30 18 12 .) # 0 6 (1. O p Adder

1

0.1

4

0.2

Bit

0.15

0 0

0.4 0.3

5 · 10−2

0.3 0.5

0.5 0.5

4

0.35

1

0.6

0 0

0.1 Bit

Transition Density

1

0.3

1

Probability

Transition Density

0.7

0.5 0.5

0.4 0.3

0 0

0.2

0.1

4

0

Bit

8 12

36 42 24 30 18 12 # 0 6 (Res.) Adder

0.1 0

Figure 4.3: Adder profiling results synthetic load on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

78


0.8 1

0.12 0.1

0.5

8 · 10

−2

6 · 10−2

0 0

4 · 10 4 Bit

8 12

42 30 36 18 24 12 6 Op.) # 0 lier (1. Multip

0.7

1 Probability

Transition Density

0.14

0.6 0.5

0.5

0.4 0.3

0 0

−2

0.2 4

2 · 10−2

Bit

0

8 12

42 30 36 18 24 12 6 .) p 0 O # lier(1. Multip

0.1 0

1

0.12

1

0.1 8 · 10−2

0.5

6 · 10−2 0 0

Probability

Transition Density

1

Bit

8 12

42 30 36 18 24 .) # p O 0 6 12 lier (2. Multip

0.6

0.5

0.4

0 0

4 · 10−2 4

0.8

2 · 10−2

4

0

Bit

0.2 8 12

36 42 24 30 18 12 0 6 Op.) # lier(2. ip lt u M

0

0.7

0.3

1

0.25 0.5

0.6

1

0.5

0.2 0.15

0 0

0.1 4 Bit

8 12

36 42 24 30 18 12 es.) # 0 6 lier (R Multip

5 · 10−2 0

Probability

Transition Density

0.8

0.4 0.5

0.3 0

0 0 6 12 18 24 30 36 42 Multiplier (Res.) #

4 8 12

t Bi

0.2 0.1 0

Figure 4.4: Multiplier profiling results FFT on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

79

4


1

0.3

0.5

0.2

0 0 8 12

42 30 36 18 24 12 6 p O .) # 0 lier (1. Multip

0.5 0.4

0.5

0.3 0 0

0.1

4 Bit

0.6

1

0.4 Probability

Transition Density

0.5

0.2 4 Bit

0

8 12

0.1

42 30 36 18 24 .) p 0 6 12 O # lier(1. Multip

0

0.35

0.35

0.6

0.3

0.4

0.25

0.2 4 Bit

8 12

42 30 36 18 24 12 6 Op.) # 0 lier (2. ip lt u M

0.3 0.25

0.5

0.2 0.15

0 0

0.2

0

Transition Density

1 Probability

1 0.8

0.1 4

0.15

Bit

8 12

42 30 36 18 24 12 6 0 Op.) # lier(2. Multip

0.7

1 0.8 0.6 0.4 0.2 0 0

0.5 0.4 0.3 0.2

4 Bit

8 12

42 30 36 18 24 12 6 es.) # 0 lier (R Multip

0.1

5 · 10−2 0

0.3 1

0.6 Probability

Transition Density

0.4

0.25 0.2

0.5

0.15 0 0

0.1 4

Bit

8 12

36 42 24 30 18 12 s.) # 0 6 lier(Re Multip

5 · 10−2 0

Figure 4.5: Multiplier profiling results synthetic load on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

80


0.8

0.1 8 · 10−2 4 · 10−2

4 20 12 16 4 8 # .) p (1. O Shifter

8

(0

...

15)

12

0

0.2 0.5

0.15

Bit

4 (0

8 ...

15)

12

20 12 16 # .) p .O ifter(2

8

4

0

Sh

0.4 0.3 0.2 4 (0

8 ...

15)

12

0

20 12 16 # . Op.) ifter(1

8

4

Sh

1

5 · 10−2

0.2

Bit

0.1

4 (0

8 ...

15)

12

0

20 12 16 4 8 # .) (2. O p Shifter

0.4

0.5

0.3 0 0 Bit

0.2 4 (0

8 ...

15)

12

0

4

20 12 16 # Res.) hifter(

8

S

0.1 0

Probability

Transition Density

0.5

0

0.8

0.6 1

0

0.3

0.5

0

0.1

0.4

0 0

0.1

0 0

0.5

0.5

Bit

2 · 10−2 0

0.25

1

0.6

0 0

6 · 10−2

0 0

0.7

1 Probability

0.12 0.5

Bit

Transition Density

0.14

Probability

Transition Density

0.16 1

1

0.7

0.5

0.5

0 0

0.3

0.6 0.4

Bit

0.2 4 (0

8 ...

15)

12

0

4

20 12 16 Res.) # hifter(

8

S

0.1 0

Figure 4.6: Shifter profiling results FFT on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

81

4


0.25

1

Probability

Transition Density

0.8

0.2 0.5

0.15 0.1

0 0 Bit

4 (0

8 ...

15)

12

0

20 12 16 4 8 .) # (1. O p Shifter

5 · 10−2

1

0.7

0.5

0.5

0 0

0.3

0.6 0.4 0.2

Bit

0

4 8

(0

...

15)

12

0

4

0.1

20 12 16 # . Op.) ifter(1

8

Sh

0

6 · 10−2 0 0

Transition Density

4 (0

8 ...

15)

12

0

20 12 16 4 8 .) # (2. O p Shifter

0.5

0 0 4 (0

8 ...

2 · 10−2

15)

12

0

20 12 16 4 8 # (Res.) Shifter

0.15

Bit

0

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1

1

0.2 0.5

5 · 10

0.1

0 0

4 · 10−2

Bit

Bit

Probability

8 · 10−2

0.5

0.25

1

0.1

4

(0

8

...

15)

12

0

20 12 16 4 8 # .) (2. O p Shifter

0

0.5 0.4

0.5

0.3 0 0

−2

5 · 10−2

0.6

1 Probability

Transition Density

0.12 1

Bit

0.2 4 (0

8 ...

15)

12

0

20 12 16 4 8 # (Res.) Shifter

0.1 0

Figure 4.7: Shifter profiling results synthetic load on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

82


6 · 10

0.5

−2

4 · 10−2 0 0 8 16 24 Bit( 32 40 48 0 .. 56 . 63 )

16 8 12 Regs) # 4 t 0 Inpu file( g e R

0.6 0.5

0.5

0.4 0.3

0 0 8 16 24 Bit( 32 40 48 0 .. 56 . 63 )

2 · 10−2

20

0.7

1

8 · 10−2

1

Probability

Transition Density

0.8

0

0.2 20

16 8 12 Regs) # 4 t u 0 Inp file( g e R

0.5

0.3

0 0

0.2 16

Bit(

32

0 ..

48

64

. 95

80

4

0

)

8

fi Reg

12

16

ta) #

0.6 0.5

0.5

0.4 0.3

0 0

0.1

20

Da rite le(W

1 Probability

Transition Density

0.7

0.5 0.4

0

0.2 16 32 Bit( 48 64 0 .. . 95 80 )

20 16 8 12 Data) # 4 te 0 Wri file( g e R

Probability

Transition Density

0.25 0.2 0.15 0 0

44 88 Bit( 132 0 .. . 17 5)

16 8 12 Data) # 4 d 0 Rea file( g e R

5 · 10

0.6 0.5

0.5

0.4 0.3

0 0

0.1 20

0

0.7

1

0.3

0.5

0.1

0.8

0.35 1

0

0.8

0.6 1

0.1

−2

0.2 44

Bit(

0

0 ..

20

88

. 17

132 5)

16 8 12 Data) # 4 d 0 Rea file( Reg

0.1 0

0.8 8 · 10−2

1

6 · 10−2

0.5

4 · 10−2 0 0

8 Bit(

0 ..

20

16 24 . 31 )

0

16 gs) # 8 12 t Re u p t (Ou gfile

4

Re

2 · 10−2 0

Probability

Transition Density

1

0.7 0.6 0.5

0.5

0.4 0.3

0 0

Bit

0.2 8

(0

16 . . . 24 31)

0

4

Regfi

20 12 16 gs) # e R put le(Out 8

0.1 0

Figure 4.8: Register file profiling results FFT on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

83

4


0.8 0.7 1

0.1

1

8 · 10−2 0.5

6 · 10−2

0 0 8 16 24 Bit( 32 40 48 0 .. 56 . 63 )

4 · 10−2 20 16 8 12 Regs) # 4 t 0 Inpu file( Reg

2 · 10

Probability

Transition Density

0.12

0.6 0.5

0.5

0.4 0.3

0 0 8 16 24 Bit( 32 40 48 0 .. 56 . 63 )

−2

0

0.2 20

0

16 8 12 Regs) # put e(In egfil

4

R

0.5 0.5

0.4 0.3

0 0

20 16 8 12 Data) # 4 te 0 Wri file( Reg

0.1

0.5 0.4

0.5

0.3 0 0

0.2 16 32 Bit( 48 64 0 .. . 95 80 )

0

0.2 16 32 Bit( 48 64 0 .. . 95 80 )

20 16 8 12 Data) # 4 te 0 Wri file( Reg

0.4 0.3

0.5 0.2 0 0

44

Bit(

16

88

132 0 .. . 17 5)

0

8 12 Data) # ead e(R egfil

R

0.6 0.5

0.5

0.4 0.3

0 0

0.2 44

Bit(

0

4

0

0.7

1

0.1

20

0.1

0.8

Probability

Transition Density

0.5 1

0

0.6 1

0.6 Probability

Transition Density

0.7 1

0.1

16

88

132 0 .. . 17 5)

0

20

8 12 Data) # ead e(R egfil

4

R

0.1 0

0.8

0.1

1

8 · 10−2 0.5 0 0

6 · 10−2 4 · 10−2 8 16 Bit( 24 0 .. . 31 )

20

16 s) # 8 12 Reg 4 put 0 t u O file( Reg

2 · 10−2 0

0.7

1 Probability

Transition Density

0.12

0.6 0.5

0.5

0.4 0.3

0 0

Bit

0.2 8

(0

16 . . . 24 31)

0

4

Regfi

16 20 8 12 gs) # put Re le(Out

0.1 0

Figure 4.9: Register file profiling results synthetic load on a WPPA with 24 PEs, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

84


4.2.1 Reconfiguration Phase To capture the overall power consumption during the reconfiguration of a WPPA, the cycle-accurate power profiling of the back-annotated netlist after place and route was performed. The following algorithms were tested on two array configurations9 : 4-tap finite impulse response (FIR) filter algorithm on a 1 PE WPPA and on a WPPA with 4 PEs, edge detection algorithm on a WPPA with 4 PEs, and FFT on a WPPA with 24 PEs. 4-Tap FIR on one PE The profiling results of the single-PE FIR filter can be seen in Figure 4.10. The phases of array initialization, processing element configuration, and functional processing are clearly distinguished by respective power traces. After the short array initialization phase and some idle time, the configuration of the first processing element (PE[0][0]) starts. Thereupon, the interconnect of the second processing element (PE[0][1]) is configured. No computation is performed on this PE. Solely its interconnect is needed to guide the results to the outputs of the WPPA array. The configuration takes only 1.7 μs for each PE. The maximum total power amounts to ≈ 6 mW. The major part of ≈ 3.5 mW comes from the internal power component, see Section 2.2.5 for definitions. The switching power is very low, around 1.5 mW. 4-tap FIR on 4 PEs If the computation is spread to four processing elements to achieve a higher throughput like shown in Figure 4.11, the overall configuration time is doubled. The configuration power does not change significantly but increases slightly. Edge Detection on 4 PEs In case of the edge detection algorithm shown in Figure 4.12, the configuration power still stays in the 6 mW range. Switching power shows more spikes as compared to the previous traces but its average does not change. FFT on 24 PEs In case of FFT, the single processing elements are more complex (VLIW width 128 bit). The 24 PEs of the array are configured in 14 configuration domains shown in Figure 3.5 which contain from 1 to 4 processing elements. The total power consumption during the reconfiguration depends on the number of PEs in the domain, as shown in Figure 4.13. However, it does not exceed 55 mW. The configuration power is significantly higher since the VLIW instruction memory is wider in this case. As in the previous examples, the internal power component forms the largest part of the power consumption of ≈ 45 mW. The switching power amounts to ≈ 15 mW. I: array 2 × 2 (4 PEs), 1× adder, 1× multiplier, VLIW width = 64 bit. Configuration II: array 3 × 8 (24 PEs), 2× adder, 2× multiplier, 1× shifter, 1× data path unit, VLIW width = 128 bit.

9 Configuration

85

15 End Array Initialization

10

6.5 mW

Idle Time

0 0.2

0.4

0.6

0.8 Time (ns) ( )

1

1.2

1.4 ·104

15 Start Processing

10 5

PE [0][1]

PE [0][0]

End Array Initialization

3.5 mW Idle Time

5.7 mW

0 0

0.2

0.4

0.6

0.8 Time (ns)

1

1.2

1.4 ·104

10 Start Processing

5 Idle Time

1.5 mW

0.2

0.4

PE [0][1]

End Array Initialization PE [0][0]

Internal Power (mW)

Start Processing

6 mW

5

0

Switching Power (mW)

PE [0][1]


PE [0][0]

Total Power (mW)

4

1 mW

0 0

0.6

0.8 Time (ns)

1

1.2

1.4 ·104

Figure 4.10: Cycle-accurate profiling results 4-Tap FIR on a WPPA with 1 PE, data width 16 bit, 200 MHz, 90 nm standard cell technology.

86

40 PE [1][1]

19 mW

6 mW

0 0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8 ·10

3 4

40 Start Processing PE [1][1]

10

PE [1][0]

PE [0][0]

20

PE [0][1]


30

18 mW

5 mW

0 0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8 ·10

3 4


PE [1][0]

5

PE [0][1]


Internal Power (mW)

PE [1][0]

10

PE [0][1]

20

0


Start Processing


30

PE [0][0]

Total Power (mW)


2.5 mW

1.5 mW

0 0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8

3

·104

Figure 4.11: Cycle-accurate profiling results 4-Tap FIR on a WPPA with 4 PEs, data width 16 bit, 200 MHz, 90 nm standard cell technology.

87


40 PE [1][1]

12 mW

6 mW

0 0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8

3

·104


10

PE [1][0]

PE [0][0]

20

PE [0][1]


30

11.5 mW

5 mW

0 0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8 ·10

3 4


PE [1][0]

5

PE [0][1]


Internal Power (mW)

PE [1][0]

10

PE [0][1]

20

0


Start Processing


30

PE [0][0]

Total Power (mW)

4

2 mW

1.5 mW

0 0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 Time (ns)

2

2.2 2.4 2.6 2.8

3

·104

Figure 4.12: Cycle-accurate profiling results edge detection on a WPPA with 4 PEs, data width 16 bit, 200 MHz, 90 nm standard cell technology.

88

103

End Array Initialization Idle Time

Total Power (mW)


102

Start Reconfiguration (14 Domains)

0.1

103

0.2

0.3

0.4

Idle Time


102

55 mW 260 mW

0.5 0.6 0.7 Time (ns)

0.8

0.9

1

1.1

1.2

·10

4

Start Processing

Start Reconfiguration (14 Domains)

45 mW 245 mW

101 0

0.1

103

0.2

0.3

0.4


102

Idle Time


Internal Power (mW)

0

Start Processing

101 0

0.1

0.5 0.6 0.7 Time (ns)

0.8

0.9

1

1.1

1.2 4

·10

Start Processing 50 mW

Start Reconfiguration (14 Domains) 15 mW

0.2

0.3

0.4

0.5 0.6 0.7 Time (ns)

0.8

0.9

1

1.1 ·10

1.2 4

Figure 4.13: Cycle-accurate profiling results FFT on a WPPA with 24 PEs, data width 16 bit, 200 MHz, 90 nm standard cell technology.

89

4


Total Power (mW)

30


25

Start of Reconfiguration (14 Domains)

End of Reconfiguration

20 15 7 mW

10 5 0

0.2

0.4

0.6

0.8 Time (ns)

1

1.2

1.4 ·104

Figure 4.14: Cycle-accurate total power consumption of the array logic for the synthetic load algorithm on a WPPA with 24 PEs, data width 16 bit, 200 MHz, 90 nm standard cell technology.

4.2.2 Functional Processing Phase The power consumed during the functional processing clearly depends on the PE configuration, array size and the algorithm. For the FIR filter on one PE, it amounts to 7 mW. Four PEs consume ≈ 20 mW. The corresponding switching power is low in both cases: 1 mW and 2.5 mW. For the edge detection, 12.5 mW are consumed during functional processing. The corresponding switching power is around 2 mW. If we look at the functional power of FFT in Figure 4.13, we see that at least ≈ 260 mW must be provided for this algorithm implementation. The switching power amounts to ≈ 50 mW and the internal power to ≈ 245 mW. Power Consumption of the Array Logic Another interesting question is how much power is consumed by the WPPA array logic excluding the processing elements. That means how much power is needed especially for the array reconfiguration modules during configuration and the interconnect logic during functional processing? Figure 4.14 shows the corresponding power trace for the case of the synthetic load algorithm which generates an especially high switching activity. During the configuration, this power amounts to ≈ 15 mW and around 7 mW during the algorithm processing. These are rather low values compared to the overall power consumed in the processing elements: 20% for reconfiguration and only 1% for functional phase. This observation is especially important for design space exploration, since it allows us to concentrate the estimation and modeling effort on functional and storage parts of processing elements in WPPA.

90

4.3 Power Macro-Modeling

4.3 Power Macro-Modeling For large WPPAs with potentially thousands of processing elements, simulationbased power estimation, even on the RTL level, is not viable because of its low speed. This applies even more for design space exploration. Therefore, probabilistic macromodeling is used for the construction of power models. Power macro-modeling of digital circuits [235], [181] is a mature research topic with a variety of different approaches. Basically speaking, a macro-model formulates the power consumed (dependent variable), in terms of parameters (independent variable) that are easily observable at higher levels of abstraction than the gate or circuit level, see, for instance, [233]. In CMOS technology, the dynamic power dissipation can be accurately approximated by considering the charging and discharging of capacitive loads. Given the supply voltage and the load capacitance, only the switching activity is needed to compute the dynamic power. Normally, power macro-modeling proceeds in two steps. In the characterization step, extensive circuit or gate-level simulation is performed on a predesigned hardware module using input streams with various preselected statistical properties. A power macro-model is then computed that maps the statistics of the input-output signals to the circuit power dissipation. But in our case, we can omit the time-intensive simulation, as it will be shown in the following. During the estimation step, the statistics of the actual input-output signals are calculated and applied to the macro-model to derive the power dissipation of the circuit. Since macro-models are derived from circuit or gate-level characterizations which capture the main physical details like parasitic capacitance and leakage current, these power estimates are sufficiently accurate for later relative comparisons of different designs as required during design space exploration. Table-based Empirical Models As already mentioned in the overview in Section 2.4, there exist three broad categories of macro-modeling techniques: tablebased, equation-based (regression analysis), and hybrid. For our work, we chose to use a modified look-up table based approach [122] for the following reasons: 1. A LUT-based approach is robust and applicable to both combinational [122] and sequential modules like register files [46, 123], 2. It is sufficiently accurate for the estimation of the average power consumption, 3. It provides a high degree of parameter extensibility, 4. The characterization process can be fully automated, 5. Power characterization can be performed together with the characterization for area and timing [36].

91

4


The key question for power macro-modeling is though the choice of the model parameters. The Choice of Macro-model Parameters An important question in macromodeling is the choice of the model parameters. Generally, a dependent variable (average power PAverage ) is described in terms of model parameters Xi (independent variables): PAverage = P(X1 , X2 , . . . , Xn ) . Parameters mostly used in micro-architectural level power macro-models include the transition density/probability, the signal probability, the average spatial signal correlation coefficient, as well as the average temporal signal correlation coefficient. As it was already shown in several publications, power is not equally sensitive to different parameters, observable at architecture-level. Transition density of input signals influences the power consumption the most, see [317], whereas the probability and spatial correlation only occasionally influence the power consumption noticeably. In [77] and [318], a notion of the power sensitivity to different model parameters was defined as Δ PAverage lim . (4.11) Δ X→0 Δ X Consequently, the higher the sensitivity of the power on the parameter X is, the more different points should be analyzed to increase the model accuracy. This is accomplished by our non-uniform parameter sampling technique. We found empirically that the parameter transition density has the highest influence on the average power consumption like shown in our experiments with different WPPA configurations and array sizes, both for combinational and sequential circuits. This is also confirmed by other publications during the recent years, see [317]. In Figure 4.15, the power sensitivity surfaces for a 16-bit multiplier, synthesized in a 90 nm commercial standard cell library, are shown. The figure in the upper left corner shows the power sensitivity to the transition densities of the first and the second operand. With the probability for both operands held constant at 0.3. It also clearly shows that especially for the low transition densities around 0.1, which are also the most relevant for the majority of applications, the dependency is very strong and nonlinear. Whereas for the signal probability, shown on the right, the influence is almost negligible. Since the probability has only a minor impact on the average power consumption, we will assume a uniform distribution for both probability and transition density for our power models. However, this does not restrict our methodology in any regard: Any suitable distribution over the bit width can be assumed for power characterization, like for instance linearly increasing or decreasing, uniform, or any

92


2

1.2

1.5

1

1

0.8

0.5

0.6

0 0.5

0.4 0.3 0.2 d(1. Op. 0.1 )

0.4 0.3 0.2 ) 0.1 O p. d (2.

0.5

d(data)=0.2

1.4

0.4

Total Power (mW)

Total Power (mW)

p(data)=0.3

1

2

0.95

1.5 1

0.9

0.5 0 0.8

0.2

0.85 0.6 0.4 p( 1 . O p. )

0.2 0.2

0.4

0.6

0.8 0.8

) O p. p( 2 .

Total Power (mW)

1.6 2

1.4

p(data)=0.3, d(data)=0.2

1.2

1.5

1

1

0.8

0.5 0 0.6 d(e 0.4 0.2 na ble )

0.6 0.4 0.2

0.4

p (e n a b

0.6

0.8

0.2

le)

Figure 4.15: Multiplier power sensitivity surfaces obtained using non-uniform parameter sampling, data width 16 bit, 5 ns delay, 90 nm standard cell technology.

other. In case of our architecture, already the uniform distribution showed a good accuracy for power estimation. This was also confirmed in [40]. Furthermore, the utilization pattern given by the enable control signals and the compiler respectively, influences the average power consumption much more than the exact switching distribution on the input ports. Also, for comparison of different design alternatives, the absolute value of the area or power consumption is subordinate, if the estimation procedure is not biased. The applied macro-modeling is not biased, that means it will perform either equally good or equally poor for all design alternatives, like already empirically shown in [36]. The last figure shows the dependency of power consumption on the enable signal probability and transition density. Here we see that except for a small deviation for very low probabilities, it is actually linear. To achieve this behavior, we apply the operand isolation technique to the inputs of all functional units. All operand inputs are gated by the enable signal of the functional unit. Besides the power reduction of up to 20% like it was already shown for general digital circuits [206] and CGRAs [218], this technique allows us to directly compute the average power consumption of the functional units dependent on their utilization ratio, given statically by the compiler. Similar results were obtained for other functional units, like shifters, adders, and logical.

93

4


Non-Uniformity Another important conclusion which we made is that to appropriately reflect the real-world behavior of the signal probabilities and transition densities, a finer resolution should be put on the lower values between 0 and 0.2, and a coarse resolution can be applied to values between 0.2 and higher values to save both characterization time and model storage space, like already done in Figure 4.15. Interestingly, this observation applies to FPGA designs as well, see [253]. There we can observe that for real-world input stimuli the number of occurrence with switching activities lying between 0 and 0.25 is approximately two orders of magnitude higher than, for instance, between 0.3 and 1. For random input stimuli, this ratio is still as high as one order of magnitude (Virtex II 2V3000 FPGA, 150 nm, 1.5V, 90% resource occupancy). We call this a non-uniform parameter sampling, as opposed to the model given in [122] and its later refinements, with equidistant sampling of 0.1 from zero to one. For the intermediate values not captured by the model, we currently use the nearest neighbor method, without any interpolation in between. However, interpolation could also be easily integrated into the estimation framework. Functional Units For functional units, bit probabilities and transition densities of both operands should be taken as model parameters independently, since they both influence the power consumption. The parameters for all functional units therefore are: • Probability and transition density of the first operand, • probability and transition density of the second operand, as well as the • bit width (8, 16, 24 bits). Thus, more accurate estimates are obtained if fixed coefficients are used at one of the inputs, as usually is the case with digital filters. Altogether, we deploy a 2D LUT-based macro-model. The reason, why we do not also consider, for instance, the output transition density, like in [122], is the following: The dependence of the output transition density has a noisy behavior due to the fact, that it is not directly controllable, and for some circuit types tends to be quite insensitive to changes in input probability distribution, see [36]. Due to this property, the exploration of the output density space would additionally require a high number of different transition density-input probability combinations, like already mentioned in [142] and slow down the characterization. On the other hand, since cycle-accurate power estimation is not our goal, the average accuracy of ≈ 8.5% of the 2D model, like evaluated in [36], is already sufficient. However, if more accurate estimates are needed, they can also be obtained by the same methodology. Register File Regarding the register file for VLIW-like architectures, with the number of read ports ranging from 2 to 12 and for write ports from 1 to 6 we observed

94

4.3 Power Macro-Modeling the port utilization to have the largest impact on the power consumption. Therefore, our model has the port utilization as a parameter: • Port utilization rate, • average transition density and probability over all data ports, • average transition density and probability over control ports (read/write enable, register addresses), • number of input registers, number of output registers, • number of general purpose registers, • data width, • input FIFO depth. Like already mentioned, the usage of an RTL synthesis tool for power macro-modeling also allows to fully account for automatic clock gating. On the other side, clock gating makes the dependence of power on the utilization rate and write/read signals non-linear.

4.3.1 Characterization Procedure The characterization for a fixed frequency and hardware library proceeds as follows: Besides hardware synthesis, a generalized synthesis script also performs a probabilistic power characterization with different probabilities and transition densities and the given bit-level distribution. Therefore, a respective tool interface in Tcl language is used, which allows to set these statistics for every single data and control bit separately and independently, see Listing A.8. The whole TCL script has 2 285 lines of code. After that, the corresponding power values for internal, switching, and leakage power are available and written to a file in a specified format. In our case, this is the structured query language (SQL) data insertion script. Thereafter, this script is read by the database management system and the values are saved in the database. Speedup Considerations An important question is now, how much speedup does the use of probabilistic characterization achieve in comparison to a simulationbased approach? The characterization procedure for one macro-model, for instance, 16 bit multiplier, yields already (n×k + n2 )×m×n2 /2 ≈ 675 840 different simulation runs with typical setting of k ≈ 50, m ≈ 5, n = 16, see [36]. Here, n denotes bit width, k the number of simulation runs, and m number of different sets to determine the output density. The factor (n×k + n2 ) accounts for the optimized stimuli generation with predefined statistics based on Markov chains, according to [317]. Especially for

95

4


the large multi-ported register files, this overhead is computationally unaffordable: In our case, with 4 320 different hardware configurations, and a large signal width of up to 12×24 = 288 bits due to multiple ports, it would yield ≈ 2.9·10 9 simulation runs, or 92 years, if one simulation runs in one second. This is currently a very realistic assumption. On the other side, the probabilistic characterization of register files took approximately four weeks on a consumer quad-core machine with 8 gigabytes main memory. This yields already a speedup of 1 104×. Regarding the application of our non-uniform parameter sampling, we see that if we can save some points for high transition densities, for instance, above 0.7, this already leads to a quadratic reduction factor (both operands) for runtime and storage during the characterization of a single multiplier, without any loss of accuracy. Or, alternatively, a higher resolution of the relevant low density values. The characterization of all functional units (adder, multiplier, shifter, logic) took approximately 4 days. Also, the VLIW memory macros were generated for different data widths and sizes with the help of a commercial memory compiler. But they are characterized for average power only. Reconfiguration Logic and Clock Tree Power We do not consider the power consumption of the reconfiguration logic during the macro-modeling because it is active only a small fraction of time like shown empirically in Section 4.2. For very big arrays with more than 500 elements, its leakage power consumption during the functional processing phase could be reduced to virtually zero by the application of power shut-off technique (not implemented in this thesis). The area overhead of reconfiguration logic is for the most part independent of the functional hardware parameters. The average power consumption of the clock tree (< 10%) and interconnection logic (< 5%) is not dominant10 . This was also reported for the ADRES architecture [52] and a NoC system for video processing in [179]. The reason behind it is that our target frequencies of 100 - 300 MHz, and the area footprints of < 10 mm2 by no means overstrain the margins of technology nodes ≤ 90 nm. All data were obtained by a gate-level simulation after place-and-route with annotated physical information.

4.3.2 Database Design An open-source relational database management system was used (HyperSQL), which is completely written in Java and freely available over the website [143]. Currently, our macro-model database contains 21.2·10 6 tuples with storage space requirement of 3 gigabytes, data backup portion of ≈ 500 megabytes inclusively. During the test runs of our framework, the average query speed was 1 000 − 2 000 queries per 10 See

96

Sections 5.1.1 and 4.2 for case study power values.


Switching Load

Register File (data portion)

Functional Units

2D LUT Resolution

Pin 6 points

Din 8 points

Pin 11 points

Din 14 points

idle low average high

0 0.2 0.3 0.6

0 0.1 0.2 0.3

0 0.2 0.4 0.6

0 0.1 0.16 0.3

Table 4.1: Look-up table resolution and switching load discretization. second on a standard consumer PC with 8 gigabyte main memory and Linux operating system. In our case study applications, typically a dozen of queries were needed for a single PE (6 slots issue width, ≈ 8 VLIW instructions). Inside the evaluation framework, which is also written in Java, the data additionally has to be accumulated and stored. Still, a 100-core heterogeneous coarse-grained processor array with ≈ 0.5 · 10 6 logic gates circuit complexity, implementing a signal processing algorithm, can be analyzed for power and area within less than a minute, when simulation-free switching load discretization is used, see Table 4.1. The query speed of the data base is therefore ≈ issue100 width software configurations per second. Thus, the database does not constitute a bottleneck in the framework. The corresponding gate-level procedure, that is front end design, back end design, and (flat) simulation, if possible at all, cannot be done in much less than a day, and not on a consumer machine.

4.3.3 Estimation Procedure In Table 4.1, the four switching load discretization steps are given together with our 2D look-up table resolution, that is the number of discrete values in the macro-model. An example of enable probabilities is given in Table 4.2 for two out of four PEs computing the edge detection algorithm. The scheduling pattern of the functional units also implies the port utilization pattern of the corresponding register file, see Table 4.2. In the fourth PE, register file is not accessed in five out from seven instructions 5/7 : idle and 2/7 : 30% port usage, see Table A.2. Given this data and the signal switching activity (estimated or simulated), the power consumption of these two register file ”states” can be computed and scaled with the corresponding duty cycle values. The power consumption of the functional units is directly scaled with their enable probability, except for the idle case, which is stored in the database, since then leakage power is still consumed.

97

4

Power, Area Characterization and Modeling PE

Power-Relevant Scheduling Statistics FU add

[0][0]

mul shift, dpu regfile add

[0][1]

mul shift, dpu regfile

# 0 1 0 1

Enable Probability 7/8 idle 5/8 idle idle Duty Cycle: (5/8, 2/8, 1/8) 0 2/7 1 idle 0 1/7 1 idle idle Duty Cycle: (2/7, 5/7)

Regfile Port Usage

(30%, 15%, idle)

(30%, idle)

Table 4.2: Power-relevant scheduling data for the ED algorithm.

4.3.4 Overall Power and Area Evaluation Framework The overall structure of the power and area evaluation framework can be seen in Figure 4.16. In case of WPPAs, the parameter space is defined by the architectural parameters (array size, PE structure, I/O properties, . . . ), and the parameters of the compiler (scheduling strategies, tile shapes, and similar). The following objective functions are evaluated simultaneously for each design alternative: Area cost of the hardware [mm2 ], average static and dynamic power dissipation [mW], execution latency [cycle number], energy consumption [μJ], and data path width or precision [bit], see also Chapter 6. The width of the data path plays an important role as an objective, since the prospective main application domain is digital audio and video processing, see [74]. Given an abstract description of algorithm(s), scheduling strategies, tile shapes on the software side, together with the corresponding hardware restrictions, like number of available functional units, size of the register file, and the interconnect structure, the compiler performs scheduling, allocation and binding. In contrast to multi-core processors, which exploit mainly task level parallelism, the applications on a TCPA are able to exploit tightly-coupled processing on multiple levels of parallelism: Array level parallelism (loop level), instruction level parallelism, subword parallelism, as well as functional and software pipelining. In order to match hierarchies of parallelism and efficiently exploit multiple levels of memory, in a first mapping step, the algorithms are also tiled in a hierarchical manner. Subsequently, an exact scheduling methods based on mixed integer programming are applied to ob-

98


2EMHFWLYHV

'DWD $FWLYLW\ KLJK

3RZHU

DYJ

$UHD

ORZ VFKHGXOHB[ VFKHGXOHB\

KZBFIJBD KZBFIJBE

/DWHQF\

+DUGZDUH &RQILJXUDWLRQ

3RZHU $UHD /DWHQF\ 3UHFLVLRQ

&RPSLOHU &RQVWUDLQWV 6FKHGXOHU &RPSLOHU 6RIWZDUH 3DUDPHWHUV +DUGZDUH 3DUDPHWHUV

0XOWLGLPHQVLRQDO 'HVLJQ 6SDFH 9LVXDOL]DWLRQ

$OJRULWKPV

3RZHUUHOHYDQW 6FKHGXOLQJ 6WDWLVWLFV

([SORUDWLRQ (QJLQH +LJKVSHHG 2EMHFWLYHV (VWLPDWLRQ 5HODWLRQDO 'DWDEDVH ZLWK 64/,QWHUIDFH

([SORUDWLRQ 'HFLVLRQ 0DNLQJ

+DUGZDUH 0DFURPRGHOV /,%

Figure 4.16: Overall exploration/estimation framework structure. tain a throughput optimal solution. The proposed method allows the simultaneous optimization of schedules on different levels, within the processors and at processor array level, with respect to the different resource constraints, see also [129]. As a result, the following power-relevant information is obtained: • The processor array size, • utilization fraction for every functional unit in every processing element, • the corresponding utilization fraction of the register files, as well as the • vertical and horizontal code sizes of the VLIW-memories. According to our experiments, this high-level information is already enough to perform a sufficiently accurate architecture-level power estimation for the corresponding processor array. The main reason for this is, that besides the reconfiguration logic, the WPPA architecture is free from sophisticated control path modules, caches, interrupt controllers, etc. It is mainly data-stream oriented and, instead of a sophisticated general-purpose control path, provides plenty of functional units to support the inherent parallelism of the corresponding algorithms. Contrary to [318], where for the determination of control signal statistics a high-level simulation is used, our control signals are already statically given by the compiler. Now, the only thing missing

99

4


to perform power estimation, is the switching activity data. Here, without loss of generality, two strategies are applicable: • Either a cycle-accurate, compiled simulator with representative data inputs can be used, which is automatically generated by the WPPA Editor framework, or • we use a heuristic, which discretizes the switching load into a couple of possible levels, for instance: idle, low, average, and high, see Table 4.1, like it was also proposed in [194]. This yields a higher abstraction and a superior acceleration with only a minor accuracy loss. Building on this information, the exploration engine can estimate the area and power consumption by querying the macro-model database. After that, it can also visualize/aggregate the data, that is, total power, internal power, switching power, leakage power, and area in three different ”resolutions”: array level, PE level, and functional unit level. Observation 4.1 Our estimation approach gives the designer the otherwise missing, but urgently needed quick insight into the power and area breakdown of each single WPPA instance. The dashed arrows in Figure 4.16 show the possibility to vary the hardware and software parameters to a certain extent. Thus, the framework allows an automatic design space exploration which is described in Chapter 6 in more detail.

4.4 Efficiency and Accuracy Comparisons In the following experiments, we consider three algorithms: Edge detection, ED (4 PEs), 8 points radix-2 Fast Fourier Transform, FFT, as well as synthetic load, SL (24 PEs). All these algorithms run on the same WPPA array of size 3 × 8. If some processing elements are not utilized, they are power gated like described in Section 5.4. For these algorithms, we examined the physical layout implementations of the corresponding WPPA by respective gate-level simulations with real image data (B&W, 250×275 pixel resolution). Therefore, at the front end the design was synthesized with the help of Cadence RTL Compiler (ver. 8.1) and a 90 nm, 1.0 V supply voltage commercial standard cell library for a 200 MHz maximum target frequency. After place and route using Cadence SoC Encounter (ver. 8.1), the gate-level netlist with back-annotated delays was simulated with real image data in Cadence Incisive Simulator (ver. 8.2). For accurate power analysis, particularly of the clock trees, the simulator-generated value change dump file (.vcd) was read back and processed by the SoC Encounter power estimator which is based on the Cadence PowerMeter engine, a sign-off power analysis tool. In Table 4.3, the estimated power consumption resulting from applying our macro-modeling framework and the gate-level simulation

100

4.4 Efficiency and Accuracy Comparisons Algorithm

Power (mW) Simulated

ED (4 / 24 PEs) FFT (24 / 24 PEs) Synthetic Load (24 / 24 PEs)

45 257 453

idle 40 234 234

Estimated low average 43.8 46.1 274 306.2 358.7 446.8

high 51.2 328.3 533.7

Table 4.3: Power estimation accuracy. results are given. It can be seen, that the estimation error lies under 10% with a slight tendency to over-estimation mainly due to the uniform transition density distribution assumption. Figure 4.17(a) graphically shows the estimated power consumption for the FFT algorithm running on a 3×8 array. Figure 4.17(b) the same for edge detection, and, finally, Figure 4.17(c) gives power estimations for synthetic load algorithm. If some processing elements have the same software programs, their respective power estimates will also be equal, like in Figure 4.17(c). Of course, provided the switching load assumption is not changed. We also see that although all PEs are homogeneous, that is have the same hardware architecture, up to 25% power variation occurs due to different software programs in case of FFT in Figure 4.17(a). In Table 4.3, the accuracy comparisons are summarized. Due to the switching load discretization and simulation-free estimation, our methodology additionally gives us a very important and unique possibility of setting the switching load on all design blocks directly: Observation 4.2 We are currently not aware of any other technique, which could generate an input stream which would cause a predefined switching activity throughout a given circuit. The power consumption difference between an idle and a high switching load can be as high as 120%, as estimated for the synthetic load algorithm. This behavior cannot be captured by simulation-based approaches.

101


21 18 15 12 9 6 3 0 0 Ro wP

Total Power (mW)

Total Power (mW)

4

1 EN um 2 b er

6 7 4 5 b er 3 2 um 0 1 mn PE N Colu

21 18 15 12 9 6 3 0 0 Ro wP

1 EN um 2 b er

6 7 4 5 b er 3 2 um 0 1 mn PE N Colu

Total Power (mW)

(a) FFT array power estimation (low switching (b) ED array power estimation (low switching activity) activity)

21 18 15 12 9 6 3 0 0 Ro wP

1 EN um 2 b er

6 7 4 5 b er 3 2 um 0 1 mn PE N u l o C

(c) Synthetic load array power estimation (average switching activity)

Figure 4.17: Estimated power values for three different algorithms running on the same WPPA array with 24 processing elements. Processor elements colored in blue are power gated (0.02 mW power consumption) (90 nm, 1.0 V, general purpose, nominal VT H , commercial CMOS standard cell library, 200 MHz target frequency, 25 ◦ C, typical-typical process).

102

Estimated Power (mW)

4.4 Efficiency and Accuracy Comparisons

45

43.8

Total Power

Internal Power

Switching Power

38.42

36 27 18

11.66 9.97

9

5.05

1.6

PE 1

PE 2

10.21 9.21 1.62

0.91

0 Array

11.73 10.03

10.21 9.21

PE 3

0.91

PE 4

Figure 4.18: Estimated power values for the edge detection algorithm running on four processing elements (low switching activity). Cycle-Accurate Simulation One of the distinct advantages of our approach is also the fast and clear insight into the different power types, that is internal, switching, and leakage power. Figure 4.18 shows such a breakdown for the four processing elements performing the edge detection algorithm. Contrary to profiling results given in Figure 4.12 (PE issue width 2), this ED implementation runs on a more complex processing element configuration with a larger issue width of 6. The power consumption is therefore higher. The leakage power is not shown due to its small value. Figure 4.20 gives the cycle-accurate power profile for all four processing elements. It clearly shows that the estimates are very close to cycle-accurate power estimation. Observation 4.3 Most existing estimation frameworks supply only one, total power estimation figure. They do not provide a breakdown into different power types (internal, switching, leakage). The accuracy of the total power estimation may be high, but the errors in switching and internal power could cancel each other. Therefore, it is more secure for the designer also to know the approximate breakdown into the different power types. Also, different power management techniques address different power types, like described in Section 2.3.1. Therefore, this breakdown is very important to apply the right techniques to right designs at the right time.

103


Power Consumption (mW)

4

Total Power

Internal Power

4.2

4.5

Switching Power

45 30 15 0

4.1

4.3

4.4

4.6 4.7 4.8 Time (μs)

4.9

5

5.1

5.2

Figure 4.19: Cycle-accurate array power-type breakdown of the edge detection algorithm, obtained by a gate-level simulation after the place and route (functional processing phase).

Total Power (mW)

PE 1

PE 4

12

10

8 4.1

4.2

4.3

4.4

4.5

4.6 4.7 4.8 Time (μs)

4.9

5

5.1

5.2

Figure 4.20: Cycle-accurate power traces for PE 1 (PE[0][0]) and PE 4 (PE[0][1]) of the edge detection algorithm, obtained by a gate-level simulation after the place and route (90 nm, 1.0 V, general purpose, nominal VT H , commercial CMOS standard cell library, 200 MHz target frequency, 25 ◦ C, typical-typical process, functional processing phase).

104

4.5 Analytical Hardware Area Estimation

4.5 Analytical Hardware Area Estimation In order to perform a technology-independent cost analysis of the highly parameterizable hardware modules, they were hierarchically decomposed to the level of basic components given in Table 4.4(a). These are the base components of all digital hardware circuits. These costs are normalized to the complexity of an inverter Cinv . If our modules are to be implemented in another technology, like FPGA, corresponding costs for the basic components have to be evaluated and can then be inserted into the cost equations. As an example, the technology-independent, analytical hardware cost estimation of the adder/subtractor block of a WPPE with x adder/subtractor modules, n bit operands, m bit immediate operand width, and r bit register address field width in an instruction, is given in the following equation:

Cadd BLK (x, n, m, r) ≤ x · Cadd (n) + (n − 2)·Cor +Cnor +Cff (4) + 3·Cxor + 3·Cmux2 + 1 (4.12) +Cmux3 (2) +Cmux3 (m) +Cmux2 (r) + 2·Cand (r) +Cmux2 (n) . 1

1

1

1

Since instructions can be executed with both register and immediate operands, the corresponding equation for hardware cost estimation contains the width m in bit of the immediate operand as a parameter. Equation 4.12 contains the hardware cost for • underlying adder/subtractor module Cadd (n) ≤ 8n for a carry look-ahead adder, as well as • corresponding instruction decoder modules, and • status flag generation logic. Similar equations were derived for the technology-independent, analytical hardware cost estimations of all WPPE and WPPA components. Some of the resulting equations for the functional units are given in Table 4.5. For a complete set of equations for all WPPA modules, we refer to [166].

4.6 Area Macro-Modeling The area estimates are obtained during the same modeling procedure as for power characterization. Synthesis tool delivers estimated area figures after the hardware synthesis. The area estimates also show a good accuracy and lie within a 10% estimation error. The estimated relative area consumption of a processing element with 16-bit data path and 24-bit data path is presented in Figure 4.21. In both cases, the register file and the instruction memory area are dominating: 84% and 77%, respectively. For a higher data path width, the area consumption of the multipliers increases significantly.

105

4


(a) Hardware cost of the basic components

Component

(b) Parameters of the functional units

Name

Cost

Parameter

Symbol

1n 2n 4n 3n

Data Width [bit] Imm. Op. Width [bit] Reg. Addr. Width [bit]

n m r

MUX 21

C Inv (n) C and (n),Cor (n) C xor (n) C mux2 (n)

F LIP -F LOP RAM -C ELL

C ff (n) C rcell (n)

8n 2n

NOT AND , OR XOR

1

Table 4.4: Analytical hardware area estimation of the basic components normalized to the area of one inverter (a) and complexity parameters of functional units with corresponding notation (b).

Hardware Area ≤

Functional Unit Block Adder/Subtractor Multiplier Logic Shifter

13n + 6m + 7r + 57 10n2 − 9n + 6r + 13 9n + 3m + 7r + 27 (6n + 1) log2 (2n) + 3(n + m) + 7r + 45

Table 4.5: Analytical hardware area estimation equations for different functional units.

regfile 32%

regfile 36% shifters 3%

shifters 2%

mem 52%

multipliers 10% adders 4%

(a) Area breakdown of a 16-bit PE

multipliers 15% adders 5%

mem 41%

(b) Area breakdown of a 24-bit PE

Figure 4.21: Estimated area breakdown of a PE from 3 × 8 WPPA.

106

4.7 Related Work

4.7 Related Work A very large body of work exists on the architecture-level power and performance simulators, the most prominent being the XTREM [86], SimplePower [322], and Wattch [54]. These and similar approaches mostly concentrate on the estimation of power and performance for RISC and VLIW [31], [158] architectures with such typical hardware building blocks like different cache hierarchies and structures, address and data busses, parameterized register files and functional units. On the software side, different compiler optimizations can be examined, like for example loop unrolling. Regarding the application of such approaches to CGRAs in general and particularly to WPPAs, the main shortcomings besides a rather different architecture concept and a too old technology of 0.35 μm, are the following: Either a too simplistic power estimation model is used to provide scalability, which does not consider the impact of different switching activities on the power consumption, or the deployment of time-intensive simulations, possibly at different abstraction levels, is obligatory to arrive at the corresponding switching activities and per-cycle resource usage counts [54]. The simulation time can quickly become prohibitively large, like exemplarily shown in [93] with simulation times in the order of years. To overcome these and other problems, regression-based analytical power models are used, like for instance in [264]. However, they also need extensive simulations, both for model construction and deployment, are strongly dependent on the choice of the representative input vectors used during the characterization phase, which is a common problem for virtually all macro-models, and are difficult in the daily deployment. The last problem is caused by the use of statistical analysis methods in the model generation phase. Often, the need arise to add some new reference points for the improvement of a specific model. To accomplish this, the complete statistical regression procedure must be repeated. This can be very time consuming and, depending on the additional data points, may finally either completely change the corresponding regression coefficients or have absolutely no impact on the model. Another strongly related research field is that of high-level synthesis. Besides the architecture, approaches in this domain mostly consider variation of supply and threshold voltages and target frequencies of the corresponding hardware designs. The major difference of such approaches to CGRAs and in particular to TCPAs is that high-level synthesis tools optimize the hardware implementation of a single algorithm, see for instance [245], whereas programmable architectures are generally optimized for an application domain. Due to this fact, many optimization approaches used in high-level synthesis tend to be too implementation specific, which in turn restricts their broader usage. The ultimate goal of high-level synthesis for a given algorithm is the minimization of latency with area/power constraints. Nevertheless, some very important key points can be observed:

107

4

Power, Area Characterization and Modeling • For complex modules, the control inputs generally influence the power consumption more than the switching load on the data ports [100], [318], [233], [46]. • Power is not equally sensitive to different statistics observed at the architectural level [317], [167]. • Compared to switch and gate-level simulation based approaches, which do not scale well beyond ≈ 10 6 logic cells, power macro-modeling provides a significant acceleration of up to several orders of magnitude [318]. • The mean estimation error usually lies below 10% as compared to commercial gate-level power estimators, see [318], [122], [36] and many other.

Although a sound research regarding efficient high-level performance and partially power analysis and modeling has been performed for CGRAs, some very important aspects still remain unclear. The existing estimation and exploration frameworks either use a (gate-level) simulation-based energy/performance estimation [72], [66], [52], regression-based, analytical power estimation [92], or consider only the performance measured in latency cycles [34]. In [92], a Field Programmable Gate Array (FPGA) implementation is presented in which the individual processing elements are not programmable. This resembles more the high-level synthesis approach. Furthermore, it deals only with homogeneous processor arrays by multiplying the estimation for a single PE with the loop iteration volume. Thereby, the initial assumption is that all PEs of the array exhibit the same power behavior. While this assumption might be justified for simpler, homogeneous architectures, it is not necessarily true for programmable and heterogeneous arrays, like it was shown in Section 4.4. Furthermore, for big arrays not all PEs in the array are utilized equally, some of them may be idle most of the time. Moreover, in our framework a direct comparison between a homogeneous and heterogeneous implementations is possible at architectural level, regarding both the area and power consumption. In the PICO framework [158], the corresponding PEs in the hardware accelerator(s) are also non-programmable and no possibility for power characterization is given, besides the inefficient gate-level or RTL simulation. The size of the hardware accelerators is not mentioned either. Furthermore, no particular performance numbers, like mean estimation time for one PE or similar were published so far, except for the high-level synthesis tools, like [245], or similar.

108

4.8 Summary

4.8 Summary We presented a novel framework for fast and efficient micro-architecture level evaluation of power/area/latency design trade-offs for WPPAs which belong to the class of tightly-coupled processor arrays. Based on a relational database, it provides extremely fast and accurate area and average power estimates for different design alternatives. Since different power macro-models all have their intrinsic advantages and limitations, see for instance [167] for further discussion, especially the power estimation engine for CGRAs should be very flexible, to allow the deployment of different power modeling techniques, like for example entropy-based information-theoretic, table-based statistical and also combinations thereof. On an example of a table-based, probabilistic macro-modeling technique with non-uniform parameter sampling, implemented by means of a relational database we show that by this approach, power estimation speeds in the milliseconds range within 10% estimation error compared to a state-of-the-art commercial gate-level post-layout power estimator are possible. The main contributions can be summarized as follows: 1. Scalable evaluation framework for extremely fast but still accurate power, area, and additionally latency characterization with techniques proposed in [129], of different design alternatives in a multidimensional design space of highly parameterized coarse-grained reconfigurable processor arrays. 2. Substantial reduction of the memory and timing requirements of the macromodeling procedure due to a new, non-uniform parameter sampling technique. 3. Usage of the built-in probabilistic power estimator of the RTL synthesis tool leads to an acceleration factor of 1 000× and avoids considerably the computation overhead in the power macro-model creation phase otherwise introduced by extensive simulations and generation of signals with predefined statistical properties. This made a proper characterization just possible. 4. The deployment of a relational database and the API of the synthesis tool, together with comprehensive experimental results, to combine both area and power macro-modeling in a single step and methodology. The main results of this chapter were also published in [7].

109

4

110


5 Increasing the Energy Efficiency of Programmable Hardware Accelerators This chapter deals with the question how the power efficiency of embedded CGRA accelerators can be substantially improved by utilizing their specific micro-architectural properties. First, in Section 5.1 the aspect of power-efficient dynamic reconfiguration control techniques in CGRAs is handled. To substantially reduce the dynamic power consumption, we propose to partition the clock domains with a custom dual clock gating scheme and to combine this with the clock gating of instruction memories as well as automatic clock gating. The experimental validation is described in Section 5.2. Section 5.3 gives an overview over the related work. Subsequently, in Section 5.4 the static power consumption is addressed with the help of the power gating technique. Finally, an original systematic approach to efficiently handle hundreds of power domains, reconfiguration-controlled power gating, is presented and discussed in Sections 5.4.3 and 5.4.4. Experimental results are given in Section 5.4.6.

5.1 Power Reduction in Coarse-Grained Reconfigurable Architectures A common architectural structure for reconfiguration management can easily be observed for many coarse-grained reconfigurable architectures. Usually, there is a global or distributed memory where different algorithm configurations are kept. Furthermore, some kind of configuration loading modules are needed to implement certain configuration protocols. They can be as simple as a pair of registers with some multiplexers or quite complex with a set of hierarchical state machines. Since each reconfigurable architecture has to be programmed or configured before performing any computation, two work phases can clearly be distinguished: the (re-)configuration phase, and the subsequent functional processing phase. A common property for all kind of reconfiguration logic (coarse-grained or even fine-grained like for instance FPGA) is that most of the time it is in the idle state. Even if the reconfiguration is

111

5

Increasing the Energy Efficiency of Programmable Hardware Accelerators

used on a per-frame basis for video decoding, the portion of time for reconfiguration will roughly amount to a few percents of the frame processing time. This also holds true for any data flow dominant applications for example from the area of digital audio and video processing. Regarding the power consumption aspect, this observation gives us a hint that it will surely be worth considering to apply power reduction techniques especially for the reconfiguration logic in such coarse-grained reconfigurable architectures.

Component-level Power Minimization Regarding the relative power consumption in a CGRA, we can identify the sequential logic modules (memory, register files, single configuration control registers and state machines) as components with the highest fraction of static and also total power consumption, as well as area contribution. Compare, for instance, the data for the ADRES 4 × 4 design given in [52], accelerating a video processing application. The reported relative total power and area contribution of the configuration memory amounts to 37%. Regarding the fact that in CGRAs, these components are active only a small percentage of the computation time, they are the ideal objects for module–level power reduction techniques, like, for example, clock and power gating.

5.1.1 Automatic Clock Gating Meanwhile, clock-gating technology in EDA tools has evolved to a state, where it is completely automated and does not break the design flow methodology. These tools automatically find flip-flops that have an enable control and which are implemented with a feedback from output to input and a multiplexer that selects whether the previous output is to be kept or a new input is to be accepted. Thereupon, they exchange these to a clock gate cell and a standard flip-flop. No manual enable signals are needed since this control logic is already in the design. Nevertheless, as already mentioned, one should bear in mind some subtle issues about automatic clock gating, like it will be shown in Section 5.2. Furthermore, automatic clock gating can only be applied to flip-flops and not to complete macro blocks, like on-chip cache and other types of memory which normally contribute the most to the power and area consumption. To use the advantages of clock gating also for macro blocks in the design, the custom application knowledge has to be exploited, and the logic has to be partitioned into different clock domains. Customized clock gating statements are therefore inserted for example in the HDL source code which are then translated by the synthesis tool to the clock gating cells of the underlying library. In case of CGRAs, it does not even introduce any additional control signals, since such signals already naturally exist to separate both phases.

112

)XQFWLRQDO SKDVH

&RQILJXUDWLRQ SKDVH

5.1 Power Reduction in Coarse-Grained Reconfigurable Architectures

HQG RI UHFRQILJXUDWLRQ SKDVH

FRQILJBGRQH JDWHG

FRQILJBFON FONBJDWHBHQ IXBFON

JDWHG

IXBUVW SF

XQGHILQHG

Figure 5.1: Dual clock gating scheme with the corresponding functional and configuration phases.

5.1.2 Hybrid Hierarchical Clock Gating Regarding the application of clock gating in coarse-grained reconfigurable systems, an important observation is that the configuration and functional processing phase normally do not overlap. Therefore, depending on the underlying hardware architecture, the whole design can readily be partitioned into at least two separate clock domains. We propose to use a dual clocking scheme, like it is schematically depicted in Figure 5.1. The idea is to have a separate clock (config clk) for logic performing configuration and another clock (fu clk) for functional modules. Observation 5.1 Although the use of different clock trees in a design is not new, to our best knowledge it has not been applied to coarse-grained reconfigurable architectures so far, to separate the reconfiguration and processing phases physically and gain a substantial reduction in both power and energy. After a global reset and during the configuration phase, config clk is relinquished and the fu clk is gated. After the configuration is done, a clk gate en signal is asserted gating the config clk and relinquishing the fu clk signal. To ensure the correct behavior of the functional logic, it is important to assert existing synchronous reset signals for some clock cycles of fu clk. The whole control can be accomplished by the reconfiguration logic itself, so that the control overhead is kept at a minimum. As opposed to fine-grained automatic clock-gating, this approach is more advantageous regarding the power and energy reduction, since gating the clock trees at the highest hierarchy level prevents a much larger number of clock tree buffers and hard macro blocks from unnecessary toggling. However, in the following, we exemplarily

113

5


Global Clock Manager

Lookup Table

en GCK

Configuration Memory

WPPE

en GCK en GCK en GCK

VLIW Memory

Configuration Loader Local Clock Manager

Global Cfg. Clock Domain

Configuration Controller

Local Cfg. Clock Domain

Configuration Manager

Local Memory Clock Domain

Weakly Programmable Processor Array

Local Func. Clock Domain

Functional Logic

Figure 5.2: Reconfiguration management and clock domain partitioning. show, that the highest power and energy reduction can be achieved by combining the custom clock gating of the reconfiguration logic and additional automatic clock gating of the whole design to a hybrid clock gating scheme, as discussed in the following sections. To verify its feasibility, this method was applied to our WPPA architecture and led to more than a threefold power reduction as compared to the single automatic or full custom clock gating techniques applied on their own. Logic Partitioning for Clock Gating To apply our proposed hybrid dual clock gating method to WPPAs, the whole array design was partitioned into the following clock domains: the global configuration clock domain on the array level, and three local clock domains in each PE, see Figure 5.2. According to this scheme, on the array level, configuration manager, configuration controller, auxiliary lookup table, and the global configuration memory can all be gated by a single latch-based clock gating cell (GCK). The information about the current configuration status from configuration manager as well as the internal or external configuration requests are used by the global clock manager module to switch this clock domain on or off. On the processing element level, in addition to a local configuration clock domain, a functional clock domain is also needed. Since the configuration and processing phases in CGRAs normally do not overlap, these two clock domains are inversely related to each other and can be controlled by a single clk gate en signal from Figure 5.1 and its inversion, respectively. A third local clock domain is introduced for the instruction memory, which can be separated from the clock just after it has been configured and before the functional processing phase starts. This intermediate state in single processing elements can last considerably long, if a sufficiently large processor array, for example, of size 10 × 10, is being configured with many different algorithms.

114

5.2 Experimental Results Parameter

WPPA 2 × 2

WPPA 3 × 8

Adders/Element Multipliers/Element Data path width [bit] FU instr. width [bit] Regs. number/Element Instr. memory/Element

1 1 16 16 5 16 × 64

2 2 16 16 15 16 × 128

Config. memory (ROM)

512 × 32

1024 × 32

Table 5.1: Static case study parameters for processing elements and WPPA array.

5.2 Experimental Results To determine the actual power savings of the proposed hybrid clock gating technique, two WPPA designs were simulated on the gate level: a 2 × 2 array with four processing elements, configuration given in Table 5.1, computing edge detection and a 3 × 8 array with 24 processing elements for the computation of an FFT. Contrary to power profiling results given in Section 4.2 where special power gating cells were already used, another 90 nm commercial standard cell library from a different vendor was used here which at the time performing the experiments did not provide power gating capabilities. Therefore, the simulation results show slight differences to the ”newer” results given in Section 4.2. However, the average power consumptions are very close in both cases. This also serves as an additional evidence, that the proposed power reduction techniques do not rely on any specific ASIC library.

5.2.1 Tool Flow The tool flow used for power analysis is depicted in Figure 5.3. At the front end the design was synthesized with the help of Synopsys Design Compiler [271] (ver. 2006) and a 90 nm, 1.0 V supply voltage standard cell library for a 200 MHz maximum target frequency (corresponds to 1024 × 768 XGA resolution, @30 fps, greyscale). Two other possible operational modes are the 125 MHz mode for an SVGA resolution, @30 fps, greyscale and the 75 MHz mode for a VGA resolution, @30 fps, greyscale. For the two lower frequency operational modes no voltage scaling was performed. After place and route using Cadence SoC Encounter [63] (ver. 5.20), the gate-level netlist with back-annotated delays was simulated with real image data in ModelSim [198] (ver. 5.8e). For accurate power analysis, particularly of the clock tree, the simulator-generated value change dump file (.vcd) was read back and processed by the SoC Encounter (ver. 7.1) power estimator which is based on the Ca-

115

5


Synopsys Design Compiler

Verilog Netlist

+

Synthesis, Automatic/Custom Clock Gate Insertion VHDL RTL Source

Cadence SoC Encounter

.vcd

LEF, LIB P&R, Delay and Parasitics Extraction .sdf

+ 90 nm CMOS LEF, Libraries LIB

3x8 WPPA (5x5 Floorplan)

MentorGraphics ModelSim

.sdc

2x2 WPPA

Verilog Netlist

.spef

Delay BackAnnotation, Gate-level Simulation

PowerMeter Gate-level Simulation-based Power Analysis

Figure 5.3: Case study tool flow overview for power analysis. dence PowerMeter engine, a sign-off power analysis tool. To compare the power consumption of designs with and without different clock gating methods, the tool flow depicted in Figure 5.3 was applied to four designs in case of a 2 × 2 WPPA with four processing elements: 1. Without any clock gating, 2. with custom inserted clock gating cells according to Figure 5.1 (but without the clock gating of instruction memories), 3. with automatically inserted clock gating cells (minimum width of a gated register = 3), and, finally, 4. with combination of the latter two. Additionally, for the 3 × 8 WPPA, the gating of configured instruction memories in the idle periods between a single PE and the whole processor array configuration was tested in a separate design and simulation run. The testbench for the 2 × 2 WPPA contained a configuration phase of the array for the ED algorithm (four configuration domains) of ≈ 4 μs with a subsequent image processing phase of ≈ 16 μs. For the 3 × 8 array, the corresponding configuration with the FFT application took ≈ 9 μs (14 configuration domains) with subsequent 11 μs processing time, see also Figure 3.5 in Chapter 3.

116

Time-averaged Power Reduction (%)

5.2 Experimental Results

200 MHz

125 MHz

75 MHz

40

35.1

31.6 27.1

24.3 18.8

20 9.3

9.7

10.3

11.2

0

Automatic

Custom

Hybrid

Clock Gating Style

Figure 5.4: Power consumption decrease in a 2 × 2 WPPA for different clock gating strategies and target frequencies.

5.2.2 Results for a 2 × 2 WPPA Array The power reduction of automatic and full custom scheme amounts to 9% and 11% for the 200 MHz clock frequency, respectively. Whereas more than a threefold power reduction of 35% is achieved by combining the automatic and full custom clock gating at 200 MHz. While for the automatic clock gating the power reduction hardly varies with clock frequency, it ranges from 11% to 24% and 27% to 35% for the full custom and hybrid cases, respectively, see Figure 5.4 and 5.5. In case of the hybrid scheme, the power savings slightly increase with increasing frequency, since more dynamic power can be saved. For all three frequencies, the hybrid approach outperforms the rest. The absolute values for power in different reconfiguration modules are shown in Figure 5.6. According to these figures, hybrid clock gating achieves 80% to 90% total power reduction for the reconfiguration logic at 200 MHz. For hybrid clock gating, 99% of registers in the design are gated, which is more than for the automatic (79.89%) and full custom (92.75%) case, see Table 5.2. Also to note, a small area decrease of ≈ 3% for automatic and hybrid gating, like it is usually reported in this case. However, an interesting question which arises is the following: What is the actual reason for the poor power reduction of automatic clock gating? The answer is shown in Table 5.3. Contrary to the intuitive expectation, automatic clock gate insertion increases the switching activity and switching power of some logic modules up to 18-fold (or 1855%). Automatic clock gating increases the switching power of the

117

5


Total Power (mW)

200 MHz 20

125 MHz

19.75

75 MHz

17.91 12.32

10

17.53 12.82

11.13 7.54

9.33 6.77

8.42 6.13

5.5

0 No

Automatic

Custom

Hybrid

Clock Gating Style

Figure 5.5: Total power consumption in a 2 × 2 WPPA for different clock gating strategies and target frequencies.

reconfiguration logic by 180%, whereas the full custom and hybrid clock gating only by 34% and 12%, respectively. In this case, fine-grained clock gating alone substantially worsens the switching power dissipation. Since the power consumption in memory modules and sequential cells is mostly dominated by internal power and not the switching power, automatic clock gating still achieves a power reduction of 9%, but further improvements are nullified by the switching power increase. In some CAD synthesis tools, a hierarchical option for automatic clock gating can be selected, meaning that the top design and all subdesigns are processed in one step to identify common enable conditions and insert CG cells on higher hierarchy levels. This option was also tested separately, but gave no remarkable improvements in this special case.

Design Not gated Automatic Full custom Hybrid

Area statistics Gate area Overhead 0.233 mm2 0.225 mm2 0.239 mm2 0.226 mm2

0% −3.56% 2.85% −2.84%

Register statistics CG elements Reg. total Gated reg. 0 181 9 195

2767 2770 2788 2788

0% 79.89% 92.75% 99.00%

Table 5.2: Statistics for different clock gating strategies. The area decrease for the automatic and hybrid case is due to the removal of obsolete multiplexers at the inputs of clock-gated registers.

118

No CG

Automatic CG

Full Custom CG

ld

r

2

2

1 ld

r

2

2 ld

r

1 r ld

1

1

rl ct t cf

cf

g

m

b ta lu

m g cf

em

1.2 1.05 0.9 0.75 0.6 0.45 0.3 0.15 0 gr

Total Power (mW)


Hybrid CG

Figure 5.6: Total power consumed in different reconfiguration modules for different clock gating (CG) strategies (200 MHz target frequency).

5.2.3 Results for a 3 × 8 WPPA Array The application of the proposed hybrid clock gating technique led to considerable 38% power efficiency improvements in case of the 3 × 8 WPPA. Thus, the proposed methodology works well for designs of very different chip sizes compared to only 9%-11% achieved by automatic clock gating alone, see Figure 5.7. According to results obtained, the proposed clock gating scheme shows a very good scalability. To be able to assess the gating efficiency of the individual clock gating cells inserted into the design (automatically or by hand) and to have a handy measure for the clock gating ”coarseness”, we propose to use a logarithmic measure derived from the overall number of gated flip-flops and gating cells: Definition 5.1 (Clock Gating Coarseness) Clock gating coarseness of a design is defined as the common logarithm of the ratio of the number of gated flip-flops to the number of clock gating cells, and is denoted by CCG . The corresponding CCG values for the 3 × 8 and 2 × 2 WPPA can be seen in Figure 5.8. According to this figure, the clock gating coarseness decreases in the following order: full custom, hybrid, automatic. But the highest gating efficiency does not automatically result in the highest power reduction, which was already experimentally shown for the 2 × 2 WPPA. Concerning the power savings, mid-range values of 15 gated flip-flops/gating cell (2 × 2 WPPA) and 35 gated flip-flops/gating cell (3 × 8 WPPA) seem to be optimal for

119

5


the given architecture. The automatic clock gating (12 and 18 flip-flops/gating cell) is too fine-grained and therefore introduces too many additional cells which cannot be switched off for a sufficiently long period of time. The full custom scheme with 287 and 484 flip-flops/gating cell is too coarse-grained which forces the place-and-route tool to build several additional clock sub-trees for a given design. The hybrid scheme uses the advantages of fine-grained gating during the actual computation (also in the functional logic) and shuts off the entire clock sub-trees during the idle periods which form the usual state of the reconfiguration logic. This is also verified by the overall number of gated registers given in Figure 5.9. Automatic clock gating is able to identify only ≈ 50% of the potential flip-flops whereas the hybrid method concerns almost 100% of sequential cells. Regarding the distribution of power savings across different kinds of logic cells in the design, that is sequential, macro-cells, and combinational we can state that like in case of a smaller 2 × 2 array, the power consumption of the 3 × 8 WPPA is still dominated by internal power consumption of the modules. Although the automatic clock gating reduces the overall switching power of the design to a certain extent, it does not sufficiently reduce the internal power. Whereas the full custom and hybrid scheme manage to reduce both switching an internal power at the same time. Full custom gating also shows the best capability of preventing combinational logic from unnecessary toggling. Clock Tree Power Impact Figure 5.10 shows the respective power consumption of the clock tree in case of the FFT algorithm during the reconfiguration phase. We see that the proposed scheme of hybrid clock gating of reconfiguration logic and instruction memory11 leads to a considerable decrease in switching power of the clock tree. Although the internal and leakage power increase, the switching activity reduction leads to an overall decrease of power consumption by more than 50% from 18.24 mW to 9.02 mW. The internal and leakage power increase is justified by additional gating cells and clock buffers. But also during functional processing, the proposed scheme does not deteriorate the clock tree power consumption too much: A slight increase from 18.24 mW to 19.37 mW only, like shown in Figure 5.11. In case that only a small group of processing elements are used for an algorithm like for the edge detection example running on four out of 24 PEs of the 3 × 8 WPPA, the advantage of hybrid clock gating of reconfiguration logic and instruction memories is even higher. Figure 5.12 gives the breakdown of clock tree power in this case. The overall power reduction during reconfiguration amounts to more than 400%: from 18.24 mW to 4.39 mW. For functional processing phase the reduction is similarly large: 18.24 mW compared to 5.1 mW, see Figure 5.13. 11

Me CG, tested for 3 × 8 WPPA only.

120


Module cfg mgr lu tab cfg mem cfg ctrl ldr 1 1 ldr 1 2 ldr 2 1 ldr 2 2 rc logic

Automatic −5.67% −272.59% 7.71% −70.95% −1855.65% −1450.94% −1359.73% −1476.15% −180.60%

Full Custom 30.71% 2.60% 43.06% 10.14% −29.43% 20.75% −4.87% 11.26% −34.07%

Hybrid 1.47% −104.14% 39.48% −33.74% −153.71% −151.19% −295.47% −196.95% −12.22%

Time-averaged Power Reduction (%)

Table 5.3: Decrease (+) and increase (−) of switching power for different clock gating schemes, 2 × 2 WPPA, and 200 MHz target frequency.

WPPA 3×8

WPPA 2×2

37.82% 35.09%

40

29.99% 20

11.24%

10.68% 9.32% 0

Automatic

Hybrid

Custom

Clock Gating Style

Figure 5.7: Decrease of total power in percent for different clock gating schemes and WPPAs compared to a not gated design, 200 MHz frequency.

121


Clock Gating Coarseness

5

WPPA 3×8

4

WPPA 2×2 2.63

2

1.27

2.46 1.55

1.09

1.15

0 Automatic

Custom

Hybrid

Clock Gating Style

Figure 5.8: Coarseness values for different clock gating schemes and WPPAs.

40,000 Number of Flip-Flops

Gated Flip-Flops

Ungated Flip-Flops

30,000 23,191

20,791

23,095

20,791

20,000 12,410

10,781

10,000 2,428 0

2,428

0 No

Automatic

Custom

Memory

124 Hybrid

Clock Gating Style

Figure 5.9: Percentage of gated flip-flops for different clock gating schemes in the 3 × 8 WPPA.

122

Reconfig. Clock Power (mW)


25

No CG

Me CG

Hybrid CG

20

18.24

16.66

15

12.87

10 5.96 5

6.84 4.33

9.02

7.6 6.9 4.63

1.57

1

0 Internal

Switching

Total

Leakage (μW)

Power Component

Functional Clock Power (mW)

Figure 5.10: Average clock tree power consumption, reconfiguration phase, 24 out of 24 PEs used.

25

No CG

Hybrid CG

20

18.24

16.66

20.0519.37

13.6412.45

15 10 5

Me CG

7.7 6.9

6.33 6.86 1.57

1

0 Internal

Switching

Leakage (μW)

Total

Power Component

Figure 5.11: Average clock tree power consumption, functional phase, 24 out of 24 PEs used.

123


Reconfig. Clock Power (mW)

5

25

No CG

Me CG

Hybrid CG

20

18.24

16.66

15

15 10 5

7.86

7.07

2.53

1.78

1.57

7.7 6.9 4.39 1

0 Internal

Switching

Total

Leakage (μW)

Power Component

Functional Clock Power (mW)

Figure 5.12: Average clock tree power consumption, reconfiguration phase, 4 out of 24 PEs used.

25

No CG

Me CG

Hybrid CG

20

18.24

16.66

15.35

15 10 5

8.47

6.81 1.57

1.78

7.7 6.9

5.1

3.25 1

0 Internal

Switching

Leakage (μW)

Total

Power Component

Figure 5.13: Average clock tree power consumption, functional phase, 4 out of 24 PEs used.

124

5.2 Experimental Results Gate-Level Power Traces The major benefits of gating also the instruction memories during the reconfiguration becomes apparent if we take a look at the gate-level power traces resolved over the time like shown in Figure 5.14. It shows the power consumption of edge detection algorithm running on four out of 24 processing elements of the 3 × 8 WPPA. The corresponding design was simulated after the place and route with back-annotated physical information. However, those traces are not cycleaccurate. Instead, averaging over 100 clock cycles was used to achieve a longer trace duration12 . The respective traces clearly show the difference between a not-gated design (≈ 255 mW), hybridly gated design (≈ 190 mW), and hybrid and memory gated design (≈ 50 mW). Observation 5.2 If an algorithm does not utilize all processing elements of the given array, the power used to clock the instruction memories can be substantial: Up to 55%. If the instruction memories of unused processing elements are gated, the power consumption of the array can be reduced significantly. During the reconfiguration the total power is reduced 10×: 255 mW to 25 mW. Figures 5.15 and 5.16 show the breakdown into internal and switching power parts. In case of full array utilization, that is if 24 out of 24 processing elements are used, gating of instruction memories still gives a substantial gain during the reconfiguration phase, see Figure 5.17: The reconfiguration power is reduced from 260 mW (no gating) and 175 mW (hybrid gating) to 45 mW only (memory gating). This corresponds to a 5.8× and 3.8× reduction. During the functional processing, memory clock gating does not have any deteriorating effects as compared to hybrid clock gating. Instead, it even slightly reduces the overall power consumption, see Figures 5.17, 5.18, and 5.19. Figure 5.18 shows that internal power is reduced the most, whereas switching power during reconfiguration stays approximately constant at 5-8 mW as compared to hybrid clock gating, see Figure 5.19. Software and Switching Activity Impact The effect of different software and therefore switching activity of processing elements and functional units is best illustrated by a direct comparison of power traces of the FFT and synthetic load algorithms given in Figure 5.20. Although the power consumption during reconfiguration is approximately equal, the respective power during processing phase differs by 90%: 235 mW (FFT) to 450 mW (SL). Internal power increases by 1.5× and switching power even 4×, see Figures 5.21 and 5.22. The reconfiguration time also shows that the software of FFT and synthetic load algorithms is different. 12 For

large designs, the cycle-accurate power profiling quickly becomes computationally infeasible on a consumer machine: In this case already for simulation lengths of 6-7 μs of ≈ 150 000 logic gates.

125

5


250 190 mW

Total Power (mW)

255 mW

200 150 100

End of Reconfiguration 50 mW 25 mW

50 0 0

1

2

3

No Clock Gating

4

5 6 Time (μs)

7

Hybrid Clock Gating

8

9

10

Mem Clock Gating

Figure 5.14: Total power consumption with different clock gating strategies for the ED algorithm.

Internal Power (mW)

250 200

170 mW

235 mW

150 100

End of Reconfiguration 40 mW 20 mW

50 0 0

1

2

3

No Clock Gating

4

5 6 Time (μs)

Hybrid Clock Gating

7

8

9

10

Mem Clock Gating

Figure 5.15: Internal power consumption with different clock gating strategies for the ED algorithm.

126


No Clock Gating Hybrid Clock Gating Mem Clock Gating


30


20

10

0 0

1

2

3

4

5 6 Time (μs)

7

8

9

10

Figure 5.16: Switching power consumption with different clock gating strategies for the ED algorithm.

290 mW 260 mW

Total Power (mW)

300

200

230 mW End of Reconfiguration

175 mW

100

45 mW


0 0

2

4

6

8

10 12 Time (μs)

14

16

18

20

Figure 5.17: Total power consumption with different clock gating strategies for the FFT algorithm.

127

5


Internal Power (mW)

300

245 mW

230 mW

200 210 mW End of Reconfiguration 160 mW

100

35 mW


0 0

2

4

6

8

10 12 Time (μs)

14

16

18

20

Figure 5.18: Internal power consumption with different clock gating strategies for the FFT algorithm.


50 40 25 mW

30 20


10 0


5 mW

0

2

4

6

8

10 12 Time (μs)

14

16

18

20

Figure 5.19: Switching power consumption with different clock gating strategies for the FFT algorithm.

128


500

Total Power (mW)

400

Synthetic Load (memory clock gating) FFT Algorithm (memory clock gating) 450 mW

300

End of Reconfiguration Synthetic Load

200

235 mW

End of Reconfiguration FFT 65 mW

100 0 0

1

2

3

4

5 6 7 Time (μs)

8

9

10

11

12

Figure 5.20: Total power consumption comparison of FFT and synthetic load algorithm.

500

Internal Power (mW)

400

Synthetic Load (memory clock gating) FFT Algorithm (memory clock gating) 300 mW

300

End of Reconfiguration Synthetic Load

200

End of Reconfiguration FFT 200 mW

50 mW

100 0 0

1

2

3

4

5 6 7 Time (μs)

8

9

10

11

12

Figure 5.21: Internal power comparison of FFT and synthetic load algorithm.

129

5



500 400

Synthetic Load (memory clock gating) FFT Algorithm (memory clock gating)

300

End of Reconfiguration Synthetic Load 150 mW

200

End of Reconfiguration FFT

100

35 mW

15 mW

0 0

1

2

3

4

5 6 7 Time (μs)

8

9

10

11

12

Figure 5.22: Switching power comparison of FFT and synthetic load algorithm.

5.3 Related Work Although there exist many academic and commercial coarse-grained reconfigurable architectures, see, for example, the overview given in [161], the aspect of applying specially suited power reduction techniques on such architectures has not been covered in depth, so far. Often single figures and breakdowns of power consumption are given, like in [267] or [52], but no attempts were made to exploit architecture-specific features to reach a better power and energy efficiency, besides the configuration cache optimization given in [161].

5.4 Standby and Idle Power Optimization In this section, we show an original systematic approach to efficiently handle a very large number of power domains in modern massively parallel TCPA systems in order to tightly match the different computational demands of processed algorithms with the corresponding power consumption. It is based on a new highly scalable and generic power control network and additionally uses the state-of-the-art Common Power Format based front-to-backend design methodology for a fully automated implementation. The power management is transparent to the user and is seamlessly integrated into the overall reconfiguration process: reconfiguration-controlled power gating.

130

5.4 Standby and Idle Power Optimization Furthermore, for the first time, a TCPA case study design with as many as 24 switchable power domains with detailed results on power savings and overheads is presented. The application of the proposed technique results in 60% active leakage and 90% standby leakage power reduction for several digital signal processing algorithms. The main methodology was also applied to a 10 × 10 WPPA with 100 power domains and was successfully synthesized in 90 nm commercial standard cell technology. Power gating constitutes the appropriate means to handle the increasing leakage power consumption both during active and standby modes, like already mentioned in Section 2.3.1. However, the application of power gating has important implications particularly on the physical level of digital circuit design. Some of them are briefly summarized in the following section.

5.4.1 Related Work One of the main problems in power-constrained designs is that of proper power delivery (and heat removal) as well as power integrity. Since the major goal of low-power design is to obviate the need of expensive heat removing systems, designers have to concentrate on power integrity and power delivery aspects. Power Integrity The main source of concern regarding the power integrity are the substantially reduced noise margins and growing current densities of modern CMOS technology nodes. Since the operation voltages are below one volt since already quite a long time and the voltages for low throughput and state retention operation modes already reach into the sub-threshold region the respective noise margins are getting very tight. Simultaneously, the current densities grow, leading to increased supply voltage drops and electromigration issues especially for designs using power gating. To have a better idea of present-day currents flowing in a high-performance microprocessor just assume ≈ 100 W power consumption at 1 V supply voltage. This yields a direct current of 100 A which is a typical value for arc welding machinery. In this case, a series resistance of 1 m Ω results in a voltage drop of 100 mV or 10% of supply voltage. Such a 10% change of supply voltage causes a 6% change of frequency and already 18% of power dissipation, see [219, 232, 314, 208]. Power integrity checks deal with these issues. The goal of power integrity checks and power grid analysis is to analyze the following (time-dependent) equation and ensure the correct operation of the circuit in all operating conditions, see [237, 225]: V (t) = I(t) · R +C·

dV (t) dI(t) ·R + L· . dt dt

(5.1)

131

5


Dynamic fluctuations in power supply are mainly caused when a certain localized region of a chip, as for instance a processor core, starts to draw a large amount of current as in case of power gating during the wake up phase. If the change of current takes place sufficiently fast, the current needed cannot be delivered in time by the power delivery network. This leads to dynamic voltage droops that is supply voltage variations. These variations in voltage directly translate to variations in device timing and reliability, see [79]. Therefore, early realistic estimations of the worst case current draws are very important. The L· dI(t) dt part of Equation 5.1 is especially relevant for high chip operating frequencies. The overall trend is that given decreasing supply voltages and increasing current magnitudes the ratio of supply noise to supply voltage is also increasing. Power Delivery The power delivery network (PDN) consists of a hierarchical mesh of metallization layers connected on the one side to the chip package pins and on the other side to the chip transistors. The overall goal is to provide a power delivery network with a low resistance and increasingly also low inductance, see [280]. Thereby, static voltage drop analysis is carried out first to determine the resistance component R in Equation 5.1. This gives the static, spatial variations of supply voltage across the chip and ensures that all device voltages are within the specified limits, see [252]. Thereafter, a dynamic solving of Equation 5.1 is performed, see [237, 73, 94]. Based on the results, decoupling capacitors are inserted and floorplan adjustments are done. Of course, these procedures have to be performed for all different voltages and respective power grids utilized in the design. These effects, of course, start to be relevant especially for high-performance, generalpurpose designs, like multicore and manycore architectures with power consumptions in the order of tens and hundreds of Watts. For mobile applications and < 1 W power consumptions these phenomena are of a minor importance. However, if sufficiently large WPPA arrays13 would be used for sophisticated image and network processing or wireless processing in base stations, power delivery and integrity issues would also come up quickly. Regarding the power gating technique, major issues arise with placing and connecting power switch cells, output isolation of gated circuit blocks, state retention in gated blocks, as well as enable/acknowledgement signal handling. Switch Cell Placement When power gating is applied at block level in a semicustom ASIC flow, there are two possibilities for switch cell placement: grid of switches or ring of switches, see [169]. The choice depends primarily on the ”nature” of power gated blocks: For hard macro blocks like memory and intellectual 13 That

132

is with more than 1 000 processing elements.

5.4 Standby and Idle Power Optimization

6ZLWFKDEOH 3RZHU 'RPDLQ HQDEOHBLQ

HQDEOHBRXW

Figure 5.23: Ring-style switch cell insertion. property cores which were already pre-placed and routed, the ring of switches technique is used like it is shown in Figure 5.23. The advantage of this method is that the physical design of the power gated block stays untouched. In the alternative method, that is grid of switches, the switch cells are placed in row, column, or checkerboard style in the midst of normal standard cells and the non-switchable ground or supply rails are routed as a signal. Since these cells have the double row height of a ”normal” standard cell no special treatment for place and route is needed. For proper routing of the non-switchable rails normally a horizontal, Figure 5.24(a), and a vertical version, Figure 5.24(b), of the switch cell exists. A schematic overview of the column-style switch insertion is given in Figure 5.25. By varying the number of switch cells and the interconnection between them, ASIC designer has a possibility to adjust the ramp-up time of power gated blocks together with the maximum rush current, supply voltage drop, and area overhead. Based on the current requirements of the gated block in active mode, an appropriate number and topology of switch cells is chosen. For additional details on placement of power switches, we refer to [170]. Enable/Acknowledgement Signal Handling Each single switch cell has an enable input and output. Therefore, several cells can be connected together in a daisy-chain style. This is needed to limit the maximum current surges which occur during the switching the block on and off, see [258] for an example. The enable output of the last switch cell is simultaneously the acknowledgement signal for the respective power controlling circuitry which indicates that the power of the switched

133

5


QVOHHSBLQ QVOHHSBLQ

QVOHHSBRXW

9*1'

QVOHHSBLQ

9''

9''

9''

9*1'

9''

6:

QVOHHSBRXW

(a) Header cell horizontal

9*1'

QVOHHSBRXW

6:

9*1'

(b) Header cell vertical

Figure 5.24: Horizontal and vertical versions of header cells. block has finally stabilized (on or off). Again, several possibilities to organize this acknowledgement signals are possible. To have a better control over the switching latency and current surges, the single switch cells can be clustered together. The control signals of the clusters are then combined to a global acknowledgement signal for example in an acknowledgment tree. For the same reason, different kinds of switch cells exist like, for example, the one with two enable inputs (double-input cell), see [243]. The first enable turns on the weak switches of the daisy chain in a kind of pre-load. After that, the second enable input turns on the main switches. Output Isolation After the power shut-off, the outputs of the power-gated circuits float towards a voltage determined by the resistive divider built from the transistors in the stack: VGND for header-style PSO and VDD for the footer-style PSO. Since only leakage currents exist in this mode, the transition to the respective voltage value at the outputs is very slow. This causes a large short-circuit current in the fan-out gates. Additionally, the logical output is not controllable by the designer, that is it may or may not assume the right value. Therefore, special care should be taken to preserve the floating of the outputs of power-gated blocks. The inputs of the respective blocks can also be isolated, but normally no switching should occur on the inputs of such blocks anyway, like for example on the clock. Nevertheless, the same controls can be used to isolate the inputs as well. A wide variety of different isolation circuits has been proposed in literature, see [258] for an overview. For semi-custom design, however, normal AND/OR-gates (retain-0 or retain-1 isolation), or latches (retain last value) are normally used. The isolation can also be combined with the voltage level conversion. State Retention Power Gating One important issue regarding power shut-off is how to preserve state in the turned-off blocks. If no care is taken, all state information

134

5.4 Standby and Idle Power Optimization HQDEOHBRXWB

HQDEOHBRXWB HQDEOHBRXWB

HQDEOHBRXWB

HQDEOHBLQ


Figure 5.25: Column-style switch cell insertion. in the sequential circuits and embedded memory will be lost. Whether some action should be taken depends on the type of design. In DSP applications, for instance, usually no state retention is needed. After power-on, the design block can start again from the reset state. However, for microprocessor cores in a mobile device some state retention strategy should be implemented, since starting the processor from a reset state (and booting the operating system) after the wake-up leads to substantial delays and also additional power consumption. Several strategies are possible, see [160]. In a software-based solution, the contents of the selected registers are written to an external memory which is never turned off completely. This approach requires a considerable amount of time and is quite hardware-specific which renders the application software less reusable. Another approach is to store the state in the scan-chain registers. But since data are read and written in a sequential manner, this technique is also quite slow. Finally, a hardware-assisted solution can be used, namely the retention registers (state retention power gating, SRPG). After receiving a special control signal (save or restore), these special flip-flops either store their content in a shadow latch or restore the content back again. This is the fastest possibility for state retention but also the most expensive one regarding the chip area. The corresponding area increases by ≈ 30%-50%. Thus, the usage of SR flip-flops should be limited. The layout of a SR flip-flop is similar to that of a power switch. It is also a double-height cell with true-VDD and switchable supply connections. The shadow latch of a SR flip-flop is connected to the true VDD . Other transistors to the virtual, that is switchable supply.

135

5


To limit the state retention overheads in a special case of a high-level synthesis it was proposed to consider power gating already during the scheduling, see [80]. Another alternative is to bias the sleep transistors appropriately and trade-off leakage power savings with state retention and wake-up time, see [25]. The idea is to reduce the supply voltage on the virtual rail to a level where the state is still retained. Always-On Cells For a coarse-grained standard cell library to be complete, another kind of special cells is still missing: The always-on cells. These cells are buffers and flip-flops. Especially for routing of control signals to or from the power management circuitry or in clock trees sometimes the need arises to route a signal through or over a power gated region. Here, the always-on buffers come into play if the routing distance is very long and/or the driven load is high. Again, this kind of cell is implemented in a double-height style. Since the non-switchable VDD is routed as a signal they can be placed in the midst of power gated cells but remain constantly on. Similar reasons apply to the usage of always-on flip-flops. Characteristics of Typical Industrial Designs with Power Gating Industrial embedded designs are mostly integrated into a system-on-a-chip together with application-specific and analog circuits. A typical SoC for mobile multimedia is fabricated in 45 nm low-power process, contains ≈ 40·10 6 transistors, deploys voltage and frequency scaling and power gating. Regarding the power gating, 5-12 power domains, ≈ 4 000 level shifter and ≈ 8 000 isolation cells are used, see [67]. Another example would be a 32-bit microcontroller in 90 nm with 10 power domains, 882 isolation cells, 4 289 retention registers, and 14 level shifters. The general trend in the industry goes towards synthesizable designs/cores. The reasons are the high engineering costs and verification time of full-custom designs. Synthesizable cores described in one of the higher-level hardware description languages like SystemVerilog or VHDL enable a faster verification and an easy portability to new technologies and chip foundries. The Bobcat core from AMD [58, 241] is an x86 processor which is entirely synthesizable from an RTL description. First versions were manufactured in 40 nm and AMD wants to migrate it to 32 nm or 28 nm processes soon.

5.4.2 Overview of the Common and Unified Power Formats According to the ITRS [147], the behavioral and architectural levels for power management would together constitute as much as 50% of the overall power management effort already in 2010. Therefore, only automation together with a higher abstraction of power-related design techniques will help to dramatically reduce the development time and costs as well as the corresponding risks for current and future massively

136

5.4 Standby and Idle Power Optimization parallel architectures for embedded and portable applications. Unfortunately, only fragmented EDA tool vendor specific solutions, which could not address the power related issues in a holistic manner existed until recently. The following minimal requirements have crystallized over time for conveying the power intent of a design throughout the EDA design flows for semi-custom ASICs: 1. A single, unambiguous power intent specification with standard semantics must be provided to tools at all levels of abstraction (behavioral, logical, physical, verification and test). 2. Power specification should define consistent semantics across both verification and implementation. All power reduction features should be properly testable on a level of abstraction as high as possible starting on behavioral RTL. The correctness of all low power features should be proved or checked against a golden specification by simulation and/or formal verification and equivalence checking tools. Therefore, verification methodology and tools must be poweraware. 3. Power design intent should be easily and conveniently specified over many elements in the design, support the hierarchical design flow, IP portability and reuse, and, in general, largely facilitate automation both in verification and implementation. Especially for soft-IPs effective means are needed to describe generic and technology independent power intent parts of the design. 4. The implementation of the power-related intent should be automated and standardized, like it also happened to the hardware description languages for about two decades ago. Without a power specification format, there is currently no possibility to express all power-related content with the existing hardware description languages or library formats only. The reason is quite obvious since the logic and not the power aspect was the principal focus at the time hardware description languages were standardized. Therefore, additional means for the description of power intent are needed. The advanced power reduction techniques described in Section 2.3.1, while conceptually simple, have a very strong impact on the design methodology at virtually all levels of abstraction, see for example [152, 160, 237], and not only the physical design, like it was formerly assumed very often. Among other things, they require new design concepts and special cells. Additionally, the application of power gating has the consequence that former exclusively physical objects like power and/or ground nets become functional nets as well. Therefore, their new functional semantics has to be taken into account already during the design on the register transfer level. For additional verification subtleties

137

5


see [78, 114, 152]. To address these and several other problems, the semiconductor and EDA industry finally came up with two major power specification formats with consistent, standardized semantics, and the corresponding power-aware design tool suits: the Common Power Format, and the Unified Power Format. Both are briefly described in the following. Common Power Format In its current version this is the Silicon Integration Initiative (Si2) Common Power Format standard, version 2.0, see [263], released in February 2011. Originally developed and maintained by Cadence Design Systems, Inc., it is publicly available at no cost since 2007. The development of CPF is driven by the Si2 and the Low Power Coalition (LPC), which is formed by major EDA, foundry, IP vendor and ASIC design companies. For the industrial adoption of CPF-based design flows see for example [261]. The main features of the preceding standard version 1.1 primarily dealt with the IP reuse issues: Possibilities to merge and reconfigure power domains, reconfigure power rules, resolve precedence, and integrate power modes were given. CPF 2.0 extends these features and additionally puts more focus on the interoperability with IEEE 1801-2009 (UPF 2.0). All CPF descriptions are based on an intuitively clear set of only four basic command types: set ∗ :

version, scope, general variables

define ∗ :

special cells description

create ∗ :

design intent

update ∗ :

implementation directives

This intuitive command set combined with clear semantics enables a designer-friendly, very dense description of the power intent, see Section 5.4. Unified Power Format In its current version, it is the IEEE 1801-2009 Standard for Design and Verification of Low Power Integrated Circuits, also known as UPF version 2.0, see [145]. IEEE 1801-2009 standardizes and enhances Accellera’s original UPF v1.0 draft standard (IEEE P1801, see [23]). The original main contributors for the first version were Synopsys Inc., Mentor Graphics, and Magma Inc. The main enhancements of the version 2.0 include the support for bias supplies, greater flexibility and capabilities in specification of power states, and enhanced semantic capabilities for merged power domains. In general, UPF provides a format for the description of the low power design intent with the ability to specify the power supply network, switches, isolation, retention and other aspects relevant to power management of an electronic system at a sufficiently abstract RT-level. A UPF-based design flow finally allows for a power aware simulation, equivalence checking, synthesis and physical design, see [285]. For industrial application of UPF refer, for example, to [111].

138

5.4 Standby and Idle Power Optimization The UPF standard has slightly different semantics for the specification of the power intent which is based on additional objects like supply nets and supply ports. First of all, in UPF each power domain has a scope and an extent: Power Domain Scope:

The hierarchical level at which the power domain is defined. Each scope has supply nets and supply ports.

Power Domain Extent:

The actual set of elements which belong to the corresponding power domain.

Additionally, the standard defines the following terms which are needed to form a power intent description in UPF: Supply Net:

A conductor that carries a supply voltage or ground throughout a given power domain: A net with power state semantics.

Supply Port:

A power supply connection point between two adjacent levels of the design hierarchy: A port with power state semantics. A supply net which crosses from one level of the design hierarchy to the next passes through a supply port. A supply port consequently has an associated ”direction”: input or output.

Supply Set:

A collection of supply nets that provide a power source.

Switch:

A design element that conditionally connects one or more input supply nets to a single output supply net according to the logical state of one or more control inputs.

The basic commands used in UPF are the following: set ∗ :

scope, design attributes, level shifter/isolation/retention strategy, isolation/retention control, supply nets/sets, port attributes

create ∗ :

power domains/switches/state tables, supply nets/ports, logic nets/ports

connect ∗ :

supply nets (to ports), logic nets (to ports), set (to elements)

add ∗ :

port state, power state, power supply table state

describe ∗ :

legal state transition

use ∗ :

interface cell

query ∗ :

design attributes, upf, cell instances, domains

map ∗ :

isolation/level shifter/retention cell, power switch

139

5


An interesting feature of UPF is the standardized possibility of issuing query commands on all kinds of HDL and UPF design objects. These query ∗ commands do not make up the power intent of a design. They are only used for querying the design database and are included to enable portable, user-specified query procedures across tools that are compliant to the UPF standard. In the latest CPF 2.0 standard, this kind of queries are also possible with the help of the find design objects command. Compatibility Issues Although both power intent specification formats are supposed to describe the same physical objects, they are not 100% compatible to each other. This is mainly because of different release times (CPF 1.1: 2008, UPF 2.0: 2009) and standardization organizations (CPF: Si2, UPF: IEEE/Accelera). Still, according to [262] roughly 65% of commands and respective arguments of CPF 1.1 and UPF 2.0 are fully interoperable. Another 11% are partially interoperable and 17% are not interoperable. The remaining 7% constitute the commands and arguments which are not recommended for usage. The coarse relations between the different versions of the power specification formats are given in Figure 5.26. The features currently not supported by IEEE 1801 but supported already in CPF 1.1 are for instance: • Disjoint power domains: Power domains implemented by disjoint physical regions. • Equivalent Control Pins: For support of time-sequenced control signals for power gating, state retention, and isolation control. • Implementation Constraints Setups: For DVFS and multi-mode multi-corner optimization (MMMC) and analysis (the create operating corner command in CPF), and others. In general, the Common Power Format can be seen as more back-end friendly, that is having a better expressiveness at lower levels of abstraction while the Unified Power Format is more front-end friendly, that is provides powerful description and modeling capabilities especially at design stages prior to place and route including the functional verification. Power-Aware Design Flow Modern CAD-tools for ASIC design get the HDL description files together with the CPF/UPF files as an input and automatically produce a design description at a lower abstraction level as an output.

140


CPF 2.0 UPF 1.0 Interoperable Subset IEEE 1801-2009 (UPF 2.0)

Figure 5.26: Main power specification formats and their interoperability (Cadence 2011).

Synthesis tools produce a netlist which fulfills the original design and power specification and is mapped to a specific hardware library. A power format capable synthesis tool optimizes for timing, power, and area simultaneously. For example, it considers the isolation and level shifter cell latencies during the timing optimization step and automatically inserts these cells into the design. Since the hierarchical names of the design objects may change during the synthesis a modified power intent description can be generated which reflects the changes made. Subsequently, a power format capable logical equivalence checker can read the new design files and perform equivalence checks to ensure the correctness of applied changes. During the simulation phase (gate level or behavioral), a power-format aware simulator is used to automatically generate assertions which verify that for instance power control signals associated with power shut-off do not become undefined, isolation, state retention, save and restore conditions are consistent, and the correct sequencing of the signals is given. Finally, place and route tools read both the synthesized netlist and the power intent files generated in the previous steps and produce chip layout data. In this step the power switching cells are inserted into the design according to the placement strategy. Again, a power format capable logical equivalence checker is used to ensure the equivalence of design implementations at different levels of abstraction. We will apply such a state-of-the-art CPF-based ASIC design flow to our generic WPPA architecture and describe the results in the following sections.

141

5


5.4.3 Reconfiguration-controlled Power Gating In case of a WPPA architecture, the ultimate design goal is to derive instances that combine flexibility and domain programmability with computational and power efficiency. Therefore, different power management strategies have to be carefully considered and combined when going from architecture to silicon. Since not every application disposes the same amount of parallelism, the available hardware resources should also tightly match the current computational needs. This can be achieved, for example, by shutting off the unused processing elements, see Figure 5.27. Since the number of processing elements and hence power domains can eventually become very large, design methodologies and implementation styles which worked well for a dozen of power domains turn unusable. This is mainly due to the lack of scalability of the corresponding power control infrastructure and for the most part a still manual implementation approach. To handle such complex designs, the implementation and verification process must be fully automated. Furthermore, an efficient and scalable power control network must be provided which allows to adapt, that is switch the supply voltage of different processing elements on and off fast, reliably, and very flexibly. In a WPPA, the idle PEs can be powered off in case of an algorithm utilizing only a small fraction of available hardware resources. Since the WPPA architecture is reconfigurable, the process of powering on the required PEs and shutting down the unused PEs is an integral part of the reconfiguration management, which finally led us to the concept of reconfiguration-controlled power gating. After an initial reset, all PEs are put in the power-off state, that is both functional and reconfiguration logic is disconnected from the power supply, see the power controller state machine given in Figure 5.28. Only after an external configuration request, the global configuration manager initiates a configuration process of a (sub-)array. By means of a multicast scheme, an arbitrarily sized rectangular array of PEs can be chosen for a subsequent configuration. Thereafter, if not connected yet, all selected PEs start to simultaneously connect to the power supply over their own daisy chains of power-switch cells. Daisy-chain style is used to minimize current surges during the power up sequence. The output of the last power switch cell in each PE power domain is then used to asynchronously inform the respective power controller about the power connection status, pwr on signal in Figure 5.28. Thus, the implementation of the power controller is completely decoupled from the physical power gating implementation details, like number of switch cells and their physical arrangement. This makes the design robust, reusable and adaptable. After three additional clock cycles, the respective PE is ready to be configured. During the power initial-

142

5.4 Standby and Idle Power Optimization 3URFHVVRU $UUD\

,2

,2

,2

,2

,QWHUFRQQHFW :UDSSHU SZUBHQ

3RZHU &RQWUROOHU

9''

&RQILJ 0DQDJHU

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

3(

$

9LUWXDO 9''

3( 3RZHU 'RPDLQ &RQILJ &RQWUROOHU

5H &RQILJXUDWLRQ 5HTXHVWV

3(

%

,2

,2

,2

,2

Figure 5.27: Logical view of a 3 × 8 WPPA with power shut-off capabilities. Generally, each individual PE can be power gated or any group of PEs. ization steps (PWR ON , RESET, CLK ON , ISO OFF), the output signal pwr ok of each PE is deasserted, to signal that currently a power-up sequence is taking place.

5.4.4 Scalable Power Control Network Like already mentioned, a scalable power control network is needed for a flexible usage of available PEs. We propose to use a two-dimensional propagation scheme of power status signals, schematically shown in Figure 5.28. Each PE sends its current power status, that is ON or OFF over the pwr ok signal to its east and south neighbors. Thereby, two modes must be considered: 1. PE is currently powered on, in the power down sequence, or powered off, and 2. PE is in the power-up sequence. In the first case, the power status of the current PE is well defined, either ON or OFF, and the PE simply forwards the AND-ed power status signals from its north and west neighbors (pwr ok north and pwr ok west). Therefore, if all PEs in the array are currently either on or off, the global power status signal, pwr ok global, connected to the configuration controller, is asserted. That means that all powering-up sequences were finished successfully. Now assume that the configuration controller selects a PE currently powered off for a subsequent configuration. After the receipt of the configuration request over the global configuration bus, the corresponding power controller initiates a power up sequence and goes to the POWER ON state, see Figure 5.28. That means, that its power status signal becomes deasserted. This state does not change, as long as the last power switch cell of the daisy-chain in the corresponding power domain is switched off (pwr on deasserted). After the receipt of the asserted pwr on signal, the controller

143


,QWHUFRQQHFW :UDSSHU >@>@ SZUBRN

SZUBRNBZHVW SZUBRNBZHVW

&RQILJXUDWLRQ %XV

,QWHUFRQQHFW :UDSSHU >1@>@

SZUBRNBQRUWK

,QWHUFRQQHFW :UDSSHU

,QWHUFRQQHFW :UDSSHU >@>1@

9LUWXDO 9''

SZUBRN

,QWHUFRQQHFW :UDSSHU >1@>1@ SZUBRNBJOREDO

SZUBRQ

3( 3RZHU 'RPDLQ

SZUBRN

*OREDO 3RZHU &RQWURO 1HWZRUN

9'' LVRODWH FONBHQ SZUBHQ UVW

SZUBRNBQRUWK SZUBRNBZHVW

3RZHU &RQWUROOHU SZUBRN

FRQILJBHQ

UHVHW

SZUBRNBJOREDO

3RZHU &RQWUROOHU

H

SZUBRNBQRUWK

H OV

&RQILJXUDWLRQ &RQWUROOHU

&RQILJXUDWLRQ 0DQDJHU

*OREDO 3RZHU &RQWURO 1HWZRUN

FRQILJBUHTXHVW

5

21B67$7( SZUBRN SZUBRNBQRUWK SZUBRNBZHVW

,62B2)) SZUBRN &/.B21 SZUBRN

VKXWBRII ,62B21

32:(5B21

FRQILJBHQ

H

SZUBRQ SZUBRN

&/.B2))

HOVH

32:(5B2))

H OV

5(6(7 SZUBRN

2))B67$7(

UHVHW

Figure 5.28: Schematical view of the scalable power control network and the power controller state machine.

performs a reset, connects the clocks to the PE, and releases the isolation from PE outputs. During the whole power-up sequence, the power status signal pwr ok remains deasserted. Thus, the global configuration controller has to wait, until the last PE selected for configuration, which is not already powered on, has completed its power-up sequence. The deployment of such a two-dimensional power status propagation scheme has the advantage that the corresponding power control network can be implemented by a tree of AND gates with two inputs and a fan-out of two. The depth of this tree increases only linearly with size of the processor array in one direction. To be more precise, it grows like (2N − 1). The width of the tree also grows linearly with N. Would a one-dimensional power status propagation scheme be used, the depth of the power control network would grow with N 2 . For a processor array of size 30 × 30 with 900 PEs, our approach yields a depth of only 59. Thus, this propagation scheme is highly scalable. The same applies also to the worst case critical path of the power control tree, which is actually the propagation speed of power status signals. Although conceptually simple, this scheme solves the scalability issues with a minimal area and implementation complexity overhead.

144

5.4 Standby and Idle Power Optimization The distributed power control network has a major advantage over a centralized approach in which each PE is reporting its current power status to the global configuration controller directly: Concentrating the merging tree for power status signals in one place leads especially for large arrays to an unbalanced delay distribution for spatially distant PEs (center vs. corners). Furthermore, since the whole architecture is parameterized, the delay variation will be also difficult to predict for a general case. Additionally, due to asynchronous handshake protocol between PEs and the configuration controller no synchronization or timing issues arise even in very big arrays. In the following, we describe the practical use of the Common Power Format to enhance our embedded, coarse-grained reconfigurable processor array architecture with additional power shut-off capabilities in a fully automated fashion. The main advantage thereof is the exceptional design productivity increase, which allows the implementation and validation of a design with dozens of switchable power domains from RTL to GDSII in a straightforward manner.

5.4.5 Power-aware Design Flow For the implementation of the above concepts, we used the CPF-based digital design tool suite from Cadence Design Systems. At the front-end, the array instance described in VHDL was first synthesized with RTL Compiler. It optimizes for timing, power and area simultaneously, for example considers the isolation and level shifter cell latencies during the timing optimization step and automatically inserts these cells into the design, see Figure 5.29. The quality of the CPF file was checked with Conformal Low Power and some inconsistencies, like missing isolation rules for power domain crossings, were corrected. After that, the design was simulated with Incisive Simulator. This tool is also power-aware, that is, it can automatically generate assertions which verify that power control signals associated with power shut-off do not become undefined, isolation, state retention, save and restore conditions are consistent, and the correct sequencing of the signals is given. Finally, place and route was performed using SoC Encounter. Since this is already the back-end design, 140 header switches were inserted into each switchable power domain, resulting in 3 360 header switches for the whole design with an estimated maximum header current of 134 μA. A column insertion style was chosen, where 140 header cells are connected in a daisy chain, to limit the switch-on current surges. The switch daisy chain latency amounts to two clock cycles (10 ns at 200 MHz). The output of the last switch was connected to the local power controller to provide an asynchronous feedback of the power domain state (pwr on) and thus make the design more flexible. CPF Description The intuitive command set of CPF combined with clear semantics enables a designer-friendly, very dense description of the power intent. This is

145

5


illustrated by a generic CPF description file for the WPPA architecture given in Appendix in Figure A.1. Generally speaking, a CPF description consists of three main parts: 1. Physical structure: Lines 3–5, abstract supply and ground nets definitions, 2. Library part: Lines 6–8, available special cells, and 3. Logical structure: Lines 9–33, power domain partitioning, power modes definition, switch and isolation rules. The update ∗ statements are the ”syntactic glue” between the abstract definition (create ∗) and its physical implementation, for example like in the case of power switch rules (lines 30, 31) and isolation rules (lines 32, 33). As already mentioned, power gating is implemented on PE level. Instead of specifying in CPF each possible combination of the single PEs being powered on or off combinatorially the following method is used: Beyond the obligatory ”all PEs on” (line 22) and ”all PEs off” power modes (line 23), only two more power modes are defined. The PEs being powered on or off are arranged in a checkerboard style. In the first of those two additional modes, all checkerboard ”white” positioned PEs are powered on and PEs on ”black” positions are powered off (line 24). And in the second mode, we have the opposite case (line 25). This captures the mesh-like interconnect scheme. Application-specific deviations are then described in additional power modes. This reduces the combinatorial number of power modes to a de facto constant one and makes the description almost independent of the array size. Up to now, we were able to successfully synthesize an array of size 10 × 10 with 100 switchable power domains and 200 MHz target frequency.

5.4.6 Experimental Results As an experimental setup for a power shut-off scenario, two representative algorithms were chosen and implemented with the help of dynamic reconfiguration on the same hardware platform with 24 PEs: namely an 8 points radix 2 FFT (shaded area A) and an edge detection (ED) algorithm, utilizing a sub-array of size 2 × 2 only (shaded area B), see Figure 5.27, and the physical chip layout in Figure 5.29. For determining the worst case power consumption, the synthetic load application is used. Due to power gating on PE level, the leakage power savings depend linearly on the number of PEs utilized by the given application. However, even for a one single algorithm there will usually exist different implementations (regarding the number of used PEs) to trade off for example the total power consumption versus the throughput or filtering quality, see Chapter 6. Figure 5.30 shows that an active leakage power

146


3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

&RQILJXUDWLRQ 0HPRU\

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

6ZLWFKHG 3RZHU 'RPDLQ %RXQGDU\ 3'RP>@>@ 3'RP>@>@ 3'RP>@>@ 3'RP>@>@ 3'RP>@>@ 9LUWXDO 9GG 3(>@>@ 3(>@>@ 3(>@>@ 3(>@>@ ,VRODWLRQ 9VV &HOOV 3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@

3'RP>@>@ 3(>@>@ 9GG 3'RP>@>@ 3(>@>@

Figure 5.29: Chip layout of a 3 × 8 processor array with 24 switchable power domains (5 × 5 floor-plan). reduction of 60% is achieved for the edge detection application running on a 2 × 2 sub-array (typical PVT conditions), compared to an array with 20 additional (clockgated) idle PEs. For the case of the idle state where the power supply to all 24 elements is switched off, the reduction even amounts to 91%. Nevertheless, a leakage power increase of ≈ 20% for the running phase can be noticed compared to a non-power-gated design. This increase is first of all due to the additional power switch cells and also the placement and routing constraints in the physical layout. Therefore, a power switch optimization step, which is also available in SoC Encounter, as well as an accurate floor-planning is required to account for these additional constraints. For the current case study, the following physical layout was chosen: the PEs are connected in a meander fashion starting in the top left corner and ending in the bottom right corner of the chip, and the configuration ROM with the corresponding software located in the center, see Figure 5.29. As we see in Table 5.4(a), the process-voltage-temperature (PVT) variation of leakage power can be quite substantial, ranging from 2.68 mW to 156.78 mW for the same architecture. An important observation is also that the leakage overhead of the header cells still stays at an acceptable level of < 5%. The leakage power of a PE core ranges from 0.08 mW to 5.81 mW for the worst case, see Table 5.4(b). From

147


Leakage Power (mW)

5

3

2.68 2.2

2 1

0.88 0.21

0.09

Idle

Idle∗

0 No (idle) 3x8 FFT 2x2 ED

Power Shutoff Configuration

Figure 5.30: Overall leakage power behavior introduced by the application of the power shut-off technique (Typ.-Typ. process, 1.0V, 25◦ C).

this point of view switching off one idle PE already compensates for major leakage overheads of applying power gating to the whole array. Also to note would be the chip area increase due to the additional power and ground rails around each switchable power domain. But in case of this design, no exact figures can be given, since the original, not power gated design, was not optimized for area. Thus, the new design objects easily fitted into the same floor-plan, since it had a sufficient area slack. This allows to reuse the same floor-plan for a few differently parameterized designs. In addition, the design also extensively uses hierarchical clock gating with 74 clock domains to reduce dynamic power, see Table 5.5, like described in the Section 5.1. The gate area increase amounts to ≈ 1% only. A corresponding CPF description, which we can generate automatically, allows a very dense description of the power intent. Thus, the flat CPF file, without any loops, contains only 349 lines of code (LOC), see Table 5.6(a). Additional 240 LOCs are needed in the place and route (P&R) script to insert the physical power switches. Each PE has a dedicated power controller module (pwr ctrl) implemented as a final state machine with 9 states. The power and area overhead of the power controller modules is negligible: 0.5 μW leakage, 17 μW total power, and 123 gates, see Table 5.6(b). In Figure 5.31, the overall power profiles of three different applications over time working on the same input data are given for the functional processing phase. Obviously, the synthetic load application reveals the highest average power and additionally shows a sharp periodic behavior.

148

5.5 Summary (a) Process-Voltage-Temperature Variation of Leakage Power.

Leakage Power (mW) Array Header Cells

Condition (PVT) Typical (Typ.-Typ., 1.0V, 25◦ C) Typical Leakage (Typ.-Typ., 1.0V, 125◦ C) Max. Leakage (Fast-Fast, 1.1V, 125◦ C)

2.68 23.03 156.78

0.05 (1.87%) 1.14 (4.95%) 5.89 (3.76%)

(b) Module Leakage

Module

Typical Cond. (mW)

Max. Leakage (mW)

Power Controller PE Core Global Controller Reconf. Addr. Table

0.0006 0.08 0.02 0.05

0.05 5.81 1.95 4.8

Table 5.4: Case study results for leakage power consumption. (90 nm, 1.0 V, general purpose, nominal VT H , commercial CMOS standard cell library, 200 MHz target frequency, 25 ◦ C, typical-typical process)

In Figure 5.32, the overall break-down of the idle leakage power in the array is given. On the array level, the different configuration data memories and controllers still contribute to ≈ 60% of leakage. Therefore, these modules should and can also be power-gated, which would lead to an estimated minimum leakage power of 90 μW, marked as idle∗ in Figure 5.30.

5.5 Summary In this chapter, we first combined the custom hierarchical and fine-grained automatic clock gating technique and obtained a considerable increase in power efficiency for WPPA architectures. Although such a combination of clock gating at different levels is already known and widely used in commercial RISC processor cores like [120], still no detailed experimental results and figures are available. Clock gating is a simple and effective method for decreasing the dynamic power consumption, see [228] and [320]. On the register transfer level, clock gating disables the clock of a component whenever its output values are not used. Clock gating should reduce the power by decreasing the switching activity in the flip-flops, gates in the fan-out of the flipflops and the clock tree [229]. However, automatic clock gating can also introduce a significant amount of redundant switching power in the inserted clock gating cells,

149

5

Increasing the Energy Efficiency of Programmable Hardware Accelerators Theoretical peak performance Maximum avg. power (synthetic) Clock domains Power domains (switchable + default) Power modes (functional + idle) Header switches/Domain Isolation cells/Domain Always on buffers/Array Power Control Network (2-AND) Power-up latency Switch daisy chain latency

24 GOPS 450 mW 74 24 + 1 2+1 140 32 27 24 8 clock cycles 2 clock cycles

Table 5.5: Reconfigurable array properties. and the corresponding clock tree buffers. This fact was also verified during our experiments. To sum up, we can look at the achieved power efficiency of the proposed WPPA architectures and compare them with some well-established commercial and academic embedded processor architectures, see Table 5.7. With a power efficiency of up to 0.064 mW/MHz in a case study image processing algorithm (1024 × 768 XGA resolution, @30 fps, greyscale), our 2 × 2 array outperforms the TMS320C6454 DSP from Texas Instruments (core power) by a factor of 28. The low-power, singlecore ARM946E-S embedded processor (with cache) on average has a 1.7× lower power efficiency. The corresponding power efficiency for the 3 × 8 WPPA array is 0.66 mW/MHz. This is still 2.7× better than the single-core TMS320C6454 DSP. With a performance-power efficiency of 98 MOPS/mW for the 3x8 WPPA and up to 124 MOPS/mW for a case study FIR and edge detection implementation, it definitely provides an ASIC-like performance, simultaneously offering domain-specific flexibility and reconfiguration features. The main results were published in [9], [10], [12], and [15]. The major contributions to the reduction of standby and operational leakage power can further be summarized as follows: 1. Reconfiguration-controlled power gating: An intelligent power shut-off technique to automatically adapt the computational resources and the power consumption to the current computational needs. This allows to run a given architecture at minimum power for different application scenarios. This technique

150

5.5 Summary can be applied to any kind of massively parallel coarse-grained architecture with a sufficient number of processing elements. 2. Scalable power control network: A new, highly-scalable, two-dimensional power control network which allows an area-efficient, application-transparent implementation of power gating on a large number of power domains of up to 1 000 and more. 3. Effective design of many-power-domain CGRAs: For the first time, a complete CPF-based front-to-backend design methodology is applied to a CGRA architecture with a non-trivial number of 24 power domains and the corresponding detailed results on the design, power, and hardware overhead are presented. It allows to define, implement, and also validate the most advanced power reduction techniques in a concise and automated way. The main results were also published in [2].

151

5


(a) Common Power Format description statistics

File

Lines

Words

Numbers

CPF (flat, no loops) P&R header insertion

349 240

3, 969 1, 416

813 408

(b) Gate count statistics

Parameter

Value

Gates

Processor Array (PEs) Processing element (PE) Instruction memory/Element Power controller/Element

24 1 1 1

1, 061 K 21 K 21 K 123

Total Power (mW)

Table 5.6: Case study results for power gating. (90 nm, 1.0 V, general purpose, nominal VT H , commercial CMOS standard cell library, 200 MHz target frequency, 25 ◦ C, typical-typical process)

Synthetic Load FFT ED

600 400 200 0 0

50

100

150 200 250 Time (ns)

300

350

400

Figure 5.31: Cycle-accurate power profiles of different applications on a 3 × 8 reconfigurable processor array (simulation after P&R, functional phase, 200 MHz).

152

5.5 Summary

cfg_mem 20% pwr_ctrl (24x) 6% lu_tab 24%

header switches (3,360x) 35% cfg_mgr 5%

cfg_ctrl 10%

Figure 5.32: Leakage power break-down of the 3 × 8 processor array with power gating in the idle mode.

Name

Architecture

MHz

mW/MHz

Pwr. Eff. (Norm.)

WPPA ARM ARC Int. Recore Systems Silicon Hive Renesas WPPA 3 × 8 ADRES Texas Instr.

WPPA 2 × 2 ARM946E-S [24] ARC EP20 [30] Montium TP [267] HiveFlex [260] SH-X2 (LP) [320] WPPA 3 × 8 ADRES 4 × 4 [52] TMS320C6454 [278]

200 230 155 140∗ 200 266 200 100 720

0.064 0.11 0.11 0.17∗ 0.24 0.3 0.66 0.7 1.79

28 16.3 16.3 10.5∗ 7.5 6.0 2.7 2.6 1.0

Table 5.7: Power efficiency overview of some known commercial and academic programmable embedded architectures (90 nm CMOS technology, or scaled (∗ ) to 90 nm).

153

5

154


6 Design Space Exploration of WPPA Architectures Since any kind of exhaustive search for highly parameterized systems, like WPPA, with exponentially growing number of possible solutions is not viable, multi-objective optimization using evolutionary algorithms (MOEA) [91] proved to be a very promising approach to deal with this complexity mainly due to their flexibility. In an embedded system, the application dictates the final system architecture and its feasibility. In this case, it is worth optimizing both the corresponding architecture and compiler for different objectives like power, area, and throughput. However, these design objectives are in conflict with each other which prohibits the use of standard single-objective optimization approaches like simulated annealing, tabu-search, (mixed) integer linear programming, and others. While compilation speed is the most important factor in the case of compilers for general purpose computers, the design of a special adaptive processor array architecture which is tailored to a given application class has to result in high code efficiency, for instance, concerning code size and execution time. To fulfill given timing constraints, it is necessary to exploit the inherent parallelism of the application programs and to use additional hardware resources which in turn conflicts with a major design goal to keep hardware cost and chip area as low as possible. Therefore, it is in general impossible to find a design point which is optimal in all design objectives. Obviously, execution time is antipodal to hardware cost if multiple functional units are used to exploit the inherent parallelism of a given application program. Therefore, the designer of a particular WPPA architecture has to trade off these conflicting design goals and needs a set of appropriate solutions rather than a single one. This is exactly the point where the potential of the multiobjective evolutionary algorithms can be exploited to the full extent like described in Section 6.1. Evolutionary algorithms maintain a pool of solutions which evolves in parallel over time. Genetic operators are applied to the solutions in the current pool to improve their average quality. Solutions with the lowest quality are successively removed from the pool, see [91, 119]. Methodology Overview The overall exploration methodology is depicted in Figure 6.1. The design space comprising architecture and compiler parameters is explored with MOEAs for the optimization of the design objectives (area, power, la-

155

6

Design Space Exploration of WPPA Architectures

Design Space Architecture

Making

Decision

Pareto-Optimal Solutions

Compiler

Evolutionary Algorithm

Architecture Parameters

Compiler Parameters Algorithms

Efficient Architecture/Compiler Co-Generation

Retargetable Compiler

Simulator I/O

I/O

WPPA setup

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

Objectives

I/O

WPPA

WPPA programming

Metrics (power, cost, performance)

Figure 6.1: Overview over the exploration of different WPPA architectures. tency, . . . ). The exploration framework itself is implemented in Java programming language and has a complexity of 8 600 lines of code. It uses the Opt4J meta-heuristic optimization library [184]. The considered architecture design parameters include, among others, the size of processor array and internal structure of each single processing element (WPPE) like storage elements configuration (number and class of input buffers and register files), functional units (number and type), and communication elements (I/O ports number and size). Regarding the compiler, different kinds of partitioning and tiling transformations can be deployed. The following six objective functions are optimized simultaneously: Area cost of the hardware [mm2 ] (min), average static and dynamic power dissipation [mW] (min), execution latency [cycle number] (min), energy consumption [μJ] (min), VLIW issue width of each processing element (min. 1, max. 6) (max), and data path width or precision [bit] (max). Figure 6.2 shows an example of exploration results for two different algorithms. Red points mark the set of possible optimal designs. The precise mathematical definition of optimality in case of multi-objective optimization will be given in the next section.

156

Latency (cycles)

Latency (cycles)

700 600 500 400 300 200 100 0

20 40 60 80 100 120 Power (mW)

(a) Power-latency trade-off (alg. I)

300 250 200 150 100 50 0 0

1 0.8 0.6 2 Pow 150 225 0.2 0.4 m) e r (m 3000 m ( a W) Are 75

(b) Power-latency-area trade-off (alg. II)

Figure 6.2: Example of different trade-offs in a WPPA architecture obtained after the exploration for two different algorithms. Notably, the headroom of each objective varies by as much as one order of magnitude among the set of optimal solutions. Red points mark this set of non-dominated solutions, points of other colors mark the dominated solutions, see Section 6.1. Retargetable Mapping Starting point of the compilation flow is the class of socalled dynamic piecewise linear algorithms (DPLAs), see [129]. This class of algorithms describes loop nests with a potential number of run-time conditions as a set of recurrence equations. Different transformations like hierarchical partitioning of the iteration space are applied to match the size of the application to the size of the physical array, or to optimize the local memory by data reuse within a partition. The derivation of latency-optimal static schedules is based on the formulation and solution of a mixed integer linear program. In order to satisfy resource constraints, a so-called resource graph has to be specified, which expresses the binding possibilities of operations to functional units, execution times and pipeline rates of these units. Thus, beside latency optimization, also the binding of operations to functional units is determined. Particular to this method is a holistic approach. This means that both the schedule within a processing element and the schedule between all processors of the array are optimized simultaneously. High-speed Compiled Array Simulation As already mentioned, to estimate the switching activity data, either a cycle-accurate, compiled simulator with representative data inputs can be used which is automatically generated by the WPPA design framework. Alternatively, a switching load discretization heuristic may be ap-

157

6


plied together with the scheduling information. To achieve high simulation speeds, an instruction set simulation (ISS) approach is used in case of cycle-accurate WPPA array simulation, see [175]. To summarize this section, we list the main contributions made by our design space exploration framework for weakly programmable processor arrays: • The presented framework for WPPAs allows us to perform a highly accurate and expeditious automatic evaluation of any possible WPPA instance in terms of area, performance and power on a high abstraction level exploiting the stateof-the-art in estimation approaches. • The presented framework constitutes the means to automatically determine the absolute upper and lower limits of the objectives for a given parameter range which would be impossible to achieve otherwise. • Substantial acceleration of the automatic exploration procedure is achieved due to deployment of novel, relational database based macro-modeling methodology and state-of-the-art multi-objective evolutionary algorithms. • Finally, the automatic exploration of combined deployment of several different algorithms on a single WPPA instance programmed by means of run-time reconfiguration is investigated.

6.1 Multi-Objective Evolutionary Algorithms In multi-objective problems (MOPs), the general goal is to optimize k objective functions simultaneously. This may be the maximization of all functions, minimization of all functions or an arbitrary combination thereof. Definition 6.1 (General MOP) A general MOP is defined as minimizing (or maximizing) F(x) = ( f1 (x), . . . , fk (x)) subject to gi (x) ≤ 0, i = {1, . . . , m}, and h j (x) = 0, j = {1, . . . , p}, x ∈ Ω . An MOP solution minimizes (or maximizes) the components of a vector F(x) where x is a n-dimensional decision variable vector x = (x1 , . . . , xn ) from some universe Ω . It is noted that gi (x) ≤ 0 and h j (x) = 0 represent constraints that must be fulfilled while minimizing (or maximizing) F(x) and Ω contains all possible x that can be used to satisfy an evaluation of F(x), see [84]. Since the multiple objectives being optimized are mostly conflicting, the search space is only partially ordered, see [276, 84]. Consequently, multi-objective optimization results in a set of solutions rather than a single solution. This set of solutions is found with the help of Pareto Optimality Theory [226] and is also called Pareto-optimal. The concept of Pareto-optimality which was originally proposed by Francis Ysidro

158

6.1 Multi-Objective Evolutionary Algorithms Edgeworth [101] and later generalized by Vilfredo Pareto [226] is formally defined as follows [84]: Definition 6.2 (Pareto Optimality) A solution x ∈ Ω is said to be Pareto-optimal with respect to Ω if and only if there is no x ∈ Ω for which v = F(x ) = ( f1 (x ), . . . , fk (x )) dominates u = F(x) = ( f1 (x), . . . , fk (x)) . Definition 6.3 (Pareto Dominance) A vector u = (u1 , . . . , uk ) is said to dominate another vector v = (v1 , . . . , vk ) (denoted by u v) if and only if u is partially less than v, that is, ∀i ∈ {1, . . . , k}, ui ≤ vi ∧ ∃i ∈ {1, . . . , k} : ui < vi . Definition 6.4 (Pareto-optimal Set) For a given MOP, F(x), the Pareto-optimal set, P ∗ , is defined as P ∗ := {x ∈ Ω | ¬∃ x ∈ Ω : F(x ) F(x)} . To determine an approximation of the Pareto-optimal set evolutionary algorithms maintain a pool of solutions, called a population. A population consists of single individuals which are encoded with the help of genes which are combined to chromosomes. An individual is an encoded solution to some problem. This property is the one with a difference as compared to other optimization approaches which operate on the problem formulation directly. By using an intermediate encoding, the problem domain and the optimization strategy can be strictly separated, see [276]. An individual is thus represented as a genotype, that is, a set of one or more chromosomes. Finally, a decoded genotype is called a phenotype. A phenotype is a solution in the objective space. General Course of Action To progress from the initial population to the following population, evolutionary operators are used. The three major evolutionary operators are recombination, mutation, and selection. The process of selection gives solutions of a higher quality a larger probability to survive and generate offsprings than those with a lower quality. Recombination defines the process of exchange of genetic information between two or more individuals to better cover the solution space. A mutation operator randomly changes one or more genes in a chromosome to bring in some additional solution diversity. Generally, evolutionary algorithms proceed in four major steps to obtain a new set of individuals: 1. Selection for variation, 2. Variation (generation of new solutions by applying the recombination and mutation operators), 3. Fitness assignment (evaluation of new solutions), 4. Selection for survival.

159

6


Advantages and Limitations An important advantage of evolutionary algorithms is that they do not pose any restrictions on the nature of the objective functions and operate only with the objective functions themselves as opposed to function derivations and similar. In general, evolutionary algorithms constitute a metaheuristic for solving optimization problems which is applicable to black-box optimization problems. One should not expect that evolutionary algorithms perform best for all optimization problems. There also exist cases where evolutionary algorithms perform decidedly bad. This is due to the so called No Free Lunch Theorem: If all optimization problems occur with the same probability, all optimization algorithms will have the same average performance [313]. After all, finding the global optimum of a general MOP is an NP-Complete problem, see [33]. This means that optimization algorithms should always be matched with appropriate optimization problems to achieve good results.

6.2 Architecture-Compiler Coupling In order to perform design space exploration for WPPAs, architectural and compilerspecific data must be collected for each single WPPA instance. The explorationrelevant information of a WPPA instance includes the following parameters xi : • Data path width (8, 16, 24 bit), • Number and VLIW issue width of processing elements and instruction memory, • Number and type of functional units (adder, multiplier, shifter, logical, data management), • Configuration of the register file in each processing element (number of input ports, number of output ports, number of read/write ports, number of registers), • Average utilization ratio of each functional unit in each processing element, • Average utilization ratio of the register file in each processing element (percentage of accessed ports per cycle). Generally, both architectural and compiler parameters can be varied during the exploration. However, in this work, the exploration is performed on the architectural parameters with tiling transformations of the compiler being untouched. Nevertheless, this does not restrict the presented exploration methodology in any regard since the compilation and exploration phases are completely decoupled. For an in-depth treatment of the compiler and the corresponding implications, we refer to [129].

160

6.2 Architecture-Compiler Coupling

Mult.

RegFile

Memory

High

Adder

Switching Load

Switching Load

Adder

Low Idle 0

20 40 60 80 100 Power Breakdown (%)

(a) 16 bit data path

Mult.

RegFile

Memory

High

Low Idle 0


(b) 24 bit data path

Figure 6.3: Example of power breakdowns for one processing element of different bit widths and issue width of two (1× adder, 1× multiplier). Register file parameters: 4 read ports, 2 write ports, 2 inputs, 1 output, 2 general purpose registers. In the following, we give a short description on how each single property of the architecture which is relevant for design space exploration can be directly obtained or estimated on a high-level of abstraction. The exact, input data dependent values of the estimated properties could also be obtained by an instruction-set simulation, but this exploration scenario was not investigated in this thesis. Data Path Width The macro-modeling data base contains power and area models of components for three different bit widths: 8, 16, and 24 bit. The bit width is one of the objective functions which is being maximized during the design space exploration. The rationale behind this decision is that for digital signal processing algorithms, a higher bit width is equivalent to a higher quality of signal processing, consider, for instance, high-definition audio and video coding. Number and Issue Width of Processing Elements The overall number of processing elements is determined during the scheduling and mapping stage of the retargetable compiler. As different hierarchical tiling transformations are performed, the number and type of processing elements is directly available after scheduling. The issue width of each of the processing elements in the processor array primarily depends on the number of functional units in it. A larger issue width corresponds to a higher local-level parallelism. A smaller issue width and a higher number of processing elements corresponds to a higher degree of array-level parallelism. However, with a higher issue width, the power consumption of the instruction memory and the register file are dominating the power consumption of the processing element con-

161


Switching Load

Adder

Mult.

Shift

RegFile

Memory

High

Adder

Switching Load

6

Low Idle 0


(a) 16 bit data path

Mult.

Shift

RegFile

Memory

High

Low Idle 0


(b) 24 bit data path

Figure 6.4: Example of power breakdowns for one processing element of different bit widths and issue width of five (2× adder, 2× multiplier, 1× shifter). Register file parameters: 10 read ports, 5 write ports, 4 inputs, 2 outputs, 8 general purpose registers. tributing together to as much as 90% of total power for low data switching loads, see Figures 6.3, 6.4. Since the retargetable compiler is currently being actively developed and not all possible transformation are directly accessible for the WPPA as a target architecture yet, we make the following simplifying assumption for obtaining the number of processing elements during the exploration procedure: The scheduling and mapping phase are performed for one processing element which contains all functional units which an algorithm needs to achieve a certain level of computation latency. Since this may result in processing elements with tens of functional units and consequently several dozens of register file ports, subsequently, in case of need, this large processing element is ”cut apart” to several processing elements with an issue width specified by the exploration framework. This is a viable method since we can provide a higher number of input/output ports and internal registers to each processing element so that the computational results can be passed across the array as needed. Generally, this simplification is not mandatory and is primarily used here due to the temporary compilation restrictions. Number and Type of Functional Units The type of functional units needed to compute a certain algorithm is defined by the algorithm itself. If additions and multiplications have to be performed, at least one adder and one multiplier unit should be available14 . On the other side, the maximum number of each functional unit type is 14 A

162

WPPA does not support software multiplication mainly for latency reasons.

6.2 Architecture-Compiler Coupling

ADD1

S1

S1

S1

S1

S1

ADD1

S1

S1

S1

ADD2

S3

S3

S3

S3

S3

ADD2

S3

S3

S3

MUL

S2

S2

S2

S2

S2

MUL

REG1

S1

REG2

S2

S1 S2

S2

S1

REG3

0

P

1

S1 S2

S2

S1

2

3

S2

REG1

S1

REG2

S2

S2 S1

S1 S2

S2

S1

4

t

0

1

P

2

3

4

t

Figure 6.5: Gantt charts for different utilization patterns of functional units. P denotes the length of the iteration interval, see [129]. set by the exploration framework during the exploration and is the main part of the genotype of each generated solution. Dependent on the number of available functional units, the compiler provides a performance-optimized schedule which minimizes the computation latency of the algorithm. Register File Configuration In each single processing element, a global multiported register file is used to store intermediate computation results. All functional units in a processing element have a parallel, unrestricted access to all registers. This is also the main reason to restrict the possible issue widths to maximum six to keep the area and power overheads of the corresponding register file with 12 read and 6 write ports under control. Furthermore, input shift registers and output ports are physically integrated into the local register file as well. On the whole, the hardware design space of each register file consists of 1 296 architectural points. Depending on the utilization pattern and switching load on the read, write, and control ports (register address ports), additional ”software” design points are supplemented. The corresponding characterization took us about eight weeks with an average characterization speed of 10 design points (hardware & software) per hour, see Section 415 . Utilization Ratio of Functional Units The utilization ratio of the functional units reflects the long-time behavior of the algorithm on an abstract level. This gives us the possibility to directly use this information for the estimation of the average dynamic power consumption of each functional unit. Figure 6.5 provides an example scenario. In this scenario, three functional units are engaged: Two adders and one multiplier. In the first case depicted on the left-hand side, all functional units are occupied in each clock cycle. This yields a utilization ratio of 100%. In the second 15 Standard

consumer 64-bit PC with 8 gigabyte main memory and Linux operating system.

163

6


case depicted on the right-hand side, the adders and the multiplier are used every second clock cycle only. This corresponds to an average utilization ratio of 50%. Since in our architecture the operand isolation technique is used, this directly translates into a halved dynamic power consumption of each functional unit. In case of some functional units being idle during the computation, only static power consumption is taken into account. This happens mainly in case of run-time reconfiguration where several different algorithms are executed on the same WPPA hardware. Register File Utilization The utilization pattern of the functional units does not only provide information for the estimation of the average dynamic power consumption in functional units but also has a direct impact on the utilization of the register file ports. As we already saw in Figures 4.8 and 4.9 in Section 4.2, the port utilization and the switching load have a major impact on the overall dynamic power consumption of the register file. This information is obtained as follows: For each clock cycle in the iteration interval P, compare the example given in Figure 6.5, we compute the ”vertical” utilization rate as the fraction of register file ports used in this clock cycle. This can be compared to the ”horizontal” utilization rate in case of functional units. Since the iteration interval is always bounded, this technique works for all valid schedules produced by the compiler. In case of Figure 6.5, this gives utilization patterns of (100%) for the left-hand side and (66%, 33%) for the right-hand side. In case of a utilization ratio of 0% not only static power is consumed like in case of functional units, but also dynamic clocking power. Therefore, the idle dynamic power consumption of a register file is still very high as opposed to the functional units, between 50% and 60% of the power under the high data switching load. Accounting for Run-time Reconfiguration In case of run-time reconfiguration, all processing elements currently not utilized by the algorithm are power gated, see Section 5.4. Furthermore, the reconfiguration logic of each processing element is completely clock gated on the architectural level, like described in Section 5.1. Therefore, concerning the power consumption, we can concentrate on the processing elements which are really utilized by the algorithm as well as their main functional submodules (functional units, instruction memory, register file). During the design space exploration for several different algorithms running on the same WPPA hardware instance, the functional units and eventually whole processing elements missing in the hardware configuration of one algorithm are automatically added to the hardware configuration of the other algorithms and vice versa. In this case, the maximal power consumption and maximal execution latency of the algorithms being explored is taken as the respective power consumption and execution latency of the generated solution. The rationale behind this decision is that by using the maximum operator we guide the exploration procedure towards more balanced solutions. That means that, for instance, contrary to taking the average values, so-

164

6.3 Exploration Results D D D D D D D D

V V V V V V V V (a) Fast Cosine Transform

(b) Discrete Cosine Transform

Figure 6.6: Dataflow graphs with arithmetic operations for the Discrete and Fast Cosine Transforms. lutions which exhibit extreme values for one of the algorithms are eventually being discarded more often than the balanced solutions. This technique showed very good results even compared to exploration of architectures tailored to one specific algorithm only, as we will show in the next section.

6.3 Exploration Results We performed a design space exploration on three popular algorithms from the digital signal processing domain. For convenience, these algorithms are briefly described in the following. Discrete Cosine Transform The Discrete Cosine Transform (DCT) algorithm, see [27], is used in digital signal processing to efficiently transform a signal from the time domain to the frequency domain and vice versa. Perhaps the most prominent usage of the DCT is in the JPEG image compression standard [151]. There, a twodimensional version of the DCT is used to compute the set of coefficients needed for a compact digital image representation. Other applications of the DCT are the feature extraction and dimensionality reduction in pattern recognition, image watermarking and data hiding, as well as other image processing applications. The two-dimensional version can further be subdivided into two one-dimensional transformations, namely the horizontal DCT and the vertical DCT. This further simplifies the number of required arithmetical operations. In our case study we investigate how the horizontal DCT for a 8 × 8 image block, see the dataflow graph in Figure 6.6(b), can be efficiently implemented on a WPPA hardware.

165

6

Design Space Exploration of WPPA Architectures X α2

αN z −1

α1 z −1

...

− βN

α0 z −1

− β2

Y

− β1

Figure 6.7: General linear digital filter schematic with distributed adders. Finite Impulse Response Filter The general mathematical description of a linear digital filter is given by the following formula: y(tN ) =

N

N

k=0

k=1

∑ αk xN−k − ∑ βk yN−k .

(6.1)

Here, αk and βk denote the filter coefficients, see Figure 6.7. Furthermore, z−1 denotes a register. In our case, all coefficients βk are set to zero, that is we consider a finite impulse response filter (FIR). For the implementation of a FIR filter of order N, N real number additions and multiplications have to be performed. The filter order N in our case study example is eight. During the design space exploration, the number of available adder and multiplier modules is varied. Subsequently, a compilation is performed which supplies the information about whether the corresponding algorithm can be mapped on the available hardware. For each possible solution the objective functions are evaluated and the results are further used by the exploration framework. For power evaluation, an average data switching load is assumed. Fast Cosine Transform Fast Cosine Transform (FCT), see [187], is a method to compute a Discrete Cosine Transform with an algorithmic complexity of O(N log2 (N)) multiplications of real numbers instead of O(N 2 ). The dataflow graph of the FCT is given in Figure 6.6(a). Parameters of the Evolutionary Algorithm During the design space exploration, the parameters of the evolutionary algorithm [184] were set to the following values: • Number of generations: 100, • Number of individuals per generation, population size α = 30, • Number of parents per generation, μ = 10,

166

6.3 Exploration Results • Number of offsprings per generation, λ = 10, • Crossover rate (recombination operator) pc = 0.5, • Mutation rate (mutation operator) pm = 0.1 . The resulting set of found non-dominated solutions contained 40 individuals, respectively. In the following, we first introduce the exploration results for the implementations of single algorithms. After that, the results of a combined exploration of the DCT and FIR algorithms running on the same WPPA hardware are presented and discussed. Finally, we compare both results regarding their energy-delay product metrics.

6.3.1 Single Algorithm Explorations For each of the three presented algorithms (DCT, FIR, FCT), a single exploration according to Figure 6.1 was performed. The results are presented as three-dimensional scatter plots for the energy-delay product over the number of adders and multipliers (Figures 6.8, 6.12, 6.10, and 6.14) as well as latency over the energy and chip area (Figures 6.9, 6.13, 6.11, and 6.15). For better readability, the plots were split up into two parts, a top part (a) and a bottom part (b). The uncolored design points represent the dominated, and the colored design points represent the non-dominated solutions. Discrete Cosine Transform The exploration results for the DCT algorithm are given in Figure 6.8 and Figure 6.9. The dataflow graph of the DCT contains 32 additions and 32 multiplications. During the exploration we found that only up to 19 adder modules can be efficiently used for computations. On the other side, the multiplications can efficiently be mapped on up to 32 multipliers. This should also not surprise too much since the data flow graph already indicates that all 32 multiplications can be performed in parallel because of missing internal data dependencies. Furthermore, this serves as an additional indication that the exploration was performed correctly. The overall size of the design space is 32·32·6·3 = 18 432 solutions (#add · #mul · issue width · data width). During the exploration, 2 016 solutions 2 016 were evaluated. This results in a design space coverage of 18 432 ≈ 11%. That means that the evolutionary algorithm was able to identify a large number of non-dominated solutions already by exploring 1/10 of the overall hardware design space. The exploration took us ≈ 5 hours which results in an average speed of 403.2 solutions per hour (compilation & power/area estimation). The compilation takes the major part of the evaluation time ≈ 97%. The reason for that is the complexity of solving mixed integer linear equations for the allocation and scheduling, see [129].

167

6


ID #add #mul EDP (pJ·s) Data Issue Lat. (cyc.) Area (mm2 ) Power (mW) Energy (μJ) A B C X

1 19 17 10

2 17 19 8

23.66 1.81 0.68 1.07

24 24 8 8

3 6 6 3

257 19 19 34

5.66 · 10−2 0.76 0.4 0.19

14.33 200.54 74.82 37.09

18.41 19.05 7.11 6.31

Table 6.1: Characteristic properties of the found non-dominated WPPA solutions (DCT). The power consumption values range between 14.33 mW (solution A) and 200.54 mW (solution B), see Table 6.1. The solution with the ultimately lowest energy-delay product of only 0.68 pJs is solution C. It should be noted that in case of the overall number of functional units not being discretely dividable by the issue width, additional idle adder modules are supplemented. The unit numbers given in the tables are the overall adder numbers including the idle adders. FIR Filter The exploration results for the 8-tap FIR filter are presented in Figures 6.10, 6.11, and Table 6.2. The design with the lowest power of 4.29 mW only is solution A. The (non-dominated) solution with the maximum found power consumption is solution B (95.87 mW). The design with the lowest found energy-delay product is solution C with 6.13 pJs. The equivalent 24 bit version has an energy-delay product of 14.98 pJs (solution D). With 625 explored design points out of 8·8·6·3 = 1 152 (#add · #mul · issue width · data width) in 1.5 hours, this results in an average speed of 416.66 evaluations per hour. The design space coverage ratio constitutes 54%. Since the algorithm is rather simple, a half of the design space was covered quickly. An interesting observation in Figure 6.10(b) is that the evolutionary algorithm found two non-dominated solutions for each of the bit widths (8, 16, 24) and the same hardware configuration of 10 adders and 8 multipliers. These are design points lying vertically between the solution C at the bottom and solution B and the top. The only difference is their issue width: 6 for the upper and 3 for the lower design points.

Fast Cosine Transform The manual exploration results for FCT presented in [7] cover only a small fraction of the overall solution space. A more complete view of the solution space after an automatic design space exploration is given in Figures 6.12 and 6.13. In this case, up to 32 adders can be used to accelerate the execution and up

168

6.3 Exploration Results

ID #add #mul EDP (pJ·s) Data Issue Lat. (cyc.) Area (mm2 ) Power (mW) Energy (μJ) A B C D

1 10 10 10

1 8 8 8

44.02 18.56 6.13 14.98

8 24 8 24

2 6 3 3

641 88 88 88

2.50 · 10−2 0.39 0.19 0.32

4.29 95.87 31.67 77.39

13.73 42.18 13.94 34.05

Table 6.2: Characteristic properties of the found non-dominated WPPA solutions (FIR). to 5 multipliers. Again, this appears feasible if we look at the dataflow graph given in Figure 6.6(a) and also count the idle adders mentioned earlier. However, the exploration time for FCT was substantially larger than for all other algorithms, namely 8 days. With 859 explored solutions out of 32·32·6·3 = 18 432 (#add · #mul · issue width · data width) this results in an average speed of only 4.47 evaluations per hour or two orders of magnitude slower exploration speed than in the previous case. The design space coverage lies here around 4.7%. The power consumption values range from 24.54 mW (solution A) to 219.13 mW (solution B). The latency of these solutions differs by a factor of almost 5× (lower power for higher latency), see Table 6.3. However, the lowest energy-delay product of 17.49 pJs is achieved for solution C with a power consumption of 63.44 mW. The equivalent 24 bit configuration has only an approximately doubled energy-delay value of 36.58 pJs (solution D). Compared to solution C for the DCT, the corresponding solution for the FCT has a 25.7× higher energy-delay product at comparable power consumption values. Although the lowest power consumption is less than in case of DCT, the latency plays here the dominating role. For FCT, no solutions with a latency lower than 105 cycles could be found during this exploration run. Another aspect which should also be mentioned is the appropriate formulation of the algorithm in the special singleassignment language PAULA, see [129]. A special care is always needed to write the algorithm in the most appropriate form which the compiler can optimize the best.

6.3.2 Multi-Algorithm Exploration To investigate the exploration of several different algorithms for run-time reconfigurable WPPAs, we performed a combined exploration for DCT and FIR. The compound design space still contains 18 432 solutions. The coverage ratio was 1 984 18 432 ≈ 11% which is comparable to the case of the single-algorithm DCT exploration. The results are depicted in Figures 6.14, 6.15. Table 6.4 gives the detailed values of objective functions.

169

6


ID #add #mul EDP (pJ·s) Data Issue Lat. (cyc.) Area (mm2 ) Power (mW) Energy (μJ) A B C D

8 31 31 31

4 5 5 5

154.61 60.4 17.49 36.58

8 24 8 24

2 4 3 3

502 105 105 105

0.15 0.83 0.38 0.57

24.54 219.13 63.44 132.72

61.6 115.05 33.31 69.68

Table 6.3: Characteristic properties of the found non-dominated WPPA solutions (FCT).

It should be noted that the values of the objective functions given in Table 6.4 are compound maximum values of objective functions of single algorithms. The rationale behind this is the already mentioned usage of the maximum operator during the multialgorithm exploration. Observation 6.1 The maximum operator allows us to find balanced architectural solutions for the respective algorithms since any intermediate solutions with extreme values of the objective functions will eventually be discarded by the evolutionary algorithm. In case of minimum or average operators, intermediate solutions with highly unbalanced objective values of single algorithms would be retained and lead to an one-sided architecture instance optimized for one algorithm only. The (non-dominated) solution with the lowest average power consumption is solution A with 11.28 mW. Solution B has the highest average power consumption of 98.28 mW. Solution C has the lowest maximum energy-delay product of 6.13 pJs from the two explored algorithms (in this case from the FIR algorithm since it is higher than 0.68 pJs for DCT). That indeed highly balanced solutions are found can be seen in Figures 6.8(b), 6.9(b) (DCT), and Figures 6.10(b), 6.11(b) (FIR). Solution X is the minimum energy-delay point found during the multi-algorithm exploration. It can clearly be seen that in case of FIR it is identical with its own solution C (minimum energy-delay single-algorithm FIR). For DCT it is still a highly optimized solution with a minor latency trade-off of 34 cycles to 19 cycles as compared with the solution C (minimum energy-delay single-algorithm DCT), see Tables 6.1 and 6.2.

6.4 Issue Width Impact Figure 6.16 shows the exemplary dependence of the average area (left) and power consumption (right) of a WPPA on the VLIW issue width of single processing elements. Issue widths as small as one are expensive regarding both area and power consumption because of the instruction memory overhead and a large processing elements number. The large increase for the transition from three to four issue slots is

170

6.5 Estimation Methodology and Accuracy

ID #add #mul EDP (pJ·s) Data Issue Lat. (cyc.) Area (mm2 ) Power (mW) Energy (μJ) A B C

3 10 10

2 8 8

29.79 18.56 6.13

8 24 8

5 6 3

325 88 88

6.01 · 10−2 0.39 0.19

11.28 98.28 37.09

18.34 42.18 13.94

Table 6.4: Characteristic properties of the found non-dominated WPPA solutions (FIR and DCT). mainly due to a bigger number of input, general purpose, and output registers (2×). This is a heuristic applied here because a higher number of functional units normally also demands for more input signals, that is needs a higher memory bandwidth. Generally, with a higher issue width the overall number of processing elements in the array decreases which directly leads to a smaller area and power consumption. This is especially true for smaller data path bit widths (8 or 16 bits) where it is more advantageous to deploy processing elements with a large issue width of 5 or 6 since the additional registers do not incur large overheads. For larger bit widths (24 bits), however, it may be advantageous to use a larger number of processing elements with a narrower issue width instead of a small number of large ones since the corresponding increase in memory bandwidth can be expensive. Observation 6.2 According to our exploration results for WPPAs, for wider data path widths array-level parallelism is more power efficient than instruction-level parallelism.

6.5 Estimation Methodology and Accuracy For the presented design space exploration the relational database-based framework described in Section 4.3.4 is used together with the respective probabilistic power models from Chapter 4. Throughout the exploration, an average data switching activity was assumed. The switching activity of the control and enable signals was taken from the scheduling information provided by the compiler as described in Section 6.2. Based on the profiling results obtained in Chapter 4 as well as Chapter 5, we can state that both relative and absolute accuracy of the exploration results is high. However, no empirical tests were conducted to quantitatively assess the absolute accuracy for the found solutions.

171

6


6.6 Related Work The usage of MOEAs or genetic algorithms for design space exploration is not new. However, the existing work on this topic either treats RISC or ASSP based designs [223, 203, 164, 115, 93, 98], VLIW-based processors [31, 221, 222], or, most recently, router configuration in networks-on-chip [155, 305]. The application of genetic algorithms to high-level synthesis was published in [245, 246, 283]. However, there, application-specific circuits are the subject for power and area exploration. To our best knowledge, there exists no comparable exploration framework for coarse-grained reconfigurable architectures which is used to determine pareto-optimal design points regarding area, power, and energy consumption for multiple applications simultaneously.

6.7 Summary In this chapter, the feasibility of design space exploration in CGRAs with the help of the state-of-the-art multi-objective evolutionary algorithms was shown. The presented framework for WPPA allows us to perform a highly accurate and expeditious automatic evaluation of any possible WPPA instance in terms of area, performance and power on a high abstraction level. Furthermore, a substantial acceleration of the automatic exploration procedure is achieved due to deployment of novel, relational data-base based macro-modeling methodology and state-of-the-art multi-objective evolutionary algorithms. The presented framework constitutes the means to automatically determine the absolute upper and lower limits of the objectives for a given parameter range which would be impossible to achieve otherwise.

172

Energy-Delay Product (pJ·s)

6.7 Summary

8-bit 16-bit 24-bit

175

125

75

A

25

0

4

8 Adder#

12

16

0

8

16

M

32

24

ip ult

lier

#

(a) Top part


5 8-bit 16-bit 24-bit

4 3 2

B 1 0

X 0

6

12 Adder#

C 18

24

0

8

16

M

24

ip ult

lier

32

#

(b) Bottom part

Figure 6.8: Energy-delay product over the number of functional units (DCT). Nondominated solutions are colored.

173

6


8-bit 16-bit 24-bit

A Latency (cycles)

225

175

125

75 0

25

0.75

50

75

100 125 150 Energy (μJ)

1

0.5 2) mm 0 0.25 ( a Are

(a) Top part

60

Latency (cycles)

8-bit 16-bit 24-bit 45

X 30

B

C 15 0

10

0.75

20 30 Energy (μJ)

40

50

1

0.5 2) mm 0 0.25 ( a Are

(b) Bottom part

Figure 6.9: Latency over the energy and chip area (DCT). Non-dominated solutions are colored.

174

6.7 Summary

8-bit 16-bit 24-bit


450

350

250

150

A 50 0

2

4 6 Adder#

8

10

0

12

4

2

M

6

ip ult

8

lier

10

#

(a) Top part



21 18

B

15

D

12 9 6

C

3 0

0

2

4 6 Adder#

8

10

X 12 11 10 9 8 er# 7 ipli t l Mu

(b) Bottom part

Figure 6.10: Energy-delay product over the number of functional units (FIR). Nondominated solutions are colored.

175

6


700

A

Latency (cycles)

600 500 400

8-bit 16-bit 24-bit

300 200 0

25

50

75 100 Energy (μJ)

125

0.5 0.4 0.3 2) 0.2 mm ( 0 0.1 a Are

150

(a) Top part

100

Latency (cycles)

95

X

B

90

D

C

85 80

8-bit 16-bit 24-bit

75 70 0

10

20 30 Energy (μJ)

40

50 0

0.1

0.2

0.3

Are

0.4

0.5

m a(

2)

m

(b) Bottom part

Figure 6.11: Latency over the energy and chip area (FIR). Non-dominated solutions are colored.

176


6.7 Summary

8-bit 16-bit 24-bit

450

350

250

A 150

0

4

8 12 Adder#

16

20

24

0

8 6 4 er# 2 ipli t l Mu

(a) Top part


70

B

60 50

D

40 30 8-bit 16-bit 24-bit

20 10 0

0

8

C

16 Adder#

24

32

0

8 6 4 er# 2 ipli t l Mu

(b) Bottom part

Figure 6.12: Energy-delay product over the number of functional units (FCT). Nondominated solutions are colored.

177

6


Latency (cycles)

475

8-bit 16-bit 24-bit

A

375

275

175 0

25 50 75 100 125 150 175 200 Energy (μJ)

0.75

1

0.5 2) mm 0 0.25 ( a Are

(a) Top part

8-bit 16-bit 24-bit Latency (cycles)

110

D

C

B

100

90

0

35

0.75

70

105

Energy (μJ)

140

1

0.5 2) mm 0 0.25 ( a Are

(b) Bottom part

Figure 6.13: Latency over the energy and chip area (FCT). Non-dominated solutions are colored.

178

6.7 Summary


450 8-bit 16-bit 24-bit

400 350 300 250 200 150 100

A

50 0

4

8 12 Adder#

16

20

2832 24 20 1216 ier# 8 l ltip 0 4 u M

24

(a) Top part

B



18 15 12 9 6

X

C

3 12 0

9 0

2

4 6 Adder#

6 8

10

12

3

M

ip ult

lier

#

(b) Bottom part

Figure 6.14: Energy-delay product over the number of functional units (FIR and DCT). Non-dominated solutions are colored.

179

6


700

Latency (cycles)

600 500 400

A 8-bit 16-bit 24-bit

300 200 0

25

50

75

100

Energy (μJ)

125

150 0

0.1

0.3

0.2

Are

0.4

m a(

2)

m

(a) Top part

100

Latency (cycles)

95

C B

90

X 85 80

8-bit 16-bit 24-bit

75 70 0

10

20 30 Energy (μJ)

40

50 0

0.1

0.2

Are

0.3

m a(

0.4 2)

m

(b) Bottom part

Figure 6.15: Latency over the energy and chip area (FIR and DCT). Non-dominated solutions are colored.

180

Average Area (mm2 )

1

3 4 5 Issue Width

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 2

3 4 5 Issue Width

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 2

3 4 5 Issue Width

6

8-bit

1

2

3 4 5 Issue Width

120 105 90 75 60 45

6

16-bit

1

6

24-bit

1

80 70 60 50 40 30 20

6

16-bit

1

Average Area (mm2 )

2

Average Power (mW)

8-bit

Average Power (mW)

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2

Average Power (mW)

Average Area (mm2 )

6.7 Summary

2

3 4 5 Issue Width

6

170 140 110 80

24-bit

1

2

3 4 5 Issue Width

6

Figure 6.16: Dependence of the average area and power consumption of a WPPA processor array (DCT and FIR multi-algorithm) on the VLIW issue width of a single processing element for different data path widths (the bars denote the corresponding standard deviation). The number of functional units was constrained to lie between 18 and 30, that is (9 ≤ #add ≤ 15) and (9 ≤ #mul ≤ 15).

181

6

182


7 Concluding Remarks The major reasons which led to implementation and research on weakly programmable processor arrays were the following: • Regarding different applications, for example, in mobile devices no ”one-sizefits-all” approach is possible. Systolic, respectively non-programmable dedicated array architectures are too inflexible. On the other side, applicationspecific customization is necessary to achieve high energy and area efficiencies. This requires highly parameterizable and extensible design templates. • Neither exact values nor efficient evaluation techniques of power consumption for large TCPAs and CGRAs were available. Also no representative power traces for relevant applications were published. • There existed no publications on advanced power reduction techniques applied to CGRAs which leveraged their special architectural properties. • Finally, no publications were available on architecture-level power estimation and design space exploration frameworks for TCPAs. The research on these topics finally resulted in the following contributions: Architecture Implementation The distinguishing property of a WPPA is that it is an architectural template rather than a fixed CGRA architecture, like many others mentioned in the introduction. This enables a flexible customization of hardware resources to a prospective set of applications at design time as well as an automatic design space exploration. WPPA Setup, Reprogramming Management, Validation: The proper structuring and efficient handling of the configuration data for WPPAs is of utmost importance both for configuration speed and flexibility. Therefore, we developed an efficient protocol, which encapsulates the physical implementation details like address boundaries, etc. and instead gives the designer a logical view on the system at the algorithmic level. Since the generation of the proper binary data is automated and the parsing is implemented in hardware, the associated timing overhead is negligible. Due to deployment of clock gating and power gating techniques, the power overhead is minimized significantly.

183

7

Concluding Remarks

Compiler-friendly and Adaptive Architecture: To enable efficient compiler optimization, a separate control data interconnect is provided. This interconnect is completely independent of the data interconnect, has an own parameterized width and can also be changed dynamically by the processing elements themselves. All hardware structures for data handling like circular buffers, parameterized shift registers and corresponding register files are also available for control signals. As already mentioned, each processing element can reprogram its interconnect at run-time by directly accessing the corresponding registers. Power Modeling and Optimization The presented evaluation and modeling framework allows extremely fast but still accurate power, area, and latency characterization of different design alternatives in a multidimensional design space of highly parameterized coarse-grained reconfigurable processor arrays. For the average power estimation of pre-designed hardware modules, we propose a table-based, probabilistic macro-modeling technique with non-uniform parameter sampling, implemented by means of a relational database. Since the current embedded hardware architectures are primarily power-constrained, state-of-the art power reduction techniques were implemented in the WPPA architecture. Clock Gating: A novel clock gating scheme was used to substantially reduce the dynamic power consumption both during the reconfiguration (up to 5×) and functional processing phase (up to 37% reduction). Power Gating: We also developed a systematic approach to efficiently handle a very large number of power domains (one for each processing element) in modern tightly-coupled processor arrays in order to tightly match the different computational demands of processed algorithms with corresponding power consumption. It is based on a new highly scalable and generic power control network and additionally uses the state-of-the-art Common Power Format based front-to-backend design methodology for a fully automated implementation. The power management is transparent to the user and is seamlessly integrated into the overall reconfiguration process: reconfiguration-controlled power gating. Design Space Exploration The feasibility of design space exploration for TCPAs with the help of the state-of-the-art multi-objective evolutionary algorithms was shown. The presented framework for WPPAs allows us to perform a highly accurate and expeditious automatic evaluation of any possible WPPA instance in terms of area, performance and power on a high abstraction level. This exploration framework is used to approximate pareto-optimal design points regarding area, power, and energy consumption for multiple applications simultaneously implemented by means of reconfiguration.

184

7.1 Future Directions: Invasive Tightly-Coupled Processor Arrays

Figure 7.1: Productivity waves (original version by Synopsys, Inc., 2010).

7.1 Future Directions: Invasive Tightly-Coupled Processor Arrays The principal paradigm shifts and trends in modern microprocessor design which were discussed in the thesis are highlighted in the following summary, see also [50]: • Frequency scaling will slow down or stop, • Various types of parallelism will be exploited, such as many-core, parallel vector engines, and array processing in general. • Stringent constraints on power consumption will dominate future architectures. A new era of microprocessor scaling is based not exclusively on device scaling but, to a large extent, on extended adaptability and reconfigurability, as well as energy and power efficiency. Extensive process variations will force to make the future circuits more error resilient. Sophisticated error detection and correction techniques will be needed to automatically correct the errors to continue with the user application. First prototypes of such systems were already presented in [56] and [288]. Especially for massively parallel systems, such error resilience can be achieved with relatively low overheads. Increasing leakage power consumption will finally make new technology approaches like three-dimensional FinFET transistors [259, 172] and transistors based on graphene or nano-carbon [107] together with optical on-chip or inter-chip interconnects a necessity. The complexity of digital designs will further grow and lead to a new level with massively parallel, many-core systems, like schematically shown in Figure 7.1. To tackle the extreme complexity of such systems, revolutionary new approaches like

185

7

Concluding Remarks

invasive computing [275] will be necessary. The distinguishing properties of invasive computing can be summarized as follows, see [284]: • A new concept of dynamic and resource-aware programming is introduced: A given program gets the ability to explore and dynamically spread its computations to neighbour processors similar to a phase of invasion, then to execute portions of code of high parallelism degree in parallel based on the available (invasible) region on a given multi-processor architecture. Afterwards, once the program terminates or if the degree of parallelism should be lower again, the program may enter a retreat phase, deallocate resources, and resume execution again, for example, sequentially on a single processor. • Invasion might provide the required self-organising behaviour to conventional programs for being able not only to tolerate certain types of faults and cope with feature variations, but also to provide scalability, higher resource utilisation numbers and performance gains by adjusting the amount of allocated resources to the temporal needs of a running application.

186

8 German Part

Energieeffiziente eng gekoppelte Prozessorfelder fur ¨ digitale Signalverarbeitung

187

8

188

German Part

Zusammenfassung Der Schwerpunkt der vorliegenden Dissertation liegt auf der neuen Architekturklasse der energie- und flächeneffizienten, schwachprogrammierbaren, eng gekoppelten Prozessorfelder (eng. Weakly Programmable Processor Arrays, WPPAs). Sie verbinden die Vorteile anwendungsspezifischer integrierter Schaltungen (Chipflächenbedarf, Energieverbrauch und Rechenleistung) mit der Flexibilität gängiger MulticoreSoCs, siehe Abb. 8.1. Diese Flexibilität wird durch die Programmierbarkeit der einzelnen Prozessorelemente und der Verbindungstopologie des Prozessorfeldes erreicht. Die Programmierbarkeit ist anwendungsspezifisch und damit eingeschränkt. Das Anwendungsfeld für schwachprogrammierbare Prozessorfelder bilden vor allem eingebettete SoC-Systeme für datenflussdominante Anwendungen, wie zum Beispiel Video- und andere digitale Signalverarbeitung in tragbaren Multimedia- und Mobilfunkgeräten. Einerseits werden dort höchste Rechenleistungen benötigt, andererseits wird durch die begrenzte Batteriekapazität die Größe und Wärmeemission und somit auch die Leistungsaufnahme der verwendeten Hardware stark eingeschränkt. Im Gegensatz zu gängigen Multicore-SoC Architekturen, die in den meisten Fällen nur Tasklevel-Parallelität ausnutzen, benötigen die oben genannten Anwendungen eine eng gekoppelte, parallele Verarbeitung auf mehreren Ebenen: Schleifen-Parallelität (auf der Prozessorfeldebene), Instruktions-Parallelität, sowie funktionales und Software-Pipelining. Mit Hilfe heutiger Halbleitertechnologien können WPPAs auf einem einzigen Chip realisiert werden oder in Form eines IP-Cores in einem SoC integriert sein. Die Ergebnisse dieser Dissertation zeigen, dass es für die oben erwähnten Anwendungsfelder mit Hilfe der vorgeschlagenen Methoden möglich ist, Systeme zu entwerfen, die 1/10 bis 1/100 der Chipfläche bei 100- bis 300-fachen Steigerung der Leistungseffizienz besitzen, bei einer Rechenleistung vergleichbar mit der von herkömmlichen Multicore-SoCs. Die wesentlichen Beiträge der vorliegenden Arbeit liegen in den folgenden vier Forschungsgebieten: (1) Erforschung von Architekturen, (2) Effiziente Modellierung des Leistungsverbrauchs auf einer hohen Abstraktionsebene, (3) Optimierung des Energie- und Leistungsverbrauchs, sowie (4) Effiziente Parameterraum-Exploration. Erforschung von Architekturen Die WPPA-Architekturen werden aus mehreren schwachprogrammierbaren Prozessorelementen (WPPE) gebildet, die zu Prozessorfeldern verbunden sind. Jedes einzelne Prozessorelement eines WPPAs besitzt

189

German Part

&38

&38

,2

,2

,2

,2

,2

,2

,2

&38

,2

8

,2

,2

&DFKH &DFKH &DFKH

&38

&38

6WDQGDUG 0XOWLFRUH 3UR]HVVRU

,2

&38

,2

&DFKH &DFKH &DFKH

6FKZDFKSURJUDPPLHUEDUHV 3UR]HVVRUIHOG

$QZHQGXQJV6SH]LILVFKH 6FKDOWXQJ

Abbildung 8.1: Unterschiede zwischen Multicore-Prozessoren, schwachprogrammierbaren eng gekoppelten Prozessorfeldern, sowie anwendungsspezifischen integrierten Schaltungen. Ein schwachprogrammierbares Prozessorfeld stellt einen Kompromiss zwischen der Flexibilität auf der einen und der Effizienz auf der anderen Seite dar.

eine VLIW-Architektur (Very Long Instruction Word). Sie werden als schwachprogrammierbar bezeichnet, weil die Größe des Instruktionsspeichers beschränkt ist und der Kontrollaufwand für Algorithmen einer bestimmten Anwendungsklasse so gering wie möglich gehalten wird. Es werden zum Beispiel keine Interrupts und Exceptions unterstützt. Der Instruktionsspeicher enthält jeweils ein VLIW Programm. Jedes WPPE enthält Parameter, wie zum Beispiel die Anzahl und Typ der funktionalen Einheiten (Addierer/Subtrahierer, Multiplizierer, Schiebeeinheiten, Logikeinheiten) und kann zur Synthesezeit parametrisiert werden. Das VHDL-Template kann durch spezielle funktionale Einheiten, beispielsweise MAC (Multiply-Accumulate) oder Barrel-Schifter, erweitert werden. Eine parametrisierbare Registerbank für Daten wird zum Speichern von Zwischenergebnissen benutzt. Eingangsdaten werden in speziellen FIFO-Speicherelementen am Eingang (First In First Out) zwischengespeichert. Die Kommunikation zwischen den einzelnen Prozessorelementen spielt bei den parallelen Hardwarearchitekturen eine sehr wichtige Rolle. Flexible Verbindungsstrukturen können mit Hilfe des Konzeptes eines Interconnect Wrapper Moduls realisiert werden, das zu jeweils einem WPPE gehört. Zusätzlich kann es dazu benutzt ¨ werden, parametrisierbare Verbindungsnetzwerke zu beschreiben. Durch das Andern der Werte spezieller Kontrollregister im Interconnect Wrapper Modul kann zur Laufzeit zwischen den einzelnen Verbindungstopologien umgeschaltet werden. Effiziente Modellierung des Leistungsverbrauchs Um eine a¨ ußerst schnelle Abschätzung des Leistungs- und Flächenverbrauchs auf Architekturebene zu ermöglichen, wird eine probabilistische Makromodellierungsmethodik vorgeschlagen, die auf einer neuartigen Implementierung mit Hilfe einer relationalen Datenbank, so-

190

wie einer nicht-uniformen Parameter-Abtastung basiert. Damit können große Prozessorfelder mit Hunderten von Prozessorelementen innerhalb einiger Minuten bezüglich ihres Energie- und Leistungsverbrauchs charakterisiert werden. Der Abschätzungsfehler liegt dabei innerhalb von 10% verglichen mit modernen kommerziellen Analysewerkzeugen, die auf einer Gatternetzliste mit Plazierungs- und Verdrahtungsinformation arbeiten. Optimierung des Energie- und Leistungsverbrauchs Sowohl der dynamische, wie auch der statische Leistungsverbrauch werden mit Hilfe moderner Methoden, sowie architekturspezifischer Eigenschaften massiv reduziert. Die Anwendung einer hybriden Clock-Gating Technik führt zu einem Rückgang der dynamischen Leistungsaufnahme um bis zu 35%. Verglichen mit den herkömmlichen automatischen Clock-Gating Techniken ist es eine Steigerung um den Faktor drei. Die Beispiel-Implementierungen von WPPAs mit unterschiedlichen Größen von 2 × 2 und 3 × 8 in einer kommerziellen 90 nm CMOS Standardzellen-Technologie ergaben Leistungseffizienz-Werte von 0.064 mW/MHz und 124 MOPS/mW. Verglichen mit modernen grobgranularen und eingebetteten Mikroprozessor-Architekturen entspricht dies einer Steigerung der Leistungseffizienz um den Faktor 1.7 bis 28. Die Reduzierung der statischen Leistungsaufnahme während des Betriebs, wie auch der Standby-Zeit wird mit Hilfe der Power-Gating Technik und eines automatisierten Entwurfsflusses basierend auf dem Common Power Format erreicht. Um die Skalierbarkeit dieser Methode für künftige große Prozessorfelder mit Tausenden von Prozessorelementen sicherzustellen, wird ein neuartiges, effizientes Verbindungsnetzwerk mit asynchroner Steuerung vorgestellt. Effiziente Automatische Parameterraum-Exploration Ein bemerkenswertes Ergebnis dieser Arbeit ist ein innovatives Explorations-Framework, das mit Hilfe moderner multikriterieller evolutionärer Algorithmen eine genaue und schnelle automatische Parameterraum-Exploration bezüglich des Flächen- und Leistungs-Verbrauchs, sowie des Durchsatzes für beliebige WPPA-Instanzen ermöglicht. Dieses Framework stellt ein Mittel dar, die jeweiligen unteren und oberen Schranken der Zielfunktionen für einen gegebenen Parameterbereich automatisch zu bestimmen. Dies kann mit herkömmlichen Methodiken nicht erreicht werden. Anschließend wird eine Exploration von WPPA-Architekturen für mehrere unterschiedliche Algorithmen untersucht, die auf dem gleichen Prozessorfeld mit Hilfe von Rekonfiguration ausgeführt werden sollen. Es werden Methoden vorgestellt, die während der Exploration zu einer ausgeglichenen Hardware-Architektur führen, die für eine gegebene AlgorithmenMenge einen vernünftigen Kompromiss bezüglich der Zielgrößen darstellt.

191

8

192

German Part

A Appendix A.1 VHDL Code Listings Listing A.1: VHDL code for the multiplexer generation on the output signals south. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

SOUTH OUTPUTS CHECK :FOR o u t p u t i n SOUTH PIN NUM + WEST PIN NUM t o SOUTH PIN NUM + WEST PIN NUM + NORTH PIN NUM −1 GENERATE SOUTH OUTPUT MUX GEN : IF MULTI SOURCE MATRIX ( 0 , o u t p u t ) > 1 GENERATE signal mux south out signal : s t d l o g i c v e c t o r ( NORTH INPUT WIDTH −1 downto 0 ) ; signal mux south select signal : s t d l o g i c v e c t o r ( MULTI SOURCE MATRIX ( 2 , o u t p u t )−1 downto 0 ) ; begin s o u t h o u t p u t s ( NORTH INPUT WIDTH∗ ( o u t p u t − ( SOUTH PIN NUM + WEST PIN NUM ) + 1 )−1 downto NORTH INPUT WIDTH∗ ( o u t p u t −(SOUTH PIN NUM + WEST PIN NUM ) ) ) NORTH INPUT WIDTH , SEL WIDTH => MULTI SOURCE MATRIX ( 2 , o u t p u t ) , NUM OF INPUTS => MULTI SOURCE MATRIX ( 0 , o u t p u t ) ) p o r t map ( data inputs => s o u t h o u t p u t a l l m u x i n s ( MULTI SOURCE MATRIX ( 4 , o u t p u t ) downto MULTI SOURCE MATRIX ( 3 , o u t p u t ) ) , sel => m u x s o u t h s e l e c t s i g n a l , output => m u x s o u t h o u t s i g n a l ) ; END GENERATE; −− NORTH OUTPUT MUX GEN END GENERATE; −− NORTH OUTPUTS CHECK

193

A

Appendix

Listing A.2: VHDL code for the input north and the output south connection. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

N S SOUTH OUTPUTS CHECK :FOR o u t p u t i n ( SOUTH PIN NUM +WEST PIN NUM ) t o ( SOUTH PIN NUM + WEST PIN NUM + NORTH PIN NUM ) −1 GENERATE N S NORTH INPUTS CHECK :FOR i n p u t i n NORTH PIN NUM −1 downto 0 GENERATE N S DRIVER CHECK : IF MULTI SOURCE MATRIX ( 0 , o u t p u t ) > 0 GENERATE N S FALSE MS CHECK : IF MULTI SOURCE MATRIX ( 0 , o u t p u t ) = 1 GENERATE N S CONN CHECK : IF ADJACENCY MATRIX( i n p u t ) ( o u t p u t ) = ’ 1 ’ GENERATE signal N S north in signal : s t d l o g i c v e c t o r ( NORTH INPUT WIDTH −1 downto 0 ) ; signal N S south out signal : s t d l o g i c v e c t o r ( NORTH INPUT WIDTH −1 downto 0 ) ; begin N S n o r t h i n s i g n a l NORTH INPUT WIDTH ) p o r t map ( i n p u t s i g n a l => N S n o r t h i n s i g n a l , o u t p u t s i g n a l => N S s o u t h o u t s i g n a l ) ; END GENERATE; −− NORTH SOUTH CONNECTION CHECK END GENERATE; −− FALSE MULTI SOURCE CHECK N S TRUE MS CHECK : IF MULTI SOURCE MATRIX ( 0 , o u t p u t ) > 1 GENERATE N S MS CONN CHECK : IF ADJACENCY MATRIX( i n p u t ) ( o u t p u t ) = ’ 1 ’ GENERATE signal N S mux north in signal : s t d l o g i c v e c t o r ( NORTH INPUT WIDTH −1 downto 0 ) ; begin N S m u x n o r t h i n s i g n a l ( NORTH INPUT WIDTH−1 downto 0 ) GLOBAL config mem data , north inputs => I N t e r n a l T O P n o r t h i n ( j ∗NORTH INPUT WIDTH ∗NORTH PIN NUM −1 downto ( j −1)∗NORTH INPUT WIDTH ∗NORTH PIN NUM ) , north outputs => I N t e r n a l T O P n o r t h o u t ( j ∗SOUTH INPUT WIDTH ∗SOUTH PIN NUM −1 downto ( j −1)∗SOUTH INPUT WIDTH ∗SOUTH PIN NUM ) , s o u t h i n p u t s => INTERNAL NORTH OUT south in connections ( i , j ) , s o u t h o u t p u t s => INTERNAL SOUTH OUT north in connections ( i , j ) , east inputs => INTERNAL WEST OUT east in connections ( i , j ) , e a s t o u t p u t s => INTERNAL EAST OUT west in connections ( i , j ) , west inputs => I N t e r n a l L E F T w e s t i n ( i ∗WEST INPUT WIDTH ∗WEST PIN NUM −1 downto ( i −1)∗WEST INPUT WIDTH ∗WEST PIN NUM ) , w e s t o u t p u t s => INTERNAL LEFT west out ( i ∗EAST INPUT WIDTH ∗EAST PIN NUM −1 downto ( i −1) ∗EAST INPUT WIDTH ∗EAST PIN NUM ) ) ; END GENERATE FRC FIRST COLUMN CHECK ; ...

196

A.2 Generic CPF Description

A.2 Generic CPF Description 1

set cpf version 1.0e

20

update nominal condition -name off -library set all libs

2

set hierarchy separator ”/”

21

update nominal condition -name on -library set all libs

3

create power nets -voltage 1.0 -nets VDD

4

create power nets -voltage 1.0 -nets { \

22

create power mode -name PM ON -domain conditions \ { PA@on PDOM ij@on }

VDD ij, i ∈ {0 . . . N}, j ∈ {0 . . . M} } -internal 5

create ground nets -voltage 0.0 -nets VSS

23

create power mode -name PM OFF -domain conditions \

6

define isolation cell -cells AND2X2 -power VDD \

24

create power mode -name PM WH ON \

{ PA@on PDOM ij@off }

7

-ground VSS -non dedicated -enable A2

-domain conditions \

-valid location to

{ PA@on PDOM ij@on, ((i + j)%2=0) \

define power switch cell -cells HEADERX0 \ -stage 1 enable nSIN -stage 1 output nSOUT \

PDOM ij@off, ((i + j)%2 = 0) } 25

-domain conditions \

-type header -power TVDD -power switchable VDD 8

define always on cell -cells OBUFFX2 -power TVDD \

{ PA@on PDOM ij@on, ((i + j)%2 = 0) \ PDOM ij@off, ((i + j)%2=0) }

-ground VSS -power switchable VDD

9

create power domain -name PA -default

10

create power domain -name PDOM ij \ -instances {array/wrp i j/core} -shutoff condition \

26

update power mode -name PM ON -sdc files \

27

update power mode -name PM OFF -sdc files \

$sdc filepath/constraint on.sdc

{!wrp i j/pwr ctrl/output[0]} -secondary domains PA 11

update power domain -name PA \

$sdc filepath/constraint off.sdc 28

update power mode -name PM BL ON -sdc files \

29

update power mode -name PM WH ON -sdc files \

-primary power net VDD -primary ground net VSS 12

update power domain -name PDOM ij \

$sdc filepath/constraint bl.sdc

-primary power net VDD ij -primary ground net VSS

13

create global connection -domain PA \

$sdc filepath/constraint wh.sdc

30

create power switch rule -name SW ij \

31

update power switch rule -name SW ij \

-domain PDOM ij -external power net VDD

-net VDD -pins VDD 14

create global connection -domain PA -net VSS -pins VSS

15

create global connection -domain PDOM ij \

-cells HEADERX0 \ -acknowledge receiver {array/wrp i j/pwr ctrl/input[1]}

-net VDD -pins TVDD 16

create global connection -domain PDOM ij \ -net VDD ij -pins VDD

17

create power mode -name PM BL ON \

32

create global connection -domain PDOM ij \

create isolation rule -name iso PDOM ij \ -from PDOM ij -to PA \ -isolation condition {array/wrp i j/pwr ctrl/output[2]} \

-net VSS -pins VSS

-isolation target from -isolation output low 18

create nominal condition -name off -voltage 0.0

19

create nominal condition -name on -voltage 1.0

33

update isolation rules -names iso PDOM ij \ -cells AND2X2 -location to

Figure A.1: A generic Common Power Format description file for WPPA architecture.

197

A

Appendix

A.3 WPPA Software Examples Instr. 0 { ADD 0 ADD 1 MUL 0 MUL 1 SHF DPU BRU }

: : : : : : :

add nop muli nop nop nop next

rd1 rd0 rd1

Instr. 2 { ADD 0 ADD 1 MUL 0 MUL 1 SHF DPU BRU }

: : : : : : :

addi nop muli nop nop nop next

rd1 rd0 0


: : : : : : :

add nop nop nop nop nop next

rd1 rd0 rd1


: : : : : : :

addi nop nop nop nop nop next

rd0 rd0 0

rd0 id0 1

rd0 id0 -2


: : : : : : :


od0 rd0 rd1


: : : : : : :


rd1 rd0 rd1


: : : : : : :

nop nop nop nop nop nop next


: : : : : : :

add nop muli nop nop nop jmp

rd0 id0 -1

rd0 id0 -1

rd1 rd0 rd1 rd0 id0 2

0

Table A.1: Software for the edge detection algorithm, PE 1 (PE[0][0]), consisting from eight VLIW instructions in WPPA-assembler code (id: input data register, rd: general purpose data register, od: output data register).

198

A.3 WPPA Software Examples


: : : : : : :



: : : : : : :

addi nop nop nop nop nop next


: : : : : : :

nop nop nop nop nop nop if neg


: : : : : : :

nop nop muli nop nop nop next

rd0 id0 0

6

5

rd0 rd0 -1


: : : : : : :



: : : : : : :



: : : : : : :

nop nop nop nop nop nop jmp


: : : : : : :

add nop nop nop nop nop jmp

7

od1 rd0 id1

0

Table A.2: Software for the edge detection algorithm, PE 4 (PE[0][1]), consisting from eight VLIW instructions in WPPA-assembler code (id: input data register, rd: general purpose data register, od: output data register).

199

A

Appendix


: : : : : : :

add add mul mul shli mov next

rd0 rd1 rd2 rd3 rd4 rd7

id0 id2 id2 id0 id3 id3

id1 id3 id3 id1 0


: : : : : : :

subi add muli muli shri mov next

rd2 id2 rd7 rd6 rd7 rd1 rd3 rd7 rd7 rd6 od1 rd5

28 rd0 0 0 2


: : : : : : :

nop nop muli muli nop nop next

od0 rd2 1 od1 rd3 1


: : : : : : :

addi sub mul mul shr mov next

rd5 rd2 rd6 rd4 rd1 rd5 rd3 rd0 rd5 id1 od0 rd4


: : : : : : :

addi addi nop nop nop nop next

od0 rd0 0 od1 rd1 0


: : : : : : :

addi sub mul mul shr mov jmp

rd5 rd2 rd6 rd4 rd1 rd5 rd3 rd0 rd5 id1 od0 rd4 0

0 rd0 rd0 rd0 rd0

0 rd0 rd0 rd0 rd0

Table A.3: Synthetic load software consisting from six VLIW instructions in WPPA-assembler code. All 24 processing elements execute this code simultaneously (id: input data register, rd: general purpose data register, od: output data register).

200

A.3 WPPA Software Examples


: : : : : : :

nop nop nop nop nop const next


: : : : : : :

nop nop nop nop shl nop next


: : : : : : :

nop nop mul mul nop nop next


: : : : : : :

sub sub nop nop nop nop next


: : : : : : :

nop nop nop nop shr od0 rd3 rd6 nop jmp 0

rd6 8

rd5 id1 rd6

rd1 rd7 id3 rd2 rd0 -1



: : : : : : :

nop nop nop nop shl const next


: : : : : : :

nop nop mul nop nop nop next


: : : : : : :

sub sub nop nop nop nop next


: : : : : : :

nop nop nop nop shr nop next

rd4 id0 rd6 rd7 181

rd0 rd7 id2


od0 rd2 rd6

Table A.4: Software for the FFT algorithm, PE 20 (PE[2][3]), consisting from nine VLIW instructions in WPPA-assembler code (id: input data register, rd: general purpose data register, od: output data register).

201

A

Appendix

VHDL Module

A.4 WPPA Template Code Complexity

wppe wppe:multiplexer wppe:lib wppe:icn:wrapper wppa:top wppa top type:lib shift:instr:decoder shift:fu shift:control reg:file pwr:controller plb:controller mux:sel:config:file mux:2:1 mult:instr:decoder mult:fu mult:control lut:fifo:memory logic:instr:decoder logic:fu logic:control instr:memory global:controller generic:loader flags:sel:unit flags:add:fu fifo:common:clock feedback:fifo dpu:instr:decoder dpu:control domain:address:memory default:lib data:path:unit cpu:instr:decoder cpu:control control:regfile control:path:unit connection configuration:manager config:memory branch:control bram:fifo:memory array:lib adds:control add:instr:decoder 100

101

102

103

Lines of Code (log-scale)

Figure A.2: VHDL template code complexity statistics.

202

104

A.5 TCL Characterization Script Fragment

A.5 TCL Characterization Script Fragment Listing A.8: Hardware power and area characterization loop procedure. 1 2

proc r u n H a r d w a r e C h a r a c t e r i z a t i o n L o o p { moduleName { debug f a l s e }} {

3 4 5 6 7 8 9 10 11 12

global global global global global global global global global

mem size mem width source code path fileName outputFile firstRun CHARACTER STEP generated lib DESIGN

global global global global

DATA WIDTH RF ADDR WIDTH NUM READ PORTS NUM WRITE PORTS

13 14 15 16 17 18 19 20

g l o b a l NUM INPUT REGS g l o b a l SIZE INPUT REGS

21 22 23 24

g l o b a l NUM FB FIFOS g l o b a l NUM OUTPUT REGS g l o b a l NUM GP REGS

25 26 27

g l o b a l DATA TOGGLE RATE g l o b a l DATA PROBABILITY

28 29 30

g l o b a l CTRL TOGGLE RATE g l o b a l CTRL PROBABILITY

31 32 33 34 35 36

global global global global global

internalPower ; netPower ; leakagePower ; totalPower ; clkPower ;

37 38 39 40

global cellArea ; global netArea ; global totalArea ;

41 42 43 44

s e t SW ITERATION COUNTER 0 s e t HW ITERATION COUNTER 0

45 46 47

s e t SW INTERNAL COUNTER 0 s e t HW INTERNAL COUNTER 0

48 49

s e t F U o p e r a n d n a m e unknown

50 51

i f { [ s t r i n g e q u a l $moduleName a d d e r ] } {

52

203

A

53 54

Appendix

s e t F U o p e r a n d n a m e summand p u t s ”now s e t t i n g F U o p e r a n d n a m e t o \ ” summand\ ” ”

55 56

} else {

57 58 59 60

s e t FU operand name operand p u t s ”now s e t t i n g F U o p e r a n d n a m e t o \ ” o p e r a n d \ ” ” }

61 62

s w i t c h −exact −− $moduleName {

63 64

regfile {

65

set read port num set write port num

66 67

$NUM READ PORTS $NUM WRITE PORTS

68

s e t f i l e N a m e ” r f c h a r a c t e r i z a t i o n d $ {DATA WIDTH} r $ {NUM READ PORTS} w${ NUM WRITE PORTS} . s q l ” writeNewSQLFile fileName

69

70 71 72

p u t s ” \ t C h a r a c t e r i z i n g f o r w r i t e p o r t s = $NUM WRITE PORTS”

73 74

f o r e a c h n u m i n p u t r e g {4 2 1} { ; # ==> s e t v a r i a b l e s e t NUM INPUT REGS $ n u m i n p u t r e g ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t n u m i n r e g = $NUM INPUT REGS” } f o r e a c h n u m o u t p u t r e g {4 2 1} { ; # ==> s e t v a r i a b l e s e t NUM OUTPUT REGS $ n u m o u t p u t r e g ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t \ t n u m o u t r e g = $NUM OUTPUT REGS” } f o r e a c h s i z e i n p u t r e g {16 8 2} { ; # ==> s e t v a r i a b l e s e t SIZE INPUT REGS $ s i z e i n p u t r e g ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t \ t \ t s i z e i n p u t r e g = $SIZE INPUT REGS ” } f o r e a c h f b f i f o n u m {4 0} { ; # ==> s e t v a r i a b l e s e t NUM FB FIFOS $ f b f i f o n u m ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t \ t \ t \ t f b f i f o n u m = $NUM FB FIFOS” } f o r e a c h n u m g e n r e g {8 4 0} { ; # ==> s e t v a r i a b l e s e t NUM GP REGS $ n u m g e n r e g ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t \ t \ t \ t \ t n u m g e n r e g = $NUM GP REGS” }

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108

204


109 110

111 112 113 114

; # ==> PERFORMING HARDWARE SYNTHESIS u p d a t e d e f a u l t l i b ” ${ g e n e r a t e d l i b } / r e g f i l e d e f a u l t $ {DATA WIDTH} . v h d ” r e a d S o u r c e F i l e s $moduleName $ f i r s t R u n runSynthesis regfile set firstRun false ; # ==> END HARDWARE SYNTHESIS

115 116

computeArea

117 118 119

s e t HW ITERATION COUNTER [ expr $HW ITERATION COUNTER + 1 ] s e t HW INTERNAL COUNTER [ expr $HW INTERNAL COUNTER +1 ]

120 121 122 123 124

; # ==> PERFORMING STATISTICAL MACROMODELING s e t u s e f r a c t i o n 200 . 0 w h i l e { $ u s e f r a c t i o n > 10 . 0 } { set u s e f r a c t i o n [ getNewUseFraction $ u s e f r a c t i o n ]

125 126 127 128

129 130 131 132

133 134 135 136

137 138 139 140 141

i f { ( $ u s e f r a c t i o n == 50 . 0 ) && ( $ w r i t e p o r t n u m < 2 ) } { ; # IDLE: s t a r t n e x t i t e r a t i o n p u t s ” u s e f r a c t i o n == $ u s e f r a c t i o n ( 5 0 ) and wp num = $ w r i t e p o r t n u m < 2 ; go t o n e x t i t e r a t i o n ” } else { i f { ( $ u s e f r a c t i o n == 25 . 0 ) && ( $ w r i t e p o r t n u m < 4 ) } { ; # IDLE: s t a r t n e x t i t e r a t i o n p u t s ” u s e f r a c t i o n == $ u s e f r a c t i o n ( 2 5 ) and wp num = $ w r i t e p o r t n u m < 4 ; go t o n e x t i t e r a t i o n ” } else { i f { ( $ u s e f r a c t i o n == 15 . 0 ) && ( $ w r i t e p o r t n u m < 5 ) } { ; # IDLE: s t a r t n e x t i t e r a t i o n p u t s ” u s e f r a c t i o n == $ u s e f r a c t i o n ( 1 5 ) and wp num = $ w r i t e p o r t n u m < 5 ; go t o n e x t i t e r a t i o n ” } else { i f { $ u s e f r a c t i o n == 10 . 0 } { ; # IDLE: s t a r t n e x t i t e r a t i o n p u t s ” u s e f r a c t i o n == $ u s e f r a c t i o n ( 1 0 ) ; go t o n e x t i t e r a t i o n ” } else {

142 143 144 145 146 147 148 149

; # ==> s e t v a r i a b l e s e t UTILIZATION RATE [ expr $ u s e f r a c t i o n / 1 0 0 . 0 ] ; # ==> s e t v a r i a b l e i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s ” \ t \ t \ t \ t \ t \ t \ t u s e f r a c t i o n = ${UTILIZATION RATE} ” }

150 151 152 153

; #−−−−−−> CLEAR THE OLD TOGGLE RATES AND ; # PROBABILITIES ! ! ! r e s e t T o g g l e R a t e s A n d P r o b a b i l i t i e s r e g f i l e $debug

154 155 156 157

p u t s ” u s e f r a c t i o n == $ u s e f r a c t i o n ; p e r f o r m i n g c h a r a c t e r i z a t i o n ” p u t s ” \ t \ t \ t \ t \ t \ t \ t u s e f r a c t i o n \ t = ${UTILIZATION RATE} ” p u t s ” \ t \ t \ t \ t \ t \ t \ t DATA WIDTH \ t = ${DATA WIDTH} ”

158 159 160 161

f o r e a c h p b d a t a {0 . 0 1 1 . 0 2 . 0 3 . 0 6 . 0 8 . 0 } { ; # ==> s e t v a r i a b l e s e t DATA PROBABILITY [ expr $ p b d a t a / 1 0 ]

205

A

Appendix

; # ==> s e t v a r i a b l e

162 163

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t p b d a t a =” p u t s $DATA PROBABILITY } ; #−−−−> SET CURRENT DATA PROBABILITY p u t s ” \ t \ t \ t \t−−>SETTING MODULE PROBABILITY” setModuleProbability r e g f i l e $DATA PROBABILITY $DATA WIDTH c o n s t 0 $debug ${UTILIZATION RATE}

164 165 166 167 168 169 170

171

f o r e a c h t r d a t a {0 . 1

172

5 .0

10 . 0

15 . 0

20 . 0

25 . 0

30 . 0

50 . 0 } {

173

; # ==> s e t v a r i a b l e s e t DATA TOGGLE RATE [ expr $ t r d a t a / 1 0 0 ] ; # ==> s e t v a r i a b l e

174 175 176 177

i f { ! ( $DATA TOGGLE RATE SET CURRENT DATA TOGGLE RATE setModuleToggleRate r e g f i l e $DATA TOGGLE RATE $DATA WIDTH c o n s t 0 $debug ${UTILIZATION RATE}

178 179 180 181 182 183 184 185 186

187

foreach ; # ==> s e t s e t CTRL ; # ==> s e t

188 189 190 191

p b c t r l {0 . 0 1 2 . 0 4 . 0 6 . 0 8 . 0 } { variable PROBABILITY [ expr $ p b c t r l / 1 0 ] variable

192

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t \ t p b c t r l =” p u t s $CTRL PROBABILITY

193 194 195

} ; #−−−−> SET CURRENT CTRL PROBABILITY s e t M o d u l e P r o b a b i l i t y C T R L r e g f i l e $CTRL PROBABILITY wes $NUM WRITE PORTS 0 $debug ${UTILIZATION RATE}

196 197 198

199

f o r e a c h t r c t r l {0 . 1 2 . 0 6 . 0 10 . 0 15 . 0 30 . 0 ; # ==> s e t v a r i a b l e s e t CTRL TOGGLE RATE [ expr $ t r c t r l / 1 0 0 ] ; # ==> s e t v a r i a b l e

200 201 202 203

50 . 0 } {

204

i f { ! ( $CTRL TOGGLE RATE 0 . 0 0 1 ) && \ ( $DATA PROBABILITY == 0 . 0 0 1 | | $DATA TOGGLE RATE == 0 . 0 0 1 ) ) \ } { ; # IDLE: make new i t e r a t i o n } else {

205 206 207 208 209 210 211

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t \ t \ t t r c t r l =” p u t s $CTRL TOGGLE RATE }

212 213 214 215

206


216 217 218

; #−−−−> SET CURRENT CTRL TOGGLE RATE s e t M o d u l e T o g g l e R a t e C T R L r e g f i l e $CTRL TOGGLE RATE wes $NUM WRITE PORTS 0 $debug ${UTILIZATION RATE}

219 220 221 222

; # p u t s ”Now p e r f o r m i n g m a c r o m o d e l i n g ” ;# s e t f i r s t R u n f a l s e

223 224

computePower

225 226

227

228

p u t s −nonewline $ o u t p u t F i l e ” INSERT INTO RF DATA d${DATA WIDTH} r $ { NUM READ PORTS} w${NUM WRITE PORTS} VALUES ( ” p u t s −nonewline $ o u t p u t F i l e ” ${NUM INPUT REGS} , ${NUM OUTPUT REGS} , $ {SIZE INPUT REGS} , ${NUM FB FIFOS} , ${NUM GP REGS} , ” p u t s −nonewline $ o u t p u t F i l e ” ${UTILIZATION RATE} , ${DATA TOGGLE RATE} , ${DATA PROBABILITY} , ${CTRL TOGGLE RATE} , ${CTRL PROBABILITY} , ”

229 230

puts $ o u t p u t F i l e ” $ t o t a l P o w e r , $ i n t e r n a l P o w e r , $netPower, $ l e a k a g e P o w e r , $ c l k P o w e r , $ t o t a l A r e a , 200 , 0 ) ”

231 232 233

s e t SW ITERATION COUNTER [ expr $SW ITERATION COUNTER + 1 ] s e t SW INTERNAL COUNTER [ expr $SW INTERNAL COUNTER + 1 ]

234 235 236 237

} ; # END e l s e $CTRL TOGGLE RATE END HARDWARE SYNTHESIS

257 258 259

p u t s ”−−> SUM: 3 x $HW ITERATION COUNTER HW i t e r a t i o n s ” p u t s ”−−> SUM: 3 x $SW ITERATION COUNTER SW i t e r a t i o n s ”

260 261

close $outputFile

262 263

} ; # end s w i t c h r e g f i l e

264 265 266 267

207

A

268 269 270 271 272 273 274 275

Appendix

memory { p u t s ”now i n r u n H a r d w a r e C h a r a c t e r i z a t i o n module=$moduleName −−> d e f a u l t ” ; # ==> PERFORMING HARDWARE SYNTHESIS u p d a t e m e m o r y w r a p p e r ” ${ s o u r c e c o d e p a t h } / c h a r r a m w r a p p e r . v h d ” r e a d S o u r c e F i l e s $moduleName $ f i r s t R u n r u n S y n t h e s i s $moduleName set firstRun false ; # ==> END HARDWARE SYNTHESIS

276

computePower computeArea

277 278 279

s e t newName ” ${moduleName} c h a r a c t e r i z a t i o n . s q l ” s e t n e w o u t p u t [ open $newName {WRONLY CREAT APPEND} ] p u t s ” f i l e ${ n e w o u t p u t } o p e n e d . . . ”

280 281 282 283 284

p u t s −nonewline $ n e w o u t p u t ” INSERT INTO ${moduleName} DATA VALUES ( ” p u t s −nonewline $ n e w o u t p u t ” ${ mem size } , ${ mem width } , ” p u t s −nonewline $ n e w o u t p u t ” ${DATA TOGGLE RATE} , ${DATA PROBABILITY} , ” p u t s −nonewline $ n e w o u t p u t ” ${CTRL TOGGLE RATE} , ${CTRL PROBABILITY} , ” ## p u t s −nonewline $ o u t p u t F i l e ” ${select CTRL TOGGLE RATE} , ${ select CTRL PROBABILITY } , ”

285 286 287 288 289

290

puts $newoutput ” $ t o t a l P o w e r , $ i n t e r n a l P o w e r , $netPower, $leakagePower, $ t o t a l A r e a , 200 , 0 ) ”

291

292

c l o s e $newoutput

293 294 295

}

296 297

fu unit {

298

p u t s ”now i n r u n H a r d w a r e C h a r a c t e r i z a t i o n module=$moduleName −−> d e f a u l t ”

299 300

f o r e a c h d w i d t h {24 16 8} {

301 302

; # ==> s e t v a r i a b l e s e t DATA WIDTH $ d w i d t h ; # ==> s e t v a r i a b l e

303 304 305 306

s e t f i l e N a m e ” MaxLeak $ {moduleName} c h a r a c t e r i z a t i o n d $ {DATA WIDTH} . s q l ” p u t s ” s e t t i n g f i l e N a m e t o \ ” ${moduleName} c h a r a c t e r i z a t i o n d $ {DATA WIDTH} . s q l \” ” writeNewSQLFile fileName

307 308

309 310

; # ==> PERFORMING HARDWARE SYNTHESIS u p d a t e d e f a u l t l i b ” ${ g e n e r a t e d l i b } / ${moduleName} d e f a u l t $ {DATA WIDTH} . v h d ” r e a d S o u r c e F i l e s $moduleName $ f i r s t R u n r u n S y n t h e s i s $moduleName set firstRun false ; # ==> END HARDWARE SYNTHESIS

311 312 313 314 315 316 317

computeArea

318 319 320 321

; #−−−−−−> CLEAR THE OLD TOGGLE RATES AND ; # PROBABILITIES ! ! !

208


322

r e s e t T o g g l e R a t e s A n d P r o b a b i l i t i e s $moduleName $debug

323 324 325

; #===> SET s e l e c t [ 0 ] , s e l e c t [ 1 ] t o g g l e R a t e s and p r o b a b i l i t i e s ! ! ! i n i t i a l i z e C o n t r o l P o r t s $moduleName

326 327 328 329 330

f o r e a c h p b d a t a {0 . 0 1 0 . 5 1 . 0 1 . 5 2 . 0 ; # ==> s e t v a r i a b l e s e t DATA PROBABILITY [ expr $ p b d a t a / 1 0 ] ; # ==> s e t v a r i a b l e

2 .5

3 .0

4 .0

5 .0

6 .0

10 . 0 } {

8 .0

331 332 333 334 335 336 337 338

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t p b d a t a =” p u t s $DATA PROBABILITY } ; #−−−−> SET 1 . Operand ’ s CURRENT DATA PROBABILITY p u t s ” \ t \ t \ t \t−−>SETTING MODULE PROBABILITY” s e t O p e r a n d P r o b a b i l i t y $DATA PROBABILITY f i r s t $ F U o p e r a n d n a m e $DATA WIDTH c o n s t 0 . 0 0 . 0 $debug

339 340

f o r e a c h t r d a t a {0 . 1 5 . 0 7 . 0 40 . 0 50 . 0 70 . 0 } {

10 . 0

12 . 0

14 . 0

16 . 0

18 . 0

20 . 0

26 . 0

30 . 0

341 342 343 344

; # ==> s e t v a r i a b l e s e t DATA TOGGLE RATE [ expr $ t r d a t a / 1 0 0 ] ; # ==> s e t v a r i a b l e

345 346 347 348 349 350 351 352 353 354

i f { ! ( $DATA TOGGLE RATE SET 1 . Operand ’ s CURRENT DATA TOGGLE RATE s e t O p e r a n d T o g g l e R a t e $DATA TOGGLE RATE f i r s t $ { F U o p e r a n d n a m e } $DATA WIDTH c o n s t 0 . 0 0 . 0 $debug

355 356 357

f o r e a c h o p 2 p b d a t a {0 . 0 1 10 . 0 } {

0 .5

1 .0

1 .5

2 .0

2 .5

3 .0

4 .0

5 .0

6 .0

8 .0

358 359 360 361

; # ==> s e t v a r i a b l e s e t op2 DATA PROBABILITY [ expr $ o p 2 p b d a t a / 1 0 ] ; # ==> s e t v a r i a b l e

362 363 364 365 366 367 368

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t \ t o p 2 p b d a t a =” p u t s $op2 DATA PROBABILITY } ; #−−−−> 2 . SET 2 . Operand ’ s CURRENT DATA PROBABILITY p u t s ” \ t \ t \ t \ t \t−−>$moduleName: SETTING 2 . Operand MODULE PROBABILITY”

369 370 371

372 373

i f { ! [ s t r i n g e q u a l $moduleName s h i f t e r ] } { s e t O p e r a n d P r o b a b i l i t y $op2 DATA PROBABILITY c o n s t 0 . 0 0 . 0 $debug } else { s e t O p e r a n d P r o b a b i l i t y $op2 DATA PROBABILITY

s e c o n d $ F U o p e r a n d n a m e $DATA WIDTH

s e c o n d $ F U o p e r a n d n a m e [ expr i n t (

209

A

Appendix

f l o o r ( l o g ( $DATA WIDTH ) / l o g ( 2 ) ) ) ] c o n s t 0 . 0 0 . 0 374

$debug

} foreach o p 2 t r d a t a . 0 30 . 0 40 . 0

375

{0 . 1 50 . 0

5 .0 7 .0 70 . 0 } {

10 . 0

12 . 0

14 . 0

16 . 0

18 . 0

20 . 0

26

376

; # ==> s e t v a r i a b l e s e t op2 DATA TOGGLE RATE [ expr $ o p 2 t r d a t a / 1 0 0 ] ; # ==> s e t v a r i a b l e

377 378 379 380 381 382 383 384 385 386 387 388 389 390

391 392 393

394

i f { ! ( $op2 DATA TOGGLE RATE 2 . SET 2 . Operand ’ s CURRENT DATA TOGGLE RATE i f { ! [ s t r i n g e q u a l $moduleName s h i f t e r ] } { s e t O p e r a n d T o g g l e R a t e $op2 DATA TOGGLE RATE s e c o n d $ { F U o p e r a n d n a m e } $DATA WIDTH c o n s t 0 . 0 0 . 0 $debug } else { s e t O p e r a n d T o g g l e R a t e $op2 DATA TOGGLE RATE s e c o n d $ { F U o p e r a n d n a m e } [ expr i n t ( f l o o r ( l o g ( $DATA WIDTH ) / l o g ( 2 ) ) ) ] c o n s t 0 . 0 0 . 0 $debug }

395

f o r e a c h p b c t r l {0 . 0 1 0 . 5 1 . 0 1 . 5 2 . 0 ; # ==> s e t v a r i a b l e s e t CTRL PROBABILITY [ expr $ p b c t r l / 1 0 ] ; # ==> s e t v a r i a b l e

396 397 398 399

3 .0

5 .0

8 .0

10 . 0 } {

400

i f { [ s t r i n g e q u a l $debug t r u e ] } { p u t s −nonewline ” \ t \ t \ t \ t \ t \ t \ t \ t \ t p b c t r l =” p u t s $CTRL PROBABILITY }

401 402 403 404 405

; #−−−−> SET CURRENT e n a b l e CTRL PROBABILITY s e t M o d u l e P r o b a b i l i t y C T R L $moduleName $CTRL PROBABILITY e n a b l e

406 407

1 1 $debug 1 . 0

408

foreach t r ; # ==> s e t s e t CTRL ; # ==> s e t

409 410 411 412

c t r l {0 . 1 2 . 0 6 . 0 10 . 0 15 . 0 variable TOGGLE RATE [ expr $ t r c t r l / 1 0 0 ] variable

30 . 0

40 . 0

50 . 0

70 . 0 } {

413

i f { ! ( $CTRL TOGGLE RATE SET CURRENT e n a b l e CTRL TOGGLE RATE s e t M o d u l e T o g g l e R a t e C T R L $moduleName $CTRL TOGGLE RATE e n a b l e 1 1 $debug 1 . 0

423 424 425

computePower

426

210

A.5 TCL Characterization Script Fragment p u t s −nonewline $ o u t p u t F i l e ” INSERT INTO ${moduleName} DATA d${DATA WIDTH } VALUES ( ” p u t s −nonewline $ o u t p u t F i l e ” ${DATA TOGGLE RATE} , ${DATA PROBABILITY} , ” p u t s −nonewline $ o u t p u t F i l e ” ${op2 DATA TOGGLE RATE} , ${ op2 DATA PROBABILITY} , ” p u t s −nonewline $ o u t p u t F i l e ” ${CTRL TOGGLE RATE} , ${CTRL PROBABILITY} , ” puts $ o u t p u t F i l e ” $ t o t a l P o w e r , $ i n t e r n a l P o w e r , $netPower, $leakagePower, $ t o t a l A r e a , 200 , 1 ) ”

427

428 429

430 431

432

} ; # e l s e ENABLE ( $CTRL TOGGLE RATE

Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal ...

Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal ...

Suggest Documents

Digital Signal Processor (DSP) Architecture

xdspcore: a compiler-basedconfigurable digital signal processor

Digital signal processor fundamentals and system design

digital signal processor architectures and programming

Using Digital Signal Processor Singularities to ...

Digital Signal Processor Controlled PWM Phase ...

Design and verification for dual issue digital signal processor

A Scalable SIMD Digital Signal Processor for High

Low-Power Digital Signal Processor Design for a ...

Digital Signal Processor System for AC Power Drivers

A Fast DCT Algorithm for Watermarking in Digital Signal Processor

Processor for Digital Still Cameras

Digital signal processor control of scanned probe microscopes - Core

Digital Signal Processor With Efficient RGB Interpolation And ... - kaist

Digital Signal Processor (DSP) based 1/f noise generator

RNS-enabled digital signal processor design - Electronics Letters

Development of Digital Signal Processor Controlled ... - IEEE Xplore

Digital signal processor-based real-time optical Doppler tomography ...

Digital signal processor control of scanned probe microscopes - Core

Digital signal processor against field programmable gate array ...

A robust digital signal processor: determining the true input rate

SMJ320C80 Digital Signal Processor (Rev. B) - Texas Instruments

Control Using Digital Signal Processor Chips - Dipartimento di ...

Digital-signal-processor-based dynamic imaging ... - Semantic Scholar