Scalable DSP Core Architecture Addressing ... - the world of SoC!

8 downloads 0 Views 8MB Size Report
(VLSI'03), Las Vegas, Nevada, USA, June 23-26, 2003, pp. 7-12. [P8] C. Panis ...... range of the destination register is exceeded which leads to signal flipping.
Tampereen teknillinen yliopisto. Julkaisu 483 Tampere University of Technology. Publication 483

Christian Panis

Scalable DSP Core Architecture Addressing Compiler Requirements Thesis for the degree of Doctor of Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB104, at Tampere University of Technology, on the13th of August 2004, at 12 o’clock noon.

Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2004

ISBN 952-15-1205-9 ISSN 1459-2045

Abstract This thesis considers the definition and design of an embedded configurable DSP (Digital Signal Processor) core architecture and will address the necessary requirements for developing an optimizing high-level language compiler. The introduction provides an overview of typical DSP core architectural features, briefly discusses the currently available DSP cores and summarizes the architectural aspects which have to be considered when developing an optimizing high-level language compiler. The introduction is followed by a total of 12 publications which outline the research work carried out while providing a detailed description of the main core features and the design space exploration methodology. Most of the research work focuses on architectural aspects of the configurable RISC (Reduced Instruction Set Computer) DSP core based on a modified Dual-Harvard load-store architecture. Due to increasing application code size and the associated configuration aspect the use of automatic code generation by a high-level language compiler is required. To generate code efficiently requires that the architectural aspects be considered as early as definition stage. This results in an orthogonal instruction set architecture with simple issue rules. Architectural features are introduced to reduce area consumption and power dissipation to fulfill requirements of SoC (System-on-Chip) and SiP (System-in-Package) applications and close the gap between dedicated hardware implementations and software based system solutions. Code density has a significant influence on the area of the DSP sub-system, thus xLIW (scalable Long Instruction Word) is introduced. An instruction buffer allows the reduction of power dissipation during execution of loop-centric DSP algorithms. Simple issue rules and exhaustive predicated execution feature enable efficient cycle and power execution of control code. The scalable DSP core architecture introduced herein allows parameterization of the main architectural features to application specific requirements. To make use of this feature it is necessary to analyze the requirements of the application. This thesis introduces a design space exploration methodology based on a C-compiler and a cycle-true instruction set simulator. A unique XML-based configuration file is used to reduce the implementation and validation effort for configuring the tool-chain, updating documentation and for automatic generation of parts of the VHDL-RTL core description.

II

III

Preface The research work described in this thesis was carried out during 1999-2004 in Infineon Technologies Austria and in the Institute of Digital and Computer Systems at the Tampere University of Technology in Tampere, Finland. I will like to express my deepest gratitude to my thesis advisor, Prof. Jari Nurmi. He introduced and guided me carefully through the scientific world. Jari hosted me during my stays at the university in Tampere and warmed the cockles of my heart in the sometimes cold Finland. Prof. Jarmo Takala as head of the Institute of Digital and Computer Systems supported my study work and along with Lasse Harju and Timo Rintakoski ensured me a warm and pleasant working environment during my time at TUT. A note of gratitude goes out to Prof. Lars Wanhammar and Dr. Mika Kuulusa for reviewing my thesis and supporting me with imperative feedback. Defining and developing a new DSP core when considering the approach of Hardware and Software Co-Definition can only be done with a competent and enthusiastic team. Therefore I would like to express my deepest thanks to the xDSPcore team which contributed excellent work during the long period. Many thanks to Prof. Andreas Krall from Vienna University of Technology who influenced the xDSPcore architecture and considered aspects relevant when developing an optimizing C-compiler and to Karl Vögler and Ulrich Hirnschrott who developed the main parts of the C-compiler backend and supporting my thesis by contributing benchmarks and analysis results alongside many productive discussions. Many thanks also to the internship and masters students who contributing to the xDSPcore research project including Pierre Elbischger, Gunther Laure, Wolfgang Lazian, Raimund Leitner, Michael Bramberger and many more. During my time in Infineon Technologies I had the pleasure to meet many amazing people which led to a plethora of a lot of fruitful discussions again representative, many thanks to Herbert Zojer who supported the development of an innovative DSP core architecture, to Prof. Lajos Gazsi, Fellow of Infineon Technologies and Dr. Söhnke Mehrgardt the CTO of Infineon Technologies who guided the development team inside the company. I would like to express my thanks to Dr. Franz Dielacher, Manfred Haas and Reinhard Petschacher at Infineon Technologies Austria and Prof. Herbert Grünbacher and Erwin Ofner at Carithian Tech Institute who enabled me to finalize my research work.

IV In addition I would like to express my thanks to Prof. Tobias G. Noll and Volker Gierenz at RWTH Aachen who assisted the project from the beginning with their technical expertise. The research was financially supported by Infineon Technologies Austria, the European Commission with the project SoC-Mobinet (IST-2000-30094) and the Carinthian Tech Institute who hosted me in the last two years. Many thanks. Most of all I would like to express my deepest gratitude to my parents Maria and Herbert Panis and brother Peter who supported me unrelentingly throughout the long time period with their love. Only through their support was it possible for me to complete my studies in Tampere, Finland.

Tampere, August 2004

Christian Panis

V

Table of Contents 1

2

3

4

Introduction..................................................................................................................... 2 1.1

Motivation............................................................................................................... 2

1.2

Methodology........................................................................................................... 3

1.3

Goals ....................................................................................................................... 4

1.4

Outline of Thesis..................................................................................................... 4

DSP Specific Features..................................................................................................... 7 2.1

Introduction............................................................................................................. 7

2.2

Saturation ................................................................................................................ 7

2.3

Rounding................................................................................................................. 9

2.4

Fixed-Point, Floating-Point................................................................................... 10

2.5

Hardware Loops.................................................................................................... 12

2.6

Addressing Modes ................................................................................................ 13

2.7

Multiple Memory Banks ....................................................................................... 18

2.8

CISC Instruction Sets............................................................................................ 19

2.9

Orthogonality ........................................................................................................ 20

2.10

Real-Time Requirements ...................................................................................... 21

DSP cores...................................................................................................................... 23 3.1

Design Space......................................................................................................... 23

3.2

Architectural Alternatives..................................................................................... 31

3.3

Available DSP Core Architectures ....................................................................... 37

3.4

xDSPcore .............................................................................................................. 49

High Level Language Compiler Issues......................................................................... 51 4.1

Coding Practices in DSP’s .................................................................................... 51

VI

5

6

7

4.2

Compiler Overview............................................................................................... 59

4.3

Requirements ........................................................................................................ 62

4.4

HLL-Compiler Friendly Core Architecture .......................................................... 69

Summary of Publications.............................................................................................. 73 5.1

Architectural Aspects of Scalable DSP Core........................................................ 73

5.2

Design Space Exploration..................................................................................... 76

5.3

Author’s Contribution to Published Work............................................................ 77

Conclusion .................................................................................................................... 81 6.1

Main Results ......................................................................................................... 81

6.2

Future Research .................................................................................................... 84

References..................................................................................................................... 89

VII

List of Publications This thesis is split into two parts with the first containing an introduction into Digital Signal Processor architectures and the second part a reprint of the publications listed below. [P1] C. Panis, J. Nurmi, “xDSPcore - a Configurable DSP Core”, Technical Report 1-2004, Tampere University of Technology, Institute of Digital and Computer Systems, Tampere, Finland, May 2004. [P2] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “xLIW – a Scaleable Long Instruction Word”, in Proceedings The 2003 IEEE International Symposium on Circuits and Systems (ISCAS 2003), Bangkok, Thailand, May 25-28, 2003, pp. V69-V72. [P3] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “Align Unit for a Configurable DSP Core”, in Proceedings on the IASTED International Conference on Circuits, Signals and Systems (CSS 2003), Cancun, Mexico, May 19-21, 2003, pp. 247-252. [P4] C. Panis, M. Bramberger, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer for a Configurable DSP Core”, in Proceedings of 29th European Solid State Conference (ESSCIRC 2003), Estoril, Portugal, September 16-18, 2003, pp. 49-52. [P5] C. Panis, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer and Align Unit for xDSPcore”, IEEE Journal of Solid-State Circuits, Volume 35, Number 7, July 2004, pp. 1094-1100. [P6] C. Panis, U. Hirnschrott, A. Krall, G. Laure, W. Lazian, J. Nurmi, “FSEL – Selective Predicated Execution for a Configurable DSP Core”, in Proceedings of IEEE Annual Symposium on VLSI (ISVLSI-04), Lafayette, Louisiana, USA, February 19-20, 2004, pp. 317-320. [P7] C. Panis, G. Laure, W. Lazian, H. Grünbacher, J. Nurmi, “A Branch File for a Configurable DSP Core”, in Proceedings of the International Conference on VLSI (VLSI’03), Las Vegas, Nevada, USA, June 23-26, 2003, pp. 7-12. [P8] C. Panis, R. Leitner, J. Nurmi, “A Scaleable Shadow Stack for a Configurable DSP Concept”, in Proceedings The 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications (IWSOC), Calgary, Canada, June 30-July 2, 2003, pp. 222-227.

VIII [P9] C. Panis, J. Hohl, H. Grünbacher, J. Nurmi, “xICU - a Scaleable Interrupt Unit for a Configurable DSP Core”, in Proceedings 2003 International Symposium on System-onChip (SOC’03), Tampere, Finland, November 19-21, 2003, pp. 75-78. [P10] C. Panis, G. Laure, W. Lazian, A. Krall, H. Grünbacher, J. Nurmi, “DSPxPlore – Design Space Exploration for a Configurable DSP Core”, in Proceedings International Signal Processing Conference (GSPx), Dallas, Texas, USA, March 31- April 3, 2003, CD-ROM. [P11] C. Panis, U. Hirnschrott, G. Laure, W. Lazian, J. Nurmi, “DSPxPlore - Design Space Exploration Methodology for an Embedded DSP Core”, in Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 04), Nicosia, Cyprus, March 14-17, 2004, pp. 876-883. [P12] C. Panis, A. Schilke, H. Habiger, J. Nurmi, “An Automatic Decoder Generator for a Scaleable DSP Architecture”, in Proceedings of the 20th Norchip Conference (Norchip’02), Copenhagen, Denmark, November 11-12, 2002, pp. 127-132.

IX

List of Figures Figure 1: Chosen Methodology for Definition of the Core Architecture. .............................. 3 Figure 2: Principle of Saturation............................................................................................. 8 Figure 3: Two's Complement Rounding (Motorola 56000 family)........................................ 9 Figure 4: Convergent Rounding (Motorola 56000 family)................................................... 10 Figure 5: Integer versus Fractional Data Representation...................................................... 11 Figure 6: Fractional Multiplication Including Left Shift. ..................................................... 12 Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter. ................... 12 Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx).................. 13 Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140). ...... 13 Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore). ...... 13 Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x)....................... 14 Figure 12: Principle of Register Indirect Addressing. .......................................................... 14 Figure 13: Principle of Pre/Post Operation Mode................................................................. 15 Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore). ...... 15 Figure 15: Principle of Using a Modulo Buffer for Address Generation. ............................ 16 Figure 16: Principle of the Bit Reversal Addressing Scheme............................................... 17 Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore). ..................... 17 Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard. ...... 18 Figure 19: Example for Interleaved Memory Addressing (SC 140)..................................... 19 Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC). .................. 20 Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA). ....................................................................................................................................... 24 Figure 22: Principle of Define-in-use Dependency. ............................................................. 26 Figure 23: Principle of Load-in-use Dependency................................................................. 27 Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140)............... 27 Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures. 31 Figure 26: Architectural Alternatives: RISC versus CISC Pipeline. .................................... 34 Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores.................. 34 Figure 28: Architectural Alternatives: Direct Memory versus Load-Store. ......................... 35 Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction Scheduling..................................................................................................................... 36

X Figure 30: Architectural Overview: OAKDSP Core. ........................................................... 38 Figure 31: Architectural Overview: Motorola 56300. .......................................................... 38 Figure 32: Architectural Overview: TI C54x........................................................................ 41 Figure 33: Architectural Overview: ZSP400. ....................................................................... 42 Figure 34: Architectural Overview: Carmel. ........................................................................ 44 Figure 35: Architectural Overview: TI C6xx........................................................................ 46 Figure 36: Architectural Overview: Starcore SC140............................................................ 47 Figure 37: Architectural Overview: Blackfin. ...................................................................... 48 Figure 38: Architectural Overview: xDSPcore..................................................................... 50 Figure 39: Principle of Software Pipelining. ........................................................................ 52 Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values......... 53 Figure 41: Example for Assembler Code Implementation including Software Pipelining (xDSPcore).................................................................................................................... 54 Figure 42: Data Flow Graph for Maximum Search Example............................................... 55 Figure 43: C-Code Example for Illustration of Software Pipelining.................................... 56 Figure 44: Generated Assembler Code without Software Pipelining (xDSPcore). .............. 56 Figure 45: Generated Assembler Code including Software Pipelining (xDSPcore). ........... 56 Figure 46: Principle of Loop Unrolling. ............................................................................... 57 Figure 47: Principle of Predicated Execution using Loop Flags. ......................................... 58 Figure 48: General High-level Language Compiler Structure.............................................. 59 Figure 49: Example for banked Register Files (TI C62x). ................................................... 63 Figure 50: Limitations during Instruction Scheduling caused by Processor Modes. ........... 64 Figure 51: Example for Address Generation Unit (Motorola 56300)................................... 65 Figure 52: Example for not Orthogonal Instructions: MAX2VIT D4,D2 (Starcore SC140). ....................................................................................................................................... 65 Figure 53: Example for Mode Dependent Instruction Sets: ARM Thumb Decompression Logic. ............................................................................................................................ 67 Figure 54: Example for Address Generation Unit (Starcore SC140). .................................. 68 Figure 55: Configurable Long Instruction Word (CLIW of Carmel DSP Core).................. 69 Figure 56: xDSPcore Core Overview. .................................................................................. 70 Figure 57: Orthogonal Register File. .................................................................................... 70 Figure 58: Issuing Rules for xDSPcore Architecture. .......................................................... 71 Figure 59: Results for Dhrystone Benchmarks generated by C-Comipler. .......................... 72 Figure 60: Results for EFR Benchmarks generated by C-Compiler..................................... 72

XI Figure 61: xDSPcore Overview. ........................................................................................... 73 Figure 62: DSPxPlore Overview. ......................................................................................... 76 Figure 63: Screenshot of xSIM ............................................................................................. 82 Figure 64: DSPxPlore Design Flow...................................................................................... 83

XII

XIII

List of Tables Table 1: Principle of Resource Allocation Table.................................................................. 53 Table 2: Resource Allocation Table including Software Pipeline Technology for increased Usage of Core Resources.............................................................................................. 54

XIV

XV

List of Abbreviations AGU

Address Generation Unit

ALU

Arithmetic Logic Unit

ANSI

American National Standard Institute

ASIC

Application Specific Integrated Circuit

ASIP

Application Specific Instruction Set Processor

BMU

Bit Manipulation Unit

CISC

Complex Instruction Set Computer

CLIW

Configurable Long Instruction Word

CMOS

Complementary Metal Oxide Semiconductor

CPU

Central Processing Unit

DMA

Direct Memory Access

DPG

Data Path Generator

DRAM

Dynamic Random Access Memory

DRM

Digital Radio Mondale

DSP

Digital Signal Processor

FFT

Fast Fourier Transformation

FIR

Finite Impulse Response

FPGA

Field Programmable Gate Array

FSM

Finite State Machine

GOPS

Giga Operations Per Second

GPP

General Purpose Processor

HDL

Hardware Description Language

HLL

High-Level Language

IC

Integrated Circuit

XVI ICU

Interrupt Control Unit

IEEE

Institute of Electrical and Electronics Engineers

ILP

Instruction Level Parallelism

IR

Intermediate Representation

ISA

Instruction Set Architecture

ISO

International Organization for Standardization

ISR

Interrupt Service Routine

ISS

Instruction Set Simulator

LCP

Loop Carry Path

LSB

Least Significant Bit

MAC

Multiply and Accumulate

MSB

Most Significant Bit

MII

Minimum Initiation Interval

MIMD

Multiple Instruction Multiple Data

MIPS

Million Instructions Per Second

MMACS

Million MAC Instructions Per Second

MOPS

Million Operations Per Second

MTCMOS

Multi-Threshold CMOS

NMI

Non Maskable Interrupt

NOP

No Operation

OCE

Open Compiler Environment

OS

Operating System

PCU

Program Control Unit

RAM

Random Access Memory

RISC

Reduced Instruction Set Computer

RTOS

Real-Time Operating System

SIMD

Single Instruction Single Data

XVII SiP

System in Package

SJP

Split Join Path

SMT

Simultaneous Multithreading

SoC

System on Chip

SSA

Static Single Assignment

TLB

Translation Lookaside Buffer

TLP

Task Level Parallelism

VHDL

VHSIC Hardware Description Language

VHSIC

Very High Speed Integrated Circuit

VLES

Variable Length Execution Set

VLIW

Very Long Instructions Word

WCET

Worst Case Execution Time

xICU

Scaleable Interrupt Control Unit

xLIW

Scaleable Long Instructions Word

Part I INTRODUCTION

2

1 Introduction The introduction begins with a short description of the motivation upon why defining and developing a new DSP core architecture was chosen for the thesis. This shall be followed by a brief introduction of the chosen methodology. A few sentences will then illustrate the goals of the development project carried out for this thesis before the outline of the thesis is provided.

1.1 Motivation Increasing complexity of System-on-Chip (SoC) applications increases the demand on powerful embedded cores. The flexibility provided by the usage of software programmable cores quite often leads to an increase in consumed silicon area and an increased power dissipation. Therefore dedicated hardware is favored over software-based platform solutions. The picture is changing however due to significantly increasing mask costs due to the use of advanced process technologies and difficulties to enter such high-volume products to the heterogeneous market that would justify the high non-recurring cost. Together these elements increase the pressure for developing product platforms. These platforms are used for a group of applications so that software executed on programmable core architectures can be used for differentiating the products. General purpose processors with a fixed Instruction Set Architecture (ISA) are less well suited for integration into platforms. To close the gap between dedicated hardware implementations and software-based solutions requires core architectures which enable platform-specific and application-specific adaptations. For embedded Digital Signal Processors (DSP) an additional problem exists. Nonorthogonal core architectures are preferred due to increased performance and less area consumption when mapping DSP algorithms onto a processor. Therefore DSPs are still programmed manually in assembly language [162]. The only drawback of the better usage of available processor resources is an architecture-dependent description of the algorithms which makes changes in the core architecture difficult and costly (due to compatibility issues) and prohibits application-specific adaptations [113]. Therefore products based on a programmable core architecture remain with the same architecture for a long time even if not state-of-the-art any more. Consequences from using assembly language are long development cycles [174]. 10 years ago algorithms executed on DSP cores consisted of several hundred lines of code. Manual coding was reasonable even if minor changes in the application code required several weeks of coding and verification. Today’s DSP cores are more powerful and enable the execution of large programs consisting of several hundred thousand lines of code. DSP cores are not

3 only used for filtering operations any more, most notably in low cost products where not more than one core is reasonable and the control code is executed on the DSP core. To increase the performance of the DSP subsystems a high degree of parallelism and deep pipeline structures are introduced. Unfortunately manual programming of highly parallelized DSP core architectures with deep pipelines and resolving data and control dependencies is limited or even impossible. Therefore the motivation of using assembly code to increase the use of the available resources is not valid any more.

1.2 Methodology To obtain the definition of a DSP core programmable in a high-level language and not to make just another DSP core, the methodology for defining the core architecture has to be changed to meet the target. The technical reason along many commercial ones as to why efficient high-level language programming of DSP cores is still not feasible is the compromise for improved efficiency in terms of area and power consumption for the price of orthogonality. This is the major requirement for the compiler architecture. Considering early DSP cores as programmable filter structures the major driving factor for the architectural features has been initiated by the algorithms executed on the cores. Some constraints influencing available core architectures have been caused by the possibility to implement the architecture in hardware with reasonable core performance such as banked register files, mode registers and complex instruction sets. Figure 1 outlines the definition of design methodology of the core architecture introduced in this thesis. The development of an optimizing high-level language compiler has been considered during the definition of the feature set and the main architectural concepts. Thus it differs to already existing core architectures.

Figure 1: Chosen Methodology for Definition of the Core Architecture. Before adding an instruction into the instruction set architecture (ISA) its suitability for the three aspects algorithm, hardware implementation and software suitability has been verified.

4

1.3 Goals To close the gap between dedicated hardware implementations and software based solutions a paradigm change is required. The main architectural features of the core subsystem have to be scaled to enable a definition of an application specific optimum in the terms of area consumption and power dissipation. To obtain this goal it is necessary to consider the DSP subsystem instead of focusing only on the core architecture. To overcome the software compatibility issues caused by the scaleable core features the programming has to take place in a high level language (HLL) like C. This enables an architectural independent application description. However HLL compilers reduce software development effort and maintenance costs. To enable the development of an optimizing HLL compiler generating efficient code (whereas efficient means less than 10% overhead compared with manual coding) requires restricting the design space for the core architecture. The goal for the core architecture can be summarized as follows- a scaleable DSP core architecture to meet area and power targets to be competitive with hardwired implementations suited as a target for an optimizing C-compiler and designated for efficient execution of control code and loopcentric DSP-specific algorithms as well. The proposed approach is to provide an application-specific scaleable DSP core architecture. To gain the advantage of this approach it is strictly required to understand the application specific requirements of the core architecture. For this purpose a design space exploration methodology is introduced to analyze the influence of different core configurations onto area consumption (and later on also onto power dissipation) for specific application code. Flexibility and scalability increase verification and validation effort. To keep this effort reasonably low a unique configuration file is introduced. When changing parameters the current core configuration propagates automatically to software tools, the VHDL-RTL description used for generating silicon and the documentation which is then automatically updated.

1.4 Outline of Thesis This thesis consists of two parts namely an introductory Part I structured as outlined below followed by Part II which illustrates the main research results in 12 publications. Chapter two starts with the introduction of DSP specific architectural features and introduces system aspects like worst case execution time. The first part of the third chapter briefly discusses the design space of core subsystems while considering area consumption, performance and power dissipation followed by architectural alternatives and their suitability for being used in DSP core architectures. The third chapter ends with an introduction of some commercially available DSP core architectures and a brief illustration of xDSPcore which is the configurable DSP core architecture introduced in this thesis. The

5 fourth part discusses issues concerning high-level language compilers starting with typical coding practices used during implementation of algorithms in the field of digital signal processing. Then follows a short introduction of the structure of high-level language compilers. The fourth part ends with a discussion about the necessary requirements which of a high-level language compiler and proceeds to summarize the architectural requirements to obtain efficient compilation results. In the fifth chapter a summary of the publications provides an overview of the research work and summarizes the author’s contribution. The sixth and final chapter contains a conclusion with a summary of the results of the project and provides an overview of future research topics.

7

2 DSP Specific Features This section illustrates DSP specific architectural features which differentiate DSPs from traditional microcontroller architectures. The architectural features are introduced and the motivation for choosing them analyzed. Some of these features exist in microcontroller architectures used to increase performance of the core when executing algorithms in the field of digital signal processing [4][7][95].

2.1 Introduction “DSP is an embedded microprocessor specifically designed to handle signal processing algorithms cost effectively”, where cost effectiveness means low silicon area and low power dissipation [102]. To obtain this target while considering the specific requirements of digital signal processing algorithm-specific hardware is utilized to meet the performance, power and area targets. Orthogonality in contradiction with these targets is ignored. The consequence of ignoring orthogonal structure leads to highly specialized core architectures programmed manually by experts in assembly language. Developing a highlevel language compiler for the specialized features is costly and requires so called compiler known functions and intrinsics to invoke the efficient use of the specialized hardware [37]. The consequences are algorithm descriptions which are not easily portable to different core architectures. The alternative is to use pure ANSI-C [109] and not make full use of the specialized features which decreases the potential performance on the core architecture. In 2003 a first draft of a standard for the embedded-C language was introduced [169]. Based on ANSI-C additional standardized enhancements are introduced to make use of the special features required to implement digital signal processing algorithms efficiently [24]. The advantage of standardized intrinsics is that compilers for different core architectures are able to compile the same algorithmic source.

2.2 Saturation When the result of an operation exceeds the size of the destination register an overflow or underflow takes place. Signed twos complement number presentation changes the sign bit and leads therefore to a significant error. As illustrated in Figure 2 the envelope of a signal is significantly changed and the error caused by the overflow or underflow is crucial. To overcome this problem traditional DSP architectures support saturation circuits. If the value of a result exceeds the data range of the storage the highest or lowest possible value which can be presented correctly is used instead of the calculated result. The error generated by the saturation circuit as illustrated in Figure 2 is thus minimal.

8 In commercially available DSP core architectures three saturation mechanisms are commonly used. DSP cores support accumulator registers, thus differing to microcontroller architectures where the 16-bit and 32-bit registers support accumulators are registers with additional bits called guard bits. For example the SC140 from Starcore LCC [32] supports a 40-bit wide accumulator register. These guard bits allow storage of intermediate results exceeding the data range, for example 32 bits. To store the final result to data memory the value has to fit into the data range supported by the data memory port. Therefore the result stored in the accumulator register has to be evaluated for the required data range. If the value stored in the accumulator register exceeds the maximum value supported by the data memory port (indicated by guard bits different to the sign bit) it has to be saturated. Some of the DSP cores, for example the Blackfin from Analog Devices and Intel [8] support an additional saturation method. To indicate the necessity of saturation the overflow flag is evaluated and overflow or underflow takes place e.g. for a 16-bit value the result is saturated to fit into a 16-bit destination register. The third mechanism uses a saturation mode where the computation results are matched to the allowed word length after each arithmetic operation [21]. 40000 35000 30000 25000 20000

original signal

15000

saturated signal signal overflow

10000 5000 0 -5000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-10000

Figure 2: Principle of Saturation. Figure 2 illustrates three signals. The solid line shows the original signal flow without limitations. The dotted line illustrates the signal flow for the original signal when the data range of the destination register is exceeded which leads to signal flipping. The dashed line represents the saturated signal. When the original signal exceeds the data range in this example of a 16-bit register the result is saturated and the largest possible value is stored. The generated error is thus kept minimal compared with the flipping signal.

9

2.3 Rounding DSP core architectures support different rounding modes. In this subsection the two rounding modes available in most commercial core architectures are introduced. To illustrate the rounding modes 40-bit accumulator registers are used. Commercial core architectures define the rounding modes only for rounding 32-bit values to 16-bit values. The guard bits remain.

2.3.1 Two’s Complement Rounding Two’s Complement rounding is also called round-to-nearest technique [116]. If the value of the lower half of the data word is greater than or equal to half of the LSB of the resulting rounded word the values are rounded up, all values smaller than half are rounded down. Therefore statistically a small positive bias is introduced. In Figure 3 the Two’s Complement rounding is illustrated. Independent from the Least Significant Bit (LSB) of the high word one is added which causes the positive bias.

Figure 3: Two's Complement Rounding (Motorola 56000 family).

2.3.2 Convergent Rounding A slightly improved rounding methodology is convergent rounding also called round-tonearest even number [116]. The above discussed bias caused by the decision at half bit is compensated by rounding down if the high portion is even and rounding up if the high portion is odd. In Figure 4 convergent rounding is illustrated.

10

Figure 4: Convergent Rounding (Motorola 56000 family). Different to Two’s Complement Rounding the addition of one is only done, if the LSB of the high word is equal to one.

2.4 Fixed-Point, Floating-Point DSP core architectures can be divided into those supporting fixed-point arithmetic and those supporting floating-point arithmetic, the floating-point DSP architectures support mostly integer fixed-point arithmetic for example to obtain address calculation. Floatingpoint data presentation uses a combination of significand and exponent: value = significand * 2exponent Fixed-point presentation is chosen for most of the available core architectures especially those used as embedded cores for SoC or SiP applications. The algorithm development for fixed-point arithmetic requires more care but the hardware implementation implies less power and area consumption and is therefore favored. Fixed-point numbers are also called fractional presentation with the special case integer. The place of the virtual binary point in the data word determines the number of integer and fraction bits in the word. In integers the point is to the right of LSB whereas in fractional numbers the point is right after the sign bit. Some operations like addressing and control code functions are inherent of type integer. Filtering operations on the other hand make use of fractional data representation. The difference is illustrated in Figure 5 for a 40-bit wide accumulator register. The common fixed point format used for DSP cores is S[15.1] where the S stands for signed [172] and scales the data in the range -1 ≤ X < 1. The radix point is located on the left side of the

11 register between the sign bit and the next one. The number located right from the radix point encodes the fraction. The guard bits as mentioned before are used to store intermediate results exceeding the data range of the destination register. For the example in Figure 5 eight guard bits are supported. This allows a data range of -256 ≤ X< 255. Integer can be interpreted as a special case of fractional where the radix point is located at the end of the register because no fractional values are supported.

Figure 5: Integer versus Fractional Data Representation. The advantage of using fractions for traditional DSP algorithms like filtering is that by reducing the number of bits available for representing a value (e.g. during rounding operation) accuracy is changing but the value remains correct. Integers are used for control code and address generation. In [172] definitions of accuracy, data range and different variants of fractional data representation can be found. With the exception of multiplication operations there is no difference between concerning hardware implementation aspects of arithmetic functions for fractional and integer data types. The multiplication requires a left shift by one bit and a setting of the LSB to zero to correct the position of the radix point. In Figure 6 the required shift for signed multiplication is illustrated.

12

Figure 6: Fractional Multiplication Including Left Shift.

2.5 Hardware Loops Filter algorithms often executed on DSPs are loop-centric. The code example in Figure 7 represents a Finite Impulse Response (FIR) filter [103]. The filter kernel consists of only one clock cycle. It loads the operands (including address calculation) and calculates one filter tap. Software pipelining is used to compensate the load-in-use dependency, caused by split execution which is illustrated in a later section. ld (r0)+, d0 || ld (r1)+, d1 ld (r0)+, d0 || ld (r1)+, d1 || rep n mac d0,d1,a4 mac d0,d1,a4 Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter. Executing the same algorithm on a traditional microcontroller architecture leads to additional instructions and clock cycles necessary for loop handling. Traditional microcontrollers do not support single Multiply and Accumulate (MAC) instructions and therefore require at least two instructions (multiply and accumulate) to calculate a filter tap. Single-issue microcontroller architectures do not support enough hardware resources to make use of SW pipelining. Loop handling consists of setting the loop counter once, decrementing the loop counter with each loop iteration and evaluating the end of loop condition continuously. If no explicit loop instruction is available conditional branch instructions are required to implement the jump back to loop start. To implement loop constructs more efficiently DSP core architectures support zerooverhead loop instructions. The loop is invoked so that the loop length and the number of iterations is part of the loop instruction. The remaining loop handling like decrementing of

13 the loop counter and jump back to loop start with each loop cycle is implicitly handled in hardware. Branch delays caused by regular branch instructions are compensated.

2.6 Addressing Modes This sub-section introduces different addressing modes supported by DSP core architectures [116] some of which exist in traditional microcontroller architectures.

2.6.1 Implied Addressing For Implied Addressing the addresses of the source operands are implicitly coded in the instruction word. Examples for implied addressing can be found in former core architectures like the Lucent 16xx core family where the multiplier source operands are located in two registers X and Y. Even if the assembler syntax for the multiplication contains explicitly named registers (X, Y as illustrated in Figure 8) the multiplication does not allow different registers to be assigned. P = X*Y Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx). Similar examples can be found in later DSP architectures like the Starcore SC140 where functions like the MAX2VIT instruction uses implied addressing [26]. Two register pairs are supported and selected by a mode bit. An example is illustrated in Figure 9. MAX2VIT D2, D4 Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140). Implied addressing can be used to increase code density but restricts the use of the implied registers during register allocation for other instructions.

2.6.2 Immediate Data Addressing Immediate Data Addressing is used for operations where the operand is part of the instruction word. Examples for immediate addressing can be found in most core architectures supporting the preload of register values with immediate data, e.g. move constant (movc) instruction of xDSPcore illustrated in Figure 10. MOVC 27,D4 Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore).

2.6.3 Memory Direct Addressing Memory Direct Addressing is also called absolute addressing. The address where data has to be fetched from or stored is part of the instruction word. This is the main limiting factor of making use of absolute addressing. The reachable address space is limited by the available coding space in the instruction word or instruction words.

14

2.6.4 Register Direct Addressing Register Direct Addressing is used for instructions which receive their operands from registers which are addressed as part of the instruction. The difference to implied addressing is that the registers are explicitly coded inside the instruction word which allows assignment of different registers to the same instruction during register allocation. SUBF R1, R4 Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x). In Figure 11 an example of TI C62x is illustrated [36]. The two operand subtract instruction allows an assignment of different source and destination operands to the same subtraction instruction.

2.6.5 Register Indirect Addressing Register Indirect Addressing and its variants as explained in the following subsections are quite often used for algorithms executed on DSP cores. The core architectures are supporting registers which contain memory addresses and can be used for accessing memory entries. The memory addresses can be stored in specialized address registers only used for this purpose or in general purpose registers. These general purpose registers can also be also used for other operations. Large address spaces can be addressed with less coding effort which has significant influence on code density. The principle of register indirect addressing is introduced in Figure 12.

Figure 12: Principle of Register Indirect Addressing. Register Indirect Pre/Post Addressing The addressing mode Register Indirect can be used with Pre/Post Addressing option as illustrated in Figure 13. In particular algorithms executed on DSP architectures process blocks of data and therefore consecutive addresses are used. The Post operation mode allows access to a memory address and afterwards will increment or decrement the address stored in the address register. The pre-address operation allows access to a data memory location with the already updated address.

15

Figure 13: Principle of Pre/Post Operation Mode. The value for incrementing or decrementing the address located in the address registers can be one (or equal to the granularity of the addressed memory space) or for some core architectures a programmable offset. LD (R0+), D0 … pre-operation LD (R0)+, D0 … post-operation Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore). Pre-operation requires an additional clock cycle for address calculation in many DSP core architectures. xDSPcore supports both modes without requiring an additional clock cycle whereas the related assembly code example is introduced in Figure 14. Register Indirect Addressing with Indexing For Register Indirect Addressing with Indexing the content of two address registers is added and the result is used for addressing data memory locations. The difference to the above introduced pre/post modification addressing scheme is that none of the register values is modified. Two reasons favor this addressing mode: Register Indirect Addressing with Indexing allows the use of the same program code with different data sets. Between different data sets only the index register value has to be set to the start address of the new block of data. The second reason is the use by compilers communicating arguments to subroutines by bypassing data via the stack. One address register is assigned as stack frame pointer. This means that the subroutine does not have to know the absolute addresses. The transferred arguments are located relative to the stack frame pointer. Register Indirect Addressing with Modulo Arithmetic Modulo arithmetic can be used for implementing circular buffers [171]. The data values as illustrated in Figure 15 [78] are located on consecutive addresses in data memory. If the address pointer reaches the end of the circular buffer, specialized hardware circuits are used to reset the pointer to the start address.

16

begin = n 2ld ( m ) 

n , m∈ N

=8n

end = begin + m

Figure 15: Principle of Using a Modulo Buffer for Address Generation. This implicit boundary check reduces the effort for manual control of the buffer addressing. Separate modulo registers are supported to store the size of the chosen buffer. Some commercially available core architectures support circular buffer addressing with a defined start address aligned to the size of the supported buffer, e.g. a circular buffer with the buffer size of 256 can start at the addresses 0, 256, 512 and so on. The drawback of this implementation is fragmented data memory. Overcoming the fragmentation problem some core architectures like the SC140 [32] or Carmel [12] support programmable size and start address of the circular buffer which requires an additional base address register and an additional adder circuit for address calculation. Register Indirect with Bit Reversal The Register Indirect with Bit Reversal addressing mode is also called reverse carry addressing. This address mode is only used during execution of FFT algorithms. FFT algorithms have the drawback that they either take their input or output values in scrambled order. To complicate matters further the scrambling depends on the particular version of the FFT algorithms [103]. In Figure 16 [78] an example is illustrated. The lower bits of the generated addresses are mirrored and allow scrambling of the addresses as required by FFT algorithms [116].

17

Figure 16: Principle of the Bit Reversal Addressing Scheme.

2.6.6 Short Addressing Modes Code density is a significant factor influencing the consumed silicon area of the core subsystem. Many of the above described addressing schemes use two instruction words for addressing to store the immediate or offset values. To increase code density DSP core architectures support instructions with small immediate values which allow the coding of them into one instruction word. Short Immediate Data One example is the short immediate data addressing where a constant as part of the instruction word can be stored in a register. Restricting the data range of the constant to a reduced value allows only one instruction word to be used. Figure 17 illustrates an example supported by xDSPcore. MOVC 0, d0 MOVCL 1234, d0 Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore). The short version of the instruction supports a data range for the constant of –32 ≤ constant < 32. For assigning constants exceeding this range a second instruction with the same function is introduced which supports an additional instruction word for storing the immediate value. Short Memory-Direct Addressing As mentioned before the use of memory direct addressing or absolute addressing is limited by the required coding space. However some core architectures support this address mode within one instruction word. The address mode can then be used in combination with special features as for example the Motorola 56000 family [20][21] where I/O registers can be addressed with this address mode. The small offset which can be placed in one instruction word (e.g. 6 bits for this example) is extended inside the core to a physical address in the 64k byte address space.

18 Paged Memory-Direct Addressing This address scheme splits the available address range into address pages. A reduced coding space can be used to access addresses in the page once the page is set. This allows the short version of the addressing schemes to be used. The overhead for the paging mechanism is not negligible and the addressing schemes can only be used to increase code density if the executed algorithm allows data to be mapped to pages. Changing the memory page requires additional instructions (with influence on code density) and additional execution cycles.

2.7 Multiple Memory Banks Traditional algorithms executed on DSP architectures are data flow algorithms. For example data values describing a signal are fetched, processed by digital signal processing algorithms and then stored into data memory. The implementation of filter algorithms based on MAC instructions as illustrated in Figure 7 requires a fetching of two data values for the multiply operation. The summation of the multiplication results takes place in a local register. To obtain the two independent data fetch operations at least two independent memory ports are required.

Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard. Figure 18 illustrates the principles of von Neumann architecture with a combined data and program memory- the Harvard architecture where data and program are split and also the modified Dual-Harvard architecture with two independent data memory ports [96][113][149]. Some core architectures for example Carmel [10] support the fetching of up to four independent data values which can be used to increase execution speed of filtering or FFT algorithms. Some commercial DSP cores like the SC140 [32] from Starcore LCC feature one address space for data and program memory which eases the transfer of data between data and program memory. Others including xDSPcore feature separate address spaces. The X/Y memory splitting as used for OAKDSP [29] is well suited if the two fetched operands are located in two different memory spaces (e.g. for the example in Figure 7). If the fetched operands are located in the same address space the memory operations have to be serialized which will lead to a reduced system performance.

19 The Starcore SC140 [32] features interleaved addressing which can be used to reduce the possibility of memory hazards. In Figure 19 the memory mapping is illustrated. The chosen concept makes use of an implementation aspect. The performance of memory operations is limited, especially when considering large memory blocks. Therefore physically the memory implementation is split into small memory blocks reaching higher clock frequencies.

Figure 19: Example for Interleaved Memory Addressing (SC 140). The small memory blocks can be accessed separately as illustrated in Figure 19 where 4k physical memory blocks are supported.

2.8 CISC Instruction Sets CISC instructions are built up of several micro-instructions. The Multiply and Accumulate (MAC) [97] instruction introduced in Figure 20 is used as an example for CISC instructions. Two data values are required for multiplication. A third operand is required for accumulation with the multiplication results. The result of the accumulation operation is stored in an accumulator register, the same used as the third source operand. The example in Figure 20 also illustrates the additional left shift operation required for multiplying fractional data values.

20

Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC). If the core architecture is based on load/store then the operands are fetched from a register file. For direct-memory architectures the operands have to be fetched from data memory. This requires memory addressing coded in the instruction word, thus increasing the complexity of the MAC instruction. Some core architectures support rounding and saturation logic as part of the MAC instruction. Driven by compiler requirements modern DSP core architectures feature RISC instruction sets with some CISC extensions for increasing code density and performance during the execution of filtering and FFT algorithms.

2.9 Orthogonality During definition of what a DSP core is one of the major arguments was that orthogonality aspects are ignored for the sake of increased efficiency in terms of silicon area and power dissipation. The feature Orthogonality is mentioned at least once in white papers, product briefs or even technical documentation of DSP vendors and often in combination with instruction set or core architecture [14][35]. In [116] Orthogonality is defined as “to which extent a processor’s instruction set is consistent”. It is also mentioned that Orthogonality is not so easily measured. Besides the aspect of instruction set consistency the degree that the operands and addressing modes are uniformly available for operations is used as a measure for orthogonality. Examples for missing orthogonality can be found in existing core architectures e.g. the address registers of the Motorola 56k family [20][21] which are banked. Four of the eight registers are assigned to one Address Generation Unit (AGU) and the remaining four to the second AGU which limits register allocation and instruction scheduling. The SC140 [32] does not allow the higher eight address registers to be used during modulo addressing because the higher address registers are then used as base registers. In the following some more examples for missing orthogonality are illustrated.

21 Reduced Number of Operations Reducing the number of instructions relaxes the pressure for coding the instruction set, for example the rotate instruction missing from the Lucent 16xx architecture [19]. Reduced Number of Addressing Modes Providing all of the known addressing modes for all instructions demands a lot of coding effort inside the instruction set. To increase code density the core architectures only support a subset of the addressing modes and restrict their use to a group of instructions. Reduced Number on Source/Destination Operands To allow orthogonal use of registers by each instruction causes large instruction space and decreases code density. Therefore core architects limit the use of some of the registers to specialized functions for example the MAC2VIT instruction of the SC140 [32]. Use of Mode Bits Most commercial DSP cores make use of mode bits. Depending on the mode indicated by mode bits the meaning of an instruction is changed. Mode bits are often used for specialized addressing modes or saturation and rounding modes. The advantage of increasing code density can be compensated by limitations during register allocation and instruction scheduling. In a later section the problem is discussed in detail.

2.10 Real-Time Requirements Real-time requirements form the last aspect where digital signal processing algorithms have specific requirements and influence on the architecture of Digital Signal. Analyzing microarchitectural improvements in microcontrollers in the last number of years it is apparent that most of the improvements have taken place in cache structures. Cache structures are well suited to reduce the average execution time of an algorithm. A similar phenomenon has taken place in DSP cores for example at the SC140 of Starcore LCC [32] or at the Blackfin from Analog Devices and Intel [8]. Cache structures have been introduced for data and program memory. The drawback of introducing caches is that a strong requirement of real-time applications is lost: minimizing of the worst case execution time. The purpose of Worst Case Execution Time (WCET) analysis is the possibility of a priori to determine the worst case execution time of a piece of code. WCET is used in real time and embedded systems to perform scheduling of tasks to determine whether performance goals for periodic events are met, also to analyze for example interrupts and their response time [80]. The main influence on execution time comes from program flow aspects like loop

22 iterations and function calls and architectural features like pipeline structures and cache architectures [80]. In the area of research several algorithms and tools for analyzing the WCET of application code have been introduced [81][91][93]. The program flow analysis for this purpose can be split into a global low-level analysis and a local low-level analysis. The global low-level analysis considers the effect of architectural features like data [111] [170], instruction cache structures [47][83][94][124][154] and branch prediction [65]. These analyses determine only global effects but do not generate any actual execution time values. The local low-level analysis handles effects caused by single instructions and their neighbor instructions for example pipeline effects [79][146][155] and the influence of memory accesses on the execution time. The influence of caches onto the WCET is significant as discussed in [50][119][127][135] [168]. If core architectures support instructions with different latency dependent on the input values for example the multiplication instruction of ARM [5] whose execution time can differ between 1 and 4 clock cycles then the calculation of the WCET is more complicated. The multiplication of PowerPC 603 [30][57][58] can even consume between 2 and 6 clock cycles depending on the source operands. In the Alpha 21604 [16][17] the execution ratio of a software division algorithm differs between 16 clock cycles and 144 cycles which implies a ratio of 1:9. In [144] the contribution of different architectural features to the variation in the execution time and therefore the uncertainty in the WCET analysis are illustrated. The most impact arises from Translation Lookaside Buffer (TLB) accesses followed by data and instruction caches. The influence of instruction execution compared with these dominating aspects is negligible [74]. To summarize caches and prediction algorithms are contra productive to fulfill real-time requirements and therefore to minimize the worst-case execution time. Similar are the requirements for developing an optimizing compiler namely simple issue rules and architectures with few restrictions are preferred as they allow more accurate results.

23

3 DSP cores This section starts with an introduction of the design space of DSP core architectures, the main parameters influencing the design of them and illustrates the limiting parameters which cause the gap between theoretical and practical performance. The second part introduces some architectural alternatives and discusses their advantages and disadvantages. The third part describes commercially available core architectures, starting with cores from the early 1990s up to the latest announcements. This chapter ends with a brief introduction of xDSPcore.

3.1 Design Space This section introduces the possible design space for RISC based core architectures. Today most of DSP core architectures are RISC based load-store architectures. The trade-offs between the main architectural features considering the silicon area, performance and power dissipation are briefly illustrated by some examples. The design space of xDSPcore and the possibilities to influence these parameters by configuration settings can be found in [P10][P11]. The purpose of this section is to illustrate the complexity of choosing the “best core” and to show that there is no general solution [13]. A DSP core is well suited when solving an application-specific problem efficiently in terms of consumed silicon area and power consumption. However it also has to be considered that the overall application partitioning has a significant influence on the costs and that the costs of a product are not only caused by silicon production and packaging. Software development costs, maintenance and portability significantly contribute to the costs of SoC and SiP solutions.

3.1.1 Silicon Area This subsection introduces the main contributors to the silicon consumption of a core subsystem, with special focus on DSP architectures. The instruction set architecture (ISA), with its influence is then chosen as an example aside from core and memory subsystem. This example shall illustrate the complexity and the mutual influence of these aspects. Core Increasing system complexity has lead to large programs being executed on core architectures. The contribution of the core area to the die area of the core subsystem is then deemed insignificant. This key number is still taken as a decision point for choosing one particular core. With the increasing complexity of modern day silicon technologies a comparison is then made even more sophisticated. Performance figures for example the core area in mm² requires additional information like chosen technology, silicon foundry, number of metal layers, temperature range and supply voltage.

24 Memory Subsystem The increasing size of programs executed on embedded core architectures leads to an increasing importance of memory subsystems to the area contribution. Therefore the importance of code density with influence on the program memory area has increased. In the following item the instruction set is taken as an example to illustrate the influence on core and memory subsystems. Instruction Set The instruction set of a core architecture can be split into two aspects: the instruction set architecture and the related binary coding. The instruction set is taken as an example to illustrate the cross coupling of different subsystem features. Further examples can be found in the design space discussion of [P10] and [P11]. The instruction set architecture mirrors the functionality supported by the core architecture. For example the support of two or three operand instructions features like addressing modes and saturation modes or complex instructions like division. Instructions and the related binary coding are necessary to program the available units. The mapping of the instruction set architecture to instructions must consider micro-architectural aspects. It will be difficult to map the ISA onto the native instruction word if the native instruction word size is 16 bits and the ISA requirement is to support three-operand instructions and each operand requires 4 bit register-coding. In this case it is necessary to map the three-operand instructions onto two instruction words or to increase the size of the native instruction word..

Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA). In Figure 21 the influence of the chosen binary coding is illustrated. The same ISA is once mapped onto 16-bit wide instruction words and once onto 20-bit wide instruction words. To illustrate the influence on code density, a piece of traditional control code is used, for example some PC benchmarks [18]. The results in Figure 21 show that the shorter native instruction word requires an increased number of long-words which are simply additional

25 instruction words for the identical instruction. This is reasonable because the coding space for immediate values and offsets is reduced in 16 bit wide native instructions. The overall code density for this example normalized in bytes is improved by 16 % when using the 16bit native instruction words, however the result will be different for other application code examples.

3.1.2 Performance The performance of DSP cores is measured in Million Instructions Per Second (MIPS) or Million Operation Per Second (MOPS) [25]. MOPS was introduced when multi-issue core architectures appeared on the market. These numbers are calculated by multiplying the reachable core frequency by the number of instructions executable in parallel. This led to announcements like the Texas Instruments TIC64x [39] with 8 GOPS (the possibility of eight parallel executed instructions multiplied by 1 GHz clock frequency). Berkeley Design Technologies Inc. (BDTi) introduced the so called BDTi benchmark suite, containing a dozen algorithmic examples. Most of these are based on small loop centric kernels for filtering and vector operations, whereas other examples include a FFT a Viterbi implementation and a control code example. Certain coding requirements restrict the implementation in order to simplify comparison between different core architectures. These small kernels are often not representative for application code executed on DSP cores. Another possibility to measure performance is counting the Million MAC instructions per second (MMACs). For example during execution of control code the number of possible executable MAC instructions does not significantly influence performance. Microarchitectural limitations (e.g. as illustrated in Figure 24) reduce the accuracy of this performance factor for a mixture of DSP and control code. Theoretical versus Practical Performance The example of Texas Instruments can be used to illustrate the term theoretical performance. This outlines the theoretical performance of the 1 GHz TI C62x is 8 GOPS [37] or another example in [133]. The practical performance is a measurement of how efficiently a certain algorithm can make use of the resources provided by the core architecture. Some of the factors limiting the reachable practical performance are introduced in this sub-section for illustrating the gap between theoretical and practical performance. Define-in-use Dependency One way to increase the number of MIPS and MOPS is to increase the reachable clock frequency of the core architecture. The increase of possible clock frequency can be attained on a technological level by smaller feature size and an architectural level by increasing the number of pipeline stages. This leads to super-pipelined architectures with 10 pipeline

26 stages and more. Increasing the number of pipeline stages during the execution phase increases the define-in-use dependency [149] as illustrated in Figure 22. The five-stage pipeline of Figure 22 supports split execution where two clock cycles are used for calculation e.g. one MAC instruction. The operands are read at the beginning of EX1 and the result written at the end of stage EX2. Filtering operations for example the FIR filter as illustrated in Figure 7 require consecutive MAC instructions for cycle efficient implementation. Due to the dependency of the result of the first MAC instruction as source operand for the second MAC instruction a NOP cycle is required to prevent data hazards [149]. For the core architecture of Figure 22 which features a lean pipeline structure the additional NOP cycle is reasonable. However the TI C62x [36] provides an eleven stage pipeline containing 5 execution stages where the define-in-use dependency increases significantly. A method to compensate this problem is through bypass circuits (bypassing intermediate results to the next instruction).

Figure 22: Principle of Define-in-use Dependency. The xDSPcore architecture which utilizes the pipeline structure of Figure 22 allows the fetching of the accumulator operand for the MAC instruction at the beginning of EX2 (as illustrated in Figure 58) which compensates the define-in-use dependency during executing filter operations. This example shall illustrate that increasing reachable clock frequency by adding additional pipeline stages leads to an increased theoretical performance (due to relaxing of the critical implementation path and by reaching a higher clock frequency) but data and control dependencies in the application code can limit the increase of practical performance. Load-in-use Dependency A similar problem is the load-in-use dependency [149] as illustrated in Figure 23. To relax the timing at the data memory ports additional pipeline stages are introduced for memory access. The execution of instructions dependent on the fetched data entries have to be delayed until the memory access has been finished.

27

Figure 23: Principle of Load-in-use Dependency. Different to the load-in-use dependency bypass circuits cannot be used to partly compensate the data dependency. The load-in-use dependency can cause a significant mismatch between theoretical and practical performance especially during the execution of control code featuring short branch distances. Data Memory Bandwidth The data memory bandwidth of a core architecture is characterized by the number of load/store instructions executed in parallel, the size of the data memory ports and the structure of the access. The structure of the access considers alignment requirements at the data memory port and the number of independent addresses which can be generated and accessed each clock cycle. To prevent limitations on the practical performance compared with the theoretical performance the necessary operands for each of the executed instruction has to be provided.

Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140). The core architecture as illustrated in Figure 24 allows execution of up to four MAC instructions in parallel (e.g. the SC140 [32]) which can be used to increase performance during the execution of filter algorithms. However for each of the four MAC instructions two operands on each clock cycle are required. The example in Figure 24 enables fetching of two independent data values from data memory each cycle. The memory bandwidth for executing the four MAC instructions in parallel is sufficient when fetching two times 64-bit data and assuming 16-bit wide operands for the MAC instructions. The structure in Figure

28 24 illustrates a limitation of storing data in data memory. The data has to be placed in data memory so that it is possible to fetch operands for all four MACs in parallel by addressing only two independent data entries. This limitation can require a large amount of operations to position the data according to the required scheme which is normally not assumed for benchmark results e.g. [9][28]. Program Memory Port The program memory port is used to fetch instructions from program memory. Multi-way VLIW architectures require a large amount of instruction words to enable programming of the available parallel resources. An example is the Texas Instruments TIC6xx family [36] which requires a program memory port width of 256 bits requiring significant wiring effort. In combination with the poor code density of the TI6xx family its usage in area and power critical applications is not recommended. Therefore core architects have introduced architectural features to prevent large program memory ports. Providing a small program memory port requires less wiring to the program memory but leads to poor usage of the available parallel units. During the execution of control code this limitation is reasonable, because data and control dependencies limit the average ILP to 2-3 as illustrated in [106][112][115][159]. Loop-centric algorithms often used for typical DSP functions can make use of more parallelism and therefore the peak-performance of the core architecture would be limited by the reduced size of the program memory port. For increasing peak performance of the core architectures, extended program memory ports have been introduced [157]. Branch Delays Branch delays are unusable execution cycles caused by taken conditional branch instructions. Increasing clock frequency by increasing the number of pipeline stages increases the number of branch delays and therefore decreases the practical performance. Compared with single issue microcontroller cores this is further deteriorated when executing control code with short branch distances on multi-way VLIW architectures. Branch prediction circuits as introduced in [31][44][173] can be used to reduce the number of branch delays but the drawback of prediction circuits has already been pointed out in section 2.10. An alternative to compensate branch delays is through trying to prevent branch instructions by making use of predicated or conditional execution. In [132] benchmark results illustrate that the use of predicated execution can be used to reduce about 30% of conditional branch instructions. The chosen implementation for predicated execution has influence on the practical performance. An example is the implementation used for SC140 [32] with only one flag and few conditions can lead to a poor usage of the resources during control code sections and several unused execution cycles. This limitation is caused by restrictions

29 during instruction scheduling. Scheduling of instructions between generation and evaluation of the status information is not allowed. For multi-way VLIW architectures featuring deep pipelines the gap between the theoretical and practical performance can increase significant.

3.1.3 Power Consumption The power consumption of a core subsystem is influenced a number of factors, namely the core architecture itself, the memory subsystem, execution frequency and the executed algorithms which have influence on the traffic on the data memory bus [139][143]. This section considers power consumption aspects where embedded core architectures can contribute to reduce power dissipation, however the technology aspects are not considered in detail. Power dissipation in CMOS circuits is mainly caused by three sources, leakage current, short circuit current and charging and discharging of capacitive loads during logic changes. P = Pleak + Pshort+ Pdynamic

[1]

Leakage current is primarily determined by the fabrication technology and circuit area. The short circuit currents can be avoided by careful design [59][62][72][158][161], and the same is true for leakage [70][73][85][104][141][165]. Three degrees of freedom are an inherent part of the system low power design space: voltage, physical capacitance and data activity. These factors will be briefly discussed in this sub-section. Equation 2 contains the factors which mainly influence dynamic power consumption [110]. 1 Pisw = V 2CiTDi 2

[2]

Voltage The quadratic relation between voltage and power dissipation favors this parameter as an effective possibility for reducing power dissipation. Voltage scaling influences not only one part of the SoC solutions where system aspects have to be carefully as with decreasing supply voltage a speed penalty is evident [56][107][131][156]. In [61] an architecture driven voltage scaling strategy is presented whereas pipelined and parallel architectures are used to compensate the throughput problem caused by reduced supply voltage. A different approach is illustrated in [175]. Another possibility to compensate the speed decrease caused by reduced supply voltage is to decrease Vt. This is limited by constraints of noise margins and by the need to control the increase of sub-threshold leakage current. Dual-Vt techniques such as those introduced in [107] require multi-threshold CMOS transistors (MTCMOS) which have to be supported by the target technology.

30 Physical Capacitance Dynamic power consumption depends linearly on the switched physical capacitance. Therefore besides reducing supply voltage a reduction of the capacitance can be used to reduce power dissipation. Using less logic, small devices and short wires the physical capacitance can be reduced. On the other hand as already mentioned in voltage scaling it is not possible to minimize one parameter without influencing some others for example reducing the device size will reduce the current drive of the transistors in turn resulting in slower operating speed at the circuits. Switching Activity Reducing switching activity also linearly influences dynamic power dissipation. A circuit containing a large amount of physical capacitance will show no power dissipation when there is no switching. However, calculating switching activity is not simple. This is caused by the fact that switching consists of spurious activity and functional activity. In certain circuits like adders and multipliers [52] spurious activity can dominate. Combining data activity with physical capacitance leads to switched capacitance describing the average capacitance charged during each data period. Summary The design space for low power design is mainly influenced by the following parameters; supply voltage, capacitance and switching activity which are cross-related to each other and have influence on static power dissipation. For an embedded DSP core the design space is even more limited because aspects like voltage scaling or dual-VT techniques which are system or technology aspects and thus cannot be influenced by the core architecture itself. The DSP core architecture introduced in this thesis supports architectural features [P5] and compiler related aspects [99] for reducing switching activity. Implementation aspects to reduce capacitance are considered by making use of manual full-custom design [P4].

31

3.2 Architectural Alternatives DSP cores are processors that provide specific features for efficient implementation of algorithms for digital signal processing as illustrated in section 2. Each core architecture aims to solve specific problems whereas an efficient architecture meets the requirements of the algorithm executed. Meeting the requirements can be subsumed in the key features area consumption leading to costs, low power dissipation leading to increased battery life time or higher integration density and system development costs, which are mainly dominated by software development costs. In this section some architectural alternatives used in current DSP core architectures are briefly introduced. The solution space is multi-dimensional and different parameters have mutually coupled influence upon the space. More details concerning the available design space for DSP core architectures and a methodology how to find the best solution for solving a certain application-specific problem can be found in [P10][P11][13].

3.2.1 Single Issue versus Multi Issue Single issue architectures invoke only one instruction each execution cycle. This concept is well established for microcontroller architectures for example ARM microcontrollers. The problem of efficient instruction scheduling is simplified to a linear problem and programming a single issue core is straight forward. Control code typically executed on microcontroller architectures is linear code with a lot of dependencies and therefore executing more than one instruction per execution bundle (instructions executed during the same clock cycle) does not significantly increase the performance of the core architecture. To increase the performance of these core architectures more complex instructions can be used [20][29].

Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures.

32 DSP algorithms are loop-centric algorithms where a significant amount of execution time is spent in loop iterations. Therefore increasing performance during execution of the loop bodies significantly increases the performance of the core architecture. Software pipelining and loop unrolling as introduced in a later section allow execution of several instructions in parallel to increase system performance. In Figure 25 the issue rate of available DSP core architectures over time is illustrated. While most of the core architectures in the 1980s allowed the execution of one instruction per execution cycle only 10 years later up to 8 instructions were able to be executed. Core architects have increased the number of instructions executed in parallel to increase relative performance of their core architectures. There are also other aspects to consider for example the Instruction Level Parallelism (ILP). The average ILP indicates the average number of instructions executed in parallel. The ILP is limited by the core resources by the issue rate, and data and control dependencies in the executed algorithm. The issue rate means that a single issue core cannot reach a value of more than one. Increasing the possible number of instructions executed in parallel will not increase the average ILP when executing an algorithm primarily based on control code. For loop-centric algorithms the increased parallelism can be used for increasing core performance. It is nearly impossible to develop code for a multi-issue DSP core architecture manually considering deep pipelines and related dependences, therefore the use of high-level language compilers like a C-compiler is required.

3.2.2 VLIW versus Superscalar Scalar and superscalar architectures are common for microcontrollers. Scalar processors support the execution of one instruction per cycle which limits the attainable performance. Superscalar processor architectures overcome this problem by supporting the execution of several instructions in parallel where resolving of dependencies in the executed application code is done by hardware circuits. Issuing queues [149], score boards [149] and highly sophisticated branch prediction circuits [153] take care of making efficient use of the core resources. The programming model is based on dynamic scheduling, i.e. the execution order of instructions is defined during run-time based on dependency analysis [148]. Superscalar architectures allow minimization of the execution time by enabling a change in the program execution order as long as dependencies are considered. This minimizes average execution time. The Very Long Instruction Word (VLIW) programming model is based on static scheduling. Dependencies in the application code are already resolved during compile time. The execution order of instructions is not changed during runtime. Changing the execution order

33 during run-time is not possible, due to lack of support of hardware circuits for dependency resolution not supported in VLIW architectures. The advantage is reduced core complexity which simplifies hardware development. Using caches for VLIW architectures leads to penalty cycles during cache misses which cannot be used to execute different code sections. One possibility to overcome this limitation is the invoking of multithreading with the drawback of increased core complexity. Static scheduling allows minimization of the worstcase execution time which is required for algorithms with real-time requirements. Developing a C-Compiler for VLIW architectures is more complex because of dependency analysis and sophisticated instruction scheduling algorithms [45][166]. Most of the latest DSP architectures are based on the VLIW programming model driven by the real-time requirements of algorithms executed on DSP architectures. To overcome the drawback of traditional VLIW having poor code density, enhanced implementations like Variable Length Execution Set (VLES) [36], Configurable Long Instruction Word (CLIW) [157] and scaleable Long Instruction Word (xLIW) [P2] are used in existing core architectures.

3.2.3 Deep Pipeline versus Lean Pipeline Pipelines were already introduced in supercomputers in the 1960s and the motivation for pipelining is to increase instruction throughput by an increased usage of hardware resources. This is achieved by splitting of operations into sub-operations and invoking of new suboperations as early as possible. This split into sub-operations allows the reaching of higher clock frequencies. In Figure 39 the concept of SW-Pipelining is illustrated for four operations. The main concept is the same for hardware and software pipelines. Pipeline structures used in DSPs are CISC and RISC pipelines. Direct memory architectures are based on CISC pipelines. Besides fetching, decoding and execution of instructions typical for RISC pipelines the memory operation for fetching operands from data memory requires additional pipeline stages which are inserted after the decode stage. In Figure 26 the two types of pipelines are illustrated. In RISC pipelines separate instructions are used to fetch data from data memory and these instructions use the same pipeline structure. In CISC pipelines the memory operations are part of arithmetic instructions.

34

Figure 26: Architectural Alternatives: RISC versus CISC Pipeline. Besides the chosen pipeline structure (which is mainly influenced by the general architectural concept) the number of clock cycles used to implement the pipeline is an important performance aspect. To increase the computational power of core architectures, splitting operations in small suboperations and thus using several clock cycles to execute one “natural” pipeline stage leads to super-pipelined architectures. Dependencies between pipeline stages (a more detailed discussion can be found in the Design Space section) can lead to a poor usage of available hardware resources.

Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores. The worst case scenario can even be an increased clock frequency which produces high power dissipation but reduced system performance due to data and control dependencies. In Figure 27 the pipeline depth for available core architectures is illustrated. For overcoming data and control dependencies caused by deep pipeline structures bypass and branch prediction circuits are introduced.

3.2.4 Direct Memory versus Load-Store For load-store architectures separate instructions are used to transfer data between data memory and register file. For direct-memory architectures the data transfer is coded inside

35 the arithmetic instruction. For load-store architectures the register file plays a central role and therefore is traditionally located between data memory and execution units whereas for direct memory architecture the register file is in parallel to the execution units where intermediate results can be stored. The difference between load-store and direct memory architecture is illustrated in Figure 28.

Figure 28: Architectural Alternatives: Direct Memory versus Load-Store. Assuming traditional DSP algorithms e.g. filtering, the direct memory architecture allows an increase in code density by using less instruction words. The load/store operations are already included inside the coding for the arithmetic instructions, however the coding space for the data transfer has to be provided inside the instruction word thus leading to more complex instruction words for example the 24-bit instruction word of Carmel [11]. For code sections which cannot make use of the more complex instructions, code density is decreased. The execution of control code with CISC instructions features the problem of poor usage of the binary coding and therefore decreased code density. The application code requires more instruction words when using less complex instructions. These provide more flexibility and can be used to increase code density on application level.

3.2.5 Mode Register versus Instruction Coding Memory dominates the area consumption of embedded DSP subsystems. Code density is a factor mirroring how efficiently a certain algorithm can make use of the provided core resources and the related instruction set architecture and instruction coding. High code density reduces the required program memory and therefore the necessary die area of the DSP subsystem. One possibility to increase code density is the usage of a mode register. The mode register allows the meaning of an instruction to change by modifying the related mode bit. Quite often examples [32] are mode bits for addressing modes or saturation modes. The disadvantage of these mode registers is caused by instruction scheduling. In Figure 29 the problem is clearly illustrated. An instruction is not necessarily dependent only on the

36 instruction word. The meaning of the instruction also depends on the mode set for a certain code section.

Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction Scheduling. It is not possible to move an instruction out of a section without considering changing the mode for the other section (which is again related with additional instructions – the new mode has to be set and reset). This is impossible for multi-issue VLIW DSP core architectures. Therefore mode register can help to increase the code density of a small kernel but the reduced freedom for scheduling of instructions can lead to an increased number of execution cycles and even to a decreased code density.

37

3.3 Available DSP Core Architectures This section introduces commercially available DSP cores. Initially each core architecture is introduced and the main aspects shall be discussed, such as available arithmetic units, pipeline structure, supported addressing modes and core specific features. At the end of each description a short summary section briefly assesses the features of the core architecture from an orthogonality point of view which is a major aspect for developing a C-compiler. This section does not contain a table comparing available DSP cores on metrics like reachable frequency, number of parallel executed instructions or pipeline depth in order to prevent superficial comparisons. These kinds of tables hide micro-architectural limitations with influence on practical performance as introduced in 3.1.2. DSP cores are quite often rated to their number of supported MAC instructions each clock cycle. The first three cores described in this section are OAKDSP, Motorola 56000 and TIC54x, chosen as examples of so called single-MAC DSP cores (DSP core architectures supporting the execution of one MAC instruction at a time). The ZSP has been chosen as an example of a DSP core based on the superscalar programming model. As examples for dual-MAC architectures Infineon Carmel DSP, TI C62x and Blackfin have been chosen. As a last example the SC140 of Starcore LCC has been chosen as a DSP core architecture supporting the execution of four MAC instructions in parallel. This thesis does not consider vector processor architectures due to their specialized architecture which can only be used for one class of algorithmic problem.

3.3.1 OAKDSP The OAKDSP [29] core was introduced in the early 90’s from DSPgroup (now Ceva [14]) as a successor to the PineDSP core. OAKDSP core is a single-issue 16-bit DSP core based on traditional direct memory architecture. This is where arithmetic instructions fetch their operands from memory and store the results into memory. The instruction set is based on a native 16-bit instruction word, and long immediate values or offsets are stored in an additional instruction word. The pipeline consists of four stages: fetch, decode, operand fetch and execution. The data memory space is split into an asymmetric X and Y memory space. The Computation Unit (CU) as illustrated in Figure 30 contains a sixteen by sixteen multiplier (also supporting double precision), an ALU/Shifter data path for implementation of MAC instructions and a separate Bit Manipulation Unit (BMU) containing a barrel shifter which is the major difference to PineDSP.

38

Figure 30: Architectural Overview: OAKDSP Core. Four shadowed accumulator registers each 36 bits wide containing four guard bits are supported. Two of these are assigned to the BMU and two to the CU. Zero overhead loop instructions with four nesting levels are supported. A software stack allows execution of recursive function calls. The address generation unit supports post increment/decrement operations and modulo address generation. Reverse carry addressing is not supported. Three status registers are available containing flags, status bits, control bits, user I/O and paging bits. Most of the fields can be modified by the user. The first status register contains the flags (zero, minus, normalized, overflow, carry …) which are influenced by the last CU or BMU operation. Most of the registers are shadowed which allows a task switch with reduced spilling of the core status to data memory. No separate interrupt control unit is available and three different interrupt sources and an NMI are supported.

Figure 31: Architectural Overview: Motorola 56300.

39 Summary: OAK DSP core is a single-issue DSP core which thus reduces its relative performance. The limited address space requires a paging mechanism with the related limitations for instruction scheduling. Compared with modern core architectures the reduced feature set enables good code density for typical DSP algorithms like filtering. Missing support for reverse carry address mode limits the possibility of an efficient implementation of FFT algorithms on the OAKDSP core. Status and configuration registers limit instruction scheduling. Flags influenced by the last occurrence of ALU or BMU instructions limit the use of conditionally executed instructions. The missing support of nested interrupts is caused by only one level of shadow registers and limits the usage of interrupt service routines. Although the support of a software stack eases the development for a C-compiler, the architectural features of OAKDSP are not orthogonal and therefore implementing a powerful C-compiler is questionable

3.3.2 Motorola 56300 The Motorola 56300 DSP [22] core is a powerful member of the Motorola 56k family [20] introduced in 1995. The 56300 is a single issue 24-bit load/store architecture where arithmetic instructions fetch their operands from two operand registers X and Y (each 56 bits wide). The native instruction set consists of 24-bit wide instruction words and long offsets or immediate values are stored in an additional instruction word. The instruction format can be split into a parallel instruction format and a not parallel instruction format. The parallel instruction format supports CISC instructions: in addition to the operation op-code and operand, operations taking place on the X and Y memory bus and a condition can be coded. The pipeline consists of seven stages: Prefetch I + II, decode, address generation I+II, execute I+II, all of which are hidden from the programmer. The memory space is split into X and Y memory. The computation unit as illustrated in Figure 31 contains a 24 by 24 bit multiplier, an accumulator including a shifter and a separate bit-field unit including a barrel shifter. The register file consists of 6 accumulator register Ax and Bx with each being 56 bits wide. The supported data types are 24-bit based (also including byte support) in addition 8 guard bits support higher precision calculation. Zero overhead loops are supported and can be nested. The nesting level is not limited because loop handling registers are spilled to the software stack (limited only by the available data space). The 56300 supports register direct and indirect address modes (including pre-and post operation) and specialized address modes used for efficient implementation of traditional DSP algorithms, namely reverse carry which allows efficient implementation of FFT algorithms and modulo addressing. The size of the modulo buffer stored in the modulo

40 register is configurable whereas the start address has to be aligned. The address register file is banked which means half of the registers are assigned to one of the two Address Generation Units (AGU). The actual core status is stored as flag information like carry, zero and overflow in a status register. Mode registers are available for choosing saturation mode and rounding mode and an operation mode register is available for determining the status of the core (e.g. stack overflow). Four interrupts and one NMI are supported. Although the 56300 is a 24-bit DSP processor, a compatibility mode supporting 16-bit data format is available. The unused bits are cleared or sign extended, depending on the position. Summary: The 56300 from Motorola supports a 24-bit datapath and therefore is well suited for audio algorithms. The seven-stage pipeline allows the reading of higher clock frequencies but dependencies in the application code can lead to limited usage of the clock cycles. Configuration and mode register limit instruction scheduling. Even if the parallel instruction format allows an increase in performance of the DSP core architecture it is still a single issue core. Control code sections in particular will suffer from poor code density by 24-bit native instruction word size. The supported address modes allow efficient implementation of traditional DSP algorithms including FFT. The use of the address registers is limited by the banked implementation of the address register file.

3.3.3 Texas Instruments 54x Texas Instruments introduced two major DSP families, namely the embedded core family TI C5xx and the high performance stand-alone DSP family TI C6xx as outlined in [39]. The TI C54x [33] has been chosen as example as illustrated in Figure 32. There are several members of the core family available e.g. [34] which provide different features but the TI C54x has been chosen because it is still the most referenced DSP core. Berkley Design Technologies Inc. (BDTi) normalizes its performance figures of analyzed DSP cores on the performance figures of the TI C54x [9]. TI C54x is based on direct memory architecture and supports three data busses and an independent program memory bus, each of which are 16 bits wide. Arithmetic instructions include the operand fetch instructions from data memory. The native instruction word is 16 bits wide. Several instructions require a second instruction word (e.g. for branch instructions). The second instruction word has to be fetched sequentially and therefore some pipeline cycles remain unused. Conditional execution is supported which reduces the number of branch delays. The pipeline consists of six stages: pre-fetch, fetch, decode, access, read and execute. Executing branch instructions results in three branch delays, the first delay is caused by executing the branch instruction consisting of two instruction words itself whereas the next

41 two executing cycles have to be flushed. To overcome this problem TI C54x supports delayed branching, which allows the use of the branch delays with unconditionally executed instructions. The Central Processing Unit (CPU) contains a 17 by 17 multiplier which supports double precision arithmetic, an adder and a barrel shifter.

Figure 32: Architectural Overview: TI C54x. The DSP core architecture features two accumulator registers each 40 bits wide including 8 guard bits for internal high-precision arithmetic. Zero overhead hardware loop instructions are supported. TI C54x supports several addressing modes including absolute and indirect addressing. The specialized address modes reverse carry and modulo addressing are also available. Therefore an efficient implementation of FFT algorithms is possible. Flags indicating the status of the core architecture are stored in core status registers. To increase code density some of the functions like saturation of multiplication results are stored in mode registers. TI C54x supports several hardware and software interrupts including prioritization of different interrupt sources. Summary: The core architecture of Texas Instruments TI C54x is well suited for efficient implementation of traditional DSP algorithms. Therefore it is quite often used as a reference concerning code density and power dissipation. The reachable performance is limited by the single-issue execution logic, a small-sized program memory port which leads to stalls by executing instructions consisting of two instruction words, and by missing a register file. This requires the fetching of operands from data memory for each arithmetic instruction. Only a few functions are stored in mode and status registers, therefore instruction scheduling is less limited. Dynamically generated flags are stored in status registers which limits instruction scheduling when using predicated execution (i.e. no instructions are allowed to be scheduled between generating the condition and the conditionally executed instruction). The support of delayed branching reduces the drawback of branch delays.

42

3.3.4 ZSP 400 The ZSP 400 DSP [42] core architecture illustrated in Figure 33 is a family member of the ZSP DSP family of LSI Logic. Different to the remaining core architectures introduced in this section ZSP uses the superscalar programming model. The core is based on a RISC load-store architecture where arithmetic instructions get their operands from the register file. Separate instructions are used to fetch and store data from data memory.

Figure 33: Architectural Overview: ZSP400. The instruction set is based on a native 16-bit instruction word. The core supports the execution of up to four instructions each execution cycle with some limitations concerning the grouping of instructions. Therefore ZSP is a multi-issue DSP core where dependencies are resolved during run-time (dynamic scheduling). The pipeline of the ZSP 400 contains 5 stages: fetch/decode, group, read, execute and write-back. The data memory bandwidth is four words wide and a cache is located between memory and core and the communication is established via data links. During data memory access no alignment restrictions have to be considered. The Computation Unit consists of two MAC units and two ALU paths, each 16 bits wide. It is possible to combine them as a single 32-bit ALU path. The register file is built up of sixteen 16-bit wide general purpose registers. Two 16-bit registers can be addressed as one 32-bit register. Two of the 32-bit registers contain additional 8 guard bits used for internal higher precision calculation. Eight of the 16-bit registers are shadowed and switching between the two sets is done with a configuration bit. ZSP 400 supports two circular buffers and no explicit address registers are supported. The first 13 registers can be used for reverse carry addressing which enables efficient implementation of FFT algorithms. The load/store instructions are supported with autoincrement and offset address calculation usual for state-of-the-art DSP architectures.

43 Mode registers are supported to enable saturation and rounding modes. Several other core functions are controlled by additional configuration registers. Status registers contain core status information like hardware flags indicating for example overflow, zero or pending interrupts. Summary: ZSP 400 is an example of the ZSP family form LSI Logic. The core is based on superscalar programming model. This implies dynamic instruction scheduling during runtime and the intensive usage of cache structures which limits the possibility of minimizing the worst-case execution time. The advantage of a unique register file for instruction scheduling is counteracted by a huge number of restrictions and non-orthogonal architectural features for example only a few registers are shadowed. Some can be used for any addressing mode, however some do not support all of them. To increase code density several typical DSP functions like saturation and rounding modes are shifted to mode registers with limitations for instruction scheduling. The ZSP architecture is well suited for implementing a C-compiler due to the superscalar architecture which eases the compiler development. Several restrictions and non-orthogonal architectural features limit the possibility of an optimizing compiler. Comparing ZSP with traditional DSP core architectures gives the indication that ZSP is not a typical DSP core. It is more a microcontroller with some features used in DSP core architectures (like address modes, MAC units and circular buffers).

3.3.5 Carmel Carmel DSP [10][11][12] core was introduced in mid-1990’s by Infineon Technologies, the former Siemens Semiconductor group. Carmel is a 16-bit fixed point direct-memory architecture where arithmetic instructions fetch their operands from memory locations. This is reflected in the 8-stage pipeline: program address, program read, decode I+II, data read address, operand fetch, execution and write address, data write. The native instruction word size is 24 bits and instructions are built of up to two instruction words. The instruction coding is code density optimized which requires two pipeline stages for instruction decoding. Carmel is based on the VLIW programming model. The implementation of the program memory port is patented as Configurable Long Instruction Word (CLIW) [157]. The regular program memory port is 48 bits wide. An extended memory port of 96 bits allows the fetching of up to 144-bit instruction words. The CLIW memory contains parameterized instruction combinations. Some of the instructions are only supported as part of CLIW instructions.

44

Figure 34: Architectural Overview: Carmel. Carmel supports the fetching of data from up to 4 independent memory locations. The data memory is split into A1, A2, B1 and B2 memory blocks. The execution unit as illustrated in Figure 34 contains two data paths, each of which contain a MAC unit and an ALU. The left path additionally supports a shifter and an exponent unit. The results can be stored into an intermediate register file of six 40-bit wide accumulator registers or directly to data memory via two 16-bit wide write ports. Zero-overhead hardware loops are supported with a nesting level of four. Similar to OAKDSP fast context switch is supported with the support of two secondary accumulator registers. Carmel supports 16-bit address space with traditional addressing modes found in DSP cores. Efficient implementation of FFT algorithms is enabled by the support of bit-reverse addressing mode. The first type of modulo addressing scheme is supported with aligned boundary addresses. A second modulo addressing mode allows prevention of memory fragmentation by the support of non-aligned boundary addresses. Configuration registers are used to choose rounding mode to activate saturation and to enable fractional data format. Conditional execution is supported for most of the instructions and can utilize two condition registers. Summary: Carmel DSP core is the 16-bit embedded DSP core created by Infineon Technologies in cooperation with the Israeli company ICCOM. The traditional directmemory architecture favors CISC instructions. To increase code density and execution speed for traditional DSP algorithms like filtering and FFT any orthogonality aspects have

45 been ignored. The extended program memory port with the restriction that some instructions can only be used with this port limits the development of an optimizing CCompiler. The necessary two pipeline stages for decoding increases the number of branch delays. Configuration registers for major DSP functions like saturation and rounding modes are limiting instruction scheduling. Considering Carmel as a high performance DSP core the limitation of a 16-bit address space is crucial and requires a paging mechanism with its related drawbacks. The conditional execution supported by Carmel limits instruction scheduling by supporting only two registers for storing conditions. In 2002 Carmel was sold to Starcore LCC and in the same year Carmel was discontinued. This is an example of typical DSP core developments of the mid 1990s and one of the last examples for direct-memory architectures. Carmel’s BDTi benchmarks still have the leading edge for traditional DSP algorithms and especially for FFT algorithms.

3.3.6 Texas Instruments 62xx The Texas Instruments C62x [36][37] is a family member of the Texas Instruments C6xx high performance DSP core family. TI C62x is a 16-bit fixed point multi-issue DSP core based on RISC load-store architecture. Operands for the arithmetic instructions are fetched from register file and separate instructions are supported to move data between register file and data memory. The instruction is based on a 32-bit native instruction word. Up to eight instructions can be executed each clock cycle. The programming model is based on VLIW. To overcome the drawback of poor code density Texas Instruments introduced the Variable Length Execution Set (VLES) which allows decoupling of the fetch and execution bundles. Two independent data busses as illustrated in Figure 35 connect the register file with data memory. The execution unit supports eight different units, two for data exchange with data memory, two multiplier, two ALU units and two shift and bit manipulation units. Each of the units contains many different features which overlap as explained in [36] and allows the shifting of functions from one unit to another. There is no explicit support of Multiply and Accumulate (MAC) which requires two instructions one for programming the multiplication and one for the accumulation. The register file contains sixteen 32-bit wide general purpose registers. It is possible to store 40-bit values in two consecutive 32-bit registers. The pipeline consists of three phases fetch, decode and execute which are split over 11 clock cycles. The pipeline of TI C62x can be called super-pipelined. TI C62x provides full predicated execution which means that each instruction can be executed conditionally. Six registers of the register file can be used to build the condition.

46

Figure 35: Architectural Overview: TI C6xx. Summary: The Texas Instruments TI C62x as an example of the C6xx family is a high performance processor architecture. A large register file and up to eight instructions executed in parallel provide an impressive peak performance. The deep pipeline structure allows high clock frequencies to be reached but long define-use and load-use dependencies can result in poor usage of available resources (details in the Design Space section). Some of the characteristics for DSP architectures like 40-bit accumulators or MAC units are not available. These specialized functions are emulated, for example using two 32-bit registers to implement an accumulator or combining a multiplication with an accumulation to realize a MAC instruction. The predicated execution is a powerful feature for implementing control code by reducing the number of branch instructions with the unusable branch delays. However, the limitation to a few of the available registers for building up the condition restricts the use of this feature. An additional limitation for instruction scheduling is the banked register file with one path for transferring data between the two banks. An important drawback not to be overlooked is the poor code density. It is not feasible to use the core as an embedded core and most of the applications making use of it use the TI C62x as a standalone device with external memory. The described DSP core is well suited for development of an optimizing C-compiler.

3.3.7 Starcore SC140 The Starcore SC140 [32] is the high performance DSP core of the Starcore DSP family. Starcore LCC has been founded by Motorola and Agere, the former semiconductor group of Lucent Technologies. Infineon Technologies joined this cooperation just two years ago. The SC140 is a multi-issue high performance 16-bit fixed-point DSP core based on RISC,

47 nearly ‘pure’ load-store architecture. Most of the instructions get their operands from the register file. However a few get operands directly from memory. The instruction set is based on a 16-bit wide native instruction word where the instructions are 16, 32 or 48 bits wide. Up to six instructions can be grouped to a Variable Length Execution Set (VLES). Some limitations exist concerning grouping. The SC140 features a five-stage pipeline: pre-fetch, fetch, dispatch, address generation and execution. Two independent data busses connect the core with data memory. The memory addresses are interleaved to reduce the possibility of an address conflict and related stall cycles. The CU of the SC140 as illustrated in Figure 36 consists of four independent data paths, each of which support the execution of a MAC instruction or an ALU operation including shifting. During execution of filter algorithms the available four MAC units provide significant peak performance.

Figure 36: Architectural Overview: Starcore SC140. The register file consists of sixteen 40-bit entries. The accumulator registers contain 8 guard bits for internal higher precision calculation. Zero overhead looping is supported up to a nesting level of four. Two independent address generation units support address modes available on state-of-theart DSP cores and due to reverse carry support an efficient implementation of FFT algorithms is possible. The modulo addressing support allows the addressing of modulo buffers with any size starting at any position, which prevents fragmented data memory. Mode registers for coding special addressing, rounding and saturation modes are available. Only one flag (T) is available for dynamic evaluation of the core status used for conditional or predicated execution. The SC140 supports a separate interrupt control unit (ICU) with a similar powerful feature set as available for microcontroller architectures [15].

48 Summary: SC140 is the powerful DSP core architecture of the Starcore DSP family, where the support of four independent MAC units increases peak performance during execution of traditional DSP algorithms like filtering. The support of one flag significantly limits the use of predicated execution for improving execution of control code on the DSP architecture. The grouping mechanism to identify the VLES decreases code density. Benchmark results illustrate poor code density [28]. The register file used for address generation is not fully orthogonal and making use of some address modes limits the use of all registers. Mode registers restrict instruction scheduling. Some specialized instructions are limited to certain source and destination registers, which limits register allocation and instruction scheduling.

Figure 37: Architectural Overview: Blackfin.

3.3.8 Blackfin DSP Blackfin DSP [1][2][8] core is co-developed by Intel Inc and Analog Devices. Blackfin is a high-performance 16-bit fixed-point DSP core based on a RISC load-store architecture. Instructions are available for transferring data between register file and data memory whereas arithmetic instructions receive their operands from the register file. The register file consists of 8 32-bit wide entries, each of which can be addressed as two 16-bit entries. Two of the eight 32-bit register entries are extended by eight guard bits each and used as accumulator registers for internal higher precision calculation. The instruction set is based on 16-bit wide native instruction words and instructions are 16, 32 and 64 bits wide. The fetch bundle contains 64 bits. Blackfin features an eight-stage pipeline: Fetch I+II, decode, address generation, execute I+II+III and write back. Nested loops with a nesting level of two are supported.

49 The execution unit of Blackfin as introduced in Figure 37 contains two 16 by 16 bit multipliers, two 40 bit wide ALU datapaths and one shifter unit. Typical DSP address modes are available including circular buffers (with no restrictions to the start address and the buffer size) and reverse carry addressing which enables efficient implementation of FFT algorithms. Status registers are available which mirror the core status including hardware flags and also configuration details for example the rounding mode. Summary: Blackfin (ADSP-21535) is a high performance 16-bit DSP processor developed by Analog Devices and Intel. The register file consists of only two accumulator registers and the remaining registers are 32 bits wide, which limits instruction scheduling. The core description emphasizes the topic of cache architectures like L1 and L2 cache architectures and cache is provided for data and program. The main problem of cache architecture is the unpredictability of cache hit and cache miss events which reduce the possibility of reducing worst case execution time for real-time critical algorithms.

3.4 xDSPcore xDSPcore is a fixed-point embedded DSP core architecture based on modified DualHarvard load-store architecture. A brief architectural overview of the core architecture can be found in Figure 38. The bit-width of the datapath is parameterized whereas the first implementation will have a 16-bit datapath. The operands for the arithmetic instructions are fetched from register files and the results stored in the register files. Two independent data memory ports are used to transfer data values between data memory and register file. The native instruction word size is also parameterized. The first implementation contains a 20-bit wide instruction word which allows the coding of all 3-operand arithmetic instructions within one instruction word. A parallel word is used to store long immediate or offset values, but a rich set of short addressing modes enables high code density. The chosen programming model is VLIW and for overcoming the drawback of code density a scalable Long Instruction Word (xLIW) is introduced [P2][P3][P4]. For the core a RISC pipeline is chosen with three phases, namely instruction fetch, decode and execute. The number of clock cycles used to implement the structure can be parameterized. The first implementation contains a five stage pipeline implementation: fetch, align, decode and execute I+II.

50

Figure 38: Architectural Overview: xDSPcore. The register file is split into three parts, a data register file containing eight accumulator registers, each of which is 40 bits wide. The accumulator without guard bits can be addressed as a 32-bit long register, which itself can be accessed as two 16-bit data registers. The number of entries can be scaled. The second part of the register file contains eight address registers and related modifier registers which are used for bit-reversal addressing and modulo addressing scheme. The third register file called branch-file contains the flags and reflects the core status used for conditional branch instructions and predicated execution. The register files are orthogonal and no register is assigned to specific functions. Zero overhead loop instructions are supported with a scalable nesting level whereas the first implementation supports 4 nesting levels. Further nesting levels require spilling of loop counter and loop addresses. xDSPcore supports addressing modes usually supported by state-of-the-art DSP cores. Preand post operations are supported without additional clock cycles. Bit reversal addressing allows efficient implementation of FFT algorithms. The size of the modulo buffer is programmable and the start address has to be aligned. The address registers are structured orthogonally which means each can be used for both AGUs. No configuration or mode registers are supported because all functions are coded inside the instruction word. Core status flags like zero or sign flags are assigned to the destination registers. The flags are used for predicated execution which reduces branch instructions in control code sections (if-then-else) without limitations and restrictions for instruction scheduling and register allocation. In the publication part of the thesis the main architectural features are introduced in detail.

51

4 High Level Language Compiler Issues This section covers the compiler aspects considered during definition of xDSPcore starting with an introduction of coding practices used for implementing algorithms on DSP core architectures. The second part gives a brief overview of the structure used for high-level language compilers followed by a discussion about architectural requirements for implementing an optimizing compiler. The second part ends with a short summary about why xDSPcore can be called compiler friendly architecture.

4.1 Coding Practices in DSP’s Traditional DSP algorithms like filtering are loop-centric, which means that 90% of the execution cycles are executed in code sections consuming less than 10% of the application code. Increasing the usage of core resources used in loop constructs leads to a significantly decreased number of required execution cycles. This section introduces coding practices in Digital Signal Processors for increasing ILP in VLIW architectures which leads to a better performance during execution of loop constructs. The first part covers software pipelining which reduces the number of execution cycles and increases the usage of core resources. The limitations and restrictions of software pipelining are investigated as can be reviewed in [78]. The second part introduces loop unrolling which is often used in combination with software pipelining. Different to software pipelining loop unrolling is used to increase the work carried out inside the loop kernel. At the end of the section the specific implementation of predicated execution for xDSPcore is introduced which can also be also used for increasing the performance of loops. In [23][46][49][71][123][136] and [140] aspects for implementing C-code for efficient code generation are mentioned and illustrate which indicates the importance of high-level language programming of Digital Signal processors.

4.1.1 Software Pipelining Software pipelining tries to invoke the next loop iteration as early as possible resulting in overlapped execution. The example in Figure 39 illustrates functionality of software pipelining on a loop body with four instructions A, B, C and D. In the first row A is executed for the first time (A1). In the second row when B is executed for the first time (B1) the next loop iteration is initiated in parallel (A2).

52

Figure 39: Principle of Software Pipelining. In row four of the example in Figure 39 four instructions are executed in parallel (D1, C2, B3, and A4) and the usage of resources reaches the maximum value. Row four is called kernel, the rows before it used for filling the pipeline prolog and the execution cycles for flushing the pipeline epilog. Software pipelines face similar limitations as hardware pipelines e.g. data dependencies between instructions of the loop. Some of these limitations are considered in the next subsections. Trip Count The trip count is equal to the number of loop iterations. If the loop iteration count equals trip count the loop is terminated. A minimum number of loop iterations are required to fill the software pipeline. The software pipeline does not increase system performance when the loop iteration count is less than the trip count. For the example in Figure 39 the trip count is four, which requires minimum loop iteration equal to four to make use of the advantages of software pipelining. Minimum Initiation Interval The Minimum Initiation Interval (MII) is equal to the minimum number of execution bundles building up a software pipelined loop kernel. The MII is restricted by data dependencies which are introduced later as live too long problem and by the number of available architectural resources. Modulo Iteration Interval Scheduling Modulo iteration interval scheduling provides a methodology to keep track of resources that are a modulo iteration interval away from others. For example for a two-cycle loop instructions scheduled on cycle n cannot use the resources as instructions scheduled on cycle n+2, n+4, … . The xDSPcore architecture supports the execution of five instructions in parallel, two load/store, two arithmetic and one program flow instruction. In Figure 40 the Data Flow Graph (DFG) of a small kernel is illustrated. The sum of two values first loaded from data

53 memory is calculated and the result then stored in data memory again. The memory addresses are auto-incremented.

Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values. The relative dependency between load instruction and add instruction is two which requires one cycle distance between fetching data and issuing the add operation. The dependency between ADD and Store operation is however zero, therefore both instructions can be issued during the same cycle. Cycle Number

0

MOV1 MOV2 CMP1 CMP2 BR

LD (R0)+, D0 LD (R1)+, D1

Cycle Number

1

MOV1 MOV2 CMP1 CMP2 BR

2

4 …

ADD D0,D1,D4

3

5

ST D4, (R2)+



Table 1: Principle of Resource Allocation Table. In this example three load/store instructions are executed which requires a minimum kernel length of two execution bundles. A resource allocation table as used in Table 1 is useful for manually performing software pipelining. The two load instructions are scheduled into cycle 0. Data dependency leads to an unused second cycle. The third cycle (cycle 2) is used to invoke the add operation. As mentioned above it is possible to invoke the store operation

54 in the same cycle as the ADD operation. To prevent resource conflicts during software pipelining the store operation is shifted to the fourth cycle (cycle 3) which has no influence on the overall cycle count already limited by an MII of two caused by the required three load/store instructions. In Table 2 software pipelining is manually introduced for that example. The first column is copied into the second column and the second into the third. Cycle Number

0

MOV1 MOV2 CMP1 CMP2 BR

LD (R0)+, D0 LD (R1)+, D1

Cycle Number

1

MOV1 MOV2 CMP1 CMP2 BR

2

4

LD (R0)+, D0 LD (R1)+, D1 ADD D0,D1,D4 ADD D0,D1,D4

3

5

ST D4, (R2)+

ST D4, (R2)+

Table 2: Resource Allocation Table including Software Pipeline Technology for increased Usage of Core Resources. In Table 2 it becomes apparent why the move of the store operation into the fourth cycle (cycle 3) carried out to prevent data hazards has no influence on the overall performance. The second column in Table 2 (grey shaded) is equal to the kernel as illustrated in Figure 41. Column one is equal to the prolog but instead of using a NOP instruction it is possible to schedule the loop instantiation into the free execution cycle (which increases code density). Column three is equal to the epilog of the software pipelined loop. Prolog: LD (R0)+,D0 || LD (R1)+, D1 BKREP N-1, epilog Kernel: LD (R0)+,D0 || LD (R1)+,D1 || ADD D0, D1, D4 ST D4,(R2)+ Epilog: ADD D0, D1, D4 || ST D4,(R2)+ Figure 41: Example for Assembler Code Implementation including Software Pipelining (xDSPcore). Live Too Long Problem An additional limitation is the live too long problem. For example a loop kernel consists of two execution cycles. It is not possible to use a register entry for more than two cycles because the next loop iteration would overwrite the value before it has been used. The two

55 aspects influencing the live too long problem are the loop carry path (LCP) and the split join path (SJP). To illustrate the related limitations another code example as in the subsection before is necessary. Figure 42 introduces another code example namely the implementation of a search function, where the maximum value of a vector has to be found.

Figure 42: Data Flow Graph for Maximum Search Example. Loop Carry Path A loop carry path is caused by an instruction writing a result whose value is used for the next loop iteration. For the example in Figure 42 the LCP is between the CMP function responsible for comparing the current maximum value with the new loaded value and the conditionally executed move register function which updates the latest maximum entry. In the example Figure 42 the LCP is equal to two and restricts the MII to the value of two. Split Join Path If the same value is used by more than one instruction it has to be valid until all instructions have been executed. The longest path determines the minimum length of the MII still guaranteeing correct semantics. For the example in Figure 42 the SJP is between the load instruction and the conditional move register instruction and is equal to three. Therefore the minimum number of execution cycles building up a valid loop kernel is equal to three. For the example in Figure 42 the SJP dominates the MII. In some code sections it is possible to reduce the LCP by moving the instruction overwriting the value to a later

56 execution cycle. Adding additional register move instructions can be used to split up the SJP in smaller parts and therefore to reduce the MII.

4.1.2 Compiler Support The C-compiler for xDSPcore supports software pipelining. Different to commercially available C-compilers the use of compiler-known functions and intrinsics is not required. In Figure 43 a small C-code example is illustrated, calculating 16 dot products and summing up the results in an accumulator register. for (j=0; j

Suggest Documents