On-Chip Memory Architecture Exploration of Embedded System on Chip

On-Chip Memory Architecture Exploration of Embedded System on Chip

A Thesis Submitted for the Degree of

Doctor of Philosophy in the Faculty of Engineering

by

T.S. Rajesh Kumar

Supercomputer Education and Research Centre Indian Institute of Science Bangalore – 560 012 September 2008

To my Family, Sree, Amma, Advika and Adarsh

Abstract Today’s feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at low cost and lower energy consumption. SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the area, power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. The on-chip memory organization of embedded processors varies widely from one SoC to another, depending on the application and market segment for which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex cache-SPRAM based hybrid architecture. The performance of a memory architecture also depends on how the data variables of the application are placed in the memory. There are multiple data layouts for each memory architecture that are efficient from a power and performance viewpoint. Further, the designer would be interested in multiple optimal design points to address various market segments. Hence a memory architecture exploration for an embedded system involves evaluating a large design space in the order of 100,000 of design points and each design points having several tens of thousands of data layouts. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by experienced designers exploring a very small subset of this design space. The vast memory design space prohibits any possibility for a manual analysis. In this work, we propose an automated framework for on-chip memory architecture

exploration. Our proposed framework integrates memory architecture exploration and data layout to search the design space efficiently. While the memory exploration selects specific memory architectures, the data layout efficiently maps the given application on to the memory architecture under consideration and thus helps in evaluating the memory architecture. The proposed memory exploration framework works at both logical and physical memory architecture level. Our work addresses on-chip memory architecture for DSP processors that is organized as multiple memory banks, with each back can be a single/dual port banks and with non-uniform bank sizes. Further, our work also address memory architecture exploration for on-chip memory architectures that is SPRAM and cache based. Our proposed method is based on multi-objective Genetic Algorithm based and outputs several hundred Pareto-optimal design solutions that are interesting from a area, power and performance viewpoints within a few hours of running on a standard desktop configuration.

Acknowledgments There are many people I would like to thank who have helped me in various ways. First and foremost I would like to thank my Supervisors, Prof. R. Govindarajan and Dr.C.P. Ravikumar, who have guided me and supported me in various aspects through the entire journey in completion of my thesis work. I profusely thank for the encouragement they provided and their perseverance in keeping me focused on the Ph.D. work. I would like to express my gratitude to Texas Instruments for giving me the time and opportunity to pursue my studies. I would like to thank my colleagues at Texas Instruments for their support and reviews. In particular my manager Balaji Holur. I would also like to thank my previous managers Pamela Kumar and Manohar Sambandam. Last but not the least, I would like to thank my dearest family members for the encouragement they provided and the sacrifices they made to help me achieve my goals.

iv

Contents Abstract

i

Acknowledgments

iii

List of Publications from this Thesis

1

1 Introduction

3

1.1

Application Specific Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.1

On-chip Memory Organization . . . . . . . . . . . . . . . . . . . . .

5

1.2.2

Cache-based Memory Organization . . . . . . . . . . . . . . . . . .

6

1.2.3

Scratch Pad Memory-based Organization . . . . . . . . . . . . . . .

7

1.3

Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.4

Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . . 10

1.5

Embedded System Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7

Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Background 2.1

2.2

23

On-chip Memory Architecture of Embedded Processors . . . . . . . . . . . 23 2.1.1

DSP On-chip SPRAM Architecture . . . . . . . . . . . . . . . . . . 23

2.1.2

Microcontroller Memory Architecture . . . . . . . . . . . . . . . . . 25

Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

2.3

2.2.1

DSP Software Optimizations . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2

MCU Software Optimizations . . . . . . . . . . . . . . . . . . . . . 28

Cache Based Embedded SOC . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1

Cache-SPRAM Based Hybrid On-chip Memory Architecture . . . . 30

2.4

Genetic Algorithms - An Overview . . . . . . . . . . . . . . . . . . . . . . 30

2.5

Multi-objective Multiple Design Points . . . . . . . . . . . . . . . . . . . . 33

3 Data Layout for Embedded Applications

35

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2

Method Overview and Problem Statement . . . . . . . . . . . . . . . . . . 38

3.3

3.2.1

Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1

Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.2

Handling Multiple Memory Banks . . . . . . . . . . . . . . . . . . . 44

3.3.3

Handling SARAM and DARAM . . . . . . . . . . . . . . . . . . . . 46

3.3.4

Overlay of Data Sections . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.5

Swapping of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4

Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5

Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6

3.5.1

Data Partitioning into Internal and External Memory . . . . . . . . 50

3.5.2

DARAM and SARAM placements . . . . . . . . . . . . . . . . . . . 50

Experimental Methodology and Results . . . . . . . . . . . . . . . . . . . . 53 3.6.1

Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 53

3.6.2

Integer Linear Programming - Results

3.6.3

Heuristic and GA Results . . . . . . . . . . . . . . . . . . . . . . . 58

3.6.4

Comparison of Heuristic Data Layout with GA . . . . . . . . . . . 59

3.6.5

Comparison of Different Approaches . . . . . . . . . . . . . . . . . 61

. . . . . . . . . . . . . . . . 54

3.7

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

vii

4 Logical Memory Exploration

67

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2

Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3

4.4

4.2.1

Memory Architecture Parameters . . . . . . . . . . . . . . . . . . . 70

4.2.2

Memory Architecture Exploration Objectives . . . . . . . . . . . . . 71

4.2.3

Memory Architecture Exploration and Data Layout . . . . . . . . . 73

Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1

GA Formulation for Memory Architecture Exploration . . . . . . . 74

4.3.2

Pareto Optimality and Non-Dominated Sorting . . . . . . . . . . . 75

Simulated Annealing Formulation . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.1

4.5

Memory Subsystem Optimization . . . . . . . . . . . . . . . . . . . 78

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5.1


4.5.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.7

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 Data Layout Exploration

93

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2

Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3

MODLEX: Multi Objective Data Layout EXploration . . . . . . . . . . . . 96

5.4

5.3.1

Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.2

Mapping Logical Memory to Physical Memory . . . . . . . . . . . . 98

5.3.3

Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . 98

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.1


5.4.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4.3

Comparison of MODLEX and Stand-alone Optimizations . . . . . . 108

5.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

viii

6 Physical Memory Exploration

111

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2

Logical Memory Exploration to Physical Memory Exploration (LME2PME) 114

6.3

6.4

6.2.1

Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2.2

Physical Memory Exploration . . . . . . . . . . . . . . . . . . . . . 115

6.2.3


Direct Physical Memory Exploration (DirPME) Framework . . . . . . . . . 120 6.3.1

Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3.2




6.4.2

Experimental Results from LME2PME . . . . . . . . . . . . . . . . 125

6.4.3

Experimental Results from DirPME . . . . . . . . . . . . . . . . . . 126

6.4.4

Comparison of LME2PME and DirPME . . . . . . . . . . . . . . . 130

6.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Cache Based Architectures

137

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2

Solution Overview

7.3

Data Partitioning Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.4

Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.4.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.4.2

Graph Partitioning Formulation . . . . . . . . . . . . . . . . . . . . 150

7.4.3

Cache Offset Computation . . . . . . . . . . . . . . . . . . . . . . . 152



7.5.2

Cache-Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . 156

7.5.3

Cache-SPRAM Data Partitioning . . . . . . . . . . . . . . . . . . . 158

7.5.4

Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . 162

ix

7.6

7.7

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.6.1

Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . 163

7.6.2

SPRAM-Cache Data Partitioning . . . . . . . . . . . . . . . . . . . 166

7.6.3

Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . 167

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8 Conclusions

171

8.1

Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.2.1

Standardization of Input and Output Parameters . . . . . . . . . . 174

8.2.2

Impact of platform change on system performance . . . . . . . . . . 174

8.2.3

Impact of Application IP library rework on system performance . . 174

8.2.4

Impact of semiconductor library rework on the system performance 175

8.2.5

Multiprocessor Architectures . . . . . . . . . . . . . . . . . . . . . . 175

Bibliography

176

List of Tables 1.1

Explanation of Xchart Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1

List of Symbols Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2

Memory Architecture for the Experiments . . . . . . . . . . . . . . . . . . 54

3.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4

Results from Heuristic Placement (HP) and Genetic Placement (GP) on 4 Embedded Applications, VE = Voice Encoder, JP = JPEG Decoder, LLP = Levinson’s Linear Predictor, 2D = 2D Wavelet Transform.

. . . . . . . 59

3.5

Comparative Ranking of Algorithms . . . . . . . . . . . . . . . . . . . . . 62

4.1

Memory Architecture Parameters . . . . . . . . . . . . . . . . . . . . . . . 72

4.2

Evaluation of Multi-Objective Cost Function . . . . . . . . . . . . . . . . . 79

4.3


4.4

Non-dominant Points Comparison GA-SA . . . . . . . . . . . . . . . . . . 85

5.1

Memory Architectures Used for Data Layout

6.1

Memory Architectures Explored - Using DirPME Approach . . . . . . . . . 129

6.2

Non-dominant Points Comparison LME2PME-DirPME . . . . . . . . . . . 134

7.1

Input Parameters for Data Partitioning Algorithm . . . . . . . . . . . . . . 145

7.2

Data Layout Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.3

Data Layout for Different Cache Configurations . . . . . . . . . . . . . . . 157

. . . . . . . . . . . . . . . . 103

List of Figures 1.1

Architecture of an Embedded SoC . . . . . . . . . . . . . . . . . . . . . . .

6

1.2

Embedded Application Development Flow . . . . . . . . . . . . . . . . . .

9

1.3

Memory Trends in SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4

Application Specific SoC Design Flow Illustration with X-chart . . . . . . . 14

1.5

Mapping Chapters to X-chart Steps . . . . . . . . . . . . . . . . . . . . . . 21

2.1

Example DSP Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2

Cache-SPRAM Based On-Chip Memory Architecture . . . . . . . . . . . . 31

2.3

Genetic Algorithm Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1

Overview of Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2

Illustration of Parallel and Self Conflicts . . . . . . . . . . . . . . . . . . . 39

3.3

Heuristic Algorithm for Data Layout . . . . . . . . . . . . . . . . . . . . . 52

3.4

Relative performance of the Genetic Algorithm w.r.t. Heuristic, for Varying Number of Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5

Comparison of Heuristic Data Layout Performance with GA Data layout . 60

4.1

DSP Processor Memory Architecture . . . . . . . . . . . . . . . . . . . . . 71

4.2

Two-stage Approach to Memory Subsystem Optimization . . . . . . . . . . 74

4.3

Comparison of GA and SA Approaches for Memory Exploration . . . . . . 82

4.4

Vocoder Non-dominated Points Comparison Between GA and SA . . . . . 83

4.5

Vocoder: Memory Exploration (All Design Points Explored and Non-dominated Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xii

4.6

MPEG: Memory Exploration (All Design Points Explored and Non-dominated Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.7

JPEG: Memory Exploration (All Design Points Explored and Non-dominated Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.8

DSL: Memory Exploration (All Design Points Explored and Non-dominated Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1

MODLEX: Multi Objective Data Layout EXploration Framework . . . . . 97

5.2

Data Layout Exploration: MPEG Encoder . . . . . . . . . . . . . . . . . . 104

5.3

Data Layout Exploration: Voice Encoder . . . . . . . . . . . . . . . . . . . 105

5.4

Data Layout Exploration: Multi-Channel DSL . . . . . . . . . . . . . . . . 105

5.5

Individual Optimizations vs Integrated . . . . . . . . . . . . . . . . . . . . 106

6.1


6.2

Memory Architecture Exploration - Integrated Approach . . . . . . . . . . 113

6.3

Logical to Physical Memory Exploration - Overview . . . . . . . . . . . . . 115

6.4

Logical to Physical Memory Exploration - Method . . . . . . . . . . . . . . 117

6.5

GA Formulation of LME2PME . . . . . . . . . . . . . . . . . . . . . . . . 118

6.6

MAX: Memory Architecture eXploration Framework . . . . . . . . . . . . 121

6.7

GA Formulation of Physical Memory Exploration . . . . . . . . . . . . . . 123

6.8

Voice Encoder: Memory Architecture Exploration - Using LME2PME Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.9

MPEG: Memory Architecture Exploration - Using LME2PME Approach . 128

6.10 DSL: Memory Architecture Exploration - Using LME2PME Approach . . . 129 6.11 Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.12 Voice Encoder: Memory Architecture Exploration - Using DirPME Approach131 6.13 MPEG Encoder: Memory Architecture Exploration - Using DirPME Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.14 DSL: Memory Architecture Exploration - Using DirPME Approach . . . . 133

xiii

7.1

Target Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2

Memory Exploration Framework . . . . . . . . . . . . . . . . . . . . . . . . 141

7.3

Example: Temporal Relationship Graph . . . . . . . . . . . . . . . . . . . 143

7.4

Heuristic Algorithm for Data Partitioning . . . . . . . . . . . . . . . . . . 147

7.5

Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.6

Heuristic Algorithm for Offset Computation . . . . . . . . . . . . . . . . . 153

7.7

AAC: Performance for different Hybrid Memory Architecture . . . . . . . . 158

7.8

MPEG: Performance for different Hybrid Memory Architecture . . . . . . . 159

7.9

JPEG: Performance for different Hybrid Memory Architecture . . . . . . . 159

7.10 AAC: Power consumed for different hybrid memory architecture . . . . . . 160 7.11 MPEG: Power consumed for different hybrid memory architecture . . . . . 161 7.12 JPEG: Power consumed for different hybrid memory architecture . . . . . 161 7.13 AAC: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . . 163 7.14 MPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164 7.15 JPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164

List of Publications from this Thesis 1. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. On-chip Memory Architecture Exploration Framework for DSP Processor Based Embedded SoC. Submitted to the ACM Transactions on Embedded Computing Systems, May 2008. 2. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Architecture Exploration Framework for Cache-based Embedded SoC. In Proceedings of the International Conference on VLSI Design, Jan 2008. 3. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MODLEX: A Multi-Objective Data Layout EXploration Framework for Embedded SoC. In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2007. 4. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MAX: A Multi-Objective Memory Architecture Exploration Framework for Embedded SoC. In Proceedings of the International Conference on VLSI Design, Jan 2007. 5. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Embedded Tutorial on MultiProcessor Architectures for Embedded SoC. In Proceedings of the VLSI Design and Test, Aug 2003. 6. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Optimal Code and Data Layout for Embedded Systems. In Proceedings of the International Conference on VLSI Design, Jan 2003. 7. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Exploration for Embedded Systems. In Proceedings of the VLSI Design and Test, Aug 2002.

2

List of Publications from this Thesis

Chapter 1 Introduction 1.1

Application Specific Systems

Today’s VLSI technology allows us to integrate tens of processor cores on the same chip along with embedded memories, application specific circuits, and interconnect infrastructure. As a result, it is possible to integrate an entire system onto a single chip. The single chip phone, which has been introduced by several semiconductor vendors, is an example of such a system-on-chip; it includes the modem, radio transceiver, power management functionality, a multimedia engine and security features, all on the same chip. An embedded system is an application-specific system which is optimized to perform a single function or a small set of functions [70]. We distinguish this from a general-purpose system, which is software-programmable to perform multiple functions. A personal computer is an example of a general-purpose system; depending on the software we run on the computer, it can be useful for playing games, word processing, database operations, scientific computation, etc. On the other hand, a digital camera is an example of an embedded system, which can perform a limited set of functions such as taking pictures, organizing them, or transferring them to another device through a suitable I/O interface. Other examples of embedded systems include mobile phones, audio/video players, videogame consoles, settop boxes, car infotainment systems, personal digital assistants, telephone central-office switches, dedicated network routers and bridges. Note that a large number of embedded

4

Introduction

systems are built for the consumer market. As a result, in order to be competetive, the cost of an embedded system cannot be very high. Yet, the consumers demand higher performance and more features from the embedded systems products. It is easy to appreciate this point if we compare the performance and feature set offered by mobile phones that cost Rs 5000/-(or 100$) today and which cost the same a few years ago. We also see that a large number of embedded systems are being built for the mobile market. This trend is not surprising - the number of mobile phone subscribers increased from 500 Million in year 2000 to 2.6 Billion in 2007 [7]. Because of such high volumes, embedded systems are extremely cost sensitive and their design demands careful silicon-area optimization. Since mobile devices use batteries as the main source of power, embedded systems must also be optimized for energy dissipation. Power, which represents the rate at which energy is consumed, must also be kept low to avoid heating and improving reliability. In summary, the designer of an embedded system must simultaneously consider and optimize price, performance, energy, and power dissipation. Application specific embedded systems designed today demand innovative methods to optimize these system cost functions [11, 19]. Many of today’s embedded systems are based on system-on-chip platforms [16], which, in turn, consist of one or more embedded microcontrollers, digital signal processors (DSP), application specific circuits and read-only memory, all integrated into a single package. These blocks are available from vendors of intellectual property (IP) as hard cores or soft cores [42, 28]. A hard core, or hard IP block, is one where the circuit is available at a lower level of abstraction such as the layout-level [42, 28]; it is impossible to customize a hard IP to suit the requirements of the embedded system. As a result, there are limited opportunities in optimizing the cost functions by modifying the hard IP. For example, if some functionality included in the IP is not required in the present application, we cannot remove the function to save area. Soft IP refers to circuits which are available at a higher level of abstraction, such as register-transfer level [28, 42]. It is possible to customize the soft IP for the specific application. The designer of an embedded SoC integrates the IP cores for processors, memories, and application-specific hardware to create the SoC. Figure 1.1 illustrates the architecture of an embedded system-on-chip (SoC). As can

1.2 Memory Subsystem

5

be seen in the figure, there are four principal components in such an SoC. 1. An Analog Front End which includes the analog/digital and digital/analog converters 2. Programmable Components which include microprocessors, microcontrollers, and DSPs. The number of embedded processors is increasing every year. An interesting statistic shows that of the nine billion processors manufactured in 2005, less than 2% were used for general-purpose computers. The other 8.8 billion went into embedded systems [13]. The microcontroller/microprocessor is useful in handling interrupts, house-keeping and performing timing related functions. The DSP is useful for processing the audio and video information e.g., compression and decompression of audio and video information. The application software is normally preloaded in the memory and is not user programmable, unlike general-purpose processor-based systems 3. Application-specific components – these include hardware accelerators for computeintensive functions. Examples of hardware accelerators include digital image processors which are useful in cameras

1.2 1.2.1

Memory Subsystem On-chip Memory Organization

The memory architecture of an embedded processor core is complex and is custom designed to improve run-time performance and power consumption. In this section we describe only on the memory architecture of the DSP processor as this is the focus of the thesis. This is because, the memory architecture of the DSP is more complex than that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are more data dominated than the control-dominated software executed on an MCU. Memory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per

6

Introduction

Figure 1.1: Architecture of an Embedded SoC

processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is critical in DSP application to extract maximum performance from the memory subsystem in order to meet the real-time constraints of the embedded application. As a consequence, the DSP software for critical kernels is developed mostly as hand optimized assembly code. In contrast, the software for MCU is typically developed in high-level languages. The memory architecture for a DSP is unique since the DSP has multiple onchip buses and multiple address generation units to service higher bandwidth needs. The on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and SPRAM (e.g., [2, 77]).

1.2.2

Cache-based Memory Organization

Purely cache-based on-chip memory organization is generally not preferred by embedded system designers as this organization cannot guarantee the worst-case execution time constraints. This is because the access time in a cache based system can vary depending on whether the access results in a cache miss or a hit [33]. As a consequence, the run-time

1.2 Memory Subsystem

7

performance of cache-based memory subsystems varies, based on the execution path of application and is data dependent. However cache architecture is advantageous in the sense that it reduces programmer’s responsibility in terms of placement of data to achieve better memory access time. Further the movement of data from off-chip memory to cache is transparent. In [12], the authors present a comparison study of SPRAM and cache for embedded applications and conclude that SPRAM has 34% smaller area and 40% lower power consumption than a cache of the same capacity. There is published literature to estimate the worst case execution time [81] and find an upper bound on run-time [78] for cache-based embedded systems. Hence it was argued that for real-time embedded systems which require stringent worst-case performance guarantee, purely cache based on-chip organization is not suitable.

1.2.3

Scratch Pad Memory-based Organization

On-chip memory organization based only on Scratch Pad memory ensures single cycle access times and guarantees on worst-case execution for data that resides in Scratch-Pad RAM (SPRAM). However, it is the responsibility of the programmer to identify data section that should be placed in SPRAM or place code in the program to appropriately move data from off-chip memory to SPRAM. A DSP core can include the following types of memories static RAM (SRAM), ROM, and/or dynamic RAM (DRAM). The scratch pad memory in the DSP core is organized into multiple memory banks to facilitate multiple simultaneous data accesses. A memory bank can be organized as a single-access RAM (SARAM) or a dual-access RAM (DARAM) to provide single or dual access to the memory bank in a single cycle. Also the on-chip memory banks can be of different sizes. Smaller memory banks consume lesser power per access than the larger memories. The embedded system may also be interfaced to off-chip memory, which can include SRAM and DRAM. Purely SPRAM based on-chip organization is suitable only for low to medium complex embedded applications. SPRAM based systems do not use the on-chip RAM efficiently as it requires the entire data sections that are currently accessed to be placed exclusively

8

Introduction

in the SPRAM. It is possible to accommodate different data sections in SPRAM at different points in execution time by moving data dynamically between off-chip memory and SPRAM. But this results in certain run-time overhead and increase in code size. For medium to large applications, which have large number of critical data variables, a large amount of on-chip RAM will become necessary to meet the real-time performance constraints. Hence for such applications pure SPRAM architecture are not preferred.

1.3

Data Layout

To efficiently use the on-chip memory, critical data variables of the application need to be identified and mapped to the on-chip RAM. The memory architecture may contain both on-chip cache and SPRAM. In such a case it is important to partition the data section and assign them appropriately to on-chip cache and SPRAM such that memory performance of the application is optimized. Further, among the data sections assigned to on-chip cache and SPRAM, a proper placement of the data sections on the cache and SPRAM is required to ensure that the cache misses are reduced and the multiple memory banks of the SPRAM and the dual ported SPRAMs are efficiently utilized. Identifying such a data placement for data sections, referred to as the data layout problem, is complex and critical step [10, 53]. This task is typically performed manually as the compiler cannot assume that the code under compilation represents the entire system [10]. The application program in a modern embedded system is complex since it must support a variety of device interfaces such as networking interfaces, credit card readers, USB interfaces, parallel ports, and so on. The application also has many multimedia components like MP3, AAC and MIDI [8]. This necessitates an IP reuse methodology [74], where software modules developed and optimized independently by different vendors are integrated. Figure 1.2 explains the typical flow in embedded application development. This integration is a very challenging job with multiple objectives: (a) it has to be done under tight constraints on time-to-market constraints, (b) it has to be repeated for different variants of SoCs with different custom memory architectures, and (c) it has to perform in such a way that the embedded application is optimized for performance,

1.3 Data Layout

9

power consumption and cost.

Figure 1.2: Embedded Application Development Flow

Since the IPs/modules are independently optimized, the integrator is under pressure to deliver the complete product with the expectation that each component performs at the same level as it did in isolation. This is a major challenge. When a module is optimized independently, the developer has all the resources of the SoC (MIPS and Memory) to optimize the module. When these modules are integrated at the system-level, the system resources are shared among the modules. So the application integrator needs to know the MIPS and memory requirements of the modules unambiguously to be able to allocate the shared resources to critical needs [74]. Usually, the modules memory requirements are given only at a high level. To be able to optimize the whole application/system, the integrator will need detailed memory analysis at the module-level; e.g., which data buffers need to be placed in dual ported memories and which data buffers should not be placed in the same memory bank – this data is usually not available. Moreover, the critical code is usually written in low-level assembly language to meet real-time constraints and/or

10

Introduction

due to legacy reasons. Because of the above mentioned reasons, the application integration/optimization, analyzing the application and mapping software modules in order to obtain optimal cost and performance takes significant amount of time (approximately 1-2 man months). Currently in most of the SoC design data layout is also performed manually and it has two major problems:(1) the development time is significant – not acceptable for current-day time to market requirements, (2) quality of solution varies based on the expertise.

1.4

Memory Architecture Exploration

In modern embedded systems, the area and power consumed by the memory subsystem is up to 10 times that of the data path, making memory a critical component of the design [11]. Further, the memory subsystem constitutes a large part (typically up to 70%) of the silicon area for the current day SoC and it is expected to go up to 94% in 2014 as shown in the Figure 1.3 [6]. The main reason for this is that embedded memory has a relatively smallsubsystem per-area design cost in terms of both man-power, time-tomarket and power consumption [60]. Hence the memory plays an important role in the design of embedded SoCs. Further the memory architecture strongly influences the cost, performance and power dissipation of an embedded SoC. As discussed earlier, the on-chip memory organization of embedded processors varies widely from one SoC to another, depending on the application and market segment for which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex cache-SPRAM based hybrid architecture. To begin with, the system designer needs to decide if the SoC requires cache and what is the right size of on-chip RAM. Once the high level memory organization is decided, the finer parameters need to be defined to complete the memory architecture definition. For the on-chip SPRAM based architecture, the parameters, namely, size, latency, number of memory banks, number of read/write ports per memory bank and connectivity, collectively define the memory organization and strongly influence the performance, cost, and power consumption. For cache based on-chip RAM,

1.4 Memory Architecture Exploration

11

Figure 1.3: Memory Trends in SoC

the finer parameters are the size of cache, associativity, line size, miss latency and write policy. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by the designer based on the targeted applications. However, with the combination of on-chip SPRAM and cache, the memory design space is too large for a manual analysis [31]. Also, with the projected growth in the complexity of embedded systems and the vast design space in memory architecture, hand optimization of the memory architecture will soon become impossible. This warrants an automated framework which can explore the memory architecture design space and identify interesting design points that are optimal from a performance, power consumption and VLSI area (and hence cost) perspective. As the memory architecture design space itself is vast, a brute force design space exploration tool may take large computation time and hence is unlikely to be useful in meeting the tight time-to-market constraint. Further, for each given memory architecture, there are several possible data section layouts which are optimal in terms of performance and power. This further compounds the memory architecture exploration problem.

12

Introduction

1.5

Embedded System Design Flow

In this section, we present our view of embedded system design flow to set the context for our work. For this purpose, we introduce the notion of the X-chart, which is inspired from the well-known Y-chart introduced by Gajski to capture the process of VLSI system design [29]. In a Y-chart, the three levels of design abstraction form the three dimensions of the figure Y; these are (a) design behavior, (b) design structure and (c) physical aspects of the design. A design flow starts from a behavior specification, which is then mapped to a structure, which in turn is mapped to a physical realization. We can view the process of transforming a behavior to a physical realization as a successive refinement process. Optimization of design metrics such as area, performance, and power are the goals of each of these refinement steps. The design process may spiral from the behavioral axis to structural axis to physical design axis in multiple stepwise refinement steps. We introduce the notion of the X-chart, which is illustrated in Figure 1.4. The Xchart representation has four axes: (a) Behavior, (b) Logical Architecture, (c) Physical Architecture and (d) Software Data Layout. The logical memory architecture (LMA) defines the embedded cache size, cache associativity, cache block size, size of the scratch pad memory, number of memory banks, and the number of ports. The physical memory architecture (PMA) is an actual realization of an LMA using the memory library components provided by the semiconductor vendor. The fourth dimension, namely Software Data Layout, is necessary for capturing the process of embedded system design. We have identified several steps in the embedded system design flow and marked them with circled numbers. Table 1.1 explains the individual steps in the X-chart representation. The design of an embedded system begins with a behavioral description (Point (1) in Figure 1.4, which is shown on the behavioral axis). Today, there are many languages available to capture the system behavior, e.g., System Verilog [5], System C [4], and so on. Hardware-software partitioning is performed to identify which functionalities of the description are best performed in hardware and which are best implemented in software. Hardware implementation is cost-intensive, but improves the performance.

1.5 Embedded System Design Flow

13

We show point (2) on the LMA axis, since hardware-software partitioning adds considerable amount of detail to decide the LMA parameters. The next step is to select hardware and software IP blocks. Depending on the time schedule (for designing the embedded system) and the cost constraint, the designer may wish to use readily available IP blocks from a vendor or implement a custom version of the IP. The target platform is then defined to implement the embedded system. As mentioned earlier, a platform includes one or more processors, memory, and hardware accelerators for specific functions. Platforms also come with software tools such as compilers and simulators, so that the development cycle can be accelerated. In other words, one does not need to wait for the hardware implementation to complete before trying out the software. We show point (4) on the software data layout axis, since the selection of a platform defines many aspects of software implementation. Software partitioning is now performed to decide which software IP blocks are executed on which processor. This completes one spiral cycle in the design life cycle of the embedded system. To recapitulate, the following components are defined at the end of the first cycle (a) the platform on which the embedded system will be built, (b) the hardware and software IP blocks that are selected for the target application, (c) assignment of software IP blocks to target processors where the software will be executed. We show point (5) on the behavioral axis, since the next spiral cycle will begin from here. The next step is to define the logical memory architecture for the memory subsystem. Guided by considerations such as cost, performance, and power, the designer must decide basic architectural parameters of the memory sub-system, such as whether or not to provide cache memory, how many memory banks are provided, whether or not dualported memories are necessary for guaranteeing performance, etc. The next step is to perform design space exploration in the logical space. Each logical memory architecture is also characterized by the selection of values for parameters such as cache size, cache associativity, cache block size, etc. There is often a cost/performance tradeoff between two solutions in the architectural space. Hence the designer must consider different Paretooptimal solutions that exhibit cost/performance tradeoff. This results in point (6) in

14

Introduction

Figure 1.4.

Figure 1.4: Application Specific SoC Design Flow Illustration with X-chart

A logical memory architecture must be translated into a physical implementation by selecting components from the semiconductor vendors memory library. There are multiple realizations, i.e., physical memory architectures (PMA) for the same LMA. This involves choosing the appropriate modules based on the process technology selected in step (7), and the corresponding semiconductor vendor memory library. These represent tradeoff in terms of power consumed and VLSI area. This leads to point (7) in Figure 1.4. The mapping of an LMA to a PMA is similar to the technology mapping step in logic synthesis [53]. Data Layout (DL) is the subsequent step in the design life cycle. During this step, the placement of data variables is determined, considering every possible implementation

1.5 Embedded System Design Flow

15

Table 1.1: Explanation of Xchart Steps

of the physical memory architecture. Once again, there are multiple solutions for data layout for a given PMA. These solutions may exhibit tradeoffs in power, performance, and area. In this thesis, we use the phrase Physical Memory Architecture Exploration (PMAE) to refer to the search for Pareto-optimal LMA/PMA/DL solutions. We capture this in the form of an equation that follows. 



 Logical M emory Architecture Exploration 

     P M AE =       

+ M emory Allocation Exploration + Data Layout Exploration

           

(1.1)

16

Introduction

In this thesis, the focus is on memory sub-system optimization, constituted by steps (5) to (9) in Figure 1.4. The size of the solution space increases manifold during each step of the memory exploration. If N1 optimal solutions (logical memory architectures) are identified during memory sub-system definition, memory allocation must be explored for each one of them, which can potentially result in N1 × N2 solutions during memory allocation exploration. Similarly, data layout must be performed for each of the N1 × N2 solutions from the memory allocation exploration step, and we may in general obtain N1 × N2 × N3 Pareto-optimal points in the PMAE solution space. As mentioned earlier this problem can result in exploring a combinatorially exploding large design space.

1.6

Contributions

First, we propose methods for data layout optimization, assuming a fixed memory architecture for a DSP-based embedded system architecture. Data layout is a critical component in the embedded design cycle and decides the final configuration of the embedded system. Data layout happens at the final stage in the life cycle of an embedded system, as illustrated in the X-chart of Figure 1.4. Data layout forms the foundation for memory subsystem optimization. Hence, we first formulate data section layout as an Integer Linear programming (ILP) problem. The proposed ILP formulation can handle: (i) partitioning of data between on-chip and off-chip memory, (ii) handling simultaneously accessed data variables (parallel conflict) in different on-chip memory banks, (iii) placing data variables that are accessed concurrently (self conflict) in dual-access RAMs, (iv) overlay of data sections with non-overlapping life times, and (v) swapping of data sections from/to off-chip memory. An important contribution of this work is the development of a simple unified ILP formulation to handle all the above mentioned optimizations. The ILP based approach is very effective for many moderately complex applications and delivers optimal results. However, as the application complexity increases, the execution time of ILP method increases drastically, making them unsuitable for large applications and in situations (such as memory architecture exploration) where the data layout need to be solved repeatedly.

1.6 Contributions

17

Hence we looked at developing faster methods to solve this problem. We propose a heuristic algorithm that maps the data sections to the given memory architecture and reduces the number of memory access conflicts resulting from both self conflicts and parallel conflicts. Finally, we also formulate the same problem in Genetic Algorithm (GA) and compare the results of the heuristic with GA. We find that the heuristic algorithm performs within 5% of GA’s results with GA performing better. However, the heuristic algorithm’s run-time is an order faster than GA’s run-time making it suitable to be used for memory architecture exploration. Next, we address logical memory architecture exploration for DSP-based embedded systems (step (5) to (7) in the X-chart of Figure 1.4). The input is a set of high-level memory parameters such as the number of memory banks, size of each memory bank, number of ports etc., that define the memory sub-system. The goal of the exploration is to find an optimal on-chip memory organization that can run the given applications with minimum number of memory-stalls. When an LMA is generated, it must be evaluated for cost (in terms of VLSI area) and performance. But these depend on the data layout. Hence to evaluate a memory architecture properly, we must first generate an efficient data layout. We use the fast heuristic method proposed by us. We have implemented the memory architecture exploration problem as a two-level hierarchical search, with architectural exploration at the outer level and data-layout exploration at the inner level. A multi-objective GA and a Simulated Annealing algorithm (SA) are used as alternate search mechanisms for the architectural exploration problem. As the memory architecture exploration framework consider both performance and cost (VLSI area) objectives, we use the Pareto-optimality constraint proposed in [25] to identify design points that are interesting from one or the other objective. The proposed memory exploration framework is fully automatic and flexible. The framework is also scalable, and additional objectives like power consumption can be added easily. We have used four different applications from multimedia and communication domains for our experiments and found 100-200 Pareto-optimal design choices (memory architectures) for each of the applications.

18

Introduction

Next, we explore the data layout design space for a given physical memory architecture in order to optimize the performance and power consumption of the memory subsystem. Note that data layout exploration forms the step (8) to (9) in the X-chart representation. We propose MODLEX, a Multi Objective Data Layout EXploration framework based on Genetic Algorithm that explores the data layout design space for a given logical and physical memory architecture and obtains a list of Pareto-optimal data layout solutions from performance and power perspectives. Most of the existing work in the literature assumes that performance and power are non-conflicting objectives with respect to data layout. However we show that there is a significant trade-off (up to 70%) that is possible between power and performance. Our next step is physical memory architecture exploration (step (5 to 8) in Figure 1.4). We propose two different methods for physical memory exploration. The first approach is an extension of the Logical Memory Architectural Exploration (LMAE) method described in Chapter 4 and represented in X-chart by step 5 to 6. Physical memory exploration is performed by taking the output of LMAE and for each of the Pareto-optimal logical memory architecture, performing a memory allocation exploration (step (6 to 7)) with an objective to optimize power and area in the physical memory space. Note that the data layout is fixed at the logical memory exploration stage itself and hence the performance does not change at this step. The memory allocation exploration is formulated as a multiobjective Genetic search to explore the design space with power and area as objectives. We refer to this approach as LME2PME. The second approach is a direct and integrated approach for Physical Memory Exploration, which we refer to as DirPME. This approach corresponds to a direct move from point 5 to point 8 in Figure 1.4. In this approach, we integrate three critical components together: (i) Logical Memory Architecture Exploration, (ii) Memory Allocation Exploration (iii) Data layout exploration. The core engine of the memory architecture exploration framework is formulated as a Multi-objective Non-Dominated Sorting Genetic Algorithm (NSGA) [25]. For the data layout problem, which needs to be solved for thousands of memory architectures, we use our fast efficient heuristic data layout method.

1.6 Contributions

19

Our integrated memory architecture exploration framework searches the design space by exploring 1000s of memory architectures and lists down 200-300 Pareto-optimal design solutions that are interesting from an area, power, and performance view point. Next, we address the memory architecture exploration problem for hybrid memory architectures that have a combination of SPRAM and cache. For such a hybrid architecture, a critical step is to partition the data between on-chip SPRAM and external RAM. Data partitioning aims at improving the overall memory sub-system performance by placing data in SPRAM that have the following characteristics: (a) higher access frequency, (b) data that has a overlapping life time with many other data, and (c) data that has poor spatial access characteristics. By placing all data that exhibits the above characteristics in SPRAM results in reducing the number of potentially conflicting data in cache, reducing the cache misses, leading to overall memory sub-system performance improvement. But typically the SPRAM size is small and it is not possible to accommodate all the data identified for SPRAM placement. Hence, even after data partitioning, there will be a significant number of potentially conflicting data sections that need to be placed in external RAM. If these data are need to be placed in the caches such that the conflict misses causes between them is reduced. Cache-conscious data layout addresses this problem and aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache misses. This is achieved by an efficient data layout heuristic that is independent of instruction caches, optimizes run-time and keeps the off-chip memory address space usage under check. We extend the above approach and perform hybrid memory architecture exploration with the objective to optimize run-time performance, power consumption and area. The salient feature of our work are as follows. • First, we provide a unified framework for logical memory exploration, memory allocation exploration, and data layout • Our work addresses power, performance, area optimization in an integrated framework

20

Introduction

• Our work addresses memory architecture exploration framework for a hybrid memory architecture involving on-chip SPRAM and cache. • Our work does not rely on source-code optimization for power and performance optimization. Hence it is suitable for Platform-based/IP-based system design

1.7

Thesis Overview

The rest of the thesis is organized as follows. In the following chapter, we provide the background material for the thesis. We begin by explaining the memory architecture of a DSP and an MCU. We summarize the software optimizations used in the literature to improve memory access efficiency. We explain cache-based embedded SoC and their challenges with respect to predictability. Finally, we introduce the concepts of a Genetic Algorithm (GA) for optimization, since GA is used in our optimization framework in the latter chapters. In Chapter 3, we propose different methods to address the data layout problem for onchip SPRAM based memory architecture. First, we propose a Integer Linear Programming (ILP) based approach. Further, we also propose a fast and efficient heuristic for the data layout problem. Finally, we formulate the data layout problem in Genetic Algorithm (GA). In Chapter 4, we present a multi-objective memory architecture exploration framework to search the memory design space for the on-chip memory architecture with performance and memory cost as two objectives. We address the memory architecture exploration problem at the logical level. Multi-objectective Data Layout Exploration problem is addressed in Chapter 5. Here, the data layout design space is explored for a given logical memory architecture and application with respect to performance and power. In Chapter 6, we address the memory architecture exploration problem at physical memory level. In this chapter we propose two different approaches for addressing the physical memory architecture exploration.

1.7 Thesis Overview

21

SPRAM-Cache based hybrid architecture is considered in Chapter 7. In this chapter, we propose efficient heuristic to partition data between on-chip SPRAM and cache. Further, we propose a cache conscious data layout. The memory design space is explored by using an exhaustive search based approach. Finally in Chapter 8, we summarize our work and outline the future work. As a summary, in Figure 1.5, we map each of the chapters of this thesis into the steps of X-chart. As can be seen in the figure, this work addresses the memory subsystem exploration and optimization at architectural level taking both hardware design and software (application development) constraints into consideration.

Figure 1.5: Mapping Chapters to X-chart Steps

22

Introduction

Chapter 2 Background In this chapter we provide the necessary background information that are useful to understand the rest of the thesis. The Following section explains the on-chip memory architecture of Digital Signal Processors (DSPs) and Microcontrollers (MCUs). Section 2.2 presents the software optimizations used in embedded applications that are targeted at using on-chip memory efficiently. Section 2.3 describes cache based on-chip memory architectures and motivates the need for cache-SPRAM based hybrid architectures for embedded SoCs. In Section 2.4, an overview of Genetic Algorithm is presented. Finally, in Section 2.5, importance of multi-objective multiple design solutions for platform based design is explained.

2.1

On-chip Memory Architecture of Embedded Processors

2.1.1

DSP On-chip SPRAM Architecture

DSP processor based embedded systems have an on-chip memory which typically has a single cycle access time [49]. The on-chip memory, also referred to as scratch pad memory, is mapped into an address space disjoint from the off-chip memory but connected to

24

Background

the same address and data buses. 1 Typically the scratch-pad memory is organized into multiple memory banks to facilitate multiple simultaneous data accesses. DSP Processors typically have 2 or more address generation units and multiple on-chip buses to facilitate multiple memory accesses.

Figure 2.1: Example DSP Memory Map

Further, each on-chip memory bank can be organized either as a single-access RAM (SARAM) or as a dual-access RAM (DARAM), to provide single or dual accesses to the same memory bank in a single cycle. For example, Texas Instruments TMS320C54X digital signal processor has two data read buses and one data write bus [75]. and, Texas Instruments TMS320C55X processor has three data read busses and two data write busses, since concurrent access to the same array are common in DSP applications [76]. Figure 2.1 presents memory map of C55X DSPs, where multiple memory banks of SARAM and DARAM memory banks form a part of memory map, and MMR represents memory mapped registers which typically contain control registers, status registers and stack pointers. The DARAM and SARAM regions can be recognized using multiple memory bank to enable two concurrent accesses. 1

We use the terms “scratch pad memory”, “on-chip memory” and “internal memory” interchangeably. Similarly “off-chip memory” and “external memory” are also used interchangeably.

2.1 On-chip Memory Architecture of Embedded Processors

2.1.2

25

Microcontroller Memory Architecture

Microcontroller’s (MCU) are designed to execute control type of applications efficiently. The applications that run on microcontroller’s are not very data intensive and hence do not require DARAM. But the real-time constraints of embedded applications and the need to run the applications in a time bound manner requires on-chip SPRAM. Similar to DSP memory architectures, even MCU processors have on-chip SPRAM. But unlike DSP’s onchip SPRAM, the MCU’s on-chip RAM is not organized as multiple memory banks. This is because typically the MCU applications do not perform more one memory access per clock cycle. Although the MCU’s on-chip RAM may be constructed with multiple physical memory modules due to practical limitations and other constraints such as: (a) smaller memory modules are faster and power efficient and (b) it is not practical to construct one large memory module and still meet the access latency constraint. For example, to construct a 192KB of on-chip SPRAM, hardware designers typically use 6×32KB memory modules. However, the 6×32KB is normally not exposed to the software application developers and hence from a software development perspective it is still one monolithic 192KB of memory. This distinction, of what is exposed to the application programmer referred to as logical memory architecture and how the same is realized is using physical memory modules (banks) - referred to as physical memory architecture - is important for both DSP and MCU. Further, the MCU’s on-chip SPRAM memory can be realised using non-uniform sized memory banks to optimize overall system power consumption. For example, it is area efficient to use large memory modules to construct the on-chip SPRAM memory. But large memories consume more power per read/write access compared a smaller memory. For example, a 2KB memory module, typically, consumes only half the power per read/write access as compared to a 8KB memory module. However, 4×2KB will consume more area than the 8KB memory module. Hence there is a area-power trade-off in selecting memory modules to construct on-chip SPRAM. The non-uniform bank sized memory architectures aims at balancing the area-power objectives. In this thesis our focus is on the memory architecture optimization for DSPs. This

26

Background

is because, the memory architecture of the DSP is more complex than that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are more data dominated than the control-dominated software executed on an MCU. Memory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is critical in DSP application to extract maximum performance from the memory subsystem in order to meet the real-time constraints of the embedded application. As a consequence, the DSP software for critical kernels is developed mostly as hand optimized assembly code. In contrast, the software for MCU is typically developed in high-level languages. The memory architecture for a DSP is unique since the DSP has multiple on-chip buses and multiple address generation units to service higher bandwidth needs. The on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and SPRAM (e.g., [2, 77]).

2.2

Software Optimizations

Embedded applications have a built in hierarchy. An application is composed of several modules, where each module consists of one or more code and data sections [74]. Each data section consists of a set of data variables, and/or data arrays, grouped purely for the sake of convenience. But it is also typical that each data array will be named as a separate section. Software developers spend considerable effort and time to achieve a careful layout of code and data sections to get maximum performance from the scratch pad memory [74]. Applicability of the optimization and methods used to perform data layout depends on the memory architecture. There are different software optimizations necessary for on-chip memory architectures of DSP and MCU processors, which are discussed in the following subsections.

2.2 Software Optimizations

2.2.1

27

DSP Software Optimizations

To take advantage of the multiple on-chip memory buses provided by the underlying processor architecture, software application developers must carefully partition the data into several independent sections. A data section typically holds an array or a set of program data structures and is placed contiguously in a memory bank. The data structures that are used in the same instruction cycle are said to be mutually conflicting and are ideally assigned to different sections so that they can be placed in different memory banks. Assigning data structures to separate sections increases the number of placement decisions drastically. Several software optimization techniques for improving the performance have been proposed in the literature [10, 14, 37, 40, 44, 53, 58, 74, 79], including: • Placing frequently accessed data variables to on-chip SPRAM and placing less frequently accessed data variables in off-chip RAM [10, 53]. • Partitioning data arrays that are accessed simultaneously in the same processor cycle into different on-chip memory banks. This way multiple data can be simultaneously accessed in the same cycle without incurring any additional memory stalls [40, 58]. • Mapping a data array which is required to support multiple simultaneous access to DARAM. This avoids additional memory stalls for two simultaneous accesses. [44]. • Overlay of data structures, typically arrays, to share the same on-chip memory space. These arrays are referred as scratch buffers [74]. The life time of these buffers are limited to a software module. Hence scratch buffers corresponding to different modules, which are not live simultaneously, can share the same on-chip memory space. • Swapping critical code and data sections from off-chip memory to on-chip memory before the execution of the appropriate code segment. This facilities efficient access to code/data currently being accessed. The benefits of swapping (on-chip access and reduced memory stalls) should more than compensate the cost of swapping [37, 79].

28

Background

Except for the swapping technique, which works on both code and data, all the other techniques concentrate only on data. Managing data is very important because most of the embedded applications are data dominated [19]. Towards achieving this goal, critical code and data — which are accessed frequently — are identified by performing extensive simulation and profiling of application. The decision to place a data structure in the on-chip SRAM is taken after analyzing the frequency of the variable in the application. The ideal case is where all the critical code and data sections can be placed in the onchip memory. While this can result in very high performance, in terms of fewer memory stall cycles, it is also prohibitively expensive to support such a large on-chip SRAM. Hence to achieve a good performance/cost ratio, a careful data layout for the memory architecture is mandatory. Taking the above optimizations into consideration, a code and data section layout can be defined as a mapping which specifies where (i.e., in which memory type) the various code and data sections reside, the memory bank(s) on which the sections reside, the type of memory access (single or dual access) supported in the memory bank, whether or not certain code (or data) sections are overlayed, and whether or not certain code (or data) sections are swapped.

2.2.2

MCU Software Optimizations

Typically, embedded applications running on the MCU are control-oriented and not very computation intensive. The primary objective is to use the on-chip SPRAM efficiently. Towards this, application is profiled to get the access frequency of all the data variables. Frequently accessed variables are placed in on-chip SPRAM and less frequently accessed variables are placed in off-chip RAM. With an objective to optimize power consumption, a non-uniform bank size based onchip SPRAM architecture is used in [14]. The key idea here is, smaller banks are used to accommodate the most frequently accessed variables, this placement optimizes the system power. For example, let a and b be two data variables with 1KB of size and both these

2.3 Cache Based Embedded SOC

29

variables are accessed 100000 times and 20000 times respectively. For an on-chip SPRAM size of 16KB organized as 4×2KB and 1×8K, placing a in one of the 2KB banks and placing b in the 8KB is more power optimal than the other way.

2.3

Cache Based Embedded SOC

All programs exhibit the property of locality of reference [68] and the cache memories exploit this property of the programs to give improved performance. Programs exhibit two types of localities, temporal and spatial. Temporal locality indicates that a recently accessed memory location is likely to be accessed again. And spatial locality implies that a recently accessed memory location’s neighboring location is likely to be accessed. In cache based architectures, data is placed in an off-chip RAM and copied at run-time to cache by a hardware cache controller. Cache controllers increase the silicon area, but eliminate the requirement of data placement and management and the associated runtime overhead. The mapping of data from off-chip RAM to L1-cache is dictated by the cache associativity scheme and can create potential side effects like thrashing. Therefore, a careful analysis of data access characteristics and understanding of temporal access pattern of the data structures is required to improve the cache performance. From a power, performance and area perspective, direct mapped caches are preferred over set-associative caches. However, direct mapped caches incur much more off-chip memory traffic [36], which, when not handled properly, can lead to very high power consumption and lower performance. In [36], the traffic inefficiency of direct mapped caches is evaluated for different embedded and multimedia applications from Mediabench [43]. The traffic (data movements from off-chip RAM to cache and vice versa) inefficiency is in the range of 10+ even for large cache sizes, and is mainly attributed to conflict misses. However, for application specific systems, the code is known a priori and an optimal cache-conscious data layout will be able to reduce the number of conflict misses and improve the performance and power consumption by reducing the off-chip memory traffic.

30

2.3.1

Background

Cache-SPRAM Based Hybrid On-chip Memory Architecture

Designers of real-time embedded memories have typically preferred scratch-pad memories (SPRAM) over data caches since the latter lead to unpredictability in access latencies as cache hits and misses result in different access times. For small to medium applications usage of SPRAM gives acceptable performance when the memory is used efficiently by using optimal data layout approaches. However, medium to large applications require large size of SPRAM to meet the real-time performance criterion. This is because the SPRAM memory is assigned dedicatedly to data-variables and the memory is not reused or shared among different data-variables. There are dynamic data layout approaches that aim at sharing the SPRAM by bringing the data-variables from off-chip RAM to on-chip RAM at run-time [37, 79]. However, these approaches are very complex and makes the software very difficult to maintain and debug. Data movements in the dynamic layout approach may lead to code size and run-time overheads. Hence architectures with only SPRAM as on-chip memory will be highly area inefficient for large applications. On the other hand, cache based architectures uses the on-chip memory efficiently by sharing the cache sets among data-variables at run-time. However, cache based architecture results in unpredictability in execution times. It is difficult to estimate the worst-case guaranteed run-time performance in a cache based system, which is a requirement for all embedded systems. Several approaches have been proposed to predict the worst case execution time in such systems [78, 81]. Hybrid memory architectures have become popular for real-time embedded systems, since caches offer sharing of on-chip memory space and SPRAM offers predictability. Hence, many embedded SOC use a mix of SPRAM and cache memories as shown in Figure 2.2.

2.4

Genetic Algorithms - An Overview

Genetic Algorithms (GA) [30] belong to the class of stochastic search methods [69]. Other stochastic search methods include simulated annealing [57], threshold acceptance [27], and

2.4 Genetic Algorithms - An Overview

31

Figure 2.2: Cache-SPRAM Based On-Chip Memory Architecture

some forms of branch and bound [24]. Most stochastic search methods operate on a single solution to the problem at hand. Whereas genetic algorithms operate on a set of solutions which lead to faster convergence. To use a genetic algorithm, the problem at hand needs to be encoded as an object. In GA’s terminology, the encoded object is called a chromosome. A population consist of a set of such chromosomes. GA combines “randomness“ and “survival of the fittest“ to perform an effective search in the solution space [30]. In Figure 2.3, the basic flow of GA is explained. To start with, the solution to the problem at hand needs to be modeled as a chromosome. A better chromosome would mean a better solution. The next step is create a set of P chromosomes, referred as population, initialized by the initialization step in Figure 2.4. The objective of the GA is to keep operating the chromosomes in the current population to generate new chromosomes and select P fittest chromosomes. GA uses operators like selection, crossover and mutation for this purpose. Observe that there are two nested loops in GA as shown in Figure 2.4. The outer loop corresponds to the evolution of different generations and the inner

32

Background

loop constitute the GA operations to generate a certain set of new chromosomes within a generation. The inner loop starts with the selection operation that picks two of the best individuals for mating. Some of the more commonly used selection methods [30] are (i) roulette wheel selection, (ii) tournament selection and (ii) rank selection. The probability of selecting a chromosome in Roulette wheel selection is proportional to the fitness function of the chromosome. For tournament selection, a set of chromosomes are selected based on roulette wheel selection and then the top two best chromosomes are picked among the selected chromosomes. Rank selection always picks the best two chromosomes based on the fitness function. We have used roulette wheel selection in our work.

Figure 2.3: Genetic Algorithm Flow

A crossover operation is performed on the selected pair of chromosomes. The two selected chromosomes are called parents. Crossover operator, typically, takes part of

2.5 Multi-objective Multiple Design Points

33

each of the parents and generates two children. Thus, the children chromosomes are expected to have a combination of characteristics from both parents. Since the parents are best chromosomes from the current population, the children are expected to be better than parents by evolution theory. Typically a 3-point cross over is performed, which is illustrated in Figure 2.3. After the crossover operation, the mutation operation is performed with certain probability. The mutation operation randomly change certain elements (flip a bit) of the chromosome. The mutation operator introduces certain amount of randomness to the search. It can help the search find solutions that crossover alone might not encounter. For each of the new chromosomes, objective functions are computed. There can be more than one objective. A fitness function is assigned based on the set of objectives. The fitness function represents how good a chromosome is (in other words, how good a solution is). This set of operation is repeated (inner loop) till M new chromosomes are generated. For each of the new M chromosomes, the objective functions and fitness values are computed. The last step in a generation (outer loop) is the anhilation step that is a representation of “survival of the fittest“ concept. At the end of the inner loop there will be a total of P + M chromosomes, where P are the parents and M are the newly generated child. Out of the P + M chromosomes, the top P chromosomes with respect to the fitness functions are selected and passed on to the next generation and the remaining chromosomes are ignored. The outer loop is repeated for a given number of generations.

2.5

Multi-objective Multiple Design Points

Platform based design is a way to address the complexity of embedded system design under tight deadlines. It is common to build systems around the same computational platform which includes a microprocessor or microcontroller for running the operating system and a DSP for running media-related applications. The same platform will therefore need to cater to different application characteristics. The OMAP platform from Texas Instruments comes in several flavors to address the market diversity. Similarly, Texas Instruments offers two variants of C55X DSP C5510 (with 320KB of SRAM and

34

Background

64KB of DARAM) and C5503 (with 64KB SRAM and 64KB of DARAM) for high-end and mid-end applications. As a consequence, the platform designer is not just interested in a single optimal design point but a set of design points. These set of design points are termed as non-dominated design points [30] as no design point is better than any of the non-dominated points on all objective criteria. These non-dominated points form the Pareto Optimal set. Conditions of Pareto optimality [30] is mathematically defined as follows. Let a vector x be partially less than y, symbolically, x
Mcost (S))

79

Function Assessment same solution Improvement Deterioration Improvement Improvement need analysis Deterioration need analysis Deterioration

M emory Cycles (Mcyc ). Memory cost is computed using the equation (4.1). Memory cycles is obtained using the same greedy backtracking heuristic described in Section 3.5 for the data layout. The initial temperature is computed by randomizing the solution space initially. The mean and the variance are computed from the δ (improvement or deterioration) through the initial iterations during the randomization process. Temperature is initialized with the standard-deviation of δ calculated during the initial iterations. To generate a new solution S 0 from the current solution S, we use controlled randomization to ensure that S 0 is in the neighborhood of S and that the new solution represents a valid solution. For example, we change the number of banks in S 0 by adding a randomly generated offset to the number of banks in S while ensuring that the total does not exceed the maximum number of banks. We also ensure that the memory size is greater than the data size. Let us denote the memory cost and memory cycles associated with S as Mcost (S)and Mcyc (S); similarly, let Mcost (S 0 )and Mcyc (S 0 )correspond to solution S 0 . The new solution is a definite improvement if Mcost (S 0 ) ≤ Mcost (S) and Mcyc (S 0 ) ≤ Mcyc (S) and we say S 0 dominates S. When the new solution does not dominate the existing solution, there are many possibilities, as illustrated in Table 4.2. lt ut ) , Mcyc We maintain an upper and lower threshold for the objective functions; let (Mcyc lt ut ) be the limits for memory cost. The , Mcost be the limits for memory cycles and (Mcost

considered in the optimization. We defer this to Chapter 6.

80

Logical Memory Exploration

change in the overall cost function is computed as follows.

³

´

ut lt ut lt δ = (Mcost (S 0 ) − Mcost (S))/(Mcost − Mcost ) + (Mcyc (S 0 ) − Mcyc (S))/(Mcyc − Mcyc )

(4.4) Since our objective is to present to the designer a list of all good solutions, we maintain a list of competitive solutions seen during the course of optimization. Each of these solutions is assigned a weight. When a new locally good solution is encountered, we compare its weight with that of all the globally competitive solutions that have been seen so far. There is fixed amount of room in the data structure that stores globally competitive solutions; as a result, we will remove a solution from the list if its weight is lower than that of all others, including the new entrant.

4.5 4.5.1

Experimental Results Experimental Methodology

We have implemented the multi-objective Genetic Algorithm (GA), Simulated Annealing (SA) and the heuristic data-layout algorithm in a standard desktop as a framework to perform memory architecture exploration. Some practical implementation constraints are applied on the memory architecture parameters to limit the search space. For example, the memory bank sizes (Bs , Bd , and Es ) are restricted to powers of 2 as it is done in practical implementations . As before we have used Texas Instruments (TI) TMS320C55X and Texas Instruments Code Composer Studio (CCS) environment for obtaining the profile data and also for validating the data-layout placements. We have used the same set of 4 different applications used in the previous chapter from the multimedia and communications domain as benchmarks to evaluate our methodology. The applications are compiled with the C55X processor compiler and assembler. The profile data is obtained by running the compiled executable in a cycle accurate SW simulator. To obtain the profile data we use a memory configuration of a single large bank of single-access RAM to fit the

4.5 Experimental Results

81

application data size. This configuration is selected because this does not resolve any of the parallel or self conflicts, so the conflict matrix can be obtained from this simulated memory configuration. The output profile data contain (a) frequency of access for all data sections (b) the conflict matrix. The other inputs required for our method is the application data section sizes, which were obtained from the C55X linker.

4.5.2

Experimental Results

This section presents our results on the memory architecture exploration. We have applied GA and SA for the memory architecture exploration. Both GA and SA based methods uses the same data layout heuristic described in Section 3.5. The reason for trying two different evolutionary schemes for memory architecture exploration is to identify the better approach between GA and SA for the multi-objective problem at hand. A better approach will be able to search the design space uniformly and identify non-dominated points that are globally Pareto-optimal. In this section, first we compare the results from GA and SA, and then we describe our observations of the exploration process based on the better approach (GA or SA). 4.5.2.1

Comparison of GA and SA Approaches

The objective is to obtain the set of Pareto-optimal points that minimizes either memory cost or the memory cycles. For one of the benchmark, Vocoder, Figure 4.3 plots all the memory architectures explored by GA and SA respectively, each point represents a memory architecture and the non-dominated points or the Pareto optimal points are also plotted in the same figure. Note that each of the non-dominated points represents the best memory architecture for a given Mcyc or Mcost . In Figure 4.3, the x-axis represents the normalised memory cost as calculated by equation 4.1. We have used Ws = 1, Wd = 3 and We = 0.05 in our experiments. Based on these values and from equation (4.1), the important data range for x-axis (Mcost ) is from 0.05 to 3.0. It can be seen that Mcost = 0.05 corresponds to an architecture where all the memory is only off-chip memory, while Mcost = 3.0 corresponds to a memory

82


Figure 4.3: Comparison of GA and SA Approaches for Memory Exploration


83

architecture that has only on-chip memory composed of DARAM memory banks. The y-axis represents the memory stall cycles, which is number of processor cycles spent in data accesses. This includes the memory bank conflicts and also the additional wait-states for data accessed from external memory.

Figure 4.4: Vocoder Non-dominated Points Comparison Between GA and SA

From Figure 4.3, we observe that the multi-objective GA explores the design points uniformly in all regions of memory cost, whereas SA explores a large number of design points only in the region of Mcost < 1. Our observation is that the sharing and niche formation methods used in GA leads to better solution diversity than SA. This trend is observed in other benchmarks as well. Table 4.3 presents the number of memory architectures explored and the number of non-dominated points obtained from the GA and SA based approaches. For each of the applications, the GA and SA are run for a fixed time, so as to compare the efficiency of the two approaches. The execution time reported in Table 4.3 is the time taken on a Pentium P4 Desktop machine with 1GB main memory operating at 1.7 GHz. From Table 4.3 we observe that both GA and SA explore a large number of design points (a few thousand) for each of the benchmark and identifies a few hundred Pareto optimal design points which are interesting from a platform based design [59] viewpoint. The total computation time

84


taken by these methods for each benchmark varies from 3 hours to 11 hours. Compared to this, the memory space design exploration typically done manually in industry, can take several man-months and may explore only few design points of interest. We observe that GA produces most of the non-dominated points in the first 25% of time and slowly improves the solution quality after that. On the other hand, SA gives the best results only towards the end when the annealing temperature approaches zero. Hence, given sufficient time SA catches-up with GA but the time taken by SA to reach the solution quality of GA is 2-3 times more than the GA’s run-time. We observe that GA explores significantly more points in the design space (almost by a factor of 2 to 3) than SA for all applications except Mpeg Enc. This is due to the higher execution time of 11 hours. SA’s performance levels increase with time and we observe that the number of global non-dominated points is highest for MPEG enc. However the number of non-dominated points identified by these methods are nearly the same. Interestingly, the non-dominant design points identified by the two methods only partly overlap. Note that our definition of non-dominated point with respect to GA and SA approach refer to those design points, that are not dominated by any other point seen so far by the respective methods. Thus it is possible a point identified as non-dominated in one approach may be dominated by design points identified in the other approach. We observe that the non-dominated points from SA and GA are very close. This can be observed from Figure 4.4 where the non-dominated points of GA and SA are plotted together in one graph for two of the applications. Table 4.4 presents data on the total number of non-dominated points obtained from GA and SA. The table presents the number of common non-dominated points between GA and SA in column 4. These are the same Pareto optimal design points identified by both GA and SA. The number of unique non-dominated points represents the solutions that are globally non-dominated but present only in one of GA/SA approaches. The presence of unique non-dominated points in one approach means that this point is missing in the other approach. Column 6 reports the global non-dominated points, this is the sum of column 4 and column 5. The ratio of column 6 to column 3, in a way represents the efficiency of an approach. We


Application

Mpeg Enc Vocoder Jpeg DSL

85

Table 4.3: Memory Architecture Exploration Time Taken GA SA No of Arch No of non No of Arch No of non explored dominated explored dominated points points 11 hours 8780 270 9172 287 6 hours 7724 105 3850 104 3 hours 9266 89 2240 90 6 hours 11240 133 8560 149

Table 4.4: Non-dominant Points Comparison GA-SA Application

Method

Mpeg Enc

GA SA GA SA GA SA GA SA

Vocoder Jpeg DSL

Num of non-dom (ND) points 270 287 105 104 90 89 133 149

Common ND pts

Unique ND pts

global ND pts

No of Dominated pts

115 115 56 56 32 32 71 71

143 99 45 22 63 2 54 34

258 214 101 78 95 34 125 105

12 73 4 26 2 55 8 44

avg min dist from unique NDs 1.3% 2.1% 0.77% 3.1% 1.6% 2.1% 1.8% 5.5%

observe that the number of common points increases if time allotted to SA is increased. Further, the column 7 reports the number of non-dominated points identified by one method which gets dominated by points in the other method. This also is an indicator of the efficiency of the approach: more the number of dominated points less the efficiency. For example, for the MPEG encoder benchmark, 73 of the non-dominated design points reported by SA are in fact dominated by certain design points seen by the GA approach. As a consequence the global non-dominated points reduces to 214 for this benchmark. In contrast, GA fares 270 non-dominated points of which 258 are globally non-dominated. In fact this trend is observed almost for all benchmarks. Thus the data from experiments point that GA performs a better job than SA. One concern that still remains is the set of

86


unique non-dominated points identified by SA but not by GA. If these design points are interesting from a platform based design, then to be competitive the GA approach should at least find a close enough design point. In order to analytically assess this we find the minimum of the Euclidean distance between the each unique non-dominated point reported by SA to all the non-dominated points reported by GA. The minimum distance is normalised with respect to the distance between the unique non-dominated point to the origin. This metric in some sense presents a close enough design point for each Pareto optimal point missed by GA. If we could find an alternate non-dominated point in GA at a very close distance to the unique non-dominated point reported by SA, then the GA’s solution space can be considered as an acceptable superset. In column 8, we report the average (arithmetic mean) minimum distance of all unique non-dominated points in SA to the non-dominated points in GA. A similar metric is reported for the unique nondominated points identified by GA. We also report the maximum of the minimum distance for all unique non-dominated points in column 9. The worst case average distance from unique non-dominated points is 1.8% for GA and 5.5% for SA. Thus for every unique nondominated point reported by SA, the GA method can find a corresponding non-dominated point within a distance of 1.8%. In summary, we observe that GA finds more non-dominated points in general and result in better solution quality for a given time. Only fewer non-dominated points of GA are dominated by SA. Also, GA searches the design space more uniformly. This may be due to the sharing and niche count based approach used in multi-objective GA which facilitates better solution diversity and explores more number of Pareto-optimal points. 4.5.2.2

Memory Architecture Exploration Results

Figures 4.5, 4.6, 4.7 and 4.8 plots all the memory architectures explored for each of the 4 applications using the GA approach. The figure also plots the non-dominated points with respect to memory cost and memory cycles. Note that each of the non-dominated point is a Pareto optimal memory architecture for a given memory cost or memory cycles. The results present 150-200 non-dominated solutions (that represents optimal architectures)


87

for each of the application.

Figure 4.5: Vocoder: Memory Exploration (All Design Points Explored and Nondominated Points)

Figure 4.6: MPEG: Memory Exploration (All Design Points Explored and Non-dominated Points)

88


Figure 4.7: JPEG: Memory Exploration (All Design Points Explored and Non-dominated Points)

Figure 4.8: DSL: Memory Exploration (All Design Points Explored and Non-dominated Points)

4.6

Related Work

Broadly, there are two types of approaches that are attempted for memory design space exploration: (i) Architecture Description Language (ADL) based approaches that uses

4.6 Related Work

89

simulation as a means to evaluate different design choices and (ii) exhaustive search or evolutionary based approaches for memory architecture exploration with analytical model based estimation to evaluate memory architectures. There are architecture description language based approaches like LISA [56], EXPRESSION [46], and ISDL [32] that capture processor architecture details in a high level language as front-end and uses a generator as back-end to generate ’C’ models that simulate the processor architecture configuration. Specifically, LISA and EXPRESSION captures the micro-architectural details of memory organization in a high level language format. From the specification, both EXPRESSION and LISA generates ’C’ models that simulate the memory behavior. To evaluate a specific memory configuration for the given application, the application has to be compiled and run on the generated ’C’ model to get the performance numbers in terms of number of memory stalls. LISA allows the flexibility to capture memory architecture details at different abstraction levels like functional and cycle accurate specification. A functional ’C’ model will be 1-2 orders of magnitude faster, in terms of run-time, as compared to a cycle accurate simulation model. Though ADLs provide an elegant means to capture the memory architecture details and further provide a platform to evaluate a given configuration by means of simulation, there are some open issues that needs to be addressed. One, simulation is an expensive method in terms of run-time and this limits the number of configurations that can be evaluated. Two, the memory configurations needs to be fed as inputs manually. To evaluate significantly different memory organizations, developing the specification is a time consuming task. Further, these methods do not address the problem of configuration selection itself. Providing new configurations is a manual task and based on the designer, who modifies the specification, the type of configurations evaluated could be different. While these methods are very effective in evaluating a given memory architecture in an accurate way, it is not suitable for exploring the design space with thousands of configurations because of the following reasons: (i) for every configuration that needs to be explored, the input specification needs to be modified and this is a manual process and (ii) since these are simulation based approaches, even with a functional simulator, the number of architecture

90


configuration that can be evaluated for a large application is very limited because of the large time taken by the simulator. The second type of approach is estimation based methods. In [54], Panda et al., present a heuristic algorithm for SPRAM-cache based local memory exploration. The objective of this work is to determine the right size of on-chip memory for a given application. Their algorithm partitions the on-chip memory into Scratch-pad RAM and data cache and also computes the right line size for the data cache. This algorithm searches the entire memory space to find the right combination of Scratch-Pad RAM, data cache and line size that gives the best memory performance. This approach is very useful for architectures that contain both SPRAM and Cache. Our work is different from this work in many aspects. We address a different memory architecture class which consists of a on-chip SPRAM with multiple SARAM banks and DARAM banks, but without cache memory. We have proposed a two-level iterative approach for memory architecture exploration. The main advantage of our method is it integrates data layout and memory exploration into one problem. To the best of our knowledge there is no work which considers integration of memory exploration and data layout as one single problem and optimises for performance and area (memory cost). The memory exploration strategy presented in [64] explores the design space to find the optimal configurations considering the cache size, processor cycles and energy consumption. They propose an enumeration based search. Our approach on the other hand uses evolutionary methods and is efficient, in terms of computation time, in exploring complex memory architectures with multiple objectives. Also there are other memory design space exploration approaches that considers a cache based target memory architectures [9, 52, 51]. In this chapter, our work addresses the memory architecture exploration of DSP memory architecture that is typically organized as multiple memory banks where each of the banks can consist of single/dual port memories with different sizes. We consider non-uniform memory bank sizes. Our work uses an integrated datalayout and memory architecture exploration approach, which is key for guiding GA’s search path in the right direction. The cost functions and the solution search space will be very different for a cache based memory architecture used in [9, 52] and an on-chip

4.7 Conclusions

91

scratch pad based DSP memory architecture used in our paper. Although the approach presented in this chapter does not address cache based architectures, we deal with this in Chapter 7. In summary the unique contributions of our work are the following: (a) integrating memory architecture exploration and data layout in an iterative framework to explore memory design space (b) addressing the class of memory architecture for DSPs that are more complex and heterogeneous and (c) solving the design space exploration problem for multiple objectives (memory architecture performance and memory area) and to obtain a set of Pareto-optimal design solutions.

4.7

Conclusions

In this chapter we addressed the multi-level multi-objective memory architecture exploration problem through a combination of evolutionary algorithms (for memory architecture exploration) and an efficient heuristic data placement algorithm. More specifically, for the outer level memory exploration problem, we have used two different evolutionary algorithms (a) multi-objective Genetic Algorithm and (b) Simulated Annealing. We have addressed two of the key system design objectives (i) performance in terms of memory stall cycles and (ii) memory cost. Our approach explores the design space and gives a few hundred Pareto Optimal memory architectures at various system design points in a few hours of run time on a standard desktop. Each of these Pareto optimal design point is interesting to the system designer from a platform based design view point. We have presented a fully automated approach in order to meet the time-to-market requirements. We extend the methodology to handle energy consumption, in Chapter 6.

92


Chapter 5 Data Layout Exploration 5.1

Introduction

In Chapter 3, we addressed the data-layout problem only from a performance (reducing memory stalls) perspective. In this chapter, we explore the data-layout design space with an objective to identify a set of Pareto-optimal data-layout solutions that are interesting from performance and power viewpoint. In the earlier chapters, we have solved the problem of data layout and memory architecture design space exploration keeping the logical architecture viewpoint. This is because, during system design, the software application developers work with logical memory architecture, which specifies the logical structure of the memory architecture in terms of the size of on-chip memory, the way the on-chip memory is organized in terms of the number of memory banks, number of ports each of the memory banks have, the size of the external memory and the access latency for all these memories. These are the parameters that impact the performance of the application and hence the software developers use this logical view of memory architecture to optimize the data layout to extract the maximum performance. The hardware designers take the logical memory architecture specification as input and design a physical memory architecture. This process is referred as memory allocation in the literature [35, 61]. Each of the logical memories is constructed with one or

94

Data Layout Exploration

more memory modules taken from a Semiconductor vendor memory library. For example, a logical memory bank of 16KB×16 can be constructed with four 4KB×16 or eight 2KB×16 or eight 4KB×8 or sixteen 1KB×16 memory units. Each of these options, for different process technology and different memory unit organization results in different performance, area and energy consumption trade-offs. Hence the memory allocation process is performed with the objective to reduce the memory area in terms of silicon gates, and energy consumption. The memory allocation problem in general is NP-Complete [35]. Earlier approaches for the data layout step typically use a logical memory architecture as input [10, 53] and as a consequence power consumption data for the memory banks is not available. By considering the physical memory architecture, the data layout method proposed in this chapter can optimize for power as well. Also, a common design assumption in earlier design approaches [10] is that, for data layout, the power and performance are non-conflicting objectives and therefore optimizing performance will also result in lower power. However we show that this assumption in general is not valid for all classes of memory architectures. Specifically, we here show that for DSPs memory architectures, power and performance are conflicting objectives and there is a significant trade-off (up to 70%) possible. Hence this factor needs to be carefully factored in the data layout method to choose an optimal power-performance point in the design space. When we extend these problems taking the physical memory architecture into account, there are two possible approaches. One approach is to solve the data layout and memory architecture exploration problem for logical memory architecture, as described in the previous chapters and then map the logical memory architecture to physical memory architecture. Alternatively the above problem can be directly solved for physical memory architecture. We evaluate both these approaches and demonstrate the latter is more beneficial. In this process, we develop a comprehensive automatic memory architecture exploration framework, which can explore logical and physical memory architecture. We do this in a systematic way, first addressing the data layout problem for physical memory architecture in this chapter. The following chapter deals with the memory architecture exploration problem considering the physical memory architecture.

5.2 Problem Definition

95

In this chapter we propose MODLEX framework, a Multi Objective Data Layout EXploration based on Genetic Algorithm that explores the data layout design space for a given logical memory architecture which is mapped to a physical memory architecture and obtains a list of Pareto-optimal data layout solutions from a performance and power perspectives. In other words, the MODLEX approach identifies the best data layouts for a given physical memory architecture (implementing logical memory architecture) to identify design points that are interesting from power and performance viewpoint. The main contributions of this chapter are (a) Combining different data layout optimizations into a unified framework that can be used for the complex embedded DSP memory architectures. Even though we target the DSP memory architectures, our method also works for microcontrollers as well. (b) Model the data layout problem as multi-objective Genetic Algorithm (GA) with performance and power being the objectives. Our method optimizes the data layout for power and run-time and presents a set of solution points that are optimal with respect to power and performance. (c) Most of the work in the literature assumes that performance and power are non-conflicting objectives with respect to data allocation. But we show that there is significant trade-off (up to 70%) that is possible between power and performance. The remainder of this chapter is organized as follows. Section 5.2 deals with the problem definition. In Section 5.3, we present our MODLEX framework. In Section 5.4, we describe the experimental methodology and report our experimental results. In Section 5.5, we present the related work. Finally concluding remarks are provided in Section 5.6.

5.2

Problem Definition

We are given a logical memory architecture Me with m on-chip SARAM memory banks, n on-chip DARAM memory banks, and an off-chip memory and the memory access characteristics of the application in terms of the conflict matrix defined in Section 3.2.1, which specifies the number of concurrent accesses between a pair of data sections i and j and self-conflicts, and the frequency of access of individual data sections. The problem on

96


hand is to realise the logical memory architecture Me in terms of physical memory modules, available in the ASIC memory library for a given technology or process node and obtain a suitable data layout for the physical memory architecture Mp such that the number of memory stalls incurred and the energy consumed by the memory architecture are minimised. More specifically, we consider the data layout problem for physical memory architecture with the following two objectives. • Number of memory stalls incurred due to conflicting accesses (parallel and self conflicts) and the additional cycles incurred in accessing off-chip memory • The total memory power calculated as the sum of the memory power of all memory banks for various memory accesses. Memory power of each of the banks is computed by multiplying the number of read/write accesses (based on the data placed in the bank) and the power per read/write access for the specific memory module accesses We defer the consideration of memory area optimization from a physical memory architecture exploration perspective to the following chapter.

5.3

MODLEX: Multi Objective Data Layout EXploration

5.3.1

Method Overview

We formulate the data layout problem as a multi-objective GA [30] to obtain a set of Pareto optimal design points. The multiple objectives are minimizing memory stall cycles and memory power. Figure 5.1 illustrates our MODLEX (Multi Objective Data Layout EXploration) framework, which takes application profile information and a logical memory architecture as inputs. The logical memory architecture, as explained in Chapter 4 contains the number of memory banks, memory bank sizes, memory bank types (singleport, dual-port), and memory bank latencies. The logical memory to physical memory

5.3 MODLEX: Multi Objective Data Layout EXploration

97

map is obtained using a greedy heuristic method which is explained in the following section. The core engine of the MODLEX framework is the multi-objective data layout, which is implemented as a Genetic Algorithm (GA). The data layout block takes the application data and the logical memory architecture as input and outputs a data placement. The cost of data placement in terms of memory stalls is computed as explained in Chapter 3. To compute the memory power, we use the physical memory architecture and use the power per read/write obtained from the ASIC memory library. The memory power computation is further explained in Section 5.3.3.3. The overall fitness function used by the GA is a combination of memory stall cost and memory power cost. Based on the fitness function, the GA evolves by selecting the fittest individuals (the data placements with the lowest cost) to the next generation. To handle multiple objectives, the fitness function is computed by ranking the chromosomes based on the non-dominated criteria (as explained in Section 2.5). This process is repeated for a maximum number of generations specified as a input parameter.

Figure 5.1: MODLEX: Multi Objective Data Layout EXploration Framework

98

5.3.2


Mapping Logical Memory to Physical Memory

To get the memory power and area numbers, the logical memories have to be mapped to physical memory modules available in a ASIC memory library for a specific technology/process node. As mentioned earlier each of the logical memory bank can be implemented physically in many ways. For example, for a logical memory bank of 4K*16 bits can be formed with two physical memories of size 2K*16 bits or four physical memories of size 2K*8 bits. Different approaches have been proposed for mapping logical memory to physical memories [35, 61]. The memory mapping problem in general is NP-Complete [35]. However since the logical memory architecture is already organized as multiple memory banks, most of the mapping turns out to be a direct one to one mapping. In this chapter a simple greedy heuristic is used to perform the mapping of logical to physical memory with the objective of reducing silicon area. This is achieved by first sorting the memory modules based on area/byte and then by choosing the smallest area/byte physical memory to form the required logical memory bank size. Though this heuristic is very simple, it results in an efficient physical memory architecture. Further in the following chapter, we consider the exploration of physical memory architecture with an added objective of area optimization.

5.3.3

Genetic Algorithm Formulation

To map the data layout problem to the GA framework, we use the chromosomal representation, fitness computation, selection function and genetic operators defined in Section 3.4. For easy reference and completeness, we briefly describe them in the following subsections. 5.3.3.1

Chromosome Representation

For the data memory layout problem, each individual chromosome represents a memory placement. A chromosome is a vector of d elements, where d is the number of data sections. Each element of a chromosome can take a value in (0 .. m), where 1..m represent onchip logical memory banks (including both SARAM and DARAM memory banks) and 0

5.3 MODLEX: Multi Objective Data Layout EXploration

99

represents off-chip memory. For the purpose of data layout it is sufficient to consider the logical memory architecture from which the number of memory stalls can be computed. However, for computing the power consumption for a given placement done by data layout, the corresponding physical memory architecture obtained from our heuristic mapping algorithm, need to be considered. Thus if element i of a chromosome has a value k, then the data section is placed in memory bank k. Thus a chromosome represents a memory placement for all data sections. Note that a chromosome may not always represent a valid memory placement, as the size of data sections placed in a memory bank k may exceed the size of k. Such a chromosome is marked as invalid and assigned a low fitness value. 5.3.3.2

Chromosome Selection and Generation

The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness, an individual with a higher fitness has a higher probability of contributing one or more offspring to the next generation. In every generation, from the P individuals of the current generation, M new offsprings are generated, resulting in a total population of (P + M ). From this P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented as explained in Section 3.4. 5.3.3.3

Fitness Function and Ranking

For each of the individuals corresponding to a data layout the fitness function computes the power consumed by the memory architecture (Mpow ) and the performance in terms of memory stall cycles (Mcyc ). The value of Mcyc computation is similar to the cost function used in our heuristic algorithm described in Section 3.5 which is explained briefly below. The number of memory stalls incurred in a memory bank j can be computed by summing the number of conflicts between pairs of data sections that are kept in j. For each pair of the conflicting data sections, the number of conflicts is given by the conflict matrix. Thus the number of stalls in memory bank j is given by ΣCx,y , for all (x, y) such that data sections x and y are placed in memory bank j. As DARAM banks support concurrent

100


accesses, DARAM bank conflicts Cx,y between data section x and y placed in a DARAM bank, as well self conflicts Cx,x do not incur any memory stalls. Note that our model assumes only up to two concurrent accesses in any cycle. The total memory stalls incurred in bank j can be computed by multiplying the number of conflicts and the bank latency. The total memory stalls for the complete memory architecture is computed by summing all the memory stalls incurred by all the individual memory banks. Memory Power corresponding to a chromosome is computed as follows. Assume each logical memory bank j is mapped to a set of physical memory banks mj,1 , mj,2 , ...mj,nj . If Pj,k is the power per read/write accesses of memory module mj,k and AFi,j,k is the number of accesses to data section i that map to physical memory bank mj,k , then the total power consumed is given by

Ponchip =

XXX i

j

AFi,j,k × Pj,k

(5.1)

k

Note that AFi,j,k is 0 if data section i is either not mapped to logical memory bank j, or if i not mapped to physical memory bank k. Also, AFi,j,k and AFi,j,k‘ would both account for an access to data section i that is mapped to logical memory bank j, when j is implemented using multiple banks k and k‘. For example, logical memory bank of 2K × 16 implemented using two physical memory modules of size 2K × 8. Thus the total power Mpow for all the memory banks including off-chip memory is given by Mpow = Pon−chip +

X

AFi,of f ∗ Pof f

i

where AFi,of f represents the number of access to off-chip memory from data section i, and Pof f is power per access for off-chip memory. Once the memory cost and memory cycles are computed for all the individuals in the population, individuals are ranked according to the Pareto optimality conditions on power consumption (Mpow ) and performance in terms of memory stall cycles (Mcyc ). More b b a a ) are the memory power and memory cycles , Mcyc ) and (Mpow , Mcyc specifically, if (Mpow


101

of chromosome A and chromosome B, A is ranked higher (i.e.,has a lower rank value) than B if a b a b a b a b ((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc )) ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ))

The ranking process in a multi-objective GA proceeds in the non-dominated sorting manner as described in Section 4.3. All non-dominated individuals in the current population have a rank value 1 and are flagged. Subsequently rank-2 individuals are identified as the non-dominated solutions in the remaining population. In this way all chromosomes in the population get a rank value. Higher fitness values are assigned for rank-1 individuals as compared to rank-2 and so on. This fitness is used for the selection probability. The individuals with higher fitness gets a better chance of getting selected for reproduction. To ensure solution diversity which is very critical for getting a good distribution of solutions in the Pareto-optimal front, the fitness value is reduced for a chromosome that has many neighboring solutions. This is accomplished as explained in Section 4.3. The GA must be provided with an initial population which is created randomly. In our implementation we have used a fixed number of generations as the termination criterion. As the GA approach evolves different generations, non-dominated solutions, which are Pareto-optimal (data layout) in terms of performance and power are saved in a database.

5.4 5.4.1

Experimental Results Experimental Methodology

We have used the same set of benchmark programs and profile information as in the earlier chapters. For performing the memory allocation step, we have used TI’s ASIC memory library. The area and power numbers are obtained from the ASIC memory library. We consider a set of 6 different logical memory architecture listed in Table 5.1. The corresponding physical memory architecture and the normalized area1 required by the 1

As the ASIC library is proprietary to Texas Instruments, we present only the normalized power and

102


physical memory for the different architectures are also given in Table 5.1. Note that the memory size used for each of the memory architectures is 96KB and this is enough to fit the data of each of the applications considered. Further, the architectures A1 to A5 are sorted based on physical memory area in descending order. Architecture A6 will be used in Section 5.4.3 for comparison with other related work. In Table 5.1, column 3, the physical memory banks with symbols 1P and 2P represent respectively single and dual port memory banks. Architectures A1 to A5 are selected such that the memory configuration in terms of multiple memory banks and the bank types (SARAM and DARAM) is varied. In all of these configurations, the data width is 16-bit in both the logical architecture and physical memory banks. From the table it can be observed that the memory area increases with the DARAM size and the number of banks. A1 has the highest number of memory banks with largest DARAM size; hence A1 consumes the largest area. A2 and A3 has the same DARAM size but the SARAM configuration is different. A3 and A4 present a non-uniform bank size based SARAM architecture. Non-uniform bank size based architectures allows the usage of memory banks with multiple sizes and hence presents opportunities to optimize memory area and power consumption. Larger memory banks optimizes area, whereas smaller memory bansk reduces power consumption. A5 has the least number of memory banks and uses larger memories with a reduced memory area. In summary, we would expect architecture A1 to perform very well in terms of performance because of its large DARAM memory and architecture A4 to perform better in terms of power consumption because of its lesser DARAM size and the presence of non-uniform bank sizes. Note that the architecture A4 has more memory area than A5 even though it has only half of the A5’s DARAM. This is due to the higher number of banks in A4. Note that A6 has the lowest area because it has 32KB of off-chip RAM and off-chip memory is not included in the area. area numbers.


Table 5.1: Memory Architectures Used for Data Layout Arch no Architecture Memory Logical Physical Area Memory Memory A1 2x8K(SARAM) 2x8192(1P) 1 5x16K(DARAM) 20x4096(2P) A2 16x4K(SARAM) 16x4096(1P) 0.91 32K(DARAM) 8x4096(2P) A3 8x4K(SARAM) 8x4096(1P) 0.82 1x32K(SARAM) 1x32K(1P) 8x4K(DARAM) 8x4096(2P) A4 8x2K(SARAM) 8x2048(1P) 0.77 4x4K(SARAM) 4x4096(1P) 3x16K(SARAM) 3x16K(1P) 16K(DARAM) 4x4096(2P) A5 64K(SARAM) 2x32K(1P) 0.72 32K(DARAM) 8x4096(2P) A6 8x2K(SARAM) 8x2048(1P) 0.57 4x4K(SARAM) 4x4096(1P) 16K(SARAM) 1x16K(1P) 16K(DARAM) 4x4096(2P) 1x32K(Off-Chip) 1x32K(SDRAM)

103

104

5.4.2


Experimental Results

This section presents the experimental results on the multi-objective data layout for physical memory architecture. Figures 5.2, 5.3, and 5.4 shows the set of non-dominated points each corresponds to a Pareto Optimal data layout from a power and performance viewpoint for the 3 applications for architectures A1-A5. Note that architectures A1 to A5 correspond to a fixed physical memory architecture with a known silicon area. Figure 5.2 presents the different data layout solution space from a power consumption and performance (memory stalls) view point. Note that each of the point in the plot represents a data layout for a given architecture. Observe that there are several data layout solutions presented for each of the architectures considered.

Figure 5.2: Data Layout Exploration: MPEG Encoder

It should be noted that the non-dominated points seen by the multi-objective GA are only near optimal, in the sense that these are the non-dominated points among the solutions seen so far. As the evolutionary method may result in a design point that could dominate an existing non-dominated solution. However we choose the number of generations in our method in such a way that increasing the number of iterations does


Figure 5.3: Data Layout Exploration: Voice Encoder

Figure 5.4: Data Layout Exploration: Multi-Channel DSL

105

106


Figure 5.5: Individual Optimizations vs Integrated

not result in any new non-dominated points. Note that solutions points corresponding to architecture A1 gives better performance (lesser memory stalls), and observe that there is a solution point (data layout) that resolves all the memory stalls for MPEG. This is because of the large size of DARAM in A1. However, most of the solution points of A1 consume more power, again due to the large DARAM size. Solution points in A2 follows very closely the solution points in A1. Observe that A2’s solution points is only slightly inferior in terms of performance to that of A1’s solution points, even though A2 has only less than half the size of DARAM as A1. Also, A2’s solution points dominate (performs better both in terms of power and performance, also note that memory area of A2 is lower than memory area of A1) most of the solution points of A1 in the low performance region. Hence, it can be deduced that MPEG does not gain in terms of performance by having larger DARAM memory. And at the same time the larger DARAM of A1 decreases the power efficiency of the data layouts. Interestingly, even A3’s solution points, in lower performance region, dominates A1’s solution points. Solution points corresponding to A4 gives the best solution points


107

in terms of power and performance in the mid-region. Observe that A4’s solution points dominates all the rest of the solution points in the mid region. A5’s solution points are notably inferior as compared to the rest of the solutions. This is because of the fewer number of memory banks in A5. From this, it can be deduced that MPEG performs multiple simultaneous memory access and thus, for MPEG, multiple memory banks is more important than DARAM banks for achieving better solution points. Figure 5.3 presents the results for the Voice Encoder application for the 5 different architectures A1-A5. Unlike MPEG, the solution points of A1 are clearly superior here, mainly in terms of performance. Observe that the solution points of the architectures A1, A2 and A4 dominate some of the power-performance regions in the data layout space. Solutions of A1 dominate the high performance space, solutions of A2 and A4 dominate the middle space both in performance and power, and again solutions of A2 dominates the low power-performance region. From the results, it can be deduced that for voice encoder, DARAM and multiple memory banks both are equally critical. With only a small increase in area compared to A5, A3 achieves much better performance than A5. This is due to the higher number of banks in A3 that resolves more parallel conflicts. Figure 5.4 presents the results for the Multi-channel DSL application for the 5 different architectures A1-A5 described in Table 5.1. Observe that all the architecture gives a solution point with near zero memory stalls. This indicates that the application does not require more that 16K of DARAM (this is the smallest size of DARAM used among all architectures A1-A5 and is used in A4). Also, it can be deduced that this application does not need more than 3 banks to resolve all the parallel conflicts (note that A5 has only 3 number of banks). A significant portion of the DSL application was developed in ’C’ language and this is one reason for the lesser number of parallel and self conflicts. Typically, hand optimized assembly code will try to exploit the DSP architectures by using multiple simultaneous accesses and self accesses. However, compiler generated assembly code may not be as efficient as hand-optimized code, mainly in terms of parallel memory accesses. Interestingly, the solution points of A4 dominates most of the other solution points. This is mainly due to the non-uniform banks sizes of A4 that presents opportunities for data

108


layout to optimize and trade-off power and performance. Also observe the wide range of trade-off available between power and performance for all the applications for each of the architectures. This is very useful for application engineers and system designers from a platform design viewpoint. Also, these different power-performance operating points are very essential for SoC’s that have Dynamic Voltage and Frequencies Scaling (DVFS) [23]. DVFS presents different operating points for a SoC to gain power based on use cases. For example, in a mobile phone, a stand-alone MP3 play may not require much performance. However, MP3 play while shooting a still picture will consume more performance. DVFS allows different operating points for stand-alone MP3 player and MP3 with Camera. MP3 player can be operated with processor running at 80Mhz at 1.2V, whereas MP3 with Camera will need more performance and can be operated with processor running at 120Mhz at 1.45V. Hence, we will need two different data layouts for MP3 for the above two operating points. Next, we report the execution time required by our multi-objective GA to obtain the data layouts. The GA is run on a Pentium4 desktop at 1.7GHz. It takes 26 to 31 minutes to obtain all Pareto Optimal design points for a single architecture. This run-time is approximately same for all the application.

5.4.3

Comparison of MODLEX and Stand-alone Optimizations

In this section we present results on all the stand-alone optimizations and compare it with our integrated approach, MODLEX, where all the optimizations are considered together. For this purpose we consider the following optimzations. O1 Optimization O1 corresponds to performing just on-chip/off-chip data partition, similar to the approach proposed in [10, 67] O2 Optimization O2 corresponds to performing O1 and also resolving parallel memory conflicts by utilizing only multiple memory banks [44, 40] O3 Optimization O3 corresponds to the MODLEX approach that integrates O1 and O2, and resolves self-conflicts and also exploits non-uniform sized memory banks

5.5 Related Work

109

Figure 5.5 presents the results for MPEG for the memory architecture A6 explained in Table 5.1. There are six different plots and each plot represents a specific data layout optimization. Note that the plots O1 and O2 uses only the optimizations O1 and O2 respectively, In comparison using our MODLEX framework and optimization O3, presents different solution points from a power and performance view point. Observe that for the same memory architecture, the MODLEX approach presents a wide range of solutions starting from the high performance region that resolves almost all the memory stalls to the low performance region. Note that from power and performance perspective, the solution points in the integrated approach completely dominates the solution points in the other two plots. Methods like [10, 67] will give power/performance close to point P1 and the point P2 corresponds to the works [44, 40, 58] and the data layout that optimizes power [15] is represented by point P3. From the results we can conclude that our integrated approach gives better solution points both with respect to power and performance. Also from the experimental results it can be concluded that there is a wide range of design points with respect to power and performance that can be obtained from multi-objective data layout optimizations. The computation cost involved in our approach is very small , less than an hour on a standard desktop.

5.5

Related Work

The data layout problem [10, 15, 40, 44, 53, 67] has been widely researched in the literature from either a performance or power perspective individually. In [18], a low-energy memory design method is proposed, referred as VAbM, that optimizes the memory area by allocating multiple memory banks with variable bit-width to optimally fit the application data. In [15], Benini et al., present a data layout method that aims at energy reduction. The main idea of this work is to use the access frequency of memory address space as starting point and design smaller (larger) bank size for the most (least) frequently accessed memory addresses. In [40], the authors present a heuristic algorithm to efficiently partition the data to avoid parallel conflicts in DSP applications. Their objective is to partition the data into multiple chunks of same size so that they can fit in a memory

110


architecture with uniform bank sizes. This approach works well if we consider only performance as an objective. However, if the objective is to optimize both performance and power, then a memory architecture with non-uniform banks is very attractive. All the above optimizations are very effective individually for the class of memory architecture they target. However a complete data layout approach has to combine many/all of these approaches to be able to comprehensively address the problem. Also it may not be optimal to just combine different optimizations compared to an integrated approach which is likely yield a better result. Our MODLEX framework accomplishes this. Further, our data layout approach can effectively partition data to resolve parallel conflicts and also exploit the advantage from non-uniform bank architecture to save power. To the best of our knowledge there is no work done in the literature to address this problem.

5.6

Conclusions

In this chapter we presented our framework MODLEX, a Multi Objective Data Layout EXploration for physical memory architecture. Our approach results in many data layout that are Pareto-optimal with respect to power and performance which are important from a platform design view point. We demonstrated that there is significant trade-off (up to 70%) that is possible between power and performance. In the next chapter we extend our framework to explore memory architecture design space along with the data layout.

Chapter 6 Physical Memory Exploration 6.1

Introduction

As discussed in Chapter 2, at the memory exploration step, a memory architecture is defined which includes determining the on-chip memory size, the number and size of each memory bank in SPRAM, the number of memory ports per bank, the types of memory (scratch pad RAM or cache), and the wait-states/latency. This architecture was referred to as logical memory architecture in Chapter 4 as it does not tie the architecture at this point to a specific ASIC memory library module. We proposed a logical memory exploration (LME) framework in Chapter 4 for identifying Pareto optimal design points in terms of performance and area. At the memory architecture design stage, each of these logical memory architectures under consideration has to be mapped to a physical memory architecture. Alternatively, one can directly explore the space of physical memory architecture, taking into consideration the different memory banks/modules available in the semi-conductor vendor memory library. The physical memory architecture proposed for a given application is once again evaluated from performance, power and cost (memory area) view point and a list of Pareto optimal design points are obtained that are interesting from a platform design perspective. In this chapter we evaluate these two approaches. This is explained in greater detail below. • The first approach is an extension of the Logical Memory Exploration (LME)

112

Physical Memory Exploration

Figure 6.1: Memory Architecture Exploration

method described in Chapter 4. The output of LME is a set of design points (Logical Memory Architectures) that are Pareto optimal with respect to performance and (logical) memory cost. Also, as a part of LME, the data layout generates data placement details for each of the logical memory architecture has been explored in LME. The non-dominated points from LME and the placement details for each of the non-dominated point form inputs to physical memory architecture exploration step. The mapping of logical memory architecture to physical memory architecture is formulated as a Multi-Objective Genetic Algorithm to explore the design space with power and area as the objectives. Area and power numbers of these physical memory modules are obtained from a semi-conductor vendor memory library. The physical memory exploration step is performed for every non-dominated point from LME. Note that the performance was one of the objectives at LME and this does not change during physical memory exploration step. Hence at the output of physical memory exploration approach, for every non-dominated point generated from LME, a set of non-dominated points are identified that are optimal with respect to power and area. We refer to this approach as LME2PME. • Second approach is a direct approach for Physical Memory Exploration (PME). In this approach we integrate three critical components together: (i) memory architecture exploration, (ii) memory allocation, which constructs a logical memory by picking memory modules from a semi-conductor vendor memory library and (iii)

6.1 Introduction

113

data layout exploration, this module is critical for estimating performance. The memory allocation step is critical and influences power/read and power/write as well as memory area for all the memory modules. This integrated approach is shown in Figure 6.2. For memory architecture exploration, we use a multi-objective non-dominated sorting Genetic Algorithm approach [25]. For the data layout problem which needs to be solved for each of the 1000s of memory architectures, we use the fast efficient heuristic method described in Section 3.5. For the memory allocation, we use an exhaustive search algorithm. Thus the overall framework uses a two level iterative approach with memory architecture exploration and memory packing at the outer level and data layout at the inner level. We propose a fully automated framework for this integrated approach and we refer the framework as DirPME.

Figure 6.2: Memory Architecture Exploration - Integrated Approach

Thus, the main contribution of this chapter is a two-pronged approach to physical memory architecture exploration. Our method optimizes the memory architecture, for a given application, and presents a set of solution points that are optimal with respect to performance, power and area. The remainder of this chapter is organized as follows. In Section 6.2, we present the LME2PME approach. In Section 6.3, deals with our DirPME framework. In Section 6.4, we present the experimental methodology and results for both LME2PME and DirPME

114


frameworks. Section 6.5 covers some of the related work from the literature. Finally in Section 6.6, we conclude by summarizing our work in this chapter.

6.2

Logical Memory Exploration to Physical Memory Exploration (LME2PME)

6.2.1

Method Overview

The LME2PME method extends the Logical Memory Exploration (LME) process described in Chapter 4 by considering memory power and memory area in addition to the memory performance objective addressed by the LME. Note that the LME works on minimizing the number of memory stalls for a given logical memory cost. The logical memory cost is a factor proportional to memory area. LME, for a given application, finds a list of Pareto optimal logical memory architectures considering performance and logical memory area as the objective criterion. This is shown in the top right portion of Figure 6.3. At the LME step, the memory is not mapped to physical modules and hence the actual silicon area and power consumption numbers are not known. Also, for a given logical memory architecture there are many possible ways to implement the actual physical memory architecture. As shown in Figure 6.3, the non-dominated points from LME are taken as inputs for the memory allocation exploration step. The output of Physical Memory Exploration is a set of Pareto optimal points with the memory power, memory area and memory stalls as the objective criteria. For each non-dominated logical memory architecture generated by LME, there are multiple physical memory architectures with different power-area operating points with the same memory stalls. This is shown in Figure 6.3, where the design solution LM 1 in LME’s output corresponds to a memory stall of ms1 and generates a set of Pareto optimal points (denoted by P M 1s in the lower half of Figure 6.3) with respect to memory area and memory power. Similarly, LM 2 which incurs a memory stall of ms2 results in a set of P M 2s of physical memory architectures. Note that ms1 and ms2 , which are the memory stalls as determined by

6.2 Logical Memory Exploration to Physical Memory Exploration (LME2PME)

115

LME, does not change during the Physical Memory Exploration step. Different physical memory architectures are explored with different area-power operating points for a given memory performance.

Figure 6.3: Logical to Physical Memory Exploration - Overview

6.2.2


In a traditional HW-SW codesign method, once the logical memory architecture is finalized, the HW and SW designs happen independently. The SW design teams focus on performance optimization of application with the given logical memory architecture and the H/W design teams focus on area optimization during the memory allocation step. But in the process, power optimization, which requires both H/W and S/W perspective, is not considered. Our LME2PME method addresses this problem by taking the required inputs from the datalayout step that helps in optimizing power consumption and also optimizing area at the same time during memory allocation. Figure 6.4 describes the LME2PME method.

116


The top part of Figure 6.4, the logical memory architecture exploration (LME) is same as what was described in Chapter 4. The bottom part of Figure 6.4 shows the physical memory exploration. As shown in Figure 6.4, the Physical Memory Exploration (PME) step takes two inputs from the LME. The first input is the set of Non-dominated points generated by LME and the second input is the data placement details, which is the output of the data layout step and provides information on what data-section is placed in which memory bank. From the data placement and the profile data, the PME computes the number of memory accesses per logical memory bank. This is an important information and this can be used to decide on using larger or smaller memories while mapping a logical memory bank. As discussed in Chapter 2, a smaller memory consumes less power per read/write access as compared to a larger memory. Hence, if a logical memory bank is known to have a data that is accessed higher number of times, it is power-optimal to design this logical memory bank with many smaller physical memories. However this comes with a higher silicon area cost and hence results in a area-power trade-off. We formulate the memory allocation exploration as a Multi-Objective Genetic Algorithm problem. To map an optimization problem to the GA framework, we need the following: chromosomal representation, fitness computation, selection function, genetic operators, the creation of the initial population and the termination criteria. Figure 6.5 explains the GA formulation of the Physical Memory Mapping problem.

6.2.3


6.2.3.1


Each individual chromosome represents a physical memory architecture. As shown in Figure 6.5, a chromosome consists of a list of physical memories picked from a ASIC memory library. These list of physical memories are used to construct a given logical memory architecture. Typically multiple physical memory modules are used to construct a logical memory bank. As an example, if the logical bank is of size 8K*16bits then, the physical memory modules can be two 4K*16bits or eight 2K*8bits or eight 1K*16bits and so on. We have limited the number of physical memory modules per logical memory bank


117

Figure 6.4: Logical to Physical Memory Exploration - Method

to at most k. Thus, a chromosome is a vector of d elements, where d = Nl ∗ k + 1 and Nl is the number of logical memory banks, which is an input from LME. Note that each of the element represents an index in the semiconductor vendor memory library which corresponds to a specific physical memory module. For decoding a chromosome, for each of the Nl logical banks, the chromosome has k elements. As mentioned earlier each of the k element is an integer used to index into semiconductor vendor memory library. With the k physical memory modules, a logical memory bank is formed. We have used a memory allocator that performs exhaustive combinations with the k physical memory modules to get the largest logical memory required with the specified word size. Here, the bank size, the word size and the number of ports are obtained from the logical memory architecture, corresponding to the chosen non-dominated point. In this process, it may happen that m out of the total k physical

118


Figure 6.5: GA Formulation of LME2PME

memories selected may not be used, if with k−m physical memories the given logical memory bank can be constructed1 . For example, if k=4, and if the 4 elements are 2K*8bits, 2K*8bits, 1K*8bits, and 16K*8bits and if the logical memory bank is 2k*16bits, then our memory allocator builds a 2K*16bits logical memory bank from the two 2K*8bits and the remaining two memories are ignored. Note that the 16K*8bit memory and 1K*8bit memory is removed from the configuration as the logical memory bank can be constructed optimally with the two 2K*8bit memory modules. Here, the memory area of this logical 1

This approach of using only the required k − m physical memory modules relaxes the constraint that the chromosome representation has to exactly match a given logical memory architecture. This, in turn, facilitates the GA approach to explore many physical memory architecture efficiently.


119

memory bank is the sum of the memory area of the two 2K*8bit physical memory modules2 . This process is repeated for each of Nl logical memory banks. The memory area of a memory architecture is the sum of the area of all the logical memory banks. 6.2.3.2


The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness; an individual with a higher fitness has a higher probability of contributing one or more offsprings to the next generation. In every generation, from the P individuals of the current generation, M new offsprings are generated using mutation and crossover operators, resulting in a total population of (P + M ). The crossover operation is performed as illustrated in Figure 6.5. From this total population of (P + M ), P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented in standard way. 6.2.3.3


For each of the individuals, the fitness function computes Marea and Mpow . Note that Mcyc is not computed as it is already available from LME. The Marea is obtained from the memory mapping block, which is the sum of area of all the physical memory modules used in the chromosome. Mpow is computed based on two factors: (a) access frequency of data-sections and the data-placement information and (b) power per read/write access information derived from the semiconductor vendor memory library for all the physical memory modules. To compute the memory power the method uses the data layout information provided by the LME step. Based on the data layout, and the physical memories required to form the logical memory (obtained from the chromosome representation), the accesses to each data section is mapped to the respective physical memories. From this, the power per 2

Although the chromosome representation may have more physical memories than required to construct the given logical memory, the fitness function (area and power estimates) is derived only for the required physical memories.

120


access for each physical memory, and the number of accesses to the data section, the total memory power consumed for all accesses to a data section is determined. From this, the total memory power consumed by the entire application on a given physical memory architecture is computed by summing the power consumed by all the data sections. Once the memory area, memory power and memory cycles are computed for all the individuals in the population, individuals are ranked according to the Pareto optimality conditions given in the following equation, which is similar to the Pareto optimality condition discussed in Chapter 4, but considers all three objective functions. Let a a a b b b (Mpow , Mcyc , Marea ) and (Mpow , Mcyc , Marea ) be the memory power, memory cycles and

memory area of chromosome A and chromosome B. A dominates B if the following expression is true.

a b a b a b (((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Marea < Marea ) ∧ (Mcyc ≤ Mcyc ) ∧ (Mpow ≤ Mpow )))

For ranking of the chromosomes, we use the non-dominated sorting process described in Section 4.3. The GA must be provided with an initial population that is created randomly. In our implementation we have used a fixed number of generations as the termination criterion.

6.3

Direct Physical Memory Exploration (DirPME) Framework

6.3.1

Method Overview

In the LME2PME approach described in the previous section, the physical memory exploration is done in two steps. In this section we describe the DirPME framework that

6.3 Direct Physical Memory Exploration (DirPME) Framework

121

directly operates in the physical memory design space.

Figure 6.6: MAX: Memory Architecture eXploration Framework

Figure 6.6 explains our DirPME framework. The core engine of the framework is the multi-objective memory architecture exploration, which takes the application data size and semi-conductor vendor memory library as inputs and forms different memory architectures. The memory allocation procedure builds the complete memory architecture from the memory modules chosen by the exploration block. If the memory modules together does not form a proper memory architecture, the memory allocation block rejects the chosen memory architecture as invalid. Also the memory mapping allocation checks the access time of the on-chip memory modules and rejects those whose cycle time is greater than the required access time. The exploration process using the genetic algorithm and the chromosome representation is discussed in detail in the following section. Once the memory modules are selected the memory mapping block computes the total memory area, which is the sum of all the individual memory modules. Details on the selected memory architecture, like the on-chip memory size, number of memory banks, number of ports, off-chip memory bank latency are passed to the data layout procedure. The application data buffers and the application profile information also

122


given as inputs to the data layout. The application itself consists of multiple modules, including several third-party IP modules as shown in Figure 6.6. With these inputs the data layout maps the application data buffers to the memory architecture; the data layout heuristic is the same as explained in Section 3.5. The output of data layout is a valid placement of application data buffers, from the data layout, and the application memory access characteristic the memory stalls are determined. The memory power is also computed using the application characteristic and power per access available from the semi-conductor vendor memory library. Lastly, the memory cost is computed by summing the cost of the individual physical memories. Thus the fitness function for the memory exploration is computed with the memory area, performance and power. Based on the fitness function the GA evolves by selecting the fittest individuals to the next generation. Since the fitness function contains multiple objectives, the fitness function is computed by ranking the chromosomes based on the non-dominated criteria (explained in Section 6.3.2). This process is repeated for a maximum number of generations specified as an input parameter.

6.3.2


6.3.2.1


For the memory architecture exploration problem in DirPME, each individual chromosome represents a physical memory architecture. As shown in Figure 6.7, a chromosome consists of two parts: (a) number of logical memory banks (Li ), and (b) list of physical memory modules that form the logical memory bank. Once again we assume that each logical memory bank is constructed using at most k physical memories. It is important to note here that a key difference between the LME2PME and DirPME approaches is that, in the LME2PME approach the number of logical memory bank is fixed (equal to Nl ). Hence the chromosome are all of the same size. However in DirPME each Li can be of a different size. Hence the chromosomes are of different sizes. Thus, a chromosome is a vector of d elements, where d = Li ∗ k + 1 and Li is the number of logical memory banks for ith chromosome. The first element of a chromosome is Li and it can take value in

6.3 Direct Physical Memory Exploration (DirPME) Framework

123

(0 .. maxbanks), where maxbanks is the maximum number of logical banks given as input parameter. The remaining elements of a chromosome can take a value in (0 .. m), where 1 .. m represent the physical memory module id in the semiconductor vendor memory library. Here the index 0 represents a void memory (size zero bits) to help the memory allocation step to construct physical memories.

Figure 6.7: GA Formulation of Physical Memory Exploration

For decoding a chromosome, first Li is read and then for each of the Li logical banks, the chromosome has k elements. Each of the k elements are integers used to index into semiconductor vendor memory library. With the k physical memory modules, corresponding to a logical memory bank, a rectangular memory bank is formed. We have used the

124


same memory allocator (described in Section 6.2.3.1 which performs exhaustive combinations with the k physical memory modules to get the largest logical memory with the required word size. In this process it may happen that some of the physical memory modules may be wasted. For example, if k=4, then if the 4 elements are 2K×8bits, 2K×8bits, 1K×8bits, and 16K×8bits and if the bit-width requirement is 16-bits then our memory allocator builds a 5K×16bits logical memory bank from the given 4 memory modules. Note that 11K×8bits is wasted in this configuration and this architecture will have a low fitness as the memory area will be very high but is considered in the exploration process nonetheless. The memory area of a logical memory bank is the sum of the memory area of all the physical memory modules. This process is repeated for each of the Ln logical memory banks. The memory area of a memory architecture is the sum of the area of all the logical memory banks. 6.3.2.2


The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness; an individual with a higher fitness has a higher probability of contributing one or more offspring to the next generation. In every generation, from the P individuals of the current generation, M more off-springs are generated using mutation and crossover operators, resulting in a total population of (P + M ). From this P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented in the standard way. The crossover operation is illustrated in Figure 6.7. 6.3.2.3


For each of the individuals, the fitness function computes Marea , Mpow and Mcyc . The value of Mcyc is computed by data layout using the heuristic explained in Section 3.5. The Marea is obtained from the memory mapping block, which is the sum of area of all the memory modules used in the chromosome. Memory power computation is performed in the same way as described in Section

6.4 Experimental Methodology and Results

125

6.2.3.3. Once the Marea , Mpow and Mcyc are computed, the chromosomes are ranked as per the process described in Section 6.2.3.3.

6.4 6.4.1

Experimental Methodology and Results Experimental Methodology

We have used the same set of benchmark programs and profile information as in the earlier chapters. For performing the memory allocation step, we have used TI’s ASIC memory library. The area and power numbers are obtained from the ASIC memory library. The results from the LME2PME method are presented in the following section. After that we present the results from the DirPME framework. Finally we compare the results from LME2PME and DirPME.

6.4.2

Experimental Results from LME2PME

As discussed in Section 6.2, the LME2PME approach performs memory allocation exploration on the set of Pareto-optimal logical memory architectures, which are obtained from the LME, with the objective to obtain Pareto-optimal physical memory architectures that are interesting from a area, power and performance viewpoint. Figures 6.8, 6.9 and 6.10 present the results of the LME2PME approach for all the 3 applications. In these figures, the x-axis represents the total memory area (normalized) required by a physical memory architecture and the y-axis represents the total power (normalized) consumed by the memory accesses. In the figures, each plot corresponds to a set of performance operating points from the LME. Note that the performance points are grouped to reduce the number of plots so that it is easier to analyze the results. Performance band 0 − 0.1 corresponds to an operating point that resolves > 90% memory stalls (from the on-chip memory bank conflicts) and hence is a high performance operating point. Similarly, the performance band of 0.8 − 0.9 corresponds to an operating point that resolves only less than 20% of memory stalls and hence is a low performance operating points.

126


For each of the pareto-optimal logical memory architecture, the memory allocation exploration step constructs a set of physical memory architectures that have different area-power operating point. Note that the performance (number of memory stalls) remain unchanged from the LME step. Each point in the Figures 6.8, 6.9 and 6.10 represents a physical memory architecture. It can be observed from these figures that each plot presents a wide choice of area-power operating points in the physical view. Note that the plots are arranged from the high performance band to low performance band. Each plot starts from a high-power and low-area region and ends in a low-power and high-area region. Observe that all the high performance (low memory stalls) plots operate on a high area-power region and a low performance (high memory stalls) operating points have a lower area-power values. Thus, from a platform design view point, a system designer needs to be clear on the critical factor among area, power and performance. Based on this information, the system designer can select the appropriate set of operating points that are intresting from the system design perspective.

6.4.3

Experimental Results from DirPME

This section presents the experimental results on the multi-objective Memory Architecture Exploration. The objective is to explore the memory design space to obtain all the nondominated design solutions (memory architectures) that are Pareto optimal with respect to area, power and performance. The Pareto-optimal design points identified by our framework for the voice encoder application are shown in Figure 6.11. It should be noted that the non-dominated points seen by the multi-objective GA are only near optimal, as the evolutionary method may result in another design point in future generations that could dominate. One can observe a set of points for each x-z plane (memory power - memory stalls) corresponding to a given area. These represent the trade-off in power and performance for a given area. The same graph is plotted in a 2D-graph in Figure 6.12 where architectures which require an area within a specific range are plotted using the same color. These correspond to the points in a set of x-z planes for the area range.


127

Figure 6.8: Voice Encoder: Memory Architecture Exploration - Using LME2PME Approach

Figures 6.12, 6.13 and 6.14 show the set of non-dominated points each corresponding to a Pareto Optimal Memory Architecture for the 3 applications. It can be observed from Figures 6.12, 6.13 and 6.14 that the increase in memory area results in improved performance and power. Increased area will translate to one or more of on-chip memory, increased number of memory banks, and more dual-port memory — all these are essential for improved performance. We look at the optimal memory architectures derived by our framework. In particular we consider (a) R1, (b) R2 and (c) R3 in each of the figures. The region R1 corresponds to (high performance, high area, high power); R2 corresponds to (low performance, high area, low power); and the region R3 corresponds to (medium performance, low area, medium power). Since the memory exploration design space is very large, it is important to focus on regions that is critical to the targeted applications. The region R1 has memory architectures with large dual-port memory that aids in improved performance but also is a cause for high power consumption. The region R2 has large

128


Figure 6.9: MPEG: Memory Architecture Exploration - Using LME2PME Approach

number of memory banks of different size. This helps in reducing the power consumption by keeping the data sections with higher access frequency to smaller memory banks. But the region R2 does not have dual-port memory modules and hence results in low performance. But the presence of a higher number of memory banks increases the area. The region R3 does not have dual port memory modules and also has lesser number of on-chip memory banks. Since the memory banks are large, the power per access is resulting in higher power consumption. Note that for a given area there can be more than one memory architecture. Also it can be observed that for a fixed memory area, the design points are Pareto optimal with respect to power and performance. Observe the wide range of trade-off available between power and performance for a given area. We observe that by trading off performance, power consumed can be reduced by as much as 70-80%. Table 6.1 gives details on the run-time, the total number of memory architectures explored and the number of non-dominated (near-optimal) points for each of the application. Note that even the number of non-dominated design solutions is also large. Hence


129

Figure 6.10: DSL: Memory Architecture Exploration - Using LME2PME Approach

Table 6.1: Memory Architectures Explored - Using DirPME Approach Application Time Taken No of Arch No of nonexplored dominated points Mpeg Enc 2.5 hours 9780 670 Vocoder 3.5 hours 13724 981 DSL 2 hours 7240 438

to select an optimal memory architecture for a targeted application, the system designer needs to follow a clear top down approach of narrowing down the region (area, power, performance) of interest and then focus on specific memory architectures. The table also reports execution time taken on a standard desktop (Pentium 4 with 1.7Ghz). As can be seen the execution time for each of these application is fairly low.

130


Encoder48 − Memory Architectcure Exploration

5

x 10 2.4 2.2

Memory Power

2 1.8 1.6 1.4 1.2 1 0

1

1

0.8

2 0.6

3

5

x 10

0.4 Memory Area Memory Stalls

Figure 6.11: Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME Approach

6.4.4

Comparison of LME2PME and DirPME

In this section we compare the non-dominated points from the LME2PME and DirPME approaches. Table 6.2 presents data on the total number of non-dominated points obtained from LME2PME and DirPME. The number of unique non-dominated points listed in column 4 represents the solutions that are globally non-dominated but present only in one of LME2PME or DirPME approach. The presence of unique non-dominated points in one approach means that this point is missing in the other approach. The ratio of column 4 to column 3, in a way represents the efficiency of an approach. We observe that this ratio is low for DirPME approach compared to LME2PME. The number of unique non-dominated points in DirPME increases if time allotted to DirPME is increased.


131

Figure 6.12: Voice Encoder: Memory Architecture Exploration - Using DirPME Approach

Further, the column 5 of Table 6.2 reports the number of non-dominated points identified by one method which gets dominated by points in the other method. For example, for the MPEG encoder benchmark, 709 of the non-dominated design points reported by DirPME are in fact dominated by certain design points seen by the LME2PME approach. As a consequence the unique non-dominated points reduces to 26 for this benchmark. In contrast, LME2PME fares better with 175 non-dominated points of which none are dominated by DirPME approach. In fact this trend is observed almost for all benchmarks. Thus the data from experiments point that LME2PME performs a better job than DirPME. One concern that still remains is the set of unique non-dominated points identified by DirPME but not by LME2PME. If these design points are interesting from a platform based design, then to be competitive the LME2PME approach should at least find a close enough design point. In order to quantitatively assess this, we find the minimum of the Euclidean distance between the each unique non-dominated point reported by DirPME

132


Figure 6.13: MPEG Encoder: Memory Architecture Exploration - Using DirPME Approach

to all the non-dominated points reported by LME2PME. The minimum distance is normalised with respect to the distance between the unique non-dominated point to the origin. This metric in some sense represents how close a non-dominated in DirPME approach is to a point in LME2PME. If we could find an alternate non-dominated point in LME2PME at a very close distance to the unique non-dominated point reported by DirPME, then the LME2PME’s solution space can be considered as an acceptable superset. In column 6, we report the average (arithmetic mean) minimum distance of all unique non-dominated points in DirPME to the non-dominated points in LME2PME. A similar metric is reported for the unique non-dominated points identified by DirPME. We also report the maximum of the minimum distance for all unique non-dominated points in column 7 of Table 6.2. The worst case average distance from unique non-dominated points is 0.46% for LME2PME and 0.49% for DirPME. Thus for every unique non-dominated point reported by DirPME, the LME2PME method can find a corresponding non-dominated


133

Figure 6.14: DSL: Memory Architecture Exploration - Using DirPME Approach

point within a distance of 0.46%. In colum 7, we report the maximum of minimum distance of all non-dominated points in DirPME to the non-dominated points in LME2PME. The same metric is presented the other way, i.e. the maximum of minumum distance of all non-dominated in LME2PME to the non-dominated points in DirPME. Observe from cloumn 6 that for every non-dominated point that is missing in LME2PME and reported in DirPME, we can find a close enough non-dominated point in LME2PME at most within 4.1% distance from the missing point for MPEG benchmark. Simillarly, for every new non-dominated point reported by LME2PME, we can find a close enough non-dominated point in DirPME at most within 6.2% distance from the missing point. Finally, in column 8, the run-time for all the benchmarks for both approaches are reported. Note that the DirPME approach takes significantly more time than the LME2PME approach. In summary, we observe that LME2PME finds more non-dominated points in general and offers better solution quality for a given time. However, since for every unique nondominated point in LME2PME, we can find a very close non-dominated point result in

134


Table 6.2: Non-dominant Points Comparison LME2PME-DirPME AppliMethod Num of Unique No of avg max of runcation non-dom ND pts Dom- min dist min dist time (ND) inated from from points pts unique unique NDs NDs Mpeg LME2PME 175 175 0 0.22% 4.1% 0.45.47 Enc DirPME 735 26 709 0.37% 6.2% 7.17.52 Vocoder LME2PME 214 192 22 0.04% 0.23% 0.40.52 DirPME 558 13 545 0.49% 6.8% 3.08.54 DSL LME2PME 134 114 20 0.46% 3.75% 0.26.23 DirPME 1093 12 1081 0.38% 7.08% 4.26.34

DirPME and vice versa, we can conclude that both the approaches perform very closely. Further, the DirPME approach operates on a much bigger search space. Hence we expect DirPME approach to catch-up and fare equally well or fare even better as compared to LME2PME approach when sufficient time is given.

6.5

Related Work

Memory architecture exploration is performed in [18] using a low-energy memory design method, referred as VAbM, that optimizes the memory area by allocating multiple memory banks with variable bit-width to optimally fit the application data. Their work addresses custom memory design for application specific hardware accelerators. Whereas our work focuses on defining memory architecture for programmable processors of embedded SoC. In [15], Benini et al., present a method that combines the memory allocation and data layout together to optimize the power consumption and area of memory architecture. They start from a given data layout and design smaller (larger) bank size for the most (least) frequently accessed memory addresses. In our method the data layout is not fixed and hence it explores the complete design space with respect to area, performance and power. Performance-energy design space exploration is presented in [72]. They present

6.6 Conclusions

135

a branch and bound algorithm which produces Pareto trade-off points representing different performance-energy execution options. In [62], an integrated memory exploration approach is presented which combines scheduling and memory allocation. They consider different speed of memory accesses during memory exploration. They consider only performance and area as objectives and they output only one design point. In our work we consider area, power and performance as objectives and we explore the complete design space to output several hundreds of Pareto optimal design points. There are other methods for memory architecture exploration for target architectures involving on-chip caches [9, 51, 52, 64]. We compare them with our memory architecture exploration approach for hybrid architectures described in the next chapter. The memory allocation exploration step of LME2PME approach is an extension of memory packing or memory allocation process [63]. Memory allocation step typically constructs a logical memory architecture with a set of physical memories considering minimizing memory area as the objective criteria. But in our approach, the memory allocation exploration has to consider two inputs: (a) area optimization by picking the right set of memory modules and (b) power optimization by considering the memory access frequency of data-sections placed in a logical bank. Note that these are conflicting objectives and our approach outputs Pareto-optimal design points which present interesting trade-offs for these objectives.

6.6

Conclusions

In this chapter we presented two different approaches for Physical Memory Architecture Exploration. The first method, called LME2PME method, is a two step process and an extension of the LME method described in Chapter 4. The LME2PME method offers flexibility with respect to exploring the design spaces in logical and physical memory architectures independently. This enables the system designers to start the memory architecture definition process without locking the technology node and semiconductor vendor memory library. The second method is a direct physical memory architecture exploration (DirPME) framework that integrates memory exploration, logical to physical memory

136


mapping and data layout. Both LME2PME and DirPME approaches addressed three of the key system design objectives (i) memory area, (ii) performance and (iii) memory power. Our approach explores the design space and gives a few hundred Pareto-optimal memory architectures at various system design points in a few hours of run time. We have presented a fully automated approach that meets the time-to-market requirements. In the next chapter, we extend the framework to address cache based memory architectures.

Chapter 7 Cache Based Architectures 7.1

Introduction

In the previous chapters, memory architecture exploration frameworks and data layout heuristics are presented for target architectures that are primarily Scratch-Pad RAM (SPRAM) based. Many SoC designs on the other hand also include cache in their memory architecture [77] as caches provide comparable performance benefits of SPRAM but with lower software overhead [83] at both program development time - requiring very little data layout and management responsibilities from the application developer - and runtime - movement of data from off-chip memory to cache is transparent and managed by hardware. Hence in this chapter we consider memory architecture with both SPRAM and cache. The work in this chapter also applies to memory architectures that have on-chip memories that could be configured both as cache and Scratch-Pad RAM. We discussed in Chapter 6 about how the presence of caches alter the objective functions in memory architecture exploration and also for data layout heuristics. In a cache architecture if two different data sections that are accessed alternatively are mapped to the same cache sets, it causes a large number of conflict misses [33], potentially resulting in no benefits from the cache. Hence it is important to address memory exploration and data layout approaches for cache based architectures. Further, the memory exploration problem becomes more challenging if the target architecture consists of both SPRAM

138

Cache Based Architectures

and cache. In this chapter, we address the memory architecture exploration problem for hybrid memory architectures that have a combination of SPRAM and Cache. As discussed in Chapter 4, the evaluation of a memory architecture cannot be separated from the problem of data layout, which physically places the application data in the memory. A non-optimal data layout will yield an inferior performance even on a very good memory architecture platform, thereby leading the memory exploration search path to go in a wrong direction. Hence before addressing the memory architecture exploration problem, for a cache based memory architecture, it is important to have a efficient data layout heuristic. For SPRAM-Cache based architectures, a critical step is to partition the data placement between on-chip SPRAM and external RAM. Data partitioning aims at improving the overall memory sub-system performance by placing data sections in SPRAM that has the following characteristics: (a) higher access frequency, (b) over-lapping life time with many other data, and (c) poor spatial access characteristics. By placing all data that exhibits the above characteristics in SPRAM results in reducing the number of potentially conflicting data in Cache and hence the reduced cache misses leading to overall memory sub-system performance. Typically the SPRAM size is small and hence it is not possible to accommodate all the data identified for SPRAM placement. Hence, even after data partitioning, there will be a significant number of potentially conflicting data placed in external RAM. If these data are not carefully placed based on the off-chip RAM, there will be a significant number of cache misses resulting in lower system performance. Cache conscious data layout addresses this problem and aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache misses. The mapping of data from off-chip RAM to L1-cache is dictated by the cache size and associativity. Hence the data-sections which map to the same cache set, when accessed alternatively can incur a large number of conflict misses. Careful analysis of data access characteristics and understanding of temporal access pattern of the data structures is required in order to come up with cache conscious data layout that minimize conflict misses. A number of earlier approaches address the problem of data

7.1 Introduction

139

layout mapping for cache architecture in embedded systems [17, 21, 22, 41, 50, 53]. In this chapter our aim is to perform memory architecture exploration and data layout in an integrated manner, assuming a hybrid architecture which includes both on-chip SPRAM and data cache (see Figure 7.1). As a first step, we address the data layout problem for Cache-SPRAM based architecture. We address this problem using a two step approach for each memory architecture: (a) data partitioning to divide the data between SPRAM and cache with the objective to improve overall memory sub-system performance and power and (b) Cache conscious data layout to minimize the number of cache misses within a given external memory address space.

Figure 7.1: Target Memory Architecture

The major contributions of this chapter are: • an efficient heuristic to partition data between SPRAM and caches based on access frequency, temporal and spatial locality in access patterns; • a data layout heuristic for data caches that improves run-time and reduces the off-chip memory address space usage; and • hybrid memory architecture exploration with the objective to improve run-time performance, power consumption and area. The rest of this chapter is organized as follows. In the following section we give an overview of the proposed method. In Section 7.3 we explain our data partitioning heuristic. In Section 7.4, describes the cache conscious data layout heuristic. In Section 7.5, we present the experimental results. We discuss related work in Section 7.6. Conclusions are presented in Section 7.7.

140

7.2


Solution Overview

Figure 7.2 presents our memory architecture exploration framework. Our proposed memory exploration framework consists of two levels. The outer level explores various memory architectures while the inner level explores placement of data sections (data layout problem) to minimize memory stalls. More specifically the outer level, the memory architecture exploration phase, targets the optimization of cache and SPRAM size and the organization of cache architecture, including cache-line size and associativity. We use an exhaustive search1 for memory architecture exploration by imposing certain practical constraints (such as, the memory bank size is always a power of 2) on the architectural parameters. Although these constraints limit the search space, they still allow all “practical” architectures to be considered and at the same time help to reduce the run-time of the memory exploration phase drastically. The exploration module takes the application’s total data size as input and provides an instance of memory architecture by defining (a)cache size (b) cache block size (c) cache associativity and (d) SPRAM size2 . Based on the SPRAM size and the application access characteristics, the data partitioning heuristic identifies the data sections to be placed in SPRAM. The remaining data sections are placed in off-chip RAM. The details of the data partitioning heuristic are presented in Section, 7.3. The cache conscious data layout heuristic assigns addresses to the data sections placed in off-chip RAM such that these data do not conflict in the cache. The data layout heuristic uses the temporal access information as input to find the optimal data placement. The objective is to minimize the number of cache misses. In Section, 7.4 we discuss the proposed cache conscious data layout. The data partitioning heuristic and data layout heuristic together place the application data in SPRAM and off-chip RAM respectively. From the temporal access information 1

Alternative approaches such as genetic algorithm or simulated annealing could also be used here. However we found the exhaustive approach does explore all practical memory architectures in a reasonable amount of computation time. 2 The proposed framework can easily be extended to consider SPRAM organization parameters such as, number of banks, number of ports etc. We do not consider it here as these were extensively dealt with in the earlier chapters.

7.2 Solution Overview

141

of data sections and access frequency information, the run-time performance in terms of memory stall cycles is computed. The memory stalls include stall cycles due to concurrent accesses to the same single-ported SPRAM bank, stall cycles due to cache misses and misspenalty (off-chip memory access to fetch the cache block). The software eCacti [45] is used to obtain the power per cache read-hit, read-miss, write-hit and write-miss. The SPRAM power per read access and power per write access are obtained from the semiconductor vendors ASIC memory library. The area for a given cache architecture is computed using eCacti [45] and the area for SPRAM is obtained from the memory library.

Figure 7.2: Memory Exploration Framework

The exploration process is repeated for all valid memory architectures and the area, power and performance are computed for each of these. The last step is to identify the list of “optimal” architectures. Since this is a multi-objective problem, all the solution points are evaluated according to the Pareto optimality conditions given by Equation 6.1 in b b b a a a ) , Marea , Mcyc ) and (Mpow , Marea , Mcyc Section 6.2.3.3. According to this equation, (Mpow

are the memory power, memory cycles and memory area for memory architecture A and

142


B respectively, then A dominates B if the following expression is true. a b a b a b (((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Marea < Marea ) ∧ (Mcyc ≤ Mcyc ) ∧ (Mpow ≤ Mpow )))

From the set of solutions generated by the memory architecture exploration module, all the dominated solutions are identified and removed. The non-dominated solutions form the Pareto optimal set, which represents the set of good architectural solutions that provide interesting design trade-off points from power, performance, cost view point.

7.3

Data Partitioning Heuristic

As cache structure has associated tag overheads, SPRAM consumes much less area than caches on a per-bit basis [12]. Further, SPRAM memory accesses consume less power than a memory access that is a cache-hit [12]. While the data sections mapped to off-chip memory share the cache space dynamically and in a transparent manner, SPRAM space is assigned to data sections exclusively if dynamic data layout is not used. As a result, the usage of SPRAM is costly from a system perspective as it gets locked to a specific data after data section layout unlike in caches, where the space is effectively reused through dynamic mapping of data by hardware. Hence, SPRAM has to be carefully utilized and the objective in a memory architecture exploration should be to minimize the SPRAM size. The objective of data partitioning is to identify data sections that must be placed in SPRAM for best performance. We refer to a set of data (one or more scalar variables or array variables) that are grouped together as one data-section. A data-section forms an atomic unit that will be assigned a memory address. All data that are part of a data section are placed in memory contiguously. An example of a data section is an array data structure. In order to identify data sections that should be mapped to SPRAM, our heuristic

7.3 Data Partitioning Heuristic

143

uses different characteristics of the data section. These include the access frequency, the temporal access pattern and the spatial locality pattern. These are explained below. To model the temporal access pattern of different data sections, a temporal relationship graph (TRG) representation has been proposed in [17]. A TRG is an undirected graph, where nodes represent data sections and an edge between a pair of nodes represents that the two successive references of either of the data sections is interleaved by a reference to the other. The weight associated with an edge (a, b) represents the number of times such interleaved accesses of a and b has occurred in the access pattern. We illustrate these ideas with the help of an example. Let there be 4 data-sections a, b, c and d and the access pattern of these data sections in the application be: aaabcbcbcbcdddddaaaaaaacacaacac

Figure 7.3: Example: Temporal Relationship Graph

For this access pattern the TRG is shown in Figure 7.3. Given a trace of data memory references, the weight associated with (a, b), denoted by T RG(a, b) is the number of times that two successive occurrences of a are intervened with at least one reference to b or vice versa. As an example, for the pattern bcbcbcb, T RG(b, c) = 5. Note that reference to c intervenes successive references to b on three occasions and references to b intervenes

144


successive references of c twice making the T RG(b, c) = 5. For the given pattern the T RG(b, d) = 0 as there are no interleaved accesses. Hence no edge exist between b and d. TRG is computed for all the data sections from the address trace collected from an instruction set simulator. We define ST RG(i) as the sum of all TRG weights on the edges connected to node i. As an example from Figure 7.3, ST RG(a) = 10. Next, We define a term, spatial locality factor, which gives a measure of spatial locality in the access trace for each data section. The spatial locality is influenced by the stride in accessing different elements of the data section. The spatial locality factor is computed by determining the number of misses incurred by that data section on a cache with a single block by the filtered access trace that contains only accesses pertaining to that data section. The spatial locality factor is the ratio of the number of such misses to the size of the data section. For example if the accesses to data section b in the filtered trace bbbb correspond to cache blocks b1 b2 b1 b1 , where b1 and b2 correspond to different blocks (determined by the cache block size) and size of data section b is Sb cache blocks then the spatial locality factor is 3/Sb . There are three parameters that control the decision to keep a data section in an on-chip SPRAM. 1. Access Frequency (Af) : Placing the most frequently accessed data section in SPRAM gives better power consumption and better run-time performance. 2. Temporal Access Characteristics : A data section is said to be conflicting if it gets accessed along with many other data sections. Placing the most conflicting data section in SPRAM reduces the number of cache conflict misses and hence improves the overall memory subsystem performance. This parameter is computed from the TRG. The ST RG factor is a direct indication of the extent to which a data section’s life-time overlaps with other data sections. 3. Spatial Locality Factor (SLF) : Data sections that have lesser spatial locality factor uses more cache lines simultaneously and thereby reduce the available cache space for other data. Also, such data exhibits less spatial reuse, causing more cache misses

7.3 Data Partitioning Heuristic

145

Table 7.1: Input Parameters for Data Partitioning Algorithm Notation Description N Number of data sections T RG(a, b) Temporal access pattern between node a and b ST RG(a) Sum of trg weights on all edges connected to node a AF (a) Access Frequency of data section a SLF (a) Spatial Locality Factor of data section a nST RG(a) Normalized ST RG(a) nAF (a) Normalized AF (a) nSLF (a) Normalized SLF (a) CI(a) Conflict Index of data section a

which in turn increases the power consumption due to off-chip memory accesses. Hence, it is both power and performance efficient to place a data section that has less spatial locality factor in SPRAM. Thus, a frequently accessed data that conflicts most with the rest of data and also exhibits less spatial locality is an ideal candidate to be placed in SPRAM as this gives the best performance from an overall memory subsystem perspective. For each of the data sections, a conflict index is computed using the three parameters mentioned above. The conflict index of a node corresponding to data section s is computed as follows. ST RG(s) ´ nST RG(s) = ³PN ST RG(i) i=1

(7.1)

AF (s) ´ nAF (s) = ³PN AF (i) i=1

(7.2)

SLF (s) ´ nSLF (s) = ³PN i=1 SLF (i)

(7.3)

CI(s) = nST RG(s) + nAF (s) + nSLF (s)

(7.4)

In the above equations, SLF (s) and AF (s) correspond to the spatial locality factor and access frequency of s respectively. The terms in the LHS of equations 7.1, 7.2 and

146


7.3 are normalized factors. Higher the conflict index, more suitable the data section is for SPRAM placement. Our data partitioning heuristic algorithm is explained in Figure 7.4. The greedy heuristic sorts the data sections based on the conflict index and assigns data sections that have the highest conflict index to SPRAM. The corresponding node is removed from the TRG and the conflict index for the remaining data sections are recomputed. Note that the above step is performed for every data section identified to be placed in SPRAM. This process is repeated either until the SPRAM space is full or until there are no more data sections to be placed.

7.4 7.4.1

Cache Conscious Data Layout Overview

The data partitioning step places the most conflicting data in SPRAM and thereby reduces the possible conflict misses in cache. However, the SPRAM size typically is very small and only a few data-sections would have been placed in SPRAM3 . The remaining data sections still needs to be placed carefully in the cache to reduce the cache misses. In this section we will be discussing the cache conscious data layout. The problem of cache-conscious data layout is to find optimal data placement in offchip RAM with the following objectives: (a) to reduce the number of cache misses and (b) to reduce address space used in off-chip RAM. In other words, the objective is to reduce the “holes“ in off-chip RAM after placement. By this, we mean that the data sections are placed in the off-chip RAM, in such a manner that the gap between data sections which is left to reduce conflict misses is reduced. These gaps lead to wasted memory space and hence increase hardware cost. To the best of our knowledge, reducing cache misses (the first objective) has been the sole objective targeted by all earlier data layout approaches published [17, 22, 41, 50, 53]. But it is very important to consider objective (a) in the context of objective (b) for the following reasons. 3 As mentioned earlier, data placement within the SPRAM can be done in a subsequent phase using any of the data layout methods discussed in Chapter 3. We do not experiment this as this has been extensively dealt with in the previous chapters.

7.4 Cache Conscious Data Layout

Algorithm: SPRAM-Cache Data Partitioning Inputs: N = Number of data sections Access Frequency of all data sections Temporal Relationship Graph (TRG) Spatial Locality Factor (SLF) Data Section Sizes Output: List of data sections to be placed in SPRAM begin 1. Compute access frequency per byte for all data sections 2. Normalize the access frequency per byte for all data sections 3. for i= 0 to N -1 3.1 compute STRG(i); sumSTRG += STRG(i); 4. for i= 0 to N -1 4.1 nSTRG(i) = STRG(i)/sumSTRG; 5. for i= 0 to N -1 5.1 compute SLF(i); sumSLF += SLF(i); 6. for i= 0 to N -1 6.1 nSLF(i) = SLF(i)/sumSLF; 7. for i= 0 to N -1 7.1 conflict-index(i) = nSTRG(i) + nAF(i) + nSLF(i); 8. sort the data sections in descending order with respect to conflict-index 9. while (available space in SPRAM) 9.1 identify the data section s with the highest conflict index 9.2 place s in SPRAM if it fits within the available space 9.3 update SPRAM available space to account for the above placement 9.4 remove s from TRG 9.5 Recompute ST RG for the remaining nodes in TRG 9.6 Recompute conflict index with the newly updated ST RG 9. exit end

Figure 7.4: Heuristic Algorithm for Data Partitioning

147

148


• For SOC architectures with instruction cache and data cache that share the same off-chip RAM, a data layout approach that optimizes only the data cache misses, without considering optimization of off-chip RAM address space will use-up too much address space by spreading the data placement, leaving many holes. This will place severe constraints on code placement requiring the code to be placed across the holes and in the remaining off-chip RAM. This may potentially result in additional instruction cache misses. Hence, there is a chance that all the gains achieved by optimizing the data cache misses is lost. • A data layout approach which optimizes the data placement in off-chip RAM without any holes will be independent of instruction cache placement. Hence, the architecture exploration of data cache can be done independent of instruction cache. For example, an application with 96K of data will have around 2700 hybrid architectures that are worth exploring. If the code placement is not independent of the data layout and the code segments are placed in the holes created, then the memory exploration process needs to consider both instruction and data cache configuration together. This will increase the number of architectures considered. In such a scenario, the number of architectures explored could increase to 50000+. Hence, it is important to design a data layout algorithm that is independent of instruction cache. We formulate the cache conscious data layout problem as a graph partitioning problem [38]. Inputs to the data layout algorithm are (i) application’s data section sizes and (ii) Temporal Relationship Graph. The data layout algorithm is explained in a block diagram in Figure 7.5. The first step in the data layout problem is modelled as a graph partitioning problem, where data sections are grouped into disjoint subsets, such that the memory requirement for the data sections in a disjoint subset is less than the cache size. More specifically, the first step is a k-way graph partitioning, where k = dapplication data size/cache-sizee. The data sections in each of the partitions are selected such that they have intervening accesses and hence can cause potential conflict misses. Thus the output of graph partitioning step is k partitions with each partition


149

Figure 7.5: Cache Conscious Data Layout

having a set of data sections that conflicts among themselves the most and the partition size is less than cache size. Since each of the k partitions is lesser than the cache size, each of these partitions can be mapped into off-chip RAM address space that corresponds to one cache page. This step eliminates all the conflicts between data-sections that are in the same partition. The graph partitioning method is discussed in detail in Section 7.4.2. The next step in the data layout is to minimize the possible conflicts between datasections that are in two different partitions. This is handled by the offset-computation step. The details of the offset computation are presented in Section 7.4.3. Once the offsetcomputation step assigns cache-block offsets for each of the data section, the address assignment step allocates unique off-chip addresses to all the data-sections. Finally, using the address assignment, the number of cache misses and the power consumed for cache

150


and off-chip memory accesses are computed which is used for identifying Pareto-optimal solution. The following subsections details the graph partitioning heuristic and offset computation heuristic.

7.4.2

Graph Partitioning Formulation

In this section we explain the graph partitioning heuristic which is a generalisation of Kernighan-Lin [38] and operates on the temporal relationship graph for the data sections that need to be placed in off-chip RAM. Note that this excludes all data sections that have been mapped to SPRAM. Given a temporal relationship graph G‘ = {V, E, s, w}, where V is the set of vertices representing data sections and E is the set of edges between a pair of data sections representing a temporal access conflict. Further functions s and w are associated respectively with the nodes and edges of the TRG; s(u) represents the size of the data section associated with a node u and w(u, v) represents the number of temporal access conflicts between a pair of nodes u and v. The weight function w(u, v) is same as T RG(u, v), but only restricted to the data sections that need to be assigned to the off-chip RAM. The graph partitioning problem aims at dividing G into m disjoint partitions. A m-way partition of G is a collection of subsets Gi = {Vi , Ei }, such that • the subsets are disjoint, Vi ∩ Vj = 0, for i 6= j •

Sm

i=1

Vi = V

• ∀ ei = (u, v) ∈ G, is in Gi iff u ∈ Gi and v ∈ Gi The objective of the graph partitioning step is to group the nodes such that the sum of weights on the internal nodes is maximized. The objective function that needs to be maximized is given in Equation 7.5 with the constraint given in Equation 7.6. X X

w(ej )

(7.5)

i ej ∈Gi

X uj ∈Gi

s(uj ) ≤ cache-size

(7.6)


151

An edge eext = (u, v) is said to be an external edge for a partition Gi if u ∈ Gi and v ∈ / Gi ; i.e., if one of the nodes connected by e is in partition Gi and the other is not. Similarly, an edge eint is said to be an internal edge if both the nodes it connects are in the partition Gi . The sum of all the weights on the external edges in partition Gi is referred as external cost (Ei =

P

eext , where eext ∈ Gi ). The sum of all the weights

on the internal edges in partition Gi is referred as internal cost (Ii = eint ∈ Gi ). The total external cost E =

P P i

eext ∈Gi

P

w(eint ), where

w(eext ). Thus the objective of the

partitioning problem is to find a partition with minimum external cost. Alternatively, the graph partitioning problem can also be formulated as maximizing the total internal cost, i.e.

P P i

eint ∈Gi

w(eint ) subject to the constraint

P uj ∈Gi

s(uj ) ≤ cache-size∀Gi .

The optimal partitioning problem is NP-Complete [38, 66]. There are a number of heuristic approaches [26, 47] to this problem, including the well known Kernigan Lin heuristic [38] for two partitions. We extend the heuristic proposed in [38, 66] to solve our problem. The Kernighan-Lin heuristic aims at finding a minimal external cost partition of a graph into two equally sized sub-graphs. The heuristic achieves this by starting with a random partition, and keeps swapping two nodes that gives the maximum gain. Gain is computed as the difference between internal and external costs. Let us consider two nodes a and b present in two different sub-graphs A and B respectively. We define external cost(ECost) of a as Ea =

P x∈B

w(a, x) and internal cost (ICost) of a as Ia =

P y∈A

w(a, y)

for each a ∈ A. Similarly ECost and ICost of b are defined as Eb and Ib respectively. Let Da = Ea − Ia be the difference between ECost and ICost for each a ∈ A. A result proved by Kernighan and Lin [38] shows that for any a ∈ A and b ∈ B, if they are interchanged, the reduction in partitioning cost is given by Rab = Da + Db − 2 × w(a, b). The nodes a and b are interchanged to partitions B and A respectively if Rab > 0. In [66], the graph partitioning heuristic is generalized to an m-way partition. It starts with a random set of m partitions and picks any two of the partitions and applies the Kernighan-Lin heuristic repeatedly on this pair until no more profitable exchanges are possible. Then these two partitions are marked as pair-wise optimal. The algorithm then picks two other partitions to apply the heuristic. This process is repeated until all the

152


partitions are pair-wise optimal. We have adapted the algorithm of [66] and added additional constraints to make it work for our problem. The main constraints are as below: 1.

P

s(a) ≤ cache-size for all partitions;

2. if a data-section size s(a) > cache-size, then this data-section is placed in a partition and marked optimal; and 3. Nodes a and b are interchanged to partitions B and A respectively only if Rab > 0 and if

P a∈A

s(a) < cache-size and if

P b∈B

s(b) ≤ cache-size

The output of the graph partitioning step is a collection of sub-graphs that maximizes the internal cost and minimizes the external cost and ensures that no partition has a size larger than the cache size4 . Thus, each of the partition can be placed in the off-chip RAM address space that maps to a cache page such that none of the data sections that are part of the same partition will conflict in cache. Now we are left with optimizing the cache conflicts that might arise because of conflicts from data sections belonging to two different partitions. Since the external cost is already minimized, the number of such conflicts will already be very less. The offset computation step, described in the following subsection, aims at reducing conflicts caused by data sections belonging to different partitions.

7.4.3

Cache Offset Computation

The cache offset computation step aims at reducing cache conflict misses between data sections that are part of two different partitions. Each partition is placed in the offchip RAM address space that corresponds to one cache page. It may be noted that the ordering of the partitions does not have any impact on the cache misses. For each of the data sections in a partition, a cache-block offset needs to be assigned which in turn is used to determine a unique off-chip memory address for the data section. 4 Obviously, a partition containing a data section whose size larger than the cache size will not obey this property. But such data section can be considered to form l = ddata section size/cache-sizee consecutive partitions, each less than or equal to cache size.


153

Algorithm: Offset Computation Heuristic Inputs: T RGblk values for all the data blocks External costs for all the partitions (Ei ) Internal costs for all the partitions (Ii ) Ei,uj External costs for node uj in a partition Gi Cache configuration Data Section Sizes Output: Offsets assigned each of the data sections begin 1. Sort the partitions in the decreasing order of external cost 2. for i =1 to k partitions 2.1 Pick the partition Gi with the highest external cost 2.2 Sort the data sections in descending order with respect to the external cost Ei,uj 2.3 for alldata sections in Gi 2.3.1 pick the data section uj with the highest Ei,uj 2.3.2 evaluate placement cost by placing uj in each of the available cache-line for the target cache configuration 2.3.2.1 place data uj in cache in the available cache line with the constraint that the data section must be contiguously placed 2.3.2.2 compute the cost of placement by using T RGblk information for all the data blocks already placed in the data section 2.3.2.3 store the cost of placement Cl for the cache line l 2.3.2.4 repeat the last three steps for all possible cache lines 3.4.3 find the cache-line l that gives the minimal cost 3.4.4 assign l as the starting point for uj 3.4.5 mark the cache lines from l to l + size(uj )/block − size as not available for other data sections in Gi 2.4 end for 3. end for 4. placement complete end

Figure 7.6: Heuristic Algorithm for Offset Computation

154


To decide the offset that gives the least number of conflicts, we compute the placement cost for all possible placements of the data section inside a cache page. To compute the placement cost, we use a fine grained version of TRG. Note that the TRG computed in Section 7.3 is at the granularity of data section. But to determine which offset to place a data section, the temporal access pattern needs to be computed at a finer granularity level. We illustrate these ideas with the help of an example. Let there be 2 data sections a and b of size 128 bytes and 64 bytes respectively. Consider the following access pattern: a[0]b[0]a[60]b[1]a[61]b[2]a[62]b[3] For this access pattern TRG(a,b) is 6 as explained in Section 7.3. Basically it means data sections a and b are accessed 6 times in an interleaving way. However, for a direct mapped cache of size 4KB with 32 byte block size, if a is placed in address k and b is placed in off-chip address k + 4KB, will not result in any conflict misses even though the TRG(a,b)=6. This is because a[60], a[61] and a[62] will map to a cache line (C + 1), while a[0], b[0], b[1], b[2] and b[3] will map to cache line C. Further as if a is placed in address k and b is placed in address k + 4KB + 32B then it will result in 5 conflict misses. Hence, to determine, the cost of placing the data section, on the number of conflict misses, the TRG values are needed at a more granular level. For the above example, if we keep the granularity level as 1 cache block then the data section a is divided into 4 data blocks and data section b is divided in to 2 data blocks. We define a new term T RGblk that represents the temporal access pattern among data blocks. This is similar to the approach described in [17]. The above access sequence results in a0, b0, a1, b0, a1, b0, a1, b0, where a0, a1 and b0 represent the first two (cache-block sizes) blocks of data section a and first block of data section b. For the above example, T RGblk will consist of nodes a0, a1, a2, a3, b1 and b2. For the access pattern given above, T RGblk (a1, b0) = 5 and all other T RGblk values are 0. We use the T RGblk values to compute the cost of placement, C(s,l), for a data section s in a cache offset l. The offset computation algorithm is explained in the Figure 7.6. To begin with, the partitions are ordered based on the total external cost (Ei ). The partition Gi with the highest external cost is selected first for offset computation. Data sections that are part


155

of partition Gi are ordered based on the external cost of the corresponding nodes in Gi . Data section uj with the highest external cost (Ei,uj ) is taken for offset computation first. Data section uj is placed in each of the allowable cache lines and the placement cost is computed with the help of T RGblk . Here, by allowable, we mean there should be contiguous cache lines free in a cache page to accommodate the data section uj . For example, if data section size is 128 bytes and cache block size is 32 bytes, then a feasible cache line mean 4 contiguous lines are free. Note that at this point no offset is assigned to the data section uj . Cost of placement C(uj ,`) for data section uj is computed for all allowable cache line ` from 1 to Nl , where Nl is the total number of cache lines. The cache line ` that has the minimum cost is assigned to to data section uj and the cache lines from l to size(uj )/line-size is marked as full so that these cache lines are not available for any other data section in Gi . Note that this restriction is put to ensure that the cache offsets for all data section in a partition Gi are assigned within one cache page and this ensures that the amount of external address space used is close to the application data size. The above process is repeated for all data sections in partition Gi . After this the next partition Gi+1 with the highest external cost is selected for offset computation. This process continues until all partitions are handled.

7.5 7.5.1

Experimental Methodology and Results Experimental Methodology

We have used Texas Instrument’s TMS32064X processor for our experiments. This processor has 16K data cache and we have used Texas Instrument’s Code Composer Studio (CCS) environment for obtaining profile data, data memory address traces and also for validating data-layout placements. We have used 3 different applications - AAC(Advanced Audio Codec), MPEG video encoder and JPEG image compression from the Mediabench [43] for performing the experiments. We compute the TRG, sumtrg, and spatial locality factor from the data memory address traces obtained from the CCS. We used eCacti [45] to obtain the area and power numbers for different cache configurations. First, we

156


report experimental results demonstrating the benefits of our cache-conscious data layout method. Subsequently in Section 7.5.4, we repeat the results pertaining to cache-SPRAM memory architecture exploration.

7.5.2

Cache-Conscious Data Layout

In this section we present results on our cache conscious data layout and we compare our results with the approach proposed by Calder [17]. We have used the above 3 mediabench applications and 4 different cache sizes. In this experiment, for all the cache sizes we have used a 32 byte cache-block size and direct mapped cache configuration. Table 7.2 presents the results of the data layout. Column 4 in Table 7.2 presents the number of cache-misses incurred when the data-layout approach of [17] is used and the Column 5 gives the number of cache misses incurred when our data layout approach is applied. Our approach performs consistently better and reduces the number of cache misses especially for AAC and MPEG. Our method achieves upto 34% reduction (for AAC with 16KB cache size) in cache misses. Also our approach consumes an off-chip memory address space that is very close to the application data-size. This is by construction of the graphpartitioning approach and avoiding gaps during data layout as explained in Section 7.4. Whereas Calder’s [17] approach consumes 1.5 to 2.6 times the application data-size in the off-chip address space to achieve the performance given in Table 7.2. This is a significant advantage of our approach, as increased off-chip address space implies increased memory cost for the SoC. In Table 7.3, we present the results of our approach for different cache configurations (direct mapped, 2-way and 4-way set associative caches). Note that these experiments are performed with cache only architecture and no SPRAM. Observe that for all the applications, the reduction in misses is significant for 2-way and 4-way set associative caches. However, for the 4KB cache configuration for MPEG, the reduction in cache misses is not much. This is due to the large data set (footprint) requirement for MPEG. Also, observe that the data set for JPEG is much smaller and hence a direct mapped 16K cache or 4-way set associative 8KB cache could resolve most of the conflict misses.


Application

AAC

MPEG

JPEG

Table 7.2: Data Layout Comparison Cache Number of Number of cache misses Size memory Calder Graphaccesses [17] Partition (our approach) 32KB 43 0 0 16KB Million 14746 9711 8KB 155749 128322 4KB 446912 385795 32KB 92 17204 14574 16KB Million 275881 224278 8KB 2332008 2314398 4KB 11919814 11919814 32KB 38 0 0 16KB Million 0 0 8KB 2350 2112 4KB 10220 10294

157

imrovement (%) 0 34 17 14 15 19 1 0 0 0 10 -1

Table 7.3: Data Layout for Different Cache Configurations Appli- Cache Number of cache misses cation Size Direct 2-Way set- 4-way setMapped Associative Associative AAC 32KB 0 0 0 16KB 9711 5252 4111 8KB 128322 66741 53732 4KB 385795 314122 260110 MPEG 32KB 14574 2122 712 16KB 224278 123632 78412 8KB 2314398 1863214 1301257 4KB 11919814 10121122 9884788 JPEG 32KB 0 0 0 16KB 0 0 0 8KB 2112 112 10 4KB 10294 4300 3200

158

7.5.3


Cache-SPRAM Data Partitioning

In this section we present the results from our cache-SPRAM Data Partitioning method. Figures 7.7, 7.8 and 7.9 present the results of data partitioning heuristic. In these figures, the x-axis represents the SPRAM size and the y-axis represents the performance in terms of memory stalls. Experiments were performed for three different cache sizes (4KB, 8KB and 16KB). For each of the cache sizes, the SPRAM size is increased from 0 to application data size. For each of the memory configuration, data partitioning and cache conscious data layout is performed to obtain the memory stalls. The memory stalls refers to the number of stalls due to the external memory accesses due to cache misses.

Figure 7.7: AAC: Performance for different Hybrid Memory Architecture

Observe that for all the applications, when the SPRAM size is increased, a significant performance improvement is achieved for all the cache sizes. However, the performance improvement is more pronounced in 4KB and 8KB caches than in 16KB caches. Observe that for AAC, the 8KB cache + 24KB of SPRAM gives the same performance as a 16KB cache with 4KB of SPRAM. The 16KB Cache and 4KB SPRAM consumes more area than the 8KB Cache + 24KB SPRAM configuration. Simillarly, for JPEG, 4KB Cache with 20KB of SPRAM gives the same performance as a 16KB Cache with no SPRAM. This gives an architecture choice to the designers to select a configuration that suits


159

Figure 7.8: MPEG: Performance for different Hybrid Memory Architecture

Figure 7.9: JPEG: Performance for different Hybrid Memory Architecture

the target application. As we discussed earlier, both caches and SPRAM have their own advantages. For instance, caches offer hardware managed resusable on-chip memory space that provides feature extendability to the system. Whereas SPRAM provides predictable

160


performance and lower power consumption. Hence, the selction of architecture needs careful analysis from different viewpoints. Now we present the power consumption numbers for all the applications in Figures 7.10, 7.11 and 7.12. In these figures, the x-axis represents the SPRAM size and the y-axis represents the total power consumed by the memory subsystem. There are three plots, each for different cache sizes (4KB, 8KB, and 16KB). As expected, the power numbers for 16KB cache configurations is higher than the other two. However, in the all the figures, observe that the power numbers converge towards the end for higher SPRAM sizes. This is becase, for higher SPRAM sizes, most of the application’s critical data sections are mapped to SPRAM and hence not much activity happens in cache. Thus, the power numbers are mostly influenced by the SPRAM accesses. Observe that for 16KB cache, the power numbers are higher for lesser SPRAM sizes and gradually decreases as the SPRAM size increase.

Figure 7.10: AAC: Power consumed for different hybrid memory architecture

In summary, the system designer needs to look at the performance graphs for his application, presented similar to those in Figures 7.7, 7.8 and 7.9 and also study the power graphs similar to the ones presented in 7.10, 7.11 and 7.12 to arrive at a suitable


161

Figure 7.11: MPEG: Power consumed for different hybrid memory architecture

Figure 7.12: JPEG: Power consumed for different hybrid memory architecture

architecture. One more dimension that is not covered here is the memory area. The next section looks at the memory architecture exploration, where the system designer can look

162


at the memory design space from a power, performance and area viewpoint.

7.5.4


In this section we present the results from our memory architecture exploration. As mentioned in Section 7.2, we explore the Cache-SPRAM solution space with the following parameters: (a) cache-size, (b) cache block-size, (c) cache associativity and (d) SPRAM size. Again we have used the same 3 benchmark applications. As mentioned earlier, we use an exhaustive search method for memory exploration by varying the above parameters. We start with no SPRAM and a 4KB cache and keep increasing the cache sizes up to the application data size (88KB, 108KB and 40KB for AAC, MPEG and JPEG respectively). For each of the cache size explored, we then increase the SPRAM size from 0 to application data size with a 4KB step increase. Also for each of the cache configurations we vary the block size from 8 bytes to 64 bytes with a 8-byte step increase and associativity from 1 to 4. Based on the application data size, the number of memory configurations evaluated varies from 1200 to 2800. From the total memory configurations evaluated, we compute the non-dominated solutions based on the Pareto Optimal criteria explained in Section 7.2. Figures 7.13, 7.14, and 7.15 present the non-dominated solutions for AAC, MPEG and JPEG respectively. In these figures, the x-axis represents the number of memory stall cycles and the y-axis represents the power consumption. We have presented the power vs. performance graph for different area bands5 . We observe from the Figure 7.13 that as the area band increases, we get better power and performance. Note that the solution points are converging from the top-left portion of the graph (which is a high power and low performance region) to the lower left portion of the graph (which is the low power and high performance region) as the area is increased. In Figure 7.14, the solution on the top right corner has the memory configuration of 4K cache size, direct mapped with 32 byte cache-block with no SPRAM. As we can observe this is a very conservative architecture giving very low performance and high power consumption. On the other 5 Again, due to proprietary reason, we present normalized area for different configuration, instead of absolute values.

7.6 Related Work

163

hand, the solution in the lower left corner has the memory configuration of 8K cache with 2-way set-associativity and 16-byte cache-block and 128K of SPRAM. This is a very high end architecture consuming lot of area but gives the best performance and power consumption. Thus the set of Pareto Optimal design points presents a critical view to the designers to pick appropriate memory configurations that suit the application-system requirements.

Figure 7.13: AAC: Non-dominated Solutions

7.6 7.6.1

Related Work Cache Conscious Data Layout

There are many earlier work that propose source code level transformations with the objective to improve the data locality. Loop transformation based data locality optimizing algorithms are proposed by Wolf et al., [82]. They describe a locality optimization algorithm that applies a combination of loop interchange, skewing, reversal and tiling to improve the data locality of loop nests. Earlier work [21, 37, 53] propose source level

164


Figure 7.14: MPEG: Non-dominated Solutions

Figure 7.15: JPEG: Non-dominated Solutions

7.6 Related Work

165

transformations such as array tiling, re-ordering data structures and loop unrolling to improve cache performance. But we focus on optimizing object module level data placements without any code modifications. We emphasize that this is important as application development flow in embedded systems typically involves integration of many IPs and the source code for them may not be available. Data layout optimization proposed by Panda et al., addresses the scenario of data arrays placed in off-chip RAM addresses that are multiples of cache size which results in thrashing due to cache conflict misses in a direct mapped cache. They propose introducing dummy words (padding) between the data arrays to avoid cache conflicts. Data layout heuristics that aim at minimizing cache conflict misses have been proposed in [17, 41]. The problem has been formulated as an Integer Linear Program (ILP) in [41]. They also propose a heuristic method to avoid the long run-times of ILP solvers. Calder et al., [17] use a Temporal Relationship Graph (TRG) that captures the temporal access characteristics of data and proposes a greedy algorithm for cache conscious data layout. While the approaches in [41, 17] target only the minimization of conflict misses en masse, our approach aims at minimizing conflict misses within a certain off-chip memory address space. The constraint of working within a certain external memory address space is very important for memory architecture exploration, since this makes the instructioncache performance independent of data cache for architectures where the external memory address space is common for both data and instruction caches, and thereby reducing the memory architecture search space. Chilimbi et al., [22] propose two cache friendly placement techniques — coloring and clustering — that improves data structure’s spatial and temporal locality and there by improving cache performance. Their approach works mainly for tree and tree like data structures. They also propose a cache conscious heap allocation method which allocates memory closer to contemporaniously accessed data objects based on programmer’s input. This reduces the number of conflict misses. However, this approach will be expensive in terms of performance as run-time decisions need to be taken. Embedded systems are

166


performance sensitive and hence usually the usage of dynamic heap objects are discouraged. Any additional run-time performance overhead in memory allocation will take away the benefit that comes from reduced conflict misses. Further, critical sections of embedded applications are typically developed in hand-written assembly language. Hence any modifications in the layout of structures cannot be completely handled by compilers. A greedy data layout heuristic is proposed in [65] which optimizes energy consumption in horizontally partitioned cache architectures. Their approach uses the idea that the energy consumed per access in a small cache is less than that in a larger cache. Hence, for cache architectures that have a main cache and a smaller mini cache, the authors show that a simple greedy data partitioning heuristic, which partitions data between the main cache and the mini cache, performs well to reduce the overall energy consumption of the memory subsystem. Our work addresses a different target memory architecture with SPRAM and cache. Palem et al. [50] propose a compile time data remapping method to reorganize record data types with the objective to increase temporal access characteritics of data objects. Their method analyzes program traces to mark data objects of records whose access characteristics and field layout do not exhibit temporal locality. The authors propose a heuristic algorithm to remap the fields of the data objects that were marked during the analysis phase. The heuristic remaps the fields in data objects to improve the temporal locality and thus avoids additional cache misses. Their approach is very efficient for record type data structures like linked lists and trees. However, their approach requires compiler support to reorganize the fields of data structures and the corresponding code changes to access the remapped fields. Whereas our work focusses on the layout of data structures that do not require code changes, which is an important constraint in IP-based embedded application development flow.

7.6.2

SPRAM-Cache Data Partitioning

Integer Linear Programming (ILP) based approach to partition instruction traces between SPRAM and instruction cache with the objective to reduce energy consumption

7.6 Related Work

167

has been propsed in [80]. Our work focuses on data partition between SPRAM and data cache. Further, we consider DSP applications which typically have multiple simultaneous memory accesses leading to parallel and self conflicts. To the best of our knowledge, only [53] addresses data partitioning for SPRAM-cache based hybrid architectures. They propose a data partitioning technique is presented that places data into on-chip SRAM and data cache with the objective of maximizing performance. Based on the life times and access frequencies of array variables, the most conflicting arrays are identified and placed in scratch pad RAM to reduce the conflict misses in the data cache. This work addresses the problem of limiting the number of memory stalls by reducing the conflict misses in the data cache through efficient data partitioning. They also demonstrate memory exploration of hybrid architectures with their proposed data partitioning heuristic. However, their memory exploration framework does not have an integrated cache-conscious data layout. They propose a model to estimate the number of cycles spent in cache access. Our approach proposes data partitioning based on three factors (i) access frequency, (ii) temporal access characteristics and (iii) spatial access characteristics. Our proposed method is a comprehensive data layout approach for SPRAM-cache based architecture as we perform data partitioning followed by cache conscious data layout. Also our approach works on all the key system design objectives such as area, power and performance.

7.6.3


Panda et al., proposes a local memory exploration method for SPRAM-cache based memory organization. They propose a simple and efficient heuristic to walk through the SPRAM-cache space. For each of the memory architecture configuration considered, the performance of the memory configuration is estimated by an analytical model. In [64], an exhaustive search based exploration approach is proposed for a cache based memory architecture which explores the memory design space based on parameters like on-chip memory size, cache size, line size and associativity. The authors extend the work by Panda et al., to consider energy consumption as the performance metrics for the memory

168


architecture exploration. Memory exploration for cache based memory architecture is also considered by [51]. The main difference between the above works and our method is that our memory architecture exploration framework integrates an efficient data layout heuristic as a part of the framework to evaluate the memory architecture. Without an efficient data layout a random mapping of application may result in a poor performance even for a good memory architecture. Further, our memory architecture exploration framework considers multiple objectives such as performance, area and power. Memory hierarchy exploration problem in the Genetic Algorithm framework is proposed in [52] and [9]; their target architecture consists of separate L1 caches for instruction and data, and unified L2 cache. Their objective function is a single formula which combines area, average access time and power. In [52], additional parameters such as bus width and bus encoding are considered, and the problem is modeled in a multi-objective GA framework. The main difference between their work and our work is the integration of data layout as part of the memory architecture exploration framework. Absence of a cache conscious data layout means that the application data may not have been efficiently placed in off-chip RAM and hence will lead to a poor performance. A point to note here is that the poor performance is a result of inefficient data placement and not due to the cache configuration. The other main difference is that [52, 9] uses simulation based fitness function evaluation which will limit the number of evaluations due to large run-time. In comparison our approach uses an analytical model to compute fitness functions.

7.7

Conclusions

In this chapter we have presented a memory architecture exploration framework for SPRAM-Cache based memory architectures. Our framework integrates memory exploration, data partitioning between SPRAM and Cache, and cache-conscious data layout to explore memory design space and presents a list of Pareto Optimal solutions. We have addressed three of the key system design objectives viz., (i) memory area, (ii) performance and (iii) memory power. Our approach explores the memory design space and presents

7.7 Conclusions

169

several Pareto Optimal solutions within a few hours on a standard desktop. Our solution is fully automated and meets the time-to-market requirements.

170


Chapter 8 Conclusions In this chapter, we present a summary of the thesis and outline possible extensions to this work.

8.1

Thesis Summary

In this work, we presented methods and a framework to address the memory subsystem optimization problem for embedded SoC. In Chapter 3, we presented three different methods to address the data layout problem for Digital Signal Processors. Multiple methods are required for addressing the problems in the embedded design flow. For instance, data layout during memory architecture exploration needs to be very fast, as data layout is used several thousand times to evaluate different memory architectures. On the other hand, the data layout method used for final system production needs to generate a highly optimal solution irrespective of the run-time. Hence, we proposed three different approaches for data layout in Chapter 3: (i) Integer Linear Programming (ILP) based approach, (ii) Genetic Algorithm (GA) formulation of data layout and (iii) a fast and efficient heuristic method. We compared the results of all the three approaches. The heuristic method performs very efficiently both in terms of the quality of the data layout and also in terms of run-time. The quality of data layout (the number of memory stalls reduced) generated by the heuristic algorithm is within 5% from

172

Conclusions

that of GA’s output. The ILP approach gives the best quality solution, but its run-time is very high. In Chapter 4, we addressed the logical memory architecture exploration problem for embedded DSP processors. As discussed in Chapter 1, logical view is closer to the behavior and helps in reducing the search space by abstracting the problem. We formulated the logical memory architecture exploration (LME) problem in multi-objective GA and multi-objective SA. The multiple objectives include performance (in terms of memory stalls) and cost (in terms of ”logical” memory area). Both GA and SA produce 100-250 Pareto-optimal design points for all application benchmarks. Our experiments showed that the multi-objective GA performed better than SA approach in terms of (i) quality of solutions in terms of the number of non-dominated solutions generated for a given time and (ii) uniformly searching the design space (diversity of solutions). Both GA and SA based approaches take approximately 30 minutes of run-time to generate Pareto-optimal solutions for one benchmark. Chapter 5 addressed the data layout exploration problem from a physical memory architecture perspective. Again, the target memory architecture is for the embedded DSP processors. We proposed a Multi Objective Data Layout EXploration (MODLEX) framework that searches the data layout design space from a performance and power consumption view point for a given physical memory architecture. We showed that our method effectively uses the multiple memory banks, single/dual-ported memories, and non-uniform banks to produce around 100-200 data layout solutions that are Paretooptimal with respect to performance and power consumed. We also showed that there is a big 70% trade-off between power and performance possible by using different data layout solutions; specifically for DSP based memory architectures. In Chapter 6, we addressed the memory architecture exploration for embedded DSP processors from a physical memory perspective. We proposed two different approaches to physical memory exploration. First approach is extends the logical memory architecture exploration described in Chapter 4 to address the physical memory architecture exploration problem. This approach was referred as LME2PME. As part of the steps to extend

8.1 Thesis Summary

173

the LME to address PME, we proposed a memory allocation exploration framework that takes the Pareto-optimal logical memory architecture and its corresponding data layout as input and explores the physical memory space by constructing the given logical memory architecture with physical memories in different possible ways with the objective to optimize area and power consumption. The memory allocation exploration is formulated in multi-objective GA. The second approach proposed in Chapter 6 is an integrated approach that directly address the physical memory architecture exploration problem. This approach is known as DirPME. This approach formulates the physical memory exploration problem directly as a multi-objective GA. This approach works on data layout, memory exploration and memory allocation simultaneously and hence the search space is very high as compared to the LME2PME approach. We showed that both approaches, LME2PME and DirPME, provide several 100s of Pareto-optimal points that are interesting from a area, power and performance view point. Further, we showed for a given time LME2PME provides better solutions than the DirPME approach. However, the solutions of DirPME and LME2PME are very close and hence both approaches are useful depending on the needs of system designers. Finally, in Chapter 7 we extended our memory architecture exploration framework to address SPRAM-cache based on-chip memory architecture. We proposed an efficient data partitioning heuristic to partition data sections between on-chip SPRAM and cache. A graph partitioning based cache conscious data layout heuristic is proposed with the objective to reduce cache conflict misses. Exhaustive search method is applied to explore SPRAM-cache design space. Each memory architecture is evaluated by mapping a target application to the memory architecture under consideration, by using the data partition heuristic and cache conscious data layout heuristic, to obtain the performance in terms of number of memory stalls. We used eCacti [45] to obtain the area and power per access numbers for the cache and used a semiconductor memory library to obtain the area and power numbers for SPRAM. Based on the area, power and performance and by applying the Pareto-optimal conditions, the list of Pareto-optimal memory architectures

174

Conclusions

are identified.

8.2

Future Work

In this section, we outline some of the possible extensions to our work.

8.2.1

Standardization of Input and Output Parameters

The memory architecture exploration problem, as discussed in Chapter 1 and illustrated in Figure 1.4, has to be addressed at several levels, namely, behavioral level, logical architecture level, physical architecture level, and and data layout. To be able communicate the interfaces and input/output parameters across these different levels of abstraction, it is very critical to standardize the communication. This involves standardizing the input and output file-formats/parameters from an IP, Platform, Semiconductor library view point. Currently there is no standardization of format, syntax and semantics for these parameters, which is aligned across these levels. It is very critical to address this problem so that multiple methods/optimizations can be integrated seamlessly to address specific applications, architectures and system aspects.

8.2.2

Impact of platform change on system performance

The impact of platform change on system parameters like area, power and performance can be studied for a given application, a given semiconductor memory library and process node. The impact analysis is critical to identify where to spend the effort in improving the platform such that overall system performance improvement is high.

8.2.3

Impact of Application IP library rework on system performance

We have addressed the memory architecture exploration problem for a given set of applications. At this stage, the make, buy or reuse decisions are made and the list of IP

8.2 Future Work

175

modules to be used as part of the system is known as shown in Figure 1.4. We could extend our memory architecture exploration framework to analyze the impact of rework or design improvement of one or more software IP on the memory power, performance and area. This analysis could direct the IP optimization efforts properly with the objective to improve system area, power and performance.

8.2.4

Impact of semiconductor library rework on the system performance

Our memory architecture exploration framework can be extended to study the suitability of a specific semiconductor memory library for a specific embedded system. Further, the impact of rework of a semiconductor memory library from memory system area and power can be studied to decide and prioritize the area of rework.

8.2.5

Multiprocessor Architectures

Our work on data layout and memory architecture exploration focuses mainly on optimizing the on-chip memory organization of a processor (DSP or Microcontroller) in a SoC. Our work can be extended for optimizing shared memory subsystems in a multiprocessor based SoC.

Bibliography [1] ARM920T and ARM922T: ARM9 Family of Embedded Processors. http://www.arm.com/products/CPUs/families/ARM9Family.html. [2] ARM926EJ-S and ARM926E-S: ARM9E Family of Embedded Processors. http://www.arm.com/products/CPUs/families/ARM9Family.html. [3] lp solve. http://lpsolve.sourceforge.net/5.5/. [4] SystemC – Language for System-Level Modeling, Design and Verification. http://www.systemc.org/home. [5] Verilog Hardware Description Language. http://www.verilog.com/index.html. [6] International Technology Roadmap for Semiconductors, SEMATECH, 3101, Industrial Terrace Suite, 106 Austin TX 78758., 2001. [7] 2007 global mobile communications - statistics, trends and forecasts. Technical report, http://www.reportbuyer.com/telecoms/mobile/2007 global mobile trends.html, 2007. [8] 1st IEEE Inter. Symposium on Industrial Embedded Systems. Panel Discussion. Open Issues in SoC Design., http://www.iestcfa.org/panel discussions.htm, 2006.

BIBLIOGRAPHY

177

[9] G. Ascia, V. Catania, and M. Palesi. Parameterised system design based on genetic algorithms. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [10] O. Avissar, R. Barua, and D. Stewart. Heterogeneous memory management for embedded systems. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [11] F. Balasa, F. Catthoor, and H. De Man. Background memory area estimation for multidimensional signal processing systems. IEEE Trans. VLSI system, 3:157–172, June 1995. [12] R. Banakar, S. Steinke, B-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: A design alternative for Cache On-chip memory in Embedded Systems. In Tenth International Symposium on Hardware/Software Codesign (CODES), Estes Park, Colorado, May 2002. ACM. [13] M. Barr. Embedded Systems Gallery. http://www.netrino.com/Publications/Glossary/index.php. [14] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino. From architecture to layout: Partitioned memory synthesis for embedded systems-on-chip. In Design Automation Conference, 2001. [15] L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. Layout driven memory synthesis for embededed systems-on-chip. In Proceedings of ACM 3rd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2002. [16] Broadcom, http://www.broadcom.com/collateral/pb/2702-PB02-R.pdf. BCM2702: High Performance Mobile Multimedida Processor, 2006.

178

BIBLIOGRAPHY

[17] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, October 1998. [18] Y. Cao, H. Tomiyama, T. Okuma, and H. Yasuura. Data memory design considering effective bitwidth for low energy embedded systems. In Proc. of the International Symposium on System Synthesis (ISSS), 2002. [19] F. Catthoor, N. D. Dutt, and C. E. Kozyrakis. How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? In Design, Automation and Test in Europe Conference and Exhibition, pages 426–433, 2000. [20] J.A. Chandy and P. Banerjee. Parallel simulated annealing strategies for VLSI cell placement. In Ninth International Conference on VLSI Design, 1996. [21] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache conscious structure definition. In Proceedings of the 1999 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, May 1999. [22] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache conscious structure layout. In International Conference on Programming Languages Design and Implementation (PLDI99), May 1999. [23] K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times. In Design, Automation and Test in Europe Conference and Exhibition, volume I, 2004. [24] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. [25] K. Deb. Multi-objective evolutionary algorithms: Introducing bias among paretooptimal solutions. Technical report, IIT Kanpur, 1996.

BIBLIOGRAPHY

179

[26] W.W. Donaldson and R.R. Meyer. A dynamic-programming heuristic for regular grid-graph partitioning. Technical report, http://pages.cs.wisc.edu/ wwd/rev4.pdf, 2007. [27] G. Dueck and T.Scheuer. Threshold accepting: A general purpose optimization algorithm appear superior to simulated annealing. Journal of Computational Physics, 90:161–175, 1990. [28] R. Fehr. Intellectual property: A solution for system design. In Technology Leadership Day, October 2000. [29] D. Gajski. Design methodology for systems-on-chip. Technical report, Centre for Embedded Computer Systems, University of California, Irvine, California, http://www.cecs.uci.edu/eve presentations.htm, 2002. [30] D. E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Addison-Wesley, 1989. [31] P. Grun, N. Dutt, and A. Nicolau.

Memory Architecture Exploration for Pro-

grammable Embedded Systems. Kluwer Academic Publishers, 2003. [32] G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL: An Instruction Set Description Language for Retargetability. In Proceedings of the Design Automation Conference (DAC), June 1997. [33] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, second edition, 1995. [34] J. D. Hiser and J. W. Davidson.

Embarc: an efficient memory bank assign-

ment algorithm for retargetable compilers. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 182–191. ACM Press, 2004. [35] P. K. Jha and N. D. Dutt. Library mapping for memories. In EuroDesign, 1997.

180

BIBLIOGRAPHY

[36] B. Juurlink and P. Langen. Dynamic techniques to reduce memory traffic in embedded systems. In Conference On Computing Frontiers, pages 192–201, 2004. [37] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving cache locality by a combination of loop and data transformations. IEEE Transactions on Computers, 1999. [38] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):91–307, 1970. [39] S. Kirkpatrick, C. D Gelatt, and M. P Vechi. Optimization by simulated annealing. Science, 220, 1983. [40] M. Ko and S. S. Bhattacharyya. Data partitioning for DSP software synthesis. In Proceedings of the International Workshop on Software and Compilers for Embedded Processors, September 2003. [41] C. Kulkarni, C. Ghez, M. Miranda, F. Catthoor, and H. De Man. Cache conscious data layout organization for embedded multimedia applications. In Design, Automation and Test in Europe, pages 686–691, 2001. [42] Bernard Laurent and Thierry Karger. A system to validate and certify soft and hard ip. In Design, Automation and Test in Europe Conference and Exhibition, 2003. [43] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In International Symposium on Microarchitecture, 1997. [44] R. Leupers and D. Kotte. Variable partitioning for dual memory bank DSPs. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City (USA), May 2001. [45] M. Mamidipaka and N. Dutt. eCACTI: An enhanced power estimation model for onchip caches. Technical report, Centre for Embedded Computer Systems, University of California, Irvine, California, 2004.

BIBLIOGRAPHY

181

[46] P. Mishra, P. Grun, N. Dutt, and A. Nicolau. Processor-memory co-exploration driven by a memory-aware architecture description language. In Proceedings of the International Conference on VLSI Design, 2001. [47] B. Monien and R. Diekmann. A local graph partitioning heuristic meeting bisection bounds. In 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. [48] H. Orsila, T. Kangas, E. Salminen, T. D. Hamalainen, and M. Hannikainen. Automated memory-aware application distribution for multi-processor system-on-chips. Journal of System Architectcure: the EUROMICRO Journal, 53(11), November 2007. [49] R. Oshana. DSP Software Development Techniques for Embedded and Real-Time Systems. Embedded computer systems, 2006. [50] K. V. Palem, R. M. Rabbah, V. J. Mooney III, P. Korkmaz, and K. Puttaswamy. Design space optimization of embedded memory systems via data remapping. ACM Conference on Languages, Compilers and Tools for Embedded Systems (LCTES), June 2002. [51] G. Palermo, C. Silvano, and V. Zaccaria. Multi-objective design space exploration of embedded systems. Journal of Embedded Computing, 1(3), August 2005. [52] M. Palesi and T. Givargis. Multi-objective design space exploration using genetic algorithms. In International Workshop on Hardware/Software Codesign (CODES), May 2002. [53] P. R. Panda, N. D. Dutt, and A. Nicolau. Memory issues in Embedded Systems-onchip: Optimizations and Exploration. Kluwer Academic Publishers, Norwell, Mass., 1998. [54] P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. Computer-Aided design, 18(1):3–13, January 1999.

182

BIBLIOGRAPHY

[55] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Design Automation of Electronic Systems, 5(3):682–704, July 2000. [56] S. Peesl, A. Hoffmannl, V. Zivojnovic2, and H. Meyrl. LISA - Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures. In Design Automation Conference, 1999. [57] R. Rutenbar. Simulated annealing algorithms: an overview. IEEE Circuits and Devices Magazine, January 1989. [58] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks in digital signal processors. In Proceedings of the 7th Intl Conference Architectural Support for Programming Languages and Operating Systems, pages 234–243, October 1996. [59] A. Sangiovanni-Vincentelli, L. Carloni, F. De Bernardinis, and M. Sgroi. Benefits and challenges for platform based design. In Design Automation Conference, 2004. [60] E. Schmidt. Power Modelling of Embedded Memories. PhD thesis, 2003. [61] H. Schmit and D. Thomas. Array mapping in behavioral synthesis. In Proc. of the International Symposium on System Synthesis (ISSS), 1995. [62] J. Seo, T. Kim, and P. Panda. An integrated algorithm for memory allocation and assignment in high-level synthesis. In Proceedings of 39th Design Automation Conference, pages 608–611, 2002. [63] J. Seo, T. Kim, and P. Panda. Memory allocation and mapping in high-level synthesis: an integrated approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11(5), October 2003. [64] W.T. Shiue and C. Chakrabarti. Memory exploration for low power, embedded systems. In Design Automation Conference, pages 140–145, New York, 1999. ACM Press.

BIBLIOGRAPHY

183

[65] A. Shrivastava, I. Issenin, and N. D. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In Proceedings of ACM 6th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), September 2005. [66] K. Shyam and R. Govindarajan. Compiler directed power optimization for partitioned memory architectures. In Proc. of the Compiler Construction Conference (CC-07), 2007. [67] J. Sjodin and C. Platen. Storage allocation for embedded processors. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [68] A.J. Smith. Cache memories. ACM Computing Surveys, 1993. [69] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, 2003. [70] S. Sriram and S. S. Bhattacharyya. Embedded Multiprocecssors: Scheduling and Synchronization. Embedded computer systems, 2000. [71] A. Sundaram and S. Pande. An efficient data partitioning method for limited memory embedded systems. In ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems (in conjunction with PLDI ’98), pages 205–218, 1998. [72] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration for multi-layer memory architectures. In Design Automation and Test Europe, 2004. [73] Texas Instruments, http://focus.ti.com/dsp/docs/. Code Composer Studio (CCS) IDE. [74] Texas Instruments, http://dspvillage.ti.com/. TMS320 DSP Algorithm Standard, 2001.

184

BIBLIOGRAPHY

[75] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C54x DSP CPU and Peripherals Reference Set, volume 1 edition, 2001. [76] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C55x DSP CPU Reference Guide, 2001. [77] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C64x DSP CPU Reference Guide, 2003. [78] H. Tomiyama and N. D. Dutt. Program path analysis to bound cache-related preemption delay in preemptive real-time systems. In Eighth International Symposium on Hardware/Software Codesign (CODES), 2000. [79] S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Tracsactions in Embedded Computing Systems, 5:1–33, 2005. [80] M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware scratchpad allocation algorithm. In Proceedings of the conference on Design, automation and test in Europe Volume 2. IEEE Computer Society, 2004. [81] L. Wehmeyer and P. Marwedel. Influence of memory hierarchies on predictability for time constrained embedded software. In Design Automation and Test in Europe (DATE), 2005. [82] M. E. Wolf and M. S. Lam. A data locality optimizing algorihm. In Proceedings of the 1991 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 1991. [83] W. Wolf and M. Kandemir. Memory system optimization of embedded software. Proceedings of the IEEE, 91(1), January 2003.

BIBLIOGRAPHY

185

[84] D.F. Wong, H.W. Leong, and C.L. Liu. Simulated Annealing for VLSI Design. Kluwer Academic Publishers, 1988.

On-Chip Memory Architecture Exploration of Embedded System on Chip

On-Chip Memory Architecture Exploration of Embedded System on Chip

Suggest Documents

On-Chip Memory Architecture Exploration of Embedded ... - IISc-SERC

Memory Architectures for Embedded Systems-On-Chip

Multiprocessor System-on-Chip Profiling Architecture - CiteSeerX

on chip memory reduction technique for data dominated embedded ...

On-Chip Memory - Microchip

Embedded Multiprocessor Systems-on-Chip Programming embedded

Embedded Memory Hierarchy Exploration Based on Magnetic - MDPI

Evaluation of System-on-Chip devices for Embedded

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

Fault Tolerant System for Embedded System Architecture

Using On-Chip Configurable Logic to Reduce Embedded System ...

A Security Approach for Off-chip Memory in Embedded ...

High Efficiency Protection Solution for Off-Chip Memory in Embedded ...

The Architecture of the DIVA Processing-In-Memory Chip - Information ...

System On Chip

Embedded System Memory Allocator ... - Damien CouroussÃ©

a dynamically reconfigurable system-on-a-chip architecture ... - eurasip

A TROJAN-RESISTANT SYSTEM-ON-CHIP BUS ARCHITECTURE

A TROJAN-RESISTANT SYSTEM-ON-CHIP BUS ARCHITECTURE

ARM system-on-chip architecture, 2nd edition

Smart Camera System-on-Chip Architecture for Real - EDM - UHasselt

Integral Parallel Architecture in System-on-Chip Designs - DCAE support