An Application Specific Instruction Set Processor (ASIP) is a processor designed for ... design flow includes key steps such as application analysis, design space exploration, instruction ... We studied the impact of register file size on performance, ...... exhaustive over a restricted domain (Gupta et al.[6] and Fischer et al. [40]).
EXPLORING REGISTER FILE AND MEMORY ORGANIZATION IN ASIP SYNTHESIS
MANOJ KUMAR JAIN
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY DELHI
c Indian Institute of Technology Delhi - 2004
All rights reserved.
EXPLORING REGISTER FILE AND MEMORY ORGANIZATION IN ASIP SYNTHESIS by
MANOJ KUMAR JAIN DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Submitted in partial fulfillment of the degree of Doctor of Philosophy to the
Indian Institute of Technology Delhi May 2004
Certificate
This is to certify that the project titled “EXPLORING REGISTER FILE AND MEMORY ORGANIZATION IN ASIP SYNTHESIS”, being submitted by Manoj Kumar Jain, in partial fulfillment for the award of Doctor of Philosophy, is a record of the bonafide work carried out by him under our guidance and supervision at the Department of Computer Science and Engineering, Indian Institute of Technology Delhi. Also, the work presented in this project has not been submitted elsewhere, either in part or in full, for the award of any other degree or diploma.
Prof. M. Balakrishnan
Prof. Anshul Kumar
Department of Computer Science & Engineering, Indian Institute Technology Delhi New Delhi, India
Acknowledgments I am grateful to my supervisors Prof. M. Balakrishnan and Prof. Anshul Kumar for their all time, guidance and support. I cannot express my regardful thanks to them in words. Next, I want to thank other research scholars of the Embedded Systems Group, namely Anup Gangwar, Basant Dwivedi and Rajeshwari Banakar for their consistent co-operation. I would like to thank project sponsors, namely Navel Research Board, Government of India and Department of Science & Technology, Government of India. I have worked in a cooperation project between our group and embedded systems group at University of Dortmund, Germany. I am grateful to Prof. Peter Marwedel for their valuable guidance. I want to thank research scholars, namely Lars Wehmeyer and Stefan Steinke of that group for their cooperation during my Dortmund visit and afterwards. It was a nice opportunity for me to work with that group and I enjoyed the same. I am thankful to all other faculty members and staff members of the department of computer science and engineering at IIT Delhi, for their help throughout my research work. I am also thankful to M. L. Sukhadia University, Udaipur (Rajasthan), India for providing me study leave to persue PhD degree from I.I.T. Delhi. Last but not least, I would like to mention that blessings of my father Sh. Naresh Chandra Jain was one of the most important secret of my success. I would like to devote my thesis to my late mother Dhapu Bai as it was her dream which had come true now. Patience of my wife Sunita and my sons Ronak and Yash played a key role to enable me for devoting sufficient time for research.
Manoj Kumar Jain
Abstract An Application Specific Instruction Set Processor (ASIP) is a processor designed for one particular application or for a set of specific applications. An ASIP exploits special characteristics of the given application(s) to meet the desired performance, cost and power requirements. A typical ASIP design flow includes key steps such as application analysis, design space exploration, instruction set generation, code generation for software and hardware synthesis [1]. Performance estimation which drives the design space exploration is usually done by simulation. Such technique needs a retargetable compiler to generate code for different processor configurations to be explored and then simulating the generated code. This process is generally very slow. Further, there is a well known trade off between retargetability and code quality in terms of performance and code size when compared to the hand optimized code. This is because when the design space is large, all possible target specific optimizations can not be performed in that case. Therefore, in our opinion, compiler-simulator based approach is not suitable for an early design space exploration. In the domain of Application Specific Instruction set Processors (ASIP), this problem can be solved by scheduler based approaches, which are much faster. However, existing scheduler based approaches do not help in exploring storage organization. We demonstrated the importance of including storage organization in design space exploration through a study using a retargetable code generator encc and a standard simulator. We studied the impact of register file size on performance, code size, power and energy consumption [2] for selected benchmarks on ARM7TDMI processor. Results indicated that choice of an appropriate number of registers has a significant impact on performance and energy in many cases. The major contribution of this thesis is a scheduler based approach to explore register file size, number of register windows and cache memory configurations in an integrated manner. Proposed technique estimates the cycle count for application execution on chosen processor and memory configuration. We consider a parameterized model for processor as well as memory. Performance for different register file sizes are estimated by predicting the number of memory spills and its delay. The technique developed does not require explicit register assignment. We estimate register needs
on unscheduled code using the concept of reuse chains with significant extensions [3]. Cost due to register spills are considered while merging the reuse chains. Proposed technique also considers the global register needs. For an architecture with register windows, the number of context switches leading to spills are estimated for evaluating the time penalty due to a limited number of register windows. A sequence of function calls and returns during the execution of the application is generated by instrumenting the input application. Number of window spills and restores required for a specific number of windows is computed by stack based analysis of generated trace of function calls and returns. We observe that usually the number of memory locations required to store spilled scalar variables and register windows is small compared to the total number of cache locations. Therefore, the spilling overhead is insensitive to the cache organization. This observation allows us to estimate the two independently. We use sim-cache simulator of simplescalar tool set to know cache misses statistics. Once we know the number of memory misses for a particular cache, based on the block size and delay information we compute the additional schedule overhead due to cache misses. Performance estimates for a range of register file sizes, register windows and cache memory configurations are generated for selected benchmarks for different processors. The processors include ARM7TDMI, LEON [4] and Trimedia (TM-1000). Experiments showed that our estimates were within 9.6%, 9.7% and 3.3% for these processors respectively, compared to the actual performance results produced by standard tool sets. Further, this technique was nearly 77 times faster compared to simulator based approach. We have shown utilization of our approach by a case study on LEON processor. We used the addresses of the registers which are decided to be ‘spare’ by our technique in addressing coprocessor registers. This simplified the coprocessor interface to the LEON processor and saved a number of loads and stores. Work presented here is integrated in an overall ASIP Synthesis Methodology named ASSIST being developed at IIT Delhi.
Contents
List of Figures
v
List of Tables
vii
1
2
Introduction and Objectives
1
1.1
Application Specific Instruction Set Processor . . . . . . . . . . . . . . . . . . . . .
1
1.2
Steps in ASIP Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.1
Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Architectural Design Space Exploration . . . . . . . . . . . . . . . . . . . .
4
1.2.3
Instruction Set Generation . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.4
Code Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.5
Micro-architecture Design and Hardware Synthesis . . . . . . . . . . . . . .
5
1.3
Variations in the Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.5
Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.6
Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Related Work
9
2.1
Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1
Broad Architectural Features . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2
Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 i
ii
CONTENTS 2.2.3 2.3
2.4
Search Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Instruction Set Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1
Instruction Set Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2
Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3
Instruction Set Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Code Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1
Retargetable Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2
Compiler Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5
Micro-architecture Design and Hardware Synthesis . . . . . . . . . . . . . . . . . . 27
2.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Impact of Register File Size on Performance and Power Metrics 3.1
3.2
3.3
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1
The ARM7TDMI Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2
Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.3
The encc Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.4
Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.5
Experimentation with Register File Size . . . . . . . . . . . . . . . . . . . . 33
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1
Number of Executed Instructions . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2
Number of Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3
Ratio of Spill Instructions to Total Static Code Size . . . . . . . . . . . . . . 36
3.2.4
Average Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.5
Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.6
Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Overview of Our Methodology 4.1
29
47
ASSIST : ASIP Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 47
CONTENTS
5
6
iii
4.2
Storage Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3
Execution Time Estimation with Limited Registers . . . . . . . . . . . . . . . . . . 51
4.4
Our Approach and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Execution Time Estimation at the Basic Block Level
57
5.1
Register Allocation using Register Reuse Chains (RRCs) . . . . . . . . . . . . . . . 58
5.2
Generating Initial Register Reuse Chains . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3
Converting Initial Chains into Dependence Conservative Chains . . . . . . . . . . . 61
5.4
Computing Merging Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5
Chain Merging and Performance Estimation . . . . . . . . . . . . . . . . . . . . . . 66
5.6
Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Global Performance Estimation and Validation
69
6.1
Addressing Global Register Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2
Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1
ARM7TDMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2
TM-1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.3
Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.4
Processor description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4
Results of performance Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.5.1
Validation for ARM7TDMI . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5.2
Validation for TM-1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv
CONTENTS
7 Register Windows and Cache Memory Exploration 7.1
Estimating Register Window Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.1.1
7.2
81
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Cache Miss Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2.1
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3
Execution Time Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8 Illustrative Case Studies and Applications of Our Approach
93
8.1
ADPCM Encoder and Decoder Storage Exploration . . . . . . . . . . . . . . . . . . 93
8.2
Collision Detection Execution Time Validation . . . . . . . . . . . . . . . . . . . . 94
8.3
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.3.1
Reducing Bits in Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3.2
Alternate use of Spare Registers . . . . . . . . . . . . . . . . . . . . . . . . 97
8.4
Hardwiring some Constants to the Spare Registers . . . . . . . . . . . . . . . . . . . 98
8.5
Utilizing Spare Register Addresses to Interface Co-processors . . . . . . . . . . . . 99
8.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9 Conclusions and Future Work
103
9.1
Summary of Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2
Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Bibliography
105
List of Figures 1.1
Flow diagram of ASIP design methodology . . . . . . . . . . . . . . . . . . . . . .
3
2.1
Block diagram of an architecture explorer . . . . . . . . . . . . . . . . . . . . . . . 10
2.2
The Flow of Estimation Scheme : Ghazal [5] & Gupta [6] . . . . . . . . . . . . . . . 14
2.3
The integrated scheduling/instruction-formation process (Huang and Despain [7, 8]) . 18
2.4
Expression tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5
a) Multiplier-Accumulator , b) Instruction Patterns . . . . . . . . . . . . . . . . . . 20
2.6
a) Optimal cover with MAC , b) Cover with intrinsic patterns only . . . . . . . . . . 21
2.7
Retargetable Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8
Compiler Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1
Number of Executed Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2
Number of Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3
Ratio of Number of Spill Instructions to Total Static Instructions . . . . . . . . . . . 37
3.4
Average Power Consumption using only Off-chip Memory . . . . . . . . . . . . . . 38
3.5
Average Power Consumption using On-chip Instruction and Off-chip Data Memory . 39
3.6
Energy Consumption using Off-chip Memory . . . . . . . . . . . . . . . . . . . . . 40
3.7
Energy Consumption using On-chip Instruction and Off-chip Data Memory . . . . . 41
3.8
Results for the program lattice_init . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9
Results for the program me_ivlin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Contributions of time saving and power saving in energy saving . . . . . . . . . . . 44 3.11 Energy saving due to voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 44 v
vi
LIST OF FIGURES 4.1
ASIP Synthesis Methodology : ASSIST . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2
Storage exploration technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3
Flow diagram of our performance estimation scheme . . . . . . . . . . . . . . . . . 53
5.1
Register allocation using register reuse chains . . . . . . . . . . . . . . . . . . . . . 59
5.2
Source code and dependency graph
5.3
Algorithm : Initial reuse chains formation . . . . . . . . . . . . . . . . . . . . . . . 62
5.4
Initial register reuse chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5
Algorithm : Converting initial chains into dependence conservative chains . . . . . . 63
5.6
Dependence conservative chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7
Computing merging costs
5.8
Sets and schedule estimates with diff. k . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1
Variable Use Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2
Optimal register allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3
Execution time estimates for matrix-mult for ARM7TDMI
6.4
Validation for ARM7TDMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5
Validation for TM-1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1
Algorithm : Computing number of window spills and restores . . . . . . . . . . . . 82
7.2
Impact of Register file size on execution time . . . . . . . . . . . . . . . . . . . . . 84
7.3
Impact of number of registers windows . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4
Trade off between number of windows and their sizes
7.5
Results for matrix-mult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6
Validation on LEON
8.1
Results for adpcm rawaudio encoder for LEON . . . . . . . . . . . . . . . . . . . . 95
8.2
Results for adpcm rawaudio decoder for LEON . . . . . . . . . . . . . . . . . . . . 95
8.3
Part of the generated code for ARM7TDMI . . . . . . . . . . . . . . . . . . . . . . . 98
8.4
Results for Use of our Technique for interfacing coproc to LEON processor . . . . . 100
. . . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
. . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . 86
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
List of Tables 2.1
Comparison of various architecture models used for design space exploration . . . . 13
2.2
Comparison of major performance measurement techniques . . . . . . . . . . . . . 16
2.3
Comparison of major instruction generation approaches . . . . . . . . . . . . . . . . 23
2.4
Comparison of major code synthesis approaches . . . . . . . . . . . . . . . . . . . . 26
3.1
Maximum variation in results for various benchmark programs . . . . . . . . . . . . 43
5.1
EUK nodes of all the nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.1
MIPS instruction formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vii
viii
Chapter 1 Introduction and Objectives 1.1 Application Specific Instruction Set Processor An Application Specific Instruction Set Processor (ASIP) is a processor designed for one particular application or for a set of specific applications. An ASIP exploits special characteristics of the given application(s) to meet the desired performance, cost and power requirements. ASIPs are a balance between two extremes, Application Specific Integrated Circuits (ASICs) and general programmable processors. ASIPs offer the required flexibility (which is not provided by ASICs) at a lower cost and power than general programmable processors. Thus ASIPs can be efficiently used in many embedded systems such as servo-motor control, automotive controls, avionics, cellular phones etc. General programmable processors GPPs are designed for general use. Many times it happens that specific application needs a certain resource mix which does not match the GPP resource mix. For example, an application may be computation intensive and a good amount of operation level concurrency may be available in it. The application may have a ratio of 1:3 for integer to floating point operations. On the other hand, the GPP may have two integer and one floating point unit based on the requirements of the expected wide range of applications. Though there could be other resources available in the GPP but these may not be useful for given application. Moreover, these resources only add to the chip area and power consumption. A possible solution could be to provide 1
2
Introduction and Objectives
multiple FUs but include only required resources. This calls for designing an ASIP. Similarly, if we have to design a processor for a mobile instrument (e.g. mobile phone) then area and power consumption are the main concerns along with the performance requirements. In this case we have to chose the required resources very carefully. Again, we are trying to get a processor tuned for a particular application, implying the design is moving towards ASIP. On the other hand, if we plan to design an ASIC to meet the given performance, power and area constraints for the given applications, the designs are rigid. Even a small change in the specifications or standards may lead to complete redesign of the ASIC. These redesigning efforts are too expensive. This rigidness of ASICs also contributes to increasing interest in ASIP synthesis. The research in ASIP synthesis is more than a decade old now. The term ASIP, coined by Sato et al. [9], initially stood for Application Specific Integrated Processor, but later on it became a popular abbreviation for Application Specific Instruction Set Processor. Now to understand how the ASIPs are designed , we analyze the major steps followed in ASIP synthesis.
1.2 Steps in ASIP Synthesis Gloria et al. [10] reported some main requirements of the design of application-specific architectures as follows. • Design starts with the application behavior. • Evaluate several architectural options. • Identify hardware functionalities to speed up the application. • Introduce hardware resources for frequently used operations only if it can be supported during compilation. In the ASIP design, it is important to search for a processor architecture that matches target application. To achieve this goal, it is essential to estimate design quality of various candidate architectures in terms of their area, performance and power consumption.
Introduction and Objectives
3
Application(s) and Design Constraints
Application Analysis
Architectural Design Space Exploration
Architectural Description
Instruction Set Generation
Code Synthesis
Micro architecture Design Hardware Synthesis
Object Code
Processor Description
Figure 1.1: Flow diagram of ASIP design methodology A study of various approaches followed for ASIP synthesis [1] suggest the following five key steps as part of a typical design flow. These steps are shown in figure 1.1. 1. Application analysis 2. Architectural design space exploration 3. Instruction set generation 4. Code synthesis 5. Micro-architecture design and hardware synthesis Now we describe in detail, the action taken in each step while surveying the previous work done in that step. Classification of main approaches is done for each step, wherever it is possible.
1.2.1 Application Analysis Input to this step is an application or a set of applications, along with their test data. The purpose of application analysis is to get the desired characteristics/requirements which can guide the architecture exploration step. The given application, written in a high level language, is analyzed and the result of analysis is expressed in terms of some parameters, which are used in the subsequent steps.
4
Introduction and Objectives
Examples of these parameters are data types and their access methods, execution counts of the operations and sequence of contiguous operations (Sato et al. [9]), the average basic block size, number of Multiply-Accumulate (MAC) operations, ratio of address computation instructions to data computation instructions and the ratio of input/output instructions to the total instructions (Gupta et al. [6] and Ghazal et al. [5]).
1.2.2 Architectural Design Space Exploration This is the key step in ASIP design. It involves identifying the broad architectural features of the ASIP. First of all, the architecture space to be explored is defined, keeping in view the parameters extracted during application analysis and the input constraints. Within this space, performance of possible architectures is estimated and a suitable architecture satisfying performance and power constraints and having minimum hardware cost is selected. Architecture is defined using some standard architecture description language (ADL). Examples of such ADL are EXPRESSION [11] and LISA [12]. Performance estimation which drives the design space exploration is either simulator based (e.g. Gloria et al. [10], Kienhuis et al. [13], Kin et al. [14], Imai et al. [15] Binh et al. [16, 17]) or scheduler based (e.g. Gupta et al. [6] and Ghazal et al. [5]). In the simulation based approaches, a simulation model of the architecture based on the selected features is generated and the application is simulated on this model to compute the performance. In contrast to this, in the scheduler based approaches, the problem is formulated as a resource constrained scheduling problem with the selected architectural components as the resources and the application is scheduled to generate an estimate of the cycle count. Profile data is used to obtain the frequency of each operation. The scheduler based approaches are preferable over simulator based approaches because simulations are slow and the design space to be explored is quite large. The search space explored by different researchers varies. Almost all the approaches explores number and types of functional units. Kin et al. [14] consider issue width, number of branch units, number of memory units, the size of instruction cache and size of data cache etc. in their model. Gupta et al. [6] and Ghazal [5] use a model which includes parameters like number of
Introduction and Objectives
5
registers, number of operation slots in each instruction, concurrent load/store operations and latency of functional units and operations. Architecture space explored by scheduler based approaches are rather limited, in particular the register file and the storage part is not explored by such approaches. However, it is considered up to certain extent by simulator based approaches.
1.2.3 Instruction Set Generation Instruction set is generated in two ways. Instruction set is augmented with special instructions which are synthesized from scratch (Huang et al. [7]) or selected from a pre-defined superset of instructions for that particular application and for the architecture selected (Binh et al. [18] and Liem et al. [19]). This instruction set is used during the code synthesis and hardware synthesis steps.
1.2.4 Code Synthesis Code is synthesized in two different ways. Either a compiler generator is used to generate a compiler which is used to synthesize code (Hatcher et al. [20]) or a retargetable code generator is used to synthesize code for the particular application or for a set of applications (Leupers et al. [21] and Hanono and Devadas [22]).
1.2.5 Micro-architecture Design and Hardware Synthesis In this step, first a micro-architecture is designed to implement the selected architectural features and instruction set (Huang et al. [7]). After this, the hardware is synthesized for this micro architecture starting from a description in VHDL/VERILOG using standard synthesis tools.
1.3 Variations in the Design Flow Some of techniques consider the processor micro-architecture to be fixed while only generating the instruction set within the flexibility provided by the micro-architecture, e.g. Gshwind et al.
6
Introduction and Objectives
[23], Hoon Choi et al. [24] and Leupers et al. [25]. Others consider the process of instruction set generation only after the parallelism and functionality of the processor micro-architecture is finalized based on the application, e.g. Hanono et al. [22] and Huang et al. [7].
1.4 Objectives As we have mentioned earlier in the section 1.2.2, the design space explored by scheduler based approaches are rather limited, particularly the storage (registers, memory hierarchy etc.) is completely ignored. On the other hand the simulator based approaches are not suitable for early design space exploration because of the large design space and slow simulation process. Since design space explored by scheduler based approaches needs to be broadened, we decided to work to explore storage organization in a scheduler based approach. Our emphasis is to explore the following features in an integrated manner. 1. Register file size (number of registers in register file). 2. Number of register windows. 3. Cache memory configurations.
1.5 Overall Approach Main focus of our work is to include on-chip storage exploration as part of the design space exploration. To do this, we need to take into account the influence of storage architecture on the performance or execution time. Our approach is to first find the execution time ignoring the influence of storage constraints on it and then add overheads due to register spills because of limited register file, overhead due to limited register windows and cache misses. A novel technique to estimate the overhead due to register spills is a major contribution of this thesis. This has been efficiently integrated in our methodology to compute various execution time overheads related to storage architecture for a range of processors. The approach has been validated and illustrated using a number of interesting design space exploration experiments.
Introduction and Objectives
7
1.6 Organization of Thesis Related work reported in literature is summarized in the next Chapter whereas the motivating study is presented in Chapter 3. Proposed technique for ASIP synthesis as well as methodology for storage organization exploration is presented in Chapter 4. Our scheduler based technique for estimating performance at the basic block level is presented in Chapter 5, whereas global performance estimation technique along with the results is presented in Chapter 6. Integrated exploration of register window size, number of register windows and memory configurations is presented in Chapter 7. Illustrative case studies and applications of our technique are presented in Chapter 8, whereas conclusions and directions for future research work is presented in the last Chapter.
8
Chapter 2 Related Work With the increasing interest in the ASIP synthesis, many researchers have proposed techniques for ASIP design. Techniques suggested by Sato et al. [9], M. Breternitz Jr. et al. [26] and Gloria et al. [10] are among the earliest techniques. In this Chapter a brief survey of ASIP design methodologies is presented. This helps in placing our work in the overall context. Survey is organized in five sections corresponding to the steps identified in Chapter 1.
2.1 Application Analysis Typically ASIP design starts with analysis of the applications. These applications with their test data are analyzed statically or dynamically. Dynamic analysis is done using a suitable profiler. Sato et al. [9] reported an Application Program Analyzer (APA) in 1991. The output of APA includes data types and their access methods, the frequency of individual operations and sequence of contiguous operations. The output of APA is used to define instruction-set. More recent application analyzers such as developed by Gupta et al. [6] and Ghazal et al. [5] extract a larger number of application parameters. These include the average basic block size, number of Multiply-Accumulate (MAC) operations, ratio of address computation instructions to data computation instructions, ratio of input/output instructions to the total instructions, average number of cycles between generation of a scalar and its consumption in the data flow graph etc. Idea be9
10
Related Work
hind extracting these parameters was to make a decision about inclusion of a hardware unit in the processor depending on the values of the above mentioned application parameters. For example, if MAC operation is frequently used in the application then it is useful to have a unit to perform this functionality in hardware. Similarly, the data types used in the application may hint about memory word size and data access sequence may suggest memory organization suitable for these specific applications etc. Saghir et al. [27] have shown that dynamic behavior of DSP kernels significantly differs from that of DSP application programs. This implies that kernels should play a more significant role in DSP application analysis especially for parameters related to operation concurrency.
2.2 Architectural Exploration Figure 2.1 shows a typical architecture explorer block diagram. Inputs from the application analysis step are used along with the range of architecture design space to select suitable architecture(s). The selection process typically can be viewed to consist of a search over the design space driven by a performance estimator. The considered methodologies differ both in the range and nature of architecture design space, the estimation techniques employed as well as search control mechanism. Now we survey the research done each of these aspects.
Input from Application analysis
Performance Estimator for a Specific architecture
Search Control
Selected Architecture(s)
Architecture Design Space
Figure 2.1: Block diagram of an architecture explorer
Related Work
11
2.2.1 Broad Architectural Features The architecture design space to be explored is usually defined in terms of a parameterized architecture model. Different values can be assigned to the parameters (of course keeping design constraints into consideration), to get different architecture instances in the design space. Clearly, the size of the design space depends on the number of parameters and the range of values which can be assigned to these parameters. The parameterized architecture model suggested by almost all the researchers includes the number of functional units of different types. Gong et al. [28] consider storage units and interconnect resources also in their architectural model. Binh et al. [18, 16] emphasize on incorporating pipelined functional units in the model. Kienhuis et al. [13] consider available element types (like buffer, controller, router and functional unit) and their composition rules in their architectural model. Composition rules generate alternative implementations of elements. All parameters together support a large design space. Kin et al. [14] consider issue width, number of branch units, number of memory units, the size of instruction cache and size of data cache etc. in their model. Gupta et al. [6] use a model which include parameters like number of operation slots in each instruction, concurrent load/store operations and latency of functional units and operations. Ghazal [5] includes optimizing features like addressing support, instruction packing, memory pack/unpack support, complex arithmetic patterns such as dual multiply-accumulate, complex multiplication etc in their architectural model. This results in a large design space. They have developed a retargetable estimator which takes advantage of such an architecture model. Middha et al. [29] explored functional units having one or more inputs and one or more outputs in a TRIMARAN [30] based framework. Architectures considered by different researchers also differ in terms of the instruction level parallelism they support. For example, Binh et al. [16] and Kienhuis et al. [13] do not support instruction level parallelism, whereas Gong et al. [28] and Gupta et al. [6] support VLIW architecture and Ghazal et al. [5] and Kin et al. [14] support VLIW as well as super scalar architectures. Most of these approaches consider only a flat memory. Kin et al. [14] consider instruction and data cache sizes during design space exploration. Similarly, no approach considers flexibility in
12
Related Work
terms of number of stages in a pipeline, though a pipelined architecture is considered in Binh et al. [16] and Gloria et al. [10]. Table 2.1 compares various architecture models used in major approaches for design space exploration.
2.2.2 Performance Estimation In the literature two major techniques have been reported for performance estimation. These are scheduler based and simulator based.
Scheduler based In this type of approach, the problem is formulated as a resource constrained scheduling problem with the selected architectural components as the resources and the application is scheduled to generate an estimate of the cycle count. Profile data is used to obtain the frequency of each operation. Examples of such approaches are Gupta et al. [6] and Ghazal et al. [5]. Gupta et al. [6] have developed a resource constrained scheduler that estimates the number of clock cycles, given the description file of the target processor architecture, application timing constraints and the profiling data. A key feature of this list scheduler is its capability to reflect the flexibility of the instruction set to handle concurrency. Processor selector rejects processors which do not meet the constraints. Ghazal et al. [5] have developed a retargetable estimator for performance estimation required for architectural exploration using a better architectural model. A distinguishing feature of their approach is that they consider several optimizing transformations while mapping the application on to the target architecture. The optimizations considered include : optimized multi-operation pattern matching (e.g. multiply-accumulate), address mode optimization (e.g. auto-update, circular buffer), loop optimization (e.g. pre-fetched sequential loads), loop vectorization/packing, simple if-else conversion (by use of predicated instructions), rescheduling within basic blocks, loop unrolling, software pipelining etc. The flow diagram of the estimation scheme followed in both (Ghazal et al. [5] and Gupta et al. [6]) approaches is shown in figure 2.2 One of the important features of the architectural models used in Ghazal et al. [5] and Gupta et al. [6] is that it captures the differentiating capabilities of the instruction set and special functional
Related Work
Sato [9] yes
Gong [28] yes
Binh [18, 16] yes
yes yes
yes yes yes
yes
Kienhuis [13] yes
Kin [14] yes
yes
Gupta [6] yes
Gazhal [5] yes
yes yes
yes
yes yes
no
VLIW
no
no
yes yes yes VLIW, superscalar
VLIW
VLIW, superscalar yes yes yes yes yes yes storage units pipelined consideration memory hi- optimizing optimizing interconnect functional of alternative erarchy features features resources units implementation considered considered consideration of elements
13
Table 2.1: Comparison of various architecture models used for design space exploration
Approach → Gloria Attribute ↓ [10] Number of FUs of difyes ferent type Register organization Storage units Interconnect resources yes Pipelining yes Controller Variable Issue-width No. of branch units Memory hierarchy yes Instruction level VLIW parallelism Addressing support Memory pack/unpack Loop vectorization Complex arith. patterns Distinguishing features unconstrained FU complexity
14
Related Work C Code Parameterized Architecture Model Profiler
SUIF Front End
Translation and Annotation
Cycle−level Estimation
. Translation of SUIF instructions to Architecture−Compatible Instructions . Optimized Computation Patterns . High−level Optimization Features
. Address Generation Cost . Functional Unit Usage Rules . Instruction Set Attributes
Estimate, Annotated Profile
Figure 2.2: The Flow of Estimation Scheme : Ghazal [5] & Gupta [6] resources, rather than the complete specification required for code synthesis or simulation. This helps in fast performance estimation. Further, estimator used in Ghazal et al. [5] also produces a trace of the application annotated with the suggested optimizations and ranked bottlenecks.
Simulator based Simulation based approach requires code generation for a given application and running it on a simulator. For exploring the design space the compiler used for code generation as well as the simulator should be retargetable. In order to drive the retargetting process, machine description languages such as EXPRESSION [31], and LISATek [12, 32, 33] have been developed. Simulation based approaches for ASIP design exploration have been used extensively by several research groups. These include Gong et al. [28], Gloria et al. [10], Kienhuis et al. [13], Kin et al.[14], Imai et al. (ASIPMeister) [15, 16, 17, 34], Aditya et al. [35], Hoffman et al. [12, 32, 33] and Middha et al. [29]. This is also the popular approach being used by the industry, e.g. Target compiler technologies [36], Axys [37] and Tensilica [38]. Kienhuis et al. [13] constructed a retargetable simulator for an architecture template. For each architecture instance, a specific simulator is derived in three steps. The architecture instance is constructed, an execution model is added and the executable architecture is instrumented with metric
Related Work
15
collectors to obtain performance numbers. Object oriented principles together with a high-level simulation mechanism are used to ensure retargetability and efficient simulation speed. Kin et al. [14] propose an approach to design space exploration for media processors considering power consumption. Code is generated using the IMPACT [39] compiler to increase instruction level parallelism. The optimized code is consumed by the Lsim simulator. Measured run-times of benchmarks through simulations are used to compute the energy based on the power model they have described. The power dissipation numbers are normalized with respect to a baseline machine power dissipation. Most researchers have focused on performance and area but do not address the power consumption. Only Kin et al.[14] and Imai et al.[15] considered power consumption. Kin et al [14] computes power consumption of the generated ASIP from a very coarse model which is based on the total cycle count. Main approaches considering performance estimation are compared in table 2.2.
2.2.3 Search Control Algorithms for search range from branch-and-bound (Binh et al. [16] and Kin et al. [14]) to exhaustive over a restricted domain (Gupta et al.[6] and Fischer et al. [40]). Binh et al. [16] suggested a HW/SW partitioning algorithm (branch-and-bound) for synthesizing the highest performance pipelined ASIPs with multiple identical functional units. Gate count and power consumption are the given constraints. They have improved their algorithm considering RAM and ROM sizes as well as chip area constraints [17]. Chip area includes the hardware cost of the register file for a given application program with the associated input data set. This optimization algorithm defines the best trade offs between the CPU core, RAM and ROM of an ASIP chip to achieve highest performance while satisfying design constraints on the chip area. Kin et al. [14] first reduced the search space by eliminating machine configurations not satisfying given area constraint and those that are dominated by at least one another machine configuration and then they used a branch-and-bound algorithm for searching for an optimum solution. Gupta et al. [6] developed a processor selector which first reduces the domain of search by restricting the various architectural parameter ranges based on certain gross characteristics of the processors, such as branch-delay, presence/absence of multiply-accumulate operation, depth of pipeline
16
Gloria Gong Binh Kienhuis [10] [28] [18, 16] [13] simulation based simulation simulation simulation based based based Algorithm extension branchof list and-bound scheduling FU cusyes multiple tomization identical FUs Interconnect yes resources Optimizations loop-unrolling, no data-dependency analysis, variablerenaming, loop interchange Measure of cycle count and performance performance resource allocanumber tion statistics Memory hierarchy Other main features
yes
Kin [14] simulation based
Gupta [6] scheduler based
Gazhal [5] scheduler based
list scheduling
list scheduling
yes
yes
yes
yes
instruction optimized multi- Gupta [6] and loop level paral- operation pattern vectorization, memlelism matching, address ory packing, softmodes, rescheduling, ware pipelining etc. loop unrolling cycle count cycle count and trace of the application annotated with suggested optimizations yes
RAM and metric col- power con- only differentiating only differentiROM sizes lectors used sumption features of instruction ating features of considera- to obtain and area set is captured instruction set is tion performance constraints captured number consideration
Related Work
Table 2.2: Comparison of major performance measurement techniques
Approach → Attribute ↓ Technique
Related Work
17
and memory bandwidths. The output is a smaller subset of processors. Processor description files of this restricted range of processors are supplied to the performance estimator which exhaustively explores this set.
2.3 Instruction Set Generation We have found that almost all approaches for instruction set generation can be classified as either instruction set synthesis, instruction set selection or instruction set extensions approach on the basis of how they are generating instructions.
2.3.1 Instruction Set Synthesis In this class of techniques, instruction set is synthesized for a particular application based on the application requirements, quantified in terms of the required micro-operations and their frequencies. Examples of such approaches are Huang et al. [7, 8], Praet et al. [41], Hoon Choi et al. [24] and Gschwind et al. [23]. These approaches differ in the manner the instructions are formed, from a list of operations. Huang and Despain [7, 8] integrated the problem of instruction formation with the scheduling problem. The simulated annealing technique is used to solve the scheduling problem. Instructions are generated from time steps in the schedule. Each time step corresponds to one instruction. Pipelined machine with data stationary control model is assumed. They had set the goal of the instruction set design so as to optimize and trade off the instruction set size, the program size and the number of cycles to execute a program. General multi-cycle instructions are not considered. However multi-cycle arithmetic/logic operations, memory access, and change of control flow (branch/jump/call) are supported by specifying the delay cycles as design parameters. The formulation takes the application, architecture template, objective function and design constraints as inputs and generates as outputs the instruction set, resource allocation (which instantiates the architecture template) and assembly code for the application. This approach is shown in figure 2.3.
18
Related Work Applications (MOPs)
Design Constraints Objective Function
Scheduling / Allocation
Instruction Set
Architecture Template Specifications
Instruction Formation
Micro−architecture
Compiled Code
Figure 2.3: The integrated scheduling/instruction-formation process (Huang and Despain [7, 8])
2.3.2 Instruction Selection In this class of techniques, a superset of instructions is available and a subset of them is selected to satisfy the performance requirements within the architectural constraints. Examples of such approaches are Imai et al. [9, 15, 42, 18], Liem et al. [19], Leupers et al. [25] and Shu et al. [43]. All the approaches in this class differ in the algorithms they are using to select the instructions from the super set. Though some approaches (e.g. Liem et al. [19], Leuper et al. [25] and Shu et al. [43]) present instruction selection for compiler backends, others (e.g. Imai et al. [9, 15, 18, 42]) view from a processor synthesis perspective. The underlying problems in the two contexts are closely related. In ASIP synthesis, the attempt is to evaluate which instructions can be gainfully included in the instruction set, whereas in code generation, the problem is to find suitable instructions to be included in the generated code. Imai et al. [9, 15] assumes that the instruction set can be divided into two groups : operations and functions. It is relatively easy for a compiler to generate instructions corresponding to operators used in the C language compared to those which corresponds to functions. It is because the full set of operations is already described, but the full set of user-defined functions is not known apriori. They further divide the set of operators into two subgroups : primitive operators and basic operators. The set of primitive operators is chosen so that other basic operators can be substituted by a combination of primitive operators. Thus they have classified functionalities as follows.
Related Work
19
• Primitive Functionalities (PF) : Can be realized by a minimal hardware component, such as ALU and shifter. • Basic Functionalities (BF) : Set of operations used in C except those included in PF. • Extended Functionalities (XF) : Library functions or user defined functions. The intermediate instructions are described as primary RTL, basic RTL, and extended RTL respectively for these classes. The generated ASIP includes hardware modules corresponding to all of the primary RTL, but only a part of the basic RTL and extended RTL. The selection problem is formulated as an integer linear programming problem with the objective to maximize the performance of the CPU under the constraints of the chip area and power consumption. The branch-and-bound algorithm is used to solve the problem. Later, they have revised their formulation considering functional module sharing constraints and pipelining [42, 18]. Some approaches (Liem et al. [19], Shu et al. [43] and Leupers et al. [25]) use pattern matching for instruction selection. The set of template patterns is extracted from the instruction set and the graph representing the intermediate code of the application is covered by these patterns. By arranging instruction patterns in a tree structure, the matching process becomes significantly fast. For example there is an expression as follows. y = w1 * x1 + w2 * x2 + w3 * x3 + w4 * x4 The corresponding expression tree is shown in figure 2.4. This type of expression evaluation is very frequent in DSP domain. The hardware configuration required to implement it is shown in figure 2.5 a. The instruction patterns are shown in figure 2.5 b. Apart from intrinsic patterns like MULT and ADD, it has MAC also which performs multiplication and addition together. If only intrinsic patterns (ADD and MULT) are available for pattern matching, then we get matching as shown in figure 2.6 b. If we use the pattern MAC as well, then we get optimal cover as shown in figure 2.6 a. Using cost and delay of different patterns and the frequency count of this expression tree received from dynamic profiling, the cost-performance trade-offs can be evaluated. This evaluation decides whether such complex instructions should be included or not in the selected instruction set.
20
Related Work
w1
x1 w2 *
x2 * w3
+
x3 * w4
x4
+
*
+ y
Figure 2.4: Expression tree
*
*
+
MULT
ADD
Pipeline Register
+
* Accumulator
+
MAC a)
b)
Figure 2.5: a) Multiplier-Accumulator , b) Instruction Patterns
Related Work
MULT
21
*
*
+
MULT
*
*
ADD
MULT
*
+
*
MULT
MAC
MAC
+
MAC
*
+
a)
ADD
+
*
ADD
MULT
+
b)
Figure 2.6: a) Optimal cover with MAC , b) Cover with intrinsic patterns only Liem et al. [19] adopt a pattern matching approach to instruction set selection. A wide range of instruction types is represented as a pattern set, which is organized in a manner such that matching is extremely efficient and retargetting to architectures with new instruction set is well defined. Shu et al. [43] have implemented the matching method based on a pattern tree structure of instructions. Two genetic algorithms (GA) are implemented for pattern selection: a pure GA which uses standard GA operators and a GA with backtracking which employs variable-length chromosomes.
2.3.3 Instruction Set Extension We observe that, there are design methodologies in both the instruction synthesis (Gschwind et al. [23]) as well as in instruction selection (Imai et al. [15, 42, 18] and Choi et al. [24]) categories which start from a basic instruction set and only synthesize/select application-specific special or complex instructions. A comparison of various approaches for instruction set generation is shown in table 2.3. Hoon et al. [24] proposed another approach which supports multi-cycle complex instructions as well. First the list of micro-operations is matched to the primary instructions in order to estimate the execution time if only these instructions are employed. If the estimated performance is unacceptable, then the complex instructions are considered. If the performance is still unacceptable, then special instructions which require special hardware units are included. Gshwind [23] describe an approach
22
Related Work
Application
Architecture Template
Instruction Set Architecture (ISA)
Retargetable Code Generator
Object Code
Figure 2.7: Retargetable Code Generator for application specific design based on extendible microprocessor core. Critical code portions are customized using the application-specific instruction set extensions. K. Atasu et al. [44] proposed an approach for automatic application-specific instruction set extensions under micro-architectural constraints. Their algorithm selects maximal-speedup convex subgraphs of the application data flow graph under constraints on the number of inputs and number of outputs in a cluster.
2.4 Code Synthesis The work reported have followed two different approaches for code synthesis. They are retargetable code generator and compiler generator. Table 2.4 compares various code synthesis approaches.
2.4.1 Retargetable Code Generator Taking architecture template, instruction set architecture (ISA) and application as inputs, object code is generated (figure 2.7). Cheng et al. [45], Wilson et al. [46], Leupers et al. [21, 47, 25], Kreuzer et al. [48], Praet et al.[49], Hanono and Devadas [22], Visser et al. [50] and Yamaguchi et al. [51] follow such an approach. All these approaches try to address the following sub-problems while synthesizing instructions 1. Instruction mapping: DFG of a given application is mapped to the instruction patterns using tree or graph pattern matching technique in an attempt to get a good cover.
Related Work 23
Table 2.3: Comparison of major instruction generation approaches
Approach → Imai Liem Praet Huang Shu Leupers Hoon Gshwind Attribute ↓ [9, 15, 42, 18] [19] [41] [7, 8] [43] [25] [24] [23] Generation selection selection synthesis synthesis selection selection synthesis synthesis mechanism Technique/ ILP, branch-and- tree pattern graph modified scheduling tree pattern ILP, dy- subset-sum hw/sw coAlgorithm bound matching, pattern problem, simulated matching, namic problem evaluation dynamic match- annealing genetic programprogram- ing algorithms ming ming Instruction primary, basic single-cycle single-cycle, intrinsic, types and special multi-cycle complex complex Other main primary, basic capture instruction utilization, better op- generated extendible issues and and special RTL behavioral instruction operand timization single and core techdistinguishing for basic, pri- represenencoding, delay than tree multi-cycle nique, features mary and special tation of load/store and delay pattern complex in- application functionalities instructions branches. integrated matching structions then specific ininstruction formasub-set-sum struction set tion with scheduling technique extensions problem
24
Related Work 2. Resource allocation and binding: Resources like registers, memory, FUs, buses etc are allocated and corresponding bindings are performed. 3. Scheduling: Instructions are scheduled for execution.
Algorithm used to solve three subproblems differ between various approaches. Some approaches also address the issue of code compaction while generating code. Examples are Kreuzer et al. [48] and Leupers et al. [21, 47, 25]. Based on the retargetablility, retargetable compilers can be classified as automatic-retargetable or parameterizable, user retargetable and developer retargetable depending on the effort invloved in retarget. Since developer retargetable compilers used to develop compilers so we consider them in the class of compiler generators.
Automatic-retargetable or parameterizable Just plug in the description of the target processor or the target processor parameters that can be specified. Once the compiler is built, it will generate code for the specified processor configuration. No rewriting of code is required. RECORD [52] and TRIMARAN [30] are examples of such compilers. Problem with such compilers is that they can explore very limited architecture design space. TRIMARAN is capable of exploring various configurations of the HPL-PD architecture. Only a few parameters like number and type of functional units (FUs), number and types of registers in register files, instruction latencies etc. can be varied, but if we want to add a new FU, we cannot do the same by setting the parameters. RECORD targets DSP like architectures. Very low level processor description (in MIMOLA [53]) makes it unsuitable for an early and rapid design space exploration.
User-retargetable Such compilers provide flexibility to the users to change the processor description and to write some modules performing processor specific optimizations. The source code would be modular and a user without in-depth knowledge should be able to write the desired modules with some reasonable effort. CHESS [54], AVIV [55], IMPACT [55] and CodeSyn [56] are such compilers. Though a user
Related Work
25
Application
Architecture Template
Instruction Set Architecture (ISA)
Retargetable Compiler Generator
Customized Compiler
Object Code
Figure 2.8: Compiler Generator
should be able to retarget the compiler with reasonable efforts, still it is not suitable for an early and rapid design space exploration.
2.4.2 Compiler Generator Taking architecture template and instruction set architecture as the inputs, a customized compiler is generated. This is used to generate object code for the given application written in a high level language (figure 2.8). Examples of such approaches are Hatcher et al. [20] and Kuroda et al. [57]. Developer-retargetable compilers are also considered in this class. A compiler generated in this manner generally has phases similar to general compilers, namely, program analysis, intermediate code optimization and code generation. The difference is that the optimizations are tailored to the specific architecture-application combination. Hatcher et al. [20] developed a code generator generator that accepts a machine description in a YACC-like format and a set of C structure definitions for valid tree nodes and produces C source (both functions and initialized data) for a code generator which is time and space efficient. Important features of their system include compact machine description via operator factoring and leaf predicates, fast production of time and efficient code generators. The System supports arbitrary evaluation orders and goal oriented code generation.
Hanono Visser [22] [50] code generator code generator yes
yes
yes
yes
yes yes
yes yes
branch-andbound yes yes
sim. annealing yes
yes mapping, phasebinding and coupling scheduling concurrent, detailed reg. allocation at second step
Related Work
Table 2.4: Comparison of major code synthesis approaches
KurodaSaghir ChengWils. Leupers Kreuzer Praet Gebotys Yamag. [57] [27] [45] [46] [47, 58, 59, 60] [48, 61] [49] [62] [51] compilercode code code code generator code code code gen- code genera- genera- gen- gengenera- genera- erator generator tor era- erator tor tor tor tor Instruction map- tree tree pattern tree graph yes ping pattern matching pattern pattern matching match- matching ing Resource allocayes yes yes yes yes yes yes yes yes tion and binding Scheduling yes yes yes yes yes yes yes yes Code optimizayes yes yes yes yes yes yes yes yes tion Register spill yes yes yes over Technique/ tree ILP Algorithm pattern matching Retargetability yes yes yes yes yes yes yes yes yes yes Instruction level yes yes parallelism Phase-ordering yes Distinguishing arbitrary uses need compaction scheduling features evaluexperts for based on followed ation knowl- kernel a heuristic by inorder edge synthescheduler, struction sis behavioral/ mapping structural target arch., address assignment
26
Approach → Hatcher Attribute ↓ [20] Code genera- compiler tor/Compiler generator generator
Related Work
27
Developer-retargetable Such compilers can be retargeted only by compiler developers, since a large part of the compiler needs to be rewritten when it is to be reatrgeted for some other configuration. SPAM [63] and CoSy [64] are examples of such retargetable compilers. Though such compilers cover a larger design space but the significant efforts (which involves a compiler developer) required for retargetting rules them out for initial design space exploration.
2.5 Micro-architecture Design and Hardware Synthesis In this step, first a micro-architecture is designed to implement the selected architectural features and instruction set (Huang and Despain [7, 8]). After this, the hardware is synthesized for this micro architecture starting from a description in VHDL/VERILOG using standard synthesis tools.
2.6 Conclusion Though, a variety of approaches have evolved in addressing each of the key steps in ASIP synthesis, the target architecture space being explored by these methodologies is still limited e.g. register file is not treated to be part of storage during exploration. With the increase in integration, it should be possible to support memory hierarchies on the chip and the same has not been addressed in an integrated manner. Similarly issues of pipelined ASIP design as well as low power ASIP design is in its infancy. We also observe that the problems of processor synthesis and retargetable code generation have been considered in isolation. We also observed that, register file is not considered as part of storage when memory organization is explored. We propose to do the same efficiently without using retargetable compiler or simulator.
28
Chapter 3 Impact of Register File Size on Performance and Power Metrics Our survey presented in the last Chapter underlines the need to broaden the design space being explored for ASIPs by scheduler based techniques. Specially the storage organization which includes register file size, register windows and cache memory is not explored by such approaches so far. Before developing new techniques to explore these features, we studied the impact of register file size on various factors using an existing retargetable code generator and simulator framework. The results of this study are recorded in this Chapter. Section 3.1 presents the experimental setup used whereas important results along with analysis are presented in section 3.2. Last section of Chapter summarizes the outcome of our study.
3.1 Experimental Setup Some benchmark programs were chosen and code generation and performance evaluation was performed with varying number of registers for the ARM7TDMI processor using the parameterizable compiler encc [65] developed at the University of Dortmund, Germany. The benchmark programs were then analyzed to identify application characteristics responsible for the observed behavior. 29
30
Impact of Register File Size on Performance and Power Metrics
3.1.1 The ARM7TDMI Processor The ARM7TDMI by ARM Ltd [66] is a 32-bit RISC processor and offers high performance combined with low power consumption. This processor employs a special architectural strategy known as THUMB, with the key idea of a 16-bit reduced instruction set for conserving power. Thus the ARM7TDMI has two instruction sets: 1. The standard 32-bit ARM set. 2. The 16-bit THUMB set. THUMB code operates on the same 32-bit register set as ARM code so it achieves better performance compared to traditional 16-bit processors using 16-bit registers and consumes less power than traditional 32-bit processors. Various portions of a system can be optimized for speed or for code density by switching between THUMB and ARM execution as appropriate. The ARM7TDMI processor has a total of 37 registers (31 general purpose 32-bit registers and 6 status registers) but these are not visible simultaneously. The processor state and operating mode dictate which registers are available to the programmer. In THUMB mode, only 8 general purpose registers are available to the user, requiring only 3 bits for register coding. Thus reduces the instruction size considerably.
3.1.2 Benchmark Suite To investigate the effect of changing the register file size, a number of benchmark applications had to be selected. The areas that are covered by ARM processors include automotive equipment, consumer entertainment, digital imaging, industrial applications, networking, security, storage and wireless applications [67]. The chosen benchmarks were taken from the domains of digital signal processing and multimedia, along with standard sorting algorithms. These benchmark programs are available at http://www.cse.iitd.ernet.in/∼manoj/research/benchmarks.html. 1. biquad_N_sections (DSP domain) 2. lattice_init
(DSP domain)
3. matrix-mult
(multiplication of two m × n matrices)
Impact of Register File Size on Performance and Power Metrics 4. me_ivlin
31
(media application)
5. bubble_sort 6. heap_sort 7. insertion_sort 8. selection_sort The biquad_N_sections program, part of the DSP-kernel benchmark suite [68], performs the filtering of input values through N biquad IIR sections. lattice_init calculates the output of a lattice filter, whereas matrix_mult implements the multiplication of two 2D matrices. me_ivlin is a multimedia application, mainly consisting of integer arithmetic operations. The standard sorting algorithms sort a given array of integers using different methods.
3.1.3 The encc Compiler The encc [65] compiler was used for code generation and performance evaluation. encc was developed for the RISC class of architectures and generates code with the objective of reducing energy consumption. It features a built-in power model which is used to take decisions during the compilation process. Configuration of the compiler is possible by changing a parameter file which contains several constant declarations and processor specific information. Using this configuration file for the target processor, a customized compiler is generated. In our case, we took the configuration file for the ARM7TDMI processor and changed the number of registers in the range from 3 to 8. For each case, a compiler was generated which was used to compile and evaluate the performance of the benchmark programs. Taking an application program written in C, an intermediate representation (IR) file is generated using LANCE [69]. Some standard optimizations are performed on this IR file using LANCE library functions. The optimizations performed by LANCE on the IR include constant propagation, copy propagation, dead code elimination, constant folding, jump optimizations and common subexpression elimination. Taking an IR file as input, the code generator generates a forest of data flow trees for each function. A cover is obtained for each tree based on tree pattern matching. At this stage, the inter-
32
Impact of Register File Size on Performance and Power Metrics
nal power model is used to generate a valid cover with minimal power consumption. A low level intermediate representation is generated. Register allocation, instruction scheduling, spill code generation and peephole optimizations are performed using this representation to generate assembly code. An assembler and a linker are used to create the object code. An instruction set simulator produces outputs required for validation. A trace of instructions is also produced which is analyzed by a trace analyzer. The encc provides information on spilled registers as well. The optimization options available are time, energy, size and power. One optimization can be selected at a time.
3.1.4 Power Model The power model used in the compiler is based on the processor power model developed by Tiwari et al. [70], which distinguishes between basic power and inter-instruction effects. Basic power consists of the measured current during execution of a single instruction in a loop. An approximate amount is added for stalls and cache misses. The change of circuit state for a different instruction and resource constraints are summed up in the inter-instruction effects. For computing the basic power costs and inter-instruction effects, actual measurements have been done for the THUMB instructions in an ARM evaluation board. Change in the register file size not only changes the number of data accesses, but also the associated instruction accesses. To isolate the effects on power consumption due to data and program with changing register file size, two configurations were studied. 1. Both data and instruction in external or off-chip memory 2. Data in off-chip and instruction in on-chip memory
The off-chip data and on-chip instruction is an interesting possibility as in many embedded systems implementations a “fixed synthesized code” could be stored in an on-chip memory. The power consumption models of the two memories were again generated from actual current measurements. For off-chip memory, measurements were carried out on the four 128KX8 SRAM chips (IDT71V124SA) used in the ATMEL evaluation board (AT91M40400). For on-chip instruction memory, the processor current measurements for instructions were carried out with and without
Impact of Register File Size on Performance and Power Metrics
33
the use of on-chip memory for programs. In effect, the processor instruction set power model mentioned earlier is based on measurements carried out without the use of on-chip memory. Based on these measurements, the power consumption of each of the two memories for different possible access bit-widths and for read and write operations was computed which constituted the memory power models. Current measurements were done by Theokharidis et al. [71]. Thus effectively, the results presented in the next section utilize the following power models associated with each instruction for the two configurations. 1. Off-chip data and instruction:
Ptot(inst) = Pcpu(inst) + Po f f chip(read,16) + Po f f chip(read/write,width)
(3.1)
2. On-chip instruction and off-chip data:
Ptot(inst) = Pcpu(inst) + Ponchip(read,16) + Po f f chip(read/write,width)
(3.2)
The Pcpu(inst) includes the inter-instruction effects. The instructions being THUMB instructions are 16 bits wide and are read from off-chip or on-chip memory respectively. The third term in both the equations is optional as only some instructions require data access. Also the data width could be different for different instructions and that is accounted for. This power model has been integrated in the encc compiler.
3.1.5 Experimentation with Register File Size The number of physical registers was varied in the range from 3 to 8 for the ARM7TDMI processor. The number of registers was increased beyond 8 as well, but in that case only assembly code could be generated as no instruction set simulator was available to execute the code. However, we were able to get information about spilling and static code size in such cases. For each different number of physical registers, encc was compiled to generate a customized compiler which was then used to generate code and other trace information for our benchmark programs.
34
Impact of Register File Size on Performance and Power Metrics
3.2 Results We present the results obtained for number of executed instructions, number of cycles, ratio of spill instructions to total static code size, power and energy consumption. The results and their analysis is based on the following two assumptions. 1. Processor cycle time does not change with the change in the number of registers. This implies that the change in the number of cycles is directly related to performance. 2. Power consumed by each instruction does not change significantly with the change in the number of registers.
3.2.1 Number of Executed Instructions The results obtained for number of executed instructions are shown in figure 3.1. Number of instructions are identified by analysis of instruction trace. encc uses a standard trace driven simulator which generates instruction trace as well. Values for different programs are scaled to produce the results on a single plot. Scale factors are shown in figure. For example, if (x m) is mentioned with the name of a benchmark application, the observations for that benchmark application are multiplied by a factor ‘m’. This is acceptable since the general trends can still be observed. We can observe one sharp curvature (knee) in some curves. The Curve for the program biquad_N_sections has its knee at 4 registers, whereas the programs bubble_sort and insertion_sort both have their knee at 5 registers. The curves for some of the other programs do not contain such a knee. In the program biquad_N_sections, there are two for loops with high iteration count. Each contains a statement like some_array[loop_counter] = value; which needs 4 registers for its execution without spilling. One each for storing the value of loop_counter, base address of the array some_array, offset value and the value to be written into the array. Thus the number of instructions shoots up significantly when we lower the number of physical registers from 4 to 3, since additional spill code has to be inserted within the loop. Looking at the programs bubble_sort and insertion_sort, we observe that each contains a 2-level nested loop. The statements in the innermost
Impact of Register File Size on Performance and Power Metrics
35
loop in both the cases need 5 registers for execution, that is why we observe a knee at 5 registers in the curves for these programs.
Number of instructions (millions)
0.8
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
0.7
0.6
0.5
0.4
0.3
0.2
0.1 3
4
5
6
7
8
Number of registers
Figure 3.1: Number of Executed Instructions
3.2.2 Number of Cycles The results obtained for number of cycles are shown in figure 3.2. Number of cycles are identified by using ‘armsd’ simulator available in the ARM tool set. Again, the values for different programs are scaled to produce the results on a single plot and scale factors are shown. General behavior of the curves for the number of cycles is similar to that for the number of instructions. Though, as we lower the number of registers, more spill instructions are inserted. Since spill instructions consist mainly of multi-cycle load and store instructions, the average number of cycles per instruction increases more than the number of instructions. Still, the general shape of the curves is the same. Thus, the same application characteristics are responsible for similar behavior in both number of instructions and number of cycles.
36
Impact of Register File Size on Performance and Power Metrics
3.5
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
Number of cycles (millions)
3
2.5
2
1.5
1
0.5
0 3
4
5
6
7
8
Number of registers
Figure 3.2: Number of Cycles
3.2.3 Ratio of Spill Instructions to Total Static Code Size
To know this ratio, spill instructions are identified in the code (static code, not the execution trace) generated by encc. The results obtained for ratio of spill instructions to total static code size is shown in figure 3.3. The values for the program lattice_init are high because of high register pressure. This program contains a 2-level nested for loop. The inner loop contains two statements which needs 6 registers for execution. An interesting feature is observed for this program, the presence of common sub-expressions in two statements of the inner loop. Three additional registers are required to avoid repetition of address calculations and memory accesses. Values for program me_ivlin are high due to the large number of variables required to be live for a long time, so spilling is high, but it is continuously decreasing with increasing number of registers. To eliminate all spill code from this program, 19 registers are required. The number of spills drastically reduce at 7 registers for the program matrix-mult, because 7 registers are sufficient to execute the statement in the innermost for loop (3-level nesting).
Impact of Register File Size on Performance and Power Metrics
Ratio of spill ins. to total inst. (static)
0.7
37
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
0.6
0.5
0.4
0.3
0.2
0.1
0 3
4
5
6
7
8
9
10
Number of registers
Figure 3.3: Ratio of Number of Spill Instructions to Total Static Instructions
3.2.4 Average Power Consumption We have used two different memory configurations in our study. One considers only off-chip memory, while the other considers on-chip instruction memory and off-chip data memory.
Off-chip memory The results obtained for average power consumption while considering only off-chip memory are shown in figure 3.4. The power values are highest for the matrix-mult program, because the innermost loop (3-level nested looping) contains the statement c[i][j] = c[i][j] + a[i][k] * b[k][j]; which accesses two 2-D array elements for reading and one 2-D array element for reading as well as writing. Since all the arrays are 2-D arrays, the address calculation requires an arithmetic shift left (instead of another expensive multiplication) and an addition. Since one power-hungry multiplication is still required for performing the actual arithmetic operation between the two matrices, the power consumption is high. The values for the program lattice_init are also high due to the fact that it is also a memory access intensive application. A 2-level nested for loop can be found and the
38
Impact of Register File Size on Performance and Power Metrics
Average power consumption (mW)
540
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
530 520 510 500 490 480 470 460 3
4
5
6
7
8
Number of registers
Figure 3.4: Average Power Consumption using only Off-chip Memory inner loop body contains statements, accessing two 2-D matrices and one 1-D matrix. The values for the program me_ivlin are quite high due to high register pressure which leads to more spilling to memory. Since power consumption of the external data memory is significantly higher than the power consumed within the processor, the application’s power demands are high. The values for the programs bubble_sort and heap_sort are similar because memory accesses in both are of similar extent. The values for program selection_sort are the lowest, because in selection sort, data movement in memory is minimum. For the program insertion_sort the amount of data movement in memory is more than that of selection_sort but less than that of bubble_sort, which justifies its estimates. Our analysis shows that using more registers does not help significantly in saving power consumption, especially for memory intensive applications (e.g. programs matrix-mult and lattice_init). Though we observe that number of instructions executed and number of cycles taken for execution are being saved considerably with increasing number of registers in our observation range of register values. These applications have higher power consumptions and even providing additional registers could not help in saving it. For other applications, the saving in power consumption is marginal and that gets saturated after a few registers.
Impact of Register File Size on Performance and Power Metrics
39
On-chip Instruction Memory and Off-chip Data Memory The results obtained for average power consumption while considering on-chip instruction and off-chip data memory are shown in figure 3.5. We observe a significant change in power consumption 420
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
Average power consumption (mW)
400 380 360 340 320 300 280 260 240 220 200 3
4
5
6
7
8
Number of registers
Figure 3.5: Average Power Consumption using On-chip Instruction and Off-chip Data Memory by the applications which are not memory intensive but have high register pressure (e.g. the program me_ivlin). In such applications significant spilling is avoided by providing additional registers. On chip instruction memory consumes less power compared to off-chip memory used for data accesses. This is due to several reasons: on chip memory is usually smaller, the bus lines that need to be driven are shorter since the boundaries of the chip are not left. The average power consumption is less for all the benchmark programs compared to the power consumption for other memory configuration (i.e. considering only off-chip memory).
3.2.5 Energy Consumption Energy is computed as product of average power consumption and execution time E = P × t. Execution time is calculated in terms of number of cycles as constant cycle time is assumed. Again, we present results for both memory configurations.
40
Impact of Register File Size on Performance and Power Metrics
Off-chip memory The results obtained for energy consumption while considering only off-chip memory are shown in figure 3.6. For this memory configuration the average power consumption is almost constant. The 0.055
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
0.05
Energy consumption (WS)
0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 3
4
5
6
7
8
Number of registers
Figure 3.6: Energy Consumption using Off-chip Memory
energy is being computed as product of power and time. Thus, the curves follow the same trend as number of cycles required for execution.
On-chip Instruction Memory and Off-chip Data Memory The results obtained for energy consumption while considering on-chip instruction memory and off-chip data memory are shown in figure 3.7. For this configuration the average power consumption is lower in general, and there is significant saving in power consumption while reducing spilling by providing additional registers. This results in a significant reduction in energy consumption with larger number of registers.This difference is visible especially for the applications which are not too memory intensive and having high register pressure such as me-ivlin.
Impact of Register File Size on Performance and Power Metrics
0.035
biquad (x 500) lattice_init (x 2) matrix-mult (x 250) me_ivlin(x 1) bubble_sort (x ) heap_sort (x 14) insertion_sort (x 6) selection_sort (x 3)
0.03
Energy consumption (WS)
41
0.025
0.02
0.015
0.01
0.005
0 3
4
5
6
7
8
Number of registers
Figure 3.7: Energy Consumption using On-chip Instruction and Off-chip Data Memory
3.2.6 Analysis of Results After seeing variation of individual parameters (number of instructions executed, number of cycles taken for execution etc.) across all the programs, we know variation of all parameters together for each program. We analyze the results for two application programs individually: lattice_init and me_ivlin. We used on-chip instruction memory and off-chip data memory while generating these results. Results obtained for program lattice_init are shown in figure 3.8. The y-axis is normalized to get their values in the same range. The intent is to compare their shapes on the same plot. We find that in this application, the power consumption does not change significantly with change in number of registers, though there is some change in number of spilling instructions. This is due to the fact that this application is memory intensive. The energy consumption shows a steady drop dominated by the reduction in the number of cycles without any pronounced knee. Results obtained for program me_ivlin are shown in figure 3.9. We can see the change in power consumption for this program as we vary the number of registers. This is because the application
42
Impact of Register File Size on Performance and Power Metrics
0.8
Executed instructions (1 = 0.25 million) Number of cycles (millions) Energy consumption (1 = 20mWS) Power consumption(W) spill instructions (thousands)
0.7
Normalized results
0.6 0.5 0.4 0.3 0.2 0.1 0 3
4
5
6
7
8
Number of registers
Figure 3.8: Results for the program lattice_init
0.8
Executed instructions (millions) Number of cycles (1 = 5 millions) Energy consumption (1 = 62.5 mWS) Power consumption(W) spill instructions (thousands)
0.7
Normalized results
0.6 0.5 0.4 0.3 0.2 0.1 0 3
4
5
6
Number of registers
Figure 3.9: Results for the program me_ivlin
7
8
Impact of Register File Size on Performance and Power Metrics
43
is not memory intensive but it has high register pressure, so additional registers helps in saving the spilling and thus reducing the memory accesses. A careful analysis shows two knees in the energy curve, the one at register value 4 is due to the knee in the cycle count whereas the knee at register value 6 is due to the knee in the power curve. Table 3.1 shows the maximum percentage increase in performance and reduction in power and energy due to an increase of one register in each of the application programs. We also indicate where this takes place. This table establishes the importance of register file size as an architectural Application Performance Power Energy program Reg. size % inc. Reg. size % red. Reg. size % red. biquad_N_sections 3 → 4 57.5 3 → 4 12.6 3 → 4 62.9 lattice_init 4 → 5 20.5 6 → 7 1.0 4 → 5 21.0 matrix-mult 3 → 4 29.7 7 → 8 7.4 3 → 4 33.4 me_ivlin 3 → 4 53.4 5 → 6 15.3 3 → 4 59.3 buuble_sort 4 → 5 46.3 4 → 5 17.3 4 → 5 55.6 heap_sort 6 → 7 25.6 6 → 7 10.3 6 → 7 33.2 insertion_sort 4 → 5 44.8 4 → 5 22.3 4 → 5 57.1 selection_sort 3 → 4 22.2 5 → 6 14.0 5 → 6 30.1 Average 37.5 12.5 44.1 Table 3.1: Maximum variation in results for various benchmark programs
feature as a single register increase results in a performance improvement of up to 57.5% and energy reduction of 62.9%. The power is relatively insensitive to the changes in the number of registers. Furthermore, there is a high degree of correlation between the register file size which gives optimum performance and optimum energy consumption. We have also analyzed the contribution of various factors for energy saving. We notice that most of the energy saving comes from the saving in execution time. Such contribution for application biquad_N_sections is shown in figure 3.10. The produced code was optimized for time in this case. Though it is important to note that it not necessary that two codes, one optimized for execution time and other for energy would be same. We observe that saving in execution time is contributing most of the energy saving. Supply voltage is assumed to be 3.3 Volts for all observations whereas clock period is around 30 ns.
44
Impact of Register File Size on Performance and Power Metrics
Figure 3.10: Contributions of time saving and power saving in energy saving
1.2
clock period (1 = 101.75) Supply voltage (1 = 3.3 V) Energy with reduced voltage (1 = 49.63 micro WS)
Normalized values
1
0.8
0.6
0.4
0.2 3
4
5
6
Number of registers
Figure 3.11: Energy saving due to voltage scaling
7
8
Impact of Register File Size on Performance and Power Metrics
45
To study the full potential of energy reduction with increase in number of registers, we assume execution time as constant but we have flexibility in supply voltage (figure 3.11). To keep execution time constant, when execution requires larger number of cycles, we have to increase the supply voltage to reduce the clock period. For estimating supply voltage with varying clock period, we have used models from Chandrakasan et al. [72]. With the estimated voltage we have calculated energy. Energy is the product of average power and execution time and here execution time is constant and power consumption depends quadratically on supply voltage. Keeping these factors into consideration, we have computed energy consumption. The curves shows that energy consumption can be reduced by using lower supply voltage for larger number of registers.
3.3 Conclusion From our experiments we conclude that increase in the number of registers by one may result in up to 50 % of performance improvement and up to 60 % reduction in energy consumption. Further, there is a high degree of correlation between performance improvement and energy reduction. In the process we found that power does not strongly depend on the number of registers. The cost of varying register file size in an ASIP is complex due to its effect on instruction encoding, instruction bit-width and required chip area. This study was carried out in a simulator based framework which is slow as earlier later. Further, retargetability of encc is also very limited. For example we cannot generate code for a processor with multiple functional units. This study strengthened our motivation for developing techniques for the register file size exploration, which does not require retargetable compiler and simulators. These should be fast and reasonably accurate. We are interested in techniques to explore register file size along with other storage configurations like caches and number of register windows. It is desirable that these techniques are fast and accurate so a wider architectural design space can be explored quickly.
46
Chapter 4 Overview of Our Methodology Previous Chapters discussed about motivation and related work. In this Chapter our proposed ASIP design methodology is described. Overall technique for storage exploration is also presented in this Chapter. Specific components of storage exploration are described in subsequent Chapters in detail.
4.1 ASSIST : ASIP Design Methodology Work reported in this thesis is part of a broader framework for ASIP Synthesis named ASSIST under development. The overall flow diagram of ASSIST methodology is shown in figure 4.1. The inputs include application behavior in C, performance, power and area constraints, basic processor configuration, pipeline templates and memory access models, power models for various components, area and clock period models. The proposed overall approach is as follows. The application is analyzed with the help of a profiler to extract application parameters. Design space exploration is an iterative process and it starts with a basic (or minimal) configuration that can be synthesized. The performance estimator (where our technique is used) estimates performance based on present processor and memory configuration, application parameters and input models. The configuration selector compares estimates to the user specified constraints to generate the next potential configuration. This process is iterated until a 47
48
Overview of Our Methodology
satisfactory configuration is chosen which is used by a retargetable compiler generator in producing a customized compiler and by a VHDL synthesizer to generate a synthesizable VHDL description of the customized processor.
Application
+
Constraints
Inputs
Profiler
Parameter extractor
Application parameters
Design Space Explorer Configuration Selector Basic Processor Config.
Retargetable Performance Estimator
Retargetable Compiler Generator
ASIP Compiler
Processor and Memory configuration
Memory Configs. and access models
Synthesizable VHDL Generator
synthesizable VHDL
Figure 4.1: ASIP Synthesis Methodology : ASSIST
ASSIST is part of a larger research project named ASSET which is targeted towards automated synthesis of embedded systems. The emphasis of the project is in early design space exploration with the final synthesis/ implementation done using commercial tools. Currently, tools have been developed for application modelling, simulation, hardware and software estimation, partitioning and convertor for generating HDL description to be used by a hardware synthesiser. At present, only the synthesizable processor description is generated automatically while simulator requires designer intervention.
Overview of Our Methodology
49
4.2 Storage Space Exploration Storage exploration is the part of the design space exploration phase of overall methodology. Proposed technique for storage space exploration is shown in figure 4.2. We estimate the cycle count for application execution on the chosen processor and memory configuration. We consider a parameterized model for processor as well as memory. Parameters of data cache include size, line size, associativity, replacement policy and access time. Processor configuration specification includes register file and windows organization along with pipeline information and functional unit (FU) operation capability and latency. Register allocation is done on unscheduled code using the concept of reuse chains [73] with significant extensions [3]. The proposed technique for register allocation is described in the next Chapter. We have defined cost of merging of reuse chains considering spills. We have also developed systematic way of merging these reuse chains. A resource constrained list scheduler is used for performance estimation. Further, we have proposed a novel technique for global performance estimation based on usage analysis of variables. Global performance estimation is done without code generation. Further, we estimate overheads due to limited register windows and data cache memory [74]. We have integrated this technique to explore register file size, windows and cache configurations. Overall execution time estimate (ET ) for an application for the specified memory and processor configuration can be expressed as follows.
ET = etR + ohW + ohC
Where etR : Execution time when register file contains R registers. ohW : Additional schedule overhead due to limited windows. ohC : Additional schedule overhead due to cache misses.
(4.1)
50
Overview of Our Methodology
etR can be further expressed by the following equation.
etR = bet + ohdep + spillR ∗ tR
(4.2)
Where bet
: Base execution time considering constraints of resources other than storage.
ohdep : The overhead due to additional dependencies inserted during register allocation. spillR : The number of register spills. tR
: The delay associated with each register spill.
Computation of etR is described in the next Chapter. ohW can be further expressed by the following equation.
ohW = spillw ∗ tW
(4.3)
Where spillw : Number of window spills and tW
: Delay associated with each register window spill.
ohC can be further expressed by the following equation.
ohC = missC ∗ tC
(4.4)
Where missC : Number of cache misses and tC
: The cache miss penalty. Computation of ohW and ohC is explained in Chapter 7. tW is computed by knowing register
window size and the latency of ‘store’ instruction. tC is computed using block size and the delays associated in each data transfer. Storage configuration selector selects suitable processor and memory configuration to meet the desired performance by knowing all the execution time estimates.
Overview of Our Methodology
51
Application + Per. constraints
Window spill overheads
Register spill overheads
Cache miss penalty
* Function calls/ returns sequence
* Reg. alloc. using RRC
* sim−cache used
* Global needs handling * stack based analysis to compute window spills oh
* List scheduler used * Results validated et R
W
Parameterized Memory
Parameterized Proc.
Including register file and windows configurations
Storage Explorer
oh C
Execution time estimate ET Register file size
Cache configuration
Register windows configuration
Storage configuration selector
Figure 4.2: Storage exploration technique
4.3 Execution Time Estimation with Limited Registers We use the concept of Register Reuse Chains (RRCs) [73], which makes it possible to do register allocation for unscheduled code blocks. While doing register allocation, an attempt is made to minimize the schedule overhead which will occur due to limited registers. A priority based resource constrained list scheduler is used to get local performance estimates of each block. Global analysis is performed at the function level to find additional schedule overhead required to handle global needs which are ignored when the local estimates are generated. This overhead is added to the local estimates to produce the total estimates. Our performance estimation methodology is shown in figure 4.3. Input application (written in C) is profiled using gprof to find execution count of each basic block as well as functions. These execution counts are used to multiply with the estimated execution times. Intermediate representation is generated using SUIF [75]. Control and dependency analysis is done using this intermediate representation. Control flow graph is generated at the function level whereas the data flow graph is
52
Overview of Our Methodology
generated at the basic block level.
For each basic block B, local register allocation is performed taking the data flow graph and number of registers to be used for local register allocation (say k) as input using a modified register reuse chains approach as shown in figure 5.1. Data flow graph may be modified because of additional dependencies as well as spills inserted during register allocation. This modified data flow graph is taken as input by a priority based resource constrained list scheduler, which produces schedule estimates. This estimate is multiplied by the execution frequency of block B to compute local estimate (LEB,k ) for this block. Local estimates are produced for all the basic blocks contained in a function, for the complete range of register file sizes to be explored. Schedule overheads needed to handle global needs with limited number of registers are computed using life time analysis of variables. For each block, we need information on variables used, defined, consumed, live at entry and exit points of this block. This additional global needs overhead is also generated for the complete range of number of registers for each basic block. Then, we decide on the optimal distribution of the registers available (say n) into registers to handle local register allocation (k) and registers to handle global needs (n − k), such that overall schedule estimate for that block is minimized.
Overall estimate for a block B can be expressed as
OEB = mink (LEB,k + GEB,n−k )
(4.5)
where OEB is the total schedule estimate for basic block B, and GEB,n−k is the overhead to handle global needs with n − k registers. OEB values for all blocks are summed up to produce estimates at the function level. Estimates for all functions are added together to produce overall estimate for the application i.e. etR . So etR can be expressed as
etR =
∑
f or each f unction
∑
f or each basic block B
(OEB )
(4.6)
Overview of Our Methodology
53
Application written in HLL (C)
SUIF Front End
Profiling
SUIF IR for application
Profile Information
Dependence Analysis
Control Flow Graph for each function
Estimating global cost for each BB Load/ Store time # buses # ports
Dtata Flow Graph for each BB
Local Register Allocation using Register Reuse Chains for each BB
Number of Registers
Modified Dtata Flow Graph Priority based Resource Constrained List Scheduler
Local Schedule Estimates LE B,k Parameterized Processor Global estimates for each BB GE B,n−k
Schedule Estimates for each BB OH B
Execution Time Estimates for complete application
Figure 4.3: Flow diagram of our performance estimation scheme
54
Overview of Our Methodology
4.4 Our Approach and Compilers As discussed earlier, compilers have been used extensively in conjunction with simulator to find the execution times during architecture exploration. As simulation is a time consuming process, an alternative approach is to use profiled information and avoid simulation. This approach is similar to our approach in spirit, but there are a couple of practical issues which we discuss here. Firstly, there are some problems in using retargetable compilers for this purpose. As described in Chapter 2, easily-retargetable compilers either require processor description at a very low level (e.g. MIMOLA is used in RECORD) or the most of the processor part is assumed to be fixed with only a few parameters (as in case of TRIMARAN) that can be varied. Larger design space is covered by user-retargetable and developer-retargetable compilers, but the flexibility is provided at the cost of rewriting significant parts of the compilers which makes them unsuitable for design space exploration for ASIPs. Moreover, there is a well known trade-off between retargetability and code quality in terms of performance and code size compared to hand optimized code. This also implies that when attempt is made to cover a larger design space, the quality of the generated code is not acceptable. In contrast, our scheme uses a very simple machine description which makes it very easy to retarget. This is possible because although we do compiler like analysis of the given application, we don’t need to generate code. Thus we are able to explore a larger design space which includes VLIW as well as RISC processors. This is a novel approach which works at a higher level (application level) so estimations are produced very quickly. Moreover, since we do not need a detailed machine description our approach is useful in an early design space exploration where such tools are not available. The second issue is the order in which register allocation and scheduling are carried out in two approaches. The goal of register allocation is to eliminate unnecessary load/store instructions and consequently reduce code size as well as power consumption. On the other hand the main goal of instruction scheduling is to increase the computing speed by fully utilizing the resources of a processor.
Overview of Our Methodology
55
Most of the approaches suggested for register allocation perform register allocation either after scheduling like a typical compiler, or they try to solve register allocation and scheduling in integrated manner. Goodman and Hsu [76] suggested DAG driven technique for register allocation, but their approach needs a pre-pass scheduling. Cindy Norris et al. [77] and Rajiv Gupta et al. [78] proposed to solve register allocation and instruction scheduling in an integrated manner. Global register needs are ignored by most of these approaches. However, in our approach the register needs are estimated on unscheduled code because register allocation is more important in ASIP synthesis. Less number of bits are required to identify a register when register file size is small. A typical instruction format includes references to two registers. Saving bits in register addressing saves bits in instruction, thus we have more bits available for op codes and more application specific instructions can be incorporated. Alternatively, instruction width and the bus width for accessing instruction cache could be reduced. Even if by the reduction in register address bit width(s), the instruction bit width is not reduced, we will still have fewer bit transitions, thus resulting in reduced power consumption. That is why, in our technique we prefer to do register allocation prior to scheduling.
4.5 Conclusion We have presented our methodology for ASIP design as well as storage exploration in this Chapter. Specific components of storage exploration are described in the next three Chapters. Computation of execution time estimation at the basic block level (LEB,k ) and schedule overhead to global register needs (GEB,n−k ) are presented in Chapters 5 and 6 whereas computation of overheads due to limited register windows (ohW ) and due to cache misses (ohC ) are described in Chapter 7.
56
Chapter 5
Execution Time Estimation at the Basic Block Level
Overall methodology for ASIP design as well as storage exploration and execution time estimation was described in the last Chapter. Technique for estimating execution time considering register file size is described in this Chapter. While computing, the overhead due to limited register windows as well as cache sizes are ignored. In other words, this Chapter describes the method of computing LEB,k , the execution time of a basic block (B) considering a certain number (k) of registers for allocation. These values are required in equation 4.5.
We use the basic concept of register reuse chains (RRCs) for register allocation as mentioned in the last Chapter. Initial chain formation and chain reduction presented in next two sections are based on the idea proposed by Zhang et al. [73]. Merging cost computation and set formation and actual merging presented in sections 5.3 and 5.4 are our major contributions. Section 5.5 describes our performance estimation scheme. It is important to note that since [73] does not consider spilling, its solution is incomplete. If chain reduction is not able to reduce number of chains to the level of number of registers, then [73] does not provide any solution. 57
58
Execution Time Estimation at the Basic Block Level
5.1 Register Allocation using Register Reuse Chains (RRCs) Zhang et al. [73] proposed DAG driven technique which performs register allocation before scheduling. They have used the concept of register reuse chains (RRCs) which we have adopted for our approach. A register reuse chain is an ordered set of nodes in a data dependence graph in which all the nodes share the same register. We have used their concept of RRCs with significant extensions. Since their approach does not consider the option of spilling a variable, so they don’t have a notion of cost of merging. They allow chains to merge only if it is possible to do so without spilling, otherwise the chains are not allowed to merge. We have considered the option of spilling the variables also while deciding on the possibility of merging the chains. We compute the merging cost for every pair of reuse chains and systematically merge them. Another major extension is that we also consider the global needs, that is, variables needed across the basic blocks may be allocated registers. For this, liveness of variables is analyzed across the basic blocks and variable usage sets are generated. Global needs are estimated using these variable usage sets as explained in the Chapter 6. This is our contribution whereas [73] and most of the DAG driven techniques do not address these global needs. A register reuse chain or RRC is an ordered set of nodes in a data dependence graph in which all the nodes share the same register. We use these chains to estimate the increase in the schedule length and spilling overheads for a given number of registers. Proposed register allocation strategy is shown in figure 5.1. Initial register reuse chains are formed in a manner such that no additional dependency is generated. If the number of chains is higher than the number of registers available for local allocation say (k), then chains are merged to reduce the number to the desired level. The chain merging is driven by merging cost which is the cost in terms of spills and schedule overheads to be paid if the said chains are merged. These costs are computed when compatibility analysis is done. Attempt is made to introduce minimum number of spills in the process of merging. The result of such chain merging would be insertion of spill instructions as well as additional dependencies. The data flow graph is modified in the process. Our scheduler (priority based resource constrained list scheduler) takes care of this modified data flow graph [3].
Execution Time Estimation at the Basic Block Level
59
Data Flow Graph (DAG), #reg Initial register reuse chains formation
initial register reuse chains YES
#chains is the measure of loads and stores required for the said merging. Based on the dependency information, these costs are easily computed. It depends on how many values we have to store in memory and load back when they are required. The merging cost of chain j to reuse register assigned to chain i can be expressed by the following equation.
Execution Time Estimation at the Basic Block Level
65
Mi, j = ∞ i f ∃n1 ∈ ci , n2 ∈ c j |path(n2, n1 ) h = ||{< n1, n2 >∈ DAG|n1 ∈ ci , n2 ∈ c j }|| + i ||{n3 | < last(ci), n3 >∈ DAG, path(last(c j), n3)}|| ∗
(5.1)
(costload + coststores ) otherwise
Here ||S|| represents cardinality of the set S. For simplicity, the above formula just identifies the number of dependencies that need to be broken for possible merging of two chains. Once such a number is identified, it is assumed that for every such dependency, one store (where value is produced, we call it source) and one load (where the value is used, we call it destination) would be required. The above formula does not reflect the fact that if more than one such dependencies are chosen to break with the same source node but with different destination nodes then that value needs to be stored only once and loaded before every destination node. In actual computation this fact has been considered. Computation of merging cost of some pairs of chains of illustrated example, assuming load, store cost as 1 is shown in figure 5.7. Computation of merging cost between chain 3 and chain 7 is explained here. Computation of M3,7 :
M3,7 represents the cost paid if variables of chain 7 reuses register
assigned to variables of chain 3. Since computation of variables ‘e’ and ‘f’ needs value of ‘d’, ’d’ is stored once after its value is produced and is loaded twice, one before computation of ‘e’ and the other before computation of ‘f’. Thus one store and two loads are required for the said merging and M3,7 is 3. Computation of M7,3 :
M7,3 represents the cost paid if variables of chain 3 reuses register
assigned to variables of chain 7. As ‘f’ is computed using value of ‘d’, it is not possible to schedule computation of ‘d’ after the computation of ‘f’. So nodes in chain 3 cannot reuse register allocated to the variable in chain 7 and M7,3 is infinite.
66
Execution Time Estimation at the Basic Block Level Chain 0
Chain 1
Chain 3
l_2
a
b l
Chain 4
l_4
M0,1=2
l_5 M3,4=0
M1,0=infinite
d
M4,3=2 h
Chain 3
Chain 7
Chain 5
Chain 6
l_4
l_7
f d
M3,7 =3
M5,6=2
l_6
M7,3=infinite
j *
* e
M6,5=3
i k
M
i,j
: Cost in terms of loads and stores in chain j reuse the register of chain i direct data dependency dependency path with intermediate nodes
* : due to additional dependencies generated during chain reduction
Figure 5.7: Computing merging costs
5.5 Chain Merging and Performance Estimation
After the compatibility graph is generated, the edges of this graph are sorted in a non-decreasing order of the associated costs. Sets of RRCs are formed. Each set will contain the RRCs which can share a register (of course by incurring cost in terms of loads and stores or by generating additional dependencies). Ordered compatibility edges help in the set formation. Chains of each set are actually merged in the same order as they appear in the set. In this merging, spill instructions (loads and stores) are inserted. When chains are merged in a set apart from the spills, dependencies may get modified. A priority based resource constrained scheduler is used to estimate computation time. This estimate is multiplied with the execution frequency of the block (obtained by profiling on typical data sets) to get the local estimate (LEB,k ) for that block. With varying number of registers for local allocation (k) the sets and the schedule estimates are shown in figure 5.8.
Execution Time Estimation at the Basic Block Level Number of Registers (k) >= 8
Sets
67 Schedule estimates (# cycles) LEB,k
, , , , , , ,
29
7
, , , , ,
35
6
, , , ,
38
5
, , ,
43
4
, ,
50
3
,
55
Figure 5.8: Sets and schedule estimates with diff. k
5.6 Complexity Analysis
The algorithm for initial chains formation starts with picking an unassigned node (which is not assigned to any reuse chain). A new chain is formed starting from the selected node. Recursively EUK’s are identified and added to this chain. Since identification of EUK involves dependence analysis, so the complexity of a single chain formation will be O(|V | 2 ), where |V | is the number of nodes in DAG. So complexity of forming all the initial reuse chains will be O(|V | 3 ). The chain reduction also involves dependency analysis and swapping of some nodes between the chains, hence this algorithm also has the complexity of O(|V |3 ).
The optimum local register allocation problem is known to be NP-hard [79] and thus we look at heuristic based approaches to do register allocation. Computing merging cost between a pair of chains involves dependence-violation checks, so the complexity of computing merging cost between all the chain pairs will also be O(|V |3 ). The merging heuristic is very simple and only dependence violation is checked before merger, so merging of all the pairs which need to be merged will have complexity of O(|V |3 ). Since no step is more than O(|V |3 ), so overall complexity of our technique is O(|V |3 ), |V | being the number of nodes in DAG.
68
Execution Time Estimation at the Basic Block Level
5.7 Conclusion In this Chapter we have presented a technique to estimate execution time at the basic block level considering limited registers. We use the concept of reuse chains with significant extensions. We merge reuse chains systematically considering spills which was not considered by [73]. As the proposed technique is DAG driven, it helps in performance estimation without code generation. Since no pre-pass scheduling is required before register allocation, best use of register can be exploited. The proposed technique in this Chapter does not considered the overheads due to global register needs. These overheads are computed as described in the next Chapter.
Chapter 6 Global Performance Estimation and Validation Execution time estimation with limited register file size has two components, namely, execution time estimation at the basic block level and the additional schedule overhead to handle global register needs. Execution time estimation scheme at the basic block level was presented in the last Chapter. Schedule overhead to handle global register needs is presented in this Chapter. Results generated and validated for different processors are also presented here.
6.1 Addressing Global Register Needs While producing schedule estimates as described in the last Chapter, we have considered nodes within the basic block only. We have not considered that registers would also be required to store values of the variables which are not accessed within that block but they are generated by some predecessor block and may be required by any successor block. Similarly, if a variables is produced in this block and also required by successor block, then the register used to store value of that variable cannot be reused to store any other variable of the basic block under consideration. Such register needs are called global register needs. Additional schedule overheads due to such needs is computed as follows. 69
70
Global Performance Estimation and Validation When there are a total of n registers and k are used for local allocation n − k can be used for
handling global register needs. The additional schedule overhead due to global needs (GE B,n−k ) consists of two components, namely, a store component (GS B,n−k ) and a load component (GLB,n−k ). GEB,n−k may be expressed as follows.
GEB,n−k = GSB,n−k ∗ t_store + GLB,n−k ∗ t_load
(6.1)
Where GSB,n−k : number of additional stores. GLB,n−k : number of additional loads. t_store : time required for one store. t_load : time required for one load. t_store and t_load are extracted from input machine description. GS B,n−k and GLB,n−k are estimated as described below. While doing control and data flow graph generation, following additional data sets are generated. useB : Set of variables used in block B prior to any definition in that block. de fB : Set of variables defined in block B prior to any use in that block. From the control flow graph, predecessor and successor blocks of each block are known. Using this information along with the sets useB and de fB , two sets, namely inB and outB are computed using standard compiler techniques [80]. inB : Set of variables live at the time of entry into block B. outB : Set of variables live at the time of exit from block B. Taking useB , de fB , inB and outB as inputs, we compute the following three sets for each basic block B (figure 6.1). CB
: Set of variables used in block B, but not used in any of the successor blocks.
liveacrossB : Set of variables that need to live across block B but are not accessed in block B. de f outB
: Set of variables generated in block B and required by successor blocks.
inuseoutB
: Set of variables used in block B and required by successor blocks.
Global Performance Estimation and Validation
71 CB
inB
useB
defB
Block B outB liveacross B
inuseoutB
defout B
Figure 6.1: Variable Use Sets Depending on the availability of registers, some variables will be allocated registers, whereas the remaining are stored in memory. Similarly, variables to be loaded from memory need memory access. Some additional notations used are as follows. ||S|| : cardinality of set S. SR : subset of S which gets registers. pro fB : total execution count of block B from profiler. Upper bound on total register needs (T RN) will be
T RN = ||liveacrossRB|| + max(||de f outB|| + ||inuseoutB||, k)
(6.2)
The reason for taking this as the total needs is based on the fact that out of the variables being used in the block and required by successor blocks (i.e. ||de f out B|| + ||inuseoutB||), only k could be preserved in the registers made available for local allocation. If we have sufficient number of registers to handle this need, then additional store is not required. Otherwise the access variables will be stored to memory each time block B is executed. So GSB,n−k can be expressed as.
GSB,n−k = 0 = (T RN − n) × pro fB
i f T RN ≤ n otherwise
(6.3)
72
Global Performance Estimation and Validation
If total number of outgoing values is greater than total registers (n), then only n such values can get registers. While computing ||outPR||, branch probabilities are also considered. ||outBR|| = min(n, ||liveacrossRB|| + ||de f outB|| + ||inuseoutB||)
(6.4)
||liveacrossRB|| will be non zero only if ||inRB|| is larger than ||useB || and the value will be the difference of two. This is based on the assumption that elements of liveacross B (values flowing across the block but not being accessed in the block) will be allocated registers only if registers are spare even after allocating registers to all the elements of inB ∩ useB , that is those variables which are required in block B.
||liveacrossRB|| = ||inRB|| − ||useB|| i f ||inRB || > ||useB || = 0
(6.5)
otherwise
Where ||inRB || is computed as follows. ||inRB || = min(||inB||, n, minP ||outPR||)
(6.6)
P is any immediate predecessor block of block B. Since ||inRB || depends on ||outPR|| of its predecessor blocks and control flow graph may have loops, a few iterations may be required to get stable ||in RB|| and ||outBR|| for each block.
Loads are estimated based on how many times the required variable is to be loaded from the memory which would not be available in register.
GLB,n−k = 0
i f ||inRB || ≥ ||useB||
= (||useB|| − ||inRB||) ∗ pro fB
otherwise
(6.7)
Global Performance Estimation and Validation
73
6.2 Illustrative Example Impact of global register needs and local register pressure on optimal distribution of available registers (n) into registers for local allocation (k) and registers for handling global needs (n − k) for each basic block is demonstrated with an illustrative example in this section. A snap shot of a typical control flow is shown in figure 6.2 omitting details within the basic blocks for simplicity. Block B3 has two immediate predecessor blocks (B1 and B2) and two immediate successor blocks (B4 and B5). Blocks B1 and B2 produces values of variables ’d’ and ’e’ which are used by block B3. Blocks B1 and B2 doesn’t have high register pressure and both are allowing three variables ’a’, ’b’ and ’c’ (which are passing live across the block) in registers. Local estimates for block B3 are 65, 50, 40 and 30 cycles for k= 2, 3, 4 and 5 respectively. We have a total of 5 registers (n=5). Global estimates for block B3 are 6, 4, 2 and 0 for n-k = 0, 1, 2 and 3. So optimal distribution will be k=5 and n-k=0, i.e. use all available registers for local allocation. While doing so, it stores variables ’a’, ’b’ and ’c’ into memory before execution of block B3 and are loaded again from memory into registers before executing block B6 which uses these three variables.
B1
B2
{a,b,c}
{a,b,c} {d,e} [a,b,c] {g}
{d,e}
{d,e}
B3
{g}
{g}
B5
B4
[a,b,c]
[a,b,c] [a,b,c]
B6 {...} : Set of variables stored in regs [....] : Set of variables stored in memory
Figure 6.2: Optimal register allocation
74
Global Performance Estimation and Validation Let block B3 is not having high register pressure, the optimal distribution could have been differ-
ent in this case. For example, if the local estimates for block B3 are 65, 50, 49 and 48 cycles for k=2, 3, 4 and 5 respectively, then the optimal distribution will be using 3 registers for local allocation and 2 for globals. So one of three variables ‘a’, ‘b’ or ‘c’ would be stored in memory and will be loaded when required.
6.3 Experimental Setup Performance estimations with varying register file sizes for selected benchmarks applications were done. Two processors namely ARM (ARM7TDMI) [66] and Trimedia (TM-1000) [81] were chosen for experimentation and validation.
6.3.1 ARM7TDMI The ARM7TDMI by ARM Ltd [66] is a 32-bit RISC processor. It supports 16-bit instructions in THUMB mode which operates on the same 32-bit register set so it achieves better performance compared to traditional 16-bit processors using 16-bit registers and consumes less power than traditional 32-bit processors.
6.3.2 TM-1000 Trimedia (TM-1000) CPU is based on a VLIW architecture with an instruction set optimized for system-on-chip (SoC) solutions requiring real-time processing of video, audio, graphics, and communications data streams. Its five-issue-slot instruction length enables up to five operations to be scheduled in parallel into a single VLIW instruction.
6.3.3 Benchmark Suite As described in Chapter 3, the benchmark applications are either from the domain of media applications, DSP or implementations of standard sorting algorithms. Benchmarks are as follows.
Global Performance Estimation and Validation
75
1. biquad_N_sections (DSP domain) 2. lattice_init
(DSP domain)
3. matrix-mult
(multiplication of two m × n matrices)
4. me_ivlin
(media application)
5. bubble_sort 6. heap_sort 7. insertion_sort 8. selection_sort
6.3.4 Processor description The processor itself is specified using a very simple description. This description involves latency of SUIF operations for this processor core. We also specify the maximum range of register file size to be explored. Nature of pipelining (number of pipeline stages and specification of stages) and buses are specified. Processor can have more than one type of FU as well as the multiple instances of each FU type. For VLIW processors, slotting information is also captured. Flat memory is considered and the read/ write latencies are captured in the description.
6.4 Results of performance Estimations Our technique generates performance estimates for the complete range of register file sizes. i.e. between 1 and the maximum limit as defined in the input description file. Estimates for all the functions as well as complete application are generated. Where register allocation is not feasible, even with the consideration of loads and stores, an error is reported. Maximum register file size was defined as 16 for ARM7TDMI and 128 for TM-1000. Results show the performance improvement with larger number of registers. There are two factors responsible for this improvement : reduction in spills and more flexibility in scheduling. After a certain limit, the performance gets saturated and providing additional registers does not improve the performance. Plots for many benchmarks have a sharp knee reflecting drastic change in performance
76
Global Performance Estimation and Validation
at specific number of registers. Observation for matrix multiplication benchmark for ARM7TDMI processor are shown in figure 6.3. In this case the performance gets saturated at 12 registers. Since this technique allows one to vary register file size and generate performance estimates, it can guide an ASIP designer in choosing a suitable register file size. This choice will be driven by the specified performance constraints, as well as a suitable cost model for register file. 6e+06 Performance Estimates for matrix-mult
5.5e+06
5e+06
Number of Cycles
4.5e+06
4e+06
3.5e+06
3e+06
2.5e+06
2e+06 2
4
6
8 10 Number of Registers
12
14
16
Figure 6.3: Execution time estimates for matrix-mult for ARM7TDMI
6.5 Validation To verify correctness of the estimates produced by our technique, it was validated for two processors, namely, ARM (ARM7TDMI) and Trimedia (TM-1000). Specific register file size is chosen for validation considering the size supported by their standard tool sets. Suitable estimates are considered for library function calls. For example the benchmarks considered for experimentation
Global Performance Estimation and Validation
77
and validation include a library function ‘rand’ which is used to generate a random number. Estimates for this library function was generated using standard tool sets and are 128 and 43 cycles for ARM7TDMI and TM-1000 processors respectively. Library functions like printf are omitted from the source code itself as these are not relevant in the embedded systems context.
6.5.1 Validation for ARM7TDMI Estimates for ARM7TDMI are generated for 8 registers using the following tool sets. 1. ARM : armulator tool set consists of ARM code generator and ARM simulator. 2. ASSIST : Our technique. 3. encc : A retargetable encc [65] code generator and ARM simulator [82]. Validation of ASSIST Framework for ARM7TDMI 6
Normalized Estimates ( # million cycles )
5
4
ARM ASSIST encc
3
2
1
selection_sort ( / 4 )
insertion_sort ( / 3 )
heap_sort ( * 4 )
bubble_sort ( / 7 )
me_ivlin ( * 5 )
matrix-mult ( * 1 )
lattice_init ( * 15 )
biquad ( * 35 )
0
benchmark applications
Figure 6.4: Validation for ARM7TDMI
Results are shown in figure 6.4. Scaling factor is shown in the braces along with benchmark
78
Global Performance Estimation and Validation
names on X axis. Scaling is done to represent results of different benchmark applications on the same plot. We observe that on the average our technique produced estimates within 9.6% compared to the performance numbers produced by standard compiler and simulator available in armulator tool set. On the other hand the simulator based technique encc produced estimates with an average error of about 29.6%. We have compared the run time elapsed in our technique with encc technique. Both techniques provide facility of getting performance estimates with different register file sizes, though the register file size that could be explored in encc technique is limited to 8. This is because the ARM instruction set simulator used for simulation could not support larger register files. Both tools (ours and encc) are run on a SUN blade 1000 server having 64-bit dual processors, clock frequency of 750 MHz with 4 GB RAM. Comparison shows that on an average our technique was 77 times faster compared to encc. Average execution time was around 2 and 144 secs for ASSIST and encc respectively. Improvement is more in the applications where the loop frequencies is higher. Of course encc also generates code apart from performance estimation whereas we only generate estimates. But the major time consumed in the “encc” framework is for simulation and it uses “armsd” simulator of armulator tool set. On the other hand, there has been significant progress in the speeds achieved by target specific simulators. They are generally compiled code simulators and are not directly relevant in the context of design space exploration. In case, if we use the encc framework (means encc code generator and ARM simulator on generated code) or ARM framework (means ARM compiler and ARM simulator) the runtime for both simulator based framework is almost equal. We preferred to use encc framework for comparison because of its flexibility in changing register file sizes. Our speed-up reported in this thesis is the average speed up for all the benchmark applications for complete range of register files (with different sizes).
6.5.2 Validation for TM-1000 To show the retargetability and capability of handling processors which support multi-issue architecture to exploit the instruction level parallelism, we have chosen Trimedia processor (TM-1000).
Global Performance Estimation and Validation
79
Estimates are generated for 128 registers using the following tool sets. 1. tmcc and tmsim : Trimedia tool set consists of tmcc compiler and tmsim instruction set simulator. 2. ASSIST : Our technique. Results are shown in figure 6.5. We observe that our technique produced estimates within 3.3%
Validation of ASSIST Framework for Trimedia processor 2.5
Normalized Estimates ( # million cycles)
2
1.5
tmcc and tmsim
1
ASSIST
0.5
selection_sort ( / 4 )
insertion_sort ( / 3 )
heap_sort ( * 4 )
bubble_sort ( / 7 )
me_ivlin ( * 5)
matrix-mult ( * 1)
lattice_init ( * 15 )
biquad ( * 35 )
0
benchmark applications
Figure 6.5: Validation for TM-1000
compared to the performance numbers produced by standard compiler and simulator available in the Trimedia tool set.
6.6 Limitations We have ignored the impact of register file size on instruction width. For example, due to change in the instruction width code size, complexity of decoder logic, bus width etc. may vary. Due to change in the complexity of various hardware components, the cycle time may change which will affect the real time performance, but performance in terms of number of cycles is unaltered.
80
Global Performance Estimation and Validation
6.7 Conclusion We have developed a novel technique to decide suitable register file size. Our approach uses a priority based resource constrained list scheduler to estimate execution time in terms of clock cycles for each basic block. Register allocation is performed on unscheduled code using the concept of register reuse chain [73] with major extensions. For example, we consider spilling of variables when needed during merging of reuse chains. Systematic chain merging is proposed and used. A novel idea of estimating schedule overheads to handle global register needs is proposed. It performs live variable analysis across all the blocks and using this variable usage information, the global schedule overhead is estimated. Performance results for register file sizes between 1 to 8 for ARM ARM7TDMI and 1 to 128 for Trimedia TM-1000 were generated for selected benchmarks. The technique was validated for these two processors for specific number of registers. Results show that our estimates are within 9.6% for ARM7TDMI and 3.3% for TM-1000 compared to the actual performance produced by standard tool sets. Further, this technique was nearly 77 times faster compared to a simulator based technique. Of course, apart from significant speed up, the proposed technique can produce estimates in an initial evaluation phase when the tools (retargetable compilers and simulators) are not available. Further, it uses a very simple processor description which makes it easy retargetable to cover a large design space. In this Chapter, we presented technique to estimate application execution time considering limited register file size. While doing so, we have not considered the overheads due to limited register windows or due to cache misses. These overheads are computed as described in the next Chapter.
Chapter 7 Register Windows and Cache Memory Exploration In the previous Chapters, we described an efficient methodology to find effect of register file size on execution time, ignoring any overheads due to limited number of register windows or cache misses. In this Chapter, we propose a technique to estimate these additional overheads.
7.1 Estimating Register Window Spills Processors with register window scheme typically assume a set of registers organized as a circular buffer. When a procedure is called (means a context switch), the tail of the buffer is advanced in order to allocate a new window of registers that the procedure can use for its locals. On overflow, registers are spilled from the head of the buffer to make room for the new window at the tail. These registers are not reloaded until a chain of returns makes it necessary. We consider the context switches that are due to function calls and returns. Vishal et al. [83] proposed a technique to estimate number of register windows spills and restores. This is based on generating a trace of calls and returns from the application execution. We used their basic idea of instrumenting the application and generating such trace. After this, earlier approach generates finite automaton for each number of register windows. Trace was given as input to these 81
82
Register Windows and Cache Memory Exploration
automaton one at a time, to compute number of window spills and restores in each case. Whereas, our approach uses a very simple and efficient stack based analysis to compute window spills and restores. Trace is used only once and simultaneously spills and restores for different number of register windows are computed. Now we elaborate our technique in detail. The Input application is instrumented in a manner that when it is executed, it will write a character c when a function call is encountered, whereas a character r is written into an intermediate output file when a function returns to its caller function. The instrumentation involves insertion of dummy statements just before body of any function as well as just before the return statement. Such c/r (sequence of ‘c’ and ‘r’) trace is analyzed as shown in figure 7.1. Depending on the character received is ‘c’ or ‘r’ from the input string, a push or pop like operation is performed respectively. The operations only moves the top pointer, no data is stored or retrieved from this
Input {s : s is c or r} and nrw : Number of register windows Output spillsnrw and restoresnrw top : stack top pointer top = 0 spillsNRM = 0 restoresnrw = 0 top_storednrw = 0 for each item of input c/r sequence if (item is call) top = top + 1 if ((top − top_storednrw ) > nrw) spillsnrw = spillsnrw + 1 top_storednrw = top_storednrw + 1 end if else top = top − 1 if (top = top_storednrw) restoresnrw = restoresnrw + 1 top_storednrw = top_storednrw − 1 end if end if end for
Figure 7.1: Algorithm : Computing number of window spills and restores
Register Windows and Cache Memory Exploration
83
stack. We keep a reference pointer for each value of number of register windows. For a particular number of register windows, the difference between the corresponding reference pointer and top will tell about the number of register windows occupied. If call comes and all the register windows are occupied, then a window from bottom of the stack will be spilled into memory, reference pointers will be raised one step up, then top will be incremented. Similarly, when we come back to access the window saved into memory, it will be restored from memory. Reference pointers for different number of register windows may have different value. Depending on each character of the input trace, number of stores and restores associated with different number of windows are affected. Two counters one for counting number of spills and other for counting number of restores is associated with each window for this purpose. The complexity of the algorithm is O(n) where n is the number of function calls. Depending on the the total number of registers available and number of register windows, the window size is decided. Register window configuration will help us in knowing about the number of registers which can be used for allocation within a context. We use the techniques proposed in the last two Chapters to estimate execution time when each function gets a specified number of registers for allocation. Window spills (spillW ) due to limited number of register windows is computed as shown above. As mentioned earlier in Chapter 4 (equation 4.3), the execution penalty due to window spills and restores (ohW ) is computed using window size and latency associated with load and store operation for the processor. This additional penalty is added into the estimates produced by the technique to produce overall execution time estimates.
7.1.1 Results We have generated various trade off curves for our benchmark applications. Though we have generated exploration results for all the benchmarks we chose for our study, interesting results were observed with quick_sort program. Other program do not have good level of nesting of function calls and usually 2 or 3 windows are sufficient to remove window spills in most of the cases. Figure 7.2 shows the impact of variation of register file size on execution time for quick_sort. When these results are generated, only a single register file of specified size is assumed and all the registers were
84
Register Windows and Cache Memory Exploration
allowed to be used for register allocation for any function. Any schedule overhead due to limited number of register windows was ignored while generating these estimates. Results indicate that execution time decreases with an increase in the register file size, but after a certain point execution time does not decrease further. For this application, 8 registers are sufficient. Beyond that, the cycle time is not reduced further.
quick_sort
Execution time (# cycles)
380000
360000
340000
320000
300000
280000 3
4
5
6
7 8 9 Register file size
10
11
12
Figure 7.2: Impact of Register file size on execution time
Variation of number of window spills and restores required with different number of register windows is shown in figure 7.3. Curve saturates at 9 windows meaning that, there would not be window spills and restores when we have 9 or more register windows. In all these experiments the input data size is kept fixed at 1000 whereas the values to be sorted are generated randomly. We are interested in trade-off between the number of registers and window sizes. For each total number of registers, window size would be different for different number of windows. While generating results (figure 7.4), we assumed that register file will be distributed in windows of equal sizes. We also assume that within a context, number of registers available for register allocation is equal to window size. Depending on the performance requirement, suitable register file size can
Number of window spills and restores
Register Windows and Cache Memory Exploration
85
quick_sort
2000
1500
1000
500
0 1
2
3
4 5 6 7 8 Number of register windows
9
10
Figure 7.3: Impact of number of registers windows
be chosen and for the chosen register file size, number of windows and hence the window size (number of registers in a window) can be decided. For example, if the total register file size is 16 registers and there are 4 register windows, then, each window has 4 registers for local allocation. We are considering only the registers available for local allocation ignoring the registers used within a window for inputs and outputs. This can also be taken care of if the user specifies the window configuration i.e. how many used for inputs, outputs and local variables.
On one end, when the number of windows is small, the time overhead due to context switches dominates the cycle time. At the other extreme, when the number of windows is large for the same total number of registers, the individual window size becomes small and the overhead to load and stores (within a context) dominates the cycle time. In the next section we see how to consider overheads due to cache misses.
86
Register Windows and Cache Memory Exploration
420000
reg_12 reg_15 reg_16 reg_18 reg_20
Execution time (# cycles)
400000 380000 360000 340000 320000 300000 280000 1
2
3
4
5
6
Number of register windows
Figure 7.4: Trade off between number of windows and their sizes
7.2 Cache Miss Overheads In the previous discussion, we have considered the effect of limited number of registers and register windows on performance. In this section, we consider cache to be a part of storage hierarchy along with register file for design space exploration. As we know the difference in the speeds of processor and memory is increasing. To overcome delay incurred due to slow access to the memory, different alternatives are suggested and using cache memory is one of them. The idea behind using a cache memory is that taking advantage of temporal and spatial locality of memory accesses, a small fast cache memory may give us a significant speedup. A number of techniques for memory exploration are reported in literature [84, 85, 86, 87]. Jacob et al. [84] presented an analytical approach for designing memory hierarchies assuming fully associative caches. They have used the concept of work load locality derived from analysis of address traces. Kirovski et al. [85] and Abraham et al. [86] proposed simulation driven memory exploration techniques. Though latter adopts the approach of only simulating the caches on the traces from a reference processor, they model the traces of other designs as a dilated version of the reference pro-
Register Windows and Cache Memory Exploration
87
cessor’s trace, where each block of instruction addresses is stretched out by the dilation coefficient. Grun et al. [87] proposed a method to decide about memory architecture based on access pattern. To the best of our knowledge, no approach considers register file as a part of storage during the design space exploration. We observe that usually the number of memory locations required to store spilled scalar variables and register windows is small compared to the total number of cache locations. Therefore, we assume that the spilling overhead is insensitive to the cache organization. This observation allows us to estimate the two independently. To know the memory access profiles (total number of accesses, hits, misses etc.) we need to generate addresses to which memory accesses are made and then a simulator is required to simulate those accesses. Since memory access patterns are typically application dependent, so we can use any standard tool set to find memory access profiles. Once we know the number of memory misses for a particular cache, knowing the block size and delay information we compute the additional schedule overhead due to cache misses using the following equation. d = α1 + α2 ∗ (b − 1)
where α1 is the time required (in terms of processor clock cycles) for transferring first data from next level memory to cache and α2 is the time required for each of the remaining data transfers. b is the block size. For our experiments, we have chosen α1 and α2 as 8 and 3 respectively. 3 clocks are required for each load/store for LEON processor. For the initial transfer we took number 8 intuitively, actually designer can specify suitable values for these constants. This additional delay is added with the base execution time estimated from our methodology described earlier to get the overall execution time estimates. Cache misses depends on block size. On one end, small block sizes may not give the desired advantage of locality of references. On the other hand, using larger block size leads to heavy miss penalty. While performing exploration, for each cache size, we took possible different combinations of block sizes, associativity and replacement policies and then, identified which combination gives the minimum number of cache misses.
88
Register Windows and Cache Memory Exploration We have used simplescalar tool set [88] to compute the number of cache misses for different
cache sizes. It assumes a MIPS like processor with some minor changes. Since we do not require timing information from the simplescalar tool, we use sim-cache simulator which is fast compared to sim-outorder. sim-cache does not use processor timing information and it gives only cache miss statistics. It takes only address trace along with cache configurations as input. Since we have explored data cache here and large amount of data is stored in form of array into memory, this array storage as well as their sequence of access do not significantly depend on processor. With this assumption, we have used sim-cache simulator to find cache miss statistics.
7.2.1 Results We have generated execution time estimates for various benchmark applications for different register file sizes and different data cache sizes. We have not considered the impact of cache size variation on memory latency, but it can be easily done, as we have to use different values of α 1 and α2 for observations for different cache sizes in that case. Consider the results produced for matrixmult program for different register file size and memory configurations as shown in figure 7.5, then some interesting trade offs can be observed. Based on the generated execution time estimates and the input performance constraint, some possible suitable configurations can be suggested. Suppose as per input performance constraint, the application should not take more than 1.0E + 05 cycles, then one of the three configurations can be suggested. 1. 12 registers with 4K data cache 2. 15 registers with 2K data cache 3. 20 registers with 1K data cache
7.3 Execution Time Validation To know correctness of our techniques, we chose to validate our result against the numbers produced by standard tool sets. We have chosen LEON [4] processor for validation, so we can
Register Windows and Cache Memory Exploration
89
validate our results with a processor which havs all the features we explore. LEON is a SPARC V8-based architecture it has register windows and instruction and data caches. The VHDL model of the LEON is available under the GNU general public license. We have compared the execution time estimates produced by our technique with tsim simulator provided with the LEON tool set.
Execution time estimates for matrixmult 1.8E+05
1.7E+05
Execution time (#cycles)
1.6E+05
1.5E+05
1.4E+05
D1_1K D1_2K D1_4K D1_8K
1.3E+05
1.2E+05
1.1E+05
1.0E+05
9.0E+04
R31
R29
R27
R25
R23
R21
R19
R17
R15
R13
R11
R9
R7
R5
R3
8.0E+04
Register file size
Figure 7.5: Results for matrix-mult Since number of windows is fixed (8) in tsim LEON simulator, we have validated our results for 8 register windows. Each window has 8 in, 8 local and 8 out registers which are in for the next window but visible in the current window. Additionally, 8 global registers are also available. For register allocation purpose, the usual SPARC convention is to allocate eight registers (%l0-%l7) for local values. A compiler could allocate more registers for local values at the expense of having fewer outs/in’s available for argument passing. But two registers of outs and in’s are reserved for storing stack pointer and return address so only six of them can be used for register allocation. Out of 8
90
Register Windows and Cache Memory Exploration
global registers, only four can be used for register allocation, while others are reserved for specific purposes. Using this information we decided to use 24 − x (8 local, 3 global and 6 each from in’s and outs) registers for register allocation for each function, where x registers are required for in’s and outs of the function into consideration. We have also considered 4K instruction cache and 4K data cache which is the memory configuration assumed by tsim simulator. The comparison of our estimates and tsim generated estimates is shown in figure 7.6. On an average our technique produced estimates within 9.7% compared to the performance numbers produced by the standard compiler and
1.2
tsim-leon ASSIST
1 0.8 0.6 0.4 0.2
Figure 7.6: Validation on LEON
quick_sort (*10)
insertion (/3)
heap_sort (*4)
bubble_sort (/7)
matrix-mult (*1)
lattice (*15)
0 biquad (*35)
Normalized Estimates (million cycles)
simulator available in the LEON tool set.
Register Windows and Cache Memory Exploration
91
7.4 Conclusion In this Chapter we have described a technique to estimate execution time overheads due to limited register windows and data cache. We have also used the techniques proposed in Chapters 5 and 6 for estimating register spills and global needs with the proposed technique in this Chapter. After integrating all the techniques, we have generated and validated estimation results for the selected benchmarks on the LEON processor. To know cache miss statistics, we have used sim-cache simulator available with simplescalar tool set. We have generated various trade-off results like number of register windows v/s window size for a given total number of registers. Trade-off results were also generated for register file size v/s cache memory configuration. This allows us to arrive at suitable register file size, window and cache configuration. Till now, we have considered benchmark applications which are very small. In the next Chapter, we present a couple of case studies with real life applications and usefulness of our technique.
92
Chapter 8 Illustrative Case Studies and Applications of Our Approach As mentioned in the last Chapter, we have developed a technique to explore register file size, windows and cache configurations in an integrated manner. Though we have generated results for small illustrative applications. In this Chapter, we present results generated for a couple of real life benchmark applications, namely, ADPCM encoder and Decoder which is part of media benchmarks and a collision detection application which is developed at IIT Delhi for detecting collision of an object with Camera [89]. We also discuss usefulness of our technique supported by a case study of using the registers predicted as ’spare’ by our technique in smoothening the coprocessor interface.
8.1 ADPCM Encoder and Decoder Storage Exploration The processor we have chosen is LEON [4] which is a SPARC V8-based architecture. Its synthesizable VHDL code is freely downloadble. The processor provides an interface for connecting a coprocessor to enhance performance of computation intensive code segment. LEON is reconfigurable as a number of parameters can be specified. Some of these parameters of integer unit are number of register windows, presence or absence of iterative multiplier and coprocessor, fast branch unit is enabled or disabled. Similarly instruction and data cache sizes can also be configured. 93
94
Illustrative Case Studies and Applications of Our Approach The benchmark applications chosen are adpcm rawaudio encoder and adpcm rawaudio decoder
which are part of media benchmarks [90]. ADPCM stands for Adaptive Differential Pulse Code Modulation. It is a family of speech compression and decompression algorithms. A common implementation takes 16-bit linear PCM samples and converts them to 4-bit samples, resulting into a compression rate of 4:1. Data files used are clinton.pcm and clinton.adpcm. To generate execution time estimate to explore storage organization, first we have generated execution time estimates using the techniques proposed in Chapters 5 and 6. Since high level of nesting is not present in these applications, window spills would be minimum, thus, window configuration exploration results are not included. To estimate overheads due to limited cache size, we used sim-cache [88] of the simplescalar tool set to compute cache miss statistics. In an attempt to know cache misses for a particular cache size, we took all possible combinations of number of blocks, block size, associativity and replacement policy. We chose the combination giving minimum number of cache misses. After computing number of cache misses, we compute miss penalty as described in the last Chapter. We took the values of constants α1 and α1 as 8 and 3 respectively. This cache miss penalty was added into the base execution time estimates which was produced ignoring cache misses. Then we plotted execution estimates of various combinations of cache sizes and register file sizes. Exploration results for adpcm rawaudio encoder and adpcm rawaudio decoder are shown in figure 8.1 and 8.2 respectively. Results show that increasing data cache size from 1K to 2K significantly improve performance of encoder but not for the decoder. If encoder is to be completed its execution in 16 million cycles then we have to use at least 2K data cache irrespective of register file size. We also observe that there is not significant performance improvement beyond window size 11 for both the applications.
8.2 Collision Detection Execution Time Validation Second real life application we chose is collision detection. This application is developed at IIT Delhi. The application checks weather any object is approaching the camera and is likely to collide
Illustrative Case Studies and Applications of Our Approach
95
Execution time estimates for adpcm_encoder 2.8E+07
Execution time (#cycles)
2.5E+07
2.2E+07
D1_1K D1_2K D1_4K D1_8K D1_16K
1.9E+07
1.6E+07
1.3E+07
R30
R27
R24
R21
R18
R15
R12
R9
R6
R3
1.0E+07
Register file size
Figure 8.1: Results for adpcm rawaudio encoder for LEON
Execution time estimates for adpcm_decoder 5.5E+07
4.5E+07
D1_1K D1_2K D1_4K D1_8K D1_16K
4.0E+07
3.5E+07
3.0E+07
R30
R27
R24
R21
R18
R15
R12
R9
R6
2.5E+07
R3
Execution time (#cycles)
5.0E+07
Register file size
Figure 8.2: Results for adpcm rawaudio decoder for LEON
96
Illustrative Case Studies and Applications of Our Approach
with it. It also estimates time to collision. We have run the application on a set of images grabbed and stored off line whereas the actual application is multi-threaded and grabbed images are analyzed in real time. The processor chosen is LEON, which is a SPARC like architecture and have register windows. The application is not very big but it is computation intensive. We have used our technique as described in Chapters 5, 6 and 7 to estimate execution cycle counts. We have also estimated execution of the same application using tsim simulator available in LEON tool set. Moreover, the suitable interface was created to perform VHDL simulation as we have fully synthesizable VHDL code for LEON processor. The execution time estimates produced by our estimator (443278 cycles) are with in 10.33% compared to the estimates produced by tsim (494375 cycles) and with in 5.26% compared to the estimates produced by VHDL simulation. We have used this benchmark application to show utility of our technique, which is described later.
8.3 Applications Optimizing register file size is useful in many ways. Some of them are listed here.
• If smaller register file can be used then register address needs fewer bits, thus we can think of either reducing the instruction width or providing more room for opcode so new application specific instructions could be easily accommodated.
• Reduction in the switching activity and thus saving in terms of power consumption.
• In case, instruction width cannot be reduced by sparing some registers, these registers or their addresses could be used efficiently for specific purposes. For example, hard-wiring some registers with fixed values will help removing some moves. There are other possibilities as well.
Illustrative Case Studies and Applications of Our Approach
97
8.3.1 Reducing Bits in Instruction If we reduce the register file size in powers of 2 (say 64 to 32, or 32 to 8) then we can save bit(s) required in addressing a register. If we analyze the instruction set, we will find that if we have the flexibility in reducing the range of coefficients/immediate values etc which exists in some instruction formats then we can save some bits in the instruction. This reduces the activity on the bus and thus saves power consumption as well. Alternatively we may provide additional bit(s) for opcode, so new instructions can be easily incorporated corresponding to the operations to be performed by application specific FUs. For example, MIPS processor support instructions of the format shown in table 8.1. It has 64 registers so 5 bits are kept for register addressing. If our analysis predicts that 32 registers are sufficient for achieving desired performance for input application(s) then we need only 4 bits to address a register. If address/immediate field can be reduced by 1 bit (i.e. 16 to 15) and target address is given three bits less (24 to 21) then we save three bits in the instruction. There will be no switching activity on the unused 3 bits. Alternatively we may provide additional 3 bits for opcode, thus a large encoding space for new instructions. Different combinations of widths for address/immediate field, target address field and opcodes can also be explored. Field size R-format I-format J-format
6 bits op op op
5 bits rs rs
5 bits 5 bits 5 bits 6 bits rt rd shamt funct rt address/ immediate target address
All MIPS instructions 32 bits Arithmetic instruction format Transfer, branch, imm. format Jump instruction format
Table 8.1: MIPS instruction formats
8.3.2 Alternate use of Spare Registers In case analysis using our technique predicts that the register file size suitable for the input application is not power of 2, say it comes out as 45, 6 bits are still required to address a register. Hence, we cannot save bits in the instruction. We also show that how to use these addressable spare registers (64 − 45 = 19) or their addresses. We allow the code generator to use only 45 registers, because that will give us the desired performance. Following sections describes two such usages.
98
Illustrative Case Studies and Applications of Our Approach
8.4 Hardwiring some Constants to the Spare Registers A number of applications use some constant values repeatedly. It may be beneficial to allocate specific registers to these frequently used values to reduce load/store. MIPS and LEON already hardwire first register to “0”. We illustrate this by taking an example of the part of the generated code for biquad_N_sections application for ARM7TDMI processor by encc [65]. The default configuration assumes 8 registers. We have generated code for 8 registers as well as 6 registers (figure 8.3). If the performance is not affected significantly by doing so, then we can use only 6 registers for register allocation and two can be hardwired with values. Here is the part of the generated code. Though there is difference in the generated code in both cases (using 8 registers or 6 registers), this part of the code is same.
LL3_0 ;; 69: biquad_N_sections.c LSL r3,r0,#2 ADD r4,r1,r3 MOV r3,#7 STR r3, [r4, #0] LL2_0 ..... ..... LL6_0 ;; 72: biquad_N_sections.c LSL r1,r0,#2 ADD r3,r2,r1 MOV r1,#0 STR r1, [r3, #0] LL5_0
coefficients[f] = 7 ;
wi[f] = 0 ;
Figure 8.3: Part of the generated code for ARM7TDMI When 6 registers are used, registers r6 and r7 are the spares. if we hardwire r6 with value 7 then instruction MOV r3, #7 can be removed and STR r3, [r4, #0] should be changed to STR r6, [r4, #0]. Similarly, if we hardwire r7 with value 0 then instruction MOV r1, #0 can be removed and STR r1, [r3, #0] should be changed to STR r7, [r3, #0].
Illustrative Case Studies and Applications of Our Approach
99
8.5 Utilizing Spare Register Addresses to Interface Co-processors In another interesting use, spare register addresses may be used to address co-processor registers. Typically coprocessors as well as accelerators need to be interfaced to the processor for transmitting operands from processor to coprocessor and results from coprocessor to processor. This can be very efficiently achieved if the coprocessor input/output registers are mapped to the register address space of the processor. We have come across two different scenarios in interfacing RISC architectures to coprocessors. In case of MIPS, it allows direct transfer of values from main register file to co-processor register file and vice-versa. So operand values need to be moved from main register file to coprocessor register file and similarly, results values which are produced in coprocessor registers need to be moved to main register file. By applying proposed idea of using addresses of ‘spare’ registers predicted by our technique, to address coprocessor registers can save such moves. LEON provides floating point unit (FPU) interface to connect any coprocessor. It does not allow to transfer value directly between integer unit register file to floating point register file, so the communication between processor and coprocessor is through memory. This means, the operand values are first stored into memory and then, these values are loaded into coprocessor register file. After coprocessor had produced results, these results will be first stored into memory and then, they are loaded into main register file. By applying our idea of using addresses of ‘spare’ registers predicted by our technique, significant benefits can be achieved in this case. Because all such loads and stores can be avoided. An approach is being developed for complete SoC synthesis at IIT Delhi [91]. Run the application C program and get the profile to identify the computation intensive part of the application. Synthesizable VHDL is automatically generated for this computation intensive part along with the necessary interfaces. We extend this synthesis approach for LEON processor based ASIP. Level 1 data/instruction caches can be configured to meet the need, of the application which can be predicted using our techniques. Number of register windows can also be easily reconfigured by setting a constant in the
100
Illustrative Case Studies and Applications of Our Approach
top level VHDL description file. For the software synthesis we need a retargetable compiler which can generate code for a chosen processor and memory configurations. As mentioned earlier, we have chosen Collision Detection application for this study. We have generated VHDL description of a coprocessor for update_sums and simulated the complete application using LEON VHDL simulator. Execution time reduced to 178573 cycles from 467912 cycles by use of coprocessor. We have used a FPU serial interface for coprocessor. Main processor and coprocessor cannot run in parallel in this interface. As discussed earlier, in case of LEON, the parameters can be passed only through memory. So all the parameters are to be stored into memory when computed by integer unit and then loaded from the memory into the coprocessor. What we propose is to use addresses of ’extra’ registers to address coprocessor registers. The VHDL code of the processor was modified such that the FPU register file is synthesized as part of IU register file. We have just modified the decoding of the register addresses of a couple of ’extra’ registers, so, now after decoding they point to FPU registers (figure 8.4). So when a coprocessor parameter is generated by IU, it can be directly IU register file IU reg. add. decoder 1
decoder 2 coproc. reg. add.
{ {
IU
coproc.
coproc register file
Before
IU register file IU reg. add. Modified decoder 1
decoder 2 coproc. reg. add.
After
{ {
‘spare’ registers
IU
coproc.
coprocessor register file coproc. registers addressed by IU reg. address
Figure 8.4: Results for Use of our Technique for interfacing coproc to LEON processor written to one of the FPU registers. This way some loads and stores from the memory are saved. In this case, each invocation of coprocessor reduces from 36 cycles to 4. Simulation results for collision
Illustrative Case Studies and Applications of Our Approach
101
detection using LEON and a coprocessor has shown that 11446 cycles are saved which is 6.4% of the total application time by just using addresses of two ’extra’ IU registers to point to FPU registers. The reduction is lower because the coprocessor is not being invoked very frequently.
8.6 Conclusion We have integrated all the techniques proposed to explore storage organization. The design space explored include register file size, windows and cache memory configurations. To compute cache miss statistics we use sim-cache cache simulator available with simplescalar tool set. Our exploration results are validated on LEON processor. Usefulness of our technique is established by simplifying coprocessor interface to LEON processor. We have used the idea of using address of ‘spare’ registers predicted by our technique to address coprocessor registers and saving loads and stores of operands and results.
102
Chapter 9 Conclusions and Future Work We now present overall conclusions and summarize our contributions. Further, we also point out the limitations and identify specific directions for future work.
9.1 Summary of Our Contributions We have developed a complete strategy to explore on-chip storage architecture for Application Specific Instruction Set Processors. This work involves deciding a suitable register file size, number of register windows and on-chip memory configurations. We have proposed a scheduler based technique to decide suitable register file size. We profile the application using gprof to get execution counts of various blocks. Our technique uses SUIF intermediate format to extract dependency and control information. Using the concepts of register reuse chains, we are able to do register allocation before scheduling. Using scheduler outputs and profiler outputs, local estimates for all the blocks are generated. Overheads to handle global needs are also considered at the time of estimating the performance. The proposed technique does register allocation prior to any scheduling to ensure the best use of registers and global needs are also taken care of. As part of our approach to explore on-chip storage, we also evaluate register windows configuration (number of register windows and window size) as well as data cache configurations in an integrated manner. 103
104
Conclusions and Future Work
We have conducted a number of experiments using this methodology to validate it and to demonstrate its utility. The technique was validated for three processors namely ARM7TDMI [66], TM1000 [81] (a VLIW processor) and LEON [4]. Results show that our estimates are within 9.6%, 3.3% and 9.7% respectively for these processors, compared to the actual performance produced by standard tool sets. Further, this technique was nearly 77 times faster compared to the simulator based technique for TM-1000. In case of LEON architecture, variation in number of register windows was also considered. Our technique neither requires code generator nor a simulator for specific target architecture. Further, the processor description required for retargetting is very simple. Apart from on-chip storage related parameters, we can vary number and types of functional units as well. To show the applicability of our approach on real examples, we considered ADPCM (encoder/ decoder) from mediabench [90] and a real time vision application for collision detection [89]. We were able to demonstrate that the technique is capable of studying trade-off between register file size and cache size. Thus the design space we can handle is quite large. Further, we have reported experiments to effectively use the ‘spare’ register address space for purposes like storage of frequently used constraints and efficient interfacing of coprocessors. We show significant performance gain in these case studies.
9.2 Limitations and Future Work In our work, we presently ignore the effect of processor configuration changes on clock period and instruction width. We are estimating performance in terms of cycle count whereas it is the product of cycle count and clock period. Further, change in register file size will generally impact instruction format and hence instruction width. Change of instruction width is likely to influence the area and perhaps in some cases the clock width. Further work is required to overcome these limitations. Our work has focused on the estimation of performance as the basic criteria to drive design space exploration. However, power and energy are increasingly becoming important factors. Research is required to estimate influence of storage architecture on these parameters.
Bibliography [1] M.K. Jain, M. Balakrishnan, and A. Kumar. ASIP Design Methodologies: Survey and Issues. In Proceedings of the IEEE / ACM International Conference on VLSI Design. (VLSI 2001), pages 76–81, January 2001.
[2] M.K. Jain, L. Wehmeyer, S. Steinke, P. Marwedel, and M. Balakrishnan. Evaluating Register File Size in ASIP Design. In Proceedings of the Ninth International Symposium on Hardware/ Software Co-design, (CODES 2001), pages 109–114, April 2001.
[3] M.K. Jain, M. Balakrishnan, and Anshul Kumar. An Efficient Technique for Exploring Register File Size in ASIP Synthesis. In Proceedings of the Fifth International Conference on Compilers, Architecture and Synthesis for Embedded Systems, (CASES 2002), pages 252–261, October 2002.
[4] LEON Homepage. "http://www.gaisler.com/leon.html".
[5] N. Ghazal, R. Newton, and J. Rabaey. Retargetable Estimation Scheme for DSP Architecture Selection. In Proceedings of the Asia and South Pacific Design Automation Conference, pages 485–489, January 2000.
[6] T.V.K. Gupta, P. Sharma, M. Balakrishnan, and S. Malik. Processor Evaluation in an Embedded Systems Design Environment. In Proceedings of the Thirteenth International Conference on VLSI Design, pages 98–103, January 2000. 105
106
BIBLIOGRAPHY
[7] I. J. Huang and A. M. Despain. Generating Instruction Sets And Micro-architectures From Applications. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 391–396, November 1994. [8] I. J. Huang and A. M. Despain. Synthesis of Application Specific Instruction Sets. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol. 14 Issue 6, pages 663–675, June 1995. [9] J. Sato, M. Imai, T. Hakata, A. Y. Alomary, and N. Hikichi. An Integrated Design Environment for Application Specific Integrated Processor. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD), pages 414–417, October 1991. [10] A. D. Gloria and P. Faraboschi. An Evaluation System for Application Specific Architectures. In Proceedings of the 23rd Annual Workshop and Symposium on Microprogramming and Micro-architecture. (Micro 23), pages 80–89, November 1990. [11] A. Halambi, P. Grun, A. Khare, V. Ganesh, N. Dutt, and A. Nicolau. EXPRESSION: A Language for Architecture Exploration through Compiler/Simulator Retargetability. In Proceedings of the Design Automation and Test in Europe (DATE), pages 485–490, March 1999. [12] S. Pees, V. Zivojnovic, and H. Meyr. LISA- Machine Description Language for Cycle Accurate Models of Programmable DSP Architectures. In Proceedings of the Design Automation Conference (DAC), pages 933–938, June 1999. [13] B. Kienhuis, E. Deprettere, K. Vissers, and P. van der Wolf. The Construction of a Retargetable Simulator for an Architecture Template. In Proceedings of the Sixth International Workshop on Hardware/Software Co-design 1998 (CODES/CASHE ’98), pages 125–129, March 1998. [14] J. Kin, C. Lee, W.H. Mangione-Smith, and M. Potkonjak. Power Efficient Media processors: Design Space Exploration. In Proceedings of the 36th Design Automation Conference, pages 321–326, June 1999.
BIBLIOGRAPHY
107
[15] M. Imai, A. Alomary, J. Sato, and N. Hikichi. An Integer Programming Approach to Instruction Implementation Method Selection Problem. In Proceedings of the European Design Automation Conference EURO-DAC with EURO-VHDL, pages 106–111, September 1992. [16] N. N. Binh, M. Imai, and A. Shiomi. A new HW/SW Partitioning Algorithm for Synthesizing the Highest Performance Pipelined ASIPs with Multiple Identical FUs. In Proceedings of the Design Automation Conference with EURO-VHDL and EURO-DAC, pages 126–131, September 1996. [17] N. N. Binh, M. Imai, and Y. Takeuchi. A performance Maximization Algorithm to Design ASIPs Under the Constraint of Chip Area Including RAM and ROM Sizes. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), pages 367–372, February 1998. [18] N. N. Binh, M. Imai, A. Shiomi, and N. Hikichi. A Hardware/Software Partitioning Algorithm for Pipelined Instruction Set Processor. In Proceedings of the Design Automation Conference with EURO-VHDL and EURO-DAC, pages 176–181, September 1995. [19] C. Liem, T. May, and P. Paulin. Instruction-Set Matching and Selection for DSP and ASIP Code Generation. In Proceedings of the European Design and Test Conference (EDAC), The European Conference on Design Automation. ETC European Test Conference (EUROASIC), pages 31–37, March 1994. [20] P. J. Hatcher and J.W. Tuller. Efficient Retargetable Compiler Code Generation. In Proceedings of the International Conference on Computer Languages, pages 25–30, October 1988. [21] R. Leupers, W. Schenk, and P. Marwedel. Retargetable Assembly Code Generation by Bootstrapping. In Proceedings of the Seventh International Symposium on High-Level Synthesis, pages 88–93, May 1994. [22] S. Hanono and S. Devadas. Instruction Selection, Resource Allocation, and Scheduling in the AVIV Retargetable Code Generator. In Proceedings of the Design Automation Conference (DAC), pages 510–515, June 1998.
108
BIBLIOGRAPHY
[23] M. Gschwind. Instruction Set Selection for ASIP Design. In Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES), pages 7–11, May 1999. [24] Hoon Choi, In-Cheol Park, Seung Ho Hwang, and Chong-Min Kyung. Synthesis of Application Specific Instructions for Embedded DSP Software. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Digest of Technical Papers, pages 665–671, November 1998. [25] R. Leupers and P. Marwedel. Instruction Selection for Embedded DSPs with Complex Instructions. In Proceedings of the European Design Automation Conference (EURO-DAC) with EURO-VHDL, pages 200–205, September 1996. [26] M. Breternitz Jr. and J. P. Shen. Architecture Synthesis of High-Performance ApplicationSpecific Processors. In Proceedings of the Design Automation Conference (DAC), pages 542– 548, January 1990. [27] M. A. R. Saghir, P. Chow, and C. G. Lee. Application Driven Design of DSP Architectures and Compilers. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pages 437–440, April 1994. [28] J. Gong, D.D. Gajski, and A. Nicolau. Performance Evaluation for Application-Specific Architectures. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 3 Issue 4, pages 483–490, December 1995. [29] B. Middha, V. Raj, A. Gangwar, A. Kumar, and M. Balakrishnan. A TRIMARAN based Framework for Exploring the Design Space of VLIW. In Proceedings of the International Symposium on System Synthesis (ISSS), pages 2–7, October 2002. [30] Trimaran Homepage. “http://www.trimaran.org". [31] P. Mishra, P. Grun, N. Dutt, and A. Nicolau. Processor-Memory Co-Exploration driven by a Memory-Aware Architecture Description Language. In Proceedings of the IEEE / ACM International Conference on VLSI Design. (VLSI 2001), pages 70–75, January 2001.
BIBLIOGRAPHY
109
[32] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, and H. Meyr. A Novel Methodology for the Design of Application-Specific Instruction-Set Processors (ASIPs) Using a Machine Description Language. IEEE Transactions on Computer Added Design of Integrated Circuits and Systems, 20(11):1338–1354, November 2001. [33] O. Schliebusch, A. Hoffmann, A. Nohl andG. Braun, and H. Meyr. Architecture Implementation Using the Machine Description Language LISA. In Proceedings of the IEEE / ACM International Conference on VLSI Design and ASP Design Automation Conference. (VLSI/ ASPDAC 2002), pages 239–244, January 2002. [34] M. Itoh, S. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and M. Imai. PEAS-III: An ASIP Design Environment. In Proceedings of the International Conference on Computer Design (ICCD), pages 430–436, September 2000. [35] S. Aditya and B.R. Rau. Automatic Architectural Synthesis and Compiler Retargetting for VLIW and EPIC Processors. In Technical Report HPL-1999-93, HPL Laboratories Palo Alto, 1999. [36] Target Compiler Technologies Homepage. “http://www.retarget.com". [37] Axys Homepage. “http://axysdesign.com". [38] Tensilica Homepage. “http://www.tensilica.com". [39] IMPACT Research Group. “http://www.crhc.uiuc.edu/Impact". [40] D. Fischer, J. Teich, M. Thies, and R. Weper. Efficient Architecture/Compiler Co-Exploration for ASIPs. In Proceedings of the Fifth International Conference on Compilers, Architecture and Synthesis for Embedded Systems, (CASES 2002), pages 27–34, October 2002. [41] J. V. Praet, G. Goossens, D. Lanneer, and H. De Man. Instruction Set Definition and Instruction Selection for ASIPs. In Proceedings of the Seventh International Symposium on High-Level Synthesis, pages 11–16, May 1994.
110
BIBLIOGRAPHY
[42] A. Alomary, T. Nakata, Y. Honma, M. Imai, and N. Hikichi. An ASIP Instruction Set Optimization Algorithm with Functional Module Sharing Constraint. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD), Digest of Technical Papers 1993, pages 526–532, November 1993. [43] J. Shu, T. C. Wilson, and D. K. Banerji. Instruction-set Matching and GA-based Selection for Embedded Processor Code Generation. In Proceedings of the Ninth International Conference on VLSI Design, pages 73–76, January 1996. [44] K. Atasu, L. Pozzi, and P. Ienne. Automatic Application-Specific Instruction-Set Extensions under Microarchitectural Constraints. In Design Automation Conference (DAC), 2003. [45] W. Cheng and Y. Lin. Code generation for a DSP processor. In Proceedings of the Seventh International Symposium on High-Level Synthesis, pages 82–87, May 1994. [46] T. Wilson, G. Grewal, B. Halley, and D. Banerji. An Integrated Approach to Retargetable Code Generation. In Proceedings of the Seventh International Symposium on High-Level Synthesis, pages 70–75, May 1994. [47] R. Leupers, R Niemann, and P. Marwedel. Methods for Retargetable DSP Code Generation. In Proceedings of the International Workshop on VLSI Signal Processing, pages 127 – 136, October 1994. [48] W. Kreuzer, M. Gotschlich, and B. Wess. A Retargetable Optimizing Code Generator for Digital Signal Processors. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Connecting the World, Vol. 2, pages 257–260, 1996. [49] J. V. Praet, D. Lanneer, G. Goossens, W. Geurts, and H. De Man. A Graph Based Processor Model for Retargetable Code Generation. In Proceedings of the European Design and Test Conference (ED&TC), pages 102–107, March 1996. [50] B. S. Visser. A Framework for Retargetable Code Generation using Simulated Annealing. In Proceedings of the 25th EUROMICRO Conference, pages 458–462, September 1999.
BIBLIOGRAPHY
111
[51] M. Yamaguchi, N. Ishiura, and T. Kambe. Binding and Scheduling Algorithms for Highly Retargetable Compilation. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), pages 93–98, February 1998. [52] R. Leupers and P. Marwedel. Retargetable Generation of Code Selectors from HDL Processor Models. In European Design & Test Conference (ED & TC), 1997. [53] R. Johnk and P. Marwedel. MIMOLA Reference Manual V 3.45. In Technical Report # 470, University of Dortmund, Germany, 1994. [54] D. Lanneer, J.V. Praet, A. Kifli, K. Schoofs, W. Geurts, F. Thoen, and G. Goossens. Chess : Retargetable Code Generation for DSP Processors. In Code Generation for Embedded Processors, Kluwer academic publishers, pages 85–102, 1995. [55] S. Hanono and S. Devadas. Instruction Selection, Resource Allocation and Scheduling in the AVIV Retargetable Code Generator. In Proceedings of the 35th Design Automation Conference, pages 510–515, June 1998. [56] C. Liem. Retargetable Compilers for Embedded Core Processors : Methods and Experiences in Industrial Applications. Kluwer Academic Publishers, 1997. [57] T. Kuroda and T. Nishitani. A Knowledge-Based Retargetable Compiler for Application Specific Signal Processors. In Proceedings of the IEEE International Symposium on Circuits and Systems, pages 631–634, May 1989. [58] R. Leupers and P. Marwedel. A BDD-based Frontend for Retargetable Compilers. In Proceedings of the European Design and Test Conference (ED&TC), pages 239–243, March 1995. [59] R. Leupers and P. Marwedel. Time-constrained Code Compaction for DSPs. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 5, Issue 1, Proceedings of the International Conference on Very Large Scale Integration (VLSI) Systems, pages 112–122, January 1995.
112
BIBLIOGRAPHY
[60] R. Leupers and P. Marwedel. Algorithms for Address Assignment in DSP Code Generation. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Digest of Technical Papers, pages 109–112, November 1996. [61] W. Kreuzer and B. Wess. Cooperative Register Assignment and Code Compaction for Digital Signal Processors with Irregular Data Paths. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pages 691–694, April 1997. [62] C. H. Gebotys. An Efficient Model for DSP Code Generation: Performance, Code Size, Estimated Energy. In Proceedings of the Tenth International Symposium on System Synthesis, pages 41–47, September 1997. [63] G. Araujo, S. Malik, and M. Lee. Using Register Transfer Paths in Code Generation for Heterogeneous Memory-Register Architecture. In Design Automation Conference (DAC), 1996. [64] ACE Associated Compiler Experts. “http://www.ace.nl". [65] S. Steinke and L. Wehmeyer. encc: An Energy Aware C Compiler. "http://ls12-www.cs.unidortmund.de/∼steinke/ LOW_POWER/encc.html", 2000. [66] ARM Ltd. Homepage. “http://www.arm.com". [67] Advanced
RISC
Machines
Ltd
(ARM).
ARM
Powered[tm]
products.
"http://www.arm.com/arm/arm_powered?OpenDocument", 2001. [68] V. Zivojnovic, J. Martinez, C. Schlaeger, and H. Meyr. DSPstone: A DSP-Oriented Benchmarking Methodology. In Proceedings of the International Conference on Signal Processing Applications and Technology, ICSPAT’94, Dallas, Texas, October 1994. [69] LANCE System Homepage. "http://ls12-www.cs.uni-dortmund.de/∼leupers/lanceV2/". [70] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis Of Embedded Software: A First Step Towards Software Power Minimization. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design ICCAD, pages 384–390, November 1994.
BIBLIOGRAPHY
113
[71] M. Theokharidis. Measuring Energy Consumption of ARM7TDMI Processor Instructions (in german), available from "http://ls12-www.cs.uni-dortmund.de/publications/theses". Master’s thesis, University of Dortmund, Department of Computer Science XII, 2000. [72] A.P. Chandrakasan et al. Low Power CMOS Digital Design. IEEE Journal on Solid State Circuits, 27(4):473–484, April 1992. [73] Y. Zhang and H.J. Lee. Register Allocation Over a Dependence Graph. In Proceedings of the Second International Workshop on Compiler and Architecture Support of Embedded Systems, (CASES 1999), October 1999. [74] M.K. Jain, M. Balakrishnan, and Anshul Kumar. Exploring Storage in ASIP Synthesis. In Proceedings of the EUROMICRO Symposium on Digital System Design, (Euro-DSD 2003), September 2003. [75] SUIF Homepage. “http://suif.stanford.edu". [76] J.R. Goodman and W.C. Hsu. Code Scheduling and Register Allocation in Large Basic Blocks. In Proceedings of the ACM International Conference on Super Computing, (ICS 1988), pages 444–452, 1988. [77] C. Norris and L.L. Pollock. Register Allocation Over the Program Dependence Graph. In Proceedings of the Conference on Programming Language Design and Implementation, (SIGPLAN 1993, June 1994. [78] D.A. Berson, R. Gupta, and M.L. Soffa. URSA: A Unified Resource Allocator for Registers and Functional Units in VLIW Architectures. In Proceedings of the IFIP WG 10.3 (Concurrent Systems) Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, Orlando, Fl., pages 243–254, January 1993. [79] M. Farach and V. Liberatore. On Local Register Allocation. In DIMACS Technical Report # 97-33, July 1997. [80] A.V. Aho and J.D. Ullman. Principles of Compiler Design. Addison- Wesley, 1977.
114
BIBLIOGRAPHY
[81] Trimedia Homepage. “http://www.trimedia.com". [82] L. Wehmeyer, M.K. Jain, S. Steinke, P. Marwedel, and M. Balakrishnan. Analysis of the Influence of Register File Size on Energy Consumption, Code Size and Execution Time. IEEE Transactions on Computer Added Design of Integrated Circuits and Systems, 20(11):1329– 1337, November 2001. [83] V. Bhatt, M. Balakrishnan, and A. Kumar. Exploring the Number of Register Windows in ASIP Synthesis. In Proceedings of the IEEE / ACM International Conference on VLSI Design and ASP Design Automation Conference. (VLSI/ ASPDAC 2002), pages 223–229, January 2002. [84] B.L. Jacob, P.M. Chen, S.R. Silverman, and T.N. Mudge. An Analytical Model for Designing Memory Hierarchies. IEEE Transactions on Computers, 45(10):1180–1194, October 1996 1996. [85] D. Kirvoski, C. Lee, M. Potkonjak, and W.H. Mangione-Smith. Application-Driven Synthesis of Memory-Intensive Systems-on-Chip. IEEE Transactions on Computer Added Design of Integrated Circuits and Systems, 18(9):1316–1326, September 1999 1999. [86] S.G. Abraham and S.A. Mahlke. Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems. In Technical Report HPL-1999-132, HPL Laboratories Palo Alto, October 1999. [87] P. Grun, N. Dutt, and A. Nicolau. APEX: Access Pattern based Memory Architecture Exploration. In International Symposium on Systems Synthesis, (ISSS 2001), October 2001. [88] Simplescalar Homepage. "http://www.simplescalar.com". [89] S. K. Lodha and S. Gupta. A FPGA based Real Time Collision Detection and Avoidance. In B. Tech. Thesis, Department of Computer Science and Engineering, IIT Delhi, 1997. [90] MediaBech Homepage. “http://http://cares.icsl.ucla.edu/MediaBench".
BIBLIOGRAPHY
115
[91] A. Singh, A. Chhabra, A. Gangwar, B.K. Dwivedi, M. Balakrishnan, and A. Kumar. SoC Synthesis with Automatic Hardware Software Interface Generation. In Proceedings of the IEEE / ACM International Conference on VLSI Design. (VLSI 2003), January 2003.
Technical Biography of Author
Manoj Kumar Jain received the M.Sc. degree from M.L. Sukhadia University, Udaipur (India) in 1989, and the M.Tech. degree in Computer Applications from the Indian Institute of Technology, Delhi (India) in 1993. He is Assistant Professor in Computer Science at M.L. Sukhadia University, Udaipur (India) since December 1993. He completed his Ph.D. degree in Computer Science and Engineering at the Indian Institute of Technology, Delhi (India) in May 2004. His research interests include Application Specific Instruction Set Processor (ASIP) Design and Embedded Systems Design.
List of Publications
1. M.K. Jain, M. Balakrishnan and A. Kumar. ASIP Design Methodologies: Survey and Issues. Technical Report TR #2000/23, Embedded System Project, Department of Computer Science and Engineering, Indian Institute of Technology, Delhi. 2. M.K. Jain, L. Wehmeyer, P. Marwedel and M. Balakrishnan. Register File Synthesis in ASIP Design. Technical Report #746 07.12.2000, Lehrstuhl Informatik XII, University of Dortmund, Germany. 3. M.K. Jain, M. Balakrishnan and A. Kumar. ASIP Design Methodologies: Survey and Issues. Proceedings of the Fourteenth International Conference on VLSI Design, 2001, 3-7 Jan. 2001, Pages: 76-81. 4. L. Wehmeyer, M.K. Jain, S. Steinke, P. Marwedel and M. Balakrishnan. Using an Energy Aware Compiler Framework to Evaluate Changes in Register File Size towards ASIP-Design. Fifth International Workshop on Software and Compilers for Embedded Systems, SCOPES 2001, 20-22 March, 2001, St. Goar, Germany. 116
5. M.K. Jain, L. Wehmeyer, S. Steinke, P. Marwedel and M. Balakrishnan. Evaluating Register File Size in ASIP Design. Proceedings of the Ninth International Symposium on HardwareSoftware Codesign, CODES 2001, 25-27 April, 2001, Copenhagen, Denmark, Pages: 109-114. 6. L. Wehmeyer, M.K. Jain, S. Steinke, P. Marwedel and M. Balakrishnan. Analysis of the Influence of Register File Size on Energy Consumption, Code Size and Execution Time.”, IEEE TCAD, vol. 20, no. 11, Nov. 2001, Pages : 1329-1337. 7. M.K. Jain, M. Balakrishnan and A. Kumar. An Efficient Technique for Exploring Register File Size in ASIP Synthesis. Proceedings of the Fifth International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2002, 8-11 October, 2002, Grenoble, France, Pages : 252-261. 8. M.K. Jain, M. Balakrishnan and A. Kumar. Exploring Storage Organization in ASIP Synthesis. Euromicro Symposium on Digital System Design (Euro-DSD 2003), 1-6 September, 2003, Belek Near Antalya, Turkey.
117