Code Generation for a Dual Instruction Set Processor Based on ...
Recommend Documents
have developed a custom FPGA-based Altera NIOS system to increase the .... is implemented as custom logic block and its software macro generated is then ... By targeting low cost devices, powerful, customized embedded systems can be ...
OFDM is the basis for the European digital audio broadcasting (DAB) standard, ... OFDM system are efficiently implemented by use of the fast Fourier transform ... system provides VHDL and Verilog code to synthesize a full functional ... Custom instru
Another contribution is the cycle-level analysis of a modified ARM processor that includes a ... and we evaluate in detail the effects of varying architectural.
cro architectural aspects in order to write functionally correct software, even in assembly language. ...... of test-program generators for digital signal processors.
array (FPGA), pipeline, reconfigurable processor. .... Each row implements a stage in the pipeline. ... XiRisc is a VLIW processor based on the classic RISC.
A summary of the ARM processor instruction set is shown in Figure 5-1: ..... To
ensure the maximum compatibility between ARM processor programs and future.
A summary of the ARM processor instruction set is shown in Figure 5-1: Instruction ... Register List. Branch cond. 1 0 1
Mar 26, 2001 - Register pipelining, a known optimization technique in compilers, eliminates .... and determining the number of free lower and higher registers, the .... The benchmark biquad_N_sections is typical for the DSP domain and part ...
User-de ned high-level self-test programs are compiled to machine code for prede ned RTL structures 14 . A more recent self-test program compiler, which.
Index TermsâApplication-specific instruction set processor. (ASIP), coordinate rotation digital computer (CORDIC) pro- cessors, modern signal processing, QR ...
tion Specific Instruction Set Processor (ASIP) to implement data-adaptive motion estimation algorithms, that is char- acterized by a specialized data-path and ...
Outline of OROCHI: A Multiple Instruction Set Executable SMT Processor. Hajime Shimada. Graduate ... meet the rising demands of portable media processing mar- ket. .... quirements, we are planning to add the hardware mecha- nism which ...
IBM Corp. Esses Junction, VT Faculty of Electrical Engineering Endicott, NY. USA. Delft, The ... branch outcomes by tagging instructions in an instruc- tion cache ...
Jul 4, 2009 - for embedded control systems based on MIC. GME is used to .... 2) According to the abstract principles, acquiring suit- .... http://www.omg.org/docs/ptc/03-10-14.pdf, 2003. ... approach to engineering object-oriented design lan-.
of a specific processor. For instance, SPACE and MPARM support ARM ISS simulation [2], [3]. Platune supports MIPS architecture [4]. In 2007, QEMU-SystemC ...
Jun 5, 2018 - [3] Intel IXP1200 Network Processor Family - Hardware Reference Manual 2001 http://www.intel.com. [4] Cisco ASR 1000 Series Aggregation ...
We have proposed a scalable instruction set architecture (ISA) extension aimed ..... original architecture of ARM is taken as an example. We can figure out that ...
This chapter describes the signals and operation of the Intel 80386SX processor
bus and describes a ... in the latest generation of Pentium and compatible.
Design automation for embedded systems comprising both hardware and ... and stringent time-to-market constraints demand for design automation tools that.
formal description method used as an intermediate representation of statecharts for model checking purposes ..... storage method of events (FIFO, LIFO, set, mul-.
Agency for Technology and Innovation, Nokia Siemens Net- works, Elektrobit, and Texas Instruments. REFERENCES. [1] I. E. Telatar, âCapacity of multi-antenna ...
maximum likelihood (ML) or near-ML estimation with reduced complexity. This paper reviews related ..... gines and APP cost function units. The total area of the.
IBM Corp. TU Delft. IBM Corp. Endicott, NY. E.E. Dept. Essex Junction, VT. USA. Delft, The ... of branch unit is its capability to remove branch instructions from the ...
like a classic register defined everywhere in the description. (i.e. CP ... of hardware components and the definition of the pipeline. It ..... It is a 16 bits RISC co-.
Code Generation for a Dual Instruction Set Processor Based on ...
Sep 24, 2003 - code size and execution time. ⫠More fine-grained approach is required. ⢠Mode switches must be explicitly handled by code generation. A. B. C.
Code Generation for a Dual Instruction Set Processor Based on Selective Code Transformation
Sheayun Lee, Jaejin Lee, Sang Lyul Min, Jason Hiser, and Jack W. Davidson Seoul National University & University of Virginia Sep. 24, 2003 SCOPES 2003, Wien, Austria
Motivation Code size is often a critical constraint in cost-sensitive embedded systems.
• Due to a limited amount of available memory Dual instruction set processors provide an effective mechanism to trade performance off for a small code size.
• Ex. Thumb programs are typically 30% smaller and run 25% slower than their ARM counterparts. Sheayun Lee - SCOPES 2003
1
Dual Instruction Set Processors Full instruction set • Normal instructions, all operations available • Relatively rich addressing modes • Full set of general purpose registers available Reduced instruction set • Typically a subset of the full instruction set • Shorter instructions (usually half – 16 bits) • Limited operations and addressing modes • Only a subset of GP registers made visible Examples • ARM/Thumb, MIPS 32/16-TinyRISC, ARC Tangent, … Sheayun Lee - SCOPES 2003
2
Comparison of the two ISA Full instruction set
• Larger code size, faster execution • Used for time-critical functions
Real-time data processing Interrupt/exception handlers
Reduced instruction set
• Smaller code size, slower execution • Used for non-time-critical functions
User interface functions Bookkeeping code
Sheayun Lee - SCOPES 2003
3
Coarse-Grained Approach Current tool support from ADS (ARM Developer Suite)
• Instruction set mode specified at the module (file) level
Provided as a command-line switch to the compiler
• Linker support to handle the dynamic mode •
transitions Called ARM/Thumb interworking
Sheayun Lee - SCOPES 2003
4
Fine-Grained Approach A
Observation • Function-level or modulelevel approaches are too coarse-grained for enabling a flexible tradeoff between code size and execution time.
More fine-grained approach is required. • Mode switches must be explicitly handled by code generation. Sheayun Lee - SCOPES 2003
B
C
D
E
F 5
Our Objective Given a program, generate code that
• maximizes performance
in terms of execution time
• while satisfying code size requirement
given as an upper bound on the instruction space
• by selectively using the two different instruction sets for different program parts in a single program (even inside a single function)
taking the mode switching overhead into account
Sheayun Lee - SCOPES 2003
6
Intuitive Solution Which instruction set should be used for which part of a given program?
• Use the full instruction set for
time-critical portion of the code (frequently executed) thereby maximizing the performance
• Use the reduced instruction set for
all the other parts (infrequently executed) thereby minimizing the code size
Sheayun Lee - SCOPES 2003
7
Selective Code Transformation Full FullISA ISA program program
Code size diff Exec time diff
Constraint on total code size
Try translating every block into full ISA Profile Source Source program program
Reduced ReducedISA ISA program program
Compile in reduced ISA
Frequency information
Decision on Heuristic which blocks selection of to translate blocks to into ARM mode translate
Transformer: selective transformation Sheayun Lee - SCOPES 2003
Mixed Mixed mode mode program program 8
Program Notations A program is given by P = , where • V = {vi | i = 1, 2, …, n} • E = {eij = | there exists a control flow vi → vj} Code size and execution time variables for each basic block v Instruction set
Reduced
Full
Code size
sR(v)
sF(v)
Execution time
tR(v)
tF(v)
Sheayun Lee - SCOPES 2003
9
Formal Problem Description Find an assignment of instruction set for each basic block that maximizes performance while satisfying code size constraint. f : V → {α , β } α if vi is to be compiled into full instructions f (vi ) = β if vi is to be compiled into reduced instructions
Such an assignment partitions the set of blocks into two disjoint subsets. • F = {v | f(v) = α } • R = {v | f(v) = β } Sheayun Lee - SCOPES 2003
10
Code Size and Execution Time Code size overhead due to mode switches
Total code size
S = ∑ s F (v ) + ∑ s R (v ) + s v∈F
*
v∈R
∆ t = ∑ cV (v) × (t R (v) − t F (v)) − t
*
v∈F
Execution time savings Execution frequency of block v Sheayun Lee - SCOPES 2003
Execution time overhead due to mode switches
11
Mode Switching Overhead Set of edges along which mode switches should occur
E * = {eij ∈ E | (vi ∈ F ∧ v j ∈ R) ∨ (vi ∈ R ∧ v j ∈ F )} Execution frequency of edge e Total code size overhead Total execution time overhead
s * = os × E * t * = ot × ∑ cE (e) e∈E *
Sheayun Lee - SCOPES 2003
12
Optimization Problem Given a program P = , find an assignment f : V → {α, β} such that it maximizes ∆ t = ∑ cV (v) × (t R (v) − t F (v)) − ot × ∑ cE (e) v∈F
e∈E *
while satisfying S = ∑ s F ( v ) + ∑ s R ( v ) + os × E * ≤ U s v∈F
v∈R
Upper bound on the maximum code size for the whole program Sheayun Lee - SCOPES 2003
13
Approximation Method Path-based approach to selective code transformation • Based on intraprocedural acyclic subpaths
Acyclic subpaths capture the set of basic blocks that are executed together.
• Enumerate the acyclic subpaths and associate with each of them
Cost (increase of code size) Benefit (reduction in execution time)
• A greedy heuristic incrementally selects one path at a time, whose blocks are to be transformed. Sheayun Lee - SCOPES 2003
14
Path-Based Cost-Benefit Model c( p) =
∑ (s
v∈V ( p ) ∩ R
Cost of transforming blocks on path p
F
(
(v) − sR (v)) + os × E M ( p ) − E m ( p )
Set of edges where mode switch instructions are newly introduced
)
Set of edges where previously existing mode switch instructions are removed
b( p ) = ∑ (cV (v) × (t R (v) − t F (v))) − ot × ∑ cE (e) − ∑ cE (e) M m v∈V ( p ) ∩ R e ∈ E ( p ) e ∈ E ( p ) Benefit from transforming blocks on path p Sheayun Lee - SCOPES 2003
Set of blocks on path p that are currently in the reduced instruction set mode 15
Selection Algorithm: Greedy Heuristic B ← U s – SR R←V F←∅ P ← {intraprocedural acyclic subpaths} do {
for each p ∈ P calculate r (p) = b (p) / c (p) select p ∈ P with maximum r (p) with c (p) ≤ B B ← B – c (p) F ← F ∪ V (p) R ← R – V (p) P ← P – { p | V (p) ∩ R = ∅ } } while (B ≥ minp∈P{ c (p) } ∧ minp∈P{ c (p) } ≥ 0 ∧ R ≠ ∅)
Sheayun Lee - SCOPES 2003
16
Implementation Based on vpo (very portable optimizer), targeted for ARM/Thumb architecture
• RTL representation of programs • Instruction selection mechanism based on peephole optimization
One or more Thumb RTL statement can be combined to form a single ARM RTL statement.
• Automatic insertion of mode switching instructions
Based on control-flow analysis
Sheayun Lee - SCOPES 2003
17
Experiments Test programs are taken from • MiBench • MediaBench Six different versions of code are generated for each of the test programs. • With different code size limits Each test program is run on an evaluation board for execution time measurement. • XScale core-based PXA250 processor • Execution times are measured by using the gettimeofday () system call. Sheayun Lee - SCOPES 2003
18
Results sha
G.721.encode
1.8
1.8
1.6
1.6
1.4
T 20% 40% 60% 80% A' A
1.2 1.0 0.8 0.6 0.4
1.4 1.2 1.0 0.8 0.6 0.4
0.2
0.2
0.0
0.0 code size
execution time
Sheayun Lee - SCOPES 2003
T 20% 40% 60% 80% A' A
code size
execution time
19
Conclusions Code generation framework that can exploit dual instruction sets
• Path-based cost-benefit model • Heuristic selection of instruction set mode for •
each basic block Automatic handling of dynamic mode switches
Sheayun Lee - SCOPES 2003
20
Future Work Efficient post-pass register allocation algorithm
• Exploit all the available registers in the • •
transformed code sections Need to precisely analyze the impact of allocating each program variable to a register Further enhance the performance of the resulting mixed instruction set mode program