TSR. Class. Application. Processor. Requirements. Data flow laser printers, X- .....
37. TSR. Operation assignment constraint. T. H1. H2. H3. P. 1. 20. 100. 2. 20.
CSE 237A Hardware/Software Codesign Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.
1
ES Design Hardware components
TSR
Hardware
Verification and Validation
2
ES Application Classes Class D ata flow
Application laser printers, Xterm inals, routers, brid ges, im age processing Interactive set-top boxes, vid eo gam es, PDAs, portable video & info appliances portable controllers, d isk Classic embedded controllers, autom otive, ind ustrial control
Processor
Requirements R4600, I960, Processes d ata and 29k, Cold fire, passes it on. H igh PPC (403, 605) m em ory bw , high throughput. R3900, Interactive, low R4100/ 4300/ 4 cost, low pow er, 600, ARM 6xx/ 7xx, V851, high throughput. SH 1/ 2/ 3 Piranha, ARM, MIPS, Cores
m ix of CPU pow er, low cost, low pow er, peripherals
Time-constrained computing systems. TSR
System Design Problem Areas 2. HDL Modeling Architectural synthesis Logic synthesis Physical synthesis
1. Design environment, co-simulation constraint analysis. Interface
Interface
Processor
Analog I/O
ASIC
3. Software synthesis, Optimization, Retargetable code gen., Debugging & Programming environ.
DMA
4. Test Issues
TSR
Memory
4
System Architecture: Yesterday PCB design Add-in board
Cache
Processor Cache/DRAM Controller
Audio VRAM DRAM
Motion Video
VRAM DRAM
PCI Bus
SCSI/ IDE
I/O
LAN
External Bus
Graphics
DRAM VRAM
3M HIGH DENSITY
ISA/EISA TSR
5
A System Architecture: Today PCI Interface VRAM
TSR
Graphics
Video
MEMORY Cache/SRAM
EISA Interface
SCSI Encryption/ Decryption
Glue Motion
LAN Interface
Processor Core
DSP Processor Glue Core
I/O Interface
HW/SW Codesign of a SoC
6
HW-centric view of a Platform Pre-Qualified/Verified Foundation-IP*
HW-SW Kernel
+ Reference Design Scaleable bus, test, power, IO, clock, timing architectures
MEM Hardware IP SW IP
Application Space CPU FPGA
Reconfigurable Hardware Region (FPGA, LPGA, …)
Programmable
IP can be: • HW or SW • hard, soft or ‘firm’ (HW) • source or object (SW)
TSR
Processor(s), RTOS(es) and SW architecture
Foundry-Specific HW Qualification
SW architecture characterisation
Source: Grant Martin and Henry Chang, “Platform-Based Design: A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.
7
SW-Centric View of Platforms Platform API Application Software
Software
Software Platform Hardware Platform Input devices Output Devices
Hardware I O network
Device Drivers BIOS
TSR
Network Communication
RTOS
API
Source: Grant Martin and Henry Chang, “Platform-Based Design: A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.
8
CMOS VLSI Trends Yesterday (1980s) memory
Today
Tomorrow
memory
memory
processors processors
platform SoC SoC
gate arrays ASICs
processors
struc. SoC
reconfigurable
custom SoC
struc. ASIC
reconfigurable (no processor)
ASICs
struc. ASIC (no processor) ASICs
TSR
9
Increasing Customization Cost Top
Estimated Cost $85 M -$90 M Example: Design with 80 M transistors in 100 nm technology
cost drivers
Verification
(40%) Architecture Design (26%) Embedded Design
1400 man months (SW) 1150 man months (HW)
HW/SW
integration
12 – 18 months
TSR
*Handel H. Jones, ”How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com
10
Responses to Increasing Cost
General purpose ISA high volumes and reuse Abstraction compilation technologies and high application/development productivity Universality
Custom silicon for embedded platforms in sufficiently high volumes Domain
specific ISAs, e.g., DSPs Application Specific Standard Products Reconfigurable hardware TSR
HW/SW Codesign 11
HW/SW Codesign: Motivations Benefit
from both HW and SW
HW:
Parallelism -> better performance, lower power Higher implementation cost
SW
Sequential implementation -> great for some problems Lower implementation cost, but often slower and higher power
TSR
12
Co-Design Methodology
Architecture Synthesis
Verification
Mapping
HW
TSR
Function
SW
13
HW/SW Codesign Issues
TSR
Task level concurrency management Which tasks in the final system? High level transformations Transformation outside the scope of traditional compilers Hardware/software partitioning Which operation mapped to hardware, which to software? Compilation Hardware-aware compilation Scheduling Performed several times, with varying precision Design space exploration Set of possible designs, not just one. 14
Software or hardware?
Decision TSR
based on hardware/ software partitioning, 15
Hardware/software codesign
Specification
Mapping Processor P1 TSR
Processor P2
Hardware 16
System Partitioning process (a, b, c) in port a, b; out port c; { read(a); … write(c); }
Specification
Good 1) 2)
3) TSR
Partition Model
Line () { a=… … detach }
Interface
HW
Capture Synthesize
Processor
partitioning mechanism:
Minimize communication across bus Allows parallelism -> both HW & CPU operating concurrently Near peak processor utilization at all times 17
Determining Communication Level Application Program
Send, Receive, Wait Application hardware (custom)
Operating System Register reads/writes I/O driver I/O bus
Easier
TSR
Bus transactions Interrupts
I/O bus
to program at application level
(send, receive, wait) but difficult to predict
More
Interrupt service
I/O driver
difficult to specify at low level
Difficult to extract from program but timing and resources easier to predict
18
Partitioning Costs Software
Resources
Performance
and power consumption Lines of code – development and testing cost Cost of components Hardware
Resources
Fixed
number of gates, limited memory & I/O Difficult to estimate timing for custom hardware Recent design shift towards IP TSR
Well-defined resource and timing characteristics 19
Software Cost Analysis Process
Functional Blocks
Calibration
Feature Points
Language Conversion
Source Lines of Code (SLOC) Software development effort Equivalent SLOC including reuse
Software maintenance effort Software schedule
Software Development and Testing Cost TSR
20
Hardware Cost Analysis Process
S/G Ratio
Gate Count
Rent’s Rule
I/O Count
Single-ChipPackage Cost
Feature Size
Interconnect Length
Core Area I/O Format
Die Area
Number Up
Die Yield
Die Cost
Wafer Characteristics
Wafer Fabrication and Sawing Cost
Chip Hardware Cost
Tooling Cost Test Development Cost Productivity, reuse
TSR
Design Cost
21
HW & SW Foundries
HW1
LSI Logic ASIC Wafer Foundry Data
0.18 µm feature size 8 inch wafers 6 layers
TSMC 018 Wafer Processing
SW1
Nominal to High development effort
SW2
Low to Nominal development effort
HW2
Samsung Semiconductor ASIC Wafer Foundry Data
TSR
0.35 µm feature size 6 inch wafers 4 layers
TSMC 035 Wafer Processing 22
MIXED Implementation Using HW1 and SW1 Software development
Testing
Percent of Total Cost
100% 80% Packaging Fabrication Tooling Design
60%
Reuse of: • Gate-level IP • Code
40% 20%
Recurring
100000, 40%
100000, 20%
100000, No
10000, 40%
10000, 20%
10000, No
1000, 40%
1000, 20%
1000, No
0%
Production Quantity and Level of Reuse TSR
23
Total Cost Per Chip 45 40 Total Cost ($/chip)
35 30
10,000 Units
25 20 15 10 5
HW1/SW1
HW1/SW2
HW2/SW1
HW2/SW2
0 0
10
20
30
40
50
60
70
80
90 100
Percent Custom Hardware TSR
24
Partitioning Analysis Result of compilation is synthesizable HDL and assembly code for the processor Compiler & profiler determine dependence and rough performance estimates
TSR
25
Hardware/Software Partitioning Simple architectural model: CPU + 1 or more ASICs on a bus memory ASIC Processor ASIC
Properties of classic partitioning algorithms Single
rate; Single-thread: CPU waits for ASIC Type of CPU is known; ASIC is synthesized TSR
26
HW/SW Partitioning Styles HW first approach start
with all-ASIC solution which satisfies constraints migrate functions to software to reduce cost
SW first approach start
with all-software solution which does not satisfy constraints migrate functions to hardware to meet constraints TSR
Partitioning - ILP Ingredients: Cost function Constraints Cost function
Involving linear expressions of integer variables from a set X
C=
∑a x
x i ∈X
Constraints:∀j ∈ J :
∑b
x i ∈X
i, j
i
i
with ai ∈R, x i ∈ ℕ N (1)
x i ≥ c j with bi , j , c j ∈ ℝ R ( 2)
Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem. If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem. TSR
28
FAQ on integer programming Maximizing Integer
the cost done by setting C‘=-C
programming is NP-complete.
Running times increase exponentially with problem size
Commercial solvers can solve for thousands of variables
IP
models are a good starting point for modelling even if in the end heuristics have to be used to solve them.
TSR
29
IP model for HW/SW partitioning Notation: Index
set I denotes task graph nodes. Index set L denotes task graph node types e.g. square root, DCT or FFT Index set KH denotes hardware component types. e.g. hardware components for the DCT or the FFT. Index set J of hardware component instances Index set KP denotes processors. All processors are assumed to be of the same type T is a mapping from task graph nodes to their types T: I →L Therefore:
Xi,k: =1 if node vi is mapped to HW component type k ∈ KH Yi,k: =1 if node vi is mapped to processor k ∈ KP NY ℓ,k =1 if at least one node of type ℓ is mapped to processor k ∈ KP
TSR
30
Constraints Operation
assignment constraints
∀i ∈ I :
∑
k∈KH
X i ,k +
∑Y
k∈KP
i ,k
=1
All task graph nodes have to be mapped either in software or in hardware. Variables are assumed to be integers. Additional constraints to guarantee they are either 0 or 1:
∀i ∈ I : ∀k ∈ KH : X i ,k ≤ 1
∀i ∈ I : ∀k ∈ KP : Yi ,k ≤ 1
TSR
31
Operation assignment constraints ∀∀ ℓ
∈L, ∀ i:T(vi)=cℓ, ∀ k ∈ KP: NY ℓ,k ≥ Yi,k
all types ℓ of operations and for all nodes i of this type: For if
i is mapped to some processor k, then that processor must implement the functionality of ℓ.
Decision
∀∀ ℓ TSR
variables must also be 0/1 variables:
∈L, ∀ k ∈ KP: NY ℓ,k ≤ 1. 32
Resource & design constraints • ∀ k ∈ KH, the cost for components of that type should not exceed its maximum. • ∀ k ∈ KP, the cost for associated data storage area should not exceed its maximum. • ∀ k ∈ KP the cost for storing instructions should not exceed its maximum. • The total cost (Σk ∈ KH) of HW components should not exceed its maximum • The total cost of data memories (Σk ∈ KP) should not exceed its maximum • The total cost instruction memories (Σk ∈ KP) should not exceed its maximum
TSR
33
Scheduling v1
v2
v3
v4 Processor
e3 v5
v6
e4
v7
FIR1
p1
FIR2
ASIC h1
v8 Communication channel c1
v9
v10 v11
TSR
FIR2 on h1
p1
c1
... v ... v 3 4
... v ... v 7 8
...e ...e 3 4
or ... v ... v 4 3
or ... v ... v 8 7
or ...e ...e 4 3 t
t
t
Scheduling / precedence constraints For
all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 with bi1,i2=1 if vi1 is executed before vi2 and = 0 otherwise. Define constraints of the type (end-time of vi1) ≤ (start time of vi2) if bi1,i2=1 and (end-time of vi2) ≤ (start time of vi1) if bi1,i2=0
Ensure
that the schedule for executing operations is consistent with the precedence constraints in the task graph. Timing constraints need to be met
TSR
35
Example
T 1 2 3 4 5 TSR
HW types H1, H2 and H3 with costs of 20, 25, and 30. Processors of type P. Tasks T1 to T5. Execution times: H1 20
H2
H3
20 12 12 20
P 100 100 10 10 100 36
Operation assignment constraint T 1 2 3 4 5
H1 20
H2
H3
20 12 12 20
P 100 100 10 10 100
∀i ∈ I :
∑X
k∈KH
i ,k
+
∑Y
k∈KP
i ,k
=1
X1,1+Y1,1=1 (task 1 mapped to H1 or to P) X2,2+Y2,1=1 X3,3+Y3,1=1 X4,3+Y4,1=1 X5,1+Y5,1=1 TSR
37
Operation assignment constraint Assume
∀∀ ℓ
types of tasks are ℓ =1, 2, 3, 3, and 1.
∈L, ∀ i:T(vi)=c ℓ, ∀ k ∈ KP: NY ℓ,k ≥ Yi,k Functionality 3 to be implemented on processor if node 4 is mapped to it.
TSR
38
Other equations Time
constraint: Application specific hardware required for time constraints under 100 time units. T 1 2 3 4 5
H1 20
H2
H3
20 12 12 20
P 100 100 10 10 100
Cost function: C=20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory) TSR
39
Result For T 1 2 3 4 5
a time constraint of 100 time units and cost(P)