HW/SW Codesign - Computer Science and Engineering

4 downloads 727 Views 1MB Size Report
TSR. Class. Application. Processor. Requirements. Data flow laser printers, X- ..... 37. TSR. Operation assignment constraint. T. H1. H2. H3. P. 1. 20. 100. 2. 20.
CSE 237A Hardware/Software Codesign Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.

1

ES Design Hardware components

TSR

Hardware

Verification and Validation

2

ES Application Classes Class D ata flow

Application laser printers, Xterm inals, routers, brid ges, im age processing Interactive set-top boxes, vid eo gam es, PDAs, portable video & info appliances portable controllers, d isk Classic embedded controllers, autom otive, ind ustrial control

Processor

Requirements R4600, I960, Processes d ata and 29k, Cold fire, passes it on. H igh PPC (403, 605) m em ory bw , high throughput. R3900, Interactive, low R4100/ 4300/ 4 cost, low pow er, 600, ARM 6xx/ 7xx, V851, high throughput. SH 1/ 2/ 3 Piranha, ARM, MIPS, Cores

m ix of CPU pow er, low cost, low pow er, peripherals

Time-constrained computing systems. TSR

System Design Problem Areas 2. HDL Modeling Architectural synthesis Logic synthesis Physical synthesis

1. Design environment, co-simulation constraint analysis. Interface

Interface

Processor

Analog I/O

ASIC

3. Software synthesis, Optimization, Retargetable code gen., Debugging & Programming environ.

DMA

4. Test Issues

TSR

Memory

4

System Architecture: Yesterday PCB design Add-in board

Cache

Processor Cache/DRAM Controller

Audio VRAM DRAM

Motion Video

VRAM DRAM

PCI Bus

SCSI/ IDE

I/O

LAN

External Bus

Graphics

DRAM VRAM

3M HIGH DENSITY

ISA/EISA TSR

5

A System Architecture: Today PCI Interface VRAM

TSR

Graphics

Video

MEMORY Cache/SRAM

EISA Interface

SCSI Encryption/ Decryption

Glue Motion

LAN Interface

Processor Core

DSP Processor Glue Core

I/O Interface

HW/SW Codesign of a SoC

6

HW-centric view of a Platform Pre-Qualified/Verified Foundation-IP*

HW-SW Kernel

+ Reference Design Scaleable bus, test, power, IO, clock, timing architectures

MEM Hardware IP SW IP

Application Space CPU FPGA

Reconfigurable Hardware Region (FPGA, LPGA, …)

Programmable

IP can be: • HW or SW • hard, soft or ‘firm’ (HW) • source or object (SW)

TSR

Processor(s), RTOS(es) and SW architecture

Foundry-Specific HW Qualification

SW architecture characterisation

Source: Grant Martin and Henry Chang, “Platform-Based Design: A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.

7

SW-Centric View of Platforms Platform API Application Software

Software

Software Platform Hardware Platform Input devices Output Devices

Hardware I O network

Device Drivers BIOS

TSR

Network Communication

RTOS

API

Source: Grant Martin and Henry Chang, “Platform-Based Design: A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.

8

CMOS VLSI Trends Yesterday (1980s) memory

Today

Tomorrow

memory

memory

processors processors

platform SoC SoC

gate arrays ASICs

processors

struc. SoC

reconfigurable

custom SoC

struc. ASIC

reconfigurable (no processor)

ASICs

struc. ASIC (no processor) ASICs

TSR

9

Increasing Customization Cost  Top

Estimated Cost $85 M -$90 M Example: Design with 80 M transistors in 100 nm technology

cost drivers

 Verification

(40%)  Architecture Design (26%)  Embedded Design  

1400 man months (SW) 1150 man months (HW)

 HW/SW

integration

12 – 18 months

TSR

*Handel H. Jones, ”How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com

10

Responses to Increasing Cost 

General purpose ISA  high volumes and reuse  Abstraction  compilation technologies and high application/development productivity  Universality



Custom silicon for embedded platforms in sufficiently high volumes  Domain

specific ISAs, e.g., DSPs  Application Specific Standard Products  Reconfigurable hardware  TSR

HW/SW Codesign 11

HW/SW Codesign: Motivations  Benefit

from both HW and SW

HW:

Parallelism -> better performance, lower power  Higher implementation cost 

SW

Sequential implementation -> great for some problems  Lower implementation cost, but often slower and higher power 

TSR

12

Co-Design Methodology

Architecture Synthesis

Verification

Mapping

HW

TSR

Function

SW

13

HW/SW Codesign Issues      

TSR

Task level concurrency management Which tasks in the final system? High level transformations Transformation outside the scope of traditional compilers Hardware/software partitioning Which operation mapped to hardware, which to software? Compilation Hardware-aware compilation Scheduling Performed several times, with varying precision Design space exploration Set of possible designs, not just one. 14

Software or hardware?

Decision TSR

based on hardware/ software partitioning, 15

Hardware/software codesign

Specification

Mapping Processor P1 TSR

Processor P2

Hardware 16

System Partitioning process (a, b, c) in port a, b; out port c; { read(a); … write(c); }

Specification

 Good 1) 2)

3) TSR

Partition Model

Line () { a=… … detach }

Interface

HW

Capture Synthesize

Processor

partitioning mechanism:

Minimize communication across bus Allows parallelism -> both HW & CPU operating concurrently Near peak processor utilization at all times 17

Determining Communication Level Application Program

Send, Receive, Wait Application hardware (custom)

Operating System Register reads/writes I/O driver I/O bus

Easier 

TSR

Bus transactions Interrupts

I/O bus

to program at application level

(send, receive, wait) but difficult to predict

More 

Interrupt service

I/O driver

difficult to specify at low level

Difficult to extract from program but timing and resources easier to predict

18

Partitioning Costs  Software

Resources

Performance

and power consumption Lines of code – development and testing cost Cost of components  Hardware

Resources

Fixed

number of gates, limited memory & I/O Difficult to estimate timing for custom hardware Recent design shift towards IP  TSR

Well-defined resource and timing characteristics 19

Software Cost Analysis Process

Functional Blocks

Calibration

Feature Points

Language Conversion

Source Lines of Code (SLOC) Software development effort Equivalent SLOC including reuse

Software maintenance effort Software schedule

Software Development and Testing Cost TSR

20

Hardware Cost Analysis Process

S/G Ratio

Gate Count

Rent’s Rule

I/O Count

Single-ChipPackage Cost

Feature Size

Interconnect Length

Core Area I/O Format

Die Area

Number Up

Die Yield

Die Cost

Wafer Characteristics

Wafer Fabrication and Sawing Cost

Chip Hardware Cost

Tooling Cost Test Development Cost Productivity, reuse

TSR

Design Cost

21

HW & SW Foundries 

HW1 

LSI Logic ASIC Wafer Foundry Data   





0.18 µm feature size 8 inch wafers 6 layers

TSMC 018 Wafer Processing

SW1 



Nominal to High development effort

SW2 

Low to Nominal development effort

HW2 

Samsung Semiconductor ASIC Wafer Foundry Data   



TSR



0.35 µm feature size 6 inch wafers 4 layers

TSMC 035 Wafer Processing 22

MIXED Implementation Using HW1 and SW1 Software development

Testing

Percent of Total Cost

100% 80% Packaging Fabrication Tooling Design

60%

Reuse of: • Gate-level IP • Code

40% 20%

Recurring

100000, 40%

100000, 20%

100000, No

10000, 40%

10000, 20%

10000, No

1000, 40%

1000, 20%

1000, No

0%

Production Quantity and Level of Reuse TSR

23

Total Cost Per Chip 45 40 Total Cost ($/chip)

35 30

10,000 Units

25 20 15 10 5

HW1/SW1

HW1/SW2

HW2/SW1

HW2/SW2

0 0

10

20

30

40

50

60

70

80

90 100

Percent Custom Hardware TSR

24

Partitioning Analysis Result of compilation is synthesizable HDL and assembly code for the processor  Compiler & profiler determine dependence and rough performance estimates 

TSR

25

Hardware/Software Partitioning Simple architectural model: CPU + 1 or more ASICs on a bus memory ASIC Processor ASIC



Properties of classic partitioning algorithms  Single

rate; Single-thread: CPU waits for ASIC  Type of CPU is known; ASIC is synthesized TSR

26

HW/SW Partitioning Styles HW first approach  start

with all-ASIC solution which satisfies constraints  migrate functions to software to reduce cost

SW first approach  start

with all-software solution which does not satisfy constraints  migrate functions to hardware to meet constraints TSR

Partitioning - ILP Ingredients:  Cost function  Constraints Cost function

Involving linear expressions of integer variables from a set X

C=

∑a x

x i ∈X

Constraints:∀j ∈ J :

∑b

x i ∈X

i, j

i

i

with ai ∈R, x i ∈ ℕ N (1)

x i ≥ c j with bi , j , c j ∈ ℝ R ( 2)

Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem. If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem. TSR

28

FAQ on integer programming  Maximizing  Integer

the cost done by setting C‘=-C

programming is NP-complete.



Running times increase exponentially with problem size



Commercial solvers can solve for thousands of variables

 IP

models are a good starting point for modelling even if in the end heuristics have to be used to solve them.

TSR

29

IP model for HW/SW partitioning Notation: Index

set I denotes task graph nodes. Index set L denotes task graph node types e.g. square root, DCT or FFT Index set KH denotes hardware component types. e.g. hardware components for the DCT or the FFT. Index set J of hardware component instances Index set KP denotes processors. All processors are assumed to be of the same type T is a mapping from task graph nodes to their types T: I →L Therefore:

  

Xi,k: =1 if node vi is mapped to HW component type k ∈ KH Yi,k: =1 if node vi is mapped to processor k ∈ KP NY ℓ,k =1 if at least one node of type ℓ is mapped to processor k ∈ KP

TSR

30

Constraints Operation

assignment constraints

∀i ∈ I :



k∈KH

X i ,k +

∑Y

k∈KP

i ,k

=1

All task graph nodes have to be mapped either in software or in hardware. Variables are assumed to be integers. Additional constraints to guarantee they are either 0 or 1:

∀i ∈ I : ∀k ∈ KH : X i ,k ≤ 1

∀i ∈ I : ∀k ∈ KP : Yi ,k ≤ 1

TSR

31

Operation assignment constraints ∀∀ ℓ

∈L, ∀ i:T(vi)=cℓ, ∀ k ∈ KP: NY ℓ,k ≥ Yi,k

all types ℓ of operations and for all nodes i of this type: For  if

i is mapped to some processor k, then that processor must implement the functionality of ℓ.

Decision

∀∀ ℓ TSR

variables must also be 0/1 variables:

∈L, ∀ k ∈ KP: NY ℓ,k ≤ 1. 32

Resource & design constraints • ∀ k ∈ KH, the cost for components of that type should not exceed its maximum. • ∀ k ∈ KP, the cost for associated data storage area should not exceed its maximum. • ∀ k ∈ KP the cost for storing instructions should not exceed its maximum. • The total cost (Σk ∈ KH) of HW components should not exceed its maximum • The total cost of data memories (Σk ∈ KP) should not exceed its maximum • The total cost instruction memories (Σk ∈ KP) should not exceed its maximum

TSR

33

Scheduling v1

v2

v3

v4 Processor

e3 v5

v6

e4

v7

FIR1

p1

FIR2

ASIC h1

v8 Communication channel c1

v9

v10 v11

TSR

FIR2 on h1

p1

c1

... v ... v 3 4

... v ... v 7 8

...e ...e 3 4

or ... v ... v 4 3

or ... v ... v 8 7

or ...e ...e 4 3 t

t

t

Scheduling / precedence constraints  For

all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 with bi1,i2=1 if vi1 is executed before vi2 and = 0 otherwise. Define constraints of the type (end-time of vi1) ≤ (start time of vi2) if bi1,i2=1 and (end-time of vi2) ≤ (start time of vi1) if bi1,i2=0

 Ensure

that the schedule for executing operations is consistent with the precedence constraints in the task graph.  Timing constraints need to be met

TSR

35

Example    

T 1 2 3 4 5 TSR

HW types H1, H2 and H3 with costs of 20, 25, and 30. Processors of type P. Tasks T1 to T5. Execution times: H1 20

H2

H3

20 12 12 20

P 100 100 10 10 100 36

Operation assignment constraint T 1 2 3 4 5

H1 20

H2

H3

20 12 12 20

P 100 100 10 10 100

∀i ∈ I :

∑X

k∈KH

i ,k

+

∑Y

k∈KP

i ,k

=1

X1,1+Y1,1=1 (task 1 mapped to H1 or to P) X2,2+Y2,1=1 X3,3+Y3,1=1 X4,3+Y4,1=1 X5,1+Y5,1=1 TSR

37

Operation assignment constraint Assume

∀∀ ℓ

types of tasks are ℓ =1, 2, 3, 3, and 1.

∈L, ∀ i:T(vi)=c ℓ, ∀ k ∈ KP: NY ℓ,k ≥ Yi,k Functionality 3 to be implemented on processor if node 4 is mapped to it.

TSR

38

Other equations Time

constraint: Application specific hardware required for time constraints under 100 time units. T 1 2 3 4 5

H1 20

H2

H3

20 12 12 20

P 100 100 10 10 100

Cost function: C=20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory) TSR

39

Result For T 1 2 3 4 5

a time constraint of 100 time units and cost(P)