Dec 2, 2005 - Large System Design Challenges. ⢠Devices growing ... Application = Graph of stream-connected operators ... unrestricted size, # IOs, timing.
Design Automation for Streaming Systems IA
IB
Eylon Caspi University of California, Berkeley OA
OB
12/2/05
Outline ♦ Streaming for Hardware ♦ From Programming Model to Hardware Model ♦ Synthesis Methodology for FPGA • Streams, Queues, SFSM Logic ♦ Characterization of 7 Multimedia Apps ♦ Optimizations • Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
2
Large System Design Challenges ♦ Devices growing with Moore’s Law
• AMD Opteron dual core CPU: ~230M transistors • Xilinx Virtex 4 / Altera Stratix-II FPGAs: ~200K LUTs ♦ Problems of DSM, large systems
• Growing interconnect delay, timing closure ♦
“Routing delays typically account for 45% to 65% of the total path delays” (Xilinx Constraints Guide)
• Slow place-and-route • Design complexity • Designs do not scale well on next gen. device; must redesign ♦ Same problems in FPGAs
12/2/05
Eylon Caspi
3
Limitations of RTL ♦ RTL = Register Transfer Level ♦ Fully exposed timing behavior
• always @(posedge clk) ... ♦ Laborious, error prone ♦ Unpredictable interconnect delay
• How deep to pipeline? • Redesign on next-gen device ♦ Undermines reuse ♦ Existing solutions
• Modular design • Floorplanning • Physical synthesis • Hierarchical CAD • Latency insensitive communication 12/2/05
Eylon Caspi
4
Streams ♦ A better communication abstraction ♦ Streams connect modules
Stream Memory
Module (compute)
• FIFO buffered channel (queue) • Blocking read • Timing independent (deterministic) ♦ Robust to communication delay
• Pipeline across long distances • Robust to unknown delay Post-placement pipelining ♦ Alternate transport (packet switched NOC) ♦
♦ Flexibly timed module interfaces
• Robust to module optimization (pipeline, reschedule, etc.) ♦ Enhances modular design + reuse 12/2/05
Eylon Caspi
5
Streaming Applications ♦ Persistent compute structure (infrequent changes) ♦ Large data sets, mostly sequential access ♦ Limited feedback ♦ Implement with deep, system level pipelining ♦ E.g. DSP, multimedia
♦ JPEG Encoder: 12/2/05
Eylon Caspi
6
Ad Hoc Streaming ♦ Every module needs streaming flow control • Block if inputs not available, output not ready to receive ♦ Every stream needs queueing • Pipeline to match interconnect delay • Queue to absorb delay mismatch, dynamic rates ♦ Manual implementation, in HDL • Laborious (flow control, queues) • Error prone (deadlock if violate protocol, queue too small) • No automation (pipeline depth, queue choice / width / depth) ♦ Interconnect / queue IP (e.g. OCP / Sonics Bus) • Still no automation 12/2/05
Eylon Caspi
7
Systematic Streaming ♦ Strong stream semantics: Process Networks
• Stream = FIFO channel with (flavor of) blocking read • E.g. Kahn Process Networks, E.g. Dataflow Process Networks (E.A.Lee)
♦ Streams as programming primitive
• Language support hides flow control ♦ Compiler support
• Compiler generated flow control • Compiler controlled pipelining, queue depth, queue impl. • Compiler optimizations (e.g. module merging, partitioning) ♦ Benefits
• Easy, correct, high performance • Portable • Paging / Virtualization is a logical extension (Automatic page partitioning) 12/2/05
Eylon Caspi
8
Outline ♦ Streaming for Hardware ♦ From Programming Model to Hardware Model ♦ Synthesis Methodology for FPGA • Streams, Queues, SFSM Logic ♦ Characterization of 7 Multimedia Apps ♦ Optimizations • Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
9
SCORE Model
Stream Computations Organized for Reconfigurable Execution
♦ Application = Graph of stream-connected operators ♦ Operator
= Process with local state
♦ Stream
= FIFO channel, unbounded capacity, blocking read
♦ Segment
= Memory, accessed via streams
Segment
Operator (SFSM)
♦ Dynamics:
• Dynamic I/O rates • Dynamic graph construction
Stream
(omitted in this work)
12/2/05
Eylon Caspi
10
SCORE Programming Model: TDF ♦ TDF = behavioral language for
• SFSM Operators (Streaming Extended Finite State Machine) • Static operator graphs ♦ State machine for • Sequencing, branching
• Firing control ♦ Firing semantics
• In state X, wait for X’s inputs, then evaluate X’s action i
j state foo (i, j): o = i + j; goto bar; }
12/2/05
o
Eylon Caspi
11
SCORE / TDF Process Networks ♦ A process from M inputs to N outputs, unified stream type S (i.e. SM→SN) ♦ SFSM = (Σ, σ0, σ, R, fNS, fO) • Σ = Set of states • σ0 ∈ Σ = Initial state • σ∈Σ = Present state • R ⊆ (Σ × SM) = Set of firing rules • fNS : R→Σ = Next state function • fO : R→SN = Output function ♦ Similar to dataflow process networks [Lee+Parks, IEEE May ‘95], but with stateful actors 12/2/05
Eylon Caspi
12
Related Streaming Models ♦ Streaming Models • Kahn PN, DFPN, BDF, SDF, CSDF, HDF, StreamsC, YAPI, Catapult C, SHIM
♦ Streaming Platforms • Pleiades, Philips VSP, Imagine, TRIPS ♦ How do we differ? • Stateful processes • Deterministic • Dynamic dataflow rates (FSM nodes) • Direct synthesis to hardware • Bounded Buffers 12/2/05
Eylon Caspi
13
Streaming Platforms ♦ FPGA
(this work)
♦ Paged FPGA • Page = fixed size partition,
connected by streams • Stylized page-to-page interconnect • Hierarchical PAR
♦ Paged, Virtual FPGA (SCORE) • Time shared pages • Area abstraction (virtually large) ♦ Multiprocessor on Chip ♦ Heterogeneous 12/2/05
Eylon Caspi
14
The Compilation Problem Programming Model: TDF
Execution Model: FPGA
• Communicating SFSMs
• Single circuit / configuration
- unrestricted size, # IOs, timing
- one or more clocks
• Unbounded stream buffering
• Fixed size queues
memory segment
TDF operator
Compile
stream 12/2/05
FPGA
Big semantic gap
Eylon Caspi
15
The Semantic Gap ♦ Semantic gap between TDF, HW ♦ Need to bind:
• • • • • • •
Compile
FPGA
Stream protocol Stream pipelining Queue implementation Queue depths SFSM synthesis style (behavioral synthesis) Memory allocation Primary I/O
♦ SCORE device binds some implementation decisions (custom hardware), raw FPGA does not ♦ Want to characterize cost of implementation decisions 12/2/05
Eylon Caspi
16
Outline ♦ Streaming for Hardware ♦ From Programming Model to Hardware Model ♦ Synthesis Methodology for FPGA • Streams, Queues, SFSM Logic ♦ Characterization of 7 Multimedia Apps ♦ Optimizations • Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
17
Compilation Tool Flow Application • Local optimization • System optimization • Queue sizing
TDF
tdfc
• Pipeline extraction • SFSM partitioning / merging • Pipelining
• Generate flow ctl, streams, queues
Verilog
Synplify
• Behavioral Synthesis • Retiming
EDIF (Unplaced LUTs, etc.)
Xilinx ISE
• Slice packing • Place and route
Device Configuration
Bits 12/2/05
Eylon Caspi
18
Wire Protocol for Streams ♦
D = Data,
V = Valid,
B = Backpressure
♦
Synchronous transaction protocol
• Producer asserts V when D ready, Consumer deasserts B when ready • Transaction commits if (¬B ∧ V) at clock edge • Encode EOS E as extra D bit (out of band, easy to enqueue)
Clk D (Data),
Producer
E (EOS)
V (Valid)
D
Consumer
B (Backpressure)
12/2/05
Eylon Caspi
V B
19
Operator Firing ♦ In state X, fire if
• Inputs desired by X are ready (Valid, EOS) • Outputs emitted by X are ready (Backpressure) ♦ Firing guard / control flow
• if (iv && !ie && !ob) begin ib=0; ov=1; ...
end
id,e iv ib
Op
od,e ov ob
♦ Subtlety: master, slave
• Operator is slave ♦
To synchronize streams, (1) wait for flow control in, (2) fire / emit out
• Connecting two slaves would deadlock • Need master (queue) between every pair of operators 12/2/05
Eylon Caspi
20
SFSM Synthesis
B V E D
Control Data registers Stream I/O
FSM ♦ Implemented as Behavioral Verilog, using state ‘case’ in FSM and DP
Datapath For State 1
Datapath For State 2
♦ FSM handles firing control, branching ♦ FSM sends state to DP ♦ DP sends bool. flags to FSM
Datapath
12/2/05
21 B V E D
FSM Module, Firing Control TDF:
Verilog FSM Module:
12/2/05
foo (input input output { state one ... }
unsigned[16] x, unsigned[16] y, unsigned[16] o) (x, eos(y)) : o=x+1;
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e, y_v, y_b, o_e, o_v, o_b, state, statecase); ... always @* begin x_b_=1; y_b_=1; o_e_=0; o_v_=0; state_reg_ = state_reg; Default statecase_ = statecase_stall; did_goto_ = 0; case (state_reg) state_one: begin if (x_v && !x_e && y_v && y_e && !o_b) statecase_ = statecase_1; x_b_=0; y_b_=0; o_v_=1; o_e_=0; end ... end // always @* Eylon Caspi endmodule // foo_fsm
is stall
Firing condition(s) for state one begin
Stream flow ctl for state one 22
Data-Path Module TDF:
Verilog Data-path Module:
12/2/05
foo (input input output { state one ... }
unsigned[16] x, unsigned[16] y, unsigned[16] o) (x, eos(y)) : o=x+1;
module foo_dp (clock, reset, x_d, y_d, o_d, state, statecase); ... always @* begin o_d_=16’bx; Default did_goto_ = 0; case (state) state_one: begin if (statecase_ == statecase_1) o_d_ = (x_d + 1’d1); end ... end // always @* endmodule // foo_dp
Eylon Caspi
begin
is stall
Firing condition(s) for state one Data-path for state one
23
Stream Buffers (Queues) ♦ Systolic • Cascade of depth-1
stages (or depth-N)
♦ Shift register • Put: shift all entries • Get: tail pointer ♦ Circular buffer • Memory with
head / tail pointers
12/2/05
Eylon Caspi
24
Enabled Register Queue ♦ Systolic, depth-1 stage
iB
♦ 1 state bit (empty/full) = V
iV
iD
oV
oD
en
♦ Shift in data unless: • Full and downstream not ready to consume queued element
♦ Area 1 FF per data bit • On FPGA 1 LUT cell per data bit • Depth-1 (single stage) nearly free,
oB
since FFs pack with logic
♦ Speed: as fast as FF • But combinationally connects producer + consumer via B 12/2/05
Eylon Caspi
25
Xilinx SRL16 ♦ SRL16 = Shift register of depth 16 in one 4-LUT cell • Shift register of arbitrary width: parallel SRL16, arbitrary depth: cascade SRL16
♦ Improve queue density by 16x
4-LUT Mode 12/2/05
SRL16 Mode Eylon Caspi
26
Shift Register Queue ♦ State: empty bit + capacity counter
iB iV iD,E
♦ Data stored in shift register
• In at position 0 • Out at position Address ♦ Address = number of stored elements minus 1 ♦ Synplify infers SRL16E from Verilog array
0
+1
-1
FSM en
Address
Empty
Shift Reg
• Parameterized depth, width ♦ Flow control
• ov = (State==Non-Empty) • ib = !(Address==Depth-1)
NonEmpty
=Depth-1
=0
full
♦ Performance improvements
zero
• Registered data out • Registered flow control • Specialized, pre-computed fullness
12/2/05
oB oV oD,E Eylon Caspi
27
SRL Queue with Registered Data Out ♦ Registered data out
iB iV iD,E
• od (clock-to-Q delay) • Non-retimable
♦ Data output register extends shift register ♦ Bypass shift register when queue empty ♦ 3 States ♦ Address = number of stored elements minus 2 ♦ Flow control
0
+1
-1
Empty
en
Address
Shift Reg
One
=Depth-2 More
full zero
=0 Data Out
• ov = !(State==Empty) • ib = (Address ==Depth-2)
12/2/05
oB oV oD,E Eylon Caspi
28
SRL Queue with Registered Flow Ctl. ♦ Registered flow ctl.
iB iV iD,E
• ov (clock-to-Q delay) • ib (clock-to-Q delay) • Non-retimable
0
♦ Flow control
• ov_next = !(State_next
en
Address
• ib_next =
♦ Based on precomputed fullness
-1
Empty
==Empty)
(Address_next ==Depth-2)
+1
Shift Reg
One
More
• full_next =
full_next full zero
=Depth-2
=0 Data Out
(Address_next ==Depth-2) oB oV oD,E
12/2/05
Eylon Caspi
29
SRL Queue with Specialized, Pre-Computed Fullness ♦ Speed up critical full pre-computation by special-casing full_next for each state
iB iV iD,E
♦ Flow control
• ov_next = !(State_next ==Empty) • ib_next = full_next
♦ zero pre-computation is less critical ♦ Result
• >200MHz unless very
large (e.g. 128 x 128) • All output delays are clock-to-Q • Area ≈ 3 x (SRL16E area) 12/2/05
0
+1
-1
Empty
en
Address
Shift Reg
One
full =Depth-3
=0
More
zero
Data Out
oB oV oD,E Eylon Caspi
30
SRL Queue Speed
12/2/05
Eylon Caspi
31
SRL Queue Area
12/2/05
Eylon Caspi
32
Page Synthesis ♦ Page = Cluster of Operator(s) + Queues ♦ SFSMs
Op 1 FSM
• One or more per page • Further decomposed into
Op 1 Datapath
FSM, data-path
♦ Page Input Queues
• Deep • Drain pipelined page-topage streams before reconfiguration
Page Input Queue(s)
Queue
♦ In-page Queues
• Shallow ♦ Separately Synthesizable Modules
Op 2 FSM
• Separate characterization • Consider custom resources 12/2/05
Eylon Caspi
Op 2 Datapath 33
Page Synthesis ♦ Module Hierarchy • Individual SFSMs
(combinational cores)
Op 1 FSM
• Local / output queues
Op 1 Datapath
• Operators and
local / output queues
• Input queues • Page
Page Input Queue(s)
Queue
Op 2 FSM 12/2/05
Eylon Caspi
Op 2 Datapath 34
Outline ♦ Streaming for Hardware ♦ From Programming Model to Hardware Model ♦ Synthesis Methodology for FPGA • Streams, Queues, SFSM Logic ♦ Characterization of 7 Multimedia Apps ♦ Optimizations • Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
35
Tool Flow, Revisited ♦
Separate compilation for application, SFSMs • Page • SFSM • Datapath • FSM
Application
tdfc
Verilog
Synplify EDIF
Tool Options • Identical queuing for every stream (SRL16 based, depth 16) • I/O boundary regs (for Xilinx static timing analysis) • Synplify 8.0 • Target 200MHz • Optimize: FSM, retiming, pipelining • Retain monolithic FSM encodings
(Unplaced LUTs, etc.)
Xilinx ISE Device Configuration
Bits 12/2/05
Eylon Caspi
• ISE 6.3i • Constrain to minimum square area, at least max slice packing + 20%, expand if fail PAR • Device: XC2VP70 -7 36
PAR Flow for Minimum Area EDIF
Constraints
♦
EDIF netlist from Synplify
♦
Constraints file
• Page area • Target Period
ngdbuild
♦
ngdbuild:
• Convert netlist EDIF → NGD
map
♦
map:
• Pack LUTs, MUXes, etc. into slices Ok? yes
no
trce
Target packed timing
par Ok? yes
trce
Target packed slices
no
Target 1 extra row/col
♦
trce: (pre-PAR)
• Static timing analysis, logic only ♦
par:
• Place and route
♦
trce: (post-PAR)
• Static timing analysis
37
SCORE Applications ♦ 7 Multimedia Applications / 279 Operators • MPEG, JPEG, Wavelet, IIR • Written by Joe Yeh
• Mostly feed-forward streaming • Constant consumption / production ratios, except compressors (ZLE, Huffman) Streams Application
Speed
Area
%Area
%Area
SFSMs
Segments
In
Local
Out
(MHz)
(4-LUT cells)
FSMs
Queues
IIR
8
0
1
7
1
166
1,922
3.4%
27.7%
JPEG Decode
9
1
1
41
8
47
7,442
7.0%
28.7%
JPEG Encode
11
4
8
42
1
57
6,728
7.5%
36.9%
MPEG Encode IP
80
16
6
231
1
47
41,472
5.5%
39.7%
MPEG Encode IPB
114
17
3
313
1
50
65,772
5.2%
40.5%
Wavelet Encode
30
6
1
50
7
106
8,320
10.1%
32.0%
Wavelet Decode
27
6
7
49
1
109
8,712
8.5%
29.6%
Total
279
50
27
733
20
140,368
5.9%
38.3%
12/2/05
Eylon Caspi
38
Page Area DCT, IDCT
♦
87% of SFSMs are smaller than 512 LUTs — by design
♦
FSMs small
12/2/05
♦ Datapaths dominate in most large pages Eylon Caspi
39
Page Speed
43%
47%
♦
FSMs (flow control) are fast, never critical
♦
Queues are critical for 1/3 fastest pages
12/2/05
Eylon Caspi
10%
♦ Datapaths dominate 40
Outline ♦ Streaming for Hardware ♦ From Programming Model to Hardware Model ♦ Synthesis Methodology for FPGA • Streams, Queues, SFSM Logic ♦ Characterization of 7 Multimedia Apps ♦ Optimizations • Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
41
Improving Performance, Area ♦ Local (module) Optimization • Traditional compiler optimization ♦
(const folding, CSE, etc.)
• Datapath pipelining / loop scheduling • Granularity transforms ♦
(composition / decomposition)
♦ System Level Optimization • Interconnect pipelining • Shrink / remove queues • Area-time transformations ♦ 12/2/05
(rate matching, serialization, parallelization) Eylon Caspi
42
Pipelining With Streams ♦ Datapath pipelining • Add registers at output (or input) • Retime into place ♦ Harder in practice (FSM, cycles) • Add registers at strategic locations • Rewrite control • Avoid violating communication protocol ♦ Stream pipelining • Add registers on streams • Retime into datapath • Modify queues, not processes 12/2/05
Eylon Caspi
DP
FSM DP
FSM DP
43
Logic Pipelining ♦ Add L pipeline registers to D, V ♦ Retime backwards
• This pipelines feed-forward parts of producer’s data-path ♦ Stale flow control may overflow queue (by L) ♦ Modify queue to emit back-pressure when empty slots ≤ L ♦ No manual modification of processes Retime
Producer
12/2/05
D
(Data)
D
V
(Valid)
B
(Backpressure)
Queue with L Reserve
Eylon Caspi
V
Consumer
B 44
Logic Relaying + Retiming ♦ Break-up deep logic in a process ♦ Relay through enabled register queue(s) ♦ Retime registers into adjacent process
• This pipelines feed-forward parts of process’s datapath • Can retime into producer or consumer ♦ No manual modification of processes Retime
Producer
D
D
V
V en
B 12/2/05
B Eylon Caspi
D
Original Queue
V
Consumer
B 45
Benefits, Limitations ♦ Benefits • Simple to implement, relies only on retiming • Sufficient for many cases, e.g. DCT, IDCT ♦ Limitations • Feed-forward only (weaker than loop sched.) • Resource sharing obfuscates retiming opportunities ♦ Extends to interconnect pipelining • Do not retime into logic — register placement only • Also pipeline B, modify queue
12/2/05
Eylon Caspi
46
Pipelining Configuration Retime Retime
D V
Queue with Reserve
D
D
V
V
Lp B
Retime
SFSM
D
D
D
V
V
V
en
en
B
B
B
B
B
Input side Logic Relaying
Logic Pipelining
Output side Logic Relaying
Li
Lp
Lr
♦ Pipeline depth parameters: Li+Lp+Lr ♦ Uniform pipelining: same depths for every stream 12/2/05
Eylon Caspi
47
Speedup from Logic Pipelining
Enabled Regs (Lr)
12/2/05
Eylon Caspi
D FFs (Lp)
48
Expansion from Logic Pipelining
Enabled Regs (Lr)
12/2/05
Eylon Caspi
D FFs (Lp)
49
Some Things Are Better Left Unpipelined ♦ Page speedup:
♦ Page expansion:
♦ Initially fast pages should not be pipelined 12/2/05
Eylon Caspi
50
Page Specific Logic Pipelining ♦ Separate pipelining of each SFSM ♦ Assumption: application speed = slowest page speed
• Critical Page ♦ Repeatedly improve slowest page until no further improvement is possible ♦ Page improvement heuristics
• Greedy Lr : • Greedy Lp : • Max :
Add one level of pipelining in 0+0+Lr Add one level of pipelining in 1+Lp+0 Pipeline to best page speed (brute force)
♦ Greedy heuristics may end early
• Non-monotonicity: adding a level of pipelining may slow page 12/2/05
Eylon Caspi
51
Speedup from Page Specific
Enabled Regs (Lr)
12/2/05
D FFs (Lp)
Eylon Caspi
52
Expansion from Page Specific
Enabled Regs (Lr)
12/2/05
D FFs (Lp)
Eylon Caspi
53
Interconnect Delay ♦ Critical routing delay grows with circuit size • Routing delay for an application: avg. 45% - 56% • Routing delay for its slowest page: avg. 40% - 50% • Ratio (appl. to slowest page): avg. 0.99x - 1.34x ♦
Averaged over 7 apps / varies with logic pipelining
♦ Modular design helps • Retain critical routing delay of page, not application • Page-to-page delays (streams) can be pipelined
12/2/05
Eylon Caspi
54
Interconnect Pipelining ♦ Add W pipeline registers to D, V, B
• Mobile registers for placer
• Not retimable
♦ Stale flow control may overflow queue (by 2W)
• Staleness = total delay on B-V feedback loop = 2W ♦ Modify downstream queue to emit back-pressure when empty slots ≤ 2W Long distance
Producer
D
(Data)
V
(Valid)
D
Queue with 2W Reserve
B (Backpressure) 12/2/05
V
Consumer
B Eylon Caspi
55
Speedup from Interconnect Pipelining
12/2/05
Eylon Caspi
56
Speedup from Interconnect Pipelining, No Area Constraint
12/2/05
Eylon Caspi
57
Expansion from Interconnect Pipelining, No Area Constraint
12/2/05
Eylon Caspi
58
Interconnect Register Allocation ♦ Commercial FPGAs / tool flows
• • • •
No dedicated interconnect registers Allocation: add to netlist, slice pack, place-and-route If pack registers with logic limited register mobility If pack registers alone area overhead
♦ Better: Post-placement register allocation
• Weaver et al., “Post-Placement C-Slow Retiming for the Xilinx • • • •
Virtex FPGA,” FPGA 2003 Allocation: PAR, c-slow, retime, scavenge registers, reroute No area overhead (scavenge registers from existing placement) Better performance, since know routing delay Modification for streaming: ♦
12/2/05
PAR, pipeline, retime, scavenge registers, reroute, modify queue depths (configuration specialization) Eylon Caspi
59
Throughput Modeling ♦ Pipelining feedback loops may reduce throughput (tokens per clock period) • Which loops / streams are critical? ♦ Throughput model for PN • Feedback cycle C with M tokens, N pipe delays, has token period: TC = M/N
• Overall token period: T = maxC {TC } • Available slack: CycleSlackC = (T - TC) • Generalize to multi-rate, dynamic rate by unfolding
TC1 = 3
TC2 = 2
equivalent single-rate PN 12/2/05
Eylon Caspi
60
Throughput Aware Optimizations ♦ Throughput aware placement • Adapt [Singh+Brown, FPGA 2002] • Stream slack: Te = maxC s.t. e∈C {TC } • Stream net criticality: crit = 1 - ((T - Te) / T) ♦ Throughput aware pipelining • Pipeline stream w/o exceeding slack • Pipeline module s.t. depth does not exceed any output stream slack
♦ Pipeline balancing (by retiming) ♦ Process Serialization • Serial arithmetic for process with low throughput, high slack 12/2/05
Eylon Caspi
61
Stream Buffer Sizing ♦ Fixed size buffers in hardware
• For minimum area • For performance
(want smallest feasible queue) (want deep enough to avoid stalls from producer-consumer timing mismatch)
♦ Semantic gap
• Buffers are unbounded in TDF, • •
• 12/2/05
bounded in HW Small buffer may create artificial deadlock (bufferlock) Theorem: memory bound is undecidable for a Turing complete process network In practice, our buffering requirements are small Eylon Caspi
Bounded x= y=
x
=x
y
=y
Unbounded x= y=
x
=x
y
=y
62
Dealing with Undecidability ♦ Handle unbounded streams
• Buffer expansion [Parks ‘95] ♦
Detect bufferlock, expand buffers
• Hardware implementation ♦
Buffer expansion = rewire to another queue
♦
Storage in off-chip memory or queue bank
♦ Guarantee depth bound for some cases
• User depth annotation • Analysis ♦
Identify compatible SFSMs with balanced schedules
♦ Detect bufferlock and fail 12/2/05
Eylon Caspi
63
Interface Automata
de Alfaro + Henzinger, Symp. Found. SW Eng. (FSE) 2001
♦ A finite state machine that transitions on I/O actions
• Not input-enabled (not every I/O on every cycle) ♦ G = (V, E, Ai, Ao, Ah, Vstart)
• • • •
Ai Ao Ah E
= = = ⊂
input actions output actions internal actions V x (Ai ∪ Ao ∪ Ah) x V
x? (in CSP notation) y! ” z; ” (transition on action)
♦ Execution trace = (v, a, v, a, …)
s st; t
S
s?
12/2/05
o!
s o
F
f?
t
f
T’
S’
sf; f
T
t?
(non-deterministic branching)
F’
o! Eylon Caspi
select
o 64
Automata Composition ♦ Composition ~ product FSM with synchronization (rendezvous) on common actions Composition edges:
x
A
y
B
x?
z A
(I)
A’
step A on unshared action
(ii) step B on unshared action (iii) step both on shared action
y! x?
Compatible Composition → Bounded Memory
12/2/05
AB
B
x? A’B
AB
y; z!
y? B’
z!
z! AB’
A’B’
A’B
y! y? z! x?
z! AB’
y? A’B’
x?
y!
Automata Composition
Direct Product
65
Stream Buffer Bounds Analysis ♦ Given a process network, find minimum buffer sizes to avoid bufferlock ♦ Buffer (queue) is also automaton
x
i
A
♦ Symbolic Park’s algorithm
o
B
y
• Compose network using •
arbitrary buffer sizes If deadlock, try larger sizes
x
i
A
♦ Practical considerations: avoiding state explosion
• Multi-action automata • Know which streams to expand first • Compose pairwise in clever order ♦
12/2/05
0
Eylon Caspi
i? 1
o!
Q
o
B
y
i?
Composition is associative
• Cull states reaching deadlock • Partition system
Q x
2
o! i?o! 66
SFSM Decomposition ♦
(Partitioning)
Why decompose
• To improve locality • To fit into custom page resources ♦
Decomposition by state clustering
• 1 state (i.e. 1 cluster) active at a time ♦
Cluster states to contain transitions
• Fast local transitions, slow external trans. • Formulation: minimize cut of transition probability under area, I/O constraints
♦
Similar to:
• • • • 12/2/05
VLIW trace scheduling [Fisher ‘81] FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98] GarpCC HW/SW partitioning [Callahan ‘00] VM/cache code placement Eylon Caspi
State flow Data flow 67
Early SFSM Decomposition Results ♦ Approach 1: Balanced, multi-way, min-cut
• Modified Wong FBB [Yang+Wong, ACM ‘94] • Edge weight is mix: c*(transition probability) + (1-c)*(wire bits) • Poor at simultaneous I/O constraint + cut optimization ♦ Approach 2: Spectral order + Extent cover
• Spectral ordering clusters connected components in 1D ♦
Minimize squared weighted distance, weight is mix (as above)
• Then choose area + I/O feasible extents [start, end] •
using dynamic programming Effective for partitioning to custom page resources
♦ Under 2% external transitions
• Amdahl’s law: few slow transitions ⇒ small performance loss • Achievable with either approach 12/2/05
Eylon Caspi
68
Summary ♦ Streaming addresses large system design challenges
• Growing interconnect delay • Flexibly timed module interfaces
• Design complexity • Reuse
♦ Methodology to compile SCORE applications to Virtex-II Pro
• Language + compiler support for streaming ♦ Characterized 7 applications on Virtex-II-Pro
• Queue area ~38% ♦
• Flow control FSM area ~6%
Improve by merging SFSMs, eliminating queues
♦ Stream pipelining
• For logic
• For interconnect
♦ Stream based optimizations
• Pipelining • Queue sizing • Placement • Serialization 12/2/05
• Module Merging • Partitioning
Eylon Caspi
69
Supplemental Material
12/2/05
Eylon Caspi
70
TDF ∈ Dataflow Process Networks ♦ Dataflow Process Networks [Lee+Parks, IEEE May ‘95]
• Process enabled by set of firing rules: R = { r1, r2, … , RK } • Firing rule = set of input patterns: ri = ( ri,1, ri,2 , … , ri,M ) • Feedback arc for state • Firing rule(s) per state
state
♦ DF process for a TDF operator: process
Patterns match state + input presence ♦ E.g. for state σ : rσ = ( [σ], rσ,1, rσ,2 , … ) ♦ Patterns: rσ,j = [*] if input j is in input signature of state σ rσ,j = ⊥ if input j is not in input signature of state σ ♦
• Single firing rule per state = DFPN sequential firing rules • Multiple firing rules per state translate the same way, with restrictions to retain determinism
12/2/05
Eylon Caspi
71
SFSM Partitioning Transform ♦
Only 1 partition active at a time
• Transform to activate via streams ♦
A
C
B
D
New state in each partition: “wait”
• Used when not active • Waits for activation
from other partition(s) • Has one input signature (firing rule) per activator
♦
Firing rules are not sequential, but determinism guaranteed
• Only 1 possible activator ♦
A
Activation streams from given source to given dest. partitions can be merged + binary-encoded
12/2/05
Eylon Caspi
B
{C,D}
{A,B}
Wait
Wait
AB
CD
{C,D}
{A,B}
C
D
72
Virtual Paged Hardware (SCORE) ♦
Compute model has unbounded resources
♦
Paging
♦
Efficient virtualization
• Programmer does not target a particular device size • “Compute pages” swapped in/out (like virtual memory) • Page context = thread (FSM to block on stream access) • Amortize reconfiguration cost over an entire input buffer • Requires “working sets” of tightly-communicating pages to fit on device
compute pages
buffers
Transform
Quantize
RLE
Encode
73