Università degli Studi di Catania Dipartimento di Ingegneria Informatica e delle Telecomunicazioni (DIIT)
Application Specific Routing Algorithms for Networks on Chip Maurizio Palesi http://www.diit.unict.it/~mpalesi
[email protected]
NEC Laboratories America December 14, 2009
1
Systems on Chip We are now in the era of blockbased design ASIC/ASSP Design
Yesterday Bus Standards, Predictable, Preverified
IP/Block Authoring
Today VSI Compatible Standards, Predictable, Preverified
NEC Laboratories America December 14, 2009
SystemBoard Integration
SystemChip Integration
2
Trends From multicore to manycore ➔From dual, tri, quad, hexa, octocore chips to ones with tens
or even hundreds of core
Homogeneous vs. Heterogeneous diatribe Examples ➔Teraflops Research Chip (Polaris), a 3.16 GHz, 80core
processor prototype
➔NVIDIA’s GPUs CUDA architecture (100s cores) ➔Tilera TILE64, 64core processor ➔picoChip PC200 series 200–300 cores per device for DSP &
wireless NEC Laboratories America December 14, 2009
3
Trends in SoCs Technology: no slowdown in sight! ➔Faster and smaller transistors: 90 → 65 → 45 → 32 nm ➔…but slower wires, lower voltage, more noise! ✔ 80% or more of the delay of critical paths will be due to interconnects
Design complexity: from 2 to 10 to 100 cores! ➔Design reuse is essential ➔…but differentiation/innovation is key for winning on the market!
Performance and power: GOPS for MWs! ➔Performance requirements keep going up ➔…but power budgets do not!
NEC Laboratories America December 14, 2009
4
The Deep Submicron Effects
NEC Laboratories America December 14, 2009
5
Communication Architectures Shared bus
IP
IP
IP
➔Low area ➔Poor scalability
Shared bus
➔High energy consumption
NetworkonChip
IP
IP
IP
IP
IP
IP
IP
➔Low energy consumption
IP
IP
IP
IP
➔Increase of design complexity
IP
IP
IP
IP
IP
IP
IP
IP
➔Scalability and modularity
NEC Laboratories America December 14, 2009
6
Intel's Teraflops 100 Million transistors 80 cores, 160 FP engines Teraflops perf. @ 62 Watts Ondie mesh network Power aware design
NEC Laboratories America December 14, 2009
7
Terascale vs. ASCI Red
ASCI Red the first computer to reach a Teraflops of processing (1996) ➔ 10,000 Pentium @ 200 MHz ➔ 500kW of power (+500kW to keep the room cool)
Terascale the first onchip solution reaching Teraflop of processing (2006) ➔ 80 core chip (VLIW based) ➔ 3.165.7 GHz ➔ 62W256W
NEC Laboratories America December 14, 2009
8
Outline Application Specific Routing Algorithms Concurrent Mapping and Routing Dealing with Manufacturing Defects Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
9
Outline Application Specific Routing Algorithms Concurrent Mapping and Routing Dealing with Manufacturing Defects Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
10
Routing Algorithms Topology
Switching
NoC performance
Flow control
Routing
NEC Laboratories America December 14, 2009
11
Routing Algorithms Topology
Switching
NoC performance
Flow control
Routing
Routing determines the path selected by a packet to reach its destination ➔ Deterministic ➔ Adaptive
NEC Laboratories America December 14, 2009
12
Routing Algorithms Topology
Switching
S NoC performance
Flow control
Routing
D
Routing determines the path selected by a packet to reach its destination ➔ Deterministic ➔ Adaptive
NEC Laboratories America December 14, 2009
13
Routing Algorithms Topology
Switching
S NoC performance
Flow control
Routing
D
Routing determines the path selected by a packet to reach its destination ➔ Deterministic ➔ Adaptive
NEC Laboratories America December 14, 2009
14
Performance of Routing Algorithms
Best
XY WestFirst OddEven Negativefirst Worst
Uniform traffic
NEC Laboratories America December 14, 2009
15
Performance of Routing Algorithms
Best
Negativefirst OddEven WestFirst XY Worst
Transpose 1 traffic
NEC Laboratories America December 14, 2009
16
Performance of Routing Algorithms
Best
OddEven WestFirst XY, Negativefirst
Worst
Hotspot traffic
NEC Laboratories America December 14, 2009
17
No Winner Routing Algorithm Best
Best
XY WestFirst OddEven Negativefirst Worst
Uniform traffic
Best
Negativefirst OddEven WestFirst XY Worst
Transpose 1 traffic
NEC Laboratories America December 14, 2009
OddEven WestFirst XY, Negativefirst
Worst
Hotspot traffic
18
Application Specific Routing Algorithm Information about ➔ Tasks which communicate and tasks which do never
communicate ✔ After task mapping → Information about network nodes which communicate
➔ Cuncurrent/non cuncurrent communications ➔ Communications bandwidth requirements
T2
T1
T4 T3
Network Topology
Tn
Application Specification
Many opportunities ➔ Improving performance (e.g., maximize routing
adaptivity) ➔ Simplify the estimation/control of congestion ➔ Design more effective selection policies
NEC Laboratories America December 14, 2009
AS Routing Algorithm Design
19
APSRA Example Topology Graph l12
l23
P1
l14
l25
l52 l45
P4
l32
l63
l36
l56 P6
P5
l54
l12
P3
P2
l21
l41
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
l23
l21 l41
l14
l32 l52
l45 l54
l25
l63
l36
l56 l65
20
APSRA Example (cnt'd) Communication Graph
T1
Topology Graph
l12
T2
l23
P1
T6
T3
l21
l41
l14
l25
l52 l45
P4
T5
T4
NEC Laboratories America December 14, 2009
P3
P2
l32
l36
l56 P6
P5
l54
l63
l65
21
APSRA Example (cnt'd) T1
Communication Graph
T2
T6
T3
T5
l12
T4
l23
l21 Topology Graph l12 P1
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l23
l32
l54
l25
l63
l36
l56 l65
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
22
APSRA Example (cnt'd) T1
Communication Graph
T2
T6
T3
T5
l12
T4
l23
l21 Topology Graph l12 P1
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l23
l32
l54
l25
l63
l36
l56 l65
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
23
APSRA Example (cnt'd) T1
T1 T3 T4 T3 T1 T6
Communication Graph
T2
T6
T3
T5
l12
T4
l23
l21 Topology Graph l12 P1
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l23
l32
l54
l25
l63
l36
l56 l65
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
24
APSRA Example (cnt'd) T1
T1 T3 T4 T3 T1 T6
Communication Graph
T2
T6
T3
T5
l12
T4
l23
l21 Topology Graph l12 P1
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l23
l32
l54
l25
l63
l36
l56 l65
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
25
APSRA Example (cnt'd) T1
T1 T3 T4 T3 T1 T6
Communication Graph
T2
T6
T3
T5
l12
T4
l21 Topology Graph l12 P1
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52
l45
P3
P2
l21
l41
l41
l23
l54
l23
l32 l63
l25
l56
l36
l65
Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
26
APSRA Example (cnt'd) T1
Communication Graph
T2
T6
T3
l12 T5
l23
T4
l21 Topology Graph l12 P1
l23
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l32
l54
l25
l63
l36
l56 l65
Application Specific Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
27
APSRA Example (cnt'd) T1
Communication Graph
T2
T6
T4 T2
T3
l12 T5
l23
T4
l21 Topology Graph l12 P1
l23
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l32
l54
l25
l63
l36
l56 l65
Application Specific Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
28
APSRA Example (cnt'd) T1
Communication Graph
T2
T6
T4 T2
T3
T5
l12
T4
l23
l21 Topology Graph l12 P1
l23
l14
l25
l52 l45
P4
l32
l36
l56 P6
P5
l54
l63
l14
l52 l45
P3
P2
l21
l41
l41
l32
l54
l25
l63
l36
l56 l65
Application Specific Channel Dependency Graph
l65
NEC Laboratories America December 14, 2009
29
APSRA Methodology Network Topology
Communication Graph Application to be mapped
T2
T1
P1 T4
T3
P2
P3
P4
Mapping Function P5
Tn
P7 P6
GOAL Maximize adaptivity
APSRA
P8
P10
[Palesi, et al., TPDPS’09]
P9
P11
P12
P13
Routing Tables
Memory budget
Compression [Palesi, et al., SAMOS’06]
NEC Laboratories America December 14, 2009
Compressed Routing Tables
30
APSRA Performance (1/2) Transpose 1 XY Odd-Even (sel=random) APSRA (sel=random) Odd-Even (sel=buffer level) APSRA (sel=buffer level)
MMS
NEC Laboratories America December 14, 2009
31
Throughput
APSRA Performance (2/2) Saturation pir
pir
Traffic scenario Random Locality Transpose 1 Transpose 2 Hotspot-4c Hotspot-4tr Hotspot-8r Mms
Max. pir (packets/cycle/IP) APSRA improvement XY OE APSRA vs. XY vs. OE 0.012 0.011 0.012 0.0% 14.3% 0.019 0.020 0.021 10.5% 5.0% 0.011 0.015 0.027 145.5% 80.0% 0.011 0.016 0.027 145.5% 68.8% 0.003 0.004 0.004 13.6% 7.1% 0.003 0.003 0.004 29.6% 12.9% 0.004 0.006 0.007 71.8% 13.6% 0.017 0.017 0.020 12.6% 12.6% Average improvement 53.6% 26.8%
NEC Laboratories America December 14, 2009
32
Outline Application Specific Routing Algorithms
✓
Concurrent Mapping and Routing Dealing with Manufacturing Defects Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
33
DSE for NoC Architectures Design effort
Increased customization level and flexibility NEC Laboratories America December 14, 2009
[Ogras et al., ASAP'05]
Design quality 34
DSE for NoC Architectures Design effort
Increased customization level and flexibility NEC Laboratories America December 14, 2009
[Ogras et al., ASAP'05]
Design quality 35
The Mapping Problem NoC Application (concurrent apps.)
IP Library
NEC Laboratories America December 14, 2009
36
The Mapping Problem NoC Application (concurrent apps.)
IP Library
The application is divided into a graph of concurrent tasks
T1
T2 T3
T5
T4 T6
Tm
Graph of concurrent tasks
NEC Laboratories America December 14, 2009
37
The Mapping Problem NoC Application (concurrent apps.)
IP Library The application tasks are assigned and scheduled
The application is divided into a graph of concurrent tasks
T1
ASIC1
T2
T1 T3
T5
T4 T6
Tm
T2
MEM1
T3
T5
CPU1
CPU2
T4 T6
Graph of concurrent tasks
NEC Laboratories America December 14, 2009
Tm DSP1
38
The Mapping Problem 1
2
3 NoC
Application (concurrent apps.)
IP Library The application tasks are assigned and scheduled
The application is divided into a graph of concurrent tasks
T1
ASIC1
T2
T1 T3
T5
T4 T6
Tm
T2
MEM1
T3
T5
CPU1
CPU2
T4 T6
Graph of concurrent tasks
NEC Laboratories America December 14, 2009
Tm DSP1
Mapping (NP-hard)
Decide to which tile each selected IP should be mapped such that the metrics of interest are optimized
39
Impact of Mapping on Performance A/V multimedia system ➔ Mapped on 16 IPs
Average packet latency of 3000 random mappings ➔ Results for the top 478
mappings. The
➔ Remaining 2522 mappings
have latency much higher than 200 clock cycles
NEC Laboratories America December 14, 2009
40
Problem Formulation Given ➔ An application (or a set of concurrent applications) already mapped and
scheduled into a set of IPs ➔ A network topology
Find the best mapping and the best routing function which ➔ Maximize Performance (Minimize the mapping coefficient) ➔ Maximize fault tolerant characteristics (Maximize the robustness index)
Such that ➔ The aggregated communications assigned to any channel do not exceed its
capacity
NEC Laboratories America December 14, 2009
41
Robustness Index Is an extension of the concept of path diversity
NEC Laboratories America December 14, 2009
42
Robustness Index Is an extension of the concept of path diversity
s
d A single link fault does not compromise the communication from s to d NEC Laboratories America December 14, 2009
43
Robustness Index Is an extension of the concept of path diversity l'
s
s l”
d A single link fault does not compromise the communication from s to d NEC Laboratories America December 14, 2009
d A single link fault in either l' or l” makes it impossible the communication from s to d
44
Robustness Index Is an extension of the concept of path diversity 1 ∣P c ∖ P c , l∣ ∑ ∣Lc ∣ l∈ Lc
RI c=
R s d =
s
11111111 =1 8
l' s d
l” R s d =
011110 =0.67 6
NEC Laboratories America December 14, 2009
d 45
Armament to Deal with the Mapping Problem Characterization of NoC resources ➔As the first step to develop a mapping technique for
NoCs
Find a model of the communication cost ➔Correlated with the performance metrics ➔Which does not require expensive simulations
NEC Laboratories America December 14, 2009
46
Model of Communication Cost Define a metric that does not depend on the traffic pattern ➔ Based exclusively on the internode distance ✔ Topology ✔ Routing algorithm
Equivalent distance ➔ Analogy to the electrical equivalent resistence
S
S
D
NEC Laboratories America December 14, 2009
D
47
Multi Media System 8x8 mesh-based NoC 8-flits packets 3-flits buffers SS packets injection distribution 50 simulations per point
Source: Hu and Marculescu, TCAD 24(4), 2005
NEC Laboratories America December 14, 2009
48
Simulation Results
NEC Laboratories America December 14, 2009
49
Correlation Index for S1
NEC Laboratories America December 14, 2009
50
Correlation Index for S8
NEC Laboratories America December 14, 2009
51
Correlation Index for S10
NEC Laboratories America December 14, 2009
52
Correlation with the Mean of the Avg Latencies
NEC Laboratories America December 14, 2009
53
Correlation Coefficients
NEC Laboratories America December 14, 2009
54
Cyclic Dependency Application Depends on
T2 T1
T4
T3
Tn
Mapping
Routing Function
Network Topology
Depends on
NEC Laboratories America December 14, 2009
55
Design Space Exploration Flow
NEC Laboratories America December 14, 2009
56
Experiments: Pareto front MMS Traffic
Monoobjective mapping with customized routing algorithm
Monoobjective mapping with XY routing algorithm Multiobjective Pareto mapping with customized routing algorithm
NEC Laboratories America December 14, 2009
57
Experiments: Dead Comms MMS Traffic
Percentage of dead communications
90.0%
MCBMAP-XY MCBMAP-AS MOGARMAP-MC MOGARMAP-RI
80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 0.5%
1.0%
2.0%
3.0%
4.0%
5.0%
10.0%
15.0%
Percentage of faulty links NEC Laboratories America December 14, 2009
58
Experiments: Delay
NEC Laboratories America December 14, 2009
59
Experiments: Summary Improvement MCBMAP-AS MOGARMAP-RI MOGARMAP-MC
70.0% 60.0% 50.0%
Saturation pir improvement (MCBMAPXY as baseline)
40.0% 30.0% 20.0% 10.0% 0.0%
Uniform Hot-spot
MMS
Average
Average delay reduction (MCBMAPXY as baseline)
NEC Laboratories America December 14, 2009
Avg. delay impr. vs. MCBMAP-XY
Sat. pir impr. vs. MCBMAP-XY
80.0%
80.0% MCBMAP-AS MOGARMAP-RI MOGARMAP-MC
70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0%
Uniform Hot-spot
MMS
Average
60
Outline Application Specific Routing Algorithms Concurrent Mapping and Routing
✓ ✓
Dealing with Manufacturing Defects Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
61
Introduction and Motivations Initial yield of complex SoC is very small ➔Yield goes down with size and complexity of the chip ➔Dealing with the reduction of yield due to manufacturing defects
✔ Isolate faulty blocks of the chip ✔ Using the chip as a “depowered” chip (E.g., Sun UltraSparc T1)
Onchip communication system ➔Represents the heart of the system ➔Gets a quite high percent of the system silicon area ✔ E.g., 20% of the silicon area in Intel's TeraScale 80cores chip ✔ High probability of being affected by manufacturing defects
NEC Laboratories America December 14, 2009
62
Managing Faulty Links S
D
NEC Laboratories America December 14, 2009
63
Managing Faulty Links S
D
NEC Laboratories America December 14, 2009
Faulty links
64
Managing Faulty Links Faulty links elimination ➔ The routing function is
S
D
NEC Laboratories America December 14, 2009
computed on the basis of the network filtered by all the fully faulty or partially faulty links
65
LFlit
HFlit
Flit
Managing Faulty Links
Partially Faulty Links usage ➔ The routing function is
computed on the basis of the network filtered by all partially faulty links with a fault degree greater than a certain threshold
TFM_RX (flit assembler)
Router TFM_TX (flit splitter)
NEC Laboratories America December 14, 2009
66
Two Questions How routing paths are determined? ➔Routing function
When does a partially faulty link should be used? ➔Selection function Network Conditions Information Source & Destination addresses Payload
H
Routing Function
NEC Laboratories America December 14, 2009
Admissible Output Channels
Selection Function
Output Channel
67
Routing Function Network Topology
Communication Graph Application to be mapped
T2
T1
P1 T4
T3
P2
P3
P4
Mapping Function P5
Tn
P7 P6
GOAL Maximize adaptivity
APSRA
P8
P10
[Palesi, et al., CODES+ISSS’06]
P9
P11
P12
P13
Routing Tables
Memory budget
Compression [Palesi, et al., SAMOS’06]
NEC Laboratories America December 14, 2009
Compressed Routing Tables
68
Selection Function Faulty Links elimination (FE) Partially Faulty Links usage strategy (FU) Partially Faulty Links usage with Lookahead strategy (FUL) Load Balancing based strategies (LB) ➔FE+LB ➔FU+LB ➔FUL+LB
NEC Laboratories America December 14, 2009
69
Partially Faulty Links usage Strategy
available. Otherwise, links with lowest FD are chosen
16 lines
➔Only fault free links are chosen, if
4 faulty lines
In low traffic conditions
FD = 4/16
In high traffic conditions ➔A link is selected with a probability
inversely proportional to its FD
Pr
F
li =
1−FDli 1−
∑
l j ∈Lao
NEC Laboratories America December 14, 2009
FD l j 70
Detecting Traffic Conditions Hu and Marculescu, DyAD: Smart Routing for NetworksonChip, DAC 2004
If max{SN,SE,SS,SW} > T
Free slots
High traffic conditions else
SN
Low traffic conditions
SW
SE SS
NEC Laboratories America December 14, 2009
71
Limitation of FU strategy The selection function used in FU, does not take into consideration the entire path ➔ It takes the decision based solely on the quality of the next link ➔ This can cause inefficiencies ✔ E.g., although the next link is of highquality (e.g., fault free), the rest of the path(s), in which the message will be obliged to travel on is/are formed by many low quality links
ns
nd
NEC Laboratories America December 14, 2009
72
FU with Lookahead Strategy (FUL) The selection function used in FU, does not take into consideration the entire path ➔ It takes the decision based solely on the quality of the next link n s
➔ This can cause inefficiencies ✔ E.g., although the next link is of highquality (e.g., fault free), the rest of the path(s), in which the message will be obliged to travel on is/are formed by many low quality links
ns
nd
NEC Laboratories America December 14, 2009
73
FU with Lookahead Strategy (FUL) The selection function used in FU, does not take into consideration the entire path ➔ It takes the decision based solely on the quality of the next link ➔ This can cause inefficiencies ✔ E.g., although the next link is of highquality (e.g., fault free), the rest of the path(s), in which the message will be obliged to travel on is/are formed by many low quality links
ns
li
FD(li) lk
lj ll
lm
FD(lj) ln
nd
NEC Laboratories America December 14, 2009
FD(ll)
FD(lk)
FD(lm) FD(ln) 74
FU with Lookahead Strategy (FUL) Routing Table AOs
ERM
dst N E S W N E S W 0 1 1 0 0 1 0 0
FD(li)
FD(lj) FD(ll)
FD(lk)
FD(lm)
Set if the equivalent resistance of paths having as first link E and ending at node is minimum
FD(ln)
NEC Laboratories America December 14, 2009
75
FU with Lookahead Strategy (FUL) In low traffic conditions ➔One of the AO links li such that the ith bit of ERM is
set, is randomly chosen ✔ If multiple faulty free paths exist, they are always used
In high traffic conditions ➔The same as FU
NEC Laboratories America December 14, 2009
76
Load Balancing Technique Determining the best set of selection probabilities which allow to optimally distributing the traffic over the network Pr(lE)
Pr(lS) l lN
lNE lE
d lS
Pr(lSE)
Selection probabilities for a given router and a given destination
lSE Set of admissible outputs for a certain destination
NEC Laboratories America December 14, 2009
Selection Function
77
Load Balancing Technique Traffic Function TF(l,X) ➔Returns an estimation of the traffic load on network link
l when the set of selection probabilities X is used
Goal
min var TF(l,X) X lL
➔Such that X is a feasible set of selection probabilities
(xi[0,1], xli,d,lo=1) ➔Sequential Quadratic Programming Optimization
NEC Laboratories America December 14, 2009
78
Load Balancing based Strategies Fault Elimination + LB (FE+LB) ➔ Selection function based on the selection probabilities computed
resolving the LB problem (Pr(LB))
Partially Faulty Links usage strategy + LB (FU+LB) ➔ In low traffic conditions ✔ Selection function uses Pr(LB)
In high traffic conditions Pr(FLB)=Pr(LB)Pr(F)
Partially Faulty Links usage with Lookahead strategy + LB (FUL+LB) ➔ In low traffic conditions ✔ Selection function uses Pr(LB) between AOs where ERM is set
NEC Laboratories America December 14, 2009
In high traffic conditions Like FUL+LB
79
Entry in Routing Table FE Routing Table
FU FE+LB FUL
dst | AOs | dst | AOs | Pr(F) | dst | AOs | Pr(LB) | dst | AOs | Pr(F)| ERM |
FU+LB FUL+LB
dst | AOs | Pr(LB) | Pr(FLB) |
dst | AOs | Pr(LB) | Pr(FLB) | ERM |
Size
NEC Laboratories America December 14, 2009
80
Router Architecture (FUL+LB)
NEC Laboratories America December 14, 2009
81
Router Architecture (FUL+LB)
FE FE+LB (2 bits) FE+LB (3 bits)
NEC Laboratories America December 14, 2009
FE+LB (continuous)
82
Implication on Router Design Area Breakdown
Power Breakdown
100%
100%
90%
90%
80%
80%
70%
70% TFM RX TFM TX CTRL ARBITER ROUTING TABLE SELECTOR CROSSBAR FIFO
60% 50% 40%
60% 50% 40%
30%
30%
20%
20%
10%
10%
0%
TFM RX TFM TX CTRL ARBITER ROUTING TABLE SELECTOR CROSSBAR FIFO
0%
FE
FU
FUL
FE+LB
FU+LB
FUL+LB
NEC Laboratories America December 14, 2009
FE
FU
FUL
FE+LB
FU+LB
FUL+LB
83
Implication on Router Design 1.40
1.20
FE FU FUL FE+LB FU+LB FUL+LB
1.00
0.80
0.60
0.40
0.20
0.00
Area
Timing
NEC Laboratories America December 14, 2009
Power
84
Performance Evaluation Simulation platform: Noxim Traffic scenarios: Uniform, transpose, bitreversal, butterfly, shuffle 8x8 meshbased NoC architecture ➔ Buffers: 4 flits ➔ Packet size: random 216 flits ➔ Poisson packet injection distribution ➔ Simulation time: 100,000 clock cycles
(warmup session: 10,000 clock cycles)
NEC Laboratories America December 14, 2009
85
Delay Variation Delay variation under uniform traffic
NEC Laboratories America December 14, 2009
86
Energy Variation Energy variation under uniform traffic
NEC Laboratories America December 14, 2009
87
Improvement in Saturation pir Increase in saturation pir 40%
35%
30%
FU FUL FE+LB FU+LB FUL+LB
25%
20%
15%
10%
5%
0%
Uniform
Transpose
Bit reversal
NEC Laboratories America December 14, 2009
Butterfly
Shuffle
88
Improvement in Average Delay Reduction in average delay 100%
90%
80%
FU FUL FE+LB FU+LB FUL+LB
70%
60%
50%
40%
30%
20%
10%
0%
Uniform
Transpose
Bit reversal
NEC Laboratories America December 14, 2009
Butterfly
Shuffle
89
Reduction in Energy Reduction in energy 25%
20%
FU FUL FE+LB FU+LB FUL+LB
15%
10%
5%
0%
-5%
Uniform
Transpose
Bit reversal
NEC Laboratories America December 14, 2009
Butterfly
Shuffle
90
Delay vs. Percentage of Link Faults Faults random distributed
FE
Faults random clustered
FE FUL+LB
FUL+LB
NEC Laboratories America December 14, 2009
91
Outline Application Specific Routing Algorithms Concurrent Mapping and Routing Dealing with Manufacturing Defects
✓ ✓ ✓
Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
92
Motivations The interconnect system is one of the main elements which characterizes the system in terms of both power dissipation and energy consumption ➔Approx 28% in the Intel’s 80tiles TeraFLOPS
As technology shrinks, the power ratio between NoC links and routers increases ➔Links are becoming more power hungry than routers
NEC Laboratories America December 14, 2009
93
General Scheme – busbased Archs
Long High capac. bus
P
NEC Laboratories America December 14, 2009
Mem
94
General Scheme – busbased Archs
Encoder
P
Decoder
Long High capac. bus
Mem
Overhead
NEC Laboratories America December 14, 2009
95
Encoding Schemes Applied in the context of busbased architectures Businvert method [Stan and Burleson, TVLSI'95] Gray code [Su et al., D&T'94] T0 method [Benini et al., GLSVLSI'97] Workingzone encoding [Mussoll et al., PDAW'98] … Do not take into account coupling effects ➔Dominant in DSM regime NEC Laboratories America December 14, 2009
96
General Scheme – NoC Archs Router Network Interface R0
R1
NI
Rh
NI
PEs NI
PEd
E D
NEC Laboratories America December 14, 2009
Processing Element
97
Power Model 2
P=[T 0 1 C s C l T c C c ]V dd F CK
C
c
T c =k 1 T 1k 2 T 2k 3 T 3k 4 T 4 c
c
c
c
NEC Laboratories America December 14, 2009
98
Proposed Scheme (1/2) P ∝T 0 1 C s k 1 T 1 k 2 T 2 k 3 T 3k 4 T 4 C c '
'
'
'
'
'
P ∝ T 0 1 C s k 1 T 1k 2 T 2 k 3 T 3 k 4 T 4 C c
Invert if P > P' NEC Laboratories America December 14, 2009
99
Proposed Scheme (2/2)
Invert if P > P'
T 0 18T 2T 0 08T4
**
NEC Laboratories America December 14, 2009
100
Encoder and Decoder
NEC Laboratories America December 14, 2009
101
Partitions Data in
Data out
Data out
Data in
Encoder 32
inv
Encoder 32 32 inv
Data in 32
8
Encoder 8 8
Data out[0..7]
8
Encoder 8 8
Data out[8..15]
Encoder 8 8
Data out[16..23]
Encoder 8 8
Data out[24..31]
Data in 32
16
Encoder 16 Data out[0..15] 16 inv0
16
Encoder 16 Data out[16..31] 16 inv1
NEC Laboratories America December 14, 2009
8
8
inv0
inv1
inv2
inv3
102
Overhead Area
Delay
Power
12%
Percent overhead
10%
8%
6%
4%
2%
0% BI8
BI16
BI32
CDBI8
NEC Laboratories America December 14, 2009
CDBI16
CDBI32
SC8
SC16
SC32
103
Power and Energy Saving 30%
Percent power saving
20% 10%
BI-32 CDBI-32 SC-32 BI-16 CDBI-16 SC-16 BI-8 CDBI-8 SC-8
0% -10% -20% -30% 10%
Text
PDF
Pic BW bmp Pic C bmp Pic BW jpg
Pic C jpg
Music
Video
Percent energy saving
5% 0% -5%
BI-32 CDBI-32 SC-32 BI-16 CDBI-16 SC-16 BI-8 CDBI-8 SC-8
-10% -15% -20% -25% -30% -35% Text
PDF
Pic BW bmp Pic C bmp Pic BW jpg
NEC Laboratories America December 14, 2009
Pic C jpg
Music
Video
104
Saving vs. Path Length Power saving
25%
Energy saving
20% 15%
20% 10% 15%
5% 0%
10%
-5% 5% -10% 0% 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
-15% 1
2
3
4
5
6
Hops BI32
CDBI32
7
8
9 10 11 12 13 14 15 16
Hops SC32
Scs32
NEC Laboratories America December 14, 2009
BI16
CDBI16
SC16
Scs16
BI8
105
Outline Application Specific Routing Algorithms Concurrent Mapping and Routing Dealing with Manufacturing Defects Encoding Scheme for Low Power
NEC Laboratories America December 14, 2009
✓ ✓ ✓ ✓
106
References (1/3)
M. Palesi, S. Kumar, V. Catania. Leveraging Partially Faulty Links Usage for Enhancing Yield and Performance in Networks on Chip. Accepted for publication in IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems.
M. Palesi, S. Kumar, V. Catania. Bandwidth Aware Routing Algorithms for NetworksonChip Platforms. Computers & Digital Techniques, IET, Vol. 3, No. 5. (11 August 2009), pp. 413429.
M. Palesi, R. Holsmark, S. Kumar, V. Catania. Application Specific Routing Algorithms for Networks on Chip. IEEE Transactions on Parallel and Distributed Systems, 20(3), pp. 316330, March 2009.
A. Mejia, M. Palesi, J. Flich, S. Kumar, P. Lopez, R. Holsmark and J. Duato. RegionBased Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs IEEE Transactions on on Very Large Scale Integration Systems, 17(3), pp. 356369, March 2009.
G. Ascia, V. Catania, M. Palesi, D. Patti. Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networks onChip. IEEE Transactions on Computers, 57(6), pp. 809820, June 2008.
R. Holsmark, M. Palesi, S. Kumar. Deadlock free Routing Algorithms for Irregular Mesh Topology NoC Systems with Rectangular Regions. Journal of Systems Architecture, 54/34 (2008) pp. 427440.
D. Bertozzi, S. Kumar, M. Palesi. NetworksonChip: Emerging Research Topics and Novel Ideas. VLSI Design, vol. 2007, Article ID 26454, doi:10.1155/2007/26454.
G. Ascia, V. Catania, M. Palesi. A Multiobjective Genetic Approach to Mapping Problem on NetworkonChip. Journal of Universal Computer Science, 12(4):370394, 2006.
G. Ascia, V. Catania, M. Palesi. Mapping Cores on NetworkonChip. International Journal of Computational Intelligence Research (IJCIR), ISSN 09729836, 1(12):109126, 2005.
M. Palesi, F. Fazzino, G. Ascia, V. Catania. Data Encoding for LowPower in WormholeSwitched NetworksonChip. To appear in The 12th Euromicro Conference on Digital System Design (DSD), 2729 Aug 2009, Patras, Greece.
NEC Laboratories America December 14, 2009
107
References (2/3)
R. Tornero, V. Sterrantino, M. Palesi, J. M. Orduna. A Multiobjective Strategy for Concurrent Mapping and Routing in Networks on Chip. To appear in The 12th International Workshop on Nature Inspired Distributed Computing held in conjunction with The 23th IEEE/ACM International Parallel and Distributed Processing, May 2528, 2009, Rome, Italy.
R. Holsmark, M. Palesi, S. Kumar, A. Mejia. HiRA: A Methodology for Deadlock Free Routing in Hierarchical Networks on Chip. 3rd ACM/IEEE International Symposium on Networks on Chip. May 1013, 2009, San Diego, CA
D. Frazzetta, G. Dimartino, M. Palesi, S. Kumar, V. Catania. Efficient Application Specific Routing Algorithms for NoC Systems utilizing Partially Faulty Links. 11th EUROMICRO Conference on Digital System Design, Architectures, Methods and Tools, pp. 1825, Sep. 35, 2008, Parma, Italy.
R. Tornero, J. M. Orduna, M. Palesi, J. Duato. A CommunicationAware Topological Mapping Technique for NoCs. International Conference on Parallel and Distributed Computing, pp. 910919, August 2629th, 2008, Las Palmas de Gran Canaria, Spain.
M. Palesi, G. Longo, S. Signorino, S. Kumar, R. Holsmark, V. Catania. Design of Bandwidth Aware and Congestion Avoiding Efficient Routing Algorithms for NetworksonChip Platforms. IEEE International Symposium on NetworksonChip, pp. 97106, 7th11th April 2008, Newcastle University, UK.
G. Longo, S. Signorino, M. Palesi, S. Kumar, R. Holsmark, V. Catania. Bandwidth Aware Routing Algorithms for NetworksonChip. 2nd Workshop on Interconnection Network Architectures: OnChip, MultiChip. Goteborg, Sweden, January 27, 2008.
R. Tornero, J. M. Orduna, M. Palesi, J. Duato. A CommunicationAware Task Mapping Technique for NoCs. 2nd Workshop on Interconnection Network Architectures: OnChip, MultiChip. Goteborg, Sweden, January 27, 2008.
M. Palesi, S. Kumar, R. Holsmark, V. Catania. Exploiting Communication Concurrency for Efficient Deadlock Free Routing in Reconfigurable NoC Platforms. IEEE International Parallel and Distributed Processing Symposium, pp. 18, Long Beach, CA, March 2007.
G. Ascia, V. Catania, M. Palesi, D. Patti. NeighborsonPath: A New Selection Strategy for OnChip Networks. Fourth IEEE Workshop on Embedded Systems for Real Time Multimedia, pp. 7984. Seoul, Korea, October 2627, 2006.
NEC Laboratories America December 14, 2009
108
References (3/3)
M. Palesi, R. Holsmark, S. Kumar, V. Catania. A Methodology for Design of Application Specific Deadlockfree Routing Algorithms for NoC Systems. International Conference on HardwareSoftware Codesign and System Synthesis, pp. 142147. Seoul, Korea, October 2225, 2006.
R. Holsmark, M. Palesi, S. Kumar. Deadlock Free Routing Algorithms for Mesh Topology NoC Systems with Regions. DSD 2006, 9th EUROMICRO Conference on Digital System Design, Architectures, Methods and Tools, pp. 696703. Croatia, Sept 2006.
M. Palesi, S. Kumar, R. Holsmark. A Method for Router Table Compression for Application Specific Routing in Mesh Topology NoC Architectures. SAMOS VI Workshop: Embedded Computer Systems: Architectures, Modeling, and Simulation, pp. 373384. Samos, Greece, July 1720, 2006.
G. Ascia, V. Catania, M. Palesi, D. Patti. A New Selection Policy for Adaptive Routing in Network on Chip. International Conference on Electronics, Hardware, Wireless and Optical Communications. Madrid, Spain, February 1517, 2006.
G. Ascia, V. Catania, M. Palesi. An Evolutionary Approach to NetworkonChip Mapping Problem. IEEE Congress on Evolutionary Computation. Edinburgh, UK, September 2nd5th, 2005.
G. Ascia, V. Catania, M. Palesi. Multiobjective Mapping for Meshbased NoC Architectures. In Second IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages 182187, Stockholm, Sweden, Sept. 810, 2004.
NEC Laboratories America December 14, 2009
109