Architecture, Reconfiguration and Modeling

3 downloads 1425 Views 2MB Size Report
2 Implementing Time-Constrained Applications on a Pre- ...... Center for Development of Advanced Computing (C-DAC), ... B-30, Sector 62, Noida, UP, INDIA ...... international conference on Mobile computing and networking, 217–228. ACM.
Muhammad Yasir Qadri and Stephen J. Sangwine (editors)

Multicore Technology: Architecture, Reconfiguration and Modeling

Contents

List of Figures List of Tables

xi xvii

Preface

xix

Contributors

xxv

I

Architecture and Design Flow

1 MORA: High-Level FPGA Programming Using a Many-core Framework Wim Vanderbauwhede, Sai Rahul Chalamalasetti, and Martin Margala 1.1 Overview of the state of the art in high-level FPGA programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Introduction to the MORA framework . . . . . . . . . . . . . 1.2.1 MORA Concept . . . . . . . . . . . . . . . . . . . . . 1.2.2 MORA Tool Chain . . . . . . . . . . . . . . . . . . . . 1.3 The MORA Reconfigurable Cell . . . . . . . . . . . . . . . . 1.3.1 Processing Element . . . . . . . . . . . . . . . . . . . 1.3.2 Control Unit and Address Generator . . . . . . . . . . 1.3.3 Asynchronous Handshake . . . . . . . . . . . . . . . . 1.3.4 Execution Model . . . . . . . . . . . . . . . . . . . . . 1.4 The MORA Intermediate Representation . . . . . . . . . . . 1.4.1 Expression Language . . . . . . . . . . . . . . . . . . . 1.4.2 Coordination Language . . . . . . . . . . . . . . . . . 1.4.3 Generation Language . . . . . . . . . . . . . . . . . . 1.4.4 Assembler . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 MORA-C++ API . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Key Features . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 MORA-C++ by Example . . . . . . . . . . . . . . . . 1.5.3 MORA-C++ Compilation . . . . . . . . . . . . . . . . 1.5.4 Floating-Point Compiler (FloPoCo) Integration . . . . 1.6 Hardware Infrastructure for the MORA Framework . . . . . 1.6.1 Direct Memory Access (DMA) Channel Multiplexing . 1.6.2 Vectorized RC Support . . . . . . . . . . . . . . . . . 1.6.3 Shared Memory Access . . . . . . . . . . . . . . . . .

1 3

4 6 6 6 6 8 8 9 12 12 13 15 16 17 18 19 19 22 27 29 29 30 30 i

ii 1.7

1.8

Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Thousand-core Implementation . . . . . . . . . 1.7.2 Results . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Comparison with Other DCT Implementations Conclusion and Future Work . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Implementing Time-Constrained Applications on a Predictable MPSoc Sander Stuijk, Akash Kumar, Roel Jordans, and Henk Corporaal 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Application modeling and programming . . . . . . . . . . . . 2.3 Platform architecture . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Processing element . . . . . . . . . . . . . . . . . . . . 2.3.2 Network interface . . . . . . . . . . . . . . . . . . . . . 2.3.3 Interconnect . . . . . . . . . . . . . . . . . . . . . . . 2.4 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Platform generation . . . . . . . . . . . . . . . . . . . . . . . 2.6 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Throughput guarantees . . . . . . . . . . . . . . . . . 2.6.2 Designer effort . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 SESAM Prototyping Solution Nicolas Ventroux, Tanguy Sassolas, Alexandre Guerre, Andriamisaina 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 Existing work . . . . . . . . . . . . . . . . . . . 3.3 SESAM overview . . . . . . . . . . . . . . . . . 3.4 SESAM infrastructure . . . . . . . . . . . . . . . 3.4.1 Approximate-timed TLM . . . . . . . . . 3.4.2 Interconnections . . . . . . . . . . . . . . 3.4.3 ArchC . . . . . . . . . . . . . . . . . . . . 3.4.4 Instruction Set Simulators . . . . . . . . . 3.4.5 Traffic generators . . . . . . . . . . . . . . 3.4.6 MMU and TLBs . . . . . . . . . . . . . . 3.4.7 CLU . . . . . . . . . . . . . . . . . . . . . 3.4.8 Memory . . . . . . . . . . . . . . . . . . . 3.4.9 DMA . . . . . . . . . . . . . . . . . . . . 3.4.10 Control Manager . . . . . . . . . . . . . . 3.5 SESAM programming and execution models . . 3.5.1 Programming model . . . . . . . . . . . . 3.5.2 Execution models . . . . . . . . . . . . . . 3.5.3 Hardware Abstraction Layer . . . . . . . 3.6 SESAM debug . . . . . . . . . . . . . . . . . . .

32 34 35 38 40

43 44 46 50 51 52 53 54 57 59 60 61 61 62 63

and Caaliph . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

64 66 67 68 69 70 72 73 73 74 74 74 75 75 75 76 77 78 79

iii 3.6.1 GDB stub . . . . . . . . . . . . . . . . . . 3.6.2 Debug and task migration . . . . . . . . . 3.6.3 Specializing GDB’s functionalities . . . . 3.7 SESAM energy modeling . . . . . . . . . . . . . 3.7.1 PowerArchC . . . . . . . . . . . . . . . . 3.7.2 DVFS and DPM . . . . . . . . . . . . . . 3.7.3 Scheduling example . . . . . . . . . . . . 3.8 SESAM exploration . . . . . . . . . . . . . . . . 3.8.1 Dynamic parameters . . . . . . . . . . . . 3.8.2 Application parallelism exploration . . . . 3.8.3 Distributed simulations . . . . . . . . . . 3.9 Use case . . . . . . . . . . . . . . . . . . . . . . 3.9.1 SCMP overview . . . . . . . . . . . . . . . 3.9.2 Implemented applications . . . . . . . . . 3.10 Validation . . . . . . . . . . . . . . . . . . . . . 3.10.1 SESAM accuracy . . . . . . . . . . . . . . 3.10.2 SESAM simulation speed . . . . . . . . . 3.10.3 SESAM sizing example: NoC study . . . . 3.10.4 Performance study in SESAM . . . . . . . 3.10.5 Power management evaluation in SESAM 3.11 Conclusion . . . . . . . . . . . . . . . . . . . . .

II

Parallelism and Optimization

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

80 81 83 83 84 86 86 88 89 90 93 94 95 96 98 98 99 99 100 101 104

107

4 Verified Multicore Parallelism using Atomic Verifiable Operations 109 Michal Dobrogost, Christopher Kumar Anand, and Wolfram Kahl 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.1.1 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1.2 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.1.3 Chapter Organization . . . . . . . . . . . . . . . . . . 112 4.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2.1 Map Modification Notation . . . . . . . . . . . . . . . 112 4.2.2 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.2.3 Disjoint Unions . . . . . . . . . . . . . . . . . . . . . . 114 4.3 Background and Previous Work . . . . . . . . . . . . . . . . 115 4.3.1 Concurrency Verification . . . . . . . . . . . . . . . . . 116 4.3.2 Motivating Example . . . . . . . . . . . . . . . . . . . 117 4.3.3 Strictly Forward Inspection of Partial Order . . . . . . 118 4.3.4 The Follows Map (Φ) . . . . . . . . . . . . . . . . . . 120 4.3.5 Φ Slices . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.3.6 Discussion of the Follows Map (Φ) . . . . . . . . . . . 121 4.3.7 Merging Φ Slices to Strengthen our Partial Order . . . 123 4.3.8 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.3.9 Verification Step . . . . . . . . . . . . . . . . . . . . . 125

iv 4.4

4.5

4.6

4.7

4.8

4.9

The Loop AVOp . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Loop AVOp Indexing . . . . . . . . . . . . . . . . . . 4.4.2 Loop definition . . . . . . . . . . . . . . . . . . . . . . Efficient Verification of Looping Programs . . . . . . . . . . 4.5.1 Locally Sequential Loops . . . . . . . . . . . . . . . . 4.5.2 Bumping State to the Next Iteration via Λ . . . . . . 4.5.3 Loop Short Circuit Theorem . . . . . . . . . . . . . . 4.5.4 Verifying Nested Loops . . . . . . . . . . . . . . . . . 4.5.5 Verifier Run Time . . . . . . . . . . . . . . . . . . . . Rewritable Loops . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Motivating example . . . . . . . . . . . . . . . . . . . 4.6.2 The Rewrite Map: ρ . . . . . . . . . . . . . . . . . . . 4.6.3 Formulation of AVOps as Functions on the State . . . 4.6.4 Induced Rewrites and Rewriting AVOps . . . . . . . . 4.6.5 Short Circuit Theorem for Rewritable Loops . . . . . 4.6.6 Verifier Run Time on Rewritable Loops . . . . . . . . 4.6.7 Memory requirements . . . . . . . . . . . . . . . . . . Extension to Full AVOp Set . . . . . . . . . . . . . . . . . . 4.7.1 Extension of AVOp State and Hazard Checking . . . . 4.7.2 Extension of the Rewrite Map . . . . . . . . . . . . . 4.7.3 Extension of Loops for Accessing Global Memory . . . 4.7.4 Extension of Verification Algorithm . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Restricted Global Memory Access for Verification . . . 4.8.2 Stronger Short Circuit Theorem for Rewritable Loops 4.8.3 Breaking up AVOp Streams to Increase AVOp Feed Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Verifying the Verifier . . . . . . . . . . . . . . . . . . . 4.8.5 Modifications for Cached Memory Multicore Processors 4.8.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 4.8.7 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126 126 127 129 129 130 131 133 134 135 137 138 139 140 141 143 144 144 144 147 148 150 150 150 150 151 152 152 153 153 153

5 Accelerating Critical Section Execution with Multi-Core Architectures 155 M. Aater Suleman and Onur Mutlu 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.3 Accelerated Critical Sections (ACS) . . . . . . . . . . . . . . 159 5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.3.2 False Serialization . . . . . . . . . . . . . . . . . . . . 161 5.4 Trade-off Analysis: Why ACS Works . . . . . . . . . . . . . . 161 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.5.1 Performance at Optimal Number of Threads . . . . . 163

v 5.5.2

5.6 5.7

5.8

III

Performance when the Number of Threads equals the Number of Contexts . . . . . . . . . . . . . . . . . . . 5.5.3 Application Scalability . . . . . . . . . . . . . . . . . . 5.5.4 ACS on Symmetric CMP . . . . . . . . . . . . . . . . 5.5.5 ACS versus Techniques to hide critical section latency Contributions and Impact . . . . . . . . . . . . . . . . . . . . Related Previous Work . . . . . . . . . . . . . . . . . . . . . 5.7.1 Improving Locality of Shared Data and Locks . . . . . 5.7.2 Hiding the Latency of Critical Sections . . . . . . . . . 5.7.3 Asymmetric CMPs and CoreFusion . . . . . . . . . . . 5.7.4 Remote Procedure Calls . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Memory Systems

164 164 166 166 167 169 169 169 169 170 170

171

6 TMbox: Hybrid Transactional Memory System Nehir Sonmez, Oriol Arcas, Osman S. Unsal, Adri´ an Cristal, and Satnam Singh 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 FPGAs for Architectural Investigation . . . . . . . . . 6.1.2 Transactional Memory . . . . . . . . . . . . . . . . . . 6.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . 6.2 The TMbox Architecture . . . . . . . . . . . . . . . . . . . . 6.2.1 Interconnection . . . . . . . . . . . . . . . . . . . . . . 6.3 Hybrid TM Support for TMbox . . . . . . . . . . . . . . . . 6.3.1 Instruction Set Architecture Extensions . . . . . . . . 6.3.2 Bus Extensions . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Cache Extensions . . . . . . . . . . . . . . . . . . . . . 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 6.4.1 Architectural Benefits and Drawbacks . . . . . . . . . 6.4.2 Experimental Results . . . . . . . . . . . . . . . . . . 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 EM2 : A Scalable Shared Memory Architecture scale Multicores Omer Khan, Mieszko Lis, Keun Sup Shim, Myong Hyon Srinivas Devadas 7.1 Background . . . . . . . . . . . . . . . . . . . . . 7.2 Migration-based Memory Coherence . . . . . . . . 7.2.1 Remote-access-only (RA) Architecture . . . 7.2.2 The Execution Migration Machine (EM2 ) . 7.2.3 Hybrid EM2 Architecture . . . . . . . . . . 7.2.4 Hardware-level Migration Framework . . . 7.2.5 Data Placement . . . . . . . . . . . . . . .

173

174 174 176 176 177 178 180 182 182 183 185 185 186 187 190

for Large191 Cho, and . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

192 194 195 196 197 198 200

vi 7.3

7.4

7.5

7.6

7.7

Analytical Models: Directory Coherence versus EM2 . . . . 7.3.1 Interconnect Traversal Costs . . . . . . . . . . . . . 7.3.2 Off-chip Memory Access Costs . . . . . . . . . . . . 7.3.3 EM2 Memory Access Latency . . . . . . . . . . . . . 7.3.4 Directory Coherence Memory Access Latency . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Architectural Simulation . . . . . . . . . . . . . . . . 7.4.2 On-chip Interconnect Model . . . . . . . . . . . . . . 7.4.3 Application Benchmarks . . . . . . . . . . . . . . . . 7.4.4 Directory-based Cache Coherence Baseline Selection 7.4.5 Remote-access NUCA Baseline Selection . . . . . . . 7.4.6 Cache Size Selection . . . . . . . . . . . . . . . . . . 7.4.7 Instruction Cache . . . . . . . . . . . . . . . . . . . 7.4.8 Area and Energy Estimation . . . . . . . . . . . . . Results and Analysis . . . . . . . . . . . . . . . . . . . . . 7.5.1 Advantages over Directory-based Cache Coherence . 7.5.2 Advantages over Traditional NUCA (RA) . . . . . . 7.5.3 Overall Area, Performance and Energy . . . . . . . . 7.5.4 Performance Scaling Potential for EM2 Designs . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Thread Migration . . . . . . . . . . . . . . . . . . . 7.6.2 Remote-access NUCA and Directory Coherence . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

201 201 203 204 204 206 206 207 207 208 209 209 210 211 212 212 214 216 218 220 220 221 222

´ Cache-Aware Fair and Efficient Scheduling for CMPs 8 CAFE: Richard West, Puneet Zaroo, Carl A. Waldspurger, and Xiao Zhang 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Cache Occupancy Estimation . . . . . . . . . . . . . . . . . . 8.2.1 Basic Cache Model . . . . . . . . . . . . . . . . . . . . 8.2.2 Extended Cache Model for LRU Replacement Policies 8.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . 8.3 Cache Utility Curves . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Curve Types . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Curve Generation . . . . . . . . . . . . . . . . . . . . 8.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Cache-Aware Scheduling . . . . . . . . . . . . . . . . . . . . 8.4.1 Fair Scheduling . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Efficient Scheduling . . . . . . . . . . . . . . . . . . . 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Conclusions and Future Work . . . . . . . . . . . . . . . . .

223

IV

257

Debugging

224 226 227 230 231 234 235 238 241 241 243 244 248 253 254

vii 9 Software Debugging Infrastructure for Multi-Core on-Chip ˇ ˇ c Bojan Mihajlovi´c, Warren J. Gross, and Zeljko Zili´ 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 9.2 Software Debugging . . . . . . . . . . . . . . . . . . 9.2.1 Debugger Programs . . . . . . . . . . . . . . 9.2.2 Fault types . . . . . . . . . . . . . . . . . . . 9.2.3 Multi-threaded Software . . . . . . . . . . . . 9.3 Traditional Debugging Methods . . . . . . . . . . . 9.3.1 Software Instrumentation . . . . . . . . . . . 9.3.2 Scan-chain Methods . . . . . . . . . . . . . . 9.3.3 In-Circuit Emulation . . . . . . . . . . . . . . 9.4 Debugging with Trace Generation . . . . . . . . . . 9.4.1 Triggers . . . . . . . . . . . . . . . . . . . . . 9.4.2 Trace Ordering . . . . . . . . . . . . . . . . . 9.4.3 Debug Interface . . . . . . . . . . . . . . . . . 9.4.4 Data Volume . . . . . . . . . . . . . . . . . . 9.5 Generalized Debugging Procedure . . . . . . . . . . 9.6 Trace Compression Scheme . . . . . . . . . . . . . . 9.6.1 Overview . . . . . . . . . . . . . . . . . . . . 9.6.2 Consecutive Address Elimination . . . . . . . 9.6.3 Finite Context Method . . . . . . . . . . . . 9.6.4 Move-to-Front and Address Encoding . . . . 9.6.5 Data Stream Serializer . . . . . . . . . . . . . 9.6.6 Run-length and Prefix Encoding . . . . . . . 9.6.7 Lempel-Ziv Encoding . . . . . . . . . . . . . 9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 9.8 Glossary . . . . . . . . . . . . . . . . . . . . . . . .

V

Networks-on-Chip

Systems259 . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

260 261 261 262 263 264 264 265 266 268 269 270 271 271 273 274 275 276 277 278 280 281 282 283 284

285

10 On Chip Interconnects For Multi-core Architectures 287 Prasun Ghosal and Soumyajit Poddar 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 10.2 Evolution of Interconnects for Multi-core Architectures . . . 288 10.2.1 More than Moore Trends: A New Perspective . . . . . 289 10.2.2 From Single Bus Based to Network-on-Chip Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 10.2.3 On Chip Applications . . . . . . . . . . . . . . . . . . 292 10.3 Emerging Technologies for Interconnections . . . . . . . . . . 293 10.3.1 Three Dimensional Interconnects . . . . . . . . . . . . 293 10.3.2 Photonic Interconnects . . . . . . . . . . . . . . . . . . 294 10.3.3 Wireless Interconnect Technology . . . . . . . . . . . . 297 10.3.4 RF Waveguide Interconnects . . . . . . . . . . . . . . 298 10.3.5 Carbon Nanotubes (CNT) . . . . . . . . . . . . . . . . 299

viii 10.4 Conclusion and Future Research Directions . . . . . . . . . . 11 Routing in Multi-core NoC Prasun Ghosal and Tuhin Subhra Das 11.1 Introduction . . . . . . . . . . . . . . . . . . . 11.2 Routing Topologies in NoC . . . . . . . . . . . 11.2.1 Topologies in 2D NoCs . . . . . . . . . . 11.2.2 Topologies in 3D NoCs . . . . . . . . . . 11.2.3 Topologies in Optical or Photonic NoCs 11.2.4 Topologies in Wireless NoCs . . . . . . 11.3 Design of Router . . . . . . . . . . . . . . . . . 11.3.1 Channel . . . . . . . . . . . . . . . . . . 11.3.2 Virtual Channel . . . . . . . . . . . . . 11.3.3 Buffer Organization . . . . . . . . . . . 11.4 Switching Techniques . . . . . . . . . . . . . . 11.4.1 Circuit switching . . . . . . . . . . . . . 11.4.2 Packet Switching . . . . . . . . . . . . . 11.5 Routing Flow Control . . . . . . . . . . . . . . 11.5.1 Store-and-Forward . . . . . . . . . . . . 11.5.2 Virtual Cut-Through . . . . . . . . . . . 11.5.3 Wormhole . . . . . . . . . . . . . . . . . 11.6 Traffic Patterns . . . . . . . . . . . . . . . . . 11.6.1 Synthetic Traffic . . . . . . . . . . . . . 11.6.2 Realistic Traffic . . . . . . . . . . . . . . 11.7 Routing Algorithms . . . . . . . . . . . . . . . 11.7.1 Oblivious routing . . . . . . . . . . . . . 11.7.2 Adaptive routing . . . . . . . . . . . . . 11.8 Problems of Routing in NoC . . . . . . . . . . 11.9 Emerging Techniques in NoC Routing . . . . . 11.9.1 Routing in Optical NoC . . . . . . . . . 11.9.2 Wireless NoC . . . . . . . . . . . . . . . 11.10Conclusion . . . . . . . . . . . . . . . . . . . .

300 301

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 Efficient Topologies for 3-D Network-on-Chip Mohammad Ayoub Khan and Abdul Quaiyum Ansari 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Classification of Network Topologies . . . . . . 12.1.2 Topology Properties . . . . . . . . . . . . . . . 12.1.3 Performance Evaluation Parameters . . . . . . 12.1.4 Basic 3-D Topologies . . . . . . . . . . . . . . . 12.1.5 Power Consumption Issues in 3-D Topologies . 12.2 Related Work . . . . . . . . . . . . . . . . . . . . . . 12.3 Binary Search Tree based Ring Topology . . . . . . . 12.3.1 Number of nodes (N ) at lth level . . . . . . . . 12.3.2 Average Degree (d) of the Network at lth level

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

302 303 304 308 311 314 318 318 319 319 319 320 320 321 321 321 321 321 322 323 323 323 327 330 331 331 333 333 335

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

336 337 337 339 341 342 343 344 345 347

ix

12.4 12.5 12.6 12.7

12.3.3 Diameter (D) of Level l network Layout and Implementation . . . . . . Discussion and Analysis . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . Glossary . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

348 349 350 351 353

13 Network-on-Chip Performance Evaluation using an Analytical Method 355 Sahar Foroutan, Abbas Sheibanyrad, and Fr´ed´eric P´etrot 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 13.2 Network-on-Chip Concepts . . . . . . . . . . . . . . . . . . . 357 13.2.1 Architectural Parameters . . . . . . . . . . . . . . . . 358 13.2.2 Layered Concept . . . . . . . . . . . . . . . . . . . . . 361 13.2.3 Levels-of-Abstraction . . . . . . . . . . . . . . . . . . . 363 13.2.4 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 366 13.2.5 Parameters Addressed in NoC Performance Evaluation 375 13.3 The State-of-the-Art in NoC Performance Evaluation . . . . 378 13.3.1 Simulation-Based Methods . . . . . . . . . . . . . . . 378 13.3.2 Analytical Methods . . . . . . . . . . . . . . . . . . . 379 13.4 An Analytical Performance Evaluation Method . . . . . . . . 389 13.4.1 The Method . . . . . . . . . . . . . . . . . . . . . . . 391 13.4.2 Validation of the Method . . . . . . . . . . . . . . . . 403 13.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 Bibliography

411

List of Figures

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12

MORA-C++ Tool Chain . . . . . . . . . . . . . . . . . . . . MORA Reconfigurable Cell (RC) . . . . . . . . . . . . . . . Block Diagram of the MORA PE . . . . . . . . . . . . . . . MORA Control Unit Flow Chart . . . . . . . . . . . . . . . MORA Address Generator . . . . . . . . . . . . . . . . . . . Reconfigurable Cell Floating-Point Core . . . . . . . . . . . Vector Support for RC Architecture . . . . . . . . . . . . . . Shared Memory Access Interface Architecture . . . . . . . . ADG diagram for the DCT small Algorithm . . . . . . . . . Slice Count for Benchmark Algorithms . . . . . . . . . . . . Effect of vectorization on Slice/BRAM Counts . . . . . . . . Throughput Versus Number of Lanes for 8-bit Benchmarks .

7 8 9 10 11 28 30 33 34 35 36 37

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

SDF3/MAMPS design flow . . . . . . . . . . . . . . . . . . Example SDFG and implementation of actor P . . . . . . . MAMPS platform architecture . . . . . . . . . . . . . . . . . MAMPS scheduling code for processing element . . . . . . . Overview of the SDF3 mapping flow . . . . . . . . . . . . . Dataflow model for interconnect communication in MAMPS Two tile MAMPS platform in XPS . . . . . . . . . . . . . . The SDF graph for the MJPEG decoder . . . . . . . . . . . Measured and guaranteed worst-case throughput . . . . . .

45 47 50 51 55 56 58 59 60

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13

SESAM overview . . . . . . . . . . . . . . . . . . . . . . . . SESAM infrastructure . . . . . . . . . . . . . . . . . . . . . Timed TLM versus Approximate-timed TLM . . . . . . . . Routing example for a Mesh network . . . . . . . . . . . . . SESAM programming model . . . . . . . . . . . . . . . . . . Structure of the debugging solution implemented in SESAM PowerArchC: Power model generation flow . . . . . . . . . . PowerArchC: Power-aware ISS architecture generation . . . DPM and DVFS techniques timing issues . . . . . . . . . . Summary of buffer monitors and scheduling implications . . SESAM exploration tool and environment . . . . . . . . . . SESAM AGP toolchain . . . . . . . . . . . . . . . . . . . . . Example of automatic parallelization . . . . . . . . . . . . .

68 69 70 71 76 80 84 85 87 88 90 91 92 xi

xii

Multicore Technology: Architecture, Reconfiguration and Modeling 3.14 3.15 3.16 3.17 3.18 3.19 3.20

Parallelization of SESAM simulations. . . . . . . . . . . . . SCMP architecture . . . . . . . . . . . . . . . . . . . . . . . Evaluation of SESAM accuracy . . . . . . . . . . . . . . . . SESAM simulation speed . . . . . . . . . . . . . . . . . . . . Network performance results . . . . . . . . . . . . . . . . . . SCMP performance profiling with a variable number of PE . Power Aware scheduling results with the WCDMA application

94 95 98 99 100 102 103

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16

Locally sequential program . . . . . . . . . . . . . . . . Φ is defined for other cores . . . . . . . . . . . . . . . . . Visualization of Φ-dependency. . . . . . . . . . . . . . . Φ map at instruction SendSignal s3 → c3 . . . . . . . . . Φ map at instruction W aitSignal s3 . . . . . . . . . . . . Example using the Loop AVOp . . . . . . . . . . . . . . Example of an unrolled loop . . . . . . . . . . . . . . . . Example with nested loops . . . . . . . . . . . . . . . . . Non-rewritable loop example . . . . . . . . . . . . . . . . Motivating example for loop rewriting. . . . . . . . . . . Motivating example unrolled. . . . . . . . . . . . . . . . Effects of an inner loop . . . . . . . . . . . . . . . . . . . Loop with rewriting verified without fully unrolling. . . . Rewritable Loop unrolled into a loop without a rewrite. Defining diagram for projected rewrite . . . . . . . . . . Accessing global memory . . . . . . . . . . . . . . . . . .

119 121 122 122 123 128 128 129 136 137 138 141 142 142 148 149

5.1

5.7

Amdahl’s serial part, parallel part, and critical section in a multi-threaded 15-puzzle kernel . . . . . . . . . . . . . . . . Accelerated Critical Sections (ACS). . . . . . . . . . . . . . Source code and its execution: baseline and ACS . . . . . . Execution time when number of threads is optimal for each application. . . . . . . . . . . . . . . . . . . . . . . . . . . . Speedup over a single small core . . . . . . . . . . . . . . . . Execution time when number of threads equals number of contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ACS versus TLR performance. . . . . . . . . . . . . . . . .

6.1 6.2 6.3 6.4 6.5 6.6

An 8-core TMbox infrastructure . . . . . . TMbox MIPS assembly for atomic{a++} Cache state diagram. . . . . . . . . . . . . Eigenbench results on 1–16 cores. . . . . . SSCA2 benchmark results on 1–16 cores. . Intruder benchmark results on 1–16 cores.

. . . . . .

179 183 184 188 188 189

7.1 7.2 7.3

Hybrid migration/remote-access architecture . . . . . . . . . Efficient execution migration in a five-stage CPU core . . . Average memory latency costs . . . . . . . . . . . . . . . . .

198 199 202

5.2 5.3 5.4 5.5 5.6

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . . . . . . .

158 160 160 163 165 166 167

List of Figures

xiii

7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13

Parallel completion time under different DirCC protocols. . Cache hierarchy miss rates at various cache sizes . . . . . . The performance of DirCC (under a MOESI protocol) . . . Cache hierarchy miss rates for EM2 and RA designs . . . . . Non-local memory accesses in RA baseline . . . . . . . . . . Per-benchmark core miss rates . . . . . . . . . . . . . . . . . Core miss rates handled by remote accesses . . . . . . . . . The performance of EM2 and RA variants relative to DirCC Dynamic energy usage for all EM2 and RA variants . . . . . EM2 performance scales with network bandwidth. . . . . . .

209 210 213 214 215 216 217 218 219 220

8.1 8.2 8.3 8.4 8.5

232 233 234 235

8.10 8.11 8.12 8.13

Accuracy of basic Estimate-M method on dual-core system . Occupancy and estimation error . . . . . . . . . . . . . . . . Two pairs of co-runners in dual-core systems . . . . . . . . . Cache occupancy for four co-runners in a quad-core system Occupancy estimation for an over-committed quad-core system (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . Occupancy estimation for an over-committed quad-core system (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . Fine-grained occupancy estimation in over-committed quadcore system. . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of memory bandwidth contention on the MPKC missrate curve for the SPEC CPU2000 mcf workload. . . . . . . Miss-ratio curves (MRCs) for various SPEC CPU workloads, ´ versus offline by page-coloring. . . obtained online by CAFE MRC for mcf with different co-runners. . . . . . . . . . . . Vtime compensation. . . . . . . . . . . . . . . . . . . . . . . Cache divvying occupancy prediction. . . . . . . . . . . . . Co-runner placement. . . . . . . . . . . . . . . . . . . . . . .

242 243 248 251 252

9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

Remote Debugging Scenario – Software View . . . . . . . Debugging multiple cores through IEEE 1149.1 (JTAG) . Debugging a single-core SoC through In-Circuit Emulation Debugging through Trace Generation and ICE . . . . . . . Example: Creating a tracepoint in the GDB debugger . . . Trace-based Debugging Scenario . . . . . . . . . . . . . . . Trace Compression Scheme . . . . . . . . . . . . . . . . . . Finite Context Method . . . . . . . . . . . . . . . . . . . . Huffman Tree for Prefix Encoding . . . . . . . . . . . . . .

. . . . . . . . .

262 266 267 269 270 276 277 279 282

10.1 10.2 10.3

Side view of multi-path interconnect . . . . . . . . . . . . . Network on chip concept . . . . . . . . . . . . . . . . . . . . Reduction of interconnect length from 2D ICs to 3D ICs . .

289 291 294

8.6 8.7 8.8 8.9

236 237 237 239

xiv

Multicore Technology: Architecture, Reconfiguration and Modeling 10.4

10.5 10.6 10.7 10.8

10.9

11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11 11.12 11.13 11.14 11.15 11.16 11.17 11.18 11.19 11.20 11.21 11.22 11.23 11.24 11.25 11.26 11.27 11.28 11.29 11.30

Schematic representation of TSV first, middle, and last processes (The International Technology Roadmap for Semiconductors 2009 for Interconnects 2009) . . . . . . . . . . . . . Schematic of photonic interconnect using micro ring resonators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A simple schematic of a Micro ring resonator. . . . . . . . . A photonic switch in the left side and a non-blocking photonic router on the right side (Wang et al. 2008) . . . . . . . . . . Torus topology: (a) the photonic network (with the routers shown as yellow boxes), and (b) the electrical network (with the gateways shown as pink boxes) (Wang et al. 2008) . . . The concentrated mesh topology and the wireless routers shown from left to right respectively . . . . . . . . . . . . . Factors affecting the performance of a NoC . . . . Mesh topology . . . . . . . . . . . . . . . . . . . . . Torus . . . . . . . . . . . . . . . . . . . . . . . . . . Folded Torus . . . . . . . . . . . . . . . . . . . . . Octagon . . . . . . . . . . . . . . . . . . . . . . . . Star . . . . . . . . . . . . . . . . . . . . . . . . . . Binary tree . . . . . . . . . . . . . . . . . . . . . . Butterfly . . . . . . . . . . . . . . . . . . . . . . . . Butterfly fat tree . . . . . . . . . . . . . . . . . . . Honeycomb . . . . . . . . . . . . . . . . . . . . . . Mesh-of-tree . . . . . . . . . . . . . . . . . . . . . . Diametric 2D mesh . . . . . . . . . . . . . . . . . . Diametric 2D mesh of tree . . . . . . . . . . . . . . A 9 × 9 Structural Diametrical 2D Mesh . . . . . . A 9×9 Star Type Topology . . . . . . . . . . . . . Custom mesh topology . . . . . . . . . . . . . . . . 3D irregular mesh . . . . . . . . . . . . . . . . . . . Dragonfly topology . . . . . . . . . . . . . . . . . . Wireless mesh . . . . . . . . . . . . . . . . . . . . . MORFIC (Mesh Overlaid with RF Inter Connect) . Hybrid ring . . . . . . . . . . . . . . . . . . . . . . Hybrid star . . . . . . . . . . . . . . . . . . . . . . Hybrid tree . . . . . . . . . . . . . . . . . . . . . . Hybrid irregular topology . . . . . . . . . . . . . . A typical router architecture . . . . . . . . . . . . . Router Data Flow . . . . . . . . . . . . . . . . . . . Different Routing Policies . . . . . . . . . . . . . . West First Turn . . . . . . . . . . . . . . . . . . . . North Last Turn . . . . . . . . . . . . . . . . . . . North First Turn . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

295 295 296 297

298 299 303 304 304 305 305 305 305 306 306 306 307 308 310 311 312 313 313 314 316 317 317 317 318 318 319 320 324 328 328 328

List of Figures 12.1 12.2 12.3 12.4 12.5 12.6

Classification of interconnection networks . . . . . . . . . . Basic Network Topologies . . . . . . . . . . . . . . . . . . . Diameter in a connected graph . . . . . . . . . . . . . . . . 3-D Mesh and Torus Topologies (Khan and Ansari 2011c) . Binary Tree (Khan and Ansari 2011b) . . . . . . . . . . . . Proposed Topology with different levels (l = 1, l = 2, and l = 3) (Khan and Ansari 2011b) . . . . . . . . . . . . . . . . 12.7 Ring Based Tree Topology (Khan and Ansari 2011b) . . . . 12.8 3-D Tree-Mesh . . . . . . . . . . . . . . . . . . . . . . . . . 12.9 Layout of the proposed topology (Khan and Ansari 2011b) . 12.10 Number of nodes in level l (Khan and Ansari 2011b) . . . . 12.11 Degree and Diameter Analysis of the proposed topology (Khan and Ansari 2011b) . . . . . . . . . . . . . . . . . . . 13.1 13.2 13.3 13.4 13.5 13.6 13.7

13.8 13.9

13.10 13.11 13.12 13.13

13.14 13.15 13.16

Operational Layered Concept of a NoC-based SoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The relation between NoC layers and levels of abstraction from a performance evaluation viewpoint . . . . . . . . . . . A generic design flow for a NoC-based system . . . . . . . . Performance requirements versus performance analysis . . . Optimization loop: architectural exploration and mapping exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At each router of the path, disrupting packets appear probabilistically in front of the tagged packet . . . . . . . . . . . . Dependency trees corresponding to the latency (a) ‘core to south’, and (b) ‘core to east’ of r3,4 in a 6 × 5 2D-mesh NoC with x-first routing algorithm . . . . . . . . . . . . . . . . . Router delay model related to a 2D-mesh NoC . . . . . . . . Buffer occupancy caused by Pj at time instant (a) t, and (b) t + 3, when Pj is transferred and Pi can be written into the buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The average number of accumulated flits in the output buffer at the arrival of Pi when there is no header contention . . . The order of delay component computation in one iteration Iterative computation for inputs {1, 2, 3, 4} of router r . . . Latency/load curves for the path r2,4 → r4,2 with buffer lengths in flits as indicated and uniform traffic (path latency excludes the source queue waiting time) . . . . . . . . . . . Latency/load curves for the path r2,4 → r4,2 with buffer lengths in flits as indicated and localised traffic . . . . . . . Analytical method for different buffer lengths and 0.01 offered load steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . The average utilization of buffer r3,4 → r4,4 under two traffic distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv 338 339 340 341 344 345 345 346 350 351 352

363 365 367 370 375 392

394 395

399 400 401 402

404 405 406 407

List of Tables

1.1

1.6

Utilization results of Single Precision Floating Point Core on Virtex 4 LX200 . . . . . . . . . . . . . . . . . . . . . . . . . Latency of Shared Memory Interface Modules . . . . . . . . Benchmark Implementation Results (no Vectorization) . . . Benchmark Throughput Results for a Single DMA Channel without RC Vectorization . . . . . . . . . . . . . . . . . . . Benchmark Throughput Results with Multiple DMA Channels and RC Vectorization . . . . . . . . . . . . . . . . . . . DCT Benchmark Throughput Comparison . . . . . . . . . .

2.1

Designer effort . . . . . . . . . . . . . . . . . . . . . . . . . .

61

3.1 3.2 3.3

Hardware Abstraction Layer of SESAM . . . . . . . . . . . Basic remote protocol support commands . . . . . . . . . . Additional remote protocol commands for fast debugging . .

79 81 81

5.1

Best number of threads for each configuration . . . . . . . .

164

6.1 6.2 6.3

LUT occupation of components of the Honeycomb core . . . HTM instructions for TMbox . . . . . . . . . . . . . . . . . TM Benchmarks Used . . . . . . . . . . . . . . . . . . . . .

180 181 185

7.1 7.2 7.3 7.4

Various parameter settings for the analytical cost model the ocean contiguous benchmark . . . . . . . . . . . System configurations used . . . . . . . . . . . . . . . . . Area and energy estimates . . . . . . . . . . . . . . . . . Synthetic benchmark settings . . . . . . . . . . . . . . .

203 206 211 212

9.1 9.2 9.3

Example of CAE with 16-bit Addresses . . . . . . . . . . . . 278 Address Encoding Scheme . . . . . . . . . . . . . . . . . . . 280 Example of Differential Address Encoding – 16-bit Addresses 281

11.1 11.2 11.3

Relative comparison of 2D irregular topologies . . . . . . . . Comparison of optical network topologies . . . . . . . . . . The Wavelength Assignment of 4-WRON . . . . . . . . . . .

309 315 333

12.1

Classification of NoC Topology (Khan and Ansari 2011b) . .

337

1.2 1.3 1.4 1.5

for . . . . . . . .

29 31 35 38 38 39

xvii

xviii 12.2

13.1 13.2 13.3

Multicore Technology: Architecture, Reconfiguration and Modeling Analysis of Network Parameters for Base Module (Khan and Ansari 2011b) . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of Analytical Methods . . . . . . . . . . . . Parameters of the Analytical Performance Evaluation Method Presented in Section 13.4 . . . . . . . . . . . . . . . . . . . . Comparing simulation and analytical tool runtimes . . . . .

350 388 391 408

Contributors

Christopher Kumar Anand Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Abdul Quaiyum Ansari Jamia Millia Islamia New Delhi, India Oriol Arcas Barcelona Supercomputing Center Universitat Polit`ecnica de Catalunya, Spain Caaliph Andriamisaina CEA LIST Gif-sur-Yvette, France Sai Rahul Chalamalasetti Department of Electrical and Computer Engineering University of Massachusetts, Lowell MA, USA Henk Corporaal Department of Electrical Engineering Eindhoven University of Technology, The Netherlands Myong Hyon Cho Massachusetts Institute of Technology Cambridge, MA, USA Adri´ an Cristal Barcelona Supercomputing Center

CSIC - Spanish National Research Council, Spain Tuhin Subhra Das Department of Information Technology Bengal Engineering and Science University, Shibpur, India Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA, USA Michal Dobrogost Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Sahar Foroutan Laboratoire TIMA Grenoble, France Prasun Ghosal Department of Information Technology Bengal Engineering and Science University, Shibpur, India Warren J. Gross Department of Electrical & Computer Engineering McGill University, Montreal, Canada Alexandre Guerre Embedded Computing Lab CEA LIST, Gif-sur-Yvette, France xxv

xxvi

Multicore Technology: Architecture, Reconfiguration and Modeling

Roel Jordans Department of Electrical Engineering Eindhoven University of Technology, The Netherlands Wolfram Kahl Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Mohammad Ayoub Khan Center for Development of Advanced Computing Noida, India Omer Khan University of Connecticut Storrs, CT, USA Akash Kumar Department of Electrical and Computer Engineering National University of Singapore, Singapore Mieszko Lis Massachusetts Institute of Technology Cambridge, MA, USA Martin Margala Department of Electrical and Computer Engineering University of Massachusetts, Lowell MA, USA Bojan Mihajlovi´ c Department of Electrical & Computer Engineering McGill University, Montreal, Canada Onur Mutlu Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA, USA

Frederic P´ etrot Laboratoire TIMA Grenoble, France Soumyajit Poddar School of VLSI Technology Bengal Engineering and Science University, Shibpur, India Tanguy Sassolas Embedded Computing Lab CEA LIST, Gif-sur-Yvette, France Hamed Sheibanyrad Laboratoire TIMA Grenoble, France Keun Sup Shim Massachusetts Institute of Technology Cambridge, MA, USA Satnam Singh Google, Inc. Mountain View, CA, USA Nehir Sonmez Barcelona Supercomputing Center Universitat Polit`ecnica de Catalunya, Spain Sander Stuijk Department of Electrical Engineering Eindhoven University of Technology, The Netherlands M. Aater Suleman Department of Electrical and Computer Engineering The University of Texas at Austin, Austin, TX Osman S. Unsal Barcelona Supercomputing Center Universitat Polit`ecnica de Catalunya, Spain

Contributors Wim Vanderbauwhede School of Computing Science University of Glasgow, Scotland Carl A. Waldspurger Formerly at VMware Inc. Palo Alto, CA, USA

xxvii Puneet Zaroo VMware Inc. Palo Alto, CA, USA ˇ ˇ c Zeljko Zili´ Department of Electrical & Computer Engineering

Richard West McGill University, Montreal, Canada Department of Computer Science Boston University, Boston, MA, USA Xiao Zhang Nicolas Ventroux Google, Inc. Embedded Computing Lab CEA LIST, Gif-sur-Yvette, France Mountain View, CA, USA

Part I

Architecture and Design Flow

Part II

Parallelism and Optimization

Part III

Memory Systems

Part IV

Debugging

Part V

Networks-on-Chip

12 Efficient Topologies for 3-D Network-on-Chip Mohammad Ayoub Khan Center for Development of Advanced Computing (C-DAC), Ministry of Communications and IT., Govt. of India B-30, Sector 62, Noida, UP, INDIA Abdul Quaiyum Ansari Department of Electrical Engineering Jamia Millia Islamia (Central University) New Delhi, INDIA

CONTENTS 12.1

12.2 12.3

12.4 12.5 12.6 12.7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Classification of Network Topologies . . . . . . . . . . . . . . . . . . . . 12.1.2 Topology Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Performance Evaluation Parameters . . . . . . . . . . . . . . . . . . . . 12.1.4 Basic 3-D Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.5 Power Consumption Issues in 3-D Topologies . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binary Search Tree based Ring Topology . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Number of nodes (N ) at lth level . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Average Degree (d) of the Network at lth level . . . . . . . . . 12.3.3 Diameter (D) of Level l network . . . . . . . . . . . . . . . . . . . . . . . . Layout and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

336 337 337 339 341 342 343 344 345 347 348 349 349 350 353

The Network-on-Chip (NoC) represents a relatively new communication paradigm for increasingly complex on-chip networks. The NoC provides techniques for generic on-chip interconnection network (IN) realized by routers that connect processing elements (PEs) like ASICs, FPGAs, memories, IP cores etc. To reduce the latency and wire length we need an efficient interconnection architecture. Performance of the network is measured in terms of throughput. The throughput and efficiency of an interconnect depends on the network parameters for a given topology. Therefore, the topology of any 335

336

Multicore Technology: Architecture, Reconfiguration and Modeling

communication network has an important role to play for efficient design. The performance of an interconnection architecture depends on degree and diameter. The cost of an interconnection architecture can be defined as a value of the degree × diameter. Three dimensional (3-D) integrated circuits offer a low interconnect latency and area efficient solution for NoC. The 3-D arrangement offers opportunities for new circuit architectures based on the geometric capacity that provides greater numbers of interconnections among multi-layer active circuits. The 3-D NoC can reduce significant amounts of wire length for local and global interconnects. This chapter investigates 3-D topologies for NoC application. Finally, the chapter considers a new design of efficient topology based on ring structures. We have obtained the degree n , while the diameter of a topology is obof proposed topology as 6×(22n −1)−4 −1 tained as D = 1 + 2 × (n − 1). The degree of the proposed topology is 25% less than the torus along with a drastic reduction in the diameter. The layout of the proposed topology could be easily extended to a 3-D NoC architecture by adding a few extra links.

12.1

Introduction

A Network-on-Chip (NoC) can be defined as a communication paradigm that uses multiple processors on a single-chip. The NoC paradigm represents a promising solution for forthcoming complex embedded systems and multimedia applications. The International Technology Roadmap for Semiconductors estimates that NoCs will soon contain billions of transistors running at speeds of many GHz (line-rate), operating below 1 V (Khan and Ansari 2011a). A typical NoC application consists of multiple storage components (memory cores) and processing elements, such as general-purpose CPUs, specialized cores and embedded hardware connected together over a complex communication architecture. The NoC technology has replaced traditional bus architecture with a low-cost point-to-point and packet-based architecture. The NoC solution incorporates a layered network protocol stack analogous to the open system interconnection (OSI) model. The performance of the network is measured in terms of throughput (Abuelrub 2008). The throughput and efficiency of interconnect depends on network parameters of the topology. We can formally define topology as a physical interconnection of the routing elements (R) by the communication channels (Khan and Ansari 2011b). This may be represented by a network graph G = (R, C), where routing elements are the vertices and channels the edges of the graph. R and C are described in equation (12.1) and (12.2) (Khan and Ansari 2011b). R ∈ {r1 , r2 , r3 . . . rn }

(12.1)

C ∈ {c1 , c2 , c3 . . . cn }

(12.2)

Efficient Topologies for 3-D Network-on-Chip

337

TABLE 12.1 Classification of NoC Topology (Khan and Ansari 2011b) Direct Indirect Orthogonal (Mesh and Torus, Tree) Crossbar switch fabric Cube-Connected-Cycles (CCC) Fully-Connected Network Octagon Omega 4-Cube Delta Butterfly Fat-tree Topology

12.1.1

Classification of Network Topologies

The topology of a network determines possible and efficient protocol regulations implemented by routing elements. The topology of a NoC affects the scalability, performance, power consumption, area and complexity of the routing elements (Khan and Ansari 2011b). The NoC topologies can be classified broadly in two categories viz. direct, indirect as shown in Table 12.1 (Khan and Ansari 2011b). In a direct network, the routing element is directly connected to a limited number of neighboring routing elements using a network interface (NI) (Khan and Ansari 2011b). The NI also connects to IP cores. Each routing element performs routing as well as arbitration. The IP element connected to the routing element injects the messages into the network through an injection channel and removes incoming messages through an ejection channel. The injected messages compete with the messages that pass through the routing element for the use of output channels (Khan and Ansari 2011b).

12.1.2

Topology Properties

Interconnection networks can be static, dynamic or hybrid in nature as shown in Figure 12.1. Hybrid networks are those interconnections which have complicated structures such as hierarchical or hyper-graph topologies. In the following section we will present two network families of static and dynamic networks. Figure 12.1 shows the overall classification of interconnection networks. In a static interconnection network, links among different nodes of the system are considered as passive. Thus each node is directly connected to a small subset of nodes by interconnecting links. Each node performs both routing and computations. Here, we present some important properties of topologies other than node degree and diameter. Regularity A network is regular when all the nodes have the same degree.

338

Multicore Technology: Architecture, Reconfiguration and Modeling 2D mesh 3D mesh Torus (k-ary-n-cube)

Mesh Orthogonal topology Direct

Hypercube Other topologies: tree, graph, ring etc.

Interconnection Networks

Crossbar Regular topology Indire ct

Blocking Multista ge

1D unidirectional torus 2D bidirectional torus 3D bidirectional torus

Unidirectional MIN’s Bidirectional MIN’s

Non-blocking

Irregular topology Multiple backplane buses Hybrid

Hierarchical network

Cluster based network

Hypergraph topologies: hypermesh, hyperbus, graph, ring etc.

There are errors in the figure, e.g. Indire and Multista where part of a word has been cutoff. Please supply a new corrected version. FIGURE 12.1 Classification of interconnection networks Symmetry A network is symmetric when it looks the same from each node’s perspective. Orthogonal property A network is orthogonal if the nodes and interconnecting links can be arranged in n dimensions such that the link is placed in exactly one dimension. In a weakly orthogonal topology, some nodes may not have any link in some dimensions. In static networks, the paths for message transmission are selected by a routing algorithm. The switching mechanism determines how inputs are connected to outputs in a node. All the existing switching techniques can also be used in direct networks. As compared to static networks, in which the interconnection links between the nodes are passive, the linking configuration in a dynamic network is a function of the states in the switching elements (SEs). In layman’s terms, the paths between the graph nodes of a dynamic network changes with the change

Efficient Topologies for 3-D Network-on-Chip

339

FIGURE 12.2 Basic Network Topologies in the states of the switching elements. The dynamic networks are built using crossbars(especially of size 2 × 2). The dynamic network may consist of single stage or multiple intermediate stages for switching. In an indirect network, the routing elements (switches) are connected to one or many intermediate routing elements. These intermediate nodes are responsible for routing and arbitration. These networks are sometimes referred to as multistage interconnect networks (MIN). A broad classification of NoC topology is shown in Table 12.1.

12.1.3

Performance Evaluation Parameters

Network topologies may be evaluated by the cost (size, degree, diameter and bisection width) and performance measures (Khan and Ansari 2011b). Size The size of a network may be defined as the number of vertices in the graph. Here, we will define some of the basic terminology of performance evaluation parameters (Khan and Ansari 2011b). S(G) = |R|

(12.3)

Degree The node degree is defined as the maximum number of physical links

340

Multicore Technology: Architecture, Reconfiguration and Modeling emanating from a node. The degree of the network may be defined as follows (Khan and Ansari 2011b): n X

Degree(ri ), ∀ri ∈ R

(12.4)

k=1

Diameter The diameter of a network is the maximum inter-node distance, i.e., the maximum distance between any two points in G where distance is the shortest path between (ri , rj ). This should be relatively small if network latency is to be minimized. The diameter is more important with store-and-forward routing than with wormhole routing (Parhami 1999). Consider a connected graph G, in Figure 12.3, d(A, E) = 2 and Diam(G) = 3. Therefore, we can define diameter as follows in the equation (12.5) (Khan and Ansari 2011b). Diameter(G) = max(d(ri , rj ))

(12.5)

FIGURE 12.3 Diameter in a connected graph

Bisection width The Bisection Width (BW ) of a network is defined as the minimum number of channels or links that must be removed to partition the network into two equal size disconnected networks. This is important when nodes communicate with each other in a random fashion. A small bisection width limits the rate of data transfer between the two halves of the network. This may affect performance of communication-intensive algorithms in SoC. Number of edges (Degree) This represents the number of communication ports (edges) required at each node (processing elements). The degree

Efficient Topologies for 3-D Network-on-Chip

341

should be a constant independent of network size if the architecture is to be readily scalable to larger sizes (Parhami 1999). The node degree has a direct effect on the cost of network. Channel Width The number of bits that can be sent simultaneously over a communication channel or link. Sometimes, this is loosely defined as the number of wires in the communication channel or link. Channel Rate This defines the peak rate at which a single wire can deliver bits (Khan and Ansari 2011b). Channel Bandwidth This defines the peak rate at which a communication channel or link can deliver bits (Khan and Ansari 2011b). Cost of Network This defines as total number of communication links (Khan and Ansari 2011b).

12.1.4

Basic 3-D Topologies

FIGURE 12.4 3-D Mesh and Torus Topologies (Khan and Ansari 2011c) Various new topologies have been explored to implement 3-D NoC. Some of these are based on basic topologies like mesh and torus that are extensively used in 2-D designs. In a 3-D mesh architecture multiple 2-D meshes are connected using vertical links, i.e. one more dimension Z is added to the regular X and Y dimensions. Similarly a 3-D torus network is the same as a 3-D mesh network except that it contains wrap around edges at the terminal nodes of all axes (Kini, Kumar, and Mruthyunjaya 2009; Rahman and Horiguchi 2004). Consequently the degree increases in both the 3-D networks. The mesh architecture is the most regular and simple architecture that is used in the design of NoCs. The implementation of a mesh network is simple to understand and verify. In a mesh architecture every node is connected to four of

342

Multicore Technology: Architecture, Reconfiguration and Modeling

its neighbors. In an N -dimensional mesh network every node is connected to 2N of the neighboring nodes. Thus, the degree of a node in an N -dimensional mesh is 2 × N . The number of connections per node remains constant in a mesh network even though the size of the network increases. The performance of a large mesh network degrades due to the increase in the diameter. The diameter of a 3-D mesh can be defined as D = d × (k − 1), where d represents dimension and k is the number of nodes in a plane (Khan and Ansari 2011c). A torus network is same as mesh network with boundary nodes connected by wrap-around edges. These wrap-around edges significantly reduce the overall diameter of the network and thus improve the throughput and latency. The diameter and network cost of the torus are just half of the mesh topology. The degree of the network is 4, the number of nodes N = n × n and the network cost is (4 × n) in an n × n torus (Ki, Lee, and Oh 2009). The architecture shown in Figure 12.4 has an asymmetric number of nodes in planes. There are many SoC applications where we place unequal numbers of IP cores in the planes. Therefore, every plane has a different number of processing elements, and thus produces a different diameter for each plane. Therefore, we have logically partitioned the torus space into quadrants and selected the nearest wrap-around edge to connect the destination node(Khan and Ansari 2011c). The diameter of an asymmetric torus can be defined as follows: j n k j n k j n k y z x + + (12.6) D =d× 2 2 2 where nx , ny , nz is number of nodes in plane x, y and z respectively

12.1.5

Power Consumption Issues in 3-D Topologies

Power dissipation is an important issue in 3-D circuits. The interconnection network has substantial amounts of power dissipation due to interconnects and buffers. For an example the MIT Raw on-chip network consumes 36% of the total chip power and 20% of the total power of the Alpha 21364 microprocessor due to the interconnection network (Soteriou and Peh 2004). Therefore, there is a need for power-aware interconnections and efficient topologies. The existing traditional approaches for power saving are not sufficient to address the needs of power issues in current SoC designs. The power consumption is also affected by IC technology. 3-D IC technology is expected to have lower power consumption than 2-D circuits due to shorter global interconnects. The topology of the NoC also affects the power consumption of the network in many ways. The power consumption can be reduced by using a topology that has minimum network cost. Pavlidis and Friedman (2007) have shown the effect on power for three different types of topologies. When the authors have used 2-D IC technology for a 3-D NoC they found that power consumption is decreased in this topology by reducing the number of hops for packet switching. The 3-D topology can reduce power even in small networks where the number of IPs is very small. However, the power savings are greater in larger

Efficient Topologies for 3-D Network-on-Chip

343

networks. In a second approach, the authors have experimented with 3-D IC technology for 2-D NoC, where horizontal bus length has been made shorter by implementing the IPs in more than one physical plane. The greater number of physical planes integrated in a 3-D IC technology provides optimum value for power regardless of the network size and operating frequency. In their third approach the authors have used 3-D IC technology for 3-D topology, where they observed the greatest savings in power in addition to the minimum delay.

12.2

Related Work

Researchers have considered butterfly fat tree (BFT), generic fat tree based interconnection networks for NoC applications (Greenberg and Guan 1997; Grecu et al. 2004; Guerrier and Greiner 2000). Feero and Pande (2009) have experimented a 64-IP SoC with BFT topology that contains 28 switches. Each switch in a BFT network consists of six ports, one to each of four child nodes and two to parent nodes, with the exception of the switches at the topmost layer. When the authors mapped to a 2D structure, the longest interswitch wire length for a BFT-based NoC is l2 DIC/3, where l2 DIC is the die length on one side (Grecu et al. 2004; Pande et al. 2005). Pande et al. (2005) have found that if the NoC is spread over a 20 mm × 20 mm die, then the longest interswitch wire is 10 mm. On the other hand, when the authors mapped the same BFT network onto a four-layer 3D SoC, wire routing became simpler, and the longest inter switch wire length was reduced by at least a factor of two (Feero and Pande 2009). In this work, the load on the router varies. At the bottom layer we have more IPs connected to the router while there are fewer routers at the top layer. Therefore, the numbers of input/output ports and the power dissipation also vary. The diameter of the FAT tree is large with varying node degree. The FAT tree based topology may not be optimum for many NoC applications. Also, a significant amount of research has been conducted with respect to off-chip networks (Pinkston and Duato 2006; Dally and Towles 2003; Duato, Yalamanchili, and Ni 2003). The basic concepts of off-chip networks can be applied to NoCs. The majority of NoC topologies gravitate towards either ring or mesh. The IBM Cell processor, the first product with an NoC, is built on ring topology. The ring topology is largely being used for design simplicity, ordering properties and low power consumption. The IBM Cell architecture (Hofstee 2005; Gschwind et al. 2006) is a joint effort between IBM, Sony and Toshiba to design a power-efficient family of chips targeting game systems. The IBM Cell has been designed on a 90 nm, 221 mm2 chip that can run at frequencies above 4 GHz. This consists of one IBM 64-Four rings that is used to boost the bandwidth in turn that alleviates the latency problem in the network. The Intel Larrabee is also based on two-ring topology

344

Multicore Technology: Architecture, Reconfiguration and Modeling

FIGURE 12.5 Binary Tree (Khan and Ansari 2011b) (Seiler et al. 2009). Balfour and Dally (2006) have presented a comparison of various on-chip network topologies including mesh, concentrated mesh, torus and fat tree. The MIT’s Raw chip has also used multiple mesh structure on a Tilera TILE64 chip (Wentzlaff et al. 2007).

12.3

Binary Search Tree based Ring Topology

A binary tree is an ordered rooted structure where every node has at most two nodes designated as left or right child node. The maximum depth (height) of a binary tree of n nodes is (n − 1) (every non-leaf node has exactly one child) (Cormen et al. 2010). The minimum depth of a binary tree of n nodes is (n > 0), dlog2 ne (every non-leaf node has exactly two children, that is, the tree is balanced). In what follows we consider a few examples of binary trees with different permutations. A binary search tree (BST) is a tree that satisfies the following criteria. The left node is always less than the root and right node is always greater than the root value (Cormen et al. 2010). A BST algorithm is applied to find out any element in the tree. The BST, by nature, allows us to apply the divide-and-conquer technique easily. ∀y in left subtree of x then [y] ≤ [x]

(12.7)

∀y in right subtree of x then [y] ≥ [x]

(12.8)

In this work the authors have constructed a modified structure of a binary tree that reduces the network diameter and degree. The basic module of the proposed tree is shown in Figure 12.6 (Khan and Ansari 2011b). The structure contains only three nodes. Every node in the basic module is capable of communicating with other nodes directly without any hop. Thus, the diameter of the basic module is 1 only. In Figure 12.6, we have also shown a ring interconnection with l = 1, l = 2 and l = 3. The level is formed by extending terminal nodes. In Figure 12.6, the authors have shown a ring interconnection

Efficient Topologies for 3-D Network-on-Chip

(a) l=1(b) l=2

345

(c) l=3

FIGURE 12.6 Proposed Topology with different levels (l = 1, l = 2, and l = 3) (Khan and Ansari 2011b)

FIGURE 12.7 Ring Based Tree Topology (Khan and Ansari 2011b) that has a maximum of 21 nodes at level l = 3 (Khan and Ansari 2011b). In the next section, we present derivation for total number of nodes, average degree, and diameter of the network.

12.3.1

Number of nodes (N ) at lth level

Theorem 7 For a ring based tree having level l and N nodes, then: 1. The total number of nodes N having level l is 3(2l − 1) 2. The total number of terminal nodes T at level l is 3(2l−1 ) Proof 10 The number of nodes at any level l can be derived using induction

346

Multicore Technology: Architecture, Reconfiguration and Modeling

FIGURE 12.8 3-D Tree-Mesh as follows: Base case : At level l = 1 , the number of nodes N = 3. Clearly true for all as the base module has three nodes. Induction Hypothesis : If we move to the next level l = 2, then the next level has 3 old nodes and 6 new nodes. Therefore, N = 3 + 6. Similarly, we

Efficient Topologies for 3-D Network-on-Chip

347

can derive for other levels as follows (Khan and Ansari 2011b): l=3

,

N = 9 + 12

l=4

,

N = 21 + 24

l=5

,

N = 45 + 48

l=6

,

N = 93 + 96

l = 7 , N = 189 + 192 .. .. . . l=n

l=n

,

N = (3(2n−1 ) − 3) + (3(2n−1 )

, ,

N = 3(2.2n − 1 − 1) N = 3(2n − 1 + 1 − 1)

,

N = 3(2n − 1)

,

N = 3(2n − 1)

Therefore, the number of nodes at level l can be written as follows: N = 3(2n − 1)

(12.9)

Proof 11 The number of terminal nodes at any level can be derived by induction. Base case : At level l = 1, the number of terminal nodes T = 3. Therefore, at level l = 1, T = 3 × (2l − 1). Clearly true for all as the base module has three terminal nodes (Khan and Ansari 2011b). l=1

,

T =3

l=2

,

T = 6 = 3(2)

l=3

,

T = 12 = 3(4) = 3(22 )

l = 4 , T = 24 = 3(8) = 3(23 ) .. .. . . l=n

,

T = 3(2n−1 )

Therefore, the number of terminal nodes at level l can be written as follows: T = 3(2n−1 )

12.3.2

(12.10)

Average Degree (d) of the Network at lth level

The degree of a node represents the number of communication ports (edges) required at each node (processing elements). The node degree has a direct effect on the cost of each node, with the effect being more significant for parallel ports containing several wires.

348

Multicore Technology: Architecture, Reconfiguration and Modeling

Proof 12 In the topology, every internal node has a degree of 4, while every terminal node has a degree of 2. Therefore, degree varies between 2 and 4. The total number of terminal nodes at the lth level is 3(2n−1 ). The total number of internal nodes at lth level is (3 × 2n−1 − 3). Therefore, the degree of internal and terminal nodes can be calculated as follows (Khan and Ansari 2011b): The degree of internal nodes in the network is: 4 × (3(2n−1 ) − 3 The degree of terminal nodes in the network is: 2 × (3(2n−1 ) Total degree of the network would be: 4 × (3(2n−1 ) − 3) + (2 × (3(2n−1 ) = 18 × (2n−1 ) − 12 Therefore, average degree can be defined as: 18 × (2n−1 ) − 12 Total degree of Network = Number of Nodes N 18 × (2n−1 ) − 12 = 3(2n − 1) 6 × (2n−1 ) − 4 = 2n − 1

Please check the −4 in the top line. Is it correct?

Average Degree of the Network at lth level is as follows: D=

12.3.3

6 × (2n − 1)−4 2n − 1

(12.11)

Diameter (D) of Level l network

The diameter of a network is the maximum inter-node distance, i.e., the maximum distance between any two points in the topology where distance is the shortest path between (ri , rj ). We have derived D using induction as follows (Khan and Ansari 2011b): Base case : At level l = 1, every node is connected to all the nodes at unit distance. Therefore, at level l = 1, D = 1 + 2 × (n − 1), where l = n, clearly true for all as the diameter of the base module is 1. Proof 13 At the next level, let l = 2, diameter D could be written as a summation of the diameter of the base module and distance of the left and right networks from the base module. Therefore, at level l = 2, the distance of

Efficient Topologies for 3-D Network-on-Chip

349

the left and right networks from the base module is 2 only. Hence, D = 1(base module) + 1(left network) + 1(right network). Similarly, we can verify for the remaining values of l. l = 3, l = 4, l = 5, l = 6, .. .

D=1+2+2=5 D=1+3+3=7 D=1+4+4=9 D = 1 + 5 + 5 = 11 .. .

l = n,

D = 1 + (n − 1) + (n − 1) D = 1 + (2n − 2) D = 1 + 2 × (n − 1)

Therefore, the diameter of the network can be written as follows: D = 1 + 2 × (n − 1)

12.4

(12.12)

Layout and Implementation

The ring topology is simple but it has poor performance when compared to higher-dimensional networks like the mesh, torus, tree etc. The latency, throughput, energy and reliability of higher-dimensional networks are good. A ring has a node degree of two while a mesh or torus has a node degree of four, where node degree in an NoC refers to the number of links (physical ports) in and out of a node. The mesh and torus require more links at routers. The topologies, featured in Figure 12.4, are three-dimensional topologies that map readily to a multiple metal layer. The torus has to be physically arranged in a folded form to equalize wire lengths instead of employing long wrap-around links between edge nodes. The 3-D torus topology has lower hop count (which leads to lower delay and energy) compared to a mesh. On the other hand, the tree topology has the advantage of lower diameter. The topology shown in Figure 12.7 at level l = 3.5 has 29 nodes. Sometimes, traffic across the subnetwork moves through the root node only. The remaining 28 nodes in the network are divided into 4 sub-networks as shown in Figure 12.9. Each subnetwork has 7 nodes with a local root node. To minimize the wiring length, a mesh and torus structure has been adopted as shown. The placement shown of the proposed topology offers simple routing regulations like XY or XYZ for a three dimensional arrangement.

350

Multicore Technology: Architecture, Reconfiguration and Modeling

FIGURE 12.9 Layout of the proposed topology (Khan and Ansari 2011b) TABLE 12.2 Analysis of Network Parameters for Base Module (Khan and Ansari 2011b) l 1 2 3 4 5 6 .. .

Ni 3 9 21 45 93 189 .. .

Ti 3 6 12 24 48 96 .. .

Average Degree (d) 2 2.6 2.8 2.9 2.9 2.9 .. .

D 1 1 5 7 9 11 .. .

20

3145725

1572864

2.9

39

12.5

Discussion and Analysis

In Figure 12.7, we have shown a tree with l = 3.5 that contains a total of 29 nodes. The equivalent VLSI layout is shown in Figure 12.9. We have presented an exhaustive analysis of the network parameters for the proposed topology as shown in Table 12.2. Based on mathematical analysis we present a graphical analysis among level, number of nodes, degree and diameter as shown in Figures 12.10 and 12.11. The proposed tree topology can have a large number of nodes if sufficient levels are chosen. If we choose l = 1, then we have a total of 3 nodes while the topology can support 3145725 nodes for l = 20. The extended 3-D Mesh-Tree topology as shown in Figure 12.8 will have little increased diameter as 3 + 2 × (1 + 2 × (n − 1)). Here, 2 × (1 + 2 × (n − 1)) is the diameter for the source and destination sub-networks, while 3 is the diameter of 3-D mesh.

Efficient Topologies for 3-D Network-on-Chip

351

FIGURE 12.10 Number of nodes in level l (Khan and Ansari 2011b)

12.6

Conclusions

The topology determines possible and efficient protocol strategies implemented by routing elements. The topology of NoC affects the scalability, performance, power consumption, area and complexity of the routing elements. In this chapter, we have constructed a modified structure of ring based binary tree that reduces the network diameter and degree drastically. We have found that the degree of the proposed topology is 25% less than the torus along with a drastic reduction in the diameter of the proposed topology. For a SoC of node 3145725, the diameter of the proposed tree is 39. The diameter of a torus topology for the same number of nodes is approximately 1800, that is too large for a NoC application. We also found that the degree of the presented topology varies between 2 and 3 while the torus has a fixed degree regardless of the number of nodes in the topology. This chapter has demonstrated that both mesh- and tree based NoCs are capable of achieving better performance when instantiated in a 3D IC environment compared to more traditional 2D implementations. However, the proposed tree based topology shows significant performance gains in terms of network diameter, degree and number of nodes. The tree-based NoCs achieve significant gain in energy dissipation and area overhead without any change in throughput and latency. The Networkon-Chip (NoC) paradigm continues to attract significant research attention in both academia and industry. With the advent of 3D IC technology, the achievable performance benefits from NoC methodology will be more pronounced as this chapter has shown. Consequently this will also facilitate adoption of the

352

Multicore Technology: Architecture, Reconfiguration and Modeling

FIGURE 12.11 Degree and Diameter Analysis of the proposed topology (Khan and Ansari 2011b)

Efficient Topologies for 3-D Network-on-Chip

353

NoC paradigm as a mainstream design solution for larger multi-core systems.

12.7

Glossary

Topology: Topology defines logical structure and interconnection between nodes in a network. Multicore: The multi-core processor is a single computing component with two or more independent processors (called ‘cores’) on a single chip. Multiple instructions can be executed in multi-core at the same time, increasing overall speed for programs amenable to parallel computing. Application Specific Integrated Circuits (ASICs): An ASIC is an integrated circuit (IC) that is customized for a particular use, rather than intended for general-purpose use. For example, a chip designed solely to run a bluetooth transceiver is an ASIC. Field Programmable Gate Arrays (FPGAs): The FPGA’s function is defined by a user’s program (VHDL/Verilog/Netlist) rather than by the manufacturer of the device. A typical integrated circuit performs a particular function defined at the time of manufacture. In contrast, the FPGA’s function is defined by a program written by someone other than the device manufacturer. System on Chip (SoC): The SoC is a new paradigm for design of VLSI system. The SoC is an integrated circuit (IC) that integrates all components of a computer or other electronic system into a single chip. The SoC may contain digital, analog, mixed signal, and often radio frequency functions all on a single chip substrate. Binary Search Tree (BST): The BST is an ordered placement of nodes of binary tree. In a BST, the left node is always less than the root, while the right node is always greater than the root. Intellectual Property Cores (IP): Topology defines physical structure and interconnection between node in a network. Open System Interconnection (OSI): The OSI mode was invented by the International Organization for Standardization(ISO). It is a prescription of characterizing and standardizing the functions of a communications system in terms of abstraction layers.

The definition of IP core does not make sense? Please rewrite. Definitions of IC and EDA deleted (editor).

Bibliography

Abuelrub, Emad. 2008. “A Comparative Study on the Topological Properties of Hyper-Mesh Interconnection Network.” In Proceedings of the World Congress on Engineering, 9–5. Vol. 1. London, UK. Adiga, N. R., G. Almasi, G. S. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, et al. 2002. “An overview of the BlueGene/L Supercomputer.” In Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 1–22. Baltimore, Maryland: IEEE Computer Society Press. Advanced Micro Devices, Inc. 2007. AMD64 Architecture Programmer’s Manual, Volume 2: System Programming. Advanced Micro Devices, Inc., September. . 2009. Multi-Core Processors from AMD. Advanced Micro Devices, Inc. http://multicore.amd.com/. Agarwal, A. 1991. “Limits on Interconnection Network Performance.” IEEE Transactions on Parallel and Distributed Systems 2, no. 4 (October): 398– 412. Akesson, B., S. Stuijk, A. Molnos, M. Koedam, R. Stefan, A. Nelson, A. Nedad, and K. Goossens. 2012. “Virtual Platforms for Mixed Time-Criticality Applications: The CoMPSoC Architecture and SDF3 Design Flow.” In Proceedings of Workshop on Quo Vadis, Virtual Platforms: Challenges and Solutions for Today and Tomorrow. Albonesi, David H. 1999. “Selective cache ways: on-demand cache resource allocation.” In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO ’99), 248–259. November. Alfalou, A., M. Elbouz, M. Jridi, and A. Loussert. 2009. “A new simultaneous compression and encryption method for images suitable to recognize form by optical correlation.” In Proceedings of SPIE - The International Society for Optical Engineering. Vol. 7486. SPIE, P. O. Box 10 Bellingham WA 98227-0010 USA. Ali, M., M. Welzl, and S. Hellebrand. 2005. “A dynamic routing mechanism for network on chip.” In Proceedings of the 23rd NORCHIP Conference, 2005. 70–73. November 21–22. doi:10.1109/NORCHP.2005.1596991.

411

412

BIBLIOGRAPHY

Amdahl, Gene M. 1967. “Validity of the single processor approach to achieving large scale computing capabilities.” In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, 483–485. ACM. Anand, Christopher K., and Wolfram Kahl. 2008. “Synthesising and Verifying Multi-Core Parallelism in Categories of Nested Code Graphs.” In Process Algebra for Parallel and Distributed Processing, edited by Michael Alexander and William Gardner. Chapman & Hall/CRC. Anis, E., and N. Nicolici. 2007. “On using lossless compression of debug data in embedded logic analysis.” In Proceedings of IEEE International Test Conference, 1–10. IEEE. isbn: 1089-3539. doi:10.1109/TEST.2007. 4437613. Annavaram, Murali, Ed Grochowski, and John Shen. 2005. “Mitigating Amdahl’s Law through EPI Throttling.” SIGARCH Computer Architecture News (New York, NY, USA) 33 (2): 298–309. doi:10.1145/1080695. 1069995. Araujo, C., M. Gomes, E. Barros, S. Rigo, R. Azevedo, and G. Araujo. 2005. “Platform designer: An approach for modeling multiprocessor platforms based on SystemC.” Design Automation for Embedded Systems 10 (4): 253–283. Arcas, Oriol, Philipp Kirchhofer, Nehir Sonmez, Martin Schindewolf, Wolfgang Karl, Osman S. Unsal, and Adrian Cristal. 2012. “A low-overhead profiling and visualization framework for Hybrid Transactional Memory.” In Proceedings of 20th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2012), 1–8. Toronto, Canada, May. ArchC - The Architecture Description Language. http : / / archc . sourceforge.net. Arden, W., M. Brillou¨et, P. Cogez, M. Graef, B. Huizing, and R. Mahnkopf. 2010. “More-than-Moore.” White Paper: International Technology Roadmap for Semiconductors, ITRS. ARM Ltd. 2010a. CoreSight for Cortex-A Series Processors, March. http: / / www . arm . com / products / system - ip / debug - trace / coresight-for-cortex-a.php. . 2010b. RealView Development Suite Documentation. http : / / infocenter . arm . com / help / topic / com . arm . doc . subset . swdev.rvds/.

BIBLIOGRAPHY

413

August, D., J. Chang, S. Girbal., D. Gracia-Perez., G. Mouchard, D. Penry, O. Temam, and N. Vachharajani. 2007. “UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development.” Computer Architecture Letters 6 (2): 45–48. doi:10.1109/L-CA.2007.12. Austin, Todd, Eric Larson, and Dan Ernst. 2002. “SimpleScalar: An Infrastructure for Computer System Modeling.” Computer 35 (2): 59–67. doi:10.1109/2.982917. Awasthi, M., K. Sudan, R. Balasubramonian, and J. Carter. 2009. “Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches.” In Proceedings of the International Symposium on High Performance Computer Architecture, 2009 (HPCA’09), 250–261. IEEE. Azevedo, R., S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, and E. Barros. 2005. “The ArchC Architecture Description Language and Tools.” International Journal of Parallel Programming 33 (5): 453–484. Bach, Moshe (Maury), Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, et al. 2010. “Analyzing Parallel Programs with Pin.” Computer 43 (3): 34–41. Bainbridge, J., and S. B. Furber. 2002. “Chain: A Delay-Insensitive Chip Area Interconnect.” IEEE Micro 22 (5): 16–23. Balfour, J., and W. J. Dally. 2006. “Design tradeoffs for tiled CMP on-chip networks.” In Proceedings of the 20th annual international conference on Supercomputing, 187–198. ACM. Banerjee, K., S. Im, and N. Srivastava. 2006. “Can Carbon Nanotubes Extend the Lifetime of On-Chip Electrical Interconnections?” In Proceedings of the 1st International Conference on Nano-Networks and Workshops, 2006 (NanoNet’06), 1–9. IEEE. Barroso, Luis Andre, and Michel Dubois. 1991. “Cache Coherence on a Slotted Ring.” In Proceedings of the International Conference on Parallel Processing, 230–237. Vol. 1. Bartzas, Alexandros, Lazaros Papadopoulos, and Dimitrios Soudris. 2009. “A system-level design methodology for application-specific networks-onchip.” Journal of Embedded Computing 3 (3): 167–177. Bartzas, Alexandros, N. Skalis, K. Siozios, and Dimitrios Soudris. 2007. “Exploration of alternative topologies for application-specific 3D networkson-chip.” In Proceedings of WASP.

414

BIBLIOGRAPHY

Bechara, C., A. Berhault, N. Ventroux, S. Chevobbe, Y. Lhuillier, R. David, and D. Etiemble. 2011. “A Small Footprint Interleaved Multithreaded Processor for Embedded Systems.” In Proceedings of IEEE International Conference on Electronics, Circuits, and Systems (ICECS). Beirut, Lebanon, December. Bechara, C., N. Ventroux, and D. Etiemble. 2010. “Towards a Parameterizable Cycle-Accurate ISS in ArchC.” In Proceedings of ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), 1–7. Hammamet, Tunisia, May. . 2011. “A TLM-based Multithreaded Instruction Set Simulator for MPSoC Simulation Environment.” In Proceedings of International Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO). Crete, Greece, January. Beigne, E., F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin. 2005. “An Asynchronous NOC Architecture Providing Low Latency Service and its Multi-Level Design Framework.” In Proceedings of the 11th IEEE International Symposium on Asynchronous Circuits and Systems, 54–63. Beltrame, G., C. Bolchini, L. Fossati, A. Miele, and D. Sciuto. 2008. “ReSP: A non-intrusive Transaction-Level Reflective MPSoC Simulation Platform for design space exploration.” In Proceedings of Asia and South Pacific Design Automation Conference (ASPDAC), 673–678. Seoul, Korea, January. doi:10.1109/ASPDAC.2008.4484036. Benini, L., D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri. 2005. “MPARM: Exploring the Multi-Processor SoC Design Space with SystemC.” Journal on VLSI Signal Processing Systems 41 (2): 169–182. Benini, L., and G. De Micheli. 2002. “Networks on Chips: A New SoC Paradigm.” Computer 35, no. 1 (January): 70–78. . 2006. Networks on Chips: Technology and Tools. Morgan Kaufmann. Bennett, Jon C. R., and Hui Zhang. 1996. “W F 2 Q: Worst-case Fair Weighted Fair Queueing.” In Proceedings IEEE INFOCOM’96. Fifteenth Annual Joint Conference of the IEEE Computer Societies. Networking the Next Generation. 120–128. Vol. 1. IEEE, March. Bentley, J., D. Sleator, R. Tarjan, and V. Wei. 1986. “A locally adaptive data compression scheme.” Communications of the ACM 29, no. 4 (April): 320–330. doi:10.1145/5684.5688. Berg, E., H. Zeffer, and E. Hagersten. 2006. “A statistical multiprocessor cache model.” In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’06), 89–99.

BIBLIOGRAPHY

415

Bertogna, M., M. Cirinei, and G. Lipari. 2008. “Schedulability Analysis of Global Scheduling Algorithms on Multiprocessor Platforms.” IEEE Transactions on Parallel and Distributed Systems 20, no. 4 (April): 553– 566. Bhattacharyya, S. S., P. K. Murthy, and E. A. Lee. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers. Bilsen, G., M. Engels, R. Lauwereins, and J. Peperstraete. 1996. “Cyclo-static dataflow.” IEEE Transactions on Signal Processing 44 (2): 397–408. Binkert, Nathan L., Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. 2006. “The M5 Simulator: Modeling Networked Systems.” IEEE Micro 26 (4): 52–60. doi:10 . 1109 / MM . 2006.82. Birrell, Andrew D., and Bruce Jay Nelson. 1984. “Implementing remote procedure calls.” ACM Transactions on Computer Systems (New York, NY, USA) 2 (1): 39–59. doi:10.1145/2080.357392. Bjerregaard, T., and J. Sparso. 2005. “A Router Architecture for ConnectionOriented Service Guarantees in the MANGO Clockless Network-onChip.” In Proceedings of the Conference on Design, Automation and Test in Europe, 1226–1231. Vol. 2. Bobda, Christophe, Ali Ahmadinia, Mateusz Majer, Jurgen Teich, Sandor Fekete, and Jan van der Veen. 2005. “DyNoC: A Dynamic Infrastructure for Communication in Dynamically Reconfigurable Devices.” In Proceedings of the IEEE International Conference on Field Programmable Logic and Applications, 153–158. IEEE. Bolotin, Evgeny, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. 2004. “QNoC: QoS Architecture and Design Process for Network On Chip.” Journal of Systems Architecture 50, nos. 2–3 (February): 105–128. Bonfietti, A., M. Lombardi, M. Milano, and L. Benini. 2010. “An Efficient and Complete Approach for Throughput-maximal SDF Allocation and Scheduling on Multi-Core Platforms.” In Proceedings of International Conference on Design, Automation and Test in Europe, DATE’10, 897– 902. IEEE. Borkar, Shekhar. 2007. “Thousand core chips: a technology perspective.” In Proceedings of the 44th annual Design Automation Conference, 746–749. ACM. Boukhechem, S., and E.-B. Bouernnane. 2008. “TLM Platform Based on SystemC For STARSoC Design Space Exploration.” In Proceedings of NASA/ESA Conference on Adaptive Hardware and Systems, 354–361. Noordwijk, The Netherlands, June.

416

BIBLIOGRAPHY

Boyd-Wickizer, Silas, Robert Morris, and M. Frans Kaashoek. 2009. “Reinventing Scheduling for Multicore Systems.” In Proceedings of the 12th Workshop on Hot Topics in Operating Systems (HotOS-XII), Monte Verita, Switzerland. Brown, Jeffery A., Rakesh Kumar, and Dean Tullsen. 2007. “Proximity-aware directory-based coherence for multi-core processor architectures.” In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 126–134. ACM. Bukhari, K. Z., G. K. Kuzmanov, and S. Vassiliadis. 2002. “DCT and IDCT implementations on different FPGA technologies.” In Proceedings of the 13th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC02) – Veldhoven, The Netherlands, 232–235. Burtscher, M., I. Ganusov, S. J. Jackson, J. Ke, P. Ratanaworabhan, and N. B. Sam. 2005. “The VPC trace-compression algorithms.” IEEE Transactions on Computers 54 (11): 1329–1344. doi:10.1109/TC.2005.186. Buyukkurt, B., Z. Guo, and W. Najjar. 2006. “Impact of loop unrolling on area, throughput and clock frequency in ROCCC: C to VHDL compiler for FPGAs.” Reconfigurable Computing: Architectures and Applications:401–412. Calandrino, John M., and James H. Anderson. 2008. “Cache-Aware RealTime Scheduling on Multicore Platforms: Heuristics and a Case Study.” In EuroMicro Conference on Real-Time Systems (ECRTS ’08), 299–308. July. Casper, Jared, Tayo Oguntebi, Sungpack Hong, Nathan G. Bronson, Christos Kozyrakis, and Kunle Olukotun. 2011. “Hardware acceleration of transactional memory on commodity systems.” ACM SIGARCH Computer Architecture News 39 (1): 27–38. Chafi, Hassan, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Olukotun. 2007. “A Scalable, Non-blocking Approach to Transactional Memory.” In Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture, 2007. HPCA 2007. 97–108. Chakraborty, Koushik, Philip M. Wells, and Gurindar S. Sohi. 2006. “Computation spreading: employing hardware migration to specialize CMP cores on-the-fly.” ACM SIGOPS Operating Systems Review 40, no. 5 (October): 283–292. doi:10.1145/1168917.1168893.

BIBLIOGRAPHY

417

Chalamalasetti, S. R., W. Vanderbauwhede, S. Purohit, and M. Margala. 2009. “A low cost reconfigurable soft processor for multimedia applications: Design synthesis and programming model.” In Proceedings of International Conference on Field Programmable Logic and Applications, 2009. FPL 2009. 534–538. IEEE. Chang, Jichuan, and Gurindar S. Sohi. 2007. “Cooperative cache partitioning for chip multiprocessors.” In Proceedings of The International Conference on Supercomputing (ICS ’07), 242–252. June. Charest, L., E. M. Aboulhamid, C. Pilkington, and P. Paulin. 2002. “SystemC performance evaluation using a pipelined DLX multiprocessor.” In IEEE Design Automation and Test in Europe (DATE), 3. Chaudhuri, M. 2009. “PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches.” In Proceedings of The IEEE 15th International Symposium on High Performance Computer Architecture, 2009. HPCA 2009. 227–238. IEEE. Chen, Kuan-Ju, Chin-Hung Peng, and Feipei Lai. 2010. “Star-Type Architecture with Low transmission Latency for a 2D Mesh NOC.” In Proceedings of The IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), 919–922. IEEE. Chiou, D., H. Sunjeliwala, H. Sunwoo, J. Dam Xu, and N. Patil. 2006. “FPGAbased Fast, Cycle-Accurate, Full-System Simulators.” In UTFAST-200601, 795–825. Vol. 15. 5. Austin, TX, USA, November. Chiu, Ge-Ming. 2000. “The Odd-Even Turn Model for Adaptive Routing.” IEEE Transactions On Parallel And Distributed Systems 11, no. 7 (July): 729–738. Cho, Myong Hyon, Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas Devadas. 2011. “Deadlock-Free Fine-Grained Thread Migration.” In Proceedings of the Fifth ACM/IEEE International Symposium on Networkson-Chip, 33–40. ACM. Cho, Sangyeun, and Lei Jin. 2006. “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation.” In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 455–468. Christie, Dave, Jae-Woong Chung, Stephan Diestelhorst, Michael Hohmuth, Martin Pohlack, Christof Fetzer, Martin Nowack, et al. 2010. “Evaluation of AMD’s advanced synchronization facility within a complete transactional memory stack.” In Proceedings of the 5th European conference on Computer systems (EuroSys ’10), 27–40. Paris, France. isbn: 978-160558-577-2.

418

BIBLIOGRAPHY

Chung, E.S., and J.C. Hoe. 2010. “High-Level Design and Validation of the BlueSPARC Multithreaded Processor.” IEEE Transactions on CAD 29 (10): 1459–1470. doi:10.1109/TCAD.2010.2057870. Chung, E.S., E. Nurvitadhi, J. C Hoe, B. Falsafi, and K. Mai. 2008. “A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs.” In Proceedings of the International Symposium on FPGAs, 77–86. Cong, J. 2008. “A new generation of C-base synthesis tool and domain-specific computing.” In Proceedings of the IEEE International SOC Conference, 2008, 386–386. IEEE. Cong, J., K. Gururaj, G. Han, A. Kaplan, M. Naik, and G. Reinman. 2008. “MC-Sim: An efficient simulation tool for MPSoC designs.” In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 364–371. San Jose, USA, November. doi:10.1109/ ICCAD.2008.4681599. Coppola, M., S. Curaba, M. Grammatikakis, and G. Maruccia. 2003. “IPSIM: SystemC 3.0 Enhancements for Communication Refinement.” In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 106–111. December 19. Coppola, M., R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra. 2004. “Spidergon: a novel on-chip communication network.” In Proceedings of International Symposium on System-on-Chip, 15. IEEE. Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2010. Introduction to Algorithms. Third. PHI Learning, September. Craven, S., C. Patterson, and P. Athanas. 2006. “A Methodology for Generating Application-Specific Heterogeneous Processor Arrays.” In Proceedings of the Hawaii International Conference on System Sciences, 251. Vol. 39. Citeseer. Culler, David E., Jaswinder Pal Singh, and Anoop Gupta. 1999. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco: Morgan Kaufmann. Cytron, R., J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1991. “Efficiently computing static single assignment form and the control dependence graph.” ACM Transactions on Programming Languages and Systems (TOPLAS) 13 (4): 490. Dall’Osso, M., G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini. 2003. “Xpipes: a Latency Insensitive Parameterized Network-on-chip Architecture For Multi-Processor SoCs.” In Proceedings of the 21st International Conference on Computer Design, 536–539.

BIBLIOGRAPHY

419

Dally, W. J. 1990. “Performance analysis of k-ary n-cube interconnection networks.” IEEE Transactions on Computers 39, no. 6 (June): 775–785. doi:10.1109/12.53599. . 1992. “Virtual-Channel Flow Control.” IEEE Transactions on Parallel and Distributed Systems 3, no. 2 (March): 194–205. Dally, William J., and Brian Towles. 2001. “Route Packets, not wires: On-Chip interconnection networks.” In Proceedings of the 38th Design Automation Conference, 684–689. Las Vegas, Nevada, USA, June. . 2003. Principles and practices of interconnection networks. Morgan Kaufmann. Dave, Nirav, Michael Pellauer, and Joel Emer. 2006. “Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA.” In Proceedings of the 2nd Workshop on Architecture Research using FPGA Platforms (WARFP 2006). De Dinechin, F., B. Pasca, O. Cret, and R. Tudoran. 2008. “An FPGA-specific approach to floating-point accumulation and sum-of-products.” In Proceedings of the International Conference on ICECE Technology, 2008. FPT 2008. 33–40. December. doi:10.1109/FPT.2008.4762363. Dean, Jeffrey, and Sanjay Ghemawat. 2008. “MapReduce: simplified data processing on large clusters.” Communications of the ACM (New York, NY, USA) 51 (1): 107–113. Draper, J. T., and J. Ghosh. 1994. “A Comprehensive Analytical Model for Wormhole Routing in Multicomputer Systems.” Journal of Parallel and Distributed Computing 23, no. 2 (November): 202–214. Duato, J., S. Yalamanchili, and L.M. Ni. 2003. Interconnection Networks: An Engineering Approach. Amsterdam: Morgan Kaufmann. Dybdahl, Haakon, Per Stenstr¨om, and Lasse Natvig. 2006. “A CachePartitioning Aware Replacement Policy for Chip Multiprocessors.” In Proceedings of the High Performance Computing-HiPC 2006, 22–34. Vol. 4297/2006. Lecture Notes in Computer Science. Springer Berlin / Heidelberg. Elmiligi, H., A. A. Morgan, M. W. El-Kharashi, and F. Gebali. 2007. “Performance Analysis of Networks-on-Chip Routers.” In Proceedings of the International Design and Test Workshop, 232–236. IEEE, December. Emulation and Verification Engineering (EVE). 2010. ZeBu: A Unified Verification Approach for Hardware Designers and Embedded Software Developers.

420

BIBLIOGRAPHY

Fauth, A., J. Van Praet, and M. Freericks. 1995. “Describing instruction set processors using nML.” In EDTC ’95: Proceedings of the 1995 European conference on Design and Test, 503. Washington, DC, USA: IEEE Computer Society. isbn: 0-8186-7039-8. Fedorova, Alexandra, Margo Seltzer, and Michael D. Smith. 2006. Cache-Fair Thread Scheduling for Multicore Processors. TR-17-06. Technical report. Harvard University. Feero, B. S., and P. P. Pande. 2009. “Networks-on-Chip in a ThreeDimensional Environment: A Performance Evaluation.” IEEE Transactions on Computers 58, no. 1 (January): 32–45. doi:10.1109/TC.2008. 142. Felber, Pascal, Christof Fetzer, and Torvald Riegel. 2008. “Dynamic performance tuning of word-based Software Transactional Memory.” In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 237–246. Feliciian, F., and S. B. Furber. 2004. “An Asynchronous On-Chip Network Router with Quality-of-Service (QoS) Support.” In Proceedings of International IEEE SOC Conference, 274–277. Fensch, C., and M. Cintra. 2008. “An OS-based alternative to full hardware coherence on tiled CMPs.” In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture, 2008. HPCA 2008. 355–366. IEEE. Ferri, Cesare, Samantha Wood, Tali Moreshet, R. Iris Bahar, and Maurice Herlihy. 2010. “Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems.” Journal of Parallel and Distributed Computing 70 (10): 1042–1052. Flanagan, Cormac, and Patrice Godefroid. 2005. “Dynamic partial-order reduction for model checking software.” In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’05), 110–121. Long Beach, California, USA: ACM. isbn: 1-58113-830-X. doi:10.1145/1040305.1040315. Foroutan, S., Y. Thonnart, R. Hersemeule, and A. Jerraya. 2010. “An Analytical Method for Evaluating Network-On-Chip Performance.” In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10), 1629–1632. Dresden: IEEE. Ganguly, Amlan, Kevin Chang, Sujay Deb, Partha Pande, Benjamin Belzer, and Christof Teuscher. 2011. “Scalable Hybrid Wireless Network-on-Chip Architectures for Multi-Core Systems.” IEEE Transactions on Computers 60, no. 10 (September): 1485–1502.

BIBLIOGRAPHY

421

Ganguly, Amlan, Kevin Chang, Partha Pratim Pande, Benjamin Belzer, and Alireza Nojeh. 2009. “Performance Evaluation of Wireless Networks on Chip Architectures.” In Proceedings of the IEEE International Symposium on Quality Electronic Design (ISQED), 350–355. March 16–18. Ganguly, Amlan, Partha Pande, and Benjamin Belzer. 2009. “Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NoC Interconnects.” IEEE Transactions on VLSI 17, no. 11 (November): 1626–1639. Ganguly, Amlan, Partha Pande, Benjamin Belzer, and Cristian Grecu. 2008. “Design of Low power & Reliable Networks on Chip through Joint Crosstalk Avoidance and Multiple Error Correction Coding.” Special Issue on Defect and Fault Tolerance, Journal of Electronic Testing: Theory and Applications (JETTA) 24 (1): 67–81. Gangwal, O. P., A. R˘ adulescu, K. Goossens, S. Gonz´alez Pestana, and E. Rijpkema. 2005. “Building Predictable Systems on Chip: An Analysis of Guaranteed Communication in the AEthereal Network on Chip.” In Dynamic and Robust Streaming In and Between Connected ConsumerElectronics Devices, edited by P. van der Stok, 1–36. Vol. 3. Springer. Garcia-Molina, H., R. J. Lipton, and J. Valdes. 1984. “A Massive Memory Machine.” IEEE Transaction on Computers 100 (5): 391–399. Gebali, F., H. Elmiligi, and M.W. El-Kharashi. 2011. Networks-on-chips: Theory and Practice. CRC Press. Geilen, M. C. W., and T. Basten. 2003. “Requirements on the Execution of Kahn Process Networks.” In European Symposium on Programming, ESOP’03, 319–334. Vol. 2618. Lecture Notes in Computer Science. Springer. Geilen, M., T. Basten, and S. Stuijk. 2005. “Minimising buffer requirements of synchronous dataflow graphs with model checking.” In Proceedings of the 42nd annual Design Automation Conference, 819–824. DAC ’05. Anaheim, California, USA: ACM. isbn: 1-59593-058-2. doi:10 . 1145 / 1065579.1065796. http://doi.acm.org/10.1145/1065579. 1065796. Ghamarian, A. H., M. C. W. Geilen, T. Basten, B. D. Theelen, M. R. Mousavi, and S. Stuijk. 2006. “Liveness and Boundedness of Synchronous Data Flow Graphs.” In Proceedings of the International Conference on Formal Methods in Computer Aided Design, FMCAD’06, 68–75. IEEE. Ghamarian, A. H., M. C. W. Geilen, S. Stuijk, T. Basten, A. J. M. Moonen, M. J. G. Bekooij, B. D. Theelen, and M. R. Mousavi. 2006. “Throughput Analysis of Synchronous Data Flow Graphs.” In Proceedings of the International Conference on Application of Concurrency to System Design, ACSD’06, 25–36. IEEE. doi:10.1109/ACSD.2006.33.

422

BIBLIOGRAPHY

Ghamarian, A. H., S. Stuijk, T. Basten, M. C. W. Geilen, and B. D. Theelen. 2007. “Latency Minimization for Synchronous Data Flow Graphs.” In Proceedings of the Conference on Digital System Design, DSD’07, 189– 196. IEEE. Gheorghita, S.V., S. Stuijk, T. Basten, and H. Corporaal. 2005. “Automatic scenario detection for improved WCET estimation.” In Proceedings of the Design Automation Conference, DAC 05, 101–104. ACM. Ghosal, Prasun, and Tuhin Subhra Das. 2012a. “Network-on-chip Routing Using Structural Diametrical 2D Mesh Architecture.” In Proceedings of Third International Conference on Emerging Applications of Information Technology (EAIT 2012). . 2012b. “SD2D: A Novel Routing Architecture For Network-on-Chip.” In Proceedings of 3rd International Symposium on Electronic System Design (ISED 2012). . 2013. “A Novel Routing Algorithm For On-chip Communication in NoC on Diametrical 2D Mesh Interconnection Architecture.” In Advances in Computing and Information Technology, edited by Natarajan Meghanathan et al, 667–676. Vol. 178. Advances in Intelligent Systems and Computing Series 178. Springer. Ghosal, Prasun, and Sankar Karmakar. 2012. “Diametrical Mesh of Tree (D2D-MoT) Routing Architecture for Network-on-Chip.” International Journal of Advanced Engineering Technology III, no. I (January): 243– 247. Gibson, J., R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, and M. Heinrich. 2000. “FLASH vs.(simulated) FLASH: Closing the simulation loop.” ACM SIGOPS Operating Systems Review 34, no. 5 (March): 49–58. GNU Project. 2010. GDB: The GNU Project Debugger, March. http : / / www.gnu.org/software/gdb/. Godefroid, Patrice. 1996. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. 142. Vol. 1032. New York, NY, USA: Springer-Verlag Inc. isbn: 3-540-607617 (Berlin softcover). http://www.springerlink.com/content/ w1675757101j/. Goel, A. K. 2001. “Nanotechnology circuit design-the.” In Proceedings of the 2001 1st IEEE Conference on Nanotechnology, 2001 (IEEE-NANO’01), 123–127. IEEE. . 2007. High-speed VLSI interconnections. Vol. 185. Wiley-IEEE Press.

BIBLIOGRAPHY

423

Gokhale, M.B., J.M. Stone, J. Arnold, and M. Kalinowski. 2000. “Streamoriented FPGA computing in the Streams-C high level language.” In Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines, 49–56. IEEE Computer Society Washington, DC, USA. Goossens, K., J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu, and E. Rijpkema. 2005. “A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SOC Design and Verification.” In Proceedings of the Conference on Design, Automation and Test in Europe, 1182–1187. Vol. 2. ACM. Goossens, Kees, John Dielissen, and Andrei Radulescu. 2005. “Æthereal Network on Chip: Concepts, Architectures, and Implementations.” IEEE Design & Test of Computers 22, no. 5 (September): 414–421. Goyal, Pawan, Harrick M. Vin, and Haichen Cheng. 1996. “Start-Time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks.” 26 (4): 157–168. Gray, J. 1986. “Why Do Computers Stop and What Can Be Done About It?” In Symposium on Reliability in Distributed Software and Database Systems, 3–12. Gray, Jan. 1998. The Myriad Uses of Block RAM. http://www.fpgacpu. org/usenet/bb.html. . 2000. “Hands-on Computer Architecture - Teaching Processor and Integrated Systems Design with FPGAs.” In Proceedings of the 2000 Workshop on Computer Architecture Education, 17. ACM. Grecu, C., P. P. Pande, A. Ivanov, and R. Saleh. 2004. “A scalable communication-centric SoC interconnect architecture.” In Proceedings of the 5th International Symposium on Quality Electronic Design, 2004. 343–348. doi:10.1109/ISQED.2004.1283698. Greenberg, R. I., and Lee Guan. 1997. “An improved analytical model for wormhole routed networks with application to butterfly fat-trees.” In Proceedings of the 1997 International Conference on Parallel Processing, 1997, 44–48. August. doi:10.1109/ICPP.1997.622554. Grottke, M., and K.S. Trivedi. 2005. “A classification of software faults.” In Proceedings of the International Symposium on Software Reliability Engineering, 4–19. Gschwind, M., B. DAmora, K. O.Brien, and A. Eichenberger. 2006. “Cell broadband engine-enabling density computing for data-rich environment.” In Proceedings of the International Symposium on Computer Architecture. June.

424

BIBLIOGRAPHY

Gu, Huaxi, Jiang Xu, and Wei Zhang. 2009. “A Low Power Fat Tree based Optical Network-on-Chip for Multiprocessor System-on-Chip.” In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 3–8. Guan, W. J., W. K. Tsai, and D. Blough. 1993. “An Analytical Model for Wormhole Routing in Multicomputer Interconnection Networks.” In Proceedings of Seventh International Parallel Processing Symposium, 650– 654. Newport, CA: IEEE. Guerre, A., N. Ventroux, R. David, and A. Merigot. 2009. “ApproximateTimed Transactional Level Modeling for MPSoC Exploration: a Networkon-Chip Case Study.” In Proceedings of the IEEE EUROMICRO Conference on Digital System Design (DSD), 390–397. Patras, Greece, August. . 2010. “Hierarchical Network-on-Chip for Embedded Many-core Architectures.” In Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip (NOCS), 189–196. Grenoble, France, May. Guerrier, P., and A. Greiner. 2000. “A Generic Architecture for On-Chip Packet-Switched Interconnections.” In Proceedings of the Conference on Design, Automation and Test in Europe, 250–256. Paris, France. Gupta, A, W Weber, and T Mowry. 1990. “Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes.” In Proceedings of the International Conference on Parallel Processing. Gupta, T., C. Bertolini, O. Heron, N. Ventroux, T. Zimmer, and F. Marc. 2010. “High Level Power and Energy Exploration using ArchC.” In Proceedings of the IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 25–32. Petr´opolis, Brazil, October. Gustafsson, J. 2006. “The Worst Case Execution Time Tool Challenge 2006.” In Proceedings of the International Symposium on Leveraging Applications of Formal Methods, Verification and Validation, 233–240. Guz, Zvika, Isask’har Walter, Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. 2006. “Efficient Link Capacity and QoS Design for Network-on-Chip.” In Proceedings of the Conference on Design, Automation and Test in Europe, 9–14. DATE’06. 3001 Leuven, Belgium: European Design / Automation Association. isbn: 3-9810801-0-6. http: //dl.acm.org/citation.cfm?id=1131481.1131487. . 2007. “Network Delays and Link Capacities in Application-Specific Wormhole NoCs.” VLSI Design 2007:15pp. doi:10.1155/2007/90941. Hadjiyiannis, G., S. Hanono, and S. Devadas. 1997. “ISDL: An Instruction Set Description Language For Retargetability.” In Proceedings of the 34th Design Automation Conference, 1997, 299–302. June.

BIBLIOGRAPHY

425

Haid, W., Kai Huang, I. Bacivarov, and L. Thiele. 2009. “Multiprocessor SoC software design flows.” Signal Processing Magazine 26 (6): 64–71. Hailpern, B., and P. Santhanam. 2002. “Software debugging, testing, and verification.” IBM Systems Journal 41, no. 1 (January): 4–12. doi:10.1147/ sj.411.0004. Halambi, Ashok, Peter Grun, Vijay Ganesh, Asheesh Khare, Nikil Dutt, and Alex Nicolau. 1999. “EXPRESSION: a language for architecture exploration through compiler/simulator retargetability.” In Proceedings of the conference on Design, automation and test in Europe (DATE ’99), 100. Munich, Germany: ACM. isbn: 1-58113-121-6. doi:http://doi.acm. org/10.1145/307418.307549. Hardavellas, Nikos, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. “Reactive NUCA: near-optimal block placement and replication in distributed caches.” In Proceedings of the International Symposium on Computer Architecture, 184–195. Vol. 37. 3. ACM. Heinrich, Joe. 1994. MIPS R4000 Microprocessor User’s Manual. MIPS Technologies, Inc. Hennessy, John L., and D. Patterson. 2003. Computer Architecture: A Quantitive Approach. 3rd. Amsterdam: Morgan Kaufmann. Henriksson, T., and P. van der Wolf. 2006. “TTL Hardware Interface: A HighLevel Interface for Streaming Multiprocessor Architectures.” In Proceedings of the IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia (ESTIMedia), 107–112. Seoul, Korea: IEEE Computer Society, October. Herlihy, Maurice, and J. Moss. 1993. “Transactional memory: architectural support for lock-free data structures.” In Proceedings of the International Symposium on Computer Architecture (ISCA-20). Hill, Mark, and Michael Marty. 2008. “Amdahl’s Law in the Multicore Era.” IEEE Computer 41 (7). Hoare, C.A.R. 1978. “Communicating sequential processes.” Communications of the ACM 21 (8): 666–677. Hofstee, H. Peter. 2005. “Power Efficient Processor Architecture and The Cell Processor.” In Proceedings of the 11th International Symposium on HighPerformance Computer Architecture, 2005 (HPCA’11), 258–262. Washington, DC, USA: IEEE Computer Society. isbn: 0-7695-2275-0. doi:10. 1109/HPCA.2005.26.

426

BIBLIOGRAPHY

Holsti, Niklas, Jan Gustafsson, Guillem Bernat, Cl´ement Ballabriga, Armelle Bonenfant, Roman Bourgade, Hugues Cass´e, et al. 2008. “WCET 2008 – Report from the Tool Challenge 2008.” In Proceedings of the 8th International Workshop on Worst-Case Execution Time (WCET) Analysis, edited by Raimund Kirner, 149–171. Also published in print by Austrian Computer Society (OCG) under ISBN 978-3-85403-237-3. Dagstuhl, Germany: Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany. isbn: 978-3-939897-10-1. http : / / drops . dagstuhl . de / opus / volltexte/2008/1663. Hong, Sungpack, Tayo Oguntebi, Jared Casper, Nathan Bronson, Christos Kozyrakis, and Kunle Olukotun. 2010. “EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics.” In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2010, 1–11. IEEE. Hopkins, A.B.T., and K.D. McDonald-Maier. 2006. “Debug support strategy for systems-on-chips with multiple processor cores.” IEEE Transactions on Computers 55 (2): 174–184. doi:10.1109/TC.2006.22. Howes, L. W., O. Pell, O. Mencer, and O. Beckmann. 2006. “Accelerating the development of hardware accelerators.” In Proceedings of the Workshop on Edge Computing. HPC Project. n.d. “Par4All, automatic parallelization.” http : / / www . par4all.org. Hsieh, Wilson C., Paul Wang, and William E. Weihl. 1993. “Computation migration: enhancing locality for distributed-memory parallel systems.” In Principles and Practice of Parallel Programming: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming: San Diego, California, United States, 239–248. Vol. 19. 22. Hu, Jingcao, and Radu Marculescu. n.d. “DyAD: smart routing for networkson-chip.” In Proceedings of the 41st annual Design Automation Conference (DAC’04), 260–263. New York, NY, USA: ACM. doi:10 . 1145 / 996566.996638. Hu, P-C., and L. Kleinrock. 1997. “An Analytical Model for Wormhole Routing with Finite Size Input Buffers.” In Proceedings of 15th International Teletraffic Congress, 549–560. Washington, DC, June 23–27. IEEE Standard Test Access Port and Boundary-Scan Architecture. 2001. IEEE Std 1149.1-2001. Technical report. doi:10 . 1109 / IEEESTD . 2001 . 92950. IEEE Standard for Reduced-Pin and Enhanced-Functionality Test Access Port and Boundary-Scan Architecture. 2009. Technical report. doi:10.1109/ IEEESTD.2010.5412866.

BIBLIOGRAPHY

427

IEEE-ISTO. 2003. “The Nexus 5001 Forum Standard for a Global Embedded Processor Debug Interface.” IEEE-ISTO 5001-2003. . 2010. Nexus 5001 Forum Adopting IEEE Std 1149.7. http://www. nexus5001.org/news- events/pressreleases/nexus- 5001% E2%84%A2-forum-adopting-ieee-std-11497. Infineon Technologies. 2010. MCDS - Multi-Core Debug Solution, March. http://www.ip-extreme.com/IP/mcds.shtml. INRIA. FloPoCO Compiler. http://flopoco.gforge.inria.fr/. Intel Corporation. 2005. Intel PXA27x Processor Family, Electrical, Mechanical, and Thermal Specification Datasheet. Intel Corporation. 2009. Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3: System Programming Guide. Intel Corporation, June. . 2009. Intel Multi-Core Technology. Intel Corporation. http://www. intel.com/multi-core/. Intel Corporation. 2010a. Intel Itanium architecture software developer’s manual. http : / / www . intel . com / design / itanium / manuals / iiasdmanual.htm. . 2010b. Multi-Core Debugging for Intel Processors. http : / / www . intel.com/intelpress/articles/ms2a_2.pdf. International Business Machines Corporation and Sony Computer Entertainment Incorporated and Toshiba Corporation. n.d. Cell Broadband Engine Programming Handbook. 1.0. Hopewell Junction, NY: IBM Systems and Technology Group. International Telecommunications Union (ITU). 2005. “National Spectrum Management.” Ipek, E., M. Kirman, N. Kirman, and J.F. Martinez. 2007. “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors.” ACM SIGARCH Computer Architecture News 35 (2): 186–197. Iyer, Ravi. 2004. “CQoS: a framework for enabling QoS in shared caches of CMP platforms.” In Proceedings of the 18th Annual International Conference on Supercomputing, 257–266. Jantsch, A. 2003. “Communication performance in Network-on-Chips.” Presentation at the Swedish INTELECT Summer School on Multiprocessor Systems on Chip (Stockholm).

428

BIBLIOGRAPHY

Jayadevappa, Suryaprasad, Ravi Shankar, and Imad Mahgoub. 2004. “A Comparative Study of Modeling at Different Levels of Abstraction in System on Chip Designs: A Case Study.” In Proceedings of the Annual Symposium on VLSI, 52–58. Los Alamitos, CA, USA: IEEE Computer Society, February. doi:http://doi.ieeecomputersociety.org/10. 1109/ISVLSI.2004.1339508. Jenks, Stephen, and Jean-Luc Gaudiot. 1996. “Nomadic Threads: A Migrating Multithreaded Approach to Remote Memory Accesses in Multiprocessors.” In Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques, 1996, 2–11. IEEE. . 2002. “An Evaluation of Thread Migration for Exploiting Distributed Array Locality.” In Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, 2002, 190–195. IEEE. Jerraya, A. A., and W. Wolf. 2005. Multiprocessor Systems-on-Chips. Morgan Kaufmann Publishers Inc. Jordans, R., F. Siyoum, S. Stuijk, A. Kumar, and H. Corporaal. 2011. “An Automated Flow to Map Throughput Constrained Applications to a MPSoC.” In Proceedings of the Workshop on Predictability and Performance in Embedded Systems, PPES’11, 47–58. Dagstuhl publishing. Joshi, Ajay, Christopher Batten, Yong-Jin Kwon, Scott Beamer, Imran Shamim, Krste Asanovic, and Vladimir Stojanovic. 2009. “SiliconPhotonic Clos Networks for Global On-Chip Communication.” In Proceedings of the 3rd ACM/IEEE International Symposium on Networkson-Chip (NoCS), 124–133. San Diego, CA, USA, May. Justin A. Boyan, Michael L. Littman. 1994. “Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach.” Advances in Neural Information Processing Systems:671–671. Kachris, Christoforos, and Chidamber Kulkarni. 2007. “Configurable Transactional Memory.” In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007 (FCCM 2007), 65–72. IEEE. Kahn, G. 1974. “The Semantics of a Simple Language for Parallel Programming.” In Proceedings of the IFIP Congress Information Processing ’74, edited by J. L. Rosenfeld, 471–475. New York, NY: North-Holland. Kandemir, M., Feihui Li, M. J. Irwin, and Seung Woo Son. 2008. “A novel migration-based NUCA design for Chip Multiprocessors.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC 2008), 1–12. IEEE.

BIBLIOGRAPHY

429

Kao, C.-F., S.-M. Huang, and I.-J. Huang. 2007. “A Hardware Approach to Real-Time Program Trace Compression for Embedded Processors.” IEEE Transactions on Circuits and Systems 54 (3): 530–543. doi:10.1109/ TCSI.2006.887613. Karim, F., A. Nguyen, and S. Dey. 2002. “An Interconnect Architecture for Networking Systems on Chips.” IEEE Micro 22 (5): 36–45. Kavvadias, N., and S. Nikolaidis. 2008. “Elimination of Overhead Operations in Complex Loop Structures for Embedded Microprocessors.” IEEE Transactions on Computers 57, no. 2 (February): 200–214. doi:10.1109/ TC.2007.70790. Kermani, Parviz, and Leonard Kleinrock. 1979. “Virtual Cut-Through: a New Computer Communication Switching Technique.” Computer Networks 3 (4): 267–286.

Khan, M. A., and A. Q. Ansari

Khan, M. A., and A. Q. Ansari

Khan, M. A., and A. Q. Ansari. 2011a. “128-Bit High-Speed FIFO for Network-on-Chip.” In Proceedings of the IEEE International Conference on Emerging Trends in Computing, 116–121. March. . 2011b. “An Efficient Tree-Based Topology for Network-On-Chip.” In Proceedings of the IEEE World Congress on Information Technology, edited by Ajith Abraham, 11–14. Mumbai: University of Mumbai, IEEE, December. . 2011c. “An Quadrant-XYZ routing algorithm for 3-D Asymmetric Torus Routing Chip.” International Journal of ACM Jordan(IJJ):The Research Bulletin of Jordan ACM-ISWSA 2 (2): 18–26. Khonsari, A., M. Ould-Khaoua, and J. Ferguson. 2003. “A General Analytical Model of Adaptive Wormhole Routing in k-Ary n-Cube Interconnection Networks.” Simulation Series 35:547–554. Ki, Woo-seo, Hyeong-Ok Lee, and Jae-Cheol Oh. 2009. “The new torus network design based on 3-dimensional hypercube.” In Proceedings of the 11th International Conference on Advanced Communication Technology, 2009 (ICACT 2009), 615–620. Vol. 01. February. Kiasari, A. E., D. Rahmati, H. Sarbazi-Azad, and S. Hessabi. 2008. “A Markovian Performance Model for Networks-on-Chip.” In Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 157–164. PDP’08. Washington, DC, USA: IEEE Computer Society. isbn: 978-0-7695-3089-5. doi:10.1109/PDP.2008. 83. Kim, Changkyu, Doug Burger, and Stephen W. Keckler. 2002. “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches.” In Acm Sigplan Notices, 211–222. Vol. 37. 10. ACM.

430

BIBLIOGRAPHY

Kim, Kwanho, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo. 2005. “An Arbitration Look-Ahead Scheme for Reducing End-to-End Latency in Networks on Chip.” In Proceedings of the IEEE International Symposium Circuits and Systems (ISCAS’05), 2357–2360. Vol. 3. Kim, Seongbeom, Dhruba Chandra, and Yan Solihin. 2004. “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture.” In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 111–122. IEEE Computer Society, October. Kindratenko, Volodymyr V., Robert J. Brunner, and Adam D. Myers. 2007. “Mitrion-C Application Development on SGI Altix 350/RC100.” In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM ’07), 239–250. Washington, DC, USA: IEEE Computer Society. isbn: 0-7695-2940-2. doi:10 . 1109 / FCCM.2007.45. Kini, N. G., M. S. Kumar, and H. S. Mruthyunjaya. 2009. “A Torus Embedded Hypercube Scalable Interconnection Network for Parallel Architecture.” In Proceedings of the IEEE International Advance Computing Conference, 2009 (IACC 2009), 858–861. March. doi:10.1109/IADCC.2009. 4809127. Kock, E. A. de, W. J. M. Smits, P. van der Wolf, J.-Y. Brunel, W. M. Kruijtzer, P. Lieverse, K. A. Vissers, and G. Essink. 2000. “YAPI: application modeling for signal processing systems.” In Proceedings of the Design Automation Conference, DAC’00, 402–405. ACM. Konstantakopulos, Theodoros, Jonathan Eastep, James Psota, and Anant Agarwal. 2008. Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures. MIT-CSAIL-TR-2008-066. Technical report. MIT CSAIL Technical Report. Koohi, Somayyeh, Meisam Abdollahi, and Shaahin Hessabi. n.d. “All-Optical Wavelength-Routed NoC based on a Novel Hierarchical Topology.” In Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip NOCS’11, 97–104. Koohi, Somayyeh, and Shaahin Hessabi. n.d. “Contention-Free on-Chip Routing of Optical Packets.” In Proceedings of the 3rd ACM/IEEE International Symposium Networks-on-Chip, 2009, 134–143. Kopp, C., S. Bernabe, B.B. Bakir, J. Fedeli, R. Orobtchouk, F. Schrank, H. Porte, L. Zimmermann, and T. Tekin. 2011. “Silicon photonic circuits: On-CMOS integration, fiber optical coupling, and packaging.” IEEE Journal of Selected Topics in Quantum Electronics 17 (3): 498–509.

BIBLIOGRAPHY

431

Krasnov, Alex, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierreyves Droz. 2007. “RAMP Blue: A message-passing manycore system in FPGAs.” In Proceedings of the International Conference on Field Programmable Logic and Applications, 2007 (FPL 2007), 54–61. IEEE. Kreupl, F., A.P. Graham, GS Duesberg, W. Steinh¨ogl, M. Liebau, E. Unger, and W. H¨ onlein. 2002. “Carbon nanotubes in interconnect applications.” Microelectronic Engineering 64 (1): 399–408. Kumar, A., S. Fernando, Y. Ha, B. Mesman, and H. Corporaal. 2008. “Multiprocessor systems synthesis for multiple use-cases of multiple applications on FPGA.” ACM Transactions on Design Automation of Electronic Systems 13 (3): 1–27. Kumar, Amit, Partha Kundu, Arvind P. Singh, LiShiuan Peh, and Niraj K. Jha. 2007. “A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator.” In Proceedings of the 25th International Conference on Computer Design, 2007 (ICCD 2007), 63–70. IEEE. Kumar, R., D.M. Tullsen, N.P. Jouppi, and P. Ranganathan. 2005. “Heterogeneous Chip Multiprocessors.” IEEE Computer 38 (11): 32–38. Kumar, S., A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A. Hemani. 2002. “A Network on Chip Architecture and Design Methodology.” In Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 105–112. Pittsburgh, PA. Kundu, S., R.P. Dasari, S. Chattopadhyay, and K. Manna. 2008. “Mesh-ofTree based scalable network-on-chip architecture.” In Proceedings of the IEEE Region 10 and the Third international Conference on Industrial and Information Systems, 2008 (ICIIS 2008), 1–6. IEEE. Kurian, G., J.E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L.C. Kimerling, and A. Agarwal. 2010. “ATAC: A 1000-core cache-coherent processor with on-chip optical network.” In Proceedings of the 19th international conference on Parallel Architectures and Compilation Techniques, 477– 488. ACM. Labrecque, Martin, Mark Jeffrey, and J. Gregory Steffan. 2010. “Applicationspecific signatures for transactional memory in soft processors.” In Proceedings of the 6th international conference on Reconfigurable Computing: Architectures, Tools and Applications, 42–54. ARC’10. Bangkok, Thailand: Springer-Verlag. doi:10.1007/978-3-642-12133-3_7. Lam, M. 1988. “Software Pipelining: an effective scheduling technique for VLIW machines.” ACM SIGPLAN Notices (New York, NY, USA) 23 (7): 318–328. doi:10.1145/960116.54022. Lawler, E. L., and D. E. Wood. 1966. “Branch-And-Bound Methods: A Survey.” Operations Research 14 (4): 699–719.

432

BIBLIOGRAPHY

Lee, E. A., and D. G. Messerschmitt. 1987a. “Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing.” IEEE Transactions on Computers 36 (1): 24–35. . 1987b. “Synchronous Data Flow.” Proceedings of the IEEE 75, no. 9 (September): 1235–1245. Lee, S.B., S.W. Tam, I. Pefkianakis, S. Lu, M.F. Chang, C. Guo, G. Reinman, C. Peng, M. Naik, L. Zhang, et al. 2009. “A scalable micro wireless interconnect structure for CMPs.” In Proceedings of the 15th annual international conference on Mobile computing and networking, 217–228. ACM. Lee, S.-J., Seong-Jun Song, Kangmin Lee, Jeong-Ho Woo, and Sung-Eun. 2003. “An 800 MHz Star-Connected On-Chip Network for Application to Systems on a Chip.” In Digest of Technical Papers. 2003 IEEE International Solid-State Circuits Conference, 2003 (ISSCC), 468–469. Vol. 1. ISSCC. Lelewer, D., and D. Hirschberg. 1987. “Data compression.” ACM Computing Surveys 19, no. 3 (September): 261–296. doi:10.1145/45072.45074. http://portal.acm.org/citation.cfm?id=45074. Leupers, R., and P. Marwedel. 1998. “Retargetable code generation based on structural processor description.” Design Automation for Embedded Systems 3 (1): 75–108. Li, Yonghui, and Huaxi Gu. 2009. “XY-turn model for deadlock free routing in honeycomb networks-on-chip.” In Proceedings of the 15th Asia-Pacific Conference on Communications, (APCC’09), 900–903. August. doi:10. 1109/APCC.2009.5375521. Li, Z., D. Fay, A. Mickelson, L. Shang, M. Vachharajani, D. Filipovic, W. Park, and Y. Sun. 2009. “Spectrum: A hybrid nanophotonic-electric onchip network.” In Proceedings of the 46th ACM/IEEE Design Automation Conference, 2009 (DAC’09), 575–580. IEEE. Liang, J., S. Swaminathan, and R. Tessier. 2000. “aSOC: A Scalable, SingleChip Communications Architecture.” In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, 37–46. Philadelphia, PA. Liedtke, Jochen, Hermann H¨artig, and Michael Hohmuth. 1997. “OScontrolled cache predictability for real-time systems.” In Proceedings of the Third IEEE Real-Time Technology and Applications Symposium, 213– 224. IEEE.

BIBLIOGRAPHY

433

Lin, Jiang, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. “Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems.” In Procedings of the 14th IEEE International Symposium on High Performance Computer Architecture, 367–378. Lines, A. 2004. “Asynchronous Interconnect for Synchronous SoC Design.” IEEE Micro 24, no. 1 (January): 32–41. Liu, Chun, Anand Sivasubramaniam, and Mahmut Kandemir. 2004. “Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs.” In Proceedings of the International Symposium on High-Performance Computer Architecture, 176–185. Liu, W., W. Yuan, X. He, Z. Gu, and X. Liu. 2008. “Efficient SAT-Based Mapping and Scheduling of Homogeneous Synchronous Dataflow Graphs for Throughput Optimization.” In Proceedings of the Real-Time Systems Symposium, RTSS’08, 492–504. IEEE. Loo, SM, B.E. Wells, N. Freije, and J. Kulick. 2002. “Handel C for rapid prototyping of VLSI coprocessors for real time systems.” In Proceedings of the Thirty-Fourth Southeastern Symposium on System Theory, 2002, 6–10. Vol. 34. Loucif, S., and M. Ould-Khaoua. 2004. “Modeling Latency in Deterministic Wormhole-Routed Hypercubes under Hot-Spot Traffic.” The Journal of Supercomputing 27 (3): 265–278. Loucif, S., M. Ould-Khaoua, and G. Min. 2005. “Analytical Modelling of HotSpot Traffic in Deterministically-Routed K-Ary N-Cubes.” In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15, 8–pp. Vol. 16. IPDPS ’05. Washington, DC, USA: IEEE Computer Society. isbn: 0-7695-2312-9. doi:10.1109/IPDPS.2005.108. Lu, Z. 2007. “Design and Analysis of On-Chip Communication for Networkon-Chip Platforms.” PhD diss., KTH. Lu, Zhonghai, Axel Jantsch, and Ingo Sander. 2005. “Feasibility Analysis of Messages for On-Chip Networks using Wormhole Routing.” In Proceedings of the 2005 Asia and South Pacific Design Automation Conference, 960–964. ASP-DAC’05. Shanghai, China: ACM. isbn: 0-7803-87376. doi:10.1145/1120725.1120767. http://doi.acm.org/10. 1145/1120725.1120767.

434

BIBLIOGRAPHY

Magnusson, Peter S., Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav H˚ allberg, Johan H¨ogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. “Simics: A Full System Simulation Platform.” Computer (Los Alamitos, CA, USA) 35 (2): 50–58. doi:10 . 1109 / 2 . 982916. http : / / dl . acm . org / citation . cfm ? id = 619072 . 621909. Majer, M., C. Bobda, A. Ahmadinia, and J. Teich. 2005. “Packet Routing in Dynamically Changing Networks on Chip.” In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 154b. IEEE, April. doi:10.1109/IPDPS.2005.323. Manna, K., S. Chattopadhyay, and I. S. Gupta. 2010. “Energy and performance evaluation of a dimension order routing algorithm for Mesh-ofTree based Network-on-Chip architecture.” In Proceedings of the Annual IEEE India Conference (INDICON), 1–4. December 17–19. doi:10. 1109/INDCON.2010.5712666. Manzke, Michael, and Ross Brennan. 2004. “Extending FPGA based teaching boards into the area of distributed memory multiprocessors.” In Proceedings of the 2004 Workshop on Computer Architecture Education: held in conjunction with the 31st International Symposium on Computer Architecture (WCAE ’04), 5. NY, USA. Marchetti, M., L. Kontothanassis, R. Bianchini, and M. L. Scott. 1995. “Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems.” In Proceedings of the 9th International Parallel Processing Symposium, 480–485. IEEE. Martin, G. 2006. “Overview of the MPSoC Design Challenge.” In Proceedings of the 43rd annual Design Automation Conference, DAC’06, 274–279. ACM. Mart´ınez, Jos´e F., and Josep Torrellas. 2002. “Speculative synchronization: applying thread-level speculation to explicitly parallel applications.” ACM SIGOPS Operating Systems Review 36 (5): 18–29. Mattson, Richard L., Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. “Evaluation Techniques for Storage Hierarchies.” IBM Systems Journal 9 (2): 78–117. Mazurkiewicz, A. 1987. “Trace theory.” Petri Nets: Applications and Relationships to Other Models of Concurrency:278–324. Meincke, T., A. Hemani, S. Kumar, P. Ellervee, J. Oberg, T. Olsson, P. Nilsson, D. Lindqvist, and H. Tenhunen. 1999. “Globally asynchronous locally synchronous architecture for large high-performance ASICs.” In Proceedings of the 1999 IEEE International Symposium on Circuits and Systems, 1999 (ISCAS’99), 512–515. Vol. 2. IEEE.

BIBLIOGRAPHY

435

Michaud, P. 2004. “Exploiting the cache capacity of a single-chip multi-core processor with execution migration.” In IEE Proceedings Software, 186– 195. IEEE. Mihajlovic, B., M.H. Neishaburi, J.G. Tong, N. Azuelos, Z. Zilic, and W. J. Gross. 2009. “Providing Infrastructure Support to Assist NoC Software Development.” In Proceedings of the Workshop on Diagnostic Services in Network-on-Chips. Mihajlovic, B., and Z. Zilic. 2011. “Real-time address trace compression for emulated and real system-on-chip processor core debugging.” In Proceedings of the ACM Great Lakes Symposium on VLSI, 331–336. Lausanne, Switzerland. doi:10.1145/1973009.1973075. Mihajlovic, B., Z. Zilic, and K. Radecka. 2007. “Compression and encryption of self-test programs for wireless sensor network nodes.” In Proceedings of the IEEE Midwest Symposium on Circuits and Systems, 1344–1347. . 2010. “Infrastructure for Testing Nodes of a Wireless Sensor Network.” In Handbook of Research on Developments and Trends in Wireless Sensor Networks, edited by H. Jin and W. Jiang, 79–107. IGI Global. isbn: 978-1-61520-701-5. http : / / www . igi - global . com / bookstore/chapter.aspx?titleid=41112. Milenkovic, A., V. Uzelac, M. Milenkovic, and M. Burtscher. 2011. “Caches and Predictors for Real-Time, Unobtrusive, and Cost-Effective Program Tracing in Embedded Systems.” IEEE Transactions on Computers 60, no. 7 (July): 992–1005. doi:10.1109/TC.2010.146. Milenkovic, M., and M. Burtscher. 2007. “Algorithms and Hardware Structures for Unobtrusive Real-Time Compression of Instruction and Data Address Traces.” In Proceedings of the Data Compression Conference, 283–292. isbn: 1068-0314. doi:10.1109/DCC.2007.10. Millberg, M., E. Nilsson, T. Thid, and A. Jantsch. 2004. “Guaranteed Bandwidth Using Looped Containers in Temporally Disjoint Networks within the Nostrum Network on Chip.” In Proceedings of the Conference on Design, Automation and Test in Europe, 890–895. Vol. 2. Miller, Jason E., Harshad Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. “Graphite: A distributed parallel simulator for multicores.” In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture, 1–12. IEEE. Minh, Chi Cao, Jae Woong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. “STAMP: Stanford Transactional Applications for MultiProcessing.” In Proceedings of the IEEE International Symposium on Workload Characterization, 2008 (IISWC 2008), 35–46. IEEE.

436

BIBLIOGRAPHY

ModelSim. n.d. “http://www.model.com/.” Mookherjea, S., and A. Melloni. 2008. Microring resonators in integrated optics. http : / / mnp . ucsd . edu / ece240a _ 2009 / chapter _ microring.pdf. Moore, G.E., et al. 1998. “Cramming more components onto integrated circuits.” Proceedings of the IEEE 86 (1): 82–85. Moore, Kevin E., Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and David A. Wood. 2006. “LogTM: Log-based transactional memory.” In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA-’06), 254–265. Austin: IEEE Computer Society. Morad, Tomer Y., et al. 2006. “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors.” Computer Architecture Letters 5 (1): 14–17. Moraes, F., N. Calazans, A. Mello, L. M¨oller, and L. Ost. 2004. “HERMES: An Infrastructure for Low Area Overhead Packet-Switching Networks on Chip.” Integration, the VLSI Journal 38, no. 1 (October): 69–93. doi:10. 1016/j.vlsi.2004.03.003. Moreira, O., J.-D. Mol, M. Bekooij, and J. van Meerbergen. 2005. “Multiprocessor Resource Allocation for Hard-real-time Streaming with a Dynamic Job-mix.” In Proceedings of the Real Time and Embedded Technology and Applications Symposium, 332–341. IEEE. Moses, J., K. Aisopos, A. Jaleel, R. Iyer, R. Illikkal, D. Newell, and S. Makineni. 2009. “CMPSched$im: Evaluating OS/CMP interaction on shared cache management.” In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’09), 113–122. April. Najaf-Abadi, H. H., and H. Sarbazi-Azad. 2004. “An Accurate Combinatorial Model for Performance Prediction of Deterministic Wormhole Routing in Torus Multicomputer Systems.” In Proceedings of the IEEE International Conference on Computer Design (ICCD’04), 548–553. Washington, DC, USA: IEEE Computer Society. isbn: 0-7695-2231-9. http://dl.acm. org/citation.cfm?id=1032648.1033415. Naveh, A., E. Rotem, A. Mendelson, S. Gochman, R. Chabukswar, K. Krishnan, and A. Kumar. 2006. “Power and Thermal Management in the Intel Core Duo Processor.” Intel Technology Journal 10 (2): 109–122. Njoroge, Njuguna, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun. 2007. “ATLAS: A chipmultiprocessor with TM support.” In Proceedings of the conference on Design, automation and test in Europe, DATE’07, 3–8.

BIBLIOGRAPHY

437

Noakes, Michael D., Deborah A. Wallach, and William J. Dally. 1993. “The JMachine Multicomputer: An Architectural Evaluation.” ACM SIGARCH Computer Architecture News 21 (2): 224–235. Nurmi, J., H. Tenhunen, J. Isoaho, and A. Jantsch. 2004. Interconnect-centric design for advanced SoC and NoC. Springer. Ogras, U. 2007. “Modeling, Analysis and Optimization of Network-On-Chip Communication Architectures.” PhD diss., Carnegie Mellon University. Ogras, U. Y., and R. Marculescu. 2006. “It’s a Small World After All: NoC Performance Optimization via Long-Range Link Insertion.” IEEE Transactions on Very Large Scale Integration Systems 14, no. 7 (July): 693– 706. . 2007. “Analytical Router Modeling for Networks-on-Chip Performance Analysis.” In Proceedings of the Conference on Design, Automation and Test in Europe, 1096–1101. DATE ’07. San Jose, CA, USA: EDA Consortium. isbn: 978-3-9810801-2-4. http : / / dl . acm . org / citation.cfm?id=1266366.1266602. Oi, Hitoshi, and N. Ranganathan. 1999. “A Cache Coherence Protocol for the Bidirectional Ring Based Multiprocessor.” In Proceedings of the International Conference on Parallel and Distributed Computing and Systems, PDCS’99, 3–6. Open SystemC Initiative. http://www.systemc.org. Ould-Khaoua, M., and H. Sarbazi-Azad. 2001. “An Analytical Model of Adaptive Wormhole Routing in Hypercubes in the Presence of Hot Spot Traffic.” IEEE Transactions on Parallel and Distributed Systems 12, no. 3 (March): 283–292. Pan, Y., P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary. 2009. “Firefly: illuminating future network-on-chip with nanophotonics.” ACM SIGARCH Computer Architecture News 37 (3): 429–440. Panainte, E. Moscu, K. L. M. Bertels, and S. Vassiliadis. 2007. “The Molen Compiler for Reconfigurable Processors.” ACM Transactions in Embedded Computing Systems (TECS) 6, no. 1 (February): 6. Pande, P. P., C. Grecu, A. Ivanov, and R. Saleh. 2003. “Design of a Switch for Network on Chip Applications.” In Proceedings of the International Symposium on Circuits and Systems, 217–220. Vol. 5. . 2005. “Timing analysis of network on chip architectures for MP-SoC platforms.” Microelectronics Journal 36 (September): 833–45. doi:10 . 1016/j.mejo.2005.03.006.

438

BIBLIOGRAPHY

Pande, P. P., C. Grecu, M. Jones, A. Ivanov, and R. Saleh. 2005. “Performance Evaluation and Design Trade-Offs for Network-On-Chip Interconnect Architectures.” IEEE Transactions on Computers 54, no. 8 (August): 1025– 1040. Pande, Partha Pratim, Amlan Ganguly, Sujay Deb, and Kevin Chang. 2011. “Energy-Efficient Network-on-Chip Architectures for Multicore Systems.” In Handbook of Energy-Aware and Green Computing, edited by Ishfaq Ahmad and Sanjay Ranka. Chapman / Hall/CRC Press Taylor / Francis Group LLC. Pande, Partha Pratim, Cristian Grecu, Amlan Ganguly, Andre Ivanov, and Resve Saleh. 2011. “Test and Fault Tolerance of NoC Infrastructures.” In Networks-on-Chips: Theory and Practice, edited by Fayez Gebali, Haytham Elmiligi, and M.Watheq El-Kharashi. Taylor & Francis Group LLC-CRC Press. Papadopoulos, Gregory M., and David E. Culler. 1990. “Monsoon: an explicit token-store architecture.” ACM SIGARCH Computer Architecture News 18 (3a): 82–91. Parekh, A. 1992. “A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks.” PhD diss., Massachusetts Institute of Technology. Parhami, Behrooz. 1999. Introduction to Parallel Processing. First. 556. Series in Computer Science. Springer. Parks, T. M. 1995. “Bounded Scheduling of Process Networks.” PhD diss., University of California, EECS Department. Paulin, P., C. Pilkington, and E. Bensoudane. 2002. “StepNP: A System-Level Exploration Platform for Network Processors.” IEEE Design & Test 19, no. 6 (November): 17–26. Pavlidis, Vasilis F., and Eby G. Friedman. 2007. “3-D Topologies for Networkson-Chip.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, no. 10 (October): 1081–1090. doi:10.1109/TVLSI.2007. 893649. Pees, Stefan, Andreas Hoffmann, Vojin Zivojnovic, and Heinrich Meyr. 1999. “LISA—machine description language for cycle-accurate models of programmable DSP architectures.” In Proceedings of the 36th annual ACM/IEEE Design Automation Conference, (AC ’99), 933–938. New Orleans, Louisiana, United States: ACM. isbn: 1-58133-109-7. doi:http: //doi.acm.org/10.1145/309847.310101. Petracca, Michele, Benjamin G. Lee, Keren Bergman, and Luca P. Carloni. 2009. “Photonic NoCs: System-Level Design Exploration.” IEEE Micro 29 (4): 74–85.

BIBLIOGRAPHY

439

Peyton Jones, Simon, et al. 2003. The Revised Haskell 98 Report. Cambridge University Press. isbn: 0521826144. Pfister, G. F., and V. Norton. 1985. “Hot Spot Contention and Combining in Multistage Interconnection Networks.” IEEE Transactions on Computers 34 (10): 943–948. Pimentel, A. D. 2008. “The Artemis Workbench for System-level Performance Evaluation of Embedded Systems.” International Journal of Embedded Systems 3 (3): 181–196. Pimentel, A.D., C. Erbas, and S. Polstra. 2006. “A Systematic approach to exploring embedded system architectures at multiple abstraction levels.” IEEE Transactions on Computers 55, no. 2 (February): 99–112. Pinkston, Timothy Mark, and Jose Duato. 2006. Appendix E: Interconnection networks. 4th edition. 1114. Computer Architecture: A Quantitative Approach. Elsevier. Pisinger, David, and Mikkel Sigurd. 2007. “Using Decomposition Techniques and Constraint Programming for Solving the Two-Dimensional BinPacking Problem.” INFORMS Journal on Computing (Institute for Operations Research)(the Management Sciences (INFORMS), Linthicum, Maryland, USA) 19 (1): 36–51. doi:10.1287/ijoc.1060.0181. Plattner, B. 1984. “Real-time Execution Monitoring.” IEEE Transactions on Software Engineering SE-10 (6): 756–764. Puente, V., J. Gregorio, and R. Beivide. 2002. “SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems.” In Proceedings of the Euromicro Workshop on Parallel, Distributed and Network-based Processing, 15–22. Canary Islands, Spain: IEEE, January. Purohit, Sohan, Sai Rahul Chalamalasetti, Martin Margala, and Pasquale Corsonello. 2008. “Power-Efficient High Throughput Reconfigurable Datapath Design for Portable Multimedia Devices.” In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (Reconfig08), 217–222. Pusceddu, Matteo, Simone Ceccolini, Gianluca Palermo, Donatella Sciuto, and Antonino Tumeo. 2010. “A Compact TM Multiprocessor System on FPGA.” Proceedings of the International Conference on Field Programmable (FPL’10):578–581.

440

BIBLIOGRAPHY

Qin, Wei, Subramanian Rajagopalan, and Sharad Malik. 2004. “A formal concurrency model based architecture description language for synthesis of software development tools.” In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (LCTES ’04), 47–56. Washington, DC, USA: ACM. isbn: 1-58113-806-7. doi:http://doi.acm.org/10.1145/997163. 997171. Qureshi, Moinuddin K., and Yale N. Patt. 2006. “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches.” In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 423–432. Rabaey, J.M., A.P. Chandrakasan, and B. Nikolic. 2005. “Digital integrated circuits.” Chap. Coping with Interconnect in, 445–490. Prentice Hall of India Pvt. Limited. Rafique, Nauman, Won-Taek Lim, and Mithuna Thottethodi. 2006. “Architectural support for operating system-driven CMP cache management.” In Proceedings of the Parallel Architectures and Compilation Techniques (PACT ’06), 2–12. September. Rahman, M. M. H., and S. Horiguchi. 2004. “High performance hierarchical torus network under matrix transpose traffic patterns.” In Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Networks, 111–116. May. doi:10.1109/ISPAN.2004.1300467. Rajwar, Ravi, and James R. Goodman. 2001. “Speculative lock elision: Enabling highly concurrent multithreaded execution.” In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, 294–305. IEEE Computer Society. . 2002. “Transactional lock-free execution of lock-based programs.” In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, 5–17. ASPLOS-X. San Jose, California: ACM. isbn: 1-58113-574-2. doi:10 . 1145 / 605397 . 605399. http://doi.acm.org/10.1145/605397.605399. Ramanujam, Rohit Sunkam, and Bill Lin. 2009. “A Layer-Multiplexed 3D On-Chip Network Architecture.” IEEE Embedded Systems Letters 1 (2): 50–55. Rangan, Krishna K., Gu-Yeon Wei, and David Brooks. 2009. “Thread motion: fine-grained power management for multi-core systems.” ACM SIGARCH Computer Architecture News 37 (3): 302–313. Ranganathan, P., V.S. Pai, H. Abdel-Shafi, and S.V. Adve. 1997. “The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems.” ACM SIGARCH Computer Architecture News 25 (2): 144–156.

BIBLIOGRAPHY

441

Ranganathan, Parthasarathy, Sarita V. Adve, and Norman P. Jouppi. 2000. “Reconfigurable caches and their application to media processing.” In Proceedings of the 27th Annual International Symposium on Computer Architecture, 214–224. June. Rantala, Ville, Teijo Lehtonen, and Juha Plosila. 2006. Network on Chip Routing Algorithms. TUCS Technical Reports 779. Technical report. Turku Centre for Computer Science. Reshadi, Midia, Ahmad Khademzadeh, Akram Reza, and Maryam Bahmani. A Novel Mesh Architecture for On-Chip Networks. http : / / www . design-reuse.com/articles/23347/on-chip-network.html. Rhoads, Steve. Plasma soft core. http : / / opencores . org / project , plasma. Richardson, A. 2006. WCDMA Design Handbook. Cambridge University Press. isbn: 0521828155. Rigo, S., G. Araujo, M. Bartholomeu, and R. Azevedo. 2004. “ArchC: a systemC-based architecture description language.” In Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing SBAC-PAD 2004, 66–73. doi:10.1109/SBAC-PAD.2004.8. Rigo, Sandro, Marcio Juliato, Rodolfo Azevedo, Guido Ara´ ujo, and Paulo Centoducatte. 2004. “Teaching computer architecture using an architecture description language.” In Proceedings of the 2004 Workshop on Computer Architecture Education (WCAE ’04), 6. Munich, Germany: ACM. doi:http://doi.acm.org/10.1145/1275571.1275580. Rusu, S., S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Vora. 2010. “A 45nm 8-core enterprise Xeon® processor.” IEEE Journal of Solid-State Circuits 45 (1): 7–14. Saastamoinen, I., D. Siguenza-Tortosa, and J. Nurmi. 2002. “Interconnect IP Node for Future System-on-Chip Designs.” In Proceedings of the the First IEEE International Workshop on Electronic Design, Test and Applications (DELTA’02), 116–120. Christchurch. Salminen, E., T. Kangasb, V. Lahtinenb, J. Riihim’kib, K. Kuusilinnac, and T. D. H¨ am¨ al¨ ainen. 2007. “Benchmarking Mesh and Hierarchical Bus Networks in System-On-Chip Context.” Journal of Systems Architecture 53, no. 8 (August): 477–488. Salminen, E., A. Kulmala, and T.D. Hamalainen. 2007. “On Network-onChip Comparison.” Digital Systems Design, Euromicro Symposium on (Los Alamitos, CA, USA):503–510. doi:10.1109/DSD.2007.80.

442

BIBLIOGRAPHY

Salminen, E., A. Kulmala, and T.D. Hamalainen. 2008. “On the Credibility of Load-Latency Measurement of Network-on-Chips.” In Proceedings of the International Symposium on System-on-Chip (SOC 2008), 1–7. Tampere, November. Sangiovanni-Vincentelli, A., and G. Martin. 2001. “Platform-based design and software design methodology for embedded systems.” IEEE Design and Test of Computers 18 (6): 23–33. Santambrogio, A., M.D. Fracassi, M. Gotti, P. Sandionigi, and C. Antola. 2007. “A Novel Hardware/Software Codesign Methodology Based on Dynamic Reconfiguration with Impulse C and Codeveloper.” In Proceedings of the 3rd Southern Conference on Programmable Logic (SPL’07), 221– 224. IEEE, February. Sarbazi-Azad, H., A. Khonsari, and M. Ould-Khaoua. 2002. “Analysis of Deterministic Routing in k-ary n-Cubes with Virtual Channels.” Journal of Interconnection Networks 3 (August): 85–101. Sarbazi-Azad, H., M. Ould-Khaoua, and L. M. Mackenzie. 2001. “Communication Delay in Hypercubes in the Presence of Bit-Reversal Traffic.” Parallel Computing 27, no. 13 (December): 1801–1816. Sassolas, T., N. Ventroux, N. Boudouani, and G. Blanc. 2011. “A Power-Aware Online Scheduling Algorithm for Streaming Applications in Embedded MPSoC.” In Proceedings of the IEEE International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 1–10. Grenoble, France: Springer, September. Sazeides, Y., and J.E. Smith. 1997. “The predictability of data values.” In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 248–258. isbn: 1072-4451. doi:10.1109/MICRO.1997.645815. Schultz, M. R. de, A. K. I. Mendonca, F. G. Carvalho, O. J. V. Furtado, and L. C. V. Santos. 2007. “Automatically-retargetable model-driven tools for embedded code inspection in SoCs.” In Proceedings of the 50th Midwest Symposium on Circuits and Systems (MWSCAS’07), 245–248. May. doi:10.1109/MWSCAS.2007.4488580. Seiler, L., D. Carmean, E. Sprangle, T. Forsyth, P. Dubey, S. Junkins, A. Lake, et al. 2009. “Larrabee: A Many-Core x86 Architecture for Visual Computing.” IEEE Micro 29, no. 1 (January): 10–21. doi:10.1109/MM. 2009.9. Shabbir, A., A. Kumar, S. Stuijk, B. Mesman, and H. Corporaal. 2010. “CAMPSoC: An automated design flow for predictable multi-processor architectures for multiple applications.” Special Issue on HW/SW Co-Design: Systems and Networks on Chip, Journal of Systems Architecture 56 (7): 265–277.

BIBLIOGRAPHY

443

Shacham, A., K. Bergman, and L. P. Carloni. 2008. “Photonic Networks-onChip for Future Generations of Chip Multiprocessors.” IEEE Transactions on Computers 57, no. 9 (September): 1246–1260. doi:10.1109/ TC.2008.78. Sheibanyrad, A. 2008. “Impl´ementation Asynchrone d’un R´eseau-sur-Puce Distribu´e (Asynchronous Implementation of a Distributed Network-onChip).” PhD diss., Universit´e de Pierre et Marie Curie. Sheibanyrad, A., A. Greiner, and I. Miro-Panades. 2008. “Multisynchronous and Fully Asynchronous NoCs for GALS Architectures.” IEEE Design & Test of Computers 25, no. 6 (December): 572–580. Sheibanyrad, A., I. Miro Panades, and A. Greiner. 2007. “Systematic Comparison Between the Asynchronous and the Multi-Synchronous Implementations of a Network On Chip Architecture.” In Proceedings of the Conference on Design, Automation and Test in Europe, 1090–1095. Nice, France: IEEE. Shen, H., P. Gerin, and F. P´etrot. 2008. “Configurable Heterogeneous MPSoC Architecture Exploration Using Abstraction Levels.” In Proceedings of the IEEE/IFIP International Symposium on Rapid System Prototyping, 51– 57. Paris, France: IEEE, June. Sherwood, Timothy, Brad Calder, and Joel S. Emer. 1999. “Reducing Cache Misses using Hardware and Software Page Placement.” In Proceedings of the 13th international conference on Supercomputing, 155–164. ACM, June. Shim, Keun Sup, Mieszko Lis, Myong Hyon Cho, Omer Khan, and Srinivas Devadas. 2011. “System-level Optimizations for Memory Access in the Execution Migration Machine (EM2 ).” In Proceedings of the International Workshop on Computer Architecture and Operating System Co-design. Shojaei, H., A. H. Ghamarian, T. Basten, M. C. W. Geilen, S. Stuijk, and R. Hoes. 2009. “A Parameterized Compositional Multi-dimensional Multiple-choice Knapsack Heuristic for CMP Run-time Management.” In Proceedings of the Design Automation Conference, DAC’09, 917–922. ACM. SoClib (Open Platform for Virtual Prototyping of Multi-Processor Systemson-Chip). http://www.soclib.fr/. Soininen, J. P., and H. Hensala. 2003. “A Design Methodology for NoC Based Systems.” In Networks on Chips, edited by A. Jantsch and H. Tenhunen. Boston: Kluwer.

444

BIBLIOGRAPHY

Song, Zhaohui, Guangsheng Ma, and Dalei Song. 2008. “A NoC-Based High Performance Deadlock Avoidance Routing Algorithm.” In Proceedings of the International Multisymposiums Computer and Computational Sciences, 2008 (IMSCCS’08), 140–143. Sonmez, Nehir, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrian Cristal, Ibrahim Hur, Satnam Singh, and Mateo Valero. 2011. “From Plasma to BeeFarm: Design Experience of an FPGA-based Multicore Prototype.” In Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications, 350–362. Springer, March 23– 25. Soteriou, V., and Li-Shiuan Peh. 2004. “Design-space exploration of poweraware on/off interconnection networks.” In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, (ICCD’04), 510–517. October. doi:10.1109/ICCD.2004. 1347970. Sridharan, S., et al. 2006. “Thread migration to improve synchronization performance.” In Proceedings of the Workshop on Operating System Interference in High Performance Applications (OSIHPA’06). Srikantaiah, Shekhar, Mahmut Kandemir, and Mary Jane Irwin. 2008. “Adaptive Set Pinning: Managing Shared Caches in CMPs.” ACM SIGARCH Computer Architecture News 36 (1): 135–144. Sriram, S., and S. S. Bhattacharyya. 2009. Embedded Multiprocessors: Scheduling and Synchronization. Second. CRC Press. Stoica, Ion, Hussein Abdel-Wahab, Kevin Jeffay, Sanjoy K. Baruah, Johannes E. Gehrke, and C. Greg Plaxton. 1996. “A Proportional Share Resource Allocation Algorithm for Real-Time, Time-Shared Systems.” In Proceedings of the 17th IEEE Real-Time Systems Symposium, 288–299. IEEE, December. Stuijk, S. 2007. “Predictable Mapping of Streaming Applications on Multiprocessors.” PhD diss., TU Eindhoven. Stuijk, S., T. Basten, M. C. W. Geilen, and H. Corporaal. 2007. “Multiprocessor Resource Allocation for Throughput-Constrained Synchronous Dataflow Graphs.” In Proceedings of the Design Automation Conference, 777–782. ACM. Stuijk, S., M. C. W. Geilen, and T. Basten. 2006a. “Exploring Trade-Offs in Buffer Requirements and Throughput Constraints for Synchronous Dataflow Graphs.” In Proceedings of the Design Automation Conference, DAC’06, 899–904. ACM.

BIBLIOGRAPHY

445

. 2006b. “SDF3 : SDF For Free.” In Proceedings of the International Conference on Application of Concurrency to System Design, ACSD’06, 276–278. IEEE. doi:10.1109/ACSD.2006.23. . 2008. “Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Synchronous Dataflow Graphs.” IEEE Transactions on Computers 57 (10): 1331–1345. . 2010. “A Predictable Multiprocessor Design Flow for Streaming Applications with Dynamic Behaviour.” In Proceedings of the Conference on Digital System Design, DSD’10, 548–555. IEEE. doi:10.1109/DSD. 2010.31. Sudan, Kshitij, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, and Al Davis. 2010. “Micro-pages: increasing DRAM efficiency with locality-aware data placement.” SIGARCH Computer Architecture News 38:219–230. Suh, G. E., L. Rudolph, and S. Devadas. 2004. “Dynamic Partitioning of Shared Cache Memory.” Journal of Supercomputing 28, no. 1 (April): 7–26. Suh, G. Edward, Srinivas Devadas, and Larry Rudolph. 2001. “Analytical cache models with applications to cache partitioning.” In Proceedings of the International Conference on Supercomputing (ICS ’01), 1–12. June. citeseer.ist.psu.edu/suh01analytical.html. Suleman, M. Aater, O. Mutlu, M.K. Qureshi, and Y.N. Patt. 2009. “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures.” ACM Sigplan Notices 44 (3): 253–264. Suleman, M. Aater, Yale N. Patt, Eric A. Sprangle, Anwar Rohillah, Anwar Ghuloum, and Doug Carmean. 2007. ACMP: Balancing Hardware Efficiency and Programmer Efficiency. TR-HPS-2007-001. Technical report. HPS Technical Report, February. Sullivan, C., A. Wilson, and S. Chappell. 2004. “Using C based logic synthesis to bridge the productivity gap.” In Proceedings of the 2004 conference on Asia South Pacific design automation: electronic design and solution fair, 349–354. IEEE Press Piscataway, NJ, USA. Synopsys Inc. n.d. “Design Compiler.” http://www.synopsys.com. . n.d. “Primetime Power Analysis, http://www.synopsys.com.” Tam, David, Reza Azimi, Livio Soares, and Michael Stumm. 2009. “RapidMRC: Approximating L2 Miss Rate Curves on Commodity Systems for Online Optimizations.” ACM Sigplan Notices 44 (3): 121–132.

446

BIBLIOGRAPHY

Tam, David, Reza Azimi, and Michael Stumm. 2007. “Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors.” In Proceedings of EuroSys 2007, 47–58. March. Tan, Zhangxi, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry Cook, David Patterson, and Krste Asanovi´c. 2010. “RAMP gold: An FPGAbased architecture simulator for multiprocessors.” In Proceedings of the 47th ACM/IEEE Design Automation Conference (DAC’10), 463–468. IEEE. Thacker, Chuck. 2009. “A DDR2 Controller for BEE3.” In. Microsoft Research. . 2010a. Beehive: A many-core computer for FPGAs (v5). http:// projects.csail.mit.edu/beehive/BeehiveV5.pdf. . 2010b. Hardware Transactional Memory for Beehive. http : / / research . microsoft . com / en - us / um / people / birrell / beehive/hardwaretransactionalmemoryforbeehive3.pdf. The International Technology Roadmap for Semiconductors. 2008. http:// www.itrs.net/Links/2008ITRS/Update/2008_Update.pdf. The International Technology Roadmap for Semiconductors 2009 for Interconnects. 2009. http : / / public . itrs . net / links / 2009ITRS / Home2009.htm. The International Technology Roadmap for Semiconductors: Assembly and Packaging. 2007. Theelen, B. D., M. C. W. Geilen, T. Basten, J. P. M. Voeten, S. V. Gheorghita, and S. Stuijk. 2006. “A scenario-aware data flow model for combined longrun average and worst-case performance analysis.” In Proceedings of the International Conference on Formal Methods and Models for Co-Design, MEMOCODE, 185–194. IEEE. Thid, Rikard, Ingo Sander, and Axel Jantsch. 2006. “Flexible Bus and NoC Performance Analysis with Configurable Synthetic Workloads.” In Proceedings of the 9th EUROMICRO Conference on Digital System Design, 681–688. DSD ’06. Washington, DC, USA: IEEE Computer Society. isbn: 0-7695-2609-8. doi:10.1109/DSD.2006.52. Thomas, D., A. Hunt, and C. Fowler. 2001. Programming Ruby: the pragmatic programmer’s guide. Addison-Wesley Reading, MA. Thoziyoor, Shyamkumar, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies.” In Proceedings of the 35th International Symposium on Computer Architecture, (ISCA’08), 51–62. IEEE.

BIBLIOGRAPHY

447

Trancoso, Pedro, and Josep Torrellas. 1996. “The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding.” In Proceedings of the 1996 International Conference on Parallel Processing, (ICPP’96), 79–86. Vol. 3. IEEE. Transaction-level Modeling Working Group. SystemC. http : / / www . systemc.org/. Tripp, J.L., K.D. Peterson, C. Ahrens, J.D. Poznanovic, and M. Gokhale. 2005. “Trident: an FPGA compiler framework for floating-point algorithms.” In Proceedings of the 15th International Conference on Field Programmable Logic and Applications (FPL 2005), 317–322. Uzelac, V., and A. Milenkovic. 2009. “A real-time program trace compressor utilizing double move-to-front method.” In Proceedings of the ACM/IEEE Design Automation Conference, 738–743. isbn: 0738-100X. . 2010. “Hardware-based data value and address trace filtering techniques.” In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 117–126. isbn: 9781-60558-903-9. doi:10.1145/1878921.1878940. Uzelac, V., A. Milenkovic, M. Burtscher, and M. Milenkovic. 2010. “Real-time unobtrusive program execution trace compression using branch predictor events.” In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 97–106. Scottsdale, Arizona, USA. doi:10.1145/1878921.1878938. Vanderbauwhede, W., S. R. Chalamalasetti, S. Purohit, and M. Margala. 2011. “A few lines of code, thousands of cores: High-level FPGA programming using vector processor networks.” In Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS), 461–467. IEEE. Vanderbauwhede, W., M. Margala, S. R. Chalamalasetti, and S. Purohit. 2010. “A C++-embedded Domain-Specific Language for programming the MORA soft processor array.” In Proceedings of the 21st IEEE International Conference on Application-specific Systems Architectures and Processors (ASAP), 141–148. IEEE. Vanderbauwhede, W., M. Margala, SR Chalamalasetti, and S. Purohit. 2009. “Programming Model and Low-level Language for a Coarse-Grained Reconfigurable Multimedia Processor.” In Proceedings of the 2009 International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’09), 195–201. Varatkar, Girish V., and Radu Marculescu. 2004. “On-Chip Traffic Modeling and Synthesis for MPEG-2 Video Applications.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12 (1): 108–119.

448

BIBLIOGRAPHY

Ventroux, N., and R. David. 2010. “SCMP Architecture: An Asymmetric Multiprocessor System-on-Chip for Dynamic Applications.” In Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies (IFMT), 6. Saint-Malo, France: ACM, June. Ventroux, N., A. Guerre, T. Sassolas, L. Moutaoukil, G. Blanc, C. Bechara, and R. David. 2010. “SESAM: an MPSoC Simulation Environment for Dynamic Application Processing.” In Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS), 1880– 1886. Bradford, UK: IEEE, July. Ventroux, N., T. Sassolas, R. David, G. Blanc, A. Guerre, and C. Bechara. 2010. “SESAM Extension For Fast MPSoC Architectural Exploration And Dynamic Streaming Application.” In Proceedings of the IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC), 341–346. Madrid, Spain: IEEE, October. Ventroux, N., T. Sassolas, A. Guerre, B. Creusillet, and R. Keryell. 2012. “SESAM/Par4All: A Tool for Joint Exploration of MPSoC Architectures and Dynamic Dataflow Code Generation.” In Proceedings of the HIPEAC Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO), 9. Paris, France, January. Verghese, Ben, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. “Operating system support for improving data locality on CC-NUMA compute servers.” ACM SIGPLAN Notices (New York, NY, USA) 31 (9): 279–289. doi:http://doi.acm.org/10.1145/248209.237205. Viaud, E., F. Pˆecheux, and A. Greiner. 2006. “An efficient TLM/T modeling and simulation environment based on conservative parallel discrete event principles.” In Proceedings of the conference on Design, automation and test in Europe (DATE), 94–99. Nice, France: European Design and Automation Association, April. VMware, Inc. 2009. vSphere Resource Management Guide: ESX 4.0, ESXi 4.0, vCenter Server 4.0. VMware, Inc. Waldspurger, Carl A., and William E. Weihl. 1994. “Lottery Scheduling: Flexible Proportional Share Resource Management.” In Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation (OSDI’04), 1–11. November. . 1995. Stride Scheduling: Deterministic Proportional-Share Resource Management. MIT/LCS/TM-528. Technical report. MIT, June.

BIBLIOGRAPHY

449

Wang, Howard, M. Petracca, A. Biberman, B. G. Lee, L. P. Carloni, and K. Bergman. 2008. “Nanophotonic Optical Interconnection Network Architecture for On-Chip and Off-Chip Communications.” In Proceedings of the Conference on Optical Fiber communication/National Fiber Optic Engineers Conference, (OFC/NFOEC’08), 1–3. 24-28 February. doi:10. 1109/OFC.2008.4528127. Wang, Yi, and Dan Zhao. 2007. “Design and Implementation of Routing Scheme for Wireless Network-on-Chip.” In Proceedings of the IEEE International Symposium on Circuits and Systems, 2007 (ISCAS’07), 1357– 1360. IEEE. Weaver, David L., and Tom Germond. 1994. The SPARC architecture manual version 9. Sun Microsystems, Inc. Wein, E. 2007. “Scale in Chip Interconnect requires Network Technology.” In Proceedings of the International Conference on Computer Design, 2006 (ICCD’06), 180–186. IEEE. Weldezion, Awet Yemane, Matt Grange, Dinesh Pamunuwa, Zhonghai Lu, Axel Jantsch, Roshan Weerasekera, and Hannu Tenhunen. 2009. “Scalability of Network-On-Chip Communication Architecture for 3-D Meshes.” In Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip, 114–123. NOCS ’09. Washington, DC, USA: IEEE Computer Society. isbn: 978-1-4244-4142-6. doi:10.1109/NOCS.2009. 5071459. Wentzlaff, D., P. Griffin, H. Hoffmann, Liewei Bao, B. Edwards, C. Ramey, M. Mattina, Chyi-Chang Miao, J.F. Brown, and A. Agarwal. 2007. “OnChip Interconnection Architecture of the Tile Processor.” IEEE Micro 27, no. 5 (September): 15–31. doi:10.1109/MM.2007.4378780. Wieferink, A., M. Doerper, R. Leupers, G. Ascheid, H. Meyr, T. Kogel, G. Braun, and A. Nohl. 2004. “A System Level Processor/Communication Co-Exploration Methodology for Multi-Processor System-on-Chip Platforms.” In International Conference on Design, Automation and Test in Europe (DATE), 1530–1591. Vol. 2. Paris, France, February. Wiggers, M., M. Bekooij, P. Jansen, and G. Smit. 2006. “Efficient Computation of Buffer Capacities for Multi-Rate Real-Time Systems with BackPressure.” In Proceedings of the International Conference on HardwareSoftware Codesign and System Synthesis, CODES+ISSS’06, 10–15. ACM. Wikipedia. Fifteen puzzle. http : / / en . wikipedia . org / wiki/ Fifteenpuzzle.

450

BIBLIOGRAPHY

Wiklund, D., and D. Liu. 2003. “SoCBUS: Switched Network on Chip for Hard Real Time Embedded Systems.” In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, 8. IEEE. Wilhelm, R., J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, et al. 2008. “The worstcase execution-time problem – overview of methods and survey of tools.” ACM Transactions on Embedded Computing Systems 7 (3): 1–53. Woo, S.C., M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. 1995. “The SPLASH-2 programs: characterization and methodological considerations.” ACM SIGARCH Computer Architecture News 23 (2): 24–36. Woo, Steven Cameron, Jaswinder Pal Singh, and John L. Hennessy. 1994. “The performance advantages of integrating block data transfer in cachecoherent multiprocessors.” ACM SIGPLAN Notices (New York, NY, USA) 29:219–229. Xilinx Inc. Xilinx Floating Point Operator. {http://www.xilinx.com/}. . n.d. Fast Simplex Link overview. http : / / www . xilinx . com / products/ipcenter/FSL.htm. Xu, Yi, Yu Du, Bo Zhao, Xiuyi Zhou, Youtao Zhang, and Jun Yang. 2009. “A low-radix and low-diameter 3D interconnection network design.” In Proceedings of the 15th International Symposium on High Performance Computer Architecture, 2009 (HPCA’09), 30–42. IEEE, February 14–18. doi:10.1109/HPCA.2009.4798234. Yang, F. C., C. L. Chiang, and J. Huang. 2010. “A Reverse-Encoding-Based On-Chip Bus Tracer for Efficient Circular-Buffer Utilization.” IEEE Transactions on VLSI Systems 18 (5): 732–741. doi:10.1109/TVLSI. 2009.2014872. Yang, Z. J., A. Kumar, and Y. Ha. 2010. “An area-efficient dynamically reconfigurable Spatial Division Multiplexing network-on-chip with static throughput guarantee.” In Proceedings of the International Conference on Field-Programmable Technology, FPT’10, 389–392. Yankova, Y., G. Kuzmanov, K. Bertels, G. Gaydadjiev, Y. Lu, and S. Vassiliadis. 2007. “DWARV: Delftworkbench automated reconfigurable VHDL generator.” In Proceedings of the International Conference on Field Programmable Logic and Applications, (FPL’07), 697–701. IEEE. Yi, J. J., and D. J. Lilja. 2006. “Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations.” IEEE Transactions on Computers 55, no. 3 (March): 268–280. doi:10.1109/TC. 2006.44.

BIBLIOGRAPHY

451

Ykman-Couvreur, Ch., V. Nollet, F. Catthoor, and H. Corporaal. 2006. “Fast multidimension multichoice knapsack heuristic for MP-SoC Run-Time Management.” In Proceedings of International Symposium on SoC, 1–4. IEEE. . 2011. “Fast multidimension multichoice knapsack heuristic for MPSoC runtime management.” ACM Transactions on Embedded Computer Systems 10, no. 3 (May): 35:1–35:16. doi:10.1145/1952522.1952528. Yourst, Matt T. 2007. “PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator.” In Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems & Software, 23– 34. San Jose, CA, USA: IEEE, April. isbn: 1-4244-1081-9. doi:10.1109/ ISPASS.2007.363733. Zebchuk, Jason, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. 2009. “A tagless coherence directory.” In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), 423–434. IEEE. Zeferino, C. Albenes, and A. Amadeu Susin. 2003. “SoCIN: A Parametric and Scalable Network-on-Chip.” In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 169–174. Zhang, Hui, and Srinivasav Keshav. 1991. “Comparison of Rate-Based Service Disciplines.” In Proceedings of the conference on Communications architecture & protocols, 113–121. SIGCOMM ’91. Zurich, Switzerland: ACM. isbn: 0-89791-444-9. doi:10 . 1145 / 115992 . 116004. http : //doi.acm.org/10.1145/115992.116004. Zhang, Lei, Mei Yang, Yingtao Jiang, Emma Regentova, and Enyue Lu. Generalized Wavelength Routed Optical Micronetwork In Network-on-chip. http : / / www . osti . gov / eprints / topicpages / documents / record/390/1675818.html. Zhang, M., and K. Asanovi´c. 2005. “Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors.” ACM SIGARCH Computer Architecture News 33 (2): 336–345. Zhang, Xiao, Sandhya Dwarkadas, and Kai Shen. 2009. “Hardware Execution Throttling for Multi-core Resource Management.” In Proceedings of the 2009 conference on USENIX Annual technical conference, 23–23. USENIX’09. San Diego, California: USENIX Association. http://dl. acm.org/citation.cfm?id=1855807.1855830. Zhang, Yuting, and Richard West. 2006. “Process-Aware Interrupt Scheduling and Accounting.” In Proceedings of the 27th IEEE International RealTime Systems Symposium, (RTSS’06), 191–201. IEEE.

452

BIBLIOGRAPHY

Zhao, Hongzhou, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. “SPACE: sharing pattern-based directory coherence for multicore scalability.” In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, 135–146. PACT ’10. Vienna, Austria: ACM. isbn: 978-1-4503-0178-7. doi:10.1145/1854273.1854294. http://doi.acm.org/10.1145/1854273.1854294. Zhao, Li, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Don Newell, and Srihari Makineni. 2007. “CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms.” In Proceedings of the Parallel Architectures and Compilation Techniques (PACT ’07), 339–352. IEEE Computer Society, September. Ziv, J., and A. Lempel. 1977. “A universal algorithm for sequential data compression.” IEEE Transactions on Information Theory 23 (3): 337–343.

Suggest Documents