DuCNoC: A High-Throughput FPGA-based NoC simulator using Dual ...

IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY

1

DuCNoC: A High-Throughput FPGA-based NoC simulator using Dual-Clock Lightweight Router Micro-Architecture Hadi Mardani Kamali, Kimia Zamiri Azar, Shaahin Hessabi, Member, IEEE

Abstract—On-chip interconnections play an important role in multi/many-processor systems-on-chip (MPSoCs). In order to achieve efficient optimization, each specific application must utilize a specific architecture, and consequently a specific interconnection network. For design space exploration and finding the best NoC solution for each specific application, a fast and flexible NoC simulator is necessary, especially for large design spaces. In this paper, we present an FPGA-based NoC co-simulator, which is able to be configured via software. In our proposed NoC simulator, entitled DuCNoC, we implement a Dual-Clock router micro-architecture, which demonstrates 75x−350x speed-up against BOOKSIM. Additionally, we implement a two-layer configurable global interconnection in our proposed architecture to (1) reduce virtualization time overhead, (2) make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially (3) provide the capability of simulating irregular topologies. Migration of some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, and implementing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to its dual-clock router micro-architecture, as well as TGs and TRs migration to software side, DuCNoC can simulate a 100-node (10×10) non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000. Index Terms—Network-on-Chip, FPGA, Virtualization, Global Interconnection, Dual-Clock, Router Micro-Architecture

I. I NTRODUCTION CCORDING to Moore’s law, the complexity of basic elements in an integrated circuit, i.e. the number of transistors, approximately doubles within a time interval. On the other hand, due to unsustainable levels of power consumption burdened by higher operational frequencies, the manufacturers were necessitated to increase the number of cores on a System-on-Chip (SoC). As a result, multi/many-core systems have emerged [1]. Having a scalable and efficient interconnection network is one of principal components of multi/many-core systems. Earlier, when the number of cores was low, there were some non-scalable and non-heterogeneous interconnection networks, like shared memory bus-based or point-to-point networks, which are inefficient for large-scale and communication-intensive applications. Consequently, network-on-chip (NoC) was introduced for implementing large and heterogeneous networks [2], [3].

A

The authors are with the Department of Computer Engineering, Sharif University of Technology, Tehran 11155-9517, Iran. (e-mail: mardani,[email protected]; [email protected]) Manuscript received MONTH dd, yyyy.

In order to take advantages of this new interconnection network, we encounter a large design space exploration which is significantly a time-consuming problem. So, it is necessary to have a cycle-accurate, flexible, and especially high-throughput NoC simulator to comprehensively explore the corresponding design space for each application. Recent FPGAs are equipped with hundreds of thousands of fine-grained logics and coarse-grained communications, which provide huge parallelism. Additionally, NoC infrastructure bears a resemblance to interconnection network between cells in FPGAs. Therefore, FPGAs can be enumerated as a suitable framework for implementing different NoCs. Unlike FPGA-based NoC simulators, software-based simulators, written in C/C++, have an indispensable throughput restriction, especially when the simulated network is large. Although software-based simulators are more flexible, they need to sacrifice some features and configurations to construct a desired trade-off between flexibility and throughput. Meanwhile, considering throughput tuning, like multi-threading, causes excessive time and memory overhead. Therefore, acceleration via FPGAs motivates researchers to implement FPGA-based NoC simulators to reduce simulation time, especially in large networks. In this paper, we present DuCNoC, an FPGA-based NoC co-simulator, which is able to be configured via software. The main aim of this simulator is to maximize speed-up compared to software simulators. We employ some novel techniques to improve the simulation speed. The main contribution is a dual-clock router micro-architecture, which significantly decreases the simulation time of packet traversals in the network. Also, we implement a two-layer configurable global interconnection, in order to not only decrease synchronization latency in virtualization, but also provide the capability of simulating all types of topologies, even irregulars. In addition, Our configurable global interconnection network enables DuCNoC to not only simulate larger networks in FPGAs with lower resources by simulating smaller clusters, but also improve the speed-up by simulating bigger clusters. Migrating some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, as well as employing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to our dual-clock router micro-architecture as well as the proposed two-layer virtualization mechanism, DuCNoC is able to accelerate the simulation by 75x−350x compared to BOOKSIM. Also, lower overhead in the proposed two-layer


interconnection decreases the overhead of virtualization by around 34%, which decreases the time overhead of simulation by 17% on average compared to an architecture with conventional virtualization mechanism. Also, DuCNoC is able to simulate a 100-node non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000. The rest of this paper is organized as follows: in Section II, previous work is described. Section III illustrates our approach in this paper. The router micro-architecture and overall architecture of DuCNoC are explained in Section IV. Implementation and experimental results are shown in Section V. Finally, we conclude the paper in Section VI. II. R ELATED W ORK A. Software-based Simulators The vast majority of current NoC simulators are implemented in highest level of abstraction, i.e. software, written in C/C++. The most valuable feature of software-based NoC simulators is flexibility. They can be simply developed, modified, and run. BOOKSIM [4] is the most widely used software-based NoC simulator. Some of important features of BOOKSIM are accurate modeling of channel latency, pipeline optimization, efficient switch allocation, flexible traffic generation, and especially the capability of integration with GEM5 [5], as one of the most powerful full system simulators. SICOSYS [6] is implemented for modeling packet traversals. Also, it can be integrated as a sub-module with RSIM [7] and SimOS [8]. In addition, there are some other high level synthesis simulators, written in systemC, which are able to provide performance evaluation and verification. OCCN [9] is a systemC-based 3-distinct-layer architecture to model NoC. It also provides a system performance modeling to collect instantaneous and periodic statistics from all nodes. Focusing on mapping task graphs and evaluating the performance are accomplished in [10]. In [11], a systemC-based event-driven simulator for modular exploration is designed, which can assess performance of bus-based, point-to-point and crossbar interconnections. ARTS [12] is another framework written in systemC for MPSoCs. It provides a full system simulator, but since its main focus is on scheduling and mapping, simulation is not efficient. Although software-based NoC simulators can simulate small-scale networks with desirable performance, when the size of simulated network increases, e.g. a many-core system, they have to simulate a large quantity of simultaneous tasks, resulting in substantial throughput degradation. B. FPGA-based Simulators Growth in VLSI equips FPGAs with abundant logic resources, which are able to provide fine-grained parallelism and coarse-grained communication. Additionally, as it was mentioned before, cells interconnection in FPGAs bears a resemblance to NoC architecture. Therefore, implementing NoC simulators on FPGAs results in several orders of magnitude speed-up against software simulators. One drawback of the FPGA-based simulators is the

2

necessity of re-simulation and re-synthesis for each change in parameters and configurations. One efficient way to eliminate wasted re-synthesis time is to decouple the simulator architecture from the simulated system [13], [14]. In fact, the base architecture for NoC simulator must be designed statically, and regarding the simulated network, the required infrastructure for different parameters is applied. Similarly, DuCNoC is able to decouple the simulator architecture from the simulated network, whereby changes in the simulated design do not require re-synthesis. Additionally, similar to most FPGA-based architectures, DuCNoC cycle (simulation cycle) is separated from system cycle (FPGA cycle) to provide better observability and more comfortability in evaluation phase. In fact, by considering a global counter in overall design, timing model does not depend on FPGA cycle (neither fast nor slow in DuCNoC). Furthermore, resource limitation is one of the other drawbacks of FPGAs for simulating networks. In order to eliminate this obstacle, virtualization is used to implement larger networks on the same FPGA [17], [21], [23], [26]. DRNoC [16] is an FPGA-based emulation work-flow based on partial reconfiguration to eliminate re-synthesis time. For evaluation purposes, all configurable parameters can be changed at run-time, and this feature improves total simulation time. But, partial reconfiguration has some non-negligible overheads such as occupying large amount of memory, excessive time overhead, and synchronization concerns. Simulating in both direct-map and virtualized-model is embedded in [17], which provides this opportunity for researchers to compare both models. However, fewer parameters can be configured in this simulator. In AcENoCs [18], the software side is implemented in a soft MICROBLAZE core in FPGA. Also, traffic generator is implemented in software side. So, bigger NoCs can be simulated in same FPGA. Since virtualization is not used in this model, the NoC size is limited by FPGA size. Also, AcENoCs has no solution for re-synthesis time overhead. Unlike AcENoCs, HardNoC [19] avoids using soft IP core to implement traffic injectors and collectors, and has few configurable parameters, which reduces the flexibility of the platform. Similar to AcENoCs, a soft MICROBLAZE core is employed in [20], which implements a NoC simulator with bigger flit size, i.e. 256 bits. DART [21] is equipped by an engine which can auto-generate nodes in FPGA according to chosen topology. Therefore, re-synthesis time for changing parameters is eliminated. Also, DART decouples simulator architecture and the simulated network, adopted from PROToFLEX [13] and SIMFLEX [14], which eliminates re-synthesis flow. Also, DART employs virtualization to overcome FPGA size limitation. TGs and TRs are implemented inside FPGA, which causes a limitation for implementing bigger NoCs. Similar to DART, auto-generation for removing re-synthesis process is accomplished in XHiNoC [22]. But, the utilized interface, i.e. UART, restricts performance of the simulator. A parametric evaluation system is implemented in [23], which is able to simulate fat-tree topology, and consequently Nearest Common Ancestor (NCA) routing algorithm is chosen


3

TABLE I: Comparison of Latest FPGA-based NoC Simulators NoC Simulator

Topology

Flow Control

Routing

Traffic Pattern

Router Arch.

FPGA

Papamichael 2011 [17]

4×4 Mesh Torus

Credit based

DoR (XY)

Syn.

4/8/12/16-port 2/4/8-VC

V6 ML605

AcENoCs 2011 [18]

5×5 Mesh

Credit based

DoR (XY)

Syn. + Trace

4×4 Mesh Torus 7×7 Mesh Irregular (Virtualized) 2×2×2 Mesh

256-bit (flit) Credit based

DoR (XY)

Syn.


DoR (XY)

Syn. + Trace

Credit based 22/25-bit (flit) Credit based

3D DoR (XYZ)

Zhang 2013 [20] DART 2014 [21] Ying 2014 [22] Van Chu 2015 [24]

64×64 Mesh (Virtualized)

FOLCS 2015 [25]

16×16 Mesh

Credit based

AdapNoC 2016 [26]

32×32 Mesh Torus (Virtualized


DuCNoC

HW/SW Software interface side tool

Max. Freq

UART

C on host PC

56 MHz

28x

-

TDMA based

V5

UART

on-chip soft-core

100 MHz

14x-47x

-

-

V5

-

on-chip soft-core

100 MHz

3000x (sim. Model.)

-

-

upto 8-port upto 4-VC 5-stage pipe

V6 ML605

PCIe

C on host PC

46 MHz

100x

Yes

TDMA based

Syn.

5-port dedicated

V5 XUPV5

UART

C on host PC

100 MHz

-

Yes

-

DoR (XY)

Syn.

5-port 1/2-VC 4/5-stage pipe

V7 VC707

-

-

100 MHz

700x (sim. Model.)

-

TDMA based

DoR (XY)

Syn.

5-port 4-VC 5-stage pipe

V6 ML605

UART

C on host PC

66 MHz

17x-22x

-

-

5-port up to 4-VC (3+)-stage pipe

V6 ML605

PCIe

on-chip soft-core

100 MHz

53x-180x 1150x (sim. Model)

Yes

Dual-Clock TDMA based

up to 64-port up to 4-VC (3+)-stage pipe

ZYNQ 7000 ZC706

PCIe

on-chip hard-core

85/170 MHz

75x-350x 1820x (sim. Model)

Yes

Dual-Clock TDMA based

DoR Syn. (XY) + + Trace Adaptive DoR 64×32 (XY or YX) Mesh 32-bit (flit) Syn. + Torus Credit + Oblivious Irregulars based Trace + (Virtualized) Adaptive

5-port 2-VC 1-stage pipe 5-port 0/2/4-VC (3+)-stage pipe

as its dedicated routing algorithm. Their simplified router micro-architecture helps implement bigger network on the same FPGA, but fewer parameters can be configured. Table I illustrates the major features of FPGA-based NoC simulators. III. D U CN O C A PPROACH The main contributions of this paper are as follows: •

•

•

We implement a fully dual-clock lightweight router micro-architecture with 3-stage pipeline, which guarantees its distinguished throughput against that of BOOKSIM as well as FPGA-based NoC simulators. Our two-layer interconnection not only implements virtualization more efficiently, but also it can support different types of topologies, even irregular ones. We propose a new cluster-based traffic aggregator in order to support modeling of adaptive routing algorithms more accurately, even for irregular networks.

Although there are some common major features in DuCNoC and AdapNoC, like implementing a dual-clock context switching in virtualization, and migrating traffic generators (TGs) and traffic receptors (TRs) to software side [26], there are major differences between DuCNoC and AdapNoC, as follows: •

DuCNoC router is implemented by using a fully dual-clock router micro-architecture. However, AdapNoC uses a single-clock router micro-architecture. It should be noted that dual-clock context switching for virtualization

•

•

Speed-up re-synth (/BOOKSIM) solution

Virtualized

mechanism, which is implemented in AdapNoC, is separated from the router architecture. Unlike AdapNoC, which is able to simulate only equivalent clusters during a simulation, our proposed two-layer interconnection enables DuCNoC to support different cluster sizes in each simulation. The traffic aggregator flow is changed compared to AdapNoC. AdapNoC implements a vector-based traffic aggregator, whereas DuCNoC utilizes a two-layer traffic aggregation mechanism, which improves the accuracy of the implemented adaptive routing algorithm.

It should be noted that we use ZYNQ-7000 series for implementing DuCNoC. This type of FPGAs are equipped with Kintex devices, which have fewer logic cells than Virtex family. Although ZYNQ-7000 FPGAs are equipped with hard-core on-chip processors, our proposed dual-clock router micro-architecture, rather than the type of FPGA, is the main contributor to our distinguished operational frequency, i.e. 170 MHz, which considerably increases DuCNoC throughput compared to other FPGA-based simulators. Additionally, providing the aforementioned features may cause some limitations on DuCNoC. The main required resources for implementing our two-layer interconnection and traffic aggregator is BRAM-based memory in the FPGA, which is a limited resource. So, implementing these features restricts available resources on the FPGA, which reduces the maximum feasible network size. Also, due to BRAM-based memory restriction, the maximum flit size is 32.


4

IV. D U CN O C A RCHITECTURE The overall architecture of DuCNoC is divided into two main sub-modules: (1) Software side and (2) FPGA side. The software side contains central processing unit for controlling all actions in the FPGA side, such as sending and receiving data to/from FPGA side, determining the latest state of FPGA side operations, and configuring all parameters in FPGA side. Additionally, since TGs and TRs are migrated to software side, generating and sending traffic as well as receiving and gathering traffic are handled via software side. Therefore, all statistical information is calculated via software side. FPGA side which contains all programmable logics, memories, and interfaces, simulates interconnection network, packet traversal, router micro-architecture, and flow control. Fig. 1 illustrates the overall architecture of DuCNoC. The figure illustrates a combination of overall SoC architecture of ZYNQ-7000 [31], [32] and the top modules of DuCNoC. As it can be seen, by using Xilinx ZYNQ-7000 ZC706, software side can be developed on Processing System (PS), which contains a dual-processor ARM Cortex-A9. Additionally, FPGA side which implements the infrastructure of DuCNoC is implemented on Programmable Logic (PL), which is a Kintex-7 FPGA chip. Note that using the two ARM processors of PS for developing software side allows us provide simulation of dynamic traffic with negligible throughput degradation. Also, the overall architecture has a 100 MHz clock as input clock. PL fabric clocks (85 MHz and 170 MHz) are generated in software side. Also, fabric clocks are used for dual-clock TDMA-based context switching in virtualization. Some modules are in common with AdapNoC [26], such as bank registers for intermediate states, centralized traffic information aggregator, routing table updater, and buffer management in FPGA side. These modules are employed to implement TDMA-based virtualization and modeling of adaptive routing algorithms. Additionally, as it can be seen in Fig. 1, the communication interface between the host and FPGA is implemented via PCI-express (PCIe), to guarantee the performance of data transmissions for dynamic traffic. FIFOs are employed in DuCNoC for three different purposes: (1) a bridge between PCIe interface and DDR3, (2) a bridge between DDR3 and PL, and (3) FIFOs between nodes. The bridges, including (1) PCIe to DDR3 bridge, and (2) DDR3 to PL bridge, are responsible for transmitting the traces from host to network of nodes in FPGA side. In fact, the FIFO set (1) is employed between PCIe multi-channel DMA and our custom DMA (PDMA), to transfer traces from PCIe interface to DDR3 via an AXI-based channel. The FIFO set (2) is a BRAM-based FIFO between DDR3 interface and network of nodes in FPGA side as a traffic translator to convert traces to dynamic traffic for simulating on network of nodes. One of ARM CPUs (CPU 1) is responsible for modeling TGs and TRs in software side. Fig. 1 shows that the CPU 1 is connected to Memory Interface in order to write/read the generated/received traffic. The BRAM-based FIFO set (3) is employed between the Memory Interface and network of nodes to model the ejection/injection (E/I) procedures. Additionally,

Processing System (PS) MMU

ARM Cortex-A9 CPU_0

ARM Cortex-A9 CPU_1

MMU

Snoop Controller 512 KB Cache and Controller 256KB I/O Peripherals On-Chip UART Memory 100 MHz clock

Memory Interface (DDR3)

ZYNQ Central Interconnection

General-Purpose Ports

High-Performance Ports BRAM-based FIFO

GTX Tranceiver config & status registers

E/I

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

E/I

64bit AXI4 Stream Interface @ 250 MHz

FIFO

E/I

E/I

AXI-ST Wrapper PCIe FPGA Multi-Channel Side DMA

DDR3 Traces data TGs data TRs received data

PL to Memory Interconnect

PCIe and RAW Driver

X4 @ 5 Gbps Integrated Block for PCIE v2.0

SW Side

Custom PCIe DMA

Programmable Logic (PL)

E/I

Routing Table Updater

Intermediate Traffic States Information Aggregator (Bank Regs)

Fig. 1. DuCNoC overall architecture, implemented on a ZYNQ-7000 ZC706 which consists of a dual-core ARM Cortex-A9 processor as PS, as well as a Kintex-7 FPGA as PL.

some FIFOs of set (3) are located between all nodes in order to model flit queues in the routers. A. Dual-Clock Mechanism for Reformed 3-Stage Pipeline One efficient acceleration approach is to equip the simulator with a dual-clock router micro-architecture. In order to employ a dual-clock mechanism for our router micro-architecture, we need to alter and duplicate the pipeline infrastructure. A typical 5-stage router consists of buffer write (BW), route computation (RC), virtual allocation (VA), switch allocation (SA), and switch traversal (ST). In DuCNoC, we employ an optimized pipeline structure by incorporating stages to provide the required infrastructure for a dual-clock router micro-architecture. Among all stages, RC and VA are meant only for head flit. Therefore, if we consider just one stage for BW alongside RC, and likewise one stage for VA alongside SA, we can implement the pipeline architecture with just three stages, i.e. BWRC, VA-SA (VSA), and ST. Fig. 2 shows the different structure of a conventional 5-stage pipeline router and the reformed 3-stage pipeline router. As it can be seen, in a conventional 5-stage pipeline router, RC and VA are active for head flits, and successively become inactive and are replaced by bubbles (No-Operation) for the rest of the packet. On the other hand, in the reformed 3-stage pipeline router, BWRC, VSA, and ST are active for all type of flits. The overall structure of our pipeline is similar to speculative approaches [27]–[29], but as its name implies, it allows a flit to speculatively enter the next stage of pipeline. For successful


1st clk 2nd clk 3rd clk 4th clk 5th clk 6th clk 7th clk Head Flits

BW RC VA SA ST

Body Flits Tail Flits Body Flits Tail Flits

BW

Bubble

BW

Bubble

Bubble

(a)

SA ST Bubble

SA ST

5

Credit Out

1st clk 2nd clk 3rd clk 4th clk 5th clk

BW VS ST RC A BW V ST RC SA BW V ST RC SA (b)

Updated Routing Updated Table Routing Data RC

Input 1

Credit Counter

Credit In

Output 1

VA SA

Output 2

Input 2 VC 1

Fig. 2. The Behavior of Router Pipeline Structure in (a) Conventional 5-stage (b) Reformed 3-stage

VC 2

VC 3

Output 5

VC 4

speculations, the corresponding flit enters the ST, directly. But, for failed speculations, the flit must cross some stages again, which depends on where the speculation failed. In order to eliminate speculation, we employ two BWRC stages (one for head flits with worse critical path, and one for body/tail flits with better critical path). Similarly, we need two stages for VSA. In fact, it seems that we have two parallel pipeline paths; one path for head flits and one path for body/tail flits. Thereby, by employing this 3-stage pipeline router micro-architecture, BWRC and VSA are duplicated, whose critical paths are variant. In fact, since RC and VA are necessary for head flits, BWRC and VSA for head flits make longer critical paths. The overall architecture of a dual-clock router micro-architecture is similar to a conventional single-clock router micro-architecture. In DuCNoC, we employ two clocks with frequency ratio of 2 to 1 (170 MHz and 85 MHz). The faster clock is dedicated for body flits while the slower one handles head flits. This mechanism allows us to establish a harmony between a dual-clock architecture and the reformed pipeline infrastructure which has different paths. Additionally, since we divide the path of head flits from tail/body flits, and the delay of each path is different, it seems that the re-ordering issue raises a big challenge. However, it should be noted that since ST stage is common between two paths, head flits should be proceeded to ST at first, and then body/tail flits will be processed. So, there is no re-ordering issue in this mechanism. Using a dual-clock router micro-architecture has some significant advantages in comparison with a single-clock one. As we mentioned, BWRC and VSA have longer critical paths in dealing with head flits. In other words, when the packet size is large and the number of body flits increases, the probability of dealing with worse critical path decreases. Therefore, the possibility of using faster clock cycle increases, and overall simulation time decreases considerably. Additionally, this mechanism increases the efficiency of the whole system. In fact, when the simulation time decreases, the percentage of utilization increases, and the throughput will be improved. In order to simulate different number of pipeline stages in router, DuCNoC has a configurable parameter which can be set to three and more (3+). This parameter indicates the required number of EMPTY (no-operation) stages which should be injected. In fact, different number of EMPTY stages can be inserted to simulate a real NoC router with more than

Input BW Input 5

BWRC

VSA Input BW info

Fig. 3. Reformed (3+)-stage Credit-based DuCNoC Router Micro-architecture (For a 5-port Router).

three pipeline stages. Furthermore, the number of VCs can be configured by up to 4 VCs per link. In order to realize the flow control mechanism, we employ credit-based wormhole (WHR) VC flow control. The overall architecture of the router is illustrated in Fig. 3. As it can be seen, the router consists of input unit, credit arbiter, VC and switch allocation, and updatable routing table. B. Configurable Dual-clock Sharable Virtualization A major concern in the FPGA-based NoC simulators is their resource limitation for simulating a big network on an FPGA. In fact, if the simulation is accomplished purely in FPGA side, there is a maximum possible NoC size for each FPGA. The most efficient solution for overcoming resource limitation in FPGA side is virtualization mechanism based on TDMA, to virtualize nodes in FPGA side [17], [21], [23], [26]. In fact, by employing virtualization mechanism, the simulated network is divided into sub-networks, called clusters. Each cluster is implemented in FPGA side in a dedicated time-slot, separately. The overall simulation time is divided into equivalent time intervals, called time-slots, and all time-slots will be alloted to clusters in a round-robin manner. For instance, suppose that we need to simulate a 4×4 mesh network, but the available resources can only simulate a 2×2 mesh network. So, by using virtualization, we divide the network into four 2×2 mesh sub-networks (clusters), and the available resources will be dedicated to each cluster, sequentially. It is evident that serialization of resource allocation is necessary in virtualization technique. It is accomplished in a round-robin manner. After each time-slot, the intermediate state of the current cluster, which determines the essential information of simulation, such as the interconnection between nodes in each cluster, and synchronization variables, will be stored and the intermediate state of the next assigned cluster will be loaded. Similar to AdapNoC, a configurable parameter is defined in DuCNoC to indicate the possible number of nodes in each cluster. Since we employed AXI-based communication


Two falling edge between system clock edges

Two falling edge between system clock edges

6

System Two falling edge bus between FPGA side and software side, the maximum Two falling edge Clock between system between system possible nodes in each cluster is restricted to 64. clock edges clock edges State Also, another relevant parameter to virtualization, which SystemHandler can be configured in DuCNoC, indicates the timespan of Clock(Registers) Load Clock Store each time-slot. It should be noted that the value of timespan State state of state of has no effect on the functionality of the system, but it can Handler cluster cluster (Registers) Store state considerably influence the overall simulation time. In a simple “i+1” “i”of cluster Clock Load state of cluster 1 “i” and intuitive scheme, if we have n clusters, n th of the total “i+1” simulation time should be allocated, distributed in several (a) time-slots, to each cluster. Two efficient approaches are accomplished in DuCNoC to System compensate for the serialization overhead due to virtualization (1) Finish cluster Clock (1) (2) (3) (4) “i” time-slot as much as possible: (1) Sharing capability of IDLE time-slot, (2) Store state of and (2) using a dual-clock mechanism for context switching cluster “i” System System (1) Finish cluster between clusters. Clock Clock (3) Load (1) (2) (3) (4) “ i” time-slot state of In order to fulfill the capability of sharing IDLE time-slots, cluster “i+1” (2) Store state of it is necessary to provide a mechanism to indicate which SystemState Handler (4) Start cluster 1 cycle save cluster in “ i” “i+1” time-slot each time-slot (1) (2) (3) (4) time-slot is IDLE. A time-slot is IDLE when there is no Clock (Registers) (3) Load state of Clock intra-cluster transmission between its nodes. Therefore, for cluster “ i+ 1” (b) (4) Start cluster detecting IDLEness, we need access to the content of input State Handler 1 cycle save in Fig. 4.(1)Mechanism Dual-clock Context“ i+Switching. 1” time-slot(a) State each time-slot (2) (3) of (4) buffers in each router. As it can be seen in Fig 3, we extract (Registers) Clock Handler Clock with a Twofold Frequency Compared to System Clock the mentioned content by using an extra output port named (b) The Effectiveness of Dual-clock Context-switching on Decreasing ”Input BW info”, which determines the number of flits waiting Virtualization Overhead in the input BW. So, if there is no flit waiting in the input BW, the corresponding node is idle, and consequently a cluster with idle nodes is considered as an IDLE cluster. After detecting virtualized network is the intra- (within a cluster) and interan IDLE cluster, its dedicated time-slot is allocated to the next (between clusters) cluster interconnections. Two inevitable assigned cluster which is not IDLE. This mechanism improves theorems should be considered in virtualized architectures: simulation time, especially when the injection rate is low, and (1) the synchronization between intra- and inter-cluster in the small clusters located in the corner of big networks, communications, (2) TDMA-based virtualization in two layers of abstraction, i.e. FPGA side and software Side. which are far away from the hotspots. DuCNoC utilizes an index-based two-layer interconnection In virtualized architecture, each cluster is implemented in FPGA side separately in its dedicated time-slot. Therefore, infrastructure to not only synchronize intra- and inter-cluster there are considerable wasted cycles for changing the communications, but also provide the capability of intermediate states of clusters between time-slots. Especially, simulating different topologies, even irregular networks. when the ratio of clusters to nodes increases, the wasted cycles Both layers consist of BRAM-based FIFOs for establishing will be the dominant factor in time overhead. In order to interconnection between nodes, whether for nodes within a compensate significant extent of the time overhead due to cluster (intra-cluster) or nodes between clusters (inter-cluster). context switching, we can use a dual-clock architecture to The first layer (layer 1) is a complete graph between specific number of nodes, which is designated to model each eliminate the mentioned time overhead. Fig. 4 illustrates a simple scenario to show how a dual-clock intra-cluster (within a cluster) topology in each time-slot. architecture can reduce the number of wasted clocks for The number of nodes in clusters is a configurable parameter, context switching in virtualization structure. As it can be seen, which defines the number of nodes in complete graph. As by using falling edges of state handler clock, whose frequency it was mentioned, the maximum number of the nodes in is double of system clock frequency, context switching is each cluster is restricted to 64, therefore an up to 64-node accomplished between two rising edges of system clock complete graph is established to enable simulating all types without wasting any system clock. So, in each simulation of topologies for each cluster. The second layer (layer 2) is a scenario, which has lots of context switching, a significant higher level topology, which determines the interconnection number of system clocks can be saved, and DuCNoC of inter-cluster communications (between clusters). The BRAM-based FIFOs for intra-cluster interconnections considerably reduces time overhead of virtualization. By implementing sharable time-slot, especially for lower (layer 1), are shared between time-slots. In fact, all clusters injection rates and small clusters located in the corner of share the same FPGA resources corresponding to the big networks, as well as employing dual-clock mechanism mentioned complete graph in order to accomplish their for context switching especially for higher injection rates, simulations in different time-slots. In other words, there is the impact of serialization in virtualization will be reduced only one up to 64-node complete graph in FPGA side, and significantly. It shows that our proposed virtualization structure each cluster can be laid on this graph sequentially when its time-slot is approached. can minimize NoC remapping time overhead. On the other hand, the BRAM-based FIFOs for inter-cluster One of the most important concerns in simulating a


SF13

R1

R3

Cluster A Time-slot 0, 3, 6, …, 3i

SF34

SF12

4-Node Complete Graph

R2

SF24

R1

R4

R2 N4

SF12

A

N3

SF25

N5

SF36

Cluster

B

DFAB2

DFAC2

DFAC1

N2

N6

N7

Cluster

SF78

SF45

Cluster

DFAB1

SF13

SF69

R3

R4 SF24

Cluster B Time-slot 1, 4, 7, …, 3i+1 R1

N8

×

DFBC1

N1

SF14

7

× ×

R2

N9

×

×

C Cluster C Time-slot 2, 5, 8, …, 3i+2

×

Inactive node

R1

SF13

InActive node

×

×

R2

×

shared BRAM-based intra-cluster (layer 1) FIFO (SFij) Inactive FIFO dedicated BRAM-based inter-cluster (layer 2) FIFO (DFab)

R3

×

Fig. 5. Two-layer Interconnection for Intra- and Inter-cluster Communications. The infrastructure of shared complete graph is changed in each time-slot to simulate the corresponding cluster. Inter-cluster FIFOs are dedicated and fixed.

interconnections (layer 2) are dedicated and fixed for all interconnections between clusters. This dedicated and non-virtualized structure between clusters can handle all intermediate flits between clusters and overcomes synchronization concerns between intra- and inter-clusters. Fig. 5 shows an example of a mesh network, and the structure of the two-layer interconnection between all nodes. As it can be seen, we need to simulate a 3×3 mesh network, in which the biggest cluster (Cluster A) has 4 nodes. So, we have to set the cluster size to 4, and consequently we have a 4-node complete graph. It should be noted that we can simulate all smaller clusters by using this bigger complete graph. As it can be seen, all three clusters (A, B, and C) are implemented on this complete graph during their allocated time-slots. All utilized FPGA resources within clusters are the same. For instance, the utilized resources for N1 in cluster A, N7 in cluster B, and N6 in cluster C are the same, i.e. R1 , which will be dedicated in each cluster in its corresponding time-slot. Similarly, all FIFOs between routers are the same. For instance, the BRAM-based FIFO between N1 and N2 , the BRAM-based FIFO between N7 and N8 , and the BRAM-based FIFO between N3 and N6 are the same, i.e. shared FIFO (SF12 ) in complete graph. So, DuCNoC is able to divide larger networks into smaller clusters, and simulates these clusters sequentially. Unlike resource sharing capability within clusters (layer 1), the inter-cluster interconnections (layer 2) are dedicated and fixed (non-virtualized). These BRAM-based FIFOs are implemented in a non-virtualized manner to handle parallelism between clusters. Suppose that the current time slot is

dedicated to Cluster A, and the next time-slot is dedicated to Cluster B. If there are some flits in N4 , whose destinations are N7 , all flits will be stored in the corresponding dedicated FIFO (DFAB1 ), and will be processed in the next time-slot which is dedicated to Cluster B. In general, all flits between clusters will be stored in the non-virtualized inter-cluster (layer 2) FIFOs (DF s) till their corresponding time-slots have been reached. Accordingly, handling parallel operations between clusters can be accomplished using non-virtualized dedicated FIFOs between clusters, which guarantees cycle accuracy in virtualization mechanism. Also, as it can be seen in Fig. 5, time-slots 0, 3, 6, ..., 3i are allocated to cluster A, and similarly, time-slots 3i + 1 and 3i + 2 are dedicated to cluster B and cluster C, respectively. The configurable time-span of all time-slots dedicated to each cluster is equal, but it can be different for other clusters, which are different in size. Additionally, if there is an IDLE cluster, the next assigned cluster can use the time-slot of the current IDLE cluster. For instance, if the cluster B is IDLE in time-slot 1, cluster C can use its time-slot, and the cluster A will be simulated again in time-slot 2. Note that all FIFOs, including intra- and inter-cluster FIFOs and bridge between DDR3 and PL, are implemented by using BRAMs, which helps us save LUTs and FFs (Logic Cells) in FPGA side. Since FIFOs are one major sub-module in NoCs, and we implement them by using BRAMs, we can lighten the resource utilization of the whole architecture, which helps us propose a lightweight global interconnection compared to other FPGA-based simulators. Since DuCNoC uses a shared complete graph for all clusters, it is obligatory to dequeue all flits in intra-cluster FIFOs (SFs) that are related to the current cluster, in order to allocate them to the next assigned cluster. Therefore, based on the configured value for time-slot, a threshold is specified to show how many simulation cycles are required for blocking incoming FIFOs to dequeue all buffered flits in them. This mechanism minimizes the required bank register for storing the essential intermediate contents in context switching. Due to HW/SW co-designed architecture of DuCNoC, it is indispensable to consider TDMA in both FPGA side and software side. As it can be seen in Fig. 1, another BRAM-based FIFO set is employed between memory interface and network of nodes, which is responsible for managing TDMA in FPGA side. On the other hand, an extra level flow control provides the control of TDMA in software side [19]. C. Cluster-based Traffic Aggregator for Modeling Adaptive Routing Algorithms Contrary to previous FPGA-based NoC simulators [17], [18], [20]–[22], [24], [25], which have used deterministic table-based routing algorithms, and similar to AdapNoC [26], DuCNoC is able to simulate adaptive routing algorithms by means of a centralized table-based architecture. Based on its table-based structure, a routing table updater, whose operations are dependent to the whole traffic information, is a mandatory component. As it was mentioned in Fig. 3, ”Input BW info” is updated via the incoming queues in each router, and indicates


how many flits are waiting in the queues. By using this gathered information in each router, we can provide an update process for all routing tables in all nodes via the routing table updater. Based on the updated routing tables, DuCNoC enables us to simulate adaptive routing algorithms. Since DuCNoC provides a configurable virtualization methodology, which supports all types of topologies even irregular networks, it is necessary to manipulate the overall structure of gathering information module in DuCNoC compared to AdapNoC [15]. Unlike AdapNoC, whose adaptive routing was limited to mesh networks, DuCNoC is able to simulate adaptive routing algorithms for all topologies. In fact, since the topology of the whole network might not be regular, we establish a non-uniform structure to gather all information from ”Input BW info” ports in each router. In order to buffer all traffic information of all clusters, we implement a two-layer aggregator module in DuCNoC to gather all necessary traffic information. The traffic information of each cluster will be gathered in its corresponding time-slot, and will be stored in the shared registers in the cluster. This information will be transferred to the registers dedicated for the current cluster in layer 2 at context switching time. Layer 1 registers are implemented in the complete graph. So, they are virtualized, and the utilized resources for these registers will be shared between clusters. On the other hand, the layer 2 bank-register is implemented in a non-virtualized manner. So, the layer 2 bank-register is accessible permanently, regardless of the current cluster which is laid in the FPGA side. By doing so, similar to AdapNoC, traffic information updater is aware of all traffic information in each node, and will update the routing tables of the nodes for each cluster before starting its corresponding time-slot. Additionally, since the bank-register of layer 2 is permanently accessible, the routing table update process for each node will be updated continuously. Therefore, in order to take advantage of the layer 2 bank-register information, immediately after context switching in each time-slot, routing tables in current cluster will be updated. So, updated routing tables in each cluster have latest up-to-dated traffic information in each time-slot, which helps achieve close to optimum results in adaptive routing algorithms. As it can be seen in Fig. 1, two important sub-modules are implemented to support adaptive routing algorithms, i.e. (1) Traffic Information Aggregator, (2) Routing Table Updater. As their names imply, we aggregate traffic information of nodes by using traffic information aggregator. Also, the routing table updater is responsible for updating routing table of each node before starting its corresponding time-slot. Aggregating traffic information as well as updating routing tables in the network allows us to model adaptive routing algorithms by using these updated routing tables. In fact, the router table updater module will update routing table of each node based on the aggregated traffic. In order to evaluate the efficiency and optimality of the two-layer traffic aggregator and routing table updater, we employ a simple adaptive routing algorithm, i.e. Adaptive Toggle DOR (ATDOR). As its name implies, regarding the traffic of the network, ATDOR can toggle the route between XY and YX for

8

each source-destination pair. In fact, the routing table updater determines which route (XY or YX) is better, and according to the aggregated traffic, it will update routing tables, and consequently route computation (RC) for each packet will be accomplished based on the updated routing table. Note that at least 2 VCs are adequate for ensuring the avoidance of deadlock. Therefore, since DuCNoC is capable of simulating up to 4 VCs, it is possible to simulate a deadlock-free ATDOR by using DuCNoC. Fig. 6 illustrates the structure of the two-layer centralized traffic aggregator and routing table updater in a 3 × 3 virtualized mesh network, which enable us to simulate adaptive routing algorithms. As it can be seen, each cluster uses a shared register set (SR) for each node. Since the biggest cluster has 4 nodes in this scenario, there are 4 register sets in the 4-node complete graph, i.e. SR1 for R1 , SR2 for R2 , SR3 for R3 , and SR4 for R4 . Also, the number of registers in each set depends on the number of ports in each node. Accordingly, since each node has 4 ports, there are 4 registers in each set, i.e. North (N), West (W), South (S), and East (E) registers. For each cluster in its corresponding time-slot, the value of the SRs will be updated, and before context switching, the values of SRs will be stored in the corresponding registers in dedicated register set (DR). For instance, suppose that the current time-slot is dedicated to cluster A, and the next time-slot is dedicated to cluster B. During the current time-slot the value of the SR1 , SR2 , SR4 , and SR5 will be updated continuously. After accomplishing this time-slot, the value of these registers will be stored in DR1 , DR2 , DR4 , and DR5 . After updating the dedicated registers, the routing table updater will update the routing tables of the cluster which uses the next time-slot. So, routing tables of N7 (T7 ) and N8 (T8 ) will be updated. Accordingly, RC in N7 and N8 will be accomplished by means of updated routing tables. Unlike AdapNoC, this infrastructure equips DuCNoC to simulate adaptive routing algorithms in virtualized structure. Additionally, since routing table updater operates as a non-virtualized module, the routing tables of all nodes will be updated in the first cycles of its time-slot after context switching. Therefore, the impact of delay in updating the routing tables is dramatically decreased, and subsequently each node accomplishes routing procedure with a close to optimum state. D. ZYNQ-based Software Simulation Framework As it can be seen in Fig. 1, the overall architecture of DuCNoC is similar to AdapNoC. The major difference is the type of embedded processor employed for software side. Contrary to AdapNoC, software side is implemented in ZYNQ-7000, which has a dual-core (hard-core) ARM processor. As it was mentioned previously, ZYNQ-7000 consists of two parts: (1) FPGA chip called PL, (2) Embedded processor called PS. We employ a dual-processor architecture to (1) handle and support trace-based traffic patterns (dynamic traffic) by using the first core, (2) run software application as user command


Ni Wi Si Ei

inactive shared registers (Sri) for Ni

Ni Wi Si Ei

active shared registers for Ni

9

4-node complete graph

R1

routing table updater dedicated port

R3

Intra-cluster aggregation port

E'i S'i

dedicated registers (Dri) for Ni

R2

SR3 N3 W3 SR1 N1 W1 SR2 N2 W2 SR4 N4 W4

S3 S1 S2 S4

Shared Registers (SRs) for Aggregation of Intra-Cluster Traffic Information

E3 E1 E2 E4

R4

W'i

N'i

S4 S1 S2 S5

E4 E1 E2 E5

DFAB2

DFAC1

N5 C

E'1 E'2 E'3 E'4 E'5 S'1 S'2 S'3 S'4 S'5

N7

E'9 S'9

W'1 W'2 W'3 W'4 W'5

W'9

N'1 N'2 N'3 N'4 N'5

N'9

DR1 DR2 DR3 DR4 DR5

DR9

N8 Cluster B

Aggregated Traffic in Dedicated Registers (DRs)

R2

N7 W7 S7 E7 N8 W8 S8 E8

T1 T7 T3 T2 T8 T6 T9 T4 T5


R1 SF12

W4 W1 W2 W5

DFBC1

SF12 DFAC1

SF25

N4 N1 N2 N5

SF78

R2

Updated Routing Tables based on Aggregated Traffic

R1 N3

SF36

N6

SF69

Cluster C

N9

SF12


Cluster A

N2

DFAB1

N4 SF45


SF14

N6 W6 S6 E6 N9 W9 S9 E9 N3 W3 S3 E3

R2

R3 SF34

N1

SF13

SF12

R1

SF24

R4

× × × × × SF13

R3

× × ×

N3 N1 N2 N4

W3 W1 W2 W4

S3 S1 S2 S4

E3 E1 E2 E4

Cluster A

N3 W3 S3 E3 N1 W1 S1 E1 N2 W2 S2 E2 N4 W4 S4 E4

Cluster B

N3 W3 S3 E3 N1 W1 S1 E1 N2 W2 S2 E2 N4 W4 S4 E4

Cluster C

Fig. 6. Two-layer Traffic Information Aggregator for Intra- and Inter-cluster Communications. The register for intra-cluster are shared, and the register for inter-cluster is dedicated. TABLE II: Configurable parameters in DuCNoC

prompt for configuration PL by using the second core. The capability of ZYNQ SoC for communication between PS and PL provides a high-throughput DMA-based communication between software side and FPGA side in DuCNoC. Also, ZYNQ-7000 is equipped with a 1 GB DDR3 which is accessible directly via both PL and PS. By using PCIe multi-channel DMA in PL, the traces of dynamic traffic are transmitted via PCIe to DDR3. The packets will be decoded to the specific pattern in order to transmit via BRAM-based FIFOs to FPGA side. Additionally, by using AXI-stream structure, the efficiency of bus between PS and PL is increased. Additionally, ARM processor generates synthetic traffic for the simulated network. By doing so, ARM is able to support both dynamic and synthetic traffic. It should be noted that receiving packets from BRAM-based FIFOs in FPGA side are handled via TR in ARM to calculate statistical information, such as latency and the correctness of packets. Handling dual-clock architecture in FPGA side is accomplished via ZYNQ PS. In fact, ZYNQ PS provides clock configuration and all relevant configurations for PL. Since our design needs two different clocks, ZYNQ PS provides both required clocks permanently, and an asynchronous-based structure in FPGA side is employed to handle dual-clock architecture. Also, all parameters for network configuration, which are illustrated in Table II, are set via the CPU 1. Using configurable virtualization structure, as well as an efficient dual-clock router micro-architecture, makes a considerable improvement against AdapNoC. Also, using configurable virtualization helps increase the possible size of networks to be simulated. On the other hand, using

Topology

Mesh / Torus / Tree / Irregular

Size Routing Algorithm no. of VCs per port Traffic Type Pipeline Stages Number of Router Ports Packet Size (per flit) Link Latency Input VCs buffer size VC Allocation

up to 64×32 DoR / Oblivious / Adaptive (ATDOR) up to 4 Synthetic / Dynamic 3 and more 2 and more up to 64 Variable up to 16 Fixed / Round Robin

dual-clock architecture increases the ratio of throughput per resource utilization. In fact, in comparison to AdapNoC, DuCNoC utilizes approximately equivalent resources, while the throughput is considerably improved. In addition, the structure of adaptive routing table in DuCNoC enables us presenting a virtualized adaptive routing table structure, especially for irregular networks. V. E XPERIMENTAL R ESULTS In order to demonstrate DuCNoC performance, various scenarios should be considered. We targeted a large Xilinx ZYNQ-based SoC architecture (ZC706 evaluation board) to evaluate all configurable parameters in FPGA side. Additionally, ZYNQ facilitates software side development by using Xilinx SDK. Also, Vivado 16.2 is used for synthesizing, implementing, and downloading the overall design on FPGA, which enables implementing a schematic block diagram from all parts of the design, even software side.


10

30

AdapNoC [26] BOOKSIM [4] DART [21] DuCNoC

AdapNoC deviation from BOOKSIM

25

20

15

0.05

0.1

0.2

0.3

0.4

0.5

Injection Rate (flits/node/cycle)

Fig. 7. Average Latency Comparison between BOOKSIM [4], DART [21], AdapNoC [26], and DuCNoC. Using new pipeline structure in router micro-architecture improves the accuracy of DuCNoC compared to AdapNoC.

DuCNoC consists of four major sub-modules: (1) software side which is implemented on PS (ARM processors), (2) FPGA side, which consists of interconnection network and router micro-architecture, and is implemented on PL, (3) UART interface between host and PS for transceiving commands, status, and statistical information, and (4) PCIe interface for transmitting traces from the host to PL. Software side has been developed in C++ by using Xilinx SDK, and PL is implemented in Verilog HDL by using Xilinx Vivado. Also, all sub-modules in PL are implemented in AXI4-based structure. In order to prove the correctness of DuCNoC simulation results, the overall latency of simulation reported by DuCNoC in at least one sample scenario should be compared with the results reported by baseline simulator, i.e. BOOKSIM. Fig. 7 shows our first validation result in non-virtualized mode. It illustrates the average latency of a network, which is depicted in Table III (Case 1), in different simulators. Since VSA is a detached stage in our proposed pipeline stages, one cycle delay should be enumerated in BOOKSIM for SA and VA altogether. So, BOOKSIM configuration for router micro-architecture has 4 cycles routing delay, 1 cycle delay for SA, and zero cycle delay for VA. As it can be seen, the average latency reported by DuCNoC resembles average latency calculated by BOOKSIM. It is obvious that using a different pipeline structure in DuCNoC router micro-architecture approximately eliminates statistical deviation. It should be noted that 20,000 warm-up cycles, 45,000 measurement cycles, and 15,000 draining cycles are considered for all simulation scenarios. As mentioned before, we reformed the pipeline stage of our router micro-architecture to a 3-stage infrastructure, which helped establish a dual-clock router micro-architecture. According to the implementation reports, i.e. slack time in place and route (PAR) process, the critical path of DuCNoC restricts the maximum possible frequency to 170 MHz. Therefore, in order to establish a dual-clock architecture, the slower clock frequency should be 85 MHz. By using our proposed dual-clock router micro-architecture, we can achieve up to 60% speed-up in DuCNoC in comparison to AdapNoC. Fig. 8 shows the DuCNoC speed compared to

2

Case 2 8×8

1 Cycle DoR (XY) 2 Round-Robin 4 5 2 Random

Link Latency Routing Algorithm no. of VCs per port VA Input VC buffer Size Router Pipeline stage Packet Size (per flit) Traffic Pattern

approximately equal results for DuCNoC and BOOKSIM

DART deviation from BOOKSIM

0.01

Case 1 3×3

Topology

DuCNoC speed / AdapNoC [24] speed

Average Packet Latency (flit cycles)

TABLE III: Mesh Network Benchmarks

1 Cycle DoR (XY) & ATDOR 2 Round-Robin 4 5 16 Random

Acceleration Ratio

1.8

Improving speed-up by increasing packet size 1.6

1.4

1.2

Increasing the ratio of faster clock against slower clock

1 2

4

6

8

12

16

Packet Size

Fig. 8. The Effectiveness of Dual-clock Router Micro-architecture in DuCNoC against AdapNoC [26]. (Injection rate = 0.4 flits/node/cycle).

AdapNoC in different packet sizes for network depicted in Table III (Case 1). It is obvious that increasing the packet size improves the throughput of DuCNoC. Also, since the behavior of dual-clock router micro-architecture is similar for all injection rates, there is no effect on the simulation results. The injection rate in Fig. 8 is 0.4 flits/node/cycle. As it can be seen in Fig. 8, increasing the size of the packets provides more efficiency in our proposed architecture. Even for the 2-flit packets DuCNoC still achieves around 13% speed-up against AdapNoC. Also, increasing the packet size by more than a meaningful value results in BRAM-based buffer overhead for storing the packet in FPGA side. On the other hand, there is a steady-state for all packet sizes more than 64 flits, whose improvement limits to 65-70%. So, the best packet size in our router micro-architecture is less than 64, which is less than the size of BRAM-based FIFOs between nodes. Using a novel configurable virtualization mechanism enables us to implement a new structure for traffic aggregator and routing table updater to support adaptive routing algorithms especially in virtualized mode. As it was depicted in Fig. 6, we implemented a two-layer structure in order to aggregate traffic information, and update routing tables of all nodes. Furthermore, unlike virtualized structure of the simulated network, the traffic aggregator and routing table updater behave in a non-virtualized manner, as mentioned before. Using this non-virtualized structure in a virtualized network helps us eliminate the delay impact of the routing table update procedure, which provides more accurate results in adaptive routing algorithms compared to AdapNoC. Fig. 9 illustrates the simulation results of an 8 × 8 network, which is depicted in Table III (Case 2), in DuCNoC, AdapNoC, baseline ATDOR [15], and O1turn [30]. As it can be seen, thanks to our two-layer non-virtualized structure, we can simulate


11

TABLE V: Resource Utilization of a 32×32 (1024-node) Virtualized Mesh Network with Different Cluster Sizes on a ZYNQ-7000 Average Packet Latency (flit cycles)

95

DuCNoC deterministic XY Centralized (baseline) ATDOR [15] DuCNoC ATDOR AdapNoC ATDOR [26] O1Turn [30]

Cluster Size

58,795 81,865 98,479 113,685 134,915 178,641

Total Available

218,600 437,200

Saturation in higher injection rate in DuCNoC with more up-to-date routing tables in virtualized structure

65

effectiveness of adaptive routing algorithm against DoR (XY)

19,128 28,543 36,914 38,927 45,815 66,912

2 (2×1) 4 (2×2) 8 (2×4) 16 (4×4) 32 (8×4) 64 (8×8)

85

75

LUTs Registers

*

BRAM* Inter-cluster

Intra-cluster

152 109 83 46 25 15

2 5 9 18 34 65

DDR3/PL bridge

1 1 2 4 8 16

545

Number of employed 36Kb BRAM modules

55

45 0.01

TABLE VI: Resource Utilization of Different Virtualized Mesh Network SIZE with Different Cluster Sizes on a ZYNQ-7000 Network Size

Cluster Size

Fig. 9. Simulation Results for Modeling Adaptive Routing Algorithm in O1Turn [30], Implemented ATDOR [15], AdapNoC [26], and DuCNoC.

512 (32×16) 512 (32×16)

TABLE IV: Resource Utilization of a 10 × 10 (100-node) Non-Virtualized Mesh Network on a ZYNQ-7000

0.1

0.2

0.3

0.4

0.5

0.6

Modules

LUTs Registers BRAM(# of 36Kb)

8643 Flit-Queue Router 22515 Traffic Aggregator Updater 5723 Complete-Graph Mapper 4827 PCIe +AXI4 + DDR3 2264 * +

2629 5485 6982 3852 1850

*

91 48+

Including all FIFOs between all nodes (W, S, E, and N) Including FIFOs for E/I + BRAM interface between DDR3 and E/I FIFOs

adaptive routing algorithms with more accuracy in DuCNoC against AdapNoC. Additionally, unlike Fig. 7, which is a non-virtualized network, Fig. 9 can be considered in order to validate virtualized network simulation in DuCNoC. Table IV shows the DuCNoC resource utilization in a 100-node non-virtualized network, as the biggest feasible non-virtualized simulated network in FPGA side, which is approximately similar to AdapNoC. The major difference is BRAM utilization, which is responsible for transceiving data from DDR3 to PL and vice versa, and BRAM-based FIFOs for establishing non-virtualized queues between all nodes. Table V shows the resource utilization of a 1024-node network, as a sample feasible virtualized network in FPGA side, with different cluster sizes. As it can be seen, since we use a new approach in cluster-based architecture, which lays just one cluster in each time-slot in FPGA side, increasing the size of the cluster results in increasing the number of utilized LUTs, decreasing the number of utilized BRAMs in inter-cluster transmissions, and increasing the number of utilized BRAMs in intra-cluster transmissions. BRAM is used for three different purposes (virtualized or non-virtualized) in DuCNoC: (1) non-virtualized as a bridge between DDR3 and PL, (2) virtualized as FIFOs between intra-cluster nodes, and (3) non-virtualized as FIFOs between inter-cluster nodes. It is obvious in Table V that increasing the size of the cluster generally results in decreasing virtualized BRAM usage and increasing resource utilization. Note that all FIFOs (intra- or inter-cluster) are implemented by using BRAMs. Additionally, these BRAMs, as a bridge between DDR3 and PL, enable us to have a copy of packets in DDR3

BRAM * Speed-up+

LUTs

Registers

16 (4×4)

99,638

23,825

46

66

32 (8×4)

114,915

26,554

86

121

512 (32×16)

64 (8×8)

88

265

16 (4×4)

147,898 113,685

37,920

1024 (32×32)

38,927

68

42

1024 (32×32)

32 (8×4)

134,915

45,815

67

98

1024 (32×32)

64 (8×8)

66,912 77,836

220

16 (4×4)

178,641 137,032

96

2048 (64×32)

120

15

2048 (64×32)

32 (8×4)

165,983

91,770

94

38

2048 (64×32)

64 (8×8)

202,005

124,894

113

85

218,600

437,200

545


Total Available * +

Num ber of employed 36Kb BRAM module s Speed-up com pa red to B OOKSIM (injection rate = 0.3 flits/node/cycle, Packet siz e = 8 flits)

simultaneously. Therefore, since DDR3 is accessible from PS, all BRAMs data can be observed. It enables us to observe all packets internal states. Although increasing the cluster size provides more parallelism in thorough simulation, resource utilization causes a big limitation in FPGA side. In fact, there is a restricting threshold for the size of the clusters. On the other hand, decreasing the cluster size utilizes less resources, but decreases parallelism drastically, and overall speed-up is not desirable for an FPGA-based simulator. Furthermore, decreasing the cluster size increases the BRAM utilization in FPGA side, which causes big concerns in PAR and maximum operational frequency. Table VI shows the resource utilization and speed-up in different network and cluster sizes. As it can be seen, it is obvious that although increasing the cluster size provides more speed-up, it restricts the overall feasible network size. On the other hand, decreasing the cluster size allows simulating bigger networks, however the speed-up is not suitable for an FPGA-based NoC simulator. Accordingly, we have to restrict the number of nodes in each cluster to 64, which leads to simulating networks with meaningful speed-ups in comparison with other simulators. In addition, although we can simulate an 8196-node network by means of employing 16-node clusters, the overall speed-up is around 3×, which is inappropriate for an FPGA-based simulator. So, the biggest size of simulated network with considerable speed-up is 1024 or 2048; nevertheless, it is possible to simulate bigger networks with less throughput. Additionally, it is obvious that


non-virtualized Cluster Size = 8 (4×2) Cluster Size = 16 (4×4) Cluster Size = 32 (8×4)

8

Overhead Cycles

12

less overhead in lower injection rates and smaller cluster sizes 4

2 1 0.01

0.1

0.2

0.3


Fig. 10. Virtualization Overhead Comparison between Different Cluster Sizes. (Packet length = 8 Flits) 16

AdapNoC [26] DART [21] DuCNoC

14

Simulation Cycles

12 10 8 12.4% improvement

6

30.6% improvement

4 2 0

59.8% improvement

0.05

0.1

0.15

7% improvement / AdapNoC 17% improvement / DART

0.2

0.25

0.3

0.35

0.4

0.45

Injection Rate (flits/node/cycle) Fig. 11. Time Overhead in Virtualization Structure. Sharable time-slot structure, as well as infrastructure of cluster-based virtualization in DuCNoC reduces 7% and 17% time overhead in comparison with AdapNoC [26] and DART [21], respectively.

the biggest cluster size, i.e. 64, is the best cluster size for simulating all network sizes, unless it is infeasible, e.g. for networks smaller than 64 nodes, whose size is less than cluster size, and networks larger than 2048 nodes, where LUTs are not enough for 64-node cluster sizes. Using a cluster-based virtualization methodology makes a significant overhead compared to a non-virtualized architecture. Inheriting a sharable dual-clock context switching in virtualization mechanism from AdapNoC, as well as implementing a thoroughly dual-clock router micro-architecture, eliminates the mentioned overhead to a great extent. Fig 10 depicts the overhead of virtualization in the 8 × 8 network (Table III (Case 2)), in both virtualized and non-virtualized architectures. It is reasonable that for a non-virtualized structure, the number of simulation cycles required for simulating one cycle of network is equal to 1, while it is variable (more than 1) for virtualized structures. It should be noted that since time-slots are sharable, the overhead is less in lower injection rates. Fig. 10 shows that the relationship between virtualization overhead and cluster size is linear, especially for higher injection rates. If we divide the network into two clusters, the virtualization overhead is approximately doubled. In other words, we

equivalently divide system cycles throughout a simulation between clusters, with TDMA structure. So, the number of clusters definitely determines the overhead of virtualization. Furthermore, the IDLEness in smaller cluster sizes, especially for corner clusters, is more probable. So, for the same injection rates, the overhead of virtualization is less in smaller cluster sizes. However, as it can be seen in Fig 10, increasing the injection rate maximizes the timing overhead imposed by virtualization. It implies that the IDLEness mechanism is more efficient for lower injection rates, and probability of IDLEness is close to zero in higher injection rates. Accordingly, using the sharable time-slot mechanism, as well as dual-clock structure for context switching significantly compensates virtualization time overhead. Fig. 11 illustrates the average simulation cycle that is necessary for virtualized mode. In fact, the y-axis depicts the overhead cycles required for one cycle simulation in virtualized mode. As was mentioned before, the number of simulation cycles required for simulating one cycle of network in non-virtualized mode is equal to 1, but since we serialized network simulation, it needs several simulation cycle to model one simulation cycle in non-virtualized mode. Since we employed a new cluster-based virtualization approach in DuCNoC, which is equipped with a timespan threshold for each time-slot, the serialization overhead has been decreased in comparison with AdapNoC. Although in highest feasible injection rate, i.e. in saturation, the time overhead in DART, AdapNoC, and DuCNoC is approximately equivalent, in lower injection rates the time overhead in DuCNoC is reduced around 34%. Also, as it can be seen, it deceases the time overhead by 7% and 17% on average compared to AdapNoC and DART, respectively. Using an index-oriented global interconnection helps support irregular and custom topologies. In order to demonstrate and evaluate the behavior of DuCNoC in custom and irregular topologies, three different networks are chosen. Fig. 12 shows these irregular sample networks: (a) a mesh network with express links, (b) a tree, and (c) an irregular topology, which are implemented by means of our cluster-based virtualization architecture. As it can be seen in Fig. 12, the results obtained by DuCNoC are compared with DART and BOOKSIM, which shows the correctness of the proposed simulator behavior in different custom and irregular topologies. It should be noted that using table-based source routing algorithms allows us initialize all routing tables with whatever we need, and it helps supporting irregular topologies with guaranteed routing algorithm without deadlock. DuCNoC, as a HW/SW co-simulator, has more throughput degradation against pure FPGA-based simulators. However, using dual-clock router architecture, cluster-based virtualization, and inheriting sharable time-slot make a significant speed-up compared to other HW/SW co-simulators, such as DART and AcENoCs. Fig. 13 shows the DuCNoC speed-up against AdapNoC, DART, and AcENoCs on average, in all depicted scenarios. Two points are substantial: (1) 75x-350x improvement on average compared to BOOKSIM, and (2) fewer throughput degradation rate against AdapNoC

Average Packet Latency (fl

80 The worst average latency

IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTHwithout YYYYexpress

N2

N5

N4

N7

N6

Map to FPGA Side

N8

DFAB1

N5

A

N7

DFAB4

0.3

0.5

N5

N6

100

The worst average latency without express links

Content-aware routing and express links reduce average packet latency

70

60

0.01

0.1 0.15 0.2

0.3

0.4

0.5

N5

N7

Map to FPGA Side

N1 DFAB1

N3

DFBC1 Cluster

B

SF12

N9

×

×

N7

DFAB2

0.6

75 70

65 60

0.01

0.05

SF57

N7

0.1

0.15

DuCNoC Deterministic

100

80

70 Due to limited number of paths between each source-destination, content-aware routing has less effect on average latency

60

50

0.2

DuCNoC Adaptive

90


(a)

N5

BOOKSIM [4]

Content-aware routing has no effect on average latency in tree


N6

110

90

80

N4

N3

N8

DuCNoC Adaptive

85

DFAB1

N2

N1

DART [21] DuCNoC Deterministic

95

B

A

C SF39

Cluster

Cluster

Cluster


DuCNoC Adaptive

90

N3

0.6

No Express Link [4]



N2

N8

DART [21]

50

N4

SF68

DuCNoC Deterministic

80

N8

×

Cluster

SF26

N6

DFAB2

100

0.1 0.15 0.2

SF24 SF56

B

N7

N6

SF67

A

N6

N4

×

Cluster

0.01

N5

N2

SF23

Cluster

N4

SF24

SF57

N4

Map to N9 FPGA 0.4 Side


SF78

SF12

N3 SF34

N2

50

N3

×

N1

N1

N2

DFAB3

SF13

N1

60

SF45

N3

13 Content-aware routing and express links reduce average packet latency

links

SF37

N1

70

0.01 0.03 0.05

0.15

0.1

0.2


(b)

(c)

ocean non

radix

raytrace

water spatial

AVG

radix

raytrace

water spatial

AVG

(a)

100

0.25

0.3

0.35

0.4

0.45

0.5

Fig. 13. DuCNoC Simulation Speed Compared to AcENoCs [18], DART [21], and AdapNoC [26].

due to our new cluster-based virtualization. For SPLASH2 traffic, DuCNoC provides an average 86x speed-up against BOOKSIM. Fig. 14b illustrates the DuCNoC speed-up against BOOKSIM for trace-based traffic (SPLASH-2). As it can be seen, the speed-up is different for different benchmarks, which shows that thanks to IDLEness detection in our proposed architecture, DuCNoC provides better speed-up for low traffic volumes, like fft, while for high traffic volumes, like cholesky, the speed-up is less than average. Additionally, as it can be seen in Fig. 14a, the average packet latency obtained by DuCNoC is the same with that of BOOKSIM, which shows the accuracy of DuCNoC.

lu_non

0.2


lu_con

0.15

fft

0.1

cholesky

0.05

140 120 100 80 60 40 20 0

barnes

50 0

DuCNoC

ocean non

112.1% improvement against AdapNoC

150

lu_non

Fewer throughput degradation rate

200

lu_con

250

BOOKSIM [4]

fft

35.6% improvement against AdapNoC

300

50 40 30 20 10 0

cholesky

350

Average Packet Latency (Cycles)

AdapNoC [26] AcENoCs [18] DART [21] DuCNoC

barnes

400

DuCNoC / BOOKSIM (Speed)

3

Simulation Speed ( 10 cycles per second)

Fig. 12. Sample Custom and Irregular Networks. (a) A Mesh Network with Express Bypassing Links via 2 Clusters (b) A 9-node Tree via 3 Clusters (c) An Irregular Topology.

(b)

Fig. 14. Simulating Trace-based Traffic (SPLASH-2) in DuCNoC. (a) Average Packet Latency (b) Speed Comparison with BOOKSIM

VI. C ONCLUSION In this paper, we presented DuCNoC, an FPGA-based NoC co-simulator, which can be configured via software. The main contribution of this simulator is to maximizing speed-up compared to other NoC simulators, especially software simulators. We employed some architectural points of view in router micro-architecture to accelerate the


simulation speed. The main contribution is a dual-clock router micro-architecture, which significantly decreases the latency of packet traversal, providing 75x-350x speed-up against BOOKSIM. Also, by using BRAMs, we implemented a lightweight configurable global interconnection, in order to decrease synchronization latency in virtualization. The global interconnection network leads DuCNoC to simulate a 2048-node network in ZYNQ ZC706 with lower resources by simulating feasible clusters in FPGA. Migrating some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, as well as employing a dual-clock virtualization are other major features of DuCNoC. R EFERENCES [1] S. Borkar, ”Thousand Core Chips: A Technology Perspective,” in Proc. of ACM/IEEE Design Automation Conference (DAC), pp. 746749, 2007. [2] B. Dally and B. Towles, ”Route Packets, Not Wires: On-Chip Interconnection Networks,” in Proc. of IEEE Design Automation Conference (DAC), pp. 684-689, 2001. [3] B. Dally and B. Towles, ”Principles and Practices of Interconnection Networks,” Morgan Kaufmann, San Francisco, CA, 2004. [4] J. Nan, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis and J. Kim, ”A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator,” in Performance Analysis of Systems and Software (ISPASS), Proc. of Int’l Symp. on., pp. 86-96, 2013. [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill and D. A. Wood, ”The GEM5 Simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1-7, 2011. [6] V. Puente, J.A. Gregorio and R. Beivide, ”SICOSYS: An Integrated Framework for Studying Interconnection Network Performance in Multiprocessor Systems,” in Proc. of Euromicro Workshop on Parallel, Distributed and Network-based Processing, pp. 15-20, 2002. [7] V. S. Pai, P. Ranganathan and S. V. Adve, ”RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors,” in Proc. Workshop on Computer Architecture Education, 1997. [8] M. Rosenblum, E. Bugnion, S. A. Herrod and S. Devine, ”Using the simOS Machine Simulator to Study Complex Computer Systems,” in ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 7, no. 1, pp. 78103, 1997. [9] M. Coppola, S. Curaba, G. Grammatikakis, G. Maruccia, F. Papariello, ”OCCN: A NoC Modeling and Simulation Framework. Journal of Systems Architecture,” in the EUROMICRO Journal, vol. 50, no. 2-3, pp. 174179, 2004. [10] K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu and E. Rijpkema, ”A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SOC Design and Verification,” in Proc. Design, Automation and Test in Europe(DATE), pp. 1182-1187, 2005. [11] T. Kogel, M. Doerper, A. Wiefrink, R. Leupers, G. Ascheid, ”A Modular Simulation Framework for Architectural Exploration of On-Chip Interconnection Networks,” in Proc. IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (CODES/ISSS), pp. 712, 2003. [12] S. Mahadevan, K. Virk, and J. Madsen, ”ARTS: A systemc-based framework for modelling multiprocessor systems-on-chip,” in Design Automation for Embedded Systems, vol. 11, no. 4, pp. 285-311, 2006. [13] E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi and K. Mai, ”PROToFLEX: FPGA-Accelerated Hybrid Functional Simulator,” in IEEE Int’l Symp. on Parallel and Distributed Processing Symposium, vol. 1, no. 6, pp. 26-30, 2007. [14] Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, D. Patterson and K. Asanovic, ”RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors,” in Proc. of IEEE Design Automation Conference (DAC), pp. 463468, 2010. [15] R. Manevich, I. Cidon, A. Kolodny, I. Walter and S. Wimer, ”A Cost Effective Centralized Adaptive Routing for Networks-on-Chip,” in Euromicro Conference on Digital System Design (DSD), pp. 39-46, 2011. [16] Y. E. Krasteva, F. Criado, E. de la Torre and T. Riesg, ”A Fast Emulation-Based NoC Prototyping Framework,” in Int’l Conf. Reconfigurable Computing and FPGAs, pp. 211-216, 2008. [17] M. K. Papamichael, ”Fast Scalable FPGA-based Network-on-Chip Simulation Models,” in ACM/IEEE Int’l Conf. on Formal Methods and Models for Codesign (MEMPCODE), pp. 77-82, 2011. [18] V. Pai, S. Lotlikar and P. Gratz, ”AcENoCs: A configurable HW/SW platform for FPGA Accelerated NoC Emulation,” in IEEE Int’l Conf. on VLSI Design, pp. 147-152, 2011. [19] G. Heck, R. Guazzelli, F. Moraes, N. Calazans and R. Soares, ”HardNoC: A Platform to Validate Networks on Chip Through FPGA Prototyping,” in Southern Conf. on Programmable Logic, pp. 1-6, 2012.

14

[20] Y. Zhang, P. Qu, Z. Qian, H. Wang and W. Zheng, ”Software/Hardware Hybrid Network-on-Chip Simulation on FPGA,” in Proc. of Int’l Conf. on Network and Parallel Computing, vol. 8147 (NPC 2013), pp. 167-178, 2013. [21] D. Wang, C. Lo, J. Vasiljevic, N. Enright Jerger and J. Gregory Steffan, ”DART: A Programmable Architecture for NoC Simulation on FPGAs,” in IEEE Transactions on. Computers, vol. 63, no. 3, pp. 664-678, 2014. [22] H. Ying, T. Hollstein and K. Hofmann, ”A Hardware/Software Co-design Reconfigurable Network-on-Chip FPGA Emulation Method,” in Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pp. 1-8, 2014. [23] S. Abba and J. A. Lee, ”A Parametric-based Performance Evaluation and Design Trade-offs for Interconnect Architectures using FPGAs for Networks-on-Chip,” in Microprocessors and Microsystems, vol. 38, no. 5, pp. 375-398, 2014. [24] T. V. Chu, S. Sato and K. Kise, ”Ultra-Fast NoC Emulation on a Single FPGA,” in IEEE Int’l Conf. on Field Programmable Logic and Applications (FPL), pp. 1-8, 2015. [25] T. Naruko and K. Hiraki, ”FOLCS: A Lightweight Implementation of a Cycle-accurate NoC Simulator on FPGAs,” in Proc. Int’l Workshop on Many-core Embedded Systems (MES), pp. 25-32, 2015. [26] H. M. Kamali and S. Hessabi, ”AdapNoC: A Fast and Flexible FPGA-based NoC Simulator,” in IEEE Int’l Conf. on Field Programmable Logic and Applications (FPL), pp. 1-8, 2016. [27] S. S. Mukherjee, P. Bannon, S. Lang, A. Spink and D. Webb, ”The Alpha 21364 Network Architecture,” in IEEE Micro, vol. 22, no. 1, pp. 26-35, 2002. [28] R. Mullins, A. West and S. Moore, ”Low-Latency Virtual-Channel Routers for On-Chip Networks,” in Proc. of Int’l Symp. on Computer Architecture (ISCA), pp. 188, 2004. [29] L.-S. Peh and W. J. Dally, ”A Delay Model for Router Microarchitectures,” in IEEE Micro, vol. 21, no. 1, pp. 26-34, 2001. [30] D. Seo, A. Ali, W. T. Lim, N. Rafique and M. Thottethodi, ”Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks,” in Proc. of Int’l Symp. on Computer Architecture (ISCA), pp. 432-443, 2005. [31] Xilinx Inc., ”Zynq-7000 All Programmable SoC, Technical Reference Manual (UG585)” Available: https://www.xilinx.com/support/documentation/user guides/ug585-Zynq7000-TRM.pdf,2016. [32] Xilinx Inc., ”ZC706 PCIe Targeted Ref. Design (UG963)” Available: https://www.xilinx.com/support/documentation/boards and kits/zc706/2 014 4/ug963-zc706-pcie-trd-ug.pdf,2015. Hadi Mardani Kamali received the B. Sc. and M. Sc. computer engineering degrees from Khajeh Nasir University of Technology and Sharif University of Technology, Tehran, Iran, in 2011, and 2013, respectively. From 2013, He works as a research assistant at VLSI laboratory of Sharif University of Technology. His research focuses on electronic design automation, FPGA-based architectures, and hardware/software co-design.

Kimia Zamiri Azar received the B.Sc. and M.Sc. computer engineering degrees from Khajeh Nasir University of Technology and Shahid Beheshti University, Tehran, Iran, in 2013, and 2015, respectively. She Worked as a research assistant in arithmetic laboratory of Shahid Beheshti University. Her current research interests include Computer Architecture, FPGA-based architectures, and Multi-core architectures.

Shaahin Hessabi received the B.S. and M.S. degrees in electrical engineering from Sharif University of Technology, Tehran, Iran, in 1986, and 1990, respectively and the Ph.D. degree in electrical and computer engineering from the University of Waterloo, Ontario, Canada. He joined Sharif University of Technology in 1996. Since 2007, he has been an Associate Professor with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. He has published more than 100 refereed papers in the related areas. His research interests include design for testability, VLSI design, Network-on-Chip, and System-on-Chip. He has served as the program chair, general chair, and program committee member of various conferences, like DATE, NOCS, NoCArch, and CADS.

DuCNoC: A High-Throughput FPGA-based NoC simulator using Dual ...

DuCNoC: A High-Throughput FPGA-based NoC simulator using Dual ...

Suggest Documents

NOC Architecture Comparison with Network Simulator

Development of a Dual-Extraction Industrial Turbine Simulator Using ...

DARSIM: a parallel cycle-level NoC simulator - People.csail.mit.edu

NOC

NOC Architecture Comparison with Network Simulator NS2 - IJETT

HighThroughput Sequencing: A Roadmap Toward ...

A Dual Modeling Method for a Real-Time Palpation Simulator

NoC

Dual Stewart Platform Mobility Simulator - CiteSeerX

self-adaptive noc power management with dual ... - Semantic Scholar

A highthroughput eightchannel probe head for

STEERING A DRIVING SIMULATOR USING THE ... - CiteSeerX

Dynamically Reconfigurable NoC using a Deadlock-Free Flexible ...

3D NoC

NOC Template

Why using a Simulator? Using the SysProg Systems Infrastructure on

NOC Template

NOC Form

NOC - ANEH

Ternary Logic Simulator Using VHDL

BBVC-3D-NoC: An Efficient 3D NoC Architecture Using ... - IEEE Xplore

Dual-chirp Arbitrary Microwave Waveform Generation by Using a Dual ...

NOC Form

Fungal community analysis by highthroughput ...