IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
1
DuCNoC: A High-Throughput FPGA-based NoC simulator using Dual-Clock Lightweight Router Micro-Architecture Hadi Mardani Kamali, Kimia Zamiri Azar, Shaahin Hessabi, Member, IEEE
Abstract—On-chip interconnections play an important role in multi/many-processor systems-on-chip (MPSoCs). In order to achieve efficient optimization, each specific application must utilize a specific architecture, and consequently a specific interconnection network. For design space exploration and finding the best NoC solution for each specific application, a fast and flexible NoC simulator is necessary, especially for large design spaces. In this paper, we present an FPGA-based NoC co-simulator, which is able to be configured via software. In our proposed NoC simulator, entitled DuCNoC, we implement a Dual-Clock router micro-architecture, which demonstrates 75x−350x speed-up against BOOKSIM. Additionally, we implement a two-layer configurable global interconnection in our proposed architecture to (1) reduce virtualization time overhead, (2) make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially (3) provide the capability of simulating irregular topologies. Migration of some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, and implementing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to its dual-clock router micro-architecture, as well as TGs and TRs migration to software side, DuCNoC can simulate a 100-node (10×10) non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000. Index Terms—Network-on-Chip, FPGA, Virtualization, Global Interconnection, Dual-Clock, Router Micro-Architecture
I. I NTRODUCTION CCORDING to Moore’s law, the complexity of basic elements in an integrated circuit, i.e. the number of transistors, approximately doubles within a time interval. On the other hand, due to unsustainable levels of power consumption burdened by higher operational frequencies, the manufacturers were necessitated to increase the number of cores on a System-on-Chip (SoC). As a result, multi/many-core systems have emerged [1]. Having a scalable and efficient interconnection network is one of principal components of multi/many-core systems. Earlier, when the number of cores was low, there were some non-scalable and non-heterogeneous interconnection networks, like shared memory bus-based or point-to-point networks, which are inefficient for large-scale and communication-intensive applications. Consequently, network-on-chip (NoC) was introduced for implementing large and heterogeneous networks [2], [3].
A
The authors are with the Department of Computer Engineering, Sharif University of Technology, Tehran 11155-9517, Iran. (e-mail: mardani,
[email protected];
[email protected]) Manuscript received MONTH dd, yyyy.
In order to take advantages of this new interconnection network, we encounter a large design space exploration which is significantly a time-consuming problem. So, it is necessary to have a cycle-accurate, flexible, and especially high-throughput NoC simulator to comprehensively explore the corresponding design space for each application. Recent FPGAs are equipped with hundreds of thousands of fine-grained logics and coarse-grained communications, which provide huge parallelism. Additionally, NoC infrastructure bears a resemblance to interconnection network between cells in FPGAs. Therefore, FPGAs can be enumerated as a suitable framework for implementing different NoCs. Unlike FPGA-based NoC simulators, software-based simulators, written in C/C++, have an indispensable throughput restriction, especially when the simulated network is large. Although software-based simulators are more flexible, they need to sacrifice some features and configurations to construct a desired trade-off between flexibility and throughput. Meanwhile, considering throughput tuning, like multi-threading, causes excessive time and memory overhead. Therefore, acceleration via FPGAs motivates researchers to implement FPGA-based NoC simulators to reduce simulation time, especially in large networks. In this paper, we present DuCNoC, an FPGA-based NoC co-simulator, which is able to be configured via software. The main aim of this simulator is to maximize speed-up compared to software simulators. We employ some novel techniques to improve the simulation speed. The main contribution is a dual-clock router micro-architecture, which significantly decreases the simulation time of packet traversals in the network. Also, we implement a two-layer configurable global interconnection, in order to not only decrease synchronization latency in virtualization, but also provide the capability of simulating all types of topologies, even irregulars. In addition, Our configurable global interconnection network enables DuCNoC to not only simulate larger networks in FPGAs with lower resources by simulating smaller clusters, but also improve the speed-up by simulating bigger clusters. Migrating some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, as well as employing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to our dual-clock router micro-architecture as well as the proposed two-layer virtualization mechanism, DuCNoC is able to accelerate the simulation by 75x−350x compared to BOOKSIM. Also, lower overhead in the proposed two-layer
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
interconnection decreases the overhead of virtualization by around 34%, which decreases the time overhead of simulation by 17% on average compared to an architecture with conventional virtualization mechanism. Also, DuCNoC is able to simulate a 100-node non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000. The rest of this paper is organized as follows: in Section II, previous work is described. Section III illustrates our approach in this paper. The router micro-architecture and overall architecture of DuCNoC are explained in Section IV. Implementation and experimental results are shown in Section V. Finally, we conclude the paper in Section VI. II. R ELATED W ORK A. Software-based Simulators The vast majority of current NoC simulators are implemented in highest level of abstraction, i.e. software, written in C/C++. The most valuable feature of software-based NoC simulators is flexibility. They can be simply developed, modified, and run. BOOKSIM [4] is the most widely used software-based NoC simulator. Some of important features of BOOKSIM are accurate modeling of channel latency, pipeline optimization, efficient switch allocation, flexible traffic generation, and especially the capability of integration with GEM5 [5], as one of the most powerful full system simulators. SICOSYS [6] is implemented for modeling packet traversals. Also, it can be integrated as a sub-module with RSIM [7] and SimOS [8]. In addition, there are some other high level synthesis simulators, written in systemC, which are able to provide performance evaluation and verification. OCCN [9] is a systemC-based 3-distinct-layer architecture to model NoC. It also provides a system performance modeling to collect instantaneous and periodic statistics from all nodes. Focusing on mapping task graphs and evaluating the performance are accomplished in [10]. In [11], a systemC-based event-driven simulator for modular exploration is designed, which can assess performance of bus-based, point-to-point and crossbar interconnections. ARTS [12] is another framework written in systemC for MPSoCs. It provides a full system simulator, but since its main focus is on scheduling and mapping, simulation is not efficient. Although software-based NoC simulators can simulate small-scale networks with desirable performance, when the size of simulated network increases, e.g. a many-core system, they have to simulate a large quantity of simultaneous tasks, resulting in substantial throughput degradation. B. FPGA-based Simulators Growth in VLSI equips FPGAs with abundant logic resources, which are able to provide fine-grained parallelism and coarse-grained communication. Additionally, as it was mentioned before, cells interconnection in FPGAs bears a resemblance to NoC architecture. Therefore, implementing NoC simulators on FPGAs results in several orders of magnitude speed-up against software simulators. One drawback of the FPGA-based simulators is the
2
necessity of re-simulation and re-synthesis for each change in parameters and configurations. One efficient way to eliminate wasted re-synthesis time is to decouple the simulator architecture from the simulated system [13], [14]. In fact, the base architecture for NoC simulator must be designed statically, and regarding the simulated network, the required infrastructure for different parameters is applied. Similarly, DuCNoC is able to decouple the simulator architecture from the simulated network, whereby changes in the simulated design do not require re-synthesis. Additionally, similar to most FPGA-based architectures, DuCNoC cycle (simulation cycle) is separated from system cycle (FPGA cycle) to provide better observability and more comfortability in evaluation phase. In fact, by considering a global counter in overall design, timing model does not depend on FPGA cycle (neither fast nor slow in DuCNoC). Furthermore, resource limitation is one of the other drawbacks of FPGAs for simulating networks. In order to eliminate this obstacle, virtualization is used to implement larger networks on the same FPGA [17], [21], [23], [26]. DRNoC [16] is an FPGA-based emulation work-flow based on partial reconfiguration to eliminate re-synthesis time. For evaluation purposes, all configurable parameters can be changed at run-time, and this feature improves total simulation time. But, partial reconfiguration has some non-negligible overheads such as occupying large amount of memory, excessive time overhead, and synchronization concerns. Simulating in both direct-map and virtualized-model is embedded in [17], which provides this opportunity for researchers to compare both models. However, fewer parameters can be configured in this simulator. In AcENoCs [18], the software side is implemented in a soft MICROBLAZE core in FPGA. Also, traffic generator is implemented in software side. So, bigger NoCs can be simulated in same FPGA. Since virtualization is not used in this model, the NoC size is limited by FPGA size. Also, AcENoCs has no solution for re-synthesis time overhead. Unlike AcENoCs, HardNoC [19] avoids using soft IP core to implement traffic injectors and collectors, and has few configurable parameters, which reduces the flexibility of the platform. Similar to AcENoCs, a soft MICROBLAZE core is employed in [20], which implements a NoC simulator with bigger flit size, i.e. 256 bits. DART [21] is equipped by an engine which can auto-generate nodes in FPGA according to chosen topology. Therefore, re-synthesis time for changing parameters is eliminated. Also, DART decouples simulator architecture and the simulated network, adopted from PROToFLEX [13] and SIMFLEX [14], which eliminates re-synthesis flow. Also, DART employs virtualization to overcome FPGA size limitation. TGs and TRs are implemented inside FPGA, which causes a limitation for implementing bigger NoCs. Similar to DART, auto-generation for removing re-synthesis process is accomplished in XHiNoC [22]. But, the utilized interface, i.e. UART, restricts performance of the simulator. A parametric evaluation system is implemented in [23], which is able to simulate fat-tree topology, and consequently Nearest Common Ancestor (NCA) routing algorithm is chosen
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
3
TABLE I: Comparison of Latest FPGA-based NoC Simulators NoC Simulator
Topology
Flow Control
Routing
Traffic Pattern
Router Arch.
FPGA
Papamichael 2011 [17]
4×4 Mesh Torus
Credit based
DoR (XY)
Syn.
4/8/12/16-port 2/4/8-VC
V6 ML605
AcENoCs 2011 [18]
5×5 Mesh
Credit based
DoR (XY)
Syn. + Trace
4×4 Mesh Torus 7×7 Mesh Irregular (Virtualized) 2×2×2 Mesh
256-bit (flit) Credit based
DoR (XY)
Syn.
36-bit (flit) Credit based
DoR (XY)
Syn. + Trace
Credit based 22/25-bit (flit) Credit based
3D DoR (XYZ)
Zhang 2013 [20] DART 2014 [21] Ying 2014 [22] Van Chu 2015 [24]
64×64 Mesh (Virtualized)
FOLCS 2015 [25]
16×16 Mesh
Credit based
AdapNoC 2016 [26]
32×32 Mesh Torus (Virtualized
32-bit (flit) Credit based
DuCNoC
HW/SW Software interface side tool
Max. Freq
UART
C on host PC
56 MHz
28x
-
TDMA based
V5
UART
on-chip soft-core
100 MHz
14x-47x
-
-
V5
-
on-chip soft-core
100 MHz
3000x (sim. Model.)
-
-
upto 8-port upto 4-VC 5-stage pipe
V6 ML605
PCIe
C on host PC
46 MHz
100x
Yes
TDMA based
Syn.
5-port dedicated
V5 XUPV5
UART
C on host PC
100 MHz
-
Yes
-
DoR (XY)
Syn.
5-port 1/2-VC 4/5-stage pipe
V7 VC707
-
-
100 MHz
700x (sim. Model.)
-
TDMA based
DoR (XY)
Syn.
5-port 4-VC 5-stage pipe
V6 ML605
UART
C on host PC
66 MHz
17x-22x
-
-
5-port up to 4-VC (3+)-stage pipe
V6 ML605
PCIe
on-chip soft-core
100 MHz
53x-180x 1150x (sim. Model)
Yes
Dual-Clock TDMA based
up to 64-port up to 4-VC (3+)-stage pipe
ZYNQ 7000 ZC706
PCIe
on-chip hard-core
85/170 MHz
75x-350x 1820x (sim. Model)
Yes
Dual-Clock TDMA based
DoR Syn. (XY) + + Trace Adaptive DoR 64×32 (XY or YX) Mesh 32-bit (flit) Syn. + Torus Credit + Oblivious Irregulars based Trace + (Virtualized) Adaptive
5-port 2-VC 1-stage pipe 5-port 0/2/4-VC (3+)-stage pipe
as its dedicated routing algorithm. Their simplified router micro-architecture helps implement bigger network on the same FPGA, but fewer parameters can be configured. Table I illustrates the major features of FPGA-based NoC simulators. III. D U CN O C A PPROACH The main contributions of this paper are as follows: •
•
•
We implement a fully dual-clock lightweight router micro-architecture with 3-stage pipeline, which guarantees its distinguished throughput against that of BOOKSIM as well as FPGA-based NoC simulators. Our two-layer interconnection not only implements virtualization more efficiently, but also it can support different types of topologies, even irregular ones. We propose a new cluster-based traffic aggregator in order to support modeling of adaptive routing algorithms more accurately, even for irregular networks.
Although there are some common major features in DuCNoC and AdapNoC, like implementing a dual-clock context switching in virtualization, and migrating traffic generators (TGs) and traffic receptors (TRs) to software side [26], there are major differences between DuCNoC and AdapNoC, as follows: •
DuCNoC router is implemented by using a fully dual-clock router micro-architecture. However, AdapNoC uses a single-clock router micro-architecture. It should be noted that dual-clock context switching for virtualization
•
•
Speed-up re-synth (/BOOKSIM) solution
Virtualized
mechanism, which is implemented in AdapNoC, is separated from the router architecture. Unlike AdapNoC, which is able to simulate only equivalent clusters during a simulation, our proposed two-layer interconnection enables DuCNoC to support different cluster sizes in each simulation. The traffic aggregator flow is changed compared to AdapNoC. AdapNoC implements a vector-based traffic aggregator, whereas DuCNoC utilizes a two-layer traffic aggregation mechanism, which improves the accuracy of the implemented adaptive routing algorithm.
It should be noted that we use ZYNQ-7000 series for implementing DuCNoC. This type of FPGAs are equipped with Kintex devices, which have fewer logic cells than Virtex family. Although ZYNQ-7000 FPGAs are equipped with hard-core on-chip processors, our proposed dual-clock router micro-architecture, rather than the type of FPGA, is the main contributor to our distinguished operational frequency, i.e. 170 MHz, which considerably increases DuCNoC throughput compared to other FPGA-based simulators. Additionally, providing the aforementioned features may cause some limitations on DuCNoC. The main required resources for implementing our two-layer interconnection and traffic aggregator is BRAM-based memory in the FPGA, which is a limited resource. So, implementing these features restricts available resources on the FPGA, which reduces the maximum feasible network size. Also, due to BRAM-based memory restriction, the maximum flit size is 32.
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
4
IV. D U CN O C A RCHITECTURE The overall architecture of DuCNoC is divided into two main sub-modules: (1) Software side and (2) FPGA side. The software side contains central processing unit for controlling all actions in the FPGA side, such as sending and receiving data to/from FPGA side, determining the latest state of FPGA side operations, and configuring all parameters in FPGA side. Additionally, since TGs and TRs are migrated to software side, generating and sending traffic as well as receiving and gathering traffic are handled via software side. Therefore, all statistical information is calculated via software side. FPGA side which contains all programmable logics, memories, and interfaces, simulates interconnection network, packet traversal, router micro-architecture, and flow control. Fig. 1 illustrates the overall architecture of DuCNoC. The figure illustrates a combination of overall SoC architecture of ZYNQ-7000 [31], [32] and the top modules of DuCNoC. As it can be seen, by using Xilinx ZYNQ-7000 ZC706, software side can be developed on Processing System (PS), which contains a dual-processor ARM Cortex-A9. Additionally, FPGA side which implements the infrastructure of DuCNoC is implemented on Programmable Logic (PL), which is a Kintex-7 FPGA chip. Note that using the two ARM processors of PS for developing software side allows us provide simulation of dynamic traffic with negligible throughput degradation. Also, the overall architecture has a 100 MHz clock as input clock. PL fabric clocks (85 MHz and 170 MHz) are generated in software side. Also, fabric clocks are used for dual-clock TDMA-based context switching in virtualization. Some modules are in common with AdapNoC [26], such as bank registers for intermediate states, centralized traffic information aggregator, routing table updater, and buffer management in FPGA side. These modules are employed to implement TDMA-based virtualization and modeling of adaptive routing algorithms. Additionally, as it can be seen in Fig. 1, the communication interface between the host and FPGA is implemented via PCI-express (PCIe), to guarantee the performance of data transmissions for dynamic traffic. FIFOs are employed in DuCNoC for three different purposes: (1) a bridge between PCIe interface and DDR3, (2) a bridge between DDR3 and PL, and (3) FIFOs between nodes. The bridges, including (1) PCIe to DDR3 bridge, and (2) DDR3 to PL bridge, are responsible for transmitting the traces from host to network of nodes in FPGA side. In fact, the FIFO set (1) is employed between PCIe multi-channel DMA and our custom DMA (PDMA), to transfer traces from PCIe interface to DDR3 via an AXI-based channel. The FIFO set (2) is a BRAM-based FIFO between DDR3 interface and network of nodes in FPGA side as a traffic translator to convert traces to dynamic traffic for simulating on network of nodes. One of ARM CPUs (CPU 1) is responsible for modeling TGs and TRs in software side. Fig. 1 shows that the CPU 1 is connected to Memory Interface in order to write/read the generated/received traffic. The BRAM-based FIFO set (3) is employed between the Memory Interface and network of nodes to model the ejection/injection (E/I) procedures. Additionally,
Processing System (PS) MMU
ARM Cortex-A9 CPU_0
ARM Cortex-A9 CPU_1
MMU
Snoop Controller 512 KB Cache and Controller 256KB I/O Peripherals On-Chip UART Memory 100 MHz clock
Memory Interface (DDR3)
ZYNQ Central Interconnection
General-Purpose Ports
High-Performance Ports BRAM-based FIFO
GTX Tranceiver config & status registers
E/I
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
E/I
64bit AXI4 Stream Interface @ 250 MHz
FIFO
E/I
E/I
AXI-ST Wrapper PCIe FPGA Multi-Channel Side DMA
DDR3 Traces data TGs data TRs received data
PL to Memory Interconnect
PCIe and RAW Driver
X4 @ 5 Gbps Integrated Block for PCIE v2.0
SW Side
Custom PCIe DMA
Programmable Logic (PL)
E/I
Routing Table Updater
Intermediate Traffic States Information Aggregator (Bank Regs)
Fig. 1. DuCNoC overall architecture, implemented on a ZYNQ-7000 ZC706 which consists of a dual-core ARM Cortex-A9 processor as PS, as well as a Kintex-7 FPGA as PL.
some FIFOs of set (3) are located between all nodes in order to model flit queues in the routers. A. Dual-Clock Mechanism for Reformed 3-Stage Pipeline One efficient acceleration approach is to equip the simulator with a dual-clock router micro-architecture. In order to employ a dual-clock mechanism for our router micro-architecture, we need to alter and duplicate the pipeline infrastructure. A typical 5-stage router consists of buffer write (BW), route computation (RC), virtual allocation (VA), switch allocation (SA), and switch traversal (ST). In DuCNoC, we employ an optimized pipeline structure by incorporating stages to provide the required infrastructure for a dual-clock router micro-architecture. Among all stages, RC and VA are meant only for head flit. Therefore, if we consider just one stage for BW alongside RC, and likewise one stage for VA alongside SA, we can implement the pipeline architecture with just three stages, i.e. BWRC, VA-SA (VSA), and ST. Fig. 2 shows the different structure of a conventional 5-stage pipeline router and the reformed 3-stage pipeline router. As it can be seen, in a conventional 5-stage pipeline router, RC and VA are active for head flits, and successively become inactive and are replaced by bubbles (No-Operation) for the rest of the packet. On the other hand, in the reformed 3-stage pipeline router, BWRC, VSA, and ST are active for all type of flits. The overall structure of our pipeline is similar to speculative approaches [27]–[29], but as its name implies, it allows a flit to speculatively enter the next stage of pipeline. For successful
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
1st clk 2nd clk 3rd clk 4th clk 5th clk 6th clk 7th clk Head Flits
BW RC VA SA ST
Body Flits Tail Flits Body Flits Tail Flits
BW
Bubble
BW
Bubble
Bubble
(a)
SA ST Bubble
SA ST
5
Credit Out
1st clk 2nd clk 3rd clk 4th clk 5th clk
BW VS ST RC A BW V ST RC SA BW V ST RC SA (b)
Updated Routing Updated Table Routing Data RC
Input 1
Credit Counter
Credit In
Output 1
VA SA
Output 2
Input 2 VC 1
Fig. 2. The Behavior of Router Pipeline Structure in (a) Conventional 5-stage (b) Reformed 3-stage
VC 2
VC 3
Output 5
VC 4
speculations, the corresponding flit enters the ST, directly. But, for failed speculations, the flit must cross some stages again, which depends on where the speculation failed. In order to eliminate speculation, we employ two BWRC stages (one for head flits with worse critical path, and one for body/tail flits with better critical path). Similarly, we need two stages for VSA. In fact, it seems that we have two parallel pipeline paths; one path for head flits and one path for body/tail flits. Thereby, by employing this 3-stage pipeline router micro-architecture, BWRC and VSA are duplicated, whose critical paths are variant. In fact, since RC and VA are necessary for head flits, BWRC and VSA for head flits make longer critical paths. The overall architecture of a dual-clock router micro-architecture is similar to a conventional single-clock router micro-architecture. In DuCNoC, we employ two clocks with frequency ratio of 2 to 1 (170 MHz and 85 MHz). The faster clock is dedicated for body flits while the slower one handles head flits. This mechanism allows us to establish a harmony between a dual-clock architecture and the reformed pipeline infrastructure which has different paths. Additionally, since we divide the path of head flits from tail/body flits, and the delay of each path is different, it seems that the re-ordering issue raises a big challenge. However, it should be noted that since ST stage is common between two paths, head flits should be proceeded to ST at first, and then body/tail flits will be processed. So, there is no re-ordering issue in this mechanism. Using a dual-clock router micro-architecture has some significant advantages in comparison with a single-clock one. As we mentioned, BWRC and VSA have longer critical paths in dealing with head flits. In other words, when the packet size is large and the number of body flits increases, the probability of dealing with worse critical path decreases. Therefore, the possibility of using faster clock cycle increases, and overall simulation time decreases considerably. Additionally, this mechanism increases the efficiency of the whole system. In fact, when the simulation time decreases, the percentage of utilization increases, and the throughput will be improved. In order to simulate different number of pipeline stages in router, DuCNoC has a configurable parameter which can be set to three and more (3+). This parameter indicates the required number of EMPTY (no-operation) stages which should be injected. In fact, different number of EMPTY stages can be inserted to simulate a real NoC router with more than
Input BW Input 5
BWRC
VSA Input BW info
Fig. 3. Reformed (3+)-stage Credit-based DuCNoC Router Micro-architecture (For a 5-port Router).
three pipeline stages. Furthermore, the number of VCs can be configured by up to 4 VCs per link. In order to realize the flow control mechanism, we employ credit-based wormhole (WHR) VC flow control. The overall architecture of the router is illustrated in Fig. 3. As it can be seen, the router consists of input unit, credit arbiter, VC and switch allocation, and updatable routing table. B. Configurable Dual-clock Sharable Virtualization A major concern in the FPGA-based NoC simulators is their resource limitation for simulating a big network on an FPGA. In fact, if the simulation is accomplished purely in FPGA side, there is a maximum possible NoC size for each FPGA. The most efficient solution for overcoming resource limitation in FPGA side is virtualization mechanism based on TDMA, to virtualize nodes in FPGA side [17], [21], [23], [26]. In fact, by employing virtualization mechanism, the simulated network is divided into sub-networks, called clusters. Each cluster is implemented in FPGA side in a dedicated time-slot, separately. The overall simulation time is divided into equivalent time intervals, called time-slots, and all time-slots will be alloted to clusters in a round-robin manner. For instance, suppose that we need to simulate a 4×4 mesh network, but the available resources can only simulate a 2×2 mesh network. So, by using virtualization, we divide the network into four 2×2 mesh sub-networks (clusters), and the available resources will be dedicated to each cluster, sequentially. It is evident that serialization of resource allocation is necessary in virtualization technique. It is accomplished in a round-robin manner. After each time-slot, the intermediate state of the current cluster, which determines the essential information of simulation, such as the interconnection between nodes in each cluster, and synchronization variables, will be stored and the intermediate state of the next assigned cluster will be loaded. Similar to AdapNoC, a configurable parameter is defined in DuCNoC to indicate the possible number of nodes in each cluster. Since we employed AXI-based communication
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
Two falling edge between system clock edges
Two falling edge between system clock edges
6
System Two falling edge bus between FPGA side and software side, the maximum Two falling edge Clock between system between system possible nodes in each cluster is restricted to 64. clock edges clock edges State Also, another relevant parameter to virtualization, which SystemHandler can be configured in DuCNoC, indicates the timespan of Clock(Registers) Load Clock Store each time-slot. It should be noted that the value of timespan State state of state of has no effect on the functionality of the system, but it can Handler cluster cluster (Registers) Store state considerably influence the overall simulation time. In a simple “i+1” “i”of cluster Clock Load state of cluster 1 “i” and intuitive scheme, if we have n clusters, n th of the total “i+1” simulation time should be allocated, distributed in several (a) time-slots, to each cluster. Two efficient approaches are accomplished in DuCNoC to System compensate for the serialization overhead due to virtualization (1) Finish cluster Clock (1) (2) (3) (4) “i” time-slot as much as possible: (1) Sharing capability of IDLE time-slot, (2) Store state of and (2) using a dual-clock mechanism for context switching cluster “i” System System (1) Finish cluster between clusters. Clock Clock (3) Load (1) (2) (3) (4) “ i” time-slot state of In order to fulfill the capability of sharing IDLE time-slots, cluster “i+1” (2) Store state of it is necessary to provide a mechanism to indicate which SystemState Handler (4) Start cluster 1 cycle save cluster in “ i” “i+1” time-slot each time-slot (1) (2) (3) (4) time-slot is IDLE. A time-slot is IDLE when there is no Clock (Registers) (3) Load state of Clock intra-cluster transmission between its nodes. Therefore, for cluster “ i+ 1” (b) (4) Start cluster detecting IDLEness, we need access to the content of input State Handler 1 cycle save in Fig. 4.(1)Mechanism Dual-clock Context“ i+Switching. 1” time-slot(a) State each time-slot (2) (3) of (4) buffers in each router. As it can be seen in Fig 3, we extract (Registers) Clock Handler Clock with a Twofold Frequency Compared to System Clock the mentioned content by using an extra output port named (b) The Effectiveness of Dual-clock Context-switching on Decreasing ”Input BW info”, which determines the number of flits waiting Virtualization Overhead in the input BW. So, if there is no flit waiting in the input BW, the corresponding node is idle, and consequently a cluster with idle nodes is considered as an IDLE cluster. After detecting virtualized network is the intra- (within a cluster) and interan IDLE cluster, its dedicated time-slot is allocated to the next (between clusters) cluster interconnections. Two inevitable assigned cluster which is not IDLE. This mechanism improves theorems should be considered in virtualized architectures: simulation time, especially when the injection rate is low, and (1) the synchronization between intra- and inter-cluster in the small clusters located in the corner of big networks, communications, (2) TDMA-based virtualization in two layers of abstraction, i.e. FPGA side and software Side. which are far away from the hotspots. DuCNoC utilizes an index-based two-layer interconnection In virtualized architecture, each cluster is implemented in FPGA side separately in its dedicated time-slot. Therefore, infrastructure to not only synchronize intra- and inter-cluster there are considerable wasted cycles for changing the communications, but also provide the capability of intermediate states of clusters between time-slots. Especially, simulating different topologies, even irregular networks. when the ratio of clusters to nodes increases, the wasted cycles Both layers consist of BRAM-based FIFOs for establishing will be the dominant factor in time overhead. In order to interconnection between nodes, whether for nodes within a compensate significant extent of the time overhead due to cluster (intra-cluster) or nodes between clusters (inter-cluster). context switching, we can use a dual-clock architecture to The first layer (layer 1) is a complete graph between specific number of nodes, which is designated to model each eliminate the mentioned time overhead. Fig. 4 illustrates a simple scenario to show how a dual-clock intra-cluster (within a cluster) topology in each time-slot. architecture can reduce the number of wasted clocks for The number of nodes in clusters is a configurable parameter, context switching in virtualization structure. As it can be seen, which defines the number of nodes in complete graph. As by using falling edges of state handler clock, whose frequency it was mentioned, the maximum number of the nodes in is double of system clock frequency, context switching is each cluster is restricted to 64, therefore an up to 64-node accomplished between two rising edges of system clock complete graph is established to enable simulating all types without wasting any system clock. So, in each simulation of topologies for each cluster. The second layer (layer 2) is a scenario, which has lots of context switching, a significant higher level topology, which determines the interconnection number of system clocks can be saved, and DuCNoC of inter-cluster communications (between clusters). The BRAM-based FIFOs for intra-cluster interconnections considerably reduces time overhead of virtualization. By implementing sharable time-slot, especially for lower (layer 1), are shared between time-slots. In fact, all clusters injection rates and small clusters located in the corner of share the same FPGA resources corresponding to the big networks, as well as employing dual-clock mechanism mentioned complete graph in order to accomplish their for context switching especially for higher injection rates, simulations in different time-slots. In other words, there is the impact of serialization in virtualization will be reduced only one up to 64-node complete graph in FPGA side, and significantly. It shows that our proposed virtualization structure each cluster can be laid on this graph sequentially when its time-slot is approached. can minimize NoC remapping time overhead. On the other hand, the BRAM-based FIFOs for inter-cluster One of the most important concerns in simulating a
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
SF13
R1
R3
Cluster A Time-slot 0, 3, 6, …, 3i
SF34
SF12
4-Node Complete Graph
R2
SF24
R1
R4
R2 N4
SF12
A
N3
SF25
N5
SF36
Cluster
B
DFAB2
DFAC2
DFAC1
N2
N6
N7
Cluster
SF78
SF45
Cluster
DFAB1
SF13
SF69
R3
R4 SF24
Cluster B Time-slot 1, 4, 7, …, 3i+1 R1
N8
×
DFBC1
N1
SF14
7
× ×
R2
N9
×
×
C Cluster C Time-slot 2, 5, 8, …, 3i+2
×
Inactive node
R1
SF13
InActive node
×
×
R2
×
shared BRAM-based intra-cluster (layer 1) FIFO (SFij) Inactive FIFO dedicated BRAM-based inter-cluster (layer 2) FIFO (DFab)
R3
×
Fig. 5. Two-layer Interconnection for Intra- and Inter-cluster Communications. The infrastructure of shared complete graph is changed in each time-slot to simulate the corresponding cluster. Inter-cluster FIFOs are dedicated and fixed.
interconnections (layer 2) are dedicated and fixed for all interconnections between clusters. This dedicated and non-virtualized structure between clusters can handle all intermediate flits between clusters and overcomes synchronization concerns between intra- and inter-clusters. Fig. 5 shows an example of a mesh network, and the structure of the two-layer interconnection between all nodes. As it can be seen, we need to simulate a 3×3 mesh network, in which the biggest cluster (Cluster A) has 4 nodes. So, we have to set the cluster size to 4, and consequently we have a 4-node complete graph. It should be noted that we can simulate all smaller clusters by using this bigger complete graph. As it can be seen, all three clusters (A, B, and C) are implemented on this complete graph during their allocated time-slots. All utilized FPGA resources within clusters are the same. For instance, the utilized resources for N1 in cluster A, N7 in cluster B, and N6 in cluster C are the same, i.e. R1 , which will be dedicated in each cluster in its corresponding time-slot. Similarly, all FIFOs between routers are the same. For instance, the BRAM-based FIFO between N1 and N2 , the BRAM-based FIFO between N7 and N8 , and the BRAM-based FIFO between N3 and N6 are the same, i.e. shared FIFO (SF12 ) in complete graph. So, DuCNoC is able to divide larger networks into smaller clusters, and simulates these clusters sequentially. Unlike resource sharing capability within clusters (layer 1), the inter-cluster interconnections (layer 2) are dedicated and fixed (non-virtualized). These BRAM-based FIFOs are implemented in a non-virtualized manner to handle parallelism between clusters. Suppose that the current time slot is
dedicated to Cluster A, and the next time-slot is dedicated to Cluster B. If there are some flits in N4 , whose destinations are N7 , all flits will be stored in the corresponding dedicated FIFO (DFAB1 ), and will be processed in the next time-slot which is dedicated to Cluster B. In general, all flits between clusters will be stored in the non-virtualized inter-cluster (layer 2) FIFOs (DF s) till their corresponding time-slots have been reached. Accordingly, handling parallel operations between clusters can be accomplished using non-virtualized dedicated FIFOs between clusters, which guarantees cycle accuracy in virtualization mechanism. Also, as it can be seen in Fig. 5, time-slots 0, 3, 6, ..., 3i are allocated to cluster A, and similarly, time-slots 3i + 1 and 3i + 2 are dedicated to cluster B and cluster C, respectively. The configurable time-span of all time-slots dedicated to each cluster is equal, but it can be different for other clusters, which are different in size. Additionally, if there is an IDLE cluster, the next assigned cluster can use the time-slot of the current IDLE cluster. For instance, if the cluster B is IDLE in time-slot 1, cluster C can use its time-slot, and the cluster A will be simulated again in time-slot 2. Note that all FIFOs, including intra- and inter-cluster FIFOs and bridge between DDR3 and PL, are implemented by using BRAMs, which helps us save LUTs and FFs (Logic Cells) in FPGA side. Since FIFOs are one major sub-module in NoCs, and we implement them by using BRAMs, we can lighten the resource utilization of the whole architecture, which helps us propose a lightweight global interconnection compared to other FPGA-based simulators. Since DuCNoC uses a shared complete graph for all clusters, it is obligatory to dequeue all flits in intra-cluster FIFOs (SFs) that are related to the current cluster, in order to allocate them to the next assigned cluster. Therefore, based on the configured value for time-slot, a threshold is specified to show how many simulation cycles are required for blocking incoming FIFOs to dequeue all buffered flits in them. This mechanism minimizes the required bank register for storing the essential intermediate contents in context switching. Due to HW/SW co-designed architecture of DuCNoC, it is indispensable to consider TDMA in both FPGA side and software side. As it can be seen in Fig. 1, another BRAM-based FIFO set is employed between memory interface and network of nodes, which is responsible for managing TDMA in FPGA side. On the other hand, an extra level flow control provides the control of TDMA in software side [19]. C. Cluster-based Traffic Aggregator for Modeling Adaptive Routing Algorithms Contrary to previous FPGA-based NoC simulators [17], [18], [20]–[22], [24], [25], which have used deterministic table-based routing algorithms, and similar to AdapNoC [26], DuCNoC is able to simulate adaptive routing algorithms by means of a centralized table-based architecture. Based on its table-based structure, a routing table updater, whose operations are dependent to the whole traffic information, is a mandatory component. As it was mentioned in Fig. 3, ”Input BW info” is updated via the incoming queues in each router, and indicates
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
how many flits are waiting in the queues. By using this gathered information in each router, we can provide an update process for all routing tables in all nodes via the routing table updater. Based on the updated routing tables, DuCNoC enables us to simulate adaptive routing algorithms. Since DuCNoC provides a configurable virtualization methodology, which supports all types of topologies even irregular networks, it is necessary to manipulate the overall structure of gathering information module in DuCNoC compared to AdapNoC [15]. Unlike AdapNoC, whose adaptive routing was limited to mesh networks, DuCNoC is able to simulate adaptive routing algorithms for all topologies. In fact, since the topology of the whole network might not be regular, we establish a non-uniform structure to gather all information from ”Input BW info” ports in each router. In order to buffer all traffic information of all clusters, we implement a two-layer aggregator module in DuCNoC to gather all necessary traffic information. The traffic information of each cluster will be gathered in its corresponding time-slot, and will be stored in the shared registers in the cluster. This information will be transferred to the registers dedicated for the current cluster in layer 2 at context switching time. Layer 1 registers are implemented in the complete graph. So, they are virtualized, and the utilized resources for these registers will be shared between clusters. On the other hand, the layer 2 bank-register is implemented in a non-virtualized manner. So, the layer 2 bank-register is accessible permanently, regardless of the current cluster which is laid in the FPGA side. By doing so, similar to AdapNoC, traffic information updater is aware of all traffic information in each node, and will update the routing tables of the nodes for each cluster before starting its corresponding time-slot. Additionally, since the bank-register of layer 2 is permanently accessible, the routing table update process for each node will be updated continuously. Therefore, in order to take advantage of the layer 2 bank-register information, immediately after context switching in each time-slot, routing tables in current cluster will be updated. So, updated routing tables in each cluster have latest up-to-dated traffic information in each time-slot, which helps achieve close to optimum results in adaptive routing algorithms. As it can be seen in Fig. 1, two important sub-modules are implemented to support adaptive routing algorithms, i.e. (1) Traffic Information Aggregator, (2) Routing Table Updater. As their names imply, we aggregate traffic information of nodes by using traffic information aggregator. Also, the routing table updater is responsible for updating routing table of each node before starting its corresponding time-slot. Aggregating traffic information as well as updating routing tables in the network allows us to model adaptive routing algorithms by using these updated routing tables. In fact, the router table updater module will update routing table of each node based on the aggregated traffic. In order to evaluate the efficiency and optimality of the two-layer traffic aggregator and routing table updater, we employ a simple adaptive routing algorithm, i.e. Adaptive Toggle DOR (ATDOR). As its name implies, regarding the traffic of the network, ATDOR can toggle the route between XY and YX for
8
each source-destination pair. In fact, the routing table updater determines which route (XY or YX) is better, and according to the aggregated traffic, it will update routing tables, and consequently route computation (RC) for each packet will be accomplished based on the updated routing table. Note that at least 2 VCs are adequate for ensuring the avoidance of deadlock. Therefore, since DuCNoC is capable of simulating up to 4 VCs, it is possible to simulate a deadlock-free ATDOR by using DuCNoC. Fig. 6 illustrates the structure of the two-layer centralized traffic aggregator and routing table updater in a 3 × 3 virtualized mesh network, which enable us to simulate adaptive routing algorithms. As it can be seen, each cluster uses a shared register set (SR) for each node. Since the biggest cluster has 4 nodes in this scenario, there are 4 register sets in the 4-node complete graph, i.e. SR1 for R1 , SR2 for R2 , SR3 for R3 , and SR4 for R4 . Also, the number of registers in each set depends on the number of ports in each node. Accordingly, since each node has 4 ports, there are 4 registers in each set, i.e. North (N), West (W), South (S), and East (E) registers. For each cluster in its corresponding time-slot, the value of the SRs will be updated, and before context switching, the values of SRs will be stored in the corresponding registers in dedicated register set (DR). For instance, suppose that the current time-slot is dedicated to cluster A, and the next time-slot is dedicated to cluster B. During the current time-slot the value of the SR1 , SR2 , SR4 , and SR5 will be updated continuously. After accomplishing this time-slot, the value of these registers will be stored in DR1 , DR2 , DR4 , and DR5 . After updating the dedicated registers, the routing table updater will update the routing tables of the cluster which uses the next time-slot. So, routing tables of N7 (T7 ) and N8 (T8 ) will be updated. Accordingly, RC in N7 and N8 will be accomplished by means of updated routing tables. Unlike AdapNoC, this infrastructure equips DuCNoC to simulate adaptive routing algorithms in virtualized structure. Additionally, since routing table updater operates as a non-virtualized module, the routing tables of all nodes will be updated in the first cycles of its time-slot after context switching. Therefore, the impact of delay in updating the routing tables is dramatically decreased, and subsequently each node accomplishes routing procedure with a close to optimum state. D. ZYNQ-based Software Simulation Framework As it can be seen in Fig. 1, the overall architecture of DuCNoC is similar to AdapNoC. The major difference is the type of embedded processor employed for software side. Contrary to AdapNoC, software side is implemented in ZYNQ-7000, which has a dual-core (hard-core) ARM processor. As it was mentioned previously, ZYNQ-7000 consists of two parts: (1) FPGA chip called PL, (2) Embedded processor called PS. We employ a dual-processor architecture to (1) handle and support trace-based traffic patterns (dynamic traffic) by using the first core, (2) run software application as user command
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
Ni Wi Si Ei
inactive shared registers (Sri) for Ni
Ni Wi Si Ei
active shared registers for Ni
9
4-node complete graph
R1
routing table updater dedicated port
R3
Intra-cluster aggregation port
E'i S'i
dedicated registers (Dri) for Ni
R2
SR3 N3 W3 SR1 N1 W1 SR2 N2 W2 SR4 N4 W4
S3 S1 S2 S4
Shared Registers (SRs) for Aggregation of Intra-Cluster Traffic Information
E3 E1 E2 E4
R4
W'i
N'i
S4 S1 S2 S5
E4 E1 E2 E5
DFAB2
DFAC1
N5 C
E'1 E'2 E'3 E'4 E'5 S'1 S'2 S'3 S'4 S'5
N7
E'9 S'9
W'1 W'2 W'3 W'4 W'5
W'9
N'1 N'2 N'3 N'4 N'5
N'9
DR1 DR2 DR3 DR4 DR5
DR9
N8 Cluster B
Aggregated Traffic in Dedicated Registers (DRs)
R2
N7 W7 S7 E7 N8 W8 S8 E8
T1 T7 T3 T2 T8 T6 T9 T4 T5
Routing Table Updater
R1 SF12
W4 W1 W2 W5
DFBC1
SF12 DFAC1
SF25
N4 N1 N2 N5
SF78
R2
Updated Routing Tables based on Aggregated Traffic
R1 N3
SF36
N6
SF69
Cluster C
N9
SF12
Routing Table Updater
Cluster A
N2
DFAB1
N4 SF45
Routing Table Updater
SF14
N6 W6 S6 E6 N9 W9 S9 E9 N3 W3 S3 E3
R2
R3 SF34
N1
SF13
SF12
R1
SF24
R4
× × × × × SF13
R3
× × ×
N3 N1 N2 N4
W3 W1 W2 W4
S3 S1 S2 S4
E3 E1 E2 E4
Cluster A
N3 W3 S3 E3 N1 W1 S1 E1 N2 W2 S2 E2 N4 W4 S4 E4
Cluster B
N3 W3 S3 E3 N1 W1 S1 E1 N2 W2 S2 E2 N4 W4 S4 E4
Cluster C
Fig. 6. Two-layer Traffic Information Aggregator for Intra- and Inter-cluster Communications. The register for intra-cluster are shared, and the register for inter-cluster is dedicated. TABLE II: Configurable parameters in DuCNoC
prompt for configuration PL by using the second core. The capability of ZYNQ SoC for communication between PS and PL provides a high-throughput DMA-based communication between software side and FPGA side in DuCNoC. Also, ZYNQ-7000 is equipped with a 1 GB DDR3 which is accessible directly via both PL and PS. By using PCIe multi-channel DMA in PL, the traces of dynamic traffic are transmitted via PCIe to DDR3. The packets will be decoded to the specific pattern in order to transmit via BRAM-based FIFOs to FPGA side. Additionally, by using AXI-stream structure, the efficiency of bus between PS and PL is increased. Additionally, ARM processor generates synthetic traffic for the simulated network. By doing so, ARM is able to support both dynamic and synthetic traffic. It should be noted that receiving packets from BRAM-based FIFOs in FPGA side are handled via TR in ARM to calculate statistical information, such as latency and the correctness of packets. Handling dual-clock architecture in FPGA side is accomplished via ZYNQ PS. In fact, ZYNQ PS provides clock configuration and all relevant configurations for PL. Since our design needs two different clocks, ZYNQ PS provides both required clocks permanently, and an asynchronous-based structure in FPGA side is employed to handle dual-clock architecture. Also, all parameters for network configuration, which are illustrated in Table II, are set via the CPU 1. Using configurable virtualization structure, as well as an efficient dual-clock router micro-architecture, makes a considerable improvement against AdapNoC. Also, using configurable virtualization helps increase the possible size of networks to be simulated. On the other hand, using
Topology
Mesh / Torus / Tree / Irregular
Size Routing Algorithm no. of VCs per port Traffic Type Pipeline Stages Number of Router Ports Packet Size (per flit) Link Latency Input VCs buffer size VC Allocation
up to 64×32 DoR / Oblivious / Adaptive (ATDOR) up to 4 Synthetic / Dynamic 3 and more 2 and more up to 64 Variable up to 16 Fixed / Round Robin
dual-clock architecture increases the ratio of throughput per resource utilization. In fact, in comparison to AdapNoC, DuCNoC utilizes approximately equivalent resources, while the throughput is considerably improved. In addition, the structure of adaptive routing table in DuCNoC enables us presenting a virtualized adaptive routing table structure, especially for irregular networks. V. E XPERIMENTAL R ESULTS In order to demonstrate DuCNoC performance, various scenarios should be considered. We targeted a large Xilinx ZYNQ-based SoC architecture (ZC706 evaluation board) to evaluate all configurable parameters in FPGA side. Additionally, ZYNQ facilitates software side development by using Xilinx SDK. Also, Vivado 16.2 is used for synthesizing, implementing, and downloading the overall design on FPGA, which enables implementing a schematic block diagram from all parts of the design, even software side.
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
10
30
AdapNoC [26] BOOKSIM [4] DART [21] DuCNoC
AdapNoC deviation from BOOKSIM
25
20
15
0.05
0.1
0.2
0.3
0.4
0.5
Injection Rate (flits/node/cycle)
Fig. 7. Average Latency Comparison between BOOKSIM [4], DART [21], AdapNoC [26], and DuCNoC. Using new pipeline structure in router micro-architecture improves the accuracy of DuCNoC compared to AdapNoC.
DuCNoC consists of four major sub-modules: (1) software side which is implemented on PS (ARM processors), (2) FPGA side, which consists of interconnection network and router micro-architecture, and is implemented on PL, (3) UART interface between host and PS for transceiving commands, status, and statistical information, and (4) PCIe interface for transmitting traces from the host to PL. Software side has been developed in C++ by using Xilinx SDK, and PL is implemented in Verilog HDL by using Xilinx Vivado. Also, all sub-modules in PL are implemented in AXI4-based structure. In order to prove the correctness of DuCNoC simulation results, the overall latency of simulation reported by DuCNoC in at least one sample scenario should be compared with the results reported by baseline simulator, i.e. BOOKSIM. Fig. 7 shows our first validation result in non-virtualized mode. It illustrates the average latency of a network, which is depicted in Table III (Case 1), in different simulators. Since VSA is a detached stage in our proposed pipeline stages, one cycle delay should be enumerated in BOOKSIM for SA and VA altogether. So, BOOKSIM configuration for router micro-architecture has 4 cycles routing delay, 1 cycle delay for SA, and zero cycle delay for VA. As it can be seen, the average latency reported by DuCNoC resembles average latency calculated by BOOKSIM. It is obvious that using a different pipeline structure in DuCNoC router micro-architecture approximately eliminates statistical deviation. It should be noted that 20,000 warm-up cycles, 45,000 measurement cycles, and 15,000 draining cycles are considered for all simulation scenarios. As mentioned before, we reformed the pipeline stage of our router micro-architecture to a 3-stage infrastructure, which helped establish a dual-clock router micro-architecture. According to the implementation reports, i.e. slack time in place and route (PAR) process, the critical path of DuCNoC restricts the maximum possible frequency to 170 MHz. Therefore, in order to establish a dual-clock architecture, the slower clock frequency should be 85 MHz. By using our proposed dual-clock router micro-architecture, we can achieve up to 60% speed-up in DuCNoC in comparison to AdapNoC. Fig. 8 shows the DuCNoC speed compared to
2
Case 2 8×8
1 Cycle DoR (XY) 2 Round-Robin 4 5 2 Random
Link Latency Routing Algorithm no. of VCs per port VA Input VC buffer Size Router Pipeline stage Packet Size (per flit) Traffic Pattern
approximately equal results for DuCNoC and BOOKSIM
DART deviation from BOOKSIM
0.01
Case 1 3×3
Topology
DuCNoC speed / AdapNoC [24] speed
Average Packet Latency (flit cycles)
TABLE III: Mesh Network Benchmarks
1 Cycle DoR (XY) & ATDOR 2 Round-Robin 4 5 16 Random
Acceleration Ratio
1.8
Improving speed-up by increasing packet size 1.6
1.4
1.2
Increasing the ratio of faster clock against slower clock
1 2
4
6
8
12
16
Packet Size
Fig. 8. The Effectiveness of Dual-clock Router Micro-architecture in DuCNoC against AdapNoC [26]. (Injection rate = 0.4 flits/node/cycle).
AdapNoC in different packet sizes for network depicted in Table III (Case 1). It is obvious that increasing the packet size improves the throughput of DuCNoC. Also, since the behavior of dual-clock router micro-architecture is similar for all injection rates, there is no effect on the simulation results. The injection rate in Fig. 8 is 0.4 flits/node/cycle. As it can be seen in Fig. 8, increasing the size of the packets provides more efficiency in our proposed architecture. Even for the 2-flit packets DuCNoC still achieves around 13% speed-up against AdapNoC. Also, increasing the packet size by more than a meaningful value results in BRAM-based buffer overhead for storing the packet in FPGA side. On the other hand, there is a steady-state for all packet sizes more than 64 flits, whose improvement limits to 65-70%. So, the best packet size in our router micro-architecture is less than 64, which is less than the size of BRAM-based FIFOs between nodes. Using a novel configurable virtualization mechanism enables us to implement a new structure for traffic aggregator and routing table updater to support adaptive routing algorithms especially in virtualized mode. As it was depicted in Fig. 6, we implemented a two-layer structure in order to aggregate traffic information, and update routing tables of all nodes. Furthermore, unlike virtualized structure of the simulated network, the traffic aggregator and routing table updater behave in a non-virtualized manner, as mentioned before. Using this non-virtualized structure in a virtualized network helps us eliminate the delay impact of the routing table update procedure, which provides more accurate results in adaptive routing algorithms compared to AdapNoC. Fig. 9 illustrates the simulation results of an 8 × 8 network, which is depicted in Table III (Case 2), in DuCNoC, AdapNoC, baseline ATDOR [15], and O1turn [30]. As it can be seen, thanks to our two-layer non-virtualized structure, we can simulate
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
11
TABLE V: Resource Utilization of a 32×32 (1024-node) Virtualized Mesh Network with Different Cluster Sizes on a ZYNQ-7000 Average Packet Latency (flit cycles)
95
DuCNoC deterministic XY Centralized (baseline) ATDOR [15] DuCNoC ATDOR AdapNoC ATDOR [26] O1Turn [30]
Cluster Size
58,795 81,865 98,479 113,685 134,915 178,641
Total Available
218,600 437,200
Saturation in higher injection rate in DuCNoC with more up-to-date routing tables in virtualized structure
65
effectiveness of adaptive routing algorithm against DoR (XY)
19,128 28,543 36,914 38,927 45,815 66,912
2 (2×1) 4 (2×2) 8 (2×4) 16 (4×4) 32 (8×4) 64 (8×8)
85
75
LUTs Registers
*
BRAM* Inter-cluster
Intra-cluster
152 109 83 46 25 15
2 5 9 18 34 65
DDR3/PL bridge
1 1 2 4 8 16
545
Number of employed 36Kb BRAM modules
55
45 0.01
TABLE VI: Resource Utilization of Different Virtualized Mesh Network SIZE with Different Cluster Sizes on a ZYNQ-7000 Network Size
Cluster Size
Fig. 9. Simulation Results for Modeling Adaptive Routing Algorithm in O1Turn [30], Implemented ATDOR [15], AdapNoC [26], and DuCNoC.
512 (32×16) 512 (32×16)
TABLE IV: Resource Utilization of a 10 × 10 (100-node) Non-Virtualized Mesh Network on a ZYNQ-7000
0.1
0.2
0.3
0.4
0.5
0.6
Modules
LUTs Registers BRAM(# of 36Kb)
8643 Flit-Queue Router 22515 Traffic Aggregator Updater 5723 Complete-Graph Mapper 4827 PCIe +AXI4 + DDR3 2264 * +
2629 5485 6982 3852 1850
*
91 48+
Including all FIFOs between all nodes (W, S, E, and N) Including FIFOs for E/I + BRAM interface between DDR3 and E/I FIFOs
adaptive routing algorithms with more accuracy in DuCNoC against AdapNoC. Additionally, unlike Fig. 7, which is a non-virtualized network, Fig. 9 can be considered in order to validate virtualized network simulation in DuCNoC. Table IV shows the DuCNoC resource utilization in a 100-node non-virtualized network, as the biggest feasible non-virtualized simulated network in FPGA side, which is approximately similar to AdapNoC. The major difference is BRAM utilization, which is responsible for transceiving data from DDR3 to PL and vice versa, and BRAM-based FIFOs for establishing non-virtualized queues between all nodes. Table V shows the resource utilization of a 1024-node network, as a sample feasible virtualized network in FPGA side, with different cluster sizes. As it can be seen, since we use a new approach in cluster-based architecture, which lays just one cluster in each time-slot in FPGA side, increasing the size of the cluster results in increasing the number of utilized LUTs, decreasing the number of utilized BRAMs in inter-cluster transmissions, and increasing the number of utilized BRAMs in intra-cluster transmissions. BRAM is used for three different purposes (virtualized or non-virtualized) in DuCNoC: (1) non-virtualized as a bridge between DDR3 and PL, (2) virtualized as FIFOs between intra-cluster nodes, and (3) non-virtualized as FIFOs between inter-cluster nodes. It is obvious in Table V that increasing the size of the cluster generally results in decreasing virtualized BRAM usage and increasing resource utilization. Note that all FIFOs (intra- or inter-cluster) are implemented by using BRAMs. Additionally, these BRAMs, as a bridge between DDR3 and PL, enable us to have a copy of packets in DDR3
BRAM * Speed-up+
LUTs
Registers
16 (4×4)
99,638
23,825
46
66
32 (8×4)
114,915
26,554
86
121
512 (32×16)
64 (8×8)
88
265
16 (4×4)
147,898 113,685
37,920
1024 (32×32)
38,927
68
42
1024 (32×32)
32 (8×4)
134,915
45,815
67
98
1024 (32×32)
64 (8×8)
66,912 77,836
220
16 (4×4)
178,641 137,032
96
2048 (64×32)
120
15
2048 (64×32)
32 (8×4)
165,983
91,770
94
38
2048 (64×32)
64 (8×8)
202,005
124,894
113
85
218,600
437,200
545
Injection Rate (flits/node/cycle)
Total Available * +
Num ber of employed 36Kb BRAM module s Speed-up com pa red to B OOKSIM (injection rate = 0.3 flits/node/cycle, Packet siz e = 8 flits)
simultaneously. Therefore, since DDR3 is accessible from PS, all BRAMs data can be observed. It enables us to observe all packets internal states. Although increasing the cluster size provides more parallelism in thorough simulation, resource utilization causes a big limitation in FPGA side. In fact, there is a restricting threshold for the size of the clusters. On the other hand, decreasing the cluster size utilizes less resources, but decreases parallelism drastically, and overall speed-up is not desirable for an FPGA-based simulator. Furthermore, decreasing the cluster size increases the BRAM utilization in FPGA side, which causes big concerns in PAR and maximum operational frequency. Table VI shows the resource utilization and speed-up in different network and cluster sizes. As it can be seen, it is obvious that although increasing the cluster size provides more speed-up, it restricts the overall feasible network size. On the other hand, decreasing the cluster size allows simulating bigger networks, however the speed-up is not suitable for an FPGA-based NoC simulator. Accordingly, we have to restrict the number of nodes in each cluster to 64, which leads to simulating networks with meaningful speed-ups in comparison with other simulators. In addition, although we can simulate an 8196-node network by means of employing 16-node clusters, the overall speed-up is around 3×, which is inappropriate for an FPGA-based simulator. So, the biggest size of simulated network with considerable speed-up is 1024 or 2048; nevertheless, it is possible to simulate bigger networks with less throughput. Additionally, it is obvious that
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
non-virtualized Cluster Size = 8 (4×2) Cluster Size = 16 (4×4) Cluster Size = 32 (8×4)
8
Overhead Cycles
12
less overhead in lower injection rates and smaller cluster sizes 4
2 1 0.01
0.1
0.2
0.3
Injection Rate (flits/node/cycle)
Fig. 10. Virtualization Overhead Comparison between Different Cluster Sizes. (Packet length = 8 Flits) 16
AdapNoC [26] DART [21] DuCNoC
14
Simulation Cycles
12 10 8 12.4% improvement
6
30.6% improvement
4 2 0
59.8% improvement
0.05
0.1
0.15
7% improvement / AdapNoC 17% improvement / DART
0.2
0.25
0.3
0.35
0.4
0.45
Injection Rate (flits/node/cycle) Fig. 11. Time Overhead in Virtualization Structure. Sharable time-slot structure, as well as infrastructure of cluster-based virtualization in DuCNoC reduces 7% and 17% time overhead in comparison with AdapNoC [26] and DART [21], respectively.
the biggest cluster size, i.e. 64, is the best cluster size for simulating all network sizes, unless it is infeasible, e.g. for networks smaller than 64 nodes, whose size is less than cluster size, and networks larger than 2048 nodes, where LUTs are not enough for 64-node cluster sizes. Using a cluster-based virtualization methodology makes a significant overhead compared to a non-virtualized architecture. Inheriting a sharable dual-clock context switching in virtualization mechanism from AdapNoC, as well as implementing a thoroughly dual-clock router micro-architecture, eliminates the mentioned overhead to a great extent. Fig 10 depicts the overhead of virtualization in the 8 × 8 network (Table III (Case 2)), in both virtualized and non-virtualized architectures. It is reasonable that for a non-virtualized structure, the number of simulation cycles required for simulating one cycle of network is equal to 1, while it is variable (more than 1) for virtualized structures. It should be noted that since time-slots are sharable, the overhead is less in lower injection rates. Fig. 10 shows that the relationship between virtualization overhead and cluster size is linear, especially for higher injection rates. If we divide the network into two clusters, the virtualization overhead is approximately doubled. In other words, we
equivalently divide system cycles throughout a simulation between clusters, with TDMA structure. So, the number of clusters definitely determines the overhead of virtualization. Furthermore, the IDLEness in smaller cluster sizes, especially for corner clusters, is more probable. So, for the same injection rates, the overhead of virtualization is less in smaller cluster sizes. However, as it can be seen in Fig 10, increasing the injection rate maximizes the timing overhead imposed by virtualization. It implies that the IDLEness mechanism is more efficient for lower injection rates, and probability of IDLEness is close to zero in higher injection rates. Accordingly, using the sharable time-slot mechanism, as well as dual-clock structure for context switching significantly compensates virtualization time overhead. Fig. 11 illustrates the average simulation cycle that is necessary for virtualized mode. In fact, the y-axis depicts the overhead cycles required for one cycle simulation in virtualized mode. As was mentioned before, the number of simulation cycles required for simulating one cycle of network in non-virtualized mode is equal to 1, but since we serialized network simulation, it needs several simulation cycle to model one simulation cycle in non-virtualized mode. Since we employed a new cluster-based virtualization approach in DuCNoC, which is equipped with a timespan threshold for each time-slot, the serialization overhead has been decreased in comparison with AdapNoC. Although in highest feasible injection rate, i.e. in saturation, the time overhead in DART, AdapNoC, and DuCNoC is approximately equivalent, in lower injection rates the time overhead in DuCNoC is reduced around 34%. Also, as it can be seen, it deceases the time overhead by 7% and 17% on average compared to AdapNoC and DART, respectively. Using an index-oriented global interconnection helps support irregular and custom topologies. In order to demonstrate and evaluate the behavior of DuCNoC in custom and irregular topologies, three different networks are chosen. Fig. 12 shows these irregular sample networks: (a) a mesh network with express links, (b) a tree, and (c) an irregular topology, which are implemented by means of our cluster-based virtualization architecture. As it can be seen in Fig. 12, the results obtained by DuCNoC are compared with DART and BOOKSIM, which shows the correctness of the proposed simulator behavior in different custom and irregular topologies. It should be noted that using table-based source routing algorithms allows us initialize all routing tables with whatever we need, and it helps supporting irregular topologies with guaranteed routing algorithm without deadlock. DuCNoC, as a HW/SW co-simulator, has more throughput degradation against pure FPGA-based simulators. However, using dual-clock router architecture, cluster-based virtualization, and inheriting sharable time-slot make a significant speed-up compared to other HW/SW co-simulators, such as DART and AcENoCs. Fig. 13 shows the DuCNoC speed-up against AdapNoC, DART, and AcENoCs on average, in all depicted scenarios. Two points are substantial: (1) 75x-350x improvement on average compared to BOOKSIM, and (2) fewer throughput degradation rate against AdapNoC
Average Packet Latency (fl
80 The worst average latency
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTHwithout YYYYexpress
N2
N5
N4
N7
N6
Map to FPGA Side
N8
DFAB1
N5
A
N7
DFAB4
0.3
0.5
N5
N6
100
The worst average latency without express links
Content-aware routing and express links reduce average packet latency
70
60
0.01
0.1 0.15 0.2
0.3
0.4
0.5
N5
N7
Map to FPGA Side
N1 DFAB1
N3
DFBC1 Cluster
B
SF12
N9
×
×
N7
DFAB2
0.6
75 70
65 60
0.01
0.05
SF57
N7
0.1
0.15
DuCNoC Deterministic
100
80
70 Due to limited number of paths between each source-destination, content-aware routing has less effect on average latency
60
50
0.2
DuCNoC Adaptive
90
Injection Rate (flits/node/cycle)
(a)
N5
BOOKSIM [4]
Content-aware routing has no effect on average latency in tree
Injection Rate (flits/node/cycle)
N6
110
90
80
N4
N3
N8
DuCNoC Adaptive
85
DFAB1
N2
N1
DART [21] DuCNoC Deterministic
95
B
A
C SF39
Cluster
Cluster
Cluster
Average Packet Latency (flit cycles)
DuCNoC Adaptive
90
N3
0.6
No Express Link [4]
Average Packet Latency (flit cycles)
Average Packet Latency (flit cycles)
N2
N8
DART [21]
50
N4
SF68
DuCNoC Deterministic
80
N8
×
Cluster
SF26
N6
DFAB2
100
0.1 0.15 0.2
SF24 SF56
B
N7
N6
SF67
A
N6
N4
×
Cluster
0.01
N5
N2
SF23
Cluster
N4
SF24
SF57
N4
Map to N9 FPGA 0.4 Side
Injection Rate (flits/node/cycle)
SF78
SF12
N3 SF34
N2
50
N3
×
N1
N1
N2
DFAB3
SF13
N1
60
SF45
N3
13 Content-aware routing and express links reduce average packet latency
links
SF37
N1
70
0.01 0.03 0.05
0.15
0.1
0.2
Injection Rate (flits/node/cycle)
(b)
(c)
ocean non
radix
raytrace
water spatial
AVG
radix
raytrace
water spatial
AVG
(a)
100
0.25
0.3
0.35
0.4
0.45
0.5
Fig. 13. DuCNoC Simulation Speed Compared to AcENoCs [18], DART [21], and AdapNoC [26].
due to our new cluster-based virtualization. For SPLASH2 traffic, DuCNoC provides an average 86x speed-up against BOOKSIM. Fig. 14b illustrates the DuCNoC speed-up against BOOKSIM for trace-based traffic (SPLASH-2). As it can be seen, the speed-up is different for different benchmarks, which shows that thanks to IDLEness detection in our proposed architecture, DuCNoC provides better speed-up for low traffic volumes, like fft, while for high traffic volumes, like cholesky, the speed-up is less than average. Additionally, as it can be seen in Fig. 14a, the average packet latency obtained by DuCNoC is the same with that of BOOKSIM, which shows the accuracy of DuCNoC.
lu_non
0.2
Injection Rate (flits/node/cycle)
lu_con
0.15
fft
0.1
cholesky
0.05
140 120 100 80 60 40 20 0
barnes
50 0
DuCNoC
ocean non
112.1% improvement against AdapNoC
150
lu_non
Fewer throughput degradation rate
200
lu_con
250
BOOKSIM [4]
fft
35.6% improvement against AdapNoC
300
50 40 30 20 10 0
cholesky
350
Average Packet Latency (Cycles)
AdapNoC [26] AcENoCs [18] DART [21] DuCNoC
barnes
400
DuCNoC / BOOKSIM (Speed)
3
Simulation Speed ( 10 cycles per second)
Fig. 12. Sample Custom and Irregular Networks. (a) A Mesh Network with Express Bypassing Links via 2 Clusters (b) A 9-node Tree via 3 Clusters (c) An Irregular Topology.
(b)
Fig. 14. Simulating Trace-based Traffic (SPLASH-2) in DuCNoC. (a) Average Packet Latency (b) Speed Comparison with BOOKSIM
VI. C ONCLUSION In this paper, we presented DuCNoC, an FPGA-based NoC co-simulator, which can be configured via software. The main contribution of this simulator is to maximizing speed-up compared to other NoC simulators, especially software simulators. We employed some architectural points of view in router micro-architecture to accelerate the
IEEE TRANSACTIONS ON COMPUTERS, VOL. –, NO. -, MONTH YYYY
simulation speed. The main contribution is a dual-clock router micro-architecture, which significantly decreases the latency of packet traversal, providing 75x-350x speed-up against BOOKSIM. Also, by using BRAMs, we implemented a lightweight configurable global interconnection, in order to decrease synchronization latency in virtualization. The global interconnection network leads DuCNoC to simulate a 2048-node network in ZYNQ ZC706 with lower resources by simulating feasible clusters in FPGA. Migrating some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, as well as employing a dual-clock virtualization are other major features of DuCNoC. R EFERENCES [1] S. Borkar, ”Thousand Core Chips: A Technology Perspective,” in Proc. of ACM/IEEE Design Automation Conference (DAC), pp. 746749, 2007. [2] B. Dally and B. Towles, ”Route Packets, Not Wires: On-Chip Interconnection Networks,” in Proc. of IEEE Design Automation Conference (DAC), pp. 684-689, 2001. [3] B. Dally and B. Towles, ”Principles and Practices of Interconnection Networks,” Morgan Kaufmann, San Francisco, CA, 2004. [4] J. Nan, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis and J. Kim, ”A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator,” in Performance Analysis of Systems and Software (ISPASS), Proc. of Int’l Symp. on., pp. 86-96, 2013. [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill and D. A. Wood, ”The GEM5 Simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1-7, 2011. [6] V. Puente, J.A. Gregorio and R. Beivide, ”SICOSYS: An Integrated Framework for Studying Interconnection Network Performance in Multiprocessor Systems,” in Proc. of Euromicro Workshop on Parallel, Distributed and Network-based Processing, pp. 15-20, 2002. [7] V. S. Pai, P. Ranganathan and S. V. Adve, ”RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors,” in Proc. Workshop on Computer Architecture Education, 1997. [8] M. Rosenblum, E. Bugnion, S. A. Herrod and S. Devine, ”Using the simOS Machine Simulator to Study Complex Computer Systems,” in ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 7, no. 1, pp. 78103, 1997. [9] M. Coppola, S. Curaba, G. Grammatikakis, G. Maruccia, F. Papariello, ”OCCN: A NoC Modeling and Simulation Framework. Journal of Systems Architecture,” in the EUROMICRO Journal, vol. 50, no. 2-3, pp. 174179, 2004. [10] K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu and E. Rijpkema, ”A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SOC Design and Verification,” in Proc. Design, Automation and Test in Europe(DATE), pp. 1182-1187, 2005. [11] T. Kogel, M. Doerper, A. Wiefrink, R. Leupers, G. Ascheid, ”A Modular Simulation Framework for Architectural Exploration of On-Chip Interconnection Networks,” in Proc. IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (CODES/ISSS), pp. 712, 2003. [12] S. Mahadevan, K. Virk, and J. Madsen, ”ARTS: A systemc-based framework for modelling multiprocessor systems-on-chip,” in Design Automation for Embedded Systems, vol. 11, no. 4, pp. 285-311, 2006. [13] E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi and K. Mai, ”PROToFLEX: FPGA-Accelerated Hybrid Functional Simulator,” in IEEE Int’l Symp. on Parallel and Distributed Processing Symposium, vol. 1, no. 6, pp. 26-30, 2007. [14] Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, D. Patterson and K. Asanovic, ”RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors,” in Proc. of IEEE Design Automation Conference (DAC), pp. 463468, 2010. [15] R. Manevich, I. Cidon, A. Kolodny, I. Walter and S. Wimer, ”A Cost Effective Centralized Adaptive Routing for Networks-on-Chip,” in Euromicro Conference on Digital System Design (DSD), pp. 39-46, 2011. [16] Y. E. Krasteva, F. Criado, E. de la Torre and T. Riesg, ”A Fast Emulation-Based NoC Prototyping Framework,” in Int’l Conf. Reconfigurable Computing and FPGAs, pp. 211-216, 2008. [17] M. K. Papamichael, ”Fast Scalable FPGA-based Network-on-Chip Simulation Models,” in ACM/IEEE Int’l Conf. on Formal Methods and Models for Codesign (MEMPCODE), pp. 77-82, 2011. [18] V. Pai, S. Lotlikar and P. Gratz, ”AcENoCs: A configurable HW/SW platform for FPGA Accelerated NoC Emulation,” in IEEE Int’l Conf. on VLSI Design, pp. 147-152, 2011. [19] G. Heck, R. Guazzelli, F. Moraes, N. Calazans and R. Soares, ”HardNoC: A Platform to Validate Networks on Chip Through FPGA Prototyping,” in Southern Conf. on Programmable Logic, pp. 1-6, 2012.
14
[20] Y. Zhang, P. Qu, Z. Qian, H. Wang and W. Zheng, ”Software/Hardware Hybrid Network-on-Chip Simulation on FPGA,” in Proc. of Int’l Conf. on Network and Parallel Computing, vol. 8147 (NPC 2013), pp. 167-178, 2013. [21] D. Wang, C. Lo, J. Vasiljevic, N. Enright Jerger and J. Gregory Steffan, ”DART: A Programmable Architecture for NoC Simulation on FPGAs,” in IEEE Transactions on. Computers, vol. 63, no. 3, pp. 664-678, 2014. [22] H. Ying, T. Hollstein and K. Hofmann, ”A Hardware/Software Co-design Reconfigurable Network-on-Chip FPGA Emulation Method,” in Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pp. 1-8, 2014. [23] S. Abba and J. A. Lee, ”A Parametric-based Performance Evaluation and Design Trade-offs for Interconnect Architectures using FPGAs for Networks-on-Chip,” in Microprocessors and Microsystems, vol. 38, no. 5, pp. 375-398, 2014. [24] T. V. Chu, S. Sato and K. Kise, ”Ultra-Fast NoC Emulation on a Single FPGA,” in IEEE Int’l Conf. on Field Programmable Logic and Applications (FPL), pp. 1-8, 2015. [25] T. Naruko and K. Hiraki, ”FOLCS: A Lightweight Implementation of a Cycle-accurate NoC Simulator on FPGAs,” in Proc. Int’l Workshop on Many-core Embedded Systems (MES), pp. 25-32, 2015. [26] H. M. Kamali and S. Hessabi, ”AdapNoC: A Fast and Flexible FPGA-based NoC Simulator,” in IEEE Int’l Conf. on Field Programmable Logic and Applications (FPL), pp. 1-8, 2016. [27] S. S. Mukherjee, P. Bannon, S. Lang, A. Spink and D. Webb, ”The Alpha 21364 Network Architecture,” in IEEE Micro, vol. 22, no. 1, pp. 26-35, 2002. [28] R. Mullins, A. West and S. Moore, ”Low-Latency Virtual-Channel Routers for On-Chip Networks,” in Proc. of Int’l Symp. on Computer Architecture (ISCA), pp. 188, 2004. [29] L.-S. Peh and W. J. Dally, ”A Delay Model for Router Microarchitectures,” in IEEE Micro, vol. 21, no. 1, pp. 26-34, 2001. [30] D. Seo, A. Ali, W. T. Lim, N. Rafique and M. Thottethodi, ”Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks,” in Proc. of Int’l Symp. on Computer Architecture (ISCA), pp. 432-443, 2005. [31] Xilinx Inc., ”Zynq-7000 All Programmable SoC, Technical Reference Manual (UG585)” Available: https://www.xilinx.com/support/documentation/user guides/ug585-Zynq7000-TRM.pdf,2016. [32] Xilinx Inc., ”ZC706 PCIe Targeted Ref. Design (UG963)” Available: https://www.xilinx.com/support/documentation/boards and kits/zc706/2 014 4/ug963-zc706-pcie-trd-ug.pdf,2015. Hadi Mardani Kamali received the B. Sc. and M. Sc. computer engineering degrees from Khajeh Nasir University of Technology and Sharif University of Technology, Tehran, Iran, in 2011, and 2013, respectively. From 2013, He works as a research assistant at VLSI laboratory of Sharif University of Technology. His research focuses on electronic design automation, FPGA-based architectures, and hardware/software co-design.
Kimia Zamiri Azar received the B.Sc. and M.Sc. computer engineering degrees from Khajeh Nasir University of Technology and Shahid Beheshti University, Tehran, Iran, in 2013, and 2015, respectively. She Worked as a research assistant in arithmetic laboratory of Shahid Beheshti University. Her current research interests include Computer Architecture, FPGA-based architectures, and Multi-core architectures.
Shaahin Hessabi received the B.S. and M.S. degrees in electrical engineering from Sharif University of Technology, Tehran, Iran, in 1986, and 1990, respectively and the Ph.D. degree in electrical and computer engineering from the University of Waterloo, Ontario, Canada. He joined Sharif University of Technology in 1996. Since 2007, he has been an Associate Professor with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. He has published more than 100 refereed papers in the related areas. His research interests include design for testability, VLSI design, Network-on-Chip, and System-on-Chip. He has served as the program chair, general chair, and program committee member of various conferences, like DATE, NOCS, NoCArch, and CADS.