AN FPGA BASED OPEN SOURCE NETWORK-ON-CHIP ARCHITECTURE Andreas Ehliar ∗
Dake Liu
Department of Electrical Engineering Link¨oping University Sweden email:
[email protected]
Department of Electrical Engineering Link¨oping University Sweden email:
[email protected]
ABSTRACT
2. BACKGROUND
Networks on Chip (NoC) has long been seen as a potential solution to the problems encountered when implementing large digital hardware designs. In this paper we describe an open source FPGA based NoC architecture with low area overhead, high throughput and low latency compared to other published works. The architecture has been optimized for Xilinx FPGAs and the NoC is capable of operating at a frequency of 260 MHz in a Virtex-4 FPGA. We have also developed a bridge so that generic Wishbone bus compatible IP blocks can be connected to the NoC.
Networks-on-chip has been a popular research area for some time now. An early paper that discusses the advantages of an ASIC based NoC is [2] when compared to more traditional approaches. Other well known ASIC based NoC research projects include the Æthereal project [3] and the xpipes project [4].
1. INTRODUCTION As chip manufacturing techniques continue to improve, more complex maksystems are being designed. Designing such a large system is not easily done however and much research from both academia and industry is focused on this problem. One of the problems encountered is how to handle the on-chip interconnections between different modules. One promising solution to the on-chip interconnection problem is the Networks on Chip (NoC) paradigm which has seen a lot of research lately. A thorough review of the concepts involved in NoCs is outside the scope of this article and we refer readers unfamiliar with the topic to [1]. Most publications in this research area are targeting ASICs however with only a few publications considering the problems and opportunities of an FPGA based NoC. However, as entry level FPGAs are increasing in size, interest in NoCs for FPGAs will also increase in both academia and industry. In this paper we present an open source NoC architecture. The architecture, which is optimized for the Virtex-4 FPGA family is based on packet switching with wormhole routing. In addition, we have also developed a bridge which allows Wishbone compatible components to communicate over the NoC. ∗ Funded by the Stringent research center of the Swedish Foundation for Strategic Research
2.1. FPGA based NoCs While the majority of NoC publications are discussing ASIC based NoC, there are some publications that explicitly deal with FPGA based NoCs, an early one is [5] where a packet switched NoC is studied on the Virtex-II Pro FPGA. One of the main goals of this NoC is that it should be able to be used in a dynamically reconfigurable system. A recent example of an FPGA based NoC is described in [6] in which the authors describe a packet switched NoC running on a Virtex-II and compares it to a statically scheduled NoC on the same FPGA. A circuit switched NoC for FPGAs named PNoC is described in [7]. Another recent example of an FPGA based NoC is NoCem [8] which is a NoC aimed at multicore processors in an FPGA. The source code for NoCem is also available on the Internet [9]. The interested reader can find a survey of some additional FPGA based NoCs in [10]. It also includes comparison with other interconnect architectures such as buses. 3. OUR NOC ARCHITECTURE The main goals of the NoC architecture described in this paper is high throughput, low latency (especially for small messages), and low area overhead. Another goal is to make it possible to easily interface a standard bus protocol such as Wishbone to it. A third goal is that there should be a certain amount of flexibility in regards to the choice of topology. The authors’ experience from SoCBUS [11] also indicates that the large latency involved in transmitting small
Table 1. A list of the data and control signals that exist in a link between two NoC switches. Name and Width Description direction Strobe → 1 Valid data is present Data → 36 Used as data signals Last → 1 Last data in a transaction Dest → 5 Address of destination node Route → 3-4 Destination port on the switch (one hot coded) Ready ← 1 Signals that the remote node is ready to receive data messages can be a huge problem in a real system. Since it is critical to be able to handle small messages in a system where a a standard bus is connected to a NoC, the architecture presented in this paper is based upon packet switching. Wormhole routing is used to avoid the need for large packet buffers and to reduce the latency. We have mostly used 2D meshes during simulation and hardware development although almost any topology is possible, as long as a deadlock free routing algorithm is used. (A discussion on deadlock free routing algorithms is outside the scope of this paper, the interested reader is referred to for example [1].) The signaling used on the NoC is shown in Table 1. 3.1. Input part An incoming packet is first buffered in an input FIFO. As long as an output port is available, the input FIFO will be emptied as fast as it can be filled. However, if no output port would be available, the input FIFO will quickly fill up. To avoid overruns, the input module will signal the sender that no further data should be sent as soon as only a few entries are left. This is required because the pipeline latency will cause additional entries to be written before the sender can react. This FIFO is efficiently implemented by using the SRL16 components of the Virtex-4. Due to the high delay in the SRL16 outputs, it is necessary to minimize the logic between the output of the SRL16 and the following flip-flop. Therefore, only a simple routing decision is performed in this stage. Since our current architecture supports 32 destination nodes, one five-input look-up table per output port is enough to make a routing decision. Unfortunately this does not take into account that the FIFO might be empty and contain stale destination data. In order to handle this situation, the route look-up also has to know whether the input FIFO is empty or not. Adding this logic increased the critical path beyond what was deemed acceptable. Therefore, in order to shorten the critical path of the route look-up, the NoC architecture was modified so that a route look-up is instead per-
formed in the previous switch. The result of a route look-up is then sent using one hot coding to the next switch. The other critical path of the input signal is the read enable signal of the input FIFO. In order to keep the latency down, the read enable signal is generated by looking at the destination port of all other input ports. If no other input port is trying to communicate with the selected output port and the output port is ready to send, the packet will be sent immediately. 3.2. Output part Once the first part of a packet is available in the input FIFO, the arbiter of the selected output port will be notified. If the port is already busy or if several input ports are trying to send at once, the arbiter uses round robin arbitration to choose the next packet to be sent once the current sender is finished. The arbitration is therefore distributed between the input port where the read enable signal has to be generated without waiting a clock cycle on the arbiter. If the output port is available and no other input port is trying to send to this port, the arbitrator will allocate the output port for the duration of the incoming packet. Beside the arbiters, only one mux for each output port is needed. A small logic depth optimization that has been done is to move a small portion of the arbiter into the output mux. This can be done because a 4-to-1 mux only uses three of the available inputs on the two LUTs that are required to implement such a mux. It should be noted that the output mux is not connected to all input ports since messages are not supposed to be routed back to the same port it arrived on. 4. WISHBONE BRIDGE In addition to the NoC architecture described above we have also developed an interface that allows Wishbone [12] compatible components to be easily connected to our NoC architecture. The protocol that is used to communicate over the NoC is summarized in Table 2. The data-flow in the bridge is shown in Fig. 1. In order to be able to operate a bus connected to the Wishbone side of the bridge at a high clock frequency it is important that as many signals as possible are registered before being allowed to enter or leave the Wishbone bus. This causes problems during a write burst, because the bridge does not know beforehand whether a slave will acknowledge a transaction or not. This means that it is necessary to use the unregistered acknowledgment signal. The usage of this signal is shown in Fig. 1 where it is used to trigger the CE input of the data and address flip-flops. For all other uses of the acknowledgment signal, the registered version is used. In the current version of the bridge a few other control
signals are also sparingly used in their unregistered version but the acknowledgment signal is the most critical of these. 4.1. Deadlock avoidance In order to avoid deadlocks we must make sure that all messages will be accepted. As a counterexample, consider a system with two nodes that have sent a large number of read requests to each other. If too many read requests are present in the network, there would be no space available in the network for the replies to these read requests and no further progress could be made. We have solved this problem by having a short queue for read requests in the Wishbone bridge. As soon as a read request is received from the NoC, no new incoming Wishbone transactions will be accepted. This queue can be sized so that it is guaranteed that it cannot ever be filled in a given system. 4.2. Limitations One problem in the Wishbone standard is that it is designed with a combinatorial bus in mind. If the bus is pipelined, it is no longer possible to utilize Wishbone to its full potential. Wishbone provides signals for handling burst reads but the only length indication which is provided for a linear burst is the fact that at least one more word is requested. This causes problems if many pipeline stages separate the slave and the master. We have augmented the Wishbone interface with a transaction length signal so that a read reply will contain exactly the number of words that have been requested.
Address generator
Read request FIFO
NoC
Input FIFO
Table 2. The protocol used by the Wishbone bridge. A Write request packet can contain up to N words, a read request packet will always contain 2 words, and a read reply can contain up to M words. (M ≤ 31). Request Word Bits Value type Write 0 35:34 “00” (Write request) 0 29:0 Address (in 32 bit words) 1..N 35:32 Byte select signals 1..N 31:00 Data Read 0 35:34 “01” (Read request) 0 29:0 Address (in 32 bit words) 1 34:30 Number of requested words 1 29:26 Byte selects for non burst read 1 25:21 Source node address 1 20:18 Request ID Read 1..M 35 “1” (Read reply) Reply 1..M 34:32 The request ID of this read request 1..M 31:0 Data
Wishbone Address CE Wishbone ACK
CE Wishbone data Route lookup NoC
Wishbone Address Wishbone data
Fig. 1. Simplified view of the data flow of the Wishbone to NoC bridge. The dotted lines is the registered acknowledgment signal. The dashed line is an internal control signal that forces a load of the first value in a transaction. The current version of the Wishbone bridge also assumes that a slave will not answer a Wishbone request with a retry or an error. Handling these signals in a fully Wishbone compliant way would severely reduce the performance of the NoC. As a future extension some sort of error reporting register should be introduced to the bridge. Also, while the bridge does not handle retries itself, it will issue a retry to a wishbone slave if a wishbone read request is received when the answer to a previous read request has not yet arrived. It will also issue a retry if a wishbone write request is received when the NoC is unable to receive further messages due to a full FIFO. The Wishbone master must honor this request and release the bus for at least one clock cycle if any other device is connected to the same Wishbone bus in order to avoid deadlocks. 4.3. Testing Both the Wishbone wrappers and the NoC architecture has been tested in RTL simulations in different NoC configurations (different number of nodes and switches). The largest design we have tested contains 16 NoC switches, 32 wishbone/NoC bridges, 96 memories, and 96 transaction generators. The NoC have also been tested on a Virtex-4 SX35 based FPGA where we tested a four node NoC with 12 Wishbone bridges connected to memories and transaction generators. 5. RESULTS The resource utilization of our design is shown in Table 3 and compared with three other publications. When compared to the packet switched architecture in [6], our architecture can operate at the same frequency
Our 4 port switch Our 5 port switch [6] (4 ports) PNoC [7] (4 ports) NoCem [8] †
Data width 36 bits 36 bits 32 bits 32 bits 32 bits
Virtex-II 6000-4 166 MHz 151 MHz 166 MHz -
Virtex-II Pro 30-7 257 MHz 244 MHz 138 MHz 150 MHz
Virtex-4 LX80-12 272 MHz 260 MHz -
Latency (cycles) 3 3 6 -
Slices
LUTs
431 659 1464 364 -
780 826 1455‡
Flip Flops 452 615 -
† The number of ports for this value is not stated in the paper. ‡ Not explicitly mentioned in the paper, calculated from the size of a 2×2 NoC.
Table 3. The performance of our NoC compared to other FPGA based NoCs.
in the same FPGA technology whereas our switch only uses 30% of the slices (in fairness, the authors hint that their NoC could be faster but they do not give a maximum number). When compared to [7], the system is capable of operating at a significantly higher frequency while being only slightly larger (in addition to serving slightly wider links). The authors also do not mention how deadlocks are avoided or handled in their design. The latency of their NoC is also unknown. Our NoC can also operate at a higher clock frequency than NoCem [8] with less resource usage. However the resource usage comparison is not completely fair since NoCem is capable of handling virtual channels (although [8] do not mention if the reported LUT resource usage is with or without virtual channels). Finally, the size of the Wishbone bridge depends on the routing table and the size of the read request FIFO, but a typical bridge with a simple routing table and a 32-entry read request FIFO will use 450 LUTs and 429 Flip Flops.
6. FUTURE WORK
8. REFERENCES [1] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004. [2] W. J. Dally and B. Towles, “Route packets, not wires: Onchip interconnection networks,” in Design Automation Conference, 2001, pp. 684–689. [Online]. Available: citeseer.ist.psu.edu/dally01route.html [3] K. Goossens, J. Dielissen, and A. Radulescu, “Aethereal network on chip: concepts, architectures, and implementations,” Design & Test of Computers, IEEE, vol. 22. [4] D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architecture for gigascale systems-on-chip,” Circuits and Systems Magazine, IEEE, vol. 4, 2004. [5] T. Bartic, J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, and R. Lauwereins, “Highly scalable network on chip for reconfigurable systemsin,” System-on-Chip, 2003. Proceedings. International Symposium on, 2003. [6] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. J. Wilson, M. Wrighton, and A. DeHon, “Packet switched vs. time multiplexed fpga overlay networks,” IEEE Symposium on Field-programmable Custom Computing Machines, 2006. [7] C. Hilton and B. Nelson, “Pnoc: a flexible circuit-switched noc for fpga-based systems,” Computers and Digital Techniques, IEE Proceedings-, vol. 153, 2006.
Since we will release this work as open source, it is our hope that this research project can both be a platform upon which further FPGA based NoC research can take place. The NoC architecture is available for use under the MIT license at http://www.da.isy.liu.se/research/soc/fpganoc/
[8] G. Schelle and D. Grunwald, “Onchip interconnect exploration for multicore processors utilizing fpgas,” 2nd Workshop on Architecture Research using FPGA Platforms, 2006.
7. CONCLUSION
[10] T. Mak, P. Sedcole, P. Y. Cheung, and W. Luk, “On-fpga communication architectures and design factors,” 16th International Conference on Field Programmable Logic and Applications, 2006.
In this paper we have presented an open source Networkon-Chip architecture optimized for the Virtex-4 FPGA. The network can operate at over 260 MHz and the area for a NoC switch is significantly smaller than for previous results at the same operating frequency. We have also presented a bridge which allows Wishbone compatible components to be connected to this NoC.
[9] “Nocem – network on chip emulator,” http://www.opencores.com/projects.cgi/web/nocem/overview
[11] D. Wiklund and D. Liu, “Socbus: switched network on chip for hard real time embedded systems,” Parallel and Distributed Processing Symposium. Proceedings. International, 2003. [12] “Wishbone system-on-chip (soc) interconnection architecture for portable ip cores,” http://www.opencores.org/, 2002.