A Validation And Performance Evaluation Tool for ProtoNo C David Castells-Rufas Cephis - MISE ETSE, UAB Bellaterra, Spain
[email protected]
Jaume Joven Cephis - MISE ETSE, UAB Bellaterra, Spain
[email protected]
Abstract-Simulating a NoC at the RTL level can be extremely complex, the simulation of a relatively small NoC, such as a 4x4 mesh, can involve observing thousands of wires on a standard HDL simulator. The facilities of JHDL to extend the simulator environment together with the possibility to fully analyze the runtime object model of the circuit offers a great opportunity to develop modules that address complex features like high level validation and performance evaluation. We present a developed tool that allows defining a NoC architecture models with some flexibility. Traffic generation processes described with high level language can be added to the model. Simulation can be used to validate the system operation on realistic conditions and get accurate values of expected performance. I.
INTRODUCTION
By now almost everybody in the microelectronics world agree that multiprocessors on a single chip are going to enter mainstream production in a near future. The increasing cost of chip design will force to reuse chips for multiple applications and multiprocessors on a single chip will offer this flexibility. The interconnection scheme among multiple processors is still an open topic addressed by many research groups. There is some consensus that traditional bus based interconnects can not get stretched any longer and that multiprocessors communications should be based on the Network on Chip (NoC) concept [1]. However, there are lots of possible designs for NoCs. There are too many interrelated variables that determine NoC performance: traffic distribution, network topology, switching scheme, flow control, packet size, etc. In addition, performance is an application dependent issue and as stated before chips will not be limited to a single application but will have to be reused for multiple applications. An average good solution will be chosen that performs well in most situations, instead of an optimal solution that performs very well in a single application but does it worse in other situations.
1-4244-0622-6/06/$20.00 ©2006 IEEE.
Jordi Carrabina Cephis - MISE ETSE, UAB Bellaterra, Spain j
[email protected]
This work will concentrate in developing a system to help in NoC design space exploration and performance evaluation on different load conditions. Simulation of a NOC can be a very tedious process. A relative small NoC can have thousands of wires. It is not acceptable to use traditional simulation tools (like ModelSim) to verify such a complex system, just like no software programmer would accept to use a logic analyzer to debug a program. Higher-level tools are needed to effectively develop and verify NoCs.
II. PROTONOC We design the simpler network we can think of. Due to its simplicity and we call it ProtoNoC. The goal of ProtoNoC design is to have a minimal area usage. ProtoNoC is based on a mesh topology, with XY routing. The flow control is based on a simple four-phase handshake. A GALS approach is followed, so the processors in tiles of the mesh are clocked but the network itself is unclocked. The channel width is equals to the packet width. Or, in other words, a packet has a size of one unit of the channel width. The packet is divided into a header, that contains the source address and the destination address and a payload (Figure 1. ). The width of all fields can be parameterized but in this example we will fix the values with 4 bits for address fields and 16 bits for payload. Having a total packet size of 32 bits simplifies the design of the Network Interface Controller (NIC) when 32bits microprocessors are used. Dx
Dy
Sx
Sy
Payload
Figure 1. ProtoNoC packet layout
The switching scheme of ProtoNoC does not fit within the standard switching schemes. Figure 2. shows different classic switching schemes together with ProtoNoC switching. We have called it Ephemeral Circuit Switching. It has some things in common with circuit switching and packet switching.
Circuit Switching
Store & Forward
Ni
Ni 4
N2
N2
N3
N3
N4
N4
I'
Virtual Cut-Through Nl
Ephemeral Circuit
Wormhole Nl
FT
Ni
ML;A LA
AL
N1lt Ni
N3
N3
N3
N4
N4
N4
Time
III.
Time
Time
Ni
I'
Time
Time
Figure 2. Different Switching Schemes. Packet header data is denoted by a green (darker) box and payload is denoted by a white box. An arrow denotes the indication of a succesfull transmision.
In packet switching networks a header containing the source address and destination is embedded in each packet, because the channel is multiplexed to allow simultaneous communication of multiple parties. In packet switching networks the packet size is always greater than the size of the channel. Packet switching is based on the concept that once a packet leaves a NIC it is stored somewhere in the network before arriving to its destination. Different packet switching strategies differ on how they handle the storage and forwarding of packets at each switching node. Store & Forward switching stores the complete packet before forwarding to the next switching node. Virtual cut-through switching starts to send the packet as soon as it arrives but store the whole packet until it has been completely forwarded. Wormhole switching starts to send the packet as soon as it arrives and only store the fragments of the packet that have not yet been forwarded, not the whole packet. Ephemeral Circuit Switching share with packet switching the concept of embedding the header on the packet, but the header is smaller than the channel width, and the packet size is just one channel width unit. When the packet size is one channel width unit there are no differences among different flavors of packet switching. In circuit switching networks a connection is established prior to transmitting any data, and afterwards the channel is exclusively owned for transmission, even if nothing is transmitted. Although payload data is transmitted after connection, the connection establishment involves some transmission of information, at least the address of the party to connect to. Ephemeral circuit switching can be interpreted as a circuit switching mode that embeds some payload data into signaling information, and that after connection is established it is not used to transmit further information and immediately disconnected.
Ephemeral Circuit Switching shows several advantages other switching schemes: It does not block the channel
over
for long periods of time like Circuit Switching does, has a low latency and does not require any storage in contrast with the needs of packet switching alternatives. On the other hand some drawbacks exist: the channel overhead is high and the throughput is low. NETWORK MODELING
A. Design Environment A parametric NoC model has been build in JHDL. JHDL [3],[4] is a design environment that provides a Java API for describing hardware circuits in a constructive way as well as a collection of tools and utilities for their simulation and hardware execution.
There is an interactive simulation environment, which provides several facilities for exercising and viewing the state of the circuit during simulation. The simulator includes a hierarchical circuit browser with a tabular view of signals, a schematic viewer that include values of signals, a waveform viewer, a memory viewer and a command line interpreter. Moreover, command line interface can be extended with custom commands that interact with the system model in some specific way. More complex hybrid testbenches can be developed. For instance a testbench can instantiate high level behavioral circuits for stimuli generation and the schematic viewer and waveform viewer for easy visual inspection of results. There are some good reasons to use JHDL as a modeling language for NoCs: *
*
*
Circuits can dynamically change their interface using the addPort and removePort functions. This feature is not present in VHDL, Verilog or SystemC and allows to efficiently designing boundary switches of the mesh by removing unused ports.
Block construction can be parameterized by complex arguments, not simple numeric parameters. Being only present in SystemC this feature can enable the design of powerful traffic generation modules. Custom state viewers can be developed so that they provide a much richer interpretation of system state
than waveforms.
B. Model The network model consists of three basic blocks that are replicated to form the entire system: processors, network interfaces and switching nodes.
The processors are behaviorally modeled and run at independent threads. The Processor model includes two primitives to send and receive data from the network. Multiple standard Java programs can be loaded into a processor and then run simultaneously following a round robin schedule. Traffic generators consist of pairs of transmitter / receiver programs executed in different
processors. Using advanced JHDL features traffic generators can be easily included in a testbench (Figure 3. ) and network features can be fully examined in different traffic conditions. In this model computation is untimed while communication is timed at the cycle accurate level.
channel will not change until connection is drop and this will allow to correctly redirect acknowledge signals.
addTrafficGenerator(O, 0, 3, 3);
addTrafficGenerator(3, addTrafficGenerator(3, addTrafficGenerator(0, addTrafficGenerator(l, buildProcessors();
3, 0, 3, 1,
0, 0, 3, 3,
0); 3); 0); 2);
Request O Request n
Data 0 Data n
Figure 3. Traffic generation instructions
NICs are designed to expose the network to the processors. A processor accesses a NIC through a system bus through a bank of registers. Only four registers are used to provide a simple interface: Status, Control, Rx Data, and Tx Data. Due to the used packet layout (Figure 1. ) 16 bits can be transferred at every bus write operation if no congestion occurs.
Figure 4. A full switch
Switching nodes are responsible for the routing and transmission of packets. A full switch contains five bidireccional ports: North, East, South, West, and Local. A processor is connected to the Local port through a NIC. Other ports interconnect neighboring switches. Mesh boundary switches have fewer ports because of the lack of neighboring switches in some directions. Each incoming port can be routed to any outgoing port depending on the routing decision for each packet. Figure 4. shows a simplified view of the switch module in which only an outgoing direction is considered. As the transmitted packet includes destination address the XYRouting module redirects the request signal to the convenient output direction. Multiple incoming packets can be competing for the same output port. The PathSwitch module resolves the collision by applying a priority selection of the request lines. After selecting the highest priority request, PathSwitch (Figure 5. ) latches the selected input channel and sets a bit to identify that a channel has been established. Selected
Figure 5. PathSwitch module design
C. Asynchronous Modeling ProtoNoC is based on an asynchronous handshake and GALS approach. Unfortunately asynchronous circuit simulation is not supported in JHDL. To overcome this limitation we provide a synchronous model that replaces asynchronous parts with synchronous ones and controlled by a global clock. For instance latches are replaced by equivalent registers, and all switching node outputs are registered to avoid asynchronous loops. In fact there are not different models but just one model that is parameterized either as Asynchronous or Synchronous. The synchronous system can be verified in the simulation environment and synthesized. The asynchronous system cannot be simulated but can be synthesized and later either simulated in other tools or executed in the hardware platform. IV.
VALIDATION
The interactive simulation features of JHDL let designers to dive into the schematic view of the system with all signal values presented, and the ability to advance the clock step by step. This method can be very productive for the verification of small circuits, like a single NoC Switching node. However larger systems, and graphically complex, such a whole NoC, are not effectively verified with this method. For instance, errors in handshake chaining and wrong routing decisions are difficult to detect and solve. We extend the simulation environment by adding a custom viewer of the NoC layout (Figure 6. ). This viewer represents the Switching nodes and the connected processors (through a NIC). Each bidireccional channel is represented by two arrows in opposite directions connecting channel endpoints. The drawing gives information about the state of each channel through a color code depending on the current state of the handshake (black=idle, blue=request, magenta=acknowledge, red=release). Switching node internal state is given by drawing a line between inputs and outputs of established circuits.
has been measured by counting the number of clocks to perform a transmission in simulation at the maximum frequency given by the synthesis (96Mhz). In the asynchronous design a deeper analysis of circuit delays has been performed. S-y-Sync
Mbps
Async
450-
400 1 | 350 300 250200150 100
Figure 6. Custom viewer for validation
50-
Such a simple drawing is extremely useful to track system behavior detect router malfunctioning. Moreover simulation is still interactive, i.e., clock can be advanced step by step, waveforms can be generated, and annotated schematic can be dived. PERFORMANCE EVALUATION Performance evaluation of ProtoNoC can be based on the analysis of the handshake signals. Latency can be measured as the time between a rising edge of the request signal and a V.
rising edge of the acknowledge signal. The number of packets that traverse some point can be measured by counting the number of falling edges of acknowledge signal at that point. Latency and channel usage can be sampled by inserting additional behavioral circuits that will be used by custom viewers for graphical feedback. Figure 7. shows the channel usage of the mesh with the traffic pattern used in Figure 3. The upward channel between node (3,1) and node (3,2) has some congestion because is the only segment shared by two communications paths.
1
2
3
4
5
6
Hops
Figure 8.
Maximum bandiwdth as function of hops
VI.
SYNTHESIS RESULTS
Synchronous and Asynchronous versions of ProtoNoC 4x4 mesh have been synthesized on Quartus II for an Altera S30 device. The results (TABLE I. ) show that asynchronous circuit is smaller, as expected. TABLE I.
Design
Mesh 4x4
SYNTHEsIs RESULTS FOR ALTERA S30 DEVICE Cost (LCs) Synchronous Asynchronous 5995 5764
VII. CONCLUSIONS AND FUTURE WORK We have presented a tool and a methodology that allows an early evaluation of NoC architectures on application
environments. Performance estimation has been obtained for both synchronous and asynchronous designs. In future work real applications will be used instead of traffic generators and performance estimation will be compared with the actual hardware system execution. REFERENCES
Figure 7. Channel usage graph for the example traffic pattern
In ProtoNoC bandwidth and latency are dependent of the number of hops between the two communication endpoints. Asynchronous and synchronous maximum bandwidth is shown in Figure 8. In the synchronous design the bandwidth
[1] Benini L.; al. "Networks on chips: a new SoC paradigm". IEEE Computer, 35(1), Jan. 2002, pp. 70-78. [2] J. Wu. Distributed System Design. The CRC Press. 1999. [3] P. Bellows and B. Hutchings. JHDL --- an HDL for reconfigurable systems. In K. Pocek and J. Arnold, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 175-184, Napa, CA, April 1998. IEEE Computer Society, IEEE Computer Society Press. [4] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, M. Rytting, "A CAD Suite for High-Performance FPGA Design", IEEE Symposium on Field-Programmable Custom Computing Machines, 1999