geneous MP-SoCs (Multi-Processor System on Chip) in order to meet the increasing .... table of âaddress-lengthâ tuples called Descriptors. Each De- scriptor is ...
Programmable Protocol Processing Engine in Heterogeneous MP-SoC: A Case Study Mohammad Badawi, Huisheng Zhou, Zhonghai Lu and Ahmed Hemani School of Information and Communication Technology KTH Royal Institute of Technology Stockholm, SWEDEN {badawi,
huisheng, zhonghai, hemani}@kth.se
ABSTRACT This paper presents a programmable protocol processing engine and provides a quantitative characterization of its performance and energy. The protocol processing engine is implemented using a low-power 32-bit processor and a minimal software implementation of the TCP/IP stack. We elevated the performance of this engine and reduced it energy consumption by improving the compute-intensive error detection functions provided by the stack, hence, effectively utilized the HW resources. Besides performance and energy measures, our results show that the speed-up and energy reduction achieved by improving the stack reached 24% and 39% respectively.
1.
INTRODUCTION
Modern embedded systems are increasingly utilizing heterogeneous MP-SoCs (Multi-Processor System on Chip) in order to meet the increasing performance requirements and to keep power consumption within bounds. To efficiently utilize such a sophisticated SoC, the target application must be partitioned into HW/SW processes, and each must then be mapped to a corresponding possessor or accelerator; such that, the overall performance and energy costs are satisfying user constraints. The accurate costs of a partition can be determined through profiling and characterization. Characterizing a partition reveals its resource-level costs related to performance, energy, and area. Moreover, the detailed characterization facilitates determining system-level properties such as buffer size, distributing interconnect’s bandwidth among partitions and dimensioning of configurable units. 1 Application Layer(Software)
Protocol Processing Layers (Software) 2 CODEC Algorithmic (Software) 3 Data Path 4 Physical Layer Control
Partitioned Application
Protocol Processing Engine
Multi-threaded Application Processor
Communication Interconnect PHY Controller
Image and Video Accelerator
Tiled Memory
Heterogeneous Multi-Processor System on Chip
Figure 1: Mapping Application to Heterogeneous MP-SoC This paper focuses on the protocol processing partition, highlighted with red in Fig. 1, and quantifies its performance and energy characteristics. We implemented a protocol processing engine by utilizing an open-source and resource-constrained SW implementation of the TCP/IP (Transmission Control Protocol/Internet Protocol) stack called uIP [8] and a high performance and low-power 32-bit processor called LEON3 [1]. By considering the protocol processing engine as a separate partition where there is no overhead related to resource sharing, it was possible to determine its maximum performance and minimum energy, which is a key requirement for setting system-level properties. This work was presented in part at the international symposium on HighlyEfficient Accelerators and Reconfigurable Technologies (HEART2014), Sendai, Japan, June 9-11, 2014.
In this study, we used an FPGA board implementing an SoC composed of LEON3, bus, memory and Ethernet MAC (Media Access Controller). As described in section 3, we implemented the SW driver of the Ethernet MAC and ported uIP to LEON3 to archive the initial version of the protocol processing engine. We then improved this initial version by re-implementing the compute-intensive error detection functions provided by uIP, hence effectively utilizing the 32-bit HW units of LEON3 . Next to that, we characterized the performance for both versions of the engine and we profiled the executed instruction traces. We also estimated the energy of both versions using the profiling results together with the energy models of LEON3 instructions collected at gate-level and we highlighted performance and energy hot-spots.
2. RELATED WORK Several studies have been conducted to evaluate partitioning critical processing tasks and off-load them to specialized resources. In [2], M. Verderber et al. performed timing and power analysis as well as optimization of MPEG-2 decoder. Similarly to our study, they implemented the decoder on an FPGA as a partitioned system then they reported their analysis for time, power, speed-up, and energy reduction. In [3], A. Hodjat et al. used an FPGA to implement a system of LEON core and a loosely-coupled Advanced Encryption Standard accelerator. Then they characterized the cycle-count, throughput, energy and look-up table size and evaluated this system against the pure SW-based solution. In [4], a person tracking system with distributed FPGA-based cameras was built as an image processing layer on top of communication layer. The communication layer was a software/hardware Object Request Broker (ORB) that integrates GIOP, TCP, IP and Ethernet protocols. Authors interconnected two FPGAbased cameras with a desktop computer to analyze performance. They measured Time-of-Flight as a metric for the performance of interconnection network and Sending/Receiving Overhead as a metric for processing GIOP, TCP, IP and Ethernet protocols at the node side. Although investigating the performance of the communication layer as a part of a system shows some similarity to our study, their are considerable differences. Firstly, in our protocol processing engine we used an embedded SW protocol stack while Linux Kernel is used in [4]. Secondly, our study focuses on the protocol processing at the node and provides performance as well as energy analysis of the protocol in terms of its composing functions; in [4] performance is measured for the protocol as one unit. Thirdly, our study quantifies performance and energy of protocol’s critical functions (hot-spots) and improves them.
3. PROTOCOL PROCESSING ENGINE As mentioned before, the “Application” processor off-loads the protocol processing task to the engine which in turn con-
trols the Ethernet device and completes packet transmission and reception. Once the protocol processing task is done, the engine returns an event of success or fail to the “Application”. Next we discuss the details of the protocol processing engine.
3.1 LEON3 Embedded Processor LEON3 [1] is a 32-bit processor core based on SPARC-V8 . It is highly configurable processor designed for embedded applications by Aeroflex Gaisler [5]. It has separate data and instruction caches and it uses AMBA AHB (Advanced Highspeed Bus) interface for data and instructions accesses. LEON3 was not only designed as a high-performance processor but also as a low-power device. The SoC used in this study was implemented on the development board LEON3 GR-XC3S-1500 Template Design [6] from Aeroflex Gaisler . GR-XC3S-1500 board includes a Spartan31500 FPGA, external memories and I/O including an Ethernet MAC called GRETH [1]. Next, we elaborate how LEON3, GRETH and the protocol stack interact to complete communication tasks.
3.2 Embedded TCP/IP Software Stack uIP [7][8] is a minimal SW implementation of the TCP/IP stack and it targets embedded systems with limited resources, specifically, 8-bit micro-controllers. It supports single network interface and it is mainly focusing on the Network and Transport layers of the OSI (Open Systems Interconnection) model. It implements IPv4 (Internet Protocol), ICMP (Internet Control Message Protocol), UDP (User Datagram Protocol) and TCP (Transmission Control Protocol), but it only handles single “inflight” TCP segment per connection. uIP is compliant to the necessary requirements in RFC1122 [9], hence it can communicate with full-scale TCP/IP implementations as well as implementations with equal capabilities. uIP is a resource-constrained implementation, it uses lightweight stack-less threads called Proto-Threads, and it can be utilized as a main program in single-tasking system or as a task in multi-tasking system. It is worth mentioning that, uIP names upper layer protocols as “The Application” and lower layer protocols including hardware and driver as the “The Network Device”. uIP offers a generic implementation of error detection functions (checksum) required for transport-layer protocols. However, it gives the possibility for the user to implement the checksum calculation that is specific to the target architecture. Note that checksum calculation has to be performed on the whole packet and thus, the implementation of the checksum function will have a major impact on uIP performance. uIP replaces dynamic memory allocation by offering a single Global Buffer which can accommodate a packet with maximum length. When a packet arrives, the driver of the “The Network Device” places the packet into the Global Buffer, and since there is only single Global Buffer, the next arriving packet will overwrite it. This situation obliges uIP and “The Application” to either finish packet processing before the arrival of the next packet, or moving the packet to secondary buffer before starting processing. Secondary buffering has to be implemented by the user as part of “The Application” and in case that it is not implemented or it is filled, the incoming packets will be dropped resulting in another impact on the performance.
3.3 Ethernet MAC and SW Driver The SW driver we implemented conforms to GRETH specification [1], is responsible for two main tasks which are Initialization and Communication Control . Initialization is done only once to write Configuration registers, resulting in setting the HW Address (MAC Address), selecting the speed and specifying if the operation mode is half-duplex or full-duplex. Communication Control is done when the caller stack (uIP ) needs
to transmit packet-data from the memory to the network and vice versa. In case of transmission, data packets have to be placed in the memory then the SW driver is called for transmission. The caller stack (uIP ) passes to the driver some parameters, including address of the data to be transmitted and its length. The driver then allocates a 1KB aligned memory area and builds a table of “address-length” tuples called Descriptors. Each Descriptor is 32-bit aligned and it points to the address of a single packet-data to be transmitted. Once the Descriptor table is ready, the SW driver writes the address of the first Descriptor to the GRETH Transmit Descriptor Pointer register and fires the transmission by setting the GRETH Control register. GRETH then reads each packet from memory, adds the frame header as well as the Cyclic Redundancy Check field, and sends the frame to the network. The SW driver reads the GRETH Status register after every packet transmission, to ensure the success of the transmission. When all packets are transmitted, the driver disables the transmission by clearing GRETH Control register and returns a SUCCESS to the caller stack. Note that reception is performed similarly but in reverse order.
3.4 Computation Hot-spot As mentioned before, uIP offers a generic implementation of error detection (checksum). In this generic implementation, data words are divided into 8-bit blocks, then added together assuming 8-bit arithmetic, which is the bit width that uIP is designed for. Since we are using 32-bit processor, we realized that this fine-grained implementation is under utilizing available arithmetic resources as well as memory and bus bandwidth. For this reason, we implemented the SW 32-bit error detection functions based on fletcher’s checksum [10] and we substituted original functions provided by uIP . In the improved implementation, data words are divided into 16-bit blocks then accumulated into a 32-bit block called Sum using 32-bit arithmetic. Sum is initialized to “0”, and after accumulating each data block, Sum is folded and summed if the most significant bit is set. The speed-up and energy reduction achieved using this improved implementation will be discussed in details in Section 4.
4. EVALUATION As shown in subsection 3.2, uIP mainly deals with IP, UDP, TCP and ICMP . These protocols belong to the bottle-neck layers (Network and Transport) and have considerable effects on the performance and energy of the applications that utilize them. Therefore, characterizing their performance and energy gains significant importance. To perform this detailed characterization, we conducted several experiments on a small Ethernet network. Our experimental network is composed of the LEON3 GR-XC3S-1500 Template Design Board, a desktop computer and a 10/100 M bps switch. LEON3 core runs at 50M Hz and it was configured to have a direct-mapped cache system of 8/4 KB Instruction/Data cache with 16 bytes per line. GRETH was configured to run at 100Mbps in full-duplex mode. uIP and SW driver were built and ported to LEON3 architecture using the development tools from Gaisler [5], namely SPARC-ELF cross-compiler/debugger v4.4.2 and GRMON for board programming and monitoring v1.1.52. In all experiments the network was operating in ideal case (where no additional source of traffic) and we used uIP as a main program to allow measuring upper-bound performance when protocol processor runs as a separate partition. It is worth mentioning that both uIP and SW driver were compiled with “O3” optimization option. The desktop computer used is Dell OPTIPLEX GX620 with 2GB memory, dual core processor running at 2.8GHz, and Linux operating system. We conducted experiments to characterize UDP/IP, TCP/IP protocols as detailed next.
UDP with Generic Check−sum
4.1 UDP/IP
UDP with Generic Check−sum
120
Packet Reception Time (µsec)
110 100
UDP with Improved Check−sum 130
Driver Setting MEMCPY CHECKSUM µIP Process Total
120 110 Packet Reception Time (µsec)
130
90 80 70 60 50 40
100 90 80 70 60 50 40
30
30
20
20
10
Driver Setting MEMCPY CHECKSUM µIP Process Total
10
06 1 25 51 28 256 6 2 4128 64 512 Packet Size (Bytes)
10 24 1024
06 1 25 51 28 256 6 2 4128 64 512 Packet Size (Bytes)
10 24 1024
Figure 2: UDP Reception Time
4.1.2 Energy In this experiment, we estimated the energy that the protocol processing engine consumes when uIP receives a UDP packet. Our goal was to estimate energy as in real life scenario where LEON3 core is an ASIC (Application Specific Integrated Circuit). To achieve this goal, we performed the same experiment of transmitting 100KB file from the computer to the engine. On the LEON3 side, we collected the trace of executed instructions resulting from performing the memory copy, checksum
60
MEMCPY Driver Settingg CHECKSUM µIP Process
40 20 0
64
128 256 512 Packet Size (Bytes)
1024
UDP with Improved Check−sum 50 Execution Time (%)
40 MEMCPY Driver Settingg CHECKSUM µIP Process
30 20 10 0
64
128 256 512 Packet Size (Bytes)
1024
Figure 3: UDP Reception: Percentage Execution Time UDP/IP with Generic Check−sum
UDP/IP with Improved Check−sum
1.1
1.1
1
1
0.9
0.9
0.8
0.8 Packet Reception Rate
In this experiment, we implemented a UDP/IP transmitter and receiver for transferring UDP packets from the desktop computer to the LEON3 board. The transmitter was running on the desktop computer where we used an open source tool called “packETH”. The receiver was the protocol processing engine and it used uIP Proto-Threads. We transmitted a 100KB text file as UDP packets and recorded the time that the engine needs to successfully receive the packet. Since the goal of this experiment is to measure the performance of uIP, we calculated the reception time from the time that the packet reaches “The Network device” until it reaches the secondary buffer. “The application” we used has no action except copying the received packet from the Global Buffer to a secondary buffer. This application action represents the solution to avoid packet overwriting and so, it should be counted together with protocol processing but not as part of application processing (as explained in section 3.2). This experiment was performed for packet sizes between 64-1024 bytes and we examined the engine with both versions of checksum. Fig. 2 shows the time needed for engine to receive one UDP packet as well as the time needed for each of uIP tasks. As shown in this figure, most of the time is consumed for calculating the checksum and moving the packet between memory locations, where both are dependent on packet size. Fig. 2 also shows that the time required for calling the driver and for uIP flag processing are constant and independent of packet size. In Fig. 3, we show the percentage of time that a packet has to spend in each of uIP tasks. Since UDP does not guarantee reliable delivery of data, we edited the experiment used before in order to measure the percentage of packets that engine can receive (packet reception rate). In the transmitter, we introduced a small delay between transmitted packets, we transmitted the 100KB text file and monitored the number of packets received by engine. Similarly, we used packet sizes between 64-1024 bytes while adding an inter-packet delay ranging from 20 µsec to 350 µsec. Fig. 4 shows uIP packet reception rate as a function of inter-packet delay.
Packet Reception Rate
4.1.1 Performance
Execution Time (%)
80
0.7 0.6 0.5 0.4 0.3
0.7 0.6 0.5 0.4 0.3
64B 128B 256B 512B 1024B
0.2 0.1
02 3 50 10 1 35 25 20 30 050 100 50 200 0 35 0 150 20 0 0 300 0 250 0 350 Inter−Packet Delay (µsec)
0.2 0.1
64B 128B 256B 512B 1024B
02 3 50 10 1 35 25 20 30 050 100 50 200 0 35 0 150 20 0 0 300 0 250 0 350 Inter−Packet Delay (µsec)
Figure 4: UDP Reception Rate calculation and total uIP reception process. The trace was collected and profiled using SPARC-ELF debugger and GRMON, where the target architecture is LEON3 on the FPGA. Since the trace of executed instructions is the same either if LEON3 is implemented on an FPGA or as an ASIC, the half-way is reached. Then we used these profiled instruction traces together with gate-level energy models provided by the energy estimation methodology in [11] [12]. This methodology includes an energy database for LEON3 individual instructions. The energy of each instruction in the database was determined based on the consumption of LEON3 core only, excluding cache system consumption. LEON3 core used in this methodology was synthesized using TSMC 90nm technology library and running at 400 M Hz clock frequency. By having the instruction energy models as well as the traces of executed instructions we could estimate the amount of energy consumed by the engine when receiving 64 bytes as well as 1024 bytes UDP packets. We made a our calculation process so that we can get the energy consumption as a function of cache miss-rate. Fig. 5 shows the amount of energy consumed by the engine for each of uIP reception tasks, in case of generic and improved checksum implementations. Note that the estimated energy counts for the effect of the cache miss on the core, but it does not count the energy of the cache itself. Which means, the linear increase in energy is due to the increase in processor idle time. As shown so far, the use of the improved checksum implementation resulted in significant speed-up as well as energy reduction. Table 1 provides a quantification of the achieved speed-up and energy reduction.
4.2 TCP/IP To characterize the TCP/IP performance within uIP, we implemented a TCP/IP client-server application. The client was
Energy (nJ)
UDP Reception of 64−bit Packet 80 70 60 50 40 30 20 10 0
MEMCPY CHECKSUM Generic uIP Process Generic CHECKSUM improved uIP Process improved
0
0.01
0.02 0.03 0.04 Cache Miss Rate
0.05
0.06
5. CONCLUSION
Energy (nJ)
UDP Reception of 1024−bit Packet 770 700 630 560 490 420 350 280 210 140 70 0
MEMCPY CHECKSUM Generic uIP Process Generic CHECKSUM improved uIP Process improved
0
0.01
0.02 0.03 0.04 Cache Miss Rate
0.05
0.06
Figure 5: UDP Packet Reception Energy Table 1: Performance and Energy Improvement Packet Size (Byte) Speed-Up Energy Reduction 64 11% 8% 1024 24% 39% running on the desktop computer and it was implemented using BSD Stream Sockets. The server was the engine and was using uIP Prtoto-Sockets. We conducted two experiments where the client transmits a 15kB text file to the server to study the effect of Receive Window Size and number of simultaneous connections on the performance of the engine. Incresing Number of Connections 8
7.1
7.5
7
7
6.9
6.5
RTT=583µsec RTT=520µsec RRT=310µsec
RTT=462µsec
TrougthPut (MByte/sec)
TrougthPut (MByte/sec)
RTT=402µsec
6.7 6.6 6.5
RTT=242µsec 5.5 5 4.5 4
6.3
3.5
1 2 3 4 5 6 7 8 9 10 Number of Simultaneous Connections
RTT=345µsec
6
6.4
6.2
Network communication using TCP/IP stack has become a widely spread service in modern embedded systems. Hence, quantifying its performance and energy while emphasizing on the host-spots are key needs to support efficient utilization of the stack. Besides influencing system-level design decisions, these quantitative measures are needed for the cost functions used during partitioning and mapping. In this study, we presented an implementation of a protocol processing engine by utilizing a minimal TCP/IP stack and a high performance/low-power processor. We quantified the execution time required by each protocol task as well as the percentage that they occupy from the total execution time. We estimated the energy consumed by uIP tasks as in real-life case, where the processor is synthesized using standard-cells. We also elaborated the effects of Receive Window Size and number of simultaneous connections on uIP performance. Our results highlighted the costs of error detection and memory copy as the two major hot-spots for performance and energy. Using the 32-bit error detection functions, we could speed-up the protocol processing engine by 24% and reduce its energy consumption by 39%.
Incresing Window Size
7.2
6.8
that involves uIP Ptoto-Thread switching. We edited the client to make requests for multiple connections then transmits the same 15KB text file on each connection. We repeated the experiment while increasing the number of connections and keeping the advertised Receive Window Size equal to 500 bytes. Fig. 6 (left) shows engine reception throughput as a function of TCP connections.
RTT=161µsec 3 100 300 500 700 900 110013001500 Advertised Window Size (Bytes)
Figure 6: TCP Throughput
4.2.1 Effect of Receive Window Size In this experiment, we measured engine reception throughput in case of single TCP connection but with different values of Receive Window Size. The receiver (engine) advertises Receive Window Size in the Acknowledgment, informing the transmitter about the maximum packet size it can receive. We repeatedly performed this experiment while increasing the Receive Window Size and keeping the size of uIP Global Buffer equal to Receive Window Size plus size of headers. Fig. 6 (right) shows engine reception throughput as a function of Receive Window Size as well as packet RRT (Round-Trip Time). As shown in the figure, uIP reception throughput increases with the increase of Receive Window Size but with a down-shift after Receive Window Size larger than 500 bytes. This happened since the size of transmitted packet is equal to half Receive Window Size after 500 bytes, unless the size of corresponding Global Buffer is doubled.
4.2.2 Effect of Multiple Connections In this experiment, we measured engine reception throughput in case of multiple simultaneous connections, which is the case
6. REFERENCES [1] GRLIB IP Core Users Manual. Aeroflex Gaisler, 2012. [2] M. Verderber, A. Zemva, and A. Trost. Hw/sw codesign of the mpeg-2 video decoder. In International Symposium on Parallel and Distributed Processing, 2003. [3] A. Hodjat and I. Verbauwhede. Interfacing a high speed crypto accelerator to an embedded cpu. In 38th Asilomar Conference on Signals, Systems and Computers, volume 1, pages 488–492, Nov 2004. [4] Ali Akbar Zarezadeh and Christophe Bobda. Hardware middleware for person tracking on embedded distributed smart cameras. Int. J. Reconfig. Comput., 2012:11:11–11:11, January 2012. [5] http://www.gaisler.com. [6] LEON3 GR-XC3S-1500 Template Design. Aeroflex Gaisler, 2006. [7] Adam Dunkels. The uip Embedded TCP/IP Stack, Reference Manual. Swedish Institute of Computer Science, 2006. [8] Adam Dunkels. Full tcp/ip for 8-bit architectures. In first international conference on mobile applications, systems and services(MOBISYS 2003), May 2003. [9] Requirements for internet hosts - communication layers, 1989. [10] J. Fletcher. An arithmetic checksum for serial transmissions. IEEE Transactions on Communications, 30:247 – 252, Jan 1982. [11] S. Penolazzi, A. Hemani, and L. Bolognino. A general approach to high-level energy and performance estimation in system-on-chip architectures. Journal of Low Power Electronics, 5(3):373–384, 2009. [12] S. Penolazzi. A System-Level Framework for Energy and Performance Estimation in System-on-Chip Architectures. PhD thesis, KTH, Electronic Systems, 2011.