Keywords- Software Communication Architecture, Software. Defined Radio, Energy ... hardware acceleration of signal processing components implemented in the SCA can ... processors, analog-to-digital and digital-to-analog. (ADC and DAC) ...
Case Study on a Software Communications Architecture Component for Hardware Acceleration Rolando Duarte, Omar Granados, Chen Liu, Jean Andrian Department of Electrical and Computer Engineering Florida International University Miami, FL 33174, U.S.A {rduar002, ogran002, cliu, andrianj}@fiu.edu
Abstract— In this paper we introduce an architecture for hardware acceleration of Software Communication Architecture (SCA) components. Intercommunication, as well as responsiveness and power dissipation, is essential for the components of SCA. Thus, we studied the SCA to explore the hotspot function and optimize its overall performance to not only operate efficiently but also in an energy-efficient way. We spotted the amplifier function takes a significant amount of the execution time which involves single-precision floating-point computation. A case study on the implementation of the amplifier component on the Virtex5 Field-Programmable Gate Array (FPGA) platform is presented. Consequently, we make a comparison of the hardware accelerated amplifier implementation employing a floating-point unit (FPU) engine over pure software implementation with no FPU support. We obtained a speedup of up to 2x faster while minimizing the energy consumption by up to 100% with a power consumption increase of 0.68%, achieving our goal. Keywords- Software Communication Architecture, Software Defined Radio, Energy Efficiency, Hardware Acceleration.
I.
INTRODUCTION
The vision of ubiquitous computing systems has lead to a sharp increase in research efforts on the development and deployment of devices whose underlying technology is invisible to the end users. That is, such systems are designed to allow users to perform a wide range of computational tasks anytime and anywhere without them being aware of the specific devices performing such tasks [1]. Because in ubiquitous computing the user’s environment is likely to change frequently, one of the main requirements for this vision to become a reality is that of spontaneous interoperation between devices [2]. This means that as mobile devices change environment, they should be able to spontaneously communicate with any other device in their new operating environment without manual intervention. The complexity of this issue has increased considerably nowadays with the significant number of wireless protocols available to serve as the link between computing elements. Recently, a promising technology known as software defined radio (SDR) has gained a great deal of attention as a possible way of enabling spontaneous interoperation among devices. An SDR system is a radio in which all of the baseband, intermediate frequency (IF), and part of the radio frequency (RF) processing tasks are done exclusively through
software. This characteristic makes possible for the radio to use intelligent algorithms to modify its transmission and reception parameters depending on the changes it detects in its operating environment, thus eliminating the need for user intervention [3]. In order to deploy SDR systems in a truly ubiquitous computational environment, mobility is a requirement and thus size, weight, and power (SWaP) are major constrains. In addition to this, the system should be able to support communication protocols of different levels of complexity. Therefore, two main characteristics the SDR system must have are high computational performance and low power consumption. A promising approach that offers improvements in the aforementioned areas is to apply hardware acceleration techniques to the computation intensive components implemented under the SDR’s software architecture. A general hardware-assisted approach was presented by Tang et al. [14]. However, their design mainly focused on accelerating the garbage collection function in the middleware layer, which is for a totally different application domain comparing with ours. In [4], Lau et al. showed that increased execution speed could be achieved by offloading some of the signal processing functions of a general SDR system onto an FPGA-based hardware accelerator. Since the Software Communications Architecture (SCA) is the most widely accepted software platform for SDR systems, in [5] Millage showed how hardware acceleration of signal processing components implemented in the SCA can be achieved using a graphics processing unit (GPU). There, it is observed that execution time is reduced compared to those of software components running only on a general purpose processor (GPP). Although because of parallelization GPUs have considerable computational power, their use as hardware accelerators in mobile devices is not very practical because of their significantly high power consumption. Therefore, in contrast with [5], in this work we propose a hardware acceleration platform for SCA components using an FPGA in order to achieve a balance between computational performance and energy consumption. We also present an initial feasibility study of such platform on a specific FPGA board. Since hardware acceleration is only performed on the signal processing portion of the code that implements a waveform component, the resulting platform is still compliant with SCA standards, thus it can be readily deployed for current military
and commercial systems without backward compatibility issues. The rest of the paper is organized as follows follows. In section II, an overview of software defined radio and the Software Communications Architecture (SCA) is presented. Section III introduces the proposed architecture using hardware acceleration and describes the specific FPGA platform that will be used to implement it. A study on the energy consumption and performance of an SCA component implemented with and without floating point unit support is presented in section IV. Lastly, concluding remarks are given in section V. II.
BACKGROUND
a) Software Defined Radio Software defined radio (SDR) refers to a radio system where the majority of the transmission and recepti reception-path components are realized ed through software. The purpose beh behind such system is to allow the modification of radio w waveforms and protocols with no or minimal hardware changes. That is, alterations to parameters such as modulation format, transmission power, or transmission frequency can be done entirely through software. The block diagram of the main hardware (HW) components of an SDR system is presented in Fig. 1. In an ideal SDR system, most of the paramet parameters of each hardware element should be able to be reconfigured through software in order to support a large variety of wireless protocols [6]. It can be observed that for this type of platform to be implemented, enabling technologies such as baseband processors, analog-to-digital and digital-to-analog analog converters (ADC and DAC), reconfigurable components of the radio frequency (RF) front end, as well as antenna elements elements, all need to be properly chosen to comply with the requirements of the waveforms they are aimed to support.
RF Module
ADC/DAC
(STRS) architecture [8],, the so-called software communications architecture (SCA) still remains as the dede facto standard of the SDR community [9]. Therefore, in this work our focus is on SCA architecture. architecture b) The Software Communications ommunications Architecture The software communications architecture (SCA) was developed as the standard software framework for the Department of Defense’s (DoD) military radio development program known as Joint Tactical Radio System (JTRS). The introduction of this standard was mainly aimed at facilitating technology insertion,, hardware abstraction, and hardware interoperability for SDR systems [10]. [10] In Fig. 2 the structure of the SCA is presented. Here, the elements labeled C1 through C4 represent the waveform components. In the SDR context, the function of each ach of these components is to perform a specific signal processing task within a waveform. In the SCA, all ll the services and interfaces available to each component are defined in the SCA core framework. From the figure it can be seen en that waveform components interact with each other and with the core framework through the logical bus provided by the Common Object Request Broker Architecture (CORBA).. One of the crucial features of this architecture is that it allows different applications (waveforms or waveform components) to run on distributed processing units and to communicate with each other. This is particularly useful when using only one processing unit does not satisfy the computational requirements of complex waveforms. waveform
IF and Baseband Processing
Fig. 1 Block diagram of SDR System
As critical as the choice of hardware platform, it is almost as important to select the right software architecture for SDR system.. Such architecture should provide portability among different hardware platforms. Moreover, er, in order to facilitate its adoption in commercial applications, the SDR system’s software performance bottleneck should be minimized. That is, one has to ensure that the baseband signal processing components and the communication protocols ols between them are as efficiently implemented as possible. The benefits of doing so can be reflected in lower computational intensity and less energy consumption. Although recently there has been an emergence of many SDR software architectures including GNURadio, OMG’s PIM and PSM for Software Radio Component Components Specification [7], and NASA’s Space Telecommunication on Radio System
Fig. 2 Software Communications Architecture Structure [10]
III.
SYSTEM ARCHITECTURE
a) Proposed System Architecture for Hardware Acceleration To take advantage of the benefits obtained through the hardware acceleration of software components, while still adhering to the SCA standard, only the code of the signal processing routine for each waveform component is accelerated in our proposed architecture as shown in Fig. 3. That is, all the core framework interfaces and CORBA services to these accelerated components remain re unchanged, allowing the component to be initialized and also located by
other waveform components. However, the signal processing portion of the code is replaced by a subroutine which calls upon the hardware accelerated implementation. One of the implications of this process is that data stored in CORBA type variables must be converted and loaded to the memory location of the hardware accelerator used to implement the signal processing routine. In Fig. 3, the component to be accelerated is denoted as C3 and the accelerated signal processing routine is C3_HW.
Fig. 3 Proposed SCA Architecture with Hardware Acceleration
Since the proposed system is to be implemented using an FPGA board, in the following subsection a thorough description of the characteristics of the hardware platform in question is provided.
Fig. 4 System Architecture with FPU Support
b) Hardware Platform We used the Virtex5 FPGA platform [11] to construct the system architecture shown in Fig. 4. FPGA is of great advantages for SDR design since they can be reconfigured and reprogrammable. MicroBlaze [12] is a soft-core 32-bit RISC microprocessor operating at a clock rate of 125 MHz. It can be configured to have a 5-stage pipeline (performance-optimized) or 3-stage pipeline (area-optimized), barrel shifter, integer multiplier/divider, floating point operation and cache support. For our design, we elect to use the 5-stage pipeline and
floating point unit (FPU) support for IEEE single precision format. MicroBlaze can access the local memory denoted as Block RAM (BRAM) through the Local Memory Bus (LMB). Thus, it accesses the instruction and data through the Instruction Local Memory Bus (ILMB) and Data Local Memory Bus (DLMB) respectively. Every instruction or data access through the LMB takes 1 clock cycle and the LMB must only be connected from processor to local memory BRAM. In contrast, the Processor Local Bus (PLB) takes about 7 clock cycles per access; however, it allows any peripheral or custom hardware to connect to it. The PLB is bidirectional and is 32-bit wide. We also employs a timer to collect timing information such as the number of clock cycles a specified operation takes on MicroBlaze or any other peripheral connected through the PLB. Interrupt is used to make the system interruptible. If Microblaze needs to perform an urgent operation, then it can interrupt the current operation to serve the urgent one. RS232_UART is used for displaying the output on the host machine. The FPGA platform is a memory-mapped system and every component is assigned a memory range. Therefore, if MicroBlaze needs to access any peripheral it only needs to provide the address location and/or data. The address of MicroBlaze is 32-bit wide thus it can access up to 0xFFFFFFFF memory location. IV.
EXPERIMENTAL RESULTS
a) Experimental Setup For the purpose of this study, a waveform implemented in an SCA-based software architecture known as OSSIE was initially profiled. It was observed that one of the components on which the processor spent more time (i.e., higher percentage of samples were taken during execution) was the IQ amplifier. This was inferred as an indication that the acceleration of this component might lead to significant performance improvement as it is frequently accessed by the processor. Therefore, our initial feasibility study consisted of executing the amplifier function from the SCA without the FPU incorporated in the MicroBlaze architecture over the case with FPU support. The amplifier function accepts seven parameters to multiply the in_phase (I_in_0) and its gain (I_gain) as well as the quadrature (Q_in_0) and its gain (Q_gain). Now, I_gain and Q_gain are floating point while the I_in_0 and Q_in_0 are integer vectors. b) Results We execute the amplifier function on MicroBlaze with and without FPU support on data set from 128 Bytes to 64 Kbyte (local memory maximum capacity). The clock cycle, execution time and energy consumption readings are collected and presented in Table I and Table II respectively, with doubling the data set size each time. We first enable the FPU support and run the amplifier with data set of 128 Bytes. From Table I, we can see that it takes 17842 clock cycles to perform floating-point multiplication of 128 Bytes data set. The microprocessor runs at 125 MHz. Given that execution time equals to the number
Since we get the power consumption reading, we can calculate the energy consumed during the execution period. Thus, the 128-Byte data set has an a energy consumption of 0.1805 mJoules Joules as shown in Table I. In Table II the energy consumed for the same data set size is at 0.3270 mJoules accordingly. 2.05 2.00
Performance
of clock cycles multiply the unit clock cycle time, the execution time we obtained is 0.1427 ms. From Table II, we can observe bserve that it takes 32651 clock cycles to perform floating-point point multiplication of 128 Bytes data set. The execution time is 0.2612 ms. Next, we utilized XPower Analyzer [12]] to get the power consumption of the system. Xpower Analyzer is a Xilinx tool dedicated dicated to analyze the power consumption for a post postimplemented place and routed design. It collect collects information about the hardware design and provides accurate estimation about the power utilization. By using the Xpower Analyzer tool, the overall power consumption for the hardware system can be measured, which are shown in Fig. 4. Our results show that with FPU support the total power consumption is 1.2650 Watts, of which 1.1105 is static power and 0.1545 is dynamic power. If we disable the FPU unit, the static power is 1.1102 and dynamic power is 0.1463,, respectively, with a total power consumption of 1.2565 watts. Adding the floating point unit support ort inquires more hardware thus it will consume more power as well. Based on our measurement, the hardware overhead incurred by FPU corresponds to a 0.68 0.68% increase in power consumption. The system architecture with FPU support requires 415 slice register used sed as flip flop flops and 771 look-up table (LUT), which constitute to 21% and 40% more flip-flops and LUTs resource respectively when compared to the system that does not use the FPU engine.
1.95 1.90 1.85 1.80 1.75 1.70 128
256
512
1K
2K
4K
8K
16K
32K
64K
Input Size (Byte) Fig. 5 Performance of FPU over no FPU Support
105.00
# Bytes 128
Clk Cycles (10^6) 0.0178
Exec Time (ms) 0.1427
Energy (mJoules) 0.1805
uJoules / Bytes 1.4106
256
0.0345
0.2799
0.3541
1.3834
512
0.0658
0.5484
0.6937
1.3549
1K
0.1341
1.0731
1.3576
1.3257
2K
0.2622
2.0983
2.6544
1.2961
4K
0.5124
4.0996
5.1862
1.2661
8K
1.0005
8.0041
10.1255
1.2360
16K
1.9521
15.6166
19.7557
1.2057
32K
3.8061
30.4487
38.5188
1.1755
64K
7.4158
59.3265
75.0504
1.1451
Energy Reduction %
100.00 Table I - MicroBlaze with FPU Support
95.00 90.00 85.00 80.00 75.00 70.00 128 256 512
1K
2K
4K
8K
16K 32K 64K
Input Size (Byte) Fig. 6 Energy Reduction of FPU over no FPU Support
1.05E+00
2.4957
1K
0.2521
2.0168
2.5248
2.4656
2K
0.4984
3.9875
4.9919
2.4374
4K
0.9855
7.8845
9.8705
2.4097
8K
1.9475
15.1504
19.5048
2.3809
16K
3.8469
30.7755
38.5272
2.3515
32K
7.5965
60.7725
76.0798
2.3217
64K
14.9975
119.9802
150.2007
2.2918
8.50E-01 8.00E-01 7.50E-01 7.00E-01 64K
2.5256
1.2778
32K
0.6465
1.0207
16K
0.5164
0.1275
8K
0.0645
512
9.00E-01
4K
256
9.50E-01
2K
uJoules / Bytes 2.5546
1K
Energy (mJoules) 0.3270
512
Exec Time (ms) 0.2612
256
Clk Cycles (10^6) 0.0326
1.00E+00
128
# Bytes 128
Normalized Energy per Byte
Table II - MicroBlaze without FPU Support
Input Size (Byte) Fig. 7 Normalized Energy per Byte Consumption with FPU Support
Executing the amplifier function with FPU support result in 1.83x faster over no FPU support with an input size of 128byte, as shown in Fig. 5. If we continuously double the input data set, the performance slightly increases reaching up to 2.02x faster. The performance is based on the execution time of FPU unit support over no FPU support. This is due to the fact the data size is directly proportional to the number of floating-point operations. With bigger input data size the microprocessor performs more floating-point calculations, but it will take less time now than the pure software floating-point execution which is in an emulated fashion (translated into integer operations). Fig. 6 shows the energy reduction using FPU over no FPU support. From the figure we can observe that a data set of 128 bytes result in an energy reduction of 81%. Recall that energy is correlated with the total execution time. Since the performance slightly increases when we increase the data size, we should expect the energy reduction to increase accordingly. This observation matches the readings we get in Fig. 6. Henceforth, increasing the data size results in more energy reduction up to 100%, when executing the amplifier function with 64 Kbytes data set with the FPU over no FPU support. Fig. 7 demonstrates the normalize energy consumed per byte over the 128 Bytes input case. We can see that with the increase of the input data size, the energy consumed for a unit data calculation slightly decreases, reaching up to 80% energy consumed per byte for input data set size of 64 Kbytes when compared with input data set size of 128 Bytes.
[2]. [3].
[4].
[5]. [6].
[7].
[8].
[9].
[10].
[11]. [12].
[13].
V.
CONCLUSION
In this paper, a hardware acceleration technique for Software Communication Architecture (SCA) components using an FPGA is presented. Through this platform computational and power efficiency of the accelerated SCA components can be improved without creating compatibility issues with their software-based counterparts. First, profiling an SCA waveform allowed identifying the amplifier component as a viable candidate for a feasibility study on the implementation of such component on an FPGA board. Then we noticed that the amplifier function performs vast floatingpoint operations. Thus, we analyzed and configured the general processor MicroBlaze to support FPU engine. We are able to accelerate the execution by up to 2x faster and reducing the energy by up to 100% for an input set of 64 Kbytes with an increase in power consumption of 0.68%. For the future work, we plan to construct a corresponding coprocessor that will take the role of the amplifier function to further compare the effectiveness of our proposed design over the two techniques employed in this study, pure software execution with and without floating point unit support embedded in the microprocessor. We will also look into implementing other components of SCA into the form of hardware accelerators to form a heterogeneous hardware acceleration platform for software defined radio systems. REFERENCES [1]. Jingde Cheng;, "Testing and Debugging Persistent Computing Systems: A New Challenge in Ubiquitous Computing," Embedded and Ubiquitous
[14].
Computing, 2008. EUC '08. IEEE/IFIP International Conference on , vol.1, no., pp.408-414, 17-20 Dec. 2008 Kindberg, T.; Fox, A.; , "System software for ubiquitous computing," Pervasive Computing, IEEE , vol.1, no.1, pp. 70- 81, Jan-Mar 2002 Shukla, A.; Burbidge, E.; Usman, I.; , "Cognitive Radios - What are They and Why are the Military and Civil Users Interested in Them," Antennas and Propagation, 2007. EuCAP 2007. The Second European Conference on , vol., no., pp.1-10, 11-16 Nov. 2007 D. Lau, J. Blackburn, and C. Jenkins, “Using c-to-hardware acceleration in FPGAs for waveform baseband processing,” in 2006 Software Defined Radio Technical Conf. Product Exposition, Nov. 2006. Millage, J.G., “GPU Integration into a Software Defined Radio Framework,” Master's Thesis, Iowa State University, 2010 Haghighat, A.;, "A review on essentials and technical challenges of software defined radio," MILCOM 2002. Proceedings , vol.1, no., pp. 377- 382 vol.1, 7-10 Oct. 2002 Gailliard, G.; Nicollet, E.; Sarlotte, M.; Verdier, F.; , "Transaction Level Modelling of SCA Compliant Software Defined Radio Waveforms and Platforms PIM/PSM," Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE '07 , vol., no., pp.1-6, 16-20 April 2007 Reinhart, R.C.; Johnson, S.K.; Kacpura, T.J.; Hall, C.S.; Smith, C.R.; Liebetreu, J.; , "Open Architecture Standard for NASA's SoftwareDefined Space Telecommunications Radio Systems," Proceedings of the IEEE , vol.95, no.10, pp.1986-1993, Oct. 2007 P. Isomäki, N. Avessta, An overview of software defined radio technologies, Tech. Rep. 652, Turku Centre for Computer Science, Finland, 2004. Aguayo Gonzalez, C.R.; Dietrich, C.B.; Reed, J.H.; , "Understanding the software communications architecture," Communications Magazine, IEEE , vol.47, no.9, pp.50-57, September 2009. Virtex-5 FPGA User Guide. [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug190.pdf MicroBlaze Processor Reference Guide. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guid e.pdf Xilinx Xpower Analyzer [Online]. Available: http://www.xilinx.com/products/design_tools/logic_design/verification/x power_an.htm Jie Tang, Shaoshan Liu, Zhimin Gu, Xiao-Feng Li, and Jean-Luc Gaudiot, “Hardware-Assisted Middleware: Acceleration of Garbage Collection Operations”, Proceedings of the 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP 2010), Rennes, France, July 7-9, 2010