and the same processing architecture is used regardless of the actual task of the .... The transport triggered architecture (TTA)-based Codesign. Environment ...
FPGA Based Application Specific Processing for Sensor Nodes Teemu Nyl¨anden, ∗ Janne Janhunen, Jari Hannuksela, Olli Silv´en ∗ Centre for Wireless Communications Computer Science and Engineering Laboratory University of Oulu, Finland University of Oulu, Finland {teemu.nylanden, jari.hannuksela, olli.silven}@ee.oulu.fi {janne.janhunen}@ee.oulu.fi Abstract—Energy efficient sensor nodes are among the rapidly expanding applications for embedded systems technology. Typically, the processing resources in sensor nodes are based on programmable micro-controllers and digital signal processors, and the same processing architecture is used regardless of the actual task of the node. This regularly results in at least an order of magnitude over-provisioning of resources, and in higher power consumption than would be needed by tightly application specific processing solutions. Currently, experiments show that Flash FPGA technology enables implementing precisely provisioned processing for sensor nodes with energy efficiency that rivals off-the-shelf processor solutions. The expected competitiveness originates from savings in silicon real-estate, and lowered software overheads, as inherently parallel tasks can be offloaded to dedicated hardware accelerators on the same die with a microcontroller unit, and radio baseband. The results pave the way for a novel type of self-powered sensor nodes whose processing resources are configured according to their tasks.
low cost sensors, and even to provide for redundancy needed in simple self-diagnostics. Energy Source
Sensor
Sensor
A/D Converter
A/D Converter
A/D Converter
Sensor
Energy Harvesting & Energy Storage
A/D Converter
SIGNAL PROCESSOR
RF Uplink
I. I NTRODUCTION Wireless sensor nodes (WSNs) are expected to become the backbone for ubiquitous computing, and their projected application area is constantly growing. They have been envisioned to perform tasks such as climate monitoring, electric motor supervision, belt tension sensing and vibration sensing. The purposes to which the low cost, even expendable WSNs are expected to be deployed vary from burning hot steel production processes to freezing-cold outdoor environmental monitoring. The huge range of application areas and environments also define a wide range of requirements for the WSNs, that are difficult to meet with a single fixed design. These include not only the environmental specifications, but also significantly disparate computational performance demands, and low power operation, even at zero power relying on energy harvested from the operating environment. The impact of the requirements on the feasibility of implementation technologies is substantial. A typical low cost wireless sensor node is illustrated in Fig. 1. A signal processor, micro-controller unit, and radio transceiver are used to implement the processing, control, and communications functionalities, respectively. In addition, the WSN has a power supply that harnesses energy from a battery, or an energy harvesting unit. The multiple sensors and measurement channels may come from the application needs, such as disturbance cancellation to dig up the signal of interest. Multiple sensors may also be employed to compensate for the mediocre performance of the
Sensor
Fig. 1.
Block diagram of a typical wireless sensor node.
Although important for the reliability, multichannel sensing induces a challenge for signal processing, since the computing requirements are easily multiplied. This has adverse impacts on power consumption, and makes energy efficient computing solutions a prime issue of interest for system designers. Typically, the most energy efficient solutions implement as much on a single chip as possible. Along these lines, a typical application-specific integrated circuit (ASIC) Systems-on-Chip (SoC) attempts to target a large user community, and therefore includes support for a wide range of uses in the same fixed design that may include signal processing and communications resources. For most purposes for which these types of designs are intended, they are over-provisioned, and sacrifice energy efficiency. When compared to solutions tailored to the specific applications, the degree of over-provisioning of gate count is often of the order of magnitude. Another development challenge is the real time software needed for coping with the multiple measurement channels and other simultaneous processing needs. Often lots of software design and a major testing effort are needed to ensure that all the timings work as required. In this paper we investigate the potential of using Flash field programmable gate array (FPGA) technology for implementing precisely provisioned computation resources for
WSNs. This technology could allow the system designers to include all the necessary digital functionalities on a single die. In particular, minimizing off-chip communications is important for power efficiency. Although the power needs of such solutions are still higher than for gate equivalent ASIC designs, they can compete against over-provisioned platforms. Fig. 2 depicts a sensor node with a Flash FPGA circuit and an external micro-controller.
Fig. 2. A Flash FPGA based wireless sensor node (picture courtesy of VTT, Oulu, Finland).
Our design approach relies on exploiting genuine parallel processing for the measurement channels to reduce software testing and development effort. Thanks to the reconfigurable system platform, a mix of applications from signal processing to baseband processing can be supported on a single chip. The experiments show that this approach is quite competitive, although significant space for development does exist. A somewhat similar Flash FPGA approach for WSNs was illustrated in [1] concentrating more on the development tools. The transport triggered architecture (TTA)-based Codesign Environment (TCE) used in our design, however, already provides for such tools [2], [3]. We have focused our investigations on floating point implementations because efficient fixed-point algorithms tend to require using unportable intrinsics or in-line assembly. The IEEE 754-2008 standard includes a half-precision 16-bit format that is an attractive alternative to fixed point designs. Unfortunately, the FPGA design tools we used lack this format, so the comparisons presented here are based on using a 32-bit integer and floating point. II. E NERGY
EFFICIENCY TRADE - OFF
Scarce energy resources limit the functionalities that can be supported within a sensor node. From the power management point of view, the power consumption of the WSN is divided into three main categories: sensing, signal processing and communication [4]. Since in most cases the radio transceiver is the most energy consuming part of the WSN, the data needs to be processed within the node, so that as few bits
as necessary are transmitted through the radio channel [5], [6]. Typical energy efficient radios need about 200 pJ - 1 nJ per transmitted bit [7]. In a typical case, the combination of an off-the-shelf microcontroller unit (MCU) and a digital signal processor (DSP) offers computational resources to spare, which leads to a waste of the scarce energy resources. On the other hand, the MCU on its own does not often provide for sufficient processing capabilities. MCUs and DSPs implemented in a 90 nm low power CMOS process need approximately 100-150 pJ for 32bit arithmetic operations. Compared to solutions tailored to the specific applications, the degree of over-provisioning of the gate count is often of the order of magnitude. By using the application-specific processor (ASP) design approach, the waste of computational resources, and therefore also much of the waste of energy, can be avoided without sacrifying the computational performance. The best ASP architectures need only around 5-10 pJ for 32-bit arithmetic operations on 90 nm CMOS ASIC technology. The use of ASPs as hardware accelerators provides the means to offload the computational operations from the MCU. While the data is processed in parallel in the accelerators, the decode logic is simplified and the MCU can then act only as a control unit, for example in energy management tasks. The interesting question is whether ASPs can be implemented so efficiently that even the use of reconfigurable logic could be defended in a self-powered system. Based on our experience, when an ASP instead of a general purpose digital signal processor is used, the savings in gate count and power consumption is easily around 90 %. This could allow for the use of ten times power hungrier circuit technology. Our architectural choice has been TTA that results in designs that feature an exposed bypass network, are close to conventional hardware accelerators, and achieve almost the same power efficiency as has been exhibited by earlier solutions. Consequently this is an interesting option when power hungrier circuit technology is used. The TTA processors are programmable using high-level language, and an advanced Codesign Environment TCE is available [2]. Fig. 3 illustrates the TCE design flow. III. TTA P ROCESSOR
FOR
S ENSOR
NODE
The TTA resembles closely the very long instruction word (VLIW) architectures [8]. The programming model of the TTA defines only the data movements, and the data movements themselves trigger the actual operations. Since the programmer controls the data movements between the functional units (FUs) and register files (RFs), the programmer can define the level of parallelism. The TTA processor consists of a set of functional units and register files that are connected through an interconnection network [2]. The interconnection network consists of transport buses and input and output sockets that connect the FUs and RFs into one entity. The TTA architecture is scalable, but adding functional units to the design also linearly increases
Algorithm
High level language (C, C++, OpenCL)
Processor Designer
Designer or Automatic Explorer
Feedback
Retargetable Compiler
Retargetable Simulator
Processor Generator
Program Image Generator
Platform Integrator
FPGA Synthesis Tools
Feedback
FPGA Programming File
Fig. 3.
TCE based design flow for application tailored sensor nodes.
the complexity. The number of concurrent data movements with a TTA processor is proportional to the number of buses. Due to minimal instruction encoding, both TTA and VLIW architectures suffer from poor code density which results in long instruction words. However, dictionary compression can be applied to improve the code density, as illustrated in [9]. The TTA based signal processor for WSN used for comparisons in the current study is illustrated in Fig. 4. We limit our treatment to this power critical part of the system that can be configured according to the application needs for each sensor. As a result, the number of manufactured hardware versions of the sensor nodes can be reduced. Energy Source
Energy Harvesting & Energy Storage
Sensor
Sensor
Sensor
Sensor
A/D Converter
A/D Converter
A/D Converter
A/D Converter
FU
FU
FU
FU
TTA
FPGA
MCU
Fig. 4.
RF Uplink
Block diagram of a TTA based wireless sensor node.
Since the TTA provides for instruction level parallelism, the signal processing for multiple channels can be carried out on a single TTA hardware accelerator in parallel. If a general purpose DSP design were to be used, the implementation
would most likely be I/O over-provisioned and the data would need to be buffered. In addition, a high frequency processor would need to be used to achieve the possible real-time requirements and/or to keep the data buffer size reasonable. To simplify comparisons and to enable even head-to-head comparisons to off-the-shelf processors, we implemented a straightforward algorithm with pre-processing on our TTA processor. The design consists of a polyphase decimation finite impulse response (FIR) filter, followed by a low-pass FIR filter and a simple algorithm used to extract local minimums and maximums from the processed data. It should be noticed that rather advanced indexed addressing is needed. The implementation was realized for both 32-bit integer and floating point arithmetics to fulfill the high dynamic range requirements of the design. Smaller word lengths could also have been used, but were not fully supported by the whole design tool chain. In addition, comparisons to programmable processors would have been made difficult. The TTA processor was realized both with and without dictionary compression. However, there were only unsubstantial differences between the two implementations in terms of energy consumption. Due to the quite simple implementation, the instruction memory size was quite small even without instruction word compression. The implementations were therefore carried out without dictionary compression. We recognize that more specialized function units would improve the energy efficiency, however, now the design effort with the existing tool chain has been nearly the same as with a standard programmable DSP-solution. IV. I MPLEMENTATION EXPERIMENTS The experiments were carried out using in-system programmable single-chip static random access memory (SRAM) and Flash FPGAs. High-performance programmable logic with densities of over 1 million system gates are available for both technologies. However, significant differences exist. The Flash FPGAs are live at power-up and can enter and exit ultra-low power modes in a microsecond, yet still retain SRAM and register data [10], [11]. In contrast, the SRAM FPGAs require the programming element to be loaded from an external device at every system power-up. In the worst case scenario, the programming might take hundreds of milliseconds and cause a considerable in-rush currents spike. Table I depicts the main differences between the Altera Cyclone III and the Actel Igloo used in our implementation. The Flash based Igloo uses 130 nm complementary metal oxide semiconductor (CMOS) technology and 1.2 V and 1.5 V core voltages. The SRAM based Cyclone III uses 65 nm Taiwan Semiconductor Manufacturing Company (TSMC) technology and 1.2 V core voltage. Our design was implemented on two low-power FPGAs: a SRAM-based Altera Cyclone III and on a Flash-based Actel Igloo. The FPGA power and energy consumption results presented herein were derived from Altera (Quartus II) and Actel (SmartPower) development tools.
TABLE I T HE MAIN DIFFERENCES BETWEEN THE TWO FPGA S USED IN OUR WORK
TABLE II C OMPARISON OF THE POWER DISSIPATION FOR THE 32- BIT FLOATING POINT IMPLEMENTATION
DEVICE
Core Voltage (V) 1.2 1.2 - 1.5
Altera Cyclone III Actel Igloo
Process 65 nm TSMC 130 nm CMOS
Number of LUT-inputs 4 3
Embedded multipliers Up to 288 None
The device architecture overview is presented in Fig. 5. The figure illustrates the main differences between the two FPGAs. The Igloo does not contain any embedded multipliers, in con-
DEVICE
Altera Cyclone III Altera Cyclone III Actel Igloo Actel Igloo
Core Voltage (V) 1.2 1.2 1.2 1.5
Clock (MHz) 11 20 11 20
Total power dissipation (mW) 59.41 63.92 8.96 25.22
Total dynamic power dissipation (mW) 7.45 11.96 8.90 25.01
TABLE III C OMPARISON OF THE POWER DISSIPATION FOR THE 32- BIT INTEGER IMPLEMENTATION
Altera Cyclone III PLL I/Os M9K Embedded Memory Blocks Embedded Multipliers
DEVICE
Altera Cyclone III Altera Cyclone III Actel Igloo Actel Igloo
Core Voltage (V) 1.2 1.2 1.2 1.5
Clock (MHz) 15 30 15 30
Total power dissipation (mW) 59.78 67.34 12.61 40.42
Total dynamic power dissipation (mW) 7.82 15.38 12.55 40.20
LEs
Actel Igloo RAM
RAM
RAM
RAM
RAM
RAM
... ...
RAM
RAM
RAM
RAM
CCC I/Os
...
Versatiles
...
...
...
... ... RAM
RAM
RAM
RAM
RAM
RAM
ISP AES Decryption
Fig. 5.
User nonvolatile FlashROM
... ...
RAM RAM
Flash* Freeze
RAM RAM Charge Dumps
Architecture overviews for the two FPGAs.
trast to the Cyclone III that can contain up to 288 embedded multipliers. There are also major differences between the logic cell composition of the two devices. The Igloo logic cell, which is called a ’versatile’, is composed of a three-input lookup-table (LUT) equivalent or a D-flip-flop/latch with enable [11]. The corresponding Altera logic cell, is called a logic element (LE) and it is much more complex compared to the Igloo versatile. The Cyclone III logic element is composed of a four-input LUT, a programmable register, a carry chain connection, the ability to drive interconnections such as local, row and column. The Cyclone III logic element also supports register packing and feedback [12]. The reduced logic cell content of the Igloo with fewer LUT inputs results in excess logic cells compared to the Cyclone III, which also has an impact on the critical path and therefore the maximum operation frequency. Table II presents the power characteristics for the SRAMbased Cyclone III, as well as the Flash-based Actel Igloo with two different 32-bit floating point configurations. In addition, the same characteristics are presented in Table III for 32-bit integer configuration. The clearest difference between the two FPGAs is their static power dissipation. As illustrated in Tables II and III,
the static power dissipation of the Cyclone III is at least two thirds of the total power dissipation of the device. Although the Flash-based Igloo does not quite achieve the same dynamic power dissipation characteristics compared to the SRAM-based Cyclone III, the static power consumption of the device is unsubstantially small and the Igloo therefore achieves better power efficiency. It also has to be noted that since the Igloo does not contain any embedded multipliers, it cannot fully compete with the Cyclone III in terms of dynamic power dissipation and silicon area. The results chime with the power characteristics presented in [10] and originate from the different processes used in the devices, as well as different logic cell contents. The Flash based Igloo uses 130 nm CMOS process and the SRAM based Cyclone III uses 65 nm TSMC process, which has significant impact on the dynamic power consumption. On the other hand, the Actel Igloo device achieves relatively low energy consumption with the core voltage of 1.2 V. However, if a higher operating frequency is required, the core voltage has to increased to 1.5 V. This has significant effects on the dynamic power dissipation, and therefore also on the total energy consumption characteristics of the device. The Altera Cyclone III, in turn, achieves the same maximum operating frequency with the 1.2 V core voltage, which improves its energy efficiency features with higher operating frequencies. The energy consumption of the devices is illustrated in Tables IV and V. Both FPGAs have the same maximum frequencies of 20 MHz for the floating point implementation, and 30 MHz for the integer implementation. However, the core voltage of the Igloo device has to be increased from 1.2 V to 1.5 V to achieve the higher frequency. The Igloo therefore sacrifices some energy efficiency features to meet the increased real-time requirements. Nevertheless, even with the reduced energy efficiency the Actel Igloo reaches almost half the energy consumption of the Cyclone III.
The power and energy consumption Tables II, III, IV and V reveal the differences between the technologies used in the two devices. Since the Igloo is designed for energy efficiency trading off some computational capacity, it uses slower transistors than the Cyclone III, which is why it does not achieve the same maximum frequency with the lower core voltage. TABLE IV C OMPARISON OF THE ENERGY CONSUMPTION 32- BIT FLOATING POINT IMPLEMENTATION
DEVICE
Altera Cyclone III Altera Cyclone III Actel Igloo Actel Igloo
Core Voltage (V) 1.2 1.2 1.2 1.5
Clock (MHz) 11 20 11 20
Total energy consumption (mJ) 73.45 43.46 16.62 25.92
Energy consumption per operation (pJ) 1040 850 330 510
TABLE V C OMPARISON OF THE ENERGY CONSUMPTION 32- BIT INTEGER IMPLEMENTATION
DEVICE
Altera Cyclone III Altera Cyclone III Actel Igloo Actel Igloo
Core Voltage (V) 1.2 1.2 1.2 1.5
Clock (MHz) 15 30 15 30
Total energy consumption (mJ) 54.20 30.52 11.42 18.32
Energy consumption per operation (pJ) 1600 900 340 540
In addition, the logic cell composition differs notably between the two devices. The Igloo versatile is a light weight adversary to the much more heavyweight Cyclone III logic element, with its embedded multipliers etc. The difference between the required logic cells is illustrated in Tables VI and VII. TABLE VI C OMPARISON OF THE REQUIRED LOGIC CELLS ON THE A LTERA C YCLONE III Arithmetic Fixed point 32-bit Floating point 32-bit
Combinational logic elements 4302 5052
Dedicated logic registers 3050 3440
Total logic elements 5346 6385
% of total logic elements 34.7 41.4
TABLE VII C OMPARISON OF THE REQUIRED LOGIC CELLS ON THE A CTEL I GLOO Arithmetic Fixed point 32-bit Floating point 32-bit
Combinational logic cells 8926 10176
Sequential logic cells 3594 3239
Total logic cells 12520 13448
% of total logic cells 50.9 54.7
V. D ISCUSSION Very few WSNs perform constant data processing; on the contrary, in a typical case there are long idle periods. During these periods, the WSN should enter a low power mode to reduce the unnecessary energy consumption. In addition, the WSN should be swiftly awakened from the sleep periods.
The Flash*Freeze technology included in the Actel Igloo enables ultra-low power consumption, while maintaining the FPGA content. The Actel Igloo can also enter and exit the lowpower modes in approximately a microsecond. This feature can be used to disable the TTA architecture during idle periods, in which case the power consumption of the TTA accelerator drops down to about 50 µW. The energy efficiency of our Flash FPGA implementation could easily be improved with three simple alterations. First of all, our design was implemented using 32-bit arithmetics due to the lack of support for the half-precision floating point format in our tool chain, and to provide comparable results to conventional programmable processor designs. For the same reason we refrained from using advanced function units. This resulted in excess energy consumption and in increased number of versatiles used on Actel Igloo. By reducing the word length to 16-bit using the e.g. IEEE-754-2008 16-bit halfprecision floating point format, the savings in dynamic power dissipation would be about 50 %. Since over 99 % of the total power dissipation of the Igloo originates from dynamic power dissipation, the total energy consumption would therefore also decrease about 50 %. Secondly, the Igloo device used in our design did not contain any embedded multipliers, which resulted in reduced energy and area efficiency compared to the SRAM based FPGA that did contain multipliers. By using Flash FPGA containing embedded multipliers, the total power dissipation could be reduced by another 50 %. It has to be noted that although the Cyclone III contained 18x18-bit embedded multipliers, they were not efficiently harnessed in our implementation due to the 32-bit word length. Naturally, a decreased word length would also improve the energy efficiency of the Cyclone III. Finally, the Igloo used 130 nm CMOS technology, leaving space for yet another improvement. A Flash FPGA using 65 nm technology, instead of 130 nm technology, would result in at least a 50 % improvement in the energy efficiency. In sum, the total power dissipation of the FPGA could be decreased to approximately one eighth of the total power dissipation of the Igloo by retaining the same granularity in the TTA processors’ function units. Table VIII shows our estimates on scaled and technology normalized implementations, as well as rough energy consumption estimates for 16-bit implementations. The Actel Igloo values have been scaled from 1.5 V 130 nm technology to 1.2 V 65 nm technology using (1). Scaled value = Unscaled value ×
65nm (1.2V)2 × 130nm (1.5V)2
(1)
It also has to be noted, that no technology scaling has been performed for the Altera Cyclone III energy efficiency values. In addition, the energy efficiency of the Actel Igloo could be improved even more by introducing embedded multipliers. For comparison, Texas Instruments TIC67x would demand about 285 pJ/FLOP assuming the same technology scaling from 130 nm to 65 nm [13]. Furthermore, the algorithm used in our study, did not
provide for the efficient use of special function units (SFUs). The SFUs can be used to reduce the instruction overhead and therefore reduce the processors’ power dissipation [14]. We estimate that the benefits from this move could double the energy efficiency. TABLE VIII E NERGY EFFICIENCY ESTIMATES SCALED TO 1.2 V 65 NM TECHNOLOGY Floating point 32-bit Fixed point 32-bit Floating point 16-bit Fixed point 16-bit
Actel Igloo 106 pJ/op (11 MHz) 109 pJ/op (15 MHz) 53 pJ/op (11 MHz) 55 pJ/op (15 MHz)
Altera Cyclone III 850 pJ/op (20 MHz)(*) 900 pJ/op (30 MHz)(*) 425 pJ/op (20 MHz) 450 pJ/op (30 MHz)
(∗) U nscaled values
VI. S UMMARY The experiments show that the Flash FPGAs offer adequate energy efficiency, even with 32-bit arithmetics, presuming that the Flash*Freeze technology is harnessed. The Flash based FPGA designs suffered significantly from the lack of multipliers in the circuits; however, they still achieved reasonable power efficiency, consuming about 330 pJ per operation and 19.57 nJ per processed sample with 32-bit floating point arithmetics. Floating point design turned out not to be significantly more power hungry than a fixed point solution. It is worth noticing that the use of a 32-bit integer format eliminated the need for much of the shifts that would have been needed with a 16-bit format so the actual disparity could be even smaller. Although significant improvements can be achieved by using shorter arithmetic formats e.g. the IEEE-754-2008 16-bit halfprecision floating point format, they were not yet implemented due to the lack of support in the tool chain used. Our Flash FPGA implementation achieves and even surpasses the energy efficiency of the common DSPs. The energy consumption estimates illustrated in Table VIII, show that reconfigurable and programmable FPGA solutions could achieve low energy consumption in comparison to off-the-shelf DSPs.
ACKNOWLEDGEMENTS This study was carried out in the InterSync project. The project is a part of the FIMECC research program EFFIMA. The project is also financially supported by the Technology Development Centre of Finland (TEKES) and industrial companies. Their support is gratefully acknowledged. R EFERENCES [1] P. V¨olgyesi, J. Sallai, A. L´edeczi, P. Dutta, and M. Mar´oti, “Software development for a novel WSN platform,” in Proceedings of the 2010 ICSE Workshop on Software Engineering for Sensor Network Applications, ser. SESENA ’10. New York, NY, USA: ACM, 2010, pp. 20–25. [2] Tampere University of Technology, “TTA codesign environment v1.4 user manual,” http://tce.cs.tut.fi/user manual/TCE.pdf. [3] D. Tabak and G. J. Lipovski, “Move architecture in digital controllers,” IEEE Trans. Comput., vol. 29, pp. 180–190, February 1980. [4] V. Raghunathan, S. Ganeriwal, and M. Srivastava, “Emerging techniques for long lived wireless sensor networks,” Communications Magazine, IEEE, vol. 44, no. 4, pp. 108 – 114, April 2006. [5] A. Willig, “Recent and emerging topics in wireless industrial communications: A selection,” IEEE Trans. Industrial Informatics, vol. 4, no. 2, pp. 102–124, 2008. [6] M. Hempstead, M. Lyons, D. Brooks, and G.-Y. Wei, “Survey of hardware systems for wireless sensor networks,” Journal of Low Power Electronics, vol. 4, pp. 11–20(10), April 2008. [7] J. Long, W. Wu, Y. Dong, Y. Zhao, M. Sanduleanu, J. Gerrits, and G. van Veenendaal, “Energy-efficient wireless front-end concepts for ultra lower power radio,” in Custom Integrated Circuits Conference, 2008. CICC 2008. IEEE, sept. 2008, pp. 587 –590. [8] J. Heikkinen, T. Rantanen, A. Cilio, J. Takala, and H. Corporaal, “Evaluating template-based instruction compression on transport triggered architectures,” System-on-Chip for Real-Time Applications, International Workshop on, p. 192, 2003. [9] J. Heikkinen, J. Takala, and H. Corporaal, “Dictionary-based program compression on TTAs: effects on area and power consumption,” in Signal Processing Systems Design and Implementation, 2005. IEEE Workshop on, November 2005, pp. 479 – 484. [10] Microsemi SoC Products Group, “Flash FPGAs in the value-based market,” http://www.actel.com/documents/ValueFPGA WP.pdf. [11] ——, “IGLOO low-power flash FPGAs with Flash*Freeze technology,” http://www.actel.com/documents/IGLOO DS.pdf. [12] Altera Corporation, “Cyclone III device handbook,” http://www.altera. com/literature/hb/cyc3/cyclone3 handbook.pdf. [13] B. Khailany, “The VLSI implementation and evaluation of area- and energy-efficient streaming media processors,” Ph.D. dissertation, Stanford University, June 2003. [14] T. Pitk¨anen and J. Takala, “Low-power application-specific processor for FFT computations,” Journal of Signal Processing Systems, pp. 1– 12, 2010.