Design and Architectures for Signal and Image ...

EURASIP Journal on Embedded Systems

Design and Architectures for Signal and Image Processing Guest Editors: Markus Rupp, Ahmet T. Erdogan, and Bertrand Granado

Design and Architectures for Signal and Image Processing


Design and Architectures for Signal and Image Processing Guest Editors: Markus Rupp, Ahmet T. Erdogan, and Bertrand Granado

Copyright © 2009 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in volume 2009 of “EURASIP Journal on Embedded Systems.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Editor-in-Chief Zoran Salcic, University of Auckland, New Zealand

Associate Editors Sandro Bartolini, Italy Neil Bergmann, Australia Shuvra Bhattacharyya, USA Ed Brinksma, The Netherlands Paul Caspi, France Liang-Gee Chen, Taiwan Dietmar Dietrich, Austria Stephen A. Edwards, USA Alain Girault, France Rajesh K. Gupta, USA

Thomas Kaiser, Germany Bart Kienhuis, The Netherlands Chong-Min Kyung, Korea Miriam Leeser, USA John McAllister, UK Koji Nakano, Japan Antonio Nunez, Spain Sri Parameswaran, Australia Zebo Peng, Sweden Marco Platzner, Germany

Marc Pouzet, France S. Ramesh, India Partha S. Roop, New Zealand Markus Rupp, Austria Asim Smailagic, USA Leonel Sousa, Portugal Jarmo Henrik Takala, Finland Jean-Pierre Talpin, France Jürgen Teich, Germany Dongsheng Wang, China

Contents Design and Architectures for Signal and Image Processing, Markus Rupp, Ahmet T. Erdogan, and Bertrand Granado Volume 2009, Article ID 674308, 3 pages Multicore Software-Defined Radio Architecture for GNSS Receiver Signal Processing, Heikki Hurskainen, Jussi Raasakka, Tapani Ahonen, and Jari Nurmi Volume 2009, Article ID 543720, 10 pages An Open Framework for Rapid Prototyping of Signal Processing Applications, Maxime Pelcat, Jonathan Piat, Matthieu Wipliez, Slaheddine Aridhi, and Jean-François Nezan Volume 2009, Article ID 598529, 13 pages Run-Time HW/SW Scheduling of Data Flow Applications on Reconfigurable Architectures, Fakhreddine Ghaffari, Benoit Miramond, and François Verdier Volume 2009, Article ID 976296, 13 pages Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes, Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci Volume 2009, Article ID 723465, 15 pages Comments on “Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes”, Kiran K. Gunnam, Gwan S. Choi, and Mark B. Yeary Volume 2009, Article ID 704174, 3 pages Reply to “Comments on Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes”, Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci Volume 2009, Article ID 635895, 2 pages OLLAF: A Fine Grained Dynamically Reconfigurable Architecture for OS Support, Samuel Garcia and Bertrand Granado Volume 2009, Article ID 574716, 11 pages Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture, Jehangir Khan, Smail Niar, Mazen A. R. Saghir, Yassin El-Hillali, and Atika Rivenq-Menhaj Volume 2009, Article ID 175043, 21 pages A Prototyping Virtual Socket System-On-Platform Architecture with a Novel ACQPPS Motion Estimator for H.264 Video Encoding Applications, Yifeng Qiu and Wael Badawy Volume 2009, Article ID 105979, 20 pages FPSoC-Based Architecture for a Fast Motion Estimation Algorithm in H.264/AVC, Obianuju Ndili and Tokunbo Ogunfunmi Volume 2009, Article ID 893897, 16 pages FPGA Accelerator for Wavelet-Based Automated Global Image Registration, Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou Volume 2009, Article ID 162078, 10 pages

A System for an Accurate 3D Reconstruction in Video Endoscopy Capsule, Anthony Kolar, Olivier Romain, Jade Ayoub, David Faura, Sylvain Viateur, Bertrand Granado, and Tarik Graba Volume 2009, Article ID 716317, 15 pages Performance Evaluation of UML2-Modeled Embedded Streaming Applications with System-Level Simulation, Tero Arpinen, Erno Salminen, Timo D. Hämäläinen, and Marko Hännikäinen Volume 2009, Article ID 826296, 16 pages Cascade Boosting-Based Object Detection from High-Level Description to Hardware Implementation, K. Khattab, J. Dubois, and J. Miteran Volume 2009, Article ID 235032, 12 pages Very Low-Memory Wavelet Compression Architecture Using Strip-Based Processing for Implementation in Wireless Sensor Networks, Li Wern Chew, Wai Chong Chia, Li-minn Ang, and Kah Phooi Seng Volume 2009, Article ID 479281, 16 pages Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors, Muhammad Yasir Qadri and Klaus D. McDonald-Maier Volume 2009, Article ID 725438, 7 pages Hardware Architecture for Pattern Recognition in Gamma-Ray Experiment, Sonia Khatchadourian, Jean-Christophe Prévotet, and Lounis Kessal Volume 2009, Article ID 737689, 15 pages Evaluation and Design Space Exploration of a Time-Division Multiplexed NoC on FPGA for Image Analysis Applications, Linlin Zhang, Virginie Fresse, Mohammed Khalid, Dominique Houzet, and Anne-Claire Legrand Volume 2009, Article ID 542035, 15 pages Efficient Processing of a Rainfall Simulation Watershed on an FPGA-Based Architecture with Fast Access to Neighbourhood Pixels, Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang, and Kah Phooi Seng Volume 2009, Article ID 318654, 19 pages

Hindawi Publishing Corporation EURASIP Journal on Embedded Systems Volume 2009, Article ID 674308, 3 pages doi:10.1155/2009/674308

Editorial Design and Architectures for Signal and Image Processing Markus Rupp (EURASIP Member),1 Ahmet T. Erdogan,2 and Bertrand Granado3 1 Institute

of Communications and Radio-Frequency Engineering (INTHFT), Vienna University of Thechnology, 1040 Vienna, Austria School of Engineering and Electronics, The University of Edinburgh, Edinburgh EH9 3JL, UK 3 ENSEA, Cergy-Pontoise University, boulevard du Port-95011 Cergy-Pontoise Cedex, France 2 The

Correspondence should be addressed to Markus Rupp, [email protected] Received 8 December 2009; Accepted 8 December 2009 Copyright © 2009 Markus Rupp et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This Special Issue of the EURASIP Journal of embedded systems is intended to present innovative methods, tools, design methodologies, and frameworks for algorithm-architecture matching approach in the design flow including system level design and hardware/software codesign, RTOS, system modeling and rapid prototyping, system synthesis, design verification, and performance analysis and estimation. Today, typical sequential design flows are in use and they are reaching their limits due to: (i) The complexity of today’s systems designed with the emerging submicron technologies for integrated circuit manufacturing (ii) The intense pressure on the design cycle time in order to reach shorter time-to-market and reduce development and production costs (iii) The strict performance constraints that have to be reached in the end, typically low and/or guaranteed application execution time, integrated circuit area, and overall system power dissipation. Because in such design methodology the system is seen as a whole, this special issue also covers the following topics: (i) New and emerging architectures: SoC, MPSoC, configurable computing (ASIPs), (dynamically) reconfigurable systems using FPGAs (ii) Smart sensors: audio and image sensors for high performance and energy efficiency (iii) Applications: automotive, medical, multimedia, telecommunications, ambient intelligence, object recognition, and cryptography

(iv) Resource management techniques for real-time operating systems in a codesign framework (v) Systems and architectures for real-time image processing (vi) Formal models, transformations, and architectures for reliable embedded system design. We received 30 submissions of which we eventually accepted 17 for publication. The paper entitled “Multicore software defined radio architecture for GNSS receiver signal processing” by H. Hurskainen et al. describes a multicore Software Defined Radio (SDR) architecture for Global Navigation Satellite System (GNSS) receiver implementation. Three GNSS SDR architectures are discussed: (1) a hardware-based SDR that is feasible for embedded devices but relatively expensive, (2) a pure SDR approach that has high level of flexibility and low bill of material, but is not yet suited for handheld applications, and (3) a novel architecture that uses a programable array of multiple processing cores that exhibits both flexibility and potential for mobile devices. The paper entitled “An open framework for rapid prototyping of signal processing applications” by M. Pelcat et al. presents an open source eclipse-based framework which aims to facilitate the exploration and development processes in this context. The framework includes a generic graph editor (Graphiti), a graph transformation library (SDF4J), and an automatic mapper/scheduler tool with simulation and code generation capabilities (PREESM). The input of the framework is composed of a scenario description and two graphs: one graph describes an algorithm and the second graph describes an architecture. As an example, a prototype

2 for 3GPP long-term evolution (LTE) algorithm on a multicore digital signal processor is built, illustrating both the features and the capabilities of this framework. The paper entitled “Run-time HW/SW scheduling of data flow applications on reconfigurable architectures” by F. Ghaffari et al. presents an efficient dynamic and run-time Hardware/Software scheduling approach. This scheduling heuristic consists in mapping on line the different tasks of a highly dynamic application in such a way that the total execution time is minimized. On several image processing applications, the scheduling method is applied. The presented experiments include simulation and synthesis results on a Virtex V-based platform. These results show a better performance against existing methods. The paper entitled “Techniques and architectures for hazard-free semiparallel decoding of LDPC codes” by M. Rovini et al. describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of “idle” cycles, to prevent the hazards of the pipeline mechanism in LDPC decoding. Along these different semiparallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm low-power CMOS technology are shown. The paper entitled “OLLAF: a fine grained dynamically reconfigurable architecture for OS support” by S. Garcia and B. Granado presents OLLAF, a fine grained dynamically reconfigurable architecture (FGDRA), specially designed to efficiently support an OS. The studies presented here show the contribution of this architecture in terms of hardware context management, preemption support, as well as the gain that can be obtained, by using OLLAF instead of a classical FPGA, in terms of context management and preemption overhead. The paper entitled “Trade-off exploration for target tracking application in a customized multiprocessor architecture” by J. Khan et al. presents the design of an FPGAbased multiprocessor-system-on-chip (MPSoC) architecture optimized for multiple target tracking (MTT) in automotive applications. The paper explains how the MTT application is designed and profiled to partition it among different processors. It also explains how different optimizations were applied to customize the individual processor cores to their assigned tasks and to assess their impact on performance and FPGA resource utilization, resulting in a complete MTT application running on an optimized MPSoC architecture that fits in a contemporary medium-sized FPGA and that meets the real-time constraints of the given application. The paper entitled “A prototyping virtual socket systemon-platform architecture with a novel ACQPPS motion estimator for H.264 video encoding applications” by Y. Qiu and W. M. Badawy presents a novel adaptive crossed quarter polar pattern search (ACQPPS) algorithm that is proposed to realize an enhanced inter prediction for H.264. Moreover, an efficient prototyping system-on-platform architecture is also presented, which can be utilized for a realization of

EURASIP Journal on Embedded Systems H.264 baseline profile encoder with the support of integrated ACQPPS motion estimator and related video IP accelerators. The implementation results show that ACQPPS motion estimator can achieve very high estimated image quality comparable to that from the full search method, in terms of peak signal-to-noise ratio (PSNR), while keeping the complexity at an extremely low level. The paper entitled “FPSoC-based architecture for a fast motion estimation algorithm in H.264/AVC” by O. Ndili and T. Ogunfunmi presents an architecture based on a modified hybrid fast motion estimation (FME) algorithm. Presented results show that the modified hybrid FME algorithm outperforms previous state-of-the-art FME algorithms, while its losses, when compared with FSME (full search motion estimation), in terms of PSNR performance and computation time are insignificant. The paper entitled “FPGA accelerator for waveletbased automated global image registration” by B. Li et al. presents an architecture for wavelet-based automated global image registration (WAGIR) that is fundamental for most remote sensing image processing algorithms, and extremely computation intensive. They propose a block wavelet-based automated global image registration (BWAGIR) architecture based on a block resampling scheme. The architecture with 1 processing unit outperforms the CL cluster system with 1 node by at least 7.4X, and the MPM massively parallel machine with 1 node by at least 3.4X. And the BWAGIR with 5 units achieves a speedup of about 3X against the CL with 16 nodes, and a comparable speed with the MPM with 30 nodes. The paper entitled “A system for an accurate 3D reconstruction in video endoscopy capsule” by A. Kolar et al. presents the hardware and software development of a wireless multispectral vision sensor which allows transmitting a 3D reconstruction of a scene in real time. The paper also presents a method to acquire the images at a 25 frames/s video rate with a discrimination between the texture and the projected pattern. This method uses an energetic approach, a pulsed projector, and an original 64 × 64 CMOS image sensor with programable integration time. Multiple images are taken with different integration times to obtain an image of the pattern which is more energetic than the background texture. Also presented is a 3D reconstruction processing that allows a precise and real-time reconstruction. This processing which is specifically designed for an integrated sensor and its integration in an FPGA-like device has a low power consumption compatible with a VCE examination. The paper presents experimental results with the realization of a large-scale demonstrator using an SOPC prototyping board. The paper entitled “Performance evaluation of UML2modeled embedded streaming applications with system-level simulation” by T. Arpinen et al. presents an efficient method to capture abstract performance model of a streaming data real-time embedded system (RTES). This method uses an MDA (model driven architecture) approach. The goal of the performance modeling and simulation is to achieve early estimates on PE, memory, and on-chip network utilization, task response times, among other information that is used

EURASIP Journal on Embedded Systems for design-space exploration. UML2 is used for performance model specification. The application workload modeling is carried out using UML2 activity diagrams. Platform is described with structural UML2 diagrams and model elements annotated with performance values. The focus here is on modeling streaming data applications. It is characteristic to streaming applications that a long sequence of data items flows through a stable set of computation steps (tasks) with only occasional control messaging and branching. The paper entitled “Cascade boosting-based object detection from high-level description to hardware implementation” by K. Khattab et al. presents an implementation of boosting-based object detection algorithms that are considered the fastest accurate object detection algorithms today, but their implementation in a real-time solution is still a challenge. A new parallel architecture, which exploits the parallelism and the pipelining in these algorithms, is proposed. The method to develop this architecture was based on a high-level SystemC description. SystemC enables PC simulation that allows simple and fast testing and leaves the structure open to any kind of hardware or software implementation since SystemC is independent from all platforms. The paper entitled “Very low memory wavelet compression architecture using strip-based processing for implementation in wireless sensor networks” by L. W. Chew et al. presents a hardware architecture for strip-based image compression using the SPIHT algorithm. The lifting-based 5/3 DWT which supports a lossless transformation is used in the proposed work. The wavelet coefficients output from the DWT module is stored in a strip buffer in a predefined location using a new 1D addressing method for SPIHT coding. In addition, a proposed modification on the traditional SPIHT algorithm is also presented. In order to improve the coding performance, a degree-0 zerotree coding methodology is applied during the implementation of SPIHT coding. To facilitate the hardware implementation, the proposed SPIHT coding eliminates the use of lists in its set-partitioning approach and is implemented in two passes. The proposed modification reduces both the memory requirement and complexity of the hardware coder. The paper entitled “Data cache-energy and throughput models: design exploration for embedded processors” by M. Y. Qadri and K. D. McDonald Maier proposes cacheenergy models. These models strive to provide a complete application-based analysis. As a result they could facilitate the tuning of a cache and an application according for a given power budget. The models presented in this paper are an improved extension of energy and throughput models for a data cache in term of the leakage energy that is indicated for the entire processor rather than simply the cache on its own. The energy model covers the per cycle energy consumption of the processor. The leakage energy statistics of the processor in the data sheet covers the cache and all peripherals of the chip. It is also improved in terms of refinement of the miss rate that has been split into two terms: a read miss rate and a write miss rate. This was done as the read energy and write energy components correspond to the respective miss rate contribution of the cache. The model-based approach

3 presented was used to predict the processors performance with sufficient accuracy. An example application for design exploration that could facilitate the identification of an optimal cache configuration and code profile for a target application was discussed. The paper entitled “Hardware architecture for pattern recognition in gamma-ray experiment” by S. Khatchadourian et al. presents an intelligent way of triggering data in the HESS (high energy stereoscopic system) phase II experiment. The system relies on the utilization of image processing algorithms in order to increase the trigger efficiency. The proposed trigger scheme is based on a neural system that extracts the interesting features of the incoming images and rejects the background more efficiently than classical solutions. The paper presents the basic principles of the algorithms as well as their hardware implementation in FPGAs. The paper entitled “Evaluation and design space exploration of a time-division multiplexed NoC on FPGA for image analysis applications” by L. Zhang et al. presents an adaptable fat tree NoC architecture for field programmable gate array (FPGA) designed for image analysis applications. The authors propose a dedicated communication architecture for image analysis algorithms. This communication mechanism is a generic NoC infrastructure dedicated to dataflow image processing applications, mixing circuitswitching and packet-switching communications. The complete architecture integrates two dedicated communication architectures and reusable IP blocks. Communications are based on the NoC concept to support the high bandwidth required for a large number and type of data. For data communication inside the architecture, an efficient timedivision multiplexed (TDM) architecture is proposed. This NoC uses a fat tree (FT) topology with virtual channels (VC) and flit packet-switching with fixed routes. Two versions of the NoC are presented in this paper. The results of their implementations and their design space exploration (DSE) on Altera StratixII are analyzed and compared with a pointto-point communication and illustrated with a multispectral image application. The paper entitled “Efficient processing of a rainfall simulation watershed on an FPGA-based architecture with fast access to neighborhood pixels” by L. S. Yeong et al. describes a hardware architecture to implement the watershed algorithm using rainfall simulation. The speed of the architecture is increased by utilizing a multiple memory bank approach to allow parallel access to the neighborhood pixel values. In a single read cycle, the architecture is able to obtain all five values of the center and four neighbors for a 4 connectivity watershed transform. The proposed rainfall watershed architecture consists of two parts. The first part performs the arrowing operation and the second part assigns each pixel to its associated catchment basin. Markus Rupp Ahmet T. Erdogan Bertrand Granado


Research Article Multicore Software-Defined Radio Architecture for GNSS Receiver Signal Processing Heikki Hurskainen, Jussi Raasakka, Tapani Ahonen, and Jari Nurmi Department of Computer Systems, Tampere University of Technology, P. O. Box 553, 33101 Tampere, Finland Correspondence should be addressed to Heikki Hurskainen, [email protected] Received 27 February 2009; Revised 22 May 2009; Accepted 30 June 2009 Recommended by Markus Rupp We describe a multicore Software-Defined Radio (SDR) architecture for Global Navigation Satellite System (GNSS) receiver implementation. A GNSS receiver picks up very low power signals from multiple satellites and then uses dedicated processing to demodulate and measure the exact timing of these signals from which the user’s position, velocity, and time (PVT) can be estimated. Three GNSS SDR architectures are discussed. (1) A hardware-based SDR that is feasible for embedded devices but relatively expensive, (2) a pure SDR approach that has high level of flexibility and low bill of material, but is not yet suited for handheld applications, and (3) a novel architecture that uses a programmable array of multiple processing cores that exhibits both flexibility and potential for mobile devices. We present the CRISP project where the multicore architecture will be realized along with numerical analysis of application requirements of the platform’s processing cores and network payload. Copyright © 2009 Heikki Hurskainen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction Global navigation has been a challenge to mankind for centuries. However, in the modern world it has become easier with the help from Global Satellite Navigation Systems (GNSSs). NAVSTAR Global Positioning System (GPS) [1] has been the most famous implementation of GNSS and the only fully operational system available for civilian users, although this situation is changing. Galileo [2] is emerging as a competitor and complement for GPS, as they both are satellite navigation systems based on Code Division Multiple Access (CDMA) techniques. CDMA is a technique that allows multiple transmitters to use same carrier simultaneously by multiplying pseudorandom noise (PRN) codes to the transmitted signal. The PRN code rate is higher than data symbol rate which divides the energy of a data symbol to a wider bandwidth. The used PRN codes are unique to each transmitter and thus transmitter can be identified in reception when received signal is correlated with a replica of the used PRN code. The Russian GLONASS system, originally based on Frequency Division Multiple Access (FDMA), is adding a

CDMA feature to the system with GLONASS-K satellites [3]. China has also shown interest in implementing its own system, called Compass, during the following decade [4]. The GPS modernization program [5] introduces additional signals with new codes and modulation. Realization of the new navigation systems and modernization of GPS produce updates and upgrades to system specifications. Besides changing specifications, GNSS is also facing challenges from an environmental point of view. The resulting multipath effects make it more difficult to determine exact signal timing crucial for navigation algorithms. Research around multipath mitigation algorithms is active since accurate navigation capability in environments with heavy multipaths is desired. Among interference issues multipath mitigation is also one of the biggest drivers for the introduction of new GNSS signal modulations. Designing a true GNSS receiver is not a trivial task. A true GNSS receiver should be reconfigurable and flexible in design so that the posibilities of new specifications and algorithms can be exploited, and the price should be low enough to enable mass market penetration.

2


2. GNSS Principles and Challenges 2.1. Navigation and Signal Processing. Navigation can be performed when four or more satellites are visible to the receiver. The pseudoranges from receiver to satellites and navigation data (containing ephemeris parameters) are needed [1, 6, 7]. When pseudoranges (ρ) are measured by the receiver, they can be used to solve unknowns, the users location (x, y, z)u and clock bias bu with known positions of satellites (x, y, z)i . The relation between pseudorange, satellite position, and user position is illustrated in

ρi = (xi − xu )2 + (yi − yu )2 + (zi − zu )2 + bu .

(1)

The transmitted signal contains low rate navigation data (50 Hz for GPS Standard Positioning Service (SPS)), repeating PRN code sequence (1023 chips at 1.023 MHz for GPS SPS) and a high rate carrier (GPS SPS is transmitted at L1 band which is centered at 1575.42 MHz) [1]. For Galileo E1 Open Service (OS) and future GPS L1C it also contains a Multiplexed Binary Offset Carrier (MBOC) modulation [8, 9]. These signal components are illustrated in Figure 1. The signal processing for GNSS can be divided into analog and digital parts. Since the carrier frequencies of the GNSS are high (>1 GHz) it is impossible to perform digital signal processing on it. In the analog part of the receiver, which is called the radio front-end, the received signal is amplified, filtered, downconverted, and finally quantized and sampled to digital format. The digital signal processing part (i.e., baseband processing) has two major tasks. First, the Doppler frequencies and code phases of the satellites need to be acquired. The details of the acquisition process are well explained in literature, for example, [1, 7]. There are a number of ways to implement acquisition, with parallel methods being faster than serial ones, but at the cost of consuming more resources. The parallel methods can be applied either as convolution in the time domain (matched filters) or as multiplication in the frequency domain (using FFT and IFFT). Second, after successful acquisition the signals found are being tracked. In tracking, the frequency and phase of the receiver are continously fine-tuned to keep receiving the acquired signals. Also, the GNSS data is demodulated and the precise timing is formed from the signal phase measurements. A detailed description of the tracking process can be found, for example, in [1, 7]. The principles for data demodulation are also illustrated in Figure 1. 2.2. Design Challenges of GNSS. The environment we are living in is constantly changing in topographic, geometric, economic, and political ways. These changes are driving the GNSS evolution. Besides new systems (e.g., Galileo, Compass), the existing ones (i.e., GPS, GLONASS) are being modernized. This leads to constantly evolving field of specifications which may increase the frustration and uncertainty among receiver designers and manufacturers.

The signal spectrum of future GNSS signals is growing with the new systems. Currently the GPS L1 (centered at 1575.42 MHz) is the only commercially exploited GNSS frequency band. Galileo systems E1 OS signal will be sharing the same band. Another common band of future GPS and Galileo signals will be centered at 1176.45 MHz (GPS L5 and Galileo E5a). The GPS modernization program is also activating the L2 frequency band (centered at 1227.60 MHz) to civilian use by implementing L2C (L2 Civil) signal [10]. This band has already been assigned for navigation use, but only for authorized users via GPS Precise Positioning Service (PPS) [1]. To improve the signal code tracking and multipath performance new Binary Offset Carrier (BOC) modulation was originally introduced as baseline for Galileo and modern GPS L1 signal development [11]. The later agreement between European and US GNSS authorities further specified the usage of Multiplexed BOC (MBOC) modulation in both systems. In MBOC modulation two different binary subcarriers are added to the signal, either as time multiplexed mode (TMBOC), or summed together with predefined weighting factors as Composite BOC (CBOC) [8, 9, 12]. Like any other wireless communication, satellite navigation also suffers from multipaths in environments prone to such (e.g., urban canyons, indoors). The problem caused by multipaths is even bigger in navigation than in communication since precise timing also needs to be resolved. The field of multipath mitigation is actively researched and new algorithms and architectures are presented frequently, for example, in [13–15]. Besides GNSS there are also other wireless communication technologies that are developing rapidly and the direction of development is driven towards multipurpose low cost receivers (user handsets) with enhanced capabilities [16].

3. Overview of SDR GNSS Architectures In this section we present three architectures for SoftwareDefined Radio (SDR) GNSS receiver. A simplified definition of SDR is given in [17]. “Radio in which some or all of the physical layer functions are software defined.” The root SDR architecture was presented in [18]. Figure 2 illustrates an example of GNSS receiver functions mapped on to this canonical architecture. Only the reception part of the architecture is presented since current GNSS receivers are not transmitting. Radio Frequency (RF) conversion handles the signal processing before digitalization. The Intermediate Frequency (IF) processing block transfers the frequency of the received signal from IF to baseband and may also take care of Doppler removal in GNSS. The baseband processing segment handles the accurate timing and demodulation, thus enabling the construction of the navigation data bits. The division into IF and baseband sections can vary depending on the chosen solution since the complex envelope of the received signal can be handled in baseband also. Desired navigation output (Position, Velocity, and Time (PVT)) is solved in the last steps of the GNSS receiver chain.


3

Satellite (transmission)

Receiver (reception)

Binary subcarrier Navigation data

Replica subcarrier Navigation data recovery

Transmission medium (i.e. space)

PRN code

Replica carrier

Carrier

Replica PRN

Figure 1: Principles for GNSS signal modulation in transmission and demodulation in reception.

IF processing

Radio

Baseband

Source

AGC

LNA

Down conversion

Local oscillator

Frequency synthesis

A/D conversion

Carrier wipeoff

Code correlation

Carrier NCO

Code NCO & generation

Baseband processing

Navigation processing

User interface

Figure 2: Canonical SDR architecture adapted to GNSS. It is modified from [18].

Current state-of-the-art mass market receivers are based on a chipset or single-chip receiver [19]. The chipset or single-chip receiver is usually implemented as an Application Specific Integrated Circuit (ASIC). ASICs have high Nonrecurring Engineering (NRE) costs, but when produced in high volumes they have very low price per unit. ASICs can also be optimized for small size and to have small power consumption. Both of these features are desired in handheld, battery operated devices. On the other hand, ASICs are fixed solutions and impossible to reconfigure. Modifications in design are also very expensive to realize with ASIC technology. This approach has proven to be successful in mass market receivers because of price and power consumption advantages although it may not hold its position with growing demand for flexibility and shortened time to market. 3.1. Hardware Accelerated SDR Receiver Architecture. The first SDR receiver architecture discussed in this paper is the approach where the most demanding parts of the receiver are implemented on a reconfigurable hardware platform, usually in the form of a Field Programmable Gate Array (FPGA) progammed with a Hardware Description Language (HDL). This architecture, comprised of radio front-end circuit, reconfigurable baseband hardware, and navigation software is well known and presented in numerous publications, for example, [16, 20–22]. FPGAs have proved to be suitable for performing GNSS signal processing functions [23]. The building blocks for hardware accelerated SDR receivers are illustrated in Figure 3.

In this architecture the RF conversion is performed by analog radio. The last step of the conversion transforms the signal from analog to digital format. IF processing and baseband functionalities are performed in accelerating hardware. The source, PVT for the GNSS case, is constructed in navigation processing. The big advantage for reconfigurable FPGAs in comparison to ASIC technologies is the savings in design, NRE and mask costs due to shorter development cycle. The risk is also smaller with FPGAs, since the possible bugs in design can be fixed by upgrades later on. On the other hand FPGAs are much higher in unit price and power consumption. A true GNSS Receiver poses some implementation challenges. The specifications are designed to be compatible (i.e., systems do not interfere with each other too much) and the true interoperability is reached at receiver level. One example of interoperative design challenges is the selection of the number of correlators and their spacing for tracking, since different modulations have different requirements for the correlator structure. 3.1.1. Challenges with Radio Front End. Although the focus of this paper is mainly on baseband functions, the radio should not be forgotten. The block diagram for a GNSS single frequency radio front end is given on the left-hand side of Figure 3. In the radio the received signal is first amplified with the Low Noise Amplifier (LNA) and then after necessary filtering it is downconverted to low IF, for example, to 4 MHz [24]. The signal is converted to a digital format after downconversion.

4

EURASIP Journal on Embedded Systems Radio front end ASIC Automatic gain control (AGC)

Reconfigurable hardware (FPGA)

General purpose processor

Acquisition engine Low noise amplifier (LNA)

Local oscillator

Down converter

A/D conversion

Frequency synthesis


User interface

Tracking channels 1 to N

Figure 3: Hardware accelerated baseband architecture. From left to right: analog radio part, reconfigurable baseband hardware, and navigation software running on GPP.

The challenges for GNSS radio design come from the increasing amount of frequency bands. To call a receiver a true GNSS receiver and also to get the best performance, more than one frequency band should be processed by the radio front-end. Dual- and/or multifrequency receivers are likely choices for future receivers, and thus it is important to study potential architectures [25]. Another challenge comes from the increased bandwidth of new signals. With increased bandwidth the radio becomes more vulnerable to interference. For mass market consumer products, the radio design should also meet certain price and power consumption requirements. Only solutions with reasonable price and power consumption will survive. 3.1.2. Baseband Processing. The fundamental signal processing for GNSS was presented in Figure 1. The carrier and code removal processes are illustrated in more detail in Figure 4. The incoming signal is divided into in-phase and quadrature-phase components by multiplying it with the locally generated sine and cosine waves. Both phases are then correlated in identical branches with several closely delayed versions (for GPS; early, prompt, and late), of the locally generated PRN code [1]. Results are then integrated and fed to discriminator computation and feedback filter. Numerically Controlled Oscillators (NCOs) are used to steer the local replicas. An example of the different needs for new GNSS signals is the addition of 2 correlator fingers (bumpjumping algorithm) due to Galileo BOC modulation [26]. In Figure 4 additional correlator components needed for Galileo tracking are marked with darker shade. In most parts the GPS and Galileo signals in L1 band are using the same components. The main difference is that due to the BOC family modulation Galileo needs additional correlators; it is very-early (VE) and very-late (VL) to remove the uncertainty of main peak location estimation [27]. The increasing number of correlators is related to the increase in complexity, measured by the number of transistors in the final design [13].

The level of hardware acceleration depends on the selected algorithms. Acquisition is rarely needed compared to tracking and thus it is more suitable for software implementation. FFT-based algorithms are more desirable for designer to implement in software since hardware languages are usually lacking direct support for floating-point number calculus. Tracking on the other hand is a process containing mostly multiplication and accumulation using relatively small word lengths. The thing that makes it more suitable for hardware implementation is that the number of these relatively simple computations is high, with a real-time deadline. 3.2. Ideal SDR GNSS Receiver Architecture. The ideal SDR is characterized by assigning all functions after the analog radio to a single processor [18]. In the ideal case all hardware problems are turned to software problems. A fundamental block diagram of a software receiver is illustrated in Figure 5 [28]. The architecture of the radio front-end is the same that was illustrated in Figure 3. After radio the digitized signals are fed to buffers for software usage. Then all of the digital signal processing, acquisition, and tracking functions are performed by software. In the literature, for example, [28, 29], the justification and reasoning for SDR GNSS is strongly attributed to the well-known Moores law which states that the capacity of integrated circuits is doubling every 18–24 months [30]. Ideal SDR solutions should become feasible if and when available processing power increases. Currently reported SDR GPS receiver implementations are working in realtime only if the clock speed of the processor is from 900 MHz [31] to 3 GHz [29], which is too high for mobile devices but not, for example, a laptop PC. In the recent years, the availability of GNSS radio front ends with USB has improved, making the implementation of a pure software receiver on a PC platform quite straightforward. The area where pure software receivers have already made a breakthrough is postprocessing applications. Postprocessing with software receivers allows fast algorithm


5 Quadrature branch (not shown)

In-phase branch

I&D

sin

Discriminator computation & filtering

cos

VE E P Carrier NCO

L VL

Code generator

Code NCO

Figure 4: GPS/Galileo tracking channel.

Radio front end ASIC


Acquisition engine GNSS radio front end

Buffers & buffer control


User interface

Tracking channels 1 to N

Figure 5: Software receiver architecture. On left-hand side: analog radio part, and on right-hand side: baseband and navigation implemented as software running on a GPP.

prototyping and signal analysis. Typical postprocessing applications are ionospheric monitoring, geodetic applications, and other scientific applications [21, 32]. Software is definitely more flexible than hardware when compared in terms of time to market, bill of materials, and reconfigurable implementation. But with a required clock frequency of around 1 GHz or more, the generated heat and battery life will be an issue for small handheld devices. 3.3. SDR with Multiple Cores. What about having an array of reconfigurable cores for baseband processing? In a multicore architecture baseband processing is divided among multiple processing cores. This reduces the clock frequency needed to a range achievable by embedded devices and provides an increased level of parallelism which also eases the work load per processing unit. An example of the GNSS receiver architecture with reconfigurable baseband approach is illustrated in Figure 6. In this example one of the four cores is acting as an acquisition engine and the remaining three are performing the tracking functions. A fixed set of cores is not desirable since the need for acquisition and tracking varies over time. For example, when receiver is turned on, all cores should be

performing acquisition to guarantee the fastest possible Time To First Fix (TTFF). After satellites have been found more of the acquisition cores are moved to the tracking task. If (and when) manufactured in large volumes the (properly scaled) array of processing cores can be eventually implemented in an ASIC circuit. This lowers the per unit price and makes this solution more appealing for mass markets, while still being reconfigurable and having high degree of flexibility. In the next section we present one future realization of this architecture.

4. CRISP Platform Cutting edge Reconfigurable ICs for Stream Processing (CRISP) [33] is a project in the Framework Programme 7 (FP7) of the European Union (EU). The objectives of the CRISP are to research the optimal utilization, efficient programming, and dependability of a reconfigurable multiprocessor platform for streaming applications. The CRISP consortium is a good mixture of academic and industrial know-how with partners; Recore (NL), University of Twente (NL), Atmel (DE), Thales Netherlands (NL),

6

EURASIP Journal on Embedded Systems Radio front end ASIC

Reconfigurable platform (array of cores)

Acquisition engine


Tracking channels Navigation processing

GNSS radio front end Tracking channels

User interface

Tracking channels

Figure 6: Software reconfigurable baseband receiver architecture. From left: analog radio part, baseband implemented on an array of reconfigurable cores, and navigation software running on GPP.

Tampere University of Technology (FI), and NXP (NL). The three-year project started in the beginning at 2008. The reconfigurable CRISP platform, also called General Streaming Processor (GSP), designed and implemented within the project, will consist of two separate devices: General Purpose Device (GPD) and Reconfigurable Fabric Device (RFD). The GPD contains off-the-shelf General Purpose Processor (GPP) with memories and peripheral connections whereas the RFD consists of 9 reconfigurable cores. The array of reconfigurable cores is illustrated in Figure 7 [34], with “R” depicting a router. The reconfigurable cores are Montium cores (it was recently decided to use Xentium processing tile as Reconfigurable Core in the CRISP GSP. The Xentium has at least similar performance to the Montium (with respect to cycle count), but is designed for better programmability (e.g., hardware supporting optimal software pipelining)). Montium [35] is a reconfigurable processing core. It has five Arithmetic and Logical Units (ALUs), each having two memories, resulting in total of 10 internal memories. The cores communicate via a Network-on-Chip (NoC) which includes two global memories. The device interfaces to other devices and outer world via standard interfaces. Within the CRISP project the GNSS receiver is one of the two applications designed for proof of concept for the platform. The other is a radar beamforming application which has much higher demands on computation than a standalone GNSS receiver. 4.1. Specifying the GNSS Receiver for the Multicore Platform. In the CRISP project our intention is to specify, implement, and integrate a GNSS receiver application supporting GPS and Galileo L1 Open Service (OS) signals on the multicore platform. In this case, the restriction for L1 band usage comes from the selected radio [24], but in principle the multicore approach can be extended to multifrequency receivers if a suitable radio front-end is used. 4.1.1. Requirements for Tile Processor. The requirements of GNSS L1 application have been studied in [36]. The

Table 1: Estimation of GNSS baseband process complexity for Montium Tile Processor running at 200 MHz, max performance of 1 GMAC/s [36]. Process Acquisition (GPS) Acquisition (Galileo) Tracking (GPS) Tracking (Galileo)

Usage (MMAC/s) 43.66 196.15 163.67 229.14

Usage of TP (%) 4.4 19.6 16.4 22.9

results, restated in Table 1, indicated that a single Montium core running at 200 MHz clock speed is barely capable of executing the minimum required amount of acquisition and tracking processes. This analysis did not take into account the processing power needed for baseband to navigation handover nor navigation processing itself. With this it is evident that an array of cores (more than one) is needed for GNSS L1 purposes. The estimations given in Table 1 are based on reported [35] performance of the Montium core. The acquisition figures are computed for a search speed of one satellite per second and the tracking figures are for a single channel. The results presented in Table 1 reflect the complexity of the processes when the input stream is sampled at 16.368 MHz, which is the output frequency of the selected radio front end for CRISP platform [24]. This is approximately 16 times the navigation signal fundamental frequency of 1.023 MHz. The GNSS application can also be used with a lower rate input stream without a significant loss in application performance. For this paper, we analyzed the effect of the input stream decimation to the complexity of the main baseband processes. The other parameters, such as acquisition time and number of frequency bins for acquisition and number of active correlators per channel for tracking, remained the same as in [36]. Figures 8 and 9 illustrate the effect of decimation by factors 1, 2, 4, 8, and 16 to the utilization of the Montium Tile processor. Decimation factor 1 equates to the case where no


7 Chip IF

Parallel IF Serial IF Tracking channel 3 Reconfigurable core

Reconfigurable core

Reconfigurable core

Network IF

Network IF

Network IF

JTAG

RF front end data in

R

R

Tracking channel 0

Tracking channel 1

Tracking channel 4

Reconfigurable core

Reconfigurable core

Reconfigurable core

Network IF

Network IF

Network IF

Parallel IF

Serial IF

R

R

RFD

Test IF

Smart memory

R

Parallel IF

R

Acquisition 0

Tracking channel 2

Tracking channel 5

Reconfigurable core

Reconfigurable core

Reconfigurable core

Network IF

Network IF

Network IF

Smart memory

R

R

R

Serial IF

Parallel IF

Chip IF

Serial IF

Channel data out

Figure 7: Array of 9 reconfigurable cores [34] with example mapping of GNSS application illustrated, the selection of cores is random. “R” depicts router and “IF” interface.

decimation is applied, that is, results shown in Table 1. The presented figures show how the complexity of both processes, measured as Montium Tile Processor utilization percentage, decreases exponentially as decimation factor increases. The behavior is the same for GPS and Galileo signals, except that utilization with Galileo signals is a bit larger than with GPS in all studied cases. To ease the computational load of the Tile Processor the decimation of the input stream seems to be a feasible choice. The amount of decimation should be sufficient to effect meaningful savings in TP utilization without significantly degrading performance of the application. For the current GPS SPS signal, decimation by factor 4 (4.092 MHz) is feasible without significant loss in receiver performance. Factor of 8 (2.046 MHz) is equal to the Nyqvist rate for 1.023 MHz, which is the PRN code rate used in GSP SPS signal. In the Galileo case, decimation factor 4 is the maximum decimation factor. This is because with a sampling frequency of approximately 4 MHz the BOC(1,1) component of the Galileo E1 OS signal can be still received with a maximum loss of only −0.9 dB, when compared with the reception of the whole MBOC bandwidth [12]. (This applies also to the modern GPS L1C signals, but they are not specified in our application [36].)

Table 2: Estimation of GNSS baseband process complexity with decimated (by factor 4) input stream. Montium Tile Processor running at 200 MHz, max performance of 1 GMAC/s. Process Acquisition (GPS) Acquisition (Galileo) Tracking (GPS) Tracking (Galileo)

Usage (MMAC/s) 9.57 43.66 40.92 57.28

Usage of TP (%) 0.96 4.37 4.09 5.73

In the ideal case the decimation of the input stream should be changing with the receiver mode (GPS/Galileo). Since in CRISP the decimation of the radio stream will be implemented as hardware in FPGA, which is connecting the radio to the parallel interface of the final CRISP prototype platform, the run time configuration of the decimation factor is not feasible. For this reason, in the rest of the paper we will focus on the scenario where fixed decimation factor of 4 is used, resulting in a stream sample rate of 4.092 MHz. Table 2 shows baseband complexity estimation for the case when input stream is decimated by a factor of four. When it is compared to the original figures of complexity shown in Table 2, it can be seen that the utilization of TP is over four times smaller.

8


Payload (bytes/ms)

18 16 14

4500 4400 4300 4200 4100 4000

0

0.5

1

1.5

12

2.5

3

3.5

4

4.5

3.5

4

4.5

5

×103

(a) Acquisition link payload

10 8 6 4 2 0

2

Time (ms)

Payload (bytes/ms)

Montium tile processor utilization (%)

20

2

4

6 8 10 12 Input stream decimation factor

14

16

4115 4110 4105 4100 4095 4090

0

0.5

1

1.5

2

2.5

3

Time (ms)

5 ×103

(b) Average tracking link payload

Figure 10: Link payloads for GPS acquisition process (a) and average payload of GPS tracking processes (b).

GPS Galileo

Figure 8: Acquisition process utilization of Montium Tile Processor resources as a function of the decimation factor of the input stream.

Montium tile processor utilization (%)

25

20

15

10

5

0

2

4

6 8 10 12 Input stream decimation factor

14

16

GPS Galileo

Figure 9: Tracking process utilization of Montium Tile Processor resources as a function of the decimation factor of the input stream.

4.1.2. Requirements for the Network-on-Chip. To analyze the multicore GNSS receiver application we built a functional software receiver with the C++ language, running on a PC. The detailed analysis of the software receiver will be given in substantial paper [37]. In our SW receiver each process was implemented as a software thread. With approximating one process per core this approach enabled us to estimate the link payload by logging communication between the threads. We estimated a scenario where one core was allocated to perform acquisition and six cores were mapped for the tracking process. This scenario is illustrated in Figure 7. Digitized RF front-end data is input to the NoC via an interface.

A specific chip interface is used to connect the RFD to the GPD, and it is used to forward channel data (channel phase measurement data related to pseudorange measurements, and navigation data) to the GPD. The Selected mapping is a compromise between minimal operative setup (one acquisition and four tracking) and the needs of dependability testing processes, where individual cores may be taken offline for testing purposes. The scenario was simulated with a prerecorded set of real GPS signals. Since signal sources for Galileo navigation were not available, the Galileo case was not tested. The link payloads caused by the cores communicating while the software was running for 5 seconds is illustrated in Figure 10. The results show that, in GPS mode, our GNSS application causes a payload for each link/processing core with a constant baseline of 4096 Bytes/millisecond. This is caused by the radio front-end input, that is, the incoming signal. In this scenario we used real GPS front end data which was sampled at 4.092 MHz, each byte representing one sample. This sampling rate is also equal to the potential decimation scenario discussed earlier. With a higher sampling rate the link payload baseline will be raised, but on the other hand one byte can be preprocessed to contain more than one sample, decreasing the traffic caused by radio front-end input. The first peak in the upper part of Figure 8 is caused by the acquisition process output. When GNSS application starts, FFT-based acquisition is started and the results are ready after 60 milliseconds, which are then transmitted to tracking channels. This peak is also the largest individual payload event caused by the GNSS application. After a short initialization period the tracking processes start to produce channel output. An Average of simulated GPS tracking link/processing core payloads is illustrated in Figure 10(b). Every 20 milliseconds a navigation data symbol (data rate is 50 Hz in GPS) is transmitted and once a second higher transmission peak is caused by the loop

EURASIP Journal on Embedded Systems phase measurement data, which is transmitted to GPD for pseudorange estimation. In Galileo mode, the payload caused by incoming signal will be equal since the same radio input will be used for both GPS and Galileo. However, the transmission of data symbols will cause a bigger payload since data rate of Galileo E1 signals is 250 symbols per second [8]. Galileo phase measurement rate will remain the same as in GPS mode. From the results it is seen that the link payload caused by the incoming RF signal is the largest one in both operating modes, and if the link payload needs to be optimized the reduction of it is the first thing to be studied. The results also indicate that when GNSS application is running smoothly the link payloads caused by it are predictable. Note that this estimation does not contain any overheads caused by network protocol or any other data than navigation related (dependability, real-time mapping of the processes). These issues will be studied in our future work. 4.2. Open Issues. Besides the additional network load caused by other than the GNSS application itself, there are also some other issues that remain open. There may be challenges in designing software for a multicore environment. Power consumption as well as the final bill of materials (BOMs), (i.e., final price of the multicore product) remains an open issue at the time of this writing. In future these issues will be studied and suitable optimizations performed after the prototyping and proof of concepts have been completed successfully.

5. Conclusions In this paper we discussed three Software-Defined Radio (SDR) architectures for a Global Navigation Satellite System (GNSS) receiver. The usage of flexible architectures in GNSS receiver was justified with the need for implementing support for upcoming navigation systems and new algorithms developed, and especially for multipath mitigation. The hardware accelerated SDR architecture is quite close to the current mass market solutions. There the ASIC is replaced with a reconfigurable piece of hardware, usually an FPGA. The second architecture, ideal (or pure) SDR receiver is using a single processor to realize all necessary signal processing functions. Real-time receivers remain a challenge, but postprocessing applications are already taking advantage of this architecture. The third architecture, SDR with multiple cores, is a novel approach for GNSS receivers. This approach benefits in both having high degree of flexibility, and when properly designed and scaled, a reasonably low unit price in high volume production. In this paper we also presented the CRISP project where such a multicore architecture will be realized along with the analysis of GNSS application requirements for the multicore platform. We extended the previously published analysis of processing tile utilization to cover the effect of input stream decimation. Decimation by factor four seems to offer a good compromise between core utilization and application performance.

9 We implemented a software GNSS receiver with processes implemented as threads and used that to analyze the GNSS application communication payload for individual links. This analysis indicated that the incoming signal represents the largest part of the communication in the network between processing cores.

Acknowledgments The authors want to thank Stephen T. Burgess from Tampere University of Technology for his useful comments about the manuscript. This work was supported in part by the FUGAT project funded by the Finnish Funding Agency for Technology and innovation (TEKES). Parts of this research are conducted within the FP7 Cutting edge Reconfigurable ICs for Stream Processing (CRISP) project (ICT-215881) supported by the European Commission.

References [1] E. D. Kaplan and C. J. Hegarty, Eds., Understanding GPS, Principles and Applications, Artech House, Boston, Mass, USA, 2nd edition, 2006. [2] J. Benedicto, S. E. Dinwiddy, G. Gatti, R. Lucas, and M. Lugert, “GALILEO: Satellite System Design and Technology Developments,” European Space Agency, November 2000. [3] S. Revnivykh, “GLONASS Status and Progress,” December 2008, http://www.oosa.unvienna.org/pdf/icg/2008/icg3/04 .pdf. [4] G. Gibbons, “International system providers meeting (ICG3) reflects GNSS’s competing interest, cooperative objectives,” Inside GNSS, December 2008. [5] U. S. Airforce, “GPS Modernization Fact Sheet,” 2006, http://pnt.gov/public/docs/2006/modernization.pdf. [6] M. S. Braasch and A. J. van Dierendonck, “GPS receiver architectures and measurements,” Proceedings of the IEEE, vol. 87, no. 1, pp. 48–64, 1999. [7] K. Borre, D. M. Akos, N. Bertelsen, P. Rinder, and S. H. Jensen, A Software Defined GPS and Galileo Receiver—A Single-Frequency Approach, Birkhäuser, Boston, Mass, USA, 2007. [8] “Galileo Open Service, Signal in space interface control document (OS SIS ICD),” Draft 1, February 2008. [9] “Interface Specification—Navstar GPS Space segment/User segment L1C Interfaces,” IS-GPS-800, August 2007. [10] R. D. Fontana, W. Cheung, and T. Stansell, “The modernized L2C signal—leaping forward into the 21st century,” GPS World, pp. 28–34, September 2001. [11] “Galileo Joint Undertaking—Galileo Open Service, Signal in space interface control document (OS SIS ICD),” GJU, May 2006. [12] G. W. Hein, J.-A. Avila-Rodriguez, S. Wallner, et al., “MBOC: the new optimized spreading modulation recommended for GALILEO L1 OS and GPS L1C,” in Proceedings of the IEEE/ION Position, Location, and Navigation Symposium (PLANS ’06), pp. 883–892, San Diego, Calif, USA, April 2006. [13] H. Hurskainen, E. S. Lohan, X. Hu, J. Raasakka, and J. Nurmi, “Multiple gate delay tracking structures for GNSS signals and their evaluation with simulink, systemC, and VHDL,” International Journal of Navigation and Observation, vol. 2008, Article ID 785695, 17 pages, 2008.

10 [14] S. Kim, S. Yoo, S. Yoon, and S. Y. Kim, “A novel unambiguous multipath mitigation scheme for BOC(kn, n) tracking in GNSS,” in Proceedings of the International Symposium on Applications and the Internet Workshops, p. 57, 2007. [15] F. Dovis, M. Pini, and P. Mulassano, “Multiple DLL architecture for multipath recovery in navigation receivers,” in Proceedings of the 59th IEEE Vehicular Technology Conference (VTC ’04), vol. 5, pp. 2848–2851, May 2004. [16] F. Dovis, A. Gramazio, and P. Mulassano, “SDR technology applied to Galileo receivers,” in Proceedings of the International Technical Meeting of the Satellite Division of the Institute of Navigation (ION GPS ’02), Portland, Ore, USA, September 2002. [17] “SDR Forum,” January 2009, http://www.sdrforum.org. [18] J. Mitola, “The software radio architecture,” IEEE Communications Magazine, 1995. [19] P. G. Mattos, “A single-chip GPS receiver,” GPS World, October 2005. [20] P. J. Mumford, K. Parkinson, and A. G. Dempster, “The namuru open GNSS research receiver,” in Proceedings of the International Technical Meeting of the Satellite Division of the Institute of Navigation (ION GNSS ’06), vol. 5, pp. 2847–2855, Fort Worth, Tex, USA, September 2006. [21] S. Ganguly, A. Jovancevic, D. A. Saxena, B. Sirpatil, and S. Zigic, “Open architecture real time development system for GPS and Galileo,” in Proceedings of the International Technical Meeting of the Satellite Division of the Institute of Navigation (ION GNSS ’04), pp. 2655–2666, Long Beach, Calif, USA, September 2004. [22] H. Hurskainen, T. Paakki, Z. Liu, J. Raasakka, and J. Nurmi, “GNSS receiver reference design,” in Proceedings of the 4th Advanced Satellite Mobile Systems (ASMS ’08), pp. 204–209, Bologna, Italy, August 2008. [23] J. Hill, “Navigation signal processing with FPGAs,” in Proceedings of the National Technical Meeting of the Institute of Navigation, pp. 420–427, June 2004. [24] Atmel, “GPS Front End IC ATR0603,” Datasheet, 2006. [25] M. Detratti, E. Lopez, E. Perez, and R. Palacio, “Dualfrequency RF front end solution for hybrid Galileo/GPS mass market receiver,” in Proceedings of the IEEE Consumer Communications and Networking Conference (CCNC ’08), pp. 603–607, Las Vegas, Nev, USA, January 2008. [26] P. Fine and W. Wilson, “Tracking algorithms for GPS offset carrier signals,” in Proceedings of the ION National Technical Meeting (NTM ’99), San Diego, Calif, USA, January 1999. [27] H. Hurskainen and J. Nurmi, “SystemC model of an interoperative GPS/Galileo code correlator channel,” in Proceedings of the IEEE Workshop on Signal Processing Systems (SIPS ’06), pp. 327–332, Banff, Canada, October 2006. [28] D. M. Akos, “The role of Global Navigation Satellite System (GNSS) software radios in embedded systems,” GPS Solutions, May 2003. [29] C. Dionisio, L. Cucchi, and R. Marracci, “SOFTREC G3, software receiver and signal analysis fog GNSS bands,” in Proceedings of the 10th IEEE Internationl Symposium on Spread Spectrum Techniques and Applications (ISSSTA ’08), Bologna, Italy, August 2008. [30] G. E. Moore, “Cramming more components onto integrated circuits,” Proceedings of the IEEE, vol. 86, no. 1, pp. 82–85, 1998. [31] S. Söderholm, T. Jokitalo, K. Kaisti, H. Kuusniemi, and H. Naukkarinen, “Smart positioning with fastrax’s software GPS receiver solution,” in Proceedings of the International Technical Meeting of the Satellite Division of the Institute of


[32]

[33] [34]

[35]

[36]

[37]

Navigation (ION GNSS ’08), pp. 1193–1200, Savannah, Ga, USA, September 2008. J. H. Won, T. Pany, and G. W. Hein, “GNSS software defined radio: real receiver or just a tool for experts,” Inside GNSS, pp. 48–56, July-August 2006. “CRISP Project,” December 2008, http://www.crisp-project .eu. P. Heysters, “CRISP Project Presentation,” June 2008, http://www.crisp-project.eu/images/publications/D6.1 CRISP project presentation 080622.pdf. P. M. Heysters, G. K. Rauwerda, and L. T. Smit, “A flexible, low power, high performance DSP IP core for programmable systems-on-chip,” in Proceedings of the IP/SoC, Grenoble, France, December 2005. H. Hurskainen, J. Raasakka, and J. Nurmi, “Specification of GNSS application for multiprocessor platform,” in Proceedings of the International Symposium on System-on-Chip (SOC ’08), pp. 128–133, Tampere, Finland, November 2008. J. Raasakka, H. Hurskainen, and J. Nurmi, “Modeling multicore software GNSS receiver with real time SW receiver,” in Proceedings of the International Technical Meeting of the Satellite Division of the Institute of Navigation (ION GNSS ’09), Savannah, Ga, USA, September 2009.


Research Article An Open Framework for Rapid Prototyping of Signal Processing Applications Maxime Pelcat,1 Jonathan Piat,1 Matthieu Wipliez,1 Slaheddine Aridhi,2 and Jean-François Nezan1 1 IETR/Image

and Remote Sensing Group, CNRS UMR 6164/INSA Rennes, 20, avenue des Buttes de Coësmes, 35043 Rennes Cedex, France 2 HPMP Division, Texas Instruments, 06271 Villeneuve Loubet, France Correspondence should be addressed to Maxime Pelcat, [email protected] Received 27 February 2009; Revised 7 July 2009; Accepted 14 September 2009 Recommended by Markus Rupp Embedded real-time applications in communication systems have significant timing constraints, thus requiring multiple computation units. Manually exploring the potential parallelism of an application deployed on multicore architectures is greatly time-consuming. This paper presents an open-source Eclipse-based framework which aims to facilitate the exploration and development processes in this context. The framework includes a generic graph editor (Graphiti), a graph transformation library (SDF4J) and an automatic mapper/scheduler tool with simulation and code generation capabilities (PREESM). The input of the framework is composed of a scenario description and two graphs, one graph describes an algorithm and the second graph describes an architecture. The rapid prototyping results of a 3GPP Long-Term Evolution (LTE) algorithm on a multicore digital signal processor illustrate both the features and the capabilities of this framework. Copyright © 2009 Maxime Pelcat et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction The recent evolution of digital communication systems (voice, data, and video) has been dramatic. Over the last two decades, low data-rate systems (such as dial-up modems, first and second generation cellular systems, 802.11 Wireless local area networks) have been replaced or augmented by systems capable of data rates of several Mbps, supporting multimedia applications (such as DSL, cable modems, 802.11b/a/g/n wireless local area networks, 3G, WiMax and ultra-wideband personal area networks). As communication systems have evolved, the resulting increase in data rates has necessitated a higher system algorithmic complexity. A more complex system requires greater flexibility in order to function with different protocols in different environments. Additionally, there is an increased need for the system to support multiple interfaces and multicomponent devices. Consequently, this requires the optimization of device parameters over varying constraints such as performance, area, and power. Achieving this device optimization requires a good understanding of the

application complexity and the choice of an appropriate architecture to support this application. An embedded system commonly contains several processor cores in addition to hardware coprocessors. The embedded system designer needs to distribute a set of signal processing functions onto a given hardware with predefined features. The functions are then executed as software code on target architecture; this action will be called a deployment in this paper. A common approach to implement a parallel algorithm is the creation of a program containing several synchronized threads in which execution is driven by the scheduler of an operating system. Such an implementation does not meet the hard timing constraints required by realtime applications and the memory consumption constraints required by embedded systems [1]. One-time manual scheduling developed for single-processor applications is also not suitable for multiprocessor architectures: manual data transfers and synchronizations quickly become very complex, leading to wasted time and potential deadlocks.

2 Furthermore, the task of finding an optimal deployment of an algorithm mapped onto a multicomponent architecture is not straightforward. When performed manually, the result is inevitably a suboptimal solution. These issues raise the need for new methodologies, which allow the exploration of several solutions, to achieve a more optimal result. Several features must be provided by a fast prototyping process: description of the system (hardware and software), automatic mapping/scheduling, simulation of the execution, and automatic code generation. This paper draws on previously presented works [2–4] in order to generate a more complete rapid prototyping framework. This complete framework is composed of three complementary tools based on Eclipse [5] that provide a full environment for the rapid prototyping of real-time embedded systems: Parallel and Real-time Embedded Executives Scheduling Method (PREESM), Graphiti and Synchronous Data Flow for Java (SDF4J). This framework implements the methodology Algorithm-Architecture Matching (AAM), which was previously called Algorithm-Architecture Adequation (AAA) [6]. The focus of this rapid prototyping activity is currently static code mapping/scheduling but dynamic extensions are planned for future generations of the tool. From the graph descriptions of an algorithm and of an architecture, PREESM can find the right deployment, provide simulation information, and generate a framework code for the processor cores [2]. These rapid prototyping tasks can be combined and parameterized in a workflow. In PREESM, a workflow is defined as an oriented graph representing the list of rapid prototyping tasks to execute on the input algorithm and architecture graphs in order to determine and simulate a given deployment. A rapid prototyping process in PREESM consists of a succession of transformations. These transformations are associated in a data flow graph representing a workflow that can be edited in a Graphiti generic graph editor. The PREESM input graphs may also be edited using Graphiti. The PREESM algorithm models are handled by the SDF4J library. The framework can be extended by modifying the workflows or by connecting new plug-ins (for compilation, graph analyses, and so on). In this paper, the differences between the proposed framework and related works are explained in Section 2. The framework structure is described in Section 3. Section 4 details the features of PREESM that can be combined by users in workflows. The use of the framework is illustrated by the deployment of a wireless communication algorithm from the 3rd Generation Partnership Project (3GPP) Long-Term Evolution (LTE) standard in Section 5. Finally, conclusions are given in Section 6.

2. State of the Art of Rapid Prototyping and Multicore Programming There exist numerous solutions to partition algorithms onto multicore architectures. If the target architecture is homogeneous, several solutions exist which generate multicore code from C with additional information (OpenMP [7], CILK [8]). In the case of heterogeneous architectures,

EURASIP Journal on Embedded Systems languages such as OpenCL [9] and the Multicore Association Application Programming Interface (MCAPI [10]) define ways to express parallel properties of a code. However, they are not currently linked to efficient compilers and runtime environments. Moreover, compilers for such languages would have difficulty in extracting and solving the bottlenecks of the implementation that appear inherently in graph descriptions of the architecture and the algorithm. The Poly-Mapper tool from PolyCore Software [11] offers functionalities similar to PREESM but, in contrast to PREESM, its mapping/scheduling is manual. Ptolemy II [12] is a simulation tool that supports many models of computation. However, it also has no automatic mapping and currently its code generation for embedded systems focuses on single-core targets. Another family of frameworks existing for data flow based programming is based on CAL [13] language and it includes OpenDF [14]. OpenDF employs a more dynamic model than PREESM but its related code generation does not currently support multicore embedded systems. Closer to PREESM are the Model Integrated Computing (MIC [15]), the Open Tool Integration Environment (OTIE [16]), the Synchronous Distributed Executives (SynDEx [17]), the Dataflow Interchange Format (DIF [18]), and SDF for Free (SDF3 [19]). Both MIC and OTIE can not be accessed online. According to literature, MIC focuses on the transformation between algorithm domain-specific models and metamodels while OTIE defines a single system description that can be used during the whole signal processing design cycle. DIF is designed as an extensible repository of representation, analysis, transformation, and scheduling of data flow language. DIF is a Java library which allows the user to go from graph specification using the DIF language to C code generation. However, the hierarchical Synchronous Data Flow (SDF) model used in the SDF4J library and PREESM is not available in DIF. SDF3 is an open-source tool implementing some data flow models and providing analysis, transformation, visualization, and manual scheduling as a C++ library. SDF3 implements the Scenario Aware Data Flow (SADF [20]), and provides Multiprocessor System-on-Chip (MP-SoC) binding/scheduling algorithm to output MP-SoC configuration files. SynDEx and PREESM are both based on the AAM methodology [6] but the tools do not provide the same features. SynDEx is not an open source, it has its own model of computation that does not support schedulability analysis, and code generation is possible but not provided with the tool. Moreover, the architecture model of SynDEx is at a too high level to account for bus contentions and DMA used in modern chips (multicore processors of MP-SoC) in the mapping/scheduling. The features that differentiate PREESM from the related works and similar tools are (i) The tool is an open source and accessible online; (ii) the algorithm description is based on a single wellknown and predictable model of computation;


SDF4J Generic graph editor eclipse plug-in

3

Data flow graph transformation library

PREESM

Scheduler

Graph transformation

Graphiti

Rapid prototyping eclipse plug-ins

Code generator

Core Eclipse framework

Figure 1: An Eclipse-based Rapid Prototyping Framework.

(iii) the mapping and the scheduling are totally automatic; (iv) the functional code for heterogeneous multicore embedded systems can be generated automatically; (v) the algorithm model provides a helpful hierarchical encapsulation thus simplifying the mapping/scheduling [3]. The PREESM framework structure is detailed in the next section.

3. An Open-Source Eclipse-Based Rapid Prototyping Framework 3.1. The Framework Structure. The framework structure is presented in Figure 1. It is composed of several tools to increase reusability in several contexts. The first step of the process is to describe both the target algorithm and the target architecture graphs. A graphical editor reduces the development time required to create, modify and edit those graphs. The role of Graphiti [21] is to support the creation of algorithm and architecture graphs for the proposed framework. Graphiti can also be quickly configured to support any type of file formats used for generic graph descriptions. The algorithm is currently described as a Synchronous Data Flow (SDF [22]) Graph. The SDF model is a good solution to describe algorithms with static behavior. The SDF4J [23] is an open-source library providing usual transformations of SDF graphs in the Java programming language. The extensive use of SDF and its derivatives in the programming model community led to the development of SDF4J as an external tool. Due to the greater specificity of the architecture description compared to the algorithm description, it was decided to perform the architecture transformation inside the PREESM plug-ins. The PREESM project [24] involves the development of a tool that performs the rapid prototyping tasks. The PREESM tool uses the Graphiti tool and SDF4J library to design algorithm and architecture graphs and to generate their transformations. The PREESM core is an Eclipse plug-in that executes sequences of rapid prototyping tasks or workflows. The tasks of a workflow are delegated to PREESM plugins. There are currently three PREESM plug-ins: the graph

transformation plug-in, the scheduler plug-in, and the codegeneration plug-in. The three tools of the framework are detailed in the next sections. 3.2. Graphiti: A Generic Graph Editor for Editing Architectures, Algorithms and Workflows. Graphiti is an open-source plugin for the Eclipse environment that provides a generic graph editor. It is written using the Graphical Editor Framework (GEF). The editor is generic in the sense that any type of graph may be represented and edited. Graphiti is used routinely with the following graph types and associated file formats: CAL networks [13, 25], a subset of IP-XACT [26], GraphML [27] and PREESM workflows [28]. 3.2.1. Overview of Graphiti. A type of graph is registered within the editor by a configuration. A configuration is an XML (Extensible Markup Language [29]) file that describes (1) the abstract syntax of the graph (types of vertices and edges, and attributes allowed for objects of each type); (2) the visual syntax of the graph (colors, shapes, etc.); (3) transformations from the file format in which the graph is defined to Graphiti’s XML file format G, and vice versa (Figure 2); Two kinds of input transformations are supported, from XML to XML and from text to XML (Figure 2). XML is transformed to XML with Extensible Stylesheet Language Transformation (XSLT [30]), and text is parsed to its Concrete Syntax Tree (CST) represented in XML according to a LL(k) grammar by the Grammatica [31] parser. Similarly, two kinds of output transformations are supported, from XML to XML and from XML to text. Graphiti handles attributed graphs [32]. An attributed graph is defined as a directed multigraph G = (V , E, μ) with V the set of vertices, E the multiset of edges (there can be more than one edge between any two vertices). μ is a function μ : ({G} ∪ V ∪ E) × A → U that associates instances with attributes from the attribute name set A and values from U, the set of possible attribute values. A built-in type attribute is defined so that each instance i ∈ {G} ∪ V ∪ E has a type t = μ(i, type), and only admits attributes from a set At ⊂ A

4

EURASIP Journal on Embedded Systems XML Text

Parsing

XSLT transformations

XML CST

G

(a) G

XML

XSLT transformations

Text (b)

Figure 2: Input/output with Graphiti’s XML format G.

do something acc produce

out

in

out

consume

Figure 4: The type of vertices of the graph shown in Figure 3.

in

Figure 3: A sample graph.

given by At = τ(t). Additionally, a type t has a visual syntax σ(t) that defines its color, shape, and size. To edit a graph, the user selects a file and the matching configuration is computed based on the file extension. The transformations defined in the configuration file are then applied to the input file and result in a graph defined in Graphiti’s XML format G as shown in Figure 2. The editor uses the visual syntax defined by σ in the configuration to draw the graph, vertices, and edges. For each instance of type t the user can edit the relevant attributes allowed by τ(t) as defined in the configuration. Saving a graph consists of writing the graph in G, and transforming it back to the input file’s native format. 3.2.2. Editing a Configuration for a Graph Type. To create a configuration for the graph represented in Figure 3, a node (a single type of vertex) must be defined. A node has a unique identifier called id, and accepts a list of values initially equal to [0] (Figure 4). Additionally, ports need to be specified on the edges, so the configuration describes an edgeType element (Figure 5) that carries sourcePort and targetPort parameters to store an edge’s source and target ports, respectively, such as acc, in, and out in Figure 3. Graphiti is a stand-alone tool, totally independent of PREESM. However, Graphiti generates workflow graphs, IP-XACT and GraphML files that are the main inputs of PREESM. The GraphML files contain the algorithm model. These inputs are loaded and stored in PREESM by the SDF4J library. This library, discussed in the next section, executes the graph transformations. 3.3. SDF4J: A Java Library for Algorithm Data Flow Graph Transformations. SDF4J is a library defining several Data Flow oriented graph models such as SDF and Directed Acyclic Graph (DAG [33]). It provides the user with several classic SDF transformations such as hierarchy flattening, and

Figure 5: The type of edges of the graph shown in Figure 3.

SDF to Homogeneous SDF (HSDF [34]) transformations and some clustering algorithms. This library also gives the possibility to expand optimization templates. It defines its own graph representation based on the GraphML standard and provides the associated parser and exporter class. SDF4J is freely available (GPL license) for download. 3.3.1. SDF4J SDF Graph model. An SDF graph is used to simplify the application specifications. It allows the representation of the application behavior at a coarse grain level. This data flow representation models the application operations and specifies the data dependencies between these operations. An SDF graph is a finite directed, weighted graph G =< V , E, d, p, c > where: (i) V is the set of nodes. A node computes an input data stream and outputs the result; (ii) E ⊆ V × V is the edge set, representing channels which carry data streams; (iii) d : E → N ∪ {0} is a function with d(e) the number of initial tokens on an edge e; (iv) p : E → N is a function with p(e) representing the number of data tokens produced at e’s source to be carried by e;

EURASIP Journal on Embedded Systems op1 3

2 op 2 2

3

4

5 op4

1

4 2 op3 2

op1 3

1

op2

1 op1 1

op2 1 op 2

1

Figure 6: A SDF graph.

(v) c : E → N is a function with c(e) representing the number of data tokens consumed from e by e’s sink node; This model offers strong compile-time predictability properties, but has limited expressive capability. The SDF implementation enabled by the SDF4J supports the hierarchy defined in [3] which increases the model expressiveness. This specific implementation is straightforward to the programmer and allows user-defined structural optimizations. This model is also intended to lead to a better code generation using common C patterns like loop and function calls. It is highly expandable as the user can associate any properties to the graph components (edge, vertex) to produce a customized model. 3.3.2. SDF4J SDF Graph Transformations. SDF4J implements several algorithms intended to transform the base model or to optimize the application behavior at different levels. (i) The hierarchy flattening transformation aims to flatten the hierarchy (remove hierarchy levels) at the chosen depth in order to later extract as much as possible parallelism from the designer’s hierarchical description. (ii) The HSDF transformation (Figure 7) transforms the SDF model to an HSDF model in which the amount of tokens exchanged on edges are homogeneous (production = consumption). This model reveals all the potential parallelism in the application but dramatically increases the amount of vertices in the graph. (iii) The internalization transformation based on [35] is an efficient clustering method minimizing the number of vertices in the graph without decreasing the potential parallelism in the application. (iv) The SDF to DAG transformation converts the SDF or HSDF model to the DAG model which is commonly used by scheduling methods [33]. 3.4. PREESM: A Complete Framework for Hardware and Software Codesign. In the framework, the role of the PREESM tool is to perform the rapid prototyping tasks. Figure 8 depicts an example of a classic workflow which can be executed in the PREESM tool. As seen in Section 3.3, the data flow model chosen to describe applications in PREESM is the SDF model. This model, described in [22], has the great advantage of enabling the formal verification of static schedulability. The typical number of vertices to schedule in

1

op2

Figure 7: A SDF graph and its HSDF transformation.

PREESM is between one hundred and several thousands. The architecture is described using IP-XACT language, an IEEE standard from the SPIRIT consortium [26]. The typical size of an architecture representation in PREESM is between a few cores and several dozen cores. A scenario is defined as a set of parameters and constraints that specify the conditions under which the deployment will run. As can be seen in Figure 8, prior to entering the scheduling phase, the algorithm goes through three transformation steps: the hierarchy flattening transformation, the HSDF transformation, and the DAG transformation (see Section 3.3.2). These transformations prepare the graph for the static scheduling and are provided by the Graph Transformation Module (see Section 4.1). Subsequently, the DAG—converted SDF graph—is processed by the scheduler [36]. As a result of the deployment by the scheduler, a code is generated and a Gantt chart of the execution is displayed. The generated code consists of scheduled function calls, synchronizations, and data transfers between cores. The functions themselves are handwritten. The plug-ins of the PREESM tool implement the rapid prototyping tasks that a user can add to the workflows. These plug-ins are detailed in next section.

4. The Current Features of PREESM 4.1. The Graph Transformation Module. In order to generate an efficient schedule for a given algorithm description, the application defined by the designer must be transformed. The purpose of this transformation is to reveal the potential parallelism of the algorithm and simplify the work of the task scheduler. To provide the user with flexibility while optimizing the design, the entire graph transformation provided by the SDF4J library can be instantiated in a workflow with parameters allowing the user to control each of the three transformations. For example, the hierarchical flattening transformation can be configured to flatten a given number of hierarchy levels (depth) in order to keep some of the user hierarchical construction and to maintain the amount of vertices to schedule at a reasonable level. The HSDF transformation provides the scheduler with a graph of high potential parallelism as all the vertices of the SDF graph are repeated according to the SDF graph’s basic repetition vector. Consequently, the number of vertices to schedule is larger than in the original graph. The clustering transformation prepares the algorithm for the scheduling process by grouping vertices according to criteria such as strong connectivity or strong data dependency between

6


Graphiti editor Architecture editor

Algorithm editor

Scenario editor

Hierarchical SDF

Scenario

IP-XACT

Hierarchy flattening SDF HSDF transformation HSDF SDF to DAG transformation DAG Mapping /scheduling DAG + implementation information

Gantt chart

Code

Code generation PREESM framework

Figure 8: Example of a workflow graph: from SDF and IP-XACT descriptions to the generated code.

vertices. The grouped vertices are then transformed into a hierarchical vertex which is then treated as a single vertex in the scheduling process. This vertex grouping reduces the number of vertices to schedule, speeding up the scheduling process. The user can freely use available transformations in his workflow in order to control the criteria for optimizing the targeted application and architecture. As can be seen in the workflow displayed in Figure 8, the graph transformation steps are followed by the static scheduling step. 4.2. The PREESM Static Scheduler. Scheduling consists of statically distributing the tasks that constitute an application between available cores in a multicore architecture and minimizing parameters such as final latency. This problem has been proven to be NP-complete [37]. A static scheduling algorithm is usually described as a monolithic process, and carries out two distinct functionalities: choosing the core to execute a specific function and evaluating the cost of the generated solutions. The PREESM scheduler splits these functionalities into three submodules [4] which share minimal interfaces: the task scheduling, the edge scheduling, and the Architecture Benchmark Computer (ABC) submodules. The task scheduling submodule produces a scheduling solution for the application tasks mapped onto the architecture cores and then queries the ABC submodule to evaluate the cost of the

proposed solution. The advantage of this approach is that any task scheduling heuristic may be combined with any ABC model, leading to many different scheduling possibilities. For instance, an ABC minimizing the deployment memory or energy consumption can be implemented without modifying the task scheduling heuristics. The interface offered by the ABC to the task scheduling submodule is minimal. The ABC gives the number of available cores, receives a deployment description and returns costs to the task scheduling (infinite if the deployment is impossible). The time keeper calculates and stores timings for the tasks and the transfers when necessary for the ABC. The ABC needs to schedule the edges in order to calculate the deployment cost. However, it is not designed to make any deployment choices; this task is delegated to the edge scheduling submodule. The router in the edge scheduling submodule finds potential routes between the available cores. The choice of module structure was motivated by the behavioral commonality of the majority of scheduling algorithms (see Figure 9). 4.2.1. Scheduling Heuristics. Three algorithms are currently coded, and are modified versions of the algorithms described in [38]. (i) A list scheduling algorithm schedules tasks in the order dictated by a list constructed from estimating a critical path. Once a mapping choice has been

EURASIP Journal on Embedded Systems made, it will never be modified. This algorithm is fast but has limitations due to this last property. List scheduling is used as a starting point for other refinement algorithms. (ii) The FAST algorithm is a refinement of the list scheduling solution which uses probabilistic hops. It changes the mapping choices of randomly chosen tasks; that is, it associates these tasks to another processing unit. It runs until stopped by the user and keeps the best latency found. The algorithm is multithreaded to exploit the multicore parallelism of a host computer. (iii) A genetic algorithm is coded as a refinement of the FAST algorithm. The n best solutions of FAST are used as the base population for the genetic algorithm. The user can stop the processing at any time while retaining the last best solution. This algorithm is also multithreaded. The FAST algorithm has been developed to solve complex deployment problems. In the original heuristic, the final order of tasks to schedule, as defined by the list scheduling algorithm, was not modified by the FAST algorithm. The FAST algorithm only modifies the mapping choices of the tasks. In large-scale applications, the initial order of the tasks performed by the list scheduling algorithm becomes occasionally suboptimal. In the modified version of the FAST scheduling algorithm, the ABC recalculates the final order of a task when the heuristic maps a task to a new core. The task switcher algorithm used to recalculate the order simply looks for the earliest appropriately sized hole in the core schedule for the mapped task (see Figure 10). 4.2.2. Scheduling Architecture Model. The current architecture representation was driven by the need to accurately model multicore architectures and hardware coprocessors with intercores message-passing communication. This communication is handled in parallel to the computation using Direct Memory Access (DMA) modules. This model is currently used to closely simulate the Texas Instruments TMS320TCI6487 processor (see Section 5.3.2). The model will soon be extended to shared memory communications and more complex interconnections. The term operator represents either a processor core or a hardware coprocessor. Operators are linked by media, each medium representing a bus and the associated DMA. The architectures can be either homogeneous (with all operators and media identical) or heterogeneous. For each medium, the user defines a DMA set up time and a bus data rate. As shown in Figure 9, the architecture model is only processed in the scheduler by the ABC and not by the heuristic and edge scheduling submodules. 4.2.3. Architecture Benchmark Computer. Scheduling often requires much time. Testing intermediate solutions with precision is an especially time-consuming operation. The ABC submodule was created by reusing the useful concept of time scalability introduced in SystemC Transaction Level

7

DAG

Task scheduling

IP-XACT + scenario Number of cores Architecture Task schedule benchmark computer (ABC) Time keeper

Scheduler Edge scheduling

Cost Task schedule

Router

Edge schedule

Figure 9: Scheduler module structure.

Modeling (TLM) [39]. This language defines several levels of system temporal simulation, from untimed to cycle-accurate precision. This concept motivated the development of several ABC latency models with different timing precisions. Three ABC latency models are currently coded (see Figure 11). (i) The loosely-timed model takes into account task and transfer times but no transfer contention. (ii) The approximately-timed model associates each intercore communication medium with its constant rate and simulates contentions. (iii) The accurately-timed model adds set up times which simulate the duration necessary to initialize a parallel transfer controller like Texas Instruments Enhanced Direct Memory Access (EDMA [40]). This set up time is scheduled in the core which sends the transfer. The task and architecture properties feeding the ABC submodule are evaluated experimentally, and include media data rate, set up times, and task timings. ABC models evaluating parameters other than latency are planed in order to minimize memory size, memory accesses, cadence (i.e., average runtime), and so on. Currently, only latency is minimized due to the limitations of the list scheduling algorithms: these costs cannot be evaluated on partial deployments. 4.2.4. Edge Scheduling Submodule. When a data block is transferred from one operator to another, transfer tasks are added and then mapped to the corresponding medium. A route is associated with each edge carrying data from one operator to another, which possibly may go through several other operators. The edge scheduling submodule routes the edges and schedules their route steps. The existing routing process is basic and will be developed further once the architecture model has been extended. Edge scheduling can be executed with different algorithms of varying complexity, which results in another level of scalability. Currently, two algorithms are implemented: (i) the simple edge scheduler follows the scheduling order given by the task list provided by the list scheduling algorithm;

8


IP-XACT + scenario

DAG

Edge scheduling Scheduler

Genetic algorithms

FAST

Latency/cadence/memory driven

Scheduler

Task scheduling

Edge scheduling

IP-XACT scenario Architecture benchmark computer (ABC)

List scheduling

Only latency-driven

ACCURATE

Cadence

ABC

Memory

DAG Task scheduling

Accurately-timed Approximately-timed Loosely-timed Bus contention + setup times

Bus contention

FAST

ACCURATE

Figure 10: Switchable scheduling heuristics.

FAST

Unscheduled communication

Figure 11: Switchable ABC models.

(ii) the switching edge scheduler reuses the task switcher algorithm discussed in Section 4.2.1 for edge scheduling. When a new communication edge needs to be scheduled, the algorithm looks for the earliest hole of appropriate size in the medium schedule. The scheduler framework enables the comparison of different edge scheduling algorithms using the same task scheduling submodule and architecture model description. The main advantage of the scheduler structure is the independence of scheduling algorithms from cost type and benchmark complexity. 4.3. Generating a Code from a Static Schedule. Using the AAM methodology from [6], a code can be generated from the static scheduling of the input algorithm on the input architecture (see workflow in Figure 8). This code consists of an initialization phase and a loop endlessly repeating the algorithm graph. From the deployment generated by the scheduler, the code generation module generates a generic representation of the code in XML. The specific code for the target is then obtained after an XSLT transformation. The code generation flow for a Texas Instruments tricore processor TMS320TCI6487 (see Section 5.3.2) is illustrated by Figure 12. PREESM currently supports the C64x and C64x+ based processors from Texas Instruments with DSP-BIOS Operating System [41] and the x86 processors with Windows Operating System. The supported intercore communication schemes include TCP/IP with sockets, Texas Instruments EDMA3 [42], and RapidIO link [43]. An actor is a task with no hierarchy. A function must be associated with each actor and the prototype of the function must be defined to add the right parameters in the right order. A CORBA Interface Definition Language (IDL) file is associated with each actor in PREESM. An example of an IDL file is shown in Figure 13. This file gives the generic prototypes of the initialization and loop function calls associated with a task. IDL was chosen because it is a language-independent way to express an interface.

Depending on the type of medium between the operators in the PREESM architecture model, the XSLT transformation generates calls to the appropriate predefined communication library. Specific code libraries have been developed to manage the communications and synchronizations between the target cores [2].

5. Rapid Prototyping of a Signal Processing Algorithm from the 3GPP LTE Standard The framework functionalities detailed in the previous sections are now applied to the rapid prototyping of a signal processing application from the 3GPP LTE radio access network physical layer. 5.1. The 3GPP LTE Standard. The 3GPP [44] is a group formed by telecommunication organizations to standardize the third generation (3G) mobile phone system specification. This group is currently developing a new standard: the LongTerm Evolution (LTE) of the 3G. The aim of this standard is to bring data rates of tens of megabits per second to wireless devices. The communication between the User Equipment (UE) and the evolved base station (eNodeB) starts when the user equipment (UE) requests a connection to the eNodeB via random access preamble (Figure 14). The eNodeB then allocates radio resources to the user for the rest of the random access procedure and sends a response. The UE answers with a L2/L3 message containing an identification number. Finally, the eNodeB sends back the identification number of the connected UE. If several UEs sent the same random access preamble at the same time, only one connection is granted and the other UEs will need to send a new random access preamble. After the random access procedure, the eNodeB allocates resources to the UE and uplink and downlink logical channels are created to exchange data continuously. The decoding algorithm, at the eNodeB, of the UE random access preamble is studied in this section. This algorithm is known as the Random Access CHannel Preamble Detection (RACH-PD).


9 IDL prototypes Communication C64x+.xsl libraries actors code

Architecture model

Proc1.c

Proc 3 c64x+

Code generation

Scheduler

Proc 2 c64x+

Deployment

Medium 1 type

Proc2.xml

Proc3.xml

XSL transformation

Proc1.xml Proc 1 c64x+

Proc2.c

Proc3.c

Proc1.exe TI code composer compiler

Algorithm

Proc2.exe

Proc3.exe

Figure 12: Code generation.

RACH burst

module antenna delay { typedef long cplx; typedef short param; interface antenna delay { void init(in cplx antIn); void loop(in cplx antIn, out char waitOut, in param antSize); }; };

Preamble bandwidth Time GP1

2x N-sample preamble n ms

GP2

Figure 15: The random access slot structure.

Figure 13: Example of an IDL prototype.

UE

eNodeB

Random access preamble

Random access response

L2/L3 message Message for early contention resolution

Figure 14: Random access procedure.

5.2. The RACH Preamble Detection. The RACH is a contention-based uplink channel used mainly in the initial transmission requests from the UE to the eNodeB for connection to the network. The UE, seeking connection with a base station, sends its signature in a RACH preamble dedicated time and frequency window in accordance with a predefined preamble format. Signatures have special autocorrelation and intercorrelation properties that maximize the ability of the eNodeB to distinguish between different UEs. The RACH preamble procedure implemented in the LTE eNodeB can detect and identify each user’s signature and is dependent on the cell size and the system bandwidth. Assume

that the eNodeB has the capacity to handle the processing of this RACH preamble detection every millisecond in a worst case scenario. The preamble is sent over a specified time-frequency resource, denoted as a slot, available with a certain cycle period and a fixed bandwidth. Within each slot, a Guard Period (GP) is reserved at each end to maintain time orthogonality between adjacent slots [45]. This preamblebased random access slot structure is shown in Figure 15. The case study in this article assumes a RACH-PD for a cell size of 115 km. This is the largest cell size supported by LTE and is also the case requiring the most processing power. According to [46], preamble format no. 3 is used with 21,012 complex samples as a cyclic prefix for GP1, followed by a preamble of 24,576 samples followed by the same 24,576 samples repeated. In this case the slot duration is 3 ms which gives a GP2 of 21,996 samples. As per Figure 16, the algorithm for the RACH preamble detection can be summarized in the following steps [45]. (1) After the cyclic prefix removal, the preprocessing (Preproc) function isolates the RACH bandwidth, by shifting the data in frequency and filtering it with downsampling. It then transforms the data into the frequency domain. (2) Next, the circular correlation (CirCorr) function correlates data with several prestored preamble root sequences (or signatures) in order to discriminate between simultaneous messages from several users. It also applies an IFFT to return to the temporal domain and calculates the energy of each root sequence correlation.

EURASIP Journal on Embedded Systems Antenna #2 to N preamble repetition #1 to P Antenna #1 preamble repetition #2 to P Power accumulation

Antenna #1 RACH circular correlation Root sequence # 2 to R

Power comp.

IFFT

Zero pad.

ZC root seq. mult.

Subcarrier demapping

Root sequence # 1 DFT

FIR (bandpass filter)

Antenna #2 to N Preamble repetition #1 to P Antenna#1 Preamble repetition #2 to P Antenna #1 preamble repetition #1 RACH preprocessing Frequency shift

Antenna interface

10

Noise floor estimation

PeakSearch

Figure 16: Random Access Channel Preamble Detection (RACH-PD) Algorithm.

(3) Then, the noisefloor threshold (NoiseFloorThr) function collects these energies and estimates the noise level for each root sequence. (4) Finally, the peak search (PeakSearch) function detects all signatures sent by the users in the current time window. It additionally evaluates the transmission timing advance corresponding to the approximate user distance.

2

1 C64x+

C64x+ 3

C64x+

C64x+

EDMA

EDMA

C64x+ 4

C64x+ EDMA

C64x+

In general, depending on the cell size, three parameters of RACH may be varied: the number of receive antennas, the number of root sequences, and the number of times the same preamble is repeated. The 115 km cell case implies 4 antennas, 64 root sequences, and 2 repetitions. 5.3. Architecture Exploration 5.3.1. Algorithm Model. The goal of this exploration is to determine through simulation the architecture best suited to the 115km cell RACH-PD algorithm. The RACH-PD algorithm behavior is described as a SDF graph in PREESM. A static deployment enables static memory allocation, so removing the need for runtime memory administration. The algorithm can be easily adapted to different configurations by tuning the HSDF parameters. Using the same approach as in [47], valid scheduling derived from the representation in Figure 16 can be described by the compact expression: (8Preproc)(4(64(InitPower (2((SingleZCProc)(PowAcc))))PowAcc)) (64NoiseFloorThreshold)PeakSearch We can separate the preamble detection algorithm in 4 steps: (1) preprocessing step: (8Preproc), (2) circular correlation step: (4(64(InitPower (2((SingleZCProc)(PowAcc))))PowAcc)), (3) noise floor threshold step: (64NoiseFloorThreshold), (4) peak search step: PeakSearch. Each of these steps is mapped onto the available cores and will appear in the exploration results detailed in

C64x+

C64x+

C64x+

Figure 17: Four architectures explored.

Section 5.3.4. The given description generates 1,357 operations; this does not include the communication operations necessary in the case of multicore architectures. Placing these operations by hand onto the different cores would be greatly time-consuming. As seen in Section 4.2 the rapid prototyping PREESM tool offers automatic scheduling, avoiding the problem of manual placement. 5.3.2. Architecture Exploration. The four architectures explored are shown in Figure 17. The cores are all homogeneous Texas Instrument TMS320C64x+ Digital Signal Processors (DSP) running at 1 GHz [48]. The connections are made via DMA links. The first architecture is a single-core DSP such as the TMS320TCI6482. The second architecture is dual-core, with each core similar to that of the TMS320TCI6482. The third is a tri-core and is equivalent to the new TMS320TCI6487 [40]. Finally, the fourth architecture is a theoretical architecture for exploration only, as it is a quad-core. The exploration goal is to determine the number of cores required to run the random RACH-PD algorithm in a 115 km cell and how to best distribute the operations on the given cores. 5.3.3. Architecture Model. To solve the deployment problem, each operation is assigned an experimental timing (in terms of CPU cycles). These timings are measured with


11

Chip

1 core 2 cores + EDMA Real-time limit of 4 ms

3 cores + EDMA 4 cores + EDMA

GEM 0

GEM 1

GEM 2

C64x+ Core 0

C64x+ Core 1

C64x+ Core 2

L2 mem

L2 mem

L2 mem

Switched central resources (SCR)

Loosely timed Approximately timed Accurately timed

Figure 18: Timings of the RACH-PD algorithm schedule on target architectures.

EDMA3

Inter-core interruptions

Hardware semaphores

DDR2 external memory

deployments of the actors on a single C64x+. Since the C64x+ is a 32-bit fixed-point DSP core, the algorithms must be converted from floating-point to fixed-point prior to these deployments. The EDMA is modelled as a nonblocking medium (see Section 4.2.2) transferring data at a constant rate and with a given set up time. Assuming the EDMA has the same performance from the L2 internal memory to the L2 internal memory as the EDMA3 of the TMS320TCI6482 (see [42], then the transfer of N bytes via EDMA should take approximately): transfer(N) = 135 + (N ÷ 3.375) cycles. Consequently, in the PREESM model, the average data rate used for simulation is 3.375 GBytes/s and the EDMA set up time is 135 cycles. 5.3.4. Architecture Choice. The PREESM automatic scheduling process is applied for each architecture. The workflow used is close to that of Figure 8. The simulation results obtained are shown in Figure 18. The list scheduling heuristic is used with loosely-timed, approximately-timed, and accurately-timed ABCs. Due to the 115 km cell constraints, preamble detection must be processed in less than 4 ms. The experimental timings were measured on code executions using a TMS320TCI6487. The timings feeding the simulation are measured in loops, each calling a single function with L1 cache activated. For more details about C64x+ cache, see [48]. This represents the application behavior when local data access is ideal and will lead to an optimistic simulation. The RACH application is well suited for a parallel architecture, as the addition of one core reduces the latency dramatically. Two cores can process the algorithm within a time frame close to the real-time deadline with loosely and approximately timed models but high data transfer contention and high number of transfers disqualify it when accurately timed model is used. The 3-core solution is clearly the best one: its CPU loads (less than 86% with accurately-timed ABC) are satisfactory and do not justify the use of a fourth core, as can be seen in Figure 18. The high data contention in this case study justifies the use of several ABC models; simple models for

Figure 19: TMS320TCI6487 architecture.

fast results and more complex models to dimension correctly the system. 5.4. Code Generation. Developed Code libraries for the TMS320TCI6487 and automatically generated code created by PREESM (see Section 4.3) were used in this experiment. Details of the code libraries and code optimizations are given in [2]. The architecture of the TMS320TCI6487 is shown in Figure 19. The communication between the cores is performed by copying data with the EDMA3 from one core local L2 memory to another core L2 memory. The cores are synchronized using intercore interruptions. Two modes are available for memory sharing: in symmetric mode, each CPU has 1MByte of L2 memory while in asymmetric mode, core-0 has 1.5 MByte, core-1 has 1 MByte and core-2 0.5 MByte. From the PREESM generated code, the size of the statically allocated buffers are 1.65 MBytes for one core, 1.25 MBytes for a second core, and 200 kBytes for a third core. The asymmetric mode is chosen to fit this memory distribution. As the necessary memory is higher than the internal L2, some buffers are manually chosen to go in the external memory and the L2 cache [40] is activated. A memory minimization ABC in PREESM would help this process, targeting some memory objectives while mapping the actors on the cores. Modeling the RACH-PD algorithm in PREESM while varying the architectures (1,2,3 and 4 cores-based) enabled the exploration of multiple solutions under the criterion of meeting the stringent latency requirement. Once the target architecture is chosen, PREESM can be setup to generate a framework code for the simulated solution. As highlighted and explained in the previous paragraph, the statically allocated buffers by the generated code were higher than the physical memory of the target architecture. This

12

EURASIP Journal on Embedded Systems CPU 2

4 ms

CPU 1

CPU 0

Circorr32 signatures




methodologies and tools to efficiently partition code on these architectures is thus an increasingly important objective.

Preprocess

References Preprocess

Maximal cadence

4 ms

4 ms

Preprocess

noiseFloor + PeakSearch Circorr32 signatures


Preprocess

Figure 20: Execution a TMS320TCI6487.

of

the

RACH-PD

algorithm

on

necessitated moving manually some of the noncritical buffers to external memory. This generated code, representing a priori a good deployment solution, when executed on the target had an average load of 78% per core while meeting the real time deadline. Hence, the goal of decoding a RACHPD every 4 ms on the TMS320TCI6487 is thus successfully accomplished. A simplified view of the code execution is shown in Figure 20. The execution of the generated code had led to a realistic assessment of a deployment very close to that predicted with accurately timed ABC where the simulation had shown an average load per core around 80%. These results show that prototyping the application with PREESM allows by simulation to assess different solutions and to give the designer a realistic picture of the multicore solution before solving complex mapping problems. This global result needs to be tempered because one week-effort of manual memory optimizations and also some manual constraints were necessary to obtain such a fast deployment. New ABCs computing the costs of semaphores for synchronizations and the memory balance between the cores will reduce this manual optimizations time.

6. Conclusions The intent of this paper was to detail the functionalities of a rapid prototyping framework comprising the Graphiti, SDF4J, and PREESM tools. The main features of the framework are the generic graph editor, the graph transformation module, the automatic static scheduler, and the code generator. With this framework, a user can describe and simulate the deployment, choose the most suitable architecture for the algorithm and generate an efficient framework code. The framework has been successfully tested on RACH-PD algorithm from the 3GPP LTE standard. The RACH-PD algorithm with 1357 operations was deployed on a tricore DSP and the simulation was validated by the generated code execution. In the near future, an increasing number of CPUs will be available in complex System on Chips. Developing

[1] E. A. Lee, “The problem with threads,” Computer, vol. 39, no. 5, pp. 33–42, 2006. [2] M. Pelcat, S. Aridhi, and J. F. Nezan, “Optimization of automatically generated multi-core code for the LTE RACHPD algorithm,” in Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP ’08), Bruxelles, Belgium, November 2008. [3] J. Piat, S. S. Bhattacharyya, M. Pelcat, and M. Raulet, “Multicore code generation from interface based hierarchy,” in Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP ’09), Sophia Antipolis, France, September 2009. [4] M. Pelcat, P. Menuet, S. Aridhi, and J.-F. Nezan, “Scalable compile-time scheduler for multi-core architectures,” in Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP ’09), Sophia Antipolis, France, September 2009. [5] “Eclipse Open Source IDE,” http://www.eclipse.org/downloads. [6] T. Grandpierre and Y. Sorel, “From algorithm and architecture specifications to automatic generation of distributed real-time executives: a seamless flow of graphs transformations,” in Proceedings of the 1st ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE ’03), pp. 123–132, 2003. [7] “OpenMP,” http://openmp.org/wp. [8] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, “Cilk: an efficient multithreaded runtime system,” Journal of Parallel and Distributed Computing, vol. 37, no. 1, pp. 55–69, 1996. [9] “OpenCL,” http://www.khronos.org/opencl. [10] “The Multicore Association,” http://www.multicore-association.org/home.php. [11] “PolyCore Software Poly-Mapper tool,” http://www.polycoresoftware.com/products3.php. [12] E. A. Lee, “Overview of the ptolemy project,” Technical Memorandum UCB/ERL M01/11, University of California, Berkeley, Calif, USA, 2001. [13] J. Eker and J. W. Janneck, “CAL language report,” Tech. Rep. ERL Technical Memo UCB/ERL M03/48, University of California, Berkeley, Calif, USA, December 2003. [14] S. S. Bhattacharyya, G. Brebner, J. Janneck, et al., “OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems,” ACM SIGARCH Computer Architecture News, vol. 36, no. 5, pp. 29–35, 2008. [15] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty, “Modelintegrated development of embedded software,” Proceedings of the IEEE, vol. 91, no. 1, pp. 145–164, 2003. [16] P. Belanovic, An open tool integration environment for efficient design of embedded systems in wireless communications, Ph.D. thesis, Technische Universität Wien, Wien, Austria, 2006. [17] T. Grandpierre, C. Lavarenne, and Y. Sorel, “Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors,” in Proceedings of the 7th International Workshop on Hardware/Software Codesign (CODES ’99), pp. 74–78, 1999. [18] C.-J. Hsu, F. Keceli, M.-Y. Ko, S. Shahparnia, and S. S. Bhattacharyya, “DIF: an interchange format for dataflowbased design tools,” in Proceedings of the 3rd and 4th


[19]

[20]

[21] [22] [23] [24] [25]

[26]

[27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

International Workshops on Computer Systems: Architectures, Modeling, and Simulation (SAMOS ’04), vol. 3133 of Lecture Notes in Computer Science, pp. 423–432, 2004. S. Stuijk, Predictable mapping of streaming applications on multiprocessors, Ph.D. thesis, Technische Universiteit Eindhoven, Eindhoven, The Netherlands, 2007. B. D. Theelen, “A performance analysis tool for scenario-aware steaming applications,” in Proceedings of the 4th International Conference on the Quantitative Evaluation of Systems (QEST ’07), pp. 269–270, 2007. “Graphiti Editor,” http://sourceforge.net/projects/graphitieditor. E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. “SDF4J,” http://sourceforge.net/projects/sdf4j. “PREESM,” http://sourceforge.net/projects/preesm. J. W. Janneck, “NL—a network language,” Tech. Rep., ASTG Technical Memo, Programmable Solutions Group, Xilinx, July 2007. SPIRIT Schema Working Group, “IP-XACT v1.4: a specification for XML meta-data and tool interfaces,” Tech. Rep., The SPIRIT Consortium, March 2008. U. Brandes, M. Eiglsperger, I. Herman, M. Himsolt, and M. S. Marshall, “Graphml progress report, structural layer proposal,” in Proceedings of the 9th International Symposium on Graph Drawing (GD ’01), P. Mutzel, M. Junger, and S. Leipert, Eds., pp. 501–512, Springer, Vienna, Austria, 2001. J. Piat, M. Raulet, M. Pelcat, P. Mu, and O. Déforges, “An extensible framework for fast prototyping of multiprocessor dataflow applications,” in Proceedings of the 3rd International Design and Test Workshop (IDT ’08), pp. 215–220, Monastir, Tunisia, December 2008. “w3c XML standard,” http://www.w3.org/XML. “w3c XSLT standard,” http://www.w3.org/Style/XSL. “Grammatica parser generator,” http://grammatica.percederberg.net. J. W. Janneck and R. Esser, “A predicate-based approach to defining visual language syntax,” in Proceedings of IEEE Symposium on Human-Centric Computing (HCC ’01), pp. 40– 47, Stresa, Italy, 2001. J. L. Pino, S. S. Bhattacharyya, and E. A. Lee, “A hierarchical multiprocessor scheduling framework for synchronous dataflow graphs,” Tech. Rep., University of California, Berkeley, Calif, USA, 1995. S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors: Scheduling and Synchronization, CRC Press, Boca Raton, Fla, USA, 1st edition, 2000. V. Sarkar, Partitioning and scheduling parallel programs for execution on multiprocessors, Ph.D. thesis, Stanford University, Palo Alto, Calif, USA, 1987. O. Sinnen and L. A. Sousa, “Communication contention in task scheduling,” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 6, pp. 503–515, 2005. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman, San Francisco, Calif, USA, 1990. Y.-K. Kwok, High-performance algorithms of compiletime scheduling of parallel processors, Ph.D. thesis, Hong Kong University of Science and Technology, Hong Kong, 1997. F. Ghenassia, Transaction-Level Modeling with Systemc: TLM Concepts and Applications for Embedded Systems, Springer, New York, NY, USA, 2006.

13 [40] “TMS320TCI6487 DSP platform, texas instrument product bulletin (SPRT405)”. [41] “Tms320 dsp/bios users guide (SPRU423F)”. [42] B. Feng and R. Salman, “TMS320TCI6482 EDMA3 performance,” Technical Document SPRAAG8, Texas Instruments, November 2006. [43] “RapidIO,” http://www.rapidio.org/home. [44] “The 3rd Generation Partnership Project,” http://www .3gpp.org. [45] J. Jiang, T. Muharemovic, and P. Bertrand, “Random access preamble detection for long term evolution wireless networks,” US patent no. 20090040918. [46] “3GPP technical specification group radio access network; evolved universal terrestrial radio access (EUTRA) (Release 8), 3GPP, TS36.211 (V 8.1.0)”. [47] S. S. Bhattacharyya and E. A. Lee, “Memory management for dataflow programming of multirate signal processing algorithms,” IEEE Transactions on Signal Processing, vol. 42, no. 5, pp. 1190–1201, 1994. [48] “TMS320C64x/C64x+ DSP CPU and instruction set,” Reference Guide SPRU732G, Texas Instruments, February 2008.


Research Article Run-Time HW/SW Scheduling of Data Flow Applications on Reconfigurable Architectures Fakhreddine Ghaffari, Benoit Miramond, and François Verdier ETIS Laboratory, UMR 8051, ENSEA, University of Cergy Pontoise, CNRS, 6 avenue Du Ponceau, BP 44, 95014 Cergy-Pontoise Cedex, France Correspondence should be addressed to Fakhreddine Ghaffari, fakhreddine.ghaff[email protected] Received 1 March 2009; Revised 22 July 2009; Accepted 7 October 2009 Recommended by Markus Rupp This paper presents an efficient dynamic and run-time Hardware/Software scheduling approach. This scheduling heuristic consists in mapping online the different tasks of a highly dynamic application in such a way that the total execution time is minimized. We consider soft real-time data flow graph oriented applications for which the execution time is function of the input data nature. The target architecture is composed of two processors connected to a dynamically reconfigurable hardware accelerator. Our approach takes advantage of the reconfiguration property of the considered architecture to adapt the treatment to the system dynamics. We compare our heuristic with another similar approach. We present the results of our scheduling method on several image processing applications. Our experiments include simulation and synthesis results on a Virtex V-based platform. These results show a better performance against existing methods. Copyright © 2009 Fakhreddine Ghaffari et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction One of the main steps of the HW/SW codesign of a mixed electronic system (Software and Hardware) is the scheduling of the application tasks on the processing elements (PEs) of the platform. The scheduling of an application formed by N tasks on M target processing units consists in finding the realizable partitioning in which the N tasks are launched onto their corresponding M units and an ordering on each PE for which the total execution time of the application meets the real-time constraints. This problem of multiprocessor scheduling is known to be NP-hard [1, 2], that is, why we propose a heuristic approach. Many applications, in particular in image processing (e.g., an intelligent embedded camera), have dependent data execution times according to the nature of the input to be processed. In this kind of application, the implementation is often stressed by real-time constraints, which demand adaptive computation capabilities. In this case, according to the nature of the input data, the system must adapt its behaviour to the dynamics of the evolution of the data and continue to meet the variable needs of required calculation

(in quantity and/or in type). Examples of applications where the processing needs changes in quantity (the computation load is variable) come from the intelligent image processing where the duration of the treatments can depend on the number of the objects in the image (motion detection, tracking, etc.) or of the number of interest areas (contours detection, labelling, etc.). We can quote also the use of run time of different filters according to the texture of the processed image (here it is the type of processing which is variable). Another example of the dynamic applications is video encoding where the runlength encoding (RLE) of frames depends on the information within frames. For these dynamic applications, many implementation ways are possible. In this paper we consider an intelligent embedded camera for which we propose a new design approach compared to classical worst case implementations. Our method consists in evaluating online the application context and adapting its implementation onto the different targeted processing units by launching a run time partitioning algorithm. The online modification of the partitioning result can also be a solution of fault tolerance, by affecting

2


in run time the tasks of the fault target unit on others operational targets [3]. This induces also to revise the scheduling strategy. More precisely, the result of this later must change at run time in two cases. (i) Firstly, to evaluate the partitioning result. After each modification of the tasks implementations we need to know the new total execution time. And this is only possible by rescheduling all the tasks. (ii) Secondly, by modifying the scheduling result we can obtain a better total execution time which meets the real-time constraint without modifying the partitioning. This is because the characteristics of the tasks (mainly execution time) are modified according to the nature of the input data. In that context, the choice of the implementation of the scheduler is of major importance and depends on the heuristic complexity. Indeed, with our method the decisions taken online by our scheduler can be very time consuming. A software implementation of the proposed scheduling strategies will then delay the application tasks. For this reason, we propose in this work a hardware implementation for our scheduling heuristic. With this implementation, the scheduler takes only few clock cycles. So we can easily call the scheduler at run time without penalty on the total execution time of the application. The primary contribution of our work is the concept of an efficient online scheduling heuristic for heterogeneous multiprocessor platforms. This heuristic provides good results for both hardware tasks (onto the FPGA) and software tasks (onto the targeted General Purpose Processors) as well as an extensive speedup through the hardware implementation of this scheduling heuristic. Finally, the implementation of our scheduler allows the system to adapt itself to the application context in real time. We have simulated and synthesized our scheduler by targeting a FPGA (Xilinx Virtex 5) platform. We have tested the scheduling technique on several image processing applications implemented onto a heterogeneous target architecture composed of two processors coupled with a configurable logic unit (FPGA). The remainder of this paper is organized as follows. Section 2 presents related works on hardware/software scheduling approaches. Section 3 introduces the framework of our scheduling problem. Section 4 presents the proposed approach. Section 5 shows the experimental results, and finally Section 6 concludes this paper.

2. Related Works The field of study which tries to find an execution order for a set of tasks that meets system design objectives (e.g., minimize the total application execution time) has been widely covered in the literature. In [4–6] the problem of HW/SW scheduling for system-on-chip platforms with dynamically reconfigurable logic architecture is exhaustively studied. Moreover several works deal with scheduling algorithm implemented in hardware [7–9]. Scheduling in such systems

is based on priorities. Therefore, an obvious solution is to implement priorities queues. Many hardware architectures for the queues have been proposed: binary tree comparators, FIFO queues plus a priority encoder, and a systolic array priority queue [7]. Nevertheless, all these approaches are based on a fixed priority static scheduling technique. Moreover most of the hardware proposed approaches addresses the implementation of only one scheduling algorithm (e.g., Earliest Deadline First) [9, 10]. Hence they are inefficient and not appropriate for systems where the required scheduling behavior changes during run time. Also, system performance for tasks with data dependent execution times should be improved by using dynamic schedulers instead of static (at compile time) scheduling techniques [11, 12]. In our work, we propose a new hardware implemented approach which computes at run-time tasks priorities based on the characteristics of each task (execution time, graph dependencies, etc.). Our approach is dynamic in the sense that the execution order is decided at run time and supports a heterogeneous (HW/SW) multiprocessor architecture. The idea of dynamic partitioning/scheduling is based on the dynamic reconfiguration of the target architecture. Increasingly FPGA [13, 14] offers very attractive reconfiguration capabilities: partial or total, static or dynamic. The reconfiguration latency of dynamically reconfigurable devices represents a major problem that must not be neglected. Several references can be found addressing temporal partitioning for reconfiguration latency minimization [15]. Moreover, configuration prefetching techniques are used to minimize reconfiguration overhead. A similar technique to lighten this overhead is developed in [16] and is integrated into an existing design environment. A prefetch and replacement unit modifies the schedule and significantly reduces the latency even for highly dynamic tasks. In fact, there are two different approaches in the literature: the first approach reduces reconfiguration overhead by modifying scheduling results. The second one distinguishes between scheduling and reconfiguration. The reconfiguration occurs only if the HW/SW partitioning step needs it. The scheduling algorithm is needed only to validate this partitioning result. After partitioning, the implementation of each task is unchanged and configuration is not longer necessary. Scheduling aims at finding the best execution time for a given implementation strategy. Since scheduling does not change partitioning decision it does not take reconfiguration time into account. In this paper, we focus only on the scheduling strategy in the second case. We assume that the reconfiguration aspects are taken into account during the HW/SW partitioning step (decision of task implementation). Furthermore we addressed this last step in our previous works [17].

3. Problem Definition 3.1. Target Architecture. The target architecture is depicted in Figure 1. It is a heterogeneous architecture, which contains two software processing units: a Master Processor and a Slave


3 MS

DMA

Bus 1

RAM

Bus i Bus 2

SL

B

Master CPU

1

1

3

SL

C

3

1

7 1

HW RCU

Salve CPU

5

A

Control data

HW

D

18

E

2

F

MS 13

Bus 3 Contexts

Figure 1: The target architecture.

Processor. The platform also contains a hardware processing unit: (Reconfigurable Computing Unit) RCU and shared memory resources. The software processing units are VonNeumann monoprocessing systems and execute only a single task at a time. Each hardware task (implemented on the RCU) occupies a tile on the reconfigurable area [18]. The size of the tile is the same for all the tasks to facilitate the placement and routing of the RCU. We choose, for example, the tile size of the task which uses the maximum of resources on the RCU (we designate by “resource” here the Logic Element used by the RCU to map any task). The RCU unit can be reconfigured partially or totally. Each hardware task is represented by a partial bitstream. All bitstreams are memorized in the contexts memory (the shared memory between the processors and the RCU in Figure 1). These bitstreams will be loaded in the RCU before scheduling to reconfigure the FPGA according to run-time partitioning results [17]. The HW/SW partitioning result can change at run time according to temporal characteristics of tasks [6]. In [17] we proposed an HW/SW partitioning approach based on HW → SW and SW → HW tasks migrations. The theory of tasks migrations consists in accelerating the task(s) which become critical by modifying their implementations from software units to hardware units and to decelerate the tasks which become noncritical by returning them to the software units. After each new HW/SW partitioning result, the scheduler must provide an evaluation for this solution by providing the corresponding total execution time. Thus it presents a realtime constraint since it will be launched at run time. With this approach of dynamic partitioning/scheduling the target architecture will be very flexible. It can self-adapt even with very dynamic applications. 3.2. Application Model. The considered applications are data flow oriented applications such as image processing, audio processing, or video processing. To model this kind of applications we consider a Data Flow Graph (DFG) (an example is depicted in Figure 2) which is a directed acyclic graph where nodes are processing functions and edges describe communication between tasks (data dependencies

MS: Master processor SL: Slave processor HW: FPGA

Figure 2: An Example of DFG with 6 tasks.

between tasks). The size of the DFG depends on the functional partitioning of the application and then on the number of tasks and edges. We can notice that the structure of the DFG has a great effect on the execution time of the scheduling operations. A low granularity DFG makes the system easy to be predictable because tasks execution time does not vary considerably, thus limiting timing constraints violation. On the other hand, for a very low granularity DFG, the number of tasks in a DFG of great size explodes, and the communications between tasks become unmanageable. Each node of the DFG represents a specific task in the application. For each task there can be up to three different implementations: Hardware implementations (HW) placed in the FPGA, Software implementations running on the master processor (MS), and another Software implementation running on the slave processor (SL). Each node of Figure 2 is annotated with two data: one about the implementation (MS or SL or HW) and the other is the execution time of the task. Similarly each edge is annotated with the communication time between two nodes (two tasks). Each task of the DFG is characterized by the following four parameters: (a) Texe (execution time), (b) Impl (implementation on the RCU or on the master processor or on the slave processor), (c) Nbpred (number of predecessor tasks), (d) The Nbsucc (number of successor tasks). All the tasks of a DFG are thus modeled identically, and the only real-time constraint is on the total execution time. At each scheduler invocation, this total execution time corresponds to the longest path in the mapped task graph. It then depends both on the application partitioning and on the chosen order of execution on processors.

4. Proposed Approach The applications are periodic. In one period, all the tasks of the DFG must be executed. In the image processing, for instance, the period is the execution time needed to

4


For all Software tasks do {

Comput ASAP Task with minimum ASAP will be chosen If (Equality of ASAP) Compute Urgency Task with maximum urgency will be chosen If (Equality of Urgency) Compare Execution time Task with maximum execution time will be chosen }

Algorithm 1: Principle of our scheduling policy.

process one image. The scheduling must occur online at the end of the execution of all the tasks, and when a violation of real-time constraints is predicted. Hence the result of partitioning/scheduling will be applied on the next period (next image, for image processing applications). Our run-time scheduling policy is dynamic since the execution order of application tasks is decided at run time. For the tasks implemented on the RCU, we assume that the hardware resources are sufficient to execute in parallel all hardware tasks chosen by the partitioning step. Therefore the only condition for launching their execution is the satisfaction of all data dependencies. That is to say, a task may begin execution only after all its incoming edges have been executed. For the tasks implemented on the software processors, the conditions for launching are the following. (1) The satisfaction of all data dependencies. (2) The discharge of the software unit. Hereby the task can have four different states. (i) Waiting. (ii) Running. (iii) Ready. (iv) Stopped. The task is in the waiting state when it waits the end of execution of one or several predecessor tasks. When a software processing unit has finished the execution of a task, new tasks may become ready for execution if all their dependencies have been completed of course. The task can be stopped in the case of preemption or after finishing its execution. The states of the processing units (SW, SL, and HW) in our target architecture are: execution state, reconfiguration state or idle state. In the following, we will explain the principle of our approach as well as a hardware implementation of the proposed HW/SW scheduler. As explained in Algorithm 1, the basic idea of our heuristic of scheduling is to take decision of tasks priorities according to three criteria.

The first criterion is the As Soon As Possible (ASAP) time. The task which has the shortest ASAP date will be launched first. The second criterion is the urgency time: the task which has the maximum of urgency will have priority to be launched before the others. This new criterion is based on the nature of the successors of the task. The urgency criterion is employed only if there is equality of the first criterion for at least two tasks. If there is still equality of this second criterion we compare the last criterion which is execution time of the tasks. We choose the task which has the upper execution time to launch first. We use these criteria to choose between two or several software tasks (on the Master or on the Slave) for running. 4.0.1. The Urgency Criterion. The urgency criterion is based on the implementation of tasks and the implementations of their successors. A task is considered as urgent when it is implemented on the software unit (Master or slave) and has one or more successor tasks implemented on other different units (hardware unit or software unit). Figure 3 shows three examples of DFG. In Figure 3(a) task C is implemented on the Slave processor and it is followed by task D which is implemented on the RCU. Thus the urgency (Urg) of task C is the execution time of its successor (Urg (C) = 13). In example (b) it is the task B which is followed by the task D implemented on a different unit (on the Master processor). In the last example (c) both tasks B and C are urgent but task B is more urgent than task C since its successor has an execution time upper than the execution time of the successor of task C. When a task has several successors with different implementations, the urgency is the maximum of execution times of the successors. In general case, when the direct successor of task A has the same implementation as A and has a successor with a different implementation, then this last feedbacks the urgency to task A. We show the scheduling result for case (a) when we respect the urgency criterion in Figure 3(d) and otherwise in Figure 3(e). We can notice for all the examples of DFG in Figure 3 that the urgency criterion makes a best choice to obtain a minimum total execution time. The third criterion (the execution time) is an arbitrary choice and has very rarely impact on the total execution time. We can notice also that our scheduler supports the dynamic creation and deletion of tasks. These online services are only possible when keeping a fixed structure of the DFG along the execution. In that case the dependencies between tasks are known a priori. Dynamic deletion is then possible by assigning a null execution time to the tasks which are not active. and dynamic creation by assigning their execution time when they become active. This scheduling strategy needs an online computation of several criterions for all software tasks in the DFG. We tried first to implement this new scheduling policy on a processor. Figure 4 shows the computation time of our scheduling method when implemented on an Intel Core 2


MS

5

MS

5

A

∗

SL

SL

C

B

SL

3

13

D

MS

(a) Urg[C] = 13

D B

MS

D

MS 8 HW

5

C

MS 5

28

(d) Case of DFG (a) task B before task C

MS 7

D

E

2 SL

D

SL

15

C

(c) Urg[B] = 8, Urg[C] =2

C 8

∗

B

HW

A

5

3

(b) Urg[B] = 5

HW SL

A ∗

SL 7

C

B

7

3 HW

MS

5

A

∗

B 12

15

A 5

25

(e) Case of DFG (a) task C before task B

Scheduling execution time on Intel (R) Core (TM) 2 Duo CPU 2.8 Ghz + 4Go RAM

(1) The DFG IP Sched (the middle part surrounded by a dashed line in the figure).

33.49652

(2) The DFG Update (DFG Up in the figure). (3) The MS Manager (SWTM).

20.6904

757

703

595

541

487

433

379

649

16.17788

14.14788

325

271

217

163

55

12.68212

109

40 35 30 25 20 15 10 5 0

1

Execution time (ms)

Figure 3: Case examples of urgency computing.

Images

Figure 4: Execution time of the software implementation of the scheduler.

Duo CPU with a frequency of 2.8 GHz and 4 Go of RAM. We can notice that the average computation time of the scheduler is about 12 milliseconds for an image. These experiments are done on an image processing application (the DFG depicted on Figure 12) whose period of processing by an image is 19 milliseconds. So the scheduling (with this software implementation) takes about 63% of a one image processing computation time on a desktop computer. We can conclude that, in an embedded context, a software implementation of this strategy is thus incompatible with real-time constraints. We describe in the following an optimized hardware implementation of our scheduler. 4.1. Hardware Scheduler Architecture. In this section, we describe the proposed architecture of our scheduler. This architecture is shown in Figure 5 for a DFG example of three tasks. It is divided in four main parts.

(4) The Slave Manager (SLTM). The basic idea of this hardware architecture is to parallelize at the maximum the scheduling of processing tasks. So, at the most (and in the best case), we can schedule all the tasks of the DFG in parallel for infinite resources architecture. We associate to the application DFG a modified graph with the same structure composed of the IP nodes (each IP represents a task). Therefore in the best case, where tasks are independent, we could schedule all the tasks in the DFG in only one clock cycle. To parallelize also the management of the software execution times, we associate for each software unit a hardware module: (i) the Master Task Manager (SWTM in Figure 5), (ii) the Slave Task Manager (SLTM in the Figure 5). These two modules manage the order of the tasks executions and compute the processor execution time for each one. The inputs signals of this scheduler architecture are the following. (i) A pointer in memory to the implementations of all the tasks. We have three kinds of implementation (RCU, Master, and Slave). With the signals SW and HW we can code these three possibilities. (ii) The measured execution time of each task (Texe). (iii) The Clock signal and the Reset.

6

EURASIP Journal on Embedded Systems Texe Total

SW SWTM

HW

DFG UP

Scheduled DFG

IP1

Texe

All Done

Nb Task

CLK

IP2

Reset

IP3

SLTM

Nb Task Slave

Figure 5: An example of the scheduler architecture for a DFG of three tasks.

The outputs signals are the following. (i) The total execution time after scheduling all tasks (Texe Total). (ii) The signal All Done which indicates the end of the scheduling. (iii) Scheduled DFG is a pointer to the scheduling result matrix to be sent to the operating system (or any simple executive). (iv) The Nb Task and the Nb Task Slave are the number of tasks scheduled on the Master and the number of tasks scheduled on the Slave, respectively. As noted here, these two signals were added solely for the purpose of simulation in ModelSim (to check the scheduling result). In the real case we do not need these two output signals since this information comes from the partitioning block. The last one is the DFG Up. This allows updating the results matrix after each scheduling of a task. In the following paragraphs, we will detail each part of this architecture. 4.1.1. The DFG IP Sched Block. In this block there are N components (N is the number of tasks in the application). For each task we associate an IP component which computes the intrinsic characteristics of this task (urgency, ASAP, Ready state, etc.). It also computes the total execution time for the entire graph. The proposed architecture of this IP is shown in Figure 6 (in the appendix). For each task the implementation PE and the execution time are fixed, so the role of this IP is to calculate the start time of the task and to define its state. This is done by taking into account the state of the corresponding target (master, slave, or RCU). It then iterates along the DFG structure to determine a total execution ordering and to affect the start time. This IP calculate also the urgency criterion of critical tasks according to the implementation and the execution time of their successors. If the task is implemented on the RCU it will be launched as soon as all its predecessors will be done. So the scheduling time of hardware tasks depends on the number of tasks that

we can run in parallel. For example, the IP can schedule all hardware tasks that can run in parallel in a one clock cycle. For the software tasks (on the master or on the slave) the scheduling will take one clock cycle per task. Thus the computing time of the hardware scheduler only depends on the result of the HW/SW partitioning. 4.1.2. The DFG Update Block. When a DFG is scheduled the result modifies the DFG into a new structure. The DFG Update block (Figure 7 in the appendix) generates new edges (dependencies between tasks) after scheduling in objective to give a total order of execution on each computing unit according to the scheduling results. We represent dependencies between tasks in the DFG by a matrix where the rows represent the successors and the columns represent the predecessors. For example, Figure 8 depicts the matrix of dependencies corresponding to the DFG of Figure 2. After scheduling, the resulting matrix is the update of the original one. It contains more dependencies than this later. This is the role of the DFG Update block. 4.1.3. The MS Manager Block. The objective of this module is to schedule the software tasks according to the algorithm given above. Figure 9 in the appendix presents the architecture of the Master Manager bloc. The input signal ASAP SW represents the ASAP times of all the tasks. The Urgency Time signal represents the urgency of each task of the application. The SW Ready signal represents the Ready signals of all the software tasks. The Signal MIN ASAP TASKS represents all the tasks “Ready” and having the same minimum values of time ASAP. The signal MAX CT TASKS represents all the tasks “Ready” and having the same maximum of urgency. The tasks which have the two preceding criteria will be represented by the Tasks Ready signal. The Task Scheduled signal determines the only software task which will be scheduled. With this signal, it is possible to choose the good value of signal TEXE SW and to give the new value of the SW Total Time signal thereafter. A single clock cycle is necessary to schedule a single software task. By analogy the Slave Manager block has the same role as the SW Manager block. From scheduling point of view there is no difference between the two processors. 4.2. HW/SW Scheduler Outputs. In this section, we describe how the results of our scheduler are processed by a target module such as an executive or a Real-Time Operating System (RTOS). As depicted in Figure 8 , the output of our run-time HW/SW scheduler is n × n matrix where “n” is the total number of tasks in the DFG. Figure 10 shows the scheduling result of the DFG depicted in Figure 12. This matrix will be used by a centralized Operating System (OS) to fill its task queues for the three computing units. The table shown in Figure 11 is a compilation of both the results of the partitioning and scheduling operations.


7

HW Ready

SW

Done

Max_Texe_Pred SL_Total_Time SW_Total_Time

Combinatorial logics

SW_Ready

SL_Ready

SW HW

0

SW_Ready = SW and ready and (not) done SL_Ready = SL and ready and (not) done HW_Ready = (not) SW and (not) SL and ready and (not) done

Mux

HW_Ready

0

ASAP_Out

All_DONE

Mux

And Or

ASAP

Max

Max_Texe_HW_Succ Critical time

ASAP

TEXE Adder

SL_Sched_DONE

Mux Mux

SW_Sched_DONE

Mux

0

All_DONE

SW

Register done

CLK

TEXE_Pred

HW TEXE_SW

Register finishing time

Reset All_DONE CLK

Reset

Mux 0

TEXE_SL

Finishing time

Figure 6: An IP representing one task. CLK Reset SW_Ready_in XOR

New successors registers matrix Or Or

SL_Ready_in

Scheduled_DFG

XOR Original DFG registers matrix Task_Sched_SW Task_Sched_SL

SW_Enable SL_Enable

Or

Or

Figure 7: The DFG updating architecture.

The OS browses the matrix row by row. Whenever it finds a “1” it passes the task whose number corresponds to the column in the waiting state. At the end of a task execution the corresponding waiting tasks on each units will become either Ready or Running. A task will be in the Ready state only when all its dependencies are done and that the target unit is busy. Thus there is no Ready state for the hardware tasks. It should be noted that if the OS runs on the Master processor, for example, this later will be interrupted each time to execute the OS.

5. Experiments and Results With the idea to cover a wide range of data-flow applications, we leaded experiments on real and artificial applications. In the context of this paper we present the summary of the results obtained on a 3-case studies in the domain of realtime image processing: (i) a motion detection application, (ii) an artificial extension of this detection application, (iii) a robotic vision application.

8

EURASIP Journal on Embedded Systems Before scheduling MS

SL

B

A

Successors ABC D E F

5

C

SL

3

7

MS HW

D

18

E

F

2

MS 13

A B C D E F

0 0 0 0 0 0

1 0 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 0 0 0

(a) After scheduling MS

SL

A

Successors A B C DE F

5

C

SL

B

3

D

18

7

MS HW

E

2

F

MS 13

A B C D E F

0 0 0 0 0

1 0 0 0 0

1 1 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 1 0 0

0 0 0 0 1 0

(b)

Figure 8: Matrix result after scheduling.

The second case study is a complex DFG which contains different classical structures (Fork, join, sequential). This DFG is depicted in Figure 12. It contains twenty tasks. Each task can be implemented on the Software computation unit (Master or Slave processor) or on the Reconfigurable RCU. The original DFG is the model of an image processing application: motion detection on a fixed image background. This application is composed of 10 sequential tasks (from ID 1 to ID 10 in Figure 12). We added 10 others virtual tasks to obtain a complex DFG containing the different possible parallel structures. This type of parallel program paradigm (Fork, join, etc.) arises in many application areas. In order to test the presented scheduling approach, we have performed a large number of experiments where several scenarios of HW/SW partitioning results were analyzed. As an example, Figure 12 presents the scheduling result when tasks 3, 4, 7, 8, 11, 12 17, 18, and 20 are implemented in hardware. As explained in Section 4.1 new dependencies (dotted lines) are added in the original graph to impose a total order on each processor. In this figure all the execution times are in milliseconds (ms). We also leaded our experiments on a more dynamic application from robotic vision domain [19]. It consists in a subset of a cognitive system allowing a robot equipped with a CCD-camera to navigate and to perceive objects. The global architecture in which the visual system is integrated is biologically inspired and based on the interactions between the processing of the visual flow and the robot movements. In order to learn its environment the system identifies keypoints in the landscape.

Keypoints are detected in a sampled scale space based on an image pyramid as presented in Figure 13. The application is dynamic in the sense that the number of keypoints depends on the scene observed by the camera. Then the execution time of the Search and Extract tasks in the graph dynamically changes (see [19] for more details about this application). 5.1. Comparison Results. Throughout our experiments, we compared the result of our scheduler with the one given by the HCP algorithm (Heterogeneous Critical Path) developed by Bjorn-Jorgensen and Madsen [20]. This algorithm represents an approach of scheduling on a heterogeneous multiprocessor architecture. It starts with the calculation of priorities for each task associated with a processor. A task is chosen depending on the length of its critical path (CPL). The task which has the largest minimum CPL will have the highest priority. We compared with this method because it is shown better than several other approaches (MD, MCP, PC, etc.) [21]. The summary of the experiments leaded is presented in Figure 14. Each column gives average values for one of the three presented applications with different partitioning strategies. By comparing the first and second rows, our scheduling method provides consistent results. The quality of the scheduling solutions found by our method and the HCP method is similar. Moreover, our method obtains better results for the complex Icam application. The HCP method returns an average total execution time equal to 69 milliseconds whereas our method returns only 58 milliseconds for the same DFG. For the icam simple application, the DFG is completely sequential, so whatever the scheduling method the result is always the same. For the robotic vision application, we find the same total execution time with the two methods because of the existence of a critical path in the DFG which always sets the overall execution time. We also measured the execution overhead of the proposed scheduling algorithm when it is implemented in software (third row of Figure 14) and in hardware (forth row). Since the scheduling overhead depends on the number of tasks in the application we only indicate the average values in Figure 14. For example, Figure 15 presents the execution time of the hardware scheduler (in cycles) according to the number of software tasks. From Figure 15, it may be concluded that when the result of partitioning changes at runtime, then the computation time needed for our scheduler to schedule all the DFG tasks is widely dependent on this modification of tasks implementations. So: (1) there is a great impact of the partitioning result and the DFG structure on the scheduler computation time, (2) the longest sequential sequence of tasks corresponds to the case where all tasks are on the Software (each task takes one clock cycle). This case corresponds to the maximum of schedule computation time,


9

CLK Reset SW_Total_Time Register SW total time

ASAP_SW Min

And

TEXE_SW Mux And

Comp MIN_ASAP_TASKS

SW_Ready

SW_Scheduled_Enable

Test

Tasks_Ready And Comp Max

And

Urgency_Time

Combi

Task_Scheduled

MAX_CT_TASKS Min

And

Figure 9: The module of the MS Manager.

Table 1: Device utilization summary after synthesis. Used logic utilization Number of slices registers Number of slice LUTs Number of fully used Bit Slices Number of bonded IOBs Number of BUFG/BUFGCTRLs Scheduler frequency

1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 4 5 6

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Icam simple 2∗ Min{|PredMVy |, |PredMVx|}, the pattern direction is chosen to be E, N, W, or S, as shown in Figure 2(b). (3.4.4) If the predicted MV does not fall onto any coordinate axis, and Max{|PredMVy |, |PredMVx|} ≤ 2∗ Min{|PredMVy |, |PredMVx|}, the pattern direction is chosen to be NE, NW, SW, or SE, as shown in Figure 2(c).

4

EURASIP Journal on Embedded Systems N

N

NW

NE

W

NW

E

SW

NE

W

SW

E

SW

SW

S

S

(a) E Pattern

(b) NE Pattern

N NW

N NE

W

E

SW

N

NW

NE

W

NW

E

SW

SW

NE

W

SW

E

SW

SW

S

S

S

(c) N Pattern

(d) NW Pattern

(e) W Pattern

N NW

W

NW

E

SW

N

N NE

SW

NW

NE

W

E

SW

SW

S

NE

W

E

SW

SW S

S

Points with the predicted MV and extension Initial SPs along the quarter circle


(f) SW Pattern

(g) S Pattern


(h) SE Pattern

Figure 1: Possible adaptive search patterns designed.

3.5. Size of the Search Pattern. To simplify the selection of search pattern size, the horizontal and vertical components of motion predictor is still utilized. The size of search pattern, that is, the radius of a designed quarter polar search pattern, is simply defined as R = Max PredMVy , |PredMVx| ,

(1)

where R is the radius of quarter circle, PredMVy and PredMVx the vertical and horizontal components of the motion predictor, respectively.

3.6. Initial Search Points. After the direction and size of a search pattern are decided, some search points will be selected in the initial search stage. Each search point represents a block to be checked with intensity matching. The initial search points include (when MVP is not zero): (1) the predicted motion vector point; (2) the center point of search pattern, which represents the candidate block in the current frame; (3) some points on the directional axis;

EURASIP Journal on Embedded Systems −4 −4

−3

−2

−1

0

1

2

5 3

4

−3

Table 1: A look-up table for the definition of vertical and horizontal components of initial search points on NW/NE/SW/SE axis. R 0 1 2 3 4 5

−2 −1

0 1 2 3 4 Initial SPs with a square pattern when PredMV = 0 (a)

|SP y |

0 1 2 2 3 4

0 1 2 2 3 4

NE

|SPx |

|SP y |

4 5 6 6 7 —

4 5 6 6 7 —

(4) the extension predicted motion vector point (the point with prolonged length of motion predictor), and the contraction predicted motion vector point (the point with contracted length of motion predictor)

W

R 6 7 8 9 10 —

Normally, if no overlapping exists, there will be totally seven search points selected in the initial search stage, in order to get a point with the MME, which can be used as a basis for the refined search stage thereafter. If a search point is on the axis of NW, NE, SW, or SE, the corresponding decomposed coordinates of that point will satisfy,

N NW

|SPx |

E

R=

2

(SPx )2 + SP y ,

(2)

where SPx and SP y are the vertical and horizontal components of a search point on the axis of NW, NE, SW, or SE. Because |SPx | is equal to |SP y | in this case, then SW

√

SE

Point with the predicted MV Max{|PredMVy |, |PredMVx|} > 2∗ Min{|PredMVy |, |PredMVx|} N/E/W/S pattern selected (b)

N NW

NE

W

√

R = 2 · |SPx | = 2 · SP y .

S

E

Obviously, neither |SPx | nor |SP y | is an integer, as R is always an integer-based radius for block processing. To simplify and reduce the computational complexity of a search point definition on the axis of NW, NE, SW or SE, a look-up table (LUT) is employed, as listed in Table 1. The values of SPx and SP y are predefined according to the radius R, and now they are integers. Figure 3 illustrates some examples of defined initial search points with the look-up table. When the radius R > 20, the value of |SPx | and |SP y | can be determined by R |SPx | = SP y = Round √ .

2

SW

SE S

Point with the predicted MV Max{|PredMVy |, |PredMVx|} ≤ 2∗ Min{|PredMVy |, |PredMVx|} NW/NE/SW/SE pattern selected (c)

Figure 2: (a) Square pattern size = 1, (b) N/W/E/S search pattern selected, (c) NW/NE/SW/SE search pattern selected.

(3)

(4)

There are two initial search points related to the extended motion predictors. One is with a prolonged length of motion predictor (extension version), whereas the other is with a reduced length of motion predictor (contraction version). Two scaled factors are adaptively defined according to the radius R, for the lengths of those two initial search points can be easily derived from the original motion predictor, as shown in Table 2. The scaled factors are chosen so that the initial search points related to the extension and contraction of the motion predictor can be distributed reasonably around the motion predictor point to obtain the better motion predictor points.

6

EURASIP Journal on Embedded Systems N

N NW

NW

NE

SP y

NE

SP y

SPx

W

E

W

R

R

SW

SW

SE

E

SPx

SE S

S

Point with the predicted MV Initial SPs when E pattern selected SP y and SPx determined by look-up table (a)

Point with the predicted MV Initial SPs when NE pattern selected SP y and SPx determined by look-up table (b)

Figure 3: (a) An example of initial search points defined for E pattern using look-up table; (b) an example of initial search points defined for NE pattern using look-up table.

Table 2: Definition of scaled factors for initial search points related to motion predictor. R 0∼2 3∼5 6 ∼ 10 >10

Scaled factor for extension (SFE ) 3 2 1.5 1.25

R 0 ∼ 10 >10

Scaled factor for contraction (SFC ) 0.5 0.75

Therefore, the initial search points related to the motion predictor can be identified as EMVP = SFE · MVP,

(5)

CMVP = SFC · MVP,

(6)

where MVP is a point representing the median vector predictor. SFE and SFC are the scaled factors for the extension and contraction, respectively. EMVP and CMVP are the initial search points with the prolonged and contracted lengths of predicted motion vector, respectively. If the horizontal or vertical component of EMVP and CMVP is not an integer after the scaling, the component value will be truncated to the integer for video block processing. 3.7. Algorithm Procedure Step 1. Get a predicted motion vector (MVP) for the candidate block in current frame for the initial search stage. Step 2. Find the adaptive direction of a search pattern by rules (3.4.1)–(3.4.4), determine the pattern size “R” with the (1), choose initial SPs in the reference frame along the quarter circle and predicted MV using look-up table, (5) and (6).

Step 3. Check the initial search points with block pixel intensity measurement, and get an MME point which has a minimum SAD as the search center for the next search stage. Step 4. Refine local search by applying unit-sized square pattern to the MME point (search center), and check its neighboring points with block pixel intensity measurement. If after search, the MME point is still the search center, then stop searching and obtain the final motion vector for the candidate block corresponding to the final best matching point identified in this step. Otherwise, set up the new MME point as the search center, and apply square pattern search to that MME point again, until the stop condition is satisfied. 3.8. Algorithm Complexity. As the ACQPPS is a predicted and adaptive multistep algorithm for motion search, the algorithm computational complexity exclusively depends on the object motions contained in the video sequences and scenarios for estimation processing. The main overhead of ACQPPS algorithm lies in the block SAD computations. Some other algorithm overhead, such as the selection of adaptive search pattern direction, the determination of search arm and initial search points, are merely consumed by a combination of if-condition judgments, and thus can be even ignored when compared with block SAD calculations. If the large, quick, and complex object motions are included in video sequences, the number of search points (NSP) will be reasonably increased. On the contrary, if the small, slow and simple object motions are shown in the sequences, it only requires the ACQPPS algorithm a few of processing steps to finish the motion search, that is, the number of search points is correspondingly reduced. Unlike the ME algorithms with fixed search ranges, for example, the full search algorithm, it is impractical to precisely identify the number of computational steps for ACQPPS. On an average, however, an approximation

EURASIP Journal on Embedded Systems Look-up table

7

Initial search processing unit

Motion predictor storage

Refined search processing unit

18 × 18 register array with reference block data

Pipelined multi-level SAD calculator

16 × 16 register array with current block data

SAD comparator

MME point

MV generated

Current & reference video frame storage

MME point MV generated Reference data Residual data

Figure 4: A hardware architecture for ACQPPS motion estimator.

equation can be utilized to represent the computational complexity for ACQPPS method. The worst case of motion search for a video sequence is to use the 4 × 4 block size, if the fixed block size is employed. In this case, the number of search points for ACQPPS motion estimation is usually around 12 ∼ 16, according to the practical motion search results. Therefore, the algorithm complexity can be simply identified as, in terms of image size and frame rate, C ≈ 16 × Block SAD computations × Number of blocks in a video frame × Frame rate,

(7) where the block size is 4 × 4 for the worst case of computations. For a standard software implementation, it actually requires 16 subtractions and 15 additions, that is, 31 arithmetic operations, for each 4 × 4 block SAD calculations. Accordingly, the complexity of ACQPPS is approximately 14 and 60 times less than the one required by full search algorithm with the [−7, +7] and [−15, +15] search range, respectively. In practice, the ACQPPS complexity is roughly at the same level as the simple DS algorithm.

4. Hardware Architecture of ACQPPS Motion Estimator The ACQPPS is designed with low complexity, which is appropriate to be implemented based on a hardware architecture. The hardware architecture takes advantage of the pipelining and parallel operations of the adaptive search patterns, and utilizes a fully pipelined multilevel SAD calculator to improve the computational efficiency and, therefore, reduce the clock frequency reasonably. As mentioned above, the computation of motion vector for a smallest block shape, that is, 4 × 4 block, is the worst case for calculation. The worst case refers to the percentage usage of the memory bandwidth. It is necessary that the computational efficiency be as high as possible in the worst

case. All of the other block shapes can be constructed from 4 × 4 blocks so that the computation of distortion in 4 × 4 partial solutions and result additions can solve all of the other block shapes. 4.1. ACQPPS Hardware Architecture. An architecture for the ACQPPS motion estimator is shown in Figure 4. There are two main stages for the motion vector search, including the initial and refined search, indicated by the hardware semaphore. In the initial search stage, the architecture utilizes the previously calculated motion vectors to produce an MVP for the current block. Some initial search points are generated utilizing the MVP and LUT to define the search range of adaptive patterns. After an MME point is found in this stage, the search refinement will take into effect applying square pattern around MME points iteratively to obtain a final best MME point, which indicates the final best MV for the current block. For motion estimation, the reference frames are stored in SRAM or DRAM, while the current frame and produced MVs are stored in dual-port memory (BRAM). Meanwhile, The LUT also uses the BRAM to facilitate the generation of initial search points. Figure 5 illustrates a data search flow of the ACQPPS hardware IP with regard to each block motion search. The initial search processing unit (ISPU) is used to generate the initial search points and then perform the initial motion search. To generate the initial search points, previously calculated MVs and an LUT are employed. The LUT contains the vertical and horizontal components of the initial search points defined in Table 1. Both produced MVs and LUT values are stored in BRAM, for they can be accessed through two independent data ports in parallel to facilitate the processing. When the initial search stage is finished, the refined search processing unit (RSPU) is enabled to work. It employs the square pattern around the MME point derived in initial search stage to refine the local motion search. The local refined search steps might be iteratively performed a few of times, until the MME point is still at the search center

8

EURASIP Journal on Embedded Systems Clock cycles 1

2

3

4

5

6

7

8

···

···

···

N +1 N +2 N +3 N +4 N +5 N +6 N +7 N +8

Preload current block data to 16 × 16 registers

···

Final MV for current block

Load reference data of Load reference data of each Load reference data of (0, 0) offset point to MVP offset point to of other initial offset SPs to 18 × 18 register array and 18 × 18 register array and 18 × 18 register array and enable SAD calculation enable SAD calculation enable SAD calculation Decide search pattern, generate initial SPs except MVP and (0, 0)

Remove the overlapping initial search points

MME point is the search center

MVP generation Load reference data of each of refinement SPs to 18 × 18 register array, enable SAD calculation

Obtain the initial MME point position

Generate new offset SPs using diamond or square pattern according to MME point for refinement search

Obtain the refinement MME point position

MME point is not the search center Initial search stage

Refined search stage

(a) Clock cycles 1

2

3

4

5

6

7

8

···

···

···

N +1 N +2 N +3 N +4 N +5 N +6 N +7 N +8

Preload current block data to 16 × 16 registers

···

Final MV for current block MME point is the search center

Load reference data of (0, 0) Load reference data of each of offset point (search center) other offset SPs defined by square to 18 × 18 register array and pattern to 18 × 18 register array enable SAD calculation and enable SAD calculation

MME Point is the search center

MVP generation

Load reference data of each of refinement SPs to 18 × 18 register array, enable SAD calculation

Generate new offset SPs Obtain the current MME using diamond or square pattern according to MME point point for refinement search position

Obtain the refinement MME point position

MME point is not the search center Refined search stage

Initial search stage

(b)

Figure 5: (a) A data search flow for the individual block motion estimation when MVP is not zero; (b) a data search flow for the individual block motion estimation when MVP is zero. Note The clock cycles for each task are not on the exact timing scale, only for illustration purpose.

after certain refined steps. The search data flow of ACQPPS IP architecture conforms to the algorithm steps defined in Section 3.7, with further improvement and optimization of hardware parallel and pipelining features. 4.2. Fully Pipelined SAD Calculator. As main ME operations are related to SAD calculations that have a critical impact on the performance of hardware-based motion estimator, a fully pipelined SAD calculator is designed to speed up the SAD computations. Figure 6 displays a basic architecture of the pipelined SAD calculator, with the processing support

of variable block sizes. According to the VBS indicated by block shape and enable signals, SAD calculator can employ appropriate parallel and pipelining adder operations to generate SAD result for a searched block. With the parallel calculations of basic processing unit (BPU), it can take 4 clock cycles to finish the 4 × 4 block SAD computations (BPU for 4 × 4 block SAD), and 8 clock cycles to produce a final SAD result for a 16 × 16 block. To support the VBS feature, different block shapes might be processed based on the prototype of the BPU. In such case, a 16 × 16 macroblock is divided into 16 basic 4 × 4 blocks.


9

4 × 4 SAD (0)

BPU1

Current 4 × 4 block 12

ACC ACC

8 × 4 SAD (0)

BPU2

ACC

Mux

16 × 8 or 16 × 16 SAD 4 × 4 SAD (2)

BPU3


ACC

Current 4 × 4 block 3 Current 4 × 4 block 7 Current 4 × 4 block 11

Mux

ACC

8 × 8 or 8 × 16 SAD (1) ACC

8 × 4 SAD (1)

BPU4

4 × 8 SAD (2) 4 × 8 SAD (3)

Mux

4 × 4 SAD (3)


Register data array

ACC

Current 4 × 4 block 2 Current 4 × 4 block 10

Data selection control

4 × 8 SAD (1) 4 × 4 SAD (1)



4 × 8 SAD (0)

Register data array


8 × 8 or 8 × 16 SAD (0)

Register data array


Mux


Register data array


Mux

Current 4 × 4 block 0 Current 4 × 4 block 4

Mux

Mux

Mux

Reference 4 × 4 block 0/4/8/12 Reference 4 × 4 block 1/5/9/13 Reference 4 × 4 block 2/6/10/14 Reference 4 × 4 block 3/7/11/15

Figure 6: An architecture for pipelined multilevel SAD calculator.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

Organization of VBS using 4 × 4 blocks 16 × 16: {0, 1, . . . , 14, 15} 16 × 8: {0, 1, . . . , 6, 7}, {8, 9, . . . , 14, 15} 8 × 16: {0, 1, 4, 5, 8, 9, 12, 13}, {2, 3, 6, 7, 10, 11, 14, 15} 8 × 8: {0, 1, 4, 5}, {2, 3, 6, 7}, {8, 9, 12, 13}, {10, 11, 14, 15} 8 × 4: {0, 1}, {2, 3}, {4, 5}, {6, 7}, {8, 9}, {10, 11}, {12, 13}, {14, 15} {0, 4}, {1, 5}, {2, 6}, {3, 7}, 4 × 8: {8, 12}, {9, 13}, {10, 14}, {11, 15} 4 × 4: {0}, {1}, . . . , {14}, {15} Computing stages for VBS using 4 × 4 blocks VBS Stage 1 Stage 2 16 × 16 {0, 1, 2, 3} {4, 5, 6, 7} 16 × 8

8

9

10

11

Stage 3

Stage 4

{8, 9, 10, 11}

{12, 13, 14, 15}

{0, 1, 2, 3}/ {8, 9, 10, 11} {4, 5, 6, 7}/ {12, 13, 14, 15}

8 × 16

{0, 1}/ {2, 3}

8×8

{0, 1}/ {2, 3}/ {8, 9}/ {10, 11}

{4, 5}/ {6, 7}

{8, 9}/ {10, 11}

{4, 5}/ {6, 7}/ {12, 13}/ {14, 15}

{0, 1}/ {2, 3}/ {4, 5}/ {6, 7}/ {8, 9}/ {10, 11}/ {12, 13}/ {14, 15} {4}/ {5}/ {6}/ {7}/ {0}/ {1}/ {2}/ {3}/ 4×8 {8}/ {9}/ {10}/ {11} {12}/ {13}/ {14}/ {15}

8×4

12

13

14

15

4×4

{0}/ {1}/ . . . / {14}/ {15}

-

{12, 13}/ {14, 15}

-

-

-

-

-

-

-

-

Figure 7: Organization of Variable Block Size based on Basic 4 × 4 Blocks.

Other 6 block sizes in H.264, that is, 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, and 4 × 8, can be organized by the combination of basic 4 × 4 blocks, shown in Figure 7, which also describes computing stages for each variable-sized block constructed on the basic 4 × 4 blocks to obtain VBS SAD results.

For instance, for a largest 16 × 16 block, it will require 4 stages of the parallel data loadings from the register arrays to the SAD calculator to obtain a final block SAD result. In this case, the schedule of data loading will be {0, 1, 2, 3} → {4, 5, 6, 7} → {8, 9, 10, 11} → {12, 13, 14, 15}, where “{}”

10 indicates each parallel pixel data input with the current and reference block data. 4.3. Optimized Memory Structure. When a square pattern is used to refine the MV search results, the mapping of the memory architecture is important to speed up the performance. In our design, the memory architecture will be mapped onto a 2D register space for the refined stage. The maximum size of this space is 18 × 18 with pixel bit depth, that is, the mapped register memory can accommodate a largest 16 × 16 macroblock plus the edge redundancy for the rotated data shift and storage operations. A simple combination of parallel register shifts and related data fetches from SRAM can reduce the memory bandwidth, and facilitate the refinement processing, as many of the pixel data for searching in this stage remain unchanged. For example, 87.89% and 93.75% of the pixel data will stay unchanged, when the (1,−1) and (1,0) offset searches for the 16 × 16 block are executed, respectively. 4.4. SAD Comparator. The SAD comparator is utilized to compare the previously generated block SAD results to obtain a final estimated MV which corresponds to the best MME point that has the minimum SAD with the lowest block pixel intensity. To select and compare the proper block SAD results as shown in Figure 6, the signals of different block shapes and computing stages are employed to determine the appropriate mode of minimum SAD to be utilized. For example, if the 16 × 16 block size is used for motion estimation, the 16 × 16 block data will be loaded into the BPU for SAD calculations. Each 16 × 16 block requires 4 computing stages to obtain a final block SAD result. In this case, the result mode of “16 × 8 or 16 × 16 SAD” will be first selected. Meanwhile, the signal of computing stages is also used to indicate the valid input to the SAD comparator for retrieving proper SAD results from BPU, and thus obtain the MME point with a minimum SAD for this block size. The best MME point position obtained by SAD comparator is further employed to produce the best matched reference block data and residual data which are important to other video encoding functions, such as mathematical transforms and motion compensation, and so forth.

5. Virtual Socket System-on-Platform Architecture The bitstream and hardware complexity analysis derived in Section 2 helps guiding both the architecture design for prototyping IP accelerated system and the optimized implementation of an H.264 BP encoding system based on that architecture. 5.1. The Proposed System-On-Platform Architecture. A variety of options, switches, and modes required in video bitstream actually results in the increasing interactions between different video tasks or function-specific IP blocks.

EURASIP Journal on Embedded Systems Consequently, the functional oriented and fully dedicated architectures will become inefficient, if high levels of the flexibility are not provided in the individual IP modules. To make the architectures remain efficient, the hardware blocks need optimization to deal with the increasing complexity for visual objects processing. Besides, the hardware must keep flexible enough to manage and allocate various resources, memories, computational video IP accelerators for different encoding tasks. In view of that the programmable solutions will be preferable for video codec applications with programmable and reconfigurable processing cores, the heterogeneous functionality and the algorithms can be executed on the same hardware platform, and upgraded flexibly by software manipulations. To accelerate the performance on processing cores, parallelization will be demanded. The parallelization can take place at different levels, such as task, data, and instruction. Furthermore, the specific video processing algorithms performed by IP accelerators or processing cores can improve the execution efficiency significantly. Therefore, the requirements for H.264 video applications are so demanding that multiple acceleration techniques may be combined to meet the real-time conditions. The programmable, reconfigurable, heterogeneous processors are the preferable choice for an implementation of H.264 BP video encoder. Architectures with the support for concurrent performance and hardware video IP accelerators are well applicable for achieving the real-time requirement imposed by the H.264 standard. Figure 8 shows the proposed extensible system-onplatform architecture. The architecture consists of a programmable and reconfigurable processing core which is built upon FPGA, and two extensible cores with RISC and DSP. The RISC can take charge of general sequences control and IP integration information, give mode selections for video coding, and configure basic operations, while DSP can be utilized to process the particular or flexible computational tasks. The processing cores are connected through the heterogeneous integrated onplatform memory spaces for the exchange of control information. The PCI/PCMCIA standard bus provides a data transfer solution for the host connected to the platform framework, reconfigures and controls the platform in a flexible way. Desirable video IP accelerators will be integrated in the system platform architecture to improve the encoding performance for H.264 BP video applications. 5.2. Virtual Socket Management. The concept of virtual socket is thus introduced to the proposed system-onplatform architecture. Virtual socket is a solution for the host-platform interface, which can map a virtual memory space from the host environment to the physical storage on the architecture. It is an efficient mechanism for the management of virtual memory interface and heterogeneous memory spaces on the system framework. It enables a truly integrated, platform independent environment for the hardware-software codevelopment.


11

Interrupt

BRAM

SRAM

IP module 2 . . . IP module N

PCI bus interface

RISC

IP module 1

Local bus mux interface

DSP

VS controller

IP memory interface

FPGA

DRAM

Figure 8: The proposed extensible system-on-platform hardware architecture.

Through the virtual socket interface, a few of virtual socket application programming interface (API) function calls can be employed to make the generic hardware functional IP accelerators automatically map the virtual memory addresses from the host system to different memory spaces on the hardware platform. Therefore, with the efficient virtual socket memory organization, the hardware abstraction layer will provide the system architecture with simplified memory access, interrupt based control and shielded interactions between the platform framework and the host system. Through the integration of IP accelerators to the hardware architecture, the system performance will be improved significantly. The codesign virtual socket host-platform interface management and system-on-platform hardware architecture actually provide a useful embedded system approach for the realization of advanced and complicated H.264 video encoding system. Hence, the IP accelerators on FPGA, together with the extensible DSP and RISC, construct an efficient programmable embedded solution to perform the dedicated and real-time video processing tasks. Moreover, due to the various video configurations for H.264 encoding, the physically implemented virtual socket interface as well as APIs can easily enable the encoder configurations, data manipulations and communications between the host computer system and hardware architecture, in return facilitate the system development for H.264 video encoders. 5.3. Integration of IP Accelerators. The IP accelerator illustrated here can be any H.264 compliant hardware block which is defined to handle a computationally extensive task for video applications without a specific design for interaction controls between IP and the host. For encoding, the basic modules to be integrated include Motion Estimator, Discrete Cosine Transform and Quantization (DCT/Q), Deblocking Filter and Context Adaptive Variable Length Coding (CAVLC), while Inverse Discrete Cosine Transform and Inverse Quantization (IDCT/Q−1 ), and Motion Com-

pensation (MC) for decoding. An IP memory interface is provided by the architecture to achieve the integration. All IP modules are connected to the IP memory interface, which provides accelerators a straight way to exchange data between the host and memory spaces. Interrupt signals can be generated by accelerators when demanded. Moreover, to control the concurrent performance of accelerators, an IP bus arbitrator is designed and integrated in the IP memory interface, for the interface controller to allocate appropriate memory operation time for each IP module, and avoid the memory access conflicts possibly caused by heterogeneous IP operations. IP interface signals are configured to connect the IP modules to the IP memory interface. It is likely that each accelerator has its own interface requirement for interaction between the platform and IP modules. To make the integration easy, it is required that certain common interface signals be defined to link IP blocks and memory interface together. With the IP interface signals, the accelerators will focus on their own computational tasks, and thus the architecture efficiency can be improved. Practically, the IP modules can be flexibly reused, extended, and migrated to other independent platforms very easily. Table 3 defines the necessary IP interface signals for the proposed architecture. IP modules only need to issue the memory requests and access parameters to the IP memory interface, and rest of the tasks are taken by platform controllers. This feature is especially useful when motion estimator is integrated in the system. 5.4. Host Interface and API Function Calls. The host interface provides the architecture with necessary data for video processing. It can also control video accelerators to operate in sequential or parallel mode, in accordance with the H.264 video codec specifications. The hardware-software partitioning is simplified so that the host interface can focus on the data communication as well as flow control for video tasks, while hardware accelerators deal with local memory

12


Valid strobes for IP memory access

the operations of encoding tasks, so that the processing core will not suffer memory and encoding latency. In such case, the data control flow of video encoding will be managed to make the DMA transfer and IP accelerator operations in fully parallel and pipelined stages.

Input and output memory data for IP IP request for memory read IP number, offset, and data count provided by IP/Host for memory read IP Number, offset, and data count provided by IP/Host for memory write IP bus request for memory access IP bus release request for memory access IP bus request grant for memory access IP bus release grant for memory access IP interrupt signal

6.2. Architecture Optimization. As the main video encoding functions (such as ME, DCT/Q, IDCT/Q−1 , MC, Deblocking Filter, and CAVLC) can be accelerated by IP modules, the interconnection between those video processing accelerators has an important impact on the overall system performance. To make the IP accelerators execute main computational encoding routines in full parallel and pipelining mode, the IP integration architecture has to be optimized. A few of caches are inserted between the video IP accelerators to facilitate the encoding concurrent performance. The caches can be organized as parallel dual-port memory (BRAM) or pipelined memory (FIFO). The interconnection control of data streaming between IP modules will be defined using those caches targeting to eliminate the extra overhead of processing routines, for encoding functions can be operated in full parallel and pipelining stages.

Table 3: IP interface signals. Interface signals Clk, reset, start Input Valid, Output Valid Data In, Data Out Memory Read Mem HW Accel, offset, count Mem HW Accel1, offset1, count1 Mem Read Req Mem Write Req Mem Read Release Req Mem Write Release Req Mem Read Ack Mem Write Ack Mem Read Release Ack Mem Write Release Ack Done

Description Platform signals for IP

accesses and video codec functions. Therefore, the software abstraction layer covers the feature of data exchange and video task flow control for hardware performance. A set of related virtual socket API functions is defined to implement the host interface features. The virtual socket APIs are software function calls coded in C/C++, which perform data transfers and signal interactions between the host and hardware system-on-platform. The virtual socket API as a software infrastructure can be utilized by a variety of video applications to control the implementation of hardware feature defined. With virtual socket APIs, the manipulation of video data in local memories can be executed conveniently. Therefore, the efficiency of hardware and software interactions can be kept high.

6. System Optimizations 6.1. Memory Optimization. Due to the significant memory access requirement for video encoding tasks, a large amount of clock cycles is consumed by the processing core while waiting for the data fetch from local memory spaces. To reduce or avoid the overhead of memory data access, the memory storage of video frame data can be organized to utilize multiple independent memory spaces (SRAM and DRAM) and dual-port memory (BRAM), in order to enable the parallel and pipelined memory access during the video encoding. This optimized requirement can practically provide the system architecture with the multi-port memory storage to reduce the data access bandwidth for each of the individual memory space. Furthermore, with the dual-port data access, DMA can be scheduled to transfer a large amount of video frame data through PCI bus and virtual socket interface in parallel with

6.3. Algorithm Optimization. The complexity of encoding algorithms can be modified when the IP accelerators are shaping. This optimization can be taken after choosing the most appropriate modes, options, and configurations for the H.264 BP applications. It is known that the motion estimator requires the major overhead for encoding computations. To reduce the complexity of motion estimation, a very efficient and fast ACQPPS algorithm and corresponding hardware architecture have been realized based on the reduction of spatio-temporal correlation redundancy. Some other algorithm optimizations can also be executed. For example, a simple algorithm optimization may be applied to mathematic transform and quantization. As many blocks tend to have minimal residual data after the motion compensation, the mathematic transform and quantization for motion-compensated blocks can be ignored, if SAD of such blocks is lower than a prescribed threshold, in order to facilitate the processing speed. The application of memory, algorithm, and architecture optimizations combined in the system can meet the major challenges for the realization of video encoding system. The optimization techniques can be employed to reduce the encoding complexity and memory bandwidth, with the welldefined parallel and pipelining data streaming control flow, in order to implement a simplified H.264 BP encoder. 6.4. An IP Accelerated Model for Video Encoding. An optimized IP accelerated model is presented in Figure 9 for a realization of simplified H.264 BP video encoder. In this architecture, BRAM, SRAM, and DRAM are used as multiport memories to facilitate video processing. The current video frame is transferred by DMA and stored in BRAM. Meanwhile, IP accelerators fetch the data from BRAM and


13

Virtual socket controller

BRAM

SRAM

DRAM

IP memory interface IP memory interface control

ME

IDCT/Q−1

DCT/Q

BRAM(1)

FIFO(1)

MC

FIFO(2)

Deblock

BRAM(3)

CAVLC

BRAM(4)

BRAM(2)

Figure 9: An optimized architecture for simplified H.264 BP video encoding system.

Transfer the current frame to BRAM

ME triggered to start

Fetch data for each block

ME generates MVs and residual data

Pipelined process

Pipelined process

Pipelined process

Pipelined process

DCT/Q starts to work

IDCT/Q−1 starts to work

MC starts to work

Deblocking starts to work

DCT/Q generates block coefficients

DCT/Q−1 generates block coefficients

MC generates reconstructed block pixels

Deblocking generates filtered block pixels

Save data to FIFO(1) and BRAM(4)

Save data to FIFO(2)

Save data to BRAM(3)

Save data to SRAM/DRAM

Save data to BRAM(1) and BRAM(2)

No

Frame finished?

Parallel process CAVLC starts to work

CAVLC produces bitstreams

Yes Interrupt to end

Figure 10: A video task partitioning and data control flow for the optimized system architecture.

start video encoding routines. As BRAM is a dual-port memory, the overhead of DMA transfer is eliminated by this dual-port cache. This IP accelerated system model includes the memory, algorithm, and architecture optimization techniques to enable the reduction and elimination of the overhead resulted from the heterogeneous video encoding tasks. The

video encoding model provided in this architecture is compliant with H.264 standard specifications. A data control flow based on the video task partitioning is shown in Figure 10. According to the data streaming, it is obvious that the parallel and pipelining operations dominate in the whole part of encoding tasks, which are able to yield an efficient processing performance.

14


7. Implementations

Table 4: Video sequences for experiment with real-time frame rate.

The proposed ACQPPS algorithm is integrated and verified under H.264 JM Reference Software [28], while the hardware architectures, including the ACQPPS motion estimator and system-on-platform framework, are synthesized with Synplify Pro 8.6.2, implemented using Xilinx ISE 8.1i SP3 targeting Virtex-4 XC4VSX35FF668-10, based on the WILDCARD-4 [29]. The system hardware architecture can sufficiently process the QCIF/SIF/CIF video frames with the support of onplatform design resources. The Virtex-4 XC4VSX35 contains 3,456 Kb BRAM [30], 192 XtremeDSP (DSP48) slices [31], and 15,360 logic slices, which are equivalent to almost 1 million logic gates. Moreover, WILDCARD-4 integrates the large-sized 8 MB SRAM and 128 MB DRAM. With the sufficient design resources and memory support, the whole video frames of QCIF/SIF/CIF can be directly stored in the on-platform memories for the efficient hardware processing. For example, if a CIF YUV (YCbCr) 4 : 2 : 0 video sequence is encoded with the optimized hardware architecture proposed in Figure 9, the total size of each current frame is 148.5 Kb. Therefore, each of the current CIF frame can be transferred from host system and directly stored in BRAM for motion estimation and video encoding, whereas the generated reference frames are stored in SRAM or DRAM. The SRAM and DRAM can accommodate a maximum of up to 55 and 882 CIF reference frames, respectively, which are more than enough for the practical video encoding process.

Sequence (bit rate Kbps) Foreman (512) Carphone (256) News (128) Miss Am (64) Suzie (256) Highway (192) Football (2048) Table Tennis (1024) Foreman (1024) Mother Daughter (128) Stefan (2048) Highway (512)

7.1. Performance of ACQPPS Algorithm. A variety of video sequences which contain different amount of motions, listed in Table 4, is examined to verify the algorithm performance for real-time encoding (30 fps). All sequences are in the format of YUV (YCbCr) 4 : 2 : 0 with luminance component to be processed for ME. The frame size of sequences varies from QCIF to SIF and CIF, which is the typical testing condition. The targeted bit rate is from 64 Kbps to 2 Mbps. SAD is used as the intensity matching criterion. The search window is [−15, +15] for FS. EPZS uses extended diamond pattern and PMVFAST pattern [32] for its primary and secondary refined search stages. It also enables the window based, temporal, and spatial memory predictors to perform advanced motion search. UMHexagonS utilizes search range prediction and default scale factor optimized for different image sizes. Encoded frames are produced in a sequence of IPP, . . . PPP, as H.264 BP encoding is employed. For reconstructed video quality evaluation, the frame-based average peak signal-tonoise ratio (PSNR) and number of search points (NSP) per MB (16 × 16 pixels) are measured. Video encoding is configured with the support of full-pel motion accuracy, single reference frame and VBS. As VBS is a complicated feature defined in H.264, to make easy and practical the calculation of NSP regarding different block sizes, all search points for variable block estimation are normalized to the search points regarding the MB measurement, so that the NSP results can be evaluated reasonably. The implementation results in Tables 6 and 7 show that the estimated image quality produced by ACQPPS, in

Size/frame rate QCIF/30 fps QCIF/30 fps QCIF/30 fps QCIF/30 fps QCIF/30 fps QCIF/30 fps SIF/30 fps SIF/30 fps CIF/30 fps CIF/30 fps CIF/30 fps CIF/30 fps

No. of frames 300 382 300 150 150 2000 125 112 300 300 90 2000

Table 5: Video sequences for experiment with low bit and frames rates. Sequence (bit rate Kbps) Foreman (90) Carphone (56) News (64) Miss Am (32) Suzie (90) Highway (64) Football (256) Table Tennis (150) Foreman (150) Mother Daughter (64) Stefan (256) Highway (150)

Size/frame rate QCIF/7.5 fps QCIF/7.5 fps QCIF/15 fps QCIF/15 fps QCIF/15 fps QCIF/15 fps SIF/10 fps SIF/10 fps CIF/10 fps CIF/10 fps CIF/10 fps CIF/10 fps

No. of frames 75 95 150 75 75 1000 40 35 100 100 30 665

terms of PSNR, is very close to that from FS, while the number of average search points is dramatically reduced. The PSNR difference between ACQPPS and FS is in the range of −0.13 dB ∼ 0 dB. In most cases, PSNR degradation of ACQPPS is less than 0.06 dB, as compared to FS. In some cases, PSNR results of ACQPPS can be approximately equivalent or equal to those generated from FS. When compared with other fast search methods, that is, DS (small pattern), UCBDS, TSS, FSS and HEX, ACQPPS result is able to outperform their performance. ACQPPS can always yield higher PSNR than those fast algorithms. In this case, ACQPPS can obtain an average PSNR of +0.56 dB higher than those algorithms with evaluated video sequences. Besides, ACQPPS performance is comparable to that of the complicated and advanced EPZS and UMHexagonS algorithms, as it can achieve an average PSNR in the range of −0.07 dB ∼ +0.05 dB and −0.04 dB ∼ +0.08 dB, as compared to EPZS and UMHexagonS, respectively. In addition to the real-time video sequence encoding with 30 fps, many other application cases, such as the mobile scenario and videoconferencing, require video encoding under the low bit and frame rate environment with less


15

Table 6: Average PSNR performance for experiment with real-time and frame rate. Sequence Foreman (QCIF) Carphone (QCIF) News (QCIF) Miss Am (QCIF) Suzie (QCIF) Highway (QCIF) Football (SIF) Table Tennis (SIF) Foreman (CIF) Mother Daughter (CIF) Stefan (CIF) Highway (CIF)

FS 38.48 36.43 37.44 39.07 38.65 38.23 31.37 33.87 36.30 36.26 33.87 37.96

DS 38.09 36.23 37.26 39.01 38.46 37.99 31.23 33.71 35.91 36.16 33.46 37.70

UCBDS 37.93 36.16 37.35 39.01 38.47 38.13 31.23 33.79 35.83 36.24 33.36 37.83

TSS 38.27 36.30 37.28 39.00 38.59 38.11 31.22 33.62 35.72 36.21 33.45 37.79

FSS 38.19 36.24 37.29 38.94 38.54 38.09 31.24 33.72 35.70 36.21 33.39 37.77

HEX 37.87 36.04 37.25 39.01 38.45 38.06 31.20 33.71 35.69 36.22 33.30 37.76

EPZS 38.45 36.42 37.43 38.98 38.61 38.18 31.40 33.87 36.27 36.26 33.89 37.89

UMHexagonS 38.44 36.37 37.35 39.01 38.58 38.17 31.37 33.84 36.24 36.23 33.82 37.87

ACQPPS 38.42 36.37 37.43 39.03 38.60 38.13 31.36 33.84 36.25 36.26 33.82 37.83

Table 7: Average number of search points per MB for experiment with real-time and frame rate. Sequence Foreman (QCIF) Carphone (QCIF) News (QCIF) Miss Am (QCIF) Suzie (QCIF) Highway (QCIF) Football (SIF) Table Tennis (SIF) Foreman (CIF) Mother Daughter (CIF) Stefan (CIF) Highway (CIF)

FS 2066.73 1872.04 1719.92 1471.96 1914.32 1791.86 2150.45 2031.72 1960.07 1473.73 1954.21 1730.90

DS 60.64 46.82 33.72 30.70 44.19 40.27 68.42 55.66 76.83 35.08 69.32 45.63

UCBDS 109.70 91.54 74.48 64.35 88.19 85.14 118.21 105.56 124.56 70.39 116.72 90.81

TSS 124.85 108.44 88.28 74.95 108.19 101.49 131.82 120.09 128.85 82.18 118.23 104.90

than 30 fps. Accordingly, the satisfied settings for video encoding are usually 7.5 fps ∼ 15 fps for QCIF and 10 fps ∼ 15 fps for SIF/CIF with various low bit rates, for example, 90 Kbps for QCIF and 150 Kbps for SIF/CIF, to maximize the perceived video quality [40, 41]. In order to further evaluate the ME algorithms under low bit and frame rate cases, video sequences are provided in Table 5, and Tables 8 and 9 generate the corresponding performance results. The experiments show that the PSNR difference between ACQPPS and FS is still small, which is in an acceptable range of −0.49 dB ∼ −0.02 dB. In most cases, there is only less than 0.2 dB PSNR discrepancy between them. Moreover, ACQPPS still sufficiently outperforms DS, UCBDS, TSS, FSS and HEX. For mobile scenarios, there are usually quick and considerable motion displacements existing, under the environment of low frame rate video encoding. In such case, ACQPPS is particularly much better than those fast algorithms, and a result of up to +2.42 dB for PSNR can be achieved with the tested sequences. When compared with EPZS and UMHexagonS, ACQPPS can yield an average PSNR in the range of −0.36 dB ∼ +0.06 dB and −0.15 dB ∼ +0.07 dB, respectively. Normally, ACQPPS is useful to produce a favorable PSNR for the sequences not only with small object motions,

FSS 122.37 106.52 90.30 76.52 104.97 100.94 129.91 121.27 125.76 82.80 122.37 103.98

HEX 109.26 94.82 81.12 68.43 93.21 90.27 117.62 108.79 117.21 73.67 113.50 93.22

EPZS 119.02 114.83 81.36 62.94 96.98 85.04 184.81 128.36 122.22 80.38 137.59 78.82

UMHexagonS 125.95 121.91 79.73 56.27 88.74 84.24 202.19 124.95 124.26 63.51 149.80 75.98

ACQPPS 55.63 54.02 41.32 32.32 47.59 46.12 72.63 54.25 67.20 40.89 58.91 47.57

but also large amount of motions. In particular, if a sequence includes large object motions or considerable amount of motions, the advantage of ACQPPS algorithm is obvious, as the ACQPPS can adaptively choose different shapes and sizes for the search pattern which is applicable to the efficient large motion search. Such search advantage can be observed when ACQPPS is compared with DS. It is know that DS has a simple diamond pattern for a very low complexity based motion search. For video sequences with slow and small motions contained, for example, Miss Am (QCIF) and Mother Daguhter (CIF) at 30 fps, the PSNR performance of DS and ACQPPS is relatively close, which indicates that DS performs well in the case of simple motion search. When the complicated and large amount of motions included in video images, however, DS is unable to yield good PSNR, as its motion search will be easily trapped in undesirable local minimum. For example, the PSNR differences between DS and ACQPPS are 0.34 dB and 0.44 dB, when Foreman (CIF) is tested with 1 Mbps at 30 fps and 150 Kbps at 10 fps, respectively. Furthermore, ACQPPS can produce an average PSNR of +0.02 dB ∼ +0.36 dB higher than DS in the case of real-time video encoding, and +0.07 dB ∼ +1.94 dB in the case of low bit and frame rate environment.

16

EURASIP Journal on Embedded Systems Table 8: Average PSNR performance for experiment with low bit and frame rates.

Sequence Foreman (QCIF) Carphone (QCIF) News (QCIF) Miss Am (QCIF) Suzie (QCIF) Highway (QCIF) Football (SIF) Table Tennis (SIF) Foreman (CIF) Mother Daughter (CIF) Stefan (CIF) Highway (CIF)

FS 34.88 34.12 35.28 38.36 36.54 36.19 25.11 27.57 31.95 36.07 27.02 37.21

DS 34.44 33.99 35.21 38.23 36.40 35.80 24.82 26.65 31.29 35.84 24.59 36.88

UCBDS 34.19 33.96 35.20 38.33 36.34 36.01 24.92 26.95 31.32 35.95 24.92 36.98

TSS 34.40 34.02 35.11 38.25 36.44 35.96 24.79 26.85 31.16 35.90 24.11 36.94

FSS 34.42 33.99 35.20 38.23 36.39 35.94 24.84 26.84 31.25 35.92 24.12 36.92

HEX 33.99 33.84 35.19 38.31 36.26 35.90 24.89 26.86 31.06 35.91 24.96 36.94

EPZS 34.85 34.08 35.25 38.28 36.52 36.13 25.08 27.57 31.90 36.07 26.89 37.12

UMHexagonS 34.80 34.04 35.21 38.27 36.50 36.09 25.10 27.60 31.79 36.02 26.67 37.09

ACQPPS 34.80 34.06 35.24 38.34 36.50 35.98 25.01 27.45 31.73 36.05 26.53 37.01

Table 9: Average number of search points per MB for experiment with low bit and frame rates. Sequence Foreman (QCIF) Carphone (QCIF) News (QCIF) Miss Am (QCIF) Suzie (QCIF) Highway (QCIF) Football (SIF) Table Tennis (SIF) Foreman (CIF) Mother Daughter (CIF) Stefan (CIF) Highway (CIF)

FS 2020.51 1836.04 1680.68 1406.26 1823.23 1710.86 1914.43 1731.44 1789.76 1467.56 1663.89 1715.63

DS 90.20 58.40 34.74 32.60 52.96 42.06 80.13 50.10 91.32 42.21 65.44 52.26

UCBDS 140.01 102.76 74.11 64.56 94.39 84.42 132.67 98.45 140.31 78.32 110.17 97.20

TSS 134.64 112.65 87.22 75.39 110.43 97.99 123.20 97.71 124.01 87.45 100.53 109.24

The number of search points for each method, which mainly represents the algorithm complexity, is also obtained to measure the search efficiency of different approaches. The NSP results show that the search efficiency of ACQPPS is higher than other algorithms, as ACQPPS can produce very good performance, in terms of PSNR, with reasonably possessed NSP. The NSP of ACQPPS is one of the least among all methods. If ACQPPS is compared with DS, it is shown that ACQPPS has the similar NSP as DS. It is true that NSP of ACQPPS is usually a little bit increased in comparison with that of DS. However, the increasing of the NSP is limited and very reasonable, and is able to in turn bring ACQPPS much better PSNR for the encoded video quality. Furthermore, for the video sequences containing complex and quick object motions, for example, Foreman (CIF) and Stefan (CIF) at 30 fps, the NSP of ACQPPS can be even less than that of DS, which verifies that ACQPPS has a much satisfied search efficiency than DS, due to its highly adaptive search patterns. In general, the complexity of ACQPPS is very low, and with high search performance, which makes it especially useful for the hardware architecture implementation.

FSS 133.92 111.28 88.92 75.08 106.30 97.34 125.01 100.73 124.24 87.75 102.36 107.49

HEX 125.12 100.74 79.64 67.74 95.11 87.36 119.62 93.82 120.48 78.42 103.71 96.94

EPZS 163.63 141.56 96.40 68.05 115.96 98.22 192.88 159.45 154.55 90.40 153.97 91.74

UMHexagonS 190.38 160.32 102.30 63.20 112.17 97.77 246.76 182.19 170.89 78.14 194.64 92.27

ACQPPS 98.94 71.81 52.61 44.22 64.30 58.13 92.51 64.39 88.62 52.36 78.69 64.45

7.2. Design Resources for ACQPPS Motion Estimator. As the complexity and search points of ACQPPS have been greatly reduced, design resources used by ACQPPS architecture can be kept at a very low level. The main part of design resources is for SAD calculator. Each BPU requires one 32bit processing element (PE) to implement SAD calculations. Every PE has two 8-bit pixel data inputs, one from the current block and the other from reference block. Besides, every PE contains 16 subtractors, 8 three-input adders, 1 latch register, and does not require extra interim registers or accumulators. As a whole, a 32 × 4 PE array will be needed to implement the pipelined multilevel SAD calculator, which requires totally 64 subtractors, 32 three-input adders, and 4 latch registers. Other related design resources mainly include an 18 × 18 register array, a 16 × 16 register array, a few of accumulators, subtractors and comparators, which are used to generate the block SAD results, residual data and final estimated MVs. Moreover, some other multiplexers, registers, memory access, and data flow control logic gates are also needed in the architecture. A comparison of design resources between ACQPPS and other ME architectures [33– 36] is presented in Table 10. The results show that proposed ACQPPS architecture can utilize greatly reduced design


17

Table 10: Performance comparison between proposed ACQPPS and other motion estimation hardware architectures.

Type Algorithm Search range Gate count Support block sizes Freq. [MHz] Max fps of CIF Min Freq. [MHz]for CIF 30 fps

[33] ASIC FS [−16, +15] 103 K

[34] ASIC FS [−32, +31] 154 K

All

All

66.67 102 19.56

100 60 50

Table 11: Design resources for system-on-platform architecture. Target FPGA Critical Path XC4VSX35FG668-10 5 ns Target FPGA LUTs XC4VSX35FG668-10 3,161

Gates 279,774 CLB Slices 3,885

DFFs/Latches 3,388 Resource 25%

Table 12: DMA performance for video sequence transfer. QCIF 4 : 2 : 0 YCrCb WildCard-4 CIF 4 :2 : 0 YCrCb WildCard-4

DMA Write (ms) 0.556 DMA Write (ms) 2.224

DMA Read (ms) 0.491 DMA Read (ms) 1.963

DMA R/W (ms) 0.515 DMA R/W (ms) 2.059

resources to realize a high-performance motion estimator for H.264 encoding. 7.3. Throughput of ACQPPS Motion Estimator. Unlike the FS which has a fixed search range, search points and search range of ACQPPS depend on video sequences. ACQPPS search points will be increased, if a video sequence contains considerable or quick motions. On the contrary, search points can be reduced, if a video sequence includes slow or small amount of motions. The ME scheme with a fixed block size can be typically applied to the throughput analysis. In such case, the worst case will be the motion estimation using 4 × 4 blocks, which is the most time consuming in the case of fixed block size. Hence, the overall throughput result produced by ACQPPS architecture can be reasonably generalized and evaluated. In general, if the clock frequency is 50 MHz and the memory (SRAM, BRAM and DRAM) structure is organized as DWORD (32-bit) for each data access, the ACQPPS hardware architecture will approximately need an average of 12.39 milliseconds for motion estimation in the worst case of using 4 × 4 blocks. For a real-hardware architecture implementation, the typical throughput in the worst case of 4 × 4 blocks can represent the overall motion search ability for this motion estimator architecture.

[35] ASIC FS [−16, +15] 67 K 8×8 16 × 16 32 × 32 60 30 60

[36] ASIC FS [−16, +15] 108 K

Proposed architecture FPGA + DSP ACQPPS Flexible 35 K

All

All

100 56 54

75 120 18.75

Therefore, the ACQPPS architecture can complete the motion estimation for more than 4 CIF (352 × 288) video sequences or equivalent 1 4 CIF (704 × 576) video sequence at 75 MHz clock frequency within each 33.33 milliseconds time slot (30 fps) to meet the real-time encoding requirement for a low design cost and low bit rate implementation. The throughput ability of ACQPPS architecture can be compared with those of a variety of other recently developed motion estimator hardware architectures, as illustrated in Table 10. The comparison results show that the proposed ACQPPS architecture can achieve higher throughput than other hardware architectures, with the reduced operational clock frequency. Generally, it will only require a very low clock frequency, that is, 18.75 MHz, to generate the motion estimation results for the CIF video sequences at 30 fps. 7.4. Realization of System Architecture. Table 11 lists the design resources utilized by system-on-platform framework. The implementation results indicate that the system architecture uses approximately 25% of the FPGA design resources when there is no hardware IP accelerator integrated in the platform system. If video functions are needed, there will be more design resources demanded, in order to integrate and accommodate necessary IP modules. Table 12 gives a performance result of the platform DMA video frame transfer feature. Different DMA burst sizes will result in different DMA data transfer rates. In our case, the maximum DMA burst size is defined to accommodate a whole CIF 4 : 2 : 0 video frame, that is, 38,016 Dwords for each DMA data transfer buffer. Accordingly, the DMA transfer results verify that it only takes an average of approximately 2 milliseconds to transfer a whole CIF 4 : 2 : 0 video frame based on WildCard-4. This transfer performance can sufficiently support up to level 4 bitstream rate for the H.264 BP video encoding system. 7.5. Overall Encoding Performance. In view of the complexity analysis of H.264 video tasks described in Section 2, the most time consuming task is motion estimation. Other encoding tasks have much less overhead. Therefore, the video tasks can be scheduled to operate in parallel and pipelining stages as displayed in Figures 9 and 10 for the proposed architecture

18

EURASIP Journal on Embedded Systems Table 13: An overall performance comparison for H.264 BP video encoding systems.

Implementation [37] Architecture ASIC ME Algorithm Full Search (FS) Freq. [MHz] 144 Max fps of CIF 272.73 Min Freq. [MHz] for CIF 30 fps 15.84 Core Voltage Supply 1.2 V I/O Voltage Supply 1.8/2.5/3.3 V

[38] Codesign Full Search (FS) 100 5.125 585 1.2 V 1.8/2.5/3.3 V

model. In this case, the overall encoding time for a video sequence is approximately equal to the following Encoding time = Total motion estimation time + Processing time of DCT/Q for the last block

Processing time of CAVLC for the last block . (8) The processing time of DCT/Q, IDCT/Q−1 , MC, Deblocking Filter, and CAVLC for a divided block directly depends on the architecture design for each of the module. On an average, it is normal that the overhead of those video tasks for encoding an individual block is much less than that of motion estimation. As a whole, the encoding time derived from those video tasks for the last one block can be even ignored, when it is compared to the total processing time of the motion estimator for a whole video sequence. Therefore, to simplify the overall encoding performance analysis for the proposed architecture model, the total encoding overhead derived from the system architecture for a video sequence can be approximately regarded as Encoding time ≈ Total motion estimation time.

Proposed architecture Codesign(Extensible multiple processing cores) ACQPPS 75 120 18.75 1.2 V 2.5/3.3 V

than the processing ability of the architectures presented in [38, 39], respectively. The generated high performance of proposed architecture is directly contributed from the efficient ACQPPS motion estimation architecture and the techniques employed for the system optimizations.

8. Conclusions

+ Max Processing time of IDCT/Q−1 + MC + Deblocking for the last block,

[39] Codesign Hexagon (HEX) 81 18.6 130.65 1.2 V 2.5/3.3 V

(9)

This simplified system encoding performance analysis is valid as long as the video tasks are operated in concurrent and pipelined stages with the efficient optimization techniques. Accordingly, when the proposed ACQPPS motion estimator is integrated into the system architecture to perform the motion search, the overall encoding performance for the proposed architecture model is generalized. A performance comparison can be presented in Table 13, where the proposed architecture is compared with some other recently developed H.264 BP video encoding systems [37–39] including both fully dedicated hardware and codesign architectures. The results indicate that this proposed system-on-platform architecture, when integrated with the IP accelerators, can yield a very good performance which is comparable or even better than other H.264 video encoding systems. Especially, if compared with other codesign architectures, the proposed system has much higher encoding throughput, which is about 30 and 6 times higher

An integrated reconfigurable hardware-software codesign IP accelerated system-on-platform architecture is proposed in this paper. The efficient virtual socket interface and optimization approaches for hardware realization have been presented. The system architecture is flexible for the host interface control and extensible with multiple cores, which can actually construct a useful integrated and embedded system approach for the dedicated functions. An advanced application for this proposed architecture is to facilitate the development of H.264 video encoding system. As the motion estimation is the most complicated and important task in video encoder, a block-based novel adaptive motion estimation search algorithm, ACQPPS, and its hardware architecture are developed for reducing the complexity to extremely low level, while keeping the encoding performance, in terms of PSNR and bit rate, as high as possible. It is beneficial to integrate video IP accelerators, especially ACQPPS motion estimator, into the architecture framework for improving the overall encoding performance. The proposed system architecture is mapped on an integrated FPGA device, WildCard-4, toward an implementation for a simplified H.264 BP video encoder. In practice, with the proposed system architecture, the realization of multistandard video codec can be greatly facilitated and efficiently verified, other than the H.264 video applications. It can be expected that the advantages of the proposed architecture will become more desirable for prototyping the future video encoding systems, as new video standards are emerging continually, for example, the coming H.265 draft.

Acknowledgment The authors would like to thank the support from Alberta Informatics Circle of Research Excellence (iCore), Xilinx Inc., Natural Science and Engineering Research Council of Canada (NSERC), Canada Foundation for Innovation (CFI), and the Department of Electrical and Computer Engineering at the University of Calgary.


References [1] M. Tekalp, Digital Video Processing, Signal Processing Series, Prentice Hall, Englewood Cliffs, NJ, USA, 1995. [2] “Information technology—generic coding of moving pictures and associated audio information: video,” ISO/IEC 13818-2, September 1995. [3] “Video Coding for Low Bit Rate Communication,” ITU-T Recommendation H.263, March 1996. [4] “Coding of audio-visual objects—part 2: visual, amendment 1: visual extensions,” ISO/IEC 14496-4/AMD 1, April 1999. [5] Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITUT recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264 — ISO/IEC 14496-10 AVC),” JVT-G050r1, May 2003; JVT-K050r1 (nonintegrated form) and JVT-K051r1 (integrated form), March 2004; Fidelity Range Extensions JVT-L047 (non-integrated form) and JVT-L050 (integrated form), July 2004. [6] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003. [7] S. Wenger, “H.264/AVC over IP,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 645–656, 2003. [8] B. Zeidman, Designing with FPGAs and CPLDs, Publishers Group West, Berkeley, Calif, USA, 2002. [9] S. Notebaert and J. D. Cock, Hardware/Software Co-design of the H.264/AVC Standard, Ghent University, White Paper, 2004. [10] W. Staehler and A. Susin, IP Core for an H.264 Decoder SoC, Universidade Federal do Rio Grande do Sul (UFRGS), White Paper, October 2008. [11] R. Chandra, IP-Reuse and Platform Base Designs, STMicroelectronics Inc., White Paper, February 2002. [12] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 688–703, 2003. [13] J. Ostermann, J. Bormans, P. List, et al., “Video coding with H.264/AVC: tools, performance, and complexity,” IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7–28, 2004. [14] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, “H.264/AVC baseline profile decoder complexity analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 704–716, 2003. [15] S. Saponara, C. Blanch, K. Denolf, and J. Bormans, “The JVT advanced video coding standard: complexity and performance analysis on a tool-by-tool basis,” in Proceedings of the Packet Video Workshop (PV ’03), Nantes, France, April 2003. [16] J. R. Jain and A. K. Jain, “Displacement measurement and its application in interframe image coding,” IEEE Transactions on Communications, vol. 29, no. 12, pp. 1799–1808, 1981. [17] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion compensated interframe coding for video conferencing,” in Proceedings of the IEEE National Telecommunications Conference (NTC ’81), vol. 4, pp. 1–9, November 1981. [18] R. Li, B. Zeng, and M. L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, pp. 438–442, 1994. [19] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, pp. 313–317, 1996.

19 [20] L.-K. Liu and E. Feig, “A block-based gradient descent search algorithm for block motion estimation in video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 4, pp. 419–421, 1996. [21] S. Zhu and K. K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” in Proceedings of the International Conference on Information, Communications and Signal Processing (ICICS ’97), vol. 1, pp. 292–296, Singapore, September 1997. [22] C. Zhu, X. Lin, and L.-P. Chau, “Hexagon-based search pattern for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 5, pp. 349–355, 2002. [23] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 4, pp. 369–377, 1998. [24] Y. Nie and K.-K. Ma, “Adaptive rood pattern search for fast block-matching motion estimation,” IEEE Transactions on Image Processing, vol. 11, no. 12, pp. 1442–1449, 2002. [25] H. C. Tourapis and A. M. Tourapis, “Fast motion estimation within the H.264 codec,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME ’03), vol. 3, pp. 517–520, Baltimore, Md, USA, July 2003. [26] A. M. Tourapis, “Enhanced predictive zonal search for single and multiple frame motion estimation,” in Visual Communications and Image Processing, vol. 4671 of Proceedings of SPIE, pp. 1069–1079, January 2002. [27] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, “Fast integer pel and fractional pel motion estimation for AVC,” JVT-F016, December 2002. [28] K. Sühring, “H.264 JM Reference Software v.15.0,” September 2008, http://iphome.hhi.de/suehring/tml/download. [29] Annapolis Micro Systems, “WildcardTM —4 Reference Manual,” 12968-000 Revision 3.2, December 2005. [30] Xilinx Inc., “Virtex-4 User Guide,” UG070 (v2.3), August 2007. [31] Xilinx Inc., “XtremeDSP for Virtex-4 FPGAs User Guide,” UG073(v2.1), December 2005. [32] A. M. Tourapis, O. C. Au, and M. L. Liou, “Predictive motion vector field adaptive search technique (PMVFAST)— enhanced block based motion estimation,” in Proceedings of the IEEE Visual Communications and Image Processing (VCIP ’01), pp. 883–892, January 2001. [33] Y.-W. Huang, T.-C. Wang, B.-Y. Hsieh, and L.-G. Chen, “Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS ’03), vol. 2, pp. 796–798, May 2003. [34] M. Kim, I. Hwang, and S. Chae, “A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264,” in Proceedings of the IEEE Asia and South Pacific Design Automation Conference, vol. 1, pp. 631–634, January 2005. [35] J.-F. Shen, T.-C. Wang, and L.-G. Chen, “A novel lowpower full-search block-matching motion-estimation design for H.263+,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 7, pp. 890–897, 2001. [36] S. Y. Yap and J. V. McCanny, “A VLSI architecture for advanced video coding motion estimation,” in Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP ’03), vol. 1, pp. 293–301, June 2003.

20 [37] S. Mochizuki, T. Shibayama, M. Hase, et al., “A 64 mW high picture quality H.264/MPEG-4 video codec IP for HD mobile applications in 90 nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 43, no. 11, pp. 2354–2362, 2008. [38] R. R. Colenbrander, A. S. Damstra, C. W. Korevaar, C. A. Verhaar, and A. Molderink, “Co-design and implementation of the H.264/AVC motion estimation algorithm using cosimulation,” in Proceedings of the 11th IEEE EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD ’08), pp. 210–215, September 2008. [39] Z. Li, X. Zeng, Z. Yin, S. Hu, and L. Wang, “The design and optimization of H.264 encoder based on the nexperia platform,” in Proceedings of the 8th IEEE International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD ’07), vol. 1, pp. 216– 219, July 2007. [40] S. Winkler and F. Dufaux, “Video quality evaluation for mobile applications,” in Visual Communications and Image Processing, vol. 5150 of Proceedings of SPIE, pp. 593–603, Lugano, Switzerland, July 2003. [41] M. Ries, O. Nemethova, and M. Rupp, “Motion based reference-free quality estimation for H.264/AVC video streaming,” in Proceedings of the 2nd International Symposium on Wireless Pervasive Computing (ISWPC ’07), pp. 355–359, February 2007.



Research Article FPSoC-Based Architecture for a Fast Motion Estimation Algorithm in H.264/AVC Obianuju Ndili and Tokunbo Ogunfunmi Department of Electrical Engineering, Santa Clara University, Santa Clara, CA 95053, USA Correspondence should be addressed to Tokunbo Ogunfunmi, [email protected] Received 21 March 2009; Revised 18 June 2009; Accepted 27 October 2009 Recommended by Ahmet T. Erdogan There is an increasing need for high quality video on low power, portable devices. Possible target applications range from entertainment and personal communications to security and health care. While H.264/AVC answers the need for high quality video at lower bit rates, it is significantly more complex than previous coding standards and thus results in greater power consumption in practical implementations. In particular, motion estimation (ME), in H.264/AVC consumes the largest power in an H.264/AVC encoder. It is therefore critical to speed-up integer ME in H.264/AVC via fast motion estimation (FME) algorithms and hardware acceleration. In this paper, we present our hardware oriented modifications to a hybrid FME algorithm, our architecture based on the modified algorithm, and our implementation and prototype on a PowerPC-based Field Programmable System on Chip (FPSoC). Our results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FME algorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant. We show that although our implementation platform is FPGA-based, our implementation results compare favourably with previous architectures implemented on ASICs. Finally we also show an improvement over some existing architectures implemented on FPGAs. Copyright © 2009 O. Ndili and T. Ogunfunmi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction Motion estimation (ME) is by far the most powerful compression tool in the H.264/AVC standard [1, 2], and it is generally carried out in two stages: integer-pel then fractional pel as a refinement of the integer-pel search. ME in H.264/AVC features variable block sizes, quarterpixel accuracy for the luma component (one-eighth pixel accuracy for the chroma component), and multiple reference pictures. However the power of ME in H.264/AVC comes at the price of increased encoding time. Experimental results [3, 4] have shown that ME can consume up to 80% of the total encoding time of H.264/AVC, with integer ME consuming a greater proportion. In order to meet realtime and low power constraints, it is desirable to speed up the ME process. Two approaches to ME speed-up include designing fast ME algorithms and accelerating ME in hardware.

Considering the algorithm approach, there are traditional, single search fast algorithms such as new three-step search (NTSS) [5], four-step search (4SS) [6], and diamond search (DS) [7]. However these algorithms were developed for fixed block size and cannot efficiently support variable block size ME (VBSME) for H.264/AVC. In addition, while these algorithms are good for small search range and low resolution video, at higher definition for some high motion sequences such as “Stefan,” these algorithms can drop into a local minimum in the early stages of the search process [4]. In order to have more robust fast algorithms, some hybrid fast algorithms that combine earlier single search techniques have been proposed. One of such was proposed by Yi et al. [8, 9]. They proposed a fast ME algorithm known variously as the Simplified Unified Multi-Hexagon (SUMH) search or Simplified Fast Motion Estimation (SFME) algorithm. SUMH is based on UMHexagonS [4], a hybrid fast motion estimation algorithm. Yi et al. show in [8] that with similar or

2 even better rate-distortion performance, SUMH reduces ME time by about 55% and 94% on average when compared with UMHexagonS and Fast Full Search, respectively. In addition, SUMH yields a bit rate reduction of up to 18% when compared with Full Search in low complexity mode. Both SUMH and UMHexagonS are nonnormative parts of the H.264/AVC standard. Considering ME speed-up via hardware acceleration, although there has been some previous work on VLSI architectures for VBSME in H.264/AVC, the overwhelming majority of these works have been based on the Full Search Motion Estimation (FSME) algorithm. This is because FSME presents a regular-patterned search window which in turn provides good candidate-level data reuse (DR) with regular searching flows. A good candidate-level DR results in the reduction of data access power. Power consumption for an integer ME module mainly comes from two parts: data access power to read reference pixels from local memories and computational power consumed by the processing elements. For FSME, the data access power is reduced because the reference pixels of neighbouring candidates are considerably overlapped. On the other hand, because of the exhaustive search done in FSME, the computational complexity and thus the power consumed by the processing elements, is large. Several low-power integer ME architectures with corresponding fast algorithms were designed for standards prior to H.264/AVC [10–13]. However, these architectures do not support H.264/AVC. Additionally, because the irregular searching flows of fast algorithms usually lead to poor intercandidate DR, the power reduction at the algorithm level is usually constrained by the power reduction at the architecture level. There is therefore an urgent need for architectures with hardware oriented fast algorithms for portable systems implementing H.264/AVC [14]. Note also that because the data flow of FME is very similar to that of fractional pel search, some hardware reuse can be achieved [15]. For H.264/AVC, previous works on architectures for fast motion estimation (FME) [14–18] have been based on diverse FME algorithms. Rahman and Badawy in [16] and Byeon et al. in [17] base their works on UMHexagonS. In [14], Chen et al. propose a parallel, content-adaptive, variable block size, 4SS algorithm, upon which their architecture is based. In [15], Zhang and Gao base their architecture on the following search sequence: Diamond Search (DS), Cross Search (CS) and finally, fractional-pel ME. In this paper, we base our architecture on SUMH which has been shown in [8] to outperform UMHexagonS. We present hardware oriented modifications to SUMH. We show that the modified SUMH has a better PSNR performance that of the parallel, content-adaptive variable block size 4SS proposed in [14]. In addition, our results (see Section 2) show that for the modified SUMH, the average PSNR loss is 0.004 dB to 0.03 dB when compared with FSME, while when compared to SUMH, most of the sequences show an average improvement of up to 0.02 dB, while two of the sequences show an average loss

EURASIP Journal on Embedded Systems of 0.002 dB. Thus in general, there is an improvement over SUMH. In terms of percentage computational time savings, while SUMH saves 88.3% to 98.8% when compared with FSME, the modified SUMH saves 60.0% to 91.7% when compared with FSME. Finally, in terms of percentage bit rate increase, when compared with FSME, the modified SUMH shows a bit rate improvement (decrease in bit rate), of 0.02% in the sequence “Coastguard.” The worst bit rate increase is in “Foreman” and that is 1.29%. When compared with SUMH, there is a bit rate improvement of 0.03% to 0.34%. The rest of this paper is organized as follows. In Section 2 we summarize integer-pel motion estimation in SUMH and present the hardware oriented SUMH along with simulation results. In Section 3 we briefly present our proposed architecture based on the modified SUMH. We also present our implementation results as well as comparisons with prior works. In Section 4 we present our prototyping efforts on the XUPV2P development board. This board contains an XC2VP30 Virtex-II Pro FPGA with two hardwired PowerPC 405 processors. Finally our conclusions are presented in Section 5.

2. Motion Estimation Algorithm 2.1. Integer-Pel SUMH Algorithm. H.264/AVC uses block matching for motion vector search. Integer-pel motion estimation uses the sum of absolute differences (SADs), as its matching criterion. The mathematical expression for SAD is given in

SAD dx, d y =

X −1 Y −1 x=0 y =0

a x, y − b x + dx, y + d y ,

(1)

MVx , MV y = dx, d y min SAD(dx,d y).

(2)

In (1), a(x, y) and b(x, y) are the pixels of the current, and candidate blocks, respectively. (dx, d y) is the displacement of the candidate block within the search window. X × Y is the size of the current block. In (2) (MVx , MV y ) is the motion vector of the best matching candidate block. H.264/AVC features seven interprediction block sizes which are 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4. These are referred to as block modes 1 to 7. An up layer block is a block that contains sub-blocks. For example, mode 5 or 6 is the up layer of mode 7, and mode 4 is the up layer of mode 5 or 6. SUMH [8] utilizes five key steps for intensive search, integer-pel motion estimation. They are cross search, hexagon search, multi big hexagon search, extended hexagon search, and extended diamond search. For motion vector (MV) prediction, SUMH uses the spatial median and up layer predictors, while for SAD prediction, the up layer predictor is used. In median MV prediction, the median value of the adjacent blocks on the left, top, and top-right (or top-left) of the current block is used to predict the


3

MV of the current block. The complete flow chart of the integer-pel, motion vector search in SUMH is shown in Figure 1. The convergence and intensive search conditions are determined by arbitrary thresholds shifted by a blocktype shift factor. The blocktype shift factor specifies the number of bits to shift to the right in order to get the corresponding thresholds for different block sizes. There are 8 blocktype shift factors corresponding to 8 block modes: 1 dummy block mode and the 7 block modes in H.264/AVC. The 8 block modes are 16 × 16 (dummy), 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4. The array of 8 blocktype shift factors corresponding, respectively, to these 8 block modes is given in blocktype shift factor = {0, 0, 1, 1, 2, 3, 3, 1}.

(3)

The convergence search condition is described in pseudocode in

min mcost < ConvergeThreshold blocktype shift factor blocktype , (4)

where min mcost is the minimum motion vector cost. The intensive search condition is described in pseudo-code in ⎛

blocktype == 1 &&

⎞

⎟ ⎜ ⎜ min mcost > CrossThreshold1 blocktype shift factor blocktype ⎟ ⎜ ⎟, ⎜ ⎟ || ⎝ ⎠ (min mcost>(CrossThreshold2 blocktype shift factor [blocktype]))

(5) where the thresholds are empirically set as follows: ConvergeThreshold = 1000, CrossThreshold1 = 800, and CrossThreshold2 = 7000. 2.2. Hardware Oriented SUMH Algorithm. The goal of our hardware oriented modification is to make SUMH less sequential without incurring performance losses or increases in the computation time. The sequential nature of SUMH arises from the fact that there are a lot of data dependencies. The most severe data dependency arises during the up layer predictor search step. This dependency forces the algorithm to sequentially and individually conduct the search for the 41 possible SADs in a 16 × 16 macroblock. The sequence begins with the 16 × 16 macroblock then computes the SADs of the subblocks in each quadrant of the 16 × 16 macroblock. Performing the algorithm in this manner consumes a lot of computational time and power, yet its rate-distortion benefits can still be obtained in a parallel implementation. In our modification, we skip this search step. The decision control structures in SUMH are another feature that makes the algorithm unsuitable for hardware implementation. In a parallel and pipelined implementation, these structures would require that the pipeline be flushed at random times. This is in turn wasteful of clock cycles as well as adds more overhead to the hardware’s control circuit.

In our modification, we consider the convergence condition not satisfied, and intensive search condition satisfied. This removes the decision control structures that make SUMH unsuitable for parallel processing. Another effect of this modification is that we expect to have a better rate-distortion performance. On the other hand, the expected disadvantage of this modification is an increase in computation time. However, as shown by our complexity analysis and results, this increase is minimal and will also be easily compensated for by hardware acceleration. Further modifications we make to SUMH are the removal of the small local search steps and the convergence search step. Our modifications to SUMH allow us to process in parallel, all the candidate macroblocks (MB), for one current macroblock (CMB). We use the so-called HF3V2 2-stitched zigzag scan proposed in [19], in order to satisfy the data dependencies between CMBs. These data dependencies arise because of the side information used to predict the MV of the CMB. Note that if we desire to process several CMBs in parallel, we will need to set the value of the MV predictor to the zero displacement MV, that is, MV = (0, 0). Experiments in [20–22], as well as our own experiments [23], show that when the search window is centered around MV = (0, 0), the average PSNR loss is less than 0.2 dB compared with when the median MV is also used. Figure 2 shows the complete flow chart of the modified integer-pel, SUMH. 2.3. Complexity Analysis of the Motion Estimation Algorithms. We consider a search range s. The number of search points to be examined by FSME algorithm is directly proportional to the square of the search range. There are (2s + 1)2 search points. Thus the algorithm complexity of Full Search is O(s2 ). We obtain the algorithm complexity of the modified SUMH algorithm by considering the algorithm complexity of each of its search steps as follows. (1) Cross search: there are s search points both horizontally and vertically yielding a total of 2s search points. Thus the algorithm complexity of this search step is O(2s). (2) Hexagon and extended hexagon search: There are 6 search points each in both of these search steps, yielding a total of 12 search points. Thus the algorithm complexity of this search step is constant O(1). (3) Multi-big hexagon search: there are (1/4)s hexagons with 16 search points per hexagon. This yields a total of 4s search points. Thus the algorithm complexity of this search step is O(4s). (4) Diamond search: there are 4 search points in this search step. Thus the algorithm complexity of this search step is constant O(1). Therefore in total there are 1 + 2s + 12 + 4 + 4s search points in the modified SUMH, and its algorithm complexity is O(6s). In order to obtain the algorithm complexity of SUMH, we consider its worst case complexity, even though the

4

EURASIP Journal on Embedded Systems Table 1: Complexity of algorithms in million operations per second (MOPS).

Start: check predictors Satisfy convergence condition?

Yes

No Small local search No

Satisfy intensive search condition? Yes Cross search Hexagon search Multibig hexagon search Up layer predictor search Small local search

Yes

Satisfy convergence condition? No Extended hexagon search Extended diamond search Convergence search Stop

Figure 1: Flow chart of integer-pel search in SUMH.

Algorithm FSME Best case SUMH Worst case SUMH Median case SUMH Modified SUMH

Number of search points for search range s = ±16 1089 5 127 66 113

Number of MOPS for CIF video at 30 Hz 17103 78 1995 1037 1775

algorithm may terminate much earlier. The worst case complexity of SUMH is similar to that of the modified SUMH, except that it adds 14 more search points. This number is obtained by adding 4 search points each for 2 small local searches and 1 convergence search, and 2 search points for the worst case up layer predictor search. Thus for the worst case SUMH, there are in total 14+1+2s+12+4+4s search points and its algorithm complexity is O(6s). Note that in the best case, SUMH has only 5 search points: 1 for the initial search candidate and 4 for the convergence search. Another way to define the complexity of each algorithm is in terms of the number of required operations. We can then express the complexity as Million Operations Per Second (MOPS). To compare the algorithms in terms of MOPS we assume the following. (1) The macroblock size is 16 × 16. (2) The SAD cost function requires 2 × 16 × 16 data loads, 16 × 16 = 256 subtraction operations, 256 absolute operations, 256 accumulate operations, 41 compare operations and 1 data store operation. This yields a total of 1322 operations for one SAD computation. (3) CIF resolution is 352 × 288 pixels = 396 macroblocks. (4) The frame rate is 30 frames per second.

Start: check center and median MV predictor

Cross search

Hexagon search

Multibig hexagon search

Extended hexagon search

Extended diamond search

Stop

Figure 2: Flow chart of modified integer-pel search.

(5) The total number of operations required to encode CIF video in real time is 1322 × 396 × 30 × za , where za is the number of search points for each algorithm. Thus there are 15.7za MOPS per algorithm, where one OP (operation) is the amount of computation it takes to obtain one SAD value. In Table 1 we compare the computational complexities of the considered algorithms in terms of MOPS. As expected, FSME requires the largest number of MOPS. The number of MOPS required for the modified SUMH is about 10% less than that required for the worst case SUMH and about 40% more than that required for the median case SUMH. 2.4. Performance Results for the Modified SUMH Algorithm. Our experiments are done in JM 13.2 [24]. We use the following standard test sequences: “Stefan” (large motion), “Foreman” and “Coastguard” (large to moderate motion) and “Silent” (small motion). We chose these sequences because we consider them extreme cases in the spectrum of low bit-rate video applications. We also use the following


5 Table 2: Simulation conditions.

Sequences Foreman Mother-daughter Stefan Flower Coastguard Carphone Silent

Quantization parameter 22, 25, 28, 31, 33, 35 22, 25, 28, 31, 33, 35 22, 25, 28, 31, 33, 35 22, 25, 28, 31, 33, 35 18, 22, 25, 28, 31, 33 18, 22, 25, 28, 31, 33 18, 22, 25, 28, 31, 33

Search range 32 32 16 16 32 32 16

Frame size CIF CIF CIF CIF QCIF QCIF QCIF

No. of frames 100 150 90 150 220 220 220

Table 3: Comparison of speed-up ratios with full search. 18

Quantization Parameter

22

25

28

31

33

35

SUMH

Modified Modified Modified Modified Modified Modified Modified SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH

Foreman

N/A

N/A

48.55

8.16

41.55

6.86

32.68

5.66

25.87

4.77

21.68

4.23

19.11

3.74

Stefan Motherdaughter

N/A

N/A

15.35

4.62

13.16

4.21

12.20

3.93

10.67

3.50

10.05

3.23

8.96

3.06

N/A

N/A

16.63

2.49

19.31

2.72

21.56

3.01

28.63

3.47

35.43

4.20

43.90

5.08

Flower

N/A

N/A

9.73

3.07

10.72

3.29

11.32

3.49

12.94

3.78

13.77

4.02

15.02

4.21

Coastguard

86.34

12.06

70.12

10.31

58.05

9.01

43.62

7.98

36.04

6.80

30.10

6.13

N/A

N/A

Silent

21.86

3.54

16.74

3.18

13.17

2.99

11.90

2.82

9.29

2.66

8.56

2.64

N/A

N/A

Carphone

24.67

4.14

29.44

4.62

37.12

5.38

46.97

6.02

53.97

7.07

64.07

8.82

N/A

N/A

Table 4: Comparison of percentage time savings with full search. 18

Quantization Parameter

22

25

28

31

33

35

SUMH

Modified Modified Modified Modified Modified Modified Modified SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH SUMH

Foreman

N/A

N/A

97.94

87.75

97.59

85.43

96.94

82.34

96.13

79.04

95.38

76.36

94.76

73.31

Stefan Motherdaughter

N/A

N/A

93.48

78.38

92.40

76.29

91.80

74.61

90.63

71.46

90.05

69.05

88.83

67.35

N/A

N/A

93.98

60.00

94.82

63.34

95.36

66.85

96.50

71.22

97.17

76.21

97.72

80.35

Flower

N/A

N/A

89.72

67.45

90.67

69.62

91.16

71.37

92.27

73.56

92.71

75.14

93.34

76.27

Coastguard

98.84

91.71

98.57

90.30

98.27

88.91

97.70

87.47

97.22

85.29

96.67

83.70

N/A

N/A

Silent

95.42

71.77

94.02

68.62

92.40

66.61

91.60

64.56

89.23

62.47

88.32

62.20

N/A

N/A

Carphone

95.94

75.87

96.60

78.36

97.30

81.41

97.87

83.41

98.14

85.87

98.43

88.66

N/A

N/A

sequences: “Mother-daughter” (small motion, talking head and shoulders), “Flower” (large motion with camera panning), and “Carphone” (large motion). The sequences are coded at 30 Hz. The picture sequence is IPPP with I-frame refresh rate set at every 15 frames. We consider 1 reference frame. The rest of our simulation conditions are summarized in Table 2. Figure 3 shows curves that compare the rate-distortion efficiencies of Full Search ME, SUMH, and the modified SUMH. Figure 4 shows curves that compare the ratedistortion efficiencies of Full Search ME and the single- and multiple-iteration parallel content-adaptive 4SS of [14]. In

Tables 3 and 4, we show a comparison of the speed-up ratios of SUMH and the modified SUMH. Table 5 shows the average percentage bit rate increase of the modified SUMH when compared with Full Search ME and SUMH. Finally Table 6 shows the average Y-PSNR loss of the modified SUMH when compared with Full Search ME and SUMH. From Figures 3 and 4, we see that the modified SUMH has a better rate-distortion performance than the proposed parallel content-adaptive 4SS of [14], even under smaller search ranges. In Section 3 we will show comparisons of our supporting architecture with the supporting architecture

6

EURASIP Journal on Embedded Systems R-D curve (Stefan, CIF, SR = 16, 1 ref frame, IPPP...)

R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...) 41

40

40

39

39

38

Y-PSNR (dB)

Y-PSNR (dB)

41

37 36 35

38 37 36

34 35

33 32

34

31 500

1000

1500

2000 2500 Bitrate (kbps)

3000

33

3500

400

600

(a)


1200

1400

(b)

R-D curve (Silent, QCIF, SR = 16, 1 ref frame, IPPP...)

R-D curve (Coastguard, QCIF, SR = 32, 1 ref frame, IPPP...)

44 42 42 Y-PSNR (dB)

Y-PSNR (dB)

40 40 38 36

38 36 34

34

32 100

150

200


350

400

Full search SUMH Modified SUMH

200 300 400 500 600 700 800 900 1000 1100 Bitrate (kbps) Full search SUMH Modified SUMH

(c)

(d)

Figure 3: Comparison of rate-distortion efficiencies for the modified SUMH.

proposed in [14]. Note though that the architecture in [14] is implemented on an ASIC (TSMC 0.18-μ 1P6M technology), while our architecture is implemented on an FPGA. From Figure 3 and Table 6 we also observe that the largest PSNR losses occur in the “Foreman” sequence, while the least PSNR losses occur in “Silent.” This is because the “Foreman” sequence has both high local object motion and greater highfrequency content. It therefore performs the worst under a given bit rate constraint. On the other hand, “Silent” is a low motion sequence. It therefore performs much better under the same bit rate constraint.

Given the tested frames from Table 2 for each sequence, we observe additionally from Table 6 that Full Search performs better than the modified SUMH for sequences with larger local object (foreground) motion, but little or no background motion. These sequences include “Foreman,” “Carphone,” “Mother-daughter,” and “Silent.” However the rate-distortion performance of the modified SUMH improves for sequences with large foreground and background motions. Such sequences include “Flower,” “Stefan,” and “Coastguard.” We therefore suggest that a yet greater improvement in the rate-distortion performance of

EURASIP Journal on Embedded Systems R-D curve (Stefan, CIF, SR = 32, 1 ref frame, IPPP...)

38

37

37

36

36

PSNR (dB)

PSNR (dB)

38

7

35 34 33 32 700

R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...)

35 34 33

900

1100

1300 1500 1700 Bitrate (kbps)

32 170

1900

270

370

(a)

670

(b)

R-D curve (Silent, CIF, SR = 32, 1 ref frame, IPPP...)

38

37

37

36

36

PSNR (dB)

PSNR (dB)

38


35 34 33

R-D curve (Coastguard, CIF, SR = 32, 1 ref frame, IPPP...)

35 34 33

32 120

220

320

32 600

1000

Bitrate (kbps) FS Proposed content-adaptive parallel-VBS 4SS Single iteration parallel-VBS 4SS

1400 Bitrate (kbps)

1800

FS Proposed content-adaptive parallel-VBS 4SS Single iteration parallel-VBS 4SS

(c)

(d)

Figure 4: Comparison of rate-distortion efficiencies for parallel content-adaptive 4SS of [25] (Reproduced from [25]).

the modified SUMH algorithm can be achieved by improving its local motion estimation. For Table 3, we define the speed-up ratio as the ratio of the ME coding time of Full Search to ME coding time of the algorithm under consideration. From Table 3 we see that speed-up ratio increases as quantization parameter (QP) decreases. This is because there are less skip mode macroblocks as QP decreases. From our results in Table 3, we further calculate the percentage time savings t for ME calculation, according to

1 (6) × 100, r where r are the data points in Table 3. The percentage time savings obtained are displayed in Table 4. From Table 4, we find that SUMH saves 88.3% to 98.8% in ME computation time compared to Full Search, while the modified SUMH saves 60.0% to 91.7%. Therefore, the modified SUMH does not incur much loss in terms of ME computation time. In our experiments we set rate-distortion optimization to high complexity mode (i.e., rate-distortion optimization is turned on), in order to ensure that all of the algorithms compared have a fair chance to yield their highest ratedistortion performance. From Table 5 we find that the t= 1 −

Table 5: Average percentage bit rate increase for modified SUMH. Sequences Foreman Stefan Mother-daughter Flower Coastguard Silent Carphone

Compared with Full search SUMH 1.29 −0.04 −0.34 0.40 −0.05 0.15 −0.17 0.19 −0.02 −0.03 −0.33 0.56 −0.06 0.27

average percentage bit rate increase of the modified SUMH is very low. When compared with Full Search, there is a bit rate improvement (decrease in bit rate), in “Coastguard” of 0.02%. The worst bit rate increase is in “Foreman” and that is 1.29%. When compared with SUMH, there is a bit rate improvement (decrease in bit rate), going from 0.04% (in “Coastguard”) to 0.34% (in “Stefan”). From Table 6 we see that the average PSNR loss for the modified SUMH is very low. When compared to Full Search, the PSNR loss for modified SUMH ranges from 0.006 dB to

8 0.03 dB. When compared to SUMH, most of the sequences show a PSNR improvement of up to 0.02 dB, while two of the sequences show a PSNR loss of 0.002 dB. Thus in general, the losses when compared with Full Search are insignificant, while on the other hand there is an improvement when compared with SUMH. We therefore conclude that the modified SUMH can be used without much penalty, instead of Full Search ME, for ME in H.264/AVC.

3. Proposed Supporting Architecture Our top-level architecture for fast integer VBSME is shown in Figure 5. The architecture is composed of search window (SW) memory, current MB memory, an address generation unit (AGU), a control unit, a block of processing units (PUs), an SAD combination tree, a comparison units and a register for storing the 41 minimum SADs and their associated motion vectors. While the current and reference frames are stored offchip in external memory, the current MB (CMB) data and the search window (SW) data are stored in on-chip, dualport block RAMS (BRAMS). The SW memory has N 16 × 16 BRAMs that store N candidate MBs, where N is related to the search range s. N can be chosen to be any factor or multiple of |s| so as to achieve a tradeoff between speed and hardware costs. For example, if we consider a search range of s = ±16, then we can choose N such that N ∈ {. . . , 32, 16, 8, 4, 2, 1}. The AGU generates addresses for blocks being processed. There are N PUs each containing 16 processing elements (PEs), in a 1D array. A PU shown in Figure 6 calculates 16 4 × 4 SADs for one candidate MB while a PE shown in Figure 8 calculates the absolute difference between two pixels, one each from the candidate MB and the current MB. From Figure 6, groups of 4 PEs in the PU calculate 1 column of 4 × 4 SADs. These are stored via demultiplexing, in registers D1–D4 which hold the inputs to the SAD combination tree, one of which is shown in Figure 7. For N PUs there are N SAD combination trees. Each SAD combination tree further combines the 16 4 × 4 output SADs from one PU, to yield a total of 41 SADs per candidate MB. Figure 7 shows that the 16 4 × 4 SADs are combined such that registers D6 contain 4 × 8 SADs, D7 contain 8 × 8 SADs, D8 contain 8 × 16 SADs, D9 contain 16 × 8 SADs, D10 contain 8 × 4 SADs, and finally, D11 contains the 16 × 16 SAD. These SADs are compared appropriately in the comparison unit (CU). CU consists of 41 N-input comparing elements (CEs). A CE is shown in Figure 9. 3.1. Address Generation Unit. For each of N MBs being processed simultaneously, the AGU generates the addresses of the top row and the leftmost column of 4 × 4 sub-blocks. The address of each sub-block is the address of its top left pixel. From the addresses of the top row and leftmost column of 4 × 4 sub-blocks, we obtain the addresses of all other block partitions in the MB. The interface of the AGU is fixed and we parameterize it by the address of the current MB, the search type and the

EURASIP Journal on Embedded Systems Table 6: Average Y-PSNR loss for modified SUMH. Sequences Foreman Stefan Mother-daughter Flower Coastguard Silent Carphone

Compared with Full search SUMH 0. 0290 dB −0. 0065 dB −0. 0125 dB 0. 0058 dB −0. 0020 dB 0. 0187 dB −0. 0002 dB 0. 0042 dB 0. 0078 dB 0. 0018 dB 0. 0098 dB 0. 0018 dB −0. 0225 dB 0. 0205 dB

Table 7: Search passes for modified SUMH. Pass 1-2 3-4 5 6–13 14 15

Description Horizontal scan of cross search. Candidate MBs seperated by 2 pixels Vertical scan of cross search. Candidate MBs seperated by 2 pixels Hexagon search has 6 search points Multi-big hexagon search has (1/4)(|s|) hexagons, each containing 16 search points Extended hexagon search has 6 search points Diamond search has 4 search points

search pass. The search type is modified SUMH. However we can expand our architecture to support other types of search, for example, Full Search, and so forth. The search pass depends on the search step and the search range. We show for instance, in Table 7 that there are 15 search passes for the modified SUMH considering a search range s = ±16. There is a separation of 2 pixels between 2 adjacent search points in the cross search, therefore address generation for search pass 1 to 4 in Table 7 is straightforward. For the remaining search passes5–15, tables of constant offset values are obtained from JM reference software [24]. These offset values are the separation in pixels, between the minimum MV from the previous search pass, and the candidate search point. In general, the affine address equations can be represented by AEx = iCx ,

AE y = iC y ,

(7)

where AEx and AE y are the horizontal and vertical addresses of the top left pixel in the MB, i is a multiplier, Cx and C y are constants obtained from JM reference software. 3.2. Memory. Figures 10 and 11 show CMB and search window (SW) memory organization for N = 8 PUs. Both CMB and SW memories are synthesized into BRAMs. Considering a search range of s = ±16, there are 15 search passes for the modified SUMH search flowchart shown in Figure 2. These search passes are shown in Table 7. In each search pass, 8 MBs are processed in parallel, hence the SW memory organization is shown in Figure 11. SW memory is 128 bytes wide and the required memory size is 2048 bytes. For the same search range s = ±16, if FSME was used along with levels A and B data reuse, the SW size would be


9

SW memory Candidate MB 2

Candidate MB 3

PU 1

PU 2

PU 3

Control unit

AGU

Candidate MB 1

Candidate MB N −2

···

Candidate MB N −1

Candidate MB N

PU N − 2

···

Current MB (CMB) memory

PU N − 1

PU N

SAD combination tree

CE 1

CE 2

···

CE 3

···

Comparison unit

CE 41

Register that stores minimum 41 SADs and associated MVs To external memory

Figure 5: The proposed architecture for fast integer VBSME.

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 PE 16

+

+

+

+

+

+

+

+

D0

D0

D0

D0

D0

D0

D0

D0

+

+ Cntr

Cntr

Demux

+

+ Cntr

Demux

Cntr

Demux

Demux

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

D5

Figure 6: The architecture of a Processing Unit (PU).

48 × 48 pixels, that is 2304 bytes [25]. Thus by using the modified SUMH, we achieve an 11% on-chip memory savings even without a data reuse scheme. In each clock cycle, we load 64 bits of data. This means that it takes 256 cycles to load data for one search pass and 3840 (256 × 15) cycles to load data for one CMB. Under similar conditions for FSME it would take 288 clock cycles to load data for one CMB. Thus the ratio of the required memory bandwidth for the modified SUMH to the required memory bandwidth for FSME is 13.3. While this ratio is undesirably high, it is well mitigated by the fact that there

are only 113 search locations for one CMB in the modified SUMH, compared to 1089 search locations for one CMB in FSME. In other words, the amount of computation for one CMB in the modified SUMH is approximately 0.1 that for FSME. Thus there is an overall power savings in using the modified SUMH instead of FSME. 3.3. Processing Unit. Table 8 shows the pixel data schedule for two search passes of the N PUs. In Table 8 we are considering as an illustrative example the cross search and a search range s = ±16, hence the given pixel coordinates.

10

EURASIP Journal on Embedded Systems Top 4 × 4 SAD Bottom 4 × 4 SAD

Top SAD

Top SAD

Bottom SAD

D5

D5

D5

D5

Top SAD

Bottom SAD

D5

D5

D5

D5

Top SAD

Bottom SAD

D5

D5

D5

D5

Bottom SAD

D5

D5

D5

D5

+

+

+

+

+

+

+

+

D6

D6

D6

D6

D6

D6

D6

D6

+

+

+

+

D7

D7

D7

D7

+

+

+

+

D8

D9

D9

D8

+ D11

+

+

+

+

+

+

+

+

D10

D10

D10

D10

D10

D10

D10

D10

D5 D6 D7 D8

4 × 4 SAD 4 × 8 SAD 8 × 8 SAD 8 × 16 SAD

D9 D10 D11

16 × 8 SAD 8 × 4 SAD 16 × 16 SAD

Figure 7: SAD Combination tree.

Table 8: Data schedule for processing unit (PU). Clock 1–16

17– 32

33–48

49–64 .. .

PU1 (−15, 0)–(0,0) .. . (−15, −15)–(0, −15) (1, 0)–(16,0) .. . (1, −15)–(16, −15) (0, 15)–(15,15) .. . (0, 0)–(15, 0) (0, −1)–(15, −1) .. . (0, −16)–(15, −16) .. .

···

···

···

···

···

PU8 (−1,0)–(14,0) .. . (−1, −15)–(14, −15) (15,0)–(30,0) .. . (15, −15)–(30, −15) (0, 1)–(15, 1) .. . (0, −14)–(15, −14) (0, −15)–(15, −15) .. . (0, −30)–(15, −30) .. .

Table 8 shows that it takes 16 cycles to output the 16 4 × 4 SADs from each PU. 3.4. SAD Combination Tree. The data schedule for the SAD combination is shown in Table 9. There are N SAD combina-

Comments Search pass 1: left horizontal scan of cross search

Search pass 2: right horizontal scan of cross search

Search pass 3: top vertical scan of cross search

Search pass 4: bottom vertical scan of cross search .. .

tion (SC) trees, each processing 16 4 × 4 SADs that are output from each PU. It takes 5 cycles to combine the 16 4 × 4 SADs and output 41 SADs for the 7 interprediction block sizes in H.264/AVC: 1 16 × 16 SAD, 2 16 × 8 SADs, 2 8 × 16 SADs, 4 8 × 8 SADs, 8 8 × 4 SADs, 8 4 × 8 SADs, and 16 4 × 4 SADs.

EURASIP Journal on Embedded Systems Candidate MB pixel

Current MB pixel

11 N SADs

Control signals Control signals

Absolute difference

+

N-input comparator

AGU input

Register

Min SAD

MV Min SAD

Figure 8: Processing element (PE).

Figure 9: Comparing element (CE).

Table 9: Data schedule for SAD combination (SC) unit. SC2 16 4 × 4 SAD

8 4 × 8 SAD 8 8 × 4 SAD 4 8 × 8 SAD

8 4 × 8 SAD 8 8 × 4 SAD 4 8 × 8 SAD

8 4 × 8 SAD 8 8 × 4 SAD 4 8 × 8 SAD

20

2 8 × 16 SAD 2 16 × 8 SAD

2 8 × 16 SAD 2 16 × 8 SAD

2 8 × 16 SAD 2 16 × 8 SAD

21

1 16 × 16 SAD

1 16 × 16 SAD

1 16 × 16 SAD

18 19

SC8 16 4 × 4 SAD

3.5. Comparison Unit. The data schedule for the CU is shown in Table 10. The CU consists of 41 CE, each element processing N SADs of the same interprediction block size, from the N PUs. Each CE compares SADs in twos. It therefore takes log2 N + 1 cycles to output the 41 minimum SADs. Thus given N = 8, the CU consumes 4 cycles.

16 pixels

···

SC1 16 4 × 4 SAD

8 bytes

. . .

Figure 10: Data arrangement in current macroblock (CMB) memory.

8 bytes

128 pixels 8 bytes

8 bytes

8 bytes ···

16 pixels

Clock 17

16 pixels 8 bytes

. . .

Figure 11: Data arrangement in search window (SW) memory.

3.6. Summary of Dataflow. The dataflow represented by the data schedules described variously in Tables 8–10 may be summarized by the algorithmic state machine (ASM) chart shown in Figure 12. The ASM chart also represents the mapping of the modified SUMH algorithm in Figure 2, to our proposed architecture in Figure 5. In our ASM chart, there are 6 states and 2 decision boxes. The states are labeled S1 to S6, while the decision boxes are labeled Q1 and Q2. In each state box, we provide the summary description of the state as well as its output variables in italic font. From Figure 12 we see that implementation of the modified SUMH on our proposed architecture IP core starts in state S1 when the motion vector (MV) predictors are checked. This is done by the PowerPC processor which is part of our SoC prototyping platform (see Section 4). The MV predictors are stored in external memory and accessed from there by the PowerPC processor. The output from state S1 is the MV predictors. In the next state S2, the minimum MV cost is obtained and mode decision is done to obtain the right blocktype. This is also done by the PowerPC processor and the outputs of this state are the minimum MV, its SAD

cost, its blocktype, and its address. The minimum MV cost is obtained by minimizing the cost in → m, REF | λmotion Jmotion − → − → − − → = SAD dx, d y, REF, m + λmotion · R m − p +R(REF) ,

(8) → m = (mx , m y )T is the current MV being considered, where − REF denotes the reference picture, λmotion is the Lagrangian → m) is the SAD cost obtained as multiplier, SAD(dx, d y, REF, − − → → m− in (1), p = (px , p y ) is the MV used for the prediction, R(− − → p ) represents the number of bits used for MV coding, and, R(REF) is the bits for coding REF. In the state S3, some of the outputs from state S2 are passed into our proposed architecture IP core. In state S4, the AGU computes the addresses of candidate blocks, using the address of the MV predictor as the base address, and the control unit waits for the initialization of search window data in the BRAMs. The output of state S4 is

12

EURASIP Journal on Embedded Systems Table 10: Data schedule for comparison unit (CU).

Clock

CE1–CE16

22

8 4 × 4 SAD

23

4 4 × 4 SAD

24

2 4 × 4 SAD

25

1 4 × 4 SAD

CE17–CE32 8 4 × 8 SAD 8 8 × 4 SAD 4 4 × 8 SAD 4 8 × 4 SAD 2 4 × 8 SAD 2 8 × 4 SAD 1 4 × 8 SAD 1 8 × 4 SAD

CE33–CE36

CE37–CE40 8 8 × 16 SAD 8 16 × 8 SAD 4 8 × 16 SAD 4 16 × 8 SAD 2 8 × 16 SAD 2 16 × 8 SAD 1 8 × 16 SAD 1 16 × 8 SAD

8 8 × 8 SAD 4 8 × 8 SAD 2 8 × 8 SAD 1 8 × 8 SAD

CE41 8 16 × 16 SAD 4 16 × 16 SAD 2 16 × 16 SAD 1 16 × 16 SAD

S1: check MV predictors in PowerPC MVs S2: obtain min MV cost and perform mode decision in PowerPC min MV, min SAD cost, blocktype and base address

S3: min MV, min SAD cost and address in IP core min MV, min SAD cost, and base address

S4: AGU computes addresses of candidate blocks from base address, and control unit waits for initialization of BRAM data for search pass addresses, BRAM initialization complete S5: PUs and SCs compute SADs for search pass addresses, SADs

S6: obtain min SAD for search pass and update base address from addresses base address, 41 min SADs, 41 min MVs

Q1: last search pass of step?

No

Yes No

Q2: last search pass of modified SUMH?

Yes IP core

Figure 12: Algorithmic state machine chart for the modified SUMH algorithm.

the addresses of the candidate blocks and a flag indicating that BRAM initialization is complete. In state S5, the processing units and SAD combination trees compute the SADs of the candidate blocks. The output of S5 is the computed SADs and unchanged AGU addresses. In state S6, the CU compares these SADs with previously computed SADs and obtains the 41 minimum SADs. The outputs of S6 are the 41 minimum SADs and their corresponding addresses.

In the decision block Q1, we check if the current search pass is the last search pass of a particular search step, for example, the cross search step. If no, we continue with other passes of that search step. If yes, we go to decision block Q2. In Q2 we check if it is the last search pass of the modified SUMH algorithm. If no, we move onto the next search step, for example, hexagon search. If yes, we check for the MV predictors of the next current macroblock, according to the HF3V2 2-stitched zigzag scan proposed in [19].


13

Table 11: Synthesis results. Process (μm) Number of slices Number of slice flip flops Number of 4-input LUTs Total equivalent gate count Max frequency (MHz) Algorithm Video specifications Search range Block size Minimum required frequency (MHz) Number of 16 × 8-bit dual-port RAMs Memory utilization (Kb) Voltage (V) Power consumed (mW)

0.13 (FPGA) 11.4K 16.4K 18.7K 388K 145.2 Modified SUMH CIF 30-fps ±16 16 × 16 to 4 × 4 24.1 129 398 1.5 25

3.7. Synthesis Results and Analysis. The proposed architecture has been implemented in Verilog HDL. Simulation and functional verification of the architecture was done using the Mentor Graphics ModelSim tool [26]. We then synthesized the architecture using the Xilinx sythesis tool (XST). XST is part of the Xilinx integrated software environment (ISE) [27]. After synthesis, place and routing is done targeting the Virtex-II Pro XC2VP30 Xilinx FPGA on our development board. Finally we obtain power analysis for our design, using the XPower tool which is also part of Xilinx ISE. Our synthesis results are shown in Table 11. From Table 11 we see that our architecture can achieve a maximum frequency of 145.2 MHz. The FPGA power consumption of our architecture is 25 mW obtained using Xilinx XPower tool. The total equivalent gate count is 388 K. Our simulations in ModelSim support our dataflow described in Sections 3.1 to 3.6. We find that it takes 27 cycles to obtain the minimum SAD from each search pass, after initialization. The 27 cycles are obtained from 1 cycle for the AGU, 1 cycle to read data from on-chip memory, 16 cycles for the PU, 5 cycles for the SAD combination tree, and 4 cycles for the comparison unit. Therefore, it takes 405 (15 × 27) cycles to complete the search for 1 CMB, 1 reference frame, and s = ±16. For a CIF image (396 MBs) at 30 Hz and considering 5 reference frames, a minimum clock frequency of approximately 24.1 (405 × 396 × 30 × 5) MHz is required. Thus with a maximum possible clock speed of 145.2 MHz, our architecture can compute in real-time CIF sequences within a search range of ±16 and using 5 reference frames. We provide Table 12 which compares our architecture with previous state-of-the-art architectures implemented on ASICs. Note that a direct comparison of our implementation with implementations done on ASIC technology is impossible because of the fact that the platforms are different. ASICs still provide the highest performance in terms of area, power consumed, and maximum frequency. However, we provide Table 12 not for direct comparisons, but to show that our implementation achieves ASIC-like levels of

performance. This is desirable because it indicates that an ASIC implementation of our architecture will yield even better performance results. Our Verilog implementation was kept portable in order to simplify FPGA to ASIC migration. From Table 12 we see that our architecture achieves many desirable results. The most remarkable is that the power consumption is very low despite the fact that our implementation is done on an FPGA which typically consumes more power than an ASIC. Besides the low power consumption of our architecture, other favorable results are that the algorithm we use has better PSNR performance than the algorithms used in the other works. We also note that our architecture achieves the highest maximum frequency. By extension our architecture is the only one that can support high definition (HD) 1080 p sequences at 30 Hz, a search range s = ±16 and 1 reference frame. This would need a minimum frequency of approximately 85.9 MHz. In the next section we discuss our prototyping efforts and compare our results with similar works.

4. Architecture Prototype The top-level prototype design of our architecture is shown in Figure 13. It is based on the prototype design in [25]. In [25], Canals et al. propose an FPSoC-based architecture for Full Search block matching algorithm. Their implementation is done on a Virtex-4 FPGA. Our prototype is done on the XUPV2P development board available from Digilent Inc. [28]. The board contains a Virtex-II Pro XC2VP30 FPGA with 30,816 Logic Cells, 136 18-bit multipliers, 2,448 Kb of block RAM, and two PowerPC Processors. There are several connectors which include a serial RS-232 port for communication with a host personal computer. The board also features JTAG programming via on-board USB2 port as well as a DDR SDRAM DIMM that can accept up to 2 Gbytes of RAM. The embedded development tool used to design our prototype is the Xilinx Platform Studio (XPS), in the Xilinx Embedded Development Kit (EDK) [29]. The EDK makes it relatively simple to integrate user Intellectual Property (IP) cores as peripherals in an FPSoC. Hardware/software cosimulation can then be done to test the user IP. In our prototype design, as shown in Figure 13, we employ a PowerPC hardcore embedded processor, as our controller. The processor sends stimuli to the motion estimation IP core and reads results back for comparison. The processor is connected to the other design modules, via a 64bit processor local bus (PLB). The boot program memory is a 64 kb BRAM. It contains a bootloop program necessary to keep the processor in a known state after we load the hardware and before we load the software. The PLB connects to the user IP core through an IP interface (IPIF). This interface exposes several programmable interconnects. We use a slave-master FIFO attachment that is 64-bits wide and 512 positions deep. The status and control signals of the FIFO are available to the user logic block. The user logic block contains logic for reading

14

EURASIP Journal on Embedded Systems Table 12: Comparison with other architectures implemented on ASICS.

Process (μm) Voltage (V) Transistors count Maximum frequency (MHz) Video Spec. frequency (MHz)

0.35 3.3 301 K

Miyakoshi’s et al. [12] 0.18 1.0 1000 K

50

13.5

48.67

66

145.2

CIF 30-fps 50

CIF 30-fps 13.5

CIF 30-fps 48.67

CIF 30-fps 24.1

Algorithm

Diamond search

Gradient decent

4SS

CIF 30-fps 13.5 Single-Iteration Parallel VBS 4SS w/1-ref.

Block size

16×16 and 8×8

16 × 16

16 × 16 to 4 × 4

16 × 16 to 4 × 4

power (mW) Normalized Power (1.8 V, 0.18 μm)∗

223.6

16 × 16 and 8×8 6.56

8.46

2.13

25

17.60

21.25

8.46

4.08

69.02

Architecture

1D tree. No data reuse scheme


1D tree. Level A data reuse scheme

2D tree. Level B data reuse scheme


Can support HD1920 × 1080 p

No

No

No

No

Yes

Chao’s et al. [11]

∗

Lin’s [13]

Chen’s et al. [14]

This Work

0.18 1.8 546 K

0.18 1.3 708 K

0.13 FPGA 1.5 388 K

Hardware oriented SUMH

Normalized power = Power × (0.182 /process2 ) × (1.82 /voltage2 ).

Table 13: Comparison with other FPSOC architectures. PowerPC 64-bit PLB bus

FPSoC FPGA Algorithm

Boot program memory

IPIF control Read control

Write control

Read FIFO

Write FIFO

Write control

Read control PLB IPIF

Video format Search range Number of slices Memory utilization (Kb) Clock frequency (MHz)

Canals et al. [25] Virtex-4 Full Search QCIF ±16 12.5 K

This work Virtex-II Pro Hardware oriented SUMH QCIF ±16 11.4 K

784

398

100

100

Status and control Pixel data memory

Motion estimation IP core User logic

Figure 13: FPSoC prototype design of our architecture.

and writing to the FIFO and the Verilog implementation of our architecture. During operation, the PowerPC processor writes input stimuli to the FIFO and sets status and control bits. The

user logic reads the status and control signals and when appropriate, reads data from the FIFO. The data passes into the IP core and when the ME computation is done, the results are written back on the FIFO. The PowerPC reads the results and does a comparison with expected results to verify accuracy of the IP. Intermediate results during the operation are sent to a terminal on the host personal computer, via the RS-232 serial connection. We target QCIF video for our prototype, in order to compare our results with the results in [25]. Table 13 shows this comparison. We see from Table 13 that our architecture consumes less FPGA resources and has a lower memory utilization. Again, we note that a direct comparison of both architectures is complicated by the fact that different FPGAs

EURASIP Journal on Embedded Systems were used in both prototyping platforms. The work in [25] is based on a Virtex-4 FPGA which uses 90-nm technology, while our work is based on Virtex-II Pro FPGA which uses 130-nm technology.

5. Conclusion In this paper we have presented our low power, FPSoCbased architecture for a fast ME algorithm in H.264/AVC. We described our adopted fast ME algorithm which is a hardware oriented SUMH algorithm. We showed that the modified SUMH has superior rate-distortion performance compared to some existing state-of-the-art fast ME algorithms. We also described our architecture for the hardware oriented SUMH. We showed that the FPGA-based implementation of our architecture yields ASIC-like levels of performance in terms of speed, area, and power. Our results showed in addition, that our architecture has the potential to support HD 1080 p unlike the other architectures we compared it with. Finally we have discussed our prototyping efforts and compared them with a similar prototyping effort. Our results showed that our implementation uses less FPGA resources. In summary therefore, the modified SUMH is more attractive than SUMH because it is hardware oriented. It is also more attractive than Full Search because Full Search is hardware oriented, it is much more complex than the modified SUMH and thus will require more hardware area, speed, and power for implementation. We therefore conclude that for low power handheld devices, the modified SUMH can be used without much penalty, instead of Full Search, for ME in H.264/AVC.

Acknowledgments The authors acknowledge the support from Xilinx Inc., the Xilinx University Program, the Packard Foundation and the Department of Electrical Engineering, Santa Clara University, California. The authors also thank the editor and Reviewers of this journal for their useful comments.

References [1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003. [2] G. J. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions,” in Proceedings of the 27th Conference on Applications of Digital Image Processing, vol. 5558 of Proceedings of SPIE, pp. 454–474, August 2004. [3] H.-C. Lin, Y.-J. Wang, K.-T. Cheng, et al., “Algorithms and DSP implementation of H.264/AVC,” in Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC ’06), pp. 742–749, Yokohama, Japan, January 2006. [4] Z. Chen, P. Zhou, and Y. He, “Fast integer pel and fractional pel motion estimation for JVT,” in Proceedings of the 6th Meeting of the Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCE, Awaji Island, Japan, December 2002, JVT-F017.

15 [5] R. Li, B. Zeng, and M. L. Liou, “New three-step search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 4, pp. 438–442, 1994. [6] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, pp. 313–317, 1996. [7] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 4, pp. 369–377, 1998. [8] X. Yi, J. Zhang, N. Ling, and W. Shang, “Improved and simplified fast motion estimation for JM,” in Proceedings of the 16th Meeting of the Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, Posnan, Poland, July 2005, JVT-P021.doc. [9] X. Yi and N. Ling, “Improved normalized partial distortion search with dual-halfway-stop for rapid block motion estimation,” IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 995– 1003, 2007. [10] C. De Vleeschouwer, T. Nilsson, K. Denolf, and J. Bormans, “Algorithmic and architectural co-design of a motionestimation engine for low-power video devices,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 12, pp. 1093–1105, 2002. [11] W.-M. Chao, C.-W. Hsu, Y.-C. Chang, and L.-G. Chen, “A novel hybrid motion estimator supporting diamond search and fast full search,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS ’02), vol. 2, pp. 492–495, Phoenix, Ariz, USA, May 2002. [12] J. Miyakoshi, Y. Kuroda, M. Miyama, K. Imamura, H. Hashimoto, and M. Yoshimoto, “A sub-mW MPEG-4 motion estimation processor core for mobile video application,” in Proceedings of the Custom Integrated Circuits Conference (ICC ’03), pp. 181–184, 2003. [13] S.-S. Lin, Low-power motion estimation processors for mobile video application, M.S. thesis, Graduate Institute of Electronic Engineering, National Taiwan University, Taipei, Taiwan, 2004. [14] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G. Chen, “Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 5, pp. 568–576, 2007. [15] L. Zhang and W. Gao, “Reusable architecture and complexitycontrollable algorithm for the integer/fractional motion estimation of H.264,” IEEE Transactions on Consumer Electronics, vol. 53, no. 2, pp. 749–756, 2007. [16] C. A. Rahman and W. Badawy, “UMHexagonS algorithm based motion estimation architecture for H.264/AVC,” in Proceedings of the 5th International Workshop on System-onChip for Real-Time Applications (IWSOC ’05), pp. 207–210, Banff, Alberta, Canada, 2005. [17] M.-S. Byeon, Y.-M. Shin, and Y.-B. Cho, “Hardware architecture for fast motion estimation in H.264/AVC video coding,” in IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E89-A, no. 6, pp. 1744–1745, 2006. [18] Y.-Y. Wang, Y.-T. Peng, and C.-J. Tsai, “VLSI architecture design of motion estimator and in-loop filter for MPEG-4 AVC/H.264 encoders,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS ’04), vol. 2, pp. 49– 52, Vancouver, Canada, May 2004.

16 [19] C.-Y. Chen, C.-T. Huang, Y.-H. Chen, and L.-G. Chen, “Level C+ data reuse scheme for motion estimation with corresponding coding orders,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 4, pp. 553–558, 2006. [20] S. Yalcin, H. F. Ates, and I. Hamzaoglu, “A high performance hardware architecture for an SAD reuse based hierarchical motion estimation algorithm for H.264 video coding,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL ’05), pp. 509–514, Tampere, Finland, August 2005. [21] S.-J. Lee, C.-G. Kim, and S.-D. Kim, “A pipelined hardware architecture for motion estimation of H.264/AVC,” in Proceedings of the 10th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC ’05), vol. 3740 of Lecture Notes in Computer Science, pp. 79–89, Springer, Singapore, October 2005. [22] C.-M. Ou, C.-F. Le, and W.-J. Hwang, “An efficient VLSI architecture for H.264 variable block size motion estimation,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1291–1299, 2005. [23] O. Ndili and T. Ogunfunmi, “A hardware oriented integer pel fast motion estimation algorithm in H.264/AVC,” in Proceedings of the IEEE/ECSI/EURASIP Conference on Design and Architectures for Signal and Image Processing (DASIP ’08), Bruxelles, Belgium, November 2008. [24] H.264/AVC Reference Software JM 13.2., 2009, http://iphome.hhi.de/suehring/tml/download. [25] J. A. Canals, M. A. Mart´ınez, F. J. Ballester, and A. Mora, “New FPSoC-based architecture for efficient FSBM motion estimation processing in video standards,” in Proceedings of the International Society for Optical Engineering, vol. 6590 of Proceedings of SPIE, p. 65901N, 2007. [26] Mentor Graphics ModelSim SE User’s Manual—Software Version 6.2d, 2009, http://www.model.com/support. [27] Xilinx ISE 9.1 In-Depth Tutorial, 2009, http://download .xilinx.com/direct/ise9 tutorials/ise9tut.pdf. [28] Xilinx Virtex-II Pro Development System, 2009, http:// www.digilentinc.com/Products/Detail.cfm?Prod=XUPV2P. [29] Xilinx Platform Studio and Embedded Development Kit, 2009, http://www.xilinx.com/ise/embedded/edk pstudio.htm.



Research Article FPGA Accelerator for Wavelet-Based Automated Global Image Registration Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China Correspondence should be addressed to Baofeng Li, [email protected] Received 14 February 2009; Accepted 30 June 2009 Recommended by Bertrand Granado Wavelet-based automated global image registration (WAGIR) is fundamental for most remote sensing image processing algorithms and extremely computation-intensive. With more and more algorithms migrating from ground computing to onboard computing, an efficient dedicated architecture of WAGIR is desired. In this paper, a BWAGIR architecture is proposed based on a block resampling scheme. BWAGIR achieves a significant performance by pipelining computational logics, parallelizing the resampling process and the calculation of correlation coefficient and parallel memory access. A proof-of-concept implementation with 1 BWAGIR processing unit of the architecture performs at least 7.4X faster than the CL cluster system with 1 node, and at least 3.4X than the MPM massively parallel machine with 1 node. Further speedup can be achieved by parallelizing multiple BWAGIR units. The architecture with 5 units achieves a speedup of about 3X against the CL with 16 nodes and a comparative speed with the MPM with 30 nodes. More importantly, the BWAGIR architecture can be deployed onboard economically. Copyright © 2009 Baofeng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction With the rapid innovations of remote sensing technology, more and more remote sensing image processing algorithms are enforced to be finished onboard instead of at ground station to meet the requirement of processing numerous remote sensing data realtimely. Image registration [1, 2] is the basis of many image processing operations, such as image fusion, image mosaic, and geographic navigator. Considering the computation-intensive and memoryintensive characteristics of remote sensing image registration and the limited computing power of onboard computers, to implement image registration efficiently and effectively with dedicated architecture is of great significance. In the past twenty years, FPGA technology has been developed significantly. The volume and performance of FPGA chip have increased greatly to adapt many largescale applications. Due to its excellent reconfigurability and convenient design flow, FPGA has been the

most popular choice for hardware designers to implement kinds of application-specific architectures. Therefore, to implement the remote sensing image registration in FPGA efficiently is just the point of this paper. Though Carstro-Pareja et al. [3, 4] have proposed a fast automatic image registration (FAIR) architecture of mutual information-based 3D image registration for medical imaging applications, few works addressing hardware acceleration of the remote sensing image registration have been reported. Many approaches have been proposed for remote sensing image registration. As for hardware implementation, only the automated algorithms are suitable because onboard computing demands that the algorithms should be accurate and robust and operate without manual intervention. Proposed automated remote sensing image registration algorithms can be classified into two categories, CPs-based algorithms [5–12] and global algorithms [13–22]. In the former, some matched control points (CPs) are extracted from both images automatically to

2


2. Wavelet-Based Automatic Global Image Registration Algorithm

decide the final mapping function. However, the problem is that it is difficult to automatically determine efficient CPs. The selected CPs need to be accurate, sufficient, and with even distribution. Missing or spurious CPs make CPs-based algorithms unreliable and unstable [23]. Hence, CPs-based algorithms are not in our consideration. Automated global registration, however, is an approach that does not rely on point-to-point matching. The final mapping function is computed globally over the images. Therefore, the algorithms are stable and robust and easy to be automatically processed. One of the disadvantages of global registration is that it is computationally expensive. Fortunately, the wavelet decomposition helps to relieve this situation because it provides us a way to obtain the final result progressively. A wavelet-based automated global image registration (WAGIR) algorithm for the remote sensing application proposed by Moigne et al. [13–15] has been proved to be efficient and effective. In WAGIR, the lowestresolution wavelet subbands are firstly registered with a rough accuracy and a wider search interval, and a local best result is obtained. Nextly, this result is refined repeatedly after the iterative registrations on the higher-resolution subbands. The final result is obtained at the highest-resolution subbands, viz. the original images. Many parallel schemes of WAGIR are proposed in previous works, such as parameter-parallel scheme (PP), imageparallel (IP) scheme, hybrid-parallel (HP) scheme which merges PP and IP, and group-parallel (GP) scheme [13, 24– 27] which are implemented targeting large, expensive supercomputers, cluster system or grid system that are impractical to be deployed onboard. In this paper, we propose a block wavelet-based automated global image registration (BWAGIR) architecture based on a block resampling scheme. The architecture with 1 processing unit outperforms the CL cluster system with 1 node by at least 7.4X, and the MPM massively parallel machine with 1 node by at least 3.4X. And the BWAGIR with 5 units achieves a speedup of about 3X against the CL with 16 nodes and a comparable speed with the MPM with 30 nodes. More importantly, our work is targeting onboard computing. The remainder of this paper is organized as follows. In Section 2, the traditional WAGIR algorithm is reviewed and analyzed based on the hierarchy architecture. The proposed block resampling scheme is detailed in Section 3. And the architecture of BWAGIR is presented in Section 4. Section 5 gives out the proof-of-concept implementation and the experimental results with comparison to several related works. Finally, this paper is concluded in Section 6.

M −1 M −1 i=0

C(A, B) =

M −1 M −1 i=0

2 j =0 Ai j

j =0

Image registration is the process that determines the most accurate match between two images of the same scene or object. In the global registration process, one image is registered according to another known standard image. We refer to the former as input image, the latter as reference image, the best matching image as registered image, and the image after each resampling process as resampled image. 2.1. Review of WAGIR Algorithm. WAGIR can be described as the pseudocode in Algorithm 1. Here assume that the LL subbands form the feature space; 2D rotations and translations are considered as search space; the search strategy follows the multiresolution approach provided by wavelet decomposition; the cross correlation coefficient is adopted as similarity metric. Firstly, an N-level wavelet decomposes the input image and the reference image with size of M × M into nLLi and nLLr sequences where n represents corresponding decomposition level. Then NLLi and NLLr with the lowest resolution are registered with accuracy of δ2N . A local best combination of rotations and translations (bestθ, bestX, bestY ) is obtained and used as the search center of registering the next level subbands, (N − 1)LLi and (N − 1)LLr. And another combination with accuracy of δ2N −1 is gained. This process iterates until the overall best result with expected accuracy δ, is retrieved after registering the original input image (0LLi) and reference images (0LLr). Finally, a resampling process is carried out to get the registered image. At each level, the algorithm shown in Algorithm 2 is employed to register nLLi and nLLr. The result of previous level (θC, XC, Y C) is used as the search center. For each combination of rotations and translations, the algorithm shown in Algorithm 3 is performed to get a resampled image of nLLr. Then a correlation coefficient is calculated to measure the similarity between the resampled nLLr and the nLLi. The combination corresponding to the maximal correlation coefficient is the best result of current level. The resampling algorithm is performed by sequentially selecting one registered image location once, calculating the corresponding coordinate of the selected location in the reference image, accessing the neighboring 4 × 4 pixel window in the reference image, calculating the corresponding interpolation weights according to the computed coordinate, and finally calculating the pixel value of the selected location by the cubic convolution interpolation method. The correlation coefficient is calculated with (1):

Ai j × Bi j − 1/M 2

− (1/M 2 )

M −1 M −1 i=0

j =0

Ai j

M −1 M −1 2

i=0

×

j =0

Ai j ×

M −1 M −1 i=0

2 j =0 Bi j

M −1 M −1 i=0 j =0 Bi j

− (1/M 2 )

2 M −1 M −1 i=0 j =0 Bi j

.

(1)


3

Input: input image and reference image Output: registered image 1 Initialize registration process (wavelet level −N, search scope − rotation angle: (θscope L, θscope R), horizontal offset: (Xscope L, Xscope R), and vertical offset: (Yscope L, Yscope L)); 2 Perform wavelet decomposition of the input image and reference image; 3 best θ = 0; best X = 0; best Y = 0; 4 step θ = δ2N ; step X = δ2N ; step Y = δ2N ; 5 for (n = N; n ≥ 0; n−) do (width, height) = (image width/2n , image height/2n ); //registering at current wavelet level based on the results of previous level. Perform Register (nLLi, nLLr, best θ, bestX, bestY, step θ, stepX, stepY); (θscope L, θscope R) = (−step θ, step θ); (Xscope L, Xscope R) = (−stepX, stepX); (Yscope L, Yscope R) = (−stepY; stepY); step θ /=2; stepX /=2; stepY /= 2; bestX∗ = 2; bestY∗ = 2; //size of next wavelet subband is twice of current 6 Resample (input image, best θ, bestX, bestY, registered image); // last resampe to obtain the result image. 7 Over. Algorithm 1: Main WAGIR algorithm.

Register (the registering algorithm) Input: nLLi, nLLr, θcenter, Xcenter, Ycenter, stepθ, stepX, stepY Output: local best θ, bestX, and bestY 1 (angle, x, y) = (θscope L, Xscope L, Y scope L); //control variables 2 max co = −1; // record the maximum correlation // the registration processing 3 while (angle 4

Preprocessing Zernike

Neural network

L2accept /reject

L2reject

Figure 5: Schematic of the HESS Phase-II Trigger System.

3.2. Intelligent Preprocessing. The second studied approach aims to make use of algorithms that already brought significant results in terms of pattern recognition. Neural networks are good candidates because they are a powerful computational model. On the other hand, their inherent parallelism makes them suitable for a hardware implementation. Although used in different fields of physics, these algorithms based on neural networks have successfully been implemented and have already proved their efficiency [5, 6]. Typical applications include particle recognition in tracking systems, event classification problems, off-line reconstruction of event, and online trigger in High-Energy Physics. From the assumption that neural networks may be useful in such experiments, we have proposed a new Level 2 (L2) trigger system enabling to implement rather complex processing on the incoming images. The major issue with neural networks resides in the learning phase which strives to identify optimal parameters (weights) in order to solve the given problem. This is true when considering unsupervised learning in which representative patterns have to be iteratively presented to the network in a first learning phase until the global error has reached a predefined value. One of the most important drawbacks of this type of algorithms is that the number of weights strongly depends on the dimensionality of the problem which is often unknown in practice. This implies to find the optimal structure of the network (number of neurons, number of layers) in order to solve the problem. Moreover, the curse of dimensionality [7] constitutes another challenge when dealing with neural networks. This problem expresses a correlation between the size of the network and the number of examples to furnish. This relation is exponential, that is, if the network’s size becomes significant, the number of training examples may become relatively huge. This cannot be considered in practice. In order to reduce the size of the network, it is possible to simplify its, task that is, reduce the dimensionality of the problem. In this case, a preprocessing step aims at finding correlations on data and at applying basic transformations in order to ease the resolution. In this study, we advise to use an “intelligent” preprocessing based on the extraction of the intrinsic features of the incoming images. The structure of the proposed L2 trigger is depicted in Figure 5. It is divided into three stages. A rejection step aims to eliminate isolated pixels and small images that cannot be processed by the system. A second step consists in applying a preprocessing on incoming data. Finally, the classifier takes the decision according to the nature of the event to identify. These different steps are described in the following sections.

3.2.1. The Rejection Step. The rejection step has two significant roles. First, it aims to remove isolated pixels that are typically due to background. These pixels are eliminated by applying a filtering mask on the entire image in order to keep the only relevant information, that is, clusters of pixels. This consists in testing the neighborhood of each pixel of the image. As the image has an hexagonal mesh grid, a hexagonal neighborhood is used. The direct neighborhood of each pixel of the image is tested. If none of the 6 neighbors are activated, the corresponding central pixel is considered as isolated and deactivated. Second, the rejection step permits to eliminate particles that cannot be distinguished by the classifier. Very small images ( 0 and values will

be read from Q1 C and subsequently from Q2 WNES and Q1 WNES until all the locations in the plateau have been visited and classified. The plateau processing steps and the associated conditions are shown in Figure 11. There are other parts which are not shown in the main diagram but warrants a discussion. These are (1) memory counters—to determine the number of unprocessed elements in a queue, (2) priority encoder—to determine the controls for Q1 sel and Q2 sel. The rest of the architecture consists of a few main parts shown in Figure 10 and are (1) centre and neighbour coordinates—to obtain the centre and neighbour locations, (2) multibank memory—to obtain the five required pixel values, (3) smallest-valued neighbour—to determine which neighbour has the smallest value,

EURASIP Journal on Embedded Systems in_am

10

+1

in_ctrl = 2 mc6–9 = 1 3 PS = 1

we_minima

0 Minima

1 Inner

we_t10

Q2 (E)

Q2 (S)

Q2 (C)

3

4

5

we_t9

we_t8

Q2 (N) 2

we_t7 Q2 (W)

Q1 (C) 5

1

Q1 (S) 4

we_t6

we_t5

Q1 (E) 3

we_t4

we_t3 Q1 (N) 2

we_t2 Q1 (W) 1

Q2_sel

Q1_sel

2

d 1

0

in_ctrl

in_ctrl > 0

Pixel coordinates

+1

Centre and neighbour coordinates

Multibank memory

Pixel status

we_t1

c_stat w_stat n_stat e_stat s_stat

Plat/inner

Plat

2 8

Arrowing

1

PS = 2 PS = 3

1

1: when a > b 0: otherwise b

Location (x,y)

Smallestvalued neighbour

8

Current pixel value

a

a>b

1

1

plat

w_loc

w_value

Arrow memory

Figure 10: Watershed architecture based on rainfall simulation. Shown here is the arrowing architecture. This architecture starts from pixel memory and ends up with an arrow memory with labels to indicate the steepest descending paths.

(4) plat/inner—to determine if the current pixel is part of a plateau and whether it is an edge or inner plateau pixel,

(6) pixel status—to determine the status of the pixels, that is, whether they have been read before, put into queue before, or have been labelled.

(5) arrowing—to determine the direction of the steepest descent. This direction is to be written to the “Arrow Memory”,

The next subsections will begin to describe the parts listed above in the same order.


11 E1 = 1 when Q1 is empty E2 = 1 when Q2 is empty

Start plateau processing

Stage 1

E1 × E2 Q1_W Q1_N Q1_E Q1_S Q1_C

Read all similar valued neighbouring pixels

mc1 + mc6 mc2 + mc7 mc3 + mc8 mc4 + mc9 mc5

0 E1 × E2 E1 × E2

E1 × E2

if mc5 > 0

if mc5 = 0

E1 × E2 Stage 2

E1 × E2

Read all from Q1_WNES using mc6–9 and label with value from minima register

Read from Q1_C, label pixels and write similar valued neighbours to Q2_WNES

2

E1 × E2

E1 × E2

E1 × E2

1 E1 × E2

E1 × E2

in_ctrl values = state numbers

if mc6–9 > 0

Stage: inner arrowing

Figure 12: State diagram of the architecture-ARROWING. Read from Q2_WNES, label pixels and write similar valued neighbours to Q1_WNES if mc1–4 > 0

if mc1–4 = 0

+1

Memory counter 1

if mc6–9 > 0

Read from Q1_WNES, label pixels and write similar valued neighbours to Q2_WNES

mc1

Q1_sel = 1 we_t1

−1

mc2

Q1_sel = 2 we_t2 +1

Memory counter 2

−1

if mc6–9 = 0 Plateau processing completed

Notes: 1. In stage 1 of the processing, mc6–9 is used as a secondary counter for Q1_WNES and incremented as mc1–4 increments but does not decrement when mc1-4 is decremented. In stage 2, if mc5 = 0 (i.e., complete lower minima), mc6–9 is used as the counter to track the number of elements in Q1_WNES. In this state, mc6-9 is decremented when Q1_WNES is read from. However, if mc5 > 0, mc6–9 is reset and resumes the role of memory counter for Q2_WNES. 2. Q1_C is only ever used once and that is during stage 2 of the processing.

Figure 11: Stages of Plateau Processing and their various conditions.

4.1. Memory Counter. The architecture is a tristate system whose state is determined by the condition of whether the queues, Q1 and Q2, are empty or otherwise. This is shown in Figure 12. These states in turn determine the control of the main multiplexer, in ctrl, which is the control of the data input into the system.

1 0

1 0

. . . mc9

Q2_sel = 4 we_t9 Memory counter 9

+1 −1

1 0 mc10

Q2_sel = 5 we_t10 Memory counter 10

+1 −1

1 0

Figure 13: Memory counter for Queue C, W, N, E, and S. The memory counter is used to determine the number of elements in the various queues for the directions of Centre, West, North, East, and South.

To determine the initial queue states, Memory Counters (MCs) are used to keep track of how many elements are pending processing in each of the West, North, East, South, and Centre queues. There are five MCs for Q1 and another five for Q2, one counter for each of the queue directions. These MCs are named mc1–5 for Q1 W, Q1 N, Q1 E, Q1 S,


Clock cycles Memory Req.

Sequential 5 1x image size

Parallel 1 5x image size

Graph-based 1 1x image size

and Q1 C, respectively, and similarly mc6–10 for Q2 W, Q2 N, Q2 E, Q2 S, and Q2 C respectively. This is shown in Figure 13. The MCs increase by one count each time an element is written to the queue. Similarly, the MCs decrease by one count every time an element is read from the queue. This increment is determined by tracking the write enable we tx where x = 1 − 10 while the decrement is determined by tracking the values of Q1 sel and Q2 sel. A special case occurs during the stage one of plateau processing, whereby mc6–9 is used to count the number of elements in Q1 W, Q1 N, Q1 E, and Q1 S, respectively. In this stage, mc6–9 is incremented when the queues are written to but are only decremented when Q1 WNES is read again in the stage two for complete lower minima labelling. The MC primarily consists of a register and a multiplexer which selects between a (+1) increment or a (−1) decrement of the current register value. Selecting between these two values and writing these new values to the register effectively count up and down. The update of the MC register value is controlled by a write enable, which is an output of a 2-input XOR. This XOR gate ensures that the MC register is updated when only one of its inputs is active.

mc1 0

=

a

mc2 0

=

b

mc3 0

=

c

mc4 0

=

d

mc5 0

=

e

4.3. Centre and Neighbour Coordinate. The centre and neighbourhood block is used to determine the coordinates of the pixel’s neighbours and to pass through the centre coordinate. These coordinates are used to address the various queues and multibank memory. It performs an addition and subtraction by one unit on both the row and column coordinates. This is rearranged and grouped into their respective outputs. The outputs from the block are five pixel locations, corresponding to the centre pixel location and the four neighbours, West (W), North (N), East (E), and South (S). This is shown in Figure 15. 4.4. The Smallest-Valued Neighbour Block. This block is to determine the smallest-valued neighbour (SVN) and its position in relation to the current pixel. This is used to determine if the current pixel has a lower minima and to find the steepest descending path to that minima (arrowing).

Q1_sel[1]

Q1_sel[2]

f

mc6 0

=

mc7 0

=

mc8 0

=

h

mc9 0

=

i

mc10 0

=

g

Q2_sel[0]

Q2_sel[1]

j

Q2_sel[2]

(a) a/f

4.2. The Priority Encoder. The priority encoder is used to determine the output of Q1 sel and Q2 sel by comparing the outputs of the MC to zero. It selects the output from the queues in the order it is stored, that is, from queue Qx W to Qx C, x = 1 or 2. Together with the state of in ctrl, Q1 sel and Q2 sel will determine the data input into the system. The logic to determine the control bits for Q1 sel and Q2 sel is shown in Figure 14.

Q1_sel[0]

Priority encoder

Table 1: Comparison of the number of clock cycles required for reading all five required values and the memory requirements for the three different methods.

Priority encoder

12

Q1_sel[0]/Q2_sel[0]

b/g Q1_sel[1]/Q2_sel[1]

c/h

Q1_sel[2]/Q2_sel[2]

d/i e/j

Q2_sel[0] = f + fgh + fghij Q2_sel[1] = fg + fgh Q2_sel[2] = fghi + fghij

Q1_sel[0] = a + abc + abcde Q1_sel[1] = ab + abc Q1_sel[2] = abcd + abcde a/f

b/g

c/h

d/i

e/j

1 0 1 1 1 1

1 x 0 1 1 1

1 x x 0 1 1

1 x x x 0 1

1 x x x x 0

[2] 0 0 0 0 1 1

[1] 0 0 1 1 0 0

[0] 0 1 0 1 0 1

Qx_sel Disable 1 2 3 4 5

(b)

Figure 14: The priority encoder. (a) shows the controls for Q1 sel and Q2 sel using the priority encoders. The output of memory counters determines the multiplexer control of Q1 sel and Q2 sel. (b) shows the logic of the priority encoders used. There is a special “disable” condition for the multiplexers of Q1 and Q2. This is used so that the Q1 sel and Q2 sel can have an initial condition and will not interfere with the memory counters.


+1

13

r+1 C

Row −1

+1

This second part of the architecture will describe how we get from Figure 4(c) to Figure 4(d) in hardware. Compared to the arrowing architecture, the labelling architecture is considerably simpler as there are no parallel memory reads. In fact, everything runs in a fairly sequential manner. Part 2 of the architecture is shown in Figure 21. The architecture for Part 2 is very similar to Part 1. Both are tristate systems whose state depends on the condition

=0

Fill queue mux = 0

6. Labelling Architecture

nter cou

PQ_c oun ter

Normal 0

_ PQ

the north (0, 0) finds (1, 0) again. The current pixel location (0, 0) on the other hand is written to Q1 C because it is a plateau pixel but not an inner (i.e., an edge) and is immediately arrowed. The status for this location (0, 0) is changed from 0 → 6. Q1 S will contain the pixel location (1, 0). This is read back into the system and mc4 = 1 → 0 indicating Q1 S to be empty. The pixel location (1, 0) is arrowed and written into Q1 C. With mc1 − 4 = 0 and mc5 > 0, the pixel locations (0, 0) and (1, 0) is reread into the system but nothing is performed because both their PSsequal 6 (i.e., completed).

b=

Read queue d) 1 (c un atchment basin fo

Figure 22: The 3 states in Architecture:Labelling.

of the queues and uses pixel state memory and queues for storing pixel locations. The difference is that Part 2 architecture only requires a single queue and a single bit pixel status register. The three states for the system are shown in Figure 22.

18 Values are initially read in from the pixel coordinate register. Whether this pixel location had been processed before is checked against the pixel status (PS) register. If it has not been processed before (i.e., was never part of any steepest descending path), it will be written to the Path Queue (PQ). Once PQ is not empty, the system will process the next pixel along the current steepest descending path. This is calculated by the “Reverse Arrowing Block” (RAB) using the current pixel location and direction information obtained from the “Arrow Memory.” This process continues until a non-negative value is read from “Arrow Memory.” This nonnegative value is called the “Catchment Basin Label” (CBL). Reading a CBL tells that the system a minimum has been reached and all the pixel locations stored in PQ will be labelled with that CBL and written to “Label Memory.” At the same time, the pixel status for the corresponding pixel locations will be updated accordingly from 0 → 1. Now that PQ is empty; the next value will be obtained from the pixel coordinate register. 6.1. The Reverse Arrowing Block. This block calculates the neighbour pixel location in the path of the steepest descent given the current location and arrowing label. In other words, it simply finds the location of the pixel pointed to by the current pixel. The output of this block is a simple case of selecting the appropriate neighbouring coordinate. Firstly the neighbouring coordinates are calculated and are fed into a 4-input multiplexer. Invalid neighbours are automatically ignored as they will never be selected. The values in “Arrow Memory” only point to valid pixels. Hence, no special consideration is required to handle these cases. The bulk of the block’s complexity lies in the control of the multiplexer. The control is determined by translating the value from the “Arrow Memory” into proper control logic. Using a bank of four comparators, the value from “Arrow Memory” is determined by comparing it to four possible valid direction labels (i.e., −4 → −1). For each of these values, only one of the comparators will produce a positive outcome (see truth table in Figure 23). Any other values outside the valid range will simply be ignored. The comparator output is then passed through some logic that will produce a 2-bit output corresponding to the multiplexer control. If the value from “Arrow Memory” is −1, the control logic will be (x = 0, y = 0) corresponding to the West neighbour location. Similarly, if the value from “Arrow Memory” is −2, −3, or −4, the control logic will be (x = 0, y = 1), (x = 1, y = 0), or (x = 1, y = 1) corresponding to the North, East, or South neighbour locations, respectively. This is shown in Figure 23.

7. Example for the Labelling Architecture This example will pick up where the previous example had stopped. In the previous part, the resulting output was written to the “Arrow Memory.” It contains the directions of the steepest descent (negative values from −1 → −4) and numbered minima (positive values from 0 → total number

EURASIP Journal on Embedded Systems am −1

= a

am −2

= b

am −3

= c

x

x = a b c d + a b cd y = a b d d + a bc d y

am = d −4 am = value from arrow memory

x

y

+1 r + 1 Row +1 c + 1 Column −1

a 1 0 0 0

0 1 2 3

W N E S

−1 r − 1

Lower neighbour location

c−1 b 0 1 0 0

c 0 0 1 0

d 0 0 0 1

x 0 0 1 1

y 0 1 0 1

mux_ctrl 0 1 2 3

Figure 23: Inside the reverse arrowing block.

of minima) as seen in Figure 4(c). In this part, we will use the information stored in “Arrow Memory” to label each pixel with the label of its respective minimum. Once all associated pixels to a minimum/minima have been labelled accordingly, a catchment basin is formed. The system starts off in the normal state and the initial conditions are as follows. PQ counter = 0, mux = 1. In the first clock cycle, the first pixel location (0, 0) is read from the pixel location register. Once this has been read in, the pixel location register will increment to the next pixel location (0, 1). The PS for the first location (0, 0) is 0. This enables the write enable for the PQ and the first location is written to queue. At the same time, the location (0, 0) and direction −3 obtained from “Arrow Memory” are used to find the next coordinate (0, 1) in the steepest descending path. Since PQ is not empty, the system enters the “Fill Queue” state and mux = 0. The next input into the system is the value from the reverse arrowing block, (0, 1), and since PS = 0, it is put into PQ. The next location processed is (0, 2). For (0, 2), PS = 0 and is also written to PQ. However, for this location, the value obtained from “Arrow Memory” is 1. This is a CBL and is buffered for the process of the next state. Once a non-negative value from “Arrow Memory” is read (i.e., b = 1), the system enters the next state which is the “Read Queue” state. In this state, all the pixel locations stored in PQ is read one at a time and the memory locations in “Label Memory” corresponding to these locations are written with the buffered CBL. At the same time, PS is also updated from 0 → 1 to reflect the changes made to “Label Memory.” It tells the system that the locations from PQ have been processed so that it will not be rewritten when it is encountered again.

EURASIP Journal on Embedded Systems Table 3: Results of the implemented architecture on a Xilinx Spartan-3 FPGA. 64 × 64 image size, Arrowing Slice flip flops Occupied slices Labelling Slice flip flops Occupied slices

423 out of 26,624 (1%) 2,658 out of 13,312 (19%) 39 out of 26,624 (1%) 37 out of 13,312 (1%)

With each read from PQ, PQ counter is decremented. When PQ is empty, PQ counter = 0 and the system will return to the normal state. In the next clock cycle, (0, 1) is read from the pixel coordinate register. For (0, 1), PS = 1 and nothing gets written to PQ and PQ counter remains at 0. The same goes for (0, 2). When the coordinate (0, 3) is read from the pixel coordinate register, the whole processes of filling up PQ and reading from PQ and writing to “Label Memory” start again.

8. Synthesis and Implementation The rainfall watershed architecture was designed in HandelC and implemented on a Celoxica RC10 board containing a Xilinx Spartan-3 FGPA. Place and route were completed to obtain a bitstream which was downloaded into the FPGA for testing. The watershed transform was computed by the FPGA architecture, and the arrowing and labelling results were verified to have the same values as software simulations in Matlab. The Spartan-3 FPGA contains a total of 13312 slices. The implementation results of the architecture are given in Table 3 for an image size of 64 × 64 pixels. An image resolution of 64 × 64 required 2658 and 37 slices for the arrowing and labelling architecture, respectively. This represents about 20% of the chip area on the Spartan-3 FPGA.

9. Summary This paper proposed a fast method of implementing the watershed transform based on rainfall simulation with a multiple bank memory addressing scheme to allow parallel access to the centre and neighbourhood pixel values. In a single read cycle, the architecture is able to obtain all five values of the centre and four neighbours for a 4-connectivity watershed transform. This multiple bank memory has the same footprint as a single bank design. The datapath and control architecture for the arrowing and labelling hardware have been described in detail, and an implemented architecture on a Xilinx Spartan-3 FGPA has been reported. The work can be extended to implement an 8-connectivity watershed transform by increasing the number of memory banks and working out its addressing. The multiple bank memory approach can also be applied to other watershed architectures such as those proposed in [10–13, 15].

19

References [1] S. E. Hernandez and K. E. Barner, “Tactile imaging using watershed-based image segmentation,” in Proceedings of the Annual Conference on Assistive Technologies (ASSETS ’00), pp. 26–33, ACM, New York, NY, USA, 2000. [2] M. Fussenegger, A. Opelt, A. Pjnz, and P. Auer, “Object recognition using segmentation for feature detection,” in Proceedings of the 17th International Conference on Pattern Recognition (ICPR ’04), vol. 3, pp. 41–44, IEEE Computer Society, Washington, DC, USA, 2004. [3] W. Zhang, H. Deng, T. G. Dietterich, and E. N. Mortensen, “A hierarchical object recognition system based on multiscale principal curvature regions,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR ’06), vol. 1, pp. 778–782, IEEE Computer Society, Washington, DC, USA, 2006. [4] M. S. Schmalz, “Recent advances in object-based image compression,” in Proceedings of the Data Compression Conference (DCC ’05), p. 478, March 2005. [5] S. Han and N. Vasconcelos, “Object-based regions of interest for image compression,” in Proceedings of the Data Compression Conference (DCC ’05), pp. 132–141, 2008. [6] T. Acharya and P.-S. Tsai, JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSl Architecturcs, John Wiley & Sons, New York, NY, USA, 2005. ´ and [7] V. Osma-Ruiz, J. I. Godino-Llorente, N. Saáenz-Lechon, ´ P. Gomez-Vilda, “An improved watershed algorithm based on efficient computation of shortest paths,” Pattern Recognition, vol. 40, no. 3, pp. 1078–1090, 2007. [8] A. Bieniek and A. Moga, “An efficient watershed algorithm based on connected components,” Pattern Recognition, vol. 33, no. 6, pp. 907–916, 2000. [9] H. Sun, J. Yang, and M. Ren, “A fast watershed algorithm based on chain code and its application in image segmentation,” Pattern Recognition Letters, vol. 26, no. 9, pp. 1266–1274, 2005. [10] M. Neuenhahn, H. Blume, and T. G. Noll, “Pareto optimal design of an FPGA-based real-time watershed image segmentation,” in Proceedings of the Conference on Program for Research on Integrated Systems and Circuits (ProRISC ’04), 2004. [11] C. Rambabu and I. Chakrabarti, “An efficient immersionbased watershed transform method and its prototype architecture,” Journal of Systems Architecture, vol. 53, no. 4, pp. 210– 226, 2007. [12] C. Rambabu, I. Chakrabarti, and A. Mahanta, “Floodingbased watershed algorithm and its prototype hardware architecture,” IEE Proceedings: Vision, Image and Signal Processing, vol. 151, no. 3, pp. 224–234, 2004. [13] C. Rambabu and I. Chakrabarti, “An efficient hillclimbingbased watershed algorithm and its prototype hardware architecture,” Journal of Signal Processing Systems, vol. 52, no. 3, pp. 281–295, 2008. [14] D. Noguet and M. Ollivier, “New hardware memory management architecture for fast neighborhood access based on graph analysis,” Journal of Electronic Imaging, vol. 11, no. 1, pp. 96– 103, 2002. [15] C. J. Kuo, S. F. Odeh, and M. C. Huang, “Image segmentation with improved watershed algorithm and its FPGA implementation,” in Proceedingsof the IEEE International Symposium on Circuits and Systems (ISCAS ’01), vol. 2, pp. 753–756, Sydney, Australia, May 2001.

Design and Architectures for Signal and Image ...

Design and Architectures for Signal and Image ...

Suggest Documents

Design space exploration for image processing architectures

Doppler Radar Architectures and Signal

Method and apparatus for rendering image signal

IMPLEMENTING ALGORITHMS FOR SIGNAL AND IMAGE ...

Signal Processing for Image Enhancement and ...

Design and Architectures for Dependable ... - Semantic Scholar

digital signal processor architectures and programming

Hardware Architectures for Image Processing

Using Reference Architectures for Design and ...

Design space exploration for image processing architectures ... - arXiv

Power and Area Efficient VLSI Architectures for Communication Signal ...

Doppler Radar Architectures and Signal Processing for ... - CiteSeerX

Adaptable Architectures for Signal Processing ... - Semantic Scholar

Training Signal Design and Tradeoffs for Spectrally

Design and Analysis of Architectures and Protocols for ... - Faculty Server

Neural Architectures for Unifying Brightness Perception and Image ...

VLSI Architectures for Image Interpolation: A Survey

Advanced Pixel Architectures for Scientific Image Sensors

Convolutional neural network architectures for image classification

Multirate Digital Signal Processing - Signal and Image Processing ...

Parallel Hyperspectral Image and Signal Processing - UMBC

Medical Image and Signal Analysis IRC

Digital Signal and Image Processing using MATLAB

Journal of Signal and Image Processing