Optimization of Image Processing Algorithms on Mobile Platforms

50 downloads 6645 Views 571KB Size Report
Nov 1, 2010 - The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 ... on non-mobile platforms like desktop computers and workstations, achieving .... algorithms using TI's Open Multimedia Application Platform (OMAP) ..... Cross-platform development requires cross-compilation tools.
Optimization of Image Processing Algorithms on Mobile Platforms Pramod Poudel and Mukul Shirvaikar Electrical Engineering Department The University of Texas at Tyler Tyler, TX 75799 e-mail: [email protected] [email protected]

ABSTRACT This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like contextaware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms. Keywords: OMAP 3530, BeagleBoard, OpenCV, Image processing, Computer vision, Real-time systems, Multi-core processing

1. INTRODUCTION Image processing techniques typically apply transformation functions to image or video data and produce enhanced images or data properties as output. Image and video processing finds widespread use in popular applications like video streaming in cellular phones. Since the transformations have to be applied to every image pixel, they usually require lots of computational power and are computationally intensive. The performance of any image processing algorithm is typically measured on the basis of speed and depends on the complexity of the algorithm and the performance of the hardware used. The order of complexity of a particular class of algorithms can be computed and is typically available in the literature 1. Thus to improve the performance, embedded system capabilities have to be optimized. Embedded system capabilities are a function of both the hardware platform and the software implementation strategy. When there are plenty of hardware resources, with unlimited power and higher processing speeds, for example, on non-mobile platforms like desktop computers and workstations, achieving real-time performance may not be a major issue. In today’s scenario there has been an exponential growth in the use of mobile embedded systems like cell phones, PDAs, tablets, video game console, etc. in the consumer electronics market. The power consumption and form factor for these devices is relatively small. Multiple applications like video streaming, smart cameras, context-aware computing and virtual reality are resident on these devices to meet consumer needs. Computer vision and image processing algorithms thus find heavy application on these platforms. However, these mobile platforms do not have the resources and computational power that PCs and workstations have and as a result image processing and computer vision algorithms on these devices may not be able to meet real time performance requirements.

One of the best ways to achieve better performance on mobile platforms is to explore and utilize the available hardware resources on the mobile processor using “smart” software. Today’s silicon technology provides state-of-the-art processing units with many hardware resources on a single silicon die called System-On-a-Chip (SOC). Such a SOC not only provides multiple hardware accelerators for different functions but also provides multiple homogenous or heterogeneous processor cores. Utilization of multiple cores for distributing tasks can significantly boost the performance. In the case of heterogeneous processors, dividing the task in a manner such that a core is assigned tasks it favors most, in terms of performance, can further improve overall performance. The use of multi-tasking and multi-core processing on such SOCs can increase the performance of image processing algorithms considerably. This work presents a methodology for the optimization of image-processing algorithms on a mobile heterogeneous multi-core platform from Texas Instruments, namely the OMAP3530. The Beagleboard (rev. C4) is used as the development platform for the research project. This mobile heterogeneous multi-core platform consists of an ARM Cortex-A8 as the General Purpose Processor (GPP), which acts as a host core and a TMS320C64x+ Digital Signal Processor (DSP) along with PowerVR SGX530 Graphics Processing Unit (GPU). The ARM Cortex-A8 runs at 720 MHz while the DSP core runs at 520 MHz and GPU core runs at 110 MHz 2. It also has 256 MB each of Double Data Rate (DDR) SDRAM and NAND flash memory. The image processing algorithms used to benchmark the speed are selected from the open source computer vision library called OpenCV and include the correlation algorithm for template-matching. Two-dimensional correlation is used in applications like pattern matching and face recognition. It can be implemented using either the convolution technique or the Discrete Fourier Transform (DFT). Both techniques are computationally intensive and require numerous multiply-and-accumulate (MAC) operations. The DSP is ideally suited for MAC operations. This research describes various approaches implemented to fully utilize the multiple cores on the mobile platform for such algorithms. Specially, the following approaches are implemented: (a) ARM only implementation, (b) DSP only implementation, (c) Dual-core implementation (algorithm on DSP, free ARM core), (d) Dual-core implementation using a fast math library for floating point operations and (e) Dual-core implementation with data-division. The performance results for the same algorithm implemented on a TI platform, namely the OMAP-L138 are also provided. Detailed benchmark results are presented to judge the efficiency of each approach. Qualitative analysis of the various approaches is also provided.

2. PAST WORK Computer vision finds many applications in the field like machine learning, artificial intelligence and surveillance3. Traditionally these algorithms have been implemented on non-mobile platforms like Pentium personal computers (PC) and work-stations. The widely used open source computer vision library ‘OpenCV’ was also designed with a focus on non-mobile platforms and highly optimized to take advantage of Intel’s architecture enhancements 4. The past work described later in this chapter reveals that speed-up in image-processing algorithms can be obtained using three different techniques. Software Approaches In this technique, different software methods are used to speed up the algorithm. Compiler based techniques such as generalized loop-unrolling to speed up computationally intensive parts of algorithm, which are generally bounded within a loop, is a popular software method 5. The selection of an algorithm based on a theoretical estimate of computational complexity is often used for speed up. Using the Fast Fourier Transform instead of the DFT algorithm reduces algorithm complexity and enhances performance. Coding critical parts of an algorithm in assembly language can also help to leverage the performance 6. Many other techniques for software optimization of algorithms are available in the literature 7 . Hardware Approaches Dedicated hardware or Application Specific Integrated Circuits (ASICs) can be used to improve the performance of an algorithm. Custom ASICs like JPEG/MPEG or DCT processor could be used for application specific image processing 8. Today, reconfigurable hardware like Field Programmable Gate Arrays (FPGAs) are widely used for fast prototyping of dedicated hardware. FPGAs are also used for image processing tasks. Comparisons based on performance of computer vision algorithms on FPGA with DSP implementation is provided in the literature 9. Both FPGA and ASIC

implementations suffer heavily from time to market as tools and languages used for design are not programmer friendly. There also exist many dedicated hardware modules on the market like Direct Memory Access (DMA), Intel’s SIMD architecture like MMX and SSE for video and image-processing 10. The use of dedicated processors like DSP and Graphical Processing Unit (GPU) for image and video processing can also result in better performance while freeing the GPP for other tasks 11. Combined Hardware/Software Approaches Due to powerful and cheap computing resources, combination of hardware and software techniques is becoming more popular for performance optimization. In the past, there has been some work done in image processing algorithms targeting specific platforms like DSP. The optimization of the Integral algorithm on a TI DSP processor is described in literature 12. It uses DMA and static on-chip memory to implement software techniques like double buffering for performance improvement. However, there has been little effort in the past to leverage the use of computer vision algorithms using TI’s Open Multimedia Application Platform (OMAP) processors like OMAP3 and OMAP4. OMAP architectures typically consist of a DSP and the popular ARM RISC processor. The use of dual-core OMAP platform for multimedia applications in wireless terminals is shown in 13, while 14 gives some idea of application profiling on such platforms. Application of OMAP in medical engineering like ECG is presented in 15. All of these articles on OMAP are based on earlier releases of OMAP technology. Today’s image and video applications demand even more performance which can only be met by newer technology like OMAP3 and OMAP4. This research further builds upon the previous work, which demonstrated performance optimization of an algorithm using ARM and DSP cores on the OMAP3 platform 16. Ideas and techniques presented in this work are simple, effective, and can empower one to achieve multi-core processing and task offloading to TI’s DSP on OMAP platforms for image processing algorithms. Moreover, these techniques can also be applied to general algorithms which need to run on the DSP of OMAP platforms with performance optimization.

3. TECHNICAL BACKGROUND Optimization of image processing algorithms on OMAP3 platforms like the BeagleBoard requires in-depth understanding of the platform and the necessary software development tools. As the main focus remains on the use of a heterogeneous RISC-DSP architecture, knowledge of the underlying hardware architecture, available software tools, and inter-processor communication between the two cores is very important to achieve the goal. 3.1 Asymmetric Dual-Core Architecture (OMAP3530) OMAP3530 is a high end application processor from Texas Instruments based on the OMAP3 architecture. It has support for operating systems like Linux and Windows CE. It consumes less power and is more suitable for mobile embedded systems. A few of the devices and support provided by this platform are listed below. • Microprocessor unit based on ARM Cortex-A8. • Image, Video and Audio accelerator (IVA2.2) subsystem with a C64x+ DSP. • Graphics Processing Unit (PowerVR SGX) for 3D graphics acceleration. • Camera image signal processor (ISP). • Display subsystem with support for NTSC/PAL video out. • Level 3 (L3) and Level 4 (L4) interconnect for high bandwidth data transfer between memory controllers and on-chip peripherals. • SmartReflex adaptive voltage control. • Package-on-Package (POP) memory stacking implementation. 3.2 ARM Cortex -A8 The ARM Cortex-A8 is a very popular RISC processor used today in most of the well known mobile platforms like Beagleboard, iPhone and Motorola Droid. It is a super-scalar RISC processor based on ARMv7-A architecture 17. This processor is finding a growing market share in today’s emerging mobile embedded devices. Some of its advanced features are highlighted below. • ARM integer execution pipeline.

• • •

NEON pipeline execution unit for advanced Single Instruction Multiple Data (SIMD) and Vector Floating Point (VFP) instruction. Dynamic branch prediction with branch target address cache, global history buffer and 8-entry return stack. Memory Management Unit (MMU) and separate data and instruction Translation Look-Aside Buffers (TLBs).

3.3 DSP Architecture (IVA2.2 Subsystem) The OMAP 3530 contains the TI Image, Video and Audio accelerator (IVA2.2) based on the C64x+ DSP core 18. DSP is an application specific processor more suitable for processing multimedia and signal processing algorithms. This subsystem is capable of providing optimum performance for any image, multimedia or signal processing tasks. Some of the key components and features of C64x+ DSP are highlighted below. • TI TMS320DMC64x+ 32-bit fixed point DSP. • Very Large Instruction Width (VLIW) architecture with 8 execution units supporting a maximum of 8 instructions/cycle. • Separate L1 Data and L1 Program configurable cache of maximum size 32 KB. • Combined L2 Program and Data configurable cache of maximum size 64 KB. • Enhanced Direct Memory Access (EDMA) for data transfer between memory and external peripherals. • Video hardware accelerator. 3.4 Development Board (BeagleBoard) The BeagleBoard (rev. C4), which houses the OMAP 3530, is the development board for this work. The BeagleBoard is a low-cost high-performance board which can be easily expanded. This platform is mainly targeted for the open source community and is supported by Texas Instruments through BeagleBoard.org 19. It provides most of the capabilities of OMAP3530 with a complete development platform. Components and interfaces like OMAP3530 processor, 256MB POP memory, USB 2.0 OTG, USB host Audio Codec, SD/MMC connector, video, audio expansion, DVI-D, S-Video and LCD connectors present on this board makes it an on-the-go lab for fast prototyping and testing new ideas.

Figure 1: Codec Engine framework.

Figure 2: The basic structure of OpenCV.

3.5 Software Framework (Codec Engine) The Codec Engine software framework provides support for Video Image Speech Audio (VISA) as well as an Universal algorithm interface between the ARM and the DSP core. Basically, it provides a set of APIs that work on the eXpressDsp Algorithm Interoperability Standard (xDAIS ) algorithm standard 20. xDAIS is TI standard for algorithms

to address the issues of algorithm resource allocation and consumption on a DSP. These APIs remain the same across different operating systems providing a common standard interface.The Codec Engine can be configured such that algorithm could run locally or remotely on General Purpose Processor (GPP) only environment or a GPP-DSP environment like OMAP3530. Codec Engine basically addresses the issue of portability from one platform to another. It is extensible, configurable, and easy to use. A high level view of the CE framework is shown in Figure 1. Engine functional layer along with VISA and UNIVERSAL functions provides dynamic object creation capability either locally or remotely which depends on the platform used. This gives support to remote calling and remote instantiation of an algorithm 21. The DSPLink software layer provides support for inter-processor communication (IPC) using shared memory and interrupts between the ARM and the DSP cores. 3.6 OpenCV Library OpenCV is a very popular open source real-time computer vision library initially developed by Intel and now maintained by Willow Garage. It is a collection of highly optimized libraries containing more than 500 algorithms and is designed to take advantage of Intel Integrated Performance Primitives (IPP) for better performance. This work focuses on the OpenCV1.0 version which provides 5 main library components. The structure of OpenCV is shown in Figure 2. The CV component contains a collection of computer vision and image processing algorithms. The Machine Learning Library (MLL) contains statistical classifiers and clustering tools while HighGUI provides routines for storing and loading video and images and CvAux contains defunct and experimental algorithms. The core support, like basic data-structures and low-level algorithmic support, is provided by CXCORE library component. Since this work is based on the OMAP platform, use of IPP library for better performance is not possible. Instead, OpenCV algorithms are redesigned to take advantage of the heterogeneous architecture present on the OMAP application processor. The reference algorithm cvMatchTemplate for correlation is taken for benchmarking from the OpenCV library which is defined by equation (1) in spatial-domain. = c ( x, y )

J −1 K −1 2 ∑ ∑ ( h ( m, n ) − f ( x + m, y + n )) m = 0= n 0

(1)

4. Algorithm Implementation The OpenCV library contains a large collection of image-processing algorithms. A number of representative algorithms were selected for implementation on the BeagleBoard. The goal was to offload the computationally intensive parts of the image processing algorithms to the DSP core and utilize the ARM core for other general processing tasks. Two sets of libraries were created to achieve this. The first library runs on the ARM core while the other runs on the DSP core. The RPC and IPC necessary to synchronize tasks were implemented using the Codec Engine framework and DSPLink. 4.1 Software Development Environment The Linux platform was used both on the host machine and the target board. Virtual UBUNTU 8.04 was run on the host machine with Windows XP operating system (OS). The Codec Engine (CE) framework provided by TI was used to perform Remote Procedure Calls (RPC). This framework in turns uses DSP/BIOS Link for Inter-Processor Communication (IPC) using shared memory. The DSPLink module, which can be easily built using OpenEmbedded, should be installed on the target platform. Since the DSP can work only on physical continuous memory address space, the CMEM library and module is used to manage physically continuous memory blocks. The CMEM library is also provided by Texas Instruments. Along with this, the CE framework makes use of various software components and other utilities like Framework Component, Linux Utils, DSP BIOS, XDC Tools and CODEGEN Tools. All of these utilities come in packages provided by TI. These utilities can be easily built and managed using OpenEmbedded tools, which provides the necessary build environment and utilities for cross-platform development 22. 4.2 Software Architecture The conventional OpenCV library runs only on the ARM core of OMAP3530 platform. However, it is possible to run OpenCV algorithms on both of the cores found on OMAP3530 platform using a new design. The new design incorporated in this work is shown in Figure 3, which provides support to run image processing algorithms on the DSP

core as well as the ARM core. For simplicity in the integration process of OpenCV library on the DSP core, the TI C6ACCEL library framework is used. This framework previously did not support the use of OpenCV library running on the DSP. However, C6ACCEL design was extended to provide this support. C6ACCEL is a recently released library from TI which can be used by ARM SOC developers to access algorithms running on the DSP core of heterogeneous architecture like OMAP3. This library in turn uses the CE framework to achieve the RPC. CE provides support for VISA as well as UNIVERSAL algorithms. It provides set of APIs to perform RPC and IPC. For non-VISA algorithm like the ones used in this work, a set of Codec Engine UNIVERSAL APIs are used.

Figure 3: Design architecture for Optimized library.

Figure 4: Codec Engine APIs for RPC.

4.3 Remote Procedure Call The CE along with the UNIVERSAL interface provides set of APIs for RPC and IPC. CE in turn uses DSPLink for IPC. DSPLink is a general purpose runtime software provided by TI for communication between GPP and TI DSP. Figure 4 shows how the RPC is implemented using these set of APIs. The CERuntime_init() function sets up the runtime environment for CE on the ARM. Engine_open() loads the DSP executable, sets up the DSP runtime environment and selects specific DSP server packages containing the algorithm. Universal_create() allocates memory and other resources and creates algorithm instances on the DSP and initializes them. Universal_processAsync() initiates the algorithm on the DSP and returns the control back to the ARM. It does not wait for DSP to complete its task. When output from the algorithm is required Universal_processWait() is used to synchronize the DSP with the ARM. After the processing is over, Universal_delete() and Engine_close() functions are used to delete the algorithm instance, free the resources and put the DSP back in reset mode. 4.4 Data Processing and Flow The data processing flow for an application in both OMAP3530 cores is shown in Figure 5. Whenever an application running on the ARM core needs to use the DSP, the Codec Engine runtime environment is initialized and the Codec Engine is opened. This removes the DSP form reset mode and the DSP server executable is loaded. Run time environment support is then started on the DSP side. This ensures that the hardware and software support to access DSP algorithms is initialized and the correct algorithm package is selected on the DSP. When data is to be passed to the DSP, the input/output buffer in the shared memory region is used. This memory should be physically continuous in memory address space. Virtual to physical address translation should be done for all the pointers to be passed since the DSP does not support virtual memory management. When working in a multi-processor environment, handling data for cache coherency is very important to ensure that the data available on the local cache memory is the one that is passed from the other core. All the input data buffers should be written back to memory on the ARM side and cache invalidated on the DSP side, while all the output data buffers should be cache invalidated on the ARM side and written back to memory on the DSP side. Once the data is written into shared memory, the algorithm on the remote core is started. After this, the DSP will be busy executing the algorithm while the ARM can be used for other tasks. The Codec Engine framework provides an asynchronous API for this purpose. This API will start the task on the DSP and return control to the ARM core. The ARM core should be synchronized with the DSP core to access output from the DSP algorithm. If processing

on the DSP side is complete by then, the ARM will receive a task completion signal from the DSP. If not, the ARM will wait until the processing is done. After all the processing on DSP is over, the instance of the algorithm is deleted and the DSP is put back into reset-mode to save power. The same procedure is followed whenever the DSP is to be used. ARM

DSP

Initialize necessary environment and start DSP Start DSP and initialize necessary environment Get the required data to process in a continuous shared memory

Perform address translation and cache coherency operation

Start Processing of algorithm on remote core Start the processing of the Algorithm

Do some other task or work on another algorithm

Process Algorithm

Write the Result and and perform necessary cache operation.

No Sync with DSP. Is sync complete?

Send completion signal to ARM

Yes

Process the output data

Yes

More Algorithm to process on DSP?

No

Reset DSP

Stop DSP and put in reset state

Figure 5: Flow of data processing in the design. Processing in both the cores simultaneously can result in better performance. It is illustrated using the timing diagram shown in Figure 6. The timing diagram provides a high-level overview of task concurrency but does not show the communication overhead between the ARM and the DSP core. Most of the overhead consists of time required for cache management and careful attention should be given to software partitioning between the cores. The DSP is more suitable for computationally intensive signal processing algorithms while the ARM is better for control and other housekeeping tasks. Thus, using DSP for inappropriate tasks may result in poor performance due to unavoidable overhead. 4.5 Tool-Chains and Cross Compilation Cross-platform development requires cross-compilation tools. For this, OpenEmbedded Tools, which provide the necessary build environment and utilities, was chosen for this purpose. The GNU compiler (GCC 4.3) was used to build the necessary software package for the Cortex-A8 processor. Whenever possible, NEON optimization was chosen using

a compiler flag for vector as well for floating point operations. For DSP side development, Code Generation Tools 6.1.9 (CGT) was used on the Linux host. Code Composer Studio IDE v4 (CCS) was also used on the Windows platform to build the OpenCV library for DSP and for JTAG debugging. Codec-Engine development requires eXpress DSP Components (XDC) tools, which are a standard for providing reusable software components optimized for real-time embedded systems 23. XDC contains utilities and standards for API development, static configuration and packaging. XDC tool uses a config.bld file to associate targets like C64x+ or C67x with platforms like EVM3530 or DM6446. Certain configuration steps should be followed before the compile and link process to use the package. Following are the brief steps that need to be followed to use XDC for application development: • Configure Application: *.cfg file is used to specify packages to be used and static object to create. It also performs integrity check between application and dependent packages and sets options for modules and objects to change their default behavior. • Write C code following the XDC conventions. • Process application configuration file for target and platform using configuro tool provided with XDC. This will generate compiler.opt file to be used by compiler and linker.cmd to be used by linker. • Compile and link using conventional Makefile or Code-composer studio method.

Time (t)

ARM

DSP

A

B C D E

F

G

H

I

A Task Running on ARM B Initialization of Runtime Environment on ARM C Initialization of Runtime Environment on DSP and algorithm instance creation D Remote Algorithm Start-up E Algorithm Processing on Remote Processor F Task Running Concurrently on ARM G Synchronizing ARM with DSP H Waiting For DSP to Complete Processing Algorithm I Task running on ARM

Figure 6: Illustration of processing on dual-core using timing diagram. (Note: Diagram not drawn to scale.) 4.6 Target Setup and Debug Environment The target platform, (i.e. BeagleBoard,) is connected to the host machine as shown in Figure 7. The host machine runs virtual UBUNTU Linux 8.4. This virtual environment is used for cross-compilation and generation of ARM and DSP executables. The folder containing these executables and test images are then Network File System (NFS) mounted no the BeagleBoard. Specifically, a folder “/beagleboard” on the virtual UBUNTU system is shared with our target platform over the network. BeagleBoard is connected to the network using an USB/Ethernet hub. Other peripherals like a mouse, keyboard and web-cam are also connected to this USB hub. A display monitor is connected to the BeagleBoard using a HDMI connector as the terminal interface. This is where the output is displayed and used for debugging using the GNU Debugger (GDB). A RS-232 serial connection is configured between the target platform and host machine to enables a console terminal on the host platform. This terminal is used for boot environment setup and as a console command terminal. A 2 GB SD card containing Angstrom Linux Image is inserted in the target to enable self booting of the BeagleBoard. For debugging purposes, a JTAG is connected between the target platform and Code Composer Studio running on the Windows host. Running an application on the target is similar to running an application on any Linux platform, for example, ./remote_ti_platform_evm3530_OpenCV.xv5T template dsp command would start our template matching application on the DSP. Loading of DSP executables on the C64x+ is done implicitly by the ARM application.

5. EXPERIMENTAL PROCEDURE AND RESULTS The comprehensive experimental procedure followed in this work is outlined in this section. Five different techniques that were studied are presented and the results obtained from each technique are tabulated. Finally, the results using the Cortex-A8 core are also compared to a floating-point DSP from OMAP L138 using the same algorithms.

Although, many algorithms were tested, including cvResize, cvSobel, cvCvtColor; emphasis is given to cvMatchTemplate. cvMatchTemplate is an OpenCV implementation of the template matching algorithm. 5.1 Test Data Five sets of images consisting of a test image and a template image were used to test the performance of the algorithm. The algorithm was tested using the same set of test data for the different method under study. The function gettimeofday() which is defined in the time.h header file is used to calculate the algorithm processing time. After processing the algorithm, average execution speed was noted down for 10 cycles of the algorithm. Figure 8 shows the test images and the template image along with the matched output shown by an arrow sign. The input images, which are RGB color images are first converted to grayscale and then the cvMatchTemplate algorithm was applied. The template image is shown in figure 8 (f). 5.2 Method 1 (Arm Only Implementation) OpenCV1.0 was built to run on the ARM processor. OpenEmbedded provides a very easy build method for this task using Bitbake. Bitbake is a tool for executing tasks and managing metadata which provides support for crosscompilation, handling inter-package dependencies, running multiple tasks, and supporting multiple build, multiple OS and many more utilities. Using simple commands like bitbake OpenCV, the OpenCV library package was cross-compiled on the host platform along with its dependencies. The NEON SIMD co-processor was used for all the floating point operations using the compiler flag –mfpu=neon. This package was later installed on the target OMAP3 platform running Angstrom embedded Linux. The application program was then cross-compiled to test the cvMatchTemplate algorithm for different image sizes. The execution time for each experiment is tabulated in Table 1.

Figure 7: Target debug configuration. 5.3 Method 2 (DSP Only Implementation) The OpenCV library, especially cv.lib and cxcore.lib were built for C64x+ DSP using Code Composer Studio IDE v4 to execute the algorithm on the DSP. The algorithm cvMatchTemplate, which is provided in the cv.lib, was called using the C6ACCEL library framework on the DSP side. The CE framework was used for RPC while DSPLink, which uses physically continuous shared memory, was used for IPC. The UNIVERSAL_process() API provided by the CE framework was used to call the DSP algorithm . A new set of APIs, (e.g. DSP_cvMatchTemplate() instead of cvMatchTemplate(),) were provided for calling the algorithm running on the DSP side. The input arguments to these API were kept similar to that of the original API cvMatchTemplate().An application program using this implementation was run and execution times were tabulated for different input images.

5.4 Method 3 (Dual-core Implementation (Algorithm on DSP, Free ARM Core)) In method 2 described above, the control was returned to the ARM processor only after the algorithm processing is complete on the DSP core. This technique does not take advantage of dual core processing capability. In order to achieve simultaneous processing on both the cores, UNIVERSAL_processAsync() and UNIVERSAL_processWait() API provided by CE were used. UNIVERSAL_processAsync() initiates the algorithm on a remote DSP core and returns the control back to the ARM processor which is free to work on some other part of the algorithm. Whenever the output from the DSP processing is required, DSP_cvSyncDSP(), which is a wrapper API for UNIVERSAL_processAsync(), is called to synchronize the DSP with ARM. The overhead to call DSP_cvMatchTemplate() and DSP_cvSyncDSP using asynchronous processing is the only execution overhead that will be visible on the ARM side. 5.5 Method 4 (Dual-core Implementation Using A Fast Math Library for Floating Point Operations) The C64x+ core is a fixed point DSP and floating point operations have be emulated in software. The run-time library support provided by C6000 DSP compiler for this task is poor in terms of performance. To further enhance the performance, TI provides the FastRTS library. This library provides optimized software emulation of floating point operations on a fixed point DSP by replacing the standard run time support library. After compiling OpenCV library, it was linked with the FastRTS library by providing the –lfastrts.lib flag in the linker command. The results using FastRTS library are shown in the table. 5.6 Method 5 (Data Division for Simultaneous Processing) The next approach used in the study was dividing the input data, i.e. the test image is divided equally into two segments. Different cores were given a different set of input data. These cores process the data independently and simultaneously. After initiating the algorithm on the DSP core, using the asynchronous method described in section 5.4, the same algorithm was started on the ARM core for half of the image data. The outputs from both of these cores are combined in the end to form one result image. Table 1: Comparison of total execution time for cvMatchTemplate algorithm. Image Size

Execution Time in milliseconds OMAP3530 Method 3 Method 4 Method 5 (Dual-core (Dual-core (Dual-core Implementation) Implementation implementation using a fast with data math library) division) 110.718 58.380 37.781

Method 1 (ARM Only)

Method 2 (DSP Only)

64x64

10.52

110.687

128x128

39.11

518.737

519.195

271.484

256x256

185.21

2339.416

2339.508

320x240

304.90

4129.913

640x480

878.91

11583.282

OMAP-L138 C674 C674 DSP DSP 720 MHz 300 (projected) MHz 22.40

9.33

153.076

94.70

39.46

1231.445

690.948

437.00

182.08

4130.035

2171.020

1014.618

762.00

317.50

11583.344

6090.546

3634.857

2090.00

870.83

5.7 Implementation on OMAP-L138 platform A similar approach, to the BeagleBoard, was used to build the OpenCV library for ARM and DSP of OMAP-L138. OMAP-L138 is dual core SOC from TI with ARM9 RISC GPP and C674x floating-point DSP 24. Both cores are running at 300 MHz. With the help of technical support from TI, a similar software design and development environment was set up on their OMAP-L138 development platform and the results of execution of various OpenCV algorithms on C674x were measured. This helped to benchmark the algorithms on a floating-point DSP as shown in Table 1. Since, OMAPL138 is running at 300 Mhz, a linearly projected execution time is also provided in the table for the 720 MHz, which is the speed of the Cortex-A8 ARM processor.

5.8 Output Data Figure 8 shows the output of the cvMatchTemplate algorithm for different test images. Although many different algorithms were tested, only results for cvMatchTemplate are presented. In each image, a rectangle is drawn around the highest correlated pixel. The algorithm successfully finds the correct match or the best match. Since the searched template is absent in image (b), Figure 8 (b) shows the best match.

Figure 8: Output of cvMatchTemplate. Test image of size (a) 64x64, (b) 128x128, (c) 256x256. (d) 320x240, and (e) 640x480. (f) Template image of size 16x16.

6. DISCUSSION Figure 9 shows a plot of the various methods implemented to optimize performance on the OMAP 3530 and OMAP-L138. Method 1 (ARM only) - The ARM Cortex-A8 core shows the best performance. This can be attributed to the floating-point support available in the form of the NEON co-processor. It can be observed that the DSP performance for the same algorithm on the OMAP-L138 is virtually identical. Again, this can be attributed to the floating-point hardware support available to the DSP on OMAP-L138. The major disadvantage of the ARM only approach is that the DSP is idle and as a result the dual-core processor efficiency and load factor is extremely low. It should be noted that in case the ARM was running other tasks on the Linux kernel, the time-slicing round-robin OS could reduce the speed of the algorithm considerably.

Figure 9: Execution time for various methods. Method 2 (DSP only) - The algorithm performance on the DSP reveals speeds slower than the ARM for the match template algorithm. The performance of DSP is affected by many factors like CE overhead for communication and data transfer and cache management. The most important factor is that the algorithm involves heavy floating point computation which is not supported by the fixed-point C64x+ DSP. Much time is consumed on software emulation of floating point operations. Also, method two was a synchronous method which did not free ARM for other task. Method 3 (Dual-core implementation (algorithm on DSP, free ARM core)) – It introduced the asynchronous DSP operation in which algorithm on the DSP is instantiated and the control is returned back to ARM core. DSP should be synchronized at the end when the output is required. The overhead for communication to start algorithm is almost same and the overhead to synchronize the DSP is constant for all image sizes. The total execution time depends on the input data size and varies accordingly. The biggest advantage of this method is that both the cores will be utilized and overall dual core processor efficiency is much higher. Method 4 (Dual-core implementation using a fast math library for floating point operations) - The figure shows that performance is considerably increased when the FastRTS library is utilized. The FastRTS library provides optimized support for floating point emulation which is evident from the plot. The execution time is decreased by almost half compared to the previous method. Method 5 (Dual-core implementation with data-division) - The data is equally divided among both the processors and the same algorithm is running simultaneously on the two cores. This is a hybrid method which can allow further speedup depending on the ARM processor load. If we look at the results for the OMAP-L138, we find that the performance for floating-point DSP matches the ARM with NEON co-processor due to the C674x floating-point DSP. In a nutshell, the performance of cvMatchTemplate was improved by using the dual core architecture on the OMAP3 platform. The result of DSP only implementation on OMAP3530 was not as high as expected. However, the performance gradually increased using the different techniques as discussed. This performance on C64x+ DSP can be justified by the required software emulation of floating point operations and overhead due to the SOC architecture. This statement is validated by the results from the OMAP-L138.

7. CONCLUSIONS AND FUTURE WORK In this research a mechanism to optimize image processing and computer vision algorithms from the popular OpenCV library on an asymmetric dual core processor has been established. A comprehensive study of the OMAP3 architecture and a detailed outline of how optimized algorithm development can be performed on the OMAP3 platform was undertaken. Shared memory was used for data communication and DSPLink was used for the necessary interprocessor communication protocol. The C6ACCEL library framework, which in turn uses the Codec Engine framework, was extended to provide support for the OpenCV algorithms. The following methods were tested: (a) ARM only implementation, (b) DSP only implementation, (c) Dual-core implementation (algorithm on DSP, free ARM core), (d) Dual-core implementation using a fast math library for floatingpoint operations and (e) Dual-core implementation with data-division. The benchmarks results show that the ARM Cortex-A8 outperforms the C64x+ DSP, which can be attributed to the floating point support available in the form of NEON coprocessor. On the other hand, results from the OMAP-L138 suggests that the DSP with a floating-point support will be virtually similar or even better in performance compared to the ARM Cortex-A8. Also, the ARM only implementation results in extremely low dual-core processing efficiency and load factor. Thus, methods (c) through (e) which runs the algorithm on the DSP core will result in better efficiency by freeing the ARM core for other tasks. The techniques discussed in this work can be applied to any algorithm to take advantage of dual-core processing power. A number of techniques for performance improvement can be further implemented in future work. One possible approach is to use enhanced DMA (EDMA) and on-chip static RAM on the DSP side to reduce main memory access times. Also software techniques, like double or triple buffering can further improve the algorithm performance. Each OpenCV algorithm can be individually tuned for performance by using combined hardware/software techniques. During the course of this research, support was provided for only ten OpenCV algorithms. Future work could extend support to all OpenCV algorithms in the library.

REFERENCES [1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein, “Introduction to Algorithms,” Second Edition, MIT Press, 2001. [2] OMAP3530/25 Application Processor, SPRS507F-February 2008 – Revised October 2009, Texas Instruments Inc. [3] F. Esposito, D. Malerba, “ Editorial: Machine Learning in Computer Vision,” Journal of Applied Artificial Intelligence, Vol. 15, pp 693-706, 2001. [4] G. Bradski and A. Kaehler, “Learning OpenCV,” First Edition, O’Reilly Publications, 2008. [5] J. C. Huang and T. Leng, “Generalized Loop-unrolling: A Method for Program Speed-up,” IEEE Proc. on Application-Specific Systems and Software Engineering and Technology, pp 244-248, 1999. [6] http://www.azillionmonkeys.com/qed/optimize.html, 1st Nov 2010. [7] http://www.eetimes.com/design/automotive-design/4007106/Back-to-the-Basics-Tested-and-effective-methods-forspeeding-up-DSP-algorithms, 1st Nov 2010. [8] P. Tseng, Y. Chang, Y. Huang, H. Fang, C. Huang and L. Chen, “Advances in Hardware Architectures for Image and Video Coding—A Survey,” Proc. of IEEE, pp. 184-197, Vol. 93, No. 1, Jan 2005. [9] D. Baumgartner, P.Rossler and W. Kubinger, “Performance Benchmark of DSP and FPGA Implementation of Low-Level Vison Algorithms,” IEEE Transactions on Computer Vision and Pattern Recognition, pp. 1-8, 2007. [10] Kiefer Kuah, “Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4),” Intel Software Solutions Group, Oct 2008. [11] M. Ujaldon and U. V. Catalyurek, “High-Performance Signal Processing on Emerging Many-Core Architectures Using CUDA,” IEEE International Conference on Multimedia and Expo, pp. 1825-1828, 2009. [12] B.Kisacanin, “Integral Image Optimizations for Embedded Vision Applications,” Proc. IEEE SSIAI, pp. 181-184, 2008. [13] J. Chaoui, K. Cyr, S. de Gregorio, J. Giacalone, J. Webb and Y. Masse, “Open Multimedia Application Platform: Enabling Multimedia Applications in Third Generation Wireless Terminals Through a Combined RISC/DSP Architecture,” IEEE, Vol. 2, pp. 1009-1012, 2001. [14] U. S. Gorgonio, H. R. B. Cunha, E. X de L. Filho, S. O. D. Luiz, A. Perkusich and M. R. A. Morais, “Application Profiling in a Dual-Core Platform,” International Conference on Consumer Electronics, pp. 1-2, 2008. [15] J. Liang and Y. Wu, “Wireless ECG Monitoring System Based on OMAP,” IEEE Proc. on Computational Science and Engineering, Vol. 2, pp. 1002-1006, 2009.

[16] P. Poudel and M. Shirvaikar, “Optimization of Computer Vision Algorithms for Real Time Platforms,” 42nd IEEE Proc. on South Eastern Symposium on System Theory, pp. 51-55, 2010. [17] “Cortex-A8 Technical Reference Manual”, Revision: r3p2, 2008-2010. [18] SPRUFA3, “OMAP35xx Application Processor IVA2.2 Technical Reference Manual”, Texas Instruments Inc., 2008. [19] “Beagleboard System Reference Manual Rev C4,” Revision 0.0, 2009. [20] SPRUED5B, “Codec Engine Server Integrator User Guide,” Texas Instruments Inc., 2007. [21] S. Preissegi, “Programming Details of Codec Engine for DaVinci Technology,” Texas Instruments, 2006. [22] H. H. P. Freyther, K. Kooi, D. Vollmann, J. Lenehan, M. Juszkiewicz and R. Leggewie, “OpenEmbedded User Manual,” http://docs.openembedded.org/usermanual/usermanual.html, 2009. [23] SPRUEX4, “XDC Consumer User’s Guide,” Texas Instruments Inc., 2007. [24] SPRS586B, “OMAP-L138 Low-Power Application Processor,” Texas Instruments Inc., 2010.

Suggest Documents