256 bit AVX2 vector unit on each core, backward compatible to the MMX and SSE instruction set extensions. Due to dual hyperthreading, the processor entirely ...
Parallel Implementation of Real-Time Block-Matching based Motion Estimation on Embedded Multi-Core Architectures Nicolai Behmann, Oliver Jakob Arndt and Holger Blume Leibniz Universit¨at Hannover, Institute of Microelectronic Systems Appelstr. 4, 30167 Hanover, Germany {behmann, arndt, blume}@ims.uni-hannover.de Abstract—Considering the strict demands of video-based advanced driver-assistance systems in terms of real-time execution, complex applications are usually realized with dedicated hardware solutions. Indeed, modern vector-accelerated multi-core processors, serving as cheep off-the-shelf components, feature increasing computational performance, while executing flexible and maintainable software code. However, as the processor characteristics (e.g., vector-instruction set, NoC topology) vary with chips, the parallelization and vectorization benefit depends on the target platform. In this work, we present a parallel realtime implementation of a predictive block-matching based motion estimation algorithm using parallelization methods on data and task level. We target our implementation on the Samsung Exynos 5 Octa mobile ARM big.LITTLE multi-core processor, exhibiting NEON SIMD vectorizations. While we illustrate our implementation strategies in detail, we specifically point out representative differences of the NEON instruction set extension, compared with the SSE vector operations, common in x86 based architectures. Furthermore, to analyze and compare the resulting parallelization and vectorization benefits, as well as the scalability over threads, we also consider an Intel Core-i7 general-purpose processor as a reference platform.
I. I NTRODUCTION Motion estimation from image sequences is an essential preprocessing step in different fields, such as video processing, coding and multimedia applications as well as ego-motion estimation in video-based driver assistance systems. Appropriate algorithms are often based on block-matching, combining complex data flows and dependencies with a huge amount of memory-expensive block similarity calculations. The requirements concerning real-time processing, combined with low latency and high data throughput, are often fulfilled with the use of dedicated or customized hardware architectures. For prototyping purposes, these accelerators appear as inconvenient in terms of development costs and flexibility. Through the emerging smartphone market mobile multi-core processors deliver increasing computing capabilities, while featuring small power demands and product costs. Embedded multi-core SoCs profit from a bigger memory bandwidth and additional SIMD units on a single die, offering high data parallelism at minimum power consumption. Moreover, the ability of high-level programming languages reduces the time to evaluate the performance of the algorithm and offers portability and maintainability, while reducing the development costs and time-to-market. However, efficiently exploiting the SoC resources (e.g., processor, memory, SIMD units) and implementing complex parallelization methods demand experienced programmers, avoiding common concurrency errors, such as false sharing, dead locks or race conditions. Especially,
the use of dedicated SIMD units, require specific programming instruments, which differ in the resulting speedup and amount of required code manipulations. Therefore, an in-depth knowledge about the possible speedup and the mandatory programming effort are helpful in the concept phase of the migration process from sequential to concurrent architectures. Combining different parallelization methods, the interferences and influences on the resulting scaling behavior across multiple processors need to be taken into account to generate the optimal tradeoff between hardware utilization and algorithmic acceleration. As the scaling behavior over threads depends on the architecture specific system interconnection network and capabilities of the particular processing units, we compare these parameters between the embedded multi-core processor Samsung Exynos 5 Octa, featuring ARM big.LITTLE architecture, and the general-purpose Intel Core-i7 quad-core processor. The latter features hyperthreading technology, enabling two independent hardware threads with separate register sets, to be executed on the same core. In this work, we introduce a parallel implementation of an exemplary motion estimation pipeline, consisting of the Parallel Predictive Block-Matching (PPBM) algorithm [1] and a vector field erosion [2]. The implementation targets generalpurpose and mobile multi-core platforms, satisfying real-time requirements in advanced driver assistance systems. We evaluate the platform performance of a low power SoC, featuring ARM big.LITTLE architecture, in comparison with a reference Intel Core-i7. Specifically, this work makes the following major contributions: • • •
Parallelized and vectorized implementation of the PPBM motion estimation algorithm and erosion In-depth analysis and comparison of vectorization strategies and instructions set extensions Evaluation of the parallel runtime behavior and scalability over threads, describing the influences and intereferences of combined data- and task-level parallelization methods
The remainder of this paper is organized as follows: The following Section reviews the related work on algorithms and migration strategies to concurrent multi-core architectures. The general concept of the implemented motion estimation algorithm is described in Section III. Section V and IV exhibit different parallelization methods on data and task level, respectively. Results for the separate parallelization methods and its interference are given in Section VI. Finally, Section VII concludes this paper.
II.
R ELATED W ORK
Motion estimation is the task of assigning a two dimensional vector to objects, image patches, or pixels, describing the translational movement between two successive frames. Motion estimation is applied in many application fields, ranging from motion compensated temporal interpolation and temporal noise reduction in consumer electronics to ego-motion estimation in advanced driver assistance systems. While optical flow approaches the image intensities with a Taylor approximation, resulting in a linear equation system with additional constraints, we transfer and adapt a block based algorithm from video-coding to the use in driver assistance. Block based algorithms divide the input image into equally sized image patches (blocks) and generate distinct motion vectors for each. Leading to a two-dimensional matching problem between blocks of two successive frames, such block-based algorithms vary in their search space and the set of tested matching-candidates. Traversing each candidate inside a predefined maximum search space, the Full Search Block-Matching (FSBM) features low control flow and many independent calculations, predestinated for concurrent execution. Thus, the search area dimensions determine the maximum possible detectable motion vector, the computational expenses increase due to the huge amount of candidates in driver assistance systems with increasing car speed. Predictive algorithms, e.g., proposed by de Haan [3], Blume [1] and Wang [4], considering chronological and spatial dependencies, generate more smooth vectors with higher relations to the true motion, reducing the error on periodic structures and sensor noise. While decreasing the computational effort by consciously selecting matching-candidates, dependencies in data and control-flow involve more complex implementation challenges. As these algorithms, however, exhibit high computational complexities, previous implementations make use of dedicated or customized hardware, [17], resulting in low flexibility, thereafter. To refine the motion vector field from block to pixel granular coordinates, de Haan [2] proposed a vector field erosion filter. In the past century, programmers profit from higher performance through increasing clock rates without software modifications. Due to the exponential growth of power dissipation with raising clock frequencies, processors shift from single to multi-core architectures. As the so-called Free Lunch, described by Sutter [6], is over, the migration of serial code to concurrent architectures is mandatory to harvest the added processing performance of multiple computation engines. As programs nowadays have to be written in an explicit parallel manner to exploit all processor resources, evaluations of programming instruments in terms of simplicity and flexibility and the resulting performance benefits are important. Asanovic et. al. [7] proposed different methods, beside parallel programming models and auto-tuners, increasing the algorithm performance, in case the software design is adapted to specific computation and communication patterns. Lee [11] argues, that threads represent a step in automatic migration from sequential code execution and therefore proposed essentially deterministic library components. Encapsulating domain-specific knowledge in reusable components, integrating concurrency with languages, and supporting explicit declarations to help compilers and operating system schedulers, can help effectively exploiting multi-core architectures through software in the opinion
of Manferdelli et. al. [8]. H¨ofler et. al. [9] proposed to use performance modeling methods from the development to the production stage, enriching the application design process. A market survey by Hebisch et. al. [10] accentuates highly used programming tools in the tool driven multi-core software development. Beside sequential code optimizations, Arndt et. al. [5] presented an efficient software migration workflow from single- to multi- and many-core architectures. Present and future architectures feature different SIMD hardware units with increasing register width, ideally suitable for parallel processing on data level. Related programming instruments range from the automatic vectorization on loop level by the compiler to the use of hardware specific assembler instructions or intrinsics. Language extensions, such as pragmas and array slicing notations, introduced to the C language with Intel Cilk Plus, allow the programmer to provide the compiler informations, e.g., data dependencies, minimum loop lengths, branching probabilities, for optimized code generation. III. A LGORITHM BASICS Motion estimation algorithms assign an individual vector ~ (~x, k − 1) to each pixel I(~x, k − 1) of the reference image, V describing the local motion between two successive frames k − 1 and k. Each vector consists of the pixel-accurate displacement, describing the position of the candidate block in the following frame with the maximum similarity. The basic FSBM traverses the reference block within a predefined search area in the following frame and selects the candidate vector with the maximum similarity, on the basis of the block-wise sum of absolute differences (SAD). Large block displacements, common in driver assistance systems, require a speed proportional adaption of the search area, increasing the computation time, while becoming more noise sensitive. Assuming spatial and temporal relations within the image, the number of candidate vectors, influencing the necessary execution time, can be significantly reduced, while the vector field homogeneity is enhanced. In respect to predominant motion in horizontal and vertical directions, the Parallel Predictive Motion Estimation (PPBM) [1] algorithm divides the search space into the same independent convergence directions (horizontal and vertical), each converging to the true object movement via the amount of frames and number of blocks within an object. Each convergence direction includes one spatial, one temporal and eight spatial update candidate vectors. Based on ~sv ~sh
~c
~th ~vs,h
~tv ~vt,h
~vu,h,1...8
Fig. 1. Common predictor configuration of the PPBM, and possible horizontal candidate vectors, extracted from the spatial and temporal predictor.
reduced block size
synch ~BM (~ V xb ,k−1)
~ (~ V x,k−1)
BMH,1 BMV,1
C
ER
I(~ x,k)
P P BMH
P P BMH C
I(~ x,k−1)
P P BMV
ER
ER
P P BMV
Fig. 2. Motion estimation processing chain using Parallel Predictive BlockMatching (BM) including Combination (C), as well as iterative Erosion (ER).
the hypothesis, that an object is bigger than a block, spatial predictors represent an already calculated motion vector of the current vector field. Consequently, the motion vector is taken from a block left or above of the current block for the horizontal or vertical convergence direction respectively. The temporal candidate takes the motion from the previous frame into account, wherefore its position is independent. The fixed update vector set, added to the spatial predictor, ensure the convergence of the predicted motion vectors against the true object movement and correct small motion changes. Within each convergence direction, the candidate with the maximum similarity, evaluated by the sum of absolute differences (SAD), is chosen. Concluding, the independent convergence directions are combined using a decision scheme, based on the possible convergence against the true motion within each convergence direction and motion vector length. Additionally, the zero vector is evaluated, resulting in 21 candidate block positions for each block. A typical predictor configuration for the current block at the position ~c, used in consumer electronics and advanced driver assistance systems, is illustrated in figure 4 with an exemplary block-size of [4 × 4]. The spatial and temporal predictors, ~vs,h/v and ~vt,h/v , are extracted from ~h/v (~c + ~sh/v , k − 1) the current or previous vector field V ~h/v (~c + ~th/v , k − 2) respectively for the at the position V separate horizontal and vertical search directions. For overview purposes only the candidate vectors of the horizontal search direction are illustrated. The implemented motion estimation algorithm uses a cascade of Parallel Predictive Block-Matching (PPBM) stages with decreasing block sizes from stage to stage, illustrated in Figure 2. An initial block-matching P P BM1 with bigger blocks omits errors through periodic structures, in combination with a reduced calculation time. The following smaller block sizes, taking the results of previous stages as temporal predictors into account, improve the coarsely granular vector ~BM (~x, k − 1), especially on object edges. In the last field V PPBM stage, the separate vector fields of the horizontal and vertical search direction are combined in stage C. Finally, a vector field erosion ER, proposed by de Haan [2], refines the vector field to the pixel plane through iterative dividing each block into four equally sized sub-blocks and assigning each sub-block the median vector of its neighborhood. IV.
Fig. 3. Data decomposition parallelization scheme with synchronizations after each block-matching stage, subdivided into separate orthogonal directions.
general purpose processing power is divided into two quadcore clusters, which communicate via an coherent interconnect bus and share the global memory. The big and LITTLE clusters feature four ARM Cortex A15 or A7 respectively, running at 2.0 GHz or 1.4 GHz. As each big core features a 128 bit ARM NEONv2 and VFPv4 SIMD unit, we compare the performance of these accelerators with the vector units of the Intel Core-i7 4771 general-purpose processor. This architecture provides a 256 bit AVX2 vector unit on each core, backward compatible to the MMX and SSE instruction set extensions. Due to dual hyperthreading, the processor entirely delivers eight threads with hardware-managed context switching in a single quadcore cluster, running at 3.5 GHz. In order to minimize synchronizations, work imbalances and latencies, distributing the workload among multiple cores using parallelism on task level requires detailed knowledge of the algorithm’s data dependencies, which are caused by spatial predictors and the algorithm’s processing chain. Since this algorithm contains two independent and thus potentially concurrent cascades of PPBM stages, representing the horizontal and vertical convergence direction, both cascades could be processed in parallel. However, as such schemes limit the dynamic scalability of the resulting implementation and involve high memory consumption, we decided to execute them sequentially. Task parallelism is enabled by a data decomposition [5] within each individual processing stage, as illustrated in Figure 3. Thereby, a separate and thus private data chunk is distributed to each core, improving data locality and prohibiting cache coherency overhead. Each core concurrently executes the same processing step on an equally sized data patch, which nearly eliminates possible work imbalances and idle times before synchronization points. The data dependencies, caused by the spatial predictors, are considered by horizontal or vertical splitting the work load for the horizontal or vertical convergence direction respectively. Thus, each core can process his data chunk in column-row order the data locality is further improved. To latter exploit the entire vector register capacity and further increase linear memory accesses, the combination process has been split into two stages. First, the SAD of the zero vector candidate is calculated for each block in the current image, storing the result in a temporal array and afterwords loading this value and generating the resulting
PARALLELIZATION
In this section, we introduce the parallelization methods on task level for the presented motion estimation pipeline, including the Parallel Predictive Block-Matching and erosion. While considering internal control- and dataflow dependencies of the algorithm, we present the associated software structure and work distribution strategy. Based on the heterogeneous ARM big.LITTLE [19] multi-core architecture, the Samsung Exynos 5 Octa 5422 SoC [18] represents our target platform for the previously introduced processing chain. The SoCs
A15 NEON
A15 NEON
A15 NEON
A15 NEON
A7
A7
A7
A7
L2 (2 MB)
L2 (512 kB)
T1 T2 AVX2 L2 (256 kB)
T1 T2 AVX2 L2 (256 kB)
T1 T2 AVX2 L2 (256 kB)
T1 T2 AVX2 L2 (256 kB)
L3 (8 MB)
DDR-RAM
DDR-RAM
(a) Schematic of Samsung Exynos 5 Octa
(b) Schematic of Intel Core-i7
Fig. 4.
Schematics of threads, caches, and communication topologies.
f o r ( s a d =0 , y = 0 ; y < b l o c k H e i g h t ; y ++) { f o r ( x = 0 ; x < b l o c k W i d t h ; x ++) { i d x = y∗ l i n e w i d t h + x ; s a d += a b s ( a [ i d x ] − b [ i d x ] ) ; } }
(a) Unoptimized SAD calculation with variable block size. f o r ( s a d =0 , y =0 , pA=a , pB=b ; y < 1 6 ; ++y ) { # pragma omp simd r e d u c t i o n ( + : s a d ) a l i g n e d ( pA , pB : 1 6 ) l i n e a r ( pA , pB : 1 ) f o r ( i n t x = 0 ; x < 1 6 ; x ++) s a d += s t d : : a b s ( pA [ x ] − pA [ x ] ) ; pA += l i n e W i d t h ; pB += l i n e W i d t h ; }
f o r ( s a d =0 , y =0 , pA=a , pB=b ; y < 1 6 ; ++y ) { tmp [ 0 : 1 6 ] = s t d : : a b s ( pA [ 0 : 1 6 ] − pB [ 0 : 1 6 ] ) ; s a d += s e c r e d u c e a d d ( tmp [ 0 : 1 6 ] ) ; pA += l i n e W i d t h ; pB += l i n e W i d t h ; }
(b) SAD calculation using Cilk Plus array notation. m128i r S a d = m m s e t z e r o s i 1 2 8 ( ) ; f o r ( y =0 , pA=a , pB=b ; y < 1 6 ; y ++) { m128i rA = mm load si128 ( pA ) ; m128i rB = mm load si128 ( pB ) ; m128i rT = mm sad epu8 ( rA , rB ) ; r S a d += mm adds epu16 ( rSad , rT ) ; pA += l i n e W i d t h ; pB += l i n e W i d t h ; } s a d = m m e x t r a c t e p i 1 6 ( rSad , 0 ) + m m e x t r a c t e p i 1 6 ( rSad , 4 ) ;
(c) SAD calculation using OpenMP compiler hints. (d) SAD calculation using SSE intrinsics. Fig. 5.
Code samples for the SAD calculation of a block with size [16 × 16] using OpenMP Pragmas, Cilk Plus array slicing notations, SSE intrinsics.
motion vector. The number of synchronizations Nsync of the entire processing chain can be precalculated from the number of block-matching stages NB and the final block-size b(NB ), with Nsync = 2·(NB +1)+log2 (b(NB )). For a typical motion estimation pipeline, containing NB = 3 PPBM stages and a concluding block-size of b(NB ) = 2 ([2 × 2]), Nsync = 9 synchronizations are necessary. V. V ECTORIZATION The runtime bottleneck, represented by the SAD similarity calculation is perfectly suitable for data level parallelism, as it exhibits no data dependencies and control flow, while featuring linear memory accesses within each block row. The programmable SIMD accelerators on both considered architectures feature optimized memory access and specialized SAD instructions, reducing the number of memory accesses and instructions in comparison to the sequential execution. Moreover, the outsourcing of workload enables hardware intern tasking parallelism, thus the calling processor is able to execute scalar operations concurrently. In the following section we present different vectorization methods and offer example implementations of the SAD calculation of one block, in order to evaluate the effort and difficulty of code manipulations. The nonvectorized SAD calculation of a single block can be formulated as a double nested for loop over all block elements, accumulating the pixel-wise absolute difference, as described in Figure 5a. Auto-vectorizer: Modern compilers often include the possibility of auto-vectorization, representing the most common method of vectorization [14]. The auto-vectorizer is able to vectorize on loop-level with a compiler-specific amount of known pattern, determining the effectiveness and efficiency of the code generation. Due to the formulation in high-level languages, this instrument provides the highest portability without the need of any code changes. In order to improve the efficiency of the vectorization, the loop body should be as slim as possible to optimal fit the compiler known looppattern and the loop length should be known to the compiler at compile time. Additional performance improvements and optimized load and store instructions can be enabled through compiler hints, which accentuate aligned memory or explicitly exclude loop carried dependencies, caused by multiple pointers pointing to the same memory block. Therefor, OpenMP [16] offers comfortable SIMD pragmas. Considering the SAD calculation of one block, compiler hints concerning properly
aligned and linear memory access with a summating reduction can improve the auto-vectorization result and are illustrated in Figure 5c. Array Slicings: An alternative abstract vectorization instrument is the notation of array operations in slicing operations, known from Matlab. Intel Cilk Plus [15] introduces this method in the current Intel compilers and the GCC starting from version 4.9. Due to the explicitly parallel syntax formulation, as illustrated in code 5b, efficient vectorizations on multiple architectures become possible. However Cilk Plus array sclicing operations are currently supported exclusive on x86 machines. Hardware specific intrinsics and inline assembly instructions offer low-level access to the SIMD engines [13] on the costs of portability and flexibility losses [12], as well as increasing programming effort. Hardware specific intrinsics, available in high-level programming languages, allow the compiler to further optimize the software and hides the explicit register allocation from the user in comparison to inlining blocks of assembly-language. As this intrinsics are directly mapped to the equivalent assembler instructions, we will focus on this vectorization method. As such vector-instruction set extensions become more and more standardized and are typically valid for a whole architecture family, the portability increases. For instance Intel released a wrapper for the execution of NEON specific intrinsics on AVX/SSE SIMD units. Using data level parallelization methods, the inner loop of the SAD calculation of one block can be executed in parallel. Within each iteration of the loop over all block rows, first one entire line of the reference and candidate block are loaded into two separate vector registers and the row-wise SAD is accumulated over all rows. The AVX2 and SSE instruction-set extension of the Intel architecture feature a single instruction, which computes the element-wise absolute difference of two vector registers and internally reduces each consecutive eight absolute differences to one 16 bit element. Concluding the accumulated 16 bit SAD value over all rows needs to be extracted from the vector register into a scalar register. In case of a block width including more than eight elements, the two SAD of the upper and lower register half need to be added. The resulting code is illustrated in Figure 5d and reveals the huge modifications during the migration workflow, in order to exploit the SIMD units. As the maximum considered block width of the PPBM is 16 pixel (each 8 bit grayscale) and therefrom derived maximum register
length is 128 bit, we will focus on the SSE instruction set extension in the remainder of this paper. Considering ARM NEON compatible vector units, the SAD calculation of two vectors is realized by a vertical accumulation of the elementwise absolute difference of two registers in one accumulating register in one single instruction. Finally, the accumulating register, including the column-wise absolute difference over all rows of each column, needs to be reduced to one blockwise SAD. This is realized by a staged summating reduction, horizontal pair-wise adding the values of the vector register in each iteration. In order to prohibit range overflows of the vertical accumulation, the element-wise absolute difference is widened to 16 bit elements prior to the accumulation. Thus, the NEON vectorization demands twice as much registers, compared with SSE units, for block widths above eight pixels (128 bit accumulation registers). Within each vector calculation for a block the reference image patch is constant, for which reason we combine the SAD calculations of all 10 candidate vectors of one convergence direction which are calculated in one stage, which is additionally parallelized on task-level. In each iteration of the loop over the lines of the compared blocks, the reference line is loaded once and then compared with each corresponding line of the candidate blocks. The definition of the updatevector set at runtime prohibits the further combination of multiple SAD calculation within one vector register, e.g. register shifting or simultaneous calculation in the lower and upper half of the register. Thus no vector-valued median filter is defined, the motion vector field erosion from block to pixel coordinates divides into a element-wise median filtering and following verification. This postprocessing step ensures, that both components result from one input vector, prohibiting new motion vectors without a previous block-matching. The SIMD units enable us to calculate the erosion for all four sub blocks in one vector register. The unstructured register assignment of the neighboring vectors prevents optimized load operations and result in a scalar motion vector loading, followed by an unoptimized stuffing of the SIMD registers. VI.
E VALUATION
In this section, we present our performance-evaluation results. Initially, we compare the different presented vectorization methods in terms of performance and ressource exploitation. Additionally, we give a performance comparison between the SSE and NEON SIMD units on the different examined platforms, representing the most common vector
The SAD calculation implementation, using hardware specific and highly optimized SSE and NEON intrinsics on the Intel general-purpose and mobile multi-core architecture respectively, outperforms all auto-vectorized implementations for all considered block sizes. The number of core-cycles is nearly constant for block sizes up to [8 × 8] utilizing the SSE instruction-set extension intrinsics, due to unoptimized memory accesses for underused vectors. Data of blocks with row lengths smaller than 8, needs to be loaded in the scalar units and than stuffed into vector registers, resulting in additional overhead, undoing the advantage of a smaller row count. NEON instruction-set extensions are able to handle small amounts of data more efficiently compared with SSE. Image patches of size [16 × 16] need double the number of registers and instructions per block line, due to the elementwise accumulation with 16 bit words, resulting in the jumping number of clock cycles. Due to a slower execution of the 100000
no vec. auto-vec. OpenMP Cilk Plus SSE
1000
A. Vectorization The duration of the vectorized SAD calculation for one block is illustrated in Figure 6 in clock-cycles for particular block sizes on the Intel Core-i7 general-purpose architecture and Samsung Exynos 5 Octa embedded processor. The proportional execution time increases linear with the block area for the non-vectorized execution on both considered architectures comparable. The auto-vectorizer of the GCC Compiler (-O3) generates a comparable scaling behavior over the cores on both architectures, but delivers a speedup of 3.18 at a block size of [16 × 16] on the Intel GPP architecture, in comparison to 7.21 on the ARM big.LITTLE architecture. The performance of the vectorization using additional OpenMP compiler hints, nearly matches the performance of the auto-vectorizer on the generalpurpose processor for block sizes up to [8 × 8] (S = 1.56). Further block size enlargements enable a further acceleration in comparison to the optimized auto-vectorizer of factor 2. Considering the ARM mobile multiprocessor architecture, the number of clock cycles of the SAD calculation for one block using OpenMP compiler pragmas, follow the auto-vectorized implementation, but slowing down the calculation for blocks of size [2×2]. The usage of array slicing notations on the GPP processor, introduced by Intel Cilk Plus, results in a speedup between the highly optimized usage of SSE hardware intrinsics and the auto-vectorizer, delivering an acceleration of 4.75 for the SAD calculation of blocks with size [16 × 16], which is factor 3.45 better than the auto-vectorizer.
10000 clock cycles
clock cycles
10000
units. Afterwards, we evaluate the scaling behavior of the vectorized motion estimation algorithm on multiple cores.
100
no vec. auto-vec. OpenMP NEON
1000
100
10 [2x2]
[4x4]
[8x8] block size
(a) SAD cycles on Intel Core-i7 (3.5 GHz). Fig. 6.
[16x16]
10 [2x2]
[4x4]
[8x8]
[16x16]
block size
(b) SAD cycles on Samsung Exynos 5 Octa (2.0 GHz).
Evaluation of clock cycles for the SAD calculation of one block with variable block size on the general-purpose and mobile ARM architecture.
execution time [ms]
100
10
NEON 720p (ARM) SSE 720p (GPP) ideal scaling 1
2
4
8
number of cores
Table I summarizes the execution times and resulting speedups of the realized motion estimation pipeline with a resolution of 720p on both considered architectures. The total speedup Stot characterizes the ratio between the sequential, unvectorized reference and the presented implementation utilizing data- and task-level parallelism. Using data parallelization methods for the combination and erosion stage, the memory bandwidth limits the resulting speedup. Additionally, the vectorization speedup of the erosion is lower, due to inefficient loading operations, combined with a less number of vector operations, which determine the vectorization speedup.
Fig. 7. Scaling behavior of the presented motion estimation pipeline as a function of the number of cores for an image sequence with 720p resolution.
non-vectorized SAD calculation on the ARM architecture, in comparison to the GPP, the resulting speedup for blocks of size [16 × 16] of the NEON intrinsics of 91.1 outperforms SSE intrinsics, achieving a speedup of 85.1 compared to the sequential execution. B. Parallelization The execution time of the parallelized and vectorized implementation of the motion estimation pipeline, using NEON or SSE instruction-set extension intrinsics on ARM or Intel architecture respectively is illustrated in Figure 7 as a function of the number of processors for an image sequence with resolution 1280×720 (720p). The pipeline includes descending block sizes from [8 × 8], [4 × 4] and [2 × 2] with concluding combination and erosion. The execution time scales with the number of cores to a maximum of four, representing the number of physical cores and SIMD units on the Intel Corei7, running at 3.5 GHz, and the Cortex-A15 cores of the ARM big.LITTLE architecture. Increasing the amount of cores to five, results in a abrupt change of the execution time on both architectures. On the ARM big.LITTLE architecture, this phenomenon can be traced back to the Cortex-A7 cores without own SIMD units, vieing for the vector units, and thus generating synchronization and communication overhead. Whereas the execution time again decreases on the generalpurpose architecture, the time remains nearly constant on the embedded multi-core architecture and slightly increases, due to the increasing amount of LITTLE processors, using the bus system to transfer their data to the NEON vector units. Comparable to the ARM big.LITTLE architecture on the Intel GPP hyperthreaded architecture, an increase to more than four cores, threads on already active cores are activated and share the SIMD unit with the other thread. The optimum tradeoff between hardware utilization and acceleration represents the use of four threads on both considered multi-core architectures. TABLE I.
E XECUTION TIME AND SPEEDUP OF THE ALGORITHM STAGES USING DIFFERENT PARALLELIZATION METHODS AT 720 P.
VII.
With the increasing computational performance of low power embedded multi-core architectures, approaching general-purpose processors, the implementation of complex real-time algorithms is feasible. In this work, we presented a parallelized and vectorized implementation of a common motion estimation algorithm on a Samsung Exynos 5 Octa, representing a mobile heterogeneous multi-core architecture. In Applications, demanding flexibility and low development costs, a parallel software implementation competes against dedicated hardware architectures, featuring low power consumption or computational efficiency. Maximum resource utilization and computational performance of the SoC is achievable by software tuning and manual vectorization, e.g., using instruction-set specific intrinsics, at the expense of lower architectural flexibility. R EFERENCES [1]
[2] [3]
[4]
[5]
[6] [7]
[8]
tseq [ms]
ttot [ms]
Svec
Spar
Stot
[9]
ARM
[8 × 8] [4 × 4] [2 × 2] C ER total
104.0 292.9 1001.3 40.1 107.9 1546.2
3.3 10.5 43.7 6.5 15.6 79.6
9.7 8.5 7.9 3.3 4.9 7.5
3.2 3.3 2.9 1.8 1.4 2.6
31.5 27.8 22.9 6.2 6.9 19.4
[10]
GPP
[8 × 8] [4 × 4] [2 × 2] C ER total
24.4 60.6 201.7 7.5 13.8 308.0
0.9 2.5 8.8 0.9 3.9 17.0
8.7 7.3 6.5 2.4 1.7 5.8
3.1 3.2 3.5 3.4 2.1 3.1
27.1 24.2 22.9 8.3 3.5 18.1
block size
C ONCLUSION
[11] [12] [13]
H. Blume, Bewegungssch¨atzung in Videosignalen mit o¨ rtlich-zeitlichen Pr¨adiktoren, Vortragsband zum 5. Dortmunder Fernsehseminar, pp. 220– 231, Fakult¨at f¨ur Elektrotechnik, Universit¨at Dortmund, 1993. G. de Haan, Motion Estimation and Compensation, Dissertation, Technische Universit¨at Delft, Philips-Verlag, 1992 G. de Haan, P. Biezen et. al., True-motion estimation with 3-D recursive search block matching, Circuits and Systems for Video Technology, IEEE Transactions on, vol. 3, no. 5, pp. 368–379, 388, oct 1993. J. Wang, D. Wang, and W. Zhang, Temporal compensated motion estimation with simple block-based prediction, IEEE Transactions on Broadcasting, vol. 49, no. 3, pp. 241–248, 2003. O. J. Arndt, D. Becker, C. Banz and H. Blume, Parallel Implementation of Real-Time Semi-Global Matching on Embedded Multi-Core Architectures, IEEE Intl. Conf. Embedded Computer Systems, pp. 56–63, 2013. H. Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobbs Journal, 30(3), 2005. K. Asanovic, R. Bodik, B.C. Catanzaro et. al., The landscape of parallel computing research: a view from Berkeley, Technical Report UCB/EECS2006-183, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2006. J.L. Manferdelli, N.K. Govindaraju, and C. Crall, Challenges and Opportunities in Many-Core Computing, Proceedings of the IEEE, 96(5), pp. 808–815, 2008. T. Hoefler, W. Gropp, W. Kramer, and M. Snir, Performance modeling for systematic performance tuning, State of the Practice Reports, p. 6, ACM, 2011. E. Hebisch, D. Spath, and A. Weisbecker, Market Overview of Tools for Multicore Software Development, Fraunhofer Verlag, 1 edition, 2010 E.A. Lee, The problem with threads, Computer, 39, pp. 33–42, 2006. R. J. Ridder, Programming Digital Signal Processors With High-Level Languages, DSP Engineering, 2000. G. Ren, P. Wu, and D. Padua, An Empirical Study on The Vectorization of Multimedia Applications for Multimedia Extensions, Intl. Conf. Parallel and Distributed Processing Symposium, pp. 89b–89b. IEEE, 2005.
[14]
[15] [16] [17]
[18] [19]
D. Nuzman, I. Rosen, and A. Zaks, Auto-vectorization of Interleaved Data for SIMD, SIGPLAN Conference on Programming Language Design and Implementation, vol. 41, pp. 132–143. ACM, 2006. Intel CilkPlus, www.cilkplus.org, 2015. OpenMP 4.0 Application Program Interface, http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2015. M. Mohammadzadeh, M. Eshghi and M.M. Azadfar, Parameterizable implementation of full search block matching algorithm using FPGA for real-time applications.. Fifth IEEE Intl. Caracas Conference on Devices, Circuits and Systems, vol. 1, pp. 200–203, 2004. Samsung Exynos 5 Octa 5422, www.samsung.com, 2015 ARM big.LITTLE architecture, www.arm.com, 2015