Fast Fourier transform benchmark on X86 Xeon

Fast Fourier transform benchmark on X86 Xeon system for multimedia data processing Young-Soo Park, Koo-Rack Park, JinMook Kim & Hwa-Young Jeong

Multimedia Tools and Applications An International Journal ISSN 1380-7501 Multimed Tools Appl DOI 10.1007/s11042-015-2843-7

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy Multimed Tools Appl DOI 10.1007/s11042-015-2843-7

Fast Fourier transform benchmark on X86 Xeon system for multimedia data processing Young-Soo Park 1 & Koo-Rack Park 1 & Jin-Mook Kim 2 & Hwa-Young Jeong 3

Received: 21 December 2014 / Revised: 19 May 2015 / Accepted: 23 July 2015 # Springer Science+Business Media New York 2015

Abstract I benchmarking the well-known Fast Fourier Transforms Library at X86 Xeon E5 2690 v3 system. Fourier transform image processing is an important tool that is used to decompose the image into sine and cosine components. If the input image represented by the equation in the spatial domain, output from the Fourier transform represents the image in the fourier or the frequency domain. Each point represents a particular frequency included in the spatial domain image in the Fourier domain image. Fourier transform is used widely for image analysis, image filtering, image compression and image reconstruction as a wide variety of applications. Fourier transform plays a important role in signal processing, image processing and speech recognition. It has been used in a wide range of sectors. For example, this is often a signal processing, is used in digital signal processing applications, such as voice recognition, image processing. The Discrete Fourier transform is a specific kind of Fourier transform. It maps the sequence over time to sequence over frequencies. If it implemented as a discrete Fourier transform, the time complexity is O (N2). It’s actually not a better way to use. Alternatively, the Fast Fourier Transform is possible to easily perform a Discrete Fourier Transform of

* Jin-Mook Kim [email protected] Young-Soo Park [email protected] Koo-Rack Park [email protected] Hwa-Young Jeong [email protected] 1

Division of Computer Science & Engineering, College of Engineering, Kongju National University, Cheonan-Daero 1223-24, Subuk-gu, Cheonan-si, Chungnam 330717, Korea

2

Division of IT Education, Sunmoon University, 70, Sunmoon-ro 221beon-gil, Tangjeong-myeon, Asan-si, Chungnam 336708, Korea

3

Humanitas College, Kyung Hee University, 24, Kyungheedae-ro, Dongdaemun-gu, Seoul 130701, Korea

Author's personal copy Multimed Tools Appl

parallelism with only O (n log n) algorithm. Fast Fourier Transform is widely used in a variety of scientific computing program. If you are using the correct library can improve the performance of the program, without any additional effort. I have a well-known fast Fourier transform library was going to perform a benchmarking on X86 based Intel Xeon E5 2690 systems. In the machine’s current Intel Xeon X86 Linux system. I have installed Intel IPP library, FFTW3 Library (West FFT), Kiss -FFT library and the numutils library on Intel X86 Xeon E5 based systems. The benchmark performed at C, and measuring the performance over a range of a transform size. It benchmarks both real and complex transforms in one dimension. Keywords Fourier transform . Signal processing . Image processing . FFTW3 . INTEL IPP . Numutils . Kiss-fft

1 Introduction The Fast Fourier Transform (FFT) refers to a class of algorithms for efficiently computing the Discrete Fourier Transform (DFT) [14]. The FFT is used in many different fields such as physics, astronomy, engineering, applied mathematics, cryptography, and computational finance. Some of its many and varied applications include solving PDEs in computational fluid dynamics, digital signal processing, and multiplying large polynomials. Because of its importance, the FFT is used in several benchmarks for parallel computers such as the HPC challenge and NAS parallel benchmarks. In this paper we present algorithms for computing FFTs with high performance on INTEL Xeon CPU. The Fast Fourier Transform is possible to easily perform a Discrete Fourier Transform of parallelism with only O (n log n) algorithm and It is possible to easily perform a Discrete Fourier Transform of parallelism with only O (n log n). FFT is used in various fields such as engineering, physics, applied mathematics, astronomy, financial engineering and cryptography. Many different applications and is included to solve a problem of fluid dynamics, digital signal processing, Partial Differential Equations(PDEs) in Fluid Dynamics in large polynomial calculation. FFT can also be also used in parallel computing benchmark. In this paper, we benchmark the FFT algorithm libraries that are widely used on INTEL X86 Xeon E5 system and calculates its performance. Going to change the transform size to measure the performance. It benchmarks both real and complex transforms in one dimension.

1.1 Fourier transforms The Fourier transform (FT) has been widely used in image reconstruction, from filter design to signal processing, image reconstruction, circuit analysis and synthesis, stochastic modeling to non-destructive measurements. FT is widely disseminated environmental modeling, multi-sensor system design, antenna theory, radar cross section prediction has been used in electromagnetics. For example, split-step parabolic equation method (way that only the beam spread of the optical system) has been used over decades, the frequency and the spatial domain based on the operation sequence between the FT. Two three-dimensional problem of propagation changes and even non-realistic terrain profile and uneven atmosphere has successfully resolved the above method [15].


The principle of a transform in engineering is to find a different representation of a signal under investigation. The FT is the most important transform widely used in electrical and computer (EC) engineering. (Again) in time domain, frequency domain transform is defined based on the Fourier transform and the inverse transform, as follows: Z∞ S ðωÞ ¼ −

sðtÞ e− j2π f t dt

ð1 aÞ

sðtÞ e j2π f t d f

ð1 bÞ

∞

Z∞ S ðωÞ ¼ −

∞

Here, s(t), S(w), and f are the time signal, the frequency signal and the frequency, pffiffiffiffiffiffi respectively, j ¼ −1 and the engineers and physicists, sometimes prefer to write the transform interms of angular frequency w=2p f, as Z∞ S ðωÞ ¼ −

1 sðt Þ ¼ 2π

sðt Þ e− jωt dt

ð2 aÞ

S ðωÞ e jωt dω

ð2 bÞ

∞

Z∞ −

∞

which, however, destroys the symmetry. To restore the symmetry of the transforms, sometime the convention 1 sðtÞ ¼ pffiffiffiffiffiffi 2π

Z∞ −

1 S ðωÞ ¼ pffiffiffiffiffiffi 2π

S ðωÞ e jωt dω

ð3 aÞ

sðtÞ e− jωt dt

ð3 bÞ

∞

Z∞ −

∞

is used. FT is valid for the real or complex signal, generally a complex function of w (or F) [19, 22, 23]. FT is valid both for satisfying certain minimum criteria for the time period and nonperiodic signals. Almost all real signals are easily (it should be noted that a special case of the Fourier Series FT) satisfies these requirements. Mathematically,

& &

FT is defined for continuous time signals. In order to do frequency analysis, the time signal must be observed infinitely.


1.2 The discrete Fourier transform The Fourier series of a periodic function can be written in terms of complex exponentials as yðtÞ ¼

ik2πt ck exp T k¼−∞ ∞ X

ð4 aÞ

With 1 ck ¼ T

T=2 Z

−T=2

−ik2πt yðtÞexp dt T

ð4 bÞ

To establish the discrete Fourier transform, we can replace the first Y (t) by a discrete representation YJ=0, 2, …, N-2, N-1 NDT=T. T the relevant cycle by the sample interval DT) and ((4-B) is a discrete Fourier transform (DFT).

ck ¼

N −1 N−1 1 X 2π 1X 2π y j exp −ik y j exp −i jΔt Δt ¼ jk N Δt j¼0 N Δt N j¼0 N

ð4 cÞ

Now consider Ck+N ckþN ¼ ¼ ¼ ¼

N −1 1X 2π y j exp −ik ðk þ N Þ j N j¼0 N N −1 1X 2π 2π y j exp −ik j exp −iN j N j¼0 N N N −1 1X 2π y j exp −i jk expð−i2πjÞ N j¼0 N N −1 1X 2π y exp −i jk ¼ ck N j¼0 j N

ð4 dÞ

We (j of -i2p) where exp=1 is used the fact that for integers j. Therefore, CK completely defined by specifying the value of N - just repeat the index value outside this range itself. Most other implementations of DFT in MATLAB, the components are specified for ck ; k ¼ 0 ; 1; …N −2; N −1

ð4 eÞ

And if we specified frequency component of the Fourier transform negative and positive are summed to obtain the actual time series, we have converted based on the guidance and


understanding from Fourier series to define the phase will specify the components, we would specify the components

ck ;

N −1 ; …−1; 0; 1; …N =2 f or N even 2 ðN −1Þ ðN −1Þ ; …−1; 0; 1; … f or N odd k¼− 2 2 k¼−

ð4 f Þ

Since ck has to repeat all N values, but the range, the procedure replaces (4-f) defined by the formula (4-e) to cause confusion in the first place, it is quite the same. To change from one to another of the rules we have to change the component order of DFT. FFT- change in MATLAB functions you perform a realignment.

1.2.1 Inverse discrete Fourier transform We might guess from inspection of Eq. (4-a) that the inverse discrete Fourier transform is given by yf ¼

N −1 X k¼0

¼

"

# N −1 1X 2π 2π yl exp −i lk exp i lk N l¼0 N N

N −1 N −1 1X X 2π yl exp −i ð l−jÞk N l¼0 k¼0 N

ð4 gÞ

2π ck exp i jk N

ð4 hÞ

yj ¼

N −1 X k¼0

To show that this works we can substitute for ck using Eq. (4-c) Now the sum over k is a geometric progression of the form S ¼ 1 þ r þ r2 … þ rN−1

ð4 iÞ

2π r ¼ exp −i ðl− jÞ N

ð4 jÞ

With

If l = j we can see that the exponent term, r is unity and the sum S = N. If l 1 j we can sum the geometric expression by first multiplying Eq. (4-9) by r rS ¼ r þ r2 … þ rN −1 þ rN

ð4 kÞ

Subtracting Eq. (4-11) from Eq. (4-i) yields ð1−rÞS ¼ 1−rN which gives the following expression for the sum

ð4 lÞ


S¼

ð1−rN Þ 1−r

ð4 mÞ

If we substitute Eq. (4-j) into (4-13) the numerator is 1−expð−i2πðl− jÞÞ ¼ 0

ð4 nÞ

1

So all the l j terms sum to zero and we can see that Eq. (4-7) is correct.

1.2.2 Fourier transform pairs If we replace ck with Yk in Eqs. (4-c) and (4-f) we get a Discrete Fourier transform pair in the conventional notation N −1 1X i2π jk y j exp − N j¼0 N N −1 X i2π jk Y k exp yj ¼ N k¼0

Yk ¼

ð4 oÞ

Now as for the continuous Fourier Transform, there is an ambiguity in terms of the multiplying term in front of the inverse and discrete Fourier transform. i2π jk y j exp − N j¼0 N −1 1X i2πjk Y k exp yj ¼ N k¼0 N

Yk ¼

N −1 X

ð4 pÞ

Thus, compared with equation the terms Yk are N times as big and the inverse DFT requires the introduction of a 1/N term. One could also write a symmetric form N −1 1 X 2πikj y j exp − Y k ¼ pffiffiffiffi N N j¼0 N −1 1 X 2πikj Y k exp y j ¼ pffiffiffiffi N N k¼0

ð4 qÞ

But I am not aware of anyone who uses this convention.

2 Related works 2.1 Fast Fourier transform Time complexity of the DFT of the N samples is O (N2), if the simple DFT implementation [7, 8]. So, using DFT is not the best way to practice. The improved algorithm is exactly as fast Fourier transform to the same effect as the DFT (FFT) [11]. And-conquer strategy is to use a


split. So, it only takes computation time O N samples (N log n). The only difference is because DFT and FFT. FFT is much faster than DFT [4, 5]. This can be considered a faster version of the DFT. The idea is to maintain DFT sequence of N samples split into two sub-sequences. This divides the even and odd index, each index step. If N is a power of two, until each subsequence is to be a factor, the order and the maintenance division. The array index is reinverted bit sequence as the original index. The idea is that keep dividing a DFT sequence of N samples into two sub sequence. It splits the even index and odd index each step. If N is a power of 2, it keeps splitting the sequence until each subsequence only has one element. The rearranged index is just the bitreversed order as the original index.

2.2 Parallel fast Fourier transform When parallelize the FFT algorithm, we need to think algorithms are suitable to implement the FFT. FFT recursive algorithm method is easy to implement. However, there are two reasons in an iterative manner using the FFT algorithm. First, it is possible to perform the calculation of the FFT algorithm iteration index less version. Second, it is easy to derive the parallel FFT algorithm when the type of sequence repeat algorithm. We have already output the index can be seen that the bit-reversed is the input surface. So, using the index rearrange ideas. There are three phrases for parallel algorithms. Home, n is the number of elements, p is the number of processes. First, the process to rearrange the index permutation of the input sequence. In the syntax, the process performs a repeated log N -log P of the FFT by performing multiplication, addition and subtraction necessary for the complex. The following graph (Fig. 1) shows the process for parallel Fast Fourier Transform: The process of the third step is repeated and exchanged a final log P value of the FFT and its partners in the hypercube dimension. Thus, each process is to control the N / P element of the input sequence. And log P for each repeat swap process for the n / p value is the partner process. All communication time complexity is O ((N / P) log P), the computational complexity of the parallel algorithm (N log N / P) O a.

2.3 INTEL IPP Intel IPP is an extensive library of multi-core, been among the most optimized software functions for multimedia data processing and communications applications. Intel IPP is a multimedia, data processing and highly optimized multi-core-ready (multi-core-ready) extensive library of software functions for communications applications. Intel IPP functions Streaming SIMD Extensions (SSE, SSE2, SSE3, SSSE3, SSE4, and SSE4.1) and the same processor of available features and other functions by matching the low-level optimization algorithm based on optimized instruction set, an optimizing compiler itself designed to provide more than capable of providing performance and it’s library has been optimized for a variety of SIMD instruction sets. Automatic Bdispatching^ detects the SIMD instruction set that is available on the running processor and selects the optimal SIMD instructions for that processor. Please review Understanding CPU Dispatching in the Intel® IPP Library Intel AVX optimization in the Intel IPP library consists of Bhand-optimized^ and Bcompiler-tuned^ functions – code that has been directly optimized for the Intel AVX instruction set. When considering many of the basic elements of the Intel IPP library, it is impossible to optimize just one product release or update (processor-specific settings within the Intel AVX period all Intel


Fig. 1 Parallel Fast Fourier Transform diagram Top sequence is input and bottom sequesnce is output. Each process is represented by a gray rectangle

IPP functions for a large set of new instructions that appear in the command. It can also take the optimization) considerations cache size and core / threads. Therefore, the functions in Table 4 represent those that either receive the greatest benefit from the new Intel AVX instructions or are the most widely used by Intel IPP customers.

2.4 FFTW FFTW stands for Fastest Fourier Transform in the West [6, 21]. FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data, as well as of even/odd data, i.e. the discrete cosine/sine transforms, or DCT/DST [8, 9, 17]. FFTW’s speed is superior when compared to other publicly-available DFT programs and its performance is highly competitive with that of vendor-driven solutions [12, 13]. FFTW is also portable, so it will work well on most architectures without modification – something vendor-driven custom applications cannot match [18]. The latest official release of version 3.3.4 FFTW is available on download page. Version 3.3 introduced support for distributed memory implementations and Fortran2003 API on top of the extension 86, MPI of AVX. Version 3.3.1 introduced support for the ARM Neon extensions [1, 2]. See the release notes for more information.


2.5 Kiss-FFT library For libraries kiss -FFT [3], function KF bfly2 and KF bfly4 each version was specialized in M=1 and m=1,2,4. Moreover, during the execution of the KF bfly4 dynamic capabilities to create a template that has been specializing in the 7 position to only modied. Binary Templates include a very small number of memory tasks than the professional version number and professional specialization are not static code.

2.6 Numutils FFT library The purpose of this project is to improve the selection of concepts used in the analysis and development of payment UTILS library [20]. This library implements some number of numeric algorithms in a floating point data. The important point of this project is to improve performance. The sources for num-utils-ng are available from git://github.com/ serge-sans-paille/num-utils-ng.git. The num-utils-library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

3 Test method and platform and results 3.1 Test method and platform 3.1.1 Fourier transforms Test run complex DFT and FFT-to-place out of the conversion to 64-bit double-precision complex input data was randomized. For each FFT applications, the number of CPU cores was tested using a different FFT SMP interface.

3.1.2 Platforms Intel Xeon E5 2690 2.90 GHz 8GB RAM Linux 2.6.32-431.el6.x86_64 The Fig. 2 shows the functional block execution process of the Intel Xeon E5 2690 Processor Architecture. The Fig. 3 shows the instruction execution process of the Intel Xeon E5 2690 Processor Architecture. The theoretical performance one expects from the Xeon E5 2690 is: In computing, FLOPS (for FLoating-point Operations Per Second) is a measure of computer performance, useful in fields of scientific calculations that make heavy use of floating-point calculations. For such cases it is a more accurate measure than the generic instructions per second. cores FLOPs clock socket cycle # GFlops = sec ¼ 8 ½#single−precision floats in SIMD vector unit 2 ½ FMA 2:60 ½GHz 24 ½cores FLOPS ¼ sockets

¼ 998:4 GFlops = sece 0:99 TFlops = sec:


Fig. 2 Intel Xeon E5 2690 Processor Architecture-Functional Block Diagram

3.1.3 MKL (Intel® Math Kernel Library) Use the optimized vector / matrix functions in Intel Math Kernel Library (MKL) or low levels of Vector Math Library (VML) [10]. The library has been optimized to take advantage of assembly language using SIMD vector instruction set in the Xeon phi [16]. So far, I only when the routine was executed using MKL work quickly to avoid Xeon (host) than the end, all the operations of the data array can be performed using the vector / matrix operations it is recommended that it is possible to use these . MKL (see benchmarks below) As a result of this calculation is that a relatively large input vectors and matrices to run more efficiently. This sets the power of the Xeon phi. More threads are automatically spawned great job.

3.1.4 SIMD (Single Instruction Multiple Data) SIMD is an abbreviation of the term Single Instruction, Multiple Data streams. It describes a computer architecture that deals with multiple data streams simultaneously by a single instruction. Despite recent CPUs support SIMD instructions, plain C/C++ codes are composed of SISD (Single Instruction, Single Data streams) instructions. However, with SIMD instructions, one can sum multiple numbers simultaneously, or can calculate a product of vectors with fewer loops. Due to the large SIMD width of 64 Bytes vectorization is even more important for the MIC architecture than for Intel Xeon. The MIC architecture offers new instructions like gather/ scatter, fused multiply-add, masked vector instructions etc. which allow more loops to be parallelized on the coprocessor than on an Intel Xeon based host. In 2006 Intel started developing an x86 many-core design (codename Larrabee), initially targeted as an alternative to existing graphics processors. It uses a 512bit SIMD instruction set called IMCI (Initial Many Core Instructions). The Fig. 4 shows the history of SIMD ISA extensions.


Fig. 3 Intel Xeon E5 2690 Processor Architecture-Instruction Execution Diagram

3.1.5 FMA (Fused Multiply and Add) FMA represents the number of arithmetic operations in a BFused-Multiply Add^ instruction. That is, these Intel processors can execute a Bmultiply and add^ (really two separate operations) as a single instruction, at the same clock rate.

3.2 Execute benchmark results This benchmark was made to see the performance improvement when enabling multi-threaded execution support in FFT library Intel Xeon using ‘./configure –enable–sse2 –enable-threads’ for compiling FFT application with FFT library. In addition to FFT application, the same FFT and DFT tasks were run using the Intel Performance Primitives 8.0 (IPP).

Fig. 4 The history of SIMD ISA extensions


Fig. 5 Comparison of 1D FFTs on a X86 Xeon (fftw3, intel-ipp, jmfft, numutils fft library)

The part of the code with the various library that is used in this paper for benchmarking is below: Source code: Makefile, intel-ipp.c , kissfft.c, numutils.c and fftw3.c were written for these tests. Essentially, the call the DFT/FFT planner and measure the time it took, then they call DFT/FFT several times and report the average execution time. mplops ¼ 5xNxlog2ðN Þ=ðtime f or one FFTmsÞ as FFT size increases then given some fixed amount of time, processing time (FLOPS) would have to increase. Therefore, the more FLOPs the lower the latency time if processing time is significant to system design. Integrators can use this benchmarking FLOP time to estimate processor need for their applications and estimate latency

3.3 The results of benchmark The FFT is a theoretical maximum FFT algorithm represents only lower than those that did not perform the floating point operations. Each FFT applications according to the FFT size is one of a variable data transfer, it consumes most of the processing time, processing time is used for data movement and other system operating under less than about 10 % overhead in performing floating-point operations. The optimum size between 32 and 512 points. The benchmark to measure the performance over a range of C transform size comprises a great number of publicly available FFT implementation. It’s all real and complex transformations from one-dimensional benchmark. Table 1 shows code of Kiss-FFT Library for benchmarking. Table 2 shows code of Num-utils FFT Library for benchmarking. Table 3 shows code of FFTW3 Library for benchmarking. Table 4 shows code of INTEL_IPP Library for benchmarking. And last, Table 5 show benchmarking result of 4 ways of code for benchmarking. The Fig. 5 shows a single precision complex power of two out-of-place benchmarking FFT for the FFT algorithm.

Author's personal copy Multimed Tools Appl Table 1 The part of the code with the kiss-fft library that is used in this paper for benchmarking Kiss-FFT library if (p->in_place) {for (i=0; i < iter; ++i) kiss_fft(work, in);} else {void *out = p-> out; for (i=0; i sign == −1) {the_plan = FFTW(plan_dft_r2c)(p-> rank, p-> n, p-> in, p-> out, the_flags);} else {the_plan = FFTW(plan_dft_c2r)(p-> rank, p-> n, p-> in, p-> out, the_flags);}

Table 4 The part of the code with the intel-ipp library that is used in this paper for benchmarking INTEL_IPP library if (p-> sign == —1) {for (i=0; i < iter; ++i) { MANGLEC(ippsFFTFwd_CToC)(p-> in, p-> out, thing, buffer);} else for (i=0; i < iter; ++i) { MANGLEC(ippsFFTInv_CToC)(p-> in, p-> out, thing, buffer);}

Table 5 Fourier transform performance benchmarking results (Mflops) Transform size

fftw3

Intel-ipp

2

1263.5

1263.8

4

3730.4

4868

Numutils 528.54 1096.1

Kiss-fft 953.99 2134

8

5485.1

11003

1526

1792.5

16

6347.8

17439

1750.5

2850.5

32

6518

19339

1966.8

2374.9

64

6591

20109

2170.5

3214.5

128 256

6071.8 6015.9

18192 18245

2374 2531

2817.8 3578.8

512

5879.6

18492

2638

3135.7

1024

5673.2

16452

2609

3790.7

2048

5564.6

14365

2439.1

3284.6

4096

5261.2

14193

1736

3828.1

8192

5050.2

12176

1452.5

2960.8

16384

4654.8

11239

1275.5

3029.7

32768 65536

4314.5 4305.6

10911 10283

1064.9 995.53

2453.9 2477.6


4 Conclusions Each FFT routine seems to have its own way of storing the conjugate-symmetric output of real transforms, especially for multidimensional transforms. We benchmarked each routine using whatever format the routine chose to implement. The Table 5 and Fig. 5 shows a single precision complex power of two out-of-place benchmarking FFT for the FFT algorithm This bell shaped curve for single precision complex even numbered FFTs is typical for all FFT library. As the size of the FFT increases to a certain size, the FFT latency gets smaller or FLOPs increase. At some FFT size, the hardware, algorithm, and data size reach an optimum and then quickly, as the FFT increases further in size the performance drops because the processor has to wait for data to move from slower memory sub-systems to processor memory (cache). Therefore, with very large FFT sizes the latency is directly proportional to non-cache memory speed. Futhermore Table 5 and Fig. 5 shows Fourier Transform (Mflops) performance by using the each Fourier Fast Transform library. Intel® IPP software building blocks are highly optimized using Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX) instruction sets so your application will perform faster than what an optimized compiler can produce alone. OpenCV library in combination with the Intel IPP library is a very good tool for image and video processing. The utilization rate for each MMX and SSE operations (Intel processors in) SSE2 instruction, each operation is very fast. Low-level functions will be made in accordance with the presence OpenCV IPP many high-level image processing method and an algorithm. So IPP and OpenCV of complementary tools to build eficient and effective image processing and computer vision applications.

References 1. Blumofe RD, Frigo M, Joerg CF, Leiserson CE, Randall KH (1996) An analysis of dag-consistent distributed shared-memory algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), (Padua, Italy), pp. 297–308 2. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. In Proceedings of the Fifth ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), (Santa Barbara, California). pp. 207–216 3. Borgerding M (2006) KissFFT v1.2.5. http://sourceforge.net/projects/kisst/ 4. Cooley JW, Lewis PAW, Welch PD (1967) The fast Fourier transform algorithm and its applications, IBM Research 5. Cooley JW, Tukey JW (1965) An algorithm for themachine computation of the complex Fourier series. Math Comput 19:297–301 6. Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to Algorithms. The MIT Press, Cambridge 7. Duhamel P, Vetterli M (1990) Fast Fourier transforms: a tutorial review and a state of the art. Signal Proc 19:259–299 8. Good IJ (1958) The interaction algorithm and practical Fourier analysis. J Roy Stat Soc B 20:361–372 9. Hong J-W, Kung HT (1981) I/O complexity: the red-blue pebbling game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, (Milwaukee), pp. 326–333 10. Intel® Math Kernel Library Reference Manual, https://software.intel.com/sites/products/documentation/hpc/ mkl/mklman/] 11. Johnson HW, Burrus CS (1983) The design of optimal DFT algorithms using dynamic programming. IEEE Trans Acoust Speech Signal Proc 31:378–387 12. Leroy X (1996) The caml light system release 0.71. Institute National de Recherche en Informatique at Automatique (INRIA) 13. Loan CV (1992) Computational frameworks for the fast Fourier transform. SIAM, Philadelphia 14. Oppenheim AV, Schafer RW (1989) Discrete-time signal processing. Prentice-Hall, Englewood Cliffs, p 07632 15. Perez F, Takaoka T (1987) A prime factor FFT algorithm implementation using a program generation technique. IEEE Trans Acoust Speech Signal Proc 35:1221–1223 16. PRACE-1IP Whitepapers, Evaluations on Intel MIC, http://www.prace-ri.eu/Evaluation-Intel-MIC

Author's personal copy Multimed Tools Appl 17. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical recipes in C: the art of scientific computing, 2nd edn. Cambridge University Press, New York 18. Savage JE (1993) Space-time tradeoffs in memory hierarchies, Tech. Rep. CS 93-08, Brown University, CS Dept., Providence, RI 02912 19. Selesnick I, Burrus CS (1996) Automatic generation of prime length FFT programs. IEEE Trans Signal Proc 14–24 20. Singleton RC (1969) An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans Audio Electroacoust AU-17:93–103 21. Swarztrauber PN (1982) Vectorizing the FFTs, parallel computations. pp. 51–83 22. Temperton C (1985) Implementation of a self-sorting in-place prime factor FFT algorithm. J Comput Phys 58:283–299 23. Temperton C (1988) A new set of minimum-add small-n rotated DFT modules. J Comput Phys 75:190–198

Young-Soo Park received the B.S. in Physics from Chung-Nam University, Daejeon, Korea, in 1999 and the M.S. in Physics from Chung-Nam University, Daejeon, Korea, in 2001. His research interests include high performance computing, software optimization, software parallelization, software vectorization, INTEL Xeon Phi coprocessor, and X86 architecture.

Koo-Rack Park received the B.S. in Electronic Engineering from Chung-Ang University, Seoul, Korea, in 1986 and the M.S. in Computer Science and engineering from Soongsil University, Seoul, Korea, in 1988. He received Dr. Degree of Computer Science from Kyonggi University, suwon, Korea, in 2000. Currently, he is a Professor in the Division of

Author's personal copy Multimed Tools Appl Computer Science and engineering at Kongju National University, Cheonan, South Korea. His research interests include Multimedia, Electronic Commerce, Simulation, Crime Prediction, Predictive Modelling, and IT Convergence.

Jin-Mook Kim received the Ph.D in computer engineering, computer security and authentication from the Kwangwoon University in 2006. Currently, He is a assistant professor in the Division of IT Education at Sunmoon University in Korea. His research interests include network control architecture, security engineering, authentication on the network, and Smart-phone security.

Hwa-young Jeong received the M.S. and Ph.D. degrees from Kyunghee University, Seoul, Korea, in 1994. His major is Software Engineering in Computer Science. He has been working as an Assistant Professor from March 2005. He has working experiences in R&D center of Aju System Co., Ltd. (related to developing the FA machine, IC Test Handler of Semiconductor) as a programmer and software engineer from 1994 to 1998 and he also worked as the same position in CAN Research Co., Ltd. in 1998 to 1999. His research interests include Web Engineering, Multimedia Application, A methodology of Software Development, Networks and so on.