Pattern Recognition with OpenCL Heterogeneous ...

6 downloads 43734 Views 887KB Size Report
development environment in which the developer writes the code by using an extension of the programming language C, customized for the purpose of parallel.
Pattern Recognition with OpenCL Heterogeneous Platform Jordan Vrtanoski, Member, IEEE, Toni Draganov Stojanovski, Member, IEEE

Abstract — OpenCL platform provides unified development environment for various multicore processors. In this paper, we evaluate the OpenCL framework for application in pattern recognition. We have selected the most common algorithm for Artificial Neural Networks (ANN) training – the backpropagation algorithm for parallelization with OpenCL because of its high demand for processing resources. We will show a SIMD version of the algorithm suitable for OpenCL implementation. Our OpenCL implementation showed 25.8 speedup of execution on ATI 5870 GPU compared to OpenCL execution on Intel Xeon W3530 when training on MNIST handwritten digits data set. Keywords — OpenCL, HPC, Pattern Recognition, Perceptron, Back propagation.

F

I. INTRODUCTION

pioneers in the field of general numerical processing on GPUs, before the appearance of GeForce 8800, were forced to map the problems from the native domain in the domain of graphical APIs. Although, the revolutionary design of GeForce 8800 was driven mainly by the gaming industry, researchers noticed that the processor could be used for executing general-purpose algorithms [1]. The data that was subject of processing had to be mapped as series of polygons or images and loaded as such in the memory of the graphical processor. The algorithms were implemented as series of pixel shaders. The results of the processing were represented as a color value of a pixel in the produced bitmap. In 2006, NVidia introduced CUDA [2] in order to address the growing demand for utilization of GPU as platform for general-purpose computation. CUDA removed the complexity of implementation of generalpurpose algorithms on GPU. CUDA provides a development environment in which the developer writes the code by using an extension of the programming language C, customized for the purpose of parallel programming on graphical processors. CUDA allowed more freedom to developers to code the general-purpose algorithms. However, CUDA has a limitation of being a proprietary platform. The platform and the code developed can be used only on NVidia hardware. To address this, Khronos group published a new IRST

J. Vrtanoski is with the Ericsson AB, Galeries 2, Down Town Jebel Ali, Dubai, UAE (phone: +971-55-9342320; e-mail: [email protected]). T. Stojanovski is with the University for Information Science and Technology "St. Paul the Apostle", Ohrid, Macedonia and with the European University of Republic Of Macedonia, Skopje, R. Macedonia; (phone: 389-78-396693, e-mail: [email protected]).

standard for heterogeneous platform for programming heterogeneous parallel computers – OpenCL [3]. Similar to CUDA, OpenCL allows development of code with standard C. The code is developed in two conceptual parts: host code and platform code. The host code is used to facilitate data exchange between the operating system and the platform. In addition, the host program acts as controller of the flow of execution of the algorithm on the platform. Host code can be written in C or C++. The platform code is written in ISO C99. The ISO C99 was enriched with several exceptions forced by the nature of the GPU hardware. Following concepts defined by ISO C99 are not applicable in the platform code: recursion, function pointers, bitfields, extern, static, auto, and register. In addition, several new classifiers for working with the OpenCL memory model are introduced. OpenCL platform supports two programming models: data parallel and task parallel. Task parallel model allows the work to be divided into multiple tasks, each executed in parallel with the other tasks. Data parallel model allows execution in parallel of the same task over different sections of the data. The developer can use both models simultaneously. Driven by the hardware design of the GPU devices, OpenCL platform has a specific memory model. The memory is divided into several regions: private, local, global and constant. In comparison, CPU model considers the whole memory as one continuous region. The private memory region represents the registers of the device and access to this memory is virtually instantaneous. Local memory is shared between all threads in the same group and can be used for inter-thread communication. The local memory region is in the device but is slower than the private memory. The global memory region is the memory outside of the device and is the slowest memory in the system. The constant memory is same as the global memory with the difference that threads can only read from this memory. Some devices contain cache that can significantly reduce the time to access this memory. The paper is organized as follows. Section II gives an overview of pattern recognition using ANN. Section III explains our approach for parallelization of the backpropagation algorithm. Section IV presents experimental results and their comparison on CPU and GPU platforms. Section V gives the conclusion and directions for future research.

II. ANN PATTERN RECOGNITION

x1 x2 x3

Since the introduction of perceptron network mathematical models by Rosenblatt[4], ANN became very popular in the domain of patter recognition. The perceptron is a mathematical model used to approximate linear multivariable functions. The input variables (stimulus) are first multiplied by predefined weights after which the resulting values are accumulated. The accumulated value is passed through an output function to receive the final result (response). In order to offset the response on the X axis, the model introduces an additional constant input called bias. The model of the perceptron is presented on Fig. 1. Perceptron model can be used for classification. A single perceptron can be used to classify problems that are linearly separable. The implementation of XOR function illustrates the problem of linear separation. The approach to overcome the problem is to pass the output of the perceptron as input to a new perceptron, effectively introducing another (hidden) layer between the input and the output. This topology is known as Multi Layer Perceptron network. In order for MLP ANN to be capable of classifying a given problem, the network should be trained. In general, there are two categories of training algorithms: supervised and unsupervised training. During the supervised training, both the input and the output are provided to the network. The input is presented together with its desired output. The network processes the input samples and the supervisor algorithm calculates the difference between the desired and the real response. The error is then propagated back through the system to adjust the weights. In case of the unsupervised training, the network is provided only with the inputs. The network itself should determine what is the appropriate response for the given input in relation to the training set. We have chosen the backpropagation algorithm for our work because this algorithm is the most common method of training ANN. First references to this algorithm are found in the book of Bryson and Ho [5]. Algorithm takes the error from the output and propagates the error backwards through the network. The goal of the training is to find values for the weights that will lead the ANN to respond with the desired response for the given input.

x4 xn

w1 w2 w3 w4 wn

+

f(x)

y

b +1

Fig. 1. Model of perceptron data parallelism. OpenCL is based on SIMT (single instruction multiple threads) parallelism model. The domain of the problem is divided into multiple segments, each processed by a single thread. Threads are organized in thread groups. The hardware will execute the whole thread group (wavefront) in a locked step. This means that all threads in the group will execute the same instruction in parallel. Parallel thread execution places huge demand for memory throughput. How fast data can be made available in the processor is bound by physical limitations. To mask the latency of the memory, OpenCL devices (GPU devices) have dedicated thread management hardware that will schedule several threads simultaneously per one processing unit. Once a thread arrives to position where it needs access to memory, the request is issued and the thread is parked until the memory becomes available. A new thread starts being executed in his place until the new thread requests data from memory. Threads are rotated until the whole group is executed. Our design should facilitate creation of multiple threads to allow the OpenCL device to have enough threads to hide the memory latency. Reading data from global memory represents a significant bottleneck. In the design of the algorithm we should partition the data in a way that will minimize the reads from the global memory. The algorithm should be designed in a way to enable its execution on any type of OpenCL device. Some OpenCL devices, such as Intel X86 CPU, due to the hardware nature of the device, are limiting the thread groups to size of 1. Large thread groups are used to speed up the algorithm by prefetching the data from the global memory in the shared memory. Since the implementation of the backpropagation algorithm should support all devices, we will not rely on shared memory caching techniques for speedup such as described in [8], [9]. A. Backpropagation algorithm We introduced several modifications to the general form of the algorithm. Main reason for modifying the algorithm is to simplify the OpenCL implementation. In the general algorithm, the error from each of the layers is propagated backwards to the layer belowПрепознавање as presented with the на облици equation,

III. PARALLEL BACKPROPAGATION

Research on parallelization of the backpropagation algorithm dates back to the 1980s. The prior work on parallelization of backpropagation is summarized by Tsaregorodtsev [6]. In summary, the approach toward p parallelization of the backpropagation algorithm is to L ⇤Lj 1 ⇥ ⌅ wi,j ⇤Li , ⇤Li ⇥ ⌅ ' ⇤ xi wj ⇥ ⇧ Erri Препознавање на облици separate the training set in multiple sub-sets. Calculate the i⇥1 (1) overall gradient vector for each subset in parallel. Finally, Пропагирањето на грешката низ мрежата може да се напише на следниов начин: where error for the output layer is calculated as W1 1 W2 WL 1 L 1 WL L accumulate the gradients of the sub-sets to find the overallL 1 p L L ⇤⇥0w⇤ ⇥ ⇤… ⇥ ⇤⇥ ⇤j ⇥ ⌅ wi,j ⇤i , ⇤Li ⇥ ⌅ ' x i j ⇥ ⇧ Erri gradient vector. The overall gradient vector will be used to (2) i⇥1 2.1.4 Препознавање на повеќе од една класа calculate the adjustment of the network. Siera-Canto al. мрежата In order to simplify the implementation of the Пропагирањето на грешкатаetниз може да се напише на следниов начин: Доколку бројот на перцептрони во излезниот слој е поголем од еден (p>1), тогаш 1 2 L 1 L [7], described parallel version of backpropagation W W W W algorithm, and to use only one equation for calculation of класи колку ш препознава повеќе од една класа, т.е. мрежата препознава онолку L ⇥0 ⇤ ⇥на1 added ⇤ … a hidden ⇥L 1 layer ⇤ ⇥of algorithm implemented in CUDA. излезниот мрежата. δ, weслој have weights to the network При тренирање ваквите мрежи, постојат два методи (пристапи) на промена на т на повеќе од една класа Our approach will be 2.1.4 drivenПрепознавање by the nature of OpenCL after the на output layer. Weights can be observed as a •

Промена на тежините за секој примерок поединечно или

Доколку бројот на перцептрони во излезниот слој е поголем од еден (p>1), тогаш велиме дека мрежата • Промена на тежините за сите примероци заедно. препознава повеќе од една класа, т.е. мрежата препознава онолку класи колку што има перцептрони во Доколку тежините се променуваат за секој примерок поединечно, тогаш, алгор излезниот слој на мрежата. примерок, ќе ја пресмета промената на тежините, ќе ги промени тежините и по При тренирање на ваквите мрежи, постојат двапримерок. методи (пристапи) на промена на тежините, и тоа одбирање на нов • Промена на тежините за секој примерок поединечно или Доколку пак тежините се променуваат за сите примероци зеадно, алгорми

matrix, and input can be observed as a vector. This way, the algorithm represents a series of matrix operations. To simplify the operations, the newly introduced layer is populated with Identity Matrix. Therefore, we can simplify the equation (2) to the following form: `n¥m dL+1 = Erri , and, dLj = j ' ‚ Ii,j dL+1 i i p

i=1 (3) The resulting equation (3) is similar for implementation with the equation (1). In addition, we added a parameter γ to each layer. The parameter controls the sensitivity to training. The only reason for such modification is to allow the network to converge faster when trained with large amount of samples. The serial version of the backpropagation algorithm, containing our modifications, is presented on Fig. 2.

B. OpenCL parallelization During parallelization of an algorithm, we first try to identify the loops in which the data processed has no dependency from the data processed in previous steps of the loop. Since data has no dependency on computations from previous steps, we can replace epoch iteration with an OpenCL threads. In the case of algorithm in Fig. 2, the loop for epochs contains data dependency for weights matrix W. At the same time, subsequent iteration of the patterns loop is independent of the previous iteration of the loop. This means that in order to parallelize the algorithm, we can simply create OpenCL thread for each sample in the data set. Such approach to divide data domain per sample is described in details in [6]. This approach results in smaller amount of threads, and only one kernel prevent further optimization of the memory access. As alternative to dividing the data domain per sample, Krpan et al. in [10], took approach to divide the data domain per perceptron in order to lower the amount of threads, trying to avoid the thread scheduling overhead. Our approach is to distribute the work of the algorithm in several kernels. Having more kernels will allow us to better optimize the memory access patterns for each kernel separately and achieve better overall performance. Each kernel is using a different approach to distribute the data over the threads. Our goal is to create higher amount of threads, enough to allow the device to hide the memory latency. The pseudo code of the kernels is shown in Fig. 3. We will use the host code to load the samples and expected results in the memory, to orchestrate the execution of the kernels and monitor the training progress. IV. EXPERIMENTING We have implemented the algorithm in OpenCL. For the experiment, we use Apple Mac Pro with Intel Xeon W3530 with 4 cores (8 threads) and ATI Radeion HD 5870 with 1600 streaming cores and maximal physical memory throughput of 153.6 GB/sec. As a test data set, we have selected the “MNIST Database of Handwritten Digits” [11]. The training set contains 60.000 handwritten digits. Each digit is 8bit grayscale with 28x28 pixels in size. The set is classified in 10 different classes, one for each Indo-Arabic numeral. The

Function sigmoid HxL 1 s= ; 1 + e-x Return s

Function sigmoid_prim HxL s=x (1-x) ; Return s Procedure perceptron_training ` ` n¥n W NumberOfLayers+1 ¨ I ; For epoch ¨ 0 to NumberOfEpochs do ` D W ¨ @0D; For p ¨ 1 to NumberOfPatterns do For l ¨ 1 to NumberOfLayers do ` ol ¨ sigmoidIx Î Wl M; End for l d l+1 ¨ d - ol ; For l ¨ NumberOfLayers down to 1 do ` d l ¨ W l+1 Î d l+1 ¥ sigmoid_prim HoL; End for l For l ¨ 1 to NumberOfLayers do ` ` D Wl ¨ gJD Wl + h d l ≈ x l N; End for l End for p For l ¨ 1 to NumberOfLayers do ` ` ` t-1 D W l ¨ D Wl + aD W l ; ` t-1 ` D Wl ¨ D Wl ; ` ` ` W l ¨ W l + D Wl ; End for End for epoch End.

Fig. 2. Serial version of backpropagation algorithm training set is divided into 2 subsets for GPU, and 20 subsets for CPU execution to lower the overhead of thread scheduling. During CPU execution, the algorithm was executed over all 8 available threads. Our ANN was constructed with 784 inputs, one for each pixel, and 10 in the outputs, one for each class. The parameters and number of perceptrons in each layer is given in Table 1. To measure the efficiency of the kernel on GPU, we Kernel propagate †§ repeat NumberOfPatterns ¥ n tid ¨ globalThreadId; oPtidT ¨ W PRow: tid mod nT Î x ; End Kernel

Kernel calculate_error †§ repeat NumberOfPatterns ¥ n tid ¨ globalThreadId; ePtidT ¨ dPtidT - oPtidT ; End Kernel

Kernel back_propagate †§ repeat NumberOfPatterns ¥ m tid ¨ globalThreadId;

T ePtidT ¨ W PColumn: tid mod mT Î eHl+1L ¥ sigmoid_prim IoPtidT M; PtifT End Kernel l+1

Kernel agregate_delta †§ repeat m ¥ n tid ¨ globalThreadId; For set ¨ 1 to NumberOfPatterns do D WPtidT ¨ gID WPtidT + ePset¥n+tidT ¥ oPset¥m+tidT M; End for End Kernel Kernel update_weights †§ repeat m ¥ n tid ¨ globalThreadId; t-1 D WPtidT ¨ D WPtidT + h ¥ a ¥ D WPtidT ; t-1 D WPtidT ¨ D WPtidT ; WPtidT ¨ WPtidT + D WPtidT ; End Kernel

Fig. 3. Pseudo code for the kernels

TABLE 1: ANN LAYOUT Layer Inputs Perceptron Input layer 784 100 Hidden layer 100 100 Output layer 100 10

γ 0.50 0.15 0.05

will use the measure of the effective memory throughput of the kernel [12]. We will calculate the amount of data Br that will be read by the kernel, and add the amount of data Bw that will be written by the kernel. The total amount of data processed is divided by the kernel execution time T (in nanoseconds) to calculate the throughput in GB/s. We have used the following equation to calculate the effective bandwidth of the kernel: Effective Bandwidth =

Br + Bw T

¥ 10-9 GB ê sec

(4)

Table 2 shows the average throughput of each of the implemented kernels over 1.000 epochs. The low throughput of the kernels processing small amount of data is mainly attributed to the scattered data access and the low number of threads preventing GPU device to hide the latency of the memory. Table 3 shows the time that took to execute the kernels on the OpenCL platform. Our experiment with the implemented algorithm showed that the GPU platform is 25.8 times faster than the CPU platform executing one epoch for 60.000 samples. We have observed that the performance of “aggregate_delta” kernel is significantly lower on CPU. The algorithm for the kernel, in its serial form, has O(n3) polynomial complexity. In addition, the kernel has scattered memory access pattern, preventing the compiler to apply CPU implicit vectorisation, and causing cache misses. The importance of the CPU vectorisation is TABLE 2: EFFECTIVE BANDWIDTH OF KERNELS -ATI 5870 Kernel Data Time Throughput (GB) (sec.) (GB) Propagate 35.0699 0.138628 271.633 Calculate Error 0.0671 0.000069 104.348 Backpropagate 35.3980 0.017880 2125.740 Aggregate Delta 52.5713 0.170376 331.315 Update Weights 0.0018 0.000029 64.883 TABLE 3: PLATFORM EXECUTION TIME W3530 ATI 5870 Time (s) Time (s) 0.037917 0.050852 Memory write sample matrix 0.001348 0.000690 Mem. write expected result 0.525760 0.138628 Propagate trough input layer 0.105667 0.025799 Propagate trough hiden layer 0.013095 0.002581 Propagate trough output layer 0.001607 0.000068 Calculate error 0.000493 0.000752 Memory read error to host 0.004682 0.000226 Backpropagate on output layer 0.037959 0.002037 Backpropagate on hiden layer 0.270636 0.017879 Backpropagate on input layer 10.390769 0.169562 Agg. delta on input layer 0.996805 0.037428 Agg. delta on hiden layer 0.084311 0.036355 Agg. delta on output layer 0.000257 0.000028 Update wights on input layer 0.000053 0.000011 Update weights on hiden layer 0.000019 0.000008 Upd. weights on output layer Total time for 60k samples 12.4713 0.4829 Kernel

Spd. 0.75 1.95 3.79 4.10 5.07 23.63 0.66 20.72 18.63 15.14 61.28 26.63 2.32 9.18 4.82 2.38

25.83

described by Dickson et al. in [13]. The access pattern and the complexity are the main reasons for the poor performance of the OpenCL implementation when executed on CPU. Although the algorithms of the “propagate” and “backpropagate” kernels are similar, and the amount of data processed is also similar, the performance differs 10 fold. The increased performance of the backpropagate kernel is attributed to the memory access pattern. The memory access pattern of the “backpropagate” kernel is maximizing the use of the CPU/GPU hardware cache and allows CPU implicit vectorization. V. CONCLUSION AND FUTURE WORK Our parallelisation of the backpropagation algorithm significantly reduces the cost of ANN implementations, and can have positive impact on the applications of ANN in telecommunications such as: Voice Recognition, Equalisers, Network Design, Management, Routing and Control. Future work shall be performed on the optimization of the memory access patterns of the kernels. We expect that optimizing the underperforming kernels will lead to additional speedup on both CPU and GPU platforms. REFERENCES [1] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Lastra, “Simulation of cloud dynamics on graphics hardware,” presented at the HWWS '03: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2003. [2] CUDA Parallel Computing Platform. [Online]. Available: http://www.nvidia.com/object/cuda_home_new.html. [Accessed: 22Sep.-2012]. [3] OpenCL - The open standard for parallel programming of heterogeneous systems. [Online]. Available: http://www.khronos.org. [Accessed: 22-Sep.-2012]. [4] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain.,” Psychological Review, vol. 65, no. 6, pp. 386–408, 1958. [5] A. E. Bryson and Y.-C. Ho, Applied optimal control: optimization, estimation, and control. Xerox College Publishing, 1969. [6] V. G. Tsaregorodtsev, “Parallel implementation of back-propagation neural network software on SMP computers,” presented at the PaCT'05: Proceedings of the 8th international conference on Parallel Computing Technologies, 2005. [7] X. Sierra-Canto, F. Madera-Ramirez, and V. Uc-Cetina, “Parallel Training of a Back-Propagation Neural Network Using CUDA,” presented at the ICMLA '10: Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications, 2010, pp. 307–312. [8] K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” presented at the HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2004. [9] B. Jang, D. Schaa, P. Mistry, and D. Kaeli, “Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures,” Parallel and Distributed Systems, IEEE Transactions on, vol. 22, no. 1, pp. 105–118, 2011. [10] N. Krpan and D. Jakobovic, “Parallel neural network training with OpenCL,” presented at the MIPRO, 2012 Proceedings of the 35th International Convention, 2012, pp. 1053–1057. [11] L. Deng, “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web],” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 141–142, 2012. [12] “Programming Guide: AMD Accelerated Parallel Processing OpenCL,” AMD, Jul. 2012. [13] N. G. Dickson, K. Karimi, and F. Hamze, “Importance of explicit vectorization for CPU and GPU software performance,” Journal of Computational Physics, vol. 230, no. 13, Jun. 2011.